# Experimenting with Cleaning, Clustering & Summarization Pipelines

### To do (technical)
- Implement date windows on my corpus loader function

In [1]:
import os
import re
import json

import numpy as np
import pandas as pd
import networkx as nx

from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

import lib.helper as helper
import lib.embedding_models as reps

from importlib import reload

%matplotlib inline

In [23]:
# Useful flatten function from Alex Martelli on https://stackoverflow.com/questions/952914/how-to-make-a-flat-list-out-of-list-of-lists
flatten = lambda l: [item for sublist in l for item in sublist]

In [2]:
# Should be same path for all my PC's, it's where each scrape goes as a separate json file.
storage_path = "/home/ozwald/Dropbox/news_crow/scrape_results"

# "bing" is targeted news search corpus, "RSS" is from specific world and local news feeds.
corpus_type = "disaster"

## 1.  Retrieve Corpus

The corpus is being scraped by the "run_news_scrapes.py" script (and windows task scheduler) every 12 hours, a bit past midday and a bit past midnight.

The "bing" corpus are news titles and text extracts gotten from the bing news search API, using a few Home Office - related keywords.

The "RSS" corpus is plugged directly into a number of RSS feeds for world news sites and local british news sites, with no filters for news story types or subjects applied.

### First, get a list of all the news dumps created so far

In [3]:
corpus = helper.load_clean_corpus(storage_path, corpus_type)

Total files: 56
Loading file: bing_disaster_corpus_2019-12-17_0022.json
Loading file: bing_disaster_corpus_2019-12-27_1803.json
Loading file: bing_disaster_corpus_2019-12-19_0022.json
Loading file: bing_disaster_corpus_2019-12-08_0023.json
Loading file: bing_disaster_corpus_2019-11-17_1952.json
Loading file: bing_disaster_corpus_2019-11-25_0022.json
Loading file: bing_disaster_corpus_2019-12-01_1222.json
Loading file: bing_disaster_corpus_2019-12-06_1222.json
Loading file: bing_disaster_corpus_2019-12-12_0022.json
Loading file: bing_disaster_corpus_2019-11-19_0022.json
Loading file: bing_disaster_corpus_2019-11-17_1956.json
Loading file: bing_disaster_corpus_2019-12-18_1222.json
Loading file: bing_disaster_corpus_2019-12-04_0022.json
Loading file: bing_disaster_corpus_2019-11-23_1223.json
Loading file: bing_disaster_corpus_2019-12-03_0022.json
Loading file: bing_disaster_corpus_2019-11-30_1222.json
Loading file: bing_disaster_corpus_2019-12-10_0022.json
Loading file: bing_disaster_corp

In [4]:
corpus.head()

Unnamed: 0,title,summary,date,link,source_url,retrieval_timestamp,origin,clean_text
0,Usman vs Covington live stream: Free links to ...,Colby Covington is set to take on Kamaru Usman...,2019-12-17T00:02:00.0000000Z,https://www.independent.co.uk/sport/general/mm...,www.independent.co.uk,2019-12-17 00:21:31.786528,bing_news_api,Usman vs Covington live stream: Free links to ...
6,<b>Flood</b> warnings in place across Berkshir...,A number of <b>flood</b> warnings are in place...,2019-12-16T13:26:00.0000000Z,https://www.getreading.co.uk/news/reading-berk...,www.getreading.co.uk,2019-12-17 00:21:31.787132,bing_news_api,Flood warnings in place across Berkshire after...
11,<b>Flood</b> alert issued for Burton after rai...,People across Burton and South Derbyshire are ...,2019-12-16T09:17:00.0000000Z,https://www.derbytelegraph.co.uk/burton/flood-...,www.derbytelegraph.co.uk,2019-12-17 00:21:31.787399,bing_news_api,Flood alert issued for Burton after rainfall. ...
13,Swindon <b>flood</b> defence pond overflows an...,Pavements and a park have been left flooded du...,2019-12-16T17:57:00.0000000Z,https://www.bbc.co.uk/news/uk-england-wiltshir...,www.bbc.co.uk,2019-12-17 00:21:31.787503,bing_news_api,Swindon flood defence pond overflows and cause...
14,<b>Flood</b> Warnings issued for River Severn ...,Heavy rain has seen the Environment Agency put...,2019-12-16T11:57:00.0000000Z,https://www.gloucestershirelive.co.uk/news/glo...,www.gloucestershirelive.co.uk,2019-12-17 00:21:31.787551,bing_news_api,Flood Warnings issued for River Severn as Envi...


## 2. Clustering using Entity Detection And Network Analytics

This doesn't resolve very well for Bing, because there's a whole bunch of keywords from the original searches in there.  Suspect that's got a lot to do with the failure of the other methods too.  For the network analytics method I'm going to try removing the keywords from the table first.

In [5]:
with open("/home/ozwald/Dropbox/news_crow/scrape_settings.json", "r") as f:
    scrape_config = json.load(f)

search_terms = scrape_config['disaster_search_list']
search_terms = re.sub(r"[^0-9A-Za-z ]", "", " ".join(search_terms)).lower().split()
search_terms = set(search_terms)

In [6]:
search_terms

{'disaster',
 'drought',
 'droughts',
 'flash',
 'flood',
 'flooding',
 'floods',
 'hurricane',
 'mudslide',
 'natural',
 'snow',
 'tsunami',
 'typhoon',
 'wildfire',
 'wildfires'}

In [7]:
model = reps.NounAdjacencyModel(corpus['clean_text'], corpus['clean_text'])

In [8]:
model.noun_sets[3]

{'Council',
 'Fields',
 'Margaret',
 'Merton',
 'Parish',
 'St',
 'Stratton',
 'Swindon'}

In [11]:
nouns_df = model.table.copy()
nouns_df.head()

Unnamed: 0_level_0,Littry,nac,North_Lincolnshire,Peak_District,Rickett,Secretary,Wyke,B-52,Lingard,hammer,...,Barrier,Merthyr,Snowdonia,Burke,Greens,Glatthaar,Experimental,Barlow,Unsworth,WALMART
clean_text,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"Usman vs Covington live stream: Free links to watch UFC 245 flood online as piracy hits ;peak levels;. Colby Covington is set to take on Kamaru Usman at the main event of UFC 245, with the Welterweight title on the line. Unusually, UFC has also put two other title fights on a single pay-per-view in the US, as Max Holloway faces Alexander Volkanovski and Amanda Nunes squares off against Germaine de Randamie. Fight fans will be able to watch the ...",0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Flood warnings in place across Berkshire after heavy downpours. A number of flood warnings are in place across Berkshire after heavy downpours at the weekend. People have been warned flooding is possible near many of the county;s rivers and those living or working near by have been urged to be prepared. The flood information service on gov.uk has issued a number of flood alerts and experts are monitoring ...,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
"Flood alert issued for Burton after rainfall. People across Burton and South Derbyshire are being warned to quot;be preparedquot; as a flood alert has been issued across the area. Persistent rainfall has fallen across the last seven days, leading to the warning on the Government website, gov.uk. The flood alert for Burton was issued on the last night, Sunday, December 15, and is still in place today.",0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
"Swindon flood defence pond overflows and causes flooding. Pavements and a park have been left flooded during work to install new drainage. The work, at Merton Fields in Swindon, was for an attenuation pond to divert water during heavy rainfall. However, it immediately overflowed in heavy rain and residents reported water pouring under fences and into gardens. Stratton St Margaret Parish Council ...",0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Flood Warnings issued for River Severn as Environment Agency says immediate action required. Heavy rain has seen the Environment Agency put Flood Warnings on stretches of the River Severn in Gloucestershire. It says river levels are expected to remain high until Wednesday and flooding is expected and immediate action required. The Flood Warnings are in place for the River Severn at Apperley and The Leigh and on the River Severn at ...,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Drop any noun/noun phrase containing one of the search terms, then create an adjacency matrix

### Drop any noun/phrase occuring too infrequently

In [13]:
# Get 500 most common nouns
nouns_to_keep = list(nouns_df.\
                    sum(axis=0).\
                    sort_values(ascending=False).\
                    index)

# Cut out any nouns containing the original search terms
nouns_to_keep = [noun for noun in nouns_to_keep if sum([term in noun for term in search_terms]) == 0]

# Keep only top 500 most common
nouns_to_keep = nouns_to_keep[:500]

# Subset nouns dataframe
nouns_df = nouns_df[nouns_to_keep]

print(nouns_df.shape)

(3348, 500)


In [14]:
embeddings = np.asarray(nouns_df)
adjacency = np.dot(embeddings, embeddings.T)
print(np.max(adjacency))

13


In [15]:
# If the "lower" limit is 1, the graph has so many edges it eats ALL the memory of my desktop, even
# with just 500-ish stories to process.
upper = 100
lower = 3
G = nx.Graph()
rows, cols = np.where((upper >= adjacency) & (adjacency >= lower))
weights = [float(adjacency[rows[i], cols[i]]) for i in range(len(rows))]
edges = zip(rows.tolist(), cols.tolist(), weights)
G.add_weighted_edges_from(edges)

# Simplify; remove self-edges
G.remove_edges_from(nx.selfloop_edges(G))

In [16]:
G.number_of_edges()

1873

In [17]:
#G_plot = nx.petersen_graph()
#plt.subplot(121)
#nx.draw(G, with_labels=True, font_weight='bold')
#plt.subplot(122)
#nx.draw_shell(G, nlist=[range(5, 10), range(5)], with_labels=True, font_weight='bold')

### Cliques, worth a look?
Idea from the docs, explanation at https://en.wikipedia.org/wiki/Clique_(graph_theory)

So, cliques are allowed to overlap - should've thought of that.  Still, good preliminary results and I've found I can disambiguate the cliques to some degree by cutting out weaker links (fewer shared entities).

I should add it also appears to merely suffer from the same problems as the other clustering methods, clusters are ultimately hierarchical!

In [18]:
cliques = []
for x in nx.find_cliques(G):
    x.sort()
    cliques.append((len(x), x))

In [19]:
cliques_df = pd.DataFrame({"nodes_list": [x[1] for x in cliques],
                           "clique_size": [x[0] for x in cliques]}).\
                    sort_values("clique_size", ascending=False).\
                    reset_index()

In [20]:
len(cliques_df[cliques_df['clique_size'] >= 5])

59

In [21]:
cliques_df[cliques_df['clique_size'] >= 5]

Unnamed: 0,index,nodes_list,clique_size
0,247,"[339, 492, 869, 1104, 2356, 2359, 2458, 2461, ...",11
1,106,"[167, 687, 911, 914, 916, 917, 1128, 1700, 196...",10
2,885,"[546, 1006, 1362, 1464, 1605, 1791, 2541, 2613...",10
3,84,"[141, 145, 146, 151, 672, 1321, 1422, 1423, 2024]",9
4,673,"[1214, 1271, 1495, 1639, 1996, 2230, 2233]",7
5,20,"[28, 354, 1281, 1730, 2458, 2657, 3319]",7
6,139,"[211, 596, 1568, 2112, 2378, 2492, 2686]",7
7,533,"[911, 912, 916, 917, 1128, 1700, 1961]",7
8,886,"[546, 1006, 1362, 1605, 1791, 2588, 2613]",7
9,83,"[141, 145, 672, 1320, 1321, 1422, 2024]",7


In [24]:
cliqued = set(flatten(list(cliques_df['nodes_list'])))
len(cliqued)

1867

In [25]:
for node in cliques_df.iloc[0]['nodes_list']:
    article = nouns_df.reset_index().iloc[node]
    print(article['clean_text'])

Wildfire-ravaged areas of Australia get Christmas respite. An emergency vehicle near a fire in Blackheath, New South Wales (Ingleside Rural Fire Brigade/AP) The wildfire crisis forced Mr Morrison to cut short his much-criticised family holiday in Hawaii. He returned to Australia on Saturday night. “To Andrew and Geoffrey’s parents, we know this is going to be a tough Christmas for you, first one ...
Why record wildfires and soaring temperatures won;t sway Australia;s government on climate change. The Australian Government is under fire for inaction on climate change while a bush fire crisis and a heatwave sweeps the country. On Tuesday Australia experienced its hottest day on record with the national average temperature reaching a high of 40.C. Late on Wednesday there were 100 bush fires burning in New South Wales alone, with 54 still ...
‘National tragedy’: Hundreds of koalas feared dead in Australian wildfire. Hundreds of koalas are feared to have died in wildfires raging along Austr

In [26]:
for node in cliques_df.iloc[1]['nodes_list']:
    article = nouns_df.reset_index().iloc[node]
    print(article['clean_text'])

Three cows presumed dead after being swept away by Hurricane Dorian are found alive months later. Three cows swept off an island of North Carolina during the raging storm of Hurricane Dorian have been found alive months later, after reportedly swimming for several miles. They were grazing on their home of Cedar Island when the severe weather hit ...
Three cows swim five miles to safety after being swept out to sea in hurricane. The cows were swept away from Cedar Island, in North Carolina, US, by a ;mini-tsunami; caused by Hurricane Dorian - and were lucky not to have been dragged to their deaths in the Atlantic
Cows swept away by Hurricane Dorian found alive in North Carolina. Three cows swept off an island in North Carolina during Hurricane Dorian have been found alive after apparently swimming for several miles. The cows belong to a herd on the US state;s Cedar Island but were swept away in September by a quot;mini tsunamiquot; generated by Dorian. They were presumed dead until they

In [27]:
for node in cliques_df.iloc[3]['nodes_list']:
    article = nouns_df.reset_index().iloc[node]
    print(article['clean_text'])

Typhoon Kammuri slams into Philippines, forcing thousands to flee. Typhoon Kammuri has made landfall in the central Philippines, at the southern end of Luzon island. At least 200,000 residents have been evacuated from coastal and mountainous areas over fears of flooding, storm surges and landslides. Some events at the ...
Typhoon Kammuri: At least four dead as storm hammers Philippines. At least four people have been killed after Typhoon Kammuri slammed into the Philippines. Hundreds of thousands of people were evacuated from high-risk villages while Manilla;s international airport was shut on Tuesday as fierce winds and rain hit the country.
Driver gets stuck in river after Typhoon Kammuri in the Philippines. People in a car have been rescued after the vehicle got stuck in a raging river on its way to deliver aid to the victims of the recent Typhoon Kammuri in the Philippines.
Typhoon Kammuri slams into Philippines, forcing thousands to flee. Typhoon Kammuri has made landfall in the c

In [28]:
for node in cliques_df.iloc[17]['nodes_list']:
    article = nouns_df.reset_index().iloc[node]
    print(article['clean_text'])

Road closed due to fallen tree as further flood alerts issued. quot;We will organise removal of the tree as soon as possible. Please plan your journey and use alternative routes in the meantime.quot; Meanwhile a number of Environment Agency flood alerts remain in force in Staffordshire - including on the Rivers Sow and Penk in Stafford Borough and the Churnet, Tean and Upper Dove in the Staffordshire Moorlands.
Flood alert remains in force as further rain expected in Staffordshire. A flood alert remains in force for two rivers around Stafford Borough this evening (Wednesday November 1). The alert, which means flooding is possible, has been issued by the Environment Agency for the Rivers Sow and Penk in Stafford Borough. The alert adds that further rainfall is possible on Thursday. The Environment Agency alert states ...
Flood alerts in force for rivers across Staffordshire following heavy rain. Flood alerts have been issued for a number of rivers in Staffordshire following heavy rain. 

### Connected components

In [32]:
nx.number_connected_components(G)

877

In [33]:
components = [component for component in nx.connected_components(G)]

In [34]:
sum([len(component) for component in components])

1867

### Community Detection Algorithm

In [35]:
from community import best_partition

In [36]:
# Apply Louvain Community Detection
# The keys are nodes, the values are the partitions they belong to
partition = best_partition(G)

number_partitions = max(partition.values())
number_partitions

In [50]:
# Iterate through and get a list of partitions and their nodes
partition_contents = {}
for key in partition.keys():
    partition_contents[partition[key]] = partition_contents.get(partition[key], []) + [key]

# Drop partitions that are too small
for key in list(partition_contents.keys()):
    if len(partition_contents[key]) < 5:
        partition_contents.pop(key)

In [53]:
# Let's see how big our "clusters" are, and how many there are total after removing the tiny ones
partition_lengths = [len(value) for key, value in partition_contents.items()]
print(partition_lengths, sum(partition_lengths))

[38, 57, 24, 41, 27, 26, 15, 7, 14, 12, 26, 12, 6, 12, 60, 16, 58, 7, 24, 26, 8, 8, 14, 24, 6, 9, 6, 5, 5, 15, 7, 9, 7, 7, 5, 7, 7, 5, 5] 667
