# Experimenting with Cleaning, Clustering & Summarization Pipelines

### To do (technical)
- Implement date windows on my corpus loader function

In [1]:
import os
import re
import json

import numpy as np
import pandas as pd
import networkx as nx

from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

import lib.helper as helper
import lib.embedding_models as reps

from importlib import reload

%matplotlib inline

## 1.  Retrieve Corpus

The corpus is being scraped by the "run_news_scrapes.py" script (and windows task scheduler) every 12 hours, a bit past midday and a bit past midnight.

The "bing" corpus are news titles and text extracts gotten from the bing news search API, using a few Home Office - related keywords.

The "RSS" corpus is plugged directly into a number of RSS feeds for world news sites and local british news sites, with no filters for news story types or subjects applied.

In [2]:
# Should be same path for all my PC's, it's where each scrape goes as a separate json file.
storage_path = "/home/ozwald/Dropbox/news_crow/scrape_results"

# "bing" is targeted news search corpus, "RSS" is from specific world and local news feeds.
corpus_type = "disaster"

# There's a helper function to go find and drag out the various JSON files created by the scrapers.
corpus = helper.load_clean_corpus(storage_path, corpus_type)

# See how it turned out
print(corpus.head())
print(corpus.shape)

Total files: 77
9.1 of files read.
18.2 of files read.
27.3 of files read.
36.4 of files read.
45.5 of files read.
54.5 of files read.
63.6 of files read.
72.7 of files read.
81.8 of files read.
90.9 of files read.
100.0 of files read.
                                                title  \
0   Usman vs Covington live stream: Free links to ...   
11  <b>Flood</b> alert issued for Burton after rai...   
13  Swindon <b>flood</b> defence pond overflows an...   

                                              summary  \
0   Colby Covington is set to take on Kamaru Usman...   
11  People across Burton and South Derbyshire are ...   
13  Pavements and a park have been left flooded du...   
14  Heavy rain has seen the Environment Agency put...   

                            date  \
0   2019-12-17T00:02:00.0000000Z   
6   2019-12-16T13:26:00.0000000Z   
11  2019-12-16T09:17:00.0000000Z   
13  2019-12-16T17:57:00.0000000Z   
14  2019-12-16T11:57:00.0000000Z   

                                

## 2. Clustering using Entity Detection And Network Analytics

In [3]:
# Retrive the set of search terms used for Bing, so we can remove them before
# clustering.
with open("/home/ozwald/Dropbox/news_crow/scrape_settings.json", "r") as f:
    scrape_config = json.load(f)

search_terms = scrape_config['disaster_search_list']
search_terms = re.sub(r"[^0-9A-Za-z ]", "", " ".join(search_terms)).lower().split()
search_terms = set(search_terms)

In [4]:
# Generate the text representation
model = reps.NounAdjacencyModel(list(corpus['clean_text']), list(corpus['clean_text']))

# Tabulate for convenience
nouns_df = model.table.copy()
nouns_df.head()

Unnamed: 0,I;ve,Mersey,JESSE,Zane,Jerry;s,Harry_Jackson,Eastford,Harington_Jon,McCluskey,quot;magicalquot,...,Henman,Economic,Bushfire,Barclays,Dartford,Greek,Apollo,Moisture,Authorities,Buckland
"Usman vs Covington live stream: Free links to watch UFC 245 flood online as piracy hits ;peak levels;. Colby Covington is set to take on Kamaru Usman at the main event of UFC 245, with the Welterweight title on the line. Unusually, UFC has also put two other title fights on a single pay-per-view in the US, as Max Holloway faces Alexander Volkanovski and Amanda Nunes squares off against Germaine de Randamie. Fight fans will be able to watch the ...",0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Flood warnings in place across Berkshire after heavy downpours. A number of flood warnings are in place across Berkshire after heavy downpours at the weekend. People have been warned flooding is possible near many of the county;s rivers and those living or working near by have been urged to be prepared. The flood information service on gov.uk has issued a number of flood alerts and experts are monitoring ...,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
"Flood alert issued for Burton after rainfall. People across Burton and South Derbyshire are being warned to quot;be preparedquot; as a flood alert has been issued across the area. Persistent rainfall has fallen across the last seven days, leading to the warning on the Government website, gov.uk. The flood alert for Burton was issued on the last night, Sunday, December 15, and is still in place today.",0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
"Swindon flood defence pond overflows and causes flooding. Pavements and a park have been left flooded during work to install new drainage. The work, at Merton Fields in Swindon, was for an attenuation pond to divert water during heavy rainfall. However, it immediately overflowed in heavy rain and residents reported water pouring under fences and into gardens. Stratton St Margaret Parish Council ...",0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Flood Warnings issued for River Severn as Environment Agency says immediate action required. Heavy rain has seen the Environment Agency put Flood Warnings on stretches of the River Severn in Gloucestershire. It says river levels are expected to remain high until Wednesday and flooding is expected and immediate action required. The Flood Warnings are in place for the River Severn at Apperley and The Leigh and on the River Severn at ...,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Drop any noun/noun phrase containing one of the search terms, then create an adjacency matrix

### Drop any noun/phrase occuring too infrequently

In [5]:
# Get 500 most common nouns
nouns_to_keep = list(nouns_df.\
                    sum(axis=0).\
                    sort_values(ascending=False).\
                    index)

# Cut out any nouns containing the original search terms
nouns_to_keep = [noun for noun in nouns_to_keep if sum([term in noun for term in search_terms]) == 0]

# Keep only top 500 most common
nouns_to_keep = nouns_to_keep[:500]

# Subset nouns dataframe
nouns_df = nouns_df[nouns_to_keep]

print(nouns_df.shape)

(4454, 500)


In [6]:
embeddings = np.asarray(nouns_df)
adjacency = np.dot(embeddings, embeddings.T)
print(np.max(adjacency))

11


In [42]:
# If the "lower" limit is 1, the graph has so many edges it eats ALL the memory of my desktop, even
# with just 500-ish stories to process.
upper = 100
lower = 3
G = nx.Graph()
rows, cols = np.where((upper >= adjacency) & (adjacency >= lower))
weights = [float(adjacency[rows[i], cols[i]]) for i in range(len(rows))]
edges = zip(rows.tolist(), cols.tolist(), weights)
G.add_weighted_edges_from(edges)

# Simplify; remove self-edges
G.remove_edges_from(nx.selfloop_edges(G))

In [43]:
G.number_of_edges()

2547

### Cliques, worth a look?
Idea from the docs, explanation at https://en.wikipedia.org/wiki/Clique_(graph_theory)

So, cliques are allowed to overlap - should've thought of that.  Still, good preliminary results and I've found I can disambiguate the cliques to some degree by cutting out weaker links (fewer shared entities).

I should add it also appears to merely suffer from the same problems as the other clustering methods, clusters are ultimately hierarchical!

In [44]:
cliques = []
for x in nx.find_cliques(G):
    x.sort()
    cliques.append((len(x), x))

In [45]:
cliques_df = pd.DataFrame({"nodes_list": [x[1] for x in cliques],
                           "clique_size": [x[0] for x in cliques]}).\
                    sort_values("clique_size", ascending=False).\
                    reset_index()

In [46]:
len(cliques_df[cliques_df['clique_size'] >= 5])

77

In [47]:
cliques_df[cliques_df['clique_size'] >= 5]

Unnamed: 0,index,nodes_list,clique_size
0,1657,"[310, 474, 622, 999, 1169, 1227, 1237, 1248, 1...",25
1,190,"[326, 327, 498, 501, 504, 1264, 2433, 2434, 39...",10
2,727,"[1310, 1875, 2015, 2307, 3356, 3509, 3700, 389...",10
3,80,"[141, 145, 146, 151, 802, 1733, 1833, 1834, 2670]",9
4,448,"[788, 1244, 1412, 1726, 3158, 4084, 4086, 4287]",8
...,...,...,...
72,644,"[1175, 1179, 1483, 3534, 3976]",5
73,520,"[920, 1898, 2040, 2049, 2991]",5
74,566,"[1028, 1030, 1036, 1426, 2143]",5
75,134,"[223, 229, 1309, 2721, 3760]",5


In [48]:
# Useful flatten function from Alex Martelli on https://stackoverflow.com/questions/952914/how-to-make-a-flat-list-out-of-list-of-lists
flatten = lambda l: [item for sublist in l for item in sublist]

In [49]:
cliqued = set(flatten(list(cliques_df['nodes_list'])))
len(cliqued)

2358

In [50]:
for node in cliques_df.iloc[0]['nodes_list']:
    print(corpus.iloc[node]['clean_text'])

Australia wildfires: minister donates £,000 to koala charity as he asks for change in international development funding. He revealed that he has given the fee Members of Parliament are given if they lose their seats to Wires, an Australian charity which is rescuing animals, including koalas, from the wildfires. Australia is burning. Over a billion animals have died, including 0% of the entire koala population in the mid-north coast of New South Wales.
Wildfire-ravaged areas of Australia get Christmas respite. An emergency vehicle near a fire in Blackheath, New South Wales (Ingleside Rural Fire Brigade/AP) The wildfire crisis forced Mr Morrison to cut short his much-criticised family holiday in Hawaii. He returned to Australia on Saturday night. “To Andrew and Geoffrey’s parents, we know this is going to be a tough Christmas for you, first one ...
Why record wildfires and soaring temperatures won;t sway Australia;s government on climate change. The Australian Government is under fire fo

In [51]:
for node in cliques_df.iloc[1]['nodes_list']:
    print(corpus.iloc[node]['clean_text'])

Typhoon Phanfone: Philippines assesses devastation as storm with winds up to 200kph kills at least 20. At least 20 people have been confirmed dead as the Philippines counted the cost of a devastating typhoon that ripped through the centre of the country on Christmas Day. Typhoon Phanfone made landfall on Tuesday night, but the extent of the damage was only ...
Typhoon Phanfone kills at least 16 as 120mph winds smash into the Philippines. At least 16 people have died after a typhoon swept across the Philippines on Christmas Day. Typhoon Phanfone saw winds of 120mph smash into remote villages and popular tourist areas, leaving a trail of devastation. Tens of thousands of people have been left stranded as transport ground to a halt while they tried to make their way home for ...
Typhoon Phanfone claims at least 16 lives in Philippines. A typhoon that swept through the Philippines on Tuesday and Wednesday has claimed at least 16 lives, and caused tens of thousands of people to be evacuated

In [52]:
for node in cliques_df.iloc[2]['nodes_list']:
    print(corpus.iloc[node]['clean_text'])

What the Met Office is saying about snow in its UK long-range forecast. We;re unlikely to see much of the white stuff in Gloucestershire but northern areas could see some within the month If you fancy seeing some winter snow, you’ll have to travel to northern parts of the UK. The Met Office has released its latest long-range forecast for the country into February. It says there could be some cold spells in ...
Election week could be hit with snow and freezing fog, say Met Office. SNOW and freezing fog could be on the cards for next week;s General Election. Temperatures will dive to sub-zero levels in some parts of the UK as winter really sets in, coinciding with voters heading to the polls, the Met Office has said. Election week is predicted to see longer spells of rain, wintry showers and harsh winds, meteorologists say.
Snow is set to hit the UK but Met Office predicts Hull will escape early falls. The Met Office says they can only officially predict five days ahead, but can safely s

In [53]:
for node in cliques_df.iloc[2]['nodes_list']:
    print(corpus.iloc[node]['clean_text'])

What the Met Office is saying about snow in its UK long-range forecast. We;re unlikely to see much of the white stuff in Gloucestershire but northern areas could see some within the month If you fancy seeing some winter snow, you’ll have to travel to northern parts of the UK. The Met Office has released its latest long-range forecast for the country into February. It says there could be some cold spells in ...
Election week could be hit with snow and freezing fog, say Met Office. SNOW and freezing fog could be on the cards for next week;s General Election. Temperatures will dive to sub-zero levels in some parts of the UK as winter really sets in, coinciding with voters heading to the polls, the Met Office has said. Election week is predicted to see longer spells of rain, wintry showers and harsh winds, meteorologists say.
Snow is set to hit the UK but Met Office predicts Hull will escape early falls. The Met Office says they can only officially predict five days ahead, but can safely s

### Connected components
This works for the small disaster corpus, but not for the larger corpuses?  Does a greater number of nodes increase the odds of accidental small-world?

In [76]:
print(nx.number_connected_components(G))

# Get a dict of lists of connected component nodes
components = [list(component) for component in nx.connected_components(G) if len(component) >= 5]

1142


In [77]:
for node in components[0][:10]:
    print(corpus.iloc[node]['clean_text'])

Government should pay to rebuild houses with extra storeys to prevent future flooding misery: Yorkshire Post Letters. AS a ‘soft southerner’ – but, with very fond memories of childhood days spent in the West Riding (my favourite uncle was the ‘Station Master’ at Bradford Railway Station and the surrounding areas) and hopes of moving to live in Yorkshire next ...
UK weather forecast – Bitter -7C blast to see Britain freeze this weekend with snow and icy rain forecast. THE UK is bracing for a weekend of rain and snow - with temperatures set to plunge as low as -7C in Scotland and the north of England. Slightly milder temperatures elsewhere will be cancelled out by cutting winds of up to 50mph, according to forecasts. The Met Office is forecasting snow and sleet across northern England and Scotland for much of ...
BBC Weather forecast: Britain braces for snow as nation set to be battered by 60mph winds. BBC Weather Chris Fawkes warned: quot;It’s going to be quite a blustery day, a day of 

In [78]:
for node in components[1][:10]:
    print(corpus.iloc[node]['clean_text'])

Australia wildfires: Firefighters save prehistoric tree species. Specialist firefighters have saved the world;s last remaining wild stand of a prehistoric tree from wildfires that razed forests west of Sydney, officials said Thursday. Firefighters were winched from helicopters to reach the cluster of fewer than 200 Wollemi pines in a remote gorge in the Blue Mountains a week before a massive wildlife bore ...
Smoke from Australian wildfires will make a ‘full circuit’ around Earth, Nasa says. A satellite image shows thick smoke moving into the Tasman Sea from the states of New South Wales and Victoria (Nasa) As the devastating wildfires continue to ravage Australia, the resulting smoke is set to make a complete circuit of the globe. Nasa is currently tracking the smoke as it moves through the atmosphere. The space agency says the ...
Bushfires thick smog to Sydney as storms lash Queensland bringing flash floods and giant hail. Sydneysiders have been warned to expect a very hazy Sunday a

In [79]:
for node in components[2][:10]:
    print(corpus.iloc[node]['clean_text'])

Former High Sheriff of Derbyshire was swept away for half a mile after car got stuck in floods, inquest hears. The former High Sheriff of Derbyshire was swept away for half a mile after driving into floodwater, an inquest has heard. Annie Hall;s body was recovered from flooded farmland close to the River Derwent in Darley Dale, near Matlock, in the early hours of November 8. Chesterfield Coroner;s Court heard how the car Mrs Hall and her husband had ...
Woman killed in floods named as former High Sheriff of Derbyshire. The woman who died after being swept away by floodwater amid torrential rain across northern England has been named as former High Sheriff of Derbyshire, Annie Hall. Ms Hall’s body was found on Friday morning after emergency services were called to the ...
UK flooding: Body of woman dragged from flood water as torrential rain hits swathes of England. The body of a woman has been dragged from flood water after reports of someone being swept away by the River Derwent in De

### Community Detection Algorithm

In [22]:
from community import best_partition

In [23]:
# Apply Louvain Community Detection
# The keys are nodes, the values are the partitions they belong to
partition = best_partition(G)

number_partitions = max(partition.values())
number_partitions

1150

In [24]:
# Iterate through and get a list of partitions and their nodes
partition_contents = {}
for key in partition.keys():
    partition_contents[partition[key]] = partition_contents.get(partition[key], []) + [key]

# Drop partitions that are too small
for key in list(partition_contents.keys()):
    if len(partition_contents[key]) < 5:
        partition_contents.pop(key)

In [25]:
# Let's see how big our "clusters" are, and how many there are total after removing the tiny ones
partition_lengths = {key:len(value) for key, value in partition_contents.items()}
print(partition_lengths, sum(partition_lengths))

{2: 25, 5: 24, 10: 71, 12: 19, 14: 70, 15: 9, 16: 27, 18: 45, 20: 19, 27: 7, 32: 6, 46: 22, 65: 26, 66: 7, 67: 7, 76: 153, 80: 15, 82: 7, 87: 6, 94: 21, 100: 6, 103: 16, 140: 11, 142: 7, 145: 6, 154: 6, 157: 14, 164: 14, 175: 5, 178: 5, 185: 5, 208: 7, 217: 9, 220: 10, 239: 8, 250: 16, 259: 7, 269: 9, 291: 7, 299: 7, 308: 7, 313: 11, 350: 5, 364: 7, 365: 7, 381: 8, 394: 5, 399: 5, 504: 12, 505: 6, 630: 7, 821: 6} 10063


In [26]:
for node in partition_contents[2][:10]:
    print(corpus.iloc[node]['clean_text'])

Flooding On Isle Of Wight Possible Until Thursday. A Flood Alert’s still in place for the Isle of Wight and the next update’s at 4pm today (Wednesday). Yarmouth, East Cowes and Cowes are the areas most at risk. A spokesperson for the Environment Agency said, “Wednesday morning’s tide at 11:07 is higher than normal due to ex Storm Sebastien which brings strong South Westerly Force 7 ...
Ex Storm Sebastien Brings Flooding And Force Seven Winds To Isle Of Wight. A Flood Alert’s in place for the Isle of Wight today (Wednesday) due to ex Storm Sebastien’s strong South Westerly Force 7 winds. Following a higher tide than usual yesterday (Tuesday) the ...


In [27]:
for node in partition_contents[5][:10]:
    print(corpus.iloc[node]['clean_text'])

;The storm literally sucks up the sea level;: Flood manager on working in extreme weather planning. “These are the highest sea levels we have seen in 15 years in some parts of the west of Scotland and the Western Isles,” says Vincent Fitzsimons, the Scottish Environment Protection Agency’s head of flooding services. “It’s an unusual and really dangerous combination of storm surge, naturally high tides – because of the way the Moon ...
Flood alert issued for Glasgow as Storm Brendan to hit UK. A flood alert is in place for Glasgow today as Storm Brendan is set to hit the country. While Glasgow will avoid the worst of the deluge, the Scottish Environment Protection Agency (SEPA) have issued a flood alert for West Central Scotland. The alert is to inform the public to be prepared for possible flooding. Heavy rain and strong gales are ...
Flood alert issued for Glasgow. The city has been soaked throughout Saturday ahead of what;s predicted to be a similarly wet Sunday. As a result, the Sco

In [28]:
for node in partition_contents[10][:10]:
    print(corpus.iloc[node]['clean_text'])

UK weather forecast: Met Office give its verdict on snow for the general election. WX Charts showed that up to one inch of snow could fall every hour in places next week with as much as 12 inches falling on high ground in northern Scotland
Will it snow on Christmas Day in Hull? The Met Office predicts what;s going to happen. It is that time of the year again when people take a punt on whether it is going to be a white Christmas. Unfortunately for Hull, it is unlikely snow will fall on December 25, with meteorologists at the Met Office saying it is quot;more than likelyquot; to be a dry Christmas for the city and East Yorkshire. Meteorologist Greg Dewhurst said the weather ...
UK weather forecast – White Christmas ‘expected’ with snow in parts in -4C DEEP FREEZE, Met Office says. The Scottish Highlands could see a light flurry of snow on Christmas DayCredit: Weather Outlook For there to be a White Christmas, there only needs to be snow somewhere in the UK - and the Scottish Highlands ma

In [29]:
for node in partition_contents[14][:10]:
    print(corpus.iloc[node]['clean_text'])

Bushfires thick smog to Sydney as storms lash Queensland bringing flash floods and giant hail. Sydneysiders have been warned to expect a very hazy Sunday as bushfire smoke is blown across the city. Temperatures are set to hit the high 40s in New South Wales by Thursday, according to the Bureau of Meteorology. There were 111 fires burning across the state on Saturday night, 60 of them not contained. Some 1500 firefighters were ...
Wildfire-ravaged areas of Australia get Christmas respite. An emergency vehicle near a fire in Blackheath, New South Wales (Ingleside Rural Fire Brigade/AP) The wildfire crisis forced Mr Morrison to cut short his much-criticised family holiday in Hawaii. He returned to Australia on Saturday night. “To Andrew and Geoffrey’s parents, we know this is going to be a tough Christmas for you, first one ...
Firefighters tackle blazes in New South Wales wildfires. Firefighters are battling wildfires in the city of Lithgow, north-west of Sydney with the state of New Sou