# Experimenting with Cleaning, Clustering & Summarization Pipelines

### To do (technical)
- Implement date windows on my corpus loader function

In [1]:
import os
import re
import json

import numpy as np
import pandas as pd
import networkx as nx

from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

import lib.helper as helper
import lib.embedding_models as reps

from importlib import reload

%matplotlib inline

## 1.  Retrieve Corpus

The corpus is being scraped by the "run_news_scrapes.py" script (and windows task scheduler) every 12 hours, a bit past midday and a bit past midnight.

The "bing" corpus are news titles and text extracts gotten from the bing news search API, using a few Home Office - related keywords.

The "RSS" corpus is plugged directly into a number of RSS feeds for world news sites and local british news sites, with no filters for news story types or subjects applied.

In [2]:
# Should be same path for all my PC's, it's where each scrape goes as a separate json file.
storage_path = "D:/Dropbox/news_crow/scrape_results"

# "bing" is targeted news search corpus, "RSS" is from specific world and local news feeds.
corpus_type = "disaster"

# There's a helper function to go find and drag out the various JSON files created by the scrapers.
corpus = helper.load_clean_corpus(storage_path, corpus_type)

# See how it turned out
print(corpus.head())
print(corpus.shape)

Total files: 153
9.8 percent of files read.
19.6 percent of files read.
29.4 percent of files read.
39.2 percent of files read.
49.0 percent of files read.
58.8 percent of files read.
68.6 percent of files read.
78.4 percent of files read.
88.2 percent of files read.
98.0 percent of files read.
                                               title  \
2  UK weather forecast – More than 100 <b>flood</...   
6  UK weather forecast: <b>Flood</b> chaos contin...   

                                             summary  \
0  Residents have been warned to &quot;remain vig...   
1  The Environment Agency has a number of <b>floo...   
2  <b>FLOOD</b>-ravaged villages in the UK have b...   
5  The Environment Agency has issued 57 <b>flood<...   
6  Despite some areas enduring their &#39;wettest...   

                           date  \
0  2019-11-17T17:35:00.0000000Z   
1  2019-11-17T18:35:00.0000000Z   
2  2019-11-17T13:45:00.0000000Z   
5  2019-11-17T16:38:00.0000000Z   
6  2019-11-17T18:32:00.

## 2. Use Detected Nouns to create a Graph Representation

In [3]:
# Retrive the set of search terms used for Bing, so we can remove them before
# clustering.
with open("D:/Dropbox/news_crow/scrape_settings.json", "r") as f:
    scrape_config = json.load(f)

search_terms = scrape_config['disaster_search_list']
search_terms = re.sub(r"[^0-9A-Za-z ]", "", " ".join(search_terms)).lower().split()
search_terms = set(search_terms)

In [4]:
# Generate the text representation
model = reps.NounAdjacencyModel(list(corpus['clean_text']), list(corpus['clean_text']))

# Tabulate for convenience
nouns_df = model.table.copy()
nouns_df.head()

Unnamed: 0,Wey,sofa,NAIROBI,BBC2;s,Rhona,BT-001,Tonight;s,Call,Elphin,sport;s,...,Storm_Atiyah,Bonner,pooch,M65_Burnley,fundraiser,Illinois,Learning_Academy,Moxley,Hartside,Coscelli
"West Midlands flood warnings prompt ;remain vigilant; alert. Residents have been warned to quot;remain vigilantquot; as up to 20 flood warnings are in place in the West Midlands with more rain forecast. There are 1 warnings affecting Worcestershire, along the River Severn, Avon and Teme, and six in Shropshire. Flood defences were put up in Ironbridge on Saturday evening. The Environment Agency (EA) said river ...",0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
New flood warnings issued with more homes at risk. The Environment Agency has a number of flood alerts and warnings in place More residents are being told that floodwater could enter their homes as new red warnings are put in place. The Environment Agency has updated the flood risk for Hull and East Yorkshire this afternoon with four red flood warnings now in force. A red warning indicates that ...,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
UK weather forecast – More than 100 flood alerts across Britain as villages are cut off for days and more storms hit. FLOOD-ravaged villages in the UK have been cut off for days as Atlantic storms threaten to unleash another deluge early next week. Swathes of Britain that were left devastated by torrential downpours will face yet more floods - with more than 100 alerts in place. It comes amid predictions bitterly cold weather will largely hold out for the rest ...,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
"UK flood warning map: Flood chaos to continue - Where is under threat of flooding?. The Environment Agency has issued 57 flood warnings at the time of writing, meaning flooding is expected and immediate action is required. There are also 0 flood alerts in place across the country, warning flooding is possible and to be prepared, READ MORE: Snow maps latest forecast: Arctic blast to hit UK with snow and sleet To see a full ...",0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
"UK weather forecast: Flood chaos continues with -5C freeze to follow ;wettest autumn;. Despite some areas enduring their ;wettest ever autumns;, much-needed relief from heavy rainfall has been forecast for flood-hit areas in the coming days",0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


#### Drop any noun/noun phrase containing one of the search terms, then create an adjacency matrix

#### Drop any noun/phrase occuring too infrequently

In [5]:
# Get 500 most common nouns
nouns_to_keep = list(nouns_df.\
                    sum(axis=0).\
                    sort_values(ascending=False).\
                    index)

# Cut out any nouns containing the original search terms
nouns_to_keep = [noun for noun in nouns_to_keep if sum([term in noun for term in search_terms]) == 0]

# Keep only top 500 most common
nouns_to_keep = nouns_to_keep[:500]

# Subset nouns dataframe
nouns_df = nouns_df[nouns_to_keep]

print(nouns_df.shape)

(12459, 500)


In [6]:
embeddings = np.asarray(nouns_df)
adjacency = np.dot(embeddings, embeddings.T)
print(np.max(adjacency))

12


In [7]:
# If the "lower" limit is 1, the graph has so many edges it eats ALL the memory of my desktop, even
# with just 500-ish stories to process.
upper = 100
lower = 3
G = nx.Graph()
rows, cols = np.where((upper >= adjacency) & (adjacency >= lower))
weights = [float(adjacency[rows[i], cols[i]]) for i in range(len(rows))]
edges = zip(rows.tolist(), cols.tolist(), weights)
G.add_weighted_edges_from(edges)

# Simplify; remove self-edges
G.remove_edges_from(nx.selfloop_edges(G))

In [8]:
G.number_of_edges()

5019

## 3.  Create (overlapping) clusters using Maximal Cliques
Idea from the docs, explanation at https://en.wikipedia.org/wiki/Clique_(graph_theory)

In [9]:
cliques = []
for x in nx.find_cliques(G):
    x.sort()
    cliques.append((len(x), x))

In [10]:
cliques_df = pd.DataFrame({"nodes_list": [x[1] for x in cliques],
                           "clique_size": [x[0] for x in cliques]}).\
                    sort_values("clique_size", ascending=False).\
                    reset_index()

cliques_df = cliques_df[(cliques_df['clique_size'] >= 3) & (cliques_df['clique_size'] <=30)]

In [59]:
cliques

[(2, [0, 309]),
 (2, [6, 268]),
 (1, [8]),
 (2, [10, 57]),
 (1, [11]),
 (1, [12]),
 (1, [13]),
 (1, [15]),
 (6, [17, 896, 1561, 1605, 2658, 3168]),
 (2, [17, 10291]),
 (2, [17, 2883]),
 (2, [17, 10382]),
 (1, [19]),
 (2, [20, 36]),
 (1, [21]),
 (3, [27, 254, 345]),
 (1, [30]),
 (6, [33, 273, 564, 745, 893, 3395]),
 (1, [37]),
 (1, [40]),
 (1, [41]),
 (2, [42, 8050]),
 (2, [42, 10261]),
 (1, [46]),
 (2, [49, 2518]),
 (1, [51]),
 (1, [53]),
 (1, [55]),
 (2, [57, 9239]),
 (1, [58]),
 (4, [59, 72, 82, 281]),
 (2, [59, 476]),
 (2, [63, 6614]),
 (2, [63, 351]),
 (1, [67]),
 (1, [68]),
 (1, [70]),
 (1, [71]),
 (1, [76]),
 (1, [79]),
 (1, [81]),
 (1, [83]),
 (1, [87]),
 (2, [90, 278]),
 (2, [92, 279]),
 (1, [94]),
 (4, [100, 225, 2499, 2505]),
 (2, [106, 538]),
 (1, [115]),
 (3, [120, 1126, 5176]),
 (1, [131]),
 (2, [133, 333]),
 (1, [137]),
 (5, [138, 334, 1801, 3072, 3447]),
 (27,
  [138,
   331,
   334,
   794,
   2765,
   3072,
   3076,
   3079,
   3235,
   3322,
   3324,
   3447,
   3531,

In [12]:
cliques_df

Unnamed: 0,index,nodes_list,clique_size
0,54,"[138, 331, 334, 794, 2765, 3072, 3076, 3079, 3...",27
1,1941,"[5715, 6085, 6127, 6307, 6421, 6550, 6617, 743...",17
2,2212,"[1511, 1737, 1951, 2449, 4150, 4858, 5089, 516...",14
3,2189,"[6092, 6267, 6355, 6417, 6444, 6450, 6463, 662...",12
4,200,"[482, 7549, 8015, 8036, 10255, 10814, 10918, 1...",12
...,...,...,...
692,1177,"[3269, 9220, 9758]",3
693,2333,"[6621, 6752, 9132]",3
694,2335,"[6622, 6740, 9338]",3
695,2336,"[6622, 6740, 8525]",3


In [13]:
# Useful flatten function from Alex Martelli on https://stackoverflow.com/questions/952914/how-to-make-a-flat-list-out-of-list-of-lists
flatten = lambda l: [item for sublist in l for item in sublist]

In [14]:
cliqued = set(flatten(list(cliques_df['nodes_list'])))
len(cliqued)

1384

In [15]:
for node in cliques_df.iloc[0]['nodes_list']:
    print(corpus.iloc[node]['clean_text'])

‘National tragedy’: Hundreds of koalas feared dead in Australian wildfire. Hundreds of koalas are feared to have died in wildfires raging along Australia’s east coast. The fire started on Friday after a lightning strike hit a forest in the state of New South Wales. The blaze has since burned through 4,00 acres. Sharing the ...
‘National tragedy’: Hundreds of koalas feared dead in Australian wildfire. Hundreds of koalas are feared to have died in wildfires raging along Australia’s east coast. The fire started on Friday after a lightning strike hit a forest in the state of New South Wales. The blaze has since burned through 4,00 acres. Sharing the full story, not just the headlines Sue Ashton, the president of Port Macquarie Koala Hospital ...
Koala rescued from wildfires by passerby. A badly-burnt koala is rescued by a brave passerby as it crawled through a wildfire in Australia. The koala, named Lewis, is facing a long recovery. Toni Doherty gave him water and wrapped him in a blanket 

In [16]:
for node in cliques_df.iloc[1]['nodes_list']:
    print(corpus.iloc[node]['clean_text'])

Storm Dennis to bring ‘risk of flooding’ to UK says Met Office. The Met Office has warned that Storm Dennis is likely to bring a ‘risk of flooding’ with ‘widespread rain’. They add that there won’t be as heavy winds as Storm Ciara.
Storm Dennis flood risk could be worse than Ciara. Storm Dennis is forecast to batter large swathes of the country with 70mph winds and up to 140mm (5.5in) of rain in some areas. The Environment Agency (EA) said the flood impact from the weather system is likely to be worse than last weekend’s Storm Ciara due to rain falling on already saturated ground. The Met Office has issued severe ...
Met Office forecasts gales, heavy rain and possible snow. The UK is braced for more bad weather after the wettest February on record with gale-force winds and heavy rain forecast - and even the possibility of snow on high ground. Recent weeks have seen the country battered by Storm Ciara, Storm Dennis and Storm Jorge. The next few days should be relatively dry with a few s

In [17]:
for node in cliques_df.iloc[2]['nodes_list']:
    print(corpus.iloc[node]['clean_text'])

Election week could be hit with snow and freezing fog, say Met Office. SNOW and freezing fog could be on the cards for next week;s General Election. Temperatures will dive to sub-zero levels in some parts of the UK as winter really sets in, coinciding with voters heading to the polls, the Met Office has said. Election week is predicted to see longer spells of rain, wintry showers and harsh winds, meteorologists say.
Election week could be hit with snow and freezing fog, say Met Office. SNOW and freezing fog could be on the cards for next week;s General Election. Temperatures will dive to sub-zero levels in some parts of the UK as winter really sets in, coinciding with voters heading to the polls, the Met Office has said. Election week is ...
Snow is set to hit the UK but Met Office predicts Hull will escape early falls. The Met Office says they can only officially predict five days ahead, but can safely say there will be no snow falling across the region anytime soon. However, there wi

## 4. Create (flat) clusters using the Community Detection Algorithm

In [18]:
from community import best_partition

In [32]:
# Apply Louvain Community Detection
# The keys are nodes, the values are the partitions they belong to
partition = best_partition(G)

In [39]:
partition[6]

1

In [56]:
# Append partition data to DF, save to file
partition_df = pd.DataFrame.\
               from_dict(partition, orient="index").\
               rename({0: "cluster"}, axis=1)

corpus.join(partition_df).\
       to_csv("working/corpus_clustered_louvain.csv")

In [29]:
# Iterate through and get a list of partitions and their nodes
partition_contents = {}
for key in partition.keys():
    partition_contents[partition[key]] = partition_contents.get(partition[key], []) + [key]

# Drop partitions that are too small
for key in list(partition_contents.keys()):
    if len(partition_contents[key]) < 3:
        partition_contents.pop(key)

In [30]:
# Let's see how big our "clusters" are, and how many there are total after removing the tiny ones
partition_lengths = {key:len(value) for key, value in partition_contents.items()}
print(partition_lengths, sum(partition_lengths.values()))

{1: 67, 3: 112, 8: 25, 12: 3, 14: 54, 24: 5, 37: 37, 40: 3, 44: 86, 46: 5, 51: 4, 56: 41, 57: 67, 61: 12, 62: 3, 64: 9, 70: 3, 71: 112, 72: 96, 76: 49, 80: 21, 82: 6, 87: 4, 88: 4, 98: 6, 99: 3, 105: 3, 108: 13, 109: 3, 113: 54, 119: 12, 140: 4, 145: 3, 148: 3, 152: 185, 160: 11, 182: 46, 184: 26, 187: 17, 194: 5, 235: 3, 238: 13, 253: 3, 287: 3, 288: 3, 290: 3, 292: 26, 295: 3, 318: 7, 323: 17, 329: 6, 334: 15, 344: 18, 352: 3, 367: 3, 368: 7, 369: 3, 372: 3, 387: 7, 409: 3, 417: 4, 443: 3, 450: 4, 479: 4, 497: 8, 506: 3, 528: 3, 539: 7, 549: 4, 551: 3, 556: 5, 566: 4, 582: 4, 640: 3, 657: 11, 671: 69, 682: 3, 694: 3, 698: 3, 718: 7, 752: 3, 754: 4, 765: 13, 769: 5, 787: 5, 837: 3, 852: 6, 853: 8, 877: 3, 914: 3, 937: 6, 940: 3, 966: 4, 995: 4, 997: 4, 1017: 27, 1031: 4, 1045: 6, 1053: 3, 1063: 4, 1066: 5, 1078: 15, 1096: 3, 1109: 4, 1111: 3, 1120: 3, 1126: 3, 1139: 3, 1175: 6, 1178: 3, 1207: 3, 1227: 5, 1252: 4, 1288: 3, 1296: 3, 1299: 3, 1313: 8, 1314: 6, 1329: 3, 1332: 10, 1334: 4,

In [31]:
len(partition_contents)

167

In [25]:
for node in partition_contents[1][:10]:
    print(corpus.iloc[node]['clean_text'])

UK weather: Flooding chaos as northern England warned of further misery. The flooding which has brought chaos and misery to large parts of the UK is expected to last until Tuesday, according to the Environment Agency (EA). Homes in Gloucestershire and Worcestershire were left waterlogged over the weekend after the rivers Severn and Avon burst their banks. Drone footage revealed the extent of the flood around the ...
Storm Dennis LIVE - Flood alerts and travel updates as UK prepares for weekend of bad weather. Forecasters predict that large swathes of England, Scotland, Wales and Northern Ireland will be hit with 70mph winds and up to 140mm of rain. The Environment Agency (EA) said the flood impact from the wet weather is likely to be worse than last weekend’s Storm Ciara due to rain falling on already saturated ground. Storm Ciara, which hit the ...
Snow forecast by Met Office for South West as BBC says cold front will see temperatures fall. The Met Office has said parts of the South W

In [26]:
for node in partition_contents[3][:10]:
    print(corpus.iloc[node]['clean_text'])

Council issue advice on how residents in flood-hit Doncaster village can receive their post. Many people living in Fishlake will be living with the continuing impact of the floods, which hit the village after torrential rain began to fall on November 7, for months to come. Enivironment agency spokesman John Curtin said they hope all houses in the village should be free of water by the end of today. But while council workers and other ...
Mayor exposes Environment Agency’s flooding failure – The Yorkshire Post says. Boris Johnson finally met Doncaster flooding victims on November 1. Copyright: JPIMedia Resell For, even though George Eustice, the newly-appointed Environment Secretary, said, curtly, last week that a summit will now be instigated within “two months”, its importance and urgency is made clear by Doncaster mayor Ros Jones as her borough ...
Hour by hour weather forecast for Birmingham on Friday after city battered by floods. Rather cloudy on Sunday with the odd shower. Dry an

In [27]:
for node in partition_contents[8][:10]:
    print(corpus.iloc[node]['clean_text'])

Driving in flood water: How to drive through flood water as Britain UNDERWATER. How should you drive through flood water? According to the AA-Environment Agency study, 74 percent of 17,0 people would happily drive through floodwaters in the UK. The latest figures mean nearly three-quarters of British drivers risk their lives as they drive in bad weather. The Environment Agency advises just one foot of water is enough to ...
Environment Agency warns of ;difficult decisions; to protect Broads from flooding. In 2008 the then boss of the Agency warned that the Broads would be lost to sea within 100 years. At the same time Natural England said it was considering whether to abandon parts of the Broads to flooding, including Potter Heigham, Hickling, Horsey, Winterton and Sea Palling. It would ruin Norfolk;s tourism industry, but when asked last week ...


In [None]:
partition_contents