# Experimenting with Cleaning, Clustering & Summarization Pipelines

### To do (technical)
- Implement date windows on my corpus loader function

In [1]:
import os
import re
import json

import numpy as np
import pandas as pd
import networkx as nx

from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

import lib.helper as helper
import lib.embedding_models as reps

from importlib import reload

%matplotlib inline

# Useful flatten function from Alex Martelli on https://stackoverflow.com/questions/952914/how-to-make-a-flat-list-out-of-list-of-lists
flatten = lambda l: [item for sublist in l for item in sublist]

## 1.  Retrieve Corpus

The corpus is being scraped by the "run_news_scrapes.py" script (and windows task scheduler) every 12 hours, a bit past midday and a bit past midnight.

The "bing" corpus are news titles and text extracts gotten from the bing news search API, using a few Home Office - related keywords.

The "RSS" corpus is plugged directly into a number of RSS feeds for world news sites and local british news sites, with no filters for news story types or subjects applied.

In [2]:
# Should be same path for all my PC's, it's where each scrape goes as a separate json file.
storage_path = "D:/Dropbox/news_crow/scrape_results"

# "bing" is targeted news search corpus, "RSS" is from specific world and local news feeds.
corpus_type = "disaster"

# There's a helper function to go find and drag out the various JSON files created by the scrapers.
corpus = helper.load_clean_corpus(storage_path, corpus_type)

# Make sure after cleaning etc it's indexed from 0
corpus.reset_index(inplace=True)
corpus.index.name = "node"

# See how it turned out
print(corpus.shape)
corpus.head()

Total files: 304
9.9 percent of files read.
19.7 percent of files read.
29.6 percent of files read.
39.5 percent of files read.
49.3 percent of files read.
59.2 percent of files read.
69.1 percent of files read.
78.9 percent of files read.
88.8 percent of files read.
98.7 percent of files read.
(22091, 9)


Unnamed: 0_level_0,index,title,summary,date,link,source_url,retrieval_timestamp,origin,clean_text
node,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
0,0,West Midlands <b>flood</b> warnings prompt &#3...,Residents have been warned to &quot;remain vig...,2019-11-17T17:35:00.0000000Z,https://www.bbc.co.uk/news/uk-england-50451817,www.bbc.co.uk,2019-11-17 19:50:58.278878,bing_news_api,West Midlands flood warnings prompt ;remain vi...
1,1,New <b>flood</b> warnings issued with more hom...,The Environment Agency has a number of <b>floo...,2019-11-17T18:35:00.0000000Z,https://www.hulldailymail.co.uk/news/hull-east...,www.hulldailymail.co.uk,2019-11-17 19:50:58.278928,bing_news_api,New flood warnings issued with more homes at r...
2,2,UK weather forecast – More than 100 <b>flood</...,<b>FLOOD</b>-ravaged villages in the UK have b...,2019-11-17T13:45:00.0000000Z,https://www.thesun.co.uk/news/10342583/uk-weat...,www.thesun.co.uk,2019-11-17 19:50:58.278953,bing_news_api,UK weather forecast – More than 100 flood aler...
3,5,UK <b>flood</b> warning map: <b>Flood</b> chao...,The Environment Agency has issued 57 <b>flood<...,2019-11-17T16:38:00.0000000Z,https://www.express.co.uk/news/weather/1205629...,www.express.co.uk,2019-11-17 19:50:58.279028,bing_news_api,UK flood warning map: Flood chaos to continue ...
4,6,UK weather forecast: <b>Flood</b> chaos contin...,Despite some areas enduring their &#39;wettest...,2019-11-17T18:32:00.0000000Z,https://www.mirror.co.uk/news/uk-news/uk-weath...,www.mirror.co.uk,2019-11-17 19:50:58.279047,bing_news_api,UK weather forecast: Flood chaos continues wit...


## 2. Use Detected Nouns to create a Graph Representation

In [3]:
# Retrive the set of search terms used for Bing, so we can remove them before
# clustering.
with open("D:/Dropbox/news_crow/scrape_settings.json", "r") as f:
    scrape_config = json.load(f)

search_terms = scrape_config['disaster_search_list']
search_terms = re.sub(r"[^0-9A-Za-z ]", "", " ".join(search_terms)).lower().split()
search_terms = set(search_terms)

In [4]:
# Generate the text representation
model = reps.NounAdjacencyModel(list(corpus['clean_text']), list(corpus['clean_text']))

# Tabulate for convenience
nouns_df = model.table.copy()
nouns_df.head()

Unnamed: 0,Assange,Marines,McClay,IM,IPCC,Briton,Lockmeadow,Bible,Congenital,IIHF,...,Monte,Devine,Iris,NHSE,Monmouth;s,Chiseldon,Nightlife,Jadon,5Dr,Ware;s
"West Midlands flood warnings prompt ;remain vigilant; alert. Residents have been warned to quot;remain vigilantquot; as up to 20 flood warnings are in place in the West Midlands with more rain forecast. There are 1 warnings affecting Worcestershire, along the River Severn, Avon and Teme, and six in Shropshire. Flood defences were put up in Ironbridge on Saturday evening. The Environment Agency (EA) said river ...",0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
New flood warnings issued with more homes at risk. The Environment Agency has a number of flood alerts and warnings in place More residents are being told that floodwater could enter their homes as new red warnings are put in place. The Environment Agency has updated the flood risk for Hull and East Yorkshire this afternoon with four red flood warnings now in force. A red warning indicates that ...,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
UK weather forecast – More than 100 flood alerts across Britain as villages are cut off for days and more storms hit. FLOOD-ravaged villages in the UK have been cut off for days as Atlantic storms threaten to unleash another deluge early next week. Swathes of Britain that were left devastated by torrential downpours will face yet more floods - with more than 100 alerts in place. It comes amid predictions bitterly cold weather will largely hold out for the rest ...,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
"UK flood warning map: Flood chaos to continue - Where is under threat of flooding?. The Environment Agency has issued 57 flood warnings at the time of writing, meaning flooding is expected and immediate action is required. There are also 0 flood alerts in place across the country, warning flooding is possible and to be prepared, READ MORE: Snow maps latest forecast: Arctic blast to hit UK with snow and sleet To see a full ...",0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
"UK weather forecast: Flood chaos continues with -5C freeze to follow ;wettest autumn;. Despite some areas enduring their ;wettest ever autumns;, much-needed relief from heavy rainfall has been forecast for flood-hit areas in the coming days",0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


#### Drop any noun/noun phrase containing one of the search terms, then create an adjacency matrix

#### Drop any noun/phrase occuring too infrequently

In [5]:
# Get X most common nouns
nouns_to_keep = list(nouns_df.\
                    sum(axis=0).\
                    sort_values(ascending=False).\
                    index)

# Cut out any nouns containing the original search terms
nouns_to_keep = [noun for noun in nouns_to_keep if sum([term in noun for term in search_terms]) == 0]

# Keep only most common
nouns_to_keep = nouns_to_keep[:2000]

# Subset nouns dataframe
nouns_df = nouns_df[nouns_to_keep]

print(nouns_df.shape)

(22091, 2000)


In [6]:
embeddings = np.asarray(nouns_df)
adjacency = np.dot(embeddings, embeddings.T)
print(np.max(adjacency))

27


In [7]:
# If the "lower" limit is 1, the graph has so many edges it eats ALL the memory of my desktop, even
# with just 500-ish stories to process.
upper = 100
lower = 3
G = nx.Graph()
rows, cols = np.where((upper >= adjacency) & (adjacency >= lower))
weights = [float(adjacency[rows[i], cols[i]]) for i in range(len(rows))]
edges = zip(rows.tolist(), cols.tolist(), weights)
G.add_weighted_edges_from(edges)

# Simplify; remove self-edges - not sure if needed?
G.remove_edges_from(nx.selfloop_edges(G))

In [8]:
G.number_of_edges()

12318

## 3c.  Try CDLIB

In [9]:
import cdlib
from cdlib import algorithms
from cdlib import evaluation

In [10]:
# Simple (flat) clustering
lp_coms = algorithms.label_propagation(G)

# Traditional (easy) community detection
louvain_coms = algorithms.louvain(G)

In [11]:
# This result implies that the two methods have come to very similar conclusions...
# This function apparently isn't defined for overlapping communities
evaluation.normalized_mutual_information(lp_coms, louvain_coms)

MatchingResult(score=0.9593520005997092, std=None)

In [12]:
# Build dict of node-to-cluster lookup
community_lookup = {}
for comm_index, members in enumerate(louvain_coms.communities):
    for member in members:
        community_lookup[member] = comm_index

In [13]:
# Add cluster to DF.  If node not in cluster, assign -1 (outlier)
corpus['node'] = corpus.index
corpus['cluster'] = corpus['node'].apply(lambda x: community_lookup.get(x, -1))
corpus[['clean_text', 'cluster']].head(10)

Unnamed: 0_level_0,clean_text,cluster
node,Unnamed: 1_level_1,Unnamed: 2_level_1
0,West Midlands flood warnings prompt ;remain vi...,423
1,New flood warnings issued with more homes at r...,1536
2,UK weather forecast – More than 100 flood aler...,-1
3,UK flood warning map: Flood chaos to continue ...,-1
4,UK weather forecast: Flood chaos continues wit...,-1
5,Flood warnings in place as groundwater levels ...,1537
6,Motorists ignore road closure at North Bank as...,19
7,Latest on flood warnings and road closures in ...,37
8,UK weather: 66 flood warnings in place as ;cha...,1
9,Italy weather: Italy braces as flood warnings ...,1538


In [14]:
corpus.to_csv("working/disaster_clustered_louvain.csv")

In [15]:
#bigclam_coms.communities

In [16]:
#bigclam_coms.average_internal_degree()

In [17]:
#bigclam_coms.newman_girvan_modularity()

## 3.  Create (overlapping) clusters using Maximal Cliques
Idea from the docs, explanation at https://en.wikipedia.org/wiki/Clique_(graph_theory)
Expanded using k-clique-communities REF FIND PAPER

In [18]:
c = list(nx.algorithms.community.kclique.k_clique_communities(G, 4))
cliques = [(len(x), x) for x in c]

In [19]:
cliques

[(14,
  frozenset({6,
             268,
             348,
             1363,
             2370,
             2790,
             2907,
             3404,
             3521,
             5351,
             5716,
             5841,
             7182,
             7704})),
 (7, frozenset({7, 569, 4173, 5706, 6532, 8672, 10271})),
 (4, frozenset({8, 51, 63, 351})),
 (6, frozenset({33, 273, 564, 745, 893, 3395})),
 (4, frozenset({49, 4072, 4299, 12150})),
 (7, frozenset({58, 463, 1985, 1990, 2118, 2131, 2304})),
 (5, frozenset({58, 1990, 1992, 2031, 2124})),
 (4, frozenset({59, 72, 82, 281})),
 (4, frozenset({90, 278, 3024, 3191})),
 (4, frozenset({100, 225, 2499, 2505})),
 (5, frozenset({101, 476, 843, 1857, 3180})),
 (12,
  frozenset({112,
             115,
             395,
             542,
             1118,
             1324,
             8064,
             9248,
             9642,
             10303,
             13276,
             13376})),
 (64,
  frozenset({138,
             145,


In [20]:
cliques_df = pd.DataFrame({"nodes_list": [x[1] for x in cliques],
                           "clique_size": [x[0] for x in cliques]}).\
                    sort_values("clique_size", ascending=False).\
                    reset_index()

cliques_df = cliques_df[(cliques_df['clique_size'] >= 3) & (cliques_df['clique_size'] <=100)]

In [21]:
cliques_df

Unnamed: 0,index,nodes_list,clique_size
1,12,"(3072, 3076, 3079, 4360, 1801, 4490, 3979, 474...",64
2,16,"(5378, 5381, 5382, 8075, 8077, 8079, 8083, 706...",35
3,78,"(20992, 4613, 1546, 19982, 20879, 3476, 3481, ...",30
4,242,"(16513, 21250, 11587, 17925, 17030, 17926, 127...",30
5,52,"(9856, 3267, 2244, 7429, 9033, 3577, 2062, 270...",28
...,...,...,...
399,196,"(4970, 14331, 5292, 13380)",4
400,197,"(5378, 15899, 5076, 16259)",4
401,198,"(5216, 5097, 15788, 8120)",4
402,199,"(13625, 8175, 5253, 13919)",4


In [22]:
cliqued = set(flatten(list(cliques_df['nodes_list'])))
len(cliqued)

2075

In [23]:
# Flatten the cliques DF into long format
flattened = {"cluster_index":[], "node":[]}

for index, row in cliques_df.iterrows():
    for node in row["nodes_list"]:
        flattened["cluster_index"].append(index)
        flattened["node"].append(node)
        

partition_df = pd.DataFrame(flattened)

# Create a single string variable (";" separated) to record all clusters/cliques a single record belongs in
partition_df["cluster"] = partition_df.\
                          groupby("node")["cluster_index"].\
                          transform(lambda x: ";".join([str(i) for i in x if type(i)==int]))

# Clean up, set index of this and corpus so the two DF's can be joined with little effort
partition_df = partition_df[["node", "cluster"]].\
               drop_duplicates(["node", "cluster"], keep="first").\
               set_index("node")

corpus.drop(["cluster", "node"], axis=1).join(partition_df).\
       to_csv("working/disaster_clustered_cliques.csv")

ValueError: columns overlap but no suffix specified: Index(['cluster'], dtype='object')

### The below attempts overlapping community detection but can only run on connected graphs, think this is an implicit restriction of the algorithm logic.

In [24]:
# Get all connected components (will become less of an issue as graph size increases)
ccs = [(len(x), x) for x in nx.connected_components(G)]

# Sort by size (largest first)
ccs.sort(key = lambda x: x[0], reverse=True)

# Extract largest connected sub-graph
connected_sub = G.subgraph(ccs[0][1])

# re-index nodes from zero to maintain compatibility with CDLIB (sub-dependency, Karate)
# Will need to reverse this indexing when matching assigned clusters back to data
node_relabel_dict = {val: i for i, val in enumerate(list(connected_sub.nodes))}

connected_sub = nx.relabel_nodes(connected_sub, node_relabel_dict)

# Fire algo!
bigclam_coms = algorithms.big_clam(connected_sub)
#leiden_coms = algorithms.leiden(connected_sub)

In [25]:
bigclam_coms.communities

[[0,
  1,
  2,
  3,
  4,
  5,
  6,
  7,
  8,
  9,
  10,
  11,
  12,
  13,
  14,
  15,
  16,
  17,
  18,
  19,
  20,
  21,
  22,
  23,
  24,
  25,
  26,
  27,
  28,
  29,
  30,
  31,
  32,
  33,
  34,
  35,
  36,
  37,
  38,
  39,
  40,
  41,
  42,
  43,
  44,
  45,
  46,
  47,
  48,
  49,
  50,
  51,
  52,
  53,
  54,
  55,
  56,
  57,
  58,
  59,
  60,
  61,
  62,
  63,
  64,
  65,
  66,
  67,
  68,
  69,
  70,
  71,
  72,
  73,
  74,
  75,
  76,
  77,
  78,
  79,
  80,
  81,
  82,
  83,
  84,
  85,
  86,
  87,
  88,
  89,
  90,
  91,
  92,
  93,
  94,
  95,
  96,
  97,
  98,
  99,
  100,
  101,
  102,
  103,
  104,
  105,
  106,
  107,
  108,
  109,
  110,
  111,
  112,
  113,
  114,
  115,
  116,
  117,
  118,
  119,
  120,
  121,
  122,
  123,
  124,
  125,
  126,
  127,
  128,
  129,
  130,
  131,
  132,
  133,
  134,
  135,
  136,
  137,
  138,
  139,
  140,
  141,
  142,
  143,
  144,
  145,
  146,
  147,
  148,
  149,
  150,
  151,
  152,
  153,
  154,
  155,
  156,
  157,
  15

In [26]:
# Build dict of node-to-cluster lookup
community_lookup = {}
for comm_index, members in enumerate(bigclam_coms.communities):
    for member in members:
        community_lookup[member] = community_lookup.get(member, []) + [comm_index]

In [27]:
# Add cluster to DF.  If node not in cluster, assign -1 (outlier)
corpus['node'] = corpus.index
corpus['cluster'] = corpus['node'].apply(lambda x: community_lookup.get(x, [-1]))
corpus[['clean_text', 'cluster']].head(10)

Unnamed: 0_level_0,clean_text,cluster
node,Unnamed: 1_level_1,Unnamed: 2_level_1
0,West Midlands flood warnings prompt ;remain vi...,[0]
1,New flood warnings issued with more homes at r...,[0]
2,UK weather forecast – More than 100 flood aler...,[0]
3,UK flood warning map: Flood chaos to continue ...,[0]
4,UK weather forecast: Flood chaos continues wit...,[0]
5,Flood warnings in place as groundwater levels ...,[0]
6,Motorists ignore road closure at North Bank as...,[0]
7,Latest on flood warnings and road closures in ...,[0]
8,UK weather: 66 flood warnings in place as ;cha...,[0]
9,Italy weather: Italy braces as flood warnings ...,[0]


In [28]:
corpus.to_csv("working/disaster_clustered_bigclam.csv")