# Experimenting with Cleaning, Clustering & Summarization Pipelines

### To do (technical)
- Implement date windows on my corpus loader function

In [1]:
import os
import re
import json

import numpy as np
import pandas as pd
import networkx as nx

from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

import lib.helper as helper
import lib.embedding_models as reps

from importlib import reload

%matplotlib inline



In [4]:
# Should be same path for all my PC's, it's where each scrape goes as a separate json file.
storage_path = "C:/Users/Martin/Dropbox/news_crow/scrape_results"

# "bing" is targeted news search corpus, "RSS" is from specific world and local news feeds.
corpus_type = "bing_cor"

## 1.  Retrieve Corpus

The corpus is being scraped by the "run_news_scrapes.py" script (and windows task scheduler) every 12 hours, a bit past midday and a bit past midnight.

The "bing" corpus are news titles and text extracts gotten from the bing news search API, using a few Home Office - related keywords.

The "RSS" corpus is plugged directly into a number of RSS feeds for world news sites and local british news sites, with no filters for news story types or subjects applied.

### First, get a list of all the news dumps created so far

In [5]:
corpus = helper.load_clean_corpus(storage_path, corpus_type)

Total files: 153
Loading file: bing_corpus_2019-09-05_2135.json
Loading file: bing_corpus_2019-09-06_0019.json
Loading file: bing_corpus_2019-09-06_1221.json
Loading file: bing_corpus_2019-09-07_0019.json
Loading file: bing_corpus_2019-09-07_1221.json
Loading file: bing_corpus_2019-09-08_0019.json
Loading file: bing_corpus_2019-09-08_1221.json
Loading file: bing_corpus_2019-09-09_0019.json
Loading file: bing_corpus_2019-09-09_1221.json
Loading file: bing_corpus_2019-09-10_0019.json
Loading file: bing_corpus_2019-09-10_1221.json
Loading file: bing_corpus_2019-09-11_0019.json
Loading file: bing_corpus_2019-09-11_1221.json
Loading file: bing_corpus_2019-09-12_0019.json
Loading file: bing_corpus_2019-09-12_1221.json
Loading file: bing_corpus_2019-09-13_0019.json
Loading file: bing_corpus_2019-09-13_1221.json
Loading file: bing_corpus_2019-09-14_0019.json
Loading file: bing_corpus_2019-09-14_1221.json
Loading file: bing_corpus_2019-09-15_0019.json
Loading file: bing_corpus_2019-09-15_2059.j

In [6]:
corpus.head()

Unnamed: 0,date,link,origin,retrieval_timestamp,source_url,summary,title,clean_text
14,2019-09-05T15:12:00.0000000Z,https://www.gov.uk/government/news/government-...,bing_news_api,2019-09-05 21:35:05.106001,www.gov.uk,New border controls that will make it harder f...,Government announces <b>immigration</b> plans ...,Government announces immigration plans for no ...
16,2019-09-05T08:23:00.0000000Z,https://www.thesun.co.uk/news/9865413/home-sec...,bing_news_api,2019-09-05 21:35:05.107007,www.thesun.co.uk,PRITI PATEL tonight conceded unlimited EU <b>i...,Home Secretary Priti Patel admits No Deal Brex...,Home Secretary Priti Patel admits No Deal Brex...
28,2019-09-05T16:54:00.0000000Z,https://www.thetelegraphandargus.co.uk/news/17...,bing_news_api,2019-09-05 21:35:05.108030,www.thetelegraphandargus.co.uk,A STUDENT from Bradford has helped create a sh...,Student film on <b>immigration</b> focuses on ...,Student film on immigration focuses on those m...
30,2019-09-05T14:42:00.0000000Z,https://www.gov.uk/government/publications/no-...,bing_news_api,2019-09-05 21:35:05.108030,www.gov.uk,The United Kingdom will be leaving the Europea...,No deal <b>immigration</b> arrangements for EU...,No deal immigration arrangements for EU citize...
31,2019-09-04T22:12:11.0000000Z,https://www.dailymail.co.uk/wires/ap/article-7...,bing_news_api,2019-09-05 21:35:05.108030,www.dailymail.co.uk,MEXICO CITY (AP) - Since last year&#39;s carav...,AP EXPLAINS: What changed in 90 days of <b>imm...,AP EXPLAINS: What changed in 0 days of immigra...


## 2. Clustering using Entity Detection And Network Analytics

This doesn't resolve very well for Bing, because there's a whole bunch of keywords from the original searches in there.  Suspect that's got a lot to do with the failure of the other methods too.  For the network analytics method I'm going to try removing the keywords from the table first.

In [7]:
with open("C:/Users/Martin/Dropbox/news_crow/scrape_settings.json", "r") as f:
    scrape_config = json.load(f)

search_terms = scrape_config['search_list']
search_terms = re.sub(r"[^0-9A-Za-z ]", "", " ".join(search_terms)).lower().split()
search_terms = set(search_terms)

In [8]:
search_terms

{'abuse',
 'border',
 'child',
 'domestic',
 'enforcement',
 'force',
 'home',
 'immigration',
 'international',
 'office',
 'patel',
 'priti',
 'secretary',
 'students',
 'uk',
 'windrush'}

In [15]:
model = reps.NounAdjacencyModel(corpus['clean_text'], corpus['clean_text'])

In [16]:
model.noun_sets[3]

{'brexit',
 'brexit_october',
 'eu_uk',
 'european_union',
 'october',
 'united_kingdom'}

In [191]:
nouns_df = model.table.copy()
nouns_df.head()

Unnamed: 0_level_0,status’,nov,roy,outlook,opjemlock,hudgell,payano,mps’,notts,supermarket,...,burnham,mermaid,wednesday;s,lovell,devops,auckland,cardiff,minster_mill,aa,russel
clean_text,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"Government announces immigration plans for no deal Brexit. New border controls that will make it harder for serious criminals to enter the UK will be introduced in the event of a no deal Brexit, the government has announced today (4 September). In a move signalling the end of free movement in its current form, a ...",0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
"Home Secretary Priti Patel admits No Deal Brexit wouldn’t end unregulated EU immigration until 2021. PRITI PATEL tonight conceded unlimited EU immigration will all-but remain in place until 2021 in a No Deal. Despite promising tougher criminal checks on migrants from October 1, the Home Office said EU citizens would be allowed unfettered access to the UK ...",0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
"Student film on immigration focuses on those making district their home. A STUDENT from Bradford has helped create a short film on immigration, featuring interviews with three people who have made the district their home. Ruby Blake, 22, who studied film and TV production at Northumbria University, was inspired to make the ...",0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
No deal immigration arrangements for EU citizens moving to the UK after Brexit. The United Kingdom will be leaving the European Union on 1 October 201. This paper sets out the immigration arrangements that will apply to EU citizens and their family members who are moving to the UK after Brexit on 1 October 201 in the event that ...,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
"AP EXPLAINS: What changed in 0 days of immigration accord?. MEXICO CITY (AP) - Since last year;s caravans of Central American migrants began reaching the U.S. border, the Trump administration had been increasing pressure on Mexico President Andrés Manuel López Obrador to stop the flow of migrants. But it was the ...",0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Drop any noun/noun phrase containing one of the search terms, then create an adjacency matrix

### Drop any noun/phrase occuring too infrequently

In [192]:
# Get 500 most common nouns
nouns_to_keep = list(nouns_df.\
                    sum(axis=0).\
                    sort_values(ascending=False).\
                    index)

# Cut out any nouns containing the original search terms
nouns_to_keep = [noun for noun in nouns_to_keep if sum([term in noun for term in search_terms]) == 0]

# Keep only top 500 most common
nouns_to_keep = nouns_to_keep[:500]

# Subset nouns dataframe
nouns_df = nouns_df[nouns_to_keep]

print(nouns_df.shape)

(4826, 500)


In [193]:
embeddings = np.asarray(nouns_df)
adjacency = np.dot(embeddings, embeddings.T)
print(np.max(adjacency))

13


In [288]:
# If the "lower" limit is 1, the graph has so many edges it eats ALL the memory of my desktop, even
# with just 500-ish stories to process.
upper = 100
lower = 3
G = nx.Graph()
rows, cols = np.where((upper >= adjacency) & (adjacency >= lower))
weights = [float(adjacency[rows[i], cols[i]]) for i in range(len(rows))]
edges = zip(rows.tolist(), cols.tolist(), weights)
G.add_weighted_edges_from(edges)

# Simplify; remove self-edges
G.remove_edges_from(nx.selfloop_edges(G))

In [289]:
G.number_of_edges()

2606

In [290]:
#G_plot = nx.petersen_graph()
#plt.subplot(121)
#nx.draw(G, with_labels=True, font_weight='bold')
#plt.subplot(122)
#nx.draw_shell(G, nlist=[range(5, 10), range(5)], with_labels=True, font_weight='bold')

### Cliques, worth a look?
Idea from the docs, explanation at https://en.wikipedia.org/wiki/Clique_(graph_theory)

So, cliques are allowed to overlap - should've thought of that.  Still, good preliminary results and I've found I can disambiguate the cliques to some degree by cutting out weaker links (fewer shared entities).

I should add it also appears to merely suffer from the same problems as the other clustering methods, clusters are ultimately hierarchical!

In [291]:
cliques = []
for x in nx.find_cliques(G):
    x.sort()
    cliques.append((len(x), x))

In [292]:
cliques_df = pd.DataFrame({"nodes_list": [x[1] for x in cliques],
                           "clique_size": [x[0] for x in cliques]}).\
                    sort_values("clique_size", ascending=False).\
                    reset_index()

In [293]:
len(cliques_df[cliques_df['clique_size'] >= 5])

86

In [294]:
cliques_df[cliques_df['clique_size'] >= 5]

Unnamed: 0,index,nodes_list,clique_size
0,894,"[2560, 2561, 2565, 2582, 2603, 2635, 2636, 265...",14
1,762,"[2162, 2380, 2381, 2382, 2384, 2386, 2387, 238...",13
2,884,"[2515, 2517, 2518, 2525, 2526, 2527, 2547, 255...",11
3,1339,"[4133, 4171, 4173, 4175, 4182, 4183, 4293, 429...",10
4,949,"[2736, 2748, 2770, 2774, 2788, 2821, 2838, 291...",10
5,1117,"[3282, 3391, 3545, 3546, 3584, 3587, 3628, 375...",9
6,1001,"[2839, 3819, 3988, 3993, 4016, 4025, 4039, 417...",9
7,1084,"[3156, 3988, 3993, 4016, 4025, 4039, 4084, 417...",9
8,1085,"[3156, 3988, 3993, 4016, 4025, 4039, 4174, 434...",9
9,1116,"[3282, 3391, 3545, 3546, 3548, 3584, 3587, 362...",9


Useful flatten function from Alex Martelli on https://stackoverflow.com/questions/952914/how-to-make-a-flat-list-out-of-list-of-lists

In [304]:
flatten = lambda l: [item for sublist in l for item in sublist]

cliqued = set(flatten(list(cliques_df['nodes_list'])))
len(cliqued)

2134

In [295]:
for node in cliques_df.iloc[0]['nodes_list']:
    article = nouns_df.reset_index().iloc[node]
    print(article['clean_text'])

Vietnamese teenager feared among dead migrants. The Vietnamese Embassy in London has started a hotline while the ambassador to the UK, Tran Ngoc An, spoke to Home Secretary Priti Patel on Friday night before meeting investigators from the National Crime Agency and Essex Police. Detective Chief Inspector Martin Pasmore told reporters on Saturday the ambassador had visited the civic centre in ...
Lorry driver charged over migrant deaths. Essex Police initially believed they were all Chinese nationals, but Vietnamese men and women are now feared to be among the dead. Vietnamese ambassador to the UK, Tran Ngoc An, spoke to Home Secretary Priti Patel on Friday night before meeting investigators from the National Crime Agency and Essex Police. Detective Chief Inspector Martin ...
Vietnamese teenager feared among dead migrants. The Vietnamese Embassy in London has started a hotline while the ambassador to the UK, Tran Ngoc An, spoke to Home Secretary Priti Patel on Friday night before meeting

In [296]:
for node in cliques_df.iloc[1]['nodes_list']:
    article = nouns_df.reset_index().iloc[node]
    print(article['clean_text'])

Four men arrested on suspicion of facilitating immigration following raids in Romford and Brentwood. After the lorry was searched, a British national was detained by the French authorities. Four men aged between 2 and were arrested by the National Crime Agency (NCA) on suspicion of facilitating immigration following a raids in Romford and Brentwood. NCA regional head of investigation Gerry McLean said: quot;Those who seek to profit from ...
Luxury yacht used to smuggle group of Albanian migrants to Britain, NCA reveal. Eight Albanian nationals were found on board and their cases will now be dealt with in line with UK immigration rules by Immigration Enforcement. An investigation has been launched by the National Crime Agency (NCA). The yacht’s skipper, a 64-year-old British national, has been been held on suspicion of immigration offences following the ...
Luxury yacht used to smuggle group of Albanian migrants to Britain, NCA reveal. Eight Albanian nationals were found on board and th

In [297]:
for node in cliques_df.iloc[3]['nodes_list']:
    article = nouns_df.reset_index().iloc[node]
    print(article['clean_text'])

Prince Andrew made ;unbelievable; racist comments about Arabs, claims ex-Home Secretary Jacqui Smith. Prince Andrew made racist jokes about Arabs during a state banquet for the Saudi Royal family, a former Home Secretary has claimed. Jacqui Smith, an ex-cabinet member, said the Duke of York made the ;unbelievable; comments while mingling with British politicians at Buckingham Palace. Mrs Smith said the conversation, in which the prince told ...
Former Home Secretary claims Prince Andrew made ‘unbelievable racist comments about Arabs ‘. A former British Home Secretary has claimed that the Duke of York made ‘unbelievable’ racist comments about Arabs during a Saudi state visit to the UK in 2007. Jacqui Smith, who served as Home Secretary from 2007 – 200, said that she felt ‘ashamed’ for not challenging Prince Andrew on his remarks. Mrs Smith, who was one of the most ...
Prince Andrew made ;unbelievable; racist comments about Arabs, ex-Home Secretary says. Prince Andrew has been hauled int

In [298]:
for node in cliques_df.iloc[17]['nodes_list']:
    article = nouns_df.reset_index().iloc[node]
    print(article['clean_text'])

Andrew Marr Accuses Priti Patel Of ‘Laughing’ When Questioned About Brexit Fears. Home Secretary Priti Patel has been criticised for “laughing” during an interview on the BBC’s Andrew Marr Show. Halfway through reading out a list of industry groups who have raised concerns in a letter to the government about the impact of a no-deal Brexit, Marr stopped and said to Patel: “I can’t see why you are laughing.”
Priti Patel bizarrely accused of ;laughing; on BBC;s Andrew Marr Show during discussion of businesses Brexit fears. Andrew Marr accused Priti Patel of laughing on his BBC show today during a discussion of businesses Brexit fears. Mr Marr clashed with the Home Secretary as he put forward concerns about the UK;s hardline approach to leaving the EU. He listed a number of industry bodies worried about what will happen when the country leaves the bloc ...
‘I can’t see why you are laughing’: Priti Patel accused of smirking about impact of no-deal Brexit on BBC;s Andrew Marr Show. Home secr

### Connected components

In [204]:
nx.number_connected_components(G)

506

In [205]:
components = [component for component in nx.connected_components(G)]

In [206]:
print([len(component) for component in components])

[2307, 1, 1, 1, 2, 3, 1, 1, 1, 2, 1, 5, 1, 1, 1, 2, 1, 1, 1, 3, 1, 1, 1, 1, 1, 3, 2, 1, 1, 1, 1, 2, 1, 1, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 7, 1, 3, 1, 4, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 3, 1, 2, 1, 1, 1, 1, 10, 1, 1, 2, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 5, 1, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 1, 1, 1, 1, 3, 1, 4, 1, 2, 2, 6, 1, 1, 1, 2, 1, 1, 5, 2, 1, 1, 3, 1, 3, 4, 2, 1, 1, 1, 3, 1, 3, 2, 1, 4, 1, 1, 3, 2, 3, 2, 2, 1, 3, 5, 3, 1, 2, 1, 4, 1, 2, 2, 2, 4, 1, 1, 3, 3, 1, 5, 1, 2, 2, 2, 2, 3, 2, 1, 2, 1, 2, 1, 1, 3, 1, 1, 2, 3, 1, 2, 1, 1, 2, 2, 1, 1, 2, 2, 2, 1, 1, 2, 1, 2, 2, 2, 1, 2, 2, 5, 1, 2, 1, 2, 1, 3, 2, 3, 1, 1, 1, 1, 2, 1, 2, 2, 2, 1, 2, 3, 2, 1, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 2, 1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1, 13, 1, 2, 2, 1, 1, 2, 2, 1, 1, 1, 2, 1, 1, 1, 1, 2, 3, 1, 7, 2, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 1, 2, 1, 3, 1, 4, 2, 2, 2, 2, 3, 1, 2, 1, 2, 4, 3, 1, 2, 1, 2, 3, 1, 2, 1, 1, 2, 2, 2, 1, 1, 2, 1, 2

### K cores approach

In [207]:
from networkx.algorithms import core

In [210]:
cores_assigned = core.core_number(G)

In [212]:
easy = pd.DataFrame({"node":list(cores_assigned.keys()),
                     "core":list(cores_assigned.values())})

In [250]:
easy

Unnamed: 0,node,core
0,0,3
1,2769,28
2,2795,45
3,4159,8
4,1,9
5,33,8
6,47,8
7,60,9
8,65,2
9,106,9


In [272]:
subby = G.subgraph(easy[easy['core']==10]['node'])

In [273]:
nx.number_connected_components(subby)

17

In [274]:
components = [component for component in nx.connected_components(subby)]
print([len(component) for component in components])

[12, 37, 11, 11, 11, 2, 2, 2, 5, 1, 3, 2, 4, 1, 1, 1, 1]
