# Experimenting with Cleaning, Clustering & Summarization Pipelines

### To do (technical)
- Implement date windows on my corpus loader function

In [1]:
import os
import re
import json

import numpy as np
import pandas as pd
import networkx as nx

from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

import lib.helper as helper
import lib.embedding_models as reps

from importlib import reload

%matplotlib inline

In [2]:
# Useful flatten function from Alex Martelli on https://stackoverflow.com/questions/952914/how-to-make-a-flat-list-out-of-list-of-lists
flatten = lambda l: [item for sublist in l for item in sublist]

In [3]:
# Should be same path for all my PC's, it's where each scrape goes as a separate json file.
storage_path = "/home/ozwald/Dropbox/news_crow/scrape_results"

# "bing" is targeted news search corpus, "RSS" is from specific world and local news feeds.
corpus_type = "RSS"

## 1.  Retrieve Corpus

The corpus is being scraped by the "run_news_scrapes.py" script (and windows task scheduler) every 12 hours, a bit past midday and a bit past midnight.

The "bing" corpus are news titles and text extracts gotten from the bing news search API, using a few Home Office - related keywords.

The "RSS" corpus is plugged directly into a number of RSS feeds for world news sites and local british news sites, with no filters for news story types or subjects applied.

### First, get a list of all the news dumps created so far

In [21]:
corpus = helper.load_clean_corpus(storage_path, corpus_type)

Total files: 223
Loading file: RSS_corpus_2019-11-01_0022.json
Loading file: RSS_corpus_2019-11-12_0021.json
Loading file: RSS_corpus_2019-10-25_1222.json
Loading file: RSS_corpus_2019-12-15_0022.json
Loading file: RSS_corpus_2019-11-23_0023.json
Loading file: RSS_corpus_2019-10-26_1221.json
Loading file: RSS_corpus_2019-10-18_1222.json
Loading file: RSS_corpus_2019-09-23_0020.json
Loading file: RSS_corpus_2019-10-15_0022.json
Loading file: RSS_corpus_2019-11-05_0022.json
Loading file: RSS_corpus_2019-09-06_0020.json
Loading file: RSS_corpus_2020-01-10_1223.json
Loading file: RSS_corpus_2019-10-06_1223.json
Loading file: RSS_corpus_2019-12-23_1222.json
Loading file: RSS_corpus_2019-09-14_1222.json
Loading file: RSS_corpus_2019-09-20_1222.json
Loading file: RSS_corpus_2019-09-22_1222.json
Loading file: RSS_corpus_2019-11-18_1223.json
Loading file: RSS_corpus_2019-09-12_1222.json
Loading file: RSS_corpus_2019-11-30_0022.json
Loading file: RSS_corpus_2019-11-04_0022.json
Loading file: RSS

Loading file: RSS_corpus_2019-09-07_1222.json
Loading file: RSS_corpus_2019-12-24_1221.json
Loading file: RSS_corpus_2019-09-28_1225.json
Loading file: RSS_corpus_2019-09-22_0020.json
Loading file: RSS_corpus_2019-09-18_1222.json
Loading file: RSS_corpus_2019-11-02_1018.json
Loading file: RSS_corpus_2019-12-15_1222.json
Loading file: RSS_corpus_2019-10-16_0807.json
Loading file: RSS_corpus_2019-09-19_0020.json
Loading file: RSS_corpus_2019-12-13_1222.json
Loading file: RSS_corpus_2019-10-31_1222.json
Loading file: RSS_corpus_2020-01-17_1649.json
Loading file: RSS_corpus_2019-09-09_0020.json
Loading file: RSS_corpus_2019-09-29_0021.json
Loading file: RSS_corpus_2019-09-23_1222.json
Loading file: RSS_corpus_2019-10-19_0024.json
Loading file: RSS_corpus_2019-10-03_1223.json
Loading file: RSS_corpus_2019-12-06_0022.json
Loading file: RSS_corpus_2019-09-17_1222.json
Loading file: RSS_corpus_2019-10-08_1223.json
Loading file: RSS_corpus_2020-01-09_1222.json
Loading file: RSS_corpus_2019-10-2

In [22]:
corpus.head()

Unnamed: 0,title,summary,date,link,source_url,retrieval_timestamp,origin,clean_text
0,Trump impeachment: House votes to formalise in...,The Democratic-controlled chamber approves a r...,"Thu, 31 Oct 2019 20:21:13 GMT",https://www.bbc.co.uk/news/world-us-canada-502...,http://feeds.bbci.co.uk/news/world/rss.xml,2019-11-01 00:21:58.608417,rss_feed,Trump impeachment: House votes to formalise in...
1,Five men acquitted of gang-raping teenager in ...,A court ruled the men did not commit rape beca...,"Thu, 31 Oct 2019 23:23:02 GMT",https://www.bbc.co.uk/news/world-europe-50257922,http://feeds.bbci.co.uk/news/world/rss.xml,2019-11-01 00:21:58.608436,rss_feed,Five men acquitted of gang-raping teenager in ...
2,Brazil wildfires: Blaze advances across Pantan...,The area is one of the most biodiverse regions...,"Fri, 01 Nov 2019 00:11:01 GMT",https://www.bbc.co.uk/news/world-latin-america...,http://feeds.bbci.co.uk/news/world/rss.xml,2019-11-01 00:21:58.608446,rss_feed,Brazil wildfires: Blaze advances across Pantan...
3,Islamic State group names its new leader as Ab...,The jihadist group names Abu Ibrahim al-Hashem...,"Thu, 31 Oct 2019 19:03:25 GMT",https://www.bbc.co.uk/news/world-middle-east-5...,http://feeds.bbci.co.uk/news/world/rss.xml,2019-11-01 00:21:58.608455,rss_feed,Islamic State group names its new leader as Ab...
4,Iraq protests: How tuk-tuks are saving lives i...,"From a nuisance to a necessity, tuk-tuks have ...","Thu, 31 Oct 2019 19:11:51 GMT",https://www.bbc.co.uk/news/world-middle-east-5...,http://feeds.bbci.co.uk/news/world/rss.xml,2019-11-01 00:21:58.608464,rss_feed,Iraq protests: How tuk-tuks are saving lives i...


In [23]:
corpus = corpus.iloc[-5000:,]

In [24]:
corpus.shape

(5000, 8)

## 2. Clustering using Entity Detection And Network Analytics

This doesn't resolve very well for Bing, because there's a whole bunch of keywords from the original searches in there.  Suspect that's got a lot to do with the failure of the other methods too.  For the network analytics method I'm going to try removing the keywords from the table first.

In [25]:
#with open("/home/ozwald/Dropbox/news_crow/scrape_settings.json", "r") as f:
#    scrape_config = json.load(f)
#
#search_terms = scrape_config['disaster_search_list']
#search_terms = re.sub(r"[^0-9A-Za-z ]", "", " ".join(search_terms)).lower().split()
#search_terms = set(search_terms)

In [26]:
#search_terms

In [27]:
model = reps.NounAdjacencyModel(list(corpus['clean_text']), list(corpus['clean_text']))
model.noun_sets[3]

PARALLEL AWESOMENESS!!!
found all nouns
Reduced noun lists to sets
Created entity table
Aggregated result table


{'David',
 'JC',
 'Jersey',
 'Kosher',
 'Lax',
 'New_Jersey',
 'Supermarket',
 'Tuesday',
 'kosher'}

In [28]:
nouns_df = model.table.copy()
nouns_df.head()

Unnamed: 0,Julius,Subaru,Ozaukee,Feds,EAR,Oct,IKEA,Banks,Theatre,Tomerong,...,Calombaris,Tatyana,Walton,Socorro,Rodrigues,LVMH,Reeves,Shanks,Episcopal,Jenner
Australian chokes back tears as he describes how he was barred from a pub because of his mullet. An all-round Aussie bloke (pictured) claims he was refused entry from bars because his mullet hairstyle went against their dress code.,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Lawyer's career is in ruins after he was found guilty of sending 'menacing' texts to mother-of-four. David Wilkie-Thorburn sent a string of threats to Veda Rodrigues (pictured) after he became concerned that staff had been walking out of his husband's salon in Aberdeen.,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
"Toad-ally terrified! Driver freaks out when he discovers a tree frog in his car. The motorist from Sanford, just north of Orlando, Florida, was filming the frog which was crawling up his door before it leaped across onto the shocked driver's lap forcing him to drop his phone.",0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Lone survivor of Jersey shooting at kosher supermarket says attackers were 'professional'. David Lax described how he dived under a salad bar and pushed his way past one of two shooters in order to escape the attack at the JC Kosher Supermarket in New Jersey on Tuesday.,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
"What happens now Boris Johnson has an overwhelming majority and mandate to FINALLY get Brexit done?. His first appointment this morning will be at Buckingham Palace, were he will meet the Queen and officially get approval to form the next Government.",0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Drop any noun/noun phrase containing one of the search terms, then create an adjacency matrix

### Drop any noun/phrase occuring too infrequently

In [29]:
# Get 500 most common nouns
nouns_to_keep = list(nouns_df.\
                    sum(axis=0).\
                    sort_values(ascending=False).\
                    index)

# Cut out any nouns containing the original search terms
#nouns_to_keep = [noun for noun in nouns_to_keep if sum([term in noun for term in search_terms]) == 0]

# Keep only top 500 most common
nouns_to_keep = nouns_to_keep[:500]

# Subset nouns dataframe
nouns_df = nouns_df[nouns_to_keep]

print(nouns_df.shape)

(5000, 500)


In [30]:
embeddings = np.asarray(nouns_df)
adjacency = np.dot(embeddings, embeddings.T)
print(np.max(adjacency))

11


In [31]:
# If the "lower" limit is 1, the graph has so many edges it eats ALL the memory of my desktop, even
# with just 500-ish stories to process.
upper = 100
lower = 3
G = nx.Graph()
rows, cols = np.where((upper >= adjacency) & (adjacency >= lower))
weights = [float(adjacency[rows[i], cols[i]]) for i in range(len(rows))]
edges = zip(rows.tolist(), cols.tolist(), weights)
G.add_weighted_edges_from(edges)

# Simplify; remove self-edges
G.remove_edges_from(nx.selfloop_edges(G))

In [32]:
G.number_of_edges()

569

In [33]:
#G_plot = nx.petersen_graph()
#plt.subplot(121)
#nx.draw(G, with_labels=True, font_weight='bold')
#plt.subplot(122)
#nx.draw_shell(G, nlist=[range(5, 10), range(5)], with_labels=True, font_weight='bold')

### Cliques, worth a look?
Idea from the docs, explanation at https://en.wikipedia.org/wiki/Clique_(graph_theory)

So, cliques are allowed to overlap - should've thought of that.  Still, good preliminary results and I've found I can disambiguate the cliques to some degree by cutting out weaker links (fewer shared entities).

I should add it also appears to merely suffer from the same problems as the other clustering methods, clusters are ultimately hierarchical!

In [34]:
cliques = []
for x in nx.find_cliques(G):
    x.sort()
    cliques.append((len(x), x))

In [35]:
cliques_df = pd.DataFrame({"nodes_list": [x[1] for x in cliques],
                           "clique_size": [x[0] for x in cliques]}).\
                    sort_values("clique_size", ascending=False).\
                    reset_index()

In [36]:
len(cliques_df[cliques_df['clique_size'] >= 5])

12

In [37]:
cliques_df[cliques_df['clique_size'] >= 5]

Unnamed: 0,index,nodes_list,clique_size
0,1055,"[2104, 2575, 2698, 3240, 3377, 3809, 4050, 4085]",8
1,58,"[4184, 4204, 4225, 4249, 4723, 4772, 4793]",7
2,1558,"[8, 407, 796, 3398, 3768, 4683]",6
3,84,"[857, 3275, 3715, 3785, 4217]",5
4,1554,"[1290, 1291, 3398, 3768, 3998]",5
5,275,"[398, 438, 2999, 3351, 3376]",5
6,394,"[531, 1650, 2369, 2379, 3580]",5
7,313,"[438, 529, 531, 3351, 3944]",5
8,390,"[529, 531, 1639, 1650, 3765]",5
9,1557,"[745, 1594, 2576, 3398, 3768]",5


In [38]:
cliqued = set(flatten(list(cliques_df['nodes_list'])))
len(cliqued)

1785

In [47]:
for node in cliques_df.iloc[0]['nodes_list']:
    print(corpus.iloc[node]['clean_text'])

Donald Trump calls diplomat who dropped 'quid pro quo' bombshell on him 'HUMAN SCUM'. President Donald Trump unloaded on diplomat Bill Taylor, who testified in the House Tuesday, calling he and other purportedly 'Never Trump' Republicans 'human scum.'
Donald Trump claims he got 'deep respect' and 'got along great' with NATO leaders. President Donald Trump called his trip 'very successful' and claimed he got 'deep respect' from world leaders, although one group was caught laughing at his expense.
Donald Trump held last-minute NATO meeting with Turkish president Erdogan. President Donald Trump sat down with Turkish President Recep Erdogan at NATO meetings Wednesday shortly before cancelling his own scheduled press conference.
Donald Trump dubs Prime Minister Scott Morrison a 'man of titanium'. U.S. President Donald Trump has dubbed Prime Minister Scott Morrison a 'man of titanium' in their first meeting in the president's Oval Office.
Donald Trump begins his week at the UN attacking Joe 

In [49]:
for node in cliques_df.iloc[1]['nodes_list']:
    print(corpus.iloc[node]['clean_text'])

No more survivors expected to be found after White Island volcano disaster New Zealand. New Zealand police say there have been 'no signs of life at any point' on White Island after it was rocked by an eruption on Monday, and they are not expecting to find any more survivors.
More than 20 Australians are caught up in New Zealand's volcano eruption. Police expect the fatalities to grow after New Zealand's Whakaari, or White Island, volcano erupted on Monday afternoon.
Grave fears for young Australian couple still missing in the wake of New Zealand's volcano disaster. Jason Griffiths, Karla Mathews, and Richard Elzer were passengers on cruise ship Ovation of the Seas with a fourth friend when the trio decided to visit White Island on Monday.
Adelaide schoolgirl and her parents are feared dead in New Zealand volcano eruption. Adelaide lawyer Gavin Dallow, his wife Lisa Hosking, 48, and her daughter Zoe Hosking, 15, went on a tour of White Island on Monday and have not been heard from since

In [50]:
for node in cliques_df.iloc[3]['nodes_list']:
    print(corpus.iloc[node]['clean_text'])

Will MPs support Boris Johnson's Brexit deal?. Boris Johnson is locked in a frantic race against time as he tries to persuade a majority of MPs to back his Brexit deal at a crunch vote in the Commons tomorrow.
PM's Brexit timetable rejected. We could be on the brink of an election or a Brexit extension after Boris Johnson suffered another defeat in the Commons.
Labour plunged into fresh Brexit chaos as 1 MPs back Boris Johnson's deal. Labour was rocked by fresh Brexit chaos this evening as 1 of Jeremy Corbyn's MPs defied their leader and sided with the government over the UK's departure from the EU.


In [51]:
for node in cliques_df.iloc[17]['nodes_list']:
    print(corpus.iloc[node]['clean_text'])

SECOND Jeffrey Epstein victim claims she had sex with Prince Andrew. The woman claims she was abused by billionaire paedophile Jeffrey Epstein. The Duke of York has denied having slept with 17-year-old Virginia Roberts at Ghislaine Maxwell's Belgravia home.
FBI 'probes Prince Andrew link to Jeffrey Epstein' and 'won't dismiss claims because he is a royal'. FBI has expended their investigation into Jeffrey Epstein to include 100 alleged victims and sex slave Virginia Roberts reportedly isn't the only one who could provide details on Prince Andrew.
Virginia Roberts describes 'having sex with Prince Andrew' aged 17. Giuffre sat down with five other Epstein accusers in her first TV interview where she describes meeting Prince Andrew in London in 2001 and allegedly being 'trafficked' to him.
Bombshell witness was 'feet away' from Prince Andrew and Virginia Roberts. Lisa Bloom, lawyer for victims of Jeffrey Epstein, says an as-yet-unnamed woman 'vividly remembers' seeing the pair (pictured) 

### Connected components

In [52]:
nx.number_connected_components(G)

1452

In [53]:
components = [component for component in nx.connected_components(G)]

In [54]:
sum([len(component) for component in components])

1785

### Community Detection Algorithm

In [55]:
from community import best_partition

In [56]:
# Apply Louvain Community Detection
# The keys are nodes, the values are the partitions they belong to
partition = best_partition(G)

number_partitions = max(partition.values())
number_partitions

1456

In [57]:
# Iterate through and get a list of partitions and their nodes
partition_contents = {}
for key in partition.keys():
    partition_contents[partition[key]] = partition_contents.get(partition[key], []) + [key]

# Drop partitions that are too small
for key in list(partition_contents.keys()):
    if len(partition_contents[key]) < 5:
        partition_contents.pop(key)

In [58]:
# Let's see how big our "clusters" are, and how many there are total after removing the tiny ones
partition_lengths = [len(value) for key, value in partition_contents.items()]
print(partition_lengths, sum(partition_lengths))

[30, 45, 26, 25, 14, 16, 5, 40, 23, 6, 11, 6, 7] 254


In [67]:
partition_contents.keys()

dict_keys([1, 4, 32, 33, 47, 53, 59, 99, 210, 287, 526, 710, 1219])

In [76]:
for node in partition_contents[1][:10]:
    print(corpus.iloc[node]['clean_text'])

What happens now Boris Johnson has an overwhelming majority and mandate to FINALLY get Brexit done?. His first appointment this morning will be at Buckingham Palace, were he will meet the Queen and officially get approval to form the next Government.
Boris Johnson faces crunch final vote on his Queen's Speech TODAY. Boris Johnson today faces a crunch vote on his blueprint for the country as MPs decide whether or not to back his Queen's Speech and Remainers plot to force a second Brexit referendum.
Brussels reacts with 'sadness but also relief' to UK election. EU leaders have said they now want a quick divorce and to move on to talks on a free-trade accord. Czech Prime Minister Andrej Babis said Brexit would be 'bad news for Europe'.
Voters would prefer the short-term disruption of a No Deal Brexit over a Jeremy Corbyn Government. Nearly half of voters would be happy for the UK to leave the EU without a deal if the alternative scenario was Mr Corbyn as Prime Minister. Just 5 per cent pr

In [78]:
for node in partition_contents[4][:10]:
    print(corpus.iloc[node]['clean_text'])

David Cameron congratulates Boris Johnson for 'extraordinary result'. Former prime minister David Cameron, who resigned after the UK voted for Brexit in 2016, said the Conservative majority win today was an 'extraordinary' and 'powerful' result.
Dominic Raab warns Remainer bill would force the UK to roll over to anything Brussels demands. Dominic Raab vowed that Boris Johnson will be able to defy the 'Surrender Act' stipulating that he must ask for a delay to Brexit if he can't secure a deal.
Remainers' LOSE legal challenge that claimed Boris Johnson's Brexit deal is unlawful. At the Court of Session in Edinburgh, Scotland's highest civil cvourt, lawyers argued Boris Johnson's deal, agreed yesterday with the EU, breaches a 2018 UK trade law.
David Cameron admits he is 'haunted' and 'deeply pained' over Brexit. David Cameron (pictured) said he was 'haunted' and 'deeply pained' by the current political landscape and was 'deeply sorry' about what had happened since the UK's 2016 referendu

In [79]:
for node in partition_contents[32][:10]:
    print(corpus.iloc[node]['clean_text'])

Don Trump Jr jokes 'I wish my name was Hunter Biden' to 'make millions off my father's presidency'. Donald Trump Jr claimed he wanted to emulate the former vice president's son, Hunter Biden, in order to profit from his father being in the White House. He also claimed the media was favoring the Bidens.
Donald Trump blasts Democrat savages Adam Schiff and Jerrold Nadler. Trump also called out freshman Rep. Alexandria Ocasio-Cortez and her 'Squad' in tweets on Saturday morning from the White House, prior to departing for Trump National Golf Club.
Trump says he may 'TERMINATE' NYT and Washington Post subscriptions. Donald Trump said that he no longer wants to have hard copies of The New York Times and The Washington Post in the White House, claiming he was going to 'terminate' the newspapers.
Chaos on Capitol Hill as Democrats repeatedly reject GOP proposals for impeachment rules. In an often contentious hearing that stretched late into Wednesday night, members of the House Rules Committe

In [80]:
for node in partition_contents[99][:10]:
    print(corpus.iloc[node]['clean_text'])

Harry and Meghan to call time on illegal poaching of prized shellfish facing extinction. During an official tour to Southern Africa, Prince Harry, 4, and Meghan, 8, (pictured) will shine a spotlight on the prized delicacies, which can sell for up to £420 a plate in China.
Meghan Markle and Prince Harry pictured with baby Archie after delayed flight to Cape Town to begin ten-day Africa tour. HARRY, Meghan and baby Archie have arrived in Cape Town for their first Royal Tour as a family after their commercial British Airways flight was delayed by 40 minutes. The couple are beginning their 10-day tour by visiting a township to support a charity that help disadvantaged children. The family had been 40 minutes late [820;]
Prince Harry 'ignored advice of senior aides and didn't tell William or Charles about press attack'. Royal sources claim Harry was warned not to publish the statement while he and Meghan were still on their taxpayer-funded tour of Africa.
Did Harry and Meghan's SussexRoyal 

In [81]:
for node in partition_contents[210][:10]:
    print(corpus.iloc[node]['clean_text'])

What Donald Trump really thinks of Scott Morrison's wife Jenny Morrison he referred to as 'Jennifer'. The former nurse struck up quite a rapport with the US President in the lead-up to the state dinner in the White House rose garden - the first for an Australian PM in 1 years.
Hillsong pastor claims he 'doesn't know' if Scott Morrison asked for him to come to US. Hillsong pastor Brian Houston said he 'doesn't know' if Scott Morrison had asked for him to join a meeting with US President Donald Trump.
Former EU Council head Tusk tweets image of him holding two fingers against Trump's back like a gun. Former EU Council chief Donald Tusk fired a shot at Donald Trump today by tweeting a photo of himself taken last year pretending to point a gun into the back of the US President during a G7 summit.
Donald Trump approves deployment of US forces to Saudi Arabia after attacks on American oil facilities. DONALD Trump will send troops to Saudi Arabia after a drone attack on the world8217;s bigges