# Experimenting with Cleaning, Clustering & Summarization Pipelines

### To do (technical)
- Implement date windows on my corpus loader function

In [17]:
import os
import re
import json

import numpy as np
import pandas as pd
import networkx as nx

from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

import lib.helper as helper
import lib.embedding_models as reps

from importlib import reload

%matplotlib inline

# Useful flatten function from Alex Martelli on https://stackoverflow.com/questions/952914/how-to-make-a-flat-list-out-of-list-of-lists
flatten = lambda l: [item for sublist in l for item in sublist]

## 1.  Retrieve Corpus

The corpus is being scraped by the "run_news_scrapes.py" script (and windows task scheduler) every 12 hours, a bit past midday and a bit past midnight.

The "bing" corpus are news titles and text extracts gotten from the bing news search API, using a few Home Office - related keywords.

The "RSS" corpus is plugged directly into a number of RSS feeds for world news sites and local british news sites, with no filters for news story types or subjects applied.

In [2]:
# Should be same path for all my PC's, it's where each scrape goes as a separate json file.
storage_path = "D:/Dropbox/news_crow/scrape_results"

# "bing" is targeted news search corpus, "RSS" is from specific world and local news feeds.
corpus_type = "disaster"

# There's a helper function to go find and drag out the various JSON files created by the scrapers.
corpus = helper.load_clean_corpus(storage_path, corpus_type)

# Make sure after cleaning etc it's indexed from 0
corpus.reset_index(inplace=True)
corpus.index.name = "node"

# See how it turned out
print(corpus.shape)
corpus.head()

Total files: 161
9.9 percent of files read.
19.9 percent of files read.
29.8 percent of files read.
39.8 percent of files read.
49.7 percent of files read.
59.6 percent of files read.
69.6 percent of files read.
79.5 percent of files read.
89.4 percent of files read.
99.4 percent of files read.
(13118, 9)


Unnamed: 0_level_0,index,title,summary,date,link,source_url,retrieval_timestamp,origin,clean_text
node,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
0,0,West Midlands <b>flood</b> warnings prompt &#3...,Residents have been warned to &quot;remain vig...,2019-11-17T17:35:00.0000000Z,https://www.bbc.co.uk/news/uk-england-50451817,www.bbc.co.uk,2019-11-17 19:50:58.278878,bing_news_api,West Midlands flood warnings prompt ;remain vi...
1,1,New <b>flood</b> warnings issued with more hom...,The Environment Agency has a number of <b>floo...,2019-11-17T18:35:00.0000000Z,https://www.hulldailymail.co.uk/news/hull-east...,www.hulldailymail.co.uk,2019-11-17 19:50:58.278928,bing_news_api,New flood warnings issued with more homes at r...
2,2,UK weather forecast – More than 100 <b>flood</...,<b>FLOOD</b>-ravaged villages in the UK have b...,2019-11-17T13:45:00.0000000Z,https://www.thesun.co.uk/news/10342583/uk-weat...,www.thesun.co.uk,2019-11-17 19:50:58.278953,bing_news_api,UK weather forecast – More than 100 flood aler...
3,5,UK <b>flood</b> warning map: <b>Flood</b> chao...,The Environment Agency has issued 57 <b>flood<...,2019-11-17T16:38:00.0000000Z,https://www.express.co.uk/news/weather/1205629...,www.express.co.uk,2019-11-17 19:50:58.279028,bing_news_api,UK flood warning map: Flood chaos to continue ...
4,6,UK weather forecast: <b>Flood</b> chaos contin...,Despite some areas enduring their &#39;wettest...,2019-11-17T18:32:00.0000000Z,https://www.mirror.co.uk/news/uk-news/uk-weath...,www.mirror.co.uk,2019-11-17 19:50:58.279047,bing_news_api,UK weather forecast: Flood chaos continues wit...


## 2. Use Detected Nouns to create a Graph Representation

In [3]:
# Retrive the set of search terms used for Bing, so we can remove them before
# clustering.
with open("D:/Dropbox/news_crow/scrape_settings.json", "r") as f:
    scrape_config = json.load(f)

search_terms = scrape_config['disaster_search_list']
search_terms = re.sub(r"[^0-9A-Za-z ]", "", " ".join(search_terms)).lower().split()
search_terms = set(search_terms)

In [4]:
# Generate the text representation
model = reps.NounAdjacencyModel(list(corpus['clean_text']), list(corpus['clean_text']))

# Tabulate for convenience
nouns_df = model.table.copy()
nouns_df.head()

Unnamed: 0,Codogno,Hutt,-14C.,Llewellyn,Magistrates,Border,FARMS,Geraldine,Emma,CR,...,Pais,Milan;s,AECOM,Sciences,Everett,Danielle,Poole,Kurilsk,Wharfemeadows,Deficiency
"West Midlands flood warnings prompt ;remain vigilant; alert. Residents have been warned to quot;remain vigilantquot; as up to 20 flood warnings are in place in the West Midlands with more rain forecast. There are 1 warnings affecting Worcestershire, along the River Severn, Avon and Teme, and six in Shropshire. Flood defences were put up in Ironbridge on Saturday evening. The Environment Agency (EA) said river ...",0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
New flood warnings issued with more homes at risk. The Environment Agency has a number of flood alerts and warnings in place More residents are being told that floodwater could enter their homes as new red warnings are put in place. The Environment Agency has updated the flood risk for Hull and East Yorkshire this afternoon with four red flood warnings now in force. A red warning indicates that ...,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
UK weather forecast – More than 100 flood alerts across Britain as villages are cut off for days and more storms hit. FLOOD-ravaged villages in the UK have been cut off for days as Atlantic storms threaten to unleash another deluge early next week. Swathes of Britain that were left devastated by torrential downpours will face yet more floods - with more than 100 alerts in place. It comes amid predictions bitterly cold weather will largely hold out for the rest ...,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
"UK flood warning map: Flood chaos to continue - Where is under threat of flooding?. The Environment Agency has issued 57 flood warnings at the time of writing, meaning flooding is expected and immediate action is required. There are also 0 flood alerts in place across the country, warning flooding is possible and to be prepared, READ MORE: Snow maps latest forecast: Arctic blast to hit UK with snow and sleet To see a full ...",0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
"UK weather forecast: Flood chaos continues with -5C freeze to follow ;wettest autumn;. Despite some areas enduring their ;wettest ever autumns;, much-needed relief from heavy rainfall has been forecast for flood-hit areas in the coming days",0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


#### Drop any noun/noun phrase containing one of the search terms, then create an adjacency matrix

#### Drop any noun/phrase occuring too infrequently

In [5]:
# Get X most common nouns
nouns_to_keep = list(nouns_df.\
                    sum(axis=0).\
                    sort_values(ascending=False).\
                    index)

# Cut out any nouns containing the original search terms
nouns_to_keep = [noun for noun in nouns_to_keep if sum([term in noun for term in search_terms]) == 0]

# Keep only most common
nouns_to_keep = nouns_to_keep[:2000]

# Subset nouns dataframe
nouns_df = nouns_df[nouns_to_keep]

print(nouns_df.shape)

(13118, 1000)


In [6]:
embeddings = np.asarray(nouns_df)
adjacency = np.dot(embeddings, embeddings.T)
print(np.max(adjacency))

13


In [7]:
# If the "lower" limit is 1, the graph has so many edges it eats ALL the memory of my desktop, even
# with just 500-ish stories to process.
upper = 100
lower = 3
G = nx.Graph()
rows, cols = np.where((upper >= adjacency) & (adjacency >= lower))
weights = [float(adjacency[rows[i], cols[i]]) for i in range(len(rows))]
edges = zip(rows.tolist(), cols.tolist(), weights)
G.add_weighted_edges_from(edges)

# Simplify; remove self-edges - not sure if needed?
G.remove_edges_from(nx.selfloop_edges(G))

In [8]:
G.number_of_edges()

7614

## 3b. Try SNAP
To add; community detection

In [88]:
import snap

G1 = snap.TUNGraph.New()

rows, cols = np.where((upper >= adjacency) & (adjacency >= lower))
edges = zip(rows.tolist(), cols.tolist())


# Add nodes
for i in range(nouns_df.shape[0]):
    G1.AddNode(i)

# Edges
for edge in edges:
    G1.AddEdge(edge[0], edge[1])
    
# Lets have a look at the degrees of the nodes
for NI in G1.Nodes():
    print("node: %d, out-degree %d, in-degree %d" % ( NI.GetId(), NI.GetOutDeg(), NI.GetInDeg()))

node: 0, out-degree 2, in-degree 2
node: 1, out-degree 0, in-degree 0
node: 2, out-degree 0, in-degree 0
node: 3, out-degree 0, in-degree 0
node: 4, out-degree 0, in-degree 0
node: 5, out-degree 1, in-degree 1
node: 6, out-degree 7, in-degree 7
node: 7, out-degree 2, in-degree 2
node: 8, out-degree 1, in-degree 1
node: 9, out-degree 0, in-degree 0
node: 10, out-degree 2, in-degree 2
node: 11, out-degree 1, in-degree 1
node: 12, out-degree 1, in-degree 1
node: 13, out-degree 1, in-degree 1
node: 14, out-degree 2, in-degree 2
node: 15, out-degree 1, in-degree 1
node: 16, out-degree 0, in-degree 0
node: 17, out-degree 9, in-degree 9
node: 18, out-degree 0, in-degree 0
node: 19, out-degree 1, in-degree 1
node: 20, out-degree 2, in-degree 2
node: 21, out-degree 1, in-degree 1
node: 22, out-degree 0, in-degree 0
node: 23, out-degree 0, in-degree 0
node: 24, out-degree 0, in-degree 0
node: 25, out-degree 0, in-degree 0
node: 26, out-degree 2, in-degree 2
node: 27, out-degree 3, in-degree 3
no

node: 3070, out-degree 0, in-degree 0
node: 3071, out-degree 2, in-degree 2
node: 3072, out-degree 35, in-degree 35
node: 3073, out-degree 0, in-degree 0
node: 3074, out-degree 0, in-degree 0
node: 3075, out-degree 2, in-degree 2
node: 3076, out-degree 27, in-degree 27
node: 3077, out-degree 6, in-degree 6
node: 3078, out-degree 1, in-degree 1
node: 3079, out-degree 42, in-degree 42
node: 3080, out-degree 6, in-degree 6
node: 3081, out-degree 0, in-degree 0
node: 3082, out-degree 7, in-degree 7
node: 3083, out-degree 0, in-degree 0
node: 3084, out-degree 10, in-degree 10
node: 3085, out-degree 2, in-degree 2
node: 3086, out-degree 0, in-degree 0
node: 3087, out-degree 0, in-degree 0
node: 3088, out-degree 4, in-degree 4
node: 3089, out-degree 2, in-degree 2
node: 3090, out-degree 1, in-degree 1
node: 3091, out-degree 0, in-degree 0
node: 3092, out-degree 1, in-degree 1
node: 3093, out-degree 1, in-degree 1
node: 3094, out-degree 0, in-degree 0
node: 3095, out-degree 1, in-degree 1
node

node: 6070, out-degree 1, in-degree 1
node: 6071, out-degree 1, in-degree 1
node: 6072, out-degree 2, in-degree 2
node: 6073, out-degree 2, in-degree 2
node: 6074, out-degree 0, in-degree 0
node: 6075, out-degree 0, in-degree 0
node: 6076, out-degree 4, in-degree 4
node: 6077, out-degree 1, in-degree 1
node: 6078, out-degree 10, in-degree 10
node: 6079, out-degree 1, in-degree 1
node: 6080, out-degree 1, in-degree 1
node: 6081, out-degree 1, in-degree 1
node: 6082, out-degree 2, in-degree 2
node: 6083, out-degree 1, in-degree 1
node: 6084, out-degree 5, in-degree 5
node: 6085, out-degree 26, in-degree 26
node: 6086, out-degree 1, in-degree 1
node: 6087, out-degree 4, in-degree 4
node: 6088, out-degree 1, in-degree 1
node: 6089, out-degree 3, in-degree 3
node: 6090, out-degree 2, in-degree 2
node: 6091, out-degree 0, in-degree 0
node: 6092, out-degree 22, in-degree 22
node: 6093, out-degree 0, in-degree 0
node: 6094, out-degree 1, in-degree 1
node: 6095, out-degree 0, in-degree 0
node: 

node: 9069, out-degree 0, in-degree 0
node: 9070, out-degree 0, in-degree 0
node: 9071, out-degree 2, in-degree 2
node: 9072, out-degree 0, in-degree 0
node: 9073, out-degree 1, in-degree 1
node: 9074, out-degree 2, in-degree 2
node: 9075, out-degree 3, in-degree 3
node: 9076, out-degree 0, in-degree 0
node: 9077, out-degree 5, in-degree 5
node: 9078, out-degree 3, in-degree 3
node: 9079, out-degree 1, in-degree 1
node: 9080, out-degree 0, in-degree 0
node: 9081, out-degree 0, in-degree 0
node: 9082, out-degree 0, in-degree 0
node: 9083, out-degree 2, in-degree 2
node: 9084, out-degree 1, in-degree 1
node: 9085, out-degree 0, in-degree 0
node: 9086, out-degree 2, in-degree 2
node: 9087, out-degree 1, in-degree 1
node: 9088, out-degree 0, in-degree 0
node: 9089, out-degree 3, in-degree 3
node: 9090, out-degree 5, in-degree 5
node: 9091, out-degree 3, in-degree 3
node: 9092, out-degree 1, in-degree 1
node: 9093, out-degree 1, in-degree 1
node: 9094, out-degree 3, in-degree 3
node: 9095, 

node: 12492, out-degree 2, in-degree 2
node: 12493, out-degree 0, in-degree 0
node: 12494, out-degree 3, in-degree 3
node: 12495, out-degree 0, in-degree 0
node: 12496, out-degree 3, in-degree 3
node: 12497, out-degree 1, in-degree 1
node: 12498, out-degree 1, in-degree 1
node: 12499, out-degree 1, in-degree 1
node: 12500, out-degree 18, in-degree 18
node: 12501, out-degree 0, in-degree 0
node: 12502, out-degree 1, in-degree 1
node: 12503, out-degree 1, in-degree 1
node: 12504, out-degree 2, in-degree 2
node: 12505, out-degree 0, in-degree 0
node: 12506, out-degree 0, in-degree 0
node: 12507, out-degree 0, in-degree 0
node: 12508, out-degree 0, in-degree 0
node: 12509, out-degree 0, in-degree 0
node: 12510, out-degree 0, in-degree 0
node: 12511, out-degree 0, in-degree 0
node: 12512, out-degree 1, in-degree 1
node: 12513, out-degree 1, in-degree 1
node: 12514, out-degree 1, in-degree 1
node: 12515, out-degree 1, in-degree 1
node: 12516, out-degree 1, in-degree 1
node: 12517, out-degree

## 3.  Create (overlapping) clusters using Maximal Cliques
Idea from the docs, explanation at https://en.wikipedia.org/wiki/Clique_(graph_theory)
Expanded using k-clique-communities REF FIND PAPER

In [63]:
c = list(nx.algorithms.community.kclique.k_clique_communities(G, 4))
cliques = [(len(x), x) for x in c]

In [64]:
cliques

[(9, frozenset({6, 268, 2370, 2790, 2907, 5716, 5841, 7182, 7704})),
 (6, frozenset({17, 896, 1561, 1605, 2658, 3168})),
 (6, frozenset({33, 273, 564, 745, 893, 3395})),
 (4, frozenset({49, 4072, 4299, 12150})),
 (7, frozenset({58, 463, 1985, 1990, 2118, 2131, 2304})),
 (5, frozenset({58, 1990, 1992, 2031, 2124})),
 (4, frozenset({58, 469, 1617, 2031})),
 (4, frozenset({59, 72, 82, 281})),
 (4, frozenset({90, 278, 3024, 3191})),
 (4, frozenset({100, 225, 2499, 2505})),
 (65,
  frozenset({138,
             145,
             241,
             287,
             331,
             334,
             494,
             539,
             550,
             684,
             790,
             794,
             1110,
             1144,
             1217,
             1322,
             1377,
             1801,
             2429,
             2765,
             3072,
             3076,
             3077,
             3079,
             3084,
             3235,
             3322,
             3324,


In [65]:
cliques_df = pd.DataFrame({"nodes_list": [x[1] for x in cliques],
                           "clique_size": [x[0] for x in cliques]}).\
                    sort_values("clique_size", ascending=False).\
                    reset_index()

cliques_df = cliques_df[(cliques_df['clique_size'] >= 3) & (cliques_df['clique_size'] <=100)]

In [66]:
cliques_df

Unnamed: 0,index,nodes_list,clique_size
1,10,"(3072, 3076, 3077, 3079, 4360, 1801, 138, 3979...",65
2,24,"(12165, 8582, 774, 9740, 653, 9613, 10255, 110...",48
3,14,"(5378, 5381, 5382, 8075, 8077, 8079, 8083, 706...",35
4,70,"(4613, 1546, 11596, 11725, 11598, 1876, 3476, ...",24
5,42,"(9985, 10178, 9987, 9988, 9990, 9991, 9992, 99...",22
...,...,...,...
220,102,"(6552, 2457, 6341, 6087)",4
221,103,"(7777, 2458, 7865, 2509)",4
222,104,"(4902, 2506, 5388, 2674)",4
223,105,"(3088, 2714, 2563, 2532)",4


In [67]:
cliqued = set(flatten(list(cliques_df['nodes_list'])))
len(cliqued)

1171

In [68]:
# Flatten the cliques DF into long format
flattened = {"cluster_index":[], "node":[]}

for index, row in cliques_df.iterrows():
    for node in row["nodes_list"]:
        flattened["cluster_index"].append(index)
        flattened["node"].append(node)
        

partition_df = pd.DataFrame(flattened)

# Create a single string variable (";" separated) to record all clusters/cliques a single record belongs in
partition_df["cluster"] = partition_df.\
                          groupby("node")["cluster_index"].\
                          transform(lambda x: ";".join([str(i) for i in x if type(i)==int]))

# Clean up, set index of this and corpus so the two DF's can be joined with little effort
partition_df = partition_df[["node", "cluster"]].\
               drop_duplicates(["node", "cluster"], keep="first").\
               set_index("node")

corpus.join(partition_df).\
       to_csv("working/corpus_clustered_cliques.csv")

In [69]:
for node in cliques_df.iloc[0]['nodes_list']:
    print(corpus.iloc[node]['clean_text'])

Water Thieves Steal 00,000 Litres in Australia as Our Mad Max-Style Future Becomes Reality. Thieves stole roughly 00,000 litres of water in a region of Australia that’s suffering from one of the worst droughts in the history of the country. And with record-breaking heat and bushfires getting even larger, it feels like Australia is living in the future. That future, unfortunately, looks a lot like Mad Max. Police in New South Wales ...
Australian PM defends government policy on climate change amid wildfire crisis. Australia’s embattled prime minister has defended his government’s climate policy, as authorities warned the wildfires crisis could fester for months. Around 200 wildfires were burning in four states, with New South Wales accounting for more than half of them, including 60 fires which are not contained. The disaster has led to renewed ...
Australia wildfires: Cooler weather helps firefighters as PM returns home from holiday to crisis. Cool weather has eased conditions at some 

In [70]:
for node in cliques_df.iloc[1]['nodes_list']:
    print(corpus.iloc[node]['clean_text'])

UK coronavirus cases surge by 2,000 in a WEEK as the epidemic outstrips most of Europe. Prime Minister Boris Johnson this week acknowledged the epidemic was entering its ;fast growth phase; and scientists have said Britain and other countries are just weeks behind Italy, which is in the grip of the worst outbreak outside of China. Between March 12 and March 18, the 12 hardest-hit countries in Europe diagnosed more than 5,000 new ...
Coronavirus UK outbreak sees Boris hold Cobra meeting to coordinate response. Boris Johnson should drop his childish ban on ministers appearing on BBC radio programmes,quot; he said. quot;The public deserves to hear what plans are in place to deal with the outbreak.quot; Speaking in Downing Street last night, the Prime Minister said: “Our thoughts are very much with the family of the victim in Yokohama, the UK national ...
Hour by hour weather forecast for Birmingham on Friday after city battered by floods. Rather cloudy on Sunday with the odd shower. Dry a

In [71]:
for node in cliques_df.iloc[2]['nodes_list']:
    print(corpus.iloc[node]['clean_text'])

France vs England rugby: Kick off time, TV channel, teams and live stream free for Six Nations match in Paris. France and England were actually supposed to meet in the Rugby World Cup group stage, but the match in Japan was cancelled due to Typhoon Hagibis. France host England on Sunday, February 2. The match kicks off at pm UK time - 4pm in Paris. It will be played at the Stade de France. France vs England is live on BBC One. Coverage commences half ...
Six Nations: Team-by-team guide, key players, title odds and how to watch. They will try to improve that record, for the most part, without Sergio Parisse. The totemic number eight was denied the farewell he expected when Typhoon Hagibis caused the Azzurri;s final Rugby World Cup pool match to be called off. The 6-year-old plans to be involved in at least one of Italy;s home games during the tournament to put the ...
When is Wales vs Italy? Six Nations 2020 kick-off time, TV channel, team news, referee and odds. It will be Italy’s firs

## 4. Create (flat) clusters using the Community Detection Algorithm

In [36]:
from community import best_partition

In [37]:
# Apply Louvain Community Detection
# The keys are nodes, the values are the partitions they belong to
partition = best_partition(G)

In [38]:
# Append partition data to DF, save to file
partition_df = pd.DataFrame.\
               from_dict(partition, orient="index").\
               rename({0: "cluster"}, axis=1)

partition_df["cluster"] = partition_df["cluster"].apply(lambda x: str(int(x)))

partition_df.index.name = "node"

corpus.join(partition_df).\
       to_csv("working/corpus_clustered_louvain.csv")

In [39]:
# Iterate through and get a list of partitions and their nodes
partition_contents = {}
for key in partition.keys():
    partition_contents[partition[key]] = partition_contents.get(partition[key], []) + [key]

# Drop partitions that are too small
for key in list(partition_contents.keys()):
    if len(partition_contents[key]) < 3:
        partition_contents.pop(key)

In [40]:
# Let's see how big our "clusters" are, and how many there are total after removing the tiny ones
partition_lengths = {key:len(value) for key, value in partition_contents.items()}
print(partition_lengths, sum(partition_lengths.values()))

{2: 27, 3: 44, 5: 66, 10: 39, 13: 86, 14: 3, 17: 3, 22: 45, 26: 150, 27: 48, 31: 22, 32: 6, 37: 194, 44: 4, 48: 3, 49: 26, 53: 3, 54: 3, 60: 96, 62: 3, 63: 7, 66: 4, 69: 4, 70: 7, 72: 112, 73: 3, 76: 26, 78: 16, 81: 46, 82: 3, 84: 12, 93: 280, 97: 47, 105: 22, 108: 9, 110: 33, 115: 5, 116: 4, 128: 3, 138: 3, 148: 27, 157: 3, 165: 21, 174: 3, 177: 3, 181: 3, 190: 81, 219: 30, 220: 27, 228: 4, 236: 4, 260: 3, 271: 27, 275: 3, 282: 14, 284: 3, 290: 36, 292: 7, 308: 15, 311: 6, 315: 3, 323: 3, 342: 3, 343: 7, 345: 9, 347: 31, 350: 3, 358: 16, 365: 3, 368: 14, 369: 24, 381: 6, 384: 19, 392: 8, 395: 3, 404: 27, 421: 3, 431: 3, 435: 7, 436: 3, 439: 3, 448: 3, 458: 7, 459: 7, 484: 12, 489: 3, 499: 8, 501: 4, 532: 4, 557: 3, 564: 3, 569: 3, 572: 5, 589: 3, 601: 9, 611: 3, 615: 3, 618: 8, 630: 15, 651: 9, 665: 4, 667: 3, 671: 14, 687: 7, 691: 5, 701: 3, 706: 3, 709: 4, 715: 3, 719: 3, 722: 3, 736: 3, 766: 3, 774: 5, 788: 3, 822: 12, 847: 5, 856: 3, 867: 3, 870: 50, 873: 3, 896: 9, 928: 3, 936: 3

In [41]:
len(partition_contents)

261

In [42]:
for node in partition_contents[1][:10]:
    print(corpus.iloc[node]['clean_text'])

KeyError: 1

In [None]:
for node in partition_contents[3][:10]:
    print(corpus.iloc[node]['clean_text'])

In [None]:
for node in partition_contents[8][:10]:
    print(corpus.iloc[node]['clean_text'])