# Experimenting with Cleaning, Clustering & Summarization Pipelines

### To do (technical)
- Implement date windows on my corpus loader function

In [1]:
import os
import re
import json
import hdbscan

import numpy as np
import pandas as pd

from sklearn.decomposition import PCA
from collections import Counter
import matplotlib.pyplot as plt

import lib.helper as helper
import lib.embedding_models as reps

from importlib import reload

%matplotlib inline



In [2]:
# Should be same path for all my PC's, it's where each scrape goes as a separate json file.
storage_path = "C:/Users/Martin/Dropbox/news_crow/scrape_results"

# "bing" is targeted news search corpus, "RSS" is from specific world and local news feeds.
corpus_type = "bing_cor"

## 1.  Retrieve Corpus

The corpus is being scraped by the "run_news_scrapes.py" script (and windows task scheduler) every 12 hours, a bit past midday and a bit past midnight.

The "bing" corpus are news titles and text extracts gotten from the bing news search API, using a few Home Office - related keywords.

The "RSS" corpus is plugged directly into a number of RSS feeds for world news sites and local british news sites, with no filters for news story types or subjects applied.

### First, get a list of all the news dumps created so far

In [3]:
corpus = helper.load_clean_corpus(storage_path, corpus_type)

Total files: 153
Loading file: bing_corpus_2019-09-05_2135.json
Loading file: bing_corpus_2019-09-06_0019.json
Loading file: bing_corpus_2019-09-06_1221.json
Loading file: bing_corpus_2019-09-07_0019.json
Loading file: bing_corpus_2019-09-07_1221.json
Loading file: bing_corpus_2019-09-08_0019.json
Loading file: bing_corpus_2019-09-08_1221.json
Loading file: bing_corpus_2019-09-09_0019.json
Loading file: bing_corpus_2019-09-09_1221.json
Loading file: bing_corpus_2019-09-10_0019.json
Loading file: bing_corpus_2019-09-10_1221.json
Loading file: bing_corpus_2019-09-11_0019.json
Loading file: bing_corpus_2019-09-11_1221.json
Loading file: bing_corpus_2019-09-12_0019.json
Loading file: bing_corpus_2019-09-12_1221.json
Loading file: bing_corpus_2019-09-13_0019.json
Loading file: bing_corpus_2019-09-13_1221.json
Loading file: bing_corpus_2019-09-14_0019.json
Loading file: bing_corpus_2019-09-14_1221.json
Loading file: bing_corpus_2019-09-15_0019.json
Loading file: bing_corpus_2019-09-15_2059.j

In [4]:
corpus.head()

Unnamed: 0,date,link,origin,retrieval_timestamp,source_url,summary,title,clean_text
14,2019-09-05T15:12:00.0000000Z,https://www.gov.uk/government/news/government-...,bing_news_api,2019-09-05 21:35:05.106001,www.gov.uk,New border controls that will make it harder f...,Government announces <b>immigration</b> plans ...,Government announces immigration plans for no ...
16,2019-09-05T08:23:00.0000000Z,https://www.thesun.co.uk/news/9865413/home-sec...,bing_news_api,2019-09-05 21:35:05.107007,www.thesun.co.uk,PRITI PATEL tonight conceded unlimited EU <b>i...,Home Secretary Priti Patel admits No Deal Brex...,Home Secretary Priti Patel admits No Deal Brex...
28,2019-09-05T16:54:00.0000000Z,https://www.thetelegraphandargus.co.uk/news/17...,bing_news_api,2019-09-05 21:35:05.108030,www.thetelegraphandargus.co.uk,A STUDENT from Bradford has helped create a sh...,Student film on <b>immigration</b> focuses on ...,Student film on immigration focuses on those m...
30,2019-09-05T14:42:00.0000000Z,https://www.gov.uk/government/publications/no-...,bing_news_api,2019-09-05 21:35:05.108030,www.gov.uk,The United Kingdom will be leaving the Europea...,No deal <b>immigration</b> arrangements for EU...,No deal immigration arrangements for EU citize...
31,2019-09-04T22:12:11.0000000Z,https://www.dailymail.co.uk/wires/ap/article-7...,bing_news_api,2019-09-05 21:35:05.108030,www.dailymail.co.uk,MEXICO CITY (AP) - Since last year&#39;s carav...,AP EXPLAINS: What changed in 90 days of <b>imm...,AP EXPLAINS: What changed in 0 days of immigra...


## 2.  Build Text Model (Representation, eg; word2vec, entities list...)

- Trying with the world corpus and with the bing corpus, neither worked with InferSent.  Suspect the problem lies in the PCA step, which may not be working well on this high-dimensional (vector length = 4096) form.
- Summed keywords works rather better with the world corpus.
- Summed keywords still fail the bing/home office corpus, giving me a cluster about "immigration" and a cluster for the American Supreme Court.

In [5]:
# Windows didn't play nicely with the vector datasets, Some obscure encoding problem (python in Conda
# kept trying to decode using cp1252 regardless of whatever other options I specified!)
# Solution; rewrite file and drop any characters the Windows encoder refuses to recognise.
# I shouldn't loose too much info.
with open('./lib/InferSent/dataset/fastText/crawl-300d-2M.vec', "r", encoding="cp1252", errors="ignore") as infile:
    with open('./lib/InferSent/dataset/fastText/crawl-300d-2M_win.vec', "wb") as outfile:
        for line in infile:
            outfile.write(line.encode('cp1252'))

In [117]:
infersent = reps.InferSentModel(list(corpus['clean_text']),
                                list(corpus['clean_text']),
                                W2V_PATH = './lib/InferSent/dataset/fastText/crawl-300d-2M_win.vec')

embeddings = infersent.get_embeddings()

Found 16344(/17591) words with w2v vectors
Vocab size : 16344


In [39]:
# Whereas this worked first time!
glove = reps.GloveWordModel(list(corpus['clean_text']), list(corpus['clean_text']))

embeddings = glove.get_embeddings()

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Martin\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [118]:
# Turn that into a DF for me
embeddings_df = pd.DataFrame({"clean_text": list(embeddings.keys()),
                              "embeddings": list(embeddings.values())})

In [119]:
embeddings_df.head()

Unnamed: 0,clean_text,embeddings
0,Government announces immigration plans for no ...,"[0.062002, 0.03705715, 0.042470187, 0.00532962..."
1,Home Secretary Priti Patel admits No Deal Brex...,"[0.05766312, 0.0521763, 0.101948425, 0.0049104..."
2,Student film on immigration focuses on those m...,"[0.04688829, 0.059768386, 0.0808005, -0.017053..."
3,No deal immigration arrangements for EU citize...,"[0.065198794, 0.017605491, 0.091646925, 0.0291..."
4,AP EXPLAINS: What changed in 0 days of immigra...,"[0.06322765, 0.08380522, 0.037862558, 0.018137..."


## 3. Cluster Text

This is the part where the pipelines get a little more experimental

In [120]:
embeddings_array = np.vstack(embeddings_df['embeddings'])

In [121]:
embeddings_array.shape

(4583, 4096)

In [155]:
# First, PCA the data
pca = PCA(n_components=20, svd_solver='full')

embeddings_pca = pca.fit_transform(embeddings_array)

In [156]:
embeddings_pca.shape

(4583, 20)

In [157]:
print(pca.explained_variance_ratio_)
print(pca.singular_values_) 

[0.07990959 0.04421469 0.0269681  0.02210961 0.01783179 0.01462192
 0.01230171 0.0114431  0.00995354 0.00972468 0.00835635 0.00819447
 0.00742902 0.00729862 0.00666714 0.00633519 0.00582932 0.00562423
 0.00552668 0.00536528]
[25.707434  19.122408  14.934285  13.522275  12.143858  10.996676
 10.086535   9.7281685  9.072938   8.968028   8.313187   8.232273
  7.838355   7.7692585  7.425558   7.2383437  6.943335   6.820099
  6.7606955  6.6612444]


In [170]:
clusterer = hdbscan.HDBSCAN(min_cluster_size=10, min_samples=3)
clusterer.fit(embeddings_pca)
pd.unique(clusterer.labels_)

array([ 2,  0, -1,  1], dtype=int64)

In [171]:
len(pd.unique(clusterer.labels_))

4

In [172]:
Counter(clusterer.labels_)

Counter({2: 3026, 0: 1473, -1: 67, 1: 17})