# Create Spare Word Count Matrix

This notebook creates a sparse matrix from a large set of word counts derived from historical British newspapers. 
- The columns in this matrix corresponds with a chosen vocabulary
- The rows are the words counts for one newspaper title in specific month.

This notebook explains and covers the following stages

- 1. Download original count data from Zenodo
- 2. Process JSON files with word counts
- 3. Constract a vocabulary from the word counts (the matrix columns)
- 4. Convert all JSON word counts to a sparse matrix

In [3]:
%load_ext autoreload
%autoreload 2

In [4]:
from tools.ngram_creation import *
from tools.ngram_creation import CorpusProcessor

## 1. Get word counts from Zenodo and unzip .tar file

In [5]:
# ! wget ...

In [1]:
# Path to the ngram
tar_ngram_path = '/Volumes/WorkData/ngrams.tar'
# Path to unzipped data
unzipped_ngram_path = '/Volumes/WorkData/ngrams/ngrams-ouput'

In [7]:
Path(unzipped_ngram_path).mkdir(exist_ok=True)

In [8]:
# unpack tar file
!tar -xzf {tar_ngram_path} -C {unzipped_ngram_path}

## 2. Process JSON files

We load all the JSON files with the `JSONHandler` object.

In [5]:
unzipped_ngram_path = '/Volumes/X9 Pro/ngrams-output'

In [6]:
handler = JSONHandler(unzipped_ngram_path)

Remove corrupted json files.

In [7]:
#handler.check_json()

`len` returns the number of JSON files.

In [8]:
len(handler)

269179

`print` also includes the distinct number of newspapers.

In [9]:
print(handler)

269179 JSON files for 1204 newspapers


Determine location where we store all the Ngram corpus as sparse matrices.

In [10]:
save_to = '/Volumes/X9 Pro/ngrams-by-nlp-all'

Initially we combine the counts by NLP. For each JSON file we remove tokens that only occur once.

In [11]:
vocab = Vocab(handler,save_to=save_to,min_threshold=1,**{'n_cores':8})

In [12]:
%time vocab.nlp_counts()

0it [00:00, ?it/s]

CPU times: user 6min 47s, sys: 35.2 s, total: 7min 23s
Wall time: 55min 55s


The output of this operation is stored in `.wc_by_nlp` attribute ("word counts by nlp"). Each element is a dictionary with word counts.

In [13]:
len(vocab.wc_by_nlp[3])

169372

In [14]:
len(vocab.wc_by_nlp)

1204

Then we combine these dictionary, we set 5 as the minimum threshold at the level of the NLP/newspaper.

In [15]:
vocab.min_threshold = 5

In [16]:
%time vocab.total_counts()

  0%|          | 0/1204 [00:00<?, ?it/s]

CPU times: user 5min 44s, sys: 29min 7s, total: 34min 51s
Wall time: 1h 39min 7s


In [20]:
len(vocab.vocab)

196719

Now we remove words that appear less then 2500 time in total with `filter_total`.

In [21]:
vocab.filter_by_min_threshold(2500)

In [22]:
len(vocab.vocab),len(vocab.wc_total)

(196719, 196719)

In [23]:
vocab.save()

In [24]:
!ls -la {vocab.save_to}

ls: /Volumes/X9: No such file or directory
ls: Pro/ngrams-by-nlp-all: No such file or directory


Lastly, we process the whole collection using this vocabulary across the whole corpus.

In [25]:
corpus_proc = CorpusProcessor(handler,
                              vocab.vocab, 
                              vocab.save_to,
                              n_cores=8)


In [26]:
%time corpus_proc.process_ngrams()

Processing 0002943
Processing 0000411
Processing 0000604
Processing 0001639
Processing 0000987
Processing 0001586
Processing 0002183
Processing 0001709
/Volumes/X9 Pro/ngrams-by-nlp-all
/Volumes/X9 Pro/ngrams-by-nlp-all/0000604_sparse_matrix.npz
Processing 0000604 done.
Processing 0002064
/Volumes/X9 Pro/ngrams-by-nlp-all
/Volumes/X9 Pro/ngrams-by-nlp-all/0002943_sparse_matrix.npz
Processing 0002943 done.
/Volumes/X9 Pro/ngrams-by-nlp-all
/Volumes/X9 Pro/ngrams-by-nlp-all/0002064_sparse_matrix.npz
Processing 0002064 done.
Processing 0000488
Processing 0001083
/Volumes/X9 Pro/ngrams-by-nlp-all
/Volumes/X9 Pro/ngrams-by-nlp-all/0001586_sparse_matrix.npz
Processing 0001586 done.
/Volumes/X9 Pro/ngrams-by-nlp-all
/Volumes/X9 Pro/ngrams-by-nlp-all/0001709_sparse_matrix.npz
Processing 0001709 done.
/Volumes/X9 Pro/ngrams-by-nlp-all
/Volumes/X9 Pro/ngrams-by-nlp-all/0001083_sparse_matrix.npz
Processing 0001083 done.
Processing 0001091
Processing 0001616
Processing 0000978
/Volumes/X9 Pro/ngra

In [28]:
!ls {save_to} | wc -l

ls: /Volumes/X9: No such file or directory
ls: Pro/ngrams-by-nlp-all: No such file or directory
       0


In [None]:
!du -h {save_to}/.. --max-depth=1 

## Merge sparse matrices

The cells below we merge all the ngrams at the NLP level into one large sparse matrix with corresponding metadata.

In [31]:
handler = JSONHandler('/Volumes/X9 Pro/ngrams-output-final')
#save_to = '../sparse_ngrams'
vocab = json.load(open(Path(save_to) / 'vocab.json'))
corpus_proc = CorpusProcessor(handler,
                              vocab = vocab, 
                              save_to = save_to)

In [None]:
# WARNING: this operation requires lots of memory
# bug!!!
save_merged = '/Volumes/X9 Pro/unigram-matrix'
corpus_proc.merge_sparse_matrices(save_merged, override=True,
                                 **{'npd_links_path' : 'data/newspapers_overview_with_links_JISC_NLPs.csv',
                                    'npd_data_path' : 'data/MPD_export_1846_1920_20230217.csv'})

In [None]:
!ls -la {save_merged} --block-size=G

# Fin.

In [None]:
metadata = pd.read_csv('/ngram_datadrive_2/sparse_matrix/metadata.csv',index_col=0)
metadata['idx'] = list(range(metadata.shape[0]))
metadata

In [None]:
timestep = 'month'
by_timestep = metadata.groupby([timestep])['idx'].apply(list).reset_index().sort_values(timestep)

In [None]:
mapping = json.load(open('/ngram_datadrive_2/sparse_matrix/mapping.json'))

In [None]:
counts = sp[:,mapping['machine']]

In [None]:
#counts[[1,2,4]].sum(axis=0)

In [None]:
from collections import defaultdict
counts_dict = defaultdict(int)
for i,row in by_timestep.iterrows():
    year = int(str(row[timestep]).split('-')[0])
    if year >= 1780 and year < 1920:
        ts_counts = np.squeeze(np.asarray(counts[row.idx].sum(axis=0)))
        counts_dict[row[timestep]] += ts_counts

In [None]:
pd.Series(counts_dict).plot()

In [None]:
handler.nlp_ids[:10]

In [None]:
handler.organize_by_nlp()

In [None]:
#handler.by_nlp['0000032']

In [None]:
from tools.ngram_tools import *

In [None]:
ngram_proc = NGramProcessor('0002340',handler,**{'min_threshold':10})

In [None]:
ngram_proc.nlp

In [None]:
ngram_proc.min_threshold

In [None]:
ngram_proc = NGramProcessor('0002340',handler,**{'min_threshold':10})
ngram_proc.min_threshold = 5
ngram_proc._process_by_nlp()

In [None]:
corpus_proc.process_ngrams()

In [None]:
ngram_proc._process_nlp('0002340')

In [None]:
ngram_proc._files

In [None]:
files = ngram_proc.load_json(handler)

In [None]:
len(list(files))

In [None]:
%matplotlib inline
from pathlib import Path
import json
import pandas as pd
import numpy as np
import multiprocessing
import seaborn as sns
import statsmodels.api as sm
from datetime import datetime
from scipy.sparse import save_npz, load_npz
from collections import defaultdict, Counter
from sklearn.feature_extraction import DictVectorizer
from sklearn.metrics import pairwise_distances
from scipy.spatial.distance import jensenshannon
from tqdm import tqdm_notebook
import os
sns.set()

In [None]:
json_files = list(Path('../ngrams-obtained/ngrams-output').glob('*.json')); len(json_files)

In [None]:
run_check = False

def test_json(js):
    """find corrupt json files"""
    try:
        with open(js,'r') as open_json:
            p=json.load(open_json)
            
    except Exception as e:
        print(e,js.name)
        return js

input_variables = [[n] for n in json_files]    

if run_check:
    with multiprocessing.Pool(6) as pool: 
            corrupt = pool.starmap(test_json,input_variables)

    corrupt_set = set(corrupt); corrupt_set
        

    for c in corrupt_set:
        if c: os.remove(c)
    
    json_files = list(Path('../ngrams-obtained/ngrams-output').glob('*.json')); len(json_files)

In [None]:
nlp_id = lambda x: x.name.split('_')[0]
nlp_ids = list(set(map(nlp_id,json_files)))

In [None]:
print(len(nlp_ids),nlp_ids[0])

In [None]:
json_files_by_nlp = {nlp : list(Path('../ngrams-obtained/ngrams-output').glob(f'{nlp}*.json')) for nlp in nlp_ids}

In [None]:
len(json_files_by_nlp['0002544'])

In [None]:
with_vocab = True
if with_vocab:
    vocab = json.load(open('../ngram-result-sp/vocab.json','r'))
    print(len(vocab))

In [None]:
vocab[:10]

In [None]:
#p_0000081 = json_files_by_nlp.pop('0000081')
#p_0001084 = json_files_by_nlp.pop('0001084')
#p_0001112 = json_files_by_nlp.pop('0001112')

In [None]:
#meta_dict = {2194: ['liberal','london'], 2088: ['conservative','merseyside'],
#            2642: ['liberal','london'], 2254: ['none','london'],
#            2083: ['conservative','merseyside'], 2645: ['conservative','london'],
#            2089: ['conservative','merseyside'], 2090: ['conservative','merseyside'],
#            2244: ['none','london'], 2084:[ 'conservative','merseyside'], 2085: ['conservative','merseyside']}


def generate_metadata(files:list) -> pd.DataFrame:
    
    """
    Arguments:
        files (list):

        
    Returns:
        pandas.DataFrame
    """

    data = []
    for i,js in enumerate(files):
        year = js.name.split('_')[1]
        year_month = '-'.join([js.name.split('_')[1],js.name.split('_')[2][:-5]])
        #politics = meta_dict.get(int(js.name.split('_')[0]),['none'])[0]
        #place = meta_dict.get(int(js.name.split('_')[0]),['none'])[-1]
        data.append([i,year,year_month])
    return pd.DataFrame(data,columns=['idx','year','month'])

In [None]:
class JSONHandler:
    
    def __init__(self,path: str):
        self.json_files = Path(path)
    
    @clas
    def

In [None]:
class NGramProcessor:
    min_threshold = 0
    vocab = []
    def __init__(self, nlp: str, save_to: str):
        self.nlp = nlp
        self.save_to = Path(save_to)
        
    def __str__(self):
        return f'Processing newspaper with NLP {self.nlp}.'
    
    def check_done(self):
        if (self.save_to / f'{self.nlp}_sparse_matrix_columns.json').is_file():
            return True
        return False
    
    @classmethod
    def load_json(self,):
        return (json.load(open(js,'r')) for js in json_files_by_nlp[nlp])

In [None]:
ngram = NGramProcessor(1,'../ngrams')
print(ngram)

In [None]:
ngram.check_done()

In [None]:
ngram.min_threshold = 10

In [None]:
ngram.min_threshold

In [None]:
def generate_sparse_matrix(nlp: str, min_threshold: int=10, vocab: list=[]):
    """
    Arguments
        nlp (str):
        min_the
    """

    if (output_path / f'{nlp}_sparse_matrix_columns.json').is_file():
        print(f'Already completed {nlp}. Skipping to next newspaper.')
        return None
    #try:
    print(f'Processing {nlp}')
    files = (json.load(open(js,'r')) for js in json_files_by_nlp[nlp]) #(json.load(open(js,'r')) for js in json_files)
    vectorizer = DictVectorizer()
    
    if vocab:
        vectorizer.fit([{w:0 for w in vocab}])
        X_red = vectorizer.transform(files)
            
    else:
        X = vectorizer.fit_transform(files)
        word_totals = X.sum(axis=0)
        include = np.where(word_totals >=min_threshold)[1]
        
        X_red = X[:,include]
        vocab = list(np.array(vectorizer.get_feature_names())[include])
    
    df = generate_metadata(json_files_by_nlp[nlp])
        
    df.to_csv(output_path / f'{nlp}_metadata.csv')
    save_npz(output_path / f'{nlp}_sparse_matrix.npz', X_red, compressed=True)
    json.dump(vocab,open(output_path / f'{nlp}_sparse_matrix_columns.json','w'))
    
    print(f'Completed {nlp}')
    #except Exception as e:
    #    print(e,f'Error with {nlp}')
        


In [None]:
output_path = Path('../sparse_matrices_with_vocab')
output_path.mkdir(exist_ok=True)

In [None]:
input_variables = [[n,1,vocab] for n in json_files_by_nlp.keys()]

In [None]:
with multiprocessing.Pool(6) as pool: 
        pool.starmap(generate_sparse_matrix,input_variables)
print('Done, almost...')

In [None]:
json_files_by_nlp['0001084'] = p_0001084
generate_sparse_matrix('0001084',20)
print('Done 0001084')

In [None]:
json_files_by_nlp['0000081-1'] = p_0000081[:int(len(p_0000081)/2)]
generate_sparse_matrix('0000081-1',20)


In [None]:
json_files_by_nlp['0000081-2'] = p_0000081[int(len(p_0000081)/2):int(len(p_0000081)/2)+int(len(p_0000081)/4)]
generate_sparse_matrix('0000081-2',20)

In [None]:
json_files_by_nlp['0000081-3'] = p_0000081[int(len(p_0000081)/2)+int(len(p_0000081)/4):]
generate_sparse_matrix('0000081-3',20)

In [None]:
json_files_by_nlp['0001112-1'] = p_0000081[:int(len(p_0001112)/2)]
generate_sparse_matrix('0001112-1',20)
json_files_by_nlp['0001112-2'] = p_0000081[int(len(p_0001112)/2):]
generate_sparse_matrix('0001112-2',20)
print('Done 0001112')

In [None]:
#json_files_by_nlp['0000081'] = p_0000081
#generate_sparse_matrix('0000081',20)
#print('Done 0000081')

#json_files_by_nlp['0001112'] = p_0001112
#generate_sparse_matrix('0001112',20)
#print('Done, done, done.')

# Metadata

In [None]:
df = generate_metadata(files,meta_dict);df 

# Load data

In [None]:
X_red = load_npz('../sparse_matrix.npz')
df = pd.read_json('../sparse_matrix_rows.json')

df['year_month'] = df.month.apply(lambda x: datetime.strptime(x, "%Y-%m").date())
df['year_month'] = pd.to_datetime(df['year_month'])
df['decade'] = df.year.apply(lambda x: int(str(x)[:3]+'0'))
vocab = json.load(open('../sparse_matrix_columns.json','r'))
vocab_set = set(vocab)

In [None]:
#df['year_month'] = df['month']

In [None]:
X_red.shape,df.shape,len(vocab)

# Timelines

In [None]:
queries = [["tramp","tramps",'beggar','beggars','pauper','paupers','idler',
         'idlers','delinquent','delinquents', 'loafer','loafers','indingent',
         'indingents','vagrant','vagrants','feckless','indolent','unemployable','underclass'
         'unemployables','shirker','shirkers','scrounger','scroungers','skiver','skivers','workshy']]
queries = [['gladstone'],['disraeli']]
queries = [['_party','conservative','tory','liberal','liberals'],['election','elections','poll','polls']]
queries = [['april'],['october']]
queries = [['seconds'],['minutes'],['hours']]
queries = [['winter']]
queries = [['machine','machines','machinery','engine','engines']]
queries = [['accident','accidents','disaster','catastrophe']]
queries = [['accident','accidents']]
queries = [['operative','operatives']]
queries = [['april']]
queries = [['harvest']]
queries = [['rain'],['snow']]
queries = [['morning','noon'],['afternoon','evening']]
queries = [['morning','am']]
queries = [['electricity']]
queries = [['slavery','slaves','slave']]
queries = [['liberty'],['justice'],['democracy']]
queries = [['english'],['british']]
queries = [['king'],['queen']]
queries = [['communism']]
queries = [['cricket']]
queries = [['railway','railways']]

In [None]:
queries = [['operative','operatives']]
queries = [['railway','railways']]
queries = [['cricket']]
queries = [['accident','accidents','disaster','catastrophe']]
queries = [['april']]
#queries = [['_party','conservative','tory','liberal','liberals'],['election','elections','poll','polls']]
#queries = [['accident','accidents','disaster','catastrophe']]
#queries = [["tramp","tramps",'beggar','beggars','pauper','paupers','idler', 'idlers','delinquent','delinquents', 'loafer','loafers','indingent','indingents','vagrant','vagrants','feckless','indolent','unemployable','underclass','unemployables','shirker','shirkers','scrounger','scroungers','skiver','skivers','workshy']]
#queries = [['_party','conservative','tory','liberal','liberals'],['election','elections','poll','polls']]

In [None]:
time_step = 'year_month' # year | year_month
queries = [['_party','conservative','tory','liberal','liberals'],['election','elections','poll','polls']]
queries = [['cricket']]
queries = [['accident','accidents','disaster','catastrophe']]
queries = [['bathing','swimming']]
relative = True
standardize = True

In [None]:
metadata = df.groupby([time_step])['idx'].apply(list).reset_index().sort_values(time_step)

In [None]:
results = {}
for query in queries:
    if isinstance(query,str):
        feature_idx = vocab.index(query)
    elif isinstance(query,list):
        feature_idx = [vocab.index(q) for q in query if q in vocab_set]

    groups = list(metadata['idx'])
    time_units = list(metadata[time_step])

    if relative:
        results[query[0]] = {ts: X_red[group][:,feature_idx].sum() / X_red[group].sum()
                                for ts,group in zip(time_units,groups)}
    else:
        results[query[0]] = {ts: X_red[group][:,feature_idx].sum() 
                                for ts,group in zip(time_units,groups)}

In [None]:
results_df = pd.DataFrame.from_dict(results)

In [None]:
if standardize:
    for c in results_df.columns:
        results_df[c] = (results_df[c] - results_df[c].mean()) / results_df[c].std()

In [None]:
if time_step == 'year':
    results_df.loc[1800:1870].plot()
elif time_step == 'year_month':
    results_df.loc[datetime.strptime('1800-01',"%Y-%m").date():datetime.strptime('1870-01',"%Y-%m").date()].plot(figsize=(25,5),alpha=.75)

In [None]:
#results_df['riot'].corr(results_df['wage'])

# Example Analysis: Accidents are as cyclical as Rain

In [None]:
results_df['month'] = results_df.apply(lambda x: x.name.month,axis=1)

In [None]:
# more accidents in winter month pre-1840
start_year, end_year = 1800,1870 # 

In [None]:
# https://towardsdatascience.com/time-series-analysis-with-theory-plots-and-code-part-1-dd3ea417d8c4
sns.boxplot(results_df.loc[datetime.strptime(f'{start_year}-01',"%Y-%m").date(): datetime.strptime(f'{end_year}-01',"%Y-%m").date()]['month'],
            results_df.loc[datetime.strptime(f'{start_year}-01',"%Y-%m").date(): datetime.strptime(f'{end_year}-01',"%Y-%m").date()][queries[0][0]])

In [None]:
y = results_df[queries[0][0]]
y_sel = y.loc[datetime.strptime(f'{start_year}-01',"%Y-%m").date(): datetime.strptime(f'{end_year}-01',"%Y-%m").date()]

In [None]:
decomposition = sm.tsa.seasonal_decompose(y_sel)
decomp = decomposition.plot()

In [None]:
decomposition.seasonal.plot(figsize=(20,4))

# Example Analysis: Unemployment

In [None]:
ur = pd.read_csv('../ur.csv')
ur.set_index('year',inplace=True)
ur['std'] = (ur['unemployment_rate'] - ur['unemployment_rate'].mean()) / ur['unemployment_rate'].std()

In [None]:
results_df = pd.DataFrame.from_dict(results)
results_df['std'] = (results_df['tramp'] - results_df['tramp'].mean())/ results_df['tramp'].std()

In [None]:
results_df['std'].loc[1800:1860].plot(legend=False)
ur['std'].loc[1800:1860].plot()

In [None]:
results_df['std'].loc[1800:1860].corr(ur['std'].loc[1800:1860])

# Comparative Timelines

In [None]:
time_step = 'year_month' # year | month
dimension = 'politics' # politics | place

In [None]:
metadata = df.groupby([time_step,dimension])['idx'].apply(list).reset_index().sort_values(time_step)
metadata[dimension].unique()

In [None]:
values =  ['conservative','liberal'] # ['london','merseyside'] | ['liberal','conservative']
query = ['operative','operatives']
query = ['tory','tories','conservative','conservatives']
query = ['bathing']
#query = ['liberal','liberals']
relative = True

In [None]:
results = {}

for v in values:
    if isinstance(query,str):
        feature_idx = vocab.index(query)
    elif isinstance(query,list):
        feature_idx = [vocab.index(q) for q in query if q in vocab_set]

    groups = list(metadata[metadata[dimension]==v].idx)
    time_units = list(metadata[metadata[dimension]==v][time_step])
    if relative:
        results[v] = {ts: X_red[group][:,feature_idx].sum() / X_red[group].sum()
                                  for ts,group in zip(time_units,groups)}
    else:
        results[v] = {ts: X_red[group][:,feature_idx].sum() 
                                  for ts,group in zip(time_units,groups)}

res = pd.DataFrame(results).fillna(0).sort_index()

In [None]:
if time_step == 'year':
    res.loc[1846:1860].plot()
elif time_step == 'year_month':
    res.loc[datetime.strptime('1830-01',"%Y-%m").date():datetime.strptime('1861-01',"%Y-%m").date()].plot(figsize=(10,4))

# Entropy: Heatmaps

In [None]:
time_step = 'year' # decode | year | month 
metadata = df.groupby([time_step])['idx'].apply(list).reset_index().sort_values(time_step)

In [None]:
results = {}
for i,row in metadata.iterrows():
    ts_counts = X_red[row.idx].sum(axis=0)
    ts_total = ts_counts.sum()
    results[row.year] = (ts_counts / ts_total)

In [None]:
matrix = np.squeeze([results[y] for y in sorted(results.keys())])

In [None]:
pairdist = pairwise_distances(matrix,metric=jensenshannon)

In [None]:
sns.heatmap(pairdist,xticklabels=sorted(results.keys()), yticklabels=sorted(results.keys()))

# Entropy: Word contributions by decade

In [None]:
time_step = 'decade' # decode | year | month 
metadata = df.groupby([time_step])['idx'].apply(list).reset_index().sort_values(time_step)

In [None]:
partial_kl = lambda p,q : p * np.log(2 * p / (p + q))                                      

In [None]:
results = {}
for i,row in metadata.iterrows():
    ts_counts = X_red[row.idx].sum(axis=0)
    ts_total = ts_counts.sum()
    results[row[time_step]] = (ts_counts / ts_total)
    
matrix = np.squeeze([results[y] for y in sorted(results.keys())])

In [None]:
matrix.shape 

In [None]:
decades = sorted(results.keys())
indices = range(len(decades))
dec_bigrams = list(zip(decades[:-1],decades[1:]))
idx_bigrams = list(zip(indices[:-1],indices[1:]));idx_bigrams

In [None]:
results = {}
for d1,d2 in idx_bigrams: # compute feature that have string signal for d2
    results[(d1,d2)] = {w : partial_kl(matrix[d2,i],matrix[d1,i]) for i,w in enumerate(vocab) 
                            if not np.isnan(partial_kl(matrix[d2,i],matrix[d1,i])) and len(w)>2}

In [None]:
sorted_results = sorted(results[(1,2)].items(),key=lambda x: x[1])

In [None]:
sorted(results[(4,5)].items(),key = lambda x : x[1], reverse=True)[:40]

In [None]:
sorted(results[(3,4)].items(),key = lambda x : x[1], reverse=True)[:40]

# Fin.