# Create Spare Word Count Matrix

This notebook creates a sparse matrix from a large set of word counts derived from historical British newspapers. 
- The columns in this matrix corresponds with a chosen vocabulary
- The rows are the words counts for one newspaper title in specific month.

This notebook explains and covers the following stages

- 1. Download original count data from Zenodo
- 2. Process JSON files with word counts
- 3. Constract a vocabulary from the word counts (the matrix columns)
- 4. Convert all JSON word counts to a sparse matrix

In [3]:
%load_ext autoreload
%autoreload 2

In [4]:
from tools.ngram_creation import *
from tools.ngram_creation import CorpusProcessor

## 1. Get word counts from Zenodo and unzip .tar file

In [4]:
# Write instructions here

# 2. Process JSON files
## 2.1 Manage JSON files


We load all the JSON files with the `JSONHandler` object.

In [5]:
unzipped_ngram_path = '/Volumes/X9 Pro/ngrams-output'

In [6]:
handler = JSONHandler(unzipped_ngram_path)

Remove corrupted json files.

In [None]:
#handler.check_json()

`len` returns the number of JSON files.

In [None]:
len(handler)

`print` also includes the distinct number of newspapers.

In [None]:
print(handler)

## 2.2 Create Vocabulary

Determine location where we store all the vocab, sparse matrices and other files.

In [None]:
save_to = '/Volumes/X9 Pro/ngrams-by-nlp-all'

Initially we combine the counts by NLP. For each JSON file we remove tokens that only occur once.

In [None]:
vocab = Vocab(handler,save_to=save_to,min_threshold=1,**{'n_cores':8})

In [None]:
%time vocab.nlp_counts()

The output of this operation is stored in `.wc_by_nlp` attribute ("word counts by nlp"). Each element is a dictionary with word counts.

In [None]:
len(vocab.wc_by_nlp[3])

In [None]:
len(vocab.wc_by_nlp)

Then we combine these dictionary, we set 5 as the minimum threshold at the level of the NLP/newspaper.

In [None]:
vocab.min_threshold = 5

In [None]:
%time vocab.total_counts()

In [None]:
len(vocab.vocab)

Now we remove words that appear less then 2500 time in total with `filter_total`.

In [None]:
vocab.filter_by_min_threshold(2500)

In [None]:
len(vocab.vocab),len(vocab.wc_total)

In [None]:
vocab.save()

In [None]:
!ls -la {vocab.save_to}

## 2.3 Convert JSON data to sparse matrices

Lastly, we process the whole collection using this vocabulary across the whole corpus.

In [5]:
path_to_json = '/Volumes/X9 Pro/ngrams-output'
path_to_matrices = '/Volumes/X9 Pro/ngrams-by-nlp-all'

handler = JSONHandler(path_to_json)
vocab = json.load(open(Path(path_to_matrices) / 'vocab.json'))

corpus_proc = CorpusProcessor(handler,
                              vocab, 
                              path_to_matrices,
                              n_cores=8)


In [None]:
corpus_proc.process_ngrams()

In [None]:
!ls {save_to} | wc -l

In [None]:
!du -h {save_to}/.. --max-depth=1 

In [6]:
# create metadata file
corpus_proc.merge_metadata(Path(path_to_matrices),
                            totals=None,
                            npd_links_path= 'data/newspapers_overview_with_links_JISC_NLPs.csv',
                            npd_data_path ='data/MPD_export_1846_1920_20230504.csv')

  0%|          | 0/1204 [00:00<?, ?it/s]

Saving metadata of size (269179, 40)


## Merge sparse matrices

The cells below we merge all the ngrams at the NLP level into one large sparse matrix with corresponding metadata.

In [None]:
path_to_json = '/Volumes/X9 Pro/ngrams-output'
path_to_matrices = '/Volumes/X9 Pro/ngrams-by-nlp-all'
save_to = '/Volumes/X9 Pro/unigram-matrix'

handler = JSONHandler(path_to_json)
vocab = json.load(open(Path(path_to_matrices) / 'vocab.json'))
corpus_proc = CorpusProcessor(handler,
                              vocab = vocab, 
                              save_to = save_to)

In [None]:
# WARNING: this operation requires lots of memory (ca. 128 GB) and will likely crash your kernel
save_merged = '/Volumes/X9 Pro/unigram-matrix'
corpus_proc.merge_sparse_matrices(save_merged, override=True,
                                 **{'npd_links_path' : 'data/newspapers_overview_with_links_JISC_NLPs.csv',
                                    'npd_data_path' : 'data/MPD_export_1846_1920_20230504.csv'})

In [None]:
!ls -la {save_merged} --block-size=G

# Fin.