# DATA601
Joshua Black
## Starter Kit Experiments in Processing and Initial Filtering.
This notebook presents initial attempts at processing the National Library
newspaper data, the result of using topic models to find genres
of writing interesting for investigating philosophical writing within the
'starter kit' corpus and for finding the topics covered within those genres.



## The Starter Pack
The Starter Pack is around 2GB uncompressed and contains articles from
- Charleston argus.
- Hot lakes chronicle.
- Lyell times and Central Buller gazette.
- Mt. Benger mail.
- The New Zealand gazette and Wellington spectator.
- The Oxford observer : and Canterbury democrat.
- Victoria times.



## Initial Processing
The Starter Pack is given uncompressed. This is not true of the full dataset
and will require a slightly different method.

We begin by importing my helper functions and reading through the Starter Pack
directories to find top-level folders for each issue in the Starter Pack.



In [1]:
import sys
import os
import glob
import re

# Remove before exporting notebook
sys.path.append('/home/joshua/hdd/Documents/MADS/DATA601/')

import pandas as pd

import NL_helpers
import NL_topicmodels

PATH = "/home/joshua/hdd/Documents/MADS/DATA601/NPOD_Starter/"

In [2]:
path_walk = os.walk(PATH)

# Collect issue folders using regex. All are of form NEWSPAPERCODE_DATE,
# where date is in format YYYYMMDD
issue_directories = {}
for location in path_walk:
    match = re.search("[A-Z]*_\d{8}$", location[0])
    if match:
        issue_directories[match.group(0)] = location[0] + '/'

Having collected the directories for each issue, we can collect the
information we want from each. In this case, we parse the XML to produce
a Python dictionary with an article id as key, and the newspaper, date,
title, text, and tokenised text as values.

The raw text is given as a list of strings, where each string corresponds to
a 'text block' in the original newspaper scans. The tokenised text
is tokenised the python NLTK regex tokeniser and default NLTK list of
stopwords.



In [3]:
corpus_dict = {}
for issue, directory in issue_directories.items():
    newspaper = issue[:-9]
    date = issue[-8:]
    articles = NL_helpers.issue2articles(directory)
    for article_code, title_and_text in articles.items():
        article_code = article_code[7:] # remove 'MODSMD_' from article code
        item_id = '_'.join([issue, article_code])
        title, text = title_and_text
        tokenised_and_stopped = NL_helpers.tokenise_and_stop(text)
        corpus_dict[item_id] = (
            newspaper,
            date,
            title,
            text,
            tokenised_and_stopped
        )

We now convert this dictionary to a pandas dataframe. We use the object datatype
in order store Python lists within it. We save it as a pickle, also to enable
storage which respects Python datatypes.



In [4]:
corpus_df = pd.DataFrame.from_dict(
    corpus_dict,
    orient='index',
    dtype = object,
    columns=['Newspaper', 'Date', 'Title', 'Text', 'Tokenised']
    )

pickle_dir = '/home/joshua/hdd/Documents/MADS/DATA601/pickles/'
corpus_df.to_pickle(pickle_dir + 'Starter_Items.tar.gz')
corpus_df

Unnamed: 0,Newspaper,Date,Title,Text,Tokenised
CHARG_18670302_ARTICLE1,CHARG,18670302,UNTITLED,[ago a till last Thurs- with 40oz. of gold and...,"[ago, last, thurs, 40oz, gold, went, back, sma..."
CHARG_18670309_ARTICLE1,CHARG,18670309,UNKNOWN,"[1.33 2.25 3.15 2.—Halcyon, s.s., Wing, master...","[halcyon, wing, master, jane, schooner, julia,..."
CHARG_18670309_ARTICLE2,CHARG,18670309,CHARLESTON ARGUS.,[If the Pakihi district was but well supplied ...,"[pakihi, district, well, supplied, water, nut,..."
CHARG_18670309_ARTICLE3,CHARG,18670309,UNTITLED,[Some little excitement has been evinced in re...,"[little, excitement, evinced, reference, tramw..."
CHARG_18670309_ARTICLE4,CHARG,18670309,"SATURDAY, MARCH 9, 1867. UNKNOWN","[3/u^ (Before C. Broad, Larceny. Ann Connelly,...","[broad, larceny, ann, connelly, woman, charged..."
...,...,...,...,...,...
VT_18410915_ARTICLE1,VT,18410915,Wellington Tavern.,[Edward Davis begs to inform his friends and t...,"[edward, davis, begs, inform, friends, public,..."
VT_18410915_ARTICLE2,VT,18410915,UNTITLED,[Messrs Pratt and Bevan beg respectfully to in...,"[messrs, pratt, bevan, beg, respectfully, info..."
VT_18410915_ARTICLE3,VT,18410915,UNTITLED,"[We insert the following communication by, par...","[insert, following, communication, partic, ula..."
VT_18410915_ARTICLE4,VT,18410915,UNTITLED,[The following Particulars were composed a sel...,"[following, particulars, composed, selected, m..."


## Initial Topic Model Using Gensim
Earlier experiments have shown that having a bunch of empty documents around
is not good for producing interesting models. I found an interesting looking
topic filled with words in te reo, but was disappointed to find it represented
a very small number of actual documents and a huge number of empty ones.
I will use two filtering steps. First, I filter out those articles which
have less than 20 words after tokenising.



In [5]:
cutoff = 20
filtered_corpus_df = corpus_df[corpus_df['Tokenised'].apply(lambda x: len(x) >= cutoff)]

We then create a dictionary for applying topic models with Gensim.

In [8]:
from gensim import corpora

In [6]:
minimum_in_docs = 5
dictionary = corpora.Dictionary(filtered_corpus_df['Tokenised'])
dictionary.filter_extremes(no_below=minimum_in_docs, no_above=0.5)
dictionary.compactify()

In [7]:
dictionary.save('dictionaries/starter_pack')

In [9]:
# Run this cell to load the pre-generated dictionary rather than generating it.
dictionary = corpora.Dictionary.load('dictionaries/starter_pack')

This is enough to use the `NL_corpus` class provided in `NL_topicmodels.py`. This generates a bag of words representation of each article on initialisation.

In [11]:
starter_corpus = NL_topicmodels.NL_corpus(filtered_corpus_df, dictionary)
starter_corpus.items

Unnamed: 0,Newspaper,Date,Title,Text,Tokenised,BOW
CHARG_18670302_ARTICLE1,CHARG,18670302,UNTITLED,[ago a till last Thurs- with 40oz. of gold and...,"[ago, last, thurs, 40oz, gold, went, back, sma...","[(0, 2), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1..."
CHARG_18670309_ARTICLE1,CHARG,18670309,UNKNOWN,"[1.33 2.25 3.15 2.—Halcyon, s.s., Wing, master...","[halcyon, wing, master, jane, schooner, julia,...","[(27, 1), (49, 1), (167, 1), (244, 1), (342, 2..."
CHARG_18670309_ARTICLE2,CHARG,18670309,CHARLESTON ARGUS.,[If the Pakihi district was but well supplied ...,"[pakihi, district, well, supplied, water, nut,...","[(10, 1), (11, 1), (29, 1), (37, 1), (48, 1), ..."
CHARG_18670309_ARTICLE3,CHARG,18670309,UNTITLED,[Some little excitement has been evinced in re...,"[little, excitement, evinced, reference, tramw...","[(8, 1), (13, 2), (15, 1), (47, 1), (48, 2), (..."
CHARG_18670309_ARTICLE4,CHARG,18670309,"SATURDAY, MARCH 9, 1867. UNKNOWN","[3/u^ (Before C. Broad, Larceny. Ann Connelly,...","[broad, larceny, ann, connelly, woman, charged...","[(132, 1), (170, 1), (230, 1), (244, 1), (371,..."
...,...,...,...,...,...,...
OO_18990909_ARTICLE11,OO,18990909,OXFOED ROAD BOARD.,[The ordinary meeting of the Board was held on...,"[ordinary, meeting, board, held, welnesday, se...","[(6, 3), (71, 1), (120, 1), (125, 2), (139, 1)..."
VT_18410915_ARTICLE1,VT,18410915,Wellington Tavern.,[Edward Davis begs to inform his friends and t...,"[edward, davis, begs, inform, friends, public,...","[(120, 1), (135, 1), (139, 1), (143, 1), (147,..."
VT_18410915_ARTICLE2,VT,18410915,UNTITLED,[Messrs Pratt and Bevan beg respectfully to in...,"[messrs, pratt, bevan, beg, respectfully, info...","[(86, 1), (105, 1), (122, 1), (147, 1), (170, ..."
VT_18410915_ARTICLE3,VT,18410915,UNTITLED,"[We insert the following communication by, par...","[insert, following, communication, partic, ula...","[(9, 1), (11, 1), (48, 2), (54, 1), (97, 1), (..."


 We filter again to remove any article whose BOW representation is less than 20 words.

In [13]:
starter_corpus.items = starter_corpus.items[starter_corpus.items['BOW'].apply(lambda x: len(x) >= cutoff)]
len(starter_corpus.items)

10183

The output above shows that we have 10183 remaining documents.

We run a topic model on these using `LdaMulticore` from `gensim`. 

In [16]:
from gensim.models import LdaMulticore

In [17]:
starter_model = LdaMulticore(
    starter_corpus,
    num_topics= 50,
    workers = 15,
    chunksize = 220,
    id2word=starter_corpus.dictionary,
    iterations = 500,
    passes = 25,
    eval_every = 100
)

This model can be visualised using `pyLDAvis` as follows: TODO

In [18]:
import pyLDAvis.gensim

In [19]:
vis = pyLDAvis.gensim.prepare(starter_model, starter_corpus, dictionary=starter_corpus.dictionary)

TypeError: object of type 'NL_corpus' has no len()

In [23]:
topic_kws = NL_topicmodels.topics_and_keywords(starter_model)

In [24]:
topic_kws

{0: '0.022*"race" + 0.016*"club" + 0.015*"match" + 0.012*"second" + 0.012*"first" + 0.012*"time" + 0.010*"mile" + 0.010*"oxford" + 0.010*"handicap" + 0.009*"sports"',
 1: '0.020*"miss" + 0.020*"bride" + 0.011*"wedding" + 0.011*"black" + 0.010*"white" + 0.010*"ceremony" + 0.010*"marriage" + 0.009*"friends" + 0.009*"happy" + 0.008*"pretty"',
 2: '0.032*"railway" + 0.020*"meredith" + 0.017*"minister" + 0.015*"line" + 0.013*"letters" + 0.010*"otago" + 0.010*"heriot" + 0.009*"roxburgh" + 0.007*"ashley" + 0.007*"league"',
 3: '0.016*"ashley" + 0.015*"vote" + 0.014*"electors" + 0.012*"prohibition" + 0.012*"election" + 0.011*"votes" + 0.010*"district" + 0.010*"license" + 0.009*"good" + 0.008*"licensing"',
 4: '0.030*"court" + 0.017*"case" + 0.010*"evidence" + 0.009*"defendant" + 0.009*"prisoner" + 0.008*"police" + 0.008*"law" + 0.008*"said" + 0.007*"plaintiff" + 0.007*"judge"',
 5: '0.007*"gas" + 0.006*"old" + 0.005*"years" + 0.005*"harle" + 0.004*"buddo" + 0.004*"billy" + 0.004*"he\'s" + 0.00

Above output shows that 33, 40 are of interest.