# DATA601
Joshua Black
## Starter Kit Experiments in Processing and Initial Filtering.
This notebook presents initial attempts at processing the National Library
newspaper data, the result of using topic models to find genres
of writing interesting for investigating philosophical writing within the
'starter kit' corpus and for finding the topics covered within those genres.



## The Starter Pack
The Starter Pack is around 2GB uncompressed and contains articles from
- Charleston argus.
- Hot lakes chronicle.
- Lyell times and Central Buller gazette.
- Mt. Benger mail.
- The New Zealand gazette and Wellington spectator.
- The Oxford observer : and Canterbury democrat.
- Victoria times.



## Initial Processing
The Starter Pack is given uncompressed. This is not true of the full dataset
and will require a slightly different method.

We begin by importing my helper functions and reading through the Starter Pack
directories to find top-level folders for each issue in the Starter Pack.



In [None]:

import sys
import os
import glob
import re

# Remove before exporting notebook
sys.path.append('/home/joshua/hdd/Documents/MADS/DATA601/')

import pandas as pd

from NL_helpers import *
from NL_topicmodels import *

PATH = "/home/joshua/hdd/Documents/MADS/DATA601/NPOD_Starter/"

#

In [None]:
path_walk = os.walk(PATH)

# Collect issue folders using regex. All are of form NEWSPAPERCODE_DATE,
# where date is in format YYYYMMDD
issue_directories = {}
for location in path_walk:
    match = re.search("[A-Z]*_\d{8}$", location[0])
    if match:
        issue_directories[match.group(0)] = location[0] + '/'

#

Having collected the directories for each issue, we can collect the
information we want from each. In this case, we parse the XML to produce
a Python dictionary with an article id as key, and the newspaper, date,
title, text, and tokenised text as values.

The raw text is given as a list of strings, where each string corresponds to
a 'text block' in the original newspaper scans. The tokenised text
is tokenised the python NLTK regex tokeniser and default NLTK list of
stopwords.



In [None]:
corpus_dict = {}
for issue, directory in issue_directories.items():
    newspaper = issue[:-9]
    date = issue[-8:]
    articles = issue2articles(directory)
    for article_code, title_and_text in articles.items():
        article_code = article_code[7:] # remove 'MODSMD_' from article code
        item_id = '_'.join([issue, article_code])
        title, text = title_and_text
        tokenised_and_stopped = tokenise_and_stop(text)
        corpus_dict[item_id] = (
            newspaper,
            date,
            title,
            text,
            tokenised_and_stopped
        )

#

We now convert this dictionary to a pandas dataframe. We use the object datatype
in order store Python lists within it. We save it as a pickle, also to enable
storage which respects Python datatypes.



In [None]:
corpus_df = pd.DataFrame.from_dict(
    corpus_dict,
    orient='index',
    dtype = object,
    columns=['Newspaper', 'Date', 'Title', 'Text', 'Tokenised']
    )

pickle_dir = '/home/joshua/hdd/Documents/MADS/DATA601/pickles/'
corpus_df.to_pickle(pickle_dir + 'Starter_Items.tar.gz')
corpus_df

#

## Initial Topic Model Using Gensim
