**Classifier**

We'll need to do the following:

- Load up the list of references
- Make sure that we can get the text for each one (or close to all of them)
- Load up the list of negative references (ie the things that aren't what we are looking for)
- Make sure that these are fine as well
- We need to decide what 'features' we want the classifier to understand and write some methods that take in the text and return  features accompanied by a number that represents that features 'strength'. For example, we could make features out of the word frequency. In this case, the feature would be the word itself, and the number is how often it occurred.
- Finally we need to load the text for these and pass them through the Naive Bayes classifier, tagging them 

In [2]:
from newspaperaccess import *

# Get the connection set up to get access to the newspaper text
n = NewspaperArchive()

# Load up the references to the pages that we know reference Abolitionists
import csv

# Month list to convert a name to a number:
MONTHS = {"january": "01", "february": "02", "march": "03", "april": "04", "may": "05", "june": "06", 
          "july": "07", "august": "08", "september": "09", "october": "10", "november": "11", "december": "12"}

# a method to open the csv file, read it in and store the references in the list
def get_references(filename):
    
    # Start a list to hold the references
    references = []
    
    with open(filename, "r") as pcsv:
    
        # this "DictReader" opens the csv file up and then uses the column headers
        # to work out what to call each bit of data
        reflist = csv.DictReader(pcsv)
    
        # now go through each row, adding them to the list
        for row in reflist:
            # a row will be something like:
            #   {"Newspaper": "Glasgow Herald", "Day": "24", "Month": "January", "Year": "1851", Page: "", etc}
        
            # change the month to be a number, not a name:
            row["month"] = MONTHS[row['month'].lower()]
            
            references.append(row)
    return references

positivereferences = get_references("positives.csv")

# There, we should have a big list draw from that spreadsheet
# Let's see what the 100th item is: (computers count from 0!)

positivereferences[100]

{'article title': 'The Freedmen of America',
 'day': '3',
 'month': '07',
 'newspaper': 'The Caledonian Mercury',
 'page': '',
 'year': '1866'}

In [3]:
# Can we get the text for this reference?
doc = n.get(**positivereferences[100])
doc.keys()

dict_keys(['0005', '0004', '0002', '0003', '0001'])

In [5]:
# What is on page 3? Just the first 500 characters anyway
print([title for title, _ in doc['0003'].values()])

['Money, Trade, and Commerce.', 'LITERATURE.', 'EAST LOTHIAN AGRICULTURAL REPORT FOR JUNE.', 'LATEST MARKETS.', 'GENERAL NEWS.', 'THE FREEDMEN OF AMERICA.', 'Imperial Parliament.', 'ALARMING OUTBREAK OF CHOLERA IN CHESHIRE.', 'CORRESPONDENCE.']


Wow! Terrible OCR! Nevermind, we shall try to continue. The quality is better in other newspapers.

Let's load in the negatives now as well:

In [6]:
nonabospeechesreferences = get_references("nonabospeeches.csv")

nonabospeechesreferences[10]

print(n.get(**nonabospeechesreferences[10]))

Couldn't find 'The Manchester Times and Gazette' in the Newspaper mapping


NoSuchNewspaper: The Manchester Times and Gazette

The error here (the final line tends to be the imporant one) says "No Such Newspaper: The Manchester Times and Gazette. Looking at the list of newspapers that the newspaperaccess file knows about, there isn't a clear match here. Let's try another:

In [7]:
nonabospeechesreferences[16]

{'article title': 'Public Meetings',
 'day': '21',
 'month': '04',
 'newspaper': 'The Operative',
 'page': '',
 'year': '1839'}

In [6]:
print(n.get(**negativereferences[16])['0008'])

VICTORIA THEATREA TUlL I The LONDON GLASGOW COTtON-SPINNERS CON, MITTEE respectfully announce to the Subscribers, their Fel- low Woskmnen, end the Public, that tkey have engaged the above commodIous Theatre for a BENEFIT, in Aid of the Fua4 for the SUPPORT of the WIVES and CHILDREN of the GLAS; G0W COTTON SPNINNERS, on WEDNESDAY, April 24. 139, when will-be presented the highly*- succ~seful Melo-drBaiB, cfflled MARY LE MORE;,or THE IRISH MANIAC. After *hltfh ibe Popular Force of A I THE ENGLISHMAN IN FRANCE. To concludewith the Gr and Scottisu Historical IDreiam of WALC. THE hERO OF SCOTLAND. Ticket ma7 eobtene o the rembr of the committee, at te Bel. Od Biley andof J Neton, S~ecretary, 1, Ste. pl,~ 1ns Bildigs;'Hoboro Loer oxe,2. 6d. Boxes, 2s Pit, s. Gller, 6d Door ope at alfpst Five. Cown- TIIEA:T~tE ROYAL, DRURY-LANE. TOMORRDW EVENING-agSeries of PROMENADE CON- CERTS A LA VALENTINE, with a Band of 100 Pedformere, end the aid of Madarne Albertazzi, Mr:. Balfe, Mr. Giabelel, Mr. Stre

{'day': '5',
 'month': '05',
 'newspaper': 'The Dundee Courier',
 'page': '',
 'year': '1846'}

from newspaperaccess import NewspaperArchive

n = NewspaperArchive()

doc = n.get(newspaper = "ANJO", year = "1846", month = "05", day = "31")

len(doc)

print(doc["0001"][:200])

In [8]:
from feature import get_common_wordlist

# create the word frequency distribution from the newspaper references we have for positive matches

pos_worddist, nn = get_common_wordlist([(x['newspaper'], x['year'], x['month'], x['day']) for x in positivereferences])

1851 24/01
1851 29/01
Couldn't find 'Dundee Courier' in the Newspaper mapping
There was an issue accessing the 'Dundee Courier' for date '['1851', '01', '29']'
Dundee Courier
1851 12/02
1851 19/02
1851 12/04
1851 29/03
Couldn't find 'The York Herald' in the Newspaper mapping
There was an issue accessing the 'The York Herald' for date '['1851', '03', '29']'
The York Herald
1855 23/02
Couldn't find 'The Essex Standard and General Advertiser for the Eastern Counties' in the Newspaper mapping
There was an issue accessing the 'The Essex Standard and General Advertiser for the Eastern Counties' for date '['1855', '02', '23']'
The Essex Standard and General Advertiser for the Eastern Counties
1856 23/10
1857 12/09
Couldn't find 'Hampshire Telegraph and Sussex Chronicle' in the Newspaper mapping
There was an issue accessing the 'Hampshire Telegraph and Sussex Chronicle' for date '['1857', '09', '12']'
Hampshire Telegraph and Sussex Chronicle
1838 21/04
Couldn't find 'The Sheffield Independent'

In [50]:
print(nn)
print("Wordlist length: {0}\n Top 10 words:".format(len(pos_worddist)))
print(pos_worddist.most_common()[:100])

96
Wordlist length: 1706359
 Top 10 words:
[('next', 96), ('fire', 96), ('twelve', 96), ('greatly', 96), ('five', 96), ('days', 96), ('rather', 96), ('letter', 96), ('point', 96), ('d.', 96), ('night', 96), ('11', 96), ('hope', 96), ('cannot', 96), ('country,', 96), ('felt', 96), ('going', 96), ('long', 96), ('much', 96), ('left', 96), ('hold', 96), ('b', 96), ('no.', 96), ('party', 96), ('saturday,', 96), ('sent', 96), ('london,', 96), ('regard', 96), ('monday', 96), ('de-', 96), ('h', 96), ('attended', 96), ('business', 96), ('14', 96), ('us', 96), ('took', 96), ('aid', 96), ('place,', 96), ('al', 96), ('queen', 96), ('5', 96), ('supply', 96), ('obtained', 96), ('20', 96), ('taken', 96), ('these', 96), ('f', 96), ('4', 96), ('life', 96), ('house,', 96), ('within', 96), ('duty', 96), ('members', 96), ('country', 96), ('addressed', 96), ('and,', 96), ('d', 96), ('must', 96), ('receive', 96), ('effect', 96), ('entered', 96), ('ti', 96), ('due', 96), ('told', 96), ('six', 96), ('extent',

In [12]:
# create the word frequency distribution from the newspaper references we have for negative matches

neg_worddist, negn = get_common_wordlist([(x['newspaper'], x['year'], x['month'], x['day']) for x in negativereferences])
print(negn)
print("Wordlist length: {0}\n Top 10 words:".format(len(neg_worddist)))
print(neg_worddist.most_common()[:10])

1869 11/12
Couldn't find 'Lancaster Gazette and General Advertiser for Lancashire, Westmorland, Yorkshire etc.' in the Newspaper mapping
There was an issue accessing the 'Lancaster Gazette and General Advertiser for Lancashire, Westmorland, Yorkshire etc.' for date '['1869', '12', '11']'
Lancaster Gazette and General Advertiser for Lancashire, Westmorland, Yorkshire etc.
1844 22/05
1850 12/10
Couldn't find 'The Northern Star and National Trades' Journal' in the Newspaper mapping
There was an issue accessing the 'The Northern Star and National Trades' Journal' for date '['1850', '10', '12']'
The Northern Star and National Trades' Journal
1861 14/01
Couldn't find 'Dundee Courier' in the Newspaper mapping
There was an issue accessing the 'Dundee Courier' for date '['1861', '01', '14']'
Dundee Courier
1861 12/08
Couldn't find 'Dundee Courier and Daily Argus' in the Newspaper mapping
There was an issue accessing the 'Dundee Courier and Daily Argus' for date '['1861', '08', '12']'
Dundee Cou

In [13]:
# store these wordlists so we don't have to recreate them later on
import json

with open("pos_worddist.json", "w") as pfp:
    json.dump(pos_worddist, pfp)
    
with open("neg_worddist.json", "w") as nfp:
    json.dump(neg_worddist, nfp)

    
# create a wordlist from the positive set of words that do not appear in the negative set
only_pos_worddist = pos_worddist.copy()

for item in set(pos_worddist).intersection(set(neg_worddist)):
    del only_pos_worddist[item]

# Most common positive only words?
print(only_pos_worddist.most_common()[:100])

[('slaveholder', 14), ('detersive', 13), ('pro-slavery', 13), ('slaveholding', 13), ('slaveholder,', 12), ('.......3', 11), ('respectble', 10), ('slow;', 9), ('brokes', 9), ('release,', 9)]


In [14]:
# let's store this word dist too, just because

with open("only_pos_worddist.json", "w") as ofp:
    json.dump(only_pos_worddist, ofp)

In [26]:
# So we have some basic feature sets we could use (common words overall, top 1000 words from just the positives, and so on)
# We also have the set of documents to train on.
import nltk

# Let's make a little method to return a featureset created a newpaper reference and a word dist
def get_features(worddist, **newspaper_ref):
    if n.exists(newspaper_ref['newspaper'], newspaper_ref['year'], newspaper_ref['month'], newspaper_ref['day']):
        features = {"has({0})".format(fword): False for fword,_ in worddist}
        features.update({"count({0})".format(fword): 0 for fword,_ in worddist})
        
        # create freq dist for the newspaper but only for the words we care about
        doc = n.get(**newspaper_ref)
        fdoc = nltk.FreqDist(w.lower() for w in " ".join(doc.values()).split(" ") if w in worddist)
        
        features.update({"has({0})".format(fword): True for fword, _ in fdoc})
        features.update({"count({0})".format(fword): c for fword, c in worddist})
        
        return features
    

In [27]:
# n.get(**positivereferences[100]) -> is in JISC1
featureset = get_features(only_pos_worddist, **positivereferences[100])

In [28]:
len(featureset)

2765288

In [38]:
# Too many features is often a key step in over-fitting our model to the training data
# This is a bad thing! Let's alter that features method to only use the 2000 most common words

def get_features(total_worddist, features_to_take=2000, **newspaper_ref):
    if n.exists(newspaper_ref['newspaper'], newspaper_ref['year'], newspaper_ref['month'], newspaper_ref['day']):
        worddist = total_worddist.most_common()[:features_to_take]
        features = {"has({0})".format(fword): False for fword, _ in worddist}
        features.update({"count({0})".format(fword): 0 for fword, _ in worddist})
        
        # create freq dist for the newspaper but only for the words we care about
        doc = n.get(**newspaper_ref)
        fdoc = nltk.FreqDist(w.lower() for w in " ".join(doc.values()).split(" ") if w in worddist)
        
        features.update({"has({0})".format(fword): True for fword, _ in fdoc})
        features.update({"count({0})".format(fword): c for fword, c in worddist})
        
        return features

In [39]:
# Let's try that again
featureset = get_features(only_pos_worddist, **positivereferences[100])

len(featureset)

4000

In [46]:
'has(slaveholder)' in featureset

True

In [47]:
list(featureset.items())[:10]

[('has(ersonal)', False),
 ('count(011cc)', 4),
 ('count(amputated)', 6),
 ('has(attei)', False),
 ('has(sohi)', False),
 ('has(upton.)', False),
 ('has(btthe)', False),
 ('count(peebles,)', 5),
 ('count(o......)', 7),
 ('has(yelp)', False)]

In [48]:
# let's make a teeny tiny classifer now..

refs = {'p':[], 'n':[]}
while len(refs['p']) <= 10:
    for item in positivereferences:
        if n.exists(item['newspaper'], item['year'], item['month'], item['day']):
            refs['p'].append(item)


while len(refs['n']) <= 10:
    for item in negativereferences:
        if n.exists(item['newspaper'], item['year'], item['month'], item['day']):
            refs['n'].append(item)

# get featuresets for the small set
featuressets = [(get_features(only_pos_wordlist, **ref), label) for label, ref in refs.items()]

#split the set in half to train and test
train_set, test_set = featuresets[:10], featuresets[-10:]

# Train on one half
classifier = nltk.NaiveBayesClassifier.train(train_set)

# test with 2nd half
print(nltk.classify.accuracy(classifier, test_set))

Couldn't find 'Dundee Courier' in the Newspaper mapping


NoSuchNewspaper: Dundee Courier

In [51]:
print(only_pos_worddist.most_common()[:100])

[('slaveholder', 14), ('detersive', 13), ('pro-slavery', 13), ('slaveholding', 13), ('slaveholder,', 12), ('.......3', 11), ('respectble', 10), ('slow;', 9), ('brokes', 9), ('release,', 9), ('busiuess', 9), ('cayley', 9), ('drss', 9), ('harney', 9), ('last)', 9), ('709', 9), ('enland', 9), ('vears.', 9), ('hardie', 8), ('168.', 8), ('"di', 8), ('termsa', 8), ('battalion.', 8), ('sister-in-law', 8), ('harvey.', 8), ('7.-', 8), ('cousiderable', 8), ('rennet', 8), ('comupany,', 8), ('singh', 8), ('549', 8), ('terrify', 8), ('effct', 8), ('fermoy', 7), ('antly', 7), ('debs', 7), ('glans', 7), ('durban,', 7), ('gaiter', 7), ('pove', 7), ('houe.', 7), (';-each', 7), ('o......', 7), ('-28', 7), ('dockyard.', 7), ('en"', 7), ('paddle-wheel', 7), ('redouble', 7), ("'ic.", 7), ('giin', 7), ('haughton', 7), ('7/0', 7), ('guys', 7), ('theobald', 7), ('850.', 7), ('thetown', 7), ('iaod', 7), ('occupa.', 7), ('2-s.', 7), ('heroby', 7), ('acadia,', 7), ('(day', 7), ('disguise.', 7), ('bedford-street'