In [6]:
import json

#Read in the data from the json file
data = None
with open('issues.json') as json_file:
    data = json.load(json_file)
len(data['issues'])

200

In [7]:
issues = data['issues']
print(len(issues[0]['text']))

16


Alright the data is all here so let's roll out the barrel so to speak. Let's try some sentiment analysis. Now I may have mentioned in the harvester notebook that due to the poor quality of some of the OCR output I may be unable to retreive sentiment. Which would normally be the case if I did something like tokenization of words. Fortunately a bit of research has led me to the Flair library and a character level LSTM classification of sentiment. Using LSTM should allow for misspellings. The other great bit is that there are pretrained models\* that can be loaded in and used. 

https://medium.com/@b.terryjack/nlp-pre-trained-sentiment-analysis-1eb52a9d742c
https://github.com/flairNLP/flair

\* I have my doubts that everything will transfer over given the state of the english language and venacular at the date of publication of most of these newspapers but I figure the data will be interesting and the approach is for sure interesting to me


In [8]:
import flair

flair_sentiment = flair.models.TextClassifier.load('en-sentiment')

2020-05-26 22:20:55,529 loading file /Users/chrismaness/.flair/models/sentiment-en-mix-distillbert.pt


In [32]:
sentence = "This text classifier works great"
s = flair.data.Sentence(sentence)
flair_sentiment.predict(s)
total_sentiment = s.labels
print(total_sentiment[0].value)
print(total_sentiment[0].score)

True
0.996302604675293


In [10]:
#let's take a look a the text we have to see if we can break it apart into sentences
issues[0]['text'][0]

'OESEEi THE AGE-HERALD.\n. J1 , _ . ...\n, __\'__ _ _ _ _ V ||\nVOLUME 23_ BIRMINGHAM, ALA., SUNDAY, AUGUST 1, 1897.-SIXTEEN PAGES- NUMBER 167\nGREAT BRITAIN\nIS SATISFIED\nAt the Prospective Settlement ol the\nSeal Question.\nSPAIN MUST RESORT TO WAR\nIn the Event the United States Interleres\nIn Cuban Altairs.\nHER FINANCES AT A LOW EBB,\nAr.d the Struggle Can Only Bp Kept lip While\nHer Funds Lastâ\x80\x94Death Before Dishonor\nIs the Castilian Sloganâ\x80\x94American\nMonopolists Denounced.\nLondon, July si.â\x80\x94Much satisfaction Is\nexpressed in official and mercantile cir\ncles at the prospective settlement of tiie\nseal question by aid of the Washington\nconference, especially a* such an strange\nment will remove ihe cause of irritation\nbetween the United States and Great\nBritain. On the proposal of Mr. Foster\';;\nJourney diplomats ridiculed the idea that\nthere was anything necessary to ho done.\nAmbassador Hay and Mr. Foster have\ncompletely changed this idea and Great\

Oh shoot that's some dirty text. And I'm running out of time to work on this so in the interest of keeping everything fair we'll do a dirty split of text using a regex and do a simple average of the sentiment in a document. Let's test it out

https://regex101.com/r/nG1gU7/27

In [16]:
import re

print(re.split('(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', issues[0]['text'][0])[5])

SPAIN MUST RESORT TO WAR
In the Event the United States Interleres
In Cuban Altairs.


In [35]:
#oh hey spanish american war neat!

from statistics import mean 

def split_text_into_sentences(text):
    return re.split('(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', text)
    
def get_avg_sentiment_for_issue(issue):
    
    sentiment_scores = []
    
    for text in issue['text']:

        sentences = split_text_into_sentences(text)
        
        for sentence in sentences:
            s = flair.data.Sentence(sentence)
            flair_sentiment.predict(s)
            total_sentiment = s.labels
            score = total_sentiment[0].score
            
            if total_sentiment[0].value == "NEGATIVE":
                score *= -1
                
            sentiment_scores.append(score)
            
    return mean(sentiment_scores)

get_avg_sentiment_for_issue(issues[0])

0.05966776119087506

Alright that's the average for the first issue and that took about a minute. Time to light the ole laptop up and let it run for a few minutes. Something interesting about that is that it's near 0 implying there's a balance of positive and negative but skews positive.

Ways this could be improved:
* Weighted average for words
* Multi-threaded to see if we get some speed

In [38]:
def add_avg_sentiment(issues):
    for issue in issues:
        print(issue['title'])
        avg_sentiment = get_avg_sentiment_for_issue(issue)
        issue['avg_sentiment'] = avg_sentiment
        
add_avg_sentiment(issues)

The Age-herald. [volume] 1897-08-01
The Age-herald. [volume] 1897-08-03
The Age-herald. [volume] 1897-08-04
The Age-herald. [volume] 1897-08-05
The Age-herald. [volume] 1897-08-06
The Age-herald. [volume] 1897-08-07
The Age-herald. [volume] 1897-08-08
The Age-herald. [volume] 1897-08-10
The Age-herald. [volume] 1897-08-11
The Age-herald. [volume] 1897-08-12
The Age-herald. [volume] 1897-08-13
The Age-herald. [volume] 1897-08-14
The Age-herald. [volume] 1897-08-15
The Age-herald. [volume] 1897-08-17
The Age-herald. [volume] 1897-08-18
The Age-herald. [volume] 1897-08-19
The Age-herald. [volume] 1897-08-20
The Age-herald. [volume] 1897-08-21
The Age-herald. [volume] 1897-08-22
The Age-herald. [volume] 1897-08-24
The Age-herald. [volume] 1897-08-25
The Age-herald. [volume] 1897-08-26
The Age-herald. [volume] 1897-08-27
The Age-herald. [volume] 1897-08-28
The Age-herald. [volume] 1897-08-29
The Age-herald. [volume] 1897-08-31
The Age-herald. [volume] 1897-09-01
The Age-herald. [volume] 189

In [39]:
issues[0]

{'source': 'https://chroniclingamerica.loc.gov/lccn/sn86072192/1897-08-01/ed-1.json',
 'title': 'The Age-herald. [volume] 1897-08-01',
 'isPartOf': 'The Age-herald. [volume]',
 'datePublished': '1897-08-01',
 'volume': '23',
 'number': '167',
 'text': ['OESEEi THE AGE-HERALD.\n. J1 , _ . ...\n, __\'__ _ _ _ _ V ||\nVOLUME 23_ BIRMINGHAM, ALA., SUNDAY, AUGUST 1, 1897.-SIXTEEN PAGES- NUMBER 167\nGREAT BRITAIN\nIS SATISFIED\nAt the Prospective Settlement ol the\nSeal Question.\nSPAIN MUST RESORT TO WAR\nIn the Event the United States Interleres\nIn Cuban Altairs.\nHER FINANCES AT A LOW EBB,\nAr.d the Struggle Can Only Bp Kept lip While\nHer Funds Lastâ\x80\x94Death Before Dishonor\nIs the Castilian Sloganâ\x80\x94American\nMonopolists Denounced.\nLondon, July si.â\x80\x94Much satisfaction Is\nexpressed in official and mercantile cir\ncles at the prospective settlement of tiie\nseal question by aid of the Washington\nconference, especially a* such an strange\nment will remove ihe cause of 

And with that I'm afraid that's all the time I have. I would've really enjoyed plotting the sentiment over time to see if it changed over the course of the 200 documents. But having the data is half the battle. I will close up by writing this out to the filesystem so I don't have to calculate sentiment again if I need it.

In [40]:
import json

data = {"issues": issues}

with open('issues_with_sentiemtn.json', 'w') as outfile:
    json.dump(data, outfile)