# Sentiment Analysis of The Times Music Reviews
## Part II: Sentiment Analysis
*How have artforms been reported?  Is there a status hierarchy between them?  How has this changed over time?*

* **Project:** What counts as culture?  Reporting and criticism in The Times 1785-2000
* **Project Lead:** Dave O'Brien
* **Developer:** Lucy Havens
* **Funding:** from the Centre for Data, Culture & Society, University of Edinburgh

Begun February 2021

***

First, import required programming libraries.

In [2]:
# For data loading
import re
import string
import numpy as np
import pandas as pd

# For text analysis
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
nltk.download('punkt')
from nltk.corpus import PlaintextCorpusReader
nltk.download('wordnet')
from nltk.corpus import wordnet
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.text import Text
nltk.download('averaged_perceptron_tagger')
nltk.download('tagsets')  # part of speech tags
from nltk.tag import pos_tag
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import subjectivity
from nltk.sentiment import SentimentAnalyzer
from nltk.sentiment.util import *

# # For data visualization
# import matplotlib.pyplot as plt
# import altair as alt
# import seaborn as sn

[nltk_data] Downloading package punkt to
[nltk_data]     /afs/inf.ed.ac.uk/user/s15/s1545703/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /afs/inf.ed.ac.uk/user/s15/s1545703/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /afs/inf.ed.ac.uk/user/s15/s1545703/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /afs/inf.ed.ac.uk/user/s15/s1545703/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package tagsets to
[nltk_data]     /afs/inf.ed.ac.uk/user/s15/s1545703/nltk_data...
[nltk_data]   Package tagsets is already up-to-date!


### 1. Sentiment Analysis with NLTK's Naive Bayes Classifier
*Code based on: https://www.nltk.org/howto/sentiment.html*

Read the data and tokenize words...

In [59]:
data_path = "../TheTimes_DaveO/TheTimesTextFiles_1953-2000/"
articles = PlaintextCorpusReader(data_path, ".+", encoding='utf-8')
tokens = articles.words()

In [23]:
print(tokens[:10]) # print the first 10 tokens

['CORONATION', 'HONOURS', 'THE', 'THREE', 'NEW', 'PEERS', ':', 'ONE', 'ORDER', 'OF']


...and sentences.

In [60]:
sentences = []
for fileid in articles.fileids():
    sentences += [sent_tokenize(articles.raw(fileid))]

In [29]:
sentences[1][:3]

['STOCK EXCHANGE DEALINGS The following list shows transactions marked on the Stock Exchange yesterday and also the latest markings during the week (date in brackets) of any security not marked yesterday.',
 'Only one mark in any one security is recorded at any one price; the sequence of marking is not necessarily that in which the bargains were done.',
 'Number of marks received in each section is shown in brackets after the name of the section concerned.']

That looks good!  Hopefully most of our data has been segmented into sentences this neatly...we'll find out!

In [61]:
tokenized_articles = []
for article in sentences:
    tokenized_sentences = []
    for s in article:
        tokens = word_tokenize(s)
        tokenized_sentences += [tokens]
    tokenized_articles += [tokenized_sentences]
print(tokenized_articles[1][1])

['Only', 'one', 'mark', 'in', 'any', 'one', 'security', 'is', 'recorded', 'at', 'any', 'one', 'price', ';', 'the', 'sequence', 'of', 'marking', 'is', 'not', 'necessarily', 'that', 'in', 'which', 'the', 'bargains', 'were', 'done', '.']


Next, let's select a random subsets of the articles for a training set and a test set. We'll put 80% of the data in the training set and the remaining 20% in the test set.

In [62]:
total_articles = len(tokenized_articles)
print("Total articles:",total_articles)
training_size = round(total_articles * 0.8)
test_size = total_articles - training_size
print("Training articles:",training_size)
print("Test articles:", test_size)
# random.sample(range(0, 1000), 10)

Total articles: 571
Training articles: 457
Test articles: 114


In [64]:
test_indeces = random.sample(range(0,total_articles),test_size)
test_data = []
for i in test_indeces:
    article = tokenized_articles[i]
    test_data += [article]

print("Test data length:", len(test_data))

Test data length: 114


In [65]:
indeces = range(0,total_articles)
training_indeces = []
for i in indeces:
    if i not in test_indeces:
        training_indeces += [i]

training_data = []
for i in training_indeces:
    article = tokenized_articles[i]
    training_data += [article]

print("Training data length:", len(training_data))

Training data length: 457


### 2. Sentiment Analysis with NLTK's VADER
*Code based on: https://www.nltk.org/howto/sentiment.html*

### 3. Sentiment Analysis with TensorFlow's Keras
https://www.google.com/search?channel=fs&client=ubuntu&q=tensorflow+keras+python

### 4. Sentiment Analysis with TextBlob's Pattern Analyzer
https://textblob.readthedocs.io/en/dev/advanced_usage.html#advanced