# Sentiment Analysis of *The Times* Music Reviews
## Part III: Sentiment Analysis with NLTK's VADER
*How have artforms been reported?  Is there a status hierarchy between them?  How has this changed over time?*

* **Project:** What counts as culture?  Reporting and criticism in The Times 1785-2000
* **Project Team:** Dave O'Brien (lead), Lucy Havens (Jupyter Notebook author), Orian Brooke, Mark Taylor
* **Funding:** from the Centre for Data, Culture & Society, University of Edinburgh
* **Dataset:** 83,625 reviews about music published in The Times from 1950 through 2009 from The Times Archive

Begun February 2021

***

First, import required programming libraries.

In [2]:
# For data loading
import re
import string
import numpy as np
import pandas as pd

# For text analysis
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
# nltk.download('punkt')
from nltk.corpus import PlaintextCorpusReader
# nltk.download('wordnet')
from nltk.corpus import wordnet
# nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.text import Text
# nltk.download('averaged_perceptron_tagger')
# nltk.download('tagsets')  # part of speech tags
from nltk.tag import pos_tag
# from nltk.classify import NaiveBayesClassifier
# from nltk.corpus import subjectivity
# nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer
# from nltk.sentiment.util import *

# import tensorflow as tf
# from tensorflow import keras

### About VADER
**VADER** stands for Valence Aware Dictionary for Sentiment Reasoning. This sentiment analyzer can estimate positivity, neutrality, and negativity, which is called **polarity**, and it can estimate the intensity of these sentiments. VADER estimates the sentiment of a text by computing how much positivity, neutrality, and negativity there is in the text, and then normalizes those computed scores to get a compound score.
***
References:
* https://www.nltk.org/howto/sentiment.html
* https://towardsdatascience.com/sentimental-analysis-using-vader-a3415fef7664
* https://medium.com/@sharonwoo/sentiment-analysis-with-nltk-422e0f794b8

Let's try the `SentimentIntensityAnalyzer`, which I refer to as `analyzer` in my code for brevity, on a sentence from one of the articles in our corpus, just to test it out.

In [2]:
analyzer = SentimentIntensityAnalyzer()

In [3]:
data_path = "../TheTimes_DaveO/TheTimesMusicReviews_1950-2009"
articles = PlaintextCorpusReader(data_path, ".+/.+", encoding='utf-8')
tokens = articles.words()

In [4]:
articles.fileids()[0]

'TheTimesMusicReviews_1950-2009_part1/20787'

In [5]:
tokens[:10] # print the first 10 tokens

["'", 'SOME', 'NEW', 'SCORES', 'MOTET', 'AND', 'OPERA', 'BY', 'OUR', 'MUSIC']

The *tokens* shown above consist of words, digits, and punctuation.  *Tokenization* is the process of splitting running text into groups of digits (numbers), groups of letters (words), and punctuation marks.  Tokens may also include a grouping that consists of any combination of numbers, punctuation marks, and digits, so if a digitization error occurred in a word, such as `alliance` being digitized as `a1liance` or `a!liance`, the tokenization process would not separate the `1` or `!` out from the other letters.

We can also use tokenization to split running text into what the computer estimates are sentences:

In [6]:
sentences = []
for fileid in articles.fileids():
    sentences += [sent_tokenize(articles.raw(fileid))]

In [7]:
sentences[0][:5] # print the first 5 sentences of the first article

["'SOME NEW SCORES MOTET AND OPERA BY OUR MUSIC CRIrIC Music publishing has got into its stride once more after the restraints of -war-time conditions, or so it''appears from the scores that arrive for our inspection and review.",
 'The steady stream bears on its broad bosom Church music, chamber music.',
 'symphonic music and operas, some of it old in new dress, some of it new.',
 'A hand thrust into it on the principle of the lucky dip brings up the largest objects.',
 "Though this is not criticism''s most subtle method of discrimination, it has the same sort of excitement as angling-one might catch a masterpiece."]

Let's run the sentiment analyzer on these sentences.  The sentiment analyzer [assigns four scores](https://github.com/cjhutto/vaderSentiment#about-the-scoring):
* **neg**: a score between 0 and 1 with a higher value indicating a greater proportion of the text has negativity
* **neu**: a score between 0 and 1 with a higher value indicating a greater proportion of the text has neutrality
* **pos**: a score between 0 and 1 with a higher value indicating a greater proportion of the text has positivity
* **compound**: a score between -1 and 1 with negative numbers indicating an overall negative sentiment and positivie numbers, positive sentiment

In [8]:
article = sentences[0][:5]
for sentence in article:
    print(sentence)
    print(analyzer.polarity_scores(sentence))
    print()

'SOME NEW SCORES MOTET AND OPERA BY OUR MUSIC CRIrIC Music publishing has got into its stride once more after the restraints of -war-time conditions, or so it''appears from the scores that arrive for our inspection and review.
{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}

The steady stream bears on its broad bosom Church music, chamber music.
{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}

symphonic music and operas, some of it old in new dress, some of it new.
{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}

A hand thrust into it on the principle of the lucky dip brings up the largest objects.
{'neg': 0.0, 'neu': 0.7, 'pos': 0.3, 'compound': 0.7184}

Though this is not criticism''s most subtle method of discrimination, it has the same sort of excitement as angling-one might catch a masterpiece.
{'neg': 0.0, 'neu': 0.741, 'pos': 0.259, 'compound': 0.791}



**Note:** We'll probably want to decide on a threshold for these scores to decide how large a compound score needs to be for us to consider it positive or negative.  The smallest (and most negative) score the `analyzer` can assign is -1.  The largest (and most positive) score the `analyzer` can assign is 1.  VADER's [documentation](https://github.com/cjhutto/vaderSentiment#about-the-scoring) suggests the following thresholds:
* positive: compound score >= 0.5
* neutral: -0.5 > compound score < 0.5
* negative: compound score <= -0.5

For now, though, we'll keep track of all four scores the `analyzer` assigns.  Next, let's run the `analyzer` on all the articles in our corpus.

In [7]:
# Get a list of all the file names, which are stored in the 'article_id' column of our inventory
inventory = pd.read_csv("../TheTimes_DaveO/TheTimesArticles_1950-2009_Inventory.csv", index_col=0)
inventory.head()

Unnamed: 0,title,year,author,term,section,pages,filename,article_id,issue_id
20787,SOME NEW SCORES MOTET AND OPERA,1950,BY OUR MUSIC CRITIC,"[' bands', ' composer', ' musical', ' opera', ...",Reviews,[],/lustre/home/dc125/shared/TDA_GDA_1785-2009/19...,0FFO-1950-JUN30-008-023,0FFO-1950-JUN30
20788,"THE ROYAL OPERA "" TRISTAN AND ISOLDE """,1950,'',"[' opera', ' orchestra']",Reviews,[],/lustre/home/dc125/shared/TDA_GDA_1785-2009/19...,0FFO-1950-JUN30-008-027,0FFO-1950-JUN30
20789,GROWING TASTE FOR MUSIC PLEA FOR ENLARGED QUEE...,1950,'',[' country'],Reviews,[],/lustre/home/dc125/shared/TDA_GDA_1785-2009/19...,0FFO-1950-JUN30-008-032,0FFO-1950-JUN30
20790,ROYAL PHILHARMONIC CONCERT BEECHAM AND MOZART,1950,'',"[' orchestra', ' orchestras']",Reviews,['010'],/lustre/home/dc125/shared/TDA_GDA_1785-2009/19...,0FFO-1950-MAR02-010-006,0FFO-1950-MAR02
20791,MUSICAL JOURNALS SOME NEWCOMERS,1950,BY OUR MUSIC CRITIC,"[' musical', ' orchestra', ' orchestras']",Reviews,['007'],/lustre/home/dc125/shared/TDA_GDA_1785-2009/19...,0FFO-1950-MAR03-007-010,0FFO-1950-MAR03


In [8]:
positive_list = []
neutral_list = []
negative_list = []
compound_list = []
files = articles.fileids()
for f in files:
    text = open("../TheTimes_DaveO/TheTimesMusicReviews_1950-2009/"+f)
    t = text.read()
    scores = analyzer.polarity_scores(t)
    compound_list += [scores["compound"]]
    positive_list += [scores["pos"]]       # VADER's abbreviation for positive
    neutral_list += [scores["neu"]]        # VADER's abbreviation for neutral
    negative_list += [scores["neg"]]       # VADER's abbreviation for negative
    text.close()

# Store the scores in a DataFrame (a type of table) with one row per article and one column per score
df = pd.DataFrame({"article_id":files, "compound":compound_list, "positive":positive_list, "neutral":neutral_list, "negative":negative_list})
df.head()

Unnamed: 0,article_id,compound,positive,neutral,negative
0,TheTimesMusicReviews_1950-2009_part1/20787,0.9897,0.075,0.912,0.013
1,TheTimesMusicReviews_1950-2009_part1/20788,0.9978,0.23,0.744,0.025
2,TheTimesMusicReviews_1950-2009_part1/20789,0.9912,0.124,0.866,0.01
3,TheTimesMusicReviews_1950-2009_part1/20790,0.9886,0.133,0.822,0.044
4,TheTimesMusicReviews_1950-2009_part1/20791,0.8225,0.061,0.893,0.046


In [9]:
negative = df[df["compound"] < 0].count()[0]
neutral = df[df["compound"] == 0].count()[0]
positive = df[df["compound"] > 0].count()[0]
print("Negative articles:", negative)
print("Neutral articles:", neutral)
print("Positive articles:", positive)

Negative articles: 6875
Neutral articles: 174
Positive articles: 76576


Let's write the results to a CSV file so we can easily reference them in Microsoft Excel or another spreadsheet tool, or load the data for analysis in another Jupyter Notebook.

In [10]:
df.to_csv("../TheTimes_DaveO/TheTimesArticles_1950-2009_VADERSentiments.csv")

Other sentiment analyzers you could consider trying are [NLTK's Naive Bayes Classifier](https://www.nltk.org/howto/sentiment.html), [TensorFlow's Keras](https://keras.io/getting_started/intro_to_keras_for_researchers/), and [TextBlob's Pattern Analyzer](https://textblob.readthedocs.io/en/dev/advanced_usage.html#advanced).