# Sentiment Analysis of *The Times* Music Reviews
## Part V: Comparison of Manual and VADER Sentiment Scores
*How have artforms been reported?  Is there a status hierarchy between them?  How has this changed over time?*

* **Project:** What counts as culture?  Reporting and criticism in The Times 1785-2000
* **Project Lead:** Dave O'Brien
* **Developer:** Lucy Havens
* **Funding:** from the Centre for Data, Culture & Society, University of Edinburgh
* **Dataset:** 83,625 reviews about music published in The Times from 1950 through 2009

Begun February 2021

***

## 1. Prepare the Data

Before we can begin coding, we must first import the programming libraries we'd like to use in our code.

In [9]:
# For data loading
import re
import string
import random
import math
import numpy as np
import pandas as pd

# For text analysis
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
# nltk.download('punkt')
from nltk.corpus import PlaintextCorpusReader
# nltk.download('wordnet')
from nltk.corpus import wordnet
# nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.text import Text
# nltk.download('averaged_perceptron_tagger')
# nltk.download('tagsets')  # part of speech tags
from nltk.tag import pos_tag

# For data visualization
import matplotlib.pyplot as plt
import altair as alt   ###  Need to figure out why Altair returns error! (Javascript Error: Unrecognized transform type: "formula")
import seaborn as sns

### 1.1 Select Reviews for Manual Sentiment Analysis

To get a sense of how well the sentiment analyzer performs, we can read some of these reviews and judge for ourselves whether the VADER scores seem accurate!  

Let's randomly select music reviews with at least one of the four genres we're focusing on (opera, jazz, rap, and rock), getting a selection of articles published in every year from 1950 through 2009.

In [11]:
# Load a DataFrame of reviews and their sentiment scores
df = pd.read_csv("../TheTimes_DaveO/TheTimesArticles_1950-2009_MetadataWithVADERSentiments.csv")

In [14]:
#  Input: music-related term and a DataFrame of music review metadata
# Output: a list of booleans (True or False) noting which reviews have the input term
def termFilter(term_string, dataframe):
    df_terms = list(dataframe.term)
    with_term = []
    for t in df_terms:
        if term_string in t:
            with_term += [True]
        else:
            with_term += [False]
    
    assert(len(with_term) == len(df_terms))
    return with_term

# Determine which music reviews have the words opera, jazz, rap, and rock
with_opera = termFilter("opera", df)
with_jazz = termFilter("jazz", df)
with_rap = termFilter("rap", df)
with_rock = termFilter("rock", df)

# Add the lists of booleans to the DataFrame of music review metadata (including sentiment scores)
df["with_opera"] = with_opera
df["with_jazz"] = with_jazz
df["with_rap"] = with_rap
df["with_rock"] = with_rock

# Create subsets of data, making one DataFrame of music reviews per genre
df_opera = df[df.with_opera == True]
df_jazz = df[df.with_jazz == True]
df_rap = df[df.with_rap == True]
df_rock = df[df.with_rock == True]
print("Opera articles:",df_opera.shape[0])
print("Jazz articles:",df_jazz.shape[0])
print("Rap articles:",df_rap.shape[0])
print("Rock articles:",df_rock.shape[0])

Opera articles: 18628
Jazz articles: 7681
Rap articles: 1925
Rock articles: 9222


In [15]:
# Create a 2D list of files for each genre, with one list per year of publication
def listsOfArticlesPerYear(dataframe):
    # Get a non-repeating list of all years in the dataframe
    years = dataframe.year.unique()
    articles = []
    for y in years:
        # For each year, create a list of articles published in that year
        # and add those articles' identifiers to a list 
        articles += [list(dataframe[dataframe.year == y].filepath)]
    # Return the two-dimensional list of articles (one sub-list per year) 
    return articles

opera_ids = listsOfArticlesPerYear(df_opera)
# print(opera_ids[1:2])
jazz_ids = listsOfArticlesPerYear(df_jazz)
rap_ids = listsOfArticlesPerYear(df_rap)
rock_ids = listsOfArticlesPerYear(df_rock)

In [16]:
year_entries = len(df.year.unique())
print("Number of publication years for our corpus:",year_entries)

# Randomly select one article from each year from each genre's identifiers lists
def randomSelection(twoDlist):
    to_read = []
    for year_of_articles in twoDlist:
        to_read += [random.choice(year_of_articles)]
    return to_read

read_opera = randomSelection(opera_ids)
print("\nOpera articles to read:", len(read_opera))
read_jazz = randomSelection(jazz_ids)
print("Jazz articles to read:", len(read_jazz))
read_rap = randomSelection(rap_ids)
print("Rap articles to read:", len(read_rap))
read_rock = randomSelection(rock_ids)
print("Rock articles to read:", len(read_rock))
print("\nArticles to read:",len(read_opera)+len(read_jazz)+len(read_rap)+len(read_rock))

Number of publication years for our corpus: 60

Opera articles to read: 60
Jazz articles to read: 60
Rap articles to read: 42
Rock articles to read: 60

Articles to read: 222


I'm also going to add five articles with the maximum and minimum VADER scores to our subset of articles to read and manually analyze the sentiment of tem.

In [17]:
max_min_scores = [70658, 64287, 21217, 98832, 72571]
to_read = []
for identifier in max_min_scores:
    filepath = list(df.filepath[df.identifier == identifier])[0]
    identifier = re.findall("\d{5,}", filepath)[0]
    to_read.append(filepath)

def flattenTwoDLists(genre_list, to_read):
    for id_list in genre_list:
        for filepath in id_list:
            to_read += [filepath]
    return to_read

to_read_ids = flattenTwoDLists([read_opera, read_jazz, read_rock, read_rap], to_read)
print(to_read_ids[0:2])
print("Total articles to read:", len(to_read_ids))

['TheTimesMusicReviews_1950-2009_part1/70658', 'TheTimesMusicReviews_1950-2009_part1/64287']
Total articles to read: 227


Now let's export the selected reviews so we can read and manually score their sentiment.

In [18]:
data_path = "../TheTimes_DaveO/TheTimesMusicReviews_1950-2009"
articles = PlaintextCorpusReader(data_path, ".+/.+", encoding='utf-8')
fileids = articles.fileids()
print(fileids[0])

TheTimesMusicReviews_1950-2009_part1/20787


In [22]:
for filepath in to_read_ids:
    file = open("../TheTimes_DaveO/ToReadAndScore/"+str(filepath), "a")
    file.write(articles.raw(filepath))
    file.close()
    
print("Files ready for manual sentiment analysis in the directory ToReadAndScore!")

Files ready for manual sentiment analysis in the directory ToReadAndScore!


Let's make sure the division of the files is relatively equal in word length.

In [31]:
dave_path =  "../TheTimes_DaveO/ToReadAndScore/Dave"
dave = PlaintextCorpusReader(dave_path, ".+", encoding='utf-8')
dave_tokens = dave.words()
orian_path =  "../TheTimes_DaveO/ToReadAndScore/Orian"
orian = PlaintextCorpusReader(orian_path, ".+", encoding='utf-8')
orian_tokens = orian.words()
lucy_path =  "../TheTimes_DaveO/ToReadAndScore/Lucy"
lucy = PlaintextCorpusReader(lucy_path, ".+", encoding='utf-8')
lucy_tokens = lucy.words()

In [32]:
print(len(dave_tokens))
print(len(orian_tokens))
print(len(lucy_tokens))

76739
76396
62749


Great!  After scoring these articles, we'll compare our manually calculated sentiment scores with the scores VADER assigns in the sections below.

### 1.2 Load Sentiment Scores
**Step 1:** Load the data with VADER scores for each review in our corpus of articles from The Times as a DataFrame (which is a type of table used in the Python library pandas) called `vader`.

In [27]:
vader = pd.read_csv("../TheTimes_DaveO/TheTimesArticles_1950-2009_MetadataWithVADERSentiments.csv")
vader.drop(columns={"Unnamed: 0"}, inplace=True)
print("Total Reviews (rows in table):", vader.shape[0])
vader.head(3)

Total Reviews (rows in table): 83625


Unnamed: 0,identifier,title,year,author,term,section,pages,filename,article_id,issue_id,filepath,compound,positive,neutral,negative
0,20787,SOME NEW SCORES MOTET AND OPERA,1950,BY OUR MUSIC CRITIC,"[' bands', ' composer', ' musical', ' opera', ...",Reviews,[],/lustre/home/dc125/shared/TDA_GDA_1785-2009/19...,0FFO-1950-JUN30-008-023,0FFO-1950-JUN30,TheTimesMusicReviews_1950-2009_part1/20787,0.9897,0.075,0.912,0.013
1,20788,"THE ROYAL OPERA "" TRISTAN AND ISOLDE """,1950,'',"[' opera', ' orchestra']",Reviews,[],/lustre/home/dc125/shared/TDA_GDA_1785-2009/19...,0FFO-1950-JUN30-008-027,0FFO-1950-JUN30,TheTimesMusicReviews_1950-2009_part1/20788,0.9978,0.23,0.744,0.025
2,20789,GROWING TASTE FOR MUSIC PLEA FOR ENLARGED QUEE...,1950,'',[' country'],Reviews,[],/lustre/home/dc125/shared/TDA_GDA_1785-2009/19...,0FFO-1950-JUN30-008-032,0FFO-1950-JUN30,TheTimesMusicReviews_1950-2009_part1/20789,0.9912,0.124,0.866,0.01


**Step 2:** Load the data of manual sentiment scores as DataFrames `man1`, `man2`, and `man3`.  We'll also remove any `NaN` (not an integer) scores that may have been recorded for articles that weren't actually about the music genres opera, rock, rap, or jazz.

In [58]:
man1 = pd.read_csv("ManualSentimentScores/manual-scores-dave.csv", header=None, names=["identifier", "manual score"])
man1.dropna(axis=0, inplace=True)
man1["person"] = "dave"
man2 = pd.read_csv("ManualSentimentScores/manual-scores-lucy.csv", header=None, names=["identifier", "manual score"])
man2.dropna(axis=0, inplace=True)
man2["person"] = "lucy"
#man3 = pd.read_csv("ManualSentimentScores/manual-scores-orian.csv", header=None, names=["identifier", "manual score"])
#man3.dropna(how="score", axis=0, inplace=True)
#man3["person"] = "orian"
man1.head(3)

Unnamed: 0,identifier,manual score,person
2,71435,5,dave
3,71703,5,dave
4,72059,5,dave


***Note:*** *For this data, each included article was only given one manual score by one person on a scale of 1 to 5, 1 being very negative, 3 being neutral, and 5 being very positive.*

**Step 3:** Merge the DataFrames of manual scores and VADER scores.

We want to compare the VADER sentiment scores to the manually-assigned sentiment scores, so we'll use the *identifier* column of our DataFrames to filter out articles from `vader` that are not included.

In [59]:
# Combine the manual sentiment score DataFrames into one dataframe
man = man1.append(man2) #.append(man3)
#man

In [60]:
# Find the VADER scores for each article in the 'man' DataFrame
man_ids = list(man.identifier)
compound = []
positive = []
neutral = []
negative = []
for i in man_ids:
    compound += [list(vader[vader["identifier"] == i]["compound"])[0]]
    positive += [list(vader[vader["identifier"] == i]["positive"])[0]]
    neutral += [list(vader[vader["identifier"] == i]["neutral"])[0]]
    negative += [list(vader[vader["identifier"] == i]["negative"])[0]]

In [61]:
# Add the VADER scores to 'man'
man["VADER compound"] = compound
man["VADER positive"] = positive
man["VADER neutral"] = neutral
man["VADER negative"] = negative
man.head()

Unnamed: 0,identifier,manual score,person,VADER compound,VADER positive,VADER neutral,VADER negative
2,71435,5,dave,0.9975,0.181,0.815,0.004
3,71703,5,dave,0.9879,0.141,0.832,0.026
4,72059,5,dave,0.9929,0.151,0.829,0.02
5,72092,4,dave,0.9764,0.094,0.865,0.041
7,72950,1,dave,-0.7537,0.081,0.823,0.096


<!-- OPTION 1:
Since the manual scores and VADER scores use different scales ([see VADER's here](https://github.com/cjhutto/vaderSentiment#about-the-scoring)), we'll normalize and generalize the scores according to the following:
* manual 5 = VADER compound 0.7501-1
* manual 4 = VADER compound 0.5001-0.75
* manual 3 = VADER compound -0.5-0.5
* manual 2 = VADER compound  -->

## 2. Analyze the Data
### 2.1 Manual Scores and VADER Compound Scores
Let's compare the manually-assigned scores to VADER's compound scores

In [10]:
maxPos = max(df.positive)
print("Highest Positive Score:\n\t", maxPos)
print("Review with Highest Positive Score:\n\t", list(df.title[df.positive == maxPos])[0], "(Identifier:",list(df.identifier[df.positive == maxPos])[0], ")")

Highest Positive Score:
	 0.457
Review with Highest Positive Score:
	 \  \"The Feigned Lady Gardener\" doesnt exactly trip off the tongue , but\ (Identifier: 72571 )


In [11]:
maxNeg = max(df.negative)
print("Highest Negative Score:\n\t", maxNeg)
print("Review with Highest Negative Score:\n\t", list(df.title[df.negative == maxNeg])[0], "(Identifier:",list(df.identifier[df.negative == maxNeg])[0], ")")

Highest Negative Score:
	 0.319
Review with Highest Negative Score:
	 The Art of Noise The1 week's best music video (Identifier: 98832 )


In [12]:
maxNeu = max(df.neutral)
print("Highest Neutral Score:\n\t", maxNeu)
print("Review with Highest Neutral Score:\n\t", list(df.title[df.neutral == maxNeu])[0], "(Identifier:",list(df.identifier[df.neutral == maxNeu])[0], ")")

Highest Neutral Score:
	 1.0
Review with Highest Neutral Score:
	 BRITISH MUSIC SUNG AT ST. JOHN LATFRAN A CONTEMPORARY MASS (Identifier: 21217 )


In [13]:
maxCom = max(df["compound"])
minCom = min(df["compound"])
print("Most Positive (Highest) Compound Score:\n\t", maxCom)
print("Review with Most Positive (Highest) Compount Score:\n\t", list(df.title[df["compound"] == maxCom])[0], "(Identifier:",list(df.identifier[df["compound"] == maxCom])[0], ")")
print("\nMost Negative (Lowest) Compound Score:\n\t", minCom)
print("Review with Most Negative (Lowest) Compount Score:\n\t", list(df.title[df["compound"] == minCom])[0], "(Identifier:",list(df.identifier[df["compound"] == minCom])[0], ")")

Most Positive (Highest) Compound Score:
	 1.0
Review with Most Positive (Highest) Compount Score:
	 . . .. . " i / . . k LUCGIRAHD (Identifier: 64287 )

Most Negative (Lowest) Compound Score:
	 -0.9998
Review with Most Negative (Lowest) Compount Score:
	 '1993: the year a simple handshake gave new hope lor a more peaceful world...' (Identifier: 70658 )
