# Sentiment Analysis of *The Times* Music Reviews
## Part V: Comparison of Manual and VADER Sentiment Scores
*How have artforms been reported?  Is there a status hierarchy between them?  How has this changed over time?*

* **Project:** What counts as culture?  Reporting and criticism in The Times 1785-2000
* **Project Lead:** Dave O'Brien
* **Developer:** Lucy Havens
* **Funding:** from the Centre for Data, Culture & Society, University of Edinburgh
* **Dataset:** 83,625 reviews about music published in The Times from 1950 through 2009

Begun February 2021

***

### 1. Load the Data

First, import required programming libraries.

In [1]:
# For data loading
import re
import string
import random
import math
import numpy as np
import pandas as pd

# For text analysis
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
# nltk.download('punkt')
from nltk.corpus import PlaintextCorpusReader
# nltk.download('wordnet')
from nltk.corpus import wordnet
# nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.text import Text
# nltk.download('averaged_perceptron_tagger')
# nltk.download('tagsets')  # part of speech tags
from nltk.tag import pos_tag

# For data visualization
import matplotlib.pyplot as plt
import altair as alt   ###  Need to figure out why Altair returns error! (Javascript Error: Unrecognized transform type: "formula")
import seaborn as sns

Next, load the data from the VADER sentiment analyzer run on The Times reviews' text and load the inventory data.

We'll load these datasets into DataFrames, organizing the CSV file into rows and columns just as it was in the Sentiment Analysis Notebook (#2).

In [2]:
dfs = pd.read_csv("../TheTimes_DaveO/TheTimesArticles_1950-2009_VADERSentiments.csv", index_col=0)
dfs.rename(columns={"article_id":"filepath"}, inplace=True)
dfs.head()

Unnamed: 0,filepath,compound,positive,neutral,negative
0,TheTimesMusicReviews_1950-2009_part1/20787,0.9897,0.075,0.912,0.013
1,TheTimesMusicReviews_1950-2009_part1/20788,0.9978,0.23,0.744,0.025
2,TheTimesMusicReviews_1950-2009_part1/20789,0.9912,0.124,0.866,0.01
3,TheTimesMusicReviews_1950-2009_part1/20790,0.9886,0.133,0.822,0.044
4,TheTimesMusicReviews_1950-2009_part1/20791,0.8225,0.061,0.893,0.046


In [3]:
dfi = pd.read_csv("../TheTimes_DaveO/TheTimesArticles_1950-2009_Inventory.csv")
dfi.rename(columns={"Unnamed: 0":"identifier"}, inplace=True)
dfi.head()

Unnamed: 0,identifier,title,year,author,term,section,pages,filename,article_id,issue_id
0,20787,SOME NEW SCORES MOTET AND OPERA,1950,BY OUR MUSIC CRITIC,"[' bands', ' composer', ' musical', ' opera', ...",Reviews,[],/lustre/home/dc125/shared/TDA_GDA_1785-2009/19...,0FFO-1950-JUN30-008-023,0FFO-1950-JUN30
1,20788,"THE ROYAL OPERA "" TRISTAN AND ISOLDE """,1950,'',"[' opera', ' orchestra']",Reviews,[],/lustre/home/dc125/shared/TDA_GDA_1785-2009/19...,0FFO-1950-JUN30-008-027,0FFO-1950-JUN30
2,20789,GROWING TASTE FOR MUSIC PLEA FOR ENLARGED QUEE...,1950,'',[' country'],Reviews,[],/lustre/home/dc125/shared/TDA_GDA_1785-2009/19...,0FFO-1950-JUN30-008-032,0FFO-1950-JUN30
3,20790,ROYAL PHILHARMONIC CONCERT BEECHAM AND MOZART,1950,'',"[' orchestra', ' orchestras']",Reviews,['010'],/lustre/home/dc125/shared/TDA_GDA_1785-2009/19...,0FFO-1950-MAR02-010-006,0FFO-1950-MAR02
4,20791,MUSICAL JOURNALS SOME NEWCOMERS,1950,BY OUR MUSIC CRITIC,"[' musical', ' orchestra', ' orchestras']",Reviews,['007'],/lustre/home/dc125/shared/TDA_GDA_1785-2009/19...,0FFO-1950-MAR03-007-010,0FFO-1950-MAR03


Now we can merge these two DataFrames together using the `filepath` column from the first DataFrame and the `identifer` column from the second DataFrame.

In [6]:
paths = list(dfs.filepath)
ids = []
for p in paths:
    ids += [int((re.findall("\d{1,}$",p))[0])]
    
print(ids[:10]) # print the first 10 ids

[20787, 20788, 20789, 20790, 20791, 20792, 20793, 20794, 20795, 20796]


Add an `identifier` column to the first DataFrame of sentiment analysis results.

In [7]:
dfs["identifier"] = ids
dfs.head(3)

Unnamed: 0,filepath,compound,positive,neutral,negative,identifier
0,TheTimesMusicReviews_1950-2009_part1/20787,0.9897,0.075,0.912,0.013,20787
1,TheTimesMusicReviews_1950-2009_part1/20788,0.9978,0.23,0.744,0.025,20788
2,TheTimesMusicReviews_1950-2009_part1/20789,0.9912,0.124,0.866,0.01,20789


Now we can join the DataFrames on their `identifier` columns!

In [8]:
df = dfi.merge(dfs, how="outer", on="identifier")
df.head(3)

Unnamed: 0,identifier,title,year,author,term,section,pages,filename,article_id,issue_id,filepath,compound,positive,neutral,negative
0,20787,SOME NEW SCORES MOTET AND OPERA,1950,BY OUR MUSIC CRITIC,"[' bands', ' composer', ' musical', ' opera', ...",Reviews,[],/lustre/home/dc125/shared/TDA_GDA_1785-2009/19...,0FFO-1950-JUN30-008-023,0FFO-1950-JUN30,TheTimesMusicReviews_1950-2009_part1/20787,0.9897,0.075,0.912,0.013
1,20788,"THE ROYAL OPERA "" TRISTAN AND ISOLDE """,1950,'',"[' opera', ' orchestra']",Reviews,[],/lustre/home/dc125/shared/TDA_GDA_1785-2009/19...,0FFO-1950-JUN30-008-027,0FFO-1950-JUN30,TheTimesMusicReviews_1950-2009_part1/20788,0.9978,0.23,0.744,0.025
2,20789,GROWING TASTE FOR MUSIC PLEA FOR ENLARGED QUEE...,1950,'',[' country'],Reviews,[],/lustre/home/dc125/shared/TDA_GDA_1785-2009/19...,0FFO-1950-JUN30-008-032,0FFO-1950-JUN30,TheTimesMusicReviews_1950-2009_part1/20789,0.9912,0.124,0.866,0.01


In [8]:
# Check for null values
print(df.year.isnull().values.any())
print(df.term.isnull().values.any())
print(df.title.isnull().values.any())
print(df.section.isnull().values.any())
print(df.pages.isnull().values.any())

False
False
False
False
False


Great, it looks like we haven't lost any articles' metadata!  All the reviews' years, terms, titles, sections, and pages have a value.

Now that we have the reviews' sentiment data joined up with the reviews' metadata, we can export the data as a CSV file for quicker reference in the future.

In [10]:
df.to_csv("../TheTimes_DaveO/TheTimesArticles_1950-2009_MetadataWithVADERSentiments.csv")

## 2. Analyze the Data
### 2.1 Summary Statistics
Let's figure out what the most positive, negative, and neutral reviews are and what scores the VADER Sentiment Analyzer assigned them.

In [11]:
# Load a DataFrame of reviews and their sentiment scores if section 1 skipped 
# (uncomment the code below by removing the '#' at the beginning of the line, and press shift+enter to run)
#df = pd.read_csv("../TheTimes_DaveO/TheTimesArticles_1950-2009_MetadataWithVADERSentiments.csv")

In [10]:
maxPos = max(df.positive)
print("Highest Positive Score:\n\t", maxPos)
print("Review with Highest Positive Score:\n\t", list(df.title[df.positive == maxPos])[0], "(Identifier:",list(df.identifier[df.positive == maxPos])[0], ")")

Highest Positive Score:
	 0.457
Review with Highest Positive Score:
	 \  \"The Feigned Lady Gardener\" doesnt exactly trip off the tongue , but\ (Identifier: 72571 )


In [11]:
maxNeg = max(df.negative)
print("Highest Negative Score:\n\t", maxNeg)
print("Review with Highest Negative Score:\n\t", list(df.title[df.negative == maxNeg])[0], "(Identifier:",list(df.identifier[df.negative == maxNeg])[0], ")")

Highest Negative Score:
	 0.319
Review with Highest Negative Score:
	 The Art of Noise The1 week's best music video (Identifier: 98832 )


In [12]:
maxNeu = max(df.neutral)
print("Highest Neutral Score:\n\t", maxNeu)
print("Review with Highest Neutral Score:\n\t", list(df.title[df.neutral == maxNeu])[0], "(Identifier:",list(df.identifier[df.neutral == maxNeu])[0], ")")

Highest Neutral Score:
	 1.0
Review with Highest Neutral Score:
	 BRITISH MUSIC SUNG AT ST. JOHN LATFRAN A CONTEMPORARY MASS (Identifier: 21217 )


In [13]:
maxCom = max(df["compound"])
minCom = min(df["compound"])
print("Most Positive (Highest) Compound Score:\n\t", maxCom)
print("Review with Most Positive (Highest) Compount Score:\n\t", list(df.title[df["compound"] == maxCom])[0], "(Identifier:",list(df.identifier[df["compound"] == maxCom])[0], ")")
print("\nMost Negative (Lowest) Compound Score:\n\t", minCom)
print("Review with Most Negative (Lowest) Compount Score:\n\t", list(df.title[df["compound"] == minCom])[0], "(Identifier:",list(df.identifier[df["compound"] == minCom])[0], ")")

Most Positive (Highest) Compound Score:
	 1.0
Review with Most Positive (Highest) Compount Score:
	 . . .. . " i / . . k LUCGIRAHD (Identifier: 64287 )

Most Negative (Lowest) Compound Score:
	 -0.9998
Review with Most Negative (Lowest) Compount Score:
	 '1993: the year a simple handshake gave new hope lor a more peaceful world...' (Identifier: 70658 )


To get a sense of how well the sentiment analyzer performs, we can read some of these reviews and judge for ourselves whether the VADER scores seem accurate!  

Let's randomly select music reviews with at least one of the four genres we're focusing on (opera, jazz, rap, and rock), getting a selection of articles published in every year from 1950 through 2009.

In [14]:
#  Input: music-related term and a DataFrame of music review metadata
# Output: a list of booleans (True or False) noting which reviews have the input term
def termFilter(term_string, dataframe):
    df_terms = list(dataframe.term)
    with_term = []
    for t in df_terms:
        if term_string in t:
            with_term += [True]
        else:
            with_term += [False]
    
    assert(len(with_term) == len(df_terms))
    return with_term

# Determine which music reviews have the words opera, jazz, rap, and rock
with_opera = termFilter("opera", df)
with_jazz = termFilter("jazz", df)
with_rap = termFilter("rap", df)
with_rock = termFilter("rock", df)

# Add the lists of booleans to the DataFrame of music review metadata (including sentiment scores)
df["with_opera"] = with_opera
df["with_jazz"] = with_jazz
df["with_rap"] = with_rap
df["with_rock"] = with_rock

# Create subsets of data, making one DataFrame of music reviews per genre
df_opera = df[df.with_opera == True]
df_jazz = df[df.with_jazz == True]
df_rap = df[df.with_rap == True]
df_rock = df[df.with_rock == True]
print("Opera articles:",df_opera.shape[0])
print("Jazz articles:",df_jazz.shape[0])
print("Rap articles:",df_rap.shape[0])
print("Rock articles:",df_rock.shape[0])

Opera articles: 18628
Jazz articles: 7681
Rap articles: 1925
Rock articles: 9222


In [15]:
# Create a 2D list of files for each genre, with one list per year of publication
def listsOfArticlesPerYear(dataframe):
    # Get a non-repeating list of all years in the dataframe
    years = dataframe.year.unique()
    articles = []
    for y in years:
        # For each year, create a list of articles published in that year
        # and add those articles' identifiers to a list 
        articles += [list(dataframe[dataframe.year == y].filepath)]
    # Return the two-dimensional list of articles (one sub-list per year) 
    return articles

opera_ids = listsOfArticlesPerYear(df_opera)
# print(opera_ids[1:2])
jazz_ids = listsOfArticlesPerYear(df_jazz)
rap_ids = listsOfArticlesPerYear(df_rap)
rock_ids = listsOfArticlesPerYear(df_rock)

In [16]:
year_entries = len(df.year.unique())
print("Number of publication years for our corpus:",year_entries)

# Randomly select one article from each year from each genre's identifiers lists
def randomSelection(twoDlist):
    to_read = []
    for year_of_articles in twoDlist:
        to_read += [random.choice(year_of_articles)]
    return to_read

read_opera = randomSelection(opera_ids)
print("\nOpera articles to read:", len(read_opera))
read_jazz = randomSelection(jazz_ids)
print("Jazz articles to read:", len(read_jazz))
read_rap = randomSelection(rap_ids)
print("Rap articles to read:", len(read_rap))
read_rock = randomSelection(rock_ids)
print("Rock articles to read:", len(read_rock))
print("\nArticles to read:",len(read_opera)+len(read_jazz)+len(read_rap)+len(read_rock))

Number of publication years for our corpus: 60

Opera articles to read: 60
Jazz articles to read: 60
Rap articles to read: 42
Rock articles to read: 60

Articles to read: 222


I'm also going to add five articles with the maximum and minimum VADER scores to our subset of articles to read and manually analyze the sentiment of tem.

In [17]:
max_min_scores = [70658, 64287, 21217, 98832, 72571]
to_read = []
for identifier in max_min_scores:
    filepath = list(df.filepath[df.identifier == identifier])[0]
    identifier = re.findall("\d{5,}", filepath)[0]
    to_read.append(filepath)

def flattenTwoDLists(genre_list, to_read):
    for id_list in genre_list:
        for filepath in id_list:
            to_read += [filepath]
    return to_read

to_read_ids = flattenTwoDLists([read_opera, read_jazz, read_rock, read_rap], to_read)
print(to_read_ids[0:2])
print("Total articles to read:", len(to_read_ids))

['TheTimesMusicReviews_1950-2009_part1/70658', 'TheTimesMusicReviews_1950-2009_part1/64287']
Total articles to read: 227


Now let's export the selected reviews so we can read and manually score their sentiment.

In [18]:
data_path = "../TheTimes_DaveO/TheTimesMusicReviews_1950-2009"
articles = PlaintextCorpusReader(data_path, ".+/.+", encoding='utf-8')
fileids = articles.fileids()
print(fileids[0])

TheTimesMusicReviews_1950-2009_part1/20787


In [22]:
for filepath in to_read_ids:
    file = open("../TheTimes_DaveO/ToReadAndScore/"+str(filepath), "a")
    file.write(articles.raw(filepath))
    file.close()
    
print("Files ready for manual sentiment analysis in the directory ToReadAndScore!")

Files ready for manual sentiment analysis in the directory ToReadAndScore!


Let's make sure the division of the files is relatively equal in word length.

In [31]:
dave_path =  "../TheTimes_DaveO/ToReadAndScore/Dave"
dave = PlaintextCorpusReader(dave_path, ".+", encoding='utf-8')
dave_tokens = dave.words()
orian_path =  "../TheTimes_DaveO/ToReadAndScore/Orian"
orian = PlaintextCorpusReader(orian_path, ".+", encoding='utf-8')
orian_tokens = orian.words()
lucy_path =  "../TheTimes_DaveO/ToReadAndScore/Lucy"
lucy = PlaintextCorpusReader(lucy_path, ".+", encoding='utf-8')
lucy_tokens = lucy.words()

In [32]:
print(len(dave_tokens))
print(len(orian_tokens))
print(len(lucy_tokens))

76739
76396
62749


Great!  Later on, we'll compare our manually calculated sentiment scores with the scores VADER assigns.