# Pitchfork Says a Lot Despite Appearing to Say Very Little: A Sentiment Analysis of Music Reviews

Music reviews can be tough to grasp. On one hand, you use them to figure out what to listen to. On the other hand, some of them can be esoteric, fatuous, and devoid of any meaning, despite appearing to say a lot. 

__*I was under the impression that [Pitchfork](https://www.pitchfork.com) could be described as such. It turns out I was wrong.*__

## Executive Summary

The aim of this project was to develop a robust technique for conducting sentiment analysis of text that can be both ambiguous and subjective. Ambiguity is a problem we tackle not only because of the inherent nature of the English language, but also because certain styles of writing sacrifice clarity for a turn of phrase that appears more impressive. Subjectivity enters the fray because of the nature of content being discussed. Without delving into the intricate philosophies of the matter, while you can discuss an art form objectively to a certain extent, after a point it becomes a matter of tastes. Two reviewers can offer opposing but perfectly valid critiques. This element cannot be ignored, because it means that we have to draw fine lines between fact, preference, and opinions. As such, I had to develop a technique that could accurately predict whether a reviewer enjoyed an album or not based on a pure sentiment analysis of the review.

Given that there are fresh-out-of-the-box tools that will promptly give you an idea of sentiment of a piece of text, I had to beat certain benchmarks of accuracy and ROC-AUC score. I tried various degrees of models - I started with a fresh-out-of-the-box model, then moved onto a word-frequency based model, and then delved into the intricasies of the language and built a more low-level (and robust) model from scratch.

I was able to beat the required benchmarks with a great deal of success. In addition to this, I was also able to take my methodology and convert it into an extremely pliable form that can be used as its own fresh-out-of-the-box sentiment analysis tool. 


_I will go into further detail later on._

## Procuring the Data

I got my data from [Pitchfork's website](https://www.pitchfork.com). To obtain it, I built my own web-scraper that can be found below. Kudos to the web developer - it's rare to see a civilized website in the cold, harsh deserts of the internet.

```Python
import requests
from bs4 import BeautifulSoup
import pandas as pd
```

There were several components to building a web scraper of this nature. First, I had to determine what sorts of information I wanted.

[This website will give you an idea of what each article looked like.](https://pitchfork.com/reviews/albums/18090-jon-hopkins-immunity/)

I decided that I wanted - 
1. Artist
2. Album
3. Genres
4. Score
5. Author
6. Time (or Date Posted)
7. Abstract (the little block of text preceding the Article)
8. The Article

I built a function to procure this information from each review page.

```Python
def get_info(review_link):
    """
    This function takes in a review link and returns all the relevant information on the page.
    """
    
    review_link = 'http://pitchfork.com' + review_link
    review_html = requests.get(review_link)
    review_soup = BeautifulSoup(review_html.text, 'html.parser')
    
    artist = 'N/A'
    album = 'N/A'
    genres = ['N/A']
    score = 'N/A'
    author = 'N/A'
    time = 'N/A'
    abstract = 'N/A'
    article = 'N/A'
    
    try:
        artist = review_soup.find(class_='artists').text.strip()
    except:
        pass
    
    try:
        album = review_soup.find(class_='review-title').text.strip()      
    except:
        pass
    
    try:
        genres = review_soup.find(class_='genre-list')
        genres = [genres.text.strip() for genres in genres.find_all('li')]
    except:
        pass
    
    try:
        score = review_soup.find(class_='score').text.strip()
    except:
        pass

    try:
        author = review_soup.find(class_='authors-detail__display-name').text.strip()
    except:
        pass
    
    try:
        time = review_soup.find(class_='pub-date')['datetime'].strip()
    except:
        pass
    
    try:
        abstract = review_soup.find(class_='abstract').text.strip()
    except:
        pass
    
    try:
        article = review_soup.find(class_='contents').text.strip()
    except:
        pass
    
    return (artist, album, genres, score, author, time, abstract, article)  
```

<br>

I then created a DataFrame that would reflect the information I was scraping.

```Python
columns = "Artist Album Genre Score Author Time Abstract Article".split()
df = pd.DataFrame(columns=columns)
```

<br>
And then the scraper itself.

```Python
URL = 'http://pitchfork.com/reviews/albums/?page='
page_count = 1 # Which page should we start scraping from?

temp_url = URL + str(page_count)
html = requests.get(temp_url)
i = 0

while html.status_code != 404: ## While the page I requested actually exists
    
    soup  = BeautifulSoup(html.text, 'html.parser')
    
    for review in soup.find_all(class_='review'): ## Go through every review on the page.
        
        review_link = review.find('a')['href']
        
        artist, album, genres, score, author, time, abstract, article = get_info(review_link)
        df.loc[i] = {'Artist': artist, 'Album': album, 'Genre': genres, 'Score': score, 'Author': author, \
                        'Time': time, 'Abstract': abstract, 'Article': article}
        
        i += 1
        
        ## Save the file every 100 reviews.
        if i % 100 == 0:
            df.to_csv('../data/Pitchfork Reviews.csv')
    
    ## A little indicator I can use to get updates on my scraper
    print('Just hit the ' + str(i) + 'th review!')
    
    page_count += 1
    temp_url = URL + str(page_count)
    html = requests.get(temp_url)

    print("We're done! Your web scraper collected " + str(i) + " reviews!")
    
    print("Dropping", str(df.shape[0] - df.dropna().shape[0]), "Values.")
    df = df.dropna()
    df.to_csv('Pitchfork Reviews.csv')
```

## Data Wrangling

In [423]:
## Importing the necessary packages.

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

%config InlineBackend.figure_format = 'retina'
%matplotlib inline

plt.style.use('seaborn-white')
# Increase figure size
plt.rcParams['figure.figsize'] = (8, 6)
plt.rcParams['font.size'] = 17

I renamed Time to Date and decided to use it to index my data.

In [424]:
df = pd.read_csv('../data/Pitchfork Reviews.csv')
df['Date'] = pd.to_datetime(df.Time).dt.date
df = df.drop(['Time', 'Unnamed: 0'], axis=1).set_index(['Date'])
df.head()

Unnamed: 0_level_0,Artist,Album,Genre,Score,Author,Abstract,Article
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2017-07-18,Japanese Breakfast,Soft Sounds From Another Planet,Rock,8.0,Nathan Reese,"Inspired by the cosmos, Japanese Breakfast’s M...",Michelle Zauner’s first album as Japanese Brea...
2017-07-18,Mura Masa,Mura Masa,Electronic,7.7,Eve Barlow,Alex Crossan’s debut album is a love letter to...,"Oscar Wilde once said, “The man who can domina..."
2017-07-18,Coca Leaf,Deep Marble Sunrise,Experimental,7.1,Paul Thompson,"Featuring members of Merchandise, the Ukiah Dr...",Dig your way through the sprawling catalogs of...
2017-07-18,Claude Speeed,Infinity Ultra,Electronic,7.4,Philip Sherburne,With songs that travel great distances between...,When the Scottish electronic musician Claude S...
2017-07-17,Coldplay,Kaleidoscope EP,Rock,5.8,Jamieson Cox,"Held up by two good-to-great songs, the lightw...",Coldplay are nearing the end of a restless dec...


In [425]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 14518 entries, 2017-07-18 to 1999-01-12
Data columns (total 7 columns):
Artist      14517 non-null object
Album       14515 non-null object
Genre       14518 non-null object
Score       14518 non-null float64
Author      14518 non-null object
Abstract    14205 non-null object
Article     14517 non-null object
dtypes: float64(1), object(6)
memory usage: 907.4+ KB


Dropping all duplicate reviews and any any reviews that don't have the necessary components.

In [426]:
df = df.drop_duplicates().dropna()

In [427]:
df.Genre.value_counts()

Rock                                                  5673
Electronic                                            1840
Rap                                                   1175
Electronic', 'Rock                                     972
Pop/R&B                                                797
Experimental', 'Rock                                   744
Experimental                                           529
Folk/Country                                           483
Metal', 'Rock                                          305
Metal                                                  287
Electronic', 'Pop/R&B                                  147
Jazz                                                   134
Global                                                  97
Pop/R&B', 'Rap                                          86
Electronic', 'Experimental', 'Rock                      86
Pop/R&B', 'Rock                                         74
Electronic', 'Jazz                                      

Creating dummy columns for the various genres.

In [428]:
def get_genre(x):
    return x.replace("'", ' ').replace(',', ' ').split()

df.Genre = df.Genre.apply(get_genre)

In [429]:
genres = []

def get_genre_list(x):
    global genres
    genres += [item for item in x if item not in genres]
    
df.Genre.apply(get_genre_list)
genres

['Rock',
 'Electronic',
 'Experimental',
 'Metal',
 'Pop/R&B',
 'Rap',
 'Global',
 'Jazz',
 'Folk/Country']

In [430]:
for genre in genres:
    df[genre] = df.Genre.apply(lambda x : 1 if genre in x else 0)
df.head()

Unnamed: 0_level_0,Artist,Album,Genre,Score,Author,Abstract,Article,Rock,Electronic,Experimental,Metal,Pop/R&B,Rap,Global,Jazz,Folk/Country
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
2017-07-18,Japanese Breakfast,Soft Sounds From Another Planet,[Rock],8.0,Nathan Reese,"Inspired by the cosmos, Japanese Breakfast’s M...",Michelle Zauner’s first album as Japanese Brea...,1,0,0,0,0,0,0,0,0
2017-07-18,Mura Masa,Mura Masa,[Electronic],7.7,Eve Barlow,Alex Crossan’s debut album is a love letter to...,"Oscar Wilde once said, “The man who can domina...",0,1,0,0,0,0,0,0,0
2017-07-18,Coca Leaf,Deep Marble Sunrise,[Experimental],7.1,Paul Thompson,"Featuring members of Merchandise, the Ukiah Dr...",Dig your way through the sprawling catalogs of...,0,0,1,0,0,0,0,0,0
2017-07-18,Claude Speeed,Infinity Ultra,[Electronic],7.4,Philip Sherburne,With songs that travel great distances between...,When the Scottish electronic musician Claude S...,0,1,0,0,0,0,0,0,0
2017-07-17,Coldplay,Kaleidoscope EP,[Rock],5.8,Jamieson Cox,"Held up by two good-to-great songs, the lightw...",Coldplay are nearing the end of a restless dec...,1,0,0,0,0,0,0,0,0


Creating a binary score that I use as my predictor - if the review is good (score above 5.0), the field gets a 1. Otherwise it gets a 0.

In [433]:
df['Score_Binary'] = df.Score.apply(lambda x : 1 if x >= 5 else 0)
df.head()

Unnamed: 0_level_0,Artist,Album,Genre,Score,Author,Abstract,Article,Rock,Electronic,Experimental,Metal,Pop/R&B,Rap,Global,Jazz,Folk/Country,Score_Binary
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
2017-07-18,Japanese Breakfast,Soft Sounds From Another Planet,[Rock],8.0,Nathan Reese,"Inspired by the cosmos, Japanese Breakfast’s M...",Michelle Zauner’s first album as Japanese Brea...,1,0,0,0,0,0,0,0,0,1
2017-07-18,Mura Masa,Mura Masa,[Electronic],7.7,Eve Barlow,Alex Crossan’s debut album is a love letter to...,"Oscar Wilde once said, “The man who can domina...",0,1,0,0,0,0,0,0,0,1
2017-07-18,Coca Leaf,Deep Marble Sunrise,[Experimental],7.1,Paul Thompson,"Featuring members of Merchandise, the Ukiah Dr...",Dig your way through the sprawling catalogs of...,0,0,1,0,0,0,0,0,0,1
2017-07-18,Claude Speeed,Infinity Ultra,[Electronic],7.4,Philip Sherburne,With songs that travel great distances between...,When the Scottish electronic musician Claude S...,0,1,0,0,0,0,0,0,0,1
2017-07-17,Coldplay,Kaleidoscope EP,[Rock],5.8,Jamieson Cox,"Held up by two good-to-great songs, the lightw...",Coldplay are nearing the end of a restless dec...,1,0,0,0,0,0,0,0,0,1


## Naive Model using TextBlob

The first modelling technique I will attempt to use makes some naive assumptions - 

+ The Pitchfork reviewers understand scaling well.

The scaling fallacy isn't restricted to Pitchfork reviewers. Most people suffer from it. The issue is that people will often, on a scale between 1 and 10, consider ~7 an average. This fallacy is derived from the way school tests are scored, where everyone roughly scores around a 70. However, in a well scaled distribution, the average should be a 5. How does this translate to reviews? It means that a score of 5.0 is average or mediocre. Anything above a 5.0 is on the side of good, and anything below a 5.0 is on the side of bad.

+ Pitchfork reviews may have unbalanced classes.

An interesting problem to deal with is the fact that Pitchfork's reviews won't be distributed very evenly because it's likely that they're reviewing albums that are interesting to the public or musically stimulating in some manner. Therefore, it's fair that we may not see too many scores under 5. This is the unbalanced classes problem. We won't be dealing with that at this stage.

+ TextBlob's sentiment analyzer is complex and can understand nuances in the English language.

It's probably naive to assume this.

In [434]:
from textblob import TextBlob
from sklearn.metrics import roc_auc_score, accuracy_score

In [436]:
df['Text'] = df['Abstract'] + ' ' + df['Article']

In [437]:
naive_sent_score = df[['Text', 'Score', 'Score_Binary']]

In [438]:
def get_naive_sent_score(x):
    return round((TextBlob(x).sentiment.polarity + 1) * 10 / 2, 1)

naive_sent_score['Naive_Sent_Score'] = naive_sent_score.Text.apply(get_naive_sent_score)
naive_sent_score['Naive_Sent_Score_Binary'] = naive_sent_score['Naive_Sent_Score'].apply(lambda x : 1 if x >= 5 else 0)
naive_sent_score.head()



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



Unnamed: 0_level_0,Text,Score,Score_Binary,Naive_Sent_Score,Naive_Sent_Score_Binary
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2017-07-18,"Inspired by the cosmos, Japanese Breakfast’s M...",8.0,1,5.7,1
2017-07-18,Alex Crossan’s debut album is a love letter to...,7.7,1,6.0,1
2017-07-18,"Featuring members of Merchandise, the Ukiah Dr...",7.1,1,5.5,1
2017-07-18,With songs that travel great distances between...,7.4,1,6.1,1
2017-07-17,"Held up by two good-to-great songs, the lightw...",5.8,1,5.9,1


In [439]:
accuracy_score(naive_sent_score['Score_Binary'], naive_sent_score['Naive_Sent_Score_Binary'])

0.92333709131905295

In [440]:
roc_auc_score(naive_sent_score['Score_Binary'], naive_sent_score['Naive_Sent_Score'])

0.69425091112112025

That's a moderately good estimation, but it still isn't ideal. High accuracy, but terrible ROC-AUC Score.

## TFIDF and XGBoost

For my second model, I am going to use TFIDF (which looks at weighted frequency of words) and an XGBoost Classifier, which is a powerful fresh-out-of-the-box modelling technique. I won't spend too much time tuning parameters, given that I just want to get an idea of how the modelling technique does based off of simple word frequencies.

In [443]:
import random
import string
import re
import nltk
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC, LinearSVC
from sklearn import svm

In [444]:
from tqdm import tqdm
tqdm.pandas(desc='progress-bar')

In [445]:
copy = df

In [446]:
def preprocess(text):
    stop_words = set(stopwords.words('english') + list(string.punctuation))
    lemma = WordNetLemmatizer()
    try:
        tokens = word_tokenize(text.lower())
        tokens = [t for t in tokens if t not in stop_words]
        tokens = [re.sub(re_punct, '', t) for t in tokens]
        tokens = [t for t in tokens if len(t) > 2]
        tokens = [lemma.lemmatize(t) for t in tokens]
        if len(tokens) == 0:
            return None
        else:
            return ' '.join(tokens)
    except:
        return None

In [448]:
copy['Tokenized'] = copy['Text'].apply(preprocess)
copy = copy[copy['Tokenized'].notnull()]



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



In [449]:
copy.head()

Unnamed: 0_level_0,Artist,Album,Genre,Score,Author,Abstract,Article,Rock,Electronic,Experimental,Metal,Pop/R&B,Rap,Global,Jazz,Folk/Country,Score_Binary,Text,Tokenized
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
2017-07-18,Japanese Breakfast,Soft Sounds From Another Planet,[Rock],8.0,Nathan Reese,"Inspired by the cosmos, Japanese Breakfast’s M...",Michelle Zauner’s first album as Japanese Brea...,1,0,0,0,0,0,0,0,0,1,"Inspired by the cosmos, Japanese Breakfast’s M...",inspired cosmos japanese breakfast michelle za...
2017-07-18,Mura Masa,Mura Masa,[Electronic],7.7,Eve Barlow,Alex Crossan’s debut album is a love letter to...,"Oscar Wilde once said, “The man who can domina...",0,1,0,0,0,0,0,0,0,1,Alex Crossan’s debut album is a love letter to...,alex crossan debut album love letter multicult...
2017-07-18,Coca Leaf,Deep Marble Sunrise,[Experimental],7.1,Paul Thompson,"Featuring members of Merchandise, the Ukiah Dr...",Dig your way through the sprawling catalogs of...,0,0,1,0,0,0,0,0,0,1,"Featuring members of Merchandise, the Ukiah Dr...",featuring member merchandise ukiah drag unifor...
2017-07-18,Claude Speeed,Infinity Ultra,[Electronic],7.4,Philip Sherburne,With songs that travel great distances between...,When the Scottish electronic musician Claude S...,0,1,0,0,0,0,0,0,0,1,With songs that travel great distances between...,song travel great distance pole ambient noise ...
2017-07-17,Coldplay,Kaleidoscope EP,[Rock],5.8,Jamieson Cox,"Held up by two good-to-great songs, the lightw...",Coldplay are nearing the end of a restless dec...,1,0,0,0,0,0,0,0,0,1,"Held up by two good-to-great songs, the lightw...",held two goodtogreat song lightweight coldplay...


In [450]:
to_drop = 'Date Artist Album Genre Score Author Abstract Article Text Tokenized Score_Binary'.split()

In [451]:
texts = copy.Tokenized.tolist() 
y = copy['Score_Binary']
vectorizer = TfidfVectorizer(min_df=0.1, max_df=0.8) # Term Frequency Inverse Document Frequency Vectorizer
X = vectorizer.fit_transform(texts)
X_df = pd.concat([copy.reset_index(), pd.DataFrame(X.todense())], axis=1).drop(to_drop, axis=1)
X_train, X_test, y_train, y_test = train_test_split(X_df, y, test_size=0.2, random_state=42) # Train Test split

In [452]:
from xgboost.sklearn import XGBClassifier
xgb = XGBClassifier()
%time xgb.fit(X_train, y_train)
print('Accuracy:', round(xgb.score(X_test, y_test) * 100, 2), '%')

CPU times: user 13.4 s, sys: 57.2 ms, total: 13.4 s
Wall time: 13.4 s
Accuracy: 93.31 %


In [453]:
preds = xgb.predict_proba(X_test)
print('AUC-ROC Score:', round(roc_auc_score(y_test, [item[1] for item in preds]) * 100, 2), '%')

AUC-ROC Score: 78.05 %


## Some more Advanced NLP

In this section, I plan to do plenty of low-level NLP. This includes - 

1. Removing stop words and punctuation.

2. Filtering by speech tags in order to keep adjectives, adverbs, and verbs.

3. Lemmatizing text (without Stemming)

In [454]:
text = df['Abstract'] + ' ' + df['Article']

In [320]:
def process(x):
    """Gets adjectives, adverbs, and verbs."""
    import nltk
    import string
    from nltk import word_tokenize
    from nltk.corpus import stopwords
    
    tokens = word_tokenize(x)
    stop_words = set(stopwords.words('english') + list(string.punctuation))
    
    filtered = list(set(tokens) - stop_words)
    
    word_types = ['J', 'R', 'V']
    tagged = nltk.pos_tag(filtered)
    
    relevant_words = []
    
    for word, tag in tagged:
        for word_type in word_types:
            if word_type == tag[0]:
                relevant_words.append(word)
    
    lemma = WordNetLemmatizer()
    
    return [lemma.lemmatize(word) for word in relevant_words]

An example of text before processing.

In [455]:
text[0]

"Inspired by the cosmos, Japanese Breakfast’s Michelle Zauner addresses life on Earth. Her voice shines over melancholic arrangements, evoking Pacific Northwest indie rock as much as shoegaze. Michelle Zauner’s first album as Japanese Breakfast, 2016’s Psychopomp, was a meditation on grief in the wake of her mother’s death from cancer, as well as a raw portrayal of sexuality and heartache. That these subjects could coexist in the same space isn’t unusual (death and sex often mix, especially at the edge of human experience), but Zauner’s gift for connecting specific details to simple metaphor was uniquely affecting. “The dog’s confused/She just paces ‘round all day/She’s sniffing at your empty room,” she sang on “In Heaven.” Then, on “Jane Cum”: “Soulless animal keep feeding on my meat/All my tiny bones between your teeth.”\nWhile Psychopomp focused on the most intimate human experiences, her new album, Soft Sounds From Another Planet, uses big guitars and melancholic arrangements to ad

An example of text after processing.

In [456]:
process(text[0])

['reimagined',
 'back',
 'shoegaze',
 'dog',
 'Japanese',
 'musical',
 'highlighted',
 'mythologizing',
 '‘',
 'portrayal',
 'opening',
 'help',
 'call',
 'whirl',
 'co-producer',
 'non-binary',
 'coexist',
 'album—',
 'set',
 'torch',
 'actually',
 'Instead',
 'animal',
 'atmospheric',
 'old',
 'instantly',
 'move',
 'world',
 'strip',
 'biggest',
 'even',
 'leap',
 'ever',
 'free',
 'first',
 'forward',
 "'m",
 'keep',
 'also',
 'opened',
 'title',
 'races.',
 'writhing/Knuckled',
 'different',
 'sheen',
 'normalize',
 'want',
 'made',
 'trope',
 'new',
 'household',
 'pain',
 'answer',
 'uniquely',
 'lover',
 'general',
 'focused',
 'profound',
 'grief',
 'self-destructive',
 'get',
 'female',
 'tiny',
 'go',
 'scream',
 'inadvertently',
 'affecting',
 'encapsulates',
 'justified',
 'grin',
 'asks',
 'evoking',
 'digital',
 'uncomfortably',
 'grew',
 'seeming',
 'abusing',
 'wish',
 'turn',
 'Out',
 'told',
 'grieving',
 'diving',
 'create',
 'Philly',
 'lost',
 'mother',
 'see',
 '

In [457]:
ad_nlp = df[['Text', 'Score_Binary']]
ad_nlp.head()

Unnamed: 0_level_0,Text,Score_Binary
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2017-07-18,"Inspired by the cosmos, Japanese Breakfast’s M...",1
2017-07-18,Alex Crossan’s debut album is a love letter to...,1
2017-07-18,"Featuring members of Merchandise, the Ukiah Dr...",1
2017-07-18,With songs that travel great distances between...,1
2017-07-17,"Held up by two good-to-great songs, the lightw...",1


In [458]:
X_train, X_test, y_train, y_test = train_test_split(ad_nlp['Text'], ad_nlp['Score_Binary'], random_state=42, stratify=ad_nlp['Score_Binary'])

In [459]:
X_train = X_train.to_frame()
X_test = X_test.to_frame()

In [460]:
X_train.head()

Unnamed: 0_level_0,Text
Date,Unnamed: 1_level_1
2016-04-22,The Brooklyn quartet Primitive Weapons blend t...
2000-04-18,"""Live albums always offer a precarious task fo..."
2017-03-22,What is the difference between American and It...
2002-12-03,I don't speak Norwegian very well. For this re...
2004-07-27,Major label debut from these Swedish garage-ro...


In [461]:
X_test.head()

Unnamed: 0_level_0,Text
Date,Unnamed: 1_level_1
2004-01-28,Flip through your calendars and count the days...
2005-04-27,"Latest from the pop depressive is a sprawling,..."
2007-07-25,"Self-proclaimed ""gypsy punks"" offer another di..."
2009-10-21,The 1980s legends continue their lengthy secon...
2010-03-17,"Frenchkiss, lately locating quality bands with..."


Processing all the text in the entire DataFrame.

In [462]:
X_train['Text_Tokenized'] = X_train.Text.apply(process)
X_train.head()

Unnamed: 0_level_0,Text,Text_Tokenized
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2016-04-22,The Brooklyn quartet Primitive Weapons blend t...,"[mosh, pulse, touring, 'm, brain, Together, bo..."
2000-04-18,"""Live albums always offer a precarious task fo...","[combine, Rocked, 'm, giant, mattered, nice, n..."
2017-03-22,What is the difference between American and It...,"[back, keying, celestial, musical, organ, rele..."
2002-12-03,I don't speak Norwegian very well. For this re...,"[sound, 'm, path, cultural, yet, somewhat, far..."
2004-07-27,Major label debut from these Swedish garage-ro...,"[Is, based, debut, clink, wily, yet, attitude,..."


In [463]:
X_test['Text_Tokenized'] = X_test.Text.apply(process)
X_test.head()

Unnamed: 0_level_0,Text,Text_Tokenized
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2004-01-28,Flip through your calendars and count the days...,"[impressive, cleared, sometimes, recognizable,..."
2005-04-27,"Latest from the pop depressive is a sprawling,...","[thought, back, grant, dry, haunt, contagious,..."
2007-07-25,"Self-proclaimed ""gypsy punks"" offer another di...","[somehow, distant, light, exceedingly, wear, b..."
2009-10-21,The 1980s legends continue their lengthy secon...,"[Do, punched, basically, motorcycle, two-word,..."
2010-03-17,"Frenchkiss, lately locating quality bands with...","[stunted, wedge, somehow, winter, oft-repeated..."


Getting frequency estimates of the entire dataset.

In [464]:
all_words = []
for words in X_train.Text_Tokenized:
    all_words += words
    
all_words = nltk.FreqDist(all_words)

In [465]:
all_words

FreqDist({'mosh': 19,
          'pulse': 419,
          'touring': 381,
          "'m": 2440,
          'brain': 15,
          'Together': 98,
          'bound': 64,
          'however': 1399,
          'express': 116,
          'brooding': 211,
          'self-titled': 640,
          'musicians—knocking': 1,
          'digital': 585,
          'includes': 638,
          'emphasized': 60,
          'perhaps': 1478,
          'groove': 646,
          'shell': 25,
          'act': 694,
          'ran': 108,
          'asking': 259,
          'hurt': 193,
          'album': 4613,
          'trapped': 138,
          'well': 3791,
          'notably': 274,
          'pressure-cooked': 1,
          'great': 2568,
          'breed': 35,
          'years—as': 1,
          'all-out': 25,
          'perfectly': 952,
          'flesh': 120,
          'traditional': 898,
          'repeating': 273,
          'striking': 431,
          'carving': 49,
          'recording': 1505,
          'spent': 

Procuring the n-most important features.

In [595]:
def get_important_features(all_words, n=3000):
    sred = sorted(all_words.items(), key=lambda x : x[1])
    sred = sred[::-1]

    word_features = [word.lower() for (word, count) in sred[:n]]

    return word_features

In [596]:
word_features = get_important_features(all_words, n=4000)
len(word_features)

4000

In [597]:
from nltk.corpus import stopwords
import string
import pickle
from nltk.classify.scikitlearn import SklearnClassifier
from sklearn.naive_bayes import MultinomialNB, GaussianNB, BernoulliNB
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import SVC, LinearSVC, NuSVC

A function I created to calculate ROC-AUC Score.

In [598]:
def get_roc_auc_score(model):
    from sklearn.metrics import roc_auc_score
    
    dists = model.prob_classify_many([features[0] for features in test_set])
    probs = [[dist.prob(0), dist.prob(1)] for dist in dists]
    
    probs_neg = [item[0] for item in probs]
    probs_pos = [item[1] for item in probs]
    
    actual = [features[1] for features in test_set]
    
    print("ROC-AUC Score: ", roc_auc_score(actual, probs_pos) * 100, "%")

A function I created to easily pickle models.

In [599]:
def pickle_it(model, model_name):
    import pickle
    save_model = open(model_name + '.pickle', 'wb')
    pickle.dump(model, save_model)
    save_model.close()
    
def open_jar(model_name):
    import pickle
    model_file = open(model_name + '.pickle', 'rb')
    model = pickle.load(model_file)
    model_file.close()
    
    return model

A helper function I created to make train and test sets.

In [600]:
def find_features(x, word_features):
    words = set(x)
    
    features = {}
    for word in word_features:
        features[word] = (word in words)
        
    return features

Creating a train and test set for model building.

In [601]:
word_features

["n't",
 'make',
 'even',
 'first',
 'much',
 'sound',
 'come',
 'new',
 'also',
 'still',
 'best',
 'album',
 'get',
 'take',
 "'re",
 'never',
 'back',
 'little',
 'last',
 'go',
 'well',
 'band',
 'made',
 'good',
 'many',
 "'ve",
 'song',
 'often',
 'seems',
 'enough',
 'almost',
 'say',
 'always',
 'know',
 'le',
 'long',
 'really',
 'track',
 'second',
 'yet',
 'single',
 'ever',
 'better',
 'vocal',
 'early',
 'find',
 'give',
 'far',
 'hard',
 'seem',
 'great',
 'together',
 'least',
 "'m",
 'away',
 'got',
 'rather',
 'released',
 'live',
 'old',
 'making',
 'want',
 'musical',
 'actually',
 'set',
 'right',
 'feel',
 'going',
 'different',
 'full',
 'recorded',
 'quite',
 'debut',
 'pretty',
 'even',
 'acoustic',
 'whole',
 'probably',
 'open',
 'put',
 'think',
 'big',
 'turn',
 'real',
 'see',
 'instead',
 'especially',
 'already',
 'beat',
 'easy',
 'mostly',
 'recent',
 'hear',
 'later',
 'next',
 'trying',
 'love',
 'playing',
 'else',
 'sometimes',
 'left',
 'guitar',
 

In [602]:
train_set = [(find_features(tokens, word_features), cat) for tokens, cat in zip(X_train.Text_Tokenized, y_train)]
test_set = [(find_features(tokens, word_features), cat) for tokens, cat in zip(X_test.Text_Tokenized, y_test)]

### NLTK Naive Bayes Classifier

In [603]:
naive_bayes_classifier = nltk.NaiveBayesClassifier.train(train_set)

In [604]:
print('NLTK Naive Bayes Accuracy:', nltk.classify.accuracy(naive_bayes_classifier, test_set) * 100, '%')

NLTK Naive Bayes Accuracy: 86.3021420519 %


In [605]:
get_roc_auc_score(naive_bayes_classifier)

ROC-AUC Score:  86.60786765 %


In [606]:
naive_bayes_classifier.show_most_informative_features(15)

Most Informative Features
                 stately = True                1 : 0      =      9.9 : 1.0
              controlled = True                1 : 0      =      8.5 : 1.0
                 pulsing = True                1 : 0      =      7.8 : 1.0
              channeling = True                1 : 0      =      6.6 : 1.0
                spectral = True                1 : 0      =      6.5 : 1.0
              seamlessly = True                1 : 0      =      6.2 : 1.0
                 cracked = True                1 : 0      =      6.2 : 1.0
          characteristic = True                1 : 0      =      6.1 : 1.0
               sustained = True                1 : 0      =      5.8 : 1.0
                passable = True                0 : 1      =      5.7 : 1.0
              emphasizes = True                1 : 0      =      5.7 : 1.0
              rollicking = True                1 : 0      =      5.7 : 1.0
              fluttering = True                1 : 0      =      5.6 : 1.0

In [607]:
pickle_it(naive_bayes_classifier, 'naive_bayes_classifier')
naive_bayes_classifier = open_jar('naive_bayes_classifier')

### Multinomial Naive Bayes

In [608]:
MNB = SklearnClassifier(MultinomialNB())
MNB.train(train_set)

<SklearnClassifier(MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))>

In [None]:
print('Multinomial Naive Bayes Accuracy:', nltk.classify.accuracy(MNB, test_set) * 100, '%')

Multinomial Naive Bayes Accuracy: 87.6832018038 %


In [None]:
get_roc_auc_score(MNB)

In [None]:
pickle_it(MNB, 'mnb_classifier')
MNB = open_jar('mnb_classifier')

### Bernoulli Naive Bayes

In [None]:
BNB = SklearnClassifier(BernoulliNB())
BNB.train(train_set)

In [None]:
print('Bernoulli Naive Bayes Accuracy:', nltk.classify.accuracy(BNB, test_set) * 100, '%')

In [None]:
get_roc_auc_score(BNB)

In [None]:
pickle_it(BNB, 'bnb_classifier')
BNB = open_jar('bnb_classifier')

### Logistic Regression

In [None]:
logistic = SklearnClassifier(LogisticRegression())
logistic.train(train_set)

In [None]:
print('Logistic Regression Accuracy:', nltk.classify.accuracy(logistic, test_set) * 100, '%')

In [None]:
get_roc_auc_score(logistic)

In [None]:
pickle_it(logistic, 'logistic_classifier')
logistic = open_jar('logistic_classifier')

### Stochastic Gradient Descent

In [None]:
SGD = SklearnClassifier(SGDClassifier())
SGD.train(train_set)

In [None]:
print('Stochastic Gradient Descent Accuracy:', nltk.classify.accuracy(SGD, test_set) * 100, '%')

In [None]:
# get_roc_auc_score(SGD)

In [None]:
pickle_it(SGD, 'sgd_classifier')
SGD = open_jar('sgd_classifier')

### Support Vectors Classifier

In [None]:
SVC = SklearnClassifier(SVC())
SVC.train(train_set)

In [None]:
print('Support Vector Classifier Accuracy:', nltk.classify.accuracy(SVC, test_set) * 100, '%')

In [None]:
# get_roc_auc_score(SVC)

In [None]:
pickle_it(SVC, 'svc_classifier')
SVC = open_jar('svc_classifier')

### Linear Support Vectors Classifier

In [None]:
LSVC = SklearnClassifier(LinearSVC())
LSVC.train(train_set)

In [None]:
print('Linear Support Vector Classifier Accuracy:', nltk.classify.accuracy(LSVC, test_set) * 100, '%')

In [None]:
# get_roc_auc_score(LSVC)

In [None]:
pickle_it(LSVC, 'lsvc_classifier')
LSVC = open_jar('lsvc_classifier')

## VoteClassifier

Finally, I plan to aggregate my multiple models in order to have a robust ensemble method. The idea behind this is that even if it mildly hurts my accuracy, it improves my ROC-AUC score.

In [None]:
from nltk.classify import ClassifierI

In [None]:
class VoteClassifier(ClassifierI):
    
    def __init__(self, *classifiers):
        self.classifiers = classifiers
    
    def classify(self, features):
        from statistics import mode
        votes = [model.classify(features) for model in self.classifiers]
        return mode(votes)
    
    def confidence(self, features):
        from statistics import mode
        votes = [model.classify(features) for model in self.classifiers]
        
        choice_votes = votes.count(mode.votes)
        confidence = choice_votes / len(votes)
        
        return confidence

In [None]:
voted_classifier = VoteClassifier(naive_bayes_classifier, MNB, logistic, SGD, SVC)

In [None]:
print('Voted Classifier Accuracy:', nltk.classify.accuracy(voted_classifier, test_set) * 100, '%')