# Valence Prediction Conclusions:

### Tesing Gradient Boosting Regressors

Not much gained or lost from increase in sample size. 
R2 error increased by .018 which is not too significant 

on sample size 500/genre and NO hyperparameter tuning (out of the box):
- [GB] Mean Squared Error: 0.05264719893512396
- [GB] R2: 0.08043425668821236

on sample size 1687/genre with optimal hyperparamers: {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 100}:
- [GB] Mean Squared Error: 0.05200244352093515
- [GB] R2: 0.09854691815352778

on full data set with optimal hyperparamers:
- [GB] Mean Squared Error: 0.05284822264086245
- [GB] R2: 0.12264012250017176

#### ** GB runs MUCH faster than RFreg and produces better r2 on the full data set

---

### Tesing Random Forest Regressors

Not much gained or lost from increase in sample size. 
R2 error increased by .02 which is not too significant 

on sample size 500/genre with optimal hyperparamers:
- [RF] Mean Squared Error: 0.05222244210798196
- [RF] R2: 0.10795588207769391

on sample size 1687/genre with optimal hyperparamers:
- [RF] Mean Squared Error: 0.05101035414841959
- [RF] R2: 0.1205852034338123

on full data set with optimal hyperparamers:
- [RF] Mean Squared Error: 0.05115798019890316
- [RF] R2: 0.15070068589697727

---

### Issues:
- stemming created gibberish
- GridSearchCV takes 2+ hours to run (up to 4)

In [2]:
# Import pandas for data handling
import pandas as pd

# NLTK is our Natural-Language-Took-Kit
import nltk
from nltk.corpus import stopwords
from nltk import word_tokenize
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
stopwords = stopwords.words('english')

# Libraries for helping us with strings
import string
# Regular Expression Library
import re

# Import text vectorizers
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# Import classifiers
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import KFold

#Import Regressor Models
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor

# Import some ML helper function
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import classification_report


# Import our metrics to evaluate our model
from sklearn import metrics
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import cross_val_score


# Library for plotting
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.sparse as sparse

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/aleksandrageorgievska/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/aleksandrageorgievska/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/aleksandrageorgievska/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [3]:
df = pd.read_csv('../data/labeled_lyrics_w_genres.csv')

# Inspecting The Data

In [15]:
df.head()

Unnamed: 0.1,Unnamed: 0,artist,seq,song,label,genre
0,0,Elijah Blake,"No, no\r\nI ain't ever trapped out the bando\r...",Everyday,0.626,R&B
1,1,Elijah Blake,"The drinks go down and smoke goes up, I feel m...",Live Till We Die,0.63,Pop
2,2,Elijah Blake,She don't live on planet Earth no more\r\nShe ...,The Otherside,0.24,R&B
3,3,Elijah Blake,"Trippin' off that Grigio, mobbin', lights low\...",Pinot,0.536,R&B
4,4,Elijah Blake,"I see a midnight panther, so gallant and so br...",Shadows & Diamonds,0.371,R&B


In [16]:
df.isnull().sum().sum()

0

In [17]:
df.duplicated().sum()

0

In [18]:
df.shape

(58100, 6)

In [19]:
df.genre.value_counts()

No_genre     21069
Pop          20691
Rock          9783
Country       2503
Rap           2311
R&B           1687
Non-Music       56
Name: genre, dtype: int64

### removing No_genre and Non-Music

In [4]:
df_dropped = df[(df['genre'] == 'No_genre') | (df['genre'] == 'Non-Music')].index
df.drop(df_dropped, inplace=True, axis='index')

In [21]:
print(df.shape)
df.head(15)

(36975, 6)


Unnamed: 0.1,Unnamed: 0,artist,seq,song,label,genre
0,0,Elijah Blake,"No, no\r\nI ain't ever trapped out the bando\r...",Everyday,0.626,R&B
1,1,Elijah Blake,"The drinks go down and smoke goes up, I feel m...",Live Till We Die,0.63,Pop
2,2,Elijah Blake,She don't live on planet Earth no more\r\nShe ...,The Otherside,0.24,R&B
3,3,Elijah Blake,"Trippin' off that Grigio, mobbin', lights low\...",Pinot,0.536,R&B
4,4,Elijah Blake,"I see a midnight panther, so gallant and so br...",Shadows & Diamonds,0.371,R&B
5,5,Elijah Blake,I just want to ready your mind\r\n'Cause I'll ...,Uno,0.321,R&B
7,7,Elis,Dieses ist lange her.\r\nDa ich deine schmalen...,Abendlied,0.333,Pop
8,8,Elis,A child is born\r\nOut of the womb of a mother...,Child,0.506,Pop
9,9,Elis,Out of the darkness you came \r\nYou looked so...,Come to Me,0.179,Pop
10,10,Elis,Each night I lie in my bed \r\nAnd I think abo...,Do You Believe,0.209,Pop


In [5]:
df.genre.value_counts()

Pop        20691
Rock        9783
Country     2503
Rap         2311
R&B         1687
Name: genre, dtype: int64

---

# Data Cleaning (Text Pre Processing)

In [6]:
# 1. function that makes all text lowercase.
def make_lowercase(test_string):
    return test_string.lower()

# 2. function that removes all punctuation. 
def remove_punc(test_string):
    test_string = re.sub(r'[^\w\s]', '', test_string)
    return test_string

# 3. function that removes all stopwords.
def remove_stopwords(test_string):
    # Break the sentence down into a list of words
    words = word_tokenize(test_string)
    
    # Make a list to append valid words into
    valid_words = []
    
    # Loop through all the words
    for word in words:
        
        # Check if word is not in stopwords. Stopwords was imported from nltk.corpus
        if word not in stopwords:
            
            # If word not in stopwords, append to our valid_words
            valid_words.append(word)

    # Join the list of words together into a string
    a_string = ' '.join(valid_words)

    return a_string

# 4. function to break words into their stem words
def stem_words(a_string):
    # Initalize our Stemmer
    porter = PorterStemmer()
    
    # Break the sentence down into a list of words
    words = word_tokenize(a_string)
    
    # Make a list to append valid words into
    valid_words = []

    # Loop through all the words
    for word in words:
        # Stem the word
        stemmed_word = porter.stem(word) #from nltk.stem import PorterStemmer
        
        # Append stemmed word to our valid_words
        valid_words.append(stemmed_word)
        
    # Join the list of words together into a string
    a_string = ' '.join(valid_words)

    return a_string 

In [7]:
# Pipeline function 

def text_processing_pipeline(a_string):
    a_string = make_lowercase(a_string)
    a_string = remove_punc(a_string)
    #a_string = stem_words(a_string) #removing stem_words for now because making lyrics gibberish
    a_string = remove_stopwords(a_string)
    return a_string

In [8]:
# apply preprocessing pipeline 

df['seq_clean'] = df['seq'].apply(text_processing_pipeline)

In [30]:
df.head()

Unnamed: 0.1,Unnamed: 0,artist,seq,song,label,genre,seq_clean
0,0,Elijah Blake,"No, no\r\nI ain't ever trapped out the bando\r...",Everyday,0.626,R&B,aint ever trapped bando oh lord dont get wrong...
1,1,Elijah Blake,"The drinks go down and smoke goes up, I feel m...",Live Till We Die,0.63,Pop,drinks go smoke goes feel got let go cares get...
2,2,Elijah Blake,She don't live on planet Earth no more\r\nShe ...,The Otherside,0.24,R&B,dont live planet earth found love venus thats ...
3,3,Elijah Blake,"Trippin' off that Grigio, mobbin', lights low\...",Pinot,0.536,R&B,trippin grigio mobbin lights low trippin grigi...
4,4,Elijah Blake,"I see a midnight panther, so gallant and so br...",Shadows & Diamonds,0.371,R&B,see midnight panther gallant brave found found...


In [9]:
X = df['seq_clean'].values

y = df['label'].values

# Sampling smaller batches from dataframe for faster testing

In [16]:
#function to randomly sample n values from each genre for smaller random forest testing

def genre_sample(dataframe, k):
    #make an empty dataframe
    df_genre_sample = pd.DataFrame(columns = ['Unnamed: 0', 'artist', 'seq', 'song', 'label', 'genre', 'seq_clean'])
    
    genres = ['R&B', 'Pop', 'Rap', 'Rock', 'Country']
    for genre in genres:
         df_genre_sample = df_genre_sample.append((dataframe[dataframe["genre"]==genre].sample(n=k)))
    
    return df_genre_sample

In [17]:
# sampling from the dataframe, k is the # of samples from each genre

df_sampled = genre_sample(df, k=500)
print(df_sampled.shape)
df_sampled.head(10)

(2500, 7)


Unnamed: 0.1,Unnamed: 0,artist,seq,song,label,genre,seq_clean
51429,79586,Goapele,Nobody knows what I go through\nIndecisions ma...,Catch 22,0.44,R&B,nobody knows go indecisions made passive say w...
45287,35483,Wilson Pickett,If you need a little lovin'\r\nCall on me all ...,Danger Zone,0.835,R&B,need little lovin call right want little huggi...
27971,66124,Natalie Cole,What a day this has been!\r\nWhat a rare mood ...,Almost Like Being in Love,0.686,R&B,day rare mood im almost like love theres smile...
37704,34400,Christina Aguilera,I feel like I've been locked up tight\nFor a c...,Genie in a Bottle,0.912,R&B,feel like ive locked tight century lonely nigh...
57183,52614,Prince,With the accurate understanding of God and His...,Rainbow Children,0.2,R&B,accurate understanding god law went work build...
18289,70229,Ray Charles,After one whole quart of brandy\r\nLike a dais...,Bewitched,0.323,R&B,one whole quart brandy like daisy im awake bro...
3084,3084,LL Cool J,"I'ma give all why y'all something, word up\r\n...",This Is Us,0.594,R&B,ima give yall something word word live cats go...
8045,8045,Ray Charles,"If you don't want, you don't have to (get in t...",Leave My Woman Alone,0.842,R&B,dont want dont get trouble dont want dont get ...
39006,114455,Kelly Rowland,You're the only one I see\r\nIt's like no one'...,Feelin Me Right Now,0.625,R&B,youre one see like ones vip know bed im tonigh...
38011,42985,T-Pain,I-I-I'm back!\nFreaknik's back baby!\n\nWhat's...,Freaknik Is Back,0.319,R&B,iiim back freakniks back baby whats happen mo ...


In [18]:
#checking correct amounts of samples per genre were obtained

df_sampled.genre.value_counts()

Rock       500
Country    500
R&B        500
Pop        500
Rap        500
Name: genre, dtype: int64

# Testing Regression Models for label prediction:
label = float scale (0-1) which signifies valence 



## Random Forest Regressor

### using the sampled dataset for faster testing

In [20]:
X_sampled = df_sampled['seq_clean'].values

y_sampled = df_sampled['label'].values

In [21]:
X_train_sample, X_test_sample, y_train_sample, y_test_sample = train_test_split(X_sampled, y_sampled, 
                                                                             test_size=0.33, random_state=42)

In [22]:
vectorizer = TfidfVectorizer()
vectorizer.fit(X_train_sample)

X_train_sample = vectorizer.transform(X_train_sample)
X_test_sample = vectorizer.transform(X_test_sample)

print(X_train_sample.shape, type(X_train_sample))

(1675, 19280) <class 'scipy.sparse.csr.csr_matrix'>


In [23]:
# function to find the best parameters for RandomForestRegressor

param_grid = {'n_estimators': [10, 50, 100, 500, 1000], 
              'bootstrap': [True, False],
             }

# there are more parameters to test but I was getting errors and need to investigate more

# param_grid = {'criterion': ['squared_error', 'absolute_error', 'poisson'],
#               'n_estimators': [10, 50, 100, 500, 1000], 
#               'max_depth': [2, 4, 8, 16, 32, 64], 
#               'min_samples_leaf': [1, 10, 25, 50],
#               'bootstrap': [True, False],
#               'min_samples_split': [0, 2, 4, 8, 16, 32]
#              }

### *Do not run the next cell unless you have 2+ hours to kill*

In [69]:
print('Running Grid Search...')

# 1. Create a RandomForestRegressor model object without supplying arguments. 

rf_regressor = RandomForestRegressor()

# 2. Run a Grid Search with 3-fold cross-validation and assign the output to the object 'rf_grid'.
#    * Pass the model and the parameter grid to GridSearchCV()
#    * Set the number of folds to 3
#    * Specify the scoring method

rf_grid = GridSearchCV(estimator=rf_regressor, param_grid = param_grid, cv=3, scoring='r2')

# 3. Fit the model (use the 'grid' variable) on the training data and assign the fitted model to the 
#    variable 'rf_grid_search'

rf_grid_search = rf_grid.fit(X_train_sample, y_train_sample)


print('Done')

Running Grid Search...
Done


In [24]:
# finding best parameters for the Random Forest Regressor

best_score = rf_grid_search.best_score_
print("The best score is: ", best_score)

rf_best_params = rf_grid_search.best_params_
print("The best params is: ", rf_best_params)

# conclusion was n_estimators=1000, bootstrap = True are best hyperparameters 

NameError: name 'rf_grid_search' is not defined

In [76]:
#Optimal Hyperparameters for RandomForestRegressor based on GridSearchCV

rf_model1 = RandomForestRegressor(n_estimators=1000, bootstrap = True)

# 2. Fit the model to the training data below
rf_model1.fit(X_train_sample, y_train_sample)

RandomForestRegressor(n_estimators=1000)

In [1]:
y_sample_pred = rf_model1.predict(X_test_sample)

rf_mse = mean_squared_error(y_test_sample, y_sample_pred)
rf_r2 = r2_score(y_test_sample, y_sample_pred)

print('on sample size of 500/genre with optimal hyperparameters:')
print('[RF] Mean Squared Error: {0}'.format(rf_mse))
print('[RF] R2: {0}'.format(rf_r2))

# on sample size of 500/genre with optimal hyperparameters:
# - [RF] Mean Squared Error: 0.05222244210798196
# - [RF] R2: 0.10795588207769391

NameError: name 'rf_model1' is not defined

In [10]:
# Function to test the predictions of the model with NEW unseen text (not part of testing set)

def rgrg_string_test(lyrics):
    new_lyrics = text_processing_pipeline(lyrics)
    print("the processed lyrics are: ", new_lyrics)
    
    new_text_vectorized = vectorizer.transform([new_lyrics])
    
    value = rf_model1.predict(new_text_vectorized)
    print("Random Forest Regressor model gives a value of: ", value)
    if(value < .50):
        print("which is negative")
    else: 
        print("which is positive")

In [11]:
test_text1 = "Hit me baby one more time my lonliness is killing me and I must confess I still believe"
test_text2 = "Oh, baby, when you talk like that You make a woman go mad So be wise and keep on Reading the signs of my body"
test_text3 = "looking out on the pouring rain I used to feel so uninspired"
test_text4 = "Girl put your record on tell me your favorite song just go ahead let your hair down"

rgrg_string_test(test_text1)
print('\n')
rgrg_string_test(test_text2)
print('\n')
rgrg_string_test(test_text3)
print('\n')
rgrg_string_test(test_text4)

the processed lyrics are:  hit baby one time lonliness killing must confess still believe


NameError: name 'vectorizer' is not defined

# Running Larger RF Test on 1687 samples from each Genre

- to-do: break this testing out into a function instead of repeating code 

In [101]:
# sampling from the dataframe, k is 1687 which is the max number of samples from R&B the smallest Genre pool 

df_sampled2 = genre_sample(df, k=1687)
print(df_sampled2.shape)
df_sampled2.head(10)

(8435, 7)


Unnamed: 0.1,Unnamed: 0,artist,seq,song,label,genre,seq_clean
51095,17518,Donna Summer,Here I am on my own again\nThe days rush by\nT...,On My Honor,0.156,R&B,days rush nights seems slow guess ive let way ...
43146,156269,Jagged Edge,"[JD (JE)]\r\n(Girl I got it)\r\nShake, shake i...",I Got It,0.881,R&B,jd je girl got shake shake baby shake shake sh...
9224,144417,Freddie Hubbard,Skylark\r\nHave you anything to say to me?\r\n...,Skylark,0.103,R&B,skylark anything say wont tell love meadow mis...
38011,42985,T-Pain,I-I-I'm back!\nFreaknik's back baby!\n\nWhat's...,Freaknik Is Back,0.319,R&B,iiim back freakniks back baby whats happen mo ...
20314,134197,Sky,Cause I just\r\nCause I just\r\nCould you be h...,Push,0.269,R&B,cause cause could holding something maybe im y...
31346,156280,Jagged Edge,I see you sitting there\r\nLooking like you gl...,Dance Floor,0.737,R&B,see sitting looking like glued chair party goi...
49569,70519,Ray J,Is a precious lil girl n such a pretty lil gir...,Sex in the Rain,0.614,R&B,precious lil girl n pretty lil girl shes grown...
9933,35772,Otis Redding,I want to thank you for being so nice now \r\n...,I Want to Thank You,0.961,R&B,want thank nice want thank giving pride sweet ...
12183,120461,Tory Lanez,"Staring, looking at you from a long way\r\nPas...",High,0.387,R&B,staring looking long way passing ceilings keep...
32704,26366,Leonard Cohen,When it all went down\r\nAnd the pain came thr...,There for You,0.702,R&B,went pain came get dont ask know true get make...


In [102]:
#checking correct amounts of samples per genre were obtained

df_sampled2.genre.value_counts()

Rap        1687
R&B        1687
Country    1687
Pop        1687
Rock       1687
Name: genre, dtype: int64

In [103]:
X_sampled2 = df_sampled2['seq_clean'].values

y_sampled2 = df_sampled2['label'].values

In [104]:
X_train_sample2, X_test_sample2, y_train_sample2, y_test_sample2 = train_test_split(X_sampled2, y_sampled2, 
                                                                             test_size=0.33, random_state=42)

In [105]:
vectorizer2 = TfidfVectorizer()
vectorizer2.fit(X_train_sample2)

X_train_sample2 = vectorizer2.transform(X_train_sample2)
X_test_sample2 = vectorizer2.transform(X_test_sample2)

print(X_train_sample2.shape, type(X_train_sample2))

(5651, 35571) <class 'scipy.sparse.csr.csr_matrix'>


In [106]:
#Optimal Hyperparameters for RandomForestRegressor based on GridSearchCV

rf_model2 = RandomForestRegressor(n_estimators=1000, bootstrap = True)

# 2. Fit the model to the training data below
rf_model2.fit(X_train_sample2, y_train_sample2)

RandomForestRegressor(n_estimators=1000)

In [112]:
y_sample_pred2 = rf_model2.predict(X_test_sample2)
y_sample_pred2

rf_mse2 = mean_squared_error(y_test_sample2, y_sample_pred2)
rf_r2_2 = r2_score(y_test_sample2, y_sample_pred2)

print('on sample size of 1687/genre  with optimal hyperparameters:')
print('[RF] Mean Squared Error: {0}'.format(rf_mse2))
print('[RF] R2: {0}'.format(rf_r2_2))

# on sample size 1687/genre with optimal hyperprameters:
# - [RF] Mean Squared Error: 0.05101035414841959
# - [RF] R2: 0.1205852034338123

[RF] Mean Squared Error: 0.05101035414841959
[RF] R2: 0.1205852034338123


# conclusion on tesing Random Forest Regressors

Not much gained or lost from increase in sample size. 
R2 error increased by .02 which is not too significant 

on sample size 500/genre with optimal hyperparameters:
- [RF] Mean Squared Error: 0.05222244210798196
- [RF] R2: 0.10795588207769391

on sample size 1687/genre with optimal hyperprameters:
- [RF] Mean Squared Error: 0.05101035414841959
- [RF] R2: 0.1205852034338123

# Running RF Test on Full Data Set 

In [25]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42)

In [26]:
# Initialize our vectorizer
vectorizer = TfidfVectorizer()

# 3. Fit your vectorizer using your X data
# This makes your vocab matrix
vectorizer.fit(X_train)

# 4. Transform your X data using your fitted vectorizer. 
# This transforms your documents into vectors.
X_train = vectorizer.transform(X_train)
X_test = vectorizer.transform(X_test)

print(X_train.shape, type(X))
print(type(X_train))

(24773, 66588) <class 'numpy.ndarray'>
<class 'scipy.sparse.csr.csr_matrix'>


In [27]:
#Optimal Hyperparameters for RandomForestRegressor based on GridSearchCV

rf_model = RandomForestRegressor(n_estimators=1000, bootstrap = True)

# 2. Fit the model to the training data below
rf_model.fit(X_train, y_train)

RandomForestRegressor(n_estimators=1000)

In [28]:
y_pred = rf_model.predict(X_test)
y_pred

rf_mse = mean_squared_error(y_test, y_pred)
rf_r2 = r2_score(y_test, y_pred)

print('on full data set with optimal hyperparameters:')
print('[RF] Mean Squared Error: {0}'.format(rf_mse))
print('[RF] R2: {0}'.format(rf_r2))

[RF] Mean Squared Error: 0.05115798019890316
[RF] R2: 0.15070068589697727


In [2]:
#end of RF testing

---

# Gradient Boosting Regressor
Gradient boosting is a technique for repeatedly adding decision trees so that the next decision tree corrects the previous decision tree error.

In [29]:
# sampling from the dataframe, k is the # of samples from each genre

df_sampled3 = genre_sample(df, k=500)
print(df_sampled3.shape)
df_sampled3.head(10)


(2500, 7)


Unnamed: 0.1,Unnamed: 0,artist,seq,song,label,genre,seq_clean
27175,104823,Najee,La la la la la la lala la\r\nLa la la la la la...,Another Star,0.968,R&B,la la la la la la lala la la la la la la la la...
47700,124966,Andreya Triana,Wait for all the broken pieces to fade\r\nAnd ...,Song For A Friend,0.261,R&B,wait broken pieces fade lay head upon shoulder...
39277,76472,Wax Tailor,A record of the delightful piece they're going...,Que Sera,0.419,R&B,record delightful piece theyre going play even...
53230,129079,Roy Woods,"Gwan big up urself\r\nOhh, ohh, yeah girl\r\n\...",Gwan Big Up Urself,0.272,R&B,gwan big urself ohh ohh yeah girl come thru mi...
18050,35787,Otis Redding,"I know you told me, long time ago\r\nThat you ...",A Fool for You,0.511,R&B,know told long time ago didnt want yeah didnt ...
45833,70927,Monifah,"Yeah, uh uh\r\nYeah, uh uh\r\nDon't look at me...",Brown Eyes,0.781,R&B,yeah uh uh yeah uh uh dont look like cant resi...
28792,49433,Avant,"This girl was so fine, much more than two guns...",AV,0.475,R&B,girl fine much two guns pump brakes man looked...
22811,20123,Brandy,[Chorus: x2]\r\nWhat's a sunny day without you...,Sunny Day,0.915,R&B,chorus x2 whats sunny day without another 24 p...
43416,32723,Amanda Perez,"Oh Oh\r\nIn my life, my life\r\nEverytime I se...",In My Life,0.563,R&B,oh oh life life everytime see lose cool goin m...
42926,91147,The Time,"Hey, are y'all ready to party up in here?\nI s...",Dreamland,0.0324,R&B,hey yall ready party said ready party want put...


In [30]:
#checking correct amounts of samples per genre were obtained
df_sampled3.genre.value_counts()

Rock       500
Country    500
R&B        500
Pop        500
Rap        500
Name: genre, dtype: int64

In [33]:
X_sampled3 = df_sampled3['seq_clean'].values
y_sampled3 = df_sampled3['label'].values

In [34]:
X_train_sample3, X_test_sample3, y_train_sample3, y_test_sample3 = train_test_split(X_sampled3, y_sampled3, 
                                                                                    test_size=0.33, random_state=42)

In [35]:
vectorizer3 = TfidfVectorizer()
vectorizer3.fit(X_train_sample3)

X_train_sample3 = vectorizer3.transform(X_train_sample3)
X_test_sample3 = vectorizer3.transform(X_test_sample3)

print(X_train_sample3.shape, type(X_train_sample3))

(1675, 18951) <class 'scipy.sparse.csr.csr_matrix'>


In [36]:
gb = GradientBoostingRegressor()

In [37]:
gb.fit(X_train_sample3, y_train_sample3)

GradientBoostingRegressor()

In [38]:
y_pred_sample3 = gb.predict(X_test_sample3)

In [39]:
gb_mse = mean_squared_error(y_test_sample3, y_pred_sample3)
gb_r2 = r2_score(y_test_sample3, y_pred_sample3)

print('For a sample size of 500 and NO hyperparameter tuning:')
print('[GB] Mean Squared Error: {0}'.format(gb_mse))
print('[GB] R2: {0}'.format(gb_r2))
print("\n Gradient Boosting Regressor produces same MSE as Random Forest but r2 has improved by .05")

# For a sample size of 500 and NO hyperparameter tuning:
# [GB] Mean Squared Error: 0.05264719893512396
# [GB] R2: 0.08043425668821236

For a sample size of 500 and NO hyperparameter tuning:
[GB] Mean Squared Error: 0.05264719893512396
[GB] R2: 0.08043425668821236

 Gradient Boosting Regressor produces same MSE as Random Forest but r2 has improved by .05


### Hyperparameter Tuning of Gradiet Boosting Regressor with GridSearchCV
- need to run overnight

In [40]:
gb_param_grid= {'n_estimators': [100, 1000, 1500],
                'learning_rate' : [0.1, 0.3, 0.5],
                'max_depth': [3, 8, 16, 32]
                }

### *Do not run the next cell unless you have 2 hours to kill*

In [41]:
print("Running Grid Search ... ")

gb_regressor = GradientBoostingRegressor()

gb_grid = GridSearchCV(estimator = gb_regressor, param_grid= gb_param_grid, cv=3, scoring= 'r2')

print("Running the fit..")

gb_grid_search = gb_grid.fit(X_train_sample3, y_train_sample3)

print("Done.")

best_score3 = gb_grid_search.best_score_
print("The best score is: ", best_score3)

gb_best_params = gb_grid_search.best_params_
print("The best parameters are: ", gb_best_params)

# The best score is:  0.07331262579138897
# The best parameters are:  {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 100}

Running Grid Search ... 
Running the fit..
Done.
The best score is:  0.07331262579138897
The best parameters are:  {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 100}


# Running Larger GB Test on 1687 samples from each Genre
### with optimized hyper parameters {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 100}

In [42]:
# sampling from the dataframe, k is 1687 which is the max number of samples from R&B the smallest Genre pool 

df_sampled4 = genre_sample(df, k=1687)
print(df_sampled4.shape)
df_sampled4.head(10)

(8435, 7)


Unnamed: 0.1,Unnamed: 0,artist,seq,song,label,genre,seq_clean
582,582,Ella Fitzgerald,Walked with no one and talked with no one\r\nA...,On the Sunny Side of the Street,0.541,R&B,walked one talked one nothing shadows one morn...
39062,148849,Mariah Carey,When I am lost\nYou shine a light for me\nAnd ...,Music Box,0.036,R&B,lost shine light set free low wash away tears ...
55058,156062,Anthony Hamilton,"Mm, mm, mm\r\nMm, mm, yeah\r\nOh\r\nTheir used...",Ball and Chain,0.63,R&B,mm mm mm mm mm yeah oh used little old boy rid...
32270,106918,Jessie Ware,"You and I, blurred lines\r\nWe come together e...",Wildest Moments,0.483,R&B,blurred lines come together every time two wro...
10054,35778,Otis Redding,I was dancing with my baby\r\nTo that Tennesse...,Tennessee Waltz,0.477,R&B,dancing baby tennessee waltz old friend happen...
35160,46893,Faith Evans,I never knew there was a\nLove like this befor...,Love Like This,0.796,R&B,never knew love like never someone show love l...
30659,20142,Brandy,[Chorus]\r\nI never do anything that pleases y...,Apart,0.885,R&B,chorus never anything pleases maybe better apa...
9087,62603,Sarah Vaughan,"Got a little rhythm, a rhythm, a rhythm \r\nTh...",Fascinating Rhythm,0.72,R&B,got little rhythm rhythm rhythm pitapats brain...
4374,4374,The Brothers Johnson,I want to know\r\nJust how you feel\r\nSaid-a ...,Ill Be Good to You,0.93,R&B,want know feel saida want know feel real cause...
28444,150191,Kool & the Gang,"Sing it from the mountain, tell it to the peop...",In the Heart,0.93,R&B,sing mountain tell people sing mountain tell p...


In [44]:
#checking correct amounts of samples per genre were obtained

df_sampled4.genre.value_counts()

Country    1687
R&B        1687
Rap        1687
Pop        1687
Rock       1687
Name: genre, dtype: int64

In [45]:
X_sampled4 = df_sampled4['seq_clean'].values

y_sampled4 = df_sampled4['label'].values

In [46]:
X_train_sample4, X_test_sample4, y_train_sample4, y_test_sample4 = train_test_split(X_sampled4, y_sampled4, 
                                                                                    test_size=0.33, random_state=42)

In [47]:
vectorizer4 = TfidfVectorizer()
vectorizer4.fit(X_train_sample4)

X_train_sample4 = vectorizer4.transform(X_train_sample4)
X_test_sample4 = vectorizer4.transform(X_test_sample4)

print(X_train_sample4.shape, type(X_train_sample4))

(5651, 36249) <class 'scipy.sparse.csr.csr_matrix'>


In [48]:
gb2 = GradientBoostingRegressor(learning_rate=0.1, max_depth=3, n_estimators= 100)
gb2.fit(X_train_sample4, y_train_sample4)

GradientBoostingRegressor()

In [49]:
y_pred_sample4 = gb2.predict(X_test_sample4)

In [51]:
gb_mse2 = mean_squared_error(y_test_sample4, y_pred_sample4)
gb_r2_2 = r2_score(y_test_sample4, y_pred_sample4)

print('For a sample size of 1687 and optimal hyperparameters:')
print('[GB] Mean Squared Error: {0}'.format(gb_mse2))
print('[GB] R2: {0}'.format(gb_r2_2))

# For a sample size of 1687 and optimal hyperparameters:
# [GB] Mean Squared Error: 0.05200244352093515
# [GB] R2: 0.09854691815352778

For a sample size of 1687 and optimal hyperparameters:
[GB] Mean Squared Error: 0.05200244352093515
[GB] R2: 0.09854691815352778


# Running GB on full data set with optimal hyperparameters

In [53]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42)

In [54]:
# Initialize our vectorizer
vectorizer = TfidfVectorizer()

# 3. Fit your vectorizer using your X data
# This makes your vocab matrix
vectorizer.fit(X_train)

# 4. Transform your X data using your fitted vectorizer. 
# This transforms your documents into vectors.
X_train = vectorizer.transform(X_train)
X_test = vectorizer.transform(X_test)

print(X_train.shape, type(X))
print(type(X_train))

(24773, 66588) <class 'numpy.ndarray'>
<class 'scipy.sparse.csr.csr_matrix'>


In [55]:
gb3 = GradientBoostingRegressor()
gb3.fit(X_train, y_train)

GradientBoostingRegressor()

In [56]:
y_pred = gb3.predict(X_test)
gb3_mse = mean_squared_error(y_test, y_pred)
gb3_r2 = r2_score(y_test, y_pred)

print('On the full data set and optimal hyperparameter tuning:')
print('[GB] Mean Squared Error: {0}'.format(gb3_mse))
print('[GB] R2: {0}'.format(gb3_r2))
print("\n ...")

# On the full data set and optimal hyperparameter tuning:
# [GB] Mean Squared Error: 0.05284822264086245
# [GB] R2: 0.12264012250017176

On the full data set and optimal hyperparameter tuning:
[GB] Mean Squared Error: 0.05284822264086245
[GB] R2: 0.12264012250017176

 ...


# conclusion on tesing Gradient Boosting Regressors

Not much gained or lost from increase in sample size. 
R2 error increased by .018 which is not too significant 

on sample size 500/genre and NO hyperparameter tuning:
- [GB] Mean Squared Error: 0.05264719893512396
- [GB] R2: 0.08043425668821236

on sample size 1687/genre with optimal hyperparamers: {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 100}:
- [GB] Mean Squared Error: 0.05200244352093515
- [GB] R2: 0.09854691815352778

on full data set with optimal hyperparamers:
- [GB] Mean Squared Error: 0.05284822264086245
- [GB] R2: 0.12264012250017176

** GB runs MUCH faster than RFreg and produces better r2 on the full data set

---
copied from previous cells for comparisson: 

# conclusion on tesing Random Forest Regressors

Not much gained or lost from increase in sample size. 
R2 error increased by .02 which is not too significant 

on sample size 500/genre with optimal hyperparamers:
- [RF] Mean Squared Error: 0.05222244210798196
- [RF] R2: 0.10795588207769391

on sample size 1687/genre with optimal hyperparamers:
- [RF] Mean Squared Error: 0.05101035414841959
- [RF] R2: 0.1205852034338123

on full data set with optimal hyperparamers:
- [RF] Mean Squared Error: 0.05115798019890316
- [RF] R2: 0.15070068589697727