# Valence Prediction Conclusions:

### Tesing Gradient Boosting Regressors

Not much gained or lost from increase in sample size. 
R2 error increased by .018 which is not too significant 

on sample size 500/genre and NO hyperparameter tuning (out of the box):
- [GB] Mean Squared Error: 0.05264719893512396
- [GB] R2: 0.08043425668821236

on sample size 1687/genre with optimal hyperparamers: {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 100}:
- [GB] Mean Squared Error: 0.05200244352093515
- [GB] R2: 0.09854691815352778

on full data set with optimal hyperparamers:
- [GB] Mean Squared Error: 0.05284822264086245
- [GB] R2: 0.12264012250017176

#### ** GB runs MUCH faster than RFreg and produces better r2 on the full data set

---

### Tesing Random Forest Regressors

Not much gained or lost from increase in sample size. 
R2 error increased by .02 which is not too significant 

on sample size 500/genre with optimal hyperparamers:
- [RF] Mean Squared Error: 0.05222244210798196
- [RF] R2: 0.10795588207769391

on sample size 1687/genre with optimal hyperparamers:
- [RF] Mean Squared Error: 0.05101035414841959
- [RF] R2: 0.1205852034338123

on full data set with optimal hyperparamers:
- [RF] Mean Squared Error: 0.05115798019890316
- [RF] R2: 0.15070068589697727

---

### Issues:
- stemming created gibberish
- GridSearchCV takes 2+ hours to run (up to 4)

---

### Try:
- turning valence into classification problem instead of regression
    - round lable values to tens place and you'll have 10 classification categories 
- look at other regression metrics
    - however r2 may not be good metric for this bc low correlation between variables
    - functions ending with _score return a value to maximize, the higher the better.
    - functions ending with _error or _loss return a value to minimize, the lower the better.
- try lemming instead of stemming
- try Word2Vec instead of tfidf
- look into this for improving grid search https://scikit-learn.org/stable/modules/grid_search.html#grid-search 


In [1]:
# Import pandas for data handling
import pandas as pd

# NLTK is our Natural-Language-Took-Kit
import nltk
from nltk.corpus import stopwords
from nltk import word_tokenize
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
stopwords = stopwords.words('english')

# Libraries for helping us with strings
import string
# Regular Expression Library
import re

# Import text vectorizers
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# Import classifiers
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import KFold

#Import Regressor Models
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor

# Import some ML helper function
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import classification_report


# Import our metrics to evaluate our model
from sklearn import metrics
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import cross_val_score


# Library for plotting
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.sparse as sparse

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/aleksandrageorgievska/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/aleksandrageorgievska/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/aleksandrageorgievska/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [2]:
df = pd.read_csv('../data/labeled_lyrics_w_genres.csv')

# Inspecting The Data

In [3]:
print(df.isnull().sum().sum())
print(df.duplicated().sum())
print(df.shape)
print(df.genre.value_counts())
df.head()

0
0
(145250, 7)
Pop          57357
No_genre     42789
Rock         26756
Country       7440
Rap           5959
R&B           4773
Non-Music      176
Name: genre, dtype: int64


Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,artist,seq,song,label,genre
0,0,0,Elijah Blake,"No, no\r\nI ain't ever trapped out the bando\r...",Everyday,0.626,R&B
1,1,1,Elijah Blake,"The drinks go down and smoke goes up, I feel m...",Live Till We Die,0.63,Pop
2,2,2,Elijah Blake,She don't live on planet Earth no more\r\nShe ...,The Otherside,0.24,R&B
3,3,3,Elijah Blake,"Trippin' off that Grigio, mobbin', lights low\...",Pinot,0.536,R&B
4,4,4,Elijah Blake,"I see a midnight panther, so gallant and so br...",Shadows & Diamonds,0.371,R&B


### removing No_genre and Non-Music

In [4]:
df_dropped = df[(df['genre'] == 'No_genre') | (df['genre'] == 'Non-Music')].index
df.drop(df_dropped, inplace=True, axis='index')

In [5]:
print(df.shape)
print(df.genre.value_counts())
df.head(15)

(102285, 7)
Pop        57357
Rock       26756
Country     7440
Rap         5959
R&B         4773
Name: genre, dtype: int64


Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,artist,seq,song,label,genre
0,0,0,Elijah Blake,"No, no\r\nI ain't ever trapped out the bando\r...",Everyday,0.626,R&B
1,1,1,Elijah Blake,"The drinks go down and smoke goes up, I feel m...",Live Till We Die,0.63,Pop
2,2,2,Elijah Blake,She don't live on planet Earth no more\r\nShe ...,The Otherside,0.24,R&B
3,3,3,Elijah Blake,"Trippin' off that Grigio, mobbin', lights low\...",Pinot,0.536,R&B
4,4,4,Elijah Blake,"I see a midnight panther, so gallant and so br...",Shadows & Diamonds,0.371,R&B
5,5,5,Elijah Blake,I just want to ready your mind\r\n'Cause I'll ...,Uno,0.321,R&B
7,7,7,Elis,Dieses ist lange her.\r\nDa ich deine schmalen...,Abendlied,0.333,Pop
8,8,8,Elis,A child is born\r\nOut of the womb of a mother...,Child,0.506,Pop
9,9,9,Elis,Out of the darkness you came \r\nYou looked so...,Come to Me,0.179,Pop
10,10,10,Elis,Each night I lie in my bed \r\nAnd I think abo...,Do You Believe,0.209,Pop


---

# Data Cleaning (Text Pre Processing)

In [6]:
# 1. function that makes all text lowercase.
def make_lowercase(test_string):
    return test_string.lower()

# 2. function that removes all punctuation. 
def remove_punc(test_string):
    test_string = re.sub(r'[^\w\s]', '', test_string)
    return test_string

# 3. function that removes all stopwords.
def remove_stopwords(test_string):
    # Break the sentence down into a list of words
    words = word_tokenize(test_string)
    
    # Make a list to append valid words into
    valid_words = []
    
    # Loop through all the words
    for word in words:
        
        # Check if word is not in stopwords. Stopwords was imported from nltk.corpus
        if word not in stopwords:
            
            # If word not in stopwords, append to our valid_words
            valid_words.append(word)

    # Join the list of words together into a string
    a_string = ' '.join(valid_words)

    return a_string

# 4. function to break words into their stem words
def stem_words(a_string):
    # Initalize our Stemmer
    porter = PorterStemmer()
    
    # Break the sentence down into a list of words
    words = word_tokenize(a_string)
    
    # Make a list to append valid words into
    valid_words = []

    # Loop through all the words
    for word in words:
        # Stem the word
        stemmed_word = porter.stem(word) #from nltk.stem import PorterStemmer
        
        # Append stemmed word to our valid_words
        valid_words.append(stemmed_word)
        
    # Join the list of words together into a string
    a_string = ' '.join(valid_words)

    return a_string 

In [7]:
# Pipeline function 

def text_processing_pipeline(a_string):
    a_string = make_lowercase(a_string)
    a_string = remove_punc(a_string)
    #a_string = stem_words(a_string) #removing stem_words for now because making lyrics gibberish
    a_string = remove_stopwords(a_string)
    return a_string

In [8]:
# apply preprocessing pipeline 

df['seq_clean'] = df['seq'].apply(text_processing_pipeline)

In [9]:
df.head(1)

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,artist,seq,song,label,genre,seq_clean
0,0,0,Elijah Blake,"No, no\r\nI ain't ever trapped out the bando\r...",Everyday,0.626,R&B,aint ever trapped bando oh lord dont get wrong...


In [10]:
X = df['seq_clean'].values
y = df['label'].values

# Sampling smaller batches from dataframe for faster testing

In [11]:
#function to randomly sample n values from each genre for smaller random forest testing

def genre_sample(dataframe, k):
    #make an empty dataframe
    df_genre_sample = pd.DataFrame(columns = ['Unnamed: 0', 'artist', 'seq', 'song', 'label', 'genre', 'seq_clean'])
    
    genres = ['R&B', 'Pop', 'Rap', 'Rock', 'Country']
    for genre in genres:
         df_genre_sample = df_genre_sample.append((dataframe[dataframe["genre"]==genre].sample(n=k)))
    
    return df_genre_sample

In [12]:
# sampling from the dataframe, k is the # of samples from each genre

df_sampled = genre_sample(df, k=500)
print(df_sampled.shape)

#checking correct amounts of samples per genre were obtained
print(df_sampled.genre.value_counts())
df_sampled.head(5)

(2500, 8)
Rap        500
Pop        500
R&B        500
Country    500
Rock       500
Name: genre, dtype: int64


Unnamed: 0.2,Unnamed: 0,artist,seq,song,label,genre,seq_clean,Unnamed: 0.1
109296,1396,Muddy Waters,"Well, it gettin'\nLate on into the evenin' and...",Feel Like Going Home,0.185,R&B,well gettin late evenin feel like like blowin ...,60423.0
85479,2479,Monifah,"I want you, I want you, I want you, I want you...",You,0.833,R&B,want want want want sun dont shine brighten da...,70912.0
126356,1856,Luther Vandross,What's your name?\r\nGirl what's your number?\...,For the Sweetness of Your Love,0.936,R&B,whats name girl whats number think fell love f...,121952.0
44834,44834,Timbaland,"Is it going, is it going, is it going, is it g...",Give It to Me,0.815,R&B,going going going going dont know youre lookin...,154408.0
79558,708,Muddy Waters,"Want you to rock me baby, rock me all night lo...",Rock Me [#],0.633,R&B,want rock baby rock night long want rock baby ...,60495.0


# Testing Regression Models for label prediction:
- label = float scale (0-1) which signifies valence 



## Random Forest Regressor

- using the sampled dataset for faster testing

In [13]:
X_sampled = df_sampled['seq_clean'].values

y_sampled = df_sampled['label'].values

In [14]:
X_train_sample, X_test_sample, y_train_sample, y_test_sample = train_test_split(X_sampled, y_sampled, 
                                                                             test_size=0.33, random_state=42)

In [15]:
vectorizer = TfidfVectorizer()
vectorizer.fit(X_train_sample)

X_train_sample = vectorizer.transform(X_train_sample)
X_test_sample = vectorizer.transform(X_test_sample)

print(X_train_sample.shape, type(X_train_sample))

(1675, 19224) <class 'scipy.sparse.csr.csr_matrix'>


In [16]:
# function to find the best parameters for RandomForestRegressor

param_grid = {'n_estimators': [100, 1000],
              'scoring': ['max_error','neg_root_mean_squared_error']
             }

# there are more parameters to test but I was getting errors and need to investigate more

# param_grid = {'criterion': ['squared_error', 'absolute_error', 'poisson'],
#               'n_estimators': [10, 50, 100, 500, 1000], 
#               'max_depth': [2, 4, 8, 16, 32, 64], 
#               'min_samples_leaf': [1, 10, 25, 50],
#               'bootstrap': [True, False],
#               'min_samples_split': [0, 2, 4, 8, 16, 32]
#              }

### *Do not run the next cell unless you have 2+ hours to kill*

In [18]:
# set "n_jobs= -1" in gridsearchcv, tells python to use all processers/paralleization 

print('Running Grid Search...')

# 1. Create a RandomForestRegressor model object without supplying arguments. 

rf_regressor = RandomForestRegressor()

# 2. Run a Grid Search with 3-fold cross-validation and assign the output to the object 'rf_grid'.
#    * Pass the model and the parameter grid to GridSearchCV()
#    * Set the number of folds to 3
#    * Specify the scoring method

rf_grid = GridSearchCV(n_jobs = -1, estimator=rf_regressor, param_grid = param_grid, cv=3) 

# 3. Fit the model (use the 'grid' variable) on the training data and assign the fitted model to the 
#    variable 'rf_grid_search'

rf_grid_search = rf_grid.fit(X_train_sample, y_train_sample)


print('Done')

Running Grid Search...


ValueError: Invalid parameter scoring for estimator RandomForestRegressor(). Check the list of available parameters with `estimator.get_params().keys()`.

In [21]:
GridSearchCV.get_params('self').keys()

NameError: name 'self' is not defined

In [None]:
# finding best parameters for the Random Forest Regressor

best_score = rf_grid_search.best_score_
print("The best score is: ", best_score)

rf_best_params = rf_grid_search.best_params_
print("The best params is: ", rf_best_params)

# conclusion was n_estimators=1000, bootstrap = True are best hyperparameters 

In [None]:
#Optimal Hyperparameters for RandomForestRegressor based on GridSearchCV

rf_model1 = RandomForestRegressor(n_estimators=1000, bootstrap = True)

# 2. Fit the model to the training data below
rf_model1.fit(X_train_sample, y_train_sample)

In [None]:
y_sample_pred = rf_model1.predict(X_test_sample)

rf_mse = mean_squared_error(y_test_sample, y_sample_pred)
rf_r2 = r2_score(y_test_sample, y_sample_pred)

print('on sample size of 500/genre with optimal hyperparameters:')
print('[RF] Mean Squared Error: {0}'.format(rf_mse))
print('[RF] R2: {0}'.format(rf_r2))

# on sample size of 500/genre with optimal hyperparameters:
# - [RF] Mean Squared Error: 0.05222244210798196
# - [RF] R2: 0.10795588207769391

In [None]:
# Function to test the predictions of the model with NEW unseen text (not part of testing set)

def rgrg_string_test(lyrics):
    new_lyrics = text_processing_pipeline(lyrics)
    print("the processed lyrics are: ", new_lyrics)
    
    new_text_vectorized = vectorizer.transform([new_lyrics])
    
    value = rf_model1.predict(new_text_vectorized)
    print("Random Forest Regressor model gives a value of: ", value)
    if(value < .50):
        print("which is negative")
    else: 
        print("which is positive")

In [None]:
test_text1 = "Hit me baby one more time my lonliness is killing me and I must confess I still believe"
test_text2 = "Oh, baby, when you talk like that You make a woman go mad So be wise and keep on Reading the signs of my body"
test_text3 = "looking out on the pouring rain I used to feel so uninspired"
test_text4 = "Girl put your record on tell me your favorite song just go ahead let your hair down"

rgrg_string_test(test_text1)
print('\n')
rgrg_string_test(test_text2)
print('\n')
rgrg_string_test(test_text3)
print('\n')
rgrg_string_test(test_text4)

# Running Larger RF Test on 1687 samples from each Genre

- to-do: break this testing out into a function instead of repeating code 

In [None]:
# sampling from the dataframe, k is 1687 which is the max number of samples from R&B the smallest Genre pool 

df_sampled2 = genre_sample(df, k=1687)
print(df_sampled2.shape)
df_sampled2.head(10)

In [None]:
#checking correct amounts of samples per genre were obtained

df_sampled2.genre.value_counts()

In [None]:
X_sampled2 = df_sampled2['seq_clean'].values

y_sampled2 = df_sampled2['label'].values

In [None]:
X_train_sample2, X_test_sample2, y_train_sample2, y_test_sample2 = train_test_split(X_sampled2, y_sampled2, 
                                                                             test_size=0.33, random_state=42)

In [None]:
vectorizer2 = TfidfVectorizer()
vectorizer2.fit(X_train_sample2)

X_train_sample2 = vectorizer2.transform(X_train_sample2)
X_test_sample2 = vectorizer2.transform(X_test_sample2)

print(X_train_sample2.shape, type(X_train_sample2))

In [None]:
#Optimal Hyperparameters for RandomForestRegressor based on GridSearchCV

rf_model2 = RandomForestRegressor(n_estimators=1000, bootstrap = True)

# 2. Fit the model to the training data below
rf_model2.fit(X_train_sample2, y_train_sample2)

In [None]:
y_sample_pred2 = rf_model2.predict(X_test_sample2)
y_sample_pred2

rf_mse2 = mean_squared_error(y_test_sample2, y_sample_pred2)
rf_r2_2 = r2_score(y_test_sample2, y_sample_pred2)

print('on sample size of 1687/genre  with optimal hyperparameters:')
print('[RF] Mean Squared Error: {0}'.format(rf_mse2))
print('[RF] R2: {0}'.format(rf_r2_2))

# on sample size 1687/genre with optimal hyperprameters:
# - [RF] Mean Squared Error: 0.05101035414841959
# - [RF] R2: 0.1205852034338123

# conclusion on tesing Random Forest Regressors

Not much gained or lost from increase in sample size. 
R2 error increased by .02 which is not too significant 

on sample size 500/genre with optimal hyperparameters:
- [RF] Mean Squared Error: 0.05222244210798196
- [RF] R2: 0.10795588207769391

on sample size 1687/genre with optimal hyperprameters:
- [RF] Mean Squared Error: 0.05101035414841959
- [RF] R2: 0.1205852034338123

# Running RF Test on Full Data Set 

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42)

In [None]:
# Initialize our vectorizer
vectorizer = TfidfVectorizer()

# 3. Fit your vectorizer using your X data
# This makes your vocab matrix
vectorizer.fit(X_train)

# 4. Transform your X data using your fitted vectorizer. 
# This transforms your documents into vectors.
X_train = vectorizer.transform(X_train)
X_test = vectorizer.transform(X_test)

print(X_train.shape, type(X))
print(type(X_train))

In [None]:
#Optimal Hyperparameters for RandomForestRegressor based on GridSearchCV

rf_model = RandomForestRegressor(n_estimators=1000, bootstrap = True)

# 2. Fit the model to the training data below
rf_model.fit(X_train, y_train)

In [None]:
y_pred = rf_model.predict(X_test)
y_pred

rf_mse = mean_squared_error(y_test, y_pred)
rf_r2 = r2_score(y_test, y_pred)

print('on full data set with optimal hyperparameters:')
print('[RF] Mean Squared Error: {0}'.format(rf_mse))
print('[RF] R2: {0}'.format(rf_r2))

In [None]:
#end of RF testing

---

# Gradient Boosting Regressor
Gradient boosting is a technique for repeatedly adding decision trees so that the next decision tree corrects the previous decision tree error.

In [None]:
# sampling from the dataframe, k is the # of samples from each genre

df_sampled3 = genre_sample(df, k=500)
print(df_sampled3.shape)
df_sampled3.head(10)


In [None]:
#checking correct amounts of samples per genre were obtained
df_sampled3.genre.value_counts()

In [None]:
X_sampled3 = df_sampled3['seq_clean'].values
y_sampled3 = df_sampled3['label'].values

In [None]:
X_train_sample3, X_test_sample3, y_train_sample3, y_test_sample3 = train_test_split(X_sampled3, y_sampled3, 
                                                                                    test_size=0.33, random_state=42)

In [None]:
vectorizer3 = TfidfVectorizer()
vectorizer3.fit(X_train_sample3)

X_train_sample3 = vectorizer3.transform(X_train_sample3)
X_test_sample3 = vectorizer3.transform(X_test_sample3)

print(X_train_sample3.shape, type(X_train_sample3))

In [None]:
gb = GradientBoostingRegressor()

In [None]:
gb.fit(X_train_sample3, y_train_sample3)

In [None]:
y_pred_sample3 = gb.predict(X_test_sample3)

In [None]:
gb_mse = mean_squared_error(y_test_sample3, y_pred_sample3)
gb_r2 = r2_score(y_test_sample3, y_pred_sample3)

print('For a sample size of 500 and NO hyperparameter tuning:')
print('[GB] Mean Squared Error: {0}'.format(gb_mse))
print('[GB] R2: {0}'.format(gb_r2))
print("\n Gradient Boosting Regressor produces same MSE as Random Forest but r2 has improved by .05")

# For a sample size of 500 and NO hyperparameter tuning:
# [GB] Mean Squared Error: 0.05264719893512396
# [GB] R2: 0.08043425668821236

### Hyperparameter Tuning of Gradiet Boosting Regressor with GridSearchCV
- need to run overnight

In [None]:
gb_param_grid= {'n_estimators': [100, 1000, 1500],
                'learning_rate' : [0.1, 0.3, 0.5],
                'max_depth': [3, 8, 16, 32]
                }

### *Do not run the next cell unless you have 2 hours to kill*

In [None]:
print("Running Grid Search ... ")

gb_regressor = GradientBoostingRegressor()

gb_grid = GridSearchCV(n_jobs = -1, estimator = gb_regressor, param_grid= gb_param_grid, cv=3, scoring= 'mse')

print("Running the fit..")

gb_grid_search = gb_grid.fit(X_train_sample3, y_train_sample3)

print("Done.")

best_score3 = gb_grid_search.best_score_
print("The best score is: ", best_score3)

gb_best_params = gb_grid_search.best_params_
print("The best parameters are: ", gb_best_params)

# The best score is:  0.07331262579138897
# The best parameters are:  {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 100}

# Running Larger GB Test on 1687 samples from each Genre
### with optimized hyper parameters {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 100}

In [None]:
# sampling from the dataframe, k is 1687 which is the max number of samples from R&B the smallest Genre pool 

df_sampled4 = genre_sample(df, k=1687)
print(df_sampled4.shape)
df_sampled4.head(10)

In [None]:
#checking correct amounts of samples per genre were obtained

df_sampled4.genre.value_counts()

In [None]:
X_sampled4 = df_sampled4['seq_clean'].values

y_sampled4 = df_sampled4['label'].values

In [None]:
X_train_sample4, X_test_sample4, y_train_sample4, y_test_sample4 = train_test_split(X_sampled4, y_sampled4, 
                                                                                    test_size=0.33, random_state=42)

In [None]:
vectorizer4 = TfidfVectorizer()
vectorizer4.fit(X_train_sample4)

X_train_sample4 = vectorizer4.transform(X_train_sample4)
X_test_sample4 = vectorizer4.transform(X_test_sample4)

print(X_train_sample4.shape, type(X_train_sample4))

In [None]:
gb2 = GradientBoostingRegressor(learning_rate=0.1, max_depth=3, n_estimators= 100)
gb2.fit(X_train_sample4, y_train_sample4)

In [None]:
y_pred_sample4 = gb2.predict(X_test_sample4)

In [None]:
gb_mse2 = mean_squared_error(y_test_sample4, y_pred_sample4)
gb_r2_2 = r2_score(y_test_sample4, y_pred_sample4)

print('For a sample size of 1687 and optimal hyperparameters:')
print('[GB] Mean Squared Error: {0}'.format(gb_mse2))
print('[GB] R2: {0}'.format(gb_r2_2))

# For a sample size of 1687 and optimal hyperparameters:
# [GB] Mean Squared Error: 0.05200244352093515
# [GB] R2: 0.09854691815352778

# Running GB on full data set with optimal hyperparameters

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42)

In [None]:
# Initialize our vectorizer
vectorizer = TfidfVectorizer()

# 3. Fit your vectorizer using your X data
# This makes your vocab matrix
vectorizer.fit(X_train)

# 4. Transform your X data using your fitted vectorizer. 
# This transforms your documents into vectors.
X_train = vectorizer.transform(X_train)
X_test = vectorizer.transform(X_test)

print(X_train.shape, type(X))
print(type(X_train))

In [None]:
gb3 = GradientBoostingRegressor()
gb3.fit(X_train, y_train)

In [None]:
y_pred = gb3.predict(X_test)
gb3_mse = mean_squared_error(y_test, y_pred)
gb3_r2 = r2_score(y_test, y_pred)

print('On the full data set and optimal hyperparameter tuning:')
print('[GB] Mean Squared Error: {0}'.format(gb3_mse))
print('[GB] R2: {0}'.format(gb3_r2))
print("\n ...")

# On the full data set and optimal hyperparameter tuning:
# [GB] Mean Squared Error: 0.05284822264086245
# [GB] R2: 0.12264012250017176

# conclusion on tesing Gradient Boosting Regressors

Not much gained or lost from increase in sample size. 
R2 error increased by .018 which is not too significant 

on sample size 500/genre and NO hyperparameter tuning:
- [GB] Mean Squared Error: 0.05264719893512396
- [GB] R2: 0.08043425668821236

on sample size 1687/genre with optimal hyperparamers: {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 100}:
- [GB] Mean Squared Error: 0.05200244352093515
- [GB] R2: 0.09854691815352778

on full data set with optimal hyperparamers:
- [GB] Mean Squared Error: 0.05284822264086245
- [GB] R2: 0.12264012250017176

** GB runs MUCH faster than RFreg and produces better r2 on the full data set

---
copied from previous cells for comparisson: 

# conclusion on tesing Random Forest Regressors

Not much gained or lost from increase in sample size. 
R2 error increased by .02 which is not too significant 

on sample size 500/genre with optimal hyperparamers:
- [RF] Mean Squared Error: 0.05222244210798196
- [RF] R2: 0.10795588207769391

on sample size 1687/genre with optimal hyperparamers:
- [RF] Mean Squared Error: 0.05101035414841959
- [RF] R2: 0.1205852034338123

on full data set with optimal hyperparamers:
- [RF] Mean Squared Error: 0.05115798019890316
- [RF] R2: 0.15070068589697727