# Valence Prediction Conclusions:

### Tesing Gradient Boosting Regressors

Not much gained or lost from increase in sample size. 
R2 error increased by .018 which is not too significant 

on sample size 500/genre and NO hyperparameter tuning (out of the box):
- [GB] Mean Squared Error: 0.05264719893512396
- [GB] R2: 0.08043425668821236

on sample size 1687/genre with optimal hyperparamers: {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 100}:
- [GB] Mean Squared Error: 0.05200244352093515
- [GB] R2: 0.09854691815352778

on full data set with optimal hyperparamers:
- [GB] Mean Squared Error: 0.05284822264086245
- [GB] R2: 0.12264012250017176

#### ** GB runs MUCH faster than RFreg and produces better r2 on the full data set

---

### Tesing Random Forest Regressors

Not much gained or lost from increase in sample size. 
R2 error increased by .02 which is not too significant 

on sample size 500/genre with optimal hyperparamers:
- [RF] Mean Squared Error: 0.05222244210798196
- [RF] R2: 0.10795588207769391

on sample size 1687/genre with optimal hyperparamers:
- [RF] Mean Squared Error: 0.05101035414841959
- [RF] R2: 0.1205852034338123

on full data set with optimal hyperparamers:
- [RF] Mean Squared Error: 0.05115798019890316
- [RF] R2: 0.15070068589697727

---

### Issues:
- stemming created gibberish
- GridSearchCV takes 2+ hours to run (up to 4)

---

### Try:
- turning valence into classification problem instead of regression
    - round lable values to tens place and you'll have 10 classification categories 
- look at other regression metrics
    - however r2 may not be good metric for this bc low correlation between variables
    - functions ending with _score return a value to maximize, the higher the better.
    - functions ending with _error or _loss return a value to minimize, the lower the better.
- try lemming instead of stemming
- try Word2Vec instead of tfidf
- look into this for improving grid search https://scikit-learn.org/stable/modules/grid_search.html#grid-search 


In [1]:
# Import pandas for data handling
import pandas as pd

# NLTK is our Natural-Language-Took-Kit
import nltk
from nltk.corpus import stopwords
from nltk import word_tokenize
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
stopwords = stopwords.words('english')

# Libraries for helping us with strings
import string
# Regular Expression Library
import re

# Import text vectorizers
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# Import Word2Vec
from nltk.tokenize import sent_tokenize, word_tokenize
import warnings
warnings.filterwarnings(action = 'ignore')
import gensim
from gensim.models import Word2Vec

# Import classifiers
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import KFold

#Import Regressor Models
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor

# Import some ML helper function
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import classification_report


# Import our metrics to evaluate our model
from sklearn import metrics
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.metrics import mean_squared_error, r2_score, max_error
from sklearn.model_selection import cross_val_score


# Library for plotting
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.sparse as sparse

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/aleksandrageorgievska/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/aleksandrageorgievska/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/aleksandrageorgievska/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [2]:
df = pd.read_csv('../data/labeled_lyrics_w_genres.csv')

In [5]:
df.loc[18066]

Unnamed: 0                                                  18066
Unnamed: 0.1                                                65729
artist                                               Beanie Sigel
seq             As far back as I can remember I always wanted ...
song                                           Watch Your B******
label                                                        0.72
genre                                                    No_genre
Name: 18066, dtype: object

# Inspecting The Data

In [3]:
print(df.isnull().sum().sum())
print(df.duplicated().sum())
print(df.shape)
print(df.genre.value_counts())
df.head()

0
0
(145250, 7)
Pop          57357
No_genre     42789
Rock         26756
Country       7440
Rap           5959
R&B           4773
Non-Music      176
Name: genre, dtype: int64


Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,artist,seq,song,label,genre
0,0,0,Elijah Blake,"No, no\r\nI ain't ever trapped out the bando\r...",Everyday,0.626,R&B
1,1,1,Elijah Blake,"The drinks go down and smoke goes up, I feel m...",Live Till We Die,0.63,Pop
2,2,2,Elijah Blake,She don't live on planet Earth no more\r\nShe ...,The Otherside,0.24,R&B
3,3,3,Elijah Blake,"Trippin' off that Grigio, mobbin', lights low\...",Pinot,0.536,R&B
4,4,4,Elijah Blake,"I see a midnight panther, so gallant and so br...",Shadows & Diamonds,0.371,R&B


### removing No_genre and Non-Music

In [4]:
df_dropped = df[(df['genre'] == 'No_genre') | (df['genre'] == 'Non-Music')].index
df.drop(df_dropped, inplace=True, axis='index')

In [5]:
print(df.shape)
print(df.genre.value_counts())
df.head(15)

(102285, 7)
Pop        57357
Rock       26756
Country     7440
Rap         5959
R&B         4773
Name: genre, dtype: int64


Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,artist,seq,song,label,genre
0,0,0,Elijah Blake,"No, no\r\nI ain't ever trapped out the bando\r...",Everyday,0.626,R&B
1,1,1,Elijah Blake,"The drinks go down and smoke goes up, I feel m...",Live Till We Die,0.63,Pop
2,2,2,Elijah Blake,She don't live on planet Earth no more\r\nShe ...,The Otherside,0.24,R&B
3,3,3,Elijah Blake,"Trippin' off that Grigio, mobbin', lights low\...",Pinot,0.536,R&B
4,4,4,Elijah Blake,"I see a midnight panther, so gallant and so br...",Shadows & Diamonds,0.371,R&B
5,5,5,Elijah Blake,I just want to ready your mind\r\n'Cause I'll ...,Uno,0.321,R&B
7,7,7,Elis,Dieses ist lange her.\r\nDa ich deine schmalen...,Abendlied,0.333,Pop
8,8,8,Elis,A child is born\r\nOut of the womb of a mother...,Child,0.506,Pop
9,9,9,Elis,Out of the darkness you came \r\nYou looked so...,Come to Me,0.179,Pop
10,10,10,Elis,Each night I lie in my bed \r\nAnd I think abo...,Do You Believe,0.209,Pop


---

# Data Cleaning (Text Pre Processing)

In [6]:
# 1. function that makes all text lowercase.
def make_lowercase(test_string):
    return test_string.lower()

# 2. function that removes all punctuation. 
def remove_punc(test_string):
    test_string = re.sub(r'[^\w\s]', '', test_string)
    return test_string

# 3. function that removes all stopwords.
def remove_stopwords(test_string):
    # Break the sentence down into a list of words
    words = word_tokenize(test_string)
    
    # Make a list to append valid words into
    valid_words = []
    
    # Loop through all the words
    for word in words:
        
        # Check if word is not in stopwords. Stopwords was imported from nltk.corpus
        if word not in stopwords:
            
            # If word not in stopwords, append to our valid_words
            valid_words.append(word)

    # Join the list of words together into a string
    a_string = ' '.join(valid_words)

    return a_string

# 4. function to break words into their stem words
def stem_words(a_string):
    # Initalize our Stemmer
    porter = PorterStemmer()
    
    # Break the sentence down into a list of words
    words = word_tokenize(a_string)
    
    # Make a list to append valid words into
    valid_words = []

    # Loop through all the words
    for word in words:
        # Stem the word
        stemmed_word = porter.stem(word) #from nltk.stem import PorterStemmer
        
        # Append stemmed word to our valid_words
        valid_words.append(stemmed_word)
        
    # Join the list of words together into a string
    a_string = ' '.join(valid_words)

    return a_string 

In [7]:
# Pipeline function 

def text_processing_pipeline(a_string):
    a_string = make_lowercase(a_string)
    a_string = remove_punc(a_string)
    #a_string = stem_words(a_string) #removing stem_words for now because making lyrics gibberish
    a_string = remove_stopwords(a_string)
    return a_string

In [8]:
# apply preprocessing pipeline 

df['seq_clean'] = df['seq'].apply(text_processing_pipeline)

In [9]:
print(df.shape)
df.head(1)

(102285, 8)


Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,artist,seq,song,label,genre,seq_clean
0,0,0,Elijah Blake,"No, no\r\nI ain't ever trapped out the bando\r...",Everyday,0.626,R&B,aint ever trapped bando oh lord dont get wrong...


# Trying Word2Vec 

In [34]:
X = df.seq_clean.apply(lambda x: word_tokenize(x))
X.head()

0    [aint, ever, trapped, bando, oh, lord, dont, g...
1    [drinks, go, smoke, goes, feel, got, let, go, ...
2    [dont, live, planet, earth, found, love, venus...
3    [trippin, grigio, mobbin, lights, low, trippin...
4    [see, midnight, panther, gallant, brave, found...
Name: seq_clean, dtype: object

In [35]:
X.shape

(102285,)

In [43]:
model = Word2Vec(X, sg=1)

In [44]:
model.train(X,total_examples= 15000, epochs=100)

(1083510289, 1224887500)

In [45]:
model.save("word2vec.model")

In [47]:
model = Word2Vec.load("word2vec.model")

In [48]:
model.wv.most_similar(positive ='')

[('unhappy', 0.6109914183616638),
 ('glad', 0.5998392701148987),
 ('sequestered', 0.5725224018096924),
 ('merriest', 0.5571557879447937),
 ('smiling', 0.5531657934188843),
 ('birthday', 0.5460292100906372),
 ('loveylove', 0.5429710149765015),
 ('gay', 0.5317879915237427),
 ('matzofarian', 0.5311346054077148),
 ('sad', 0.5262659192085266)]

# Using word2vec to create a lyric vector by taking the average of words present in lyric

Reference to method used https://www.kaggle.com/code/nitin194/twitter-sentiment-analysis-word2vec-doc2vec/notebook

In [54]:
w2v_words = list(model.wv.key_to_index)
def lyric_vector(tokens, size):
    sent = np.zeros(200)
    count = 0
    for word in tokens:
        if word in w2v_words:
            vec = model.wv[word]
            sent += vec
            count += 1
    if count != 0:
        sent /= count
    return sent

AttributeError: 'Word2VecKeyedVectors' object has no attribute 'key_to_index'

In [10]:
X = df['seq_clean'].values
y = df['label'].values

# Sampling smaller batches from dataframe for faster testing

In [None]:
# #function to randomly sample n values from each genre for smaller random forest testing

# def genre_sample(dataframe, k):
#     #make an empty dataframe
#     df_genre_sample = pd.DataFrame(columns = ['Unnamed: 0', 'artist', 'seq', 'song', 'label', 'genre', 'seq_clean'])
    
#     genres = ['R&B', 'Pop', 'Rap', 'Rock', 'Country']
#     for genre in genres:
#          df_genre_sample = df_genre_sample.append((dataframe[dataframe["genre"]==genre].sample(n=k)))
    
#     return df_genre_sample

In [None]:
# # sampling from the dataframe, k is the # of samples from each genre

# df_sampled = genre_sample(df, k=500)
# print(df_sampled.shape)

# #checking correct amounts of samples per genre were obtained
# print(df_sampled.genre.value_counts())
# df_sampled.head(5)

# Testing Regression Models for label prediction:
- label = float scale (0-1) which signifies valence 



# Gradient Boosting Regressor
Gradient boosting is a technique for repeatedly adding decision trees so that the next decision tree corrects the previous decision tree error.

### Pipeline to sample the data based on k size, train_test_split, fit model, print results

In [26]:
# Pipeline to sample the data based on k size, train_test_split, fit model, print results

def sample_train_test_pipeline(dataframe, model, sample_size, vectorizer):
    
    #define an empty dataframe
    df_sampled = pd.DataFrame(columns = ['Unnamed: 0', 'artist', 'seq', 'song', 'label', 'genre', 'seq_clean'])
    
    # 1) sample the data based on k size
    all_genres = ['R&B', 'Pop', 'Rap', 'Rock', 'Country']
    for genres in all_genres:
         df_sampled=df_sampled.append((dataframe[dataframe["genre"]==genres].sample(n=sample_size)))

    # 2) print the shape, head, and value counts
    print('\nThe shape of your dataframe is: ', df_sampled.shape, '\n')
    print('\nThe dataframe looks like: \n', df_sampled.head(2), '\n')
    print('Checking each genre has the correct sample size:\n\n', df_sampled["genre"].value_counts(), '\n')
    
    # 3) assign the X and y labels
    X_sampled = df_sampled['seq_clean'].values
    y_sampled = df_sampled['label'].values
    
    # 4) train_test_split
    X_train_sample, X_test_sample, y_train_sample, y_test_sample = train_test_split(X_sampled, y_sampled, 
                                                                                    test_size=0.33, random_state=42)
    # 5) vectorize
    vectorizer = vectorizer
    vectorizer.fit(X_train_sample)

    X_train_sample = vectorizer.transform(X_train_sample)
    X_test_sample = vectorizer.transform(X_test_sample)

    print('\nThe shape and type of X_train_sample: ', X_train_sample.shape, type(X_train_sample))
    
    # 6) define model
    model = model
    
    # 7) fit model
    print('\nFitting the model...')
    model.fit(X_train_sample, y_train_sample)
    
    y_pred_sample = model.predict(X_test_sample)
    
    # 8) print results 
    print('\nPrinting metrics...')
    model_mse = mean_squared_error(y_test_sample, y_pred_sample)
    model_me = max_error(y_test_sample, y_pred_sample)
    
    print('\nRESULTS...')
    print('For a sample size of', sample_size, ': ')
    print('The Mean Squared Error of the ', model, 'model was: {0}'.format(model_mse))
    print('The Max Error of the ', model, 'model was: {0}'.format(model_me))
    
    print('\nDone!')

In [30]:
sample_train_test_pipeline(dataframe=df, model=GradientBoostingRegressor(n_estimators=200, verbose=2), 
                           sample_size=100, vectorizer=TfidfVectorizer())






The shape of your dataframe is:  (500, 8) 


The dataframe looks like: 
        Unnamed: 0           artist  \
143093       1993  Jesse McCartney   
127867       3367       Diana Ross   

                                                      seq              song  \
143093  Yeah yeah oh yeah yeah\n\nHow do you feel when...         Checkmate   
127867  Dream, love is only in a dream, remember\r\nRe...  Remember Reprise   

         label genre                                          seq_clean  \
143093  0.1460   R&B  yeah yeah oh yeah yeah feel king got ta lie fo...   
127867  0.0757   R&B  dream love dream remember remember life never ...   

        Unnamed: 0.1  
143093      106560.0  
127867       72571.0   

Checking each genre has the correct sample size:

 Rock       100
R&B        100
Rap        100
Country    100
Pop        100
Name: genre, dtype: int64 


The shape and type of X_train_sample:  (335, 7568) <class 'scipy.sparse.csr.csr_matrix'>

Fitting the model...
      Iter

       195           0.0052            0.03s
       196           0.0051            0.02s
       197           0.0051            0.02s
       198           0.0050            0.01s
       199           0.0050            0.01s
       200           0.0049            0.00s

Printing metrics...

RESULTS...
For a sample size of 100 : 
The Mean Squared Error of the  GradientBoostingRegressor(n_estimators=200, verbose=2) model was: 0.06339029791901489
The Max Error of the  GradientBoostingRegressor(n_estimators=200, verbose=2) model was: 0.6465404962964746

Done!


In [31]:
sample_train_test_pipeline(dataframe=df, model=GradientBoostingRegressor(n_estimators=300, verbose=2), 
                           sample_size=100, vectorizer=TfidfVectorizer())


The shape of your dataframe is:  (500, 8) 


The dataframe looks like: 
        Unnamed: 0                artist  \
111432       3532        Bobby Caldwell   
4370         4370  The Brothers Johnson   

                                                      seq            song  \
111432  How long\r\nHow long have you been away\r\nOh,...        My Flame   
4370    So glad we've got a good thing\r\nYou know you...  The Real Thing   

        label genre                                          seq_clean  \
111432  0.411   R&B  long long away oh long cant find words say ive...   
4370    0.963   R&B  glad weve got good thing know make heart sing ...   

        Unnamed: 0.1  
111432       83943.0  
4370          4370.0   

Checking each genre has the correct sample size:

 Rock       100
R&B        100
Rap        100
Country    100
Pop        100
Name: genre, dtype: int64 


The shape and type of X_train_sample:  (335, 7381) <class 'scipy.sparse.csr.csr_matrix'>

Fitting the model...
    

       194           0.0051            0.63s
       195           0.0051            0.62s
       196           0.0050            0.62s
       197           0.0050            0.61s
       198           0.0049            0.61s
       199           0.0049            0.60s
       200           0.0048            0.59s
       201           0.0048            0.59s
       202           0.0048            0.58s
       203           0.0047            0.58s
       204           0.0047            0.57s
       205           0.0046            0.56s
       206           0.0046            0.56s
       207           0.0045            0.55s
       208           0.0045            0.55s
       209           0.0044            0.54s
       210           0.0044            0.53s
       211           0.0043            0.53s
       212           0.0043            0.52s
       213           0.0043            0.52s
       214           0.0042            0.51s
       215           0.0042            0.50s
       216

In [21]:
print('sample size of 4773 and no hyperparameters\n')

sample_train_test_pipeline(dataframe=df, model=GradientBoostingRegressor(), sample_size=4773, vectorizer=TfidfVectorizer())

print('\nsample size of 4773 with hyperparameters:')
sample_train_test_pipeline(dataframe=df, model=GradientBoostingRegressor(learning_rate=0.1, max_depth=3, n_estimators= 100), sample_size=4773, vectorizer=TfidfVectorizer())

print('\nsample size of 4773 with hyperparameters:')
sample_train_test_pipeline(dataframe=df, model=GradientBoostingRegressor(learning_rate=0.1, n_estimators= 100), sample_size=4773, vectorizer=TfidfVectorizer())

print('DONE!!')

# conclusion the hyperparameters made the max error worse so NOT BEST PARAMETERS


sample size of 4773 and no hyperparameters


The shape of your dataframe is:  (23865, 8) 


The dataframe looks like: 
        Unnamed: 0          artist  \
131518       2868    Annie Lennox   
144492       3392  Mint Condition   

                                                      seq  \
131518  Angels from the realms of glory\nWing your fli...   
144492  I wait for the day A sweet gentle sway rocks y...   

                                   song  label genre  \
131518  Angels from the Realms of Glory  0.300   R&B   
144492               U Send Me Swingin'  0.346   R&B   

                                                seq_clean  Unnamed: 0.1  
131518  angels realms glory wing flight earth ye sang ...      155696.0  
144492  wait day sweet gentle sway rocks love right wa...      158019.0   

Checking each genre has the correct sample size:

 Rock       4773
R&B        4773
Rap        4773
Country    4773
Pop        4773
Name: genre, dtype: int64 


The shape and type of X_train_s

### Hyperparameter Tuning of Gradiet Boosting Regressor with GridSearchCV
- need to run overnight

In [None]:
# TODO: write a function for hyper parameter tuning 

In [None]:
gb_param_grid= {'n_estimators': [100, 500, 1000],
                'learning_rate' : [0.05, 0.1, 0.15],
                'max_depth': [3, 4, 5]
                }

### *Do not run the next cell unless you have 2 hours to kill*

In [None]:
print("Running Grid Search ... ")

gb_regressor = GradientBoostingRegressor()

gb_grid = GridSearchCV(n_jobs = -1, estimator = gb_regressor, param_grid= gb_param_grid, cv=3, scoring= 'neg_mean_squared_error')

print("Running the fit..")

gb_grid_search = gb_grid.fit(X_train_sample3, y_train_sample3)

print("Done.")

best_score3 = gb_grid_search.best_score_
print("The best score is: ", best_score3)

gb_best_params = gb_grid_search.best_params_
print("The best parameters are: ", gb_best_params)

# The best score is:  0.07331262579138897
# The best parameters are:  {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 100}

# Running Larger GB Test on 1687 samples from each Genre
### with optimized hyper parameters {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 100}

In [None]:
# sampling from the dataframe, k is 1687 which is the max number of samples from R&B the smallest Genre pool 

df_sampled4 = genre_sample(df, k=1687)
print(df_sampled4.shape)
df_sampled4.head(10)

In [None]:
#checking correct amounts of samples per genre were obtained

df_sampled4.genre.value_counts()

In [None]:
X_sampled4 = df_sampled4['seq_clean'].values

y_sampled4 = df_sampled4['label'].values

In [None]:
X_train_sample4, X_test_sample4, y_train_sample4, y_test_sample4 = train_test_split(X_sampled4, y_sampled4, 
                                                                                    test_size=0.33, random_state=42)

In [None]:
vectorizer4 = TfidfVectorizer()
vectorizer4.fit(X_train_sample4)

X_train_sample4 = vectorizer4.transform(X_train_sample4)
X_test_sample4 = vectorizer4.transform(X_test_sample4)

print(X_train_sample4.shape, type(X_train_sample4))

In [None]:
gb2 = GradientBoostingRegressor(learning_rate=0.1, max_depth=3, n_estimators= 100)
gb2.fit(X_train_sample4, y_train_sample4)

In [None]:
y_pred_sample4 = gb2.predict(X_test_sample4)

In [None]:
gb_mse2 = mean_squared_error(y_test_sample4, y_pred_sample4)
gb_r2_2 = r2_score(y_test_sample4, y_pred_sample4)

print('For a sample size of 1687 and optimal hyperparameters:')
print('[GB] Mean Squared Error: {0}'.format(gb_mse2))
print('[GB] R2: {0}'.format(gb_r2_2))

# For a sample size of 1687 and optimal hyperparameters:
# [GB] Mean Squared Error: 0.05200244352093515
# [GB] R2: 0.09854691815352778

# Running GB on full data set with optimal hyperparameters

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42)

In [None]:
# Initialize our vectorizer
vectorizer = TfidfVectorizer()

# 3. Fit your vectorizer using your X data
# This makes your vocab matrix
vectorizer.fit(X_train)

# 4. Transform your X data using your fitted vectorizer. 
# This transforms your documents into vectors.
X_train = vectorizer.transform(X_train)
X_test = vectorizer.transform(X_test)

print(X_train.shape, type(X))
print(type(X_train))

In [None]:
gb3 = GradientBoostingRegressor()
gb3.fit(X_train, y_train)

In [None]:
y_pred = gb3.predict(X_test)
gb3_mse = mean_squared_error(y_test, y_pred)
gb3_r2 = r2_score(y_test, y_pred)

print('On the full data set and optimal hyperparameter tuning:')
print('[GB] Mean Squared Error: {0}'.format(gb3_mse))
print('[GB] R2: {0}'.format(gb3_r2))
print("\n ...")

# On the full data set and optimal hyperparameter tuning:
# [GB] Mean Squared Error: 0.05284822264086245
# [GB] R2: 0.12264012250017176

# conclusion on tesing Gradient Boosting Regressors

Not much gained or lost from increase in sample size. 
R2 error increased by .018 which is not too significant 

on sample size 500/genre and NO hyperparameter tuning:
- [GB] Mean Squared Error: 0.05264719893512396
- [GB] R2: 0.08043425668821236

on sample size 1687/genre with optimal hyperparamers: {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 100}:
- [GB] Mean Squared Error: 0.05200244352093515
- [GB] R2: 0.09854691815352778

on full data set with optimal hyperparamers:
- [GB] Mean Squared Error: 0.05284822264086245
- [GB] R2: 0.12264012250017176

** GB runs MUCH faster than RFreg and produces better r2 on the full data set

## Random Forest Regressor

- using the sampled dataset for faster testing

In [None]:
X_sampled = df_sampled['seq_clean'].values

y_sampled = df_sampled['label'].values

In [None]:
X_train_sample, X_test_sample, y_train_sample, y_test_sample = train_test_split(X_sampled, y_sampled, 
                                                                             test_size=0.33, random_state=42)

In [None]:
vectorizer = TfidfVectorizer()
vectorizer.fit(X_train_sample)

X_train_sample = vectorizer.transform(X_train_sample)
X_test_sample = vectorizer.transform(X_test_sample)

print(X_train_sample.shape, type(X_train_sample))

In [None]:
list(RandomForestRegressor.get_params(rf).keys())

In [None]:
# function to find the best parameters for RandomForestRegressor

param_grid = {'n_estimators': [100, 1000],
              'max_depth': [2, 8, 32, 'None']
             }

# there are more parameters to test but I was getting errors and need to investigate more

# param_grid = {'criterion': ['squared_error', 'absolute_error', 'poisson'],
#               'n_estimators': [10, 50, 100, 500, 1000], 
#               'max_depth': [2, 4, 8, 16, 32, 64],
#               'min_samples_leaf': [1, 10, 25, 50],
#               'bootstrap': [True, False],
#               'min_samples_split': [0, 2, 4, 8, 16, 32]
#              }

In [None]:
list(GridSearchCV.get_params(rf_grid).keys())

### *Do not run the next cell unless you have 2+ hours to kill*

In [None]:
print('Running Grid Search...')

# 1. Create a RandomForestRegressor model object without supplying arguments. 

rf = RandomForestRegressor()

# 2. Run a Grid Search with 3-fold cross-validation and assign the output to the object 'rf_grid'.
#    * Pass the model and the parameter grid to GridSearchCV()
#    * Set the number of folds to 3
#    * Specify the scoring method

rf_grid = GridSearchCV(n_jobs = -1, estimator=rf, param_grid = param_grid, cv=3, scoring='neg_mean_squared_error') 

# 3. Fit the model (use the 'grid' variable) on the training data and assign the fitted model to the 
#    variable 'rf_grid_search'

rf_grid_search = rf_grid.fit(X_train_sample, y_train_sample)


print('Done')

In [None]:
# finding best parameters for the Random Forest Regressor

best_score = rf_grid_search.best_score_
print("The best score is: ", best_score)

rf_best_params = rf_grid_search.best_params_
print("The best params is: ", rf_best_params)

# conclusion was n_estimators=1000, bootstrap = True are best hyperparameters 

In [None]:
#Optimal Hyperparameters for RandomForestRegressor based on GridSearchCV

rf_model1 = RandomForestRegressor(n_estimators=1000, bootstrap = True)

# 2. Fit the model to the training data below
rf_model1.fit(X_train_sample, y_train_sample)

In [None]:
y_sample_pred = rf_model1.predict(X_test_sample)

rf_mse = mean_squared_error(y_test_sample, y_sample_pred)
rf_r2 = r2_score(y_test_sample, y_sample_pred)

print('on sample size of 500/genre with optimal hyperparameters:')
print('[RF] Mean Squared Error: {0}'.format(rf_mse))
print('[RF] R2: {0}'.format(rf_r2))

# on sample size of 500/genre with optimal hyperparameters:
# - [RF] Mean Squared Error: 0.05222244210798196
# - [RF] R2: 0.10795588207769391

In [None]:
# Function to test the predictions of the model with NEW unseen text (not part of testing set)

def rgrg_string_test(lyrics):
    new_lyrics = text_processing_pipeline(lyrics)
    print("the processed lyrics are: ", new_lyrics)
    
    new_text_vectorized = vectorizer.transform([new_lyrics])
    
    value = rf_model1.predict(new_text_vectorized)
    print("Random Forest Regressor model gives a value of: ", value)
    if(value < .50):
        print("which is negative")
    else: 
        print("which is positive")

In [None]:
test_text1 = "Hit me baby one more time my lonliness is killing me and I must confess I still believe"
test_text2 = "Oh, baby, when you talk like that You make a woman go mad So be wise and keep on Reading the signs of my body"
test_text3 = "looking out on the pouring rain I used to feel so uninspired"
test_text4 = "Girl put your record on tell me your favorite song just go ahead let your hair down"

rgrg_string_test(test_text1)
print('\n')
rgrg_string_test(test_text2)
print('\n')
rgrg_string_test(test_text3)
print('\n')
rgrg_string_test(test_text4)

# Running Larger RF Test on 1687 samples from each Genre

- to-do: break this testing out into a function instead of repeating code 

In [None]:
# sampling from the dataframe, k is 1687 which is the max number of samples from R&B the smallest Genre pool 

df_sampled2 = genre_sample(df, k=1687)
print(df_sampled2.shape)
df_sampled2.head(10)

In [None]:
#checking correct amounts of samples per genre were obtained

df_sampled2.genre.value_counts()

In [None]:
X_sampled2 = df_sampled2['seq_clean'].values

y_sampled2 = df_sampled2['label'].values

In [None]:
X_train_sample2, X_test_sample2, y_train_sample2, y_test_sample2 = train_test_split(X_sampled2, y_sampled2, 
                                                                             test_size=0.33, random_state=42)

In [None]:
vectorizer2 = TfidfVectorizer()
vectorizer2.fit(X_train_sample2)

X_train_sample2 = vectorizer2.transform(X_train_sample2)
X_test_sample2 = vectorizer2.transform(X_test_sample2)

print(X_train_sample2.shape, type(X_train_sample2))

In [None]:
#Optimal Hyperparameters for RandomForestRegressor based on GridSearchCV

rf_model2 = RandomForestRegressor(n_estimators=1000, bootstrap = True)

# 2. Fit the model to the training data below
rf_model2.fit(X_train_sample2, y_train_sample2)

In [None]:
y_sample_pred2 = rf_model2.predict(X_test_sample2)
y_sample_pred2

rf_mse2 = mean_squared_error(y_test_sample2, y_sample_pred2)
rf_r2_2 = r2_score(y_test_sample2, y_sample_pred2)

print('on sample size of 1687/genre  with optimal hyperparameters:')
print('[RF] Mean Squared Error: {0}'.format(rf_mse2))
print('[RF] R2: {0}'.format(rf_r2_2))

# on sample size 1687/genre with optimal hyperprameters:
# - [RF] Mean Squared Error: 0.05101035414841959
# - [RF] R2: 0.1205852034338123

# conclusion on tesing Random Forest Regressors

Not much gained or lost from increase in sample size. 
R2 error increased by .02 which is not too significant 

on sample size 500/genre with optimal hyperparameters:
- [RF] Mean Squared Error: 0.05222244210798196
- [RF] R2: 0.10795588207769391

on sample size 1687/genre with optimal hyperprameters:
- [RF] Mean Squared Error: 0.05101035414841959
- [RF] R2: 0.1205852034338123

# Running RF Test on Full Data Set 

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42)

In [None]:
# Initialize our vectorizer
vectorizer = TfidfVectorizer()

# 3. Fit your vectorizer using your X data
# This makes your vocab matrix
vectorizer.fit(X_train)

# 4. Transform your X data using your fitted vectorizer. 
# This transforms your documents into vectors.
X_train = vectorizer.transform(X_train)
X_test = vectorizer.transform(X_test)

print(X_train.shape, type(X))
print(type(X_train))

In [None]:
#Optimal Hyperparameters for RandomForestRegressor based on GridSearchCV

rf_model = RandomForestRegressor(n_estimators=1000, bootstrap = True)

# 2. Fit the model to the training data below
rf_model.fit(X_train, y_train)

In [None]:
y_pred = rf_model.predict(X_test)
y_pred

rf_mse = mean_squared_error(y_test, y_pred)
rf_r2 = r2_score(y_test, y_pred)

print('on full data set with optimal hyperparameters:')
print('[RF] Mean Squared Error: {0}'.format(rf_mse))
print('[RF] R2: {0}'.format(rf_r2))

In [None]:
#end of RF testing

---

---


# conclusion on tesing Random Forest Regressors

Not much gained or lost from increase in sample size. 
R2 error increased by .02 which is not too significant 

on sample size 500/genre with optimal hyperparamers:
- [RF] Mean Squared Error: 0.05222244210798196
- [RF] R2: 0.10795588207769391

on sample size 1687/genre with optimal hyperparamers:
- [RF] Mean Squared Error: 0.05101035414841959
- [RF] R2: 0.1205852034338123

on full data set with optimal hyperparamers:
- [RF] Mean Squared Error: 0.05115798019890316
- [RF] R2: 0.15070068589697727