# Building an Ensembler for a KNN and SVM classifiers

**Now that we have built both SVM and KNN classifiers, what would happen if we were to combine the two in an ensembler?**

In [1]:
import pandas as pd
import numpy as np
import pickle
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import VotingClassifier
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn

**Downloading Data as before.**

In [2]:
given = pickle.load(open("cleanedData.pkl", "rb"))
audio_analysis = pickle.load(open("audio_analysis_df.pkl", "rb"))
key_analysis = pickle.load(open("key_analysis_df.pkl", "rb"))
lengths_df = pickle.load(open("lengths_df.pkl", "rb"))
songs = pd.concat([given, audio_analysis, key_analysis, lengths_df], axis=1)

**Narrowing down data frame into the features we desire.**

In [3]:
features = ['genre', 'lyrics', 'num_words', 'num_lines',
       'num_dupes', 'danceability', 'energy', 'loudness', 'mode',
       'speechiness', 'acousticness', 'instrumentalness', 'liveness',
       'valence', 'tempo', 'duration_ms', 'time_signature', 
            'year_bin', 'duration_avg', 'loudness_end_avg', 'loudness_max_avg',
       'loudness_max_time_avg', 'loudness_start_avg', 'key_changes',
       'num_sections', 'num_segments']

In [4]:
data = songs[features].dropna()

**Generating TF-IDF vectorizer on the lyrics and using these as well as the features above, for our X_train data set.**

In [5]:
# Quantify the lyrics
lyrics = data["lyrics"].replace("\n", " ")
vector = TfidfVectorizer(norm=None, min_df = 0.015, stop_words={"english"}, lowercase=True) # Do not normalize.
vector.fit(lyrics) # This determines the vocabulary.
tf_idf_sparse = vector.transform(lyrics)
lyrics = pd.DataFrame(tf_idf_sparse.todense())
lyrics.columns = vector.vocabulary_
    
# Use the features that we would like
df_features = data.drop(["lyrics","genre"], axis=1).reset_index().drop("index", axis=1)
    
# Create X_train and y_train
X_train = pd.concat([df_features, lyrics], axis = 1, sort=False)
y_train = data["genre"]

**Creating SVC model (`model1`)**

In [6]:
# Vectorize the categorical variables
X_dict = X_train.to_dict(orient="records")
vec = DictVectorizer(sparse=False)
# Create Scaler, PCA, and SVC objects
scaler = StandardScaler()
pca = PCA(n_components=6)
model = SVC()
  
# Construct the pipeline
model1 = Pipeline([("vectorizer", vec), ("scaler", scaler),("pca", pca), ("model", model)])

**Creating KNN Classifier model (`model2`)**

In [7]:
X_dict = X_train.to_dict(orient="records")
vec = DictVectorizer(sparse=False)
scaler = StandardScaler()
model = KNeighborsClassifier(n_neighbors=36)
  
# Construct the pipeline
model2 = Pipeline([("vectorizer", vec), ("scaler", scaler), ("model", model)])

**Building Ensembler**

In [8]:
eclf1 = VotingClassifier(estimators=[('svc', model1), ('knn', model2)], voting='hard')

In [9]:
eclf1 = eclf1.fit(X_dict, y_train)

In [10]:
f1 = cross_val_score(eclf1,
                        X_dict,
                        y_train == "rock",
                        cv=10, scoring="f1").mean()

In [11]:
f1

0.60541602355882695

**The Ensembler turns out to be worse than our SVC so we will stick with that model.**

**Now we will use the SVC to predict the missing genres in the original data frame, in order to fill in the empty training data. We will then use that entire data set to predict the genres of the current Billboard top 100. This known as regression imputation and it has some drawbacks, such as causing relationships to be over identified and suggest greater precision in the imputed values than there should be. However, we are only doing it for a small fraction of the observations, so we are not too worried.**

In [12]:
missing_genres = songs[songs.genre.isna()][features].drop(["genre"], axis=1)

In [13]:
# Quantify the lyrics
lyrics = missing_genres["lyrics"].replace("\n", " ")
vector = TfidfVectorizer(norm=None, min_df = 0.015, stop_words={"english"}, lowercase=True) # Do not normalize.
vector.fit(lyrics) # This determines the vocabulary.
tf_idf_sparse = vector.transform(lyrics)
lyrics = pd.DataFrame(tf_idf_sparse.todense())
lyrics.columns = vector.vocabulary_
lyrics = lyrics.set_index(pd.Index(missing_genres.index))
    
# Use the features that we would like
df_features = missing_genres.drop(["lyrics"], axis=1)
    
# Create X_train and y_train
X = pd.concat([df_features, lyrics], axis = 1, sort=False)


In [14]:
model1.fit(X_dict, y_train)

Pipeline(memory=None,
     steps=[('vectorizer', DictVectorizer(dtype=<class 'numpy.float64'>, separator='=', sort=True,
        sparse=False)), ('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('pca', PCA(copy=True, iterated_power='auto', n_components=6, random_state=None,
  svd_solver='auto', tol=0.0, w...f', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False))])

In [15]:
predictions = model1.predict(X.to_dict(orient="records"))

In [16]:
predictions[:20]

array(['rock', 'rock', 'rock', 'rock', 'rock', 'rock', 'rock', 'rock',
       'rock', 'rock', 'rock', 'rock', 'rock', 'rock', 'jazz', 'rock',
       'rock', 'rock', 'rock', 'rock'], dtype=object)

In [17]:
missing_genres["genre"] = predictions

In [18]:
songs.update(missing_genres["genre"])

In [19]:
songs_train = songs[features]
y_train = songs["genre"]

In [20]:
# Quantify the lyrics
lyrics = songs_train["lyrics"].replace("\n", " ")
vector = TfidfVectorizer(norm=None, min_df = 0.015, stop_words={"english"}, lowercase=True) # Do not normalize.
vector.fit(lyrics) # This determines the vocabulary.
tf_idf_sparse = vector.transform(lyrics)
lyrics = pd.DataFrame(tf_idf_sparse.todense())
lyrics.columns = vector.vocabulary_
    
# Use the features that we would like
df_features = songs_train.drop(["lyrics"], axis=1)
    
# Create X_train and y_train
X = pd.concat([df_features, lyrics], axis = 1, sort=False)


**Re-fitting the model to include our newly predicted genres.**

In [21]:
model1.fit(X.to_dict(orient="records"), y_train)

Pipeline(memory=None,
     steps=[('vectorizer', DictVectorizer(dtype=<class 'numpy.float64'>, separator='=', sort=True,
        sparse=False)), ('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('pca', PCA(copy=True, iterated_power='auto', n_components=6, random_state=None,
  svd_solver='auto', tol=0.0, w...f', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False))])

**Now that we have a finalized model with a complete data set, let's take a look at how it performs agains all the songs. While accuracy is not the best measure of an individual class, we will use it to gain a holistic perspective on the performance of our model across all of our data.**

In [22]:
accuracy = cross_val_score(model1,
                        X.to_dict(orient="records"),
                        y_train,
                        cv=10, scoring="accuracy").mean()
accuracy

0.58845431042106267

Overall, our model did fairly well with regards to all songs.

# **Now we can predict the genres of the current top 100 songs on billboard hot-100!**

In [23]:
top100 = pickle.load(open("top100_df.pkl", "rb"))

In [24]:
top100["year_bin"] = "10s"

In [25]:
features = ['lyrics', 'num_words', 'num_lines',
       'num_dupes', 'danceability', 'energy', 'loudness', 'mode',
       'speechiness', 'acousticness', 'instrumentalness', 'liveness',
       'valence', 'tempo', 'duration_ms', 'time_signature', 
            'year_bin', 'duration_avg', 'loudness_end_avg', 'loudness_max_avg',
       'loudness_max_time_avg', 'loudness_start_avg', 'key_changes',
       'num_sections', 'num_segments']

In [26]:
top100[features].head()

Unnamed: 0,lyrics,num_words,num_lines,num_dupes,danceability,energy,loudness,mode,speechiness,acousticness,...,time_signature,year_bin,duration_avg,loudness_end_avg,loudness_max_avg,loudness_max_time_avg,loudness_start_avg,key_changes,num_sections,num_segments
0,We go together \n Better than birds of a feath...,383,46,29,0.846,0.731,-5.027,0,0.064,0.0436,...,4,10s,0.247322,-60.0,-5.140011,0.052402,-16.998504,6,12,732
1,"Yeah, breakfast at Tiffany's and bottles of bu...",480,50,16,0.725,0.321,-10.744,0,0.323,0.578,...,4,10s,0.272733,-60.0,-12.369545,0.067081,-23.344505,1,4,655
2,"Uh, uh, yeah, come on \n \n Please me, baby \...",487,85,51,0.747,0.57,-6.711,1,0.081,0.0642,...,4,10s,0.244987,-60.0,-6.354572,0.059532,-16.06294,5,11,820
3,"Swae Lee: \n Needless to say, I keep it in che...",276,38,11,0.76,0.479,-5.574,1,0.0466,0.556,...,4,10s,0.264724,-60.0,-6.413338,0.057229,-16.804481,3,8,597
4,Found you when your heart was broke \n I fille...,434,54,36,0.752,0.488,-7.05,1,0.0705,0.297,...,4,10s,0.275493,-60.0,-8.837766,0.062921,-18.74872,7,11,732


In [27]:
top100_df = top100[features]

In [28]:
# Quantify the lyrics
lyrics = top100_df["lyrics"].replace("\n", " ")
vector = TfidfVectorizer(norm=None, min_df = 0.015, stop_words={"english"}, lowercase=True) # Do not normalize.
vector.fit(lyrics) # This determines the vocabulary.
tf_idf_sparse = vector.transform(lyrics)
lyrics = pd.DataFrame(tf_idf_sparse.todense())
lyrics.columns = vector.vocabulary_
    
# Use the features that we would like
df_features = top100_df.drop(["lyrics"], axis=1)
    
# Create X_train and y_train
X = pd.concat([df_features, lyrics], axis = 1, sort=False)

In [29]:
top100_preds = model1.predict(X.to_dict(orient="records"))

In [30]:
top100_preds

array(['rock', 'rock', 'rock', 'rock', 'rock', 'rock', 'rock', 'rock',
       'hip-hop', 'pop', 'hip-hop', 'country', 'rock', 'rock', 'rock',
       'rock', 'hip-hop', 'pop', 'rock', 'rock', 'rock', 'country',
       'hip-hop', 'country', 'rock', 'rock', 'pop', 'rock', 'rock',
       'country', 'hip-hop', 'rock', 'rock', 'rock', 'rock', 'country',
       'country', 'rock', 'rock', 'rock', 'pop', 'rock', 'rock', 'rock',
       'rock', 'rock', 'rock', 'rock', 'rock', 'rock', 'rock', 'rock',
       'rock', 'rock', 'rock', 'rock', 'rock', 'rock', 'rock', 'rock',
       'rock', 'rock', 'rock', 'rock', 'rock', 'country', 'pop', 'rock',
       'rock', 'rock', 'rock', 'rock', 'hip-hop', 'rock', 'rock', 'rock',
       'rock', 'rock', 'rock', 'rock', 'rock', 'rock', 'rock', 'hip-hop',
       'rock', 'rock', 'rock', 'rock', 'rock', 'rock', 'rock', 'rock',
       'rock', 'rock', 'rock', 'rock', 'rock', 'rock', 'pop', 'rock'], dtype=object)

In [31]:
top100["genre"] = top100_preds

**Here are the predictions of the genres that were made by the model for the top 100 songs.**

In [32]:
top100[["title", "artist","genre"]]

Unnamed: 0,title,artist,genre
0,Sucker,Jonas Brothers,rock
1,7 Rings,Ariana Grande,rock
2,Please Me,Cardi B & Bruno Mars,rock
3,Sunflower (Spider-Man: Into The Spider-Verse),Post Malone & Swae Lee,rock
4,Without Me,Halsey,rock
5,Shallow,Lady Gaga & Bradley Cooper,rock
6,Wow.,Post Malone,rock
7,Happier,Marshmello & Bastille,rock
8,Middle Child,J. Cole,hip-hop
9,Sicko Mode,Travis Scott,pop


**How well did our model predict**

In [33]:
top100[["title", "artist","genre"]].loc[84]

title     Whiskey Glasses
artist      Morgan Wallen
genre                rock
Name: 84, dtype: object

**We predicted this song to be country and it is!**

![](whiskeyCountry.png)

**This is fairly impressive since there are very few country songs which have made it to the top 100 in our training data**

In [34]:
songs.genre.value_counts()

rock                 1893
pop                   886
hip-hop               317
soul                  305
country               159
disco                 131
jazz                   76
rnb                    68
electronic/dance       55
blues                  49
alternative/indie      37
folk                   27
reggae                 13
swing                  12
Name: genre, dtype: int64

In [35]:
top100.genre.value_counts()

rock       80
country     7
hip-hop     7
pop         6
Name: genre, dtype: int64

**We may assume pop is such a popular genre nowadays and we may think that most of the songs on the top 100 should be pop. It may also appear that our model over predicts for rock songs. However, a lot of songs nowadays that are classified as pop have elements of rock in them. Indeed, if we look at some of the biggest pop artists around from today, we will see that they are also classified as rock artists. It is worth noting however, that our model may over predict because we did use regression imputation.**

<img src="Pic1.png" width="75%">

<img src="Pic2.png" width="75%">

<img src="Pic3.png" width="75%">