# Genre Prediction Presentation
**By Josh Abel and Kevin Label**

***Note, our model may predict our names are a rap song since they rhyme.***

##  Introduction

When we began brainstorming ideas for a data science project, we tried to think of topics that related to both of us. We thought about music, being that we both grew up learning various instruments.

![](josh.png) 
<img src="kevin.png" width="15%">

(We couldn't find an old picture of Kevin, so we let him borrow mine.)

From this we thought of extracting Billboard top 100 data and doing something interesting with it. We began asking ourselves different questions that had to do with the analyization of music. We came to the question, what really defines a genre. It may seem like common sense to listen to a song and to be able to classify its genre. However, what measurable features allow our brains to recognize the genre of the song?

This set us on our journey of searching for obtainable information about songs, that way we could predict a song's genre.

From our research we came across an article on "towardsdatascience.com" by Rosebud Anwuri, who had previously tried to distinguish between old and new music. From this project, she released open source data which she discovered and came up with. We decided to take her data and expand on it as we wanted to explore more variables.

Luckily she used a Spotify API to gather data, and provided unique keys for each song. From this we were able to extract from the API ourselves and discover more data on each song.

The following is a small segment of her data:

In [1]:
import pandas as pd
Rosebud_data = pd.read_csv("https://raw.githubusercontent.com/RosebudAnwuri/TheArtandScienceofData/master/The%20Making%20of%20Great%20Music/data/music_df.csv")
Rosebud_data.head()

Unnamed: 0,lyrics,num_syllables,pos,year,fog_index,flesch_index,num_words,num_lines,title,f_k_grade,...,tempo,duration_ms,time_signature,uri,analysis_url,artist_with_features,year_bin,image,cluster,Gender
0,"Mona Lisa, Mona Lisa, men have named you\nYou'...",189.0,0.199,1950,5.2,88.74,145,17,Mona Lisa,2.9,...,86.198,207573.0,3,spotify:track:3k5ycyXX5qsCjLd7R2vphp,https://api.spotify.com/v1/audio-analysis/3k5y...,,50s,https://i.scdn.co/image/a4c0918f13b67aa8d9f4ea...,String Lover,male
1,I wanna be Loved\nBy Andrews Sisters\n\nOooo-o...,270.9,0.224,1950,4.4,82.31,189,31,I Wanna Be Loved,3.3,...,170.869,198027.0,5,spotify:track:4UY81WrDU3jTROGaKuz4uZ,https://api.spotify.com/v1/audio-analysis/4UY8...,Gordon Jenkins,50s,https://i.scdn.co/image/42e4dc3ab9b190056a1ca1...,String Lover,Group
2,I was dancing with my darling to the Tennessee...,174.6,0.351,1950,5.2,88.74,138,16,Tennessee Waltz,2.9,...,86.335,182733.0,3,spotify:track:6DKt9vMnMN0HmlnK3EAHRQ,https://api.spotify.com/v1/audio-analysis/6DKt...,,50s,https://i.scdn.co/image/353b05113b1a140d64d83d...,String Lover,female
3,Each time I hold someone new\nMy arms grow col...,135.9,0.231,1950,4.4,99.23,117,18,I'll Never Be Free,0.9,...,82.184,158000.0,3,spotify:track:0KnD456yC5JuweN932Ems3,https://api.spotify.com/v1/audio-analysis/0KnD...,Kay Starr,50s,https://i.scdn.co/image/4bd427bb9181914d0fa448...,String Lover,male
4,"Unfortunately, we are not licensed to display ...",46.8,0.079,1950,6.0,69.79,32,3,All My Love,6.0,...,123.314,190933.0,4,spotify:track:05sXHTLqIpwywbpui1JT4o,https://api.spotify.com/v1/audio-analysis/05sX...,,50s,https://i.scdn.co/image/353b05113b1a140d64d83d...,String Lover,female


Some of these features were quite useful, such as danceability (a measure from the Spotify API), while other were not such as the gender of the artist.

We first wanted to see the proportions of genres that made it to the top 100 in the years 1950 to 2015.

<img src="Img1.png" width="65%">

From this we saw that rock took up almost half of all top 100 songs over those years. But we wondered how this could be and so we decided to look how many songs from each decade made it into the top 100.

<img src="Img2.png" width="55%">

Looking at this it becomes apparent that over the years, more songs have quickly come and gone in the top 100. It is also worth noting that the decade with the most songs that reached the top 100 was the 80s, in which rock was most prevelant.

<img src="Slash.png" width="50%">

So then we wanted to take these two ideas of year and propotion of genres and merge them together to see how they flow with each other.

<img src="Img3.png" width="65%">

From this graph we could see how popular a genre was per decade. It is most prevalent in rock and pop.

From this, it is clear that categorical variable "decade" had an effect on the genre. Next, we wanted to be able to visualize whether quantitative variables can show differences.

For example, we can see that by genre, the number of repeated lines in a song's lyrics change dramatically.

<img src="Img4.png" width="55%">

Similarly we found stark contrasts between genres with regrads to features such as danceability and acousticness. However, it is noticible in a feature such as the number of key changes in a song. Note, it is easy to see these differneces in songs that are very different such as jazz and pop vs. those that are similar such as pop and hip-hop.

<img src="Img5.png" width="75%">

<img src="Img6.png" width="75%">

<img src="Img7.png" width="75%">

Being that it is hard to see the difference between similar genres such as pop and hip-hop we tried other approaches to find these differences. The way we tried to separate, for example pop and hip-hop, was using a principal componenet analysis. However, it did not help too much to show the difference.

<img src="Img8.png" width="55%">

Now that we know how our variables interact with the genres of songs, we decided to take them to the test and use them to implement a maching learing model.

Here was our game plan:

(1) Train a SVM classifier and a k-nearest neighbors classifier on the ~4000 songs we had.

(2) See how the two classifiers perform together, by ensembling them.

(3) Choose the best model

(4) Since not all the songs in our training data had genres corresponding to them, we predicted these genres and put them back into our data, and trained a new model using them. This is known as regression imputation (this can have draw backs).

(5) Scrape the current billboard top 100, and then gather its data from the Spotify API and a lyrics website.

(6) From this, we use this new data as a test set and predict the genres of the current top 100.

<img src="Billboard.png" width="70%">

## Suppot Vector Machine Model (SVM)

<img src="SVM1.png" width="20%">
<img src="SVM4.png" width="20%">
<img src="SVM3.png" width="20%">
<img src="SVM2.png" width="20%">

An SVM tries to separate categories by using some sort of linear object. For more complicated data it may project this data into a higher dimention, such as 2D to 3D, and then draw a plane through this transformed data. (See Above)

To create the SVM we experimented with various features, used our intuition from the data exploration, and tried a model with and without a TF-IDF vectorizer on the lyrics of each song. We then used all these features in a PCA and constructed our model. The following is our analysis of optimizing the `min_df` parameter in the TF-IDF vectorizer.

<img src="SVM6.png" width="50%">

## K-Nearest Neighbors Classifier

<img src="KNN.png" width="40%">

In our K-Nearest Neighbors Classifier, we similary tested features, used a TF-IDF vectorizer on the lyrics, and ran PCA on all the features. In the end we optimized our value of k to be 36-nearest neighbors, looking the f1 scores of classes.

## Ensembling the Two Models

Here we used a voting classifer to combine our two models.

The voting classier calculates a probility A from class A and a probability B from class B and uses that to construct a "majority vote".

<img src="Ensemble1.png" width="50%">

The above is how our ensembler performed looking at the f1 score for rock. We found that it performed worse than the SVC, so we stuck with this model to make our predictions on the current top 100 songs.

### Missing predictions

First we predicted the genres of the missing genres in our original data set.

<img src="Ensemble2.png" width="50%">

We can see that the first 20 of the prediction on the missing data were rock with a few jazz predictions.

## Final Prediction on the Current Top 100

After webscraping data for the current billboard top 100 songs, getting information from the Spotify API, and retraining our model on all the data, we used this model to predict the current top 100 songs.

The following shows the individual predictions for each of the top 100.

<img src="current.png" width="60%">

<img src="Ens2.png" width="50%">

We can that our model predicted *Whiskey Glasses* by Morgan Wallen correctly, even though country music was a rare genre.

We can also see that country was a rare top 100 genre from 1950 to 2015.

<img src="final1.png" width="40%">

It may be surprising that our model predicted rock so many more times than it did for pop and hip-hop.

However, we can see from our earlier analysis that pop and rock are fairly similar and so are pop and hip-hop.

<img src="rockpop.png" width="50%">

Furthermore, it may appear that our model over predicts for rock songs. However, on closer inspecation, it is revealed that a lot of songs nowadays that are classified as pop also have elements of rock in them. Indeed, if we look at some of the biggest pop artists from today, we will see that they are also classified as rock artists. It is also worth noting that our model may slightly over predict because we used regression imputation.

<img src="Pic1.png" width="75%">

<img src="Pic2.png" width="75%">

<img src="Pic3.png" width="75%">

In conclusion, we found that there is no one feature that distinguishes two genres from eachother. Rather it is culmination of many attributes, such as lyrics, acousticness and how often the flow of the song changes (e.g. number of key changes, number of segments, etc.). Some genres have clear distinctions from one another (e.g. pop and jazz) while others overlap significantly (e.g. pop and hip-hop). Therefore, the next time you listen to a song don't feel bad if you guess the genre incorrectly ("is Taylor Swift pop or country?").