# Natural Language Processing on Billboard 100 Songs


---

The aim of my project is to predict the group (Top 25, Top 25 - 50, Top 50 - 75, Top 75 -100) a song in the Billboard 100 belongs to, based on the lyrics of the song. Indeed, the ultimate questions is how much do lyrics and words matter, versus other features such as the rhythm and artist popularity?

In order to investigate this questions various Natural language processing tools were used including CountVectozier, TF-IDF, Sentiment Analysis and Part of Speech Tagging. An additional feature explored was the repetitiveness of a song, using the Zopfli compression algorithm.

### Agenda

- [Scraping the Data](#Scraping-the-Data)
     - [Billboard100 Scrap](#Billboard100-Scrap)
     - [Lyric Scrap](#Lyric-Scrap)
     - [Genres Scrap](#Genres-Scrap)
- [EDA](#EDA)
     - [Genres Grouping](#Genres-Grouping)
     - [Artist Popularity](#Artist-Popularity)
- [ CountVectorizer Model](# CountVectorizer-Model)
    - [CountVectorizer, Artist Popularity & Genre Model ](#CountVectorizer,-Artist-Popularity-&-Genre-Model)
- [TF-IDF Model](#TF-IDF-Model)
    - [TF-IDF, Artist Popularity & Genre Model ](#TF-IDF,-Artist-Popularity-&-Genre-Model)
- [Sentiment Analysis](#Sentiment Analysis)
    - [Sentiment KNN Model](#Sentiment KNN Model)
    - [Compound Sentiment & Genres Model](#Compound-Sentiment-&-Genres-Model)
- [Repetitiveness ](#Repetitiveness)
- [Part of Speech Tagging ](#Part-of-Speech-Tagging )
	- [Percentage Repetition, Part of Speech Tags & Genre Model](#Percentage-Repetition,-Part-of-Speech-Tags-& Genre-Model)
- [All Features Model](#Part-of-Speech-Tagging )
- [Conclusion](#Conclusion )

In [6]:
import pandas as pd 
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
import pandas as pd 
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn import metrics 

<a id="Scraping the Data"></a>
## Scraping the Data

<a id="Billboard100 Scrap"></a>
### Billboard100 Scrap

Scrapping the data was a three-fold process. First, I needed to acquire the names of the songs, artists and rank of the songs in Billboard 100 list. The Billboard 100 is a weekly released list of the top 100 most popular songs.  No duplicate songs could be present in the data as this would result in songs being classified into multiple groups, due to the change in rank of a song on the list over time. To guarantee there were no duplicate songs I did a grouped-by with the artist and song, and selected the minimum rank position (ie. the highest position in the chart). The songs were taken from the time period of August 2018 to January 2016. 

For the number one ranked songs in the Billboard 100, the HTML construct of the page was different that resulted in a separate scrap for these songs (second code below). 

Using the rank of the songs, I created my groups that were my target variable. The groups were Top 25, Top 25 - 50, Top 50 - 75 and Top 75 -100.

In [None]:
#SONGS from Rank 2 -100 
def extract_song(listing):
    try:
        song =listing.find("span",attrs={"class":"chart-list-item__title-text"}).get_text()
        return song 
    
    except:
        return "Not found"

def extract_artist(listing):
    try:
        artist =listing.find("div",attrs={"class":"chart-list-item__artist"}).get_text()
        return artist 
    
    except:
        return "Not found"

def extract_rank(listing):
    try:
        rank =listing.find("div",attrs={"class":"chart-list-item__rank"}).get_text()
        return rank 
    
    except:
        return "Not found"

url_template = "https://www.billboard.com/charts/hot-100/{}"

songs = []
artists = []
ranks = []

for date in set(['2018-08-04', '2018-07-28', '2018-07-21', '2018-07-14', '2018-07-07', 
    '2018-06-30', '2018-06-23', '2018-06-16', '2018-06-09', '2018-06-02','2018-05-26',
                 '2018-05-19','2018-05-12','2018-05-05','2018-04-28','2018-04-21',
                '2018-04-14','2018-04-07','2018-03-31','2018-03-24','2018-03-17',
                '2018-03-10','2018-03-03','2018-02-24','2018-02-17','2018-02-10',
                '2018-02-03','2018-01-27','2018-01-20','2018-01-13','2018-01-06','2017-12-30'
                '2017-12-23','2017-12-16','2017-12-09','2017-12-02','2017-11-25','2017-11-18','2017-11-11'
                '2017-11-04','2017-10-28','2017-10-21','2017-10-14','2017-10-07','2017-09-30','2017-09-23',
                '2017-09-16','2017-09-09','2017-09-02','2017-08-26','2017-08-19','2017-08-12','2017-08-05','2017-07-29', '2017-07-22','2017-07-15','2017-07-08','2017-07-01','2017-06-24','2017-06-17',
                '2017-06-10','2017-06-03','2017-05-27','2017-05-20','2017-05-13','2017-05-06','2017-04-29',
                '2017-04-22','2017-04-15','2017-04-08','2017-04-01','2017-03-25','2017-03-28','2017-03-11',
                '2017-03-18','2017-03-11','2017-03-04','2017-02-25','2017-02-18','2017-02-11','2017-02-04',
                '2017-01-28','2017-01-21','2017-01-14','2017-01-07','2016-12-31','2016-12-24','2016-12-17',
                 '2016-12-10','2016-12-03','2016-11-26','2016-11-19','2016-11-12','2016-11-05','2016-10-29',
                '2016-10-22','2016-10-15','2016-10-08','2016-10-01','2016-09-24','2016-09-17','2016-09-10',
                '2016-09-03','2016-08-27','2016-08-20','2016-08-13','2016-08-06','2016-07-30','2016-07-23','2016-07-23',
                '2016-07-16','2016-07-09','2016-07-02','2016-06-25','2016-06-18','2016-06-11','2016-06-04','2016-05-28'
                ,'2016-05-21','2016-05-14','2016-05-07','2016-04-30','2016-04-23','2016-04-16','2016-04-09','2016-04-02',
                '2016-03-26','2016-03-19','2016-03-12','2016-03-05','2016-02-27','2016-02-20','2016-02-13','2016-02-06',
                '2016-01-30','2016-01-23','2016-01-16','2016-01-09','2016-01-02']):
    updated_url = url_template.format(date)                             
    jobs_request = requests.get(updated_url)
    soup = BeautifulSoup(jobs_request.text,"html.parser")
    listings = soup.find_all("div",attrs={"class":"chart-list-item__first-row"})
    
    
    for listing in listings:
        songs.append(extract_song(listing))
        artists.append(extract_artist(listing))
        ranks.append(extract_rank(listing))
    pass

In [None]:
#SONGS from Rank 1 
def extract_top_song(listing):
    try:
        top_song =listing.find("div",attrs={"class":"chart-number-one__title"}).get_text()
        return top_song 
    
    except:
        return "Not found"

def extract_top_artist(listing):
    try:
        top_artist =listing.find("div",attrs={"class":"chart-number-one__artist"}).get_text()
        return top_artist
    
    except:
        return "Not found"
    
    
url_template = "https://www.billboard.com/charts/hot-100/{}"

top_songs = []
top_artists = []

for date in set(['2018-08-04', '2018-07-28', '2018-07-21', '2018-07-14', '2018-07-07', 
    '2018-06-30', '2018-06-23', '2018-06-16', '2018-06-09', '2018-06-02','2018-05-26',
                 '2018-05-19','2018-05-12','2018-05-05','2018-04-28','2018-04-21',
                '2018-04-14','2018-04-07','2018-03-31','2018-03-24','2018-03-17',
                '2018-03-10','2018-03-03','2018-02-24','2018-02-17','2018-02-10',
                '2018-02-03','2018-01-27','2018-01-20','2018-01-13','2018-01-06','2017-12-30'
                '2017-12-23','2017-12-16','2017-12-09','2017-12-02','2017-11-25','2017-11-18','2017-11-11'
                '2017-11-04','2017-10-28','2017-10-21','2017-10-14','2017-10-07','2017-09-30','2017-09-23',
                '2017-09-16','2017-09-09','2017-09-02','2017-08-26','2017-08-19','2017-08-12','2017-08-05','2017-07-29', '2017-07-22','2017-07-15','2017-07-08','2017-07-01','2017-06-24','2017-06-17',
                '2017-06-10','2017-06-03','2017-05-27','2017-05-20','2017-05-13','2017-05-06','2017-04-29',
                '2017-04-22','2017-04-15','2017-04-08','2017-04-01','2017-03-25','2017-03-28','2017-03-11',
                '2017-03-18','2017-03-11','2017-03-04','2017-02-25','2017-02-18','2017-02-11','2017-02-04',
                '2017-01-28','2017-01-21','2017-01-14','2017-01-07','2016-12-31','2016-12-24','2016-12-17',
                 '2016-12-10','2016-12-03','2016-11-26','2016-11-19','2016-11-12','2016-11-05','2016-10-29',
                '2016-10-22','2016-10-15','2016-10-08','2016-10-01','2016-09-24','2016-09-17','2016-09-10',
                '2016-09-03','2016-08-27','2016-08-20','2016-08-13','2016-08-06','2016-07-30','2016-07-23','2016-07-23',
                '2016-07-16','2016-07-09','2016-07-02','2016-06-25','2016-06-18','2016-06-11','2016-06-04','2016-05-28'
                ,'2016-05-21','2016-05-14','2016-05-07','2016-04-30','2016-04-23','2016-04-16','2016-04-09','2016-04-02',
                '2016-03-26','2016-03-19','2016-03-12','2016-03-05','2016-02-27','2016-02-20','2016-02-13','2016-02-06',
                '2016-01-30','2016-01-23','2016-01-16','2016-01-09','2016-01-02']):
    updated_url = url_template.format(date)                             
    jobs_request = requests.get(updated_url)
    soup = BeautifulSoup(jobs_request.text,"html.parser")
    listings = soup.find_all("div",attrs={"class":"chart-number-one__info "})
    
    
    for listing in listings:
        top_songs.append(extract_top_song(listing))
        top_artists.append(extract_top_artist(listing))
    pass

In [None]:
#Groupby to elimiate duplicate songs  

result2 = result.groupby(["Artist","Song"])["Rank"].min().reset_index().set_index('Artist')

In [None]:
#Creating the Groups from the Rank of the songs 

classifiers = []

for entry in result2["Rank"]:
    if int(entry) <= 25:
        classifiers.append("Top 25")
    elif int(entry) <= 50 and int(entry)>25:
        classifiers.append("Top 25 - 50")
    elif int(entry) <= 75 and int(entry)>50:
        classifiers.append("Top 50 - 75")
    else: 
        classifiers.append("Top 75- 100")

<a id="Lyric Scrap"></a>
### Lyric Scrap

Now that none of the songs were duplicated, I was able to scarp the lyrics from Lyrics Wiki. The URL format was  "Artist_Song". With multiple artist on a track or featuring artist the correct URL format for the song lyrics was more complicated. In order to have the same Artist, Song combination as that on LyricsWiki I had to go through the songs consecutively, checking to see which artist the song was listed under. 

In [None]:
def cleaned_lyrics (lyric):
    
    cleaned = str(lyric).split("<br/>")

    #line1 
    line1 = cleaned[0] 
    new_line1 = line1[22:]
    cleaned[0] = cleaned[0].replace(cleaned[0],new_line1)

    #lastline
    last_index = len(cleaned)-1
    lastline = cleaned[last_index]

    len_lastline = len(lastline)
    index = len_lastline -38
    new_lastline = lastline[:index]

    cleaned[last_index] = cleaned[last_index].replace(cleaned[last_index],new_lastline)
    
    return cleaned

In [None]:
url_template = "http://lyrics.wikia.com/wiki/{}:{}"

lyrics_clean= []

for entry in zipped:
    artist = entry[0]
    song = entry[1]
    updated_url = url_template.format(artist,song)  
    #print(updated_url)
    lyric_request= requests.get(updated_url)
    soup = BeautifulSoup(lyric_request.text,"html.parser")
    
    
    lyrics = soup.find_all("div",attrs={"class":"lyricbox"})
    #print(lyrics)
    
    if lyrics != []:
    
        for lyric in lyrics:
            lyrics_clean.append(cleaned_lyrics(lyric))
            
        
    else: 
        lyrics_clean.append("Not Found")

<a id="Genres Scrap"></a>
### Genres Scrap

Finally, I needed the genre of the artist. In order to acquire the genre I used Every Noise, which returns up to 10 genres for a single artist. The URL format was simply the artist name. With multiple artist on the song, the genre of the main artist was found. 

In [None]:
def extract_genre(genres):
    try:
        g1 = genres.get_text().strip("\n")
        return g1
    
    except:
        return "Not found"

In [None]:
url_template = "http://everynoise.com/lookup.cgi?who={}&mode=map"

genres_everyn = []

for artist in artist_add:
    updated_url = url_template.format(artist)   
    #print(updated_url)
    jobs_request = requests.get(updated_url)
    soup = BeautifulSoup(jobs_request.text,"html.parser")
    genres = soup.find("div")
    
    genres_everyn.append(extract_genre(genres))
    pass

Note: At every section of the scraping process (Billboard 100, Lyrics, Genres) the data was assembled into a DataFrames. These DataFrames were then finally merged into one single DataFrame, where all the information was present for every song. 

The DataFrame has the columns: Artist, Song, Lyrics, Rank, Group , Genres 

<a id="EDA"></a>
## EDA

Now that all the data was assembled it was time to clean it. This entailed removing any strange symbols, making all the letters lowercase and finally removing any foreign songs that did not use the English alphabet. Once all of the data was cleaned, graphs were created in order to further explore the data and gain a deeper understanding. 
(Note: To see the final graphs, please see the "NLP on Billboard 100 Presenation" PDF in the repository)

<a id="Genre Grouping"></a>
### Genre Grouping

Within EDA two additional features were created from the scrapped data; genres and artist popularity. The genres scraped from EveryNoise were very specific ranging upto 10 genres for a single artist. For the genres to have any weight within the model, wider genres groups needed to be created. For example, for the Artist "Drake" Canadian pop and pop were listed as genres, so these categories were grouped under pop. 

In [None]:
general_g2 = []

for entry in df["EV_cleaned_G2"]:
    if "pop" in entry:
        general_g2.append("pop")
    elif "country" in entry:
        general_g2.append("country")
    elif "rap" in entry:
        general_g2.append("rap")
    elif "trap" in entry:
        general_g2.append("trap")
    elif "r&b" in entry:
        general_g2.append("r&b")
    elif "indie" in entry:
        general_g2.append("indie")
    elif "indie" in entry:
        general_g2.append("stomp and holler")
    elif "punk" in entry:
        general_g2.append("punk")
    elif "hip hop" in entry:
        general_g2.append("hip hop")
    elif "edm" in entry:
        general_g2.append("electronic")
    elif "metal" in entry:
        general_g2.append("metal")
    elif "reggaeton" in entry:
        general_g2.append("latin")
    elif "reggae fusion" in entry:
        general_g2.append("reggae")
    elif "alternative" in entry:
        general_g2.append("alternative")
    elif "tropical" in entry:
        general_g2.append("tropical")
    elif "rock" in entry:
        general_g2.append("rock")
    elif "boy band" in entry:
        general_g2.append("pop")
    elif "red dirt" in entry:
        general_g2.append("country")
    elif "singer-songwriter" in entry:
        general_g2.append("country") 
    elif "a cappella" in entry:
        general_g2.append("country") 
    elif "bolero" in entry:
        general_g2.append("latin") 
    elif "bachata" in entry:
        general_g2.append("latin") 
    elif "g funk" in entry:
        general_g2.append("r&b")
    elif "neo mellow" in entry:
        general_g2.append("pop")
    elif "new romantic" in entry:
        general_g2.append("pop")
    elif "neo soul" in entry:
        general_g2.append("soul")
    elif "permanent wave" in entry:
        general_g2.append("electronic")
    elif "aussietronica" in entry:
        general_g2.append("electronic")
    elif "moombahton" in entry:
        general_g2.append("electronic")
    elif "brostep" in entry:
        general_g2.append("electronic")
    elif "adult standards" in entry:
        general_g2.append("traditional pop")
    elif "anthem worship" in entry:
        general_g2.append("christian music")
    elif "hollywood" in entry:
        general_g2.append("pop")
    elif "lift kit" in entry:
        general_g2.append("country")
    elif "stomp and holler" in entry:
        general_g2.append("country")
    elif "show tunes" in entry:
        general_g2.append("musicals")
    elif "disney" in entry:
        general_g2.append("musicals")
    elif "motown" in entry:
        general_g2.append("musicals")
    elif "uk funky" in entry:
        general_g2.append("reggae")
    elif "talent show" in entry:
        general_g2.append("tv music") 
    elif "deep talent show" in entry:
        general_g2.append("tv music")  
    elif "downtempo" in entry:
        general_g2.append("soul")  
    elif "drill" in entry:
        general_g2.append("trap")  
    elif "erotica" in entry:
        general_g2.append("r&b")
    elif "indonesian jazz" in entry:
        general_g2.append("jazz")
    elif "tropical" in entry:
        general_g2.append("latin")
    elif "australian dance" in entry:
        general_g2.append("pop")
    elif "progressive electro house" in entry:
        general_g2.append("electronic")
    elif " latin" in entry:
        general_g2.append("latin")
    elif "electronica" in entry:
        general_g2.append("electronic")
    elif " idol" in entry:
        general_g2.append("tv music")
    elif "canadian folk" in entry:
        general_g2.append("folk")
    elif "musical" in entry:
        general_g2.append("musicals")
    elif " big room" in entry:
        general_g2.append("pop")
    elif " deep euro house" in entry:
        general_g2.append("electronic")
    elif " ccm" in entry:
        general_g2.append("alternative")
    elif " neo-psychedelic" in entry:
        general_g2.append("rock")
    elif " movie tunes" in entry:
        general_g2.append("tv music")
    elif " emo" in entry:
        general_g2.append("rock")
    else:
        general_g2.append(entry)  

<a id="Artist Popularity"></a>
### Artist Popularity

Artist popularity was created by the counting number of times an artist appeared in each group. Every time the artist appeared within the group their count was increased. This created four new columns of Artist Popularity for each respective group (Top 25, Top 25 - 50, Top 50 - 75, Top 75 -100). 

In [None]:
#Example of Artist Popularity for Artist in the Top 25 group

twentyone_Savage = 0
five_Seconds_Of_Summer = 0
Adele = 0
Alessia_Cara= 0
Amine= 0
Ariana_Grande= 0
Ayo_Teo= 0
Bazzi= 0
Bebe_Rexha_and_Florida_Georgia_Line= 0
Beyonce= 0
Big_Sean = 0
BlocBoy_JB= 0
Brett_Young= 0
Britney_Spears= 0
Bruno_Mars= 0
Bruno_Mars_and_Cardi_B= 0
Bryson_Tiller= 0
Calvin_Harris = 0
Camila_Cabello = 0
Cardi_B = 0
Charlie_Puth= 0
Childish_Gambino= 0
Chris_Brown= 0
Clean_Bandit= 0
Coldplay= 0
DRAM= 0
DJ_Khaled= 0
DJ_Snake= 0
DNCE= 0
Dan_and_Shay= 0
David_Guetta= 0
Daya= 0
Demi_Lovato= 0
Desiigner= 0
Drake= 0
Drake_and_Future= 0
Dua_Lipa= 0
Ed_Sheeran= 0
Ella_Mai= 0
Elle_King= 0
Ellie_Goulding= 0
Eminem= 0
Fetty_Wap= 0
Fifth_Harmony= 0
Flo_Rida= 0
Florida_Georgia_Line= 0
Flume= 0
French_Montana= 0
Future= 0
G_Eazy= 0
G_Eazy_and_Halsey = 0
G_Eazy_x_Bebe_Rexha = 0
Gucci_Mane = 0
Hailee_Steinfeld_and_Grey = 0
Halsey= 0
Harry_Styles= 0
Imagine_Dragons= 0
J_Balvin_and_Willy_William= 0
J_Cole= 0
James_Arthur= 0
James_Bay= 0
Jeremih= 0
John_Legend= 0
Jon_Bellion= 0
Jordan_Smith= 0
Juice_WRLD= 0
Julia_Michaels= 0
Justin_Bieber= 0
Justin_Timberlake= 0
KYLE= 0
Kane_Brown= 0
Kanye_West= 0
Katy_Perry= 0
Keith_Urban= 0
Kelly_Clarkson= 0
Kendrick_Lamar= 0
Kendrick_Lamar_and_SZA= 0
Kent_Jones= 0
Kesha= 0
Kevin_Gates= 0
Khalid= 0
Khalid_and_Normani= 0
Kiiara= 0
Kodak_Black= 0
Kygo_and_Selena_Gomez= 0
Lady_Gaga= 0
Liam_Payne= 0
Lil_Dicky= 0
Lil_Pump= 0
Lil_Uzi_Vert= 0
Lil_Wayne= 0
Lin_Manuel_Miranda= 0
Logic= 0
Lorde= 0
Luis_Fonsi_and_Daddy_Yankee= 0
Lukas_Graham= 0
MAX= 0
Machine_Gun_Kelly_and_Camila_Cabello= 0
Major_Lazer= 0
Major_Lazer_and_DJ_Snake= 0
Mariah_Carey= 0
Marian_Hill= 0
Mark_Ronson= 0
Maroon_5= 0
Marshmello_and_Anne_Marie = 0
Martin_Garrix_and_Bebe_Rexha= 0
Meghan_Trainor= 0
Migos= 0
Mike_Posner= 0
Miley_Cyrus= 0
NF= 0
Niall_Horan= 0
Nick_Jonas= 0
Nicki_Minaj= 0
Offset_and_Metro_Boomin= 0
Pink= 0
Pentatonix= 0
Portugal_The_Man= 0
Post_Malone= 0
Prince= 0
Prince_And_The_Revolution= 0
Rae_Sremmurd= 0
Rihanna= 0
Ruth_B= 0
Sam_Hunt= 0
Sam_Smith= 0
Selena_Gomez= 0
Shawn_Mendes= 0
Shawn_Mendes_and_Camila_Cabello= 0
Sia= 0
Taylor_Swift= 0
The_Carters= 0
The_Chainsmokers= 0
The_Chainsmokers_and_Coldplay= 0
The_Weeknd = 0
The_Weeknd_and_Kendrick_Lamar= 0
Thomas_Rhett = 0
Tory_Lanez= 0
Travis_Scott= 0
Troye_Sivan= 0
Tyga= 0
X_Ambassadors= 0
XXXTENTACION= 0
Yo_Gotti= 0
Young_MA= 0
Zara_Larsson_and_MNEK= 0
Zay_Hilfigerrr_and_Zayion_McCall= 0
Zayn= 0
Zayn_and_Taylor_Swift= 0
Zedd= 0
Zedd_and_Alessia_Cara= 0
gnash= 0
twenty_one_pilots = 0




for entry in group25["Artist_cleaned"]:
    if entry == '21 Savage': 
        twentyone_Savage +=1  
    elif entry == '5 Seconds Of Summer':
        five_Seconds_Of_Summer+=1
    elif entry == 'Adele':
        Adele +=1  
    elif entry == 'Alessia Cara':
        Alessia_Cara +=1 
    elif entry == 'Amine':
        Amine +=1  
    elif entry == 'Ariana Grande':
        Ariana_Grande +=1  
    elif entry == 'Ayo & Teo':
        Ayo_Teo +=1   
        
    elif entry == 'Bazzi':
        Bazzi +=1      
    elif entry == 'Bebe Rexha & Florida Georgia Line':
        Bebe_Rexha_and_Florida_Georgia_Line +=1      
    elif entry == 'Beyonce':
        Beyonce +=1      
    elif entry == 'Big Sean':
        Big_Sean +=1        
    elif entry == 'BlocBoy JB':
        BlocBoy_JB +=1 
    elif entry == 'Brett Young':
        Brett_Young +=1 
    elif entry == 'Britney Spears':
        Britney_Spears +=1 
    elif entry == 'Bruno Mars':
        Bruno_Mars +=1
    elif entry == 'Bruno Mars & Cardi B':
        Bruno_Mars_and_Cardi_B +=1 
    elif entry == 'Bryson Tiller':
        Bryson_Tiller +=1 
    elif entry == 'Calvin Harris':
        Calvin_Harris +=1
    elif entry == 'Camila Cabello':
        Camila_Cabello +=1
    elif entry == 'Cardi B':
        Cardi_B +=1
    elif entry == 'Charlie Puth':
        Charlie_Puth +=1
    elif entry == 'Childish Gambino':
        Childish_Gambino +=1
    elif entry == 'Chris Brown':
        Chris_Brown +=1
    elif entry == 'Clean Bandit':
        Clean_Bandit +=1
    elif entry == 'Coldplay':
        Coldplay +=1
    elif entry == 'D.R.A.M.':
        DRAM +=1
    elif entry == 'DJ Khaled':
        DJ_Khaled +=1
    elif entry == 'DJ Snake':
        DJ_Snake +=1
    elif entry == 'DNCE':
        DNCE +=1
    elif entry == 'Dan + Shay':
        Dan_and_Shay +=1
        
    elif entry == 'David Guetta':
        David_Guetta +=1
    elif entry == 'Daya':
        Daya +=1
    elif entry == 'Demi Lovato':
        Demi_Lovato +=1
    elif entry == 'Desiigner':
        Desiigner +=1
    elif entry == 'Drake':
        Drake +=1
    elif entry == 'Drake & Future':
        Drake_and_Future +=1
    elif entry == 'Dua Lipa':
        Dua_Lipa +=1
    elif entry == 'Ed Sheeran':
        Ed_Sheeran +=1
    elif entry == 'Ella Mai':
        Ella_Mai +=1
    elif entry == 'Elle King':
        Elle_King +=1
    elif entry == 'Ellie Goulding':
        Ellie_Goulding +=1
    elif entry == 'Eminem':
        Eminem +=1
    elif entry == 'Fetty Wap':
        Fetty_Wap +=1
    elif entry == 'Fifth Harmony':
        Fifth_Harmony +=1
    elif entry == 'Flo Rida':
        Flo_Rida +=1
    elif entry == 'Florida Georgia Line':
        Florida_Georgia_Line +=1
    elif entry == 'Flume':
        Flume +=1
    elif entry == 'French Montana':
        French_Montana +=1
    elif entry == 'Future':
        Future +=1
    elif entry == 'G-Eazy':
        G_Eazy +=1
    elif entry == 'G-Eazy & Halsey':
        G_Eazy_and_Halsey +=1
    elif entry == 'G-Eazy x Bebe Rexha':
        G_Eazy_x_Bebe_Rexha +=1
    elif entry == 'Gucci Mane':
        Gucci_Mane +=1
    elif entry == 'Hailee Steinfeld & Grey':
        Hailee_Steinfeld_and_Grey +=1
    
    elif entry == 'Halsey':
        Halsey +=1
    elif entry == 'Harry Styles':
        Harry_Styles +=1
    elif entry == 'Imagine Dragons':
        Imagine_Dragons +=1
    elif entry == 'J Balvin & Willy William':
        J_Balvin_and_Willy_William +=1
    elif entry == 'J. Cole':
        J_Cole +=1
    elif entry == 'James Arthur':
        James_Arthur +=1
    elif entry == 'James Bay':
        James_Bay +=1
    
    elif entry == 'Jeremih':
        Jeremih +=1
    elif entry == 'John Legend':
        John_Legend +=1
    elif entry == 'Jon Bellion':
        Jon_Bellion +=1
    elif entry == 'Jordan Smith':
        Jordan_Smith +=1
    elif entry == 'Juice WRLD':
        Juice_WRLD +=1
    elif entry == 'Julia Michaels':
        Julia_Michaels +=1
    elif entry == 'Justin Bieber':
        Justin_Bieber +=1
    elif entry == 'Justin Timberlake':
        Justin_Timberlake +=1
    elif entry == 'KYLE':
        KYLE +=1 
    elif entry == 'Kane Brown':
        Kane_Brown +=1
    elif entry == 'Kanye West':
        Kanye_West +=1
    elif entry == 'Katy Perry':
        Katy_Perry +=1
    elif entry == 'Keith Urban':
        Keith_Urban +=1
    elif entry == 'Kelly Clarkson':
        Kelly_Clarkson +=1
    elif entry == 'Kendrick Lamar':
        Kendrick_Lamar +=1
    elif entry == 'Kendrick Lamar & SZA':
        Kendrick_Lamar_and_SZA +=1
    elif entry == 'Kent Jones':
        Kent_Jones +=1
    elif entry == 'Kesha':
        Kesha +=1
    elif entry == 'Kevin Gates':
        Kevin_Gates +=1
    elif entry == 'Khalid':
        Khalid +=1
    elif entry == 'Khalid & Normani':
         Khalid_and_Normani +=1
    elif entry == 'Kiiara':
        Kiiara +=1
    elif entry == 'Kodak Black':
        Kodak_Black +=1
    elif entry == 'Kygo & Selena Gomez':
        Kygo_and_Selena_Gomez +=1
    elif entry == 'Lady Gaga':
        Lady_Gaga +=1
    elif entry == 'Liam Payne':
        Liam_Payne +=1
    elif entry == 'Lil Dicky':
        Lil_Dicky +=1
    elif entry == 'Lil Pump':
        Lil_Pump +=1
    elif entry == 'Lil Uzi Vert':
        Lil_Uzi_Vert +=1

    elif entry == 'Lil Wayne':
        Lil_Wayne +=1
    elif entry == 'Lin-Manuel Miranda':
        Lin_Manuel_Miranda +=1
    elif entry == 'Logic':
        Logic +=1
    elif entry == 'Lorde':
        Lorde +=1
    elif entry == 'Luis Fonsi & Daddy Yankee':
        Luis_Fonsi_and_Daddy_Yankee +=1
    elif entry == 'Lukas Graham':
        Lukas_Graham +=1
    elif entry == 'MAX':
        MAX +=1
    elif entry == 'Machine Gun Kelly & Camila Cabello':
        Machine_Gun_Kelly_and_Camila_Cabello +=1
    elif entry == 'Major Lazer':
        Major_Lazer +=1
    elif entry == 'Major Lazer & DJ Snake':
        Major_Lazer_and_DJ_Snake +=1
    elif entry == 'Mariah Carey':
        Mariah_Carey +=1
    elif entry == 'Marian Hill':
        Marian_Hill +=1
    elif entry == 'Mark Ronson':
        Mark_Ronson +=1
    elif entry == 'Maroon 5':
        Maroon_5 +=1
    elif entry == 'Marshmello & Anne-Marie':
        Marshmello_and_Anne_Marie +=1
    elif entry == 'Martin Garrix & Bebe Rexha':
        Martin_Garrix_and_Bebe_Rexha +=1
    elif entry == 'Meghan Trainor':
        Meghan_Trainor +=1
    elif entry == 'Migos':
        Migos +=1
    elif entry == 'Mike Posner':
        Mike_Posner +=1
    elif entry == 'Miley Cyrus':
        Miley_Cyrus +=1
    elif entry == 'NF':
        NF +=1
    elif entry == 'Niall Horan':
        Niall_Horan +=1
    elif entry == 'Nick Jonas':
        Nick_Jonas +=1
    elif entry == 'Nicki Minaj':
        Nicki_Minaj +=1
    elif entry == 'Offset & Metro Boomin':
        Offset_and_Metro_Boomin +=1
    elif entry == 'P!nk':
        Pink +=1
    elif entry == 'Pentatonix':
        Pentatonix +=1
    elif entry == 'Portugal. The Man':
        Portugal_The_Man +=1
    elif entry == 'Post Malone':
        Post_Malone +=1
    elif entry == 'Prince':
        Prince +=1
    elif entry == 'Prince And The Revolution':
        Prince_And_The_Revolution +=1
    elif entry == 'Rae Sremmurd':
        Rae_Sremmurd +=1
    elif entry == 'Rihanna':
        Rihanna +=1
    elif entry == 'Ruth B':
        Ruth_B +=1
    elif entry == 'Sam Hunt':
        Sam_Hunt +=1
    elif entry == 'Sam Smith':
        Sam_Smith +=1
    elif entry == 'Selena Gomez':
        Selena_Gomez +=1
    elif entry == 'Shawn Mendes':
        Shawn_Mendes +=1
    elif entry == 'Shawn Mendes & Camila Cabello':
        Shawn_Mendes_and_Camila_Cabello +=1
    elif entry == 'Sia':
        Sia +=1
    elif entry == 'Taylor Swift':
        Taylor_Swift +=1
    elif entry == 'The Carters':
        The_Carters +=1
    elif entry == 'The Chainsmokers':
        The_Chainsmokers +=1
    elif entry == 'The Chainsmokers & Coldplay':
        The_Chainsmokers_and_Coldplay +=1
    elif entry == 'The Weeknd':
        The_Weeknd +=1
    elif entry == 'The Weeknd & Kendrick Lamar':
        The_Weeknd_and_Kendrick_Lamar +=1
    elif entry == 'Thomas Rhett':
        Thomas_Rhett +=1
    elif entry == 'Tory Lanez':
        Tory_Lanez +=1
    elif entry == 'Travis Scott':
        Travis_Scott +=1
    elif entry == 'Troye Sivan':
        Troye_Sivan +=1
    elif entry == 'Tyga':
        Tyga +=1
    elif entry == 'X Ambassadors':
        X_Ambassadors +=1
    elif entry == 'XXXTENTACION':
        XXXTENTACION +=1
    elif entry == 'Yo Gotti':
        Yo_Gotti +=1
    elif entry == 'Young M.A':
        Young_MA +=1
    elif entry == 'Zara Larsson & MNEK':
        Zara_Larsson_and_MNEK +=1
    elif entry == 'Zay Hilfigerrr & Zayion McCall':
        Zay_Hilfigerrr_and_Zayion_McCall +=1
    elif entry == 'Zayn':
        Zayn +=1
    elif entry == 'Zayn / Taylor Swift':
        Zayn_and_Taylor_Swift +=1
    elif entry == 'Zedd':
        Zedd +=1
    elif entry == 'Zedd & Alessia Cara':
        Zedd_and_Alessia_Cara +=1
    elif entry == 'gnash':
        gnash +=1
    else:  
        twenty_one_pilots +=1
    

In [None]:
counts = []

        
for entry in group25["Artist_cleaned"]:
    if entry == '21 Savage': 
        counts.append(twentyone_Savage)  
    elif entry == '5 Seconds Of Summer':
        counts.append(five_Seconds_Of_Summer) 
    elif entry == 'Adele':
        counts.append(Adele)   
    elif entry == 'Alessia Cara':
        counts.append(Alessia_Cara)   
    elif entry == 'Amine':
        counts.append(Amine)  
    elif entry == 'Ariana Grande':
        counts.append(Ariana_Grande)   
    elif entry == 'Ayo & Teo':
        counts.append(Ayo_Teo)    
    elif entry == 'Bazzi':
        counts.append(Bazzi)       
    elif entry == 'Bebe Rexha & Florida Georgia Line':
        counts.append(Bebe_Rexha_and_Florida_Georgia_Line)        
    elif entry == 'Beyonce':
        counts.append(Beyonce)       
    elif entry == 'Big Sean':
        counts.append(Big_Sean)        
    elif entry == 'BlocBoy JB':
        counts.append(BlocBoy_JB)  
    elif entry == 'Brett Young':
        counts.append(Brett_Young) 
    elif entry == 'Britney Spears':
        counts.append(Britney_Spears) 
    elif entry == 'Bruno Mars':
        counts.append(Bruno_Mars)
    elif entry == 'Bruno Mars & Cardi B':
        counts.append(Bruno_Mars_and_Cardi_B) 
    elif entry == 'Bryson Tiller':
        counts.append(Bryson_Tiller) 
    elif entry == 'Calvin Harris':
        counts.append(Calvin_Harris)
    elif entry == 'Camila Cabello':
        counts.append(Camila_Cabello)
    elif entry == 'Cardi B':
        counts.append(Cardi_B)
    elif entry == 'Charlie Puth':
        counts.append(Charlie_Puth)
    elif entry == 'Childish Gambino':
        counts.append(Childish_Gambino)
    elif entry == 'Chris Brown':
        counts.append(Chris_Brown)
    elif entry == 'Clean Bandit':
        counts.append(Clean_Bandit)
    elif entry == 'Coldplay':
        counts.append(Coldplay)
    elif entry == 'D.R.A.M.':
        counts.append(DRAM )
    elif entry == 'DJ Khaled':
        counts.append(DJ_Khaled)
    elif entry == 'DJ Snake':
        counts.append(DJ_Snake)
    elif entry == 'DNCE':
        counts.append(DNCE)
    elif entry == 'Dan + Shay':
        counts.append(Dan_and_Shay)
        
    elif entry == 'David Guetta':
        counts.append(David_Guetta)
    elif entry == 'Daya':
        counts.append(Daya)
    elif entry == 'Demi Lovato':
        counts.append(Demi_Lovato)
    elif entry == 'Desiigner':
        counts.append(Desiigner)
    elif entry == 'Drake':
        counts.append(Drake)
    elif entry == 'Drake & Future':
        counts.append(Drake_and_Future)
    elif entry == 'Dua Lipa':
        counts.append(Dua_Lipa)
    elif entry == 'Ed Sheeran':
        counts.append(Ed_Sheeran)
    elif entry == 'Ella Mai':
        counts.append(Ella_Mai)
    elif entry == 'Elle King':
        counts.append(Elle_King)
    elif entry == 'Ellie Goulding':
        counts.append(Ellie_Goulding)
    elif entry == 'Eminem':
        counts.append(Eminem)
    elif entry == 'Fetty Wap':
        counts.append(Fetty_Wap)
    elif entry == 'Fifth Harmony':
        counts.append(Fifth_Harmony)
    elif entry == 'Flo Rida':
        counts.append(Flo_Rida)
    elif entry == 'Florida Georgia Line':
        counts.append(Florida_Georgia_Line)
    elif entry == 'Flume':
        counts.append(Flume)
    elif entry == 'French Montana':
        counts.append(French_Montana)
    elif entry == 'Future':
        counts.append(Future)
    elif entry == 'G-Eazy':
        counts.append(G_Eazy)
    elif entry == 'G-Eazy & Halsey':
        counts.append(G_Eazy_and_Halsey)
    elif entry == 'G-Eazy x Bebe Rexha':
        counts.append(G_Eazy_x_Bebe_Rexha)
    elif entry == 'Gucci Mane':
        counts.append(Gucci_Mane)
    elif entry == 'Hailee Steinfeld & Grey':
        counts.append(Hailee_Steinfeld_and_Grey)
    
    elif entry == 'Halsey':
        counts.append(Halsey)
    elif entry == 'Harry Styles':
        counts.append(Harry_Styles)
    elif entry == 'Imagine Dragons':
        counts.append(Imagine_Dragons)
    elif entry == 'J Balvin & Willy William':
        counts.append(J_Balvin_and_Willy_William)
    elif entry == 'J. Cole':
        counts.append(J_Cole)
    elif entry == 'James Arthur':
        counts.append(James_Arthur)
    elif entry == 'James Bay':
        counts.append(James_Bay)
    
    elif entry == 'Jeremih':
        counts.append(Jeremih)
    elif entry == 'John Legend':
        counts.append(John_Legend)
    elif entry == 'Jon Bellion':
        counts.append(Jon_Bellion)
    elif entry == 'Jordan Smith':
        counts.append(Jordan_Smith)
    elif entry == 'Juice WRLD':
        counts.append(Juice_WRLD)
    elif entry == 'Julia Michaels':
        counts.append(Julia_Michaels)
    elif entry == 'Justin Bieber':
        counts.append(Justin_Bieber)
    elif entry == 'Justin Timberlake':
        counts.append(Justin_Timberlake)
    elif entry == 'KYLE':
        counts.append(KYLE )
    elif entry == 'Kane Brown':
        counts.append(Kane_Brown)
    elif entry == 'Kanye West':
        counts.append(Kanye_West)
    elif entry == 'Katy Perry':
        counts.append(Katy_Perry)
    elif entry == 'Keith Urban':
        counts.append(Keith_Urban)
    elif entry == 'Kelly Clarkson':
        counts.append(Kelly_Clarkson)
    elif entry == 'Kendrick Lamar':
        counts.append(Kendrick_Lamar)
    elif entry == 'Kendrick Lamar & SZA':
        counts.append(Kendrick_Lamar_and_SZA )
    elif entry == 'Kent Jones':
        counts.append(Kent_Jones)
    elif entry == 'Kesha':
        counts.append(Kesha )
    elif entry == 'Kevin Gates':
        counts.append(Kevin_Gates )
    elif entry == 'Khalid':
        counts.append(Khalid )
    elif entry == 'Khalid & Normani':
         counts.append(Khalid_and_Normani )
    elif entry == 'Kiiara':
        counts.append(Kiiara)
    elif entry == 'Kodak Black':
        counts.append(Kodak_Black )
    elif entry == 'Kygo & Selena Gomez':
        counts.append(Kygo_and_Selena_Gomez)
    elif entry == 'Lady Gaga':
        counts.append(Lady_Gaga)
    elif entry == 'Liam Payne':
        counts.append(Liam_Payne)
    elif entry == 'Lil Dicky':
        counts.append(Lil_Dicky)
    elif entry == 'Lil Pump':
        counts.append(Lil_Pump)
    elif entry == 'Lil Uzi Vert':
        counts.append(Lil_Uzi_Vert)

    elif entry == 'Lil Wayne':
        counts.append(Lil_Wayne)
    elif entry == 'Lin-Manuel Miranda':
        counts.append(Lin_Manuel_Miranda)
    elif entry == 'Logic':
        counts.append(Logic)
    elif entry == 'Lorde':
        counts.append(Lorde)
    elif entry == 'Luis Fonsi & Daddy Yankee':
        counts.append(Luis_Fonsi_and_Daddy_Yankee)
    elif entry == 'Lukas Graham':
        counts.append(Lukas_Graham )
    elif entry == 'MAX':
        counts.append(MAX )
    elif entry == 'Machine Gun Kelly & Camila Cabello':
        counts.append(Machine_Gun_Kelly_and_Camila_Cabello)
    elif entry == 'Major Lazer':
        counts.append(Major_Lazer)
    elif entry == 'Major Lazer & DJ Snake':
        counts.append(Major_Lazer_and_DJ_Snake)
    elif entry == 'Mariah Carey':
        counts.append(Mariah_Carey )
    elif entry == 'Marian Hill':
        counts.append(Marian_Hill)
    elif entry == 'Mark Ronson':
        counts.append(Mark_Ronson)
    elif entry == 'Maroon 5':
        counts.append(Maroon_5 )
    elif entry == 'Marshmello & Anne-Marie':
        counts.append(Marshmello_and_Anne_Marie )
    elif entry == 'Martin Garrix & Bebe Rexha':
        counts.append(Martin_Garrix_and_Bebe_Rexha)
    elif entry == 'Meghan Trainor':
        counts.append(Meghan_Trainor)
    elif entry == 'Migos':
        counts.append(Migos)
    elif entry == 'Mike Posner':
        counts.append(Mike_Posner)
    elif entry == 'Miley Cyrus':
        counts.append(Miley_Cyrus)
    elif entry == 'NF':
        counts.append(NF)
    elif entry == 'Niall Horan':
        counts.append(Niall_Horan)
    elif entry == 'Nick Jonas':
        counts.append(Nick_Jonas)
    elif entry == 'Nicki Minaj':
        counts.append(Nicki_Minaj)
    elif entry == 'Offset & Metro Boomin':
        counts.append(Offset_and_Metro_Boomin)
    elif entry == 'P!nk':
        counts.append(Pink )
    elif entry == 'Pentatonix':
        counts.append(Pentatonix )
    elif entry == 'Portugal. The Man':
        counts.append(Portugal_The_Man)
    elif entry == 'Post Malone':
        counts.append(Post_Malone)
    elif entry == 'Prince':
        counts.append(Prince)
    elif entry == 'Prince And The Revolution':
        counts.append(Prince_And_The_Revolution)
    elif entry == 'Rae Sremmurd':
        counts.append(Rae_Sremmurd)
    elif entry == 'Rihanna':
        counts.append(Rihanna)
    elif entry == 'Ruth B':
        counts.append(Ruth_B )
    elif entry == 'Sam Hunt':
        counts.append(Sam_Hunt)
    elif entry == 'Sam Smith':
        counts.append(Sam_Smith)
    elif entry == 'Selena Gomez':
        counts.append(Selena_Gomez)
    elif entry == 'Shawn Mendes':
        counts.append(Shawn_Mendes)
    elif entry == 'Shawn Mendes & Camila Cabello':
        counts.append(Shawn_Mendes_and_Camila_Cabello)
    elif entry == 'Sia':
        counts.append(Sia)
    elif entry == 'Taylor Swift':
        counts.append(Taylor_Swift)
    elif entry == 'The Carters':
        counts.append(The_Carters )
    elif entry == 'The Chainsmokers':
        counts.append(The_Chainsmokers)
    elif entry == 'The Chainsmokers & Coldplay':
        counts.append(The_Chainsmokers_and_Coldplay)
    elif entry == 'The Weeknd':
        counts.append(The_Weeknd)
    elif entry == 'The Weeknd & Kendrick Lamar':
        counts.append(The_Weeknd_and_Kendrick_Lamar)
    elif entry == 'Thomas Rhett':
        counts.append(Thomas_Rhett)
    elif entry == 'Tory Lanez':
        counts.append(Tory_Lanez)
    elif entry == 'Travis Scott':
        counts.append(Travis_Scott)
    elif entry == 'Troye Sivan':
        counts.append(Troye_Sivan)
    elif entry == 'Tyga':
        counts.append(Tyga)
    elif entry == 'X Ambassadors':
        counts.append(X_Ambassadors)
    elif entry == 'XXXTENTACION':
        counts.append(XXXTENTACION )
    elif entry == 'Yo Gotti':
        counts.append(Yo_Gotti)
    elif entry == 'Young M.A':
        counts.append(Young_MA)
    elif entry == 'Zara Larsson & MNEK':
        counts.append(Zara_Larsson_and_MNEK)
    elif entry == 'Zay Hilfigerrr & Zayion McCall':
        counts.append(Zay_Hilfigerrr_and_Zayion_McCall)
    elif entry == 'Zayn':
        counts.append(Zayn)
    elif entry == 'Zayn / Taylor Swift':
        counts.append(Zayn_and_Taylor_Swift)
    elif entry == 'Zedd':
        counts.append(Zedd)
    elif entry == 'Zedd & Alessia Cara':
        counts.append(Zedd_and_Alessia_Cara)
    elif entry == 'gnash':
        counts.append(gnash)
    else:  
        counts.append(twenty_one_pilots)

<a id="CountVectorizer Model"></a>
## CountVectorizer Model

Aim: Can you classify songs into groups based on the vocabulary used within a song? 

Namely, do certain songs rank in the Top 25 if they contain words about love, versus words about violence and betrayal that might rank lower in the Top 75 - 100 group. 

In order to eliminate some of the noise associated with words limits on  several hyper parameters were set within the the CountVectorizer. The number of features was limited to 3000, a word needed to appear at least 20 times to be considered as a word, English stop words were ignored and n grams of (1,2) were considered. 

A Support Vector Classifier was chosen to model the CountVectorizer data with a C of 6. By setting a high C, the model would be harsh on outliers and create  narrow margins in order to be able to differentiate on common words used across all groups. The final model was a SVC with an rbf kernel and a C = 6, an accuracy score of 33% was obtained (26% baseline). 

In [13]:
df = pd.read_csv("181716_edacleaned.csv",sep="\t")
df.drop(columns=['Unnamed: 0'],inplace=True)

X = df["Lyrics"]
y = df["Groups"]

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,stratify=y)

cvec = CountVectorizer(token_pattern='\w+',stop_words="english", ngram_range = (1,2),
                      decode_error="ignore",max_features=30000, min_df=20) 




X_train = cvec.fit_transform(X_train)
X_test = cvec.transform(X_test)

svc = SVC(6)
svc.probability = True #in order to be able to create an ROC curve 
svc.fit(X_train,y_train)
svc_model = svc.score(X_test,y_test)
print(svc_model)

0.33766233766233766


Looking more closely at the data I found that 70% of the top 20 words across all the groups were exactly the same. Therefore, the SVC model was unable to differentiate between the words of different groups despite the narrow margins. Due to the underlying similarity in the data across each groups, PCA was not explored. However more features, genres and artist popularity, were added to the model in order to create wider identifying characteristics of each group. 

<a id="CountVectorizer, Artist Popularity & Genre Model"></a>
### CountVectorizer, Artist Popularity & Genre Model

In [None]:
df = pd.read_csv("181716_edacleaned_sent_rep_ALLFRI.csv",sep="\t")

df_count = df[['Lyrics','General_G2','Artist_Top25_Count', 'Artist_Top25_50_Count',
       'Artist_Top50_75_Count', 'Artist_Top75_100_Count',"Groups"]]

#Dummify the genres labels 

df_count_dummy = pd.get_dummies(df_count,columns=['General_G2'])

X = df_count_dummy[["Lyrics",'Artist_Top25_Count', 'Artist_Top25_50_Count',
       'Artist_Top50_75_Count', 'Artist_Top75_100_Count','General_G2_ christmas',
       'General_G2_ funk', 'General_G2_alternative', 'General_G2_christmas',
       'General_G2_country', 'General_G2_electronic', 'General_G2_folk',
       'General_G2_hip hop', 'General_G2_indie', 'General_G2_jazz',
       'General_G2_latin', 'General_G2_metal', 'General_G2_musicals',
       'General_G2_pop', 'General_G2_punk', 'General_G2_r&b', 'General_G2_rap',
       'General_G2_reggae', 'General_G2_rock', 'General_G2_soul',
       'General_G2_traditional pop', 'General_G2_tropical',
       'General_G2_tv music']]

columns_ss = X.iloc[:,1:].columns

#Standardize the data 

from sklearn.preprocessing import StandardScaler

ss = StandardScaler()

X_ss =ss.fit_transform(X.iloc[:,1:])
X_ss_df = pd.DataFrame(X_ss,columns=columns_ss)
X_ss_df.head(1)

X_f = pd.concat([X["Lyrics"], X_ss_df], axis=1)
X_f

y = df_count_dummy["Groups"]

#Train test split 

Xtrain,Xtest,ytrain,ytest = train_test_split(X_f,y,test_size=0.2,stratify=y)

#Apply CVEC 

cvec = CountVectorizer(token_pattern='\w+',stop_words="english", ngram_range = (1,2),
                      decode_error="ignore",max_features=50000, min_df=20) 

Xtrain_l = cvec.fit_transform(Xtrain["Lyrics"]) 
Xtrain_l = Xtrain_l.toarray()

Xtest_l = cvec.transform(Xtest["Lyrics"])
Xtest_l = Xtest_l.toarray()

Xtrain_l = pd.DataFrame(Xtrain_l)
Xtrain_l.head(1)

Train_AG = Xtrain.iloc[:,1:]
Train_AG.reset_index(inplace=True)
Train_AG.head()

#Xtrain with CVEC, Artist Popularity, Genre 
Xtrain_final =pd.concat([Xtrain_l, Train_AG], axis=1)

Xtest_l = pd.DataFrame(Xtest_l)
Xtest_l.head(1)

Test_AG = Xtest.iloc[:,1:]
Test_AG.reset_index(inplace=True)
Test_AG.head()


#Xtest with CVEC, Artist Popularity, Genre 
Xtest_final =pd.concat([Xtest_l, Test_AG], axis=1)
Xtest_final.head(1)

#Modeling 
logreg = LogisticRegression()
logreg.fit(Xtrain_final,ytrain)
logreg.score(Xtest_final,ytest)

<a id="TD-IDF Model"></a>
## TD-IDF Model

Aim: Can you exploit the repetition of particular words and use uncommon words to better classify the songs? 

In order to eliminate some of the noise associated with words within the TF-IDF the number of features were limited to 3000 and n grams of (1,2) were considered.

A Logistic Regression model, proved the best model at classifying songs based on TF-IDF yielding and accuracy score of 29% (3% above baseline). This model is poorer than the CountVectorizer model used above. 

In [None]:
df = pd.read_csv("181716_edacleaned.csv",sep="\t")

X = df["Lyrics"]
y = df["Groups"]
Xtrain,Xtest,ytrain,ytest = train_test_split(X,y,test_size=0.2,stratify=y)


tvec = TfidfVectorizer(stop_words='english',
                        max_features=3000,ngram_range=(1,2))

tvec.fit(Xtrain)
X_train = tvec.transform(Xtrain)
X_test = tvec.transform(Xtest)


logreg = LogisticRegression(C=1)
logreg.fit(X_train,ytrain)
logreg.score(X_test,ytest)

Looking closely at the data I found that many of the top 20 words in each group from the TF-IDF were the same words that appeared in the CountVectorizer, due to the number of times they occur in the different songs. In addition, 50% of the top 20 words in each group were exactly the same. This resulted in a poor model only 3% above baseline. 

In order to expand the features space of the model, genres and artist popularity were added to the TF-IDF features space. This improved the model significantly to an accuracy score of 99.5%. Looking more closely at the highest coefficients within the regression, Artist Popularity seemed to be giving the heaviest weight within the model. In order to see the predictive ability of the actual NLP features of the songs, Artist Popularity was removed from future models. 

<a id="TF-IDF, Artist Popularity & Genre Model"></a>
### TF-IDF, Artist Popularity & Genre Model

In [None]:
df_count = df[['Lyrics','General_G2','Artist_Top25_Count', 'Artist_Top25_50_Count',
       'Artist_Top50_75_Count', 'Artist_Top75_100_Count',"Groups"]]

df_count_dummy = pd.get_dummies(df_count,columns=['General_G2'])

X = df_count_dummy[["Lyrics",'Artist_Top25_Count', 'Artist_Top25_50_Count',
       'Artist_Top50_75_Count', 'Artist_Top75_100_Count','General_G2_ christmas',
       'General_G2_ funk', 'General_G2_alternative', 'General_G2_christmas',
       'General_G2_country', 'General_G2_electronic', 'General_G2_folk',
       'General_G2_hip hop', 'General_G2_indie', 'General_G2_jazz',
       'General_G2_latin', 'General_G2_metal', 'General_G2_musicals',
       'General_G2_pop', 'General_G2_punk', 'General_G2_r&b', 'General_G2_rap',
       'General_G2_reggae', 'General_G2_rock', 'General_G2_soul',
       'General_G2_traditional pop', 'General_G2_tropical',
       'General_G2_tv music']]

columns_ss = X.iloc[:,1:].columns

from sklearn.preprocessing import StandardScaler

ss = StandardScaler()

X_ss =ss.fit_transform(X.iloc[:,1:])
X_ss_df = pd.DataFrame(X_ss,columns=columns_ss)
X_ss_df.head(1)

X_f = pd.concat([X["Lyrics"], X_ss_df], axis=1)
X_f

y = df_count_dummy["Groups"]

#Train-Test Split 

Xtrain,Xtest,ytrain,ytest = train_test_split(X_f,y,test_size=0.2,stratify=y)


#Apply TF-IDF

tvec = TfidfVectorizer(stop_words='english',
                       #sublinear_tf=True,
                       #max_df=0.5)
                       max_features=3000,ngram_range=(1,2))



Xtrain_l = tvec.fit_transform(Xtrain["Lyrics"]) 
Xtrain_l = Xtrain_l.toarray()

Xtest_l = tvec.transform(Xtest["Lyrics"])
Xtest_l = Xtest_l.toarray()

Xtrain_l = pd.DataFrame(Xtrain_l,columns=[tvec.get_feature_names()])


Train_AG = Xtrain.iloc[:,1:]
Train_AG.reset_index(inplace=True)


#This is the training set to fit with model 
Xtrain_final =pd.concat([Xtrain_l, Train_AG], axis=1)

Xtest_l = pd.DataFrame(Xtest_l,columns=[tvec.get_feature_names()])


Test_AG = Xtest.iloc[:,1:]
Test_AG.reset_index(inplace=True)


#This is the testing set to fit with model 
Xtest_final =pd.concat([Xtest_l, Test_AG], axis=1)


#GRIDSEARCHCV

params = {"C":np.logspace(-7,2,100)}

logreg = LogisticRegression()

grid = GridSearchCV(logreg,params,verbose=1,cv=10)
grid.fit(Xtrain_final,ytrain)

#Getting the best model from the GridSearchCV
optimal_model = grid.best_estimator_

optimal_model.score(Xtest_final,ytest)

In [None]:
#Looking at the largest coefficients in the model 

coefs = pd.DataFrame(dict(coef=optimal_model.coef_[0],
                                     abscoef=np.abs(optimal_model.coef_[0]),
                                     feature=Xtest_final.columns))
coefs.sort_values('abscoef', ascending=False, inplace=True)
coefs.head(10)

<a id="Sentiment Analysis"></a>
## Sentiment Analysis 

Aim: Create a improved classifier model using the sentiment of songs. 

With commutes, radio's and commuters, are more likely to listen to happier songs than sad ones. Therefore, happier songs should appear within the Top 25, Top 25-50 groups more than later groups.

Vader sentiment analysis toolkit was used in order to create the sentiment for each song.

In [None]:
df['vader_neg'] = 0
df['vader_pos'] = 0
df['vader_neu'] = 0
df['vader_compound'] = 0   

for i, q in enumerate(df["Lyrics"]):
    vs = analyzer.polarity_scores(q)
    df.iloc[i, -4] = vs['neg']
    df.iloc[i, -3] = vs['pos']
    df.iloc[i, -2] = vs['neu']
    df.iloc[i, -1] = vs['compound']

<a id="Sentiment KNN Model"></a>
### Sentiment KNN Model

From an initial review of the sentiment scores across the different groups, no clear difference in the positive or negative sentiment could be seen across the groups. A key reason for this was that the rhythm plays an important role in giving tone. Many songs had very happy lyrics, but in actual fact have a sad rhythm. This juxtaposition was not picked up in the sentiment analysis and caused no significant difference to be seen across the groups. Nevertheless, a KNN model looking at the 5 nearest neighbours was used in order to classify songs based on sentiment into groups.  

In [None]:
df = pd.read_csv("181716_edacleaned_sentiment.csv",sep="\t")

X = df[['vader_neg', 'vader_pos', 'vader_neu', 'vader_compound']]
y = df["Groups"]

from sklearn.preprocessing import StandardScaler
ss = StandardScaler()

X_ss = ss.fit_transform(X)

X_train,X_test,y_train,y_test = train_test_split(X_ss,y,test_size=0.2,stratify=y)


#GRIDSEARCH to find optimal number of neighbours

params = {"n_neighbors":[5,10,15,20,25,30,35,40,45,50,55,60,65,70,75,80,90,100]}

knn = KNeighborsClassifier()

grid = GridSearchCV(knn,params,verbose=1,cv=10)
grid.fit(X_train,y_train)

In [None]:
#Optimal model found in the Gridsearch
optimal_model = grid.best_estimator_
optimal_model.score(X_test,y_test)

<a id="Compound-Sentiment-&-Genres-Model"></a>
### Compound Sentiment & Genres Model

Looking closely at the data, compound sentiment proved to be the most diversifying feature across the groups. Genres, was added into the model to weight the composition of the different groups in order to create a better classifier model.

In [None]:
df = pd.read_csv("181716_edacleaned_sent_rep_ALLFRI.csv",sep="\t")
df_dum = pd.get_dummies(df,columns=["General_G2"])

y = df["Groups"]
X_data  = df_dum[['vader_compound',
       'General_G2_ christmas', 'General_G2_ funk',
       'General_G2_alternative', 'General_G2_christmas', 'General_G2_country',
       'General_G2_electronic', 'General_G2_folk', 'General_G2_hip hop',
       'General_G2_indie', 'General_G2_jazz', 'General_G2_latin',
       'General_G2_metal', 'General_G2_musicals', 'General_G2_pop',
       'General_G2_punk', 'General_G2_r&b', 'General_G2_rap',
       'General_G2_reggae', 'General_G2_rock', 'General_G2_soul',
       'General_G2_traditional pop', 'General_G2_tropical',
       'General_G2_tv music']]

from sklearn.preprocessing import StandardScaler

ss = StandardScaler()
X = ss.fit_transform(X_data)

Xtrain,Xtest,ytrain,ytest = train_test_split(X,y,test_size=0.2,stratify=y)


#GRIDSEARCH to find optimal hyperparameters 

logreg = LogisticRegression()

params = {"solver":["newton-cg","sag","saga","lbfgs"],
          "C":[1,2,3,4,5,6,7,8,9,10]}

grid = GridSearchCV(logreg,params,cv=5,verbose=1)
grid.fit(Xtrain,ytrain)

optimal_grid = grid.best_estimator_
optimal_grid.score(Xtest,ytest)

<a id="Repetitiveness"></a>
## Repetitiveness


Aim: Exploit the genre-breakdown within each group in order better classify songs? 
    
Pop songs are known to be more repetitive than rap songs. Using the genre-composition of each of the group, can you use a percentage of repetition feature that will be able to better classify songs.

To calculate the percentage of repetition the formula used was: (length of the song - length of the compression)/ length of the song. 

In order to asses the amount of repetition there was in a song the Zopfli compression alogorithim was used. The Zopfli compression algorithim uses Huffman coding and Lempel Ziv under the hood. Lempel Ziv, seeks to match words in a song to previous words in a song and will replace it with a "marker", resulting in only non-repetitive words being left. Huffman coding counts the frequency of each letter in a body of text and creates shorter bit encoding for letters that appear more often, resulting in a more compressed document. 

In [None]:
length_compression = []
percent_compress = []

for song in df["Lyrics"]:
    song_byt = str.encode(song)
    
    c = zopfli.ZopfliDeflater()
    z = c.compress(song_byt) + c.flush()
    
    length_compression.append(len(z))
    
    len_song = len(song)
    len_compress = len(z)
    calc = (len_song - len_compress)/len_song
    
    percent_compress.append(calc)
    
df["percent_compress"] = percent_compress
df ["length_compression"] = length_compression

<a id="Part-of-Speech-Tagging"></a>
## Part of Speech Tagging



Aim: Look at the construct of the lyrics to see if there is a difference in the number of nouns, adjectives and other part of speech tags used across the different groups.  

NLTK's toolkit was used to get the part of speech tags for every songs. For each part of speech tag a new column was created for that part of speech tag, with a count of the tag for every song. 

In [None]:
#Creating count of different Part of Speech Tags in every song

quote_final_values = [] 
openbracket_final_values = [] 
closebracket_final_values = []
comma_final_values = []
tire_final_values = []
doublepoints_final_values = []
openquotes_final_values = []
dot_final_values = []
cc_final_values = [] 
cd_final_values = []
dt_final_values = []
ex_final_values = []
fw_final_values = []
in_final_values = []
jj_final_values = []
jjr_final_values = []
jjs_final_values = []
ls_final_values = []
md_final_values = []
nn_final_values = []
nnp_final_values = []
nnps_final_values = []
nns_final_values = []
pdt_final_values = []
pos_final_values = []
prp_final_values = []
prpdollar_final_values = []
rb_final_values = []
rbr_final_values = []
rp_final_values = []
sym_final_values = []
to_final_values = []
uh_final_values = []
vb_final_values = []
vbd_final_values = []
vbg_final_values = []
vbn_final_values = []
vbp_final_values = []
vbz_final_values = []
wdt_final_values = []
wp_final_values = []
wpdollar_final_values = []
wrb_final_values = []
none_final_valuess = []



for song in df["Lyrics"]:
    t_sentence = []
    for lyrics in song.split(" "):
        tags = pos_tag(tok.tokenize(lyrics)) #here we get all the tags 
        
        for t in tags: #loop through the tags, to only get the part of speech 
            t1 = t[1]
            
            t_sentence.append(t1) #append it to our list
    quote_count = 0 
    openbracket_count = 0 
    closebracket_count = 0
    comma_count = 0
    tire_count = 0
    doublepoints_count = 0
    openquotes_count = 0
    dot_count = 0
    cc_count = 0 
    cd_count = 0
    dt_count = 0
    ex_count = 0
    fw_count = 0
    in_count = 0
    jj_count = 0
    jjr_count = 0
    jjs_count = 0
    ls_count = 0
    md_count = 0
    nn_count = 0
    nnp_count = 0
    nnps_count = 0
    nns_count = 0
    pdt_count = 0
    pos_count = 0
    prp_count = 0
    prpdollar_count = 0
    rb_count = 0
    rbr_count = 0
    rp_count = 0
    sym_count = 0
    to_count = 0
    uh_count = 0
    vb_count = 0
    vbd_count = 0
    vbg_count = 0
    vbn_count = 0
    vbp_count = 0
    vbz_count = 0
    wdt_count = 0
    wp_count = 0
    wpdollar_count = 0
    wrb_count = 0
    none_count = 0

    for entry in t_sentence:
        #print(entry)
        if entry == "''":
            quote_count +=1 
        elif entry =='(':
            openbracket_count +=1 
        elif entry ==')':
            closebracket_count+=1  
        elif entry ==',':
            comma_count+=1 
        elif entry =='--':
            tire_count+=1 
        elif entry ==':':
            doublepoints_count+=1
        elif entry =='``$':
            openquotes_count+=1 
        elif entry =='.':
            dot_count+=1 
        elif entry =='CC':
            cc_count+=1
        elif entry =='CD':
            cd_count+=1
        elif entry =='DT':
            dt_count+=1
        elif entry =='EX':
            ex_count+=1
        elif entry =='FW':
            fw_count+=1
        elif entry =='IN':
            in_count+=1
        elif entry =='JJ':
            jj_count+=1 
        elif entry =='JJR':
            jjr_count+=1    
        elif entry =='JJS':
             jjs_count+=1
        elif entry =='LS':
             ls_count+=1      
        elif entry =='MD':
            md_count+=1  
        elif entry =='NN':
            nn_count+=1 
        elif entry =='NNP':
            nnp_count+=1    
        elif entry =='NNPS':
            nnps_count+=1   
        elif entry =='NNS':
             nns_count+=1 
        elif entry =='PDT':
            pdt_count+=1 
        elif entry =='POS':
            pos_count+=1 
        elif entry =='PRP':
            prp_count+=1 
        elif entry =='PRP$':
            prpdollar_count+=1 
        elif entry =='RB':
            rb_count+=1 
        elif entry =='RBR':
            rbr_count+=1 
        elif entry =='RP':
            rp_count+=1 
        elif entry =='SYM':
            sym_count+=1 
        elif entry =='TO':
             to_count+=1
        elif entry =='UH':
            uh_count+=1 
        elif entry =='VB':
            vb_count+=1 
        elif entry =='VBD':
            vbd_count+=1     
        elif entry =='VBG':
            vbg_count+=1 
        elif entry =='VBN':
            vbn_count+=1
        elif entry =='VBP':
            vbp_count+=1
        elif entry =='VBZ':
            vbz_count+=1
        elif entry =='WDT':
            wdt_count+=1
        elif entry =='WP':
            wp_count+=1
        elif entry =='WP$':
            wpdollar_count+=1
        elif entry =='WRB':
            wrb_count+=1
        else:
            none_count+=1
    
    
    
    
    quote_final_values.append(quote_count) 
    openbracket_final_values.append(openbracket_count) 
    closebracket_final_values.append(closebracket_count)
    comma_final_values.append(comma_count)
    tire_final_values.append(tire_count)
    doublepoints_final_values.append(doublepoints_count)
    openquotes_final_values.append(openquotes_count)
    dot_final_values.append(dot_count)
    cc_final_values.append(cc_count) 
    cd_final_values.append(cd_count)
    dt_final_values.append(dt_count)
    ex_final_values.append(ex_count)
    fw_final_values.append(fw_count)
    in_final_values.append(in_count)
    jj_final_values.append(jj_count)
    jjr_final_values.append(jjr_count)
    jjs_final_values.append(jjs_count)
    ls_final_values.append(ls_count)
    md_final_values.append(md_count)
    nn_final_values.append(nn_count)
    nnp_final_values.append(nnp_count)
    nnps_final_values.append(nnps_count)
    nns_final_values.append(nns_count)
    pdt_final_values.append(pdt_count)
    pos_final_values.append(pos_count)
    prp_final_values.append(prp_count)
    prpdollar_final_values.append(prpdollar_count)
    rb_final_values.append(rb_count)
    rbr_final_values.append(rbr_count)
    rp_final_values.append(rp_count)
    sym_final_values.append(sym_count)
    to_final_values.append(to_count)
    uh_final_values.append(uh_count)
    vb_final_values.append(vb_count)
    vbd_final_values.append(vbd_count)
    vbg_final_values.append(vbg_count)
    vbn_final_values.append(vbn_count)
    vbp_final_values.append(vbp_count)
    vbz_final_values.append(vbz_count)
    wdt_final_values.append(wdt_count)
    wp_final_values.append(wp_count)
    wpdollar_final_values.append(wpdollar_count)
    wrb_final_values.append(wrb_count)
    none_final_valuess.append(none_count)

In [None]:
#Adding the Part of Speech Tags to the DataFrame (new columns)

df["quote"] = quote_final_values
df["openbracker"] = openbracket_final_values  
df["closebracket"] = closebracket_final_values 
df["comma"] = comma_final_values  
df["tire"] = tire_final_values 
df["doublepoints"] = doublepoints_final_values  
df["openquotes"] = openquotes_final_values  
df["dot"] = dot_final_values 
df["cc"] = cc_final_values   
df["cd"] = cd_final_values  
df["dt"] = dt_final_values  
df["ex"] = ex_final_values  
df["fw"] = fw_final_values  
df["in"] = in_final_values  
df["jj"] = jj_final_values  
df["jjr"] = jjr_final_values  
df["jjs"] = jjs_final_values  
df["ls"] = ls_final_values  
df["md"] = md_final_values  
df["nn"] = nn_final_values  
df["nnp"] = nnp_final_values  
df["nnps"] = nnps_final_values  
df["nns"] = nns_final_values  
df["pdt"] = pdt_final_values  
df["pos"] = pos_final_values  
df["prp"] = prp_final_values  
df["prpdollar"] = prpdollar_final_values  
df["rb"] = rb_final_values  
df["rbr"] = rbr_final_values  
df["rp"] = rp_final_values  
df["sym"] = sym_final_values  
df["to"] = to_final_values  
df["uh"] = uh_final_values  
df["vb"] = vb_final_values  
df["vbd"] = vbd_final_values  
df["vbg"] = vbg_final_values  
df["vbn"] = vbn_final_values  
df["vbp"] = vbp_final_values  
df["vbz"] = vbz_final_values  
df["wdt"] = wdt_final_values  
df["wp"] = wp_final_values  
df["wpdollar"] = wpdollar_final_values  
df["wrb"] = wrb_final_values  
df["none"] = none_final_valuess  

<a id="Percentage-Repetition,-Part-of-Speech-Tags-&-Genre-Model"></a>
### Percentage Repetition, Part of Speech Tags & Genre Model

Attempting different models I noticed that some features simply added noise to the model resulting in a poorer classifier whilst others created a stronger classifier. Part of speech tags, Percentage of Repetition and Genres  were significantly different across each group and therefore combined in a model. Using a Logistic Regression and accuracy score of 40% was achieved, the best model to classify songs. 

In [None]:
#concatinating various dataframes 
df = pd.read_csv("181716_edacleaned_sent_rep_ALLFRI.csv",sep="\t")
df2 = pd.read_csv("181716_edacleaned_ALL.csv",sep="\t")
sub_df2 = df2[['quote', 'openbracker', 'closebracket',
       'comma', 'tire', 'doublepoints', 'openquotes', 'dot', 'cc', 'cd', 'dt',
       'ex', 'fw', 'in', 'jj', 'jjr', 'jjs', 'ls', 'md', 'nn', 'nnp', 'nnps',
       'nns', 'pdt', 'pos', 'prp', 'prpdollar', 'rb', 'rbr', 'rp', 'sym', 'to',
       'uh', 'vb', 'vbd', 'vbg', 'vbn', 'vbp', 'vbz', 'wdt', 'wp', 'wpdollar',
       'wrb', 'none']]

DF = pd.concat([df, sub_df2], axis=1)

#Getting the Dummy variables 
DF_dum = pd.get_dummies(DF,columns=["General_G2"])

In [None]:
y = DF_dum["Groups"]
X = DF_dum[['vader_compound','percent_compress', 'quote', 'openbracker', 'closebracket',
       'comma', 'tire', 'doublepoints', 'openquotes', 'dot', 'cc', 'cd', 'dt',
       'ex', 'fw', 'in', 'jj', 'jjr', 'jjs', 'ls', 'md', 'nn', 'nnp', 'nnps',
       'nns', 'pdt', 'pos', 'prp', 'prpdollar', 'rb', 'rbr', 'rp', 'sym', 'to',
       'uh', 'vb', 'vbd', 'vbg', 'vbn', 'vbp', 'vbz', 'wdt', 'wp', 'wpdollar',
       'wrb', 'none', 'General_G2_ christmas', 'General_G2_ funk',
       'General_G2_alternative', 'General_G2_christmas', 'General_G2_country',
       'General_G2_electronic', 'General_G2_folk', 'General_G2_hip hop',
       'General_G2_indie', 'General_G2_jazz', 'General_G2_latin',
       'General_G2_metal', 'General_G2_musicals', 'General_G2_pop',
       'General_G2_punk', 'General_G2_r&b', 'General_G2_rap',
       'General_G2_reggae', 'General_G2_rock', 'General_G2_soul',
       'General_G2_traditional pop', 'General_G2_tropical',
       'General_G2_tv music']]

ss = StandardScaler()

X_ss = ss.fit_transform(X)

X_train,X_test,y_train,y_test = train_test_split(X_ss,y,test_size=0.2,stratify=y)

#Apply modeling 

svc = SVC(C=1.06)
svc.probability = True
svc.fit(X_train,y_train)
svc.score(X_test,y_test)

<a id="All-Features-Model"></a>
### All Features Model

As the final model, I wanted to combine all of the NLP features created from the Lyrics into a model. With a Random Forest Decision Tree Classifier, the highest accuracy score of 99% was achieved. Nevertheless, as previously seen, the weight of the model stems from the Artist Popularity. 

In [None]:
df = pd.read_csv("181716_edacleaned_ALL.csv",sep="\t")

df2 = pd.read_csv("181716_edacleaned_sent_rep_ALLFRI.csv",sep="\t")
artist_df = df2[['Artist_Top25_Count', 'Artist_Top25_50_Count','Artist_Top50_75_Count', 'Artist_Top75_100_Count']]

DF = pd.concat([df, artist_df ], axis=1)
df_dum = pd.get_dummies(DF,columns=["General_G2"])

In [None]:
y = df_dum["Groups"]

X_art = df_dum[['percent_compress', 
       'vader_neg', 'vader_pos', 'vader_neu', 'vader_compound', 'quote',
       'openbracker', 'closebracket', 'comma', 'tire', 'doublepoints',
       'openquotes', 'dot', 'cc', 'cd', 'dt', 'ex', 'fw', 'in', 'jj', 'jjr',
       'jjs', 'ls', 'md', 'nn', 'nnp', 'nnps', 'nns', 'pdt', 'pos', 'prp',
       'prpdollar', 'rb', 'rbr', 'rp', 'sym', 'to', 'uh', 'vb', 'vbd', 'vbg',
       'vbn', 'vbp', 'vbz', 'wdt', 'wp', 'wpdollar', 'wrb', 'none',
       'Artist_Top25_Count', 'Artist_Top25_50_Count', 'Artist_Top50_75_Count',
       'Artist_Top75_100_Count', 'General_G2_ christmas', 'General_G2_ funk',
       'General_G2_alternative', 'General_G2_christmas', 'General_G2_country',
       'General_G2_electronic', 'General_G2_folk', 'General_G2_hip hop',
       'General_G2_indie', 'General_G2_jazz', 'General_G2_latin',
       'General_G2_metal', 'General_G2_musicals', 'General_G2_pop',
       'General_G2_punk', 'General_G2_r&b', 'General_G2_rap',
       'General_G2_reggae', 'General_G2_rock', 'General_G2_soul',
       'General_G2_traditional pop', 'General_G2_tropical',
       'General_G2_tv music']]


from sklearn.preprocessing import StandardScaler

ss = StandardScaler()
X_ss = ss.fit_transform(X_art)

X_train,X_test,y_train,y_test = train_test_split(X_ss,y,test_size=0.2,stratify=y)


rfc = RandomForestClassifier(n_estimators=100)
rfc.fit(X_train, y_train)
rfc.score(X_test,y_test)

<a id="Conclusion"></a>
### Conclusion

For the conclusion of the project and more details on the accuracy scores obtained with various models please see the "NLP Billboard100 Presentation" PDF in the repository.