# Lambda School Data Science - Logistic Regression

Logistic regression is the baseline for classification models, as well as a handy way to predict probabilities (since those too live in the unit interval). While relatively simple, it is also the foundation for more sophisticated classification techniques such as neural networks (many of which can effectively be thought of as networks of logistic models).

## Assignment - real-world classification

We're going to check out a larger dataset - the [FMA Free Music Archive data](https://github.com/mdeff/fma). It has a selection of CSVs with metadata and calculated audio features that you can load and try to use to classify genre of tracks. To get you started:

### First I'll download all the data and set a few parameters for Pandas. 

In [1]:
import pandas as pd
from sklearn.linear_model import LinearRegression
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)  # Unlimited columns

In [2]:
#!wget https://os.unil.cloud.switch.ch/fma/fma_metadata.zip
#!unzip fma_metadata.zip


### Now I'll open the Tracks.csv and create my dataframe.


In [3]:
# Reading this CSV but the names of the columns look all messed up.
names = pd.read_csv('fma_metadata/tracks.csv')
names.head()

Unnamed: 0.1,Unnamed: 0,album,album.1,album.2,album.3,album.4,album.5,album.6,album.7,album.8,album.9,album.10,album.11,album.12,artist,artist.1,artist.2,artist.3,artist.4,artist.5,artist.6,artist.7,artist.8,artist.9,artist.10,artist.11,artist.12,artist.13,artist.14,artist.15,artist.16,set,set.1,track,track.1,track.2,track.3,track.4,track.5,track.6,track.7,track.8,track.9,track.10,track.11,track.12,track.13,track.14,track.15,track.16,track.17,track.18,track.19
0,,comments,date_created,date_released,engineer,favorites,id,information,listens,producer,tags,title,tracks,type,active_year_begin,active_year_end,associated_labels,bio,comments,date_created,favorites,id,latitude,location,longitude,members,name,related_projects,tags,website,wikipedia_page,split,subset,bit_rate,comments,composer,date_created,date_recorded,duration,favorites,genre_top,genres,genres_all,information,interest,language_code,license,listens,lyricist,number,publisher,tags,title
1,track_id,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,2,0,2008-11-26 01:44:45,2009-01-05 00:00:00,,4,1,<p></p>,6073,,[],AWOL - A Way Of Life,7,Album,2006-01-01 00:00:00,,,"<p>A Way Of Life, A Collective of Hip-Hop from...",0,2008-11-26 01:42:32,9,1,40.0583238,New Jersey,-74.4056612,"Sajje Morocco,Brownbum,ZawidaGod,Custodian of ...",AWOL,The list of past projects is 2 long but every1...,['awol'],http://www.AzillionRecords.blogspot.com,,training,small,256000,0,,2008-11-26 01:48:12,2008-11-26 00:00:00,168,2,Hip-Hop,[21],[21],,4656,en,Attribution-NonCommercial-ShareAlike 3.0 Inter...,1293,,3,,[],Food
3,3,0,2008-11-26 01:44:45,2009-01-05 00:00:00,,4,1,<p></p>,6073,,[],AWOL - A Way Of Life,7,Album,2006-01-01 00:00:00,,,"<p>A Way Of Life, A Collective of Hip-Hop from...",0,2008-11-26 01:42:32,9,1,40.0583238,New Jersey,-74.4056612,"Sajje Morocco,Brownbum,ZawidaGod,Custodian of ...",AWOL,The list of past projects is 2 long but every1...,['awol'],http://www.AzillionRecords.blogspot.com,,training,medium,256000,0,,2008-11-26 01:48:14,2008-11-26 00:00:00,237,1,Hip-Hop,[21],[21],,1470,en,Attribution-NonCommercial-ShareAlike 3.0 Inter...,514,,4,,[],Electric Ave
4,5,0,2008-11-26 01:44:45,2009-01-05 00:00:00,,4,1,<p></p>,6073,,[],AWOL - A Way Of Life,7,Album,2006-01-01 00:00:00,,,"<p>A Way Of Life, A Collective of Hip-Hop from...",0,2008-11-26 01:42:32,9,1,40.0583238,New Jersey,-74.4056612,"Sajje Morocco,Brownbum,ZawidaGod,Custodian of ...",AWOL,The list of past projects is 2 long but every1...,['awol'],http://www.AzillionRecords.blogspot.com,,training,small,256000,0,,2008-11-26 01:48:20,2008-11-26 00:00:00,206,6,Hip-Hop,[21],[21],,1933,en,Attribution-NonCommercial-ShareAlike 3.0 Inter...,1151,,6,,[],This World


Ooh, those headers look all messed up. I'll replace them with a cleaner list.

### Replace headers with tidy list of good names.

In [4]:
# So I need to make a list of the good names, and drop all the rows that aren't good. 
# I'm going to save these names and use them as the feature names.
cols = names[0:1].values.tolist()
cols = cols[0]
cols[0] = 'track_id'
print(f"The fixed list of names now: {cols[0:3]} ...\n")

# Instead of renaming these I'm going to reimport my csv and drop the top rows.
# This way it will import the datatypes for each column
tracks = pd.read_csv('fma_metadata/tracks.csv', skiprows=[0,1,2], header=None, names=cols)
tracks.head(2)

The fixed list of names now: ['track_id', 'comments', 'date_created'] ...



Unnamed: 0,track_id,comments,date_created,date_released,engineer,favorites,id,information,listens,producer,tags,title,tracks,type,active_year_begin,active_year_end,associated_labels,bio,comments.1,date_created.1,favorites.1,id.1,latitude,location,longitude,members,name,related_projects,tags.1,website,wikipedia_page,split,subset,bit_rate,comments.2,composer,date_created.2,date_recorded,duration,favorites.2,genre_top,genres,genres_all,information.1,interest,language_code,license,listens.1,lyricist,number,publisher,tags.2,title.1
0,2,0,2008-11-26 01:44:45,2009-01-05 00:00:00,,4,1,<p></p>,6073,,[],AWOL - A Way Of Life,7,Album,2006-01-01 00:00:00,,,"<p>A Way Of Life, A Collective of Hip-Hop from...",0,2008-11-26 01:42:32,9,1,40.058324,New Jersey,-74.405661,"Sajje Morocco,Brownbum,ZawidaGod,Custodian of ...",AWOL,The list of past projects is 2 long but every1...,['awol'],http://www.AzillionRecords.blogspot.com,,training,small,256000,0,,2008-11-26 01:48:12,2008-11-26 00:00:00,168,2,Hip-Hop,[21],[21],,4656,en,Attribution-NonCommercial-ShareAlike 3.0 Inter...,1293,,3,,[],Food
1,3,0,2008-11-26 01:44:45,2009-01-05 00:00:00,,4,1,<p></p>,6073,,[],AWOL - A Way Of Life,7,Album,2006-01-01 00:00:00,,,"<p>A Way Of Life, A Collective of Hip-Hop from...",0,2008-11-26 01:42:32,9,1,40.058324,New Jersey,-74.405661,"Sajje Morocco,Brownbum,ZawidaGod,Custodian of ...",AWOL,The list of past projects is 2 long but every1...,['awol'],http://www.AzillionRecords.blogspot.com,,training,medium,256000,0,,2008-11-26 01:48:14,2008-11-26 00:00:00,237,1,Hip-Hop,[21],[21],,1470,en,Attribution-NonCommercial-ShareAlike 3.0 Inter...,514,,4,,[],Electric Ave


### Inspect the data a bit more....

In [5]:
print(tracks.shape)


(106574, 53)


### Here I'm going to add in the Features.csv field. 

These features are from the features extraction from the audio (used to create features.csv).
https://github.com/mdeff/fma

In [6]:
features = pd.read_csv('fma_metadata/features.csv', skiprows=[1,2,3])
features = features.rename(columns={'feature': 'track_id'})

In [7]:
# Here I printed out a longer head table to make sure that the rows allign.
# I also print out the shape to make sure it looks right for mergeing the two dataframes. 
print(features.shape)
features.head(3)

(106574, 519)


Unnamed: 0,track_id,chroma_cens,chroma_cens.1,chroma_cens.2,chroma_cens.3,chroma_cens.4,chroma_cens.5,chroma_cens.6,chroma_cens.7,chroma_cens.8,chroma_cens.9,chroma_cens.10,chroma_cens.11,chroma_cens.12,chroma_cens.13,chroma_cens.14,chroma_cens.15,chroma_cens.16,chroma_cens.17,chroma_cens.18,chroma_cens.19,chroma_cens.20,chroma_cens.21,chroma_cens.22,chroma_cens.23,chroma_cens.24,chroma_cens.25,chroma_cens.26,chroma_cens.27,chroma_cens.28,chroma_cens.29,chroma_cens.30,chroma_cens.31,chroma_cens.32,chroma_cens.33,chroma_cens.34,chroma_cens.35,chroma_cens.36,chroma_cens.37,chroma_cens.38,chroma_cens.39,chroma_cens.40,chroma_cens.41,chroma_cens.42,chroma_cens.43,chroma_cens.44,chroma_cens.45,chroma_cens.46,chroma_cens.47,chroma_cens.48,chroma_cens.49,chroma_cens.50,chroma_cens.51,chroma_cens.52,chroma_cens.53,chroma_cens.54,chroma_cens.55,chroma_cens.56,chroma_cens.57,chroma_cens.58,chroma_cens.59,chroma_cens.60,chroma_cens.61,chroma_cens.62,chroma_cens.63,chroma_cens.64,chroma_cens.65,chroma_cens.66,chroma_cens.67,chroma_cens.68,chroma_cens.69,chroma_cens.70,chroma_cens.71,chroma_cens.72,chroma_cens.73,chroma_cens.74,chroma_cens.75,chroma_cens.76,chroma_cens.77,chroma_cens.78,chroma_cens.79,chroma_cens.80,chroma_cens.81,chroma_cens.82,chroma_cens.83,chroma_cqt,chroma_cqt.1,chroma_cqt.2,chroma_cqt.3,chroma_cqt.4,chroma_cqt.5,chroma_cqt.6,chroma_cqt.7,chroma_cqt.8,chroma_cqt.9,chroma_cqt.10,chroma_cqt.11,chroma_cqt.12,chroma_cqt.13,chroma_cqt.14,chroma_cqt.15,chroma_cqt.16,chroma_cqt.17,chroma_cqt.18,chroma_cqt.19,chroma_cqt.20,chroma_cqt.21,chroma_cqt.22,chroma_cqt.23,chroma_cqt.24,chroma_cqt.25,chroma_cqt.26,chroma_cqt.27,chroma_cqt.28,chroma_cqt.29,chroma_cqt.30,chroma_cqt.31,chroma_cqt.32,chroma_cqt.33,chroma_cqt.34,chroma_cqt.35,chroma_cqt.36,chroma_cqt.37,chroma_cqt.38,chroma_cqt.39,chroma_cqt.40,chroma_cqt.41,chroma_cqt.42,chroma_cqt.43,chroma_cqt.44,chroma_cqt.45,chroma_cqt.46,chroma_cqt.47,chroma_cqt.48,chroma_cqt.49,chroma_cqt.50,chroma_cqt.51,chroma_cqt.52,chroma_cqt.53,chroma_cqt.54,chroma_cqt.55,chroma_cqt.56,chroma_cqt.57,chroma_cqt.58,chroma_cqt.59,chroma_cqt.60,chroma_cqt.61,chroma_cqt.62,chroma_cqt.63,chroma_cqt.64,chroma_cqt.65,chroma_cqt.66,chroma_cqt.67,chroma_cqt.68,chroma_cqt.69,chroma_cqt.70,chroma_cqt.71,chroma_cqt.72,chroma_cqt.73,chroma_cqt.74,chroma_cqt.75,chroma_cqt.76,chroma_cqt.77,chroma_cqt.78,chroma_cqt.79,chroma_cqt.80,chroma_cqt.81,chroma_cqt.82,chroma_cqt.83,chroma_stft,chroma_stft.1,chroma_stft.2,chroma_stft.3,chroma_stft.4,chroma_stft.5,chroma_stft.6,chroma_stft.7,chroma_stft.8,chroma_stft.9,chroma_stft.10,chroma_stft.11,chroma_stft.12,chroma_stft.13,chroma_stft.14,chroma_stft.15,chroma_stft.16,chroma_stft.17,chroma_stft.18,chroma_stft.19,chroma_stft.20,chroma_stft.21,chroma_stft.22,chroma_stft.23,chroma_stft.24,chroma_stft.25,chroma_stft.26,chroma_stft.27,chroma_stft.28,chroma_stft.29,chroma_stft.30,chroma_stft.31,chroma_stft.32,chroma_stft.33,chroma_stft.34,chroma_stft.35,chroma_stft.36,chroma_stft.37,chroma_stft.38,chroma_stft.39,chroma_stft.40,chroma_stft.41,chroma_stft.42,chroma_stft.43,chroma_stft.44,chroma_stft.45,chroma_stft.46,chroma_stft.47,chroma_stft.48,chroma_stft.49,chroma_stft.50,chroma_stft.51,chroma_stft.52,chroma_stft.53,chroma_stft.54,chroma_stft.55,chroma_stft.56,chroma_stft.57,chroma_stft.58,chroma_stft.59,chroma_stft.60,chroma_stft.61,chroma_stft.62,chroma_stft.63,chroma_stft.64,chroma_stft.65,chroma_stft.66,chroma_stft.67,chroma_stft.68,chroma_stft.69,chroma_stft.70,chroma_stft.71,chroma_stft.72,chroma_stft.73,chroma_stft.74,chroma_stft.75,chroma_stft.76,chroma_stft.77,chroma_stft.78,chroma_stft.79,chroma_stft.80,chroma_stft.81,chroma_stft.82,chroma_stft.83,mfcc,mfcc.1,mfcc.2,mfcc.3,mfcc.4,mfcc.5,mfcc.6,mfcc.7,mfcc.8,mfcc.9,mfcc.10,mfcc.11,mfcc.12,mfcc.13,mfcc.14,mfcc.15,mfcc.16,mfcc.17,mfcc.18,mfcc.19,mfcc.20,mfcc.21,mfcc.22,mfcc.23,mfcc.24,mfcc.25,mfcc.26,mfcc.27,mfcc.28,mfcc.29,mfcc.30,mfcc.31,mfcc.32,mfcc.33,mfcc.34,mfcc.35,mfcc.36,mfcc.37,mfcc.38,mfcc.39,mfcc.40,mfcc.41,mfcc.42,mfcc.43,mfcc.44,mfcc.45,mfcc.46,mfcc.47,mfcc.48,mfcc.49,mfcc.50,mfcc.51,mfcc.52,mfcc.53,mfcc.54,mfcc.55,mfcc.56,mfcc.57,mfcc.58,mfcc.59,mfcc.60,mfcc.61,mfcc.62,mfcc.63,mfcc.64,mfcc.65,mfcc.66,mfcc.67,mfcc.68,mfcc.69,mfcc.70,mfcc.71,mfcc.72,mfcc.73,mfcc.74,mfcc.75,mfcc.76,mfcc.77,mfcc.78,mfcc.79,mfcc.80,mfcc.81,mfcc.82,mfcc.83,mfcc.84,mfcc.85,mfcc.86,mfcc.87,mfcc.88,mfcc.89,mfcc.90,mfcc.91,mfcc.92,mfcc.93,mfcc.94,mfcc.95,mfcc.96,mfcc.97,mfcc.98,mfcc.99,mfcc.100,mfcc.101,mfcc.102,mfcc.103,mfcc.104,mfcc.105,mfcc.106,mfcc.107,mfcc.108,mfcc.109,mfcc.110,mfcc.111,mfcc.112,mfcc.113,mfcc.114,mfcc.115,mfcc.116,mfcc.117,mfcc.118,mfcc.119,mfcc.120,mfcc.121,mfcc.122,mfcc.123,mfcc.124,mfcc.125,mfcc.126,mfcc.127,mfcc.128,mfcc.129,mfcc.130,mfcc.131,mfcc.132,mfcc.133,mfcc.134,mfcc.135,mfcc.136,mfcc.137,mfcc.138,mfcc.139,rmse,rmse.1,rmse.2,rmse.3,rmse.4,rmse.5,rmse.6,spectral_bandwidth,spectral_bandwidth.1,spectral_bandwidth.2,spectral_bandwidth.3,spectral_bandwidth.4,spectral_bandwidth.5,spectral_bandwidth.6,spectral_centroid,spectral_centroid.1,spectral_centroid.2,spectral_centroid.3,spectral_centroid.4,spectral_centroid.5,spectral_centroid.6,spectral_contrast,spectral_contrast.1,spectral_contrast.2,spectral_contrast.3,spectral_contrast.4,spectral_contrast.5,spectral_contrast.6,spectral_contrast.7,spectral_contrast.8,spectral_contrast.9,spectral_contrast.10,spectral_contrast.11,spectral_contrast.12,spectral_contrast.13,spectral_contrast.14,spectral_contrast.15,spectral_contrast.16,spectral_contrast.17,spectral_contrast.18,spectral_contrast.19,spectral_contrast.20,spectral_contrast.21,spectral_contrast.22,spectral_contrast.23,spectral_contrast.24,spectral_contrast.25,spectral_contrast.26,spectral_contrast.27,spectral_contrast.28,spectral_contrast.29,spectral_contrast.30,spectral_contrast.31,spectral_contrast.32,spectral_contrast.33,spectral_contrast.34,spectral_contrast.35,spectral_contrast.36,spectral_contrast.37,spectral_contrast.38,spectral_contrast.39,spectral_contrast.40,spectral_contrast.41,spectral_contrast.42,spectral_contrast.43,spectral_contrast.44,spectral_contrast.45,spectral_contrast.46,spectral_contrast.47,spectral_contrast.48,spectral_rolloff,spectral_rolloff.1,spectral_rolloff.2,spectral_rolloff.3,spectral_rolloff.4,spectral_rolloff.5,spectral_rolloff.6,tonnetz,tonnetz.1,tonnetz.2,tonnetz.3,tonnetz.4,tonnetz.5,tonnetz.6,tonnetz.7,tonnetz.8,tonnetz.9,tonnetz.10,tonnetz.11,tonnetz.12,tonnetz.13,tonnetz.14,tonnetz.15,tonnetz.16,tonnetz.17,tonnetz.18,tonnetz.19,tonnetz.20,tonnetz.21,tonnetz.22,tonnetz.23,tonnetz.24,tonnetz.25,tonnetz.26,tonnetz.27,tonnetz.28,tonnetz.29,tonnetz.30,tonnetz.31,tonnetz.32,tonnetz.33,tonnetz.34,tonnetz.35,tonnetz.36,tonnetz.37,tonnetz.38,tonnetz.39,tonnetz.40,tonnetz.41,zcr,zcr.1,zcr.2,zcr.3,zcr.4,zcr.5,zcr.6
0,2,7.180653,5.230309,0.249321,1.34762,1.482478,0.531371,1.481593,2.691455,0.866868,1.341231,1.347792,1.237658,0.6925,0.569344,0.597041,0.625864,0.56733,0.443949,0.487976,0.497327,0.574435,0.579241,0.620102,0.586945,0.4743,0.369816,0.236119,0.228068,0.22283,0.221415,0.229238,0.248795,0.196245,0.175809,0.200713,0.319972,0.482825,0.387652,0.249082,0.238187,0.233066,0.23012,0.232068,0.248896,0.197692,0.1733,0.19884,0.314053,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-1.995901,-1.811653,-0.351841,-0.438166,-0.568652,-0.439196,-0.150423,-0.660931,0.100542,0.565466,0.364254,-0.443213,0.106365,0.085317,0.09235,0.077237,0.074284,0.077322,0.077279,0.068588,0.079606,0.08547,0.085498,0.098219,4.518084,0.223714,-1.098686,-0.781633,-0.938433,-1.132679,-0.909878,-0.496936,-0.367597,-0.263383,-0.537583,0.675497,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.88321,0.668069,0.524087,0.49096,0.493515,0.505337,0.508059,0.507593,0.440306,0.411464,0.450871,0.605162,1.0,0.677295,0.529293,0.495927,0.497422,0.495146,0.503019,0.496752,0.415908,0.378414,0.429901,0.621644,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-2.235862,-0.621387,-0.020153,0.045996,-0.020142,0.056067,0.068259,0.185622,0.384968,0.522297,0.344329,-0.523531,0.227931,0.225448,0.26511,0.236233,0.240283,0.266694,0.253122,0.235398,0.224825,0.232748,0.241133,0.204236,-1.006041,-0.634076,-0.233752,-0.120917,0.004806,1.218982,0.969103,-0.884986,-0.46919,-0.515475,-0.632885,-0.903778,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.696063,0.447832,0.35558,0.349723,0.340518,0.266161,0.291201,0.430514,0.376956,0.380832,0.419979,0.5274,0.813452,0.410986,0.286802,0.283637,0.265072,0.184796,0.226379,0.365963,0.305793,0.311604,0.369269,0.509396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.637954,0.438492,0.834427,0.845905,0.9337,1.339213,1.216451,0.524964,0.718931,0.713768,0.568743,0.065187,0.326939,0.276546,0.27735,0.263065,0.266792,0.2408,0.23828,0.305775,0.279608,0.285548,0.283178,0.280121,3.856789,1.541901,0.000816,0.330728,0.118731,-0.342687,-0.259252,0.146735,0.410656,-0.162872,-0.02992,0.430621,-0.225081,-0.303092,-0.190764,-0.052871,-0.151794,-0.00059,0.075492,0.002595,28.657707,215.541351,52.424423,103.289833,54.599335,85.158218,37.843021,58.169933,30.032066,39.129414,27.7374,37.241585,34.146088,33.541115,30.837498,28.613127,32.683372,22.623711,27.038391,21.427813,-163.772964,116.696678,-41.753826,29.144329,-15.050158,18.879372,-8.918165,12.002118,-4.253151,1.359791,-2.683,-0.794632,-6.920971,-3.655366,1.465213,0.201078,3.998204,-2.114676,0.116842,-5.785884,-143.594299,124.859314,-43.518631,28.886786,-13.499909,19.181675,-7.834178,11.582661,-3.643022,1.085177,-2.272115,-0.952252,-6.490981,-3.207944,1.109259,0.027717,3.859897,-1.698104,1.509903e-14,-5.589337,-504.891144,-1.421085e-14,-115.315613,-51.569004,-97.87989,-41.524628,-61.748718,-39.678986,-54.222973,-42.562126,-49.195107,-35.534348,-38.507935,-33.41198,-31.206163,-32.890686,-28.135849,-32.310982,-27.950054,-34.325916,-1.747327,-1.189495,0.320707,0.043292,-0.262969,0.031747,-0.272285,0.073492,-0.165036,-0.10356,-0.15511,0.255138,-0.102555,0.034845,0.01737,-0.096178,0.018864,-0.171892,0.067983,-0.151166,97.809044,38.569584,22.576273,20.767921,19.86974,20.299164,14.631481,12.183411,9.399827,10.742301,10.059609,8.601942,9.28425,9.245516,8.520863,8.560472,7.651871,7.246555,7.077188,7.391859,2.499856,14.748096,3.188761,2.653895,0.0,1.565426,2.536809,3.874384,3451.105957,1607.474365,1618.850098,0.0,-0.884317,436.80899,2.406808,5514.049805,1639.583252,1503.496704,0.0,1.079815,719.770508,2.270822,0.447823,0.763851,0.078571,0.010253,7.247566,3.255106,50.825714,41.389446,39.333603,31.513493,33.353039,47.265903,54.68726,18.005175,15.363138,17.129013,17.15716,18.087046,17.616112,38.268646,17.58058,15.031244,16.844053,16.985653,17.816685,17.255152,39.830345,2.673153,2.296826,3.394232,3.750881,5.484161,9.662931,10.44932,0.891873,0.493982,0.608306,0.367693,0.43285,1.690805,-1.5719,4.541557,4.321756,3.936234,3.144537,3.310087,3.096597,7.622641,0.84054,9410.009766,3267.804688,3143.847656,0.0,0.347372,1300.729736,2.303301,0.978718,1.033038,1.668363,0.829119,8.464136,0.103562,0.160202,0.20917,0.318972,0.05969,0.069184,-0.00257,0.019296,0.01051,0.073464,0.009272,0.015765,-0.003789,0.017786,0.007311,0.067945,0.009488,0.016876,-0.059769,-0.091745,-0.185687,-0.140306,-0.048525,-0.089286,0.752462,0.262607,0.200944,0.593595,-0.177665,-1.424201,0.019809,0.029569,0.038974,0.054125,0.012226,0.012111,5.75889,0.459473,0.085629,0.071289,0.0,2.089872,0.061448
1,3,1.888963,0.760539,0.345297,2.295201,1.654031,0.067592,1.366848,1.054094,0.108103,0.619185,1.038253,1.292235,0.677641,0.584248,0.581271,0.581182,0.454241,0.464841,0.542833,0.66472,0.511329,0.530998,0.603398,0.547428,0.232784,0.229469,0.225674,0.216713,0.220512,0.242744,0.369235,0.420716,0.312129,0.242748,0.264292,0.225683,0.230579,0.228059,0.20937,0.202267,0.230913,0.2603,0.393017,0.441179,0.31354,0.239462,0.25667,0.225647,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.640233,0.48854,0.787512,1.268909,-0.937488,-0.885065,-1.089813,-0.864207,-0.460459,0.387031,0.435561,0.373115,0.081881,0.084639,0.102759,0.090946,0.075314,0.102431,0.083249,0.100854,0.089236,0.083384,0.079806,0.071072,-0.855436,-0.883988,-0.992211,-0.675419,-0.361638,-0.927972,0.094254,0.944722,-0.084373,-0.93061,-0.927093,-0.705974,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.525373,0.515373,0.516092,0.495411,0.486615,0.530408,0.712591,0.827842,0.612929,0.536088,0.564723,0.504494,0.510403,0.490436,0.48322,0.444839,0.46996,0.519769,0.726974,0.95076,0.612225,0.521246,0.547211,0.481853,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.217048,0.263811,0.320437,0.543782,0.290695,0.128691,-0.602294,-1.295364,-0.16419,0.158242,0.154014,0.284164,0.24835,0.242151,0.254289,0.244836,0.211848,0.245807,0.209122,0.225268,0.203094,0.249586,0.248174,0.233538,-0.951502,-0.660734,-1.050015,-0.977441,-0.343043,-0.515404,-0.973297,-1.261086,-1.132458,-0.953374,-1.206439,-1.143998,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.449602,0.409319,0.477329,0.461481,0.390022,0.409889,0.485767,0.58183,0.506545,0.509018,0.580438,0.538219,0.410894,0.362428,0.432373,0.402289,0.322998,0.355997,0.481117,0.584029,0.505411,0.506322,0.595622,0.531547,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.385666,0.555563,0.344984,0.453011,0.765728,0.651586,0.12958,-0.098296,0.0643,0.066642,-0.137568,0.048948,0.288883,0.27506,0.300208,0.297829,0.269214,0.278881,0.270844,0.309467,0.290817,0.275219,0.303523,0.295704,4.296755,1.399977,0.112535,-0.21117,0.032953,-0.023489,0.150404,0.046454,0.033484,-0.064596,0.511647,0.374576,0.21488,0.101417,0.480927,0.269853,0.113057,0.094816,0.313537,0.055046,29.380682,207.696793,76.742714,137.24501,53.943695,105.259331,55.662521,59.427101,36.565681,40.323013,26.740063,49.431381,25.32263,20.490335,37.785286,27.832268,32.357517,29.402746,30.734978,31.932407,-159.004166,120.158501,-33.233562,47.342003,-6.247318,31.405355,-5.261811,11.618972,-1.595837,5.13273,-3.422679,6.949284,-4.175256,-3.528815,0.274716,-2.270682,1.090475,-2.343884,0.471821,-1.546707,-140.036758,128.237503,-33.951824,46.591507,-6.294216,33.552624,-5.438893,12.083658,-1.30653,5.819781,-2.972703,5.818186,-4.059319,-2.814742,-0.632349,-1.774742,1.089518,-2.19391,0.6338969,-1.279352,-546.268188,-18.51665,-85.012314,-12.548506,-87.048355,-26.988558,-61.851612,-33.42152,-47.009068,-31.34738,-46.103527,-24.693287,-35.108582,-34.726711,-34.031342,-35.291336,-27.158138,-28.573957,-29.90835,-39.935055,-1.781124,-1.167514,0.297836,-0.026875,0.050252,-0.010573,0.17992,-0.112891,0.080667,-0.294777,-0.270786,0.491998,-0.081663,-0.318115,0.579669,-0.133963,-0.224357,0.094043,-0.295273,0.043353,111.686371,41.191982,19.40634,22.025253,19.330267,19.175596,12.421652,10.261386,9.386757,10.165544,8.771966,10.026867,6.978541,7.650417,9.600357,7.222888,8.398293,7.285423,7.417791,8.77744,-0.643963,9.096919,3.607359,3.706424,0.0,0.018619,1.952665,2.38334,3469.176514,1736.961426,1686.766602,0.0,0.464217,486.662476,3.519866,6288.426758,1763.012451,1517.993652,0.0,1.652693,972.758423,3.208793,0.243778,-0.010697,-0.28142,0.251526,1.278046,3.732807,59.697681,40.604237,42.141411,31.473713,38.779331,37.237942,58.204182,15.732023,15.046037,17.374632,17.210331,18.180691,18.898478,39.201202,15.306031,14.709955,17.167637,17.202524,17.859306,18.703789,40.40379,0.390136,0.652681,1.835595,1.840017,1.837415,1.837453,1.837142,0.999547,0.469961,0.325516,0.102769,0.472403,0.391984,-1.46113,4.430571,4.51701,4.631938,3.747272,4.085845,3.351878,7.611158,2.379145,10002.172852,3514.619629,3413.012695,0.0,1.118952,1650.357788,2.003829,1.376126,0.865467,2.015784,0.429529,0.480131,0.178571,0.114781,0.268787,0.214807,0.070261,0.070394,0.000183,0.006908,0.047025,-0.029942,0.017535,-0.001496,-0.000108,0.007161,0.046912,-0.021149,0.016299,-0.002657,-0.097199,-0.079651,-0.164613,-0.304375,-0.024958,-0.055667,0.265541,-0.131471,0.17193,-0.99071,0.574556,0.556494,0.026316,0.018708,0.051151,0.063831,0.014212,0.01774,2.824694,0.466309,0.084578,0.063965,0.0,1.716724,0.06933
2,5,0.527563,-0.077654,-0.27961,0.685883,1.93757,0.880839,-0.923192,-0.927232,0.666617,1.038546,0.268932,1.125141,0.611014,0.651471,0.494528,0.448799,0.468624,0.454021,0.497172,0.559755,0.671287,0.610565,0.551663,0.603413,0.25842,0.303385,0.250737,0.218562,0.245743,0.236018,0.275766,0.293982,0.346324,0.289821,0.246368,0.220939,0.255472,0.293571,0.245253,0.222065,0.247063,0.241129,0.276731,0.286257,0.346843,0.276288,0.235473,0.210505,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.264258,0.054401,-0.040338,-0.172385,-0.721496,-0.580557,-0.271856,-0.072518,-0.250609,0.415978,0.109953,0.54899,0.086736,0.125088,0.100257,0.072991,0.071584,0.076366,0.126677,0.131232,0.108912,0.101764,0.089526,0.088002,-0.605479,-1.181325,-0.663084,-0.39024,-0.436532,-0.658291,-1.038576,-1.353048,-0.331045,-0.708057,-0.472452,-0.562972,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.535632,0.613613,0.521669,0.485778,0.525744,0.517949,0.566337,0.624896,0.669948,0.580517,0.517042,0.490671,0.53642,0.578348,0.513162,0.454991,0.513091,0.507419,0.573509,0.60644,0.699196,0.564417,0.491623,0.463206,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01021,-0.021992,0.098824,0.42374,0.113253,0.159869,-0.101626,-0.104295,-0.504562,0.04662,0.317426,0.344981,0.235049,0.287652,0.233674,0.233186,0.226433,0.2386,0.269667,0.305429,0.241888,0.242494,0.232426,0.242393,-0.794551,-1.264806,-0.664387,-0.405196,-0.022688,0.014883,-0.190766,-0.507027,-0.868905,-0.952605,-1.278158,-1.100202,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.48116,0.65869,0.451551,0.368825,0.334513,0.324101,0.340564,0.354422,0.414781,0.417633,0.469238,0.465465,0.440058,0.732033,0.411747,0.294889,0.266582,0.253355,0.264986,0.282113,0.333416,0.352265,0.421411,0.406838,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.332297,-0.446066,0.41665,0.758402,0.892551,0.916629,0.885812,0.743365,0.609668,0.51566,0.259499,0.370563,0.277303,0.33753,0.271043,0.271661,0.258273,0.260891,0.278818,0.280037,0.31019,0.307683,0.326721,0.310268,2.624517,2.415293,0.440233,-0.782131,-0.771069,-0.724216,0.09026,0.152119,0.261731,-0.608905,0.096892,-0.248479,0.159848,0.637849,0.192881,0.286125,-0.066113,0.363365,0.614379,0.131038,-40.50074,218.972977,50.373932,112.312531,51.51371,66.547562,29.266182,57.987415,48.3335,43.876404,27.197964,32.226604,37.523643,35.240238,45.277225,33.539673,31.718861,30.777176,36.514858,24.971487,-205.440491,132.215073,-16.085823,41.514759,-7.642954,16.942802,-5.651261,9.569445,0.503157,8.673513,-8.271377,0.594473,-0.340203,2.377888,7.899487,1.947641,7.44195,-1.739911,0.278015,-5.489016,-181.015961,138.253937,-14.507562,43.082825,-7.996415,17.443892,-3.931509,9.787046,0.266423,8.91002,-7.978985,0.260834,-0.112727,2.252367,7.514513,1.874906,7.339341,-1.918201,1.598721e-14,-5.602306,-528.702759,-62.28398,-87.205124,-24.319845,-74.06031,-30.447651,-59.320251,-33.444458,-38.407055,-26.083885,-46.700871,-35.248055,-33.570656,-34.795387,-22.939646,-33.812237,-20.736326,-29.763596,-37.071568,-30.421646,-1.488152,-1.258937,-0.360284,-0.10864,0.015386,0.037464,-0.511768,-0.019643,-0.047712,-0.087559,-0.173707,-0.031627,0.106064,-0.080311,0.248211,-0.040418,0.06053,0.295008,0.209928,0.119846,95.049904,39.367599,18.87355,24.319347,23.159674,17.15971,13.051704,10.914735,9.775167,11.808991,8.863638,9.581952,8.895723,8.141456,8.201844,7.780963,7.132692,7.539753,8.452527,7.334442,0.001781,11.031059,3.251386,2.409692,0.0,1.026818,2.585286,0.895442,3492.736816,1512.917358,1591.517822,0.0,-0.658707,474.413513,1.322586,5648.614746,1292.95813,1186.514038,0.0,0.937101,665.319275,1.478091,0.35715,0.455044,0.07588,-0.110072,0.770549,0.933812,50.931419,38.541519,39.017235,33.707951,32.751514,38.786518,51.896751,17.097452,15.969444,18.646988,16.973648,17.292145,19.255819,36.413609,16.489166,15.864572,18.528168,16.879728,16.944323,19.105276,39.101978,3.422826,2.31401,3.41998,7.140376,9.315227,7.225081,8.567882,0.819824,0.19727,0.212964,0.24784,0.479982,0.366885,-1.271206,4.942019,4.376371,4.262785,3.190465,3.102941,3.087401,8.494939,-0.238572,9442.30957,2773.931885,2863.916016,0.0,0.266975,1323.46521,10.772251,0.921786,-0.189319,0.527997,0.150232,7.698659,0.254815,0.092672,0.19125,0.180027,0.072169,0.076847,-0.007501,-0.018525,-0.030318,0.024743,0.004771,-0.004536,-0.007385,-0.018953,-0.020358,0.024615,0.004868,-0.003899,-0.128391,-0.125289,-0.359463,-0.166667,-0.038546,-0.146136,1.212025,0.218381,-0.419971,-0.014541,-0.199314,-0.925733,0.02555,0.021106,0.084997,0.04073,0.012691,0.014759,6.808415,0.375,0.053114,0.041504,0.0,2.193303,0.044861


### Merge Tracks and Features into one dataset

In [8]:
#df = pd.merge(tracks, features, on=['track_id', 'track_id'])
df = tracks

In [9]:
df.head()

Unnamed: 0,track_id,comments,date_created,date_released,engineer,favorites,id,information,listens,producer,tags,title,tracks,type,active_year_begin,active_year_end,associated_labels,bio,comments.1,date_created.1,favorites.1,id.1,latitude,location,longitude,members,name,related_projects,tags.1,website,wikipedia_page,split,subset,bit_rate,comments.2,composer,date_created.2,date_recorded,duration,favorites.2,genre_top,genres,genres_all,information.1,interest,language_code,license,listens.1,lyricist,number,publisher,tags.2,title.1
0,2,0,2008-11-26 01:44:45,2009-01-05 00:00:00,,4,1,<p></p>,6073,,[],AWOL - A Way Of Life,7,Album,2006-01-01 00:00:00,,,"<p>A Way Of Life, A Collective of Hip-Hop from...",0,2008-11-26 01:42:32,9,1,40.058324,New Jersey,-74.405661,"Sajje Morocco,Brownbum,ZawidaGod,Custodian of ...",AWOL,The list of past projects is 2 long but every1...,['awol'],http://www.AzillionRecords.blogspot.com,,training,small,256000,0,,2008-11-26 01:48:12,2008-11-26 00:00:00,168,2,Hip-Hop,[21],[21],,4656,en,Attribution-NonCommercial-ShareAlike 3.0 Inter...,1293,,3,,[],Food
1,3,0,2008-11-26 01:44:45,2009-01-05 00:00:00,,4,1,<p></p>,6073,,[],AWOL - A Way Of Life,7,Album,2006-01-01 00:00:00,,,"<p>A Way Of Life, A Collective of Hip-Hop from...",0,2008-11-26 01:42:32,9,1,40.058324,New Jersey,-74.405661,"Sajje Morocco,Brownbum,ZawidaGod,Custodian of ...",AWOL,The list of past projects is 2 long but every1...,['awol'],http://www.AzillionRecords.blogspot.com,,training,medium,256000,0,,2008-11-26 01:48:14,2008-11-26 00:00:00,237,1,Hip-Hop,[21],[21],,1470,en,Attribution-NonCommercial-ShareAlike 3.0 Inter...,514,,4,,[],Electric Ave
2,5,0,2008-11-26 01:44:45,2009-01-05 00:00:00,,4,1,<p></p>,6073,,[],AWOL - A Way Of Life,7,Album,2006-01-01 00:00:00,,,"<p>A Way Of Life, A Collective of Hip-Hop from...",0,2008-11-26 01:42:32,9,1,40.058324,New Jersey,-74.405661,"Sajje Morocco,Brownbum,ZawidaGod,Custodian of ...",AWOL,The list of past projects is 2 long but every1...,['awol'],http://www.AzillionRecords.blogspot.com,,training,small,256000,0,,2008-11-26 01:48:20,2008-11-26 00:00:00,206,6,Hip-Hop,[21],[21],,1933,en,Attribution-NonCommercial-ShareAlike 3.0 Inter...,1151,,6,,[],This World
3,10,0,2008-11-26 01:45:08,2008-02-06 00:00:00,,4,6,,47632,,[],Constant Hitmaker,2,Album,,,"Mexican Summer, Richie Records, Woodsist, Skul...","<p><span style=""font-family:Verdana, Geneva, A...",3,2008-11-26 01:42:55,74,6,,,,"Kurt Vile, the Violators",Kurt Vile,,"['philly', 'kurt vile']",http://kurtvile.com,,training,small,192000,0,Kurt Vile,2008-11-25 17:49:06,2008-11-26 00:00:00,161,178,Pop,[10],[10],,54881,en,Attribution-NonCommercial-NoDerivatives (aka M...,50135,,1,,[],Freeway
4,20,0,2008-11-26 01:45:05,2009-01-06 00:00:00,,2,4,"<p> ""spiritual songs"" from Nicky Cook</p>",2710,,[],Niris,13,Album,1990-01-01 00:00:00,2011-01-01 00:00:00,,<p>Songs written by: Nicky Cook</p>\n<p>VOCALS...,2,2008-11-26 01:42:52,10,4,51.895927,Colchester England,0.891874,Nicky Cook\n,Nicky Cook,,"['instrumentals', 'experimental pop', 'post pu...",,,training,large,256000,0,,2008-11-26 01:48:56,2008-01-01 00:00:00,311,0,,"[76, 103]","[17, 10, 76, 103]",,978,en,Attribution-NonCommercial-NoDerivatives (aka M...,361,,3,,[],Spiritual Level


In [10]:
topgenre_unique = df['genre_top'].value_counts()
df = df[df.genre_top != 0]

print(topgenre_unique)
#Country music? Not in my classifier. Hahaha
replace_these = ["Jazz", "Old-Time / Historic", "Spoken", "Soul-RnB", "Blues", "Country","Easy Listening"]
for i in replace_these:
    df = df[df.genre_top != i]

topgenre_unique = df['genre_top'].value_counts()
print(topgenre_unique)

Rock                   14182
Experimental           10608
Electronic              9372
Hip-Hop                 3552
Folk                    2803
Pop                     2332
Instrumental            2079
International           1389
Classical               1230
Jazz                     571
Old-Time / Historic      554
Spoken                   423
Country                  194
Soul-RnB                 175
Blues                    110
Easy Listening            24
Name: genre_top, dtype: int64
Rock             14182
Experimental     10608
Electronic        9372
Hip-Hop           3552
Folk              2803
Pop               2332
Instrumental      2079
International     1389
Classical         1230
Name: genre_top, dtype: int64


### Now a bit of Feature Engineering on the merged DF


In [11]:
# Which columns are missing data
total = df.isnull().sum().sort_values(ascending=False)
percent = (df.isnull().sum()/df.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(25)

Unnamed: 0,Total,Percent
lyricist,104224,0.997139
publisher,103354,0.988816
information.1,102466,0.98032
composer,100985,0.966151
wikipedia_page,99239,0.949447
active_year_end,99209,0.94916
date_recorded,98911,0.946308
related_projects,91639,0.876735
associated_labels,90395,0.864834
language_code,90364,0.864537


In [12]:
# Lets take some of the data and see if we can do some tricky feature engineering.

# Here I'll use Label Encoding 
from sklearn.preprocessing import LabelEncoder
lb_make = LabelEncoder()

# df['name'] Should encode this name. Categorical variable.
#df = df.dropna(subset=['name'])
df["name_code"] = lb_make.fit_transform(df["name"])
print(df[["name", "name_code"]].head(10))

# df['type'] Should encode this name. Categorical variable.
df = df.dropna(subset=['type'])
df["type_code"] = lb_make.fit_transform(df["type"])
print(df[["type", "type_code"]].head(10))

# df['genre_top'] Should encode this name. Categorical variable.
df = df.dropna(subset=['genre_top'])
df["genre_top_code"] = lb_make.fit_transform(df["genre_top"])
print(df[["genre_top", "genre_top_code"]].head(10))


         name  name_code
0        AWOL        297
1        AWOL        297
2        AWOL        297
3   Kurt Vile       7099
4  Nicky Cook       9159
5  Nicky Cook       9159
6  Nicky Cook       9159
7  Nicky Cook       9159
8  Nicky Cook       9159
9        AWOL        297
    type  type_code
0  Album          0
1  Album          0
2  Album          0
3  Album          0
4  Album          0
5  Album          0
6  Album          0
7  Album          0
8  Album          0
9  Album          0
       genre_top  genre_top_code
0        Hip-Hop               4
1        Hip-Hop               4
2        Hip-Hop               4
3            Pop               7
9        Hip-Hop               4
10          Rock               8
11          Rock               8
12  Experimental               2
13  Experimental               2
14          Folk               3


In [13]:
df.shape

(45682, 56)

In [14]:
# Drop all remaining columns missing data. 
nan_columns = df.columns[df.isna().any()].tolist()
df = df.drop(columns=nan_columns)
df.head(1)

# Make sure we didn't miss any.
total = df.isnull().sum().sort_values(ascending=False)
percent = (df.isnull().sum()/df.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(4)

Unnamed: 0,Total,Percent
genre_top_code,0,0.0
type_code,0,0.0
comments,0,0.0
date_created,0,0.0


In [15]:
# All my rows are dropped that will be dropped. Now I'll make my key lists for 
df_key= df[["genre_top", "genre_top_code","type", "type_code","name", "name_code"]]

# Lets drop all the non-numerical columns now. 
data = df._get_numeric_data()
data.shape

(45682, 19)

### Test/Train Split and Prepare Some Balanced Samples. 

In [16]:
#!pip install imblearn
from imblearn.pipeline import make_pipeline
from imblearn.under_sampling import (RandomUnderSampler)

from collections import Counter
from imblearn.under_sampling import RandomUnderSampler # doctest: +NORMALIZE_WHITESPACE

# Define my X & Y
y=data["genre_top_code"]
X=data.drop(columns=['genre_top_code', 'track_id'])

# Going to try a standard scaling my X for fast converge with SAG / SAGA
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit(X)
X_scaled = scaler.transform(X)

# Here is where I split the model into test/train sets. 
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Undersample into balanced group. 
print('Original dataset shape %s' % Counter(y_train))
rus = RandomUnderSampler(sampling_strategy='not minority',replacement=True, random_state=42)
X_RandomUS, y_RandomUS = rus.fit_resample(X_train, y_train)
print('Resampled dataset shape %s' % Counter(y_RandomUS))

Original dataset shape Counter({8: 11009, 2: 8233, 1: 7081, 4: 2754, 3: 2066, 7: 1821, 5: 1584, 6: 1049, 0: 948})
Resampled dataset shape Counter({0: 948, 1: 948, 2: 948, 3: 948, 4: 948, 5: 948, 6: 948, 7: 948, 8: 948})


### Define my success metrics

In [17]:
# Lets define a performance evaluation function.
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report

def get_metrics(y_test, y_predicted):  
    # true positives / (true positives+false positives)
    precision = precision_score(y_test, y_predicted, pos_label=None,
                                    average='weighted')             
    # true positives / (true positives + false negatives)
    recall = recall_score(y_test, y_predicted, pos_label=None,
                              average='weighted')
    
    # harmonic mean of precision and recall
    f1 = f1_score(y_test, y_predicted, pos_label=None, average='weighted')
    
    # true positives + true negatives/ total
    accuracy = accuracy_score(y_test, y_predicted)
    return accuracy, precision, recall, f1

### Now for some Logistic Regression - Original Data First

In [18]:
from sklearn.linear_model import LogisticRegression
#from sklearn.linear_model import LogisticRegressionCV

# Logistic Regression w/ minimum random sampling of the Genre
logreg = LogisticRegression(max_iter=1000,
                            n_jobs=-1, 
                            verbose=1,
                            random_state=42)

In [19]:
# Fit with the Original Data
logreg.fit(X_train, y_train)

# lets check the performance w/ Undersampling
y_predicted = logreg.predict(X_test)
accuracy, precision, recall, f1 = get_metrics(y_test, y_predicted)
trials = []
trials.append(["Trial","accuracy","precision","recall","f1"])
trial = "Base"
trials.append([trial, accuracy, precision, recall, f1])
print("accuracy = %.3f, precision = %.3f, recall = %.3f, f1 = %.3f" % (accuracy, precision, recall, f1))

[LibLinear]accuracy = 0.454, precision = 0.441, recall = 0.454, f1 = 0.398


In [20]:
# Try a basic logistic regression
# Logistic Regression w/ minimum random sampling of the Genre
logreg = LogisticRegression(verbose=1,
                            max_iter=10000,
                            random_state=42)

# Fit with the Original Data
logreg.fit(X_train, y_train)

# lets check the performance on the unbalanced data set. 
y_predicted = logreg.predict(X_test)
accuracy, precision, recall, f1 = get_metrics(y_test, y_predicted)
trial = "BasicBase"
trials.append([trial, accuracy, precision, recall, f1])
print("accuracy = %.3f, precision = %.3f, recall = %.3f, f1 = %.3f" % (accuracy, precision, recall, f1))

[LibLinear]accuracy = 0.454, precision = 0.441, recall = 0.454, f1 = 0.398


### Now I attempt with the differently sampled data. 

In [21]:
# Fit with the undersampled data
logreg.fit(X_RandomUS, y_RandomUS)

# lets check the performance w/ Undersampling
y_predicted = logreg.predict(X_test)
accuracy, precision, recall, f1 = get_metrics(y_test, y_predicted)
trial = 'RandomUnderSampler'
trials.append([trial, accuracy, precision, recall, f1])
print("accuracy = %.3f, precision = %.3f, recall = %.3f, f1 = %.3f" % (accuracy, precision, recall, f1))

[LibLinear]accuracy = 0.327, precision = 0.440, recall = 0.327, f1 = 0.350


In [22]:
# Lets try 2 Kinds of oversampling. 
from imblearn.over_sampling import SMOTE, ADASYN

# First SMOTE
X_overSMOTE, y_overSMOTE = SMOTE().fit_resample(X_train, y_train)
print(f"overSMOTED: {sorted(Counter(y_overSMOTE).items())}\n")

# Then ADASYN
X_overADASYN, y_overADASYN = ADASYN().fit_resample(X_train, y_train)
print(f"overADASYN: {sorted(Counter(y_overADASYN).items())}\n")


overSMOTED: [(0, 11009), (1, 11009), (2, 11009), (3, 11009), (4, 11009), (5, 11009), (6, 11009), (7, 11009), (8, 11009)]

overADASYN: [(0, 10997), (1, 10411), (2, 10894), (3, 10954), (4, 10875), (5, 11129), (6, 10981), (7, 10794), (8, 11009)]



In [23]:
# Checking how SMOTE oversampling performs. 
logreg = LogisticRegression(n_jobs=-1, 
                            verbose=1,
                            max_iter=1000,
                            random_state=42)

logreg.fit(X_overSMOTE, y_overSMOTE)

# lets check the performance w/ Oversampling 
y_predicted = logreg.predict(X_test)
accuracy, precision, recall, f1 = get_metrics(y_test, y_predicted)
trial = 'SMOTE'
trials.append([trial, accuracy, precision, recall, f1])
print("accuracy = %.3f, precision = %.3f, recall = %.3f, f1 = %.3f" % (accuracy, precision, recall, f1))

[LibLinear]accuracy = 0.330, precision = 0.455, recall = 0.330, f1 = 0.348


In [24]:
# Checking how ADASYN oversampling performs. 
logreg = LogisticRegression(n_jobs=-1, 
                            verbose=1,
                            max_iter=1000,
                            random_state=42)

logreg.fit(X_overADASYN, y_overADASYN)

# lets check the performance w/ Oversampling 
y_predicted = logreg.predict(X_test)
accuracy, precision, recall, f1 = get_metrics(y_test, y_predicted)
trial = 'ADASYN'
trials.append([trial, accuracy, precision, recall, f1])
print("accuracy = %.3f, precision = %.3f, recall = %.3f, f1 = %.3f" % (accuracy, precision, recall, f1))

[LibLinear]accuracy = 0.315, precision = 0.434, recall = 0.315, f1 = 0.326


### Lets try Recursive Feature Elimination to see what features are most important. 

In [26]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

# Lets see if we can find the ideal features.
logreg = LogisticRegression(n_jobs=-1, random_state=42, verbose=1, max_iter=100, warm_start=True)

rfe = RFE(logreg, n_features_to_select=1, verbose=1)
rfe = rfe.fit(X_test, y_test)

print(rfe.support_)
print(rfe.ranking_)

Fitting estimator with 17 features.
[LibLinear]Fitting estimator with 16 features.
[LibLinear]Fitting estimator with 15 features.
[LibLinear]Fitting estimator with 14 features.
[LibLinear]Fitting estimator with 13 features.
[LibLinear]Fitting estimator with 12 features.
[LibLinear]Fitting estimator with 11 features.
[LibLinear]Fitting estimator with 10 features.
[LibLinear]Fitting estimator with 9 features.
[LibLinear]Fitting estimator with 8 features.
[LibLinear]Fitting estimator with 7 features.
[LibLinear]Fitting estimator with 6 features.
[LibLinear]Fitting estimator with 5 features.
[LibLinear]Fitting estimator with 4 features.
[LibLinear]Fitting estimator with 3 features.
[LibLinear]Fitting estimator with 2 features.
[LibLinear][LibLinear][False False False False False False False False False False False  True
 False False False False False]
[16 14 11  3  2  9  7 10 12 17  8  1  4  5 13 15  6]


In [28]:
names = data.drop(columns=['genre_top_code','track_id'])
feature_names = list(names.columns)

print ("Features sorted by their rank:")
print (sorted(zip(map(lambda x: round(x, 4), rfe.ranking_), feature_names)))

Features sorted by their rank:
[(1, 'favorites.2'), (2, 'tracks'), (3, 'listens'), (4, 'interest'), (5, 'listens.1'), (6, 'type_code'), (7, 'favorites.1'), (8, 'duration'), (9, 'comments.1'), (10, 'id.1'), (11, 'id'), (12, 'bit_rate'), (13, 'number'), (14, 'favorites'), (15, 'name_code'), (16, 'comments'), (17, 'comments.2')]


In [79]:
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import (LinearRegression, Ridge, 
                                  Lasso, RandomizedLasso)
from sklearn.feature_selection import RFE,RFECV,f_regression
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestRegressor
import sklearn.metrics
import numpy as np
#!pip install minepy
from minepy import MINE
sorted(sklearn.metrics.SCORERS.keys()) 

['accuracy',
 'adjusted_mutual_info_score',
 'adjusted_rand_score',
 'average_precision',
 'balanced_accuracy',
 'brier_score_loss',
 'completeness_score',
 'explained_variance',
 'f1',
 'f1_macro',
 'f1_micro',
 'f1_samples',
 'f1_weighted',
 'fowlkes_mallows_score',
 'homogeneity_score',
 'mutual_info_score',
 'neg_log_loss',
 'neg_mean_absolute_error',
 'neg_mean_squared_error',
 'neg_mean_squared_log_error',
 'neg_median_absolute_error',
 'normalized_mutual_info_score',
 'precision',
 'precision_macro',
 'precision_micro',
 'precision_samples',
 'precision_weighted',
 'r2',
 'recall',
 'recall_macro',
 'recall_micro',
 'recall_samples',
 'recall_weighted',
 'roc_auc',
 'v_measure_score']

In [90]:
# Scoring types to consider
# ‘neg_mean_squared_error’, 'accuracy', 'precision', 'recall', ‘neg_log_loss’

 
ranks = {}
names = feature_names

Y= y_test
X= X_test

print(Y.shape)
print(X.shape)

logreg.fit(X, Y)

def rank_to_dict(ranks, names, order=1):
    minmax = MinMaxScaler()
    ranks = minmax.fit_transform(order*np.array([ranks]).T).T[0]
    ranks = map(lambda x: round(x, 2), ranks)
    return dict(zip(names, ranks ))

print (np.abs(logreg.coef_).ndim)

logreg = LogisticRegression(n_jobs=-1, random_state=42, verbose=1, max_iter=100, warm_start=True)
#rfecv.grid_scores_
rfecv = RFECV(logreg, min_features_to_select=1, step=1, cv=StratifiedKFold(2),
              scoring='r2',verbose=1)
rfecv.fit(X,Y)

#‘neg_mean_squared_error’, 'accuracy', 'precision', 'recall', ‘neg_log_loss’'f1_weighted'

#ranks["RFECV"] = rank_to_dict(([rfecv.grid_scores_]), names, order=-1)
#ranks["RFE"] = rank_to_dict((float(i) for i in rfe.ranking_), names, order=-1)
#rfe.ranking_
print(rfecv.ranking_)
print(rfecv.grid_scores_)
#print(ranking = [float(i) for i in ranking])
#print(map(float, ranking))
#order=-1
#print(np.array(float(i) for i in ranking))
#print(order*np.array([float(i) for i in rfe.ranking_]))
ranks["Stability"] = rank_to_dict(np.abs([1-i for i in rfecv.grid_scores_]), names)
print (ranks)

(9137,)
(9137, 17)
[LibLinear]2
Fitting estimator with 17 features.
[LibLinear]Fitting estimator with 16 features.
[LibLinear]Fitting estimator with 15 features.
[LibLinear]Fitting estimator with 14 features.
[LibLinear]Fitting estimator with 13 features.
[LibLinear]Fitting estimator with 12 features.
[LibLinear]Fitting estimator with 11 features.
[LibLinear]Fitting estimator with 10 features.
[LibLinear]Fitting estimator with 9 features.
[LibLinear]Fitting estimator with 8 features.
[LibLinear]Fitting estimator with 7 features.
[LibLinear]Fitting estimator with 6 features.
[LibLinear]Fitting estimator with 5 features.
[LibLinear]Fitting estimator with 4 features.
[LibLinear]Fitting estimator with 3 features.
[LibLinear]Fitting estimator with 2 features.
[LibLinear][LibLinear]Fitting estimator with 17 features.
[LibLinear]Fitting estimator with 16 features.
[LibLinear]Fitting estimator with 15 features.
[LibLinear]Fitting estimator with 14 features.
[LibLinear]Fitting estimator with 13

In [101]:
#https://blog.datadive.net/selecting-good-features-part-iv-stability-selection-rfe-and-everything-side-by-side/
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import (LinearRegression, Ridge, 
                                  Lasso)
from sklearn.feature_selection import RFE,RFECV,f_regression
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestRegressor
import numpy as np
#!pip install minepy
#from minepy import MINE
 
np.random.seed(0)
 
size = 750
 
Y = y_test
X = X_test
 
names = feature_names
 
ranks = {}
 
def rank_to_dict(ranks, names, order=1):
    minmax = MinMaxScaler()
    ranks = minmax.fit_transform(order*np.array([ranks]).T).T[0]
    ranks = map(lambda x: round(x, 2), ranks)
    return dict(zip(names, ranks ))
 
lr = LinearRegression(normalize=True)
lr.fit(X, Y)
ranks["Linear Regression"] = rank_to_dict(np.abs(lr.coef_), names)

rf = RandomForestRegressor()
rf.fit(X,Y)
ranks["Random Forest"] = rank_to_dict(rf.feature_importances_, names)

f, pval  = f_regression(X, Y, center=True)
ranks["Corr. (p_value)"] = rank_to_dict(f, names)

ridge = Ridge(alpha=7)
ridge.fit(X, Y)
ranks["Ridge Regression"] = rank_to_dict(np.abs(ridge.coef_), names)

lasso = Lasso(alpha=.05)
lasso.fit(X, Y)
ranks["LASSO"] = rank_to_dict(np.abs(lasso.coef_), names)

# Depreciated out of scikit learn
#rlasso = RandomizedLasso(alpha=0.04)
#rlasso.fit(X, Y)
#ranks["Stability"] = rank_to_dict(np.abs(rlasso.scores_), names)

#stop the search when 5 features are left (they will get equal scores)
rfe = RFE(logreg, n_features_to_select=5)
rfe.fit(X,Y)
ranks["Recursive Feature Elimination"] = rank_to_dict(([float(i) for i in rfe.ranking_]), names, order=-1)

#mine = MINE()
#mic_scores = []
#for i in range(X.shape[1]):
#    mine.compute_score(X[:,i], Y)
#    m = mine.mic()
#    mic_scores.append(m)
#ranks["MIC"] = rank_to_dict(mic_scores, names) 

r = {}
for name in names: r[name] = round(np.mean([ranks[method][name]for method in ranks.keys()]), 2)
ranks["Feat.Select Mean"] = r
    
#rfecv.grid_scores_- Accuracy Contribution
rfecv = RFECV(logreg, min_features_to_select=1, step=1, cv=StratifiedKFold(2),
              scoring='accuracy',verbose=0)
rfecv.fit(X,Y)
ranks["Accuracy"] = rank_to_dict(np.abs([1-i for i in rfecv.grid_scores_]), names)

#rfecv.grid_scores_ - Precision Contribution
rfecv = RFECV(logreg, min_features_to_select=1, step=1, cv=StratifiedKFold(2),
              scoring='precision_weighted',verbose=0)
rfecv.fit(X,Y)
ranks["Precision"] = rank_to_dict(np.abs([1-i for i in rfecv.grid_scores_]), names)

#rfecv.grid_scores_ - Recall Contribution
rfecv = RFECV(logreg, min_features_to_select=1, step=1, cv=StratifiedKFold(2),
              scoring='recall_weighted',verbose=0)
rfecv.fit(X,Y)
ranks["Recall"] = rank_to_dict(np.abs([1-i for i in rfecv.grid_scores_]), names)

#rfecv.grid_scores_ - F1 Contribution
rfecv = RFECV(logreg, min_features_to_select=1, step=1, cv=StratifiedKFold(2),
              scoring='f1_weighted',verbose=0)
rfecv.fit(X,Y)
ranks["f1"] = rank_to_dict(np.abs([1-i for i in rfecv.grid_scores_]), names)

r = {}
for name in names: r[name] = round(np.mean([ranks[method][name]for method in ranks.keys()]), 2)
ranks["Scores Mean"] = r

methods = list(ranks.keys())
#methods.append("Mean")

df_tabs = pd.DataFrame.from_dict(ranks)
df_tabs = df_tabs[methods]
df_tabs

[LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear

Unnamed: 0,Linear Regression,Random Forest,Corr. (p_value),Ridge Regression,LASSO,Recursive Feature Elimination,Mean,Accuracy,Precision,Recall,f1
bit_rate,0.07,0.42,0.07,0.08,0.14,0.42,0.2,0.09,0.13,0.09,0.11
comments,0.01,0.09,0.05,0.01,0.0,0.08,0.04,1.0,1.0,1.0,1.0
comments.1,0.0,0.16,0.04,0.0,0.0,0.67,0.15,0.28,0.3,0.28,0.26
comments.2,0.01,0.0,0.0,0.01,0.0,0.0,0.0,0.09,0.04,0.09,0.1
duration,0.2,0.85,0.19,0.23,0.48,0.75,0.45,0.1,0.06,0.1,0.1
favorites,0.08,0.23,0.1,0.09,0.18,0.25,0.16,0.69,0.56,0.69,0.58
favorites.1,0.0,0.57,0.03,0.0,0.0,0.83,0.24,0.22,0.28,0.22,0.19
favorites.2,0.02,0.18,0.05,0.02,0.0,1.0,0.21,0.04,0.04,0.04,0.05
id,0.12,0.89,0.03,0.14,0.0,0.5,0.28,0.63,0.51,0.63,0.52
id.1,0.17,0.94,0.0,0.19,0.11,0.58,0.33,0.09,0.14,0.09,0.11


### Add in the Features.csv

In [None]:
# Merge Df's 
df = pd.merge(tracks, features, on=['track_id', 'track_id'])
print(df.shape)

# Fix Genres list. 
topgenre_unique = df['genre_top'].value_counts()
df = df[df.genre_top != 0]

# Country music? Not in my classifier. Hahaha
replace_these = ["Jazz", "Old-Time / Historic", "Spoken", "Soul-RnB", "Blues", "Country","Easy Listening"]
for i in replace_these:
    df = df[df.genre_top != i]

# Lets take some of the data and see if we can do some tricky feature engineering.
# Here I'll use Label Encoding 
from sklearn.preprocessing import LabelEncoder
lb_make = LabelEncoder()

# df['name'] Should encode this name. Categorical variable.
#df = df.dropna(subset=['name'])
df["name_code"] = lb_make.fit_transform(df["name"])

# df['type'] Should encode this name. Categorical variable.
df = df.dropna(subset=['type'])
df["type_code"] = lb_make.fit_transform(df["type"])

# df['genre_top'] Should encode this name. Categorical variable.
df = df.dropna(subset=['genre_top'])
df["genre_top_code"] = lb_make.fit_transform(df["genre_top"])

# Drop all remaining columns missing data. 
nan_columns = df.columns[df.isna().any()].tolist()
df = df.drop(columns=nan_columns)
df.head(1)

# All my rows are dropped that will be dropped. Now I'll make my key lists for 
df_key= df[["genre_top", "genre_top_code","type", "type_code","name", "name_code"]]

# Lets drop all the non-numerical columns now. 
data = df._get_numeric_data()
print(data.shape)

# Define my X & Y
y=data["genre_top_code"]
X=data.drop(columns=['genre_top_code'])

# Going to try a standard scaling my X for fast converge with SAG / SAGA
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit(X)
X_scaled = scaler.transform(X)

# Here is where I split the model into test/train sets. 
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Undersample into balanced group. 
print('Original dataset shape %s' % Counter(y_train))
rus = RandomUnderSampler(sampling_strategy='not minority',replacement=True, random_state=42)
X_RandomUS, y_RandomUS = rus.fit_resample(X_train, y_train)
print('Resampled dataset shape %s' % Counter(y_RandomUS))

In [None]:
logreg = LogisticRegression(class_weight ='balanced',
                            multi_class='auto', 
                            solver='lbfgs',
                            n_jobs=-1, 
                            verbose=1,
                            max_iter=500,
                            random_state=42)

## Guidance for the Assignment




This is the biggest data you've played with so far, and while it does generally fit in Colab, it can take awhile to run. That's part of the challenge!

Your tasks:
- Clean up the variable names in the dataframe
- Use logistic regression to fit a model predicting (primary/top) genre
- Inspect, iterate, and improve your model
- Answer the following questions (written, ~paragraph each):
  - What are the best predictors of genre?
  - What information isn't very useful for predicting genre?
  - What surprised you the most about your results?

*Important caveats*:
- This is going to be difficult data to work with - don't let the perfect be the enemy of the good!
- Be creative in cleaning it up - if the best way you know how to do it is download it locally and edit as a spreadsheet, that's OK!
- If the data size becomes problematic, consider sampling/subsetting
- You do not need perfect or complete results - just something plausible that runs, and that supports the reasoning in your written answers

If you find that fitting a model to classify *all* genres isn't very good, it's totally OK to limit to the most frequent genres, or perhaps trying to combine or cluster genres as a preprocessing step. Even then, there will be limits to how good a model can be with just this metadata - if you really want to train an effective genre classifier, you'll have to involve the other data (see stretch goals).

This is real data - there is no "one correct answer", so you can take this in a variety of directions. Just make sure to support your findings, and feel free to share them as well! This is meant to be practice for dealing with other "messy" data, a common task in data science.

## Resources and stretch goals

- Check out the other .csv files from the FMA dataset, and see if you can join them or otherwise fit interesting models with them
- [Logistic regression from scratch in numpy](https://blog.goodaudience.com/logistic-regression-from-scratch-in-numpy-5841c09e425f) - if you want to dig in a bit more to both the code and math (also takes a gradient descent approach, introducing the logistic loss function)
- Create a visualization to show predictions of your model - ideally show a confidence interval based on error!
- Check out and compare classification models from scikit-learn, such as [SVM](https://scikit-learn.org/stable/modules/svm.html#classification), [decision trees](https://scikit-learn.org/stable/modules/tree.html#classification), and [naive Bayes](https://scikit-learn.org/stable/modules/naive_bayes.html). The underlying math will vary significantly, but the API (how you write the code) and interpretation will actually be fairly similar.
- Sign up for [Kaggle](https://kaggle.com), and find a competition to try logistic regression with
- (Not logistic regression related) If you enjoyed the assignment, you may want to read up on [music informatics](https://en.wikipedia.org/wiki/Music_informatics), which is how those audio features were actually calculated. The FMA includes the actual raw audio, so (while this is more of a longterm project than a stretch goal, and won't fit in Colab) if you'd like you can check those out and see what sort of deeper analysis you can do.