# Lambda School Data Science - Logistic Regression

Logistic regression is the baseline for classification models, as well as a handy way to predict probabilities (since those too live in the unit interval). While relatively simple, it is also the foundation for more sophisticated classification techniques such as neural networks (many of which can effectively be thought of as networks of logistic models).

## Lecture - Where Linear goes Wrong
### Return of the Titanic 🚢

You've likely already explored the rich dataset that is the Titanic - let's use regression and try to predict survival with it. The data is [available from Kaggle](https://www.kaggle.com/c/titanic/data), so we'll also play a bit with [the Kaggle API](https://github.com/Kaggle/kaggle-api).

In [0]:
!pip install kaggle



In [0]:
# Note - you'll also have to sign up for Kaggle and authorize the API
# https://github.com/Kaggle/kaggle-api#api-credentials

# This essentially means uploading a kaggle.json file
# For Colab we can have it in Google Drive
from google.colab import drive
drive.mount('/content/drive')
%env KAGGLE_CONFIG_DIR=/content/drive/My Drive/

# You also have to join the Titanic competition to have access to the data
!kaggle competitions download -c titanic

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive
env: KAGGLE_CONFIG_DIR=/content/drive/My Drive/
Downloading train.csv to /content
  0% 0.00/59.8k [00:00<?, ?B/s]
100% 59.8k/59.8k [00:00<00:00, 23.7MB/s]
Downloading test.csv to /content
  0% 0.00/28.0k [00:00<?, ?B/s]
100% 28.0k/28.0k [00:00<00:00, 28.8MB/s]
Downloading gender_submission.csv to /content
  0% 0.00/3.18k [00:00<?, ?B/s]
100% 3.18k/3.18k [00:00<00:00, 2.15MB/s]


In [0]:
# How would we try to do this with linear regression?
import pandas as pd

train_df = pd.read_csv('train.csv').dropna()
test_df = pd.read_csv('test.csv').dropna()  # Unlabeled, for Kaggle submission

train_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
10,11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7,G6,S
11,12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.55,C103,S


In [0]:
train_df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,183.0,183.0,183.0,183.0,183.0,183.0,183.0
mean,455.36612,0.672131,1.191257,35.674426,0.464481,0.47541,78.682469
std,247.052476,0.470725,0.515187,15.643866,0.644159,0.754617,76.347843
min,2.0,0.0,1.0,0.92,0.0,0.0,0.0
25%,263.5,0.0,1.0,24.0,0.0,0.0,29.7
50%,457.0,1.0,1.0,36.0,0.0,0.0,57.0
75%,676.0,1.0,1.0,47.5,1.0,1.0,90.0
max,890.0,1.0,3.0,80.0,3.0,4.0,512.3292


In [0]:
from sklearn.linear_model import LinearRegression

X = train_df[['Pclass', 'Age', 'Fare']]
y = train_df.Survived

linear_reg = LinearRegression().fit(X, y)
linear_reg.score(X, y)

0.08389810726550917

In [0]:
linear_reg.predict(test_df[['Pclass', 'Age', 'Fare']])

array([0.79543117, 0.58610823, 0.67595121, 0.793829  , 0.62090522,
       0.72542107, 0.59848968, 0.58734245, 0.48567063, 0.77627736,
       0.84211887, 0.57052439, 0.7754689 , 0.96621114, 0.70287941,
       0.57673837, 0.72321391, 0.75894755, 0.77968041, 0.50246003,
       0.49858077, 0.7474959 , 0.3542282 , 0.61648435, 0.71300224,
       0.66294608, 0.53175333, 0.77397395, 0.68419387, 0.68395536,
       0.52041202, 0.56814038, 0.79586606, 0.81372012, 0.61068545,
       0.57260627, 0.52525981, 0.58055388, 0.45584728, 0.67976208,
       0.8226707 , 0.84286197, 0.96189157, 0.66724612, 0.68589478,
       0.61846513, 0.63455044, 0.68275686, 0.65738372, 0.45198998,
       0.59988596, 0.63845908, 0.63132487, 0.7888473 , 0.60126246,
       0.79714045, 0.78713803, 0.54643775, 0.42823635, 0.7711724 ,
       0.53552976, 0.55608044, 0.54480459, 0.57031915, 0.65080369,
       0.77958926, 0.6371013 , 0.70993488, 0.71493598, 0.60375943,
       0.54407206, 0.48186138, 0.76576089, 0.75456305, 0.53968

In [0]:
linear_reg.coef_

array([-0.08596295, -0.00829314,  0.00048775])

In [0]:
import numpy as np

test_case = np.array([[1, 5, 500]])  # Rich 5-year old in first class
linear_reg.predict(test_case)

array([1.14845883])

In [0]:
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression().fit(X, y)
log_reg.score(X, y)



0.7103825136612022

In [0]:
log_reg.predict(test_df[['Pclass', 'Age', 'Fare']])

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1,
       0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

In [0]:
log_reg.predict(test_case)[0]

1

In [0]:
help(log_reg.predict)

Help on method predict in module sklearn.linear_model.base:

predict(X) method of sklearn.linear_model.logistic.LogisticRegression instance
    Predict class labels for samples in X.
    
    Parameters
    ----------
    X : array_like or sparse matrix, shape (n_samples, n_features)
        Samples.
    
    Returns
    -------
    C : array, shape [n_samples]
        Predicted class label per sample.



In [0]:
log_reg.predict_proba(test_case)[0]

array([0.02485552, 0.97514448])

In [0]:
# What's the math?
log_reg.coef_

array([[-0.0455017 , -0.02912513,  0.0048037 ]])

In [0]:
log_reg.intercept_

array([1.45878264])

In [0]:
# The logistic sigmoid "squishing" function, implemented to accept numpy arrays
def sigmoid(x):
  return 1 / (1 + np.e**(-x))

In [0]:
sigmoid(log_reg.intercept_ + np.dot(log_reg.coef_, np.transpose(test_case)))

array([[0.97514448]])

So, clearly a more appropriate model in this situation! For more on the math, [see this Wikipedia example](https://en.wikipedia.org/wiki/Logistic_regression#Probability_of_passing_an_exam_versus_hours_of_study).

For live - let's tackle [another classification dataset on absenteeism](http://archive.ics.uci.edu/ml/datasets/Absenteeism+at+work) - it has 21 classes, but remember, scikit-learn LogisticRegression automatically handles more than two classes. How? By essentially treating each label as different (1) from some base class (0).

In [0]:
# Live - let's try absenteeism!

## Assignment - real-world classification

We're going to check out a larger dataset - the [FMA Free Music Archive data](https://github.com/mdeff/fma). It has a selection of CSVs with metadata and calculated audio features that you can load and try to use to classify genre of tracks. To get you started:

In [1]:
!wget https://os.unil.cloud.switch.ch/fma/fma_metadata.zip
!unzip fma_metadata.zip

--2019-02-25 22:20:55--  https://os.unil.cloud.switch.ch/fma/fma_metadata.zip
Resolving os.unil.cloud.switch.ch (os.unil.cloud.switch.ch)... 86.119.28.13, 2001:620:5ca1:2ff::ce53
Connecting to os.unil.cloud.switch.ch (os.unil.cloud.switch.ch)|86.119.28.13|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 358412441 (342M) [application/zip]
Saving to: ‘fma_metadata.zip’


2019-02-25 22:21:37 (8.27 MB/s) - ‘fma_metadata.zip’ saved [358412441/358412441]

Archive:  fma_metadata.zip
 bunzipping: fma_metadata/README.txt  
 bunzipping: fma_metadata/checksums  
 bunzipping: fma_metadata/not_found.pickle  
 bunzipping: fma_metadata/raw_genres.csv  
 bunzipping: fma_metadata/raw_albums.csv  
 bunzipping: fma_metadata/raw_artists.csv  
 bunzipping: fma_metadata/raw_tracks.csv  
 bunzipping: fma_metadata/tracks.csv  
 bunzipping: fma_metadata/genres.csv  
 bunzipping: fma_metadata/raw_echonest.csv  
 bunzipping: fma_metadata/echonest.csv  
 bunzipping: fma_metadata/features.

In [0]:
import pandas as pd

In [4]:
tracks = pd.read_csv('fma_metadata/tracks.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [0]:
pd.set_option('display.max_columns', None)  # Unlimited columns
tracks.head()

In [18]:
tracks.head(10)

Unnamed: 0.1,Unnamed: 0,album,album.1,album.2,album.3,album.4,album.5,album.6,album.7,album.8,...,track.10,track.11,track.12,track.13,track.14,track.15,track.16,track.17,track.18,track.19
0,,comments,date_created,date_released,engineer,favorites,id,information,listens,producer,...,information,interest,language_code,license,listens,lyricist,number,publisher,tags,title
1,track_id,,,,,,,,,,...,,,,,,,,,,
2,2,0,2008-11-26 01:44:45,2009-01-05 00:00:00,,4,1,<p></p>,6073,,...,,4656,en,Attribution-NonCommercial-ShareAlike 3.0 Inter...,1293,,3,,[],Food
3,3,0,2008-11-26 01:44:45,2009-01-05 00:00:00,,4,1,<p></p>,6073,,...,,1470,en,Attribution-NonCommercial-ShareAlike 3.0 Inter...,514,,4,,[],Electric Ave
4,5,0,2008-11-26 01:44:45,2009-01-05 00:00:00,,4,1,<p></p>,6073,,...,,1933,en,Attribution-NonCommercial-ShareAlike 3.0 Inter...,1151,,6,,[],This World
5,10,0,2008-11-26 01:45:08,2008-02-06 00:00:00,,4,6,,47632,,...,,54881,en,Attribution-NonCommercial-NoDerivatives (aka M...,50135,,1,,[],Freeway
6,20,0,2008-11-26 01:45:05,2009-01-06 00:00:00,,2,4,"<p> ""spiritual songs"" from Nicky Cook</p>",2710,,...,,978,en,Attribution-NonCommercial-NoDerivatives (aka M...,361,,3,,[],Spiritual Level
7,26,0,2008-11-26 01:45:05,2009-01-06 00:00:00,,2,4,"<p> ""spiritual songs"" from Nicky Cook</p>",2710,,...,,1060,en,Attribution-NonCommercial-NoDerivatives (aka M...,193,,4,,[],Where is your Love?
8,30,0,2008-11-26 01:45:05,2009-01-06 00:00:00,,2,4,"<p> ""spiritual songs"" from Nicky Cook</p>",2710,,...,,718,en,Attribution-NonCommercial-NoDerivatives (aka M...,612,,5,,[],Too Happy
9,46,0,2008-11-26 01:45:05,2009-01-06 00:00:00,,2,4,"<p> ""spiritual songs"" from Nicky Cook</p>",2710,,...,,252,en,Attribution-NonCommercial-NoDerivatives (aka M...,171,,8,,[],Yosemite


In [19]:
tracks.loc[0:0].values

array([[nan, 'comments', 'date_created', 'date_released', 'engineer',
        'favorites', 'id', 'information', 'listens', 'producer', 'tags',
        'title', 'tracks', 'type', 'active_year_begin',
        'active_year_end', 'associated_labels', 'bio', 'comments',
        'date_created', 'favorites', 'id', 'latitude', 'location',
        'longitude', 'members', 'name', 'related_projects', 'tags',
        'website', 'wikipedia_page', 'split', 'subset', 'bit_rate',
        'comments', 'composer', 'date_created', 'date_recorded',
        'duration', 'favorites', 'genre_top', 'genres', 'genres_all',
        'information', 'interest', 'language_code', 'license', 'listens',
        'lyricist', 'number', 'publisher', 'tags', 'title']], dtype=object)

In [0]:
tracks.columns = ['id', 'comments', 'date_created', 'date_released', 'engineer',
        'favorites', 'id', 'information', 'listens', 'producer', 'tags',
        'title', 'tracks', 'type', 'active_year_begin',
        'active_year_end', 'associated_labels', 'bio', 'comments',
        'date_created', 'favorites', 'id', 'latitude', 'location',
        'longitude', 'members', 'name', 'related_projects', 'tags',
        'website', 'wikipedia_page', 'split', 'subset', 'bit_rate',
        'comments', 'composer', 'date_created', 'date_recorded',
        'duration', 'favorites', 'genre_top', 'genres', 'genres_all',
        'information', 'interest', 'language_code', 'license', 'listens',
        'lyricist', 'number', 'publisher', 'tags', 'title']

In [0]:
tracks = tracks.drop([0,1])


In [59]:
tracks.head(10)

Unnamed: 0,id,comments,date_created,date_released,engineer,favorites,id.1,information,listens,producer,tags,title,tracks,type,active_year_begin,active_year_end,associated_labels,bio,comments.1,date_created.1,favorites.1,id.2,latitude,location,longitude,members,name,related_projects,tags.1,website,wikipedia_page,split,subset,bit_rate,comments.2,composer,date_created.2,date_recorded,duration,favorites.2,genre_top,genres,genres_all,information.1,interest,language_code,license,listens.1,lyricist,number,publisher,tags.2,title.1
2,2,0,2008-11-26 01:44:45,2009-01-05 00:00:00,,4,1,<p></p>,6073,,[],AWOL - A Way Of Life,7,Album,2006-01-01 00:00:00,,,"<p>A Way Of Life, A Collective of Hip-Hop from...",0,2008-11-26 01:42:32,9,1,40.0583238,New Jersey,-74.4056612,"Sajje Morocco,Brownbum,ZawidaGod,Custodian of ...",AWOL,The list of past projects is 2 long but every1...,['awol'],http://www.AzillionRecords.blogspot.com,,training,small,256000,0,,2008-11-26 01:48:12,2008-11-26 00:00:00,168,2,Hip-Hop,[21],[21],,4656,en,Attribution-NonCommercial-ShareAlike 3.0 Inter...,1293,,3,,[],Food
3,3,0,2008-11-26 01:44:45,2009-01-05 00:00:00,,4,1,<p></p>,6073,,[],AWOL - A Way Of Life,7,Album,2006-01-01 00:00:00,,,"<p>A Way Of Life, A Collective of Hip-Hop from...",0,2008-11-26 01:42:32,9,1,40.0583238,New Jersey,-74.4056612,"Sajje Morocco,Brownbum,ZawidaGod,Custodian of ...",AWOL,The list of past projects is 2 long but every1...,['awol'],http://www.AzillionRecords.blogspot.com,,training,medium,256000,0,,2008-11-26 01:48:14,2008-11-26 00:00:00,237,1,Hip-Hop,[21],[21],,1470,en,Attribution-NonCommercial-ShareAlike 3.0 Inter...,514,,4,,[],Electric Ave
4,5,0,2008-11-26 01:44:45,2009-01-05 00:00:00,,4,1,<p></p>,6073,,[],AWOL - A Way Of Life,7,Album,2006-01-01 00:00:00,,,"<p>A Way Of Life, A Collective of Hip-Hop from...",0,2008-11-26 01:42:32,9,1,40.0583238,New Jersey,-74.4056612,"Sajje Morocco,Brownbum,ZawidaGod,Custodian of ...",AWOL,The list of past projects is 2 long but every1...,['awol'],http://www.AzillionRecords.blogspot.com,,training,small,256000,0,,2008-11-26 01:48:20,2008-11-26 00:00:00,206,6,Hip-Hop,[21],[21],,1933,en,Attribution-NonCommercial-ShareAlike 3.0 Inter...,1151,,6,,[],This World
5,10,0,2008-11-26 01:45:08,2008-02-06 00:00:00,,4,6,,47632,,[],Constant Hitmaker,2,Album,,,"Mexican Summer, Richie Records, Woodsist, Skul...","<p><span style=""font-family:Verdana, Geneva, A...",3,2008-11-26 01:42:55,74,6,,,,"Kurt Vile, the Violators",Kurt Vile,,"['philly', 'kurt vile']",http://kurtvile.com,,training,small,192000,0,Kurt Vile,2008-11-25 17:49:06,2008-11-26 00:00:00,161,178,Pop,[10],[10],,54881,en,Attribution-NonCommercial-NoDerivatives (aka M...,50135,,1,,[],Freeway
6,20,0,2008-11-26 01:45:05,2009-01-06 00:00:00,,2,4,"<p> ""spiritual songs"" from Nicky Cook</p>",2710,,[],Niris,13,Album,1990-01-01 00:00:00,2011-01-01 00:00:00,,<p>Songs written by: Nicky Cook</p>\n<p>VOCALS...,2,2008-11-26 01:42:52,10,4,51.895927,Colchester England,0.891874,Nicky Cook\n,Nicky Cook,,"['instrumentals', 'experimental pop', 'post pu...",,,training,large,256000,0,,2008-11-26 01:48:56,2008-01-01 00:00:00,311,0,,"[76, 103]","[17, 10, 76, 103]",,978,en,Attribution-NonCommercial-NoDerivatives (aka M...,361,,3,,[],Spiritual Level
7,26,0,2008-11-26 01:45:05,2009-01-06 00:00:00,,2,4,"<p> ""spiritual songs"" from Nicky Cook</p>",2710,,[],Niris,13,Album,1990-01-01 00:00:00,2011-01-01 00:00:00,,<p>Songs written by: Nicky Cook</p>\n<p>VOCALS...,2,2008-11-26 01:42:52,10,4,51.895927,Colchester England,0.891874,Nicky Cook\n,Nicky Cook,,"['instrumentals', 'experimental pop', 'post pu...",,,training,large,256000,0,,2008-11-26 01:49:05,2008-01-01 00:00:00,181,0,,"[76, 103]","[17, 10, 76, 103]",,1060,en,Attribution-NonCommercial-NoDerivatives (aka M...,193,,4,,[],Where is your Love?
8,30,0,2008-11-26 01:45:05,2009-01-06 00:00:00,,2,4,"<p> ""spiritual songs"" from Nicky Cook</p>",2710,,[],Niris,13,Album,1990-01-01 00:00:00,2011-01-01 00:00:00,,<p>Songs written by: Nicky Cook</p>\n<p>VOCALS...,2,2008-11-26 01:42:52,10,4,51.895927,Colchester England,0.891874,Nicky Cook\n,Nicky Cook,,"['instrumentals', 'experimental pop', 'post pu...",,,training,large,256000,0,,2008-11-26 01:49:11,2008-01-01 00:00:00,174,0,,"[76, 103]","[17, 10, 76, 103]",,718,en,Attribution-NonCommercial-NoDerivatives (aka M...,612,,5,,[],Too Happy
9,46,0,2008-11-26 01:45:05,2009-01-06 00:00:00,,2,4,"<p> ""spiritual songs"" from Nicky Cook</p>",2710,,[],Niris,13,Album,1990-01-01 00:00:00,2011-01-01 00:00:00,,<p>Songs written by: Nicky Cook</p>\n<p>VOCALS...,2,2008-11-26 01:42:52,10,4,51.895927,Colchester England,0.891874,Nicky Cook\n,Nicky Cook,,"['instrumentals', 'experimental pop', 'post pu...",,,training,large,256000,0,,2008-11-26 01:49:53,2008-01-01 00:00:00,104,0,,"[76, 103]","[17, 10, 76, 103]",,252,en,Attribution-NonCommercial-NoDerivatives (aka M...,171,,8,,[],Yosemite
10,48,0,2008-11-26 01:45:05,2009-01-06 00:00:00,,2,4,"<p> ""spiritual songs"" from Nicky Cook</p>",2710,,[],Niris,13,Album,1990-01-01 00:00:00,2011-01-01 00:00:00,,<p>Songs written by: Nicky Cook</p>\n<p>VOCALS...,2,2008-11-26 01:42:52,10,4,51.895927,Colchester England,0.891874,Nicky Cook\n,Nicky Cook,,"['instrumentals', 'experimental pop', 'post pu...",,,training,large,256000,0,,2008-11-26 01:49:56,2008-01-01 00:00:00,205,0,,"[76, 103]","[17, 10, 76, 103]",,247,en,Attribution-NonCommercial-NoDerivatives (aka M...,173,,9,,[],Light of Light
11,134,0,2008-11-26 01:44:45,2009-01-05 00:00:00,,4,1,<p></p>,6073,,[],AWOL - A Way Of Life,7,Album,2006-01-01 00:00:00,,,"<p>A Way Of Life, A Collective of Hip-Hop from...",0,2008-11-26 01:42:32,9,1,40.0583238,New Jersey,-74.4056612,"Sajje Morocco,Brownbum,ZawidaGod,Custodian of ...",AWOL,The list of past projects is 2 long but every1...,['awol'],http://www.AzillionRecords.blogspot.com,,training,medium,256000,0,,2008-11-26 01:43:19,2008-11-26 00:00:00,207,3,Hip-Hop,[21],[21],,1126,en,Attribution-NonCommercial-ShareAlike 3.0 Inter...,943,,5,,[],Street Music


In [0]:
tracks

In [61]:
tracks[['comments', 'favorites', 'listens', 'tracks']]

Unnamed: 0,comments,comments.1,comments.2,favorites,favorites.1,favorites.2,listens,listens.1,tracks
2,0,0,0,4,9,2,6073,1293,7
3,0,0,0,4,9,1,6073,514,7
4,0,0,0,4,9,6,6073,1151,7
5,0,3,0,4,74,178,47632,50135,2
6,0,2,0,2,10,0,2710,361,13
7,0,2,0,2,10,0,2710,193,13
8,0,2,0,2,10,0,2710,612,13
9,0,2,0,2,10,0,2710,171,13
10,0,2,0,2,10,0,2710,173,13
11,0,0,0,4,9,3,6073,943,7


In [58]:
tracks.dtypes

id                   object
comments             object
date_created         object
date_released        object
engineer             object
favorites            object
id                   object
information          object
listens              object
producer             object
tags                 object
title                object
tracks               object
type                 object
active_year_begin    object
active_year_end      object
associated_labels    object
bio                  object
comments             object
date_created         object
favorites            object
id                   object
latitude             object
location             object
longitude            object
members              object
name                 object
related_projects     object
tags                 object
website              object
wikipedia_page       object
split                object
subset               object
bit_rate             object
comments             object
composer            

In [40]:
tracks.isnull().sum()

id                        0
comments                  0
date_created           3529
date_released         36280
engineer              91279
favorites                 0
id                        0
information           23425
listens                   0
producer              88514
tags                      0
title                  1025
tracks                    0
type                   6508
active_year_begin     83863
active_year_end      101199
associated_labels     92303
bio                   35418
comments                  0
date_created            856
favorites                 0
id                        0
latitude              62030
location              36364
longitude             62030
members               59725
name                      0
related_projects      93422
tags                      0
website               27318
wikipedia_page       100993
split                     0
subset                    0
bit_rate                  0
comments                  0
composer            

In [41]:
tracks.columns

Index(['id', 'comments', 'date_created', 'date_released', 'engineer',
       'favorites', 'id', 'information', 'listens', 'producer', 'tags',
       'title', 'tracks', 'type', 'active_year_begin', 'active_year_end',
       'associated_labels', 'bio', 'comments', 'date_created', 'favorites',
       'id', 'latitude', 'location', 'longitude', 'members', 'name',
       'related_projects', 'tags', 'website', 'wikipedia_page', 'split',
       'subset', 'bit_rate', 'comments', 'composer', 'date_created',
       'date_recorded', 'duration', 'favorites', 'genre_top', 'genres',
       'genres_all', 'information', 'interest', 'language_code', 'license',
       'listens', 'lyricist', 'number', 'publisher', 'tags', 'title'],
      dtype='object')

In [0]:
df = tracks[['comments', 'favorites', 'listens', 'tracks', 'genre_top']].dropna()

In [70]:
df.shape

(49598, 10)

In [0]:
# ATTEMPT 1
X = df[['comments', 'favorites', 'listens', 'tracks']]

In [73]:
X.shape

(49598, 9)

In [0]:
y = df['genre_top']

In [0]:
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.model_selection import train_test_split

In [76]:
model = LogisticRegression(random_state=42, solver='lbfgs', multi_class='multinomial', max_iter=1000)
model.fit(X,y)
model.score(X,y)



0.2861002459776604

In [83]:
X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size=.2, random_state=42)

model2 = LogisticRegression(random_state=42, multi_class='multinomial', solver='lbfgs', max_iter=1500)
model2.fit(X_train,Y_train)
model2.score(X_test,Y_test)



0.2810483870967742

In [0]:
#ATTEMPT 2
df = tracks[['comments', 'favorites', 'listens', 'tracks', 'genre_top', 'bit_rate', 'number', 'interest', 'duration']].dropna()

In [0]:
X = df[['comments', 'favorites', 'listens', 'tracks', 'bit_rate', 'number', 'interest', 'duration']]
y = df['genre_top']

In [90]:
df.head(20)

Unnamed: 0,comments,comments.1,comments.2,favorites,favorites.1,favorites.2,listens,listens.1,tracks,genre_top,bit_rate,number,interest,duration
2,0,0,0,4,9,2,6073,1293,7,Hip-Hop,256000,3,4656,168
3,0,0,0,4,9,1,6073,514,7,Hip-Hop,256000,4,1470,237
4,0,0,0,4,9,6,6073,1151,7,Hip-Hop,256000,6,1933,206
5,0,3,0,4,74,178,47632,50135,2,Pop,192000,1,54881,161
11,0,0,0,4,9,3,6073,943,7,Hip-Hop,256000,5,1126,207
12,1,1,1,0,0,0,3331,1832,4,Rock,256000,0,2484,837
13,1,1,1,0,0,0,3331,1498,4,Rock,256000,0,1948,509
14,1,0,0,2,5,2,1681,1278,2,Experimental,256000,1,2559,1233
15,1,0,0,2,5,2,1681,489,2,Experimental,256000,2,1909,1231
16,0,0,0,1,11,3,1304,582,2,Folk,128000,2,702,296


In [86]:
model = LogisticRegression(random_state=42, solver='lbfgs', multi_class='multinomial', max_iter=1000)
model.fit(X,y)
model.score(X,y)



0.3774547360780677

In [89]:
X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size=.5, random_state=42)

model2 = LogisticRegression(random_state=42, multi_class='multinomial', solver='lbfgs', max_iter=2000)
model2.fit(X_train,Y_train)
model2.score(X_test,Y_test)



0.3793701358925763

In [0]:
#ATTEMPT 3
df = tracks[['listens', 'tracks', 'bit_rate', 'number', 'interest', 'duration', 'genre_top', ]].dropna()


In [0]:
X = df[['listens', 'tracks', 'bit_rate', 'number', 'interest', 'duration']]
y = df['genre_top']

In [96]:
model = LogisticRegression(random_state=42, solver='lbfgs', multi_class='multinomial', max_iter=2500)
model.fit(X,y)
model.score(X,y)



0.3793298116859551

This is the biggest data you've played with so far, and while it does generally fit in Colab, it can take awhile to run. That's part of the challenge!

Your tasks:
- Clean up the variable names in the dataframe
- Use logistic regression to fit a model predicting (primary/top) genre
- Inspect, iterate, and improve your model
- Answer the following questions (written, ~paragraph each):
  - What are the best predictors of genre?
  - What information isn't very useful for predicting genre?
  - What surprised you the most about your results?

*Important caveats*:
- This is going to be difficult data to work with - don't let the perfect be the enemy of the good!
- Be creative in cleaning it up - if the best way you know how to do it is download it locally and edit as a spreadsheet, that's OK!
- If the data size becomes problematic, consider sampling/subsetting
- You do not need perfect or complete results - just something plausible that runs, and that supports the reasoning in your written answers

If you find that fitting a model to classify *all* genres isn't very good, it's totally OK to limit to the most frequent genres, or perhaps trying to combine or cluster genres as a preprocessing step. Even then, there will be limits to how good a model can be with just this metadata - if you really want to train an effective genre classifier, you'll have to involve the other data (see stretch goals).

This is real data - there is no "one correct answer", so you can take this in a variety of directions. Just make sure to support your findings, and feel free to share them as well! This is meant to be practice for dealing with other "messy" data, a common task in data science.

The best predictors of Genres seems to be all the numerical columns.
The favorites and comments column each had 3 columns named the same thing and these didn't seem very helpful
I was surprised that the results were as high as they even were.