# Lambda School Data Science - Logistic Regression

Logistic regression is the baseline for classification models, as well as a handy way to predict probabilities (since those too live in the unit interval). While relatively simple, it is also the foundation for more sophisticated classification techniques such as neural networks (many of which can effectively be thought of as networks of logistic models).

## Lecture - Where Linear goes Wrong
### Return of the Titanic 🚢

You've likely already explored the rich dataset that is the Titanic - let's use regression and try to predict survival with it. The data is [available from Kaggle](https://www.kaggle.com/c/titanic/data), so we'll also play a bit with [the Kaggle API](https://github.com/Kaggle/kaggle-api).

### Get data, option 1: Kaggle API

#### Sign up for Kaggle and get an API token
1. [Sign up for a Kaggle account](https://www.kaggle.com/), if you don’t already have one. 
2. [Follow these instructions](https://github.com/Kaggle/kaggle-api#api-credentials) to create a Kaggle “API Token” and download your `kaggle.json` file. If you are using Anaconda, put the file in the directory specified in the instructions.

_This will enable you to download data directly from Kaggle. If you run into problems, don’t worry — I’ll give you an easy alternative way to download today’s data, so you can still follow along with the lecture hands-on. And then we’ll help you through the Kaggle process after the lecture._

#### Put `kaggle.json` in the correct location

- ***If you're using Anaconda,*** put the file in the directory specified in the [instructions](https://github.com/Kaggle/kaggle-api#api-credentials).

- ***If you're using Google Colab,*** upload the file to your Google Drive, and run this cell:

In [0]:
from google.colab import drive
drive.mount('/content/drive')
%env KAGGLE_CONFIG_DIR=/content/drive/My Drive/

#### Install the Kaggle API package and use it to get the data

You also have to join the Titanic competition to have access to the data

In [0]:
!pip install kaggle

In [0]:
!kaggle competitions download -c titanic

### Get data, option 2: Download from the competition page
1. [Sign up for a Kaggle account](https://www.kaggle.com/), if you don’t already have one. 
2. [Go to the Titanic competition page](https://www.kaggle.com/c/titanic) to download the [data](https://www.kaggle.com/c/titanic/data).

### Get data, option 3: Use Seaborn

```
import seaborn as sns
train = sns.load_dataset('titanic')
```

But Seaborn's version of the Titanic dataset is not identical to Kaggle's version, as we'll see during this lesson!

### Read data

In [1]:
import pandas as pd
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
train.shape, test.shape

((891, 12), (418, 11))

### Data Exploration

In [2]:
train.sample(n=5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
21,22,1,2,"Beesley, Mr. Lawrence",male,34.0,0,0,248698,13.0,D56,S
873,874,0,3,"Vander Cruyssen, Mr. Victor",male,47.0,0,0,345765,9.0,,S
783,784,0,3,"Johnston, Mr. Andrew G",male,,1,2,W./C. 6607,23.45,,S
450,451,0,2,"West, Mr. Edwy Arthur",male,36.0,1,2,C.A. 34651,27.75,,S
641,642,1,1,"Sagesser, Mlle. Emma",female,24.0,0,0,PC 17477,69.3,B35,C


In [3]:
test.sample(n=5)

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
61,953,2,"McCrae, Mr. Arthur Gordon",male,32.0,0,0,237216,13.5,,S
70,962,3,"Mulvihill, Miss. Bertha E",female,24.0,0,0,382653,7.75,,Q
85,977,3,"Khalil, Mr. Betros",male,,1,0,2660,14.4542,,C
246,1138,2,"Karnes, Mrs. J Frank (Claire Bennett)",female,22.0,0,0,F.C.C. 13534,21.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q


In [4]:
# Normalize value_counts to get percentages
target = 'Survived'
train[target].value_counts(normalize=True)

0    0.616162
1    0.383838
Name: Survived, dtype: float64

In [5]:
train.describe(include='number')

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [6]:
train.describe(exclude='number')

Unnamed: 0,Name,Sex,Ticket,Cabin,Embarked
count,891,891,891,204,889
unique,891,2,681,147,3
top,"Nasser, Mrs. Nicholas (Adele Achem)",male,CA. 2343,B96 B98,S
freq,1,577,7,4,644


### How would we try to do this with linear regression?

https://scikit-learn.org/stable/modules/impute.html

In [8]:
from sklearn.linear_model import LinearRegression
from sklearn.impute import SimpleImputer

# Separate target from features
features = ['Pclass', 'Age', 'Fare']
target = 'Survived'
X_train = train[features]
y_train = train[target]
X_test = test[features]

# Impute missing data
imputer = SimpleImputer()
X_train_imputed = imputer.fit_transform(X_train)
X_test_imputed = imputer.transform(X_test)

# Run linear regression
lin_reg = LinearRegression()
lin_reg.fit(X_train_imputed, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [9]:
X_train.shape, X_train_imputed.shape, X_test.shape, X_test_imputed.shape

((891, 3), (891, 3), (418, 3), (418, 3))

In [17]:
X_test.tail()

Unnamed: 0,Pclass,Age,Fare
413,3,,8.05
414,1,39.0,108.9
415,3,38.5,7.25
416,3,,8.05
417,3,,22.3583


In [18]:
X_test_imputed[-5:]

array([[  3.        ,  29.69911765,   8.05      ],
       [  1.        ,  39.        , 108.9       ],
       [  3.        ,  38.5       ,   7.25      ],
       [  3.        ,  29.69911765,   8.05      ],
       [  3.        ,  29.69911765,  22.3583    ]])

In [10]:
import numpy as np

test_case = np.array([[1, 5, 500]]) # Rich 5-year old in first class
lin_reg.predict(test_case)

array([1.19207871])

In [11]:
y_pred = lin_reg.predict(X_test_imputed)
pd.Series(y_pred).describe()

count    418.000000
mean       0.392117
std        0.181876
min        0.011755
25%        0.227341
50%        0.339570
75%        0.516439
max        0.954827
dtype: float64

In [12]:
pd.Series(lin_reg.coef_, X_train.columns)

Pclass   -0.210390
Age      -0.007358
Fare      0.000751
dtype: float64

In [13]:
lin_reg.intercept_

1.0638995000035445

### How would we do this with Logistic Regression?

In [20]:
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression(solver='lbfgs')
log_reg.fit(X_train_imputed, y_train)

print('Prediction for rich 5 year old:', log_reg.predict(test_case))
print('Predicted probability for rich 5 year old:', log_reg.predict_proba(test_case))

Prediction for rich 5 year old: [1]
Predicted probability for rich 5 year old: [[0.02778799 0.97221201]]


In [21]:
threshold = 0.5
(log_reg.predict_proba(X_test_imputed)[:, 1] > threshold).astype(int)

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0,
       0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0,
       0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
       0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0,
       1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0,

In [22]:
manual_predictions = (log_reg.predict_proba(X_test_imputed)[:,1] > threshold).astype(int)
direct_predictions = log_reg.predict(X_test_imputed)

all(manual_predictions == direct_predictions)

True

### How accurate is the Logistic Regression?

In [30]:
score = log_reg.score(X_train_imputed, y_train)
print('Train Accuracy Score:', score)

Train Accuracy Score: 0.7025813692480359


In [23]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(log_reg, X_train_imputed, y_train, cv=10)
print('Cross Validation Accuracy Scores:', scores)

Cross Validation Accuracy Scores: [0.63333333 0.62222222 0.68539326 0.71910112 0.69662921 0.69662921
 0.76404494 0.75280899 0.73033708 0.71590909]


In [24]:
scores = pd.Series(scores)
scores.min(), scores.mean(), scores.max()

(0.6222222222222222, 0.7016408466689366, 0.7640449438202247)

### What's the math for the Logistic Regression?

https://en.wikipedia.org/wiki/Logistic_function

https://en.wikipedia.org/wiki/Logistic_regression#Probability_of_passing_an_exam_versus_hours_of_study

In [31]:
log_reg.coef_

array([[-0.9345267 , -0.03569729,  0.00422069]])

In [33]:
log_reg.intercept_

array([2.55763985])

In [25]:
# The logistic sigmoid "squishing" function,
# implemented to work with numpy arrays
def sigmoid(x):
    return 1 / (1 + np.e**(-x))

In [26]:
sigmoid(np.dot(log_reg.coef_, test_case.T) + log_reg.intercept_)

array([[0.97221201]])

In [27]:
log_reg.predict_proba(test_case)

array([[0.02778799, 0.97221201]])

## Feature Engineering

Get the [Category Encoder](http://contrib.scikit-learn.org/categorical-encoding/) library

If you're running on Google Colab:

```
!pip install category_encoders
```

If you're running locally with Anaconda:

```
!conda install -c conda-forge category_encoders
```

## Assignment: real-world classification

We're going to check out a larger dataset - the [FMA Free Music Archive data](https://github.com/mdeff/fma). It has a selection of CSVs with metadata and calculated audio features that you can load and try to use to classify genre of tracks. To get you started:

### Get and unzip the data

#### Google Colab

In [0]:
!wget https://os.unil.cloud.switch.ch/fma/fma_metadata.zip
!unzip fma_metadata.zip

#### Windows
- Download the [zip file](https://os.unil.cloud.switch.ch/fma/fma_metadata.zip)
- You may need to use [7zip](https://www.7-zip.org/download.html) to unzip it


#### Mac
- Download the [zip file](https://os.unil.cloud.switch.ch/fma/fma_metadata.zip)
- You may need to use [p7zip](https://superuser.com/a/626731) to unzip it

### Look at first 3 lines of raw file

In [49]:
!head -n 3 fma_metadata/tracks.csv

,album,album,album,album,album,album,album,album,album,album,album,album,album,artist,artist,artist,artist,artist,artist,artist,artist,artist,artist,artist,artist,artist,artist,artist,artist,artist,set,set,track,track,track,track,track,track,track,track,track,track,track,track,track,track,track,track,track,track,track,track
,comments,date_created,date_released,engineer,favorites,id,information,listens,producer,tags,title,tracks,type,active_year_begin,active_year_end,associated_labels,bio,comments,date_created,favorites,id,latitude,location,longitude,members,name,related_projects,tags,website,wikipedia_page,split,subset,bit_rate,comments,composer,date_created,date_recorded,duration,favorites,genre_top,genres,genres_all,information,interest,language_code,license,listens,lyricist,number,publisher,tags,title
track_id,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


### Read with pandas
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

In [332]:
tracks = pd.read_csv('fma_metadata/tracks.csv', index_col=0)
df = tracks.sample(n=5000, random_state=42)

In [333]:
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
print(df.shape)
df.head(5)

(5000, 52)


Unnamed: 0,album_comments,album_date_created,album_date_released,album_engineer,album_favorites,album_id,album_information,album_listens,album_producer,album_tags,album_title,album_tracks,album_type,artist_active_year_begin,artist_active_year_end,artist_associated_labels,artist_bio,artist_comments,artist_date_created,artist_favorites,artist_id,artist_latitude,artist_location,artist_longitude,artist_members,artist_name,artist_related_projects,artist_tags,artist_website,artist_wikipedia_page,set_split,set_subset,track_bit_rate,track_comments,track_composer,track_date_created,track_date_recorded,track_duration,track_favorites,track_genre_top,track_genres,track_genres_all,track_information,track_interest,track_language_code,track_license,track_listens,track_lyricist,track_number,track_publisher,track_tags,track_title
43742,-1,,,,-1,8578,,-1,,[],Second Spring,-1,,,,,,4,8/12/2010 1:22,82,8390,39.263517,"Lutherville, MD; Woodbine, MD",-76.623942,,The New Mystikal Troubadours,Bad Liquor Pond\nDr. Tuborg,"['the new mystikal troubadours', 'the agrarians']",http://thenewmystikaltroubadours.bandcamp.com/,,validation,large,160000,0,,2/17/2011 21:33,,325,2,,"[3, 33, 38, 250]","[33, 3, 38, 17, 250]",,2629,,Attribution-Noncommercial-Share Alike 3.0 Unit...,850,,8,,[],A Parody
31941,0,6/23/2010 12:17,2/12/2009 0:00,,1,6655,,19141,,[],La Fantaisie des Biches,14,Album,,,,"<p><span class=""long_text""><span style=""backgr...",1,6/23/2010 12:18,15,7888,46.211401,France,2.20936,,Misiaczek,,"['misiaczek', 'super moyen', 'misiaczek mczk']",www.misiaczek.net,,test,large,320000,0,,6/23/2010 12:15,,168,2,,"[66, 77]","[66, 2, 12, 77]",,2105,,Attribution 2.0 France,1320,,9,,[],safari truites
50985,3,7/15/2011 14:40,,,10,9619,<p>Great Jazz selections from Kevin MacLeod's ...,458874,,[],Jazz Sampler,20,Album,1/1/1998 0:00,1/1/2014 0:00,,"<p>For information about usage, please see my ...",32,7/15/2011 10:50,431,11431,40.692455,"New York, NY",-73.990364,Kevin Macleod,Kevin MacLeod,freepd.com,['kevin macleod'],http://incompetech.com,http://en.wikipedia.org/wiki/Kevin_MacLeod_(mu...,training,large,267146,0,,7/15/2011 10:19,,174,26,,"[4, 13]","[4, 13]",,31224,,Creative Commons Attribution,19783,,0,,[],Night on the Docks - Sax
146981,0,11/28/2016 12:37,11/18/2016 0:00,,0,21917,,19158,,[],не знаю,11,Album,,,,"<p><span style=""color: #333333; font-family: G...",4,3/31/2013 2:17,110,15891,58.007359,"Perm, Russia",56.228149,Konstantin Trokay,Kosta T,,"['skripka', 'improv', 'ambient', 'reverb', 'ac...",https://soundcloud.com/konstantin-trokay,,training,large,320000,0,konstantin trokai,11/28/2016 12:37,,308,0,,"[1, 18, 38, 94, 107, 125, 250, 514, 659, 1235]","[1, 514, 5, 38, 107, 17, 18, 1235, 659, 250, 1...",,3491,,Attribution,2908,,10,,[],печально всё как-то ...
20461,0,11/4/2009 2:19,,,1,4611,"<p>ZONA MC, from Italy is the lone Artist to c...",3139,,[],QUELLO ROTTO,5,Album,,,,"<p>ZONA MC, from Italy is the lone Artist to c...",1,11/4/2009 2:24,6,5443,41.87194,Italy,12.56738,,ZONA MC,,['zona mc'],http://www.myspace.com/lamjula,,training,medium,171259,0,,11/4/2009 2:20,,86,0,Hip-Hop,[21],[21],,722,en,Attribution-Noncommercial-Share Alike 2.5 Italy,385,,6,,[],Non mi e venuta una buona idea per il titolo


In [334]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5000 entries, 43742 to 76049
Data columns (total 52 columns):
album_comments              5000 non-null int64
album_date_created          4844 non-null object
album_date_released         3290 non-null object
album_engineer              689 non-null object
album_favorites             5000 non-null int64
album_id                    5000 non-null int64
album_information           3888 non-null object
album_listens               5000 non-null int64
album_producer              811 non-null object
album_tags                  5000 non-null object
album_title                 4953 non-null object
album_tracks                5000 non-null int64
album_type                  4710 non-null object
artist_active_year_begin    1051 non-null object
artist_active_year_end      232 non-null object
artist_associated_labels    704 non-null object
artist_bio                  3368 non-null object
artist_comments             5000 non-null int64
artist_date_crea

In [335]:
df.describe(include='number')

Unnamed: 0,album_comments,album_favorites,album_id,album_listens,album_tracks,artist_comments,artist_favorites,artist_id,artist_latitude,artist_longitude,track_bit_rate,track_comments,track_duration,track_favorites,track_interest,track_listens,track_number
count,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,2115.0,2115.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0
mean,0.4242,1.3262,12783.0296,34985.57,19.9778,2.0994,32.0074,11938.5344,39.237138,-36.995743,264248.616,0.0336,278.5166,3.0638,3690.291,2394.2812,8.237
std,2.639403,3.455052,6325.693078,175851.4,40.000046,6.820852,105.984695,6900.847278,19.606598,66.444585,67765.834004,0.306741,297.209732,12.507414,22262.34,9151.499846,14.672569
min,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,1.0,-45.87876,-156.331925,-1.0,0.0,1.0,0.0,4.0,2.0,0.0
25%,0.0,0.0,7729.75,3343.25,7.0,0.0,1.0,6323.75,39.271398,-79.385324,196190.25,0.0,151.0,0.0,576.75,270.0,2.0
50%,0.0,0.0,13320.0,8931.0,11.0,0.0,5.0,12094.0,41.515476,-73.326524,319720.0,0.0,218.0,1.0,1266.0,725.0,5.0
75%,0.0,1.0,18193.5,23919.0,18.0,1.0,17.0,17784.0,48.872076,4.47677,320000.0,0.0,304.0,3.0,3075.25,1983.25,9.0
max,53.0,61.0,22917.0,3564243.0,652.0,79.0,963.0,24349.0,65.0103,174.885971,333035.0,11.0,3710.0,599.0,1314156.0,429168.0,255.0


In [336]:
df.describe(exclude='number')

Unnamed: 0,album_date_created,album_date_released,album_engineer,album_information,album_producer,album_tags,album_title,album_type,artist_active_year_begin,artist_active_year_end,artist_associated_labels,artist_bio,artist_date_created,artist_location,artist_members,artist_name,artist_related_projects,artist_tags,artist_website,artist_wikipedia_page,set_split,set_subset,track_composer,track_date_created,track_date_recorded,track_genre_top,track_genres,track_genres_all,track_information,track_language_code,track_license,track_lyricist,track_publisher,track_tags,track_title
count,4844,3290,689,3888,811,5000,4953,4710,1051,232,704,3368,4968,3314,2202,5000,624,5000,3693,255,5000,5000,177,5000,312,2309,5000,5000,122,710,4996,19,54,5000,5000
unique,3621,1725,242,2941,273,742,3730,4,52,35,281,1883,2918,985,1286,3104,278,3013,2038,136,3,3,87,4153,176,16,1708,1519,109,13,66,14,23,755,4917
top,12/4/2008 9:27,1/1/2009 0:00,Ernie Indradat,<p>Now you will be able to hear this unique an...,Terre T,[],The Conet Project,Album,1/1/2007 0:00,1/1/2014 0:00,"Care in the Community Recordings, Gagarin Reco...","<p><span style=""color: #333333; font-family: G...",12/4/2008 9:23,"Brooklyn, NY",Konstantin Trokay,Kosta T,"Ratatat, Lullatone, Nightmares On Wax, Air, Mo...",[],https://soundcloud.com/konstantin-trokay,http://translate.google.com/translate?hl=en&am...,training,large,konstantin trokai,4/5/2013 20:29,11/26/2008 0:00,Rock,[21],[21],"<p><span style=""font-family: Verdana,Geneva,Ar...",en,Attribution-Noncommercial-Share Alike 3.0 Unit...,Wayne Myers,Victrola Dog (ASCAP),[],Untitled
freq,32,25,47,13,40,3898,13,4113,93,26,31,38,40,107,38,38,30,133,38,11,3961,3853,25,6,46,665,147,147,3,674,899,5,19,3879,14


In [337]:
df.columns

Index(['album_comments', 'album_date_created', 'album_date_released',
       'album_engineer', 'album_favorites', 'album_id', 'album_information',
       'album_listens', 'album_producer', 'album_tags', 'album_title',
       'album_tracks', 'album_type', 'artist_active_year_begin',
       'artist_active_year_end', 'artist_associated_labels', 'artist_bio',
       'artist_comments', 'artist_date_created', 'artist_favorites',
       'artist_id', 'artist_latitude', 'artist_location', 'artist_longitude',
       'artist_members', 'artist_name', 'artist_related_projects',
       'artist_tags', 'artist_website', 'artist_wikipedia_page', 'set_split',
       'set_subset', 'track_bit_rate', 'track_comments', 'track_composer',
       'track_date_created', 'track_date_recorded', 'track_duration',
       'track_favorites', 'track_genre_top', 'track_genres',
       'track_genres_all', 'track_information', 'track_interest',
       'track_language_code', 'track_license', 'track_listens',
       'track_

### Data Cleaning

In [338]:
# Drop rows with NaN values in target column
df = df.dropna(subset=['track_genre_top'])
print(df['track_genre_top'].isnull().sum())

0


In [339]:
df.shape

(2309, 52)

In [340]:
# Separate the target from the features
target = 'track_genre_top'
features = ['album_comments', 'album_favorites', 'album_id', 'album_listens', 'album_tracks', 'artist_comments', 
            'artist_favorites', 'artist_id', 'artist_latitude', 'artist_longitude', 'track_bit_rate', 'track_comments', 
            'track_duration', 'track_favorites', 'track_interest', 'track_listens', 'track_number']

X = df[features]
y = df[target]

X.shape, y.shape

((2309, 17), (2309,))

In [341]:
from sklearn.model_selection import train_test_split

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

X_train.shape, y_train.shape, X_test.shape, y_test.shape

((1847, 17), (1847,), (462, 17), (462,))

In [342]:
# Impute missing values
imputer = SimpleImputer()
X_train_imputed = imputer.fit_transform(X_train)
X_test_imputed = imputer.transform(X_test)

### Fit Logistic Regression!

In [343]:
# Fit the model
model = LogisticRegression(solver='lbfgs', multi_class='multinomial', max_iter=1000)
model.fit(X_train_imputed, y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=1000, multi_class='multinomial',
          n_jobs=None, penalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False)

In [344]:
# Accuracy scores
train_score = model.score(X_train_imputed, y_train)
print('Train Accuracy Score:', train_score)

test_score = model.score(X_test_imputed, y_test)
print('Test Accuracy Score:', test_score)

Train Accuracy Score: 0.3638332430969139
Test Accuracy Score: 0.329004329004329


### Improve Model

In [345]:
df.select_dtypes(include='number').columns

Index(['album_comments', 'album_favorites', 'album_id', 'album_listens',
       'album_tracks', 'artist_comments', 'artist_favorites', 'artist_id',
       'artist_latitude', 'artist_longitude', 'track_bit_rate',
       'track_comments', 'track_duration', 'track_favorites', 'track_interest',
       'track_listens', 'track_number'],
      dtype='object')

In [404]:
# Separate the target and features
new_target = 'track_genre_top'
new_features = ['album_comments', 'album_favorites', 'album_listens', 'album_tracks',
                'artist_latitude', 'artist_longitude', 'track_comments', 'track_duration', 
                'track_favorites', 'track_number']

X = df[new_features]
y = df[new_target]

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [405]:
# Impute missing values
imputer = SimpleImputer()
X_train_imputed = imputer.fit_transform(X_train)
X_test_imputed = imputer.transform(X_test)

In [406]:
# Fit the model
model2 = LogisticRegression(solver='lbfgs', multi_class='multinomial', max_iter=1000)
model2.fit(X_train_imputed, y_train)

# Accuracy scores
train_score = model2.score(X_train_imputed, y_train)
print('Train Accuracy Score:', train_score)

test_score = model2.score(X_test_imputed, y_test)
print('Test Accuracy Score:', test_score)

Train Accuracy Score: 0.376285868976719
Test Accuracy Score: 0.36363636363636365




### Analysis

**What are the best predictors of genre?**

- The best predictors were album_comments, album_favorites, album_listens, album_tracks, artist_latitude, artist_longitude, track_comments, track_duration, track_favorites, and track_number. It was interesting that removing all of the artist predictors (except for longitude and latitude) made the score go up. If there had been more values, location might have been a good categorical variable to encode. I would have needed to drop the NaN values for location, since there's really no way to deal with them, but that would have significantly reduced my observations.

**What information isn't very useful for predicting genre?**

- I thought the tags might be nice to encode, but most of them were empty lists. As far as the numerical data, the album id, artist id, and track listens were not useful. 

**What surprised you the most about your results?**

- It was really difficult to improve the test score. I was also surprised that improving the test score usually lowered the train score.

This dataset is bigger than many you've worked with so far, and while it should fit in Colab, it can take awhile to run. That's part of the challenge!

Your tasks:
- Clean up the variable names in the dataframe
- Use logistic regression to fit a model predicting (primary/top) genre
- Inspect, iterate, and improve your model
- Answer the following questions (written, ~paragraph each):
  - What are the best predictors of genre?
  - What information isn't very useful for predicting genre?
  - What surprised you the most about your results?

*Important caveats*:
- This is going to be difficult data to work with - don't let the perfect be the enemy of the good!
- Be creative in cleaning it up - if the best way you know how to do it is download it locally and edit as a spreadsheet, that's OK!
- If the data size becomes problematic, consider sampling/subsetting, or [downcasting numeric datatypes](https://www.dataquest.io/blog/pandas-big-data/).
- You do not need perfect or complete results - just something plausible that runs, and that supports the reasoning in your written answers

If you find that fitting a model to classify *all* genres isn't very good, it's totally OK to limit to the most frequent genres, or perhaps trying to combine or cluster genres as a preprocessing step. Even then, there will be limits to how good a model can be with just this metadata - if you really want to train an effective genre classifier, you'll have to involve the other data (see stretch goals).

This is real data - there is no "one correct answer", so you can take this in a variety of directions. Just make sure to support your findings, and feel free to share them as well! This is meant to be practice for dealing with other "messy" data, a common task in data science.

## Resources and stretch goals

- Check out the other .csv files from the FMA dataset, and see if you can join them or otherwise fit interesting models with them
- [Logistic regression from scratch in numpy](https://blog.goodaudience.com/logistic-regression-from-scratch-in-numpy-5841c09e425f) - if you want to dig in a bit more to both the code and math (also takes a gradient descent approach, introducing the logistic loss function)
- Create a visualization to show predictions of your model - ideally show a confidence interval based on error!
- Check out and compare classification models from scikit-learn, such as [SVM](https://scikit-learn.org/stable/modules/svm.html#classification), [decision trees](https://scikit-learn.org/stable/modules/tree.html#classification), and [naive Bayes](https://scikit-learn.org/stable/modules/naive_bayes.html). The underlying math will vary significantly, but the API (how you write the code) and interpretation will actually be fairly similar.
- Sign up for [Kaggle](https://kaggle.com), and find a competition to try logistic regression with
- (Not logistic regression related) If you enjoyed the assignment, you may want to read up on [music informatics](https://en.wikipedia.org/wiki/Music_informatics), which is how those audio features were actually calculated. The FMA includes the actual raw audio, so (while this is more of a longterm project than a stretch goal, and won't fit in Colab) if you'd like you can check those out and see what sort of deeper analysis you can do.