# Lambda School Data Science - Logistic Regression

Logistic regression is the baseline for classification models, as well as a handy way to predict probabilities (since those too live in the unit interval). While relatively simple, it is also the foundation for more sophisticated classification techniques such as neural networks (many of which can effectively be thought of as networks of logistic models).

## Lecture - Where Linear goes Wrong
### Return of the Titanic 🚢

You've likely already explored the rich dataset that is the Titanic - let's use regression and try to predict survival with it. The data is [available from Kaggle](https://www.kaggle.com/c/titanic/data), so we'll also play a bit with [the Kaggle API](https://github.com/Kaggle/kaggle-api).

### Get data, option 1: Kaggle API

#### Sign up for Kaggle and get an API token
1. [Sign up for a Kaggle account](https://www.kaggle.com/), if you don’t already have one. 
2. [Follow these instructions](https://github.com/Kaggle/kaggle-api#api-credentials) to create a Kaggle “API Token” and download your `kaggle.json` file. If you are using Anaconda, put the file in the directory specified in the instructions.

_This will enable you to download data directly from Kaggle. If you run into problems, don’t worry — I’ll give you an easy alternative way to download today’s data, so you can still follow along with the lecture hands-on. And then we’ll help you through the Kaggle process after the lecture._

#### Put `kaggle.json` in the correct location

- ***If you're using Anaconda,*** put the file in the directory specified in the [instructions](https://github.com/Kaggle/kaggle-api#api-credentials).

- ***If you're using Google Colab,*** upload the file to your Google Drive, and run this cell:

In [None]:
from google.colab import drive
drive.mount('/content/drive')
%env KAGGLE_CONFIG_DIR=/content/drive/My Drive/

#### Install the Kaggle API package and use it to get the data

You also have to join the Titanic competition to have access to the data

In [5]:
!pip install kaggle



In [7]:
!kaggle competitions download -c titanic

train.csv: Skipping, found more recently modified local copy (use --force to force download)
test.csv: Skipping, found more recently modified local copy (use --force to force download)
Downloading gender_submission.csv to /Users/azel/Documents/GitHub/DS-Unit-2-Sprint-3-Classification-Validation/module1-logistic-regression
  0%|                                               | 0.00/3.18k [00:00<?, ?B/s]
100%|██████████████████████████████████████| 3.18k/3.18k [00:00<00:00, 1.79MB/s]


### Get data, option 2: Download from the competition page
1. [Sign up for a Kaggle account](https://www.kaggle.com/), if you don’t already have one. 
2. [Go to the Titanic competition page](https://www.kaggle.com/c/titanic) to download the [data](https://www.kaggle.com/c/titanic/data).

### Get data, option 3: Use Seaborn

```
import seaborn as sns
train = sns.load_dataset('titanic')
```

But Seaborn's version of the Titanic dataset is not identical to Kaggle's version, as we'll see during this lesson!

### Read data

In [6]:
import pandas as pd

In [9]:
train = pd.read_csv('/Users/azel/Documents/GitHub/DS-Unit-2-Sprint-3-Classification-Validation/module1-logistic-regression/train.csv')
test = pd.read_csv('/Users/azel/Documents/GitHub/DS-Unit-2-Sprint-3-Classification-Validation/module1-logistic-regression/test.csv')

In [11]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [12]:
train.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


### How would we try to do this with linear regression?

https://scikit-learn.org/stable/modules/impute.html

In [27]:
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression

features = ['Pclass','Age','Fare']
target = 'Survived'

X_train = train[features]
y_train = train[target]

X_test = test[features]

imputer = SimpleImputer()
X_train_imputed = imputer.fit_transform(X_train)
X_test_imputed = imputer.fit_transform(X_test)

In [19]:
X_train_imputed[:6]

array([[ 3.        , 22.        ,  7.25      ],
       [ 1.        , 38.        , 71.2833    ],
       [ 3.        , 26.        ,  7.925     ],
       [ 1.        , 35.        , 53.1       ],
       [ 3.        , 35.        ,  8.05      ],
       [ 3.        , 29.69911765,  8.4583    ]])

In [22]:
X_train.head()

Unnamed: 0,Pclass,Age,Fare
0,3,22.0,7.25
1,1,38.0,71.2833
2,3,26.0,7.925
3,1,35.0,53.1
4,3,35.0,8.05


In [23]:
X_test.tail()

Unnamed: 0,Pclass,Age,Fare
413,3,,8.05
414,1,39.0,108.9
415,3,38.5,7.25
416,3,,8.05
417,3,,22.3583


In [29]:
X_test_imputed[-5:]

array([[  3.        ,  30.27259036,   8.05      ],
       [  1.        ,  39.        , 108.9       ],
       [  3.        ,  38.5       ,   7.25      ],
       [  3.        ,  30.27259036,   8.05      ],
       [  3.        ,  30.27259036,  22.3583    ]])

In [30]:
lin_reg = LinearRegression()
lin_reg.fit(X_train_imputed, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [31]:
import numpy as np

test_case = np.array([[1, 5, 500]])# Rich 5-year old in first class
lin_reg.predict(test_case)

array([1.19207871])

In [32]:
pd.Series(lin_reg.coef_, X_train.columns)

Pclass   -0.210390
Age      -0.007358
Fare      0.000751
dtype: float64

### How would we do this with Logistic Regression?

In [37]:
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression(solver = 'lbfgs')
log_reg.fit(X_train_imputed, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False)

In [39]:
log_reg.predict(test_case)

array([1])

In [41]:
log_reg.predict_proba(test_case)

array([[0.02778799, 0.97221201]])

In [42]:
log_reg.predict_proba(X_test_imputed)[:, 1]

array([0.19085685, 0.13076995, 0.18486051, 0.23627608, 0.27300709,
       0.33031372, 0.21675828, 0.47074764, 0.2977617 , 0.29035639,
       0.21530085, 0.52267833, 0.75938256, 0.18990847, 0.55071029,
       0.48721585, 0.37549991, 0.27586047, 0.23571485, 0.13921516,
       0.47763266, 0.3649756 , 0.66288758, 0.75630594, 0.73439675,
       0.12243134, 0.75014158, 0.2652934 , 0.57156891, 0.22529204,
       0.27159598, 0.49120168, 0.20798568, 0.2265992 , 0.68910514,
       0.29404308, 0.21541083, 0.27707412, 0.25006306, 0.25196894,
       0.17058217, 0.65802896, 0.15757318, 0.41885796, 0.55935625,
       0.2488185 , 0.53546271, 0.21519691, 0.45097775, 0.1878664 ,
       0.73489845, 0.44724964, 0.51792142, 0.84987609, 0.41913919,
       0.38225551, 0.18815861, 0.24860162, 0.22120867, 0.80929153,
       0.3058765 , 0.40209402, 0.29822154, 0.26920185, 0.90606015,
       0.42471955, 0.29833567, 0.53103445, 0.65406254, 0.64366709,
       0.25540976, 0.2764264 , 0.22309632, 0.67321035, 0.78012

In [43]:
 = log_reg.predict_proba(X_test_imputed)[:, 1]
 = log_reg.predict(X_test_imputed)

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0,
       0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0,
       0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
       0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0,
       1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0,

### How accurate is the Logistic Regression?

In [44]:
score = log_reg.score(X_train_imputed, y_train)
print(score)

0.7025813692480359


In [45]:
X_train_imputed.shape

(891, 3)

In [46]:
y_pred = log_reg.predict(X_train_imputed)
len(y_pred)

891

In [48]:
len(y_train)

891

In [49]:
y_train[:5]

0    0
1    1
2    1
3    1
4    0
Name: Survived, dtype: int64

In [50]:
y_pred[:5]

array([0, 1, 0, 1, 0])

In [52]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(log_reg, X_train_imputed, y_train, cv = 10)
scores


array([0.63333333, 0.62222222, 0.68539326, 0.71910112, 0.69662921,
       0.69662921, 0.76404494, 0.75280899, 0.73033708, 0.71590909])

In [53]:
scores = pd.Series(scores)
scores.min(), scores.mean(), scores.max()

(0.6222222222222222, 0.7016408466689366, 0.7640449438202247)

### What's the math for the Logistic Regression?

https://en.wikipedia.org/wiki/Logistic_function

https://en.wikipedia.org/wiki/Logistic_regression#Probability_of_passing_an_exam_versus_hours_of_study

## Feature Engineering

Get the [Category Encoder](http://contrib.scikit-learn.org/categorical-encoding/) library

If you're running on Google Colab:

```
!pip install category_encoders
```

If you're running locally with Anaconda:

```
!conda install -c conda-forge category_encoders
```

In [None]:
!conda install -c conda-forge category_encoders

In [1]:
import category_encoders as ce

## Assignment: real-world classification

We're going to check out a larger dataset - the [FMA Free Music Archive data](https://github.com/mdeff/fma). It has a selection of CSVs with metadata and calculated audio features that you can load and try to use to classify genre of tracks. To get you started:

### Get and unzip the data

#### Google Colab

In [2]:
!wget https://os.unil.cloud.switch.ch/fma/fma_metadata.zip
!unzip fma_metadata.zip

/bin/sh: wget: command not found
unzip:  cannot find or open fma_metadata.zip, fma_metadata.zip.zip or fma_metadata.zip.ZIP.


#### Windows
- Download the [zip file](https://os.unil.cloud.switch.ch/fma/fma_metadata.zip)
- You may need to use [7zip](https://www.7-zip.org/download.html) to unzip it


#### Mac
- Download the [zip file](https://os.unil.cloud.switch.ch/fma/fma_metadata.zip)
- You may need to use [p7zip](https://superuser.com/a/626731) to unzip it

### Look at first 3 lines of raw file

In [4]:
!head -n 3 /Users/azel/fma_metadata/tracks.csv

,album,album,album,album,album,album,album,album,album,album,album,album,album,artist,artist,artist,artist,artist,artist,artist,artist,artist,artist,artist,artist,artist,artist,artist,artist,artist,set,set,track,track,track,track,track,track,track,track,track,track,track,track,track,track,track,track,track,track,track,track
,comments,date_created,date_released,engineer,favorites,id,information,listens,producer,tags,title,tracks,type,active_year_begin,active_year_end,associated_labels,bio,comments,date_created,favorites,id,latitude,location,longitude,members,name,related_projects,tags,website,wikipedia_page,split,subset,bit_rate,comments,composer,date_created,date_recorded,duration,favorites,genre_top,genres,genres_all,information,interest,language_code,license,listens,lyricist,number,publisher,tags,title
track_id,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


### Read with pandas
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

In [319]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import category_encoders as ce

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

pd.options.display.max_columns = 100
pd.options.display.max_rows = 100

In [320]:
tracks = pd.read_csv('/Users/azel/fma_metadata/tracks.csv', header = [0,1], index_col = 0)
del tracks.index.name

In [321]:
print(tracks.shape)
print()
print(tracks.info())
tracks.head()

(106574, 52)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 106574 entries, 2 to 155320
Data columns (total 52 columns):
(album, comments)              106574 non-null int64
(album, date_created)          103045 non-null object
(album, date_released)         70294 non-null object
(album, engineer)              15295 non-null object
(album, favorites)             106574 non-null int64
(album, id)                    106574 non-null int64
(album, information)           83149 non-null object
(album, listens)               106574 non-null int64
(album, producer)              18060 non-null object
(album, tags)                  106574 non-null object
(album, title)                 105549 non-null object
(album, tracks)                106574 non-null int64
(album, type)                  100066 non-null object
(artist, active_year_begin)    22711 non-null object
(artist, active_year_end)      5375 non-null object
(artist, associated_labels)    14271 non-null object
(artist, bio)           

Unnamed: 0_level_0,album,album,album,album,album,album,album,album,album,album,album,album,album,artist,artist,artist,artist,artist,artist,artist,artist,artist,artist,artist,artist,artist,artist,artist,artist,artist,set,set,track,track,track,track,track,track,track,track,track,track,track,track,track,track,track,track,track,track,track,track
Unnamed: 0_level_1,comments,date_created,date_released,engineer,favorites,id,information,listens,producer,tags,title,tracks,type,active_year_begin,active_year_end,associated_labels,bio,comments,date_created,favorites,id,latitude,location,longitude,members,name,related_projects,tags,website,wikipedia_page,split,subset,bit_rate,comments,composer,date_created,date_recorded,duration,favorites,genre_top,genres,genres_all,information,interest,language_code,license,listens,lyricist,number,publisher,tags,title
2,0,2008-11-26 01:44:45,2009-01-05 00:00:00,,4,1,<p></p>,6073,,[],AWOL - A Way Of Life,7,Album,2006-01-01 00:00:00,,,"<p>A Way Of Life, A Collective of Hip-Hop from...",0,2008-11-26 01:42:32,9,1,40.058324,New Jersey,-74.405661,"Sajje Morocco,Brownbum,ZawidaGod,Custodian of ...",AWOL,The list of past projects is 2 long but every1...,['awol'],http://www.AzillionRecords.blogspot.com,,training,small,256000,0,,2008-11-26 01:48:12,2008-11-26 00:00:00,168,2,Hip-Hop,[21],[21],,4656,en,Attribution-NonCommercial-ShareAlike 3.0 Inter...,1293,,3,,[],Food
3,0,2008-11-26 01:44:45,2009-01-05 00:00:00,,4,1,<p></p>,6073,,[],AWOL - A Way Of Life,7,Album,2006-01-01 00:00:00,,,"<p>A Way Of Life, A Collective of Hip-Hop from...",0,2008-11-26 01:42:32,9,1,40.058324,New Jersey,-74.405661,"Sajje Morocco,Brownbum,ZawidaGod,Custodian of ...",AWOL,The list of past projects is 2 long but every1...,['awol'],http://www.AzillionRecords.blogspot.com,,training,medium,256000,0,,2008-11-26 01:48:14,2008-11-26 00:00:00,237,1,Hip-Hop,[21],[21],,1470,en,Attribution-NonCommercial-ShareAlike 3.0 Inter...,514,,4,,[],Electric Ave
5,0,2008-11-26 01:44:45,2009-01-05 00:00:00,,4,1,<p></p>,6073,,[],AWOL - A Way Of Life,7,Album,2006-01-01 00:00:00,,,"<p>A Way Of Life, A Collective of Hip-Hop from...",0,2008-11-26 01:42:32,9,1,40.058324,New Jersey,-74.405661,"Sajje Morocco,Brownbum,ZawidaGod,Custodian of ...",AWOL,The list of past projects is 2 long but every1...,['awol'],http://www.AzillionRecords.blogspot.com,,training,small,256000,0,,2008-11-26 01:48:20,2008-11-26 00:00:00,206,6,Hip-Hop,[21],[21],,1933,en,Attribution-NonCommercial-ShareAlike 3.0 Inter...,1151,,6,,[],This World
10,0,2008-11-26 01:45:08,2008-02-06 00:00:00,,4,6,,47632,,[],Constant Hitmaker,2,Album,,,"Mexican Summer, Richie Records, Woodsist, Skul...","<p><span style=""font-family:Verdana, Geneva, A...",3,2008-11-26 01:42:55,74,6,,,,"Kurt Vile, the Violators",Kurt Vile,,"['philly', 'kurt vile']",http://kurtvile.com,,training,small,192000,0,Kurt Vile,2008-11-25 17:49:06,2008-11-26 00:00:00,161,178,Pop,[10],[10],,54881,en,Attribution-NonCommercial-NoDerivatives (aka M...,50135,,1,,[],Freeway
20,0,2008-11-26 01:45:05,2009-01-06 00:00:00,,2,4,"<p> ""spiritual songs"" from Nicky Cook</p>",2710,,[],Niris,13,Album,1990-01-01 00:00:00,2011-01-01 00:00:00,,<p>Songs written by: Nicky Cook</p>\n<p>VOCALS...,2,2008-11-26 01:42:52,10,4,51.895927,Colchester England,0.891874,Nicky Cook\n,Nicky Cook,,"['instrumentals', 'experimental pop', 'post pu...",,,training,large,256000,0,,2008-11-26 01:48:56,2008-01-01 00:00:00,311,0,,"[76, 103]","[17, 10, 76, 103]",,978,en,Attribution-NonCommercial-NoDerivatives (aka M...,361,,3,,[],Spiritual Level


In [322]:
tracks.isnull().sum()

album   comments                  0
        date_created           3529
        date_released         36280
        engineer              91279
        favorites                 0
        id                        0
        information           23425
        listens                   0
        producer              88514
        tags                      0
        title                  1025
        tracks                    0
        type                   6508
artist  active_year_begin     83863
        active_year_end      101199
        associated_labels     92303
        bio                   35418
        comments                  0
        date_created            856
        favorites                 0
        id                        0
        latitude              62030
        location              36364
        longitude             62030
        members               59725
        name                      0
        related_projects      93422
        tags                

### Fit Logistic Regression!

In [323]:
(tracks['set']['subset'] == 'large').sum()

81574

In [324]:
# Cleaning up the columns

album = tracks['album'][['engineer','information','producer','tracks','type']]
artist = tracks['artist'][['active_year_begin', 'active_year_end', 'associated_labels', 'bio', 
                           'id', 'latitude', 'location','longitude', 'members', 
                           'name', 'related_projects']]
train_split = tracks['set'][['split','subset']]
track_info = tracks['track'][['bit_rate', 'comments', 'composer', 'date_created', 'date_recorded',
                              'duration', 'favorites', 'genre_top', 'genres', 'genres_all',
                              'interest', 'language_code', 'license', 'listens',
                              'lyricist', 'number', 'publisher', 'tags']]

df = pd.concat([album, artist, train_split, track_info], axis = 1)

In [325]:
print(df.shape)
print()
df.head()

(106574, 36)



Unnamed: 0,engineer,information,producer,tracks,type,active_year_begin,active_year_end,associated_labels,bio,id,latitude,location,longitude,members,name,related_projects,split,subset,bit_rate,comments,composer,date_created,date_recorded,duration,favorites,genre_top,genres,genres_all,interest,language_code,license,listens,lyricist,number,publisher,tags
2,,<p></p>,,7,Album,2006-01-01 00:00:00,,,"<p>A Way Of Life, A Collective of Hip-Hop from...",1,40.058324,New Jersey,-74.405661,"Sajje Morocco,Brownbum,ZawidaGod,Custodian of ...",AWOL,The list of past projects is 2 long but every1...,training,small,256000,0,,2008-11-26 01:48:12,2008-11-26 00:00:00,168,2,Hip-Hop,[21],[21],4656,en,Attribution-NonCommercial-ShareAlike 3.0 Inter...,1293,,3,,[]
3,,<p></p>,,7,Album,2006-01-01 00:00:00,,,"<p>A Way Of Life, A Collective of Hip-Hop from...",1,40.058324,New Jersey,-74.405661,"Sajje Morocco,Brownbum,ZawidaGod,Custodian of ...",AWOL,The list of past projects is 2 long but every1...,training,medium,256000,0,,2008-11-26 01:48:14,2008-11-26 00:00:00,237,1,Hip-Hop,[21],[21],1470,en,Attribution-NonCommercial-ShareAlike 3.0 Inter...,514,,4,,[]
5,,<p></p>,,7,Album,2006-01-01 00:00:00,,,"<p>A Way Of Life, A Collective of Hip-Hop from...",1,40.058324,New Jersey,-74.405661,"Sajje Morocco,Brownbum,ZawidaGod,Custodian of ...",AWOL,The list of past projects is 2 long but every1...,training,small,256000,0,,2008-11-26 01:48:20,2008-11-26 00:00:00,206,6,Hip-Hop,[21],[21],1933,en,Attribution-NonCommercial-ShareAlike 3.0 Inter...,1151,,6,,[]
10,,,,2,Album,,,"Mexican Summer, Richie Records, Woodsist, Skul...","<p><span style=""font-family:Verdana, Geneva, A...",6,,,,"Kurt Vile, the Violators",Kurt Vile,,training,small,192000,0,Kurt Vile,2008-11-25 17:49:06,2008-11-26 00:00:00,161,178,Pop,[10],[10],54881,en,Attribution-NonCommercial-NoDerivatives (aka M...,50135,,1,,[]
20,,"<p> ""spiritual songs"" from Nicky Cook</p>",,13,Album,1990-01-01 00:00:00,2011-01-01 00:00:00,,<p>Songs written by: Nicky Cook</p>\n<p>VOCALS...,4,51.895927,Colchester England,0.891874,Nicky Cook\n,Nicky Cook,,training,large,256000,0,,2008-11-26 01:48:56,2008-01-01 00:00:00,311,0,,"[76, 103]","[17, 10, 76, 103]",978,en,Attribution-NonCommercial-NoDerivatives (aka M...,361,,3,,[]


In [326]:
df.isnull().sum()

engineer              91279
information           23425
producer              88514
tracks                    0
type                   6508
active_year_begin     83863
active_year_end      101199
associated_labels     92303
bio                   35418
id                        0
latitude              62030
location              36364
longitude             62030
members               59725
name                      0
related_projects      93422
split                     0
subset                    0
bit_rate                  0
comments                  0
composer             102904
date_created              0
date_recorded        100415
duration                  0
favorites                 0
genre_top             56976
genres                    0
genres_all                0
interest                  0
language_code         91550
license                  87
listens                   0
lyricist             106263
number                    0
publisher            105311
tags                

In [327]:
# Target feature is 'genre_top'. There are roughly 50,000 null values. 
# I want to focus on the observations where we have a top genre available. 
# If I had more time, I would try to fill the primary genre with the other genre columns. 

df = df.loc[df['genre_top'].isnull() == False]

In [328]:
#These are a list of all the genres in the target feature. 

genres = df['genre_top'].unique()
genres

array(['Hip-Hop', 'Pop', 'Rock', 'Experimental', 'Folk', 'Jazz',
       'Electronic', 'Spoken', 'International', 'Soul-RnB', 'Blues',
       'Country', 'Classical', 'Old-Time / Historic', 'Instrumental',
       'Easy Listening'], dtype=object)

In [329]:
# Filling categorical feature NAN with a dummy value. Next step is one-hot encoding. 

autofill_none = ['engineer','producer','bio','publisher']
autofill_other = ['type','license','language_code','location']

for column in autofill_none: 
    df[column] = df[column].fillna('None')
    
for column in autofill_other:
    df[column] = df[column].fillna('Other')

In [330]:
# One-Hot Encoding Categorical Variables with large NAN values. 

for_dummies = ['type','language_code',]
 
df = pd.get_dummies(df, columns = for_dummies)

In [331]:
# Manual modification of the genre list.
## For Tricky categories like Soul-Rnb
### Also Tricky Hip-Hop vs Hip Hop... Annoying
#### .replace() doesn't work because of something to do with lists of lists. 


tag_genre = ['Hip Hop', 'Hip-Hop', 'Pop', 'Rock', 'Experimental', 'Folk', 'Jazz',
             'Electronic', 'Spoken', 'International', 'Soul', 'Rhythm' , 'Blues',
             'Country', 'Classical', 'Old-Time', 'Historic', 'Instrumental',
             'Easy Listening']

tag_genre_lower = [s.lower() for s in tag_genre]

In [332]:
# Feature engineer me some lists 

for s in tag_genre_lower: 
    df[s + '_mentioned'] = df['tags'].str.contains(s)

In [343]:
#These are my selected features. 
features = ['engineer','producer','location','bit_rate','comments',
            'date_created','duration','favorites','listens','publisher',
            'type_Album', 'type_Live Performance','type_Radio Program', 
            'type_Single Tracks', 'language_code_ar','language_code_bg',
            'language_code_cs', 'language_code_de', 'language_code_el',
            'language_code_en', 'language_code_es', 
            'language_code_fr','language_code_he', 'language_code_id', 
            'language_code_it','language_code_ja', 'language_code_nl', 
            'language_code_pt','language_code_ru', 'language_code_sr', 
            'language_code_sw','language_code_tr', 'language_code_zh', 
            'hip hop_mentioned','hip-hop_mentioned', 'pop_mentioned', 
            'rock_mentioned','experimental_mentioned', 'folk_mentioned', 
            'jazz_mentioned','electronic_mentioned', 'spoken_mentioned', 
            'international_mentioned','soul_mentioned', 'rhythm_mentioned', 
            'blues_mentioned','country_mentioned', 'classical_mentioned', 
            'old-time_mentioned','historic_mentioned', 'instrumental_mentioned',
            'easy listening_mentioned']

# Find all columns where values are objects. 
ob_columns = df[features].select_dtypes(include='object').columns.tolist()
# Change those columns to categories, then use .cat.codes to encode them to int8
for col in ob_columns:     # 
    df[col+'Cat'] = df[col].astype('category').cat.codes
    
ob_columns

['engineer', 'producer', 'location', 'date_created', 'publisher']

In [345]:
features = ['engineerCat','producerCat','locationCat','bit_rate','comments',
            'date_createdCat','duration','favorites','listens','publisherCat',
            'type_Album', 'type_Live Performance','type_Radio Program', 
            'type_Single Tracks', 'language_code_ar','language_code_bg',
            'language_code_cs', 'language_code_de', 'language_code_el',
            'language_code_en', 'language_code_es', 
            'language_code_fr','language_code_he', 'language_code_id', 
            'language_code_it','language_code_ja', 'language_code_nl', 
            'language_code_pt','language_code_ru', 'language_code_sr', 
            'language_code_sw','language_code_tr', 'language_code_zh', 
            'hip hop_mentioned','hip-hop_mentioned', 'pop_mentioned', 
            'rock_mentioned','experimental_mentioned', 'folk_mentioned', 
            'jazz_mentioned','electronic_mentioned', 'spoken_mentioned', 
            'international_mentioned','soul_mentioned', 'rhythm_mentioned', 
            'blues_mentioned','country_mentioned', 'classical_mentioned', 
            'old-time_mentioned','historic_mentioned', 'instrumental_mentioned',
            'easy listening_mentioned']

target = ['genre_top']

train = df[df['split'] == 'training']
test = df[df['split'] == 'test']
validate = df[df['split'] == 'validation']

X_train = train[features]
X_test = test[features]
X_val = validate[features]
y_train = train[target]
y_test = test[target]
y_val = validate[target]

X_train.shape, y_train.shape, X_test.shape, y_test.shape, X_val.shape, y_val.shape

((39943, 52), (39943, 1), (4951, 52), (4951, 1), (4704, 52), (4704, 1))

In [346]:
# Verifying my work
X_train.isnull().sum()

engineerCat                 0
producerCat                 0
locationCat                 0
bit_rate                    0
comments                    0
date_createdCat             0
duration                    0
favorites                   0
listens                     0
publisherCat                0
type_Album                  0
type_Live Performance       0
type_Radio Program          0
type_Single Tracks          0
language_code_ar            0
language_code_bg            0
language_code_cs            0
language_code_de            0
language_code_el            0
language_code_en            0
language_code_es            0
language_code_fr            0
language_code_he            0
language_code_id            0
language_code_it            0
language_code_ja            0
language_code_nl            0
language_code_pt            0
language_code_ru            0
language_code_sr            0
language_code_sw            0
language_code_tr            0
language_code_zh            0
hip hop_me

In [363]:
model = LogisticRegression(solver = 'lbfgs')
model.fit(X_train, y_train)

  y = column_or_1d(y, warn=True)


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False)

In [364]:
print(round(model.score(X_test, y_test)*100, 2 ),'%')

37.1 %


In [361]:
features = ['bit_rate','comments','date_createdCat','duration','favorites',
            'listens','type_Album', 'type_Live Performance','type_Radio Program', 
            'type_Single Tracks','hip hop_mentioned','hip-hop_mentioned', 'pop_mentioned', 
            'rock_mentioned','experimental_mentioned', 'folk_mentioned', 
            'jazz_mentioned','electronic_mentioned', 'spoken_mentioned', 
            'international_mentioned','soul_mentioned', 'rhythm_mentioned', 
            'blues_mentioned','country_mentioned', 'classical_mentioned', 
            'old-time_mentioned','historic_mentioned', 'instrumental_mentioned',
            'easy listening_mentioned']

X_train = train[features]
X_test = test[features]
X_val = validate[features]
y_train = train[target]
y_test = test[target]
y_val = validate[target]



In [362]:
model = LogisticRegression()
model.fit(X_train, y_train)

print(round(model.score(X_test, y_test)*100, 2 ),'%')

  y = column_or_1d(y, warn=True)


37.16 %


This dataset is bigger than many you've worked with so far, and while it should fit in Colab, it can take awhile to run. That's part of the challenge!

Your tasks:
- Clean up the variable names in the dataframe
- Use logistic regression to fit a model predicting (primary/top) genre
- Inspect, iterate, and improve your model
- Answer the following questions (written, ~paragraph each):
  - What are the best predictors of genre?
  - What information isn't very useful for predicting genre?
  - What surprised you the most about your results?

*Important caveats*:
- This is going to be difficult data to work with - don't let the perfect be the enemy of the good!
- Be creative in cleaning it up - if the best way you know how to do it is download it locally and edit as a spreadsheet, that's OK!
- If the data size becomes problematic, consider sampling/subsetting, or [downcasting numeric datatypes](https://www.dataquest.io/blog/pandas-big-data/).
- You do not need perfect or complete results - just something plausible that runs, and that supports the reasoning in your written answers

If you find that fitting a model to classify *all* genres isn't very good, it's totally OK to limit to the most frequent genres, or perhaps trying to combine or cluster genres as a preprocessing step. Even then, there will be limits to how good a model can be with just this metadata - if you really want to train an effective genre classifier, you'll have to involve the other data (see stretch goals).

This is real data - there is no "one correct answer", so you can take this in a variety of directions. Just make sure to support your findings, and feel free to share them as well! This is meant to be practice for dealing with other "messy" data, a common task in data science.

## Resources and stretch goals

- Check out the other .csv files from the FMA dataset, and see if you can join them or otherwise fit interesting models with them
- [Logistic regression from scratch in numpy](https://blog.goodaudience.com/logistic-regression-from-scratch-in-numpy-5841c09e425f) - if you want to dig in a bit more to both the code and math (also takes a gradient descent approach, introducing the logistic loss function)
- Create a visualization to show predictions of your model - ideally show a confidence interval based on error!
- Check out and compare classification models from scikit-learn, such as [SVM](https://scikit-learn.org/stable/modules/svm.html#classification), [decision trees](https://scikit-learn.org/stable/modules/tree.html#classification), and [naive Bayes](https://scikit-learn.org/stable/modules/naive_bayes.html). The underlying math will vary significantly, but the API (how you write the code) and interpretation will actually be fairly similar.
- Sign up for [Kaggle](https://kaggle.com), and find a competition to try logistic regression with
- (Not logistic regression related) If you enjoyed the assignment, you may want to read up on [music informatics](https://en.wikipedia.org/wiki/Music_informatics), which is how those audio features were actually calculated. The FMA includes the actual raw audio, so (while this is more of a longterm project than a stretch goal, and won't fit in Colab) if you'd like you can check those out and see what sort of deeper analysis you can do.