# Lambda School Data Science - Logistic Regression

Logistic regression is the baseline for classification models, as well as a handy way to predict probabilities (since those too live in the unit interval). While relatively simple, it is also the foundation for more sophisticated classification techniques such as neural networks (many of which can effectively be thought of as networks of logistic models).

## Lecture - Where Linear goes Wrong
### Return of the Titanic 🚢

You've likely already explored the rich dataset that is the Titanic - let's use regression and try to predict survival with it. The data is [available from Kaggle](https://www.kaggle.com/c/titanic/data), so we'll also play a bit with [the Kaggle API](https://github.com/Kaggle/kaggle-api).

### Get data, option 1: Kaggle API

#### Sign up for Kaggle and get an API token
1. [Sign up for a Kaggle account](https://www.kaggle.com/), if you don’t already have one. 
2. [Follow these instructions](https://github.com/Kaggle/kaggle-api#api-credentials) to create a Kaggle “API Token” and download your `kaggle.json` file. If you are using Anaconda, put the file in the directory specified in the instructions.

_This will enable you to download data directly from Kaggle. If you run into problems, don’t worry — I’ll give you an easy alternative way to download today’s data, so you can still follow along with the lecture hands-on. And then we’ll help you through the Kaggle process after the lecture._

#### Put `kaggle.json` in the correct location

- ***If you're using Anaconda,*** put the file in the directory specified in the [instructions](https://github.com/Kaggle/kaggle-api#api-credentials).

- ***If you're using Google Colab,*** upload the file to your Google Drive, and run this cell:

In [None]:
from google.colab import drive
drive.mount('/content/drive')
%env KAGGLE_CONFIG_DIR=/content/drive/My Drive/

#### Install the Kaggle API package and use it to get the data

You also have to join the Titanic competition to have access to the data

In [1]:
!pip install kaggle



In [2]:
!kaggle competitions download -c titanic

train.csv: Skipping, found more recently modified local copy (use --force to force download)
test.csv: Skipping, found more recently modified local copy (use --force to force download)
gender_submission.csv: Skipping, found more recently modified local copy (use --force to force download)


### Get data, option 2: Download from the competition page
1. [Sign up for a Kaggle account](https://www.kaggle.com/), if you don’t already have one. 
2. [Go to the Titanic competition page](https://www.kaggle.com/c/titanic) to download the [data](https://www.kaggle.com/c/titanic/data).

### Get data, option 3: Use Seaborn

```
import seaborn as sns
train = sns.load_dataset('titanic')
```

But Seaborn's version of the Titanic dataset is not identical to Kaggle's version, as we'll see during this lesson!

### Read data

In [3]:
import pandas as pd

train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

train.shape, test.shape

((891, 12), (418, 11))

In [4]:
train.sample(n=5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
109,110,1,3,"Moran, Miss. Bertha",female,,1,0,371110,24.15,,Q
245,246,0,1,"Minahan, Dr. William Edward",male,44.0,2,0,19928,90.0,C78,Q
97,98,1,1,"Greenfield, Mr. William Bertram",male,23.0,0,1,PC 17759,63.3583,D10 D12,C
548,549,0,3,"Goldsmith, Mr. Frank John",male,33.0,1,1,363291,20.525,,S
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S


In [5]:
test.sample(n=5)

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
155,1047,3,"Duquemin, Mr. Joseph",male,24.0,0,0,S.O./P.P. 752,7.55,,S
350,1242,1,"Greenfield, Mrs. Leo David (Blanche Strouse)",female,45.0,0,1,PC 17759,63.3583,D10 D12,C
375,1267,1,"Bowen, Miss. Grace Scott",female,45.0,0,0,PC 17608,262.375,,C
287,1179,1,"Snyder, Mr. John Pillsbury",male,24.0,1,0,21228,82.2667,B45,S
229,1121,2,"Hocking, Mr. Samuel James Metcalfe",male,36.0,0,0,242963,13.0,,S


In [6]:
target = 'Survived'
train[target].value_counts(normalize=True)

0    0.616162
1    0.383838
Name: Survived, dtype: float64

In [7]:
test.columns.tolist(), train.columns.tolist()

(['PassengerId',
  'Pclass',
  'Name',
  'Sex',
  'Age',
  'SibSp',
  'Parch',
  'Ticket',
  'Fare',
  'Cabin',
  'Embarked'],
 ['PassengerId',
  'Survived',
  'Pclass',
  'Name',
  'Sex',
  'Age',
  'SibSp',
  'Parch',
  'Ticket',
  'Fare',
  'Cabin',
  'Embarked'])

In [8]:
train.describe(include='number')

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [9]:
train.describe(exclude='number')

Unnamed: 0,Name,Sex,Ticket,Cabin,Embarked
count,891,891,891,204,889
unique,891,2,681,147,3
top,"Andersson, Master. Sigvard Harald Elias",male,CA. 2343,G6,S
freq,1,577,7,4,644


### How would we try to do this with linear regression?

https://scikit-learn.org/stable/modules/impute.html

In [10]:
from sklearn.linear_model import LinearRegression
from sklearn.impute import SimpleImputer

features = ['Pclass', 'Age', 'Fare']
target = 'Survived'
X_train = train[features]
y_train = train[target]
X_test = test[features]

imputer = SimpleImputer()

X_train_imputed = imputer.fit_transform(X_train)
X_test_imputed = imputer.transform(X_test)



lin_reg = LinearRegression()
lin_reg.fit(X_train_imputed, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [11]:
X_train['Age'].mean()

29.69911764705882

In [12]:
X_train_imputed[:6]

array([[ 3.        , 22.        ,  7.25      ],
       [ 1.        , 38.        , 71.2833    ],
       [ 3.        , 26.        ,  7.925     ],
       [ 1.        , 35.        , 53.1       ],
       [ 3.        , 35.        ,  8.05      ],
       [ 3.        , 29.69911765,  8.4583    ]])

In [13]:
X_train.head(6)

Unnamed: 0,Pclass,Age,Fare
0,3,22.0,7.25
1,1,38.0,71.2833
2,3,26.0,7.925
3,1,35.0,53.1
4,3,35.0,8.05
5,3,,8.4583


In [14]:
X_test.tail()

Unnamed: 0,Pclass,Age,Fare
413,3,,8.05
414,1,39.0,108.9
415,3,38.5,7.25
416,3,,8.05
417,3,,22.3583


In [15]:
X_test['Age'].mean()

30.272590361445783

In [16]:
X_test_imputed[-6:]

array([[  3.        ,  28.        ,   7.775     ],
       [  3.        ,  29.69911765,   8.05      ],
       [  1.        ,  39.        , 108.9       ],
       [  3.        ,  38.5       ,   7.25      ],
       [  3.        ,  29.69911765,   8.05      ],
       [  3.        ,  29.69911765,  22.3583    ]])

In [17]:
pd.concat([X_train, X_test])['Age'].mean()

29.881137667304014

In [18]:
import numpy as np

test_case = np.array([[1,5,500]]) #rich 5 yo in 1st class
lin_reg.predict(test_case)

array([1.19207871])

In [19]:
y_pred = lin_reg.predict(X_test_imputed)

In [20]:
pd.Series(y_pred).describe()

count    418.000000
mean       0.392117
std        0.181876
min        0.011755
25%        0.227341
50%        0.339570
75%        0.516439
max        0.954827
dtype: float64

In [21]:
pd.Series(lin_reg.coef_, X_train.columns)

Pclass   -0.210390
Age      -0.007358
Fare      0.000751
dtype: float64

In [22]:
lin_reg.intercept_

1.0638995000035438

### How would we do this with Logistic Regression?

In [28]:
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression(solver='lbfgs')
log_reg.fit(X_train_imputed, y_train)
# y_pred = 
print('Prediction for rich 5yo: ', log_reg.predict(test_case)) 
print('Predicted probabilities for rich 5yo: ', log_reg.predict_proba(test_case))

Prediction for rich 5yo:  [1]
Predicted probabilities for rich 5yo:  [[0.02778799 0.97221201]]


In [29]:
log_reg.predict_proba(X_test_imputed)

array([[0.80914315, 0.19085685],
       [0.86923005, 0.13076995],
       [0.81513949, 0.18486051],
       [0.76372392, 0.23627608],
       [0.72699291, 0.27300709],
       [0.66968628, 0.33031372],
       [0.78324172, 0.21675828],
       [0.52925236, 0.47074764],
       [0.7022383 , 0.2977617 ],
       [0.70964361, 0.29035639],
       [0.78122042, 0.21877958],
       [0.47732167, 0.52267833],
       [0.24061744, 0.75938256],
       [0.81009153, 0.18990847],
       [0.44928971, 0.55071029],
       [0.51278415, 0.48721585],
       [0.62450009, 0.37549991],
       [0.72413953, 0.27586047],
       [0.76428515, 0.23571485],
       [0.86078484, 0.13921516],
       [0.52236734, 0.47763266],
       [0.6350244 , 0.3649756 ],
       [0.33255308, 0.66744692],
       [0.24369406, 0.75630594],
       [0.26560325, 0.73439675],
       [0.87756866, 0.12243134],
       [0.24985842, 0.75014158],
       [0.7347066 , 0.2652934 ],
       [0.42843109, 0.57156891],
       [0.77111489, 0.22888511],
       [0.

In [31]:
threshold = 0.5

(log_reg.predict_proba(X_test_imputed)[:, 1] > threshold).astype(int)


array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0,
       0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0,
       0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
       0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0,
       1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0,

In [32]:
manual_predictions = (log_reg.predict_proba(X_test_imputed)[:, 1] > threshold).astype(int)
direct_predictions = log_reg.predict(X_test_imputed)

all(manual_predictions == direct_predictions)

True

### How accurate is the Logistic Regression?

In [33]:
score = log_reg.score(X_train_imputed, y_train)
print('Train Accuracy Score ', score)

Train Accuracy Score  0.7025813692480359


In [34]:
X_train_imputed.shape

(891, 3)

In [35]:
y_pred= log_reg.predict(X_train_imputed)

In [36]:
len(y_pred)

891

In [37]:
len(y_train)

891

In [38]:
y_pred[:5]

array([0, 1, 0, 1, 0])

In [39]:
y_train[:5].values

array([0, 1, 1, 1, 0])

In [41]:
correct_predictions = (1 + 1 + 0 + 1 + 1)
total_predictions = 5

accuracy = correct_predictions / total_predictions

print(accuracy)

0.8


In [42]:
from sklearn.metrics import accuracy_score

accuracy_score(y_train[:5], y_pred[:5])

0.8

In [44]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(log_reg, X_train_imputed, y_train, cv=10)

scores.min(), scores.mean(), scores.max()

(0.6222222222222222, 0.7016408466689366, 0.7640449438202247)

### What's the math for the Logistic Regression?

https://en.wikipedia.org/wiki/Logistic_function

https://en.wikipedia.org/wiki/Logistic_regression#Probability_of_passing_an_exam_versus_hours_of_study

In [45]:
log_reg.coef_

array([[-0.9345267 , -0.03569729,  0.00422069]])

In [46]:
log_reg.intercept_

array([2.55763985])

In [47]:
test_case

array([[  1,   5, 500]])

In [54]:
# logistitc sigmoid squishing function
def sigmoid(x):
    return 1 / (1 + np.e**(-x))

In [55]:
sigmoid(np.dot(log_reg.coef_, test_case.T) + log_reg.intercept_)

array([[0.97221201]])

In [56]:
sigmoid((log_reg.coef_ @ test_case.T) + log_reg.intercept_)

array([[0.97221201]])

## Feature Engineering

Get the [Category Encoder](http://contrib.scikit-learn.org/categorical-encoding/) library

If you're running on Google Colab:

```
!pip install category_encoders
```

If you're running locally with Anaconda:

```
!conda install -c conda-forge category_encoders
```

In [57]:
import seaborn as sns
sns_titanic = sns.load_dataset('titanic')

In [58]:
sns_titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [64]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [67]:
train.columns.tolist()

['PassengerId',
 'Survived',
 'Pclass',
 'Name',
 'Sex',
 'Age',
 'SibSp',
 'Parch',
 'Ticket',
 'Fare',
 'Cabin',
 'Embarked']

In [85]:
def make_features(X):
    X = X.copy()
    X['adult_male'] = (X['Sex'] == 'male') & (X['Age'] >= 16)
    X['alone'] = (X['SibSp'] == 0) & (X['Parch'] == 0)
    X['last_name'] = X['Name'].str.split(',').str[0]
    X['title'] = X['Name'].str.split(',').str[1].str.split('.').str[0]
    return X

In [86]:
train = make_features(train)
test = make_features(test)

train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,adult_male,alone,last_name,title
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,True,False,Braund,Mr
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,False,False,Cumings,Mrs
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,False,True,Heikkinen,Miss
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,False,False,Futrelle,Mrs
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,True,True,Allen,Mr


In [None]:
!conda install -c conda-forge category_encoders

Collecting package metadata: done
Solving environment: done


  current version: 4.6.11
  latest version: 4.6.14

Please update conda by running

    $ conda update -n base -c defaults conda



## Package Plan ##

  environment location: /home/nedderlander/anaconda3

  added / updated specs:
    - category_encoders


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    category_encoders-1.3.0    |             py_0          29 KB  conda-forge
    certifi-2019.3.9           |           py37_0         149 KB  conda-forge
    conda-4.6.14               |           py37_0         2.1 MB  conda-forge
    ------------------------------------------------------------
                                           Total:         2.3 MB

The following NEW packages will be INSTALLED:

  category_encoders  conda-forge/noarch::category_encoders-1.3.0-py_0

The following packages will be UPDATED:

  conda      

## Assignment: real-world classification

We're going to check out a larger dataset - the [FMA Free Music Archive data](https://github.com/mdeff/fma). It has a selection of CSVs with metadata and calculated audio features that you can load and try to use to classify genre of tracks. To get you started:

### Get and unzip the data

#### Google Colab

In [72]:
!wget https://os.unil.cloud.switch.ch/fma/fma_metadata.zip
!unzip fma_metadata.zip

--2019-05-06 11:08:24--  https://os.unil.cloud.switch.ch/fma/fma_metadata.zip
Resolving os.unil.cloud.switch.ch (os.unil.cloud.switch.ch)... 86.119.28.13, 2001:620:5ca1:2ff::ce53
Connecting to os.unil.cloud.switch.ch (os.unil.cloud.switch.ch)|86.119.28.13|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 358412441 (342M) [application/zip]
Saving to: ‘fma_metadata.zip’


2019-05-06 11:09:49 (4.10 MB/s) - ‘fma_metadata.zip’ saved [358412441/358412441]

Archive:  fma_metadata.zip
 bunzipping: fma_metadata/README.txt  
 bunzipping: fma_metadata/checksums  
 bunzipping: fma_metadata/not_found.pickle  
 bunzipping: fma_metadata/raw_genres.csv  
 bunzipping: fma_metadata/raw_albums.csv  
 bunzipping: fma_metadata/raw_artists.csv  
 bunzipping: fma_metadata/raw_tracks.csv  
 bunzipping: fma_metadata/tracks.csv  
 bunzipping: fma_metadata/genres.csv  
 bunzipping: fma_metadata/raw_echonest.csv  
 bunzipping: fma_metadata/echonest.csv  
 bunzipping: fma_metadata/features.

#### Windows
- Download the [zip file](https://os.unil.cloud.switch.ch/fma/fma_metadata.zip)
- You may need to use [7zip](https://www.7-zip.org/download.html) to unzip it


#### Mac
- Download the [zip file](https://os.unil.cloud.switch.ch/fma/fma_metadata.zip)
- You may need to use [p7zip](https://superuser.com/a/626731) to unzip it

### Look at first 3 lines of raw file

In [None]:
!head -n 3 fma_metadata/tracks.csv

### Read with pandas
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

In [121]:
genres = pd.read_csv('fma_metadata/genres.csv', header=[0,1], index_col=0)

genres

genre_id,#tracks,parent,title,top_level
1,8693,38,Avant-Garde,38
2,5271,0,International,2
3,1752,0,Blues,3
4,4126,0,Jazz,4
5,4106,0,Classical,5
6,914,38,Novelty,38
7,217,20,Comedy,20
8,868,0,Old-Time / Historic,8
9,1987,0,Country,9
10,13845,0,Pop,10
11,367,14,Disco,14


In [125]:
tracks = pd.read_csv('fma_metadata/tracks.csv', header=[0,1], index_col=0)

In [126]:
tracks.head().T

Unnamed: 0,track_id,2,3,5,10,20
album,comments,0,0,0,0,0
album,date_created,2008-11-26 01:44:45,2008-11-26 01:44:45,2008-11-26 01:44:45,2008-11-26 01:45:08,2008-11-26 01:45:05
album,date_released,2009-01-05 00:00:00,2009-01-05 00:00:00,2009-01-05 00:00:00,2008-02-06 00:00:00,2009-01-06 00:00:00
album,engineer,,,,,
album,favorites,4,4,4,4,2
album,id,1,1,1,6,4
album,information,<p></p>,<p></p>,<p></p>,,"<p> ""spiritual songs"" from Nicky Cook</p>"
album,listens,6073,6073,6073,47632,2710
album,producer,,,,,
album,tags,[],[],[],[],[]


In [144]:
# tracks['track', ['genre_top', 'genres']] #,'genre_top']['track','genres']

TypeError: '('track', ['genre_top', 'genres'])' is an invalid key

In [124]:
# the genres top category is our target but is missing 59k values, it seems to be derived from
# genres_all so I will copy over the first genre label from genres_all into genres_top

# tracks['track','genre_top'] = tracks['track','genres_all'].str.strip('[]').str[:3].str.replace(',','')

In [120]:
# tracks.isna().sum()

album   comments                  0
        date_created           3529
        date_released         36280
        engineer              91279
        favorites                 0
        id                        0
        information           23425
        listens                   0
        producer              88514
        tags                      0
        title                  1025
        tracks                    0
        type                   6508
artist  active_year_begin     83863
        active_year_end      101199
        associated_labels     92303
        bio                   35418
        comments                  0
        date_created            856
        favorites                 0
        id                        0
        latitude              62030
        location              36364
        longitude             62030
        members               59725
        name                      0
        related_projects      93422
        tags                

In [91]:
tracks.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
Int64Index: 106574 entries, 2 to 155320
Data columns (total 52 columns):
(album, comments)              106574 non-null int64
(album, date_created)          103045 non-null object
(album, date_released)         70294 non-null object
(album, engineer)              15295 non-null object
(album, favorites)             106574 non-null int64
(album, id)                    106574 non-null int64
(album, information)           83149 non-null object
(album, listens)               106574 non-null int64
(album, producer)              18060 non-null object
(album, tags)                  106574 non-null object
(album, title)                 105549 non-null object
(album, tracks)                106574 non-null int64
(album, type)                  100066 non-null object
(artist, active_year_begin)    22711 non-null object
(artist, active_year_end)      5375 non-null object
(artist, associated_labels)    14271 non-null object
(artist, bio)                  71156 n

In [93]:
for dtype in ['float','int','object']:
    selected_dtype = tracks.select_dtypes(include=[dtype])
    mean_usage_b = selected_dtype.memory_usage(deep=True).mean()
    mean_usage_mb = mean_usage_b / 1024 ** 2
    print("Average memory usage for {} columns: {:03.2f} MB".format(dtype,mean_usage_mb))

Average memory usage for float columns: 0.81 MB
Average memory usage for int columns: 0.81 MB
Average memory usage for object columns: 13.88 MB


### Fit Logistic Regression!

In [158]:
# to start with I'll use all non-nan features and non-genre features 
df_track = tracks['track'].copy()
df_track = df_track[(df_track['genre_top'].isna()== False)]

Unnamed: 0_level_0,bit_rate,comments,composer,date_created,date_recorded,duration,favorites,genre_top,genres,genres_all,information,interest,language_code,license,listens,lyricist,number,publisher,tags,title
track_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
2,256000,0,,2008-11-26 01:48:12,2008-11-26 00:00:00,168,2,Hip-Hop,[21],[21],,4656,en,Attribution-NonCommercial-ShareAlike 3.0 Inter...,1293,,3,,[],Food
3,256000,0,,2008-11-26 01:48:14,2008-11-26 00:00:00,237,1,Hip-Hop,[21],[21],,1470,en,Attribution-NonCommercial-ShareAlike 3.0 Inter...,514,,4,,[],Electric Ave
5,256000,0,,2008-11-26 01:48:20,2008-11-26 00:00:00,206,6,Hip-Hop,[21],[21],,1933,en,Attribution-NonCommercial-ShareAlike 3.0 Inter...,1151,,6,,[],This World
10,192000,0,Kurt Vile,2008-11-25 17:49:06,2008-11-26 00:00:00,161,178,Pop,[10],[10],,54881,en,Attribution-NonCommercial-NoDerivatives (aka M...,50135,,1,,[],Freeway
134,256000,0,,2008-11-26 01:43:19,2008-11-26 00:00:00,207,3,Hip-Hop,[21],[21],,1126,en,Attribution-NonCommercial-ShareAlike 3.0 Inter...,943,,5,,[],Street Music
135,256000,1,,2008-11-26 01:43:26,2008-11-26 00:00:00,837,0,Rock,"[45, 58]","[58, 12, 45]",,2484,en,Attribution-NonCommercial-ShareAlike 3.0 Inter...,1832,,0,,[],Father's Day
136,256000,1,,2008-11-26 01:43:35,2008-11-26 00:00:00,509,0,Rock,"[45, 58]","[58, 12, 45]",,1948,en,Attribution-NonCommercial-ShareAlike 3.0 Inter...,1498,,0,,[],Peel Back The Mountain Sky
137,256000,0,,2008-11-26 01:43:42,1978-04-27 00:00:00,1233,2,Experimental,"[1, 32]","[32, 1, 38]",<p>Recorded live in downtown Los Angeles at th...,2559,en,Attribution-NonCommercial-ShareAlike 3.0 Inter...,1278,,1,,['lafms'],Side A
138,256000,0,,2008-11-26 01:43:56,1978-04-27 00:00:00,1231,2,Experimental,"[1, 32]","[32, 1, 38]",<p>Recorded live in downtown Los Angeles at th...,1909,en,Attribution-NonCommercial-ShareAlike 3.0 Inter...,489,,2,,['lafms'],Side B
139,128000,0,,2008-11-26 01:44:05,2008-11-26 00:00:00,296,3,Folk,[17],[17],,702,en,Attribution-Noncommercial-No Derivative Works ...,582,,2,,[],CandyAss


In [169]:
# I'm going to create a baseline model using the df_track since I'm not strong on 
# working with multi-index

# df_track['genre_top']
target = 'genre_top'

# this gives me a list of all non-nan numeric columns
features = df_track._get_numeric_data().columns.tolist()




In [170]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression


def run_log_model(X, y):
    # Split into test and train data
    X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.80, test_size=0.20, random_state=42)  
  
    # Fit model using train data
    model = LogisticRegression(solver='lbfgs')
    model.fit(X_train, y_train)
    
    #score model
    train_score = model.score(X_train, y_train)
    print('Train Accuracy Score: ', train_score)
    
    test_score = model.score(X_test, y_test)
    print('Test Accuracy Score: ', test_score)

  
  

In [173]:
X = df_track[features]
y = df_track[target]

run_log_model(X,y)



Train Accuracy Score:  0.35969554917082514
Test Accuracy Score:  0.35443548387096774


In [181]:
# now I have a baseline of 0.35 so I will start thinking about ways to improve accuracy.

# df_track = df_track[(df_track['genre_top'].isna()== False)]


# seperate multi-index into new dataframes
df_album = tracks['album']
df_artist = tracks['artist']


# get non nan numeric columns 
album_features = df_album._get_numeric_data().columns.tolist()

artist_features = df_artist._get_numeric_data().columns.tolist()

In [193]:
# now re-add these features into my Xs

X = pd.merge(pd.merge(df_track[features], df_album[album_features]), df_artist[artist_features])

y = df_track[target]

run_log_model(X,y)

# in retrospect, it's not surprising that this doesn't work because there is no connection between the seperate 
# dataframe features and the target in df_track, this should instead be within a single dataframe

ValueError: Found input variables with inconsistent numbers of samples: [48263, 49598]

In [225]:
# seperate multi-index into new dataframes
tracks = tracks.sort_index()
df_album = tracks['album']
df_artist = tracks['artist']
df_track = tracks['track']


df_flat = pd.merge(pd.merge(df_album, df_artist, on='track_id'), df_track, on='track_id')

In [226]:
df_flat.isna().sum()
# ok, so I can now try to use all numerics here for more features

comments_x                0
date_created_x         3529
date_released         36280
engineer              91279
favorites_x               0
id_x                      0
information_x         23425
listens_x                 0
producer              88514
tags_x                    0
title_x                1025
tracks                    0
type                   6508
active_year_begin     83863
active_year_end      101199
associated_labels     92303
bio                   35418
comments_y                0
date_created_y          856
favorites_y               0
id_y                      0
latitude              62030
location              36364
longitude             62030
members               59725
name                      0
related_projects      93422
tags_y                    0
website               27318
wikipedia_page       100993
bit_rate                  0
comments                  0
composer             102904
date_created              0
date_recorded        100415
duration            

In [243]:
# same target
target = 'genre_top'

# drop target nans
df_clean = df_flat[(df_flat['genre_top'].isna()== False)].drop(columns=['latitude', 'longitude'])

# this gives me a list of all non-nan numeric columns
features = df_clean._get_numeric_data().columns.tolist()


X = df_clean[features]

y = df_clean[target]

run_log_model(X,y)



Train Accuracy Score:  0.38419275165078887
Test Accuracy Score:  0.37691532258064514




In [252]:
# good god that was a lot of warnings for a low accuracy score
# the warnings suggest I might try standardizing my data:

from sklearn.preprocessing import StandardScaler

# standardizing the features and target logistic function

def run_stan_log_model(X, y):
    # Split into test and train data
    X_train, X_test, y_train, y_test = train_test_split(StandardScaler().fit_transform(X), 
                                                        y, train_size=0.80, test_size=0.20, random_state=42)  
  
    # Fit model using train data
    model = LogisticRegression(solver='lbfgs')
    model.fit(X_train, y_train)
    
    #score model
    train_score = model.score(X_train, y_train)
    print('Train Accuracy Score: ', train_score)
    
    test_score = model.score(X_test, y_test)
    print('Test Accuracy Score: ', test_score)
    
    print('Model coefficients: ', pd.Series(model.coef_, X.columns.tolist()) )

run_stan_log_model(X, y)

  return self.partial_fit(X, y)
  return self.fit(X, **fit_params).transform(X)


Train Accuracy Score:  0.4137557336559302
Test Accuracy Score:  0.40625


ValueError: Length of passed values is 16, index implies 15

NameError: name 'model' is not defined

This dataset is bigger than many you've worked with so far, and while it should fit in Colab, it can take awhile to run. That's part of the challenge!

Your tasks:
- Clean up the variable names in the dataframe
- Use logistic regression to fit a model predicting (primary/top) genre
- Inspect, iterate, and improve your model
- Answer the following questions (written, ~paragraph each):
  - What are the best predictors of genre?
  - What information isn't very useful for predicting genre?
  - What surprised you the most about your results?

*Important caveats*:
- This is going to be difficult data to work with - don't let the perfect be the enemy of the good!
- Be creative in cleaning it up - if the best way you know how to do it is download it locally and edit as a spreadsheet, that's OK!
- If the data size becomes problematic, consider sampling/subsetting, or [downcasting numeric datatypes](https://www.dataquest.io/blog/pandas-big-data/).
- You do not need perfect or complete results - just something plausible that runs, and that supports the reasoning in your written answers

If you find that fitting a model to classify *all* genres isn't very good, it's totally OK to limit to the most frequent genres, or perhaps trying to combine or cluster genres as a preprocessing step. Even then, there will be limits to how good a model can be with just this metadata - if you really want to train an effective genre classifier, you'll have to involve the other data (see stretch goals).

This is real data - there is no "one correct answer", so you can take this in a variety of directions. Just make sure to support your findings, and feel free to share them as well! This is meant to be practice for dealing with other "messy" data, a common task in data science.

## Resources and stretch goals

- Check out the other .csv files from the FMA dataset, and see if you can join them or otherwise fit interesting models with them
- [Logistic regression from scratch in numpy](https://blog.goodaudience.com/logistic-regression-from-scratch-in-numpy-5841c09e425f) - if you want to dig in a bit more to both the code and math (also takes a gradient descent approach, introducing the logistic loss function)
- Create a visualization to show predictions of your model - ideally show a confidence interval based on error!
- Check out and compare classification models from scikit-learn, such as [SVM](https://scikit-learn.org/stable/modules/svm.html#classification), [decision trees](https://scikit-learn.org/stable/modules/tree.html#classification), and [naive Bayes](https://scikit-learn.org/stable/modules/naive_bayes.html). The underlying math will vary significantly, but the API (how you write the code) and interpretation will actually be fairly similar.
- Sign up for [Kaggle](https://kaggle.com), and find a competition to try logistic regression with
- (Not logistic regression related) If you enjoyed the assignment, you may want to read up on [music informatics](https://en.wikipedia.org/wiki/Music_informatics), which is how those audio features were actually calculated. The FMA includes the actual raw audio, so (while this is more of a longterm project than a stretch goal, and won't fit in Colab) if you'd like you can check those out and see what sort of deeper analysis you can do.