# Lambda School Data Science - Logistic Regression

Logistic regression is the baseline for classification models, as well as a handy way to predict probabilities (since those too live in the unit interval). While relatively simple, it is also the foundation for more sophisticated classification techniques such as neural networks (many of which can effectively be thought of as networks of logistic models).

## Lecture - Where Linear goes Wrong
### Return of the Titanic 🚢

You've likely already explored the rich dataset that is the Titanic - let's use regression and try to predict survival with it. The data is [available from Kaggle](https://www.kaggle.com/c/titanic/data), so we'll also play a bit with [the Kaggle API](https://github.com/Kaggle/kaggle-api).

In [202]:
# Already installed locally
#!pip install kaggle

In [203]:
# Note - you'll also have to sign up for Kaggle and authorize the API
# https://github.com/Kaggle/kaggle-api#api-credentials

# This essentially means uploading a kaggle.json file
# For Colab we can have it in Google Drive
# from google.colab import drive
# drive.mount('/content/drive')
# %env KAGGLE_CONFIG_DIR=/content/drive/My Drive/

# You also have to join the Titanic competition to have access to the data
# !kaggle competitions download -c titanic

### Read and Describe data

In [204]:
import pandas as pd

train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')

In [205]:
# 'Survived' target feature is already dropped from test_df for Kaggle submission
print(train_df.shape, test_df.shape)
train_df.sample(5)

(891, 12) (418, 11)


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
100,101,0,3,"Petranec, Miss. Matilda",female,28.0,0,0,349245,7.8958,,S
158,159,0,3,"Smiljanic, Mr. Mile",male,,0,0,315037,8.6625,,S
136,137,1,1,"Newsom, Miss. Helen Monypeny",female,19.0,0,2,11752,26.2833,D47,S
133,134,1,2,"Weisz, Mrs. Leopold (Mathilde Francoise Pede)",female,29.0,1,0,228414,26.0,,S
48,49,0,3,"Samaan, Mr. Youssef",male,,2,0,2662,21.6792,,C


In [206]:
# Split train_df descriptions into numeric and non-numeric
train_df.describe(include='number')

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [207]:
train_df.describe(exclude='number')

Unnamed: 0,Name,Sex,Ticket,Cabin,Embarked
count,891,891,891,204,889
unique,891,2,681,147,3
top,"Beane, Mr. Edward",male,347082,G6,S
freq,1,577,7,4,644


In [208]:
# What percentage survived?
train_df['Survived'].value_counts(normalize=True)

0    0.616162
1    0.383838
Name: Survived, dtype: float64

### Linear Regression

In [209]:
from sklearn.linear_model import LinearRegression
from sklearn.impute import SimpleImputer

# Use features that are already of numeric type
features = ['Pclass', 'Age', 'Fare']
target = 'Survived'
X_train = train_df[features]
X_test = test_df[features]
y_train = train_df[target]

imputer = SimpleImputer()
X_train_imputed = imputer.fit_transform(X_train)
X_test_imputed = imputer.transform(X_test)

# Reminder: model score is equivalent to the R-squared statistics.
# R-squared: is the proportion of the variance in the dependent variable 
# that is predictable from the independent variable.
linear_reg = LinearRegression().fit(X_train_imputed, y_train)

In [210]:
X_train.shape, X_train_imputed.shape, X_test.shape, X_test_imputed.shape

((891, 3), (891, 3), (418, 3), (418, 3))

In [211]:
X_train['Age'].mean()

29.69911764705882

In [212]:
X_test['Age'].mean()

30.272590361445783

In [213]:
X_train.head(6)

Unnamed: 0,Pclass,Age,Fare
0,3,22.0,7.25
1,1,38.0,71.2833
2,3,26.0,7.925
3,1,35.0,53.1
4,3,35.0,8.05
5,3,,8.4583


In [214]:
# SimpleImputer() used default 'mean' strategy: notice last age index
X_train_imputed[:6]

array([[ 3.        , 22.        ,  7.25      ],
       [ 1.        , 38.        , 71.2833    ],
       [ 3.        , 26.        ,  7.925     ],
       [ 1.        , 35.        , 53.1       ],
       [ 3.        , 35.        ,  8.05      ],
       [ 3.        , 29.69911765,  8.4583    ]])

In [215]:
X_test.tail()

Unnamed: 0,Pclass,Age,Fare
413,3,,8.05
414,1,39.0,108.9
415,3,38.5,7.25
416,3,,8.05
417,3,,22.3583


In [216]:
# Notice that SimpleImputer is filling in NaNs on the test data
# with the train means. This is considered the proper method because
# the train data is generally more robust
X_test_imputed[-5::]

array([[  3.        ,  29.69911765,   8.05      ],
       [  1.        ,  39.        , 108.9       ],
       [  3.        ,  38.5       ,   7.25      ],
       [  3.        ,  29.69911765,   8.05      ],
       [  3.        ,  29.69911765,  22.3583    ]])

In [217]:
linear_reg.predict(test_df[['Pclass', 'Age', 'Fare']].dropna())

array([0.18476056, 0.09216475, 0.19420436, 0.24057013, 0.2800808 ,
       0.33664465, 0.21772081, 0.47358573, 0.30571491, 0.29634405,
       0.53456631, 0.74603759, 0.19909257, 0.55361492, 0.48734115,
       0.39486551, 0.28363816, 0.24001647, 0.1070494 , 0.49341948,
       0.36888899, 0.74507271, 0.69730135, 0.07572154, 0.73816528,
       0.27260136, 0.57473387, 0.29474482, 0.49017826, 0.20536585,
       0.67612743, 0.30203598, 0.28471732, 0.25591458, 0.1558448 ,
       0.13695006, 0.4321428 , 0.56185887, 0.2547322 , 0.54470183,
       0.46931104, 0.17978265, 0.72196373, 0.45574283, 0.51322862,
       0.84492785, 0.38101538, 0.18113163, 0.25452575, 0.78559573,
       0.3135732 , 0.41780243, 0.30610588, 0.27665565, 0.95482663,
       0.30620288, 0.53952021, 0.64683924, 0.60947616, 0.26195869,
       0.28414174, 0.22530074, 0.66462079, 0.75476086, 0.77315552,
       0.46812031, 0.4321428 , 0.26195869, 0.40002749, 0.52702595,
       0.51249272, 0.23998209, 0.3063311 , 0.6456702 , 0.28129

In [218]:
pd.Series(linear_reg.coef_, X_train.columns)

Pclass   -0.210390
Age      -0.007358
Fare      0.000751
dtype: float64

In [219]:
import numpy as np

test_case = np.array([[1, 5, 500]])  # Rich 5-year old in first class
linear_reg.predict(test_case)

array([1.19207871])

### How would we fit this data with Logistic Regression?

[sklearn.linear_model.LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)

In [220]:
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression(solver='lbfgs').fit(X_train_imputed, y_train)
print('Prediction for rich 5 yr old: ', log_reg.predict(test_case))
print('Predicted probabilities for rich 5 yr olds (Die/Survive): ', log_reg.predict_proba(test_case))

Prediction for rich 5 yr old:  [1]
Predicted probabilities for rich 5 yr olds (Die/Survive):  [[0.02778799 0.97221201]]


In [221]:
help(log_reg.predict)

Help on method predict in module sklearn.linear_model.base:

predict(X) method of sklearn.linear_model.logistic.LogisticRegression instance
    Predict class labels for samples in X.
    
    Parameters
    ----------
    X : array_like or sparse matrix, shape (n_samples, n_features)
        Samples.
    
    Returns
    -------
    C : array, shape [n_samples]
        Predicted class label per sample.



In [222]:
log_reg.predict(test_df[['Pclass', 'Age', 'Fare']].dropna())

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1,
       1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0,
       1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0,
       0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0,
       1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0,
       1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0,
       0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1,
       0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0,

In [223]:
threshold = 0.5
manual_predictions = (log_reg.predict_proba(X_test_imputed)[:, 1] > threshold).astype(int)
direct_predictions = log_reg.predict(X_test_imputed)

all(manual_predictions == direct_predictions)

True

### How accurate is this logistic regression model?

In [224]:
# Use different metrics for classification vs regression: 
# R-squared only makes sense for regression and
# accuracy only makes sense for classification

score = log_reg.score(X_train_imputed, y_train)
print('Train accuracy score: ', score)

Train accuracy score:  0.7025813692480359


In [225]:
print('Total predictions: ', X_train_imputed.shape)

Total predictions:  (891, 3)


In [226]:
y_pred = log_reg.predict(X_train_imputed)
len(y_pred)

891

In [227]:
print(y_pred[:5])
print(y_train[:5].values)
print('Correct predictions: ', 1 + 1 + 0 + 1 + 1)
print('Accuracy: ', 4/5)

[0 1 0 1 0]
[0 1 1 1 0]
Correct predictions:  4
Accuracy:  0.8


In [228]:
from sklearn.metrics import accuracy_score

accuracy_score(y_train[:5], y_pred[:5])

0.8

In [229]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(log_reg, X_train_imputed, y_train, cv=10)
print('Cross-Validation Accuracy Scores:\n', scores)

Cross-Validation Accuracy Scores:
 [0.63333333 0.62222222 0.68539326 0.71910112 0.69662921 0.69662921
 0.76404494 0.75280899 0.73033708 0.71590909]


In [230]:
# The range of accuracies for this logistic regression model
scores = pd.Series(scores)
scores.min(), scores.mean(), scores.max()

(0.6222222222222222, 0.7016408466689366, 0.7640449438202247)

### What's the math?

In [231]:
log_reg.coef_

array([[-0.9345267 , -0.03569729,  0.00422069]])

In [232]:
log_reg.intercept_

array([2.55763985])

In [233]:
# The logistic sigmoid "squishing" function, implemented to accept numpy arrays
def sigmoid(x):
  return 1 / (1 + np.e**(-x))

In [234]:
sigmoid(log_reg.intercept_ + np.dot(log_reg.coef_, np.transpose(test_case)))

array([[0.97221201]])

So, clearly a more appropriate model in this situation! For more on the math, [see this Wikipedia example](https://en.wikipedia.org/wiki/Logistic_regression#Probability_of_passing_an_exam_versus_hours_of_study).

For live - let's tackle [another classification dataset on absenteeism](http://archive.ics.uci.edu/ml/datasets/Absenteeism+at+work) - it has 21 classes, but remember, scikit-learn LogisticRegression automatically handles more than two classes. How? By essentially treating each label as different (1) from some base class (0).

In [235]:
# Live - let's try absenteeism!

## Assignment - real-world classification

We're going to check out a larger dataset - the [FMA Free Music Archive data](https://github.com/mdeff/fma). It has a selection of CSVs with metadata and calculated audio features that you can load and try to use to classify genre of tracks. To get you started:

In [236]:
def duplicate_feature_deleter(data: pd.DataFrame):
    df = data.copy()
    for column in df.columns:
        if column.endswith('.1') or column.endswith('.2') or column.endswith('.3'):
            if all(df[column] == df[column.strip()]):
                df = df.drop(columns=column)
    return df

### Read, Clean, Merge

In [237]:
tracks = pd.read_csv('tracks.csv', header=1)
raw_tracks = pd.read_csv('raw_tracks.csv')
genres = pd.read_csv('raw_genres.csv')
echonest = pd.read_csv('raw_echonest.csv', header=2)

pd.set_option('display.max_columns', None)  # Unlimited columns

  interactivity=interactivity, compiler=compiler, result=result)


In [238]:
tracks = tracks.rename(columns={'Unnamed: 0': 'track_id'})
tracks = tracks.drop(tracks.index[0])
echonest = echonest.rename(columns={'Unnamed: 0': 'track_id'})
echonest = echonest.drop(echonest.index[0])
echonest.head(5)

Unnamed: 0,track_id,acousticness,danceability,energy,instrumentalness,liveness,speechiness,tempo,valence,album_date,album_name,artist_latitude,artist_location,artist_longitude,artist_name,release,artist_discovery_rank,artist_familiarity_rank,artist_hotttnesss_rank,song_currency_rank,song_hotttnesss_rank,artist_discovery,artist_familiarity,artist_hotttnesss,song_currency,song_hotttnesss,000,001,002,003,004,005,006,007,008,009,010,011,012,013,014,015,016,017,018,019,020,021,022,023,024,025,026,027,028,029,030,031,032,033,034,035,036,037,038,039,040,041,042,043,044,045,046,047,048,049,050,051,052,053,054,055,056,057,058,059,060,061,062,063,064,065,066,067,068,069,070,071,072,073,074,075,076,077,078,079,080,081,082,083,084,085,086,087,088,089,090,091,092,093,094,095,096,097,098,099,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223
1,2,0.416675,0.675894,0.634476,0.010628,0.177647,0.15931,165.922,0.576661,,,32.6783,"Georgia, US",-83.223,AWOL,AWOL - A Way Of Life,,,,,,0.38899,0.38674,0.40637,0.0,0.0,0.877233,0.588911,0.354243,0.29509,0.298413,0.30943,0.304496,0.334579,0.249495,0.259656,0.318376,0.371974,1.0,0.571,0.278,0.21,0.215,0.2285,0.2375,0.279,0.1685,0.1685,0.279,0.3325,0.049848,0.104212,0.06023,0.05229,0.047403,0.052815,0.052733,0.062216,0.051613,0.057399,0.053199,0.062583,0.036,0.018,0.017,0.021,0.021,0.01,0.015,0.041,0.01,0.009,0.021,0.013,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.964,0.982,0.983,0.979,0.979,0.99,0.985,0.959,0.99,0.991,0.979,0.987,-1.899342,-0.032654,0.878469,1.147538,0.950856,0.948257,1.157887,1.147911,1.646318,1.530193,1.197568,0.745673,2.510038,-1.500183,0.03054,0.694242,0.170432,0.064695,0.874727,0.722576,2.25132,1.708159,1.054857,0.020675,42.949131,44.387436,32.409389,15.668667,10.114028,-4.069252,2.042353,2.188321,-3.805923,-0.494699,6.02467,10.692599,44.442501,42.3885,31.684999,9.9875,9.568501,-7.1485,3.8315,1.8505,-2.6875,-0.8,5.4615,10.2565,39.49482,1966.979126,1825.123047,1903.756714,828.810059,911.155823,581.01532,722.001404,404.682556,315.528473,376.632416,229.282547,0.0,-110.367996,-100.605003,-112.581001,-75.882004,-89.160004,-80.737999,-91.498001,-66.649002,-61.845001,-66.081001,-58.043999,52.006001,216.237,208.423004,145.194,97.482002,98.723,68.091003,101.588997,69.505997,58.227001,69.262001,58.175999,52.006001,326.60498,309.028015,257.774994,173.364014,187.882996,148.82901,193.087006,136.154999,120.072006,135.343002,116.220001,-2.952152,0.060379,0.525976,0.365915,0.018182,0.454431,-0.330007,0.149395,-0.214859,0.030427,-0.153877,-0.150132,13.206213,1.009934,1.577194,0.337023,0.097149,0.40126,0.006324,0.643486,0.012059,0.237947,0.655938,1.213864,-12.486146,-11.2695,46.031261,-60.0,-3.933,56.067001,-2.587475,11.802585,0.04797,0.038275,0.000988,0.0,0.2073,0.2073,1.603659,2.984276,-21.812077,-20.312,49.157482,-60.0,-9.691,50.308998,-1.992303,6.805694,0.23307,0.19288,0.027455,0.06408,3.67696,3.61288,13.31669,262.929749
2,3,0.374408,0.528643,0.817461,0.001851,0.10588,0.461818,126.957,0.26924,,,32.6783,"Georgia, US",-83.223,AWOL,AWOL - A Way Of Life,,,,,,0.38899,0.38674,0.40637,0.0,0.0,0.534429,0.537414,0.443299,0.390879,0.344573,0.366448,0.419455,0.747766,0.460901,0.392379,0.474559,0.406729,0.506,0.5145,0.387,0.3235,0.2805,0.3135,0.3455,0.898,0.4365,0.3385,0.398,0.348,0.079207,0.083319,0.073595,0.071024,0.056679,0.066113,0.073889,0.0881,0.071305,0.059275,0.088222,0.067298,0.04,0.04,0.029,0.021,0.009,0.02,0.02,0.053,0.022,0.032,0.034,0.028,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.96,0.96,0.971,0.979,0.991,0.98,0.98,0.947,0.978,0.968,0.966,0.972,0.197378,0.182772,0.608394,0.785822,0.773479,0.650546,0.574605,-0.767804,0.326266,0.808435,0.506893,0.682139,-1.157548,-1.170147,-0.683188,-0.366574,-0.245357,-0.547616,-0.809711,-0.894495,-0.808738,-0.130317,-1.056485,-0.510763,44.097282,48.790241,10.584317,-2.565454,-1.292831,-4.490149,-0.633741,2.112556,-7.288055,-5.132599,-2.390409,11.447978,45.414501,51.979,6.598,0.28,-0.012,-8.0495,-0.683,1.031,-7.0775,-4.321,-0.9275,9.0895,22.519407,1694.821167,1256.443848,1251.588257,907.545166,588.602051,619.97168,679.588013,301.723358,401.670532,294.221619,411.2995,0.0,-82.774002,-137.591003,-131.729004,-106.242996,-73.445,-83.468002,-81.322998,-71.073997,-108.565002,-71.498001,-51.351002,51.366001,190.334,114.844002,359.941986,101.862,111.082001,78.648003,103.000999,46.223,53.993999,61.181,90.429001,51.366001,273.108002,252.434998,491.67099,208.104996,184.527008,162.115997,184.324005,117.296997,162.559006,132.679001,141.779999,-1.827564,-0.083561,0.162382,0.829534,-0.164874,0.89774,-0.058807,0.365381,-0.131389,-0.245579,-0.33528,0.61312,8.424043,0.230834,0.614212,11.627348,1.015813,1.627731,0.032318,0.819126,-0.030998,0.73461,0.458883,0.999964,-12.502044,-11.4205,26.468552,-60.0,-5.789,54.210999,-1.755855,7.895351,0.057707,0.04536,0.001397,0.0,0.3395,0.3395,2.271021,9.186051,-20.185032,-19.868,24.002327,-60.0,-9.679,50.320999,-1.582331,8.889308,0.258464,0.220905,0.081368,0.06413,6.08277,6.01864,16.673548,325.581085
3,5,0.043567,0.745566,0.70147,0.000697,0.373143,0.124595,100.26,0.621661,,,32.6783,"Georgia, US",-83.223,AWOL,AWOL - A Way Of Life,,,,,,0.38899,0.38674,0.40637,0.0,0.0,0.548093,0.720192,0.389257,0.344934,0.3613,0.402543,0.434044,0.388137,0.512487,0.525755,0.425371,0.446896,0.511,0.772,0.361,0.288,0.331,0.372,0.359,0.279,0.443,0.484,0.368,0.397,0.081051,0.0783,0.048697,0.056922,0.045264,0.066819,0.094489,0.08925,0.098089,0.084133,0.068866,0.086224,0.023,0.023,0.024,0.021,0.023,0.02,0.029,0.022,0.04,0.026,0.032,0.016,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.977,0.977,0.976,0.979,0.977,0.98,0.971,0.978,0.96,0.974,0.968,0.984,0.192822,-0.671701,0.716431,1.177679,0.629193,0.478121,0.63328,0.862866,0.344354,0.26409,0.673426,0.461554,-1.079328,-0.744193,0.101992,0.86234,-0.294693,-0.819228,-0.859957,-0.54047,-1.28792,-1.182708,-0.56651,-0.931232,40.274662,6.508874,-7.265673,-4.340635,10.407271,-7.618135,5.729183,0.286108,10.888865,0.241682,-3.912109,1.756583,41.426998,7.836,-10.48,-5.742,10.468,-10.38,5.908,0.906,12.029,-1.319,-3.049,0.987,27.555271,1956.254395,1382.757202,1596.684204,983.111511,945.404419,659.338928,835.013123,399.154694,483.607666,378.464874,244.344467,0.0,-138.438004,-137.203995,-126.450996,-89.247002,-198.056,-78.342003,-89.745003,-50.596001,-81.189003,-57.694,-57.312,48.240002,211.490005,98.503998,316.550995,92.763,212.559006,69.366997,126.810997,63.868,73.842003,60.088001,70.402,48.240002,349.928009,235.707993,443.001984,182.01001,410.61499,147.709,216.556,114.464005,155.031006,117.781998,127.714005,-2.893349,0.052129,0.169777,0.796858,-0.16451,0.46476,-0.211987,0.027119,-0.21524,0.083052,-0.004778,0.114815,12.998166,1.258411,-0.105143,5.284808,-0.250734,4.719755,-0.183342,0.340812,-0.29597,0.099103,0.098723,1.389372,-15.458095,-14.105,35.955223,-60.0,-7.248,52.751999,-2.505533,9.716598,0.058608,0.0457,0.001777,0.0,0.29497,0.29497,1.827837,5.253727,-24.523119,-24.367001,31.804546,-60.0,-12.582,47.417999,-2.288358,11.527109,0.256821,0.23782,0.060122,0.06014,5.92649,5.86635,16.013849,356.755737
4,10,0.95167,0.658179,0.924525,0.965427,0.115474,0.032985,111.562,0.96359,2008-03-11,Constant Hitmaker,39.9523,"Philadelphia, PA, US",-75.1624,Kurt Vile,Constant Hitmaker,2635.0,2544.0,397.0,115691.0,67609.0,0.557339,0.614272,0.798387,0.005158,0.354516,0.311404,0.711402,0.321914,0.500601,0.250963,0.321316,0.73425,0.325188,0.373012,0.23584,0.368756,0.440775,0.263,0.736,0.273,0.426,0.214,0.288,0.81,0.246,0.295,0.164,0.311,0.386,0.033969,0.070692,0.039161,0.095781,0.024102,0.028497,0.073847,0.045103,0.065468,0.041634,0.041619,0.084442,0.027,0.081,0.035,0.025,0.033,0.008,0.099,0.038,0.022,0.009,0.04,0.019,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.973,0.919,0.965,0.975,0.967,0.992,0.901,0.962,0.978,0.991,0.96,0.981,1.373789,-0.422791,1.210829,0.40199,1.838612,1.161654,-0.547191,1.386647,1.216524,2.3209,1.221092,0.474827,2.1177,-1.080739,1.304154,-1.201286,4.986209,1.730812,-1.124398,1.428501,0.490017,5.137411,1.125351,-0.976438,46.439899,6.738814,72.586601,20.438717,-32.377827,-3.392143,-1.74723,-9.380472,1.754843,-11.147522,5.879513,3.324857,47.187,3.536,74.306,19.589001,-36.053001,-3.497,-0.418,-9.842,0.898,-9.96,6.12,3.911,22.193727,1392.732788,572.282837,694.788147,880.852112,344.104462,395.002258,306.603455,162.284821,283.120117,125.487427,139.459778,0.0,-174.085999,-61.278999,-95.272003,-91.910004,-56.064999,-101.487999,-70.795998,-40.389999,-58.234001,-95.463997,-43.458,52.193001,171.130005,131.085999,298.596985,159.093002,105.116997,73.433998,79.760002,49.598999,63.230999,69.753998,35.388,52.193001,345.216003,192.36499,393.868988,251.003006,161.181992,174.921997,150.556,89.988998,121.464996,165.217987,78.846001,-4.515986,0.082999,-0.471422,2.171539,1.747735,0.435429,-0.603002,0.200987,0.12711,-0.005297,-0.956349,-0.287195,30.331905,2.051292,1.123436,22.177616,7.889378,1.809147,2.219095,1.51843,0.654815,0.650727,12.656473,0.406731,-10.24489,-9.464,20.304308,-60.0,-5.027,54.973,-5.365219,41.201279,0.048938,0.0408,0.002591,0.0,0.89574,0.89574,10.539709,150.359985,-16.472773,-15.903,27.53944,-60.0,-9.025,50.974998,-3.662988,21.508228,0.283352,0.26707,0.125704,0.08082,8.41401,8.33319,21.317064,483.403809
5,134,0.452217,0.513238,0.56041,0.019443,0.096567,0.525519,114.29,0.894072,,,32.6783,"Georgia, US",-83.223,AWOL,AWOL - A Way Of Life,,,,,,0.38899,0.38674,0.40637,0.0,0.0,0.610849,0.569169,0.428494,0.345796,0.37692,0.46059,0.401371,0.4499,0.428946,0.446736,0.479849,0.378221,0.614,0.545,0.363,0.28,0.311,0.397,0.317,0.404,0.356,0.38,0.42,0.292,0.085176,0.092242,0.073183,0.056354,0.062012,0.088343,0.077084,0.097942,0.10179,0.094533,0.089367,0.088544,0.003,0.012,0.003,0.004,0.01,0.015,0.005,0.006,0.016,0.014,0.013,0.007,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.997,0.988,0.997,0.996,0.99,0.985,0.995,0.994,0.984,0.986,0.987,0.993,-0.099188,0.033726,0.699476,1.039602,0.898852,0.505001,0.625378,0.394606,0.554671,0.524494,0.358942,0.864621,-1.25571,-1.326045,-0.515872,0.517893,-0.001662,-0.9677,-0.747463,-1.202028,-1.038381,-1.039335,-1.076293,-0.390393,40.35828,24.115547,20.227261,6.252143,-21.950306,5.266994,-5.009126,0.108464,12.665721,11.32714,-4.692913,-0.311117,41.785,19.591999,7.035,1.393,-23.08,-2.693,-3.409,-1.28,13.749,11.034,-3.839,-1.822,26.85305,4123.289551,4842.126953,2033.496216,1603.765381,1318.972656,709.53772,1050.203857,758.473206,563.139587,366.803772,409.632477,10.065,-162.100998,-137.953995,-137.983002,-146.548996,-105.698997,-84.077003,-99.051003,-71.302002,-76.054001,-78.458,-64.713997,47.804001,243.921005,251.854004,212.216003,107.835999,192.679001,82.653999,107.737,107.257004,89.804001,75.475998,71.638,37.739002,406.022003,389.807983,350.199005,254.384995,298.377991,166.731003,206.787994,178.559006,165.858002,153.93399,136.35199,-1.652579,0.30187,0.665983,0.78496,0.107662,1.03975,-0.137514,0.217578,-0.04449,0.011159,-0.265695,0.331218,3.168006,0.141561,-0.04771,1.916984,-0.139364,2.25103,-0.224826,0.050703,0.188019,0.24975,0.931698,0.766069,-15.145472,-14.151,19.988146,-40.209999,-7.351,32.859001,-1.632508,3.340982,0.05947,0.04856,0.001586,0.01079,0.42006,0.40927,2.763948,13.718324,-24.336575,-22.448999,52.783905,-60.0,-13.128,46.872002,-1.452696,2.356398,0.234686,0.19955,0.149332,0.0644,11.26707,11.20267,26.45418,751.147705


In [239]:
tracks.shape, raw_tracks.shape, genres.shape, echonest.shape

((106574, 53), (109727, 39), (164, 5), (14511, 250))

In [240]:
tracks.columns

Index(['track_id', 'comments', 'date_created', 'date_released', 'engineer',
       'favorites', 'id', 'information', 'listens', 'producer', 'tags',
       'title', 'tracks', 'type', 'active_year_begin', 'active_year_end',
       'associated_labels', 'bio', 'comments.1', 'date_created.1',
       'favorites.1', 'id.1', 'latitude', 'location', 'longitude', 'members',
       'name', 'related_projects', 'tags.1', 'website', 'wikipedia_page',
       'split', 'subset', 'bit_rate', 'comments.2', 'composer',
       'date_created.2', 'date_recorded', 'duration', 'favorites.2',
       'genre_top', 'genres', 'genres_all', 'information.1', 'interest',
       'language_code', 'license', 'listens.1', 'lyricist', 'number',
       'publisher', 'tags.2', 'title.1'],
      dtype='object')

In [241]:
raw_tracks.columns

Index(['track_id', 'album_id', 'album_title', 'album_url', 'artist_id',
       'artist_name', 'artist_url', 'artist_website', 'license_image_file',
       'license_image_file_large', 'license_parent_id', 'license_title',
       'license_url', 'tags', 'track_bit_rate', 'track_comments',
       'track_composer', 'track_copyright_c', 'track_copyright_p',
       'track_date_created', 'track_date_recorded', 'track_disc_number',
       'track_duration', 'track_explicit', 'track_explicit_notes',
       'track_favorites', 'track_file', 'track_genres', 'track_image_file',
       'track_information', 'track_instrumental', 'track_interest',
       'track_language_code', 'track_listens', 'track_lyricist',
       'track_number', 'track_publisher', 'track_title', 'track_url'],
      dtype='object')

In [242]:
genres.columns

Index(['genre_id', 'genre_color', 'genre_handle', 'genre_parent_id',
       'genre_title'],
      dtype='object')

In [243]:
echonest.columns

Index(['track_id', 'acousticness', 'danceability', 'energy',
       'instrumentalness', 'liveness', 'speechiness', 'tempo', 'valence',
       'album_date',
       ...
       '214', '215', '216', '217', '218', '219', '220', '221', '222', '223'],
      dtype='object', length=250)

In [244]:
# Combine seperate CSVs into a 'master' DataFrame
fma_df = raw_tracks[['track_id', 'artist_id', 'artist_name', 'track_duration']]
fma_df = pd.merge(fma_df, tracks[['track_id', 'genre_top']], on='track_id', how='inner')
fma_df = pd.merge(fma_df, echonest[['track_id','acousticness','danceability','energy',
                                    'instrumentalness','liveness','speechiness','tempo','valence']],
                  on='track_id', how='inner')
fma_df.head(10)

Unnamed: 0,track_id,artist_id,artist_name,track_duration,genre_top,acousticness,danceability,energy,instrumentalness,liveness,speechiness,tempo,valence
0,26533,6718,Roman Stolyar & Ilia Belorukov,11:28,Jazz,0.992052,0.449904,0.187091,0.926967,0.153883,0.184544,82.752,0.143674
1,26534,6718,Roman Stolyar & Ilia Belorukov,13:02,Jazz,0.988513,0.299413,0.212382,0.472631,0.064746,0.141513,74.872,0.126995
2,26614,6732,Former Selv,04:05,Electronic,0.031106,0.793853,0.32114,0.88518,0.103475,0.075847,131.997,0.111437
3,26615,6733,Lissom,05:03,Electronic,0.845697,0.061916,0.015739,0.930647,0.107575,0.040837,191.853,0.037751
4,26616,6734,Yann Novak,05:32,Electronic,0.80459,0.164799,0.013755,0.934238,0.097474,0.053641,125.515,0.035151
5,26617,6735,Clinker,05:44,Electronic,0.705879,0.148969,2e-05,0.897722,0.086542,,132.192,0.12669
6,26620,3148,Kamran Sadeghi,05:00,Electronic,0.91528,0.186919,0.000191,0.903295,0.111539,0.08242,131.667,0.579779
7,26638,6742,Flowerheads,02:40,Rock,0.68341,0.375122,0.520872,0.97287,0.111027,0.094277,80.18,0.219157
8,26639,6742,Flowerheads,02:30,Rock,0.725093,0.475448,0.448794,0.971753,0.110724,0.035687,119.978,0.188971
9,26640,6742,Flowerheads,01:17,Rock,0.907694,0.696932,0.334671,0.967341,0.100748,0.138263,220.256,0.900989


In [245]:
fma_df.describe(include='number')

Unnamed: 0,artist_id,acousticness,danceability,energy,instrumentalness,liveness,speechiness,tempo,valence
count,7029.0,7029.0,7014.0,7029.0,7029.0,7029.0,6928.0,7029.0,7013.0
mean,9565.798122,0.4486711,0.501338,0.562889,0.595034,0.18157,0.092247,122.241313,0.434712
std,4642.427364,0.3812348,0.190171,0.269048,0.381595,0.149778,0.125025,34.282646,0.271589
min,4.0,9.491e-07,0.051435,2e-05,0.0,0.025916,0.022795,0.0,0.008695
25%,7134.0,0.04641772,0.358666,0.364828,0.115453,0.099536,0.03629,95.989,0.198255
50%,9076.0,0.3927885,0.506497,0.583819,0.814408,0.117868,0.048258,120.011,0.411838
75%,11423.0,0.8506437,0.646359,0.789315,0.909902,0.205161,0.081441,142.746,0.653344
max,20818.0,0.9957965,0.968645,0.999964,0.998016,0.98033,0.964377,249.616,0.99999


In [246]:
fma_df.describe(exclude='number')

Unnamed: 0,track_id,artist_name,track_duration,genre_top
count,7029,7029,7029,4411
unique,7029,1666,672,12
top,107986,51%,03:06,Rock
freq,1,52,48,1820


In [247]:
imputer = SimpleImputer(strategy='mean')
imputer.fit(fma_df.select_dtypes(include='number'))
fma_df_imputed = imputer.transform(fma_df.select_dtypes(include='number'))

In [248]:
fma_df_imputed.shape

(7029, 9)

In [249]:
imputed_features = fma_df.select_dtypes(include='number').columns
fma_df[imputed_features] = fma_df_imputed
fma_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7029 entries, 0 to 7028
Data columns (total 13 columns):
track_id            7029 non-null object
artist_id           7029 non-null float64
artist_name         7029 non-null object
track_duration      7029 non-null object
genre_top           4411 non-null object
acousticness        7029 non-null float64
danceability        7029 non-null float64
energy              7029 non-null float64
instrumentalness    7029 non-null float64
liveness            7029 non-null float64
speechiness         7029 non-null float64
tempo               7029 non-null float64
valence             7029 non-null float64
dtypes: float64(9), object(4)
memory usage: 768.8+ KB


In [250]:
# Drop observations where there is a np.NaN, then proceed to log-reg
fma_df = fma_df.dropna(how='any')

### Logistic Regression

In [251]:
fma_df['genre_top'].value_counts(normalize=True)

Rock                   0.412605
Electronic             0.257311
Hip-Hop                0.102471
Folk                   0.070279
Pop                    0.047382
Jazz                   0.029018
Classical              0.025391
Instrumental           0.019950
Old-Time / Historic    0.015189
International          0.014056
Experimental           0.004081
Blues                  0.002267
Name: genre_top, dtype: float64

In [254]:
from sklearn.model_selection import train_test_split

y = fma_df['genre_top']
X = fma_df.drop(columns='genre_top').select_dtypes(include='number')
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.70, test_size=0.30, 
                                                    random_state=42)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((3087, 9), (1324, 9), (3087,), (1324,))

In [255]:
log_reg = LogisticRegression(solver='lbfgs')
log_reg.fit(X_train, y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False)

In [256]:
score = log_reg.score(X_train, y_train)
print('Train accuracy score: ', score)

Train accuracy score:  0.5322319403952057


In [257]:
scores = cross_val_score(log_reg, X_train, y_train, cv=10)
print('Cross-Validation Accuracy Scores:\n', scores)



Cross-Validation Accuracy Scores:
 [0.47133758 0.55414013 0.5        0.55339806 0.56818182 0.5487013
 0.54071661 0.56393443 0.50328947 0.54934211]


In [258]:
scores = pd.Series(scores)
scores.min(), scores.mean(), scores.max()

(0.4713375796178344, 0.535304149969664, 0.5681818181818182)

This is the biggest data you've played with so far, and while it does generally fit in Colab, it can take awhile to run. That's part of the challenge!

Your tasks:
- Clean up the variable names in the dataframe
- Use logistic regression to fit a model predicting (primary/top) genre
- Inspect, iterate, and improve your model
- Answer the following questions (written, ~paragraph each):
  - What are the best predictors of genre?
  - What information isn't very useful for predicting genre?
  - What surprised you the most about your results?

*Important caveats*:
- This is going to be difficult data to work with - don't let the perfect be the enemy of the good!
- Be creative in cleaning it up - if the best way you know how to do it is download it locally and edit as a spreadsheet, that's OK!
- If the data size becomes problematic, consider sampling/subsetting
- You do not need perfect or complete results - just something plausible that runs, and that supports the reasoning in your written answers

If you find that fitting a model to classify *all* genres isn't very good, it's totally OK to limit to the most frequent genres, or perhaps trying to combine or cluster genres as a preprocessing step. Even then, there will be limits to how good a model can be with just this metadata - if you really want to train an effective genre classifier, you'll have to involve the other data (see stretch goals).

This is real data - there is no "one correct answer", so you can take this in a variety of directions. Just make sure to support your findings, and feel free to share them as well! This is meant to be practice for dealing with other "messy" data, a common task in data science.

## Resources and stretch goals

- Check out the other .csv files from the FMA dataset, and see if you can join them or otherwise fit interesting models with them
- [Logistic regression from scratch in numpy](https://blog.goodaudience.com/logistic-regression-from-scratch-in-numpy-5841c09e425f) - if you want to dig in a bit more to both the code and math (also takes a gradient descent approach, introducing the logistic loss function)
- Create a visualization to show predictions of your model - ideally show a confidence interval based on error!
- Check out and compare classification models from scikit-learn, such as [SVM](https://scikit-learn.org/stable/modules/svm.html#classification), [decision trees](https://scikit-learn.org/stable/modules/tree.html#classification), and [naive Bayes](https://scikit-learn.org/stable/modules/naive_bayes.html). The underlying math will vary significantly, but the API (how you write the code) and interpretation will actually be fairly similar.
- Sign up for [Kaggle](https://kaggle.com), and find a competition to try logistic regression with
- (Not logistic regression related) If you enjoyed the assignment, you may want to read up on [music informatics](https://en.wikipedia.org/wiki/Music_informatics), which is how those audio features were actually calculated. The FMA includes the actual raw audio, so (while this is more of a longterm project than a stretch goal, and won't fit in Colab) if you'd like you can check those out and see what sort of deeper analysis you can do.

In [None]:
@inproceedings{fma_dataset,
  title = {FMA: A Dataset for Music Analysis},
  author = {Defferrard, Micha\"el and Benzi, Kirell and Vandergheynst, Pierre and Bresson, Xavier},
  booktitle = {18th International Society for Music Information Retrieval Conference},
  year = {2017},
  url = {https://arxiv.org/abs/1612.01840},
}