# Lambda School Data Science - Logistic Regression

Logistic regression is the baseline for classification models, as well as a handy way to predict probabilities (since those too live in the unit interval). While relatively simple, it is also the foundation for more sophisticated classification techniques such as neural networks (many of which can effectively be thought of as networks of logistic models).

## Lecture - Where Linear goes Wrong
### Return of the Titanic 🚢

You've likely already explored the rich dataset that is the Titanic - let's use regression and try to predict survival with it. The data is [available from Kaggle](https://www.kaggle.com/c/titanic/data), so we'll also play a bit with [the Kaggle API](https://github.com/Kaggle/kaggle-api).

In [50]:
import numpy as np
import pandas as pd

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

In [2]:
!pip install kaggle



In [3]:
# Note - you'll also have to sign up for Kaggle and authorize the API
# https://github.com/Kaggle/kaggle-api#api-credentials

# This essentially means uploading a kaggle.json file
# For Colab we can have it in Google Drive
#from google.colab import drive
#drive.mount('/content/drive')
%env KAGGLE_CONFIG_DIR=C:\Users\edwar\Downloads

# You also have to join the Titanic competition to have access to the data
!kaggle competitions download -c titanic --force 

env: KAGGLE_CONFIG_DIR=C:\Users\edwar\Downloads
Downloading train.csv to C:\Users\edwar\Downloads

Downloading test.csv to C:\Users\edwar\Downloads

Downloading gender_submission.csv to C:\Users\edwar\Downloads




  0%|          | 0.00/59.8k [00:00<?, ?B/s]
100%|##########| 59.8k/59.8k [00:00<00:00, 1.91MB/s]

  0%|          | 0.00/28.0k [00:00<?, ?B/s]
100%|##########| 28.0k/28.0k [00:00<00:00, 5.25MB/s]

  0%|          | 0.00/3.18k [00:00<?, ?B/s]
100%|##########| 3.18k/3.18k [00:00<00:00, 3.29MB/s]


In [4]:
# How would we try to do this with linear regression?


train_df = pd.read_csv('train.csv').dropna()
test_df = pd.read_csv('test.csv').dropna()  # Unlabeled, for Kaggle submission

train_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
10,11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7,G6,S
11,12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.55,C103,S


In [5]:
train_df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,183.0,183.0,183.0,183.0,183.0,183.0,183.0
mean,455.36612,0.672131,1.191257,35.674426,0.464481,0.47541,78.682469
std,247.052476,0.470725,0.515187,15.643866,0.644159,0.754617,76.347843
min,2.0,0.0,1.0,0.92,0.0,0.0,0.0
25%,263.5,0.0,1.0,24.0,0.0,0.0,29.7
50%,457.0,1.0,1.0,36.0,0.0,0.0,57.0
75%,676.0,1.0,1.0,47.5,1.0,1.0,90.0
max,890.0,1.0,3.0,80.0,3.0,4.0,512.3292


In [6]:
X = train_df[['Pclass', 'Age', 'Fare']]
y = train_df.Survived

linear_reg = LinearRegression().fit(X, y)
linear_reg.score(X, y)

0.08389810726550906

In [7]:
linear_reg.predict(test_df[['Pclass', 'Age', 'Fare']])

array([0.79543117, 0.58610823, 0.67595121, 0.793829  , 0.62090522,
       0.72542107, 0.59848968, 0.58734245, 0.48567063, 0.77627736,
       0.84211887, 0.57052439, 0.7754689 , 0.96621114, 0.70287941,
       0.57673837, 0.72321391, 0.75894755, 0.77968041, 0.50246003,
       0.49858077, 0.7474959 , 0.3542282 , 0.61648435, 0.71300224,
       0.66294608, 0.53175333, 0.77397395, 0.68419387, 0.68395536,
       0.52041202, 0.56814038, 0.79586606, 0.81372012, 0.61068545,
       0.57260627, 0.52525981, 0.58055388, 0.45584728, 0.67976208,
       0.8226707 , 0.84286197, 0.96189157, 0.66724612, 0.68589478,
       0.61846513, 0.63455044, 0.68275686, 0.65738372, 0.45198998,
       0.59988596, 0.63845908, 0.63132487, 0.7888473 , 0.60126246,
       0.79714045, 0.78713803, 0.54643775, 0.42823635, 0.7711724 ,
       0.53552976, 0.55608044, 0.54480459, 0.57031915, 0.65080369,
       0.77958926, 0.6371013 , 0.70993488, 0.71493598, 0.60375943,
       0.54407206, 0.48186138, 0.76576089, 0.75456305, 0.53968

In [8]:
linear_reg.coef_

array([-0.08596295, -0.00829314,  0.00048775])

In [9]:
test_case = np.array([[1, 5, 500]])  # Rich 5-year old in first class
linear_reg.predict(test_case)

array([1.14845883])

In [10]:
log_reg = LogisticRegression().fit(X, y)
log_reg.score(X, y)

0.7103825136612022

In [11]:
log_reg.predict(test_df[['Pclass', 'Age', 'Fare']])

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1,
       0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
      dtype=int64)

In [12]:
log_reg.predict(test_case)[0]

1

In [13]:
print

<function print>

In [14]:
help(log_reg.predict)

Help on method predict in module sklearn.linear_model.base:

predict(X) method of sklearn.linear_model.logistic.LogisticRegression instance
    Predict class labels for samples in X.
    
    Parameters
    ----------
    X : {array-like, sparse matrix}, shape = [n_samples, n_features]
        Samples.
    
    Returns
    -------
    C : array, shape = [n_samples]
        Predicted class label per sample.



In [15]:
log_reg.predict_proba(test_case)[0]

array([0.02485552, 0.97514448])

In [16]:
# What's the math?
log_reg.coef_

array([[-0.0455017 , -0.02912513,  0.0048037 ]])

In [17]:
log_reg.intercept_

array([1.45878264])

In [18]:
# The logistic sigmoid "squishing" function, implemented to accept numpy arrays
def sigmoid(x):
  return 1 / (1 + np.e**(-x))

In [19]:
sigmoid(log_reg.intercept_ + np.dot(log_reg.coef_, np.transpose(test_case)))

array([[0.97514448]])

So, clearly a more appropriate model in this situation! For more on the math, [see this Wikipedia example](https://en.wikipedia.org/wiki/Logistic_regression#Probability_of_passing_an_exam_versus_hours_of_study).

For live - let's tackle [another classification dataset on absenteeism](http://archive.ics.uci.edu/ml/datasets/Absenteeism+at+work) - it has 21 classes, but remember, scikit-learn LogisticRegression automatically handles more than two classes. How? By essentially treating each label as different (1) from some base class (0).

In [20]:
# Live - let's try absenteeism!

## Assignment - real-world classification

We're going to check out a larger dataset - the [FMA Free Music Archive data](https://github.com/mdeff/fma). It has a selection of CSVs with metadata and calculated audio features that you can load and try to use to classify genre of tracks. To get you started:

In [21]:
tracks = pd.read_csv( 'C:\\Users\\edwar\\Downloads\\fma_metadata\\fma_metadata\\tracks.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [22]:
tracks.describe()

Unnamed: 0.1,Unnamed: 0,album,album.1,album.2,album.3,album.4,album.5,album.6,album.7,album.8,...,track.10,track.11,track.12,track.13,track.14,track.15,track.16,track.17,track.18,track.19
count,106575,106575,103046,70295,15296,106575,106575,83150,106575,18061,...,2350,106575,15025,106488,106575,312,106575,1264,106575,106574
unique,106575,29,14341,3670,623,65,14929,11076,11351,761,...,1587,18977,45,114,15340,67,331,136,2452,94987
top,13199,0,2015-01-26 13:04:57,2008-01-01 00:00:00,Ernie Indradat,0,-1,"<p class=""p1"" style=""margin: 0px; padding: 8px...",-1,Joe Belock,...,"<p><a href=""http://www.myspace.com/theshambler...",320,en,Attribution-Noncommercial-Share Alike 3.0 Unit...,97,Apache Tomcat,1,Victrola Dog (ASCAP),[],Untitled
freq,1,71187,310,667,876,45753,805,310,3130,855,...,22,67,14255,19250,110,44,10459,465,83078,298


In [23]:
pd.set_option('display.max_columns', None)  # Unlimited columns
tracks.head()

Unnamed: 0.1,Unnamed: 0,album,album.1,album.2,album.3,album.4,album.5,album.6,album.7,album.8,album.9,album.10,album.11,album.12,artist,artist.1,artist.2,artist.3,artist.4,artist.5,artist.6,artist.7,artist.8,artist.9,artist.10,artist.11,artist.12,artist.13,artist.14,artist.15,artist.16,set,set.1,track,track.1,track.2,track.3,track.4,track.5,track.6,track.7,track.8,track.9,track.10,track.11,track.12,track.13,track.14,track.15,track.16,track.17,track.18,track.19
0,,comments,date_created,date_released,engineer,favorites,id,information,listens,producer,tags,title,tracks,type,active_year_begin,active_year_end,associated_labels,bio,comments,date_created,favorites,id,latitude,location,longitude,members,name,related_projects,tags,website,wikipedia_page,split,subset,bit_rate,comments,composer,date_created,date_recorded,duration,favorites,genre_top,genres,genres_all,information,interest,language_code,license,listens,lyricist,number,publisher,tags,title
1,track_id,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,2,0,2008-11-26 01:44:45,2009-01-05 00:00:00,,4,1,<p></p>,6073,,[],AWOL - A Way Of Life,7,Album,2006-01-01 00:00:00,,,"<p>A Way Of Life, A Collective of Hip-Hop from...",0,2008-11-26 01:42:32,9,1,40.0583238,New Jersey,-74.4056612,"Sajje Morocco,Brownbum,ZawidaGod,Custodian of ...",AWOL,The list of past projects is 2 long but every1...,['awol'],http://www.AzillionRecords.blogspot.com,,training,small,256000,0,,2008-11-26 01:48:12,2008-11-26 00:00:00,168,2,Hip-Hop,[21],[21],,4656,en,Attribution-NonCommercial-ShareAlike 3.0 Inter...,1293,,3,,[],Food
3,3,0,2008-11-26 01:44:45,2009-01-05 00:00:00,,4,1,<p></p>,6073,,[],AWOL - A Way Of Life,7,Album,2006-01-01 00:00:00,,,"<p>A Way Of Life, A Collective of Hip-Hop from...",0,2008-11-26 01:42:32,9,1,40.0583238,New Jersey,-74.4056612,"Sajje Morocco,Brownbum,ZawidaGod,Custodian of ...",AWOL,The list of past projects is 2 long but every1...,['awol'],http://www.AzillionRecords.blogspot.com,,training,medium,256000,0,,2008-11-26 01:48:14,2008-11-26 00:00:00,237,1,Hip-Hop,[21],[21],,1470,en,Attribution-NonCommercial-ShareAlike 3.0 Inter...,514,,4,,[],Electric Ave
4,5,0,2008-11-26 01:44:45,2009-01-05 00:00:00,,4,1,<p></p>,6073,,[],AWOL - A Way Of Life,7,Album,2006-01-01 00:00:00,,,"<p>A Way Of Life, A Collective of Hip-Hop from...",0,2008-11-26 01:42:32,9,1,40.0583238,New Jersey,-74.4056612,"Sajje Morocco,Brownbum,ZawidaGod,Custodian of ...",AWOL,The list of past projects is 2 long but every1...,['awol'],http://www.AzillionRecords.blogspot.com,,training,small,256000,0,,2008-11-26 01:48:20,2008-11-26 00:00:00,206,6,Hip-Hop,[21],[21],,1933,en,Attribution-NonCommercial-ShareAlike 3.0 Inter...,1151,,6,,[],This World


In [24]:
tracks = tracks.drop(['Unnamed: 0'], axis=1)
tracks.columns = tracks.iloc[0]
tracks = tracks.drop([0])
tracks = tracks.drop([1])
tracks = tracks.reset_index(drop=True)

In [25]:
tracks.head()

Unnamed: 0,comments,date_created,date_released,engineer,favorites,id,information,listens,producer,tags,title,tracks,type,active_year_begin,active_year_end,associated_labels,bio,comments.1,date_created.1,favorites.1,id.1,latitude,location,longitude,members,name,related_projects,tags.1,website,wikipedia_page,split,subset,bit_rate,comments.2,composer,date_created.2,date_recorded,duration,favorites.2,genre_top,genres,genres_all,information.1,interest,language_code,license,listens.1,lyricist,number,publisher,tags.2,title.1
0,0,2008-11-26 01:44:45,2009-01-05 00:00:00,,4,1,<p></p>,6073,,[],AWOL - A Way Of Life,7,Album,2006-01-01 00:00:00,,,"<p>A Way Of Life, A Collective of Hip-Hop from...",0,2008-11-26 01:42:32,9,1,40.0583238,New Jersey,-74.4056612,"Sajje Morocco,Brownbum,ZawidaGod,Custodian of ...",AWOL,The list of past projects is 2 long but every1...,['awol'],http://www.AzillionRecords.blogspot.com,,training,small,256000,0,,2008-11-26 01:48:12,2008-11-26 00:00:00,168,2,Hip-Hop,[21],[21],,4656,en,Attribution-NonCommercial-ShareAlike 3.0 Inter...,1293,,3,,[],Food
1,0,2008-11-26 01:44:45,2009-01-05 00:00:00,,4,1,<p></p>,6073,,[],AWOL - A Way Of Life,7,Album,2006-01-01 00:00:00,,,"<p>A Way Of Life, A Collective of Hip-Hop from...",0,2008-11-26 01:42:32,9,1,40.0583238,New Jersey,-74.4056612,"Sajje Morocco,Brownbum,ZawidaGod,Custodian of ...",AWOL,The list of past projects is 2 long but every1...,['awol'],http://www.AzillionRecords.blogspot.com,,training,medium,256000,0,,2008-11-26 01:48:14,2008-11-26 00:00:00,237,1,Hip-Hop,[21],[21],,1470,en,Attribution-NonCommercial-ShareAlike 3.0 Inter...,514,,4,,[],Electric Ave
2,0,2008-11-26 01:44:45,2009-01-05 00:00:00,,4,1,<p></p>,6073,,[],AWOL - A Way Of Life,7,Album,2006-01-01 00:00:00,,,"<p>A Way Of Life, A Collective of Hip-Hop from...",0,2008-11-26 01:42:32,9,1,40.0583238,New Jersey,-74.4056612,"Sajje Morocco,Brownbum,ZawidaGod,Custodian of ...",AWOL,The list of past projects is 2 long but every1...,['awol'],http://www.AzillionRecords.blogspot.com,,training,small,256000,0,,2008-11-26 01:48:20,2008-11-26 00:00:00,206,6,Hip-Hop,[21],[21],,1933,en,Attribution-NonCommercial-ShareAlike 3.0 Inter...,1151,,6,,[],This World
3,0,2008-11-26 01:45:08,2008-02-06 00:00:00,,4,6,,47632,,[],Constant Hitmaker,2,Album,,,"Mexican Summer, Richie Records, Woodsist, Skul...","<p><span style=""font-family:Verdana, Geneva, A...",3,2008-11-26 01:42:55,74,6,,,,"Kurt Vile, the Violators",Kurt Vile,,"['philly', 'kurt vile']",http://kurtvile.com,,training,small,192000,0,Kurt Vile,2008-11-25 17:49:06,2008-11-26 00:00:00,161,178,Pop,[10],[10],,54881,en,Attribution-NonCommercial-NoDerivatives (aka M...,50135,,1,,[],Freeway
4,0,2008-11-26 01:45:05,2009-01-06 00:00:00,,2,4,"<p> ""spiritual songs"" from Nicky Cook</p>",2710,,[],Niris,13,Album,1990-01-01 00:00:00,2011-01-01 00:00:00,,<p>Songs written by: Nicky Cook</p>\n<p>VOCALS...,2,2008-11-26 01:42:52,10,4,51.895927,Colchester England,0.891874,Nicky Cook\n,Nicky Cook,,"['instrumentals', 'experimental pop', 'post pu...",,,training,large,256000,0,,2008-11-26 01:48:56,2008-01-01 00:00:00,311,0,,"[76, 103]","[17, 10, 76, 103]",,978,en,Attribution-NonCommercial-NoDerivatives (aka M...,361,,3,,[],Spiritual Level


In [26]:
tracks.shape

(106574, 52)

In [27]:
# drop nan rows where missing top_genre
tracks = tracks.dropna(subset=['genre_top'])

In [28]:
too_many_nan = tracks.isna().sum() > 10000

In [29]:
too_many_nan

0
comments             False
date_created         False
date_released         True
engineer              True
favorites            False
id                   False
information           True
listens              False
producer              True
tags                 False
title                False
tracks               False
type                 False
active_year_begin     True
active_year_end       True
associated_labels     True
bio                   True
comments             False
date_created         False
favorites            False
id                   False
latitude              True
location              True
longitude             True
members               True
name                 False
related_projects      True
tags                 False
website               True
wikipedia_page        True
split                False
subset               False
bit_rate             False
comments             False
composer              True
date_created         False
date_recorded         True

In [30]:
remove_columns = too_many_nan[too_many_nan == True]

In [31]:
remove_columns = remove_columns.index.tolist()

In [32]:
tracks2 = tracks.drop(remove_columns, axis=1)

In [33]:
tracks2.isna().sum()

0
comments           0
date_created    1051
favorites          0
id                 0
listens            0
tags               0
title            309
tracks             0
type            2047
comments           0
date_created     215
favorites          0
id                 0
name               0
tags               0
split              0
subset             0
bit_rate           0
comments           0
date_created       0
duration           0
favorites          0
genre_top          0
genres             0
genres_all         0
interest           0
license           59
listens            0
number             0
tags               0
title              1
dtype: int64

In [34]:
tracks2.head()

Unnamed: 0,comments,date_created,favorites,id,listens,tags,title,tracks,type,comments.1,date_created.1,favorites.1,id.1,name,tags.1,split,subset,bit_rate,comments.2,date_created.2,duration,favorites.2,genre_top,genres,genres_all,interest,license,listens.1,number,tags.2,title.1
0,0,2008-11-26 01:44:45,4,1,6073,[],AWOL - A Way Of Life,7,Album,0,2008-11-26 01:42:32,9,1,AWOL,['awol'],training,small,256000,0,2008-11-26 01:48:12,168,2,Hip-Hop,[21],[21],4656,Attribution-NonCommercial-ShareAlike 3.0 Inter...,1293,3,[],Food
1,0,2008-11-26 01:44:45,4,1,6073,[],AWOL - A Way Of Life,7,Album,0,2008-11-26 01:42:32,9,1,AWOL,['awol'],training,medium,256000,0,2008-11-26 01:48:14,237,1,Hip-Hop,[21],[21],1470,Attribution-NonCommercial-ShareAlike 3.0 Inter...,514,4,[],Electric Ave
2,0,2008-11-26 01:44:45,4,1,6073,[],AWOL - A Way Of Life,7,Album,0,2008-11-26 01:42:32,9,1,AWOL,['awol'],training,small,256000,0,2008-11-26 01:48:20,206,6,Hip-Hop,[21],[21],1933,Attribution-NonCommercial-ShareAlike 3.0 Inter...,1151,6,[],This World
3,0,2008-11-26 01:45:08,4,6,47632,[],Constant Hitmaker,2,Album,3,2008-11-26 01:42:55,74,6,Kurt Vile,"['philly', 'kurt vile']",training,small,192000,0,2008-11-25 17:49:06,161,178,Pop,[10],[10],54881,Attribution-NonCommercial-NoDerivatives (aka M...,50135,1,[],Freeway
9,0,2008-11-26 01:44:45,4,1,6073,[],AWOL - A Way Of Life,7,Album,0,2008-11-26 01:42:32,9,1,AWOL,['awol'],training,medium,256000,0,2008-11-26 01:43:19,207,3,Hip-Hop,[21],[21],1126,Attribution-NonCommercial-ShareAlike 3.0 Inter...,943,5,[],Street Music


In [35]:
tracks2.type.unique()

array(['Album', 'Single Tracks', 'Live Performance', nan, 'Radio Program'],
      dtype=object)

In [36]:
tracks2.genre_top.unique()

array(['Hip-Hop', 'Pop', 'Rock', 'Experimental', 'Folk', 'Jazz',
       'Electronic', 'Spoken', 'International', 'Soul-RnB', 'Blues',
       'Country', 'Classical', 'Old-Time / Historic', 'Instrumental',
       'Easy Listening'], dtype=object)

In [37]:
tracks2.genre_top = tracks2[(tracks2.genre_top != 'Spoken')]
tracks2.genre_top = tracks2[(tracks2.genre_top != 'Old-Time / Historic')]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[name] = value


In [38]:
tracks2.type.isna().sum()

2047

In [39]:
tracks2 = tracks2.dropna(subset=['type'])

In [40]:
tracks2.shape

(47551, 31)

In [41]:
tracks2.genre_top.unique()

array(['0', '1', '2', '4', '3', '5', '17', '6', 0, 2, 1, 4, 3, 5, 7, 8,
       10, 6, 9, 12, 11, 14], dtype=object)

In [44]:
tracks2.genre_top = tracks2.genre_top.astype(int)

In [46]:
label_encoder = LabelEncoder()
tracks2['genre_top'] = label_encoder.fit_transform(tracks2['genre_top'])
tracks2['type'] = label_encoder.fit_transform(tracks2['type'])

In [47]:
X = tracks2[['listens', 'tracks', 'bit_rate', 'type', 'duration', 'interest']]
y = tracks2['genre_top']

In [57]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=42)

In [60]:
tracks_logistic_reg = LogisticRegression().fit(X_train, y_train)
tracks_logistic_reg.score(X_train, y_train)

0.8466351209253418

# Obscure data


I originally included the genre types for spoken and old time/historic and had a score of 39, removing them increased the score to 84. It seems that having obscure target data can really hurt the prediction of a model. That seems obvious in retrospect, but just looking at the data doesn't necessarily yield that intuition. Fitting it to the model and tweaking can really be a valuable way to learn about the cohesiveness of your targets given your features.

  - What are the best predictors of genre? 
  There was a much better way to approach feature selection than trying to think about it from an intuitive standpoint. I       managed to find that removing unpredictable genre's has an immense effect on data prediction, but if I had the opportunity I would go back and use Sklearn's recursive feature selection. It looks really solid and helps establish a relationship better than a correlational matrix, because it's eliminating features itteratively and rerunning the correlation.
  - What information isn't very useful for predicting genre? 
  Comments and favorites stuck out to me as not being very useful because their values are so repetitive, any feature that repeats it's value for the most part isn't a good predictor, because it's the same input for every outcome. 
  - What surprised you the most about your results?
  I was really surpised that the bitrate had such a high effect towards the model score. I was really surprised to find that removing it took the model down by 6 percent.  


## Resources and stretch goals

- Check out the other .csv files from the FMA dataset, and see if you can join them or otherwise fit interesting models with them
- [Logistic regression from scratch in numpy](https://blog.goodaudience.com/logistic-regression-from-scratch-in-numpy-5841c09e425f) - if you want to dig in a bit more to both the code and math (also takes a gradient descent approach, introducing the logistic loss function)
- Create a visualization to show predictions of your model - ideally show a confidence interval based on error!
- Check out and compare classification models from scikit-learn, such as [SVM](https://scikit-learn.org/stable/modules/svm.html#classification), [decision trees](https://scikit-learn.org/stable/modules/tree.html#classification), and [naive Bayes](https://scikit-learn.org/stable/modules/naive_bayes.html). The underlying math will vary significantly, but the API (how you write the code) and interpretation will actually be fairly similar.
- Sign up for [Kaggle](https://kaggle.com), and find a competition to try logistic regression with
- (Not logistic regression related) If you enjoyed the assignment, you may want to read up on [music informatics](https://en.wikipedia.org/wiki/Music_informatics), which is how those audio features were actually calculated. The FMA includes the actual raw audio, so (while this is more of a longterm project than a stretch goal, and won't fit in Colab) if you'd like you can check those out and see what sort of deeper analysis you can do.