# Lambda School Data Science - Logistic Regression

Logistic regression is the baseline for classification models, as well as a handy way to predict probabilities (since those too live in the unit interval). While relatively simple, it is also the foundation for more sophisticated classification techniques such as neural networks (many of which can effectively be thought of as networks of logistic models).

## Lecture - Where Linear goes Wrong
### Return of the Titanic 🚢

You've likely already explored the rich dataset that is the Titanic - let's use regression and try to predict survival with it. The data is [available from Kaggle](https://www.kaggle.com/c/titanic/data), so we'll also play a bit with [the Kaggle API](https://github.com/Kaggle/kaggle-api).

In [None]:
!pip install kaggle

In [None]:
# Note - you'll also have to sign up for Kaggle and authorize the API
# https://github.com/Kaggle/kaggle-api#api-credentials

# This essentially means uploading a kaggle.json file
# For Colab we can have it in Google Drive
from google.colab import drive
drive.mount('/content/drive')
%env KAGGLE_CONFIG_DIR=/content/drive/My Drive/

# You also have to join the Titanic competition to have access to the data
!kaggle competitions download -c titanic

In [None]:
# How would we try to do this with linear regression?
import pandas as pd

train_df = pd.read_csv('train.csv').dropna()
test_df = pd.read_csv('test.csv').dropna()  # Unlabeled, for Kaggle submission

train_df.head()

In [None]:
train_df.describe()

In [None]:
from sklearn.linear_model import LinearRegression

X = train_df[['Pclass', 'Age', 'Fare']]
y = train_df.Survived

linear_reg = LinearRegression().fit(X, y)
linear_reg.score(X, y)

In [None]:
linear_reg.predict(test_df[['Pclass', 'Age', 'Fare']])

In [None]:
linear_reg.coef_

In [None]:
import numpy as np

test_case = np.array([[1, 5, 500]])  # Rich 5-year old in first class
linear_reg.predict(test_case)

In [None]:
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression().fit(X, y)
log_reg.score(X, y)

In [None]:
log_reg.predict(test_df[['Pclass', 'Age', 'Fare']])

In [None]:
log_reg.predict(test_case)[0]

In [None]:
help(log_reg.predict)

In [None]:
log_reg.predict_proba(test_case)[0]

In [None]:
# What's the math?
log_reg.coef_

In [None]:
log_reg.intercept_

In [None]:
# The logistic sigmoid "squishing" function, implemented to accept numpy arrays
def sigmoid(x):
  return 1 / (1 + np.e**(-x))

In [None]:
sigmoid(log_reg.intercept_ + np.dot(log_reg.coef_, np.transpose(test_case)))

So, clearly a more appropriate model in this situation! For more on the math, [see this Wikipedia example](https://en.wikipedia.org/wiki/Logistic_regression#Probability_of_passing_an_exam_versus_hours_of_study).

For live - let's tackle [another classification dataset on absenteeism](http://archive.ics.uci.edu/ml/datasets/Absenteeism+at+work) - it has 21 classes, but remember, scikit-learn LogisticRegression automatically handles more than two classes. How? By essentially treating each label as different (1) from some base class (0).

In [None]:
# Live - let's try absenteeism!

## Assignment - real-world classification

We're going to check out a larger dataset - the [FMA Free Music Archive data](https://github.com/mdeff/fma). It has a selection of CSVs with metadata and calculated audio features that you can load and try to use to classify genre of tracks. To get you started:

In [None]:
# !scp /c/users/jhump/Downloads/fma_metadata.zip -- does not return zip file

In [None]:
# !unzip fma_metadata.zip -- not needed with local workflow

In [1]:
import pandas as pd
tracks = pd.read_csv('C:\\Users\\jhump\\Desktop\\fma_metadata\\tracks.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [2]:
tracks.describe()

Unnamed: 0.1,Unnamed: 0,album,album.1,album.2,album.3,album.4,album.5,album.6,album.7,album.8,...,track.10,track.11,track.12,track.13,track.14,track.15,track.16,track.17,track.18,track.19
count,106575,106575,103046,70295,15296,106575,106575,83150,106575,18061,...,2350,106575,15025,106488,106575,312,106575,1264,106575,106574
unique,106575,29,14341,3670,623,65,14929,11076,11351,761,...,1587,18977,45,114,15340,67,331,136,2452,94987
top,143640,0,2015-01-26 13:04:57,2008-01-01 00:00:00,Ernie Indradat,0,-1,"<p class=""p1"" style=""margin: 0px; padding: 8px...",-1,Joe Belock,...,"<p><a href=""http://www.myspace.com/theshambler...",320,en,Attribution-Noncommercial-Share Alike 3.0 Unit...,97,Apache Tomcat,1,Victrola Dog (ASCAP),[],Untitled
freq,1,71187,310,667,876,45753,805,310,3130,855,...,22,67,14255,19250,110,44,10459,465,83078,298


In [3]:
pd.set_option('display.max_columns', None)  # Unlimited columns
tracks.head()

Unnamed: 0.1,Unnamed: 0,album,album.1,album.2,album.3,album.4,album.5,album.6,album.7,album.8,album.9,album.10,album.11,album.12,artist,artist.1,artist.2,artist.3,artist.4,artist.5,artist.6,artist.7,artist.8,artist.9,artist.10,artist.11,artist.12,artist.13,artist.14,artist.15,artist.16,set,set.1,track,track.1,track.2,track.3,track.4,track.5,track.6,track.7,track.8,track.9,track.10,track.11,track.12,track.13,track.14,track.15,track.16,track.17,track.18,track.19
0,,comments,date_created,date_released,engineer,favorites,id,information,listens,producer,tags,title,tracks,type,active_year_begin,active_year_end,associated_labels,bio,comments,date_created,favorites,id,latitude,location,longitude,members,name,related_projects,tags,website,wikipedia_page,split,subset,bit_rate,comments,composer,date_created,date_recorded,duration,favorites,genre_top,genres,genres_all,information,interest,language_code,license,listens,lyricist,number,publisher,tags,title
1,track_id,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,2,0,2008-11-26 01:44:45,2009-01-05 00:00:00,,4,1,<p></p>,6073,,[],AWOL - A Way Of Life,7,Album,2006-01-01 00:00:00,,,"<p>A Way Of Life, A Collective of Hip-Hop from...",0,2008-11-26 01:42:32,9,1,40.0583238,New Jersey,-74.4056612,"Sajje Morocco,Brownbum,ZawidaGod,Custodian of ...",AWOL,The list of past projects is 2 long but every1...,['awol'],http://www.AzillionRecords.blogspot.com,,training,small,256000,0,,2008-11-26 01:48:12,2008-11-26 00:00:00,168,2,Hip-Hop,[21],[21],,4656,en,Attribution-NonCommercial-ShareAlike 3.0 Inter...,1293,,3,,[],Food
3,3,0,2008-11-26 01:44:45,2009-01-05 00:00:00,,4,1,<p></p>,6073,,[],AWOL - A Way Of Life,7,Album,2006-01-01 00:00:00,,,"<p>A Way Of Life, A Collective of Hip-Hop from...",0,2008-11-26 01:42:32,9,1,40.0583238,New Jersey,-74.4056612,"Sajje Morocco,Brownbum,ZawidaGod,Custodian of ...",AWOL,The list of past projects is 2 long but every1...,['awol'],http://www.AzillionRecords.blogspot.com,,training,medium,256000,0,,2008-11-26 01:48:14,2008-11-26 00:00:00,237,1,Hip-Hop,[21],[21],,1470,en,Attribution-NonCommercial-ShareAlike 3.0 Inter...,514,,4,,[],Electric Ave
4,5,0,2008-11-26 01:44:45,2009-01-05 00:00:00,,4,1,<p></p>,6073,,[],AWOL - A Way Of Life,7,Album,2006-01-01 00:00:00,,,"<p>A Way Of Life, A Collective of Hip-Hop from...",0,2008-11-26 01:42:32,9,1,40.0583238,New Jersey,-74.4056612,"Sajje Morocco,Brownbum,ZawidaGod,Custodian of ...",AWOL,The list of past projects is 2 long but every1...,['awol'],http://www.AzillionRecords.blogspot.com,,training,small,256000,0,,2008-11-26 01:48:20,2008-11-26 00:00:00,206,6,Hip-Hop,[21],[21],,1933,en,Attribution-NonCommercial-ShareAlike 3.0 Inter...,1151,,6,,[],This World


In [4]:
tracks.shape

(106576, 53)

In [65]:
import numpy as np

tracks_copy = tracks.copy()
new_header = tracks_copy.iloc[0]
tracks_copy = tracks_copy[1:]
tracks_copy.columns = new_header
tracks_copy = tracks_copy.rename(columns={np.nan: 'track_id'})
tracks_copy = tracks_copy.drop(index=[1])
tracks_copy.head(12)
# tracks_copy.shape

Unnamed: 0,track_id,comments,date_created,date_released,engineer,favorites,id,information,listens,producer,tags,title,tracks,type,active_year_begin,active_year_end,associated_labels,bio,comments.1,date_created.1,favorites.1,id.1,latitude,location,longitude,members,name,related_projects,tags.1,website,wikipedia_page,split,subset,bit_rate,comments.2,composer,date_created.2,date_recorded,duration,favorites.2,genre_top,genres,genres_all,information.1,interest,language_code,license,listens.1,lyricist,number,publisher,tags.2,title.1
2,2,0,2008-11-26 01:44:45,2009-01-05 00:00:00,,4,1,<p></p>,6073,,[],AWOL - A Way Of Life,7,Album,2006-01-01 00:00:00,,,"<p>A Way Of Life, A Collective of Hip-Hop from...",0,2008-11-26 01:42:32,9,1,40.0583238,New Jersey,-74.4056612,"Sajje Morocco,Brownbum,ZawidaGod,Custodian of ...",AWOL,The list of past projects is 2 long but every1...,['awol'],http://www.AzillionRecords.blogspot.com,,training,small,256000,0,,2008-11-26 01:48:12,2008-11-26 00:00:00,168,2,Hip-Hop,[21],[21],,4656,en,Attribution-NonCommercial-ShareAlike 3.0 Inter...,1293,,3,,[],Food
3,3,0,2008-11-26 01:44:45,2009-01-05 00:00:00,,4,1,<p></p>,6073,,[],AWOL - A Way Of Life,7,Album,2006-01-01 00:00:00,,,"<p>A Way Of Life, A Collective of Hip-Hop from...",0,2008-11-26 01:42:32,9,1,40.0583238,New Jersey,-74.4056612,"Sajje Morocco,Brownbum,ZawidaGod,Custodian of ...",AWOL,The list of past projects is 2 long but every1...,['awol'],http://www.AzillionRecords.blogspot.com,,training,medium,256000,0,,2008-11-26 01:48:14,2008-11-26 00:00:00,237,1,Hip-Hop,[21],[21],,1470,en,Attribution-NonCommercial-ShareAlike 3.0 Inter...,514,,4,,[],Electric Ave
4,5,0,2008-11-26 01:44:45,2009-01-05 00:00:00,,4,1,<p></p>,6073,,[],AWOL - A Way Of Life,7,Album,2006-01-01 00:00:00,,,"<p>A Way Of Life, A Collective of Hip-Hop from...",0,2008-11-26 01:42:32,9,1,40.0583238,New Jersey,-74.4056612,"Sajje Morocco,Brownbum,ZawidaGod,Custodian of ...",AWOL,The list of past projects is 2 long but every1...,['awol'],http://www.AzillionRecords.blogspot.com,,training,small,256000,0,,2008-11-26 01:48:20,2008-11-26 00:00:00,206,6,Hip-Hop,[21],[21],,1933,en,Attribution-NonCommercial-ShareAlike 3.0 Inter...,1151,,6,,[],This World
5,10,0,2008-11-26 01:45:08,2008-02-06 00:00:00,,4,6,,47632,,[],Constant Hitmaker,2,Album,,,"Mexican Summer, Richie Records, Woodsist, Skul...","<p><span style=""font-family:Verdana, Geneva, A...",3,2008-11-26 01:42:55,74,6,,,,"Kurt Vile, the Violators",Kurt Vile,,"['philly', 'kurt vile']",http://kurtvile.com,,training,small,192000,0,Kurt Vile,2008-11-25 17:49:06,2008-11-26 00:00:00,161,178,Pop,[10],[10],,54881,en,Attribution-NonCommercial-NoDerivatives (aka M...,50135,,1,,[],Freeway
6,20,0,2008-11-26 01:45:05,2009-01-06 00:00:00,,2,4,"<p> ""spiritual songs"" from Nicky Cook</p>",2710,,[],Niris,13,Album,1990-01-01 00:00:00,2011-01-01 00:00:00,,<p>Songs written by: Nicky Cook</p>\n<p>VOCALS...,2,2008-11-26 01:42:52,10,4,51.895927,Colchester England,0.891874,Nicky Cook\n,Nicky Cook,,"['instrumentals', 'experimental pop', 'post pu...",,,training,large,256000,0,,2008-11-26 01:48:56,2008-01-01 00:00:00,311,0,,"[76, 103]","[17, 10, 76, 103]",,978,en,Attribution-NonCommercial-NoDerivatives (aka M...,361,,3,,[],Spiritual Level
7,26,0,2008-11-26 01:45:05,2009-01-06 00:00:00,,2,4,"<p> ""spiritual songs"" from Nicky Cook</p>",2710,,[],Niris,13,Album,1990-01-01 00:00:00,2011-01-01 00:00:00,,<p>Songs written by: Nicky Cook</p>\n<p>VOCALS...,2,2008-11-26 01:42:52,10,4,51.895927,Colchester England,0.891874,Nicky Cook\n,Nicky Cook,,"['instrumentals', 'experimental pop', 'post pu...",,,training,large,256000,0,,2008-11-26 01:49:05,2008-01-01 00:00:00,181,0,,"[76, 103]","[17, 10, 76, 103]",,1060,en,Attribution-NonCommercial-NoDerivatives (aka M...,193,,4,,[],Where is your Love?
8,30,0,2008-11-26 01:45:05,2009-01-06 00:00:00,,2,4,"<p> ""spiritual songs"" from Nicky Cook</p>",2710,,[],Niris,13,Album,1990-01-01 00:00:00,2011-01-01 00:00:00,,<p>Songs written by: Nicky Cook</p>\n<p>VOCALS...,2,2008-11-26 01:42:52,10,4,51.895927,Colchester England,0.891874,Nicky Cook\n,Nicky Cook,,"['instrumentals', 'experimental pop', 'post pu...",,,training,large,256000,0,,2008-11-26 01:49:11,2008-01-01 00:00:00,174,0,,"[76, 103]","[17, 10, 76, 103]",,718,en,Attribution-NonCommercial-NoDerivatives (aka M...,612,,5,,[],Too Happy
9,46,0,2008-11-26 01:45:05,2009-01-06 00:00:00,,2,4,"<p> ""spiritual songs"" from Nicky Cook</p>",2710,,[],Niris,13,Album,1990-01-01 00:00:00,2011-01-01 00:00:00,,<p>Songs written by: Nicky Cook</p>\n<p>VOCALS...,2,2008-11-26 01:42:52,10,4,51.895927,Colchester England,0.891874,Nicky Cook\n,Nicky Cook,,"['instrumentals', 'experimental pop', 'post pu...",,,training,large,256000,0,,2008-11-26 01:49:53,2008-01-01 00:00:00,104,0,,"[76, 103]","[17, 10, 76, 103]",,252,en,Attribution-NonCommercial-NoDerivatives (aka M...,171,,8,,[],Yosemite
10,48,0,2008-11-26 01:45:05,2009-01-06 00:00:00,,2,4,"<p> ""spiritual songs"" from Nicky Cook</p>",2710,,[],Niris,13,Album,1990-01-01 00:00:00,2011-01-01 00:00:00,,<p>Songs written by: Nicky Cook</p>\n<p>VOCALS...,2,2008-11-26 01:42:52,10,4,51.895927,Colchester England,0.891874,Nicky Cook\n,Nicky Cook,,"['instrumentals', 'experimental pop', 'post pu...",,,training,large,256000,0,,2008-11-26 01:49:56,2008-01-01 00:00:00,205,0,,"[76, 103]","[17, 10, 76, 103]",,247,en,Attribution-NonCommercial-NoDerivatives (aka M...,173,,9,,[],Light of Light
11,134,0,2008-11-26 01:44:45,2009-01-05 00:00:00,,4,1,<p></p>,6073,,[],AWOL - A Way Of Life,7,Album,2006-01-01 00:00:00,,,"<p>A Way Of Life, A Collective of Hip-Hop from...",0,2008-11-26 01:42:32,9,1,40.0583238,New Jersey,-74.4056612,"Sajje Morocco,Brownbum,ZawidaGod,Custodian of ...",AWOL,The list of past projects is 2 long but every1...,['awol'],http://www.AzillionRecords.blogspot.com,,training,medium,256000,0,,2008-11-26 01:43:19,2008-11-26 00:00:00,207,3,Hip-Hop,[21],[21],,1126,en,Attribution-NonCommercial-ShareAlike 3.0 Inter...,943,,5,,[],Street Music


In [21]:
tracks_copy.columns
tracks_copy_cols_map = zip([_ for _ in range(53)], tracks_copy.columns)
tracks_copy_cols_dict = dict(tracks_copy_cols_map)
tracks_copy_cols_dict

{0: 'track_id',
 1: 'comments',
 2: 'date_created',
 3: 'date_released',
 4: 'engineer',
 5: 'favorites',
 6: 'id',
 7: 'information',
 8: 'listens',
 9: 'producer',
 10: 'tags',
 11: 'title',
 12: 'tracks',
 13: 'type',
 14: 'active_year_begin',
 15: 'active_year_end',
 16: 'associated_labels',
 17: 'bio',
 18: 'comments',
 19: 'date_created',
 20: 'favorites',
 21: 'id',
 22: 'latitude',
 23: 'location',
 24: 'longitude',
 25: 'members',
 26: 'name',
 27: 'related_projects',
 28: 'tags',
 29: 'website',
 30: 'wikipedia_page',
 31: 'split',
 32: 'subset',
 33: 'bit_rate',
 34: 'comments',
 35: 'composer',
 36: 'date_created',
 37: 'date_recorded',
 38: 'duration',
 39: 'favorites',
 40: 'genre_top',
 41: 'genres',
 42: 'genres_all',
 43: 'information',
 44: 'interest',
 45: 'language_code',
 46: 'license',
 47: 'listens',
 48: 'lyricist',
 49: 'number',
 50: 'publisher',
 51: 'tags',
 52: 'title'}

In [22]:
# rearrange order of tracks_copy columns for more logical, informative appearance
# tracks_copy = tracks_copy.iloc[0, 11, 2:11, 1, 12:52]
cols = tracks_copy.columns.tolist()
cols

['track_id',
 'comments',
 'date_created',
 'date_released',
 'engineer',
 'favorites',
 'id',
 'information',
 'listens',
 'producer',
 'tags',
 'title',
 'tracks',
 'type',
 'active_year_begin',
 'active_year_end',
 'associated_labels',
 'bio',
 'comments',
 'date_created',
 'favorites',
 'id',
 'latitude',
 'location',
 'longitude',
 'members',
 'name',
 'related_projects',
 'tags',
 'website',
 'wikipedia_page',
 'split',
 'subset',
 'bit_rate',
 'comments',
 'composer',
 'date_created',
 'date_recorded',
 'duration',
 'favorites',
 'genre_top',
 'genres',
 'genres_all',
 'information',
 'interest',
 'language_code',
 'license',
 'listens',
 'lyricist',
 'number',
 'publisher',
 'tags',
 'title']

In [19]:
new_cols = (cols[0], cols[2:12], cols[1], cols[12:52])
new_cols_list = []
for col in new_cols:
    new_cols_list.append(col)
new_cols_list
type('track_id')
# note: will make this a stretch goal

str

In [23]:
cols

['track_id',
 'comments',
 'date_created',
 'date_released',
 'engineer',
 'favorites',
 'id',
 'information',
 'listens',
 'producer',
 'tags',
 'title',
 'tracks',
 'type',
 'active_year_begin',
 'active_year_end',
 'associated_labels',
 'bio',
 'comments',
 'date_created',
 'favorites',
 'id',
 'latitude',
 'location',
 'longitude',
 'members',
 'name',
 'related_projects',
 'tags',
 'website',
 'wikipedia_page',
 'split',
 'subset',
 'bit_rate',
 'comments',
 'composer',
 'date_created',
 'date_recorded',
 'duration',
 'favorites',
 'genre_top',
 'genres',
 'genres_all',
 'information',
 'interest',
 'language_code',
 'license',
 'listens',
 'lyricist',
 'number',
 'publisher',
 'tags',
 'title']

In [41]:
tracks_copy.genre_top.unique()
tracks_copy.producer.unique()

array([nan, 'Alec K. Refearn, Rob Pemberton', 'Tom Buckland',
       'Chief Boima, Oro 11', 'Griffin Rodriguez', 'Dylan Going',
       'Weasel Walter', 'Philip Manley', 'Brian Mumform', 'Johnny Jewel',
       'Greg Ashley', 'Bobb Bruno', 'Mike Rep & Drew Clausen',
       'Phil Elverum', 'Shawn Greenlee', 'Jimmy Hollywood', 'Mike Rep',
       'Terre T', 'Stork', 'Acapulco Rodriguez', 'Liz Berg',
       'Irene Trudel', 'Jason Sigal', 'Brian Turner',
       'Evan "Funk" Davies', 'Keili Hamilton', 'Joe Belock', 'OCDJ',
       'Pat Duncan', 'Pseu Braun', 'Rob Hatch-Miller', 'Scott Williams',
       'Dan Bodah', 'Diane Kamikaze', 'Bethany Ryker', 'Rob Weisberg',
       'Marty McSorley', 'Scott McDowell', 'Rob Weisberg ', 'Rob Lim',
       'Liz French', 'Steven R Smith', 'Evan ', 'WFMU',
       'Michael Goodstein', 'Mike Lupica', 'Dan Deacon', 'Evan Davies',
       'Hayvanlar Alemi', 'Aaron Spectre', 'The Apartment', 'Bob Wiseman',
       'Trouble', 'The Space Lady', 'Jane Shields', 'Kurt Got

In [66]:
mylist = list(tracks_copy.select_dtypes(include=['int64', 'float64']).columns)
mylist
tracks_copy.dtypes
tracks_copy.date_recorded.unique()
# tracks_copy['date_recorded'] = pd.to_numeric(list(tracks_copy['date_recorded']), errors='coerce')
# tracks_copy['date_recorded'].unique()

array(['2008-11-26 00:00:00', '2008-01-01 00:00:00',
       '1978-04-27 00:00:00', '1998-11-26 00:00:00',
       '1995-11-26 00:00:00', '2002-08-01 00:00:00',
       '2006-01-01 00:00:00', nan, '2004-08-20 00:00:00',
       '2006-10-07 00:00:00', '2007-06-01 00:00:00',
       '2006-02-01 00:00:00', '2002-01-01 00:00:00',
       '2007-01-01 00:00:00', '2003-01-01 00:00:00',
       '2007-10-20 00:00:00', '2006-11-26 00:00:00',
       '2006-08-26 00:00:00', '2006-11-07 00:00:00',
       '2005-01-01 00:00:00', '2007-09-01 00:00:00',
       '1982-11-26 00:00:00', '1981-01-01 00:00:00',
       '2002-06-27 00:00:00', '2006-08-30 00:00:00',
       '2005-07-05 00:00:00', '2008-10-01 00:00:00',
       '2001-01-01 00:00:00', '2008-04-14 00:00:00',
       '2007-10-17 00:00:00', '2007-12-11 00:00:00',
       '2005-05-02 00:00:00', '2008-09-29 00:00:00',
       '2007-12-12 00:00:00', '2003-11-01 00:00:00',
       '2000-09-01 00:00:00', '2002-03-01 00:00:00',
       '1999-01-01 00:00:00', '2004-01-01

In [58]:
import datetime as dt
# convert timestamp strings into MMDDHH integer
# first, write function to do conversion

def convert_timestamp_to_int(val):
    """
    https://stackoverflow.com/questions/54046862/python-convert-timestamp-string-to-mmddhh-integer
    """
    date_string = dt.datetime.strptime(val, '%Y-%m-%d %H:%M:%S')
    different_date_string = dt.datetime.strftime(date_string, '%m%d%H')
    return int(different_date_string)

# second, apply function to all values in date_recorded column
tracks_copy['date_recorded'] = tracks_copy['date_recorded'].apply(convert_timestamp_to_int)

TypeError: strptime() argument 1 must be str, not float

In [43]:
# build Logistic Regression model to predict "genre_top" for any given song
# way to do this involves using numeric columns
from sklearn.linear_model import LogisticRegression

X = tracks_copy[['producer', 'tags', 'date_recorded']]
y = tracks_copy.genre_top

log_reg = LogisticRegression().fit(X, y)
log_reg.score(X, y)

ValueError: could not convert string to float: "['ballad', 'epic', 'rockabilly', 'curse', 'hex', 'hard rock', 'cauldron', 'witches', 'creepy', 'black cats']"

This is the biggest data you've played with so far, and while it does generally fit in Colab, it can take awhile to run. That's part of the challenge!

Your tasks:
- Clean up the variable names in the dataframe
- Use logistic regression to fit a model predicting (primary/top) genre
- Inspect, iterate, and improve your model
- Answer the following questions (written, ~paragraph each):
  - What are the best predictors of genre?
  - What information isn't very useful for predicting genre?
  - What surprised you the most about your results?

*Important caveats*:
- This is going to be difficult data to work with - don't let the perfect be the enemy of the good!
- Be creative in cleaning it up - if the best way you know how to do it is download it locally and edit as a spreadsheet, that's OK!
- If the data size becomes problematic, consider sampling/subsetting
- You do not need perfect or complete results - just something plausible that runs, and that supports the reasoning in your written answers

If you find that fitting a model to classify *all* genres isn't very good, it's totally OK to limit to the most frequent genres, or perhaps trying to combine or cluster genres as a preprocessing step. Even then, there will be limits to how good a model can be with just this metadata - if you really want to train an effective genre classifier, you'll have to involve the other data (see stretch goals).

This is real data - there is no "one correct answer", so you can take this in a variety of directions. Just make sure to support your findings, and feel free to share them as well! This is meant to be practice for dealing with other "messy" data, a common task in data science.

## Resources and stretch goals

- Check out the other .csv files from the FMA dataset, and see if you can join them or otherwise fit interesting models with them
- [Logistic regression from scratch in numpy](https://blog.goodaudience.com/logistic-regression-from-scratch-in-numpy-5841c09e425f) - if you want to dig in a bit more to both the code and math (also takes a gradient descent approach, introducing the logistic loss function)
- Create a visualization to show predictions of your model - ideally show a confidence interval based on error!
- Check out and compare classification models from scikit-learn, such as [SVM](https://scikit-learn.org/stable/modules/svm.html#classification), [decision trees](https://scikit-learn.org/stable/modules/tree.html#classification), and [naive Bayes](https://scikit-learn.org/stable/modules/naive_bayes.html). The underlying math will vary significantly, but the API (how you write the code) and interpretation will actually be fairly similar.
- Sign up for [Kaggle](https://kaggle.com), and find a competition to try logistic regression with
- (Not logistic regression related) If you enjoyed the assignment, you may want to read up on [music informatics](https://en.wikipedia.org/wiki/Music_informatics), which is how those audio features were actually calculated. The FMA includes the actual raw audio, so (while this is more of a longterm project than a stretch goal, and won't fit in Colab) if you'd like you can check those out and see what sort of deeper analysis you can do.