<a href="https://colab.research.google.com/github/ShreyasJothish/DS-Unit-2-Sprint-3-Advanced-Regression/blob/master/module1-logistic-regression/LS_DS1_231_Logistic_Regression_Day2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lambda School Data Science - Logistic Regression

Logistic regression is the baseline for classification models, as well as a handy way to predict probabilities (since those too live in the unit interval). While relatively simple, it is also the foundation for more sophisticated classification techniques such as neural networks (many of which can effectively be thought of as networks of logistic models).

## Lecture - Where Linear goes Wrong
### Return of the Titanic 🚢

You've likely already explored the rich dataset that is the Titanic - let's use regression and try to predict survival with it. The data is [available from Kaggle](https://www.kaggle.com/c/titanic/data), so we'll also play a bit with [the Kaggle API](https://github.com/Kaggle/kaggle-api).

In [1]:
!pip install kaggle



In [2]:
# Note - you'll also have to sign up for Kaggle and authorize the API
# https://github.com/Kaggle/kaggle-api#api-credentials

# This essentially means uploading a kaggle.json file
# For Colab we can have it in Google Drive
from google.colab import drive
drive.mount('/content/drive', force_remount=True)
%env KAGGLE_CONFIG_DIR=/content/drive/My Drive/

# You also have to join the Titanic competition to have access to the data
!kaggle competitions download -c titanic

Mounted at /content/drive
env: KAGGLE_CONFIG_DIR=/content/drive/My Drive/
train.csv: Skipping, found more recently modified local copy (use --force to force download)
test.csv: Skipping, found more recently modified local copy (use --force to force download)
gender_submission.csv: Skipping, found more recently modified local copy (use --force to force download)


In [3]:
!ls /content/drive/My\ Drive/

 Address.gdoc		 ideas	       projects				  Video
'Chromebook Backup'	 kaggle.json  'Rescue Map.gmap'
'Colab Notebooks'	 learning     'Untitled spreadsheet (1).gsheet'
'Group Savings.gsheet'	 personal     'Untitled spreadsheet.gsheet'


In [4]:
# How would we try to do this with linear regression?
import pandas as pd

train_df = pd.read_csv('train.csv').dropna()
test_df = pd.read_csv('test.csv').dropna()  # Unlabeled, for Kaggle submission

train_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
10,11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7,G6,S
11,12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.55,C103,S


In [5]:
train_df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,183.0,183.0,183.0,183.0,183.0,183.0,183.0
mean,455.36612,0.672131,1.191257,35.674426,0.464481,0.47541,78.682469
std,247.052476,0.470725,0.515187,15.643866,0.644159,0.754617,76.347843
min,2.0,0.0,1.0,0.92,0.0,0.0,0.0
25%,263.5,0.0,1.0,24.0,0.0,0.0,29.7
50%,457.0,1.0,1.0,36.0,0.0,0.0,57.0
75%,676.0,1.0,1.0,47.5,1.0,1.0,90.0
max,890.0,1.0,3.0,80.0,3.0,4.0,512.3292


In [6]:
from sklearn.linear_model import LinearRegression

X = train_df[['Pclass', 'Age', 'Fare']]
y = train_df.Survived

linear_reg = LinearRegression().fit(X, y)
linear_reg.score(X, y)

0.08389810726550917

In [7]:
linear_reg.predict(test_df[['Pclass', 'Age', 'Fare']])

array([0.79543117, 0.58610823, 0.67595121, 0.793829  , 0.62090522,
       0.72542107, 0.59848968, 0.58734245, 0.48567063, 0.77627736,
       0.84211887, 0.57052439, 0.7754689 , 0.96621114, 0.70287941,
       0.57673837, 0.72321391, 0.75894755, 0.77968041, 0.50246003,
       0.49858077, 0.7474959 , 0.3542282 , 0.61648435, 0.71300224,
       0.66294608, 0.53175333, 0.77397395, 0.68419387, 0.68395536,
       0.52041202, 0.56814038, 0.79586606, 0.81372012, 0.61068545,
       0.57260627, 0.52525981, 0.58055388, 0.45584728, 0.67976208,
       0.8226707 , 0.84286197, 0.96189157, 0.66724612, 0.68589478,
       0.61846513, 0.63455044, 0.68275686, 0.65738372, 0.45198998,
       0.59988596, 0.63845908, 0.63132487, 0.7888473 , 0.60126246,
       0.79714045, 0.78713803, 0.54643775, 0.42823635, 0.7711724 ,
       0.53552976, 0.55608044, 0.54480459, 0.57031915, 0.65080369,
       0.77958926, 0.6371013 , 0.70993488, 0.71493598, 0.60375943,
       0.54407206, 0.48186138, 0.76576089, 0.75456305, 0.53968

In [8]:
linear_reg.coef_

array([-0.08596295, -0.00829314,  0.00048775])

In [9]:
import numpy as np

test_case = np.array([[1, 5, 500]])  # Rich 5-year old in first class
linear_reg.predict(test_case)

array([1.14845883])

In [10]:
np.dot(linear_reg.coef_, test_case.reshape(-1, 1)) + linear_reg.intercept_

array([1.14845883])

In [11]:
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression().fit(X, y)
log_reg.score(X, y)



0.7103825136612022

In [12]:
log_reg.predict(test_df[['Pclass', 'Age', 'Fare']])

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1,
       0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

In [13]:
log_reg.predict(test_case)[0]

1

In [14]:
help(log_reg.predict)

Help on method predict in module sklearn.linear_model.base:

predict(X) method of sklearn.linear_model.logistic.LogisticRegression instance
    Predict class labels for samples in X.
    
    Parameters
    ----------
    X : array_like or sparse matrix, shape (n_samples, n_features)
        Samples.
    
    Returns
    -------
    C : array, shape [n_samples]
        Predicted class label per sample.



In [15]:
log_reg.predict_proba(test_case)[0]

array([0.02485552, 0.97514448])

In [16]:
# What's the math?
log_reg.coef_

array([[-0.0455017 , -0.02912513,  0.0048037 ]])

In [17]:
log_reg.intercept_

array([1.45878264])

In [0]:
# The logistic sigmoid "squishing" function, implemented to accept numpy arrays
def sigmoid(x):
  return 1 / (1 + np.e**(-x))

In [19]:
sigmoid(log_reg.intercept_ + np.dot(log_reg.coef_, np.transpose(test_case)))

array([[0.97514448]])

So, clearly a more appropriate model in this situation! For more on the math, [see this Wikipedia example](https://en.wikipedia.org/wiki/Logistic_regression#Probability_of_passing_an_exam_versus_hours_of_study).

For live - let's tackle [another classification dataset on absenteeism](http://archive.ics.uci.edu/ml/datasets/Absenteeism+at+work) - it has 21 classes, but remember, scikit-learn LogisticRegression automatically handles more than two classes. How? By essentially treating each label as different (1) from some base class (0).

In [20]:
# Live - let's try absenteeism!
!wget http://archive.ics.uci.edu/ml/machine-learning-databases/00445/Absenteeism_at_work_AAA.zip
!unzip Absenteeism_at_work_AAA.zip	

--2019-01-22 22:14:30--  http://archive.ics.uci.edu/ml/machine-learning-databases/00445/Absenteeism_at_work_AAA.zip
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.249
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.249|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 66136 (65K) [application/zip]
Saving to: ‘Absenteeism_at_work_AAA.zip.1’


2019-01-22 22:14:31 (180 KB/s) - ‘Absenteeism_at_work_AAA.zip.1’ saved [66136/66136]

Archive:  Absenteeism_at_work_AAA.zip
replace Absenteeism_at_work.arff? [y]es, [n]o, [A]ll, [N]one, [r]ename: N


In [21]:
!head Absenteeism_at_work.csv

ID;Reason for absence;Month of absence;Day of the week;Seasons;Transportation expense;Distance from Residence to Work;Service time;Age;Work load Average/day ;Hit target;Disciplinary failure;Education;Son;Social drinker;Social smoker;Pet;Weight;Height;Body mass index;Absenteeism time in hours
11;26;7;3;1;289;36;13;33;239.554;97;0;1;2;1;0;1;90;172;30;4
36;0;7;3;1;118;13;18;50;239.554;97;1;1;1;1;0;0;98;178;31;0
3;23;7;4;1;179;51;18;38;239.554;97;0;1;0;1;0;0;89;170;31;2
7;7;7;5;1;279;5;14;39;239.554;97;0;1;2;1;1;0;68;168;24;4
11;23;7;5;1;289;36;13;33;239.554;97;0;1;2;1;0;1;90;172;30;2
3;23;7;6;1;179;51;18;38;239.554;97;0;1;0;1;0;0;89;170;31;2
10;22;7;6;1;361;52;3;28;239.554;97;0;1;1;1;0;4;80;172;27;8
20;23;7;6;1;260;50;11;36;239.554;97;0;1;4;1;0;0;65;168;23;4
14;19;7;2;1;155;12;14;34;239.554;97;0;1;2;1;0;0;95;196;25;40


In [22]:
absent_df = pd.read_table('Absenteeism_at_work.csv', sep=';')
absent_df.head()

Unnamed: 0,ID,Reason for absence,Month of absence,Day of the week,Seasons,Transportation expense,Distance from Residence to Work,Service time,Age,Work load Average/day,...,Disciplinary failure,Education,Son,Social drinker,Social smoker,Pet,Weight,Height,Body mass index,Absenteeism time in hours
0,11,26,7,3,1,289,36,13,33,239.554,...,0,1,2,1,0,1,90,172,30,4
1,36,0,7,3,1,118,13,18,50,239.554,...,1,1,1,1,0,0,98,178,31,0
2,3,23,7,4,1,179,51,18,38,239.554,...,0,1,0,1,0,0,89,170,31,2
3,7,7,7,5,1,279,5,14,39,239.554,...,0,1,2,1,1,0,68,168,24,4
4,11,23,7,5,1,289,36,13,33,239.554,...,0,1,2,1,0,1,90,172,30,2


In [23]:
absent_df.shape

(740, 21)

In [24]:
LogisticRegression

sklearn.linear_model.logistic.LogisticRegression

In [25]:
absent_df.columns

Index(['ID', 'Reason for absence', 'Month of absence', 'Day of the week',
       'Seasons', 'Transportation expense', 'Distance from Residence to Work',
       'Service time', 'Age', 'Work load Average/day ', 'Hit target',
       'Disciplinary failure', 'Education', 'Son', 'Social drinker',
       'Social smoker', 'Pet', 'Weight', 'Height', 'Body mass index',
       'Absenteeism time in hours'],
      dtype='object')

In [26]:
X = absent_df.drop('Reason for absence', axis='columns')
y = absent_df['Reason for absence']

absent_log1 = LogisticRegression().fit(X, y)
absent_log1.score(X, y)



0.5013513513513513

In [27]:
absent_log1

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [28]:
absent_log1.coef_

array([[-1.27508146e-02, -2.20793886e-01, -4.08404457e-02,
         4.22616389e-01,  7.64613632e-03, -6.18658809e-03,
        -3.11523480e-02,  5.52309565e-02, -1.01707258e-02,
        -1.14781041e-01,  2.51984519e+00,  1.41303285e-02,
         3.56702636e-01,  2.27778616e-01,  1.80478520e-01,
         1.33441356e-01, -9.82098540e-03,  5.23702470e-02,
         7.24744784e-02, -3.15221513e+00],
       [ 5.11128074e-02,  1.77166203e-01, -5.01201750e-01,
        -9.34413067e-02,  8.70070989e-03,  2.90683453e-02,
        -8.19496921e-02,  1.30910624e-01,  3.69193210e-03,
         3.58478924e-02, -7.77198534e-01,  9.21419989e-01,
         2.74046853e-01, -6.53744541e-01, -2.74959514e-01,
        -4.46143318e-01,  1.88638790e-01, -7.60409657e-02,
        -6.85430554e-01,  4.55734002e-03],
       [ 1.74477093e-01,  5.91768593e-01,  4.46355493e-02,
         4.29016838e-01,  5.97802716e-02, -2.02281583e-01,
        -4.44906133e-01, -2.65239556e-01,  8.55486140e-03,
        -4.07755000e-01, -6.7

In [0]:
?LogisticRegression

## Assignment - real-world classification

We're going to check out a larger dataset - the [FMA Free Music Archive data](https://github.com/mdeff/fma). It has a selection of CSVs with metadata and calculated audio features that you can load and try to use to classify genre of tracks. To get you started:

In [30]:
!wget https://os.unil.cloud.switch.ch/fma/fma_metadata.zip
!unzip fma_metadata.zip

--2019-01-22 22:18:14--  https://os.unil.cloud.switch.ch/fma/fma_metadata.zip
Resolving os.unil.cloud.switch.ch (os.unil.cloud.switch.ch)... 86.119.28.13, 2001:620:5ca1:2ff::ce53
Connecting to os.unil.cloud.switch.ch (os.unil.cloud.switch.ch)|86.119.28.13|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 358412441 (342M) [application/zip]
Saving to: ‘fma_metadata.zip.1’


2019-01-22 22:18:28 (24.6 MB/s) - ‘fma_metadata.zip.1’ saved [358412441/358412441]

Archive:  fma_metadata.zip
replace fma_metadata/README.txt? [y]es, [n]o, [A]ll, [N]one, [r]ename: N


In [31]:
# Loading data and setting the columns.
tracks = pd.read_csv('fma_metadata/tracks.csv', header=1)
tracks.drop(index=0, inplace=True)
tracks.rename(index=str, columns={'Unnamed: 0': 'track_id'}, inplace=True)

  interactivity=interactivity, compiler=compiler, result=result)


In [32]:
tracks.describe()

Unnamed: 0,comments,favorites,id,listens,tracks,comments.1,favorites.1,id.1,latitude,longitude,bit_rate,comments.2,duration,favorites.2,interest,listens.1,number
count,106574.0,106574.0,106574.0,106574.0,106574.0,106574.0,106574.0,106574.0,44544.0,44544.0,106574.0,106574.0,106574.0,106574.0,106574.0,106574.0,106574.0
mean,0.394946,1.286927,12826.933914,32120.31,19.721452,1.894702,30.041915,12036.770404,39.901626,-38.668642,263274.695048,0.031621,277.8491,3.182521,3541.31,2329.353548,8.260945
std,2.268915,3.133035,6290.261805,147853.2,39.943673,6.297679,100.511408,6881.420867,18.24086,65.23722,67623.443584,0.321993,305.518553,13.51382,19017.43,8028.070647,15.243271
min,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,1.0,-45.87876,-157.526855,-1.0,0.0,0.0,0.0,2.0,0.0,0.0
25%,0.0,0.0,7793.0,3361.0,7.0,0.0,1.0,6443.0,39.271398,-79.997459,192000.0,0.0,149.0,0.0,599.0,292.0,2.0
50%,0.0,0.0,13374.0,8982.0,11.0,0.0,5.0,12029.5,41.387917,-73.554431,299914.0,0.0,216.0,1.0,1314.0,764.0,5.0
75%,0.0,1.0,18203.0,23635.0,17.0,1.0,16.0,18011.0,48.85693,4.35171,320000.0,0.0,305.0,3.0,3059.0,2018.0,9.0
max,53.0,61.0,22940.0,3564243.0,652.0,79.0,963.0,24357.0,67.286005,175.277,448000.0,37.0,18350.0,1482.0,3293557.0,543252.0,255.0


In [33]:
pd.set_option('display.max_columns', None)  # Unlimited columns
tracks.head(2)

Unnamed: 0,track_id,comments,date_created,date_released,engineer,favorites,id,information,listens,producer,tags,title,tracks,type,active_year_begin,active_year_end,associated_labels,bio,comments.1,date_created.1,favorites.1,id.1,latitude,location,longitude,members,name,related_projects,tags.1,website,wikipedia_page,split,subset,bit_rate,comments.2,composer,date_created.2,date_recorded,duration,favorites.2,genre_top,genres,genres_all,information.1,interest,language_code,license,listens.1,lyricist,number,publisher,tags.2,title.1
1,2,0.0,2008-11-26 01:44:45,2009-01-05 00:00:00,,4.0,1.0,<p></p>,6073.0,,[],AWOL - A Way Of Life,7.0,Album,2006-01-01 00:00:00,,,"<p>A Way Of Life, A Collective of Hip-Hop from...",0.0,2008-11-26 01:42:32,9.0,1.0,40.058324,New Jersey,-74.405661,"Sajje Morocco,Brownbum,ZawidaGod,Custodian of ...",AWOL,The list of past projects is 2 long but every1...,['awol'],http://www.AzillionRecords.blogspot.com,,training,small,256000.0,0.0,,2008-11-26 01:48:12,2008-11-26 00:00:00,168.0,2.0,Hip-Hop,[21],[21],,4656.0,en,Attribution-NonCommercial-ShareAlike 3.0 Inter...,1293.0,,3.0,,[],Food
2,3,0.0,2008-11-26 01:44:45,2009-01-05 00:00:00,,4.0,1.0,<p></p>,6073.0,,[],AWOL - A Way Of Life,7.0,Album,2006-01-01 00:00:00,,,"<p>A Way Of Life, A Collective of Hip-Hop from...",0.0,2008-11-26 01:42:32,9.0,1.0,40.058324,New Jersey,-74.405661,"Sajje Morocco,Brownbum,ZawidaGod,Custodian of ...",AWOL,The list of past projects is 2 long but every1...,['awol'],http://www.AzillionRecords.blogspot.com,,training,medium,256000.0,0.0,,2008-11-26 01:48:14,2008-11-26 00:00:00,237.0,1.0,Hip-Hop,[21],[21],,1470.0,en,Attribution-NonCommercial-ShareAlike 3.0 Inter...,514.0,,4.0,,[],Electric Ave


In [34]:
tracks.columns

Index(['track_id', 'comments', 'date_created', 'date_released', 'engineer',
       'favorites', 'id', 'information', 'listens', 'producer', 'tags',
       'title', 'tracks', 'type', 'active_year_begin', 'active_year_end',
       'associated_labels', 'bio', 'comments.1', 'date_created.1',
       'favorites.1', 'id.1', 'latitude', 'location', 'longitude', 'members',
       'name', 'related_projects', 'tags.1', 'website', 'wikipedia_page',
       'split', 'subset', 'bit_rate', 'comments.2', 'composer',
       'date_created.2', 'date_recorded', 'duration', 'favorites.2',
       'genre_top', 'genres', 'genres_all', 'information.1', 'interest',
       'language_code', 'license', 'listens.1', 'lyricist', 'number',
       'publisher', 'tags.2', 'title.1'],
      dtype='object')

In [35]:
tracks.shape

(106574, 53)

In [36]:
tracks.genre_top.value_counts()

Rock                   14182
Experimental           10608
Electronic              9372
Hip-Hop                 3552
Folk                    2803
Pop                     2332
Instrumental            2079
International           1389
Classical               1230
Jazz                     571
Old-Time / Historic      554
Spoken                   423
Country                  194
Soul-RnB                 175
Blues                    110
Easy Listening            24
Name: genre_top, dtype: int64

In [0]:
# Encode the data from String to Number
genre_dict = { 'Rock': 1,
'Experimental' : 2,
'Electronic' : 3,
'Hip-Hop' : 4,
'Folk' : 5,
'Pop' : 6,
'Instrumental' : 7,
'International' : 8,
'Classical' : 9,
'Jazz' : 10,
'Old-Time / Historic': 11,
'Spoken' : 12,
'Country' : 13,
'Soul-RnB' : 14,
'Blues' : 15,
'Easy Listening' : 16    
}

tracks.genre_top.replace(genre_dict, inplace=True)

In [38]:
tracks.genre_top.value_counts()

1.0     14182
2.0     10608
3.0      9372
4.0      3552
5.0      2803
6.0      2332
7.0      2079
8.0      1389
9.0      1230
10.0      571
11.0      554
12.0      423
13.0      194
14.0      175
15.0      110
16.0       24
Name: genre_top, dtype: int64

In [39]:
# Trail 1: Fetch numeric data and frontfill data before running Logistic Regression.
numeric = ['int32', 'int64', 'float16', 'float32', 'float64']
tracks_num = tracks.select_dtypes(include=numeric).copy()

print(tracks.columns)
print(tracks_num.columns)

Index(['track_id', 'comments', 'date_created', 'date_released', 'engineer',
       'favorites', 'id', 'information', 'listens', 'producer', 'tags',
       'title', 'tracks', 'type', 'active_year_begin', 'active_year_end',
       'associated_labels', 'bio', 'comments.1', 'date_created.1',
       'favorites.1', 'id.1', 'latitude', 'location', 'longitude', 'members',
       'name', 'related_projects', 'tags.1', 'website', 'wikipedia_page',
       'split', 'subset', 'bit_rate', 'comments.2', 'composer',
       'date_created.2', 'date_recorded', 'duration', 'favorites.2',
       'genre_top', 'genres', 'genres_all', 'information.1', 'interest',
       'language_code', 'license', 'listens.1', 'lyricist', 'number',
       'publisher', 'tags.2', 'title.1'],
      dtype='object')
Index(['comments', 'favorites', 'id', 'listens', 'tracks', 'comments.1',
       'favorites.1', 'id.1', 'latitude', 'longitude', 'bit_rate',
       'comments.2', 'duration', 'favorites.2', 'genre_top', 'interest',
      

In [40]:
tracks_num.dtypes

comments       float64
favorites      float64
id             float64
listens        float64
tracks         float64
comments.1     float64
favorites.1    float64
id.1           float64
latitude       float64
longitude      float64
bit_rate       float64
comments.2     float64
duration       float64
favorites.2    float64
genre_top      float64
interest       float64
listens.1      float64
number         float64
dtype: object

In [41]:
tracks.dtypes

track_id              object
comments             float64
date_created          object
date_released         object
engineer              object
favorites            float64
id                   float64
information           object
listens              float64
producer              object
tags                  object
title                 object
tracks               float64
type                  object
active_year_begin     object
active_year_end       object
associated_labels     object
bio                   object
comments.1           float64
date_created.1        object
favorites.1          float64
id.1                 float64
latitude             float64
location              object
longitude            float64
members               object
name                  object
related_projects      object
tags.1                object
website               object
wikipedia_page        object
split                 object
subset                object
bit_rate             float64
comments.2    

In [42]:
tracks_num.isnull().sum()

comments           0
favorites          0
id                 0
listens            0
tracks             0
comments.1         0
favorites.1        0
id.1               0
latitude       62030
longitude      62030
bit_rate           0
comments.2         0
duration           0
favorites.2        0
genre_top      56976
interest           0
listens.1          0
number             0
dtype: int64

In [0]:
tracks_num.fillna(method='ffill', inplace=True)

In [44]:
tracks_num.shape

(106574, 18)

In [45]:
tracks_num.isnull().sum()

comments       0
favorites      0
id             0
listens        0
tracks         0
comments.1     0
favorites.1    0
id.1           0
latitude       0
longitude      0
bit_rate       0
comments.2     0
duration       0
favorites.2    0
genre_top      0
interest       0
listens.1      0
number         0
dtype: int64

In [46]:
X = tracks_num.drop('genre_top', axis='columns')
y = tracks_num['genre_top']

genre_log1 = LogisticRegression(solver= 'newton-cg', 
                                 multi_class='multinomial').fit(X, y)
genre_log1.score(X, y)



0.2772439807082403

In [47]:
# Trail 2: Fetch numeric data and dropna data before running Logistic Regression.

numeric = ['int32', 'int64', 'float16', 'float32', 'float64']
tracks_num1 = tracks.select_dtypes(include=numeric).copy()
tracks_num1.isnull().sum()

comments           0
favorites          0
id                 0
listens            0
tracks             0
comments.1         0
favorites.1        0
id.1               0
latitude       62030
longitude      62030
bit_rate           0
comments.2         0
duration           0
favorites.2        0
genre_top      56976
interest           0
listens.1          0
number             0
dtype: int64

In [48]:
tracks_num1.dropna(inplace=True)
tracks_num1.shape

(20980, 18)

In [49]:
tracks_num1.isnull().sum()

comments       0
favorites      0
id             0
listens        0
tracks         0
comments.1     0
favorites.1    0
id.1           0
latitude       0
longitude      0
bit_rate       0
comments.2     0
duration       0
favorites.2    0
genre_top      0
interest       0
listens.1      0
number         0
dtype: int64

In [50]:
X = tracks_num1.drop('genre_top', axis='columns')
y = tracks_num1['genre_top']

genre_log1 = LogisticRegression(solver= 'newton-cg', 
                                 multi_class='multinomial').fit(X, y)
genre_log1.score(X, y)



0.382650142993327

In [51]:
# Trail 3: Fetch numeric data, drop 'latitude' and 'longitude' columns and
# frontfill data before running Logistic Regression.
numeric = ['int32', 'int64', 'float16', 'float32', 'float64']
tracks_num2 = tracks.select_dtypes(include=numeric).copy()
tracks_num2.isnull().sum()

comments           0
favorites          0
id                 0
listens            0
tracks             0
comments.1         0
favorites.1        0
id.1               0
latitude       62030
longitude      62030
bit_rate           0
comments.2         0
duration           0
favorites.2        0
genre_top      56976
interest           0
listens.1          0
number             0
dtype: int64

In [52]:
tracks_num2.drop(columns=['latitude','longitude'], inplace=True)
tracks_num2.shape

(106574, 16)

In [53]:
tracks_num2.isnull().sum()

comments           0
favorites          0
id                 0
listens            0
tracks             0
comments.1         0
favorites.1        0
id.1               0
bit_rate           0
comments.2         0
duration           0
favorites.2        0
genre_top      56976
interest           0
listens.1          0
number             0
dtype: int64

In [0]:
tracks_num2.fillna(method='ffill', inplace=True)

In [55]:
X = tracks_num2.drop('genre_top', axis='columns')
y = tracks_num2['genre_top']

genre_log1 = LogisticRegression(solver= 'newton-cg', 
                                 multi_class='multinomial').fit(X, y)
genre_log1.score(X, y)



0.27716891549533657

In [0]:
# Trail 4: Fetch numeric data, run LinearRegression and
# identify the top 2 columns before running Logistic Regression.
tracks_num3 = tracks.select_dtypes('number').copy()

In [0]:
tracks_num3.dropna(inplace=True)

In [58]:
X.columns

Index(['comments', 'favorites', 'id', 'listens', 'tracks', 'comments.1',
       'favorites.1', 'id.1', 'bit_rate', 'comments.2', 'duration',
       'favorites.2', 'interest', 'listens.1', 'number'],
      dtype='object')

In [59]:
from sklearn.linear_model import LinearRegression

X = tracks_num3.drop('genre_top', axis='columns')
y = tracks_num3['genre_top']

linear_reg = LinearRegression().fit(X, y)
linear_reg.score(X, y)

linear_reg.coef_

array([-3.97951792e-03,  1.14328692e-01,  7.01838772e-06,  3.65822959e-06,
       -4.94188734e-07, -3.30077085e-03,  2.28995522e-03,  2.92195352e-05,
        3.79382623e-03,  1.39287253e-03, -2.95139703e-06, -2.70067752e-01,
        2.72525382e-04,  1.38879991e-02, -1.07520547e-05,  3.19481192e-05,
        1.12246242e-02])

In [60]:
X = tracks_num3[['favorites', 'favorites.1']]
y = tracks_num3['genre_top']

genre_log1 = LogisticRegression(solver= 'newton-cg', 
                                 multi_class='multinomial').fit(X, y)
genre_log1.score(X, y)

0.37874165872259297

In [0]:
# Trail 5: Fetch numeric data, droping lower gener values.
tracks_num4 = tracks.select_dtypes('number').copy()
tracks_num4.dropna(inplace=True)

In [66]:
tracks_num4.genre_top.value_counts()

1.0     7693
2.0     3888
3.0     3638
4.0     1368
6.0      976
5.0      935
9.0      725
8.0      597
7.0      416
10.0     336
12.0     156
14.0     110
13.0      67
15.0      62
16.0      10
11.0       3
Name: genre_top, dtype: int64

In [0]:
tracks_num4 = tracks_num4[tracks_num4.genre_top < 10]

In [70]:
X = tracks_num1.drop('genre_top', axis='columns')
y = tracks_num1['genre_top']

genre_log1 = LogisticRegression(solver= 'newton-cg', 
                                 multi_class='multinomial', max_iter=200).fit(X, y)
genre_log1.score(X, y)



0.4295996186844614

This is the biggest data you've played with so far, and while it does generally fit in Colab, it can take awhile to run. That's part of the challenge!

Your tasks:
- Clean up the variable names in the dataframe
- Use logistic regression to fit a model predicting (primary/top) genre
- Inspect, iterate, and improve your model
- Answer the following questions (written, ~paragraph each):
  - What are the best predictors of genre?
  - What information isn't very useful for predicting genre?
  - What surprised you the most about your results?

*Important caveats*:
- This is going to be difficult data to work with - don't let the perfect be the enemy of the good!
- Be creative in cleaning it up - if the best way you know how to do it is download it locally and edit as a spreadsheet, that's OK!
- If the data size becomes problematic, consider sampling/subsetting
- You do not need perfect or complete results - just something plausible that runs, and that supports the reasoning in your written answers

If you find that fitting a model to classify *all* genres isn't very good, it's totally OK to limit to the most frequent genres, or perhaps trying to combine or cluster genres as a preprocessing step. Even then, there will be limits to how good a model can be with just this metadata - if you really want to train an effective genre classifier, you'll have to involve the other data (see stretch goals).

This is real data - there is no "one correct answer", so you can take this in a variety of directions. Just make sure to support your findings, and feel free to share them as well! This is meant to be practice for dealing with other "messy" data, a common task in data science.

| Feature | Co-efficient value|
|--------------|-----------------------------|
|favorites|1.14E-01|
|favorites.2|	1.39E-02|
|number|	1.12E-02|
|latitude| 3.79E-03|
|favorites.1|	2.29E-03|
|longitude|	1.39E-03|
|duration|	2.73E-04|
|listens.1|3.19E-05|
|id.1|2.92E-05|
|id|7.02E-06|
|listens|3.66E-06|
|tracks|-4.94E-07|
|bit_rate|	-2.95E-06|
|interest|	-1.08E-05|
|comments.1|-3.30E-03|
|comments|	-3.98E-03|
|comments.2|-2.70E-01|

  - What are the best predictors of genre?
  
  Answer: From the Linear Regression coefficients it appears features related to favorites (positively) and  comments(negatively) are best predictors of genre.

  - What information isn't very useful for predicting genre?
  
    Answer: From the Linear Regression coefficients it appears features related to listens and tracks are least predictors of genre.
    
  - What surprised you the most about your results?
  
  Answer: As per my assumption, listens was supposed to have better prediction on genre.

## Resources and stretch goals

- Check out the other .csv files from the FMA dataset, and see if you can join them or otherwise fit interesting models with them
- [Logistic regression from scratch in numpy](https://blog.goodaudience.com/logistic-regression-from-scratch-in-numpy-5841c09e425f) - if you want to dig in a bit more to both the code and math (also takes a gradient descent approach, introducing the logistic loss function)
- Create a visualization to show predictions of your model - ideally show a confidence interval based on error!
- Check out and compare classification models from scikit-learn, such as [SVM](https://scikit-learn.org/stable/modules/svm.html#classification), [decision trees](https://scikit-learn.org/stable/modules/tree.html#classification), and [naive Bayes](https://scikit-learn.org/stable/modules/naive_bayes.html). The underlying math will vary significantly, but the API (how you write the code) and interpretation will actually be fairly similar.
- Sign up for [Kaggle](https://kaggle.com), and find a competition to try logistic regression with
- (Not logistic regression related) If you enjoyed the assignment, you may want to read up on [music informatics](https://en.wikipedia.org/wiki/Music_informatics), which is how those audio features were actually calculated. The FMA includes the actual raw audio, so (while this is more of a longterm project than a stretch goal, and won't fit in Colab) if you'd like you can check those out and see what sort of deeper analysis you can do.