# Lambda School Data Science - Logistic Regression

Logistic regression is the baseline for classification models, as well as a handy way to predict probabilities (since those too live in the unit interval). While relatively simple, it is also the foundation for more sophisticated classification techniques such as neural networks (many of which can effectively be thought of as networks of logistic models).

## Lecture - Where Linear goes Wrong
### Return of the Titanic 🚢

You've likely already explored the rich dataset that is the Titanic - let's use regression and try to predict survival with it. The data is [available from Kaggle](https://www.kaggle.com/c/titanic/data), so we'll also play a bit with [the Kaggle API](https://github.com/Kaggle/kaggle-api).

### Get data, option 1: Kaggle API

#### Sign up for Kaggle and get an API token
1. [Sign up for a Kaggle account](https://www.kaggle.com/), if you don’t already have one. 
2. [Follow these instructions](https://github.com/Kaggle/kaggle-api#api-credentials) to create a Kaggle “API Token” and download your `kaggle.json` file. If you are using Anaconda, put the file in the directory specified in the instructions.

_This will enable you to download data directly from Kaggle. If you run into problems, don’t worry — I’ll give you an easy alternative way to download today’s data, so you can still follow along with the lecture hands-on. And then we’ll help you through the Kaggle process after the lecture._

#### Put `kaggle.json` in the correct location

- ***If you're using Anaconda,*** put the file in the directory specified in the [instructions](https://github.com/Kaggle/kaggle-api#api-credentials).

- ***If you're using Google Colab,*** upload the file to your Google Drive, and run this cell:

In [3]:
from google.colab import drive
drive.mount('/content/drive')
%env KAGGLE_CONFIG_DIR=/content/drive/My Drive/

KeyboardInterrupt: ignored

#### Install the Kaggle API package and use it to get the data

You also have to join the Titanic competition to have access to the data

In [0]:
!pip install kaggle



In [0]:
!kaggle competitions download -c titanic

Traceback (most recent call last):
  File "/usr/local/bin/kaggle", line 6, in <module>
    from kaggle.cli import main
  File "/usr/local/lib/python2.7/dist-packages/kaggle/__init__.py", line 23, in <module>
    api.authenticate()
  File "/usr/local/lib/python2.7/dist-packages/kaggle/api/kaggle_api_extended.py", line 116, in authenticate
    self.config_file, self.config_dir))
IOError: Could not find kaggle.json. Make sure it's located in /content/drive/My Drive/. Or use the environment method.


In [0]:
from google.colab import drive
drive.mount("/content/drive", force_remount=True)
%env KAGGLE_CONFIG_DIR=/content/drive/My Drive/

Mounted at /content/drive
env: KAGGLE_CONFIG_DIR=/content/drive/My Drive/


### Get data, option 2: Download from the competition page
1. [Sign up for a Kaggle account](https://www.kaggle.com/), if you don’t already have one. 
2. [Go to the Titanic competition page](https://www.kaggle.com/c/titanic) to download the [data](https://www.kaggle.com/c/titanic/data).

### Get data, option 3: Use Seaborn

```
import seaborn as sns
train = sns.load_dataset('titanic')
```

But Seaborn's version of the Titanic dataset is not identical to Kaggle's version, as we'll see during this lesson!

### Read data

In [0]:
import pandas as pd

In [0]:
from google.colab import files
uploaded = files.upload()

Saving gender_submission.csv to gender_submission.csv


In [0]:
from google.colab import files
uploaded = files.upload()

Saving test.csv to test.csv


In [0]:
from google.colab import files
uploaded = files.upload()

Saving train.csv to train.csv


In [0]:
train = pd.read_csv('train.csv')
train.shape

(891, 12)

In [0]:
train.sample(n=5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
540,541,1,1,"Crosby, Miss. Harriet R",female,36.0,0,2,WE/P 5735,71.0,B22,S
792,793,0,3,"Sage, Miss. Stella Anna",female,,8,2,CA. 2343,69.55,,S
490,491,0,3,"Hagland, Mr. Konrad Mathias Reiersen",male,,1,0,65304,19.9667,,S
799,800,0,3,"Van Impe, Mrs. Jean Baptiste (Rosalie Paula Go...",female,30.0,1,1,345773,24.15,,S
310,311,1,1,"Hays, Miss. Margaret Bechstein",female,24.0,0,0,11767,83.1583,C54,C


In [0]:
test = pd.read_csv('test.csv')
test.shape

(418, 11)

In [0]:
test.sample(n=5)

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
318,1210,3,"Jonsson, Mr. Nils Hilding",male,27.0,0,0,350408,7.8542,,S
329,1221,2,"Enander, Mr. Ingvar",male,21.0,0,0,236854,13.0,,S
192,1084,3,"van Billiard, Master. Walter John",male,11.5,1,1,A/5. 851,14.5,,S
415,1307,3,"Saether, Mr. Simon Sivertsen",male,38.5,0,0,SOTON/O.Q. 3101262,7.25,,S
153,1045,3,"Klasen, Mrs. (Hulda Kristina Eugenia Lofqvist)",female,36.0,0,2,350405,12.1833,,S


In [0]:
gender = pd.read_csv('gender_submission.csv')
gender.shape

(418, 2)

In [0]:
# Looking at a total encoded count of who survived and who did not 
# Raw counts
target = 'Survived'
train[target].value_counts()
# if adding normalize you get a percentage 
#train[target].value_counts(normalize=True)

0    549
1    342
Name: Survived, dtype: int64

In [0]:
# Looking at numeric only
train.describe(include='number')

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [0]:
# Looking at non numeric cols
train.describe(exclude='number')

Unnamed: 0,Name,Sex,Ticket,Cabin,Embarked
count,891,891,891,204,889
unique,891,2,681,147,3
top,"McGovern, Miss. Mary",male,1601,G6,S
freq,1,577,7,4,644


### How would we try to do this with linear regression?

https://scikit-learn.org/stable/modules/impute.html

In [0]:
# Sklearn expects NO missing values and everything is already encoded
# Importing the library
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression

features = ['Pclass', 'Age', 'Fare']
target = 'Survived'
X_train = train[features]
y_train = train[target]
X_test = test[features]

imputer = SimpleImputer()
X_train_imputed = imputer.fit_transform(X_train)
X_test_imputed = imputer.transform(X_test)



lin_reg = LinearRegression()
lin_reg.fit(X_train_imputed, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [0]:
X_train_imputed.shape, X_train.shape

((891, 3), (891, 3))

In [0]:
# Doesnt work with .head, need to put in an array
# The nan values are now filled with the mean of column
X_train_imputed[:6]

array([[ 3.        , 22.        ,  7.25      ],
       [ 1.        , 38.        , 71.2833    ],
       [ 3.        , 26.        ,  7.925     ],
       [ 1.        , 35.        , 53.1       ],
       [ 3.        , 35.        ,  8.05      ],
       [ 3.        , 29.69911765,  8.4583    ]])

In [0]:
# Using the means from the training mean
# This is the proper way to fill nans in this format
# Means should come from training means 
X_test_imputed[-5:]

array([[  3.        ,  29.69911765,   8.05      ],
       [  1.        ,  39.        , 108.9       ],
       [  3.        ,  38.5       ,   7.25      ],
       [  3.        ,  29.69911765,   8.05      ],
       [  3.        ,  29.69911765,  22.3583    ]])

In [0]:
pd.concat([X_train, X_test])['Age'].mean()

29.881137667304014

In [0]:
# Rich 5yr old in first class
import numpy as np
#First class, 5 yrs old, fare price
#Pclass, age, fare
test_case = np.array([[1, 5, 500]])
lin_reg.predict(test_case)       

array([1.19207871])

In [0]:
y_pred = lin_reg.predict(X_test_imputed)

In [0]:
pd.Series(y_pred).describe()

count    418.000000
mean       0.392117
std        0.181876
min        0.011755
25%        0.227341
50%        0.339570
75%        0.516439
max        0.954827
dtype: float64

In [0]:
# For each year you had a less likely chance of survival
pd.Series(lin_reg.coef_, X_train.columns)

Pclass   -0.210390
Age      -0.007358
Fare      0.000751
dtype: float64

In [0]:
lin_reg.intercept_

1.063899500003544

### How would we do this with Logistic Regression?

In [0]:
from sklearn.linear_model import LogisticRegression

# Code is the same as linear reg
log_reg = LogisticRegression(solver='lbfgs')
# Always use the imputed bc it has no nans
log_reg.fit(X_train_imputed, y_train)
#discrete predictions
print('5 year old prediction:', log_reg.predict(test_case))
print('Predicted Prob:', log_reg.predict_proba(test_case))

5 year old prediction: [1]
Predicted Prob: [[0.02778799 0.97221201]]


In [0]:
threshold = 0.5
log_reg.predict_proba(X_test_imputed)[:, 1] > threshold

In [0]:
# Behind the scenes
man_predict = (log_reg.predict_proba(X_test_imputed)[:, 1] > threshold).astype(int)
direct_predict = log_reg.predict(X_test_imputed)

all(man_predict == direct_predict)

True

### How accurate is the Logistic Regression?

In [0]:
score = log_reg.score(X_train_imputed, y_train)
print('Train Acc Score', score)

Train Acc Score 0.7025813692480359


In [0]:
# We cant get this until we submit on kaggle
score = log_reg.score(X_test_imputed, y_test)

In [0]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(log_reg, X_train_imputed, y_train, cv=10)
print('CVAS', scores)

CVAS [0.63333333 0.62222222 0.68539326 0.71910112 0.69662921 0.69662921
 0.76404494 0.75280899 0.73033708 0.71590909]


In [0]:
# can out in series 
# look at min mean max

In [0]:
X_train_imputed.shape

(891, 3)

In [0]:
y_pred = log_reg.predict(X_train_imputed)

In [0]:
y_pred[:5]

array([0, 1, 0, 1, 0])

In [0]:
y_train[:5].values

array([0, 1, 1, 1, 0])

In [0]:
correct_predictions = 4
total_predictions = 5
accuracy = correct_predictions / total_predictions


In [0]:
from sklearn.metrics import accuracy_score
accuracy_score(y_train[:5], y_pred[:5])

0.8

### What's the math for the Logistic Regression?

https://en.wikipedia.org/wiki/Logistic_function

https://en.wikipedia.org/wiki/Logistic_regression#Probability_of_passing_an_exam_versus_hours_of_study

In [0]:
log_reg.coef_

array([[-0.9345267 , -0.03569729,  0.00422069]])

In [0]:
log_reg.intercept_

array([2.55763985])

In [0]:
# Squishing function
def sigmoid(x):
  return 1 / (1 + np.e**(-x))

In [0]:
sigmoid(np.dot(log_reg.coef_, test_case.T) + log_reg.intercept_)

array([[0.97221201]])

## Feature Engineering

Get the [Category Encoder](http://contrib.scikit-learn.org/categorical-encoding/) library

If you're running on Google Colab:

```
!pip install category_encoders
```

If you're running locally with Anaconda:

```
!conda install -c conda-forge category_encoders
```

In [22]:
!pip install category_encoders



## Assignment: real-world classification

We're going to check out a larger dataset - the [FMA Free Music Archive data](https://github.com/mdeff/fma). It has a selection of CSVs with metadata and calculated audio features that you can load and try to use to classify genre of tracks. To get you started:

### Get and unzip the data

#### Google Colab

In [0]:
!wget https://os.unil.cloud.switch.ch/fma/fma_metadata.zip
!unzip fma_metadata.zip

#### Windows
- Download the [zip file](https://os.unil.cloud.switch.ch/fma/fma_metadata.zip)
- You may need to use [7zip](https://www.7-zip.org/download.html) to unzip it


#### Mac
- Download the [zip file](https://os.unil.cloud.switch.ch/fma/fma_metadata.zip)
- You may need to use [p7zip](https://superuser.com/a/626731) to unzip it

### Look at first 3 lines of raw file

In [9]:
!head -n 3 fma_metadata/tracks.csv

,album,album,album,album,album,album,album,album,album,album,album,album,album,artist,artist,artist,artist,artist,artist,artist,artist,artist,artist,artist,artist,artist,artist,artist,artist,artist,set,set,track,track,track,track,track,track,track,track,track,track,track,track,track,track,track,track,track,track,track,track
,comments,date_created,date_released,engineer,favorites,id,information,listens,producer,tags,title,tracks,type,active_year_begin,active_year_end,associated_labels,bio,comments,date_created,favorites,id,latitude,location,longitude,members,name,related_projects,tags,website,wikipedia_page,split,subset,bit_rate,comments,composer,date_created,date_recorded,duration,favorites,genre_top,genres,genres_all,information,interest,language_code,license,listens,lyricist,number,publisher,tags,title
track_id,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


### Read with pandas
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

In [0]:
import pandas as pd

### **Using select DF's**

In [32]:
tracks = pd.read_csv('fma_metadata/tracks.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [0]:
genre = pd.read_csv('fma_metadata/genres.csv')

### **Cleaning Tracks**

In [34]:
tracks.shape

(106576, 53)

In [35]:
# Need to fix column names
# Viewing all columns
pd.set_option('display.max_columns', None)  
tracks.head(1)

Unnamed: 0.1,Unnamed: 0,album,album.1,album.2,album.3,album.4,album.5,album.6,album.7,album.8,album.9,album.10,album.11,album.12,artist,artist.1,artist.2,artist.3,artist.4,artist.5,artist.6,artist.7,artist.8,artist.9,artist.10,artist.11,artist.12,artist.13,artist.14,artist.15,artist.16,set,set.1,track,track.1,track.2,track.3,track.4,track.5,track.6,track.7,track.8,track.9,track.10,track.11,track.12,track.13,track.14,track.15,track.16,track.17,track.18,track.19
0,,comments,date_created,date_released,engineer,favorites,id,information,listens,producer,tags,title,tracks,type,active_year_begin,active_year_end,associated_labels,bio,comments,date_created,favorites,id,latitude,location,longitude,members,name,related_projects,tags,website,wikipedia_page,split,subset,bit_rate,comments,composer,date_created,date_recorded,duration,favorites,genre_top,genres,genres_all,information,interest,language_code,license,listens,lyricist,number,publisher,tags,title


In [36]:
# Using the first row as the column names
new_cols = tracks.iloc[0] 
# Using all of the data beneath the first row
new_tracks = tracks[1:] 
# Setting the first row as the columns officially
new_tracks.columns = new_cols 

new_tracks = new_tracks[1:]
new_tracks.head(1)

Unnamed: 0,nan,comments,date_created,date_released,engineer,favorites,id,information,listens,producer,tags,title,tracks,type,active_year_begin,active_year_end,associated_labels,bio,comments.1,date_created.1,favorites.1,id.1,latitude,location,longitude,members,name,related_projects,tags.1,website,wikipedia_page,split,subset,bit_rate,comments.2,composer,date_created.2,date_recorded,duration,favorites.2,genre_top,genres,genres_all,information.1,interest,language_code,license,listens.1,lyricist,number,publisher,tags.2,title.1
2,2,0,2008-11-26 01:44:45,2009-01-05 00:00:00,,4,1,<p></p>,6073,,[],AWOL - A Way Of Life,7,Album,2006-01-01 00:00:00,,,"<p>A Way Of Life, A Collective of Hip-Hop from...",0,2008-11-26 01:42:32,9,1,40.0583238,New Jersey,-74.4056612,"Sajje Morocco,Brownbum,ZawidaGod,Custodian of ...",AWOL,The list of past projects is 2 long but every1...,['awol'],http://www.AzillionRecords.blogspot.com,,training,small,256000,0,,2008-11-26 01:48:12,2008-11-26 00:00:00,168,2,Hip-Hop,[21],[21],,4656,en,Attribution-NonCommercial-ShareAlike 3.0 Inter...,1293,,3,,[],Food


In [37]:
# Looking at NaN's
# That is a lot 
# I know to make a model I will need to drop 
new_tracks.isna().sum().sum()

1673813

### **Looking at what columns I will use to find Top Genres **

In [48]:
new_tracks.columns

Index([                nan,          'comments',      'date_created',
           'date_released',          'engineer',         'favorites',
                      'id',       'information',           'listens',
                'producer',              'tags',             'title',
                  'tracks',              'type', 'active_year_begin',
         'active_year_end', 'associated_labels',               'bio',
                'comments',      'date_created',         'favorites',
                      'id',          'latitude',          'location',
               'longitude',           'members',              'name',
        'related_projects',              'tags',           'website',
          'wikipedia_page',             'split',            'subset',
                'bit_rate',          'comments',          'composer',
            'date_created',     'date_recorded',          'duration',
               'favorites',         'genre_top',            'genres',
              'genre

In [50]:
# Very informative
new_tracks['genre_top'].value_counts().head()

Rock            14182
Experimental    10608
Electronic       9372
Hip-Hop          3552
Folk             2803
Name: genre_top, dtype: int64

In [51]:
# Non Descriptive
new_tracks['genres_all'].value_counts().head()

[21]                    2735
[15]                    2689
[]                      2231
[12]                    1896
[1, 38, 41, 247, 30]    1633
Name: genres_all, dtype: int64

### Fit Logistic Regression!

In [0]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

In [64]:
# By setting the taining and testing to 80/20 I see an increase in accuracy
best_tracks = new_tracks[['genre_top','favorites']]
best_tracks = best_tracks.dropna(how='any',axis=0)

X = best_tracks['favorites']
y = best_tracks['genre_top']

X_train,X_test,Y_train,Y_test = train_test_split(X,y, train_size=.8, test_size=.2, random_state=42)

log_reg = LogisticRegression(solver='lbfgs')
log_reg.fit(X_train,Y_train)
log_reg.score(X_test,Y_test)



0.31340725806451614

This dataset is bigger than many you've worked with so far, and while it should fit in Colab, it can take awhile to run. That's part of the challenge!

Your tasks:
- Clean up the variable names in the dataframe
- Use logistic regression to fit a model predicting (primary/top) genre
- Inspect, iterate, and improve your model
- Answer the following questions (written, ~paragraph each):
  - What are the best predictors of genre?
  - What information isn't very useful for predicting genre?
  - What surprised you the most about your results?

*Important caveats*:
- This is going to be difficult data to work with - don't let the perfect be the enemy of the good!
- Be creative in cleaning it up - if the best way you know how to do it is download it locally and edit as a spreadsheet, that's OK!
- If the data size becomes problematic, consider sampling/subsetting, or [downcasting numeric datatypes](https://www.dataquest.io/blog/pandas-big-data/).
- You do not need perfect or complete results - just something plausible that runs, and that supports the reasoning in your written answers

If you find that fitting a model to classify *all* genres isn't very good, it's totally OK to limit to the most frequent genres, or perhaps trying to combine or cluster genres as a preprocessing step. Even then, there will be limits to how good a model can be with just this metadata - if you really want to train an effective genre classifier, you'll have to involve the other data (see stretch goals).

This is real data - there is no "one correct answer", so you can take this in a variety of directions. Just make sure to support your findings, and feel free to share them as well! This is meant to be practice for dealing with other "messy" data, a common task in data science.

***I believe the best predictors of genre is favorites and genre top. Genre top allows you to see the top genres if you sort by value_counts and same for favorites. What is not useful for predicting genre is using tracks or all genres. Quite pointless when they have already sorted that for you. I wouldn't say anything really surprised me about my results. Although when adjusting the training and testing split the value increased. ***

## Resources and stretch goals

- Check out the other .csv files from the FMA dataset, and see if you can join them or otherwise fit interesting models with them
- [Logistic regression from scratch in numpy](https://blog.goodaudience.com/logistic-regression-from-scratch-in-numpy-5841c09e425f) - if you want to dig in a bit more to both the code and math (also takes a gradient descent approach, introducing the logistic loss function)
- Create a visualization to show predictions of your model - ideally show a confidence interval based on error!
- Check out and compare classification models from scikit-learn, such as [SVM](https://scikit-learn.org/stable/modules/svm.html#classification), [decision trees](https://scikit-learn.org/stable/modules/tree.html#classification), and [naive Bayes](https://scikit-learn.org/stable/modules/naive_bayes.html). The underlying math will vary significantly, but the API (how you write the code) and interpretation will actually be fairly similar.
- Sign up for [Kaggle](https://kaggle.com), and find a competition to try logistic regression with
- (Not logistic regression related) If you enjoyed the assignment, you may want to read up on [music informatics](https://en.wikipedia.org/wiki/Music_informatics), which is how those audio features were actually calculated. The FMA includes the actual raw audio, so (while this is more of a longterm project than a stretch goal, and won't fit in Colab) if you'd like you can check those out and see what sort of deeper analysis you can do.