# Logistic Regression with Scikit Learn

*Adapted from https://github.com/justmarkham*

### Libraries

- [scikit-learn](http://scikit-learn.org/stable/)
- pandas
- matplotlib

In this tutorial we will see some basic examples of Logistic Regression for classification.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import cross_val_score
%matplotlib inline

## Classification with Logistic Regression

|*|continuous|categorical|
|---|---|---|
|**supervised**|regression|**classification**|
|**unsupervised**|dim. reduction|clustering|

# Predicting Titanic survival with Logistic Regression

Let's use the data obtained by the _Encyclopedia Titanica_ to predict if a passenger survived the Titanic disaster.

<img src="img/titanic.jpg" width="600">

Let's import the dataset _titanic.csv_ (_hint_ use the `read_csv` pandas function):

In [12]:
titanic = pd.read_csv("data/titanic.csv", sep=',')
titanic.head(5)

Unnamed: 0,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
0,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S
1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S
2,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S
3,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S
4,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S


What are the **features**?
- name: Name of the passenger
- sex: Male or Female
- age: Age in years
- sibsp: # of siblings / spouses aboard the Titanic
- parch: # of parents / children aboard the Titanic
- ticket: Ticket number
- fare: Ticket price
- cabin: Cabin number
- embarked: Port of Embarkation

What is the **response**?
- survived: whether the passenger survived the disaster or not

Print the number of survivors and death passengers, as well as the percentage of survivors. Is the dataset balanced?

In [13]:
titanic['survived']

0       1
1       1
2       0
3       0
4       0
       ..
1304    0
1305    0
1306    0
1307    0
1308    0
Name: survived, Length: 1309, dtype: int64

In [24]:
dead = titanic[titanic['survived']==0]
survived = titanic[titanic['survived']==1]
number_of_deads = len(dead)
number_of_survivors = len(survived)
percentage_of_dead = (number_of_deads/len(titanic))*100
percentage_of_survivors = (number_of_survivors/len(titanic))*100

# print the required information
print('Number of deads : ', number_of_deads, '\n','Percentage of deads : ', percentage_of_dead)
print('\nNumber of survivors : ', number_of_survivors, '\n','Percentage of survivors : ', percentage_of_survivors)

Number of deads :  809 
 Percentage of deads :  61.80290297937356

Number of survivors :  500 
 Percentage of survivors :  38.19709702062643


Specify the columns to use as features:

In [25]:
titanic_features = ['sex', 'age', 'sibsp', 'parch', 'fare']

For the sake of this execise, we can assume the other features (name, cabin number, embarked) are not predictive.

### Let's prepare the feature vector for the training

The dataset contains one categorical variable: sex (male|female)

We need to convert it to a numerical variable. Use the pandas method `get_dummies` to take care of this. Check https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html

In [49]:
X = pd.get_dummies(titanic[titanic_features])
X.head()

Unnamed: 0,age,sibsp,parch,fare,sex_female,sex_male
0,29.0,0,0,211.3375,1,0
1,0.9167,1,2,151.55,0,1
2,2.0,1,2,151.55,1,0
3,30.0,1,2,151.55,0,1
4,25.0,1,2,151.55,1,0


The categorical feature _sex_ is converted in 2 boolean features.

Titanic sank in 1912: it was a lot of time ago! Some data may be missing. Let's check if there are undefined values. Use pandas' `isna` for this purpose: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isna.html

In [52]:
# add your code here
X[X.isna().any(axis=1)]

Unnamed: 0,age,sibsp,parch,fare,sex_female,sex_male
15,,0,0,25.9250,0,1
37,,0,0,26.5500,0,1
40,,0,0,39.6000,0,1
46,,0,0,31.0000,0,1
59,,0,0,27.7208,1,0
...,...,...,...,...,...,...
1293,,0,0,8.0500,0,1
1297,,0,0,7.2500,0,1
1302,,0,0,7.2250,0,1
1303,,0,0,14.4583,0,1


Let's try to fix the data with a basic imputation method: replacing the missing values with the mean. Use pandas' `fillna` for this purpose: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html. The `any` method can also be useful  https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.any.html

More info: https://en.wikipedia.org/wiki/Imputation_(statistics)

In [58]:
#X = titanic.any(skipna=False)
#X = X.fillna(method='bfill')
X = X.fillna(X.mean())

# check if X has any missing values
len(X[X.isna().any(axis=1)])


0

Create the label vector `y`:

In [60]:
y = titanic.survived

Let's create a Logistic Regression model...

In [63]:
logistic = LogisticRegression()

... and evaluate the precison/recall with a cross validation (10 splits). For this, use the `cross_val_score` implementation provided by `sklearn` and already imported above. _Hint:_ check the `scoring` argument of this function: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html

In [66]:
precision = cross_val_score(logistic, X, y, cv=10, scoring='precision')
recall = cross_val_score(logistic, X, y, cv=10, scoring='recall')

# Precision: avoid false positives
print("Precision: %0.2f (+/- %0.2f)" % (precision.mean(), precision.std() * 2))
# Recall: avoid false negatives
print("Recall: %0.2f (+/- %0.2f)" % (recall.mean(), recall.std() * 2))

Precision: 0.72 (+/- 0.13)
Recall: 0.67 (+/- 0.16)


### Explore the model output

Let's create a new Logistic Regression model and train it on the full dataset:

In [68]:
logistic = LogisticRegression()

# Train the model
logistic.fit(X,y)

LogisticRegression()

Of course, since we trained the whole dataset, we don't have new samples to predict, but we can predict the outcome and the relative probability for some artificial samples. Would you have survived?

Remember the features:

In [69]:
X.columns

Index(['age', 'sibsp', 'parch', 'fare', 'sex_female', 'sex_male'], dtype='object')

Would a man, 25 years old without relative onboard, and with a fare of 100 survive? _Hint:_ use pandas' `predict` to make the prediction.

In [85]:
test = [25,0,0,100,0,1]

# Check if he would have survived
print("YES") if logistic.predict([test])[0]>0 else print("NO")



NO


What is the probability distribution behind this prediction? _Hint:_ use pandas' `predict_proba` to find the prediction distribution.

In [86]:
# Probability distribution
logistic.predict_proba([test])



array([[0.55322817, 0.44677183]])

What about a woman, 35 years old, alone onboard and with the same fare?

In [87]:
test = [35,0,0,100,1,0]

# Check if he would have survived
print("YES") if logistic.predict([test])[0]>0 else print("NO")

# Probability distribution
print("proba : ",logistic.predict_proba([test]))



YES




proba :  [[0.11257807 0.88742193]]
