# Subject: Classical Data Analysis

## Session 3 - Logistic Regression with one variable

### Demo 1 -  Logistic Regression in Python

In the last lessons, we introduced linear regression as a predictive modeling method to estimate numeric variables. Now we turn our attention to classification: prediction tasks where the response variable is categorical instead of numeric. In this lesson we will learn how to use a common classification technique known as logistic regression and apply it to the Titanic survival data we used in lesson 2.

## 1. Revisiting the Titanic

We'll start by loading the data and then carrying out a few of the same preprocessing tasks:

In [74]:
import pandas as pd
from sklearn import linear_model
from sklearn import preprocessing

In [75]:
df=pd.read_csv("C:/Users/francisco.sacramento/Desktop/Master_Big_Data_Phyton/6_Exercices/Classical Data Analysis/Session_3_CDA/1_titanic_dataset.csv")
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [76]:
df.shape

(891, 12)

Logistic regression model that only uses the Sex variable as a predictor. Before creating a model with the sex variable, we need to convert to a real number because sklearn's machine learning functions only death with real numbers. We can convert a categorical variable like into a number using the sklearn preprocessing function LabelEncoder():

In [77]:
# Initialize label encoder
label_encoder = preprocessing.LabelEncoder()

# Convert Sex variable to numeric
encoded_sex = label_encoder.fit_transform(df["Sex"])


In [78]:
X=pd.DataFrame(encoded_sex)
X

Unnamed: 0,0
0,1
1,0
2,0
3,0
4,1
5,1
6,1
7,1
8,0
9,0


In [79]:
# Initialize logistic regression model
log_model = linear_model.LogisticRegression()

# Train the model
log_model.fit(X = pd.DataFrame(encoded_sex), 
              y = df["Survived"])

# Check trained model intercept
print(log_model.intercept_)

# Check trained model coefficients
print(log_model.coef_)

[ 1.00027876]
[[-2.43010712]]


The logistic regression model coefficients look similar to the output we saw for linear regression. We can see the model produced a positive intercept value and a weight of -2.421 on gender. Let's use the model to make predictions:

In [80]:
# Make predictions
preds = log_model.predict_proba(X= pd.DataFrame(encoded_sex)) # Use model.predict_proba() to get the predicted class probabilities.

In [81]:
preds

array([[ 0.80687457,  0.19312543],
       [ 0.26888662,  0.73111338],
       [ 0.26888662,  0.73111338],
       ..., 
       [ 0.26888662,  0.73111338],
       [ 0.80687457,  0.19312543],
       [ 0.80687457,  0.19312543]])

In [82]:
preds = pd.DataFrame(preds)
preds

Unnamed: 0,0,1
0,0.806875,0.193125
1,0.268887,0.731113
2,0.268887,0.731113
3,0.268887,0.731113
4,0.806875,0.193125
5,0.806875,0.193125
6,0.806875,0.193125
7,0.806875,0.193125
8,0.268887,0.731113
9,0.268887,0.731113


In [83]:
preds.columns = ["Death_prob", "Survival_prob"]
preds

Unnamed: 0,Death_prob,Survival_prob
0,0.806875,0.193125
1,0.268887,0.731113
2,0.268887,0.731113
3,0.268887,0.731113
4,0.806875,0.193125
5,0.806875,0.193125
6,0.806875,0.193125
7,0.806875,0.193125
8,0.268887,0.731113
9,0.268887,0.731113


In [84]:
# Generate table of predictions vs Sex
pd.crosstab(df["Sex"], preds["Survival_prob"])

Survival_prob,0.193125428972,0.731113382332
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1
female,0,314
male,577,0


The table shows that the model predicted a survival chance of roughly 19% for males and 73% for females. 

We can also get the accuracy of a model using the scikit-learn model.score() function:

In [85]:
log_model.score(X = pd.DataFrame(encoded_sex) ,
                y = df["Survived"])

0.78675645342312006