# Subject: Classical Data Analysis

## Session 3 - Logistic Regression with one variable

### Demo 1 -  Logistic Regression in Python

In the last lessons, we introduced linear regression as a predictive modeling method to estimate numeric variables. Now we turn our attention to classification: prediction tasks where the response variable is categorical instead of numeric. In this lesson we will learn how to use a common classification technique known as logistic regression and apply it to the Titanic survival data we used in lesson 2.

## 1. Revisiting the Titanic

We'll start by loading the data and then carrying out a few of the same preprocessing tasks:

In [39]:
import pandas as pd
from sklearn import linear_model
from sklearn import preprocessing

In [40]:
df=pd.read_csv("C:/Users/francisco.sacramento/Desktop/Master_Big_Data_Phyton/6_Exercices/Classical Data Analysis/Session_3_CDA/1_titanic_dataset.csv")
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [41]:
df.shape

(891, 12)

In [42]:
df.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

Logistic regression model that only uses the Sex variable as a predictor. Before creating a model with the sex variable, we need to convert to a real number because sklearn's machine learning functions only death with real numbers. We can convert a categorical variable like into a number using the sklearn preprocessing function LabelEncoder():

In [43]:
# Initialize label encoder
label_encoder = preprocessing.LabelEncoder()

# Convert Sex variable to numeric
encoded_sex = label_encoder.fit_transform(df["Sex"])


In [44]:
X=pd.DataFrame(encoded_sex)
X

Unnamed: 0,0
0,1
1,0
2,0
3,0
4,1
5,1
6,1
7,1
8,0
9,0


In [45]:
# Initialize logistic regression model
log_model = linear_model.LogisticRegression()

# Train the model
log_model.fit(X = pd.DataFrame(encoded_sex), 
              y = df["Survived"])

# Check trained model intercept
print(log_model.intercept_)

# Check trained model coefficients
print(log_model.coef_)

[ 1.00027876]
[[-2.43010712]]


The logistic regression model coefficients look similar to the output we saw for linear regression. We can see the model produced a positive intercept value and a weight of -2.421 on gender. Let's use the model to make predictions:

In [46]:
# Make predictions
preds = log_model.predict_proba(X= pd.DataFrame(encoded_sex)) # Use model.predict_proba() to get the predicted class probabilities.

In [47]:
preds

array([[ 0.80687457,  0.19312543],
       [ 0.26888662,  0.73111338],
       [ 0.26888662,  0.73111338],
       ..., 
       [ 0.26888662,  0.73111338],
       [ 0.80687457,  0.19312543],
       [ 0.80687457,  0.19312543]])

In [48]:
preds = pd.DataFrame(preds)
preds

Unnamed: 0,0,1
0,0.806875,0.193125
1,0.268887,0.731113
2,0.268887,0.731113
3,0.268887,0.731113
4,0.806875,0.193125
5,0.806875,0.193125
6,0.806875,0.193125
7,0.806875,0.193125
8,0.268887,0.731113
9,0.268887,0.731113


In [49]:
preds.columns = ["Death_prob", "Survival_prob"]
preds

Unnamed: 0,Death_prob,Survival_prob
0,0.806875,0.193125
1,0.268887,0.731113
2,0.268887,0.731113
3,0.268887,0.731113
4,0.806875,0.193125
5,0.806875,0.193125
6,0.806875,0.193125
7,0.806875,0.193125
8,0.268887,0.731113
9,0.268887,0.731113


In [50]:
# Generate table of predictions vs Sex
pd.crosstab(df["Sex"], preds["Survival_prob"])

Survival_prob,0.193125428972,0.731113382332
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1
female,0,314
male,577,0


The table shows that the model predicted a survival chance of roughly 19% for males and 73% for females. 

We can also get the accuracy of a model using the scikit-learn model.score() function:

In [51]:
log_model.score(X = pd.DataFrame(encoded_sex) ,
                y = df["Survived"])

0.78675645342312006

# Let's make a more complicated model that includes a few more variables from the titanic training set.

In binary logistic regression the fundamental condition is that the outcome variable is dichotomous and the predictors tend to a linear relationship.
The predictive varibles can be categorical or continuous.

In [52]:
df.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

In [53]:
import numpy as np
char_cabin = df["Cabin"].astype(str)     # Convert cabin to str

new_Cabin = np.array([cabin[0] for cabin in char_cabin]) # Take first letter

df["Cabin"] = pd.Categorical(new_Cabin)  # Save the new cabin var

# Impute median Age for NA Age values
new_age_var = np.where(df["Age"].isnull(), # Logical check
                       28,                       # Value if check is true
                       df["Age"])     # Value if check is false

df["Age"] = new_age_var 

In [54]:
# Convert more variables to numeric
encoded_class = label_encoder.fit_transform(df["Pclass"])
encoded_cabin = label_encoder.fit_transform(df["Cabin"])

train_features = pd.DataFrame([encoded_class,
                              encoded_cabin,
                              encoded_sex,
                              df["Age"]]).T

# Initialize logistic regression model
log_model = linear_model.LogisticRegression()

# Train the model
log_model.fit(X = train_features ,
              y = df["Survived"])

# Check trained model intercept
print(log_model.intercept_)

# Check trained model coefficients
print(log_model.coef_)

[ 3.32716302]
[[-0.90790164 -0.06426483 -2.43179802 -0.0265924 ]]


In [55]:
train_features

Unnamed: 0,0,1,2,3
0,2.0,8.0,1.0,22.0
1,0.0,2.0,0.0,38.0
2,2.0,8.0,0.0,26.0
3,0.0,2.0,0.0,35.0
4,2.0,8.0,1.0,35.0
5,2.0,8.0,1.0,28.0
6,0.0,4.0,1.0,54.0
7,2.0,8.0,1.0,2.0
8,2.0,8.0,0.0,27.0
9,1.0,8.0,0.0,14.0


Next, let's make class predictions using this model and then compare the predictons to the actual values:

In [56]:
# Make predictions
preds = log_model.predict(X= train_features)

# Generate table of predictions vs actual
pd.crosstab(preds,df["Survived"])

Survived,0,1
row_0,Unnamed: 1_level_1,Unnamed: 2_level_1
0,463,98
1,86,244


The table above shows the classes our model predicted vs. true values of the Survived variable. This table of predicted vs. actual values is known as a confusion matrix.

# The Confusion Matrix

The confusion matrix is a common tool for assessing the results of classification. Each cell tells us something different about our predictions versus the true values. 
- The bottom right corner indicates the True positives: people the model predicted to survive who actually did survive. 
- The bottom left cell indicates the false positives: people for whom the model predicted survival who did not actually survive.
- The top left cell indicates the true negatives: people correctly identified as non survivors. 
- Finally, the top right cell shows the false negatives: passengers the model identified as non survivors who actually did survive.

We can calculate the overall prediction accuracy from the matrix by adding the total number of correct predictions and dividing by the total number of predictions. In the case of our model, the prediction accuracy is:

In [57]:
(467+237)/889

0.7919010123734533

You can also get the accuracy of a model using the scikit-learn model.score() function:

In [58]:
log_model.score(X = train_features ,
                y = df["Survived"])

0.79349046015712688

In [59]:
from sklearn import metrics

In [60]:
# View confusion matrix
metrics.confusion_matrix(y_true=df["Survived"],  # True labels
                         y_pred=preds) # Predicted labels

array([[463,  86],
       [ 98, 244]])

In [61]:
# View summary of common classification metrics
print(metrics.classification_report(y_true=df["Survived"],
                              y_pred=preds) )

             precision    recall  f1-score   support

          0       0.83      0.84      0.83       549
          1       0.74      0.71      0.73       342

avg / total       0.79      0.79      0.79       891



- Overall prediction accuracy is just one of many quantities you can use to assess a classification model. Oftentimes accuracy is not the best metric for assessing a model.

- Model's sensitivity (recall): the proportion of positive cases that the model correctly identifies as positive.

- Model's precision: the proportion of positive predictions that turn out to be true positives.

- F1 score (also F-score or F-measure) is a measure of a test's accuracy. It considers both the precision "p" and the recall "r" of the test to compute the score: "p" is the number of correct positive results divided by the number of all positive results, and "r" is the number of correct positive results divided by the number of positive results that should have been returned.

- Support - class support size (number of elements in each class).

http://onlineconfusionmatrix.com/