# Logistic Regression 

Logistic regression is a classification method built on the same concept as linear regression. In this lesson we will learn how to use a common classification technique known as logistic regression and apply it to the Titanic survival data.

In [1]:
%matplotlib inline

In [2]:
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import scipy.stats as stats
matplotlib.style.use('ggplot')

In [3]:
titanic = pd.read_csv("C:/Users/HP PC/Documents/10 Academy/Jully_Training/Week3/titanic.csv")  

In [4]:
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [31]:
char_cabin = titanic["Cabin"].astype(str)     # Convert cabin to str

new_Cabin = np.array([cabin[0] for cabin in char_cabin]) # Take first letter

titanic["Cabin"] = pd.Categorical(new_Cabin)  # Save the new cabin var

# Impute median Age for NA Age values
new_age_var = np.where(titanic["Age"].isnull(), # Logical check
                       28,                       # Value if check is true
                       titanic["Age"])     # Value if check is false

titanic["Age"] = new_age_var 

Now we are ready to use a logistic regression model to predict survival. The scikit-learn library has a logistic regression function in the learn model subfolder. Let's make a logistic regression model that only uses the Sex variable as a predictor.

Before creating a model with the sex variable, we need to convert to a real number because sklearn's machine learning functions only death with real numbers. We can convert a categorical variable like into a number using the sklearn preprocessing function LabelEncoder():

In [32]:
from sklearn import linear_model
from sklearn import preprocessing

In [33]:
# Initialize label encoder
label_encoder = preprocessing.LabelEncoder()

# Convert Sex variable to numeric
encoded_sex = label_encoder.fit_transform(titanic["Sex"])

# Initialize logistic regression model
log_model = linear_model.LogisticRegression()

# Train the model
log_model.fit(X = pd.DataFrame(encoded_sex), 
              y = titanic["Survived"])

# Check trained model intercept
print(log_model.intercept_)

# Check trained model coefficients
print(log_model.coef_)

[1.01628767]
[[-2.44597988]]


Let's use the model to make predictions on the test set:

In [34]:
# Make predictions
preds = log_model.predict_proba(X= pd.DataFrame(encoded_sex))
preds = pd.DataFrame(preds)
preds.columns = ["Death_prob", "Survival_prob"]

# Generate table of predictions vs Sex
pd.crosstab(titanic["Sex"], preds["Survival_prob"])

Survival_prob,0.193147,0.734249
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1
female,0,314
male,577,0


In [35]:
log_model.score(X = pd.DataFrame(encoded_sex) ,
                y = titanic["Survived"])

0.7867564534231201

The table shows that the model predicted a survival chance of roughly 19% for males and 73% for females. If we used this simple model to predict survival, we'd end up predicting that all women survived and that all men died.

Let's make a more complicated model that includes a few more variables from the titanic training set:

In [21]:
# Convert more variables to numeric
encoded_class = label_encoder.fit_transform(titanic["Pclass"])
encoded_cabin = label_encoder.fit_transform(titanic["Cabin"])

train_features = pd.DataFrame([encoded_class,
                              encoded_cabin,
                              encoded_sex,
                              titanic["Age"]]).T

# Initialize logistic regression model
log_model = linear_model.LogisticRegression()

# Train the model
log_model.fit(X = train_features ,
              y = titanic["Survived"])

# Check trained model intercept
print(log_model.intercept_)

# Check trained model coefficients
print(log_model.coef_)

[3.85818122]
[[-0.93272694 -0.09865307 -2.51826528 -0.03339524]]


Next, let's make class predictions using this model and then compare the predictons to the actual values:

In [26]:
# Make predictions
preds = log_model.predict(X= train_features)

# Generate table of predictions vs actual
pd.crosstab(preds,titanic["Survived"])

Survived,0,1
row_0,Unnamed: 1_level_1,Unnamed: 2_level_1
0,458,89
1,91,253


**Note**: Use model.predict_proba() to get the predicted class probabilities. Use model.predict() to get the predicted classes.

You can also get the accuracy of a model using the scikit-learn model.score() function:

In [27]:
log_model.score(X = train_features ,
                y = titanic["Survived"])

0.797979797979798