# Fitting a Logistic Regression Model - Lab

## Introduction

In the last lesson you were given a broad overview of logistic regression. This included an introduction to two separate packages for creating logistic regression models. In this lab, you'll be investigating fitting logistic regressions with `statsmodels`. For your first foray into logistic regression, you are going to attempt to build a model that classifies whether an individual survived the [Titanic](https://www.kaggle.com/c/titanic/data) shipwreck or not (yes, it's a bit morbid).


## Objectives

In this lab you will: 

* Implement logistic regression with `statsmodels` 
* Interpret the statistical results associated with model parameters

## Import the data

Import the data stored in the file `'titanic.csv'` and print the first five rows of the DataFrame to check its contents. 

In [1]:
# Import the data
import numpy as np
import pandas as pd

df = pd.read_csv('titanic.csv')
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## Define independent and target variables

Your target variable is in the column `'Survived'`. A `0` indicates that the passenger didn't survive the shipwreck. Print the total number of people who didn't survive the shipwreck. How many people survived?

In [3]:
# Total number of people who survived/didn't survive
X = df.drop(columns=['Survived'])
y = df['Survived']
print("Number of survivors:", sum(y))
print("Total Number of Passengers:", len(y))
print("Survival rate:", sum(y)/len(y))

Number of survivors: 342
Total Number of Passengers: 891
Survival rate: 0.3838383838383838


Only consider the columns specified in `relevant_columns` when building your model. The next step is to create dummy variables from categorical variables. Remember to drop the first level for each categorical column and make sure all the values are of type `float`: 

In [4]:
# Create dummy variables
relevant_columns = ['Pclass', 'Age', 'SibSp', 'Fare', 'Sex', 'Embarked', 'Survived']
dummy_dataframe = pd.get_dummies(df[relevant_columns], drop_first=True, dtype='float')

dummy_dataframe.shape

(891, 8)

In [5]:
dummy_dataframe.head()

Unnamed: 0,Pclass,Age,SibSp,Fare,Survived,Sex_male,Embarked_Q,Embarked_S
0,3,22.0,1,7.25,0,1.0,0.0,1.0
1,1,38.0,1,71.2833,1,0.0,0.0,0.0
2,3,26.0,0,7.925,1,0.0,0.0,1.0
3,1,35.0,1,53.1,1,0.0,0.0,1.0
4,3,35.0,0,8.05,0,1.0,0.0,1.0


In [None]:
# Noticed that the Pclass column was not made into dummies because it has numerical datatype,
# will have to fix this

Did you notice above that the DataFrame contains missing values? To keep things simple, simply delete all rows with missing values. 

> NOTE: You can use the [`.dropna()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html) method to do this. 

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


In [9]:
# Drop missing rows
df['Pclass'] = df['Pclass'].astype('object')
dummy_dataframe = pd.get_dummies(df[relevant_columns].dropna(), drop_first=True, dtype='float')
dummy_dataframe.shape

(712, 9)

In [10]:
dummy_dataframe.head()

Unnamed: 0,Age,SibSp,Fare,Survived,Pclass_2,Pclass_3,Sex_male,Embarked_Q,Embarked_S
0,22.0,1,7.25,0,0.0,1.0,1.0,0.0,1.0
1,38.0,1,71.2833,1,0.0,0.0,0.0,0.0,0.0
2,26.0,0,7.925,1,0.0,1.0,0.0,0.0,1.0
3,35.0,1,53.1,1,0.0,0.0,0.0,0.0,1.0
4,35.0,0,8.05,0,0.0,1.0,1.0,0.0,1.0


Finally, assign the independent variables to `X` and the target variable to `y`: 

In [11]:
# Split the data into X and y
y = dummy_dataframe['Survived']
X = dummy_dataframe.drop(columns=['Survived'])

## Fit the model

Now with everything in place, you can build a logistic regression model using `statsmodels` (make sure you create an intercept term as we showed in the previous lesson).  

> Warning: Did you receive an error of the form "LinAlgError: Singular matrix"? This means that `statsmodels` was unable to fit the model due to certain linear algebra computational problems. Specifically, the matrix was not invertible due to not being full rank. In other words, there was a lot of redundant, superfluous data. Try removing some features from the model and running it again.

In [13]:
# Build a logistic regression model using statsmodels
import statsmodels.api as sm

X = sm.add_constant(X)

  return ptp(axis=axis, out=out, **kwargs)


In [14]:
# Fit model
logit_model = sm.Logit(y, X)

# Get results of the fit
result = logit_model.fit()

Optimization terminated successfully.
         Current function value: 0.444229
         Iterations 6


## Analyze results

Generate the summary table for your model. Then, comment on the p-values associated with the various features you chose.

In [15]:
# Summary table
result.summary()

0,1,2,3
Dep. Variable:,Survived,No. Observations:,712.0
Model:,Logit,Df Residuals:,703.0
Method:,MLE,Df Model:,8.0
Date:,"Tue, 07 Jul 2020",Pseudo R-squ.:,0.3417
Time:,17:52:07,Log-Likelihood:,-316.29
converged:,True,LL-Null:,-480.45
Covariance Type:,nonrobust,LLR p-value:,3.814e-66

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,4.4240,0.534,8.277,0.000,3.376,5.472
Age,-0.0433,0.008,-5.202,0.000,-0.060,-0.027
SibSp,-0.3794,0.125,-3.036,0.002,-0.624,-0.134
Fare,0.0012,0.002,0.468,0.640,-0.004,0.006
Pclass_2,-1.2037,0.328,-3.675,0.000,-1.846,-0.562
Pclass_3,-2.4182,0.340,-7.119,0.000,-3.084,-1.752
Sex_male,-2.6163,0.218,-11.992,0.000,-3.044,-2.189
Embarked_Q,-0.8154,0.598,-1.363,0.173,-1.988,0.357
Embarked_S,-0.4047,0.274,-1.475,0.140,-0.943,0.133


In [None]:
# Your comments here
# It appears that the 'Fare', and 'Embarked' features were not significant predictors

## Level up (Optional)

Create a new model, this time only using those features you determined were influential based on your analysis of the results above. How does this model perform?

In [16]:
# Your code here
X.head()

Unnamed: 0,const,Age,SibSp,Fare,Pclass_2,Pclass_3,Sex_male,Embarked_Q,Embarked_S
0,1.0,22.0,1,7.25,0.0,1.0,1.0,0.0,1.0
1,1.0,38.0,1,71.2833,0.0,0.0,0.0,0.0,0.0
2,1.0,26.0,0,7.925,0.0,1.0,0.0,0.0,1.0
3,1.0,35.0,1,53.1,0.0,0.0,0.0,0.0,1.0
4,1.0,35.0,0,8.05,0.0,1.0,1.0,0.0,1.0


In [17]:
X_new = X.drop(columns=['Fare', 'Embarked_Q', 'Embarked_S'])

# Fit model
logit_model = sm.Logit(y, X_new)

# Get results of the fit
result = logit_model.fit()

# Print Summary
result.summary()

Optimization terminated successfully.
         Current function value: 0.446657
         Iterations 6


0,1,2,3
Dep. Variable:,Survived,No. Observations:,712.0
Model:,Logit,Df Residuals:,706.0
Method:,MLE,Df Model:,5.0
Date:,"Tue, 07 Jul 2020",Pseudo R-squ.:,0.3381
Time:,17:57:07,Log-Likelihood:,-318.02
converged:,True,LL-Null:,-480.45
Covariance Type:,nonrobust,LLR p-value:,4.498e-68

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,4.3254,0.451,9.597,0.000,3.442,5.209
Age,-0.0449,0.008,-5.456,0.000,-0.061,-0.029
SibSp,-0.3786,0.121,-3.119,0.002,-0.616,-0.141
Pclass_2,-1.4063,0.285,-4.937,0.000,-1.965,-0.848
Pclass_3,-2.6450,0.286,-9.251,0.000,-3.205,-2.085
Sex_male,-2.6190,0.215,-12.181,0.000,-3.040,-2.198


In [None]:
# Your comments here
# All of the predictors in the new model are significant, with low impact on Pseudo R-squared
# It appears that being 2nd class, 3rd class, or male strongly hindered your chances of survival

## Summary 

Well done! In this lab, you practiced using `statsmodels` to build a logistic regression model. You then interpreted the results, building upon your previous stats knowledge, similar to linear regression. Continue on to take a look at building logistic regression models in Scikit-learn!