# Fitting a Logistic Regression Model - Lab

## Introduction

In the last lesson you were given a broad overview of logistic regression. This included an introduction to two separate packages for creating logistic regression models. In this lab, you'll be investigating fitting logistic regressions with `statsmodels`. For your first foray into logistic regression, you are going to attempt to build a model that classifies whether an individual survived the [Titanic](https://www.kaggle.com/c/titanic/data) shipwreck or not (yes, it's a bit morbid).


## Objectives

In this lab you will: 

* Implement logistic regression with `statsmodels` 
* Interpret the statistical results associated with model parameters

## Import the data

Import the data stored in the file `'titanic.csv'` and print the first five rows of the DataFrame to check its contents. 

In [24]:
#importing the necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
# Import the data
df = pd.read_csv("titanic.csv")
df.head()


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [14]:
#Getting information about the dataframe
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


## Define independent and target variables

Your target variable is in the column `'Survived'`. A `0` indicates that the passenger didn't survive the shipwreck. Print the total number of people who didn't survive the shipwreck. How many people survived?

In [22]:
# Total number of people who survived/didn't survive
survived = len(df[df["Survived"] == 1])
dead = len(df[df["Survived"] == 0])
print(f'Survived: {survived}\n    Dead: {dead}')


Survived: 342
    Dead: 549


Only consider the columns specified in `relevant_columns` when building your model. The next step is to create dummy variables from categorical variables. Remember to drop the first level for each categorical column and make sure all the values are of type `float`: 

In [12]:
# Create dummy variables
relevant_columns = ['Pclass', 'Age', 'SibSp', 'Fare', 'Sex', 'Embarked', 'Survived']

df_relevant = df[relevant_columns]

# Convert categorical columns to type 'category' to ensure proper encoding
for col in df_relevant.select_dtypes(include='object').columns:
    df_relevant[col] = df_relevant[col].astype('category')

dummy_dataframe = pd.get_dummies(df_relevant, drop_first = True, dtype = float)
                                
dummy_dataframe.shape

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_relevant[col] = df_relevant[col].astype('category')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_relevant[col] = df_relevant[col].astype('category')


(891, 8)

Did you notice above that the DataFrame contains missing values? To keep things simple, simply delete all rows with missing values. 

> NOTE: You can use the [`.dropna()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html) method to do this. 

In [16]:
# Drop missing rows
dummy_dataframe = dummy_dataframe.dropna(axis=0) #dropping by row
dummy_dataframe.shape

(714, 8)

In [17]:
dummy_dataframe.head()

Unnamed: 0,Pclass,Age,SibSp,Fare,Survived,Sex_male,Embarked_Q,Embarked_S
0,3,22.0,1,7.25,0,1.0,0.0,1.0
1,1,38.0,1,71.2833,1,0.0,0.0,0.0
2,3,26.0,0,7.925,1,0.0,0.0,1.0
3,1,35.0,1,53.1,1,0.0,0.0,1.0
4,3,35.0,0,8.05,0,1.0,0.0,1.0


Finally, assign the independent variables to `X` and the target variable to `y`: 

In [20]:
# Split the data into X and y
y = dummy_dataframe["Survived"]
X = dummy_dataframe.drop("Survived", axis=1)

## Fit the model

Now with everything in place, you can build a logistic regression model using `statsmodels` (make sure you create an intercept term as we showed in the previous lesson).  

> Warning: Did you receive an error of the form "LinAlgError: Singular matrix"? This means that `statsmodels` was unable to fit the model due to certain linear algebra computational problems. Specifically, the matrix was not invertible due to not being full rank. In other words, there was a lot of redundant, superfluous data. Try removing some features from the model and running it again.

In [26]:
# Build a logistic regression model using statsmodels
import statsmodels.api as sm

# Add intercept term
X = sm.add_constant(X)

# Fit logistic regression model
logit_model = sm.Logit(y, X)

## Analyze results

Generate the summary table for your model. Then, comment on the p-values associated with the various features you chose.

In [27]:
# Summary table

# Obtain the results
logit_result = logit_model.fit()

# Print summary of the model
print(logit_result.summary())

Optimization terminated successfully.
         Current function value: 0.443267
         Iterations 6
                           Logit Regression Results                           
Dep. Variable:               Survived   No. Observations:                  714
Model:                          Logit   Df Residuals:                      706
Method:                           MLE   Df Model:                            7
Date:                Mon, 29 Apr 2024   Pseudo R-squ.:                  0.3437
Time:                        01:28:08   Log-Likelihood:                -316.49
converged:                       True   LL-Null:                       -482.26
Covariance Type:            nonrobust   LLR p-value:                 1.103e-67
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const          5.6503      0.633      8.921      0.000       4.409       6.892
Pclass        -1.2118      0.

##### Your comments here
Based on the p-values from the model:
The intercept term together with the variables ```Pclass```, ```Age```, ```SibSp``` and ```Sex_male``` are statistically significant in the model since their p-values are less than 0.05
The variables ```Fare```, ```Embarked_Q```  and ```Embarked_S``` are not statistically significant since their pvalues are greater than 0.05 which implies that they have little contribution in prediction of the survival.
    

## Level up (Optional)

Create a new model, this time only using those features you determined were influential based on your analysis of the results above. How does this model perform?

In [35]:
# Your code here
# we redefine the independent variables which will only consist of the significant ones
significant_vars = ["Pclass", "Age", "SibSp","Sex_male"]
X1 = dummy_dataframe[significant_vars]

# Add intercept term
X1 = sm.add_constant(X1)

# Fit logistic regression model again
logit_model_2 = sm.Logit(y, X1)


In [36]:
# Get the summary of the model
# Obtain the results
logit_result_2 = logit_model_2.fit()

# Print summary of the model
print(logit_result_2.summary())

Optimization terminated successfully.
         Current function value: 0.445882
         Iterations 6
                           Logit Regression Results                           
Dep. Variable:               Survived   No. Observations:                  714
Model:                          Logit   Df Residuals:                      709
Method:                           MLE   Df Model:                            4
Date:                Mon, 29 Apr 2024   Pseudo R-squ.:                  0.3399
Time:                        01:52:57   Log-Likelihood:                -318.36
converged:                       True   LL-Null:                       -482.26
Covariance Type:            nonrobust   LLR p-value:                 1.089e-69
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const          5.6008      0.543     10.306      0.000       4.536       6.666
Pclass        -1.3174      0.

##### Your comments here
The Pseudo R-squared reduced from 0.3437 in the first model to 0.3399 implying that the error in the model reduced by 0.0038 after dropping the non-significant terms. 

## Summary 

Well done! In this lab, you practiced using `statsmodels` to build a logistic regression model. You then interpreted the results, building upon your previous stats knowledge, similar to linear regression. Continue on to take a look at building logistic regression models in Scikit-learn!