# Fitting a Logistic Regression Model - Lab

## Introduction

In the last lesson you were given a broad overview of logistic regression. This included an introduction to two separate packages for creating logistic regression models. In this lab, you'll be investigating fitting logistic regressions with `statsmodels`. For your first foray into logistic regression, you are going to attempt to build a model that classifies whether an individual survived the [Titanic](https://www.kaggle.com/c/titanic/data) shipwreck or not (yes, it's a bit morbid).


## Objectives

In this lab you will: 

* Implement logistic regression with `statsmodels` 
* Interpret the statistical results associated with model parameters

## Import the data

Import the data stored in the file `'titanic.csv'` and print the first five rows of the DataFrame to check its contents. 

In [3]:
# Import the data
import pandas as pd


df = pd.read_csv('titanic.csv',index_col = 0)
df.head()



Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## Define independent and target variables

Your target variable is in the column `'Survived'`. A `0` indicates that the passenger didn't survive the shipwreck. Print the total number of people who didn't survive the shipwreck. How many people survived?

In [4]:
# Total number of people who survived/didn't survive
df['Survived'].value_counts()

Survived
0    549
1    342
Name: count, dtype: int64

Only consider the columns specified in `relevant_columns` when building your model. The next step is to create dummy variables from categorical variables. Remember to drop the first level for each categorical column and make sure all the values are of type `float`: 

In [19]:
# Create dummy variables
relevant_columns = ['Pclass', 'Age', 'SibSp', 'Fare', 'Sex', 'Embarked', 'Survived']
dummy_dataframe = pd.get_dummies(df[relevant_columns], drop_first = True, dtype = float)

dummy_dataframe.shape
dummy_dataframe.head()

Unnamed: 0_level_0,Pclass,Age,SibSp,Fare,Survived,Sex_male,Embarked_Q,Embarked_S
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,3,22.0,1,7.25,0,1.0,0.0,1.0
2,1,38.0,1,71.2833,1,0.0,0.0,0.0
3,3,26.0,0,7.925,1,0.0,0.0,1.0
4,1,35.0,1,53.1,1,0.0,0.0,1.0
5,3,35.0,0,8.05,0,1.0,0.0,1.0


Did you notice above that the DataFrame contains missing values? To keep things simple, simply delete all rows with missing values. 

> NOTE: You can use the [`.dropna()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html) method to do this. 

In [23]:
# Drop missing rows
dummy_dataframe = dummy_dataframe.dropna()
dummy_dataframe.shape

(714, 8)

In [24]:
dummy_dataframe.head()

Unnamed: 0_level_0,Pclass,Age,SibSp,Fare,Survived,Sex_male,Embarked_Q,Embarked_S
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,3,22.0,1,7.25,0,1.0,0.0,1.0
2,1,38.0,1,71.2833,1,0.0,0.0,0.0
3,3,26.0,0,7.925,1,0.0,0.0,1.0
4,1,35.0,1,53.1,1,0.0,0.0,1.0
5,3,35.0,0,8.05,0,1.0,0.0,1.0


Finally, assign the independent variables to `X` and the target variable to `y`: 

In [27]:
# Split the data into X and y
y = dummy_dataframe['Survived']
X = dummy_dataframe.drop('Survived', axis = 1)


## Fit the model

Now with everything in place, you can build a logistic regression model using `statsmodels` (make sure you create an intercept term as we showed in the previous lesson).  

> Warning: Did you receive an error of the form "LinAlgError: Singular matrix"? This means that `statsmodels` was unable to fit the model due to certain linear algebra computational problems. Specifically, the matrix was not invertible due to not being full rank. In other words, there was a lot of redundant, superfluous data. Try removing some features from the model and running it again.

In [30]:
# Build a logistic regression model using statsmodels
import statsmodels.api as sm
X = sm.add_constant(X)
logit_model = sm.Logit(y,X)
result = logit_model.fit()


Optimization terminated successfully.
         Current function value: 0.443267
         Iterations 6


In [None]:
"""
This output message indicates that the optimization process to fit the logistic regression model completed successfully.
Current function value: This value (0.443267) represents the value of the loss function (log-likelihood) at the end of the optimization process. A lower value generally indicates a better fit, but it should be interpreted in the context of the model and data.
Iterations 6: This indicates that the optimization algorithm converged after 6 iterations. Fewer iterations typically suggest that the algorithm found a solution relatively quickly, which can be a sign of a well-behaved optimization problem.
"""

## Analyze results

Generate the summary table for your model. Then, comment on the p-values associated with the various features you chose.

In [32]:
# Summary table
result.summary()

0,1,2,3
Dep. Variable:,Survived,No. Observations:,714.0
Model:,Logit,Df Residuals:,706.0
Method:,MLE,Df Model:,7.0
Date:,"Fri, 16 Aug 2024",Pseudo R-squ.:,0.3437
Time:,11:25:48,Log-Likelihood:,-316.49
converged:,True,LL-Null:,-482.26
Covariance Type:,nonrobust,LLR p-value:,1.1029999999999999e-67

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,5.6503,0.633,8.921,0.000,4.409,6.892
Pclass,-1.2118,0.163,-7.433,0.000,-1.531,-0.892
Age,-0.0431,0.008,-5.250,0.000,-0.059,-0.027
SibSp,-0.3806,0.125,-3.048,0.002,-0.625,-0.136
Fare,0.0012,0.002,0.474,0.636,-0.004,0.006
Sex_male,-2.6236,0.217,-12.081,0.000,-3.049,-2.198
Embarked_Q,-0.8260,0.598,-1.381,0.167,-1.999,0.347
Embarked_S,-0.4130,0.269,-1.533,0.125,-0.941,0.115


In [None]:
# Your comments here
"""
const: The intercept (5.6503) indicates the log-odds of survival when all predictor variables are equal to zero. Since this is a logistic regression, the interpretation of the intercept is often less meaningful.

Pclass: The coefficient for Pclass (-1.2118) suggests that as the passenger class decreases (from 1st class to 3rd class), the odds of survival decrease significantly. The p-value (0.000) indicates that this effect is statistically significant.

Age: The coefficient for Age (-0.0431) indicates that for each additional year of age, the odds of survival decrease. This is also statistically significant (p-value = 0.000).

SibSp: The coefficient for SibSp (-0.3806) indicates that having more siblings/spouses aboard is associated with lower odds of survival, and this effect is statistically significant (p-value = 0.002).

Fare: The coefficient for Fare (0.0012) suggests a very small positive effect on survival odds, but the p-value (0.636) indicates that this effect is not statistically significant.

Sex_male: The coefficient for Sex_male (-2.6236) indicates that being male is associated with significantly lower odds of survival, which is statistically significant (p-value = 0.000).

Embarked_Q and Embarked_S: The coefficients for Embarked_Q and Embarked_S show negative effects on survival odds, but neither is statistically significant (p-values = 0.167 and 0.125, respectively).

"""

## Level up (Optional)

Create a new model, this time only using those features you determined were influential based on your analysis of the results above. How does this model perform?

In [37]:
dummy_dataframe.head()

Unnamed: 0_level_0,Pclass,Age,SibSp,Fare,Survived,Sex_male,Embarked_Q,Embarked_S
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,3,22.0,1,7.25,0,1.0,0.0,1.0
2,1,38.0,1,71.2833,1,0.0,0.0,0.0
3,3,26.0,0,7.925,1,0.0,0.0,1.0
4,1,35.0,1,53.1,1,0.0,0.0,1.0
5,3,35.0,0,8.05,0,1.0,0.0,1.0


In [45]:
# Your code here
influencial_X = dummy_dataframe.drop(['Embarked_Q','Embarked_S','Fare','Survived'],axis = 1)
influencial_X = sm.add_constant(influencial_X)
model = sm.Logit(y,influencial_X)
result_in = model.fit()

Optimization terminated successfully.
         Current function value: 0.445882
         Iterations 6


In [44]:
result_in.summary()

0,1,2,3
Dep. Variable:,Survived,No. Observations:,714.0
Model:,Logit,Df Residuals:,708.0
Method:,MLE,Df Model:,5.0
Date:,"Fri, 16 Aug 2024",Pseudo R-squ.:,0.9659
Time:,11:44:40,Log-Likelihood:,-16.421
converged:,False,LL-Null:,-482.26
Covariance Type:,nonrobust,LLR p-value:,3.7130000000000006e-199

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,-29.5210,164.910,-0.179,0.858,-352.739,293.697
Pclass,-9.2549,26.001,-0.356,0.722,-60.215,41.706
Age,0.1979,0.060,3.315,0.001,0.081,0.315
SibSp,7.7954,11.027,0.707,0.480,-13.818,29.409
Survived,54.9558,169.961,0.323,0.746,-278.161,388.073
Sex_male,1.3858,1.099,1.261,0.207,-0.768,3.539


In [None]:
# Your comments here
"""
Pseudo R-squ.: 0.9659
This is a pseudo R-squared value, which indicates the proportion of variance explained by the model. A value of 0.9659 suggests that about 96.59% of the variability in the outcome variable is explained by the model, which is exceptionally high and suggests a very good fit.

const: The intercept (-29.5210) indicates the log-odds of survival when all predictor variables are equal to zero. However, this value is not meaningful in this context, as having all predictors equal to zero may not be realistic.

Pclass: The coefficient for Pclass (-9.2549) suggests that as the passenger class decreases (from 1st class to 3rd class), the odds of survival decrease significantly. However, the p-value (0.722) indicates that this effect is not statistically significant.

Age: The coefficient for Age (0.1979) indicates that for each additional year of age, the odds of survival increase. This effect is statistically significant (p-value = 0.001), suggesting that older passengers had higher odds of survival.

SibSp: The coefficient for SibSp (7.7954) indicates that having more siblings/spouses aboard is associated with higher odds of survival, but the p-value (0.480) indicates that this effect is not statistically significant.

Survived: The coefficient for Survived (54.9558) is unusual in this context and likely indicates an issue with the model specification. This variable should not be included as a predictor since it is the dependent variable.
"""

## Summary 

Well done! In this lab, you practiced using `statsmodels` to build a logistic regression model. You then interpreted the results, building upon your previous stats knowledge, similar to linear regression. Continue on to take a look at building logistic regression models in Scikit-learn!