# Fitting a Logistic Regression Model - Lab

## Introduction

In the last lesson you were given a broad overview of logistic regression. This included an introduction to two separate packages for creating logistic regression models. In this lab, you'll be investigating fitting logistic regressions with `statsmodels`. For your first foray into logistic regression, you are going to attempt to build a model that classifies whether an individual survived the [Titanic](https://www.kaggle.com/c/titanic/data) shipwreck or not (yes, it's a bit morbid).


## Objectives

In this lab you will: 

* Implement logistic regression with `statsmodels` 
* Interpret the statistical results associated with model parameters

## Import the data

Import the data stored in the file `'titanic.csv'` and print the first five rows of the DataFrame to check its contents. 

In [1]:
# Import the data

# Import pandas
import pandas as pd

# Import the data
df = pd.read_csv('titanic.csv')

# Display the first five rows
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## Define independent and target variables

Your target variable is in the column `'Survived'`. A `0` indicates that the passenger didn't survive the shipwreck. Print the total number of people who didn't survive the shipwreck. How many people survived?

In [2]:
# Total number of people who survived/didn't survive
# Define target variable
y = df['Survived']

# Define independent variables (drop 'Survived' and any non-numeric or irrelevant columns)
X = df.drop(columns=['Survived', 'Name', 'Ticket', 'Cabin'])  # adjust columns as needed
# For categorical variables, you may need to encode them later

# Total number of people who didn't survive
num_not_survived = (y == 0).sum()
print("Number of people who didn't survive:", num_not_survived)

# Total number of people who survived
num_survived = (y == 1).sum()
print("Number of people who survived:", num_survived)

Number of people who didn't survive: 549
Number of people who survived: 342


Only consider the columns specified in `relevant_columns` when building your model. The next step is to create dummy variables from categorical variables. Remember to drop the first level for each categorical column and make sure all the values are of type `float`: 

In [4]:
# Define the relevant columns
relevant_columns = ['Pclass', 'Age', 'SibSp', 'Fare', 'Sex', 'Embarked', 'Survived']

# Select only relevant columns
df_relevant = df[relevant_columns]

# Drop rows with missing values
df_relevant = df_relevant.dropna()

# Create dummy variables for categorical columns
dummy_dataframe = pd.get_dummies(df_relevant, columns=['Sex', 'Embarked'], drop_first=True)

# Ensure all values are float
dummy_dataframe = dummy_dataframe.astype(float)

# Check the shape of the resulting DataFrame
print(dummy_dataframe.shape)

(712, 8)


Did you notice above that the DataFrame contains missing values? To keep things simple, simply delete all rows with missing values. 

> NOTE: You can use the [`.dropna()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html) method to do this. 

In [5]:
# Drop rows with missing values
dummy_dataframe = dummy_dataframe.dropna()
dummy_dataframe.shape

(712, 8)

Finally, assign the independent variables to `X` and the target variable to `y`: 

In [6]:
# Split the data into X and y
# Target variable
y = dummy_dataframe['Survived']

# Independent variables (drop the target column)
X = dummy_dataframe.drop(columns=['Survived'])


In [7]:
# Check shapes to confirm
print("X shape:", X.shape)
print("y shape:", y.shape)

X shape: (712, 7)
y shape: (712,)


## Fit the model

Now with everything in place, you can build a logistic regression model using `statsmodels` (make sure you create an intercept term as we showed in the previous lesson).  

> Warning: Did you receive an error of the form "LinAlgError: Singular matrix"? This means that `statsmodels` was unable to fit the model due to certain linear algebra computational problems. Specifically, the matrix was not invertible due to not being full rank. In other words, there was a lot of redundant, superfluous data. Try removing some features from the model and running it again.

In [8]:
# Build a logistic regression model using statsmodels
import statsmodels.api as sm

# Add an intercept term to the independent variables
X = sm.add_constant(X)

# Fit the logistic regression model
logit_model = sm.Logit(y, X)
result = logit_model.fit()

# Print the summary of the model
print(result.summary())


Optimization terminated successfully.
         Current function value: 0.444229
         Iterations 6
                           Logit Regression Results                           
Dep. Variable:               Survived   No. Observations:                  712
Model:                          Logit   Df Residuals:                      704
Method:                           MLE   Df Model:                            7
Date:                Thu, 30 Oct 2025   Pseudo R-squ.:                  0.3417
Time:                        16:27:57   Log-Likelihood:                -316.29
converged:                       True   LL-Null:                       -480.45
Covariance Type:            nonrobust   LLR p-value:                 5.360e-67
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const          5.6378      0.633      8.901      0.000       4.396       6.879
Pclass        -1.2102      0.

## Analyze results

Generate the summary table for your model. Then, comment on the p-values associated with the various features you chose.

In [9]:
# Summary table
summary_table = result.summary()
print(summary_table)

                           Logit Regression Results                           
Dep. Variable:               Survived   No. Observations:                  712
Model:                          Logit   Df Residuals:                      704
Method:                           MLE   Df Model:                            7
Date:                Thu, 30 Oct 2025   Pseudo R-squ.:                  0.3417
Time:                        16:29:21   Log-Likelihood:                -316.29
converged:                       True   LL-Null:                       -480.45
Covariance Type:            nonrobust   LLR p-value:                 5.360e-67
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const          5.6378      0.633      8.901      0.000       4.396       6.879
Pclass        -1.2102      0.163     -7.427      0.000      -1.530      -0.891
Age           -0.0433      0.008     -5.263      0.0

In [None]:
# Your comments here
""""Features like Pclass, Age, SibSp, and Sex_male are statistically significant predictors of survival."""

"""Fare and Embarked variables do not provide significant information in this model and could potentially be removed in a simplified model."

## Level up (Optional)

Create a new model, this time only using those features you determined were influential based on your analysis of the results above. How does this model perform?

In [10]:
# Your code here
import statsmodels.api as sm

# Select only significant features
X_significant = X[['Pclass', 'Age', 'SibSp', 'Sex_male']]

# Add an intercept term
X_significant = sm.add_constant(X_significant)

# Fit the logistic regression model
logit_model_sig = sm.Logit(y, X_significant)
result_sig = logit_model_sig.fit()

# Print the summary of the new model
print(result_sig.summary())

Optimization terminated successfully.
         Current function value: 0.446755
         Iterations 6
                           Logit Regression Results                           
Dep. Variable:               Survived   No. Observations:                  712
Model:                          Logit   Df Residuals:                      707
Method:                           MLE   Df Model:                            4
Date:                Thu, 30 Oct 2025   Pseudo R-squ.:                  0.3379
Time:                        16:31:19   Log-Likelihood:                -318.09
converged:                       True   LL-Null:                       -480.45
Covariance Type:            nonrobust   LLR p-value:                 5.015e-69
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const          5.5908      0.543     10.288      0.000       4.526       6.656
Pclass        -1.3139      0.

In [None]:
# Your comments here
"""Survival on the Titanic was most strongly influenced by sex, passenger class, age, and family aboard. The simplified logistic regression model provides a clear, interpretable framework for predicting survival while avoiding unnecessary complexity."""

## Summary 

Well done! In this lab, you practiced using `statsmodels` to build a logistic regression model. You then interpreted the results, building upon your previous stats knowledge, similar to linear regression. Continue on to take a look at building logistic regression models in Scikit-learn!