# Fitting a Logistic Regression Model - Lab

## Introduction

In the last lesson you were given a broad overview of logistic regression. This included an introduction to two separate packages for creating logistic regression models. In this lab, you'll be investigating fitting logistic regressions with `statsmodels`. For your first foray into logistic regression, you are going to attempt to build a model that classifies whether an individual survived the [Titanic](https://www.kaggle.com/c/titanic/data) shipwreck or not (yes, it's a bit morbid).


## Objectives

In this lab you will: 

* Implement logistic regression with `statsmodels` 
* Interpret the statistical results associated with model parameters

## Import the data

Import the data stored in the file `'titanic.csv'` and print the first five rows of the DataFrame to check its contents. 

In [1]:
import pandas as pd

# Import the data
df = pd.read_csv('titanic.csv')
print(df.head())


   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S  


## Define independent and target variables

Your target variable is in the column `'Survived'`. A `0` indicates that the passenger didn't survive the shipwreck. Print the total number of people who didn't survive the shipwreck. How many people survived?

In [2]:
# Total number of people who survived/didn't survive
survival_counts = df['Survived'].value_counts()
print(survival_counts)


Survived
0    549
1    342
Name: count, dtype: int64


Only consider the columns specified in `relevant_columns` when building your model. The next step is to create dummy variables from categorical variables. Remember to drop the first level for each categorical column and make sure all the values are of type `float`: 

In [3]:
# Create dummy variables
relevant_columns = ['Pclass', 'Age', 'SibSp', 'Fare', 'Sex', 'Embarked', 'Survived']
df_relevant = df[relevant_columns]
dummy_dataframe = pd.get_dummies(df_relevant, drop_first=True).astype(float)

dummy_dataframe.shape

(891, 8)

Did you notice above that the DataFrame contains missing values? To keep things simple, simply delete all rows with missing values. 

> NOTE: You can use the [`.dropna()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html) method to do this. 

In [4]:
# Drop missing rows
dummy_dataframe = dummy_dataframe.dropna()
dummy_dataframe.shape

(714, 8)

Finally, assign the independent variables to `X` and the target variable to `y`: 

In [5]:
# Split the data into X and y
y = dummy_dataframe['Survived']
X = dummy_dataframe.drop(columns=['Survived'])

## Fit the model

Now with everything in place, you can build a logistic regression model using `statsmodels` (make sure you create an intercept term as we showed in the previous lesson).  

> Warning: Did you receive an error of the form "LinAlgError: Singular matrix"? This means that `statsmodels` was unable to fit the model due to certain linear algebra computational problems. Specifically, the matrix was not invertible due to not being full rank. In other words, there was a lot of redundant, superfluous data. Try removing some features from the model and running it again.

In [6]:
import statsmodels.api as sm

# Add an intercept term to the model
X = sm.add_constant(X)

# Fit the logistic regression model
logit_model = sm.Logit(y, X)
result = logit_model.fit()

# Print the summary of the model
print(result.summary())


Optimization terminated successfully.
         Current function value: 0.443267
         Iterations 6
                           Logit Regression Results                           
Dep. Variable:               Survived   No. Observations:                  714
Model:                          Logit   Df Residuals:                      706
Method:                           MLE   Df Model:                            7
Date:                Mon, 04 Nov 2024   Pseudo R-squ.:                  0.3437
Time:                        10:54:00   Log-Likelihood:                -316.49
converged:                       True   LL-Null:                       -482.26
Covariance Type:            nonrobust   LLR p-value:                 1.103e-67
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const          5.6503      0.633      8.921      0.000       4.409       6.892
Pclass        -1.2118      0.

## Analyze results

Generate the summary table for your model. Then, comment on the p-values associated with the various features you chose.

In [7]:
# Summary table
# Above there

In [None]:
# Your comments here

# The p-values indicate the statistical significance of each feature in the model:
# - Pclass, Age, SibSp, and Sex_male have very low p-values (< 0.05), indicating that they are statistically significant predictors of survival.
# - Fare, Embarked_Q, and Embarked_S have high p-values (> 0.05), suggesting that they are not statistically significant predictors of survival in this model.
# - The intercept (const) is also statistically significant with a very low p-value.


## Level up (Optional)

Create a new model, this time only using those features you determined were influential based on your analysis of the results above. How does this model perform?

In [8]:
# Select only the influential features
X_influential = X[['const', 'Pclass', 'Age', 'SibSp', 'Sex_male']]

# Fit the logistic regression model with the influential features
logit_model_influential = sm.Logit(y, X_influential)
result_influential = logit_model_influential.fit()

# Print the summary of the new model
print(result_influential.summary())


Optimization terminated successfully.
         Current function value: 0.445882
         Iterations 6
                           Logit Regression Results                           
Dep. Variable:               Survived   No. Observations:                  714
Model:                          Logit   Df Residuals:                      709
Method:                           MLE   Df Model:                            4
Date:                Mon, 04 Nov 2024   Pseudo R-squ.:                  0.3399
Time:                        10:56:02   Log-Likelihood:                -318.36
converged:                       True   LL-Null:                       -482.26
Covariance Type:            nonrobust   LLR p-value:                 1.089e-69
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const          5.6008      0.543     10.306      0.000       4.536       6.666
Pclass        -1.3174      0.


used /explain (rerun without)
The output provided is the result of a logistic regression analysis. Let's break down the key components of the results:

Key Components
Dependent Variable (Dep. Variable): Survived

This indicates that the model is predicting the likelihood of survival.
Number of Observations (No. Observations): 714

The dataset contains 714 observations.
Model: Logit

This specifies that a logistic regression model was used.
Method: MLE

Maximum Likelihood Estimation was used to estimate the model parameters.
Pseudo R-squared (Pseudo R-squ.): 0.3399

This is a measure of how well the model explains the variability of the outcome. A value of 0.3399 indicates that approximately 33.99% of the variability in the dependent variable is explained by the model.
Log-Likelihood: -318.36

This is a measure of the model fit. Higher values (closer to zero) indicate a better fit.
Converged: True

This indicates that the optimization algorithm successfully converged.
LL-Null: -482.26

The log-likelihood of the null model (a model with no predictors).
LLR p-value: 1.089e-69

The p-value for the likelihood ratio test comparing the fitted model to the null model. A very small p-value indicates that the model provides a significantly better fit than the null model.
Coefficients and Statistics
const: 5.6008

The intercept of the model. A positive value indicates a higher baseline log-odds of survival.
Pclass: -1.3174

The coefficient for passenger class. A negative value indicates that higher classes (lower numerical values) are associated with higher odds of survival.
Age: -0.0444

The coefficient for age. A negative value indicates that older age is associated with lower odds of survival.
SibSp: -0.3761

The coefficient for the number of siblings/spouses aboard. A negative value indicates that having more siblings/spouses aboard is associated with lower odds of survival.
Sex_male: -2.6235

The coefficient for being male. A negative value indicates that being male is associated with lower odds of survival.
Statistical Significance
P>|z|: The p-values for each coefficient.
All p-values are very small (0.000 or close to it), indicating that all predictors are statistically significant.
Confidence Intervals
[0.025, 0.975]: The 95% confidence intervals for each coefficient.
These intervals provide a range within which the true coefficient values are likely to fall.
Interpretation
The model suggests that being in a higher passenger class, being older, having more siblings/spouses aboard, and being male are all associated with lower odds of survival.
The intercept and coefficients are statistically significant, indicating that these predictors have a meaningful impact on the likelihood of survival.

## Summary 

Well done! In this lab, you practiced using `statsmodels` to build a logistic regression model. You then interpreted the results, building upon your previous stats knowledge, similar to linear regression. Continue on to take a look at building logistic regression models in Scikit-learn!