# AccelerateAI: Logistic Regression

## Assignment : Insurance Cross Sell

The insurance major is keen to understand the cross sell opportunities by analyzing the information it has. As a Data Scientist, you have access to the Insurance Cross Sell data (GitHub - ```InsuranceCrossSell.csv```). 

The description of variables are as follows (for your quick reference): 

- ```Response``` is the binary outcome which indicates whether customer has taken the insurance or not
- Other predictor variables are as follows: Gender, Age, Driving_License, Region_Code etc.

Fit a model (using Logit or GLM) and explain significance of predictors on the "Response" decision.

In [1]:
import pandas as pd 
import statsmodels.api as sm
from statsmodels.api import Logit, add_constant
import statsmodels.formula.api as smf

import warnings
warnings.filterwarnings('ignore')

In [2]:
df = pd.read_csv('./InsuranceCrossSell.csv') 

df.sample(5)

Unnamed: 0,id,Gender,Age,Driving_License,Region_Code,Previously_Insured,Vehicle_Age,Vehicle_Damage,Annual_Premium,Policy_Sales_Channel,Vintage,Response
174341,174342,Female,40,1,15.0,1,1-2 Year,No,37341.0,124.0,261,0
18430,18431,Male,21,1,28.0,1,< 1 Year,No,2630.0,152.0,178,0
293711,293712,Female,22,1,49.0,0,< 1 Year,Yes,2630.0,152.0,180,0
129283,129284,Female,37,1,28.0,0,1-2 Year,No,46607.0,124.0,26,0
3428,3429,Female,44,1,46.0,0,1-2 Year,Yes,23490.0,26.0,38,1


In [3]:
df.shape

(381109, 12)

In [4]:
df.columns

Index(['id', 'Gender', 'Age', 'Driving_License', 'Region_Code',
       'Previously_Insured', 'Vehicle_Age', 'Vehicle_Damage', 'Annual_Premium',
       'Policy_Sales_Channel', 'Vintage', 'Response'],
      dtype='object')

In [5]:
# We do not require id column and can drop it.
df.drop(columns='id', inplace=True)

In [6]:
df.columns

Index(['Gender', 'Age', 'Driving_License', 'Region_Code', 'Previously_Insured',
       'Vehicle_Age', 'Vehicle_Damage', 'Annual_Premium',
       'Policy_Sales_Channel', 'Vintage', 'Response'],
      dtype='object')

In [7]:
# We can encode categorical columns using Encoding methods.
df_encoded = pd.get_dummies(data=df, drop_first=True)

In [8]:
df_encoded.head(5)

Unnamed: 0,Age,Driving_License,Region_Code,Previously_Insured,Annual_Premium,Policy_Sales_Channel,Vintage,Response,Gender_Male,Vehicle_Age_< 1 Year,Vehicle_Age_> 2 Years,Vehicle_Damage_Yes
0,44,1,28.0,0,40454.0,26.0,217,1,1,0,1,1
1,76,1,3.0,0,33536.0,26.0,183,0,1,0,0,0
2,47,1,28.0,0,38294.0,26.0,27,1,1,0,1,1
3,21,1,11.0,1,28619.0,152.0,203,0,1,1,0,0
4,29,1,41.0,1,27496.0,152.0,39,0,0,1,0,0


In [9]:
df_encoded.columns

Index(['Age', 'Driving_License', 'Region_Code', 'Previously_Insured',
       'Annual_Premium', 'Policy_Sales_Channel', 'Vintage', 'Response',
       'Gender_Male', 'Vehicle_Age_< 1 Year', 'Vehicle_Age_> 2 Years',
       'Vehicle_Damage_Yes'],
      dtype='object')

In [10]:
# Renaming some of the columns which has different characters which may cause issues later, so handling them 
df_encoded = df_encoded.rename(columns={'Vehicle_Age_< 1 Year':'VehicleAgeLessThan1Yr','Vehicle_Age_> 2 Years':'VehicleAgeMoreThan2Yrs'})

In [11]:
# Fitting the Logit model with the formula from statsmodels.formula.api

f = 'Response ~ Age + Driving_License + Region_Code + Previously_Insured + Annual_Premium + Policy_Sales_Channel + Vintage + Gender_Male + VehicleAgeLessThan1Yr + VehicleAgeMoreThan2Yrs + Vehicle_Damage_Yes'

logitfit = smf.logit(formula = str(f), data = df_encoded).fit()

Optimization terminated successfully.
         Current function value: 0.274858
         Iterations 11


In [12]:
# Summary of model for analysis

print(logitfit.summary())

                           Logit Regression Results                           
Dep. Variable:               Response   No. Observations:               381109
Model:                          Logit   Df Residuals:                   381097
Method:                           MLE   Df Model:                           11
Date:                Mon, 19 Sep 2022   Pseudo R-squ.:                  0.2611
Time:                        16:07:22   Log-Likelihood:            -1.0475e+05
converged:                       True   LL-Null:                   -1.4177e+05
Covariance Type:            nonrobust   LLR p-value:                     0.000
                             coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------------------
Intercept                 -2.8979      0.171    -16.909      0.000      -3.234      -2.562
Age                       -0.0253      0.001    -46.450      0.000      -0.026      -0.024
Driv

- We can observe that the model is significant from LLR test (LLR p-value)
- We can observe that all features considered here are significant except ```Region_Code``` and ```Vintage```.

So, we can again re-fit the model by excluding these features.

In [13]:
# Refinement of the model 

f_refined = 'Response ~ Age + Driving_License + Previously_Insured + Annual_Premium + Policy_Sales_Channel + Gender_Male + VehicleAgeLessThan1Yr + VehicleAgeMoreThan2Yrs + Vehicle_Damage_Yes'

logitfit_refined = smf.logit(formula = str(f_refined), data = df_encoded).fit()

Optimization terminated successfully.
         Current function value: 0.274859
         Iterations 11


In [14]:
print(logitfit_refined.summary())

                           Logit Regression Results                           
Dep. Variable:               Response   No. Observations:               381109
Model:                          Logit   Df Residuals:                   381099
Method:                           MLE   Df Model:                            9
Date:                Mon, 19 Sep 2022   Pseudo R-squ.:                  0.2611
Time:                        16:07:24   Log-Likelihood:            -1.0475e+05
converged:                       True   LL-Null:                   -1.4177e+05
Covariance Type:            nonrobust   LLR p-value:                     0.000
                             coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------------------
Intercept                 -2.9085      0.171    -17.038      0.000      -3.243      -2.574
Age                       -0.0254      0.001    -46.458      0.000      -0.026      -0.024
Driv

Now, all features are individually significant and we can consider with this model.

It is already statistically significant at overall model level. 