# Objective – Phase 1	 

In [48]:
import pandas as pd
import statsmodels.api as sm

## Data Considerations
Based on your first report, the Bank has strategically binned each of the continuous variables in 
the data set to help facilitate any further analysis.  
- For any variable with missing values, change the data to include a missing category 
instead of a missing value for the categorical variable.  
    - (HINT: Now all variables should be categorized (treated as categorical variables 
so no more continuous variable assumptions) and without missing values. Banks 
do this for more advanced modeling purposes that we will talk about in the 
spring.) 
- Check each variable for separation concerns. Document in the report and adjust any 
variables with complete or quasi-separation concerns. 

### Handling Missing Values

In [39]:
df = pd.read_csv("insurance_t_bin.csv")
print(df.shape)
df.head()

(8495, 48)


Unnamed: 0,DDA,CASHBK,DIRDEP,NSF,SAV,ATM,CD,IRA,LOC,INV,...,INVBAL_BIN,ILSBAL_BIN,MMBAL_BIN,MTGBAL_BIN,CCBAL_BIN,INCOME_BIN,LORES_BIN,HMVAL_BIN,AGE_BIN,CRSCORE_BIN
0,1,0,1,0,0,1,0,0,0,0.0,...,01 <= 1025,01 <= 9690.09,01 <= 1031.1401,01 <= 100000,01 <= 0.9,01 <= 50,02 > 5,01 <= 98,02 > 50,01 <= 675
1,0,0,0,0,0,0,0,0,0,0.0,...,01 <= 1025,01 <= 9690.09,01 <= 1031.1401,01 <= 100000,01 <= 0.9,01 <= 50,02 > 5,01 <= 98,02 > 50,01 <= 675
2,1,0,1,0,0,0,0,0,0,0.0,...,01 <= 1025,01 <= 9690.09,01 <= 1031.1401,01 <= 100000,02 > 0.9,01 <= 50,01 <= 5,02 <= 122,02 > 50,01 <= 675
3,1,0,0,1,1,1,0,0,0,,...,00 Miss,01 <= 9690.09,01 <= 1031.1401,01 <= 100000,00 Miss,01 <= 50,01 <= 5,02 <= 122,01 <= 50,01 <= 675
4,1,0,0,1,1,1,1,0,0,,...,00 Miss,01 <= 9690.09,01 <= 1031.1401,01 <= 100000,00 Miss,01 <= 50,02 > 5,01 <= 98,02 > 50,02 > 675


In [40]:
df.isnull().sum()

DDA               0
CASHBK            0
DIRDEP            0
NSF               0
SAV               0
ATM               0
CD                0
IRA               0
LOC               0
INV            1075
ILS               0
MM                0
MMCRED            0
MTG               0
CC             1075
CCPURC         1075
SDB               0
HMOWN          1463
MOVED             0
INAREA            0
INS               0
BRANCH            0
RES               0
ACCTAGE_BIN       0
DDABAL_BIN        0
DEP_BIN           0
DEPAMT_BIN        0
CHECKS_BIN        0
NSFAMT_BIN        0
PHONE_BIN         0
TELLER_BIN        0
SAVBAL_BIN        0
ATMAMT_BIN        0
POS_BIN           0
POSAMT_BIN        0
CDBAL_BIN         0
IRABAL_BIN        0
LOCBAL_BIN        0
INVBAL_BIN        0
ILSBAL_BIN        0
MMBAL_BIN         0
MTGBAL_BIN        0
CCBAL_BIN         0
INCOME_BIN        0
LORES_BIN         0
HMVAL_BIN         0
AGE_BIN           0
CRSCORE_BIN       0
dtype: int64

Looks like INV, CC, CCPURC, HMOWN all have missing data

In [41]:
miss = ['INV', 'CC', 'CCPURC', 'HMOWN']
for x in miss:
    print(f"Unique values in {x}: {df[x].unique()}")

Unique values in INV: [ 0. nan  1.]
Unique values in CC: [ 1. nan  0.]
Unique values in CCPURC: [ 1.  0. nan  2.  3.  4.]
Unique values in HMOWN: [ 1.  0. nan]


In [42]:
# replacing missing values with 'Missing'
for col in miss:
    df[col] = df[col].fillna('Missing')
    df[col] = df[col].astype('category')

In [47]:
# correctly encode the data in preperation on the logistic regression
dum = ['BRANCH', 'RES']
df_encoded = pd.get_dummies(df, columns=miss+dum, drop_first=True, dtype=int)

pd.set_option('display.max_columns', 500)
df_encoded.head()

Unnamed: 0,DDA,CASHBK,DIRDEP,NSF,SAV,ATM,CD,IRA,LOC,ILS,MM,MMCRED,MTG,SDB,MOVED,INAREA,INS,ACCTAGE_BIN,DDABAL_BIN,DEP_BIN,DEPAMT_BIN,CHECKS_BIN,NSFAMT_BIN,PHONE_BIN,TELLER_BIN,SAVBAL_BIN,ATMAMT_BIN,POS_BIN,POSAMT_BIN,CDBAL_BIN,IRABAL_BIN,LOCBAL_BIN,INVBAL_BIN,ILSBAL_BIN,MMBAL_BIN,MTGBAL_BIN,CCBAL_BIN,INCOME_BIN,LORES_BIN,HMVAL_BIN,AGE_BIN,CRSCORE_BIN,INV_1.0,INV_Missing,CC_1.0,CC_Missing,CCPURC_1.0,CCPURC_2.0,CCPURC_3.0,CCPURC_4.0,CCPURC_Missing,HMOWN_1.0,HMOWN_Missing,BRANCH_B10,BRANCH_B11,BRANCH_B12,BRANCH_B13,BRANCH_B14,BRANCH_B15,BRANCH_B16,BRANCH_B17,BRANCH_B18,BRANCH_B19,BRANCH_B2,BRANCH_B3,BRANCH_B4,BRANCH_B5,BRANCH_B6,BRANCH_B7,BRANCH_B8,BRANCH_B9,RES_S,RES_U
0,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,1,0,01 <= 19.6,06 <= 2188.02,02 <= 3,02 <= 686.49,02 <= 2,01 <= 6.65,01 <= 0,01 <= 0,01 <= 0.01,02 <= 3688.63,01 <= 0,01 <= 250,01 <= 500,01 <= 0.1,01 <= 5000,01 <= 1025,01 <= 9690.09,01 <= 1031.1401,01 <= 100000,01 <= 0.9,01 <= 50,02 > 5,01 <= 98,02 > 50,01 <= 675,0,0,1,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,01 <= 19.6,01 <= 0.1,01 <= 0,01 <= 2.18,01 <= 0,01 <= 6.65,01 <= 0,01 <= 0,01 <= 0.01,01 <= 5.4801,01 <= 0,01 <= 250,01 <= 500,01 <= 0.1,01 <= 5000,01 <= 1025,01 <= 9690.09,01 <= 1031.1401,01 <= 100000,01 <= 0.9,01 <= 50,02 > 5,01 <= 98,02 > 50,01 <= 675,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0
2,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,1,01 <= 19.6,05 <= 1248.47,03 > 3,05 > 6425.57,04 > 4,01 <= 6.65,03 > 1,03 > 3,01 <= 0.01,01 <= 5.4801,01 <= 0,01 <= 250,01 <= 500,01 <= 0.1,01 <= 5000,01 <= 1025,01 <= 9690.09,01 <= 1031.1401,01 <= 100000,02 > 0.9,01 <= 50,01 <= 5,02 <= 122,02 > 50,01 <= 675,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1
3,1,0,0,1,1,1,0,0,0,0,0,0,0,1,0,1,0,01 <= 19.6,03 <= 304.95,02 <= 3,03 <= 2190.32,02 <= 2,01 <= 6.65,00 Miss,01 <= 0,02 <= 61.25,02 <= 3688.63,00 Miss,00 Miss,01 <= 500,01 <= 0.1,01 <= 5000,00 Miss,01 <= 9690.09,01 <= 1031.1401,01 <= 100000,00 Miss,01 <= 50,01 <= 5,02 <= 122,01 <= 50,01 <= 675,0,1,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
4,1,0,0,1,1,1,1,0,0,0,0,0,0,0,0,1,0,01 <= 19.6,03 <= 304.95,03 > 3,04 <= 6425.57,04 > 4,02 > 6.65,00 Miss,02 <= 3,02 <= 61.25,03 > 3688.63,00 Miss,00 Miss,02 <= 9200,01 <= 0.1,01 <= 5000,00 Miss,01 <= 9690.09,01 <= 1031.1401,01 <= 100000,00 Miss,01 <= 50,02 > 5,01 <= 98,02 > 50,02 > 675,0,1,0,1,0,0,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1


### Checking for Linear Separation/Convergence Problems

apperently statsmodels and scikit-learn will give warnings and will provide some summary stats after model is fitted. Thus, I will skip this part and check after the model is fitted.

## Model Building
Build a main effects only binary logistic regression model to predict the purchase of the 
insurance product. 
- Use backward selection to do the variable selection – the Bank currently uses 𝛼 = 0.002 
and p-values to perform backward, but is open to another technique and/or significance 
level if documented in your report.  
- Report the final variables from this model ranked by p-value.  
    - (HINT: Even if you choose to not use p-values to select your variables, you 
should still rank all final variables by their p-value in this report.)

In [56]:
y = df_encoded.INS
X = df_encoded.drop('INS', axis=1)
print(X.shape)
print(y.shape)

(8495, 72)
(8495,)


ValueError: Pandas data cast to numpy dtype of object. Check input data with np.asarray(data).