### All the Lifecycle In A Data Science Projects
1. Exploratory Data Analysis
2. **Feature Engineering** -->     ***--THIS SECTION--***
3. **Feature Selection**   -->      ***--THIS SECTION--***
4. Model Building
5. Model Deployment

## Import Necessary Libraries

In [1]:
# Data Analysis
import numpy as np
import pandas as pd

# Ignore Warnings
import warnings
warnings.filterwarnings('ignore')

## Loading the Dataset

In [2]:
df = pd.read_csv('Insurance_Claim.csv')
df.head()

Unnamed: 0,ID,AGE,GENDER,RACE,DRIVING_EXPERIENCE,EDUCATION,INCOME,CREDIT_SCORE,VEHICLE_OWNERSHIP,VEHICLE_YEAR,MARRIED,CHILDREN,POSTAL_CODE,ANNUAL_MILEAGE,VEHICLE_TYPE,SPEEDING_VIOLATIONS,DUIS,PAST_ACCIDENTS,OUTCOME
0,569520,65+,female,majority,0-9y,high school,upper class,0.629027,1.0,after 2015,0.0,1.0,10238,12000.0,sedan,0,0,0,0.0
1,750365,16-25,male,majority,0-9y,none,poverty,0.357757,0.0,before 2015,0.0,0.0,10238,16000.0,sedan,0,0,0,1.0
2,199901,16-25,female,majority,0-9y,high school,working class,0.493146,1.0,before 2015,0.0,0.0,10238,11000.0,sedan,0,0,0,0.0
3,478866,16-25,male,majority,0-9y,university,working class,0.206013,1.0,before 2015,0.0,1.0,32765,11000.0,sedan,0,0,0,0.0
4,731664,26-39,male,majority,10-19y,none,working class,0.388366,1.0,before 2015,0.0,0.0,32765,12000.0,sedan,2,0,1,1.0


### Data Cleaning

In [3]:
print(f'The dataset has a total of {df.shape[0]} observations(rows) and {df.shape[1]} columns')

The dataset has a total of 10000 observations(rows) and 19 columns


In [4]:
### Checking for Duplicates (ID Column)
df.ID.duplicated().any()

False

In [5]:
### Drop ID Column
df.drop(columns= ['ID'], inplace= True)

## Feature Engineering

### 1. Handling Missing Data

In [6]:
### Obtain the count
for column in df.columns:
    if df[column].isnull().any() == True:
        print(f'{column}: {df[column].isnull().sum()}')

CREDIT_SCORE: 982
ANNUAL_MILEAGE: 957


#### ANNUAL MILEAGE

Here we will look at what is the Average Annual Mileage for a Car based on its Year of Manufacturing and Type.

In [7]:
table1 = pd.pivot_table(df, index= ['VEHICLE_YEAR', 'VEHICLE_TYPE'], values= ['ANNUAL_MILEAGE'], aggfunc= ['mean'])
table1

Unnamed: 0_level_0,Unnamed: 1_level_0,mean
Unnamed: 0_level_1,Unnamed: 1_level_1,ANNUAL_MILEAGE
VEHICLE_YEAR,VEHICLE_TYPE,Unnamed: 2_level_2
after 2015,sedan,11392.223481
after 2015,sports car,10805.309735
before 2015,sedan,11842.087599
before 2015,sports car,11832.298137


In [8]:
## Initially filling with 0. --> Helps with assignment
df['ANNUAL_MILEAGE'] = df['ANNUAL_MILEAGE'].fillna(0)

In [9]:
## For loop for filling the Missing Values with the necessary Values

for i in range(len(df)):
    
    if (df['VEHICLE_TYPE'][i] == 'after 2015'):
        if (df['VEHICLE_TYPE'][i] == 'sedan'):
            if (df['ANNUAL_MILEAGE'][i] == 0):
                df['ANNUAL_MILEAGE'][i] = 11392.22
                
        elif (df['VEHICLE_TYPE'][i] == 'sports car'):
            if (df['ANNUAL_MILEAGE'][i] == 0):
                df['ANNUAL_MILEAGE'][i] = 10805.30
            else:
                df['ANNUAL_MILEAGE'][i] = df['ANNUAL_MILEAGE'][i]

                
    elif (df['VEHICLE_TYPE'][i] == 'before 2015'):
        if (df['VEHICLE_TYPE'][i] == 'sports car'):
            if (df['ANNUAL_MILEAGE'][i] == 0):
                df['ANNUAL_MILEAGE'][i] = 11832.29
        elif (df['VEHICLE_TYPE'][i] == 'sedan'):
            if (df['ANNUAL_MILEAGE'][i] == 0):
                df['ANNUAL_MILEAGE'][i] = 11842.08
            else:
                df['ANNUAL_MILEAGE'][i] = df['ANNUAL_MILEAGE'][i]

#### CREDIT SCORE

#### Predict the Missing Values

The intuition behind this method is very simple yet effective. We are going to think of the column having missing values as the dependent variable ( or the y column). The rest of the columns can be the independent variable ( or the x column). Now, we take the completely filled rows as our training set and the missing value containing rows as our test set. Then we simply use a simple Linear regression model or a classification model to predict the missing values. Since this method takes into account the correlation between the missing value column and other columns to predict the missing values, it yields much better results.

In [10]:
### Obtaining the Train and Test data
train_data = df[df['CREDIT_SCORE'].isnull() != True]
test_data = df[df['CREDIT_SCORE'].isnull() == True]

In [11]:
### Split the data into Dependent and Independent Features
X_variable = train_data.drop(columns= 'CREDIT_SCORE')
Y_variable = train_data[['CREDIT_SCORE']]

In [12]:
X_variable.head()

Unnamed: 0,AGE,GENDER,RACE,DRIVING_EXPERIENCE,EDUCATION,INCOME,VEHICLE_OWNERSHIP,VEHICLE_YEAR,MARRIED,CHILDREN,POSTAL_CODE,ANNUAL_MILEAGE,VEHICLE_TYPE,SPEEDING_VIOLATIONS,DUIS,PAST_ACCIDENTS,OUTCOME
0,65+,female,majority,0-9y,high school,upper class,1.0,after 2015,0.0,1.0,10238,12000.0,sedan,0,0,0,0.0
1,16-25,male,majority,0-9y,none,poverty,0.0,before 2015,0.0,0.0,10238,16000.0,sedan,0,0,0,1.0
2,16-25,female,majority,0-9y,high school,working class,1.0,before 2015,0.0,0.0,10238,11000.0,sedan,0,0,0,0.0
3,16-25,male,majority,0-9y,university,working class,1.0,before 2015,0.0,1.0,32765,11000.0,sedan,0,0,0,0.0
4,26-39,male,majority,10-19y,none,working class,1.0,before 2015,0.0,0.0,32765,12000.0,sedan,2,0,1,1.0


In [13]:
### One-Hot Encoding (Categorical Variables)
train_one = pd.get_dummies(X_variable, drop_first= True)
train_one.head()

Unnamed: 0,VEHICLE_OWNERSHIP,MARRIED,CHILDREN,POSTAL_CODE,ANNUAL_MILEAGE,SPEEDING_VIOLATIONS,DUIS,PAST_ACCIDENTS,OUTCOME,AGE_26-39,...,DRIVING_EXPERIENCE_10-19y,DRIVING_EXPERIENCE_20-29y,DRIVING_EXPERIENCE_30y+,EDUCATION_none,EDUCATION_university,INCOME_poverty,INCOME_upper class,INCOME_working class,VEHICLE_YEAR_before 2015,VEHICLE_TYPE_sports car
0,1.0,0.0,1.0,10238,12000.0,0,0,0,0.0,0,...,0,0,0,0,0,0,1,0,0,0
1,0.0,0.0,0.0,10238,16000.0,0,0,0,1.0,0,...,0,0,0,1,0,1,0,0,1,0
2,1.0,0.0,0.0,10238,11000.0,0,0,0,0.0,0,...,0,0,0,0,0,0,0,1,1,0
3,1.0,0.0,1.0,32765,11000.0,0,0,0,0.0,0,...,0,0,0,0,1,0,0,1,1,0
4,1.0,0.0,0.0,32765,12000.0,2,0,1,1.0,1,...,1,0,0,1,0,0,0,1,1,0


In [14]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model_1 = model.fit(train_one, Y_variable)

In [15]:
### Prediction on the Test data
Y1_pred = model_1.predict(train_one)

In [16]:
### Performance Metrics
from sklearn.metrics import mean_absolute_error, mean_squared_error
from math import sqrt

print(f'MSE: {mean_squared_error(Y_variable, Y1_pred)}')
print(f'MAE: {mean_absolute_error(Y_variable, Y1_pred)}')
print(f'RMSE: {sqrt(mean_squared_error(Y_variable, Y1_pred))}')

MSE: 0.008259913304145342
MAE: 0.07264632607351136
RMSE: 0.09088406518276645


#### Performing the analysis on the Test data

In [17]:
### Split the data into Dependent and Independent Features
XTest_variable = test_data.drop(columns= 'CREDIT_SCORE')
YTest_variable = test_data[['CREDIT_SCORE']]

In [18]:
### One-Hot Encoding (Categorical Variables)
test_one = pd.get_dummies(XTest_variable, drop_first= True)
test_one.head(3)

Unnamed: 0,VEHICLE_OWNERSHIP,MARRIED,CHILDREN,POSTAL_CODE,ANNUAL_MILEAGE,SPEEDING_VIOLATIONS,DUIS,PAST_ACCIDENTS,OUTCOME,AGE_26-39,...,DRIVING_EXPERIENCE_10-19y,DRIVING_EXPERIENCE_20-29y,DRIVING_EXPERIENCE_30y+,EDUCATION_none,EDUCATION_university,INCOME_poverty,INCOME_upper class,INCOME_working class,VEHICLE_YEAR_before 2015,VEHICLE_TYPE_sports car
17,0.0,1.0,0.0,32765,12000.0,0,0,0,1.0,0,...,0,0,0,1,0,1,0,0,1,0
23,0.0,0.0,0.0,10238,17000.0,0,0,0,0.0,0,...,0,0,0,1,0,1,0,0,1,0
37,1.0,1.0,1.0,10238,11000.0,2,0,1,0.0,0,...,1,0,0,1,0,0,0,0,1,0


In [19]:
### Prediction on the Test data
Y_pred = model_1.predict(test_one)
len(Y_pred)

982

In [20]:
### Fill the Missing Values using the Predicted values.
XTest_variable['CREDIT_SCORE'] = Y_pred
XTest_variable.head()

Unnamed: 0,AGE,GENDER,RACE,DRIVING_EXPERIENCE,EDUCATION,INCOME,VEHICLE_OWNERSHIP,VEHICLE_YEAR,MARRIED,CHILDREN,POSTAL_CODE,ANNUAL_MILEAGE,VEHICLE_TYPE,SPEEDING_VIOLATIONS,DUIS,PAST_ACCIDENTS,OUTCOME,CREDIT_SCORE
17,16-25,male,majority,0-9y,none,poverty,0.0,before 2015,1.0,0.0,32765,12000.0,sedan,0,0,0,1.0,0.326985
23,16-25,male,majority,0-9y,none,poverty,0.0,before 2015,0.0,0.0,10238,17000.0,sedan,0,0,0,0.0,0.331415
37,40-64,female,majority,10-19y,none,middle class,1.0,before 2015,1.0,1.0,10238,11000.0,sedan,2,0,1,0.0,0.535087
38,65+,male,majority,30y+,university,upper class,0.0,after 2015,0.0,1.0,10238,12000.0,sports car,6,0,5,0.0,0.597809
47,40-64,female,majority,20-29y,university,upper class,1.0,after 2015,1.0,1.0,92101,11000.0,sedan,3,0,2,0.0,0.621821


In [21]:
### Combining the DataFrames
DataFrames = [train_data, XTest_variable]
  
CleanData = pd.concat(DataFrames)
CleanData.reset_index(inplace= True)
CleanData.drop(columns= 'index', inplace= True)

CleanData.shape

(10000, 18)

In [22]:
CleanData.head()       ### This is our New DataFrame.

Unnamed: 0,AGE,GENDER,RACE,DRIVING_EXPERIENCE,EDUCATION,INCOME,CREDIT_SCORE,VEHICLE_OWNERSHIP,VEHICLE_YEAR,MARRIED,CHILDREN,POSTAL_CODE,ANNUAL_MILEAGE,VEHICLE_TYPE,SPEEDING_VIOLATIONS,DUIS,PAST_ACCIDENTS,OUTCOME
0,65+,female,majority,0-9y,high school,upper class,0.629027,1.0,after 2015,0.0,1.0,10238,12000.0,sedan,0,0,0,0.0
1,16-25,male,majority,0-9y,none,poverty,0.357757,0.0,before 2015,0.0,0.0,10238,16000.0,sedan,0,0,0,1.0
2,16-25,female,majority,0-9y,high school,working class,0.493146,1.0,before 2015,0.0,0.0,10238,11000.0,sedan,0,0,0,0.0
3,16-25,male,majority,0-9y,university,working class,0.206013,1.0,before 2015,0.0,1.0,32765,11000.0,sedan,0,0,0,0.0
4,26-39,male,majority,10-19y,none,working class,0.388366,1.0,before 2015,0.0,0.0,32765,12000.0,sedan,2,0,1,1.0


## Feature Selection

1. Convert all object data types to numerical (Label Encoding)
2. Perform Feature selection method.

In [23]:
from sklearn.preprocessing import LabelEncoder

In [24]:
CleanData['POSTAL_CODE'] = CleanData['POSTAL_CODE'].astype(str)

In [25]:
categorical_features = [feature for feature in CleanData.columns if CleanData[feature].dtypes == 'O']
label_encoder = LabelEncoder()

for feature in categorical_features:
    trial = CleanData[[feature]]
    print(feature)
    trial[feature+' encode'] = label_encoder.fit_transform(trial[feature].values)
    trial = trial.drop_duplicates(feature)
    print(trial)
    print('-------------------------------------------------------------')
    print()

AGE
     AGE  AGE encode
0    65+           3
1  16-25           0
4  26-39           1
5  40-64           2
-------------------------------------------------------------

GENDER
   GENDER  GENDER encode
0  female              0
1    male              1
-------------------------------------------------------------

RACE
        RACE  RACE encode
0   majority            0
43  minority            1
-------------------------------------------------------------

DRIVING_EXPERIENCE
  DRIVING_EXPERIENCE  DRIVING_EXPERIENCE encode
0               0-9y                          0
4             10-19y                          1
5             20-29y                          2
6               30y+                          3
-------------------------------------------------------------

EDUCATION
     EDUCATION  EDUCATION encode
0  high school                 0
1         none                 1
3   university                 2
-------------------------------------------------------------

INCOME
   

In [26]:
### Copy Data for Further use
label = CleanData.copy()
significance = CleanData.copy()

In [27]:
### Creating for-Loops for encoding manually
for i in range(len(label)):
    if label['EDUCATION'][i] == 'high school':
        label['EDUCATION'][i] = 1
    elif label['EDUCATION'][i] == 'none':
        label['EDUCATION'][i] = 0
    else:
        label['EDUCATION'][i] = 2
label['EDUCATION'] = label['EDUCATION'].astype(int)
    
for i in range(len(label)):
    if label['RACE'][i] == 'majority':
        label['RACE'][i] = 1
    else:
        label['RACE'][i] = 0
label['RACE'] = label['RACE'].astype(int)

for i in range(len(label)):
    if label['INCOME'][i] == 'upper class':
        label['INCOME'][i] = 3
    elif label['INCOME'][i] == 'poverty':
        label['INCOME'][i] = 0
    elif label['INCOME'][i] == 'working class':
        label['INCOME'][i] = 1
    else:
        label['INCOME'][i] = 2
label['INCOME'] = label['INCOME'].astype(int)

for i in range(len(label)):
    if label['VEHICLE_YEAR'][i] == 'after 2015':
        label['VEHICLE_YEAR'][i] = 1
    else:
        label['VEHICLE_YEAR'][i] = 0
label['VEHICLE_YEAR'] = label['VEHICLE_YEAR'].astype(int)
        
label.head(3)

Unnamed: 0,AGE,GENDER,RACE,DRIVING_EXPERIENCE,EDUCATION,INCOME,CREDIT_SCORE,VEHICLE_OWNERSHIP,VEHICLE_YEAR,MARRIED,CHILDREN,POSTAL_CODE,ANNUAL_MILEAGE,VEHICLE_TYPE,SPEEDING_VIOLATIONS,DUIS,PAST_ACCIDENTS,OUTCOME
0,65+,female,1,0-9y,1,3,0.629027,1.0,1,0.0,1.0,10238,12000.0,sedan,0,0,0,0.0
1,16-25,male,1,0-9y,0,0,0.357757,0.0,0,0.0,0.0,10238,16000.0,sedan,0,0,0,1.0
2,16-25,female,1,0-9y,1,1,0.493146,1.0,0,0.0,0.0,10238,11000.0,sedan,0,0,0,0.0


In [28]:
categorical_features = [feature for feature in label.columns if (label[feature].dtypes == 'O') & (feature != 'POSTAL_CODE') & (feature != 'VEHICLE_TYPE')]
categorical_features

['AGE', 'GENDER', 'DRIVING_EXPERIENCE']

In [29]:
for feature in categorical_features:
    ## Encode labels in all Categorical Columns.
    label[feature]= label_encoder.fit_transform(label[feature])

label.head()

Unnamed: 0,AGE,GENDER,RACE,DRIVING_EXPERIENCE,EDUCATION,INCOME,CREDIT_SCORE,VEHICLE_OWNERSHIP,VEHICLE_YEAR,MARRIED,CHILDREN,POSTAL_CODE,ANNUAL_MILEAGE,VEHICLE_TYPE,SPEEDING_VIOLATIONS,DUIS,PAST_ACCIDENTS,OUTCOME
0,3,0,1,0,1,3,0.629027,1.0,1,0.0,1.0,10238,12000.0,sedan,0,0,0,0.0
1,0,1,1,0,0,0,0.357757,0.0,0,0.0,0.0,10238,16000.0,sedan,0,0,0,1.0
2,0,0,1,0,1,1,0.493146,1.0,0,0.0,0.0,10238,11000.0,sedan,0,0,0,0.0
3,0,1,1,0,2,1,0.206013,1.0,0,0.0,1.0,32765,11000.0,sedan,0,0,0,0.0
4,1,1,1,1,0,1,0.388366,1.0,0,0.0,0.0,32765,12000.0,sedan,2,0,1,1.0


- We have label encoded the columns that appear to be **Ordinal** such as 'EDUCATION' or 'AGE'.
- One-Hot Encoding is performed for those columns that are **Not Ordinal**.

**REFERENCE:** https://www.analyticsvidhya.com/blog/2020/03/one-hot-encoding-vs-label-encoding-using-scikit-learn/#:~:text=When%20to%20use%20a,to%20high%20memory%20consumption

In [30]:
onehot = label.copy()

In [31]:
IndependentO = onehot.drop(columns= 'OUTCOME')
DependentO = onehot[['OUTCOME']].astype(int)

In [32]:
one = pd.get_dummies(IndependentO, drop_first= True)

#### Mutual Information Gain

In [33]:
from sklearn.feature_selection import mutual_info_classif

In [34]:
Independent = one.copy()
Dependent = DependentO.copy()

In [35]:
# determine the mutual information
mutual_info = mutual_info_classif(Independent, Dependent, random_state= 2)
mutual_info

array([0.12473312, 0.00156013, 0.        , 0.15630951, 0.01668744,
       0.08861167, 0.06247536, 0.0747301 , 0.04834029, 0.03169633,
       0.02094903, 0.01256163, 0.07804794, 0.02295744, 0.07411648,
       0.01780909, 0.        , 0.        , 0.        ])

In [36]:
mutual_info = pd.DataFrame(mutual_info)
mutual_info.index = Independent.columns
mutual_info.rename(columns = {0:'Info_Score'}, inplace= True)
mutual_info.sort_values(ascending=False, by= 'Info_Score')

Unnamed: 0,Info_Score
DRIVING_EXPERIENCE,0.15631
AGE,0.124733
INCOME,0.088612
SPEEDING_VIOLATIONS,0.078048
VEHICLE_OWNERSHIP,0.07473
PAST_ACCIDENTS,0.074116
CREDIT_SCORE,0.062475
VEHICLE_YEAR,0.04834
MARRIED,0.031696
DUIS,0.022957


In [37]:
selected = mutual_info[mutual_info.Info_Score > 0.02].index
selected

Index(['AGE', 'DRIVING_EXPERIENCE', 'INCOME', 'CREDIT_SCORE',
       'VEHICLE_OWNERSHIP', 'VEHICLE_YEAR', 'MARRIED', 'CHILDREN',
       'SPEEDING_VIOLATIONS', 'DUIS', 'PAST_ACCIDENTS'],
      dtype='object')

#### Feature Selection (Using Logistic Regression)

In [41]:
import statsmodels.api as sm
logit_model= sm.Logit(Dependent, Independent)
result= logit_model.fit()

print(result.summary())

         Current function value: 0.328240
         Iterations: 35
                           Logit Regression Results                           
Dep. Variable:                OUTCOME   No. Observations:                10000
Model:                          Logit   Df Residuals:                     9981
Method:                           MLE   Df Model:                           18
Date:                Fri, 29 Apr 2022   Pseudo R-squ.:                  0.4720
Time:                        13:22:24   Log-Likelihood:                -3282.4
converged:                      False   LL-Null:                       -6217.2
Covariance Type:            nonrobust   LLR p-value:                     0.000
                              coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------------------
AGE                         0.0095      0.047      0.200      0.841      -0.083       0.102
GENDER                    



### Statistical Significance

#### Chi-2 Test --> Categorial Variables

- **Null Hypothesis (H0):** There is no relationship between the variables
- **Alternative Hypothesis (H1)**: There is a relationship between variables (Significant)

In [42]:
significance.head()

Unnamed: 0,AGE,GENDER,RACE,DRIVING_EXPERIENCE,EDUCATION,INCOME,CREDIT_SCORE,VEHICLE_OWNERSHIP,VEHICLE_YEAR,MARRIED,CHILDREN,POSTAL_CODE,ANNUAL_MILEAGE,VEHICLE_TYPE,SPEEDING_VIOLATIONS,DUIS,PAST_ACCIDENTS,OUTCOME
0,65+,female,majority,0-9y,high school,upper class,0.629027,1.0,after 2015,0.0,1.0,10238,12000.0,sedan,0,0,0,0.0
1,16-25,male,majority,0-9y,none,poverty,0.357757,0.0,before 2015,0.0,0.0,10238,16000.0,sedan,0,0,0,1.0
2,16-25,female,majority,0-9y,high school,working class,0.493146,1.0,before 2015,0.0,0.0,10238,11000.0,sedan,0,0,0,0.0
3,16-25,male,majority,0-9y,university,working class,0.206013,1.0,before 2015,0.0,1.0,32765,11000.0,sedan,0,0,0,0.0
4,26-39,male,majority,10-19y,none,working class,0.388366,1.0,before 2015,0.0,0.0,32765,12000.0,sedan,2,0,1,1.0


In [43]:
categorical_features = [feature for feature in significance.columns if significance[feature].dtypes == 'O']

In [44]:
# Import the function
from scipy.stats import chi2_contingency

chi2_check = []
for i in categorical_features:
    if chi2_contingency(pd.crosstab(significance['OUTCOME'], significance[i]))[1] < 0.05:
        chi2_check.append('Reject Null Hypothesis (Significant)')
    else:
        chi2_check.append('Fail to Reject Null Hypothesis (Not Significant)')
res = pd.DataFrame(data = [categorical_features, chi2_check] 
             ).T 
res.columns = ['Column', 'Hypothesis']
print(res)

               Column                                        Hypothesis
0                 AGE              Reject Null Hypothesis (Significant)
1              GENDER              Reject Null Hypothesis (Significant)
2                RACE  Fail to Reject Null Hypothesis (Not Significant)
3  DRIVING_EXPERIENCE              Reject Null Hypothesis (Significant)
4           EDUCATION              Reject Null Hypothesis (Significant)
5              INCOME              Reject Null Hypothesis (Significant)
6        VEHICLE_YEAR              Reject Null Hypothesis (Significant)
7         POSTAL_CODE              Reject Null Hypothesis (Significant)
8        VEHICLE_TYPE  Fail to Reject Null Hypothesis (Not Significant)


**INFERENCE**

- Here we can see that the Vehicle Type and the Race are both not significant with the Target variable.
- This alligns with out Mutual Information Method.

#### p-value --> Nmerical Variables

In [45]:
numerical_columns = [feature for feature in significance.columns if significance[feature].dtypes != 'O']
numerical_columns

['CREDIT_SCORE',
 'VEHICLE_OWNERSHIP',
 'MARRIED',
 'CHILDREN',
 'ANNUAL_MILEAGE',
 'SPEEDING_VIOLATIONS',
 'DUIS',
 'PAST_ACCIDENTS',
 'OUTCOME']

In [46]:
import statsmodels.api as sm
for i in numerical_columns[:-1]:
    result = sm.OLS(significance[numerical_columns[-1]], significance[i]).fit()
    p_values = result.summary2().tables[1]['P>|t|']
    
    if (p_values[0]) < 0.05:
        print(i, '-->', 'Reject Null Hypothesis (Significant)')
    else:
        print(i, '-->', 'Fail to Reject Null Hypothesis (Not Significant)')

CREDIT_SCORE --> Reject Null Hypothesis (Significant)
VEHICLE_OWNERSHIP --> Reject Null Hypothesis (Significant)
MARRIED --> Reject Null Hypothesis (Significant)
CHILDREN --> Reject Null Hypothesis (Significant)
ANNUAL_MILEAGE --> Reject Null Hypothesis (Significant)
SPEEDING_VIOLATIONS --> Reject Null Hypothesis (Significant)
DUIS --> Reject Null Hypothesis (Significant)
PAST_ACCIDENTS --> Reject Null Hypothesis (Significant)


In [None]:
# result = sm.OLS(significance[numerical_columns[-1]], significance[numerical_columns[0]]).fit()
# print(result.summary2())

## Feature Encoding

In [47]:
CleanData.head(3)

Unnamed: 0,AGE,GENDER,RACE,DRIVING_EXPERIENCE,EDUCATION,INCOME,CREDIT_SCORE,VEHICLE_OWNERSHIP,VEHICLE_YEAR,MARRIED,CHILDREN,POSTAL_CODE,ANNUAL_MILEAGE,VEHICLE_TYPE,SPEEDING_VIOLATIONS,DUIS,PAST_ACCIDENTS,OUTCOME
0,65+,female,majority,0-9y,high school,upper class,0.629027,1.0,after 2015,0.0,1.0,10238,12000.0,sedan,0,0,0,0.0
1,16-25,male,majority,0-9y,none,poverty,0.357757,0.0,before 2015,0.0,0.0,10238,16000.0,sedan,0,0,0,1.0
2,16-25,female,majority,0-9y,high school,working class,0.493146,1.0,before 2015,0.0,0.0,10238,11000.0,sedan,0,0,0,0.0


#### Selecting the Columns for Encoding

In [52]:
### Split the data into Dependent and Independent Features
Independent = CleanData.drop(columns= ['OUTCOME', 'RACE', 'VEHICLE_TYPE', 'POSTAL_CODE', 'SPEEDING_VIOLATIONS', 
                                       'DUIS', 'EDUCATION', 'AGE'])
Dependent = CleanData[['OUTCOME']]

In [53]:
Independent.columns

Index(['GENDER', 'DRIVING_EXPERIENCE', 'INCOME', 'CREDIT_SCORE',
       'VEHICLE_OWNERSHIP', 'VEHICLE_YEAR', 'MARRIED', 'CHILDREN',
       'ANNUAL_MILEAGE', 'PAST_ACCIDENTS'],
      dtype='object')

In [54]:
### Creating for-Loops for encoding manually  
for i in range(len(Independent)):
    if Independent['INCOME'][i] == 'upper class':
        Independent['INCOME'][i] = 3
    elif Independent['INCOME'][i] == 'poverty':
        Independent['INCOME'][i] = 0
    elif Independent['INCOME'][i] == 'working class':
        Independent['INCOME'][i] = 1
    else:
        Independent['INCOME'][i] = 2
Independent['INCOME'] = Independent['INCOME'].astype(int)

for i in range(len(Independent)):
    if Independent['VEHICLE_YEAR'][i] == 'after 2015':
        Independent['VEHICLE_YEAR'][i] = 1
    else:
        Independent['VEHICLE_YEAR'][i] = 0
Independent['VEHICLE_YEAR'] = Independent['VEHICLE_YEAR'].astype(int)
        
Independent.head(3)

Unnamed: 0,GENDER,DRIVING_EXPERIENCE,INCOME,CREDIT_SCORE,VEHICLE_OWNERSHIP,VEHICLE_YEAR,MARRIED,CHILDREN,ANNUAL_MILEAGE,PAST_ACCIDENTS
0,female,0-9y,3,0.629027,1.0,1,0.0,1.0,12000.0,0
1,male,0-9y,0,0.357757,0.0,0,0.0,0.0,16000.0,0
2,female,0-9y,1,0.493146,1.0,0,0.0,0.0,11000.0,0


In [55]:
categorical_features_enc = [feature for feature in Independent.columns if (Independent[feature].dtypes == 'O')]
categorical_features_enc

['GENDER', 'DRIVING_EXPERIENCE']

In [56]:
for feature in categorical_features_enc:
    ## Encode labels in all Categorical Columns.
    Independent[feature]= label_encoder.fit_transform(Independent[feature])

Independent.head()

Unnamed: 0,GENDER,DRIVING_EXPERIENCE,INCOME,CREDIT_SCORE,VEHICLE_OWNERSHIP,VEHICLE_YEAR,MARRIED,CHILDREN,ANNUAL_MILEAGE,PAST_ACCIDENTS
0,0,0,3,0.629027,1.0,1,0.0,1.0,12000.0,0
1,1,0,0,0.357757,0.0,0,0.0,0.0,16000.0,0
2,0,0,1,0.493146,1.0,0,0.0,0.0,11000.0,0
3,1,0,1,0.206013,1.0,0,0.0,1.0,11000.0,0
4,1,1,1,0.388366,1.0,0,0.0,0.0,12000.0,1


In [57]:
### We will perform One-Hot Encoding
df_enc = Independent
df_enc['OUTCOME'] = Dependent[['OUTCOME']].astype(int)
df_enc.head()

Unnamed: 0,GENDER,DRIVING_EXPERIENCE,INCOME,CREDIT_SCORE,VEHICLE_OWNERSHIP,VEHICLE_YEAR,MARRIED,CHILDREN,ANNUAL_MILEAGE,PAST_ACCIDENTS,OUTCOME
0,0,0,3,0.629027,1.0,1,0.0,1.0,12000.0,0,0
1,1,0,0,0.357757,0.0,0,0.0,0.0,16000.0,0,1
2,0,0,1,0.493146,1.0,0,0.0,0.0,11000.0,0,0
3,1,0,1,0.206013,1.0,0,0.0,1.0,11000.0,0,0
4,1,1,1,0.388366,1.0,0,0.0,0.0,12000.0,1,1


In [58]:
df_enc.to_csv('Data.csv', index= False)

<h4 align= 'center'><strong><b>Now that we have our encoded dataset ready, we can go forward with model building.</b></strong></h4>