### Predict if a driver will file an insurance claim next year.

### Problem Description
Nothing ruins the thrill of buying a brand new car more quickly than seeing your new insurance bill. The sting’s even more painful when you know you’re a good driver. It doesn’t seem fair that you have to pay so much if you’ve been cautious on the road for years.

Porto Seguro, one of Brazil’s largest auto and homeowner insurance companies, completely agrees. Inaccuracies in car insurance company’s claim predictions raise the cost of insurance for good drivers and reduce the price for bad ones.




### Data Description
In this competition, we need to predict the probability that an auto insurance policy holder files a claim.

In the train and test data, features that belong to similar groupings are tagged as such in the feature names (e.g., ind, reg, car, calc). In addition, feature names include the postfix bin to indicate binary features and cat to indicate categorical features. Features without these designations are either continuous or ordinal. Values of -1 indicate that the feature was missing from the observation. The target columns signifies whether or not a claim was filed for that policy holder.

### Importing necessary libraries



#### Data Wrangling libraries:

numpy : `Numpy` is a general-purpose array-processing package. It provides a high-performance multidimensional array object, and tools for working with these arrays
pandas: `Pandas` is an open-source library that is built on top of NumPy library. It is a Python package that offers various data structures and operations for manipulating numerical data and time series. It is mainly popular for importing and analyzing data much easier. Pandas is fast and it has high-performance & productivity for users.

#### Data visualization library:

matplotlib:`Matplotlib` is a low level graph plotting library in python that serves as a visualization utility.
seaborn: `Seaborn` is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.

#### Data Pre-processing Libraries:
Using the scikit learn package MinMaxScaler and StandardScaler are imported which will be used for sclaing the data. The train_test_split is used for dividing te data into training, validation and test set for model building . In case of imbalanced dataset stratifiedKFold can be used.

`Stratified K fold` : It provides train/test indices to split data in train/test sets.This cross-validation object is a variation of KFold that returns stratified folds. The folds are made by preserving the percentage of samples for each class.

`RandomizedsearchCv`: Randomized search on hyper parameters.RandomizedSearchCV implements a “fit” and a “score” method. It also implements “score_samples”, “predict”, “predict_proba”, “decision_function”, “transform” and “inverse_transform” if they are implemented in the estimator used.The parameters of the estimator used to apply these methods are optimized by cross-validated search over parameter settings.

In contrast to `GridSearchCV`, not all parameter values are tried out, but rather a fixed number of parameter settings is sampled from the specified distributions. The number of parameter settings that are tried is given by n_iter. If all parameters are presented as a list, sampling without replacement is performed. If at least one parameter is given as a distribution, sampling with replacement is used. It is highly recommended to use continuous distributions for continuous parameters.

#### Machine Learning libraries:

`RandomForestClassifier` is a library from the Scikit-Learn package for training and fitting a Random Forest machine learning model.` XGBClassifier` is XGBoost library for training and fitting a Gradient Boosting machine learning model. `LightGBM` is a fast, distributed, high-performance gradient boosting framework based on decision tree algorithms. `CatBoost` is a fast, scalable, high-performance gradient boosting on decision tree library. `TensorFlow` is an open source library for machine learning optimizers, data flow graphs, and deep learning algorithms. `Keras` is a high-level API running on top of TensorFlow for building and training deep learning models.  These libraries provide various machine learning models and algorithms which can be used for predicting the output for given input data.



In [1]:
#data wrangling libraries
import numpy as np
import pandas as pd

#data visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns

#machine learning libraries
from collections import OrderedDict
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
import lightgbm as lgbm
import catboost
import tensorflow as tf
from tensorflow import keras

### Loading the dataset

`pd.read_csv` uses the library pandas to read in datasets from csv files. The train and test datasets are read into data frames named train and test, and another data frame called submission is read from the file sample_submission.csv. `The index_col` argument is used in both the train and test datasets, which tells pandas to use the first column of the csv file as the index for the data frame. This means that when referencing the data frame, the index numbers can be used rather than column names.

In [2]:
train = pd.read_csv('train.csv', index_col=0)
test = pd.read_csv('test.csv', index_col = 0)
submission = pd.read_csv('sample_submission.csv')

The `.head()` function in pandas is used to obtain the first 5 rows of a DataFrame or Series. 

In [3]:
train.head()

Unnamed: 0_level_0,target,ps_ind_01,ps_ind_02_cat,ps_ind_03,ps_ind_04_cat,ps_ind_05_cat,ps_ind_06_bin,ps_ind_07_bin,ps_ind_08_bin,ps_ind_09_bin,...,ps_calc_11,ps_calc_12,ps_calc_13,ps_calc_14,ps_calc_15_bin,ps_calc_16_bin,ps_calc_17_bin,ps_calc_18_bin,ps_calc_19_bin,ps_calc_20_bin
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
7,0,2,2,5,1,0,0,1,0,0,...,9,1,5,8,0,1,1,0,0,1
9,0,1,1,7,0,0,0,0,1,0,...,3,1,1,9,0,1,1,0,1,0
13,0,5,4,9,1,0,0,0,1,0,...,4,2,7,7,0,1,1,0,1,0
16,0,0,1,2,0,0,1,0,0,0,...,2,2,4,9,0,0,0,0,0,0
17,0,0,2,0,1,0,1,0,0,0,...,3,1,1,3,0,0,0,1,1,0


In [4]:
test.head()

Unnamed: 0_level_0,ps_ind_01,ps_ind_02_cat,ps_ind_03,ps_ind_04_cat,ps_ind_05_cat,ps_ind_06_bin,ps_ind_07_bin,ps_ind_08_bin,ps_ind_09_bin,ps_ind_10_bin,...,ps_calc_11,ps_calc_12,ps_calc_13,ps_calc_14,ps_calc_15_bin,ps_calc_16_bin,ps_calc_17_bin,ps_calc_18_bin,ps_calc_19_bin,ps_calc_20_bin
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0,1,8,1,0,0,1,0,0,0,...,1,1,1,12,0,1,1,0,0,1
1,4,2,5,1,0,0,0,0,1,0,...,2,0,3,10,0,0,1,1,0,1
2,5,1,3,0,0,0,0,0,1,0,...,4,0,2,4,0,0,0,0,0,0
3,0,1,6,0,0,1,0,0,0,0,...,5,1,0,5,1,0,1,0,0,0
4,5,1,7,0,0,0,0,0,1,0,...,4,0,0,4,0,1,1,0,0,1


In [5]:
submission.head()

Unnamed: 0,id,target
0,0,0.0364
1,1,0.0364
2,2,0.0364
3,3,0.0364
4,4,0.0364


### Checking the info of the dataset using .info() method.


The `info()` method prints information about the DataFrame.The information contains the number of columns, column labels, column data types, memory usage, range index, and the number of cells in each column (non-null values).

In [6]:
print("Training Data:")
train.info()
print("..................................")

print("Testing Data:")
test.info()
print("..................................")


print("Submission Data:")
submission.info()
print("..................................")

Training Data:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 595212 entries, 7 to 1488027
Data columns (total 58 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   target          595212 non-null  int64  
 1   ps_ind_01       595212 non-null  int64  
 2   ps_ind_02_cat   595212 non-null  int64  
 3   ps_ind_03       595212 non-null  int64  
 4   ps_ind_04_cat   595212 non-null  int64  
 5   ps_ind_05_cat   595212 non-null  int64  
 6   ps_ind_06_bin   595212 non-null  int64  
 7   ps_ind_07_bin   595212 non-null  int64  
 8   ps_ind_08_bin   595212 non-null  int64  
 9   ps_ind_09_bin   595212 non-null  int64  
 10  ps_ind_10_bin   595212 non-null  int64  
 11  ps_ind_11_bin   595212 non-null  int64  
 12  ps_ind_12_bin   595212 non-null  int64  
 13  ps_ind_13_bin   595212 non-null  int64  
 14  ps_ind_14       595212 non-null  int64  
 15  ps_ind_15       595212 non-null  int64  
 16  ps_ind_16_bin   595212 non-null  int64  

`.describe` computes summary statistics that characterize the collection of data being analyzed. The method returns a dataframe with the count, mean, standard deviation, minimum, first quartile, median, third quartile, and maximum for each column of data. The "T" argument will calculate the summary statistics of the transposed data.

In [7]:
train.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
target,595212.0,0.036448,0.187401,0.0,0.0,0.0,0.0,1.0
ps_ind_01,595212.0,1.900378,1.983789,0.0,0.0,1.0,3.0,7.0
ps_ind_02_cat,595212.0,1.358943,0.664594,-1.0,1.0,1.0,2.0,4.0
ps_ind_03,595212.0,4.423318,2.699902,0.0,2.0,4.0,6.0,11.0
ps_ind_04_cat,595212.0,0.416794,0.493311,-1.0,0.0,0.0,1.0,1.0
ps_ind_05_cat,595212.0,0.405188,1.350642,-1.0,0.0,0.0,0.0,6.0
ps_ind_06_bin,595212.0,0.393742,0.488579,0.0,0.0,0.0,1.0,1.0
ps_ind_07_bin,595212.0,0.257033,0.436998,0.0,0.0,0.0,1.0,1.0
ps_ind_08_bin,595212.0,0.163921,0.370205,0.0,0.0,0.0,0.0,1.0
ps_ind_09_bin,595212.0,0.185304,0.388544,0.0,0.0,0.0,0.0,1.0


In [8]:
test.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
ps_ind_01,892816.0,1.902371,1.986503,0.0,0.0,1.0,3.0,7.0
ps_ind_02_cat,892816.0,1.358613,0.663002,-1.0,1.0,1.0,2.0,4.0
ps_ind_03,892816.0,4.413734,2.700149,0.0,2.0,4.0,6.0,11.0
ps_ind_04_cat,892816.0,0.417361,0.493453,-1.0,0.0,0.0,1.0,1.0
ps_ind_05_cat,892816.0,0.408132,1.355068,-1.0,0.0,0.0,0.0,6.0
ps_ind_06_bin,892816.0,0.393246,0.488471,0.0,0.0,0.0,1.0,1.0
ps_ind_07_bin,892816.0,0.257191,0.437086,0.0,0.0,0.0,1.0,1.0
ps_ind_08_bin,892816.0,0.163659,0.369966,0.0,0.0,0.0,0.0,1.0
ps_ind_09_bin,892816.0,0.185905,0.38903,0.0,0.0,0.0,0.0,1.0
ps_ind_10_bin,892816.0,0.000373,0.019309,0.0,0.0,0.0,0.0,1.0


In [9]:
submission.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
id,892816.0,744153.461357,429683.0,0.0,372021.75,744307.0,1116308.0,1488026.0
target,892816.0,0.0364,6.387602e-13,0.0364,0.0364,0.0364,0.0364,0.0364


### Dealing with missing values:

Missing values in features are given as -1. So I need to deal with them.

In here, there are three types of variables: binary, categorical and continuous variables. I first separate out these variables.

First I look at the number of unique values in each variable to make sure the binary, categorical and continuous variables are correctly specified in the dataset:

In [10]:
print("UNIQUE VALUE COUNT\n")

print("\n------------train------------\n")
for i in train.columns:
    print(f'{i}: {train[i].nunique()}')

print("\n------------test------------\n")
for i in test.columns:
    print(f'{i}: {test[i].nunique()}')

UNIQUE VALUE COUNT


------------train------------

target: 2
ps_ind_01: 8
ps_ind_02_cat: 5
ps_ind_03: 12
ps_ind_04_cat: 3
ps_ind_05_cat: 8
ps_ind_06_bin: 2
ps_ind_07_bin: 2
ps_ind_08_bin: 2
ps_ind_09_bin: 2
ps_ind_10_bin: 2
ps_ind_11_bin: 2
ps_ind_12_bin: 2
ps_ind_13_bin: 2
ps_ind_14: 5
ps_ind_15: 14
ps_ind_16_bin: 2
ps_ind_17_bin: 2
ps_ind_18_bin: 2
ps_reg_01: 10
ps_reg_02: 19
ps_reg_03: 5013
ps_car_01_cat: 13
ps_car_02_cat: 3
ps_car_03_cat: 3
ps_car_04_cat: 10
ps_car_05_cat: 3
ps_car_06_cat: 18
ps_car_07_cat: 3
ps_car_08_cat: 2
ps_car_09_cat: 6
ps_car_10_cat: 3
ps_car_11_cat: 104
ps_car_11: 5
ps_car_12: 184
ps_car_13: 70482
ps_car_14: 850
ps_car_15: 15
ps_calc_01: 10
ps_calc_02: 10
ps_calc_03: 10
ps_calc_04: 6
ps_calc_05: 7
ps_calc_06: 11
ps_calc_07: 10
ps_calc_08: 11
ps_calc_09: 8
ps_calc_10: 26
ps_calc_11: 20
ps_calc_12: 11
ps_calc_13: 14
ps_calc_14: 24
ps_calc_15_bin: 2
ps_calc_16_bin: 2
ps_calc_17_bin: 2
ps_calc_18_bin: 2
ps_calc_19_bin: 2
ps_calc_20_bin: 2

------------test----

### Separating out the datatypes

Here we are creating three empty lists to store names of variables classified according to their data types. The for loop is then iterating over the columns from the dataset 'train' and checking where each column name contains the strings `'_bin'`,`'_cat'`, or none of them. If the column contains '_bin' string it is appended to the list named bin_vars, if it contains '_cat' string it is appended to the list named cat_vars otherwise name of the column is appended to the list cont_vars. Finally, the target label is removed from the list cont_vars. In the end, all the lists are printed to identify the type of data in each variable.

`Binary variables(bin_vars)`are variables that can only take two values, such as yes/no, true/false, or 1/0.

`Continuous variables(cont_vars)` are variables that can take any value within a given range. Examples include height, weight, and age.

`Categorical variables(cat_vars)` are variables that can take on one of a limited number of values

In [11]:
bin_vars = [] ## list to store names of binary variables
cat_vars = [] ## list to store names of categorical variables
cont_vars = [] ## list to store names of continuous variables

for i in train.columns:
    if "_bin" in i:
        bin_vars.append(i)
    elif "_cat" in i:
        cat_vars.append(i)
    else:
        cont_vars.append(i)

cont_vars.remove('target')

print("Binary Variables:\n")
print(bin_vars)
print('\n')

print("Categorical Variables:\n")
print(cat_vars)
print('\n')

print("Continuous Variables:\n")
print(cont_vars)
print('\n')

Binary Variables:

['ps_ind_06_bin', 'ps_ind_07_bin', 'ps_ind_08_bin', 'ps_ind_09_bin', 'ps_ind_10_bin', 'ps_ind_11_bin', 'ps_ind_12_bin', 'ps_ind_13_bin', 'ps_ind_16_bin', 'ps_ind_17_bin', 'ps_ind_18_bin', 'ps_calc_15_bin', 'ps_calc_16_bin', 'ps_calc_17_bin', 'ps_calc_18_bin', 'ps_calc_19_bin', 'ps_calc_20_bin']


Categorical Variables:

['ps_ind_02_cat', 'ps_ind_04_cat', 'ps_ind_05_cat', 'ps_car_01_cat', 'ps_car_02_cat', 'ps_car_03_cat', 'ps_car_04_cat', 'ps_car_05_cat', 'ps_car_06_cat', 'ps_car_07_cat', 'ps_car_08_cat', 'ps_car_09_cat', 'ps_car_10_cat', 'ps_car_11_cat']


Continuous Variables:

['ps_ind_01', 'ps_ind_03', 'ps_ind_14', 'ps_ind_15', 'ps_reg_01', 'ps_reg_02', 'ps_reg_03', 'ps_car_11', 'ps_car_12', 'ps_car_13', 'ps_car_14', 'ps_car_15', 'ps_calc_01', 'ps_calc_02', 'ps_calc_03', 'ps_calc_04', 'ps_calc_05', 'ps_calc_06', 'ps_calc_07', 'ps_calc_08', 'ps_calc_09', 'ps_calc_10', 'ps_calc_11', 'ps_calc_12', 'ps_calc_13', 'ps_calc_14']




Now I check if any of the binary variables have -1 as a value. To do this, I check the unique values of each of the binary variables:

In [12]:
print("UNIQUE VALUES OF BINARY VARIABLES:\n")

print("\n---------train-----------\n")
for i in bin_vars:
    print(f'Unique values of {i}: {train[i].unique()}')

print("\n---------test-----------\n")
for i in bin_vars:
    print(f'Unique values of {i}: {test[i].unique()}')

UNIQUE VALUES OF BINARY VARIABLES:


---------train-----------

Unique values of ps_ind_06_bin: [0 1]
Unique values of ps_ind_07_bin: [1 0]
Unique values of ps_ind_08_bin: [0 1]
Unique values of ps_ind_09_bin: [0 1]
Unique values of ps_ind_10_bin: [0 1]
Unique values of ps_ind_11_bin: [0 1]
Unique values of ps_ind_12_bin: [0 1]
Unique values of ps_ind_13_bin: [0 1]
Unique values of ps_ind_16_bin: [0 1]
Unique values of ps_ind_17_bin: [1 0]
Unique values of ps_ind_18_bin: [0 1]
Unique values of ps_calc_15_bin: [0 1]
Unique values of ps_calc_16_bin: [1 0]
Unique values of ps_calc_17_bin: [1 0]
Unique values of ps_calc_18_bin: [0 1]
Unique values of ps_calc_19_bin: [0 1]
Unique values of ps_calc_20_bin: [1 0]

---------test-----------

Unique values of ps_ind_06_bin: [0 1]
Unique values of ps_ind_07_bin: [1 0]
Unique values of ps_ind_08_bin: [0 1]
Unique values of ps_ind_09_bin: [0 1]
Unique values of ps_ind_10_bin: [0 1]
Unique values of ps_ind_11_bin: [0 1]
Unique values of ps_ind_12_bi

* Thus, there are no missing values in the binary columns.

Next, I check the same for the categorical columns:

In [13]:
print("UNIQUE VALUES OF CATEGORICAL VARIABLES:\n")

print("\n---------train-----------\n")
for i in cat_vars:
    print(f'Unique values of {i}: {train[i].unique()}')

print("\n---------test-----------\n")
for i in cat_vars:
    print(f'Unique values of {i}: {test[i].unique()}')

UNIQUE VALUES OF CATEGORICAL VARIABLES:


---------train-----------

Unique values of ps_ind_02_cat: [ 2  1  4  3 -1]
Unique values of ps_ind_04_cat: [ 1  0 -1]
Unique values of ps_ind_05_cat: [ 0  1  4  3  6  5 -1  2]
Unique values of ps_car_01_cat: [10 11  7  6  9  5  4  8  3  0  2  1 -1]
Unique values of ps_car_02_cat: [ 1  0 -1]
Unique values of ps_car_03_cat: [-1  0  1]
Unique values of ps_car_04_cat: [0 1 8 9 2 6 3 7 4 5]
Unique values of ps_car_05_cat: [ 1 -1  0]
Unique values of ps_car_06_cat: [ 4 11 14 13  6 15  3  0  1 10 12  9 17  7  8  5  2 16]
Unique values of ps_car_07_cat: [ 1 -1  0]
Unique values of ps_car_08_cat: [0 1]
Unique values of ps_car_09_cat: [ 0  2  3  1 -1  4]
Unique values of ps_car_10_cat: [1 0 2]
Unique values of ps_car_11_cat: [ 12  19  60 104  82  99  30  68  20  36 101 103  41  59  43  64  29  95
  24   5  28  87  66  10  26  54  32  38  83  89  49  93   1  22  85  78
  31  34   7   8   3  46  27  25  61  16  69  40  76  39  88  42  75  91
  23   2  71 

* Missing values are present in most of the categorical columns, so we need to deal with them. First we will find out the percentage of missing values in each of the categorical columns. This will allow to understand which columns have more missing values than signals; we can then accordngly drop out those columns.

We are defining a function named "percent_missing" which takes two arguments: "col" and "df". The function calculates the percentage of missing values in the specified column (col) of the specified dataframe (df).

In [14]:
def percent_missing(col, df):
    return (df[col].value_counts()[-1]/df[col].value_counts().sum())

In [15]:
print("PERCENTAGE OF MISSING VALUES IN CATEGORICAL COLUMNS")

print("\n-------train--------\n")
for i in cat_vars:
    if -1 in train[i].unique():
        print(f"Fraction of missing values in {i}: {percent_missing(i, train)}")
    else:
        print(f"Fraction of missing values in {i}: No missing values")

print("\n-------test--------\n")
for i in cat_vars:
    if -1 in test[i].unique():
        print(f"Fraction of missing values in {i}: {percent_missing(i, test)}")
    else:
        print(f"Fraction of missing values in {i}: No missing values")

PERCENTAGE OF MISSING VALUES IN CATEGORICAL COLUMNS

-------train--------

Fraction of missing values in ps_ind_02_cat: 0.0003628959093566662
Fraction of missing values in ps_ind_04_cat: 0.00013944611331760784
Fraction of missing values in ps_ind_05_cat: 0.00975954785857812
Fraction of missing values in ps_car_01_cat: 0.0001797678810239041
Fraction of missing values in ps_car_02_cat: 8.400368272145051e-06
Fraction of missing values in ps_car_03_cat: 0.6908983689844963
Fraction of missing values in ps_car_04_cat: No missing values
Fraction of missing values in ps_car_05_cat: 0.4478253126617071
Fraction of missing values in ps_car_06_cat: No missing values
Fraction of missing values in ps_car_07_cat: 0.019302366215734897
Fraction of missing values in ps_car_08_cat: No missing values
Fraction of missing values in ps_car_09_cat: 0.0009559619093701067
Fraction of missing values in ps_car_10_cat: No missing values
Fraction of missing values in ps_car_11_cat: No missing values

-------test---

The columns `ps_car_03_cat` and `ps_car_05_cat` has the highest percentage of missing values. So, it is rational to drop these from both the train and test dataset

In [16]:
train.drop(['ps_car_03_cat','ps_car_05_cat'], axis=1, inplace=True)
test.drop(['ps_car_03_cat','ps_car_05_cat'], axis=1, inplace=True)

cat_vars.remove('ps_car_03_cat')
cat_vars.remove('ps_car_05_cat')

Now we use an imputation to fill in the missing values of the remaining categorical variables. We use the mode of each categorical column as the imputed value.

`impute_by_mode`:  a function is created  that is used to replace missing values in a dataframe, df, with the mode or most frequent value for a given column or list of columns, cols. The function iterates through each cell in the given columns, and if the value is missing (-1), it is replaced with the mode of that column.

In [17]:
def impute_by_mode(cols, df):
    for i in cols:
        for j in range(0,len(df[i])):
            if df[i].iloc[j] == -1:
                df[i].iloc[j] = df[i].mode()[0]

In [18]:
print("Value counts for train data:\n")
for i in cat_vars:
    print(f'Value counts for {i}\n')
    print(train[i].value_counts())
    print('\n--------------\n')

Value counts for train data:

Value counts for ps_ind_02_cat

 1    431859
 2    123573
 3     28186
 4     11378
-1       216
Name: ps_ind_02_cat, dtype: int64

--------------

Value counts for ps_ind_04_cat

 0    346965
 1    248164
-1        83
Name: ps_ind_04_cat, dtype: int64

--------------

Value counts for ps_ind_05_cat

 0    528009
 6     20662
 4     18344
 1      8322
 3      8233
-1      5809
 2      4184
 5      1649
Name: ps_ind_05_cat, dtype: int64

--------------

Value counts for ps_car_01_cat

 11    207573
 7     179247
 6      62393
 10     50087
 4      26174
 9      20323
 5      18142
 8      15093
 3       6658
 0       5904
 2       2144
 1       1367
-1        107
Name: ps_car_01_cat, dtype: int64

--------------

Value counts for ps_car_02_cat

 1    493990
 0    101217
-1         5
Name: ps_car_02_cat, dtype: int64

--------------

Value counts for ps_car_04_cat

0    496581
1     32115
2     23770
8     20598
9     19034
6      1560
3       640
5       54

In [19]:
impute_by_mode(cat_vars, train)

print("Value counts for train data after imputation:\n")
for i in cat_vars:
    print(f'Value counts for {i}\n')
    print(train[i].value_counts())
    print('\n--------------\n')

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)


Value counts for train data after imputation:

Value counts for ps_ind_02_cat

1    432075
2    123573
3     28186
4     11378
Name: ps_ind_02_cat, dtype: int64

--------------

Value counts for ps_ind_04_cat

0    347048
1    248164
Name: ps_ind_04_cat, dtype: int64

--------------

Value counts for ps_ind_05_cat

0    533818
6     20662
4     18344
1      8322
3      8233
2      4184
5      1649
Name: ps_ind_05_cat, dtype: int64

--------------

Value counts for ps_car_01_cat

11    207680
7     179247
6      62393
10     50087
4      26174
9      20323
5      18142
8      15093
3       6658
0       5904
2       2144
1       1367
Name: ps_car_01_cat, dtype: int64

--------------

Value counts for ps_car_02_cat

1    493995
0    101217
Name: ps_car_02_cat, dtype: int64

--------------

Value counts for ps_car_04_cat

0    496581
1     32115
2     23770
8     20598
9     19034
6      1560
3       640
5       545
4       230
7       139
Name: ps_car_04_cat, dtype: int64

--------------


In [20]:
print("Value counts for test data:\n")
for i in cat_vars:
    print(f'Value counts for {i}\n')
    print(test[i].value_counts())
    print('\n--------------\n')

Value counts for test data:

Value counts for ps_ind_02_cat

 1    647468
 2    186174
 3     41986
 4     16881
-1       307
Name: ps_ind_02_cat, dtype: int64

--------------

Value counts for ps_ind_04_cat

 0    519899
 1    372772
-1       145
Name: ps_ind_04_cat, dtype: int64

--------------

Value counts for ps_ind_05_cat

 0    791403
 6     31215
 4     27362
 3     12521
 1     12415
-1      8710
 2      6523
 5      2667
Name: ps_ind_05_cat, dtype: int64

--------------

Value counts for ps_car_01_cat

 11    311152
 7     270370
 6      93386
 10     74500
 4      39546
 9      30178
 5      26940
 8      22510
 3      10024
 0       8940
 2       3098
 1       2012
-1        160
Name: ps_car_01_cat, dtype: int64

--------------

Value counts for ps_car_02_cat

 1    740989
 0    151822
-1         5
Name: ps_car_02_cat, dtype: int64

--------------

Value counts for ps_car_04_cat

0    744753
1     48446
2     35318
8     30613
9     28823
6      2377
3      1073
5       785

In [21]:
impute_by_mode(cat_vars, test)

print("Value counts for test data after imputation:\n")
for i in cat_vars:
    print(f'Value counts for {i}\n')
    print(test[i].value_counts())
    print('\n--------------\n')

Value counts for test data after imputation:

Value counts for ps_ind_02_cat

1    647775
2    186174
3     41986
4     16881
Name: ps_ind_02_cat, dtype: int64

--------------

Value counts for ps_ind_04_cat

0    520044
1    372772
Name: ps_ind_04_cat, dtype: int64

--------------

Value counts for ps_ind_05_cat

0    800113
6     31215
4     27362
3     12521
1     12415
2      6523
5      2667
Name: ps_ind_05_cat, dtype: int64

--------------

Value counts for ps_car_01_cat

11    311312
7     270370
6      93386
10     74500
4      39546
9      30178
5      26940
8      22510
3      10024
0       8940
2       3098
1       2012
Name: ps_car_01_cat, dtype: int64

--------------

Value counts for ps_car_02_cat

1    740994
0    151822
Name: ps_car_02_cat, dtype: int64

--------------

Value counts for ps_car_04_cat

0    744753
1     48446
2     35318
8     30613
9     28823
6      2377
3      1073
5       785
4       397
7       231
Name: ps_car_04_cat, dtype: int64

--------------



Checking whether all the missing values in categorical columns have been dealt with or not:

In [22]:
print("PERCENTAGE OF MISSING VALUES IN CATEGORICAL COLUMNS")

print("\n-------train--------\n")
for i in cat_vars:
    if -1 in train[i].unique():
        print(f"Fraction of missing values in {i}: {percent_missing(i, train)}")
    else:
        print(f"Fraction of missing values in {i}: No missing values")

print("\n-------test--------\n")
for i in cat_vars:
    if -1 in test[i].unique():
        print(f"Fraction of missing values in {i}: {percent_missing(i, test)}")
    else:
        print(f"Fraction of missing values in {i}: No missing values")

PERCENTAGE OF MISSING VALUES IN CATEGORICAL COLUMNS

-------train--------

Fraction of missing values in ps_ind_02_cat: No missing values
Fraction of missing values in ps_ind_04_cat: No missing values
Fraction of missing values in ps_ind_05_cat: No missing values
Fraction of missing values in ps_car_01_cat: No missing values
Fraction of missing values in ps_car_02_cat: No missing values
Fraction of missing values in ps_car_04_cat: No missing values
Fraction of missing values in ps_car_06_cat: No missing values
Fraction of missing values in ps_car_07_cat: No missing values
Fraction of missing values in ps_car_08_cat: No missing values
Fraction of missing values in ps_car_09_cat: No missing values
Fraction of missing values in ps_car_10_cat: No missing values
Fraction of missing values in ps_car_11_cat: No missing values

-------test--------

Fraction of missing values in ps_ind_02_cat: No missing values
Fraction of missing values in ps_ind_04_cat: No missing values
Fraction of missing v

### Now checking for continuous columns

In [23]:
print("PERCENTAGE OF MISSING VALUES IN CONTINUOUS COLUMNS")

print("\n-------train--------\n")
for i in cont_vars:
    if -1 in train[i].unique():
        print(f"Fraction of missing values in {i}: {percent_missing(i, train)}")
    else:
        print(f"Fraction of missing values in {i}: No missing values")

print("\n-------test--------\n")
for i in cont_vars:
    if -1 in test[i].unique():
        print(f"Fraction of missing values in {i}: {percent_missing(i, test)}")
    else:
        print(f"Fraction of missing values in {i}: No missing values")

PERCENTAGE OF MISSING VALUES IN CONTINUOUS COLUMNS

-------train--------

Fraction of missing values in ps_ind_01: No missing values
Fraction of missing values in ps_ind_03: No missing values
Fraction of missing values in ps_ind_14: No missing values
Fraction of missing values in ps_ind_15: No missing values
Fraction of missing values in ps_reg_01: No missing values
Fraction of missing values in ps_reg_02: No missing values
Fraction of missing values in ps_reg_03: 0.18106489788512328
Fraction of missing values in ps_car_11: 8.400368272145051e-06
Fraction of missing values in ps_car_12: 1.68007365442901e-06
Fraction of missing values in ps_car_13: No missing values
Fraction of missing values in ps_car_14: 0.07160473915176441
Fraction of missing values in ps_car_15: No missing values
Fraction of missing values in ps_calc_01: No missing values
Fraction of missing values in ps_calc_02: No missing values
Fraction of missing values in ps_calc_03: No missing values
Fraction of missing values 

### Imputing the missing values

In [24]:
def impute_by_mean(cols, df):
    for i in cols:
        for j in range(0,len(df[i])):
            if df[i].iloc[j] == -1:
                df[i].iloc[j] = df[i].mean()

In [None]:
impute_by_mean(cont_vars, train)
impute_by_mean(cont_vars, test)

In [None]:
print("PERCENTAGE OF MISSING VALUES IN CONTINUOUS COLUMNS")

print("\n-------train--------\n")
for i in cont_vars:
    if -1 in train[i].unique():
        print(f"Fraction of missing values in {i}: {percent_missing(i, train)}")
    else:
        print(f"Fraction of missing values in {i}: No missing values")

print("\n-------test--------\n")
for i in cont_vars:
    if -1 in test[i].unique():
        print(f"Fraction of missing values in {i}: {percent_missing(i, test)}")
    else:
        print(f"Fraction of missing values in {i}: No missing values")

### Exploratory Data Analysis:
Let us first look at the distribution of the target variable:

We create a `pie chart` which displays the value counts of the target feature from the train dataset. 

`plot_data_train`: This variable stores the value counts of the target feature from the train dataset 
`plot_labels_train`: This variable stores the names of the plot regarding the target feature. 
`plt.pie`: This is function to create a pie chart. 
`autopct`: This parameter is a string or function used to label the wedges with their numeric value. It accepts a format string, eg. ‘%1.2f’, or a function.
`wedgeprops`: This parameter is a dictionary with properties that will be passed to the wedge objects when the chart is drawn.

In [None]:
plot_data_train = train['target'].value_counts()
plot_labels_train = plot_data_train.rename(index={0:'0: Insurance not claimed', 1:'1: Insurance claimed'}).index
plt.pie(plot_data_train, labels=plot_labels_train, autopct='%1.1f%%', wedgeprops = dict(edgecolor='black'))

* So, the train data is highly imbalanced. Only 3.6% have claimed insurance. We can therefore use Stratified K-fold CV while training and cross-validating the models or we can use smote technique or use various oversampling and undersampling techniques.

Next, we analyze the different variables being used:

In [None]:
print('Features distribution in the data:\n')
print('No. of binary features: ', len(bin_vars))
print('No. of categorical features: ', len(cat_vars))
print('No. of continuous features: ', len(cont_vars))

plt.pie([len(bin_vars),len(cat_vars), len(cont_vars)],
        labels = ['Binary features','Categorical features', 'Continuous features'],
        colors = ['#F8766D', '#00BFC4', '#F6744D'],
        textprops = {'fontsize': 13},
        autopct = '%1.1f%%')

In [None]:
def eda_by_vars(var_list, data_train, data_test, categorical):
    """
    This fuction does a complete eda on each variable(column).
    
    """
    colors = sns.cubehelix_palette()
    if categorical == True:
        
        for i in var_list:
            
            ## xticks:
            xticks_train = list(data_train[i].unique())
            xticks_test = list(data_test[i].unique())
            
            ## merging xticks_train, xticks_test and xticks_original
            xticks = list(OrderedDict.fromkeys(xticks_train + xticks_test))
            if all([str(item).isdigit() for item in xticks]): xticks.sort()
            
            # figure, axes
            sns.set_style('darkgrid')
            fig, ax = plt.subplots(1, 2, figsize=(14,5))
            # figure title
            fig.suptitle(i, fontsize=16)
            
            ## train / test / original differences:
            pct_train = data_train[i].value_counts(normalize=True).reindex(xticks)
            pct_test = data_test[i].value_counts(normalize=True).reindex(xticks)
            
            pct_train.plot(kind = 'bar',
                           align = 'edge',
                           width = -0.5,
                           ax = ax[0],
                           color = colors[0],
                           edgecolor = 'black')
            
            pct_test.plot(kind = 'bar',
                          align = 'edge',
                          width = 0.5,
                          ax = ax[0],
                          color = colors[1],
                          edgecolor = 'black')
            
            ax[0].set_ylabel('Percent', fontsize=12)
            ax[0].legend(['Train', 'Test'])
            
            ## Claim likelihood:
            sns.pointplot(data = data_train,
                          x = i,
                          y = 'target',
                          color = 'gray',
                          ax = ax[1])
            ax[1].tick_params(axis='x', labelsize=8, rotation=90)
            ax[1].set_xlabel('')
            ax[1].set_ylabel('Claim Likelihood', fontsize=12)
            
    else:
        
        for i in var_list:
            
            # figure, axes
            sns.set_style('darkgrid')
            fig, ax = plt.subplots(1, 2, figsize=(24,5))
            # figure title
            fig.suptitle(i, fontsize=16)

            # train / test / original differences
            sns.kdeplot(data_train[i], shade=True, color=colors[5], ax=ax[0], label='Train', alpha=0.1, linewidth=3)
            sns.kdeplot(data_test[i], shade=True, color=colors[3], ax=ax[0], label='Test', alpha=0.1, linewidth=3)
        
            ax[0].set_xlabel('')
            ax[0].set_ylabel('Density')
            handles, labels = ax[0].get_legend_handles_labels()
            ax[0].legend(handles, labels)
    

    
            # boxplot
            sns.boxplot(y=i, x='target', data=data_train, width=0.3, ax=ax[1], palette=colors, linewidth=3)
            ax[1].set_xlabel('Insurance Claim')
            ax[1].set_ylabel('')

### EDA of Binary Variable

In [None]:
print("BINARY VARIABLES")
eda_by_vars(bin_vars, train, test, categorical=True)

### EDA of Categorical variable

In [None]:
print("CATEGORICAL VARIABLES")
eda_by_vars(cat_vars, train, test, categorical=True)

### EDA of Continuous variable

In [None]:
print("CONTINUOUS VARIABLES")
eda_by_vars(cont_vars, train, test, categorical = False)

### Correlation

We create a heatmap of the correlation values for a given dataset (in this case, the 'train' dataset). The `heatmap` displays the correlation values between variables in the dataset as a colored 2D matrix, with the color of each cell indicating the strength of the correlation. Values close to `1` indicate a `strong positive correlation`, while values close to `-1` indicate a `strong negative correlation`. Values close to `0` indicate that the two variables are `not strongly linked`.

In [None]:
plt.figure(figsize = (20,20))
sns.heatmap(train.corr())

* The variables `ps_ind_10_bin`, `ps_ind_11_bin`, `ps_ind_12_bin`, `ps_ind_13_bin` and `ps_ind_14` are being dropped:

In [None]:
train.drop(['ps_ind_10_bin', 'ps_ind_11_bin', 'ps_ind_12_bin', 'ps_ind_13_bin', 'ps_ind_14'], axis=1, inplace=True)
test.drop(['ps_ind_10_bin', 'ps_ind_11_bin', 'ps_ind_12_bin', 'ps_ind_13_bin', 'ps_ind_14'], axis=1, inplace=True)

bin_vars.remove('ps_ind_10_bin')
bin_vars.remove('ps_ind_11_bin')
bin_vars.remove('ps_ind_12_bin')
bin_vars.remove('ps_ind_13_bin')
cont_vars.remove('ps_ind_14')

### Data Preparation

In [None]:
X = train.drop(['target'], axis=1)
y = train['target']
X_test = test.copy()

### Modelling:
Since the given evaluation metric is Gini index, we have to create a function to calculate it:

This function is used to evaluate the `Gini coefficien`t for a model's predictions. The Gini coefficient is a metric used for evaluating a `model's performance` when the `target variable` is `binary` or `categorical`. It measures the model's ability to correctly predict either positive (labeled as 1) or negative (labeled as 0) outcomes.

The parameters of the code are as follows: 
`y_true`: the ground truth labels, which are expected values 
`y_pred`: labels predicted by the model being evaluated 
`np.asarray()`: converts the input to a NumPy array
`np.argsort()`: returns the indices that would sort an array
`ntrue`: the number of true values 
`gini`: the Gini coefficient 
`delta`: the cumulative difference in predicted and true values 
`n`: the number of samples being evaluated.

In [None]:
def eval_gini(y_true, y_pred):
    y_true = np.asarray(y_true)
    y_true = y_true[np.argsort(y_pred)]
    ntrue = 0
    gini = 0
    delta = 0
    n = len(y_true)
    for i in range(n-1, -1, -1):
        y_i = y_true[i]
        ntrue += y_i
        gini += y_i * delta
        delta += 1 - y_i
    gini = 1 - 2 * gini / (ntrue * (n - ntrue))
    return gini

Now we craete a function that will be used to perform a cross-validation method on a given dataset. The parameters used include:

X: This is the feature variable of the dataset,Also known as independent variables 

y: This is the target variable of the dataset, also known as the dependent variable.

model: This is the machine learning algorithm or model that will be used to make predictions on the dataset.

skf: This is an instance of the StratifiedKFold class from the scikit-learn library. It is used to split the dataset into train and test sets.

`train_id`: This is the list of indices of the training set.

`val_id`: This is the list of indices of the validation/test set.

`X_train, X_valid`: These are the feature variables of the training and validation sets respectively.

`y_train, y_valid`: These are the target variables of the training and validation sets respectively.

`y_pred`: These are the predictions of the model on the validation set.

`gini_score`: This is the Gini score which is used to evaluate the model's performance on the validation set.

`cv_scores`: This is a list of all the Gini scores from each fold of the cross-validation.

`avg_gini_score`: This is the average Gini score for the cross-validation method.

In [None]:
def cross_validate(X, y, model):
    
    skf = StratifiedKFold(n_splits = 5,
                          shuffle = True,
                          random_state = 0)
    cv_scores = []
    
    for fold, (train_id, val_id) in enumerate(skf.split(X, y)):
        X_train, X_valid = X.iloc[train_id], X.iloc[val_id]
        y_train, y_valid = y.iloc[train_id], y.iloc[val_id]
        
        model.fit(X_train, y_train)
        y_pred = model.predict_proba(X_valid)[:, 1]
        gini_score = eval_gini(y_valid, y_pred)
        cv_scores.append(gini_score)
        
        print(f'Fold {fold}: Gini Score = {gini_score}')
        print(f'\n-----------------------------\n')
        
    avg_gini_score = np.mean(cv_scores)
    print(f'Normalized Gini Score: {avg_gini_score}')

### Random Forest Classifier

Then we create a Random Forest Classifier object which can be used to perform classification tasks on data sets. 

`n_estimators`: This parameter specifies the number of trees to be used for creating the model. A higher value usually results in better performance, but it also increases the overall complexity of the model and can lead to overfitting. 

`max_depth`: This parameter specifies the maximum number of levels the trees in the model can have. A higher value can lead to better performance but can also lead to overfitting. 

`max_features`: This parameter specifies the maximum number of features to be considered at each split. A higher value usually results in better performance, but can also lead to overfitting. 

`bootstrap`: This parameter specifies whether to use bootstrapping when creating the model or not. Bootstraping is a technique used in statistics to estimate the standard errors in model parameters by sampling with replacement from the data set. 

`random_state`: This parameter specifies the random number generator used to make the decision trees in the model. A fixed number ensures that the same decision trees are generated every time the model is initialized.

In [None]:
rf_model = RandomForestClassifier(n_estimators = 1500,
                                  max_depth = 9,
                                  max_features = 10,
                                  bootstrap = True,
                                  random_state = 0)
cross_validate(X, y, rf_model)

### Gradient Boosting 

In [None]:
xgb_model = XGBClassifier(n_estimators = 1500,
                          learning_rate = 0.01,
                          max_bin = 25,
                          num_leaves = 31,
                          min_child_samples = 1500,
                          colsample_bytree = 0.7,
                          subsample_freq = 1,
                          subsample = 0.7,
                          reg_alpha = 1.0,
                          reg_lambda = 1.0,
                          verbosity = 0,
                          random_state = 0)

cross_validate(X, y, xgb_model)

### Parameters for boosting algorithms:



In [None]:
boost_params = {
    'objective': 'binary',
    'boosting_type': 'gbdt',
    'learning_rate': 0.01,
    'max_bin': 25,
    'num_leaves': 31,
    'min_child_samples': 1500,
    'colsample_bytree': 0.7,
    'subsample_freq': 1,
    'subsample': 0.7,
    'reg_alpha': 1.0,
    'reg_lambda': 1.0,
    'verbosity': 0,
    'random_state': 0,
    'n_estimators': 1500
}

Here we declare a dictionary, which is composed of parameters related to the boosting type Gradient Boosting Decision Trees (GBDT). Specifically, this is a set of parameters for the LightGBM algorithm, a type of GBDT. 

The declared parameters include: 
`'objective'`: 'binary': indicates that the objective is to train a binary classifier.

`'boosting_type'`: 'gbdt': specifying the boosting type to be Gradient Boosted Decision Trees. 

`'learning_rate'`: 0.01: Learning rate for boosting. It determines the impact of each tree on the final outcome.

`'max_bin'`: 25: The maximum number of bins for feature values.

`'num_leaves'`: 31: The maximum number of leaves/terminal nodes in a tree.

`'min_child_samples'`: 1500: The minimum number of samples required in a child node for a split.

`'colsample_bytree'`: 0.7: Subsample ratio of columns for each split.

`'subsample_freq'`: 1: Frequency for bagging, where K indicates that bagging should be done after every K trees.

`'subsample': 0.7`: Subsample ratio of the training instance.

`'reg_alpha'`: 1.0: L1 regularization term on weights.

`'reg_lambda'`: 1.0: L2 regularization term on weights.

`'verbosity'`: 0: Value of 0 for silent. 

`'random_state'`: 0: The seed used to initialize the RF random number generator.

`'n_estimators'`: 1500: The number of trees in the fore

In [None]:
lgbm_model = lgbm.LGBMClassifier(**boost_params)

cross_validate(X, y, lgbm_model)

### Catboost Classifier

In [None]:
cat_model = catboost.CatBoostClassifier(n_estimators = 1500,
                                        learning_rate = 0.01,
                                        max_bin = 25,
                                        num_leaves = 31,
                                        min_child_samples = 1500,
                                        subsample = 0.7,
                                        reg_lambda = 1.0,
                                        random_state = 0)

cross_validate(X, y, cat_model)

* Since the XGBoost model gives the best results, I choose to use it for predictions on the test set.

### Prediction

In [None]:
final_preds = xgb_model.predict_proba(X_test)

In [None]:
submission['target'] = final_preds
submission.head()

In [None]:
submission.to_csv('submission.csv')