## Problem Description
Insurance companies take risks over customers. Risk management is a very important aspect of the insurance industry. Insurers consider every quantifiable factor to develop profiles of high and low insurance risks. Insurers collect vast amounts of information about policyholders and analyze the data.

As a Data scientist in an insurance company, you need to analyze the available data and predict whether to sanction the insurance or not.

## Dataset Description
A zipped file containing train, test and sample submission files are given. The training dataset consists of data corresponding to 52310 customers and the test dataset consists of 22421 customers. Following are the features of the dataset

- Target: Claim Status (Claim)
- Name of agency (Agency)
- Type of travel insurance agencies (Agency.Type)
- Distribution channel of travel insurance agencies (Distribution.Channel)
- Name of the travel insurance products (Product.Name)
- Duration of travel (Duration)
- Destination of travel (Destination)
- Amount of sales of travel insurance policies (Net.Sales)
- The commission received for travel insurance agency (Commission)
- Age of insured (Age)
- The identification record of every observation (ID)

# Evaluation Metric
The evaluation metric for this task will be **precision_score**.

**=========================================Data Analaysis Begins=========================================**

### Importing Necessary Pacakages

In [1]:
import pandas as pd
import numpy as np
from scipy import stats
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, precision_score,recall_score,f1_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV

Defining path for data files

In [2]:
train_path='./train.csv'
test_path='./test.csv'

Loading data

In [3]:
df_train=pd.read_csv(train_path)
df_test=pd.read_csv(test_path)

Getting intution about data with head and describe

In [4]:
df_test.head()


Unnamed: 0,ID,Agency,Agency Type,Distribution Channel,Product Name,Duration,Destination,Net Sales,Commision (in value),Age
0,17631,EPX,Travel Agency,Online,Cancellation Plan,192,HONG KONG,18.0,0.0,36
1,15064,EPX,Travel Agency,Online,1 way Comprehensive Plan,2,SINGAPORE,20.0,0.0,36
2,14139,C2B,Airlines,Online,Bronze Plan,13,SINGAPORE,13.5,3.38,24
3,19754,EPX,Travel Agency,Online,2 way Comprehensive Plan,133,"TAIWAN, PROVINCE OF CHINA",41.0,0.0,36
4,16439,C2B,Airlines,Online,Silver Plan,2,SINGAPORE,30.0,7.5,32


In [5]:
df_test.describe()

Unnamed: 0,ID,Duration,Net Sales,Commision (in value),Age
count,22421.0,22421.0,22421.0,22421.0,22421.0
mean,15499.196646,59.100665,49.44607,12.316924,39.784889
std,2606.751171,114.819397,61.794609,22.957306,13.910773
min,11000.0,-1.0,-297.0,0.0,1.0
25%,13236.0,10.0,19.8,0.0,34.0
50%,15515.0,24.0,29.518868,0.0,36.0
75%,17762.0,58.0,56.0,13.63,43.0
max,20000.0,4857.0,810.0,283.5,118.0


In [6]:
df_train.describe()

Unnamed: 0,ID,Duration,Net Sales,Commision (in value),Age,Claim
count,52310.0,52310.0,52310.0,52310.0,52310.0,52310.0
mean,6005.745804,58.256108,48.554673,12.219963,39.555725,0.166699
std,2306.450475,109.138708,60.198589,22.847645,13.762473,0.37271
min,2000.0,-2.0,-389.0,0.0,0.0,0.0
25%,4015.0,10.0,19.8,0.0,33.0,0.0
50%,6002.0,24.0,29.5,0.0,36.0,0.0
75%,8004.0,57.0,55.0,13.38,43.0,0.0
max,10000.0,4881.0,682.0,262.76,118.0,1.0


In [7]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52310 entries, 0 to 52309
Data columns (total 11 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   ID                    52310 non-null  int64  
 1   Agency                52310 non-null  object 
 2   Agency Type           52310 non-null  object 
 3   Distribution Channel  52310 non-null  object 
 4   Product Name          52310 non-null  object 
 5   Duration              52310 non-null  int64  
 6   Destination           52310 non-null  object 
 7   Net Sales             52310 non-null  float64
 8   Commision (in value)  52310 non-null  float64
 9   Age                   52310 non-null  int64  
 10  Claim                 52310 non-null  int64  
dtypes: float64(2), int64(4), object(5)
memory usage: 4.4+ MB


Above summary shows no column has **"Null"** Values. We can also see there are 6 columns having numerical data and other are categorical data.

## Segregating Numerical and Object data

In [8]:
numerical_cols=df_train.select_dtypes(include=['int64','float64'])
categorical_cols=df_train.select_dtypes(include='object')
numerical_cols_test=df_test.select_dtypes(include=['int64','float64'])
categorical_cols_test=df_test.select_dtypes(include='object')

In [9]:
categorical_cols.columns

Index(['Agency', 'Agency Type', 'Distribution Channel', 'Product Name',
       'Destination'],
      dtype='object')

### With above information and basic analysis we can conclude following:
- **ID:** Numerical value for reference, No significane in stastical analysis.
- **Duration:** 
    - has values in negative. Which is not possible, Time can't be in negative. This will need further analysis as how many occurances of such data.
    - has High Variance
    - Standard Deviation of 109 against mean 58 suggest data is largely dispersed
- **Net Sales:**
    - Has Values in negative. Need more investigation.
    - Data is releativley clsoer to mean
    - variance is high
- **Commision (in value):**
    - closer to mean
    - no negative value
    - 50% data has value as 0
- **Age:**
    - Age having Min as 0 suggests unit is in years.
    - Standard Deviation is low suggest data is closer to mean.
- **Claim:**
    - Boolean value and is target Variable
    

In [10]:
numerical_cols = numerical_cols.copy()
#Droping ID from Numerical Columns
numerical_cols.drop(columns=["ID","Claim"],axis=1,inplace=True)

## Opeartion on Numerical Data
### Checking skewness 

In [11]:
skew_check={}
for i in numerical_cols.columns:
    skew_check[i]=numerical_cols[i].skew()

In [12]:
skew_check

{'Duration': 15.3525235978114,
 'Net Sales': 2.811837338046441,
 'Commision (in value)': 3.5356943446774736,
 'Age': 2.9478911827909426}

Above value shows All numerical data is **postive skewed**.Applying Transformation techniques for data to follow **Guassian distribution**

In [13]:
#np.quantile(df_train_logtransformed,q = [0.1, 0.2, 0.3])
seq = np.linspace(0,1,101)
pd.DataFrame(df_train['Net Sales'].quantile(seq)).T

Unnamed: 0,0.00,0.01,0.02,0.03,0.04,0.05,0.06,0.07,0.08,0.09,...,0.91,0.92,0.93,0.94,0.95,0.96,0.97,0.98,0.99,1.00
Net Sales,-389.0,0.0,0.0,0.0,2.12,9.9,10.0,10.0,10.0,10.0,...,112.0,112.0,128.7,164.069017,201.598537,216.0,246.901143,252.85,291.75,682.0


In [14]:
df_train['Net Sales'].describe()

count    52310.000000
mean        48.554673
std         60.198589
min       -389.000000
25%         19.800000
50%         29.500000
75%         55.000000
max        682.000000
Name: Net Sales, dtype: float64

## Opeartion on Categorical Data

In [15]:
categorical_cols.columns

Index(['Agency', 'Agency Type', 'Distribution Channel', 'Product Name',
       'Destination'],
      dtype='object')

In [16]:
for i in categorical_cols:
    print("Column {} has {} unique values".format(i,len(categorical_cols[i].value_counts())))

Column Agency has 16 unique values
Column Agency Type has 2 unique values
Column Distribution Channel has 2 unique values
Column Product Name has 25 unique values
Column Destination has 97 unique values


Analysing above data it suggests columns **"Agency Type"** & **"Agency"** are of Boolean in nature.However rest 3 columns are of Non Boolean in nature hence opting for **LabelEncoding**.

In [17]:
le=preprocessing.LabelEncoder() #Intialising LabelEncoder Model
df_train_enc=pd.DataFrame() #Creating Empty Dataframe to store encoded Values

for i in (categorical_cols.columns): #Looping through to encode each columns
    df_train_enc[i] = le.fit_transform(df_train[i])   

In [23]:
#This is not the correct way
df_test_enc=pd.DataFrame() #Creating Empty Dataframe to store encoded Values
for i in (categorical_cols_test.columns):
    df_test_enc[i] = le.fit_transform(df_test[i])

In [19]:
df_test_enc.Agency.unique()

array([ 7,  2,  6, 11, 14,  9, 12,  4,  8, 13, 10,  0,  1, 15,  5,  3])

In [20]:
df_train_enc.Agency.unique()

array([ 7,  6,  9,  2, 12, 14,  5,  8, 11, 13,  1,  4, 10,  0,  3, 15])

In [None]:
test_1

In [24]:
df_train_enc.head() #Checking for encoded values

Unnamed: 0,Agency,Agency Type,Distribution Channel,Product Name,Destination
0,7,1,1,10,68
1,7,1,1,10,53
2,6,1,1,16,84
3,7,1,1,1,33
4,7,1,1,1,53


Merging dataframes (Encoded +Tranformed)

In [25]:
df_train_Modified=pd.concat([numerical_cols,df_train_enc],axis=1) #Adding both dataframes
df_test_Modified=pd.concat([numerical_cols_test,df_test_enc],axis=1) #Adding both dataframes

In [26]:
df_train_Modified.head()
#df_train_clean_power['Net Sales'].min()

Unnamed: 0,Duration,Net Sales,Commision (in value),Age,Agency,Agency Type,Distribution Channel,Product Name,Destination
0,61,12.0,0.0,41,7,1,1,10,68
1,4,17.0,0.0,35,7,1,1,10,53
2,26,19.8,11.88,47,6,1,1,16,84
3,15,27.0,0.0,48,7,1,1,1,33
4,15,37.0,0.0,36,7,1,1,1,53


### Applying Feature Selection Technique

In [27]:
clean_skew=df_train_Modified.skew()

clean_skew

Duration                15.352524
Net Sales                2.811837
Commision (in value)     3.535694
Age                      2.947891
Agency                  -0.097111
Agency Type             -0.718350
Distribution Channel    -7.465242
Product Name             0.332685
Destination             -0.590582
dtype: float64

## Model Building

### Splliting data in Feature and Target variable

In [28]:
#Data with Log Transformed
X=df_train_Modified #Features
y=df_train['Claim'] #Target

In [29]:
# Train Test Split Log Transformed Data
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=3)

In [30]:
#Checking for shape of Training data
print("Shape of X Train:",X_train.shape," & Y Train is ",y_train.shape)

Shape of X Train: (41848, 9)  & Y Train is  (41848,)


In [42]:
from imblearn.over_sampling import SMOTE
import smote_variants as sv

ModuleNotFoundError: No module named 'imblearn'

In [43]:
pip install imblearn

Collecting imblearn
  Downloading imblearn-0.0-py2.py3-none-any.whl (1.9 kB)
Collecting imbalanced-learn
  Downloading imbalanced_learn-0.7.0-py3-none-any.whl (167 kB)
Collecting scikit-learn>=0.23
  Downloading scikit_learn-0.24.1-cp37-cp37m-win_amd64.whl (6.8 MB)
Collecting threadpoolctl>=2.0.0
  Downloading threadpoolctl-2.1.0-py3-none-any.whl (12 kB)
Installing collected packages: threadpoolctl, scikit-learn, imbalanced-learn, imblearn
  Attempting uninstall: scikit-learn
    Found existing installation: scikit-learn 0.22.1
    Uninstalling scikit-learn-0.22.1:
      Successfully uninstalled scikit-learn-0.22.1
Note: you may need to restart the kernel to use updated packages.


ERROR: Could not install packages due to an EnvironmentError: [WinError 5] Access is denied: 'c:\\users\\mrityunjay1.pandey\\anaconda3\\lib\\site-packages\\~klearn\\decomposition\\_cdnmf_fast.cp37-win_amd64.pyd'
Consider using the `--user` option or check the permissions.



### Model Execution

#### Building Random Forest model after finalizing 

In [34]:
#Base Model
perf_report=pd.DataFrame()
rf=RandomForestClassifier(n_estimators=500,
                          min_samples_leaf=1,
                          max_features='log2',
                          max_depth=40,
                          criterion='gini',
                          class_weight='balanced_subsample')
rf.fit(X_train,y_train)
y_pred=rf.predict(X_test)
rf_score=accuracy_score(y_test,y_pred)
rf_precision_score=precision_score(y_test,y_pred)
perf_report = pd.concat([perf_report, pd.DataFrame([{'model':'Random Forest',
                                             'Accuracy': accuracy_score(y_test,y_pred), 
                                            'Precision': precision_score(y_test,y_pred),
                                            'Recall': recall_score(y_test,y_pred),
                                                    'F1':f1_score(y_test,y_pred)}])])


final_pred=rf.predict(X_test)

In [41]:
smote = SMOTE(random_state=12)

#Sampling the data using SMOTE
X_sample, y_sample = smote.fit_sample(X_train, y_train)
rf.fit(X_sample,y_sample)
y_pred_smote=rf.predict(X_test)
rf_score=accuracy_score(y_test,y_pred_smote)
rf_precision_score=precision_score(y_test,y_pred_smote)
print('rf_score',rf_score)
print('rf_precision_score',rf_precision_score)
print("cohen's kappa = " , cohen_kappa_score(y_test,y_pred_smote))
pd.crosstab(y_test,y_pred_smote, rownames=['Predicted'], colnames=['Actual'])# Calculating the actual vs predicted

NameError: name 'SMOTE' is not defined

In [35]:
df_test_ID=df_test_Modified.ID #Taking ID columns in seperate variable

In [36]:
df_test_for_pred=df_test_Modified.drop(columns='ID') #Droping ID to be inline with input for model

In [37]:
final_pred_rf=rf.predict(df_test_for_pred) #Executing model to predict claim values

In [38]:
#Concating ID and Claim to write in csv
final_pred=pd.concat(
                    [pd.DataFrame(df_test_ID,columns=['ID']),
                         pd.DataFrame(final_pred_rf,columns=['Claim'])],
                         axis=1) #Adding both dataframes

In [39]:
final_pred.head()

Unnamed: 0,ID,Claim
0,17631,0
1,15064,0
2,14139,0
3,19754,0
4,16439,0


### Writing Prediction to CSV

In [40]:
pd.DataFrame.to_csv(final_pred,'RF_Tuned.csv',index=False) 