<a href="https://colab.research.google.com/github/Camicb/practice/blob/main/Travel_Insurance_Claim_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Travel Insurance Claim Prediction**

#1. Introduction

Many companies selling tickets or travel packages, give consumers the option to purchase travel insurance, also known as travelers insurance. Travel insurance is a type of insurance that covers the costs and losses associated with traveling. It is useful protection for those traveling domestically or abroad.
Some travel policies cover damage to personal property, rented equipment, such as rental cars, or even the cost of paying a ransom. 

The objective of this project is to create a machine learning model for a insurance company to predict if the insurance buyer will claim their travel insurance or not.

#2. Import Required Libraries

In [None]:
#!pip install -U imbalanced-learn
#!pip install https://github.com/pandas-profiling/pandas-profiling/archive/master.zip 
#!pip install pycaret

In [None]:
#Importing libraries
import pandas as pd
import numpy as np
%matplotlib inline 
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno 
from pandas_profiling import ProfileReport
from sklearn.model_selection import train_test_split
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn import set_config
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE


#3. Exploratory Data Analysis
##3.1 About the data
There are 11 columns in the dataset:
*   **Duration:** Travel duration
*   **Destination:** Travel destination (country)
*   **Agency:** Agency Name
*   **Agency Type:** Travel Agency or Airlines 
*   **Commission (in value):** Commission on the insurance
*   **Age:** Age of the insurance buyer
*   **Gender:** Gender of the insurance buyer
*   **Distribution Channel:** offline/online
*   **Product Name:** Name of the insurance plan
*   **Net Sales:** Net sales
*   **Claim:** If the insurance is claimed or not (the target variable), 0 = not claimed, 1 = claimed


In [None]:
# Load the provided data into a pandas data frame 
ins = pd.read_csv("https://raw.githubusercontent.com/dphi-official/Datasets/master/travel_insurance/Training_set_label.csv" ) # training data
test_ins = pd.read_csv('https://raw.githubusercontent.com/dphi-official/Datasets/master/travel_insurance/Testing_set_label.csv') # testing data

## 3.2 Exploratory data analysis

In [None]:
ins.head()
ins.info()

In [None]:
test_ins.head()
test_ins.info()

The variable Claim is treated as numerical in the training dataset, so it will be transformed into a categorical one.

In [None]:
# Statistic report
profile = ProfileReport(ins, html={'style': {'full_width': True, 'primary_color': '#30b6c2'}},  samples=None, missing_diagrams=None, interactions=None)
profile.to_file("report.html")
profile.to_notebook_iframe()

In [None]:
# Visualization of missing values 
msno.matrix(ins, figsize=(10,5), fontsize=10, color=(0.0, 0.75, 0.75)) 

Since 'Gender' has too many missing values and the 'Distribution Channel' is highly correlated to others variables and presents imbalanced classes, the entire columns will be removed. Then, the training data will be splited into a new training and validation sets.

In [None]:
# Selecting the variables
X=ins.drop(['Gender', 'Distribution Channel', 'Claim'], axis=1) 
y=ins['Claim']

#Spliting the data with stratification into training and test sets 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=1, stratify=y)

In [None]:
X_train.head()
X_train.info()

In [None]:
y_train.value_counts()

# 4. Data wrangling and Feature enginnering

Changes in the testing data are executed in order to maintain a consistent shape with the training data.

In [None]:
pd.set_option('mode.chained_assignment',None) # no warnings

4.1 Agency 


In [None]:
# Replacing the agencies with a frequency smaller than 5% to 'Other' 
Agencies=X_train.loc[:,'Agency'].value_counts(normalize=True)*100
Agencies=list(Agencies[Agencies < 5].index)

X_train.loc[:,'Agency']=X_train.loc[:,'Agency'].apply(lambda i: 'Other' if i in Agencies else i)
X_test.loc[:,'Agency']=X_test.loc[:,'Agency'].apply(lambda i: 'Other' if i in Agencies else i) 

sns.histplot(data=X_train, x='Agency', color='c', stat='probability')
plt.xticks(rotation='vertical')

4.2 Product Name

In [None]:
# Replacing the products with a frequency smaller than 5% to 'Other Plan'
Products=X_train.loc[:, 'Product Name'].value_counts(normalize=True)*100
Products=list(Products[Products < 5].index)

X_train['Product Name']=X_train['Product Name'].apply(lambda i: 'Other Plan' if i in Products else i)
X_test['Product Name']=X_test['Product Name'].apply(lambda i: 'Other Plan' if i in Products else i)

sns.histplot(data=X_train, x='Product Name', color='c', stat='probability')
plt.xticks(rotation='vertical')

4.3 Duration and Age

In [None]:
# Transforming the values equal or smaller than zero for Duration and equal to 118 for Age for being clearly outliers.
X_train['Duration']= X_train.loc[:, 'Duration'].apply(lambda i: np.nan if i < 1 else i)
X_train['Duration'].isnull().value_counts(normalize=True)*100
print('---')
X_train['Age']= X_train.loc[:,'Age'].apply(lambda i: np.nan if i == 118 else i)
X_train['Age'].isnull().value_counts(normalize=True)*100

In [None]:
# Imputing NaN values
#imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
#X_train['Age'] = imputer.fit_transform(X_train[['Age']]).ravel()
#X_train['Duration'] = imputer.fit_transform(X_train[['Duration']]).ravel()

4.4 Destination

In [None]:
# Replacing the Destinations with a frequency smaller than 1% to 'OTHER'
Destination=X_train.loc[:, 'Destination'].value_counts(normalize=True)*100
Destination=list(Destination[Destination > 1].index)

X_train['Destination']=X_train['Destination'].apply(lambda i: 'OTHER' if i not in Destination else i)
X_test['Destination']=X_test['Destination'].apply(lambda i: 'OTHER' if i not in Destination else i)

sns.histplot(data=X_train, x='Destination', color='c', stat='probability')
plt.xticks(rotation='vertical')

4.5 Total sales : Net Sales and Commision (in  value)

In [None]:
# Adding both values since 55% of the commision values are zero
X_train['Total sales']= X_train['Commision (in value)'] + X_train['Net Sales']
X_test['Total sales']= X_test['Commision (in value)'] + X_test['Net Sales']

In [None]:
# Droping columns for being highly correlated now
X_train=X_train.drop(columns=['Commision (in value)','Net Sales'], axis=1)
X_test=X_test.drop(columns=['Commision (in value)','Net Sales'], axis=1)
X_train.head()


In [None]:
# Encoding the categorical variables with one hot encoding
#]X_train=pd.get_dummies(X_train, columns=['Agency','Agency Type','Product Name','Destination'])  
#X_test=pd.get_dummies(X_test, columns=['Agency','Agency Type','Product Name','Destination'])  

In [None]:
# SMOTE-oversampling the minority class in the target variable
#sm = SMOTE(random_state = 25, sampling_strategy = 0.5)
#X_train, y_train= sm.fit_sample(X_train, y_train)

In [None]:
y_train.value_counts()

In [None]:
# Checking the shape of the data for modeling
X_train.shape
y_train.shape
X_test.shape
y_test.shape

In [None]:
# Choosing a model with pycaret 
from pycaret.classification import *

train=pd.concat([X_train, y_train], axis=1)
test=pd.concat([X_test, y_test], axis=1) #validation data

clf=setup(train, target='Claim', test_data=test, 
          normalize=True, 
          fix_imbalance=True,
          feature_selection=True,
          feature_selection_threshold=0.9, 
          feature_selection_method='boruta')

In [None]:
model=compare_models(sort='F1')

In [None]:
#from sklearn.feature_selection import SelectFromModel
#from sklearn.metrics import accuracy_score, f1_score
#from sklearn.ensemble import GradientBoostingClassifier

#model = GradientBoostingClassifier(random_state=0).fit(X_train, y_train)
#y_pred = model.predict(X_test)
#ac = accuracy_score(y_test, y_pred)
#fscore = f1_score(y_test ,y_pred)

#print("Baseline Model Accuracy:", ac)
#print("Baseline Model F1 Score:", fscore*100)
