## Travel Insurance
### Content:

Data obtained from a third-party travel insurance servicing company that is based in
Singapore.

### Format:
1. Target: Claim Status (Claim.Status)
2. Name of agency (Agency)
3. Type of travel insurance agencies (Agency.Type)
4. Distribution channel of travel insurance agencies (Distribution.Channel)
5. Name of the travel insurance products (Product.Name)
6. Duration of travel (Duration)
7. Destination of travel (Destination)
8. Amount of sales of travel insurance policies (Net.Sales)
9. Commission received for travel insurance agency (Commission)
10. Gender of insured (Gender)
11. Age of insured (Age)


### Task:
1. Find the critical parameters for claiming the insurance.
2. Build a model to predict the claim status

Here we will build a  model using a Random Forest Classifier to predict the claim status for travel insurance. Here's a breakdown of the intuition behind each step:

* **Importing libraries**: We import the necessary libraries such as pandas, numpy, and scikit-learn to handle data manipulation, numerical operations, and machine learning algorithms.

* **Loading the dataset:** We load the travel insurance dataset from a CSV file using pd.read_csv() into a pandas DataFrame. This allows us to access and analyze the data.

* **Data preprocessing and exploration:** We perform initial data exploration to understand the dataset. We check the data types, identify missing values, and explore the distribution of the target variable (claim status). This step helps us identify any issues with the data and decide how to handle them.

* **Feature selection and preprocessing:** We select the relevant features for modeling based on domain knowledge and analysis. We separate the selected features as the input (X) and the target variable (y). Categorical features are one-hot encoded to convert them into a numerical representation suitable for the model.

* **Splitting the data:** We split the data into training and testing sets using train_test_split(). The training set is used to train the model, while the testing set is used to evaluate the model's performance on unseen data.

* **Building the Random Forest Classifier model:** We create an instance of the Random Forest Classifier using RandomForestClassifier(). The Random Forest algorithm combines multiple decision trees to make predictions and handle complex relationships between features.

* **Training the model:** We train the model using the training data by calling the fit() method on the classifier object. This step involves building multiple decision trees and aggregating their predictions to make the final prediction.

* **Making predictions:** We use the trained model to predict the claim status for the test set using the predict() method. This allows us to evaluate how well the model generalizes to unseen data.

* **Evaluating the model's performance:** We calculate the accuracy of the model by comparing the predicted claim status with the actual claim status using the accuracy_score() function. Accuracy is a common metric that measures the percentage of correctly predicted labels.

* **Printing the accuracy:** Finally, we print or display the accuracy score to assess the performance of the model. A higher accuracy indicates a better predictive performance, but it's important to consider other evaluation metrics and further analyze the results to get a comprehensive understanding of the model's performance.

In [17]:
import pandas as pd
import numpy as np
#import the LabelEncoder class from scikit-learn, which is used to encode categorical variables.
from sklearn.preprocessing import LabelEncoder
#import the train_test_split function from scikit-learn, which is used to split the data into training and testing sets.
from sklearn.model_selection import train_test_split
# import the RandomForestClassifier class from scikit-learn, which is an implementation of the random forest algorithm.
from sklearn.ensemble import RandomForestClassifier
# import the accuracy_score function from scikit-learn, which is used to calculate the accuracy of a classification model.
from sklearn.metrics import accuracy_score

**Data Preparation:**

Read the travel insurance data from a CSV file and stores it in the data DataFrame.

In [18]:
data = pd.read_csv('travel_insurance_data.csv')


Display the first few rows of the dataset to get an overview of the data.

In [19]:
print(data.head())


  Agency    Agency Type Distribution Channel                     Product Name  \
0    CBH  Travel Agency              Offline               Comprehensive Plan   
1    CBH  Travel Agency              Offline               Comprehensive Plan   
2    CWT  Travel Agency               Online  Rental Vehicle Excess Insurance   
3    CWT  Travel Agency               Online  Rental Vehicle Excess Insurance   
4    CWT  Travel Agency               Online  Rental Vehicle Excess Insurance   

  Claim  Duration Destination  Net Sales  Commision (in value) Gender  Age  
0    No       186    MALAYSIA      -29.0                  9.57      F   81  
1    No       186    MALAYSIA      -29.0                  9.57      F   71  
2    No        65   AUSTRALIA      -49.5                 29.70    NaN   32  
3    No        60   AUSTRALIA      -39.6                 23.76    NaN   32  
4    No        79       ITALY      -19.8                 11.88    NaN   41  


Print information about the dataset, including the data types of each column and the presence of any missing values.

In [20]:
print(data.info())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 63326 entries, 0 to 63325
Data columns (total 11 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Agency                63326 non-null  object 
 1   Agency Type           63326 non-null  object 
 2   Distribution Channel  63326 non-null  object 
 3   Product Name          63326 non-null  object 
 4   Claim                 63326 non-null  object 
 5   Duration              63326 non-null  int64  
 6   Destination           63326 non-null  object 
 7   Net Sales             63326 non-null  float64
 8   Commision (in value)  63326 non-null  float64
 9   Gender                18219 non-null  object 
 10  Age                   63326 non-null  int64  
dtypes: float64(2), int64(2), object(7)
memory usage: 5.3+ MB
None


Calculate and display the statistical summary of the numeric columns in the dataset, providing information such as count, mean, standard deviation, minimum, maximum, and quartile values.

In [21]:
print(data.describe())


           Duration     Net Sales  Commision (in value)           Age
count  63326.000000  63326.000000          63326.000000  63326.000000
mean      49.317074     40.702018              9.809992     39.969981
std      101.791566     48.845637             19.804388     14.017010
min       -2.000000   -389.000000              0.000000      0.000000
25%        9.000000     18.000000              0.000000     35.000000
50%       22.000000     26.530000              0.000000     36.000000
75%       53.000000     48.000000             11.550000     43.000000
max     4881.000000    810.000000            283.500000    118.000000


Count the unique values in the 'Claim' column of the dataset, giving an idea of the distribution of claim statuses

In [22]:
print(data['Claim'].value_counts())


No     62399
Yes      927
Name: Claim, dtype: int64


**Data Cleaning and Encoding:**

Remove any row that contain missing values from the dataset.

In [23]:
data = data.dropna()


Create an instance of the LabelEncoder class to encode the categorical variable 'Claim' as numeric labels.

In [24]:
label_encoder = LabelEncoder()
data['Claim'] = label_encoder.fit_transform(data['Claim'])


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['Claim'] = label_encoder.fit_transform(data['Claim'])


**Feature Selection and Encoding:**

* Selects the relevant features for modeling by specifying a list of column names and assigns them to **X**.
* Selects the target variable 'Claim' and assigns it to **y**.

In [25]:
features = ['Agency', 'Agency Type', 'Distribution Channel', 'Product Name', 'Duration', 'Destination', 'Net Sales', 'Commision (in value)', 'Gender', 'Age']
X = data[features]
y = data['Claim']


Perform one-hot encoding on the categorical features in X, converting them into binary columns representing the presence or absence of each category.

In [26]:
X = pd.get_dummies(X)


**Model Building and Evaluation:**

Split the data into training and testing sets, with 80% of the data used for training and 20% for testing. The random_state parameter ensures reproducibility.

In [27]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


Create an instance of the RandomForestClassifier model and trains it using the training data (X_train and y_train).

In [28]:
clf = RandomForestClassifier()
clf.fit(X_train, y_train)


Use the trained model to make predictions on the testing data (X_test) and assigns the predicted values to y_pred.

In [29]:
y_pred = clf.predict(X_test)


Calculate the accuracy of the model by comparing the predicted values (y_pred) with the actual values (y_test) and prints the accuracy score.

In [30]:
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.9593852908891328


Extracting feature importances

In [31]:
importances = clf.feature_importances_

Creating a DataFrame to display feature importances

In [32]:
feature_importances = pd.DataFrame({'Feature': X.columns, 'Importance': importances})
feature_importances = feature_importances.sort_values('Importance', ascending=False)

Displaying the critical parameters for claiming insurance

In [33]:
print(feature_importances)

                      Feature  Importance
3                         Age    0.357909
0                    Duration    0.326140
1                   Net Sales    0.123239
2        Commision (in value)    0.122734
127                  Gender_M    0.008149
..                        ...         ...
101          Destination_OMAN    0.000000
120  Destination_TURKMENISTAN    0.000000
90      Destination_MAURITIUS    0.000000
61         Destination_CYPRUS    0.000000
54         Destination_BRAZIL    0.000000

[128 rows x 2 columns]
