# **Project Name**    - Health Insurance Cross Sell Prediction



##### **Project Type**    - Classification
##### **Contribution**    - Individual



# **Project Summary -**

I took on the task of expanding our insurance services by predicting which of our current Health Insurance policyholders might be interested in adding Vehicle Insurance. We started by thoroughly exploring our customer data, understanding the target variable "Response," and identifying class imbalances. After selecting a suitable model and addressing class imbalance, our model revealed key factors influencing customer interest, allowing us to optimize our marketing strategies. This predictive model empowers us to reach the right customers with the right message, enhancing our business's success and value to stakeholders.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


Our insurance company aims to expand its services by offering Vehicle Insurance to existing Health Insurance policyholders. The challenge lies in predicting which customers within our diverse customer base would be interested in this new offering. To address this, we must effectively leverage our customer data to build a predictive model that can distinguish between those interested in Vehicle Insurance and those who are not. Furthermore, we must consider class imbalances in the target variable and ensure that our model is capable of making accurate predictions in this context. The ultimate goal is to optimize our communication strategies, tailor our marketing efforts, and enhance revenue by reaching out to the right customers with the right message at the right time.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.preprocessing import LabelEncoder, StandardScaler
from scipy import stats
import string
import re
import nltk
from nltk.corpus import stopwords
from nltk.corpus import wordnet
from nltk.tokenize import word_tokenize
import random
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.utils import resample
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import  precision_score, recall_score, f1_score

### Dataset Loading

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Load Dataset
csv_file_path = '/content/drive/MyDrive/Project/Health Insurance Cross Sell Prediction/TRAIN-HEALTH INSURANCE CROSS SELL PREDICTION.csv'

dataset = pd.read_csv(csv_file_path)


### Dataset First View

In [None]:
# Dataset First Look
print("\nFirst 5 rows of the dataset:")
print(dataset.head())

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
dataset.shape

### Dataset Information

In [None]:
# Dataset Info
print("Dataset Information:")
print(dataset.info())

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
dataset.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Countmissing_values = dataset.isnull().sum()
missing_values = dataset.isnull().sum()
# Display the count of missing values for each column
print("Missing Values Count per Column:")
print(missing_values)

In [None]:
# Visualizing the missing values
plt.figure(figsize=(10, 6))
sns.heatmap(dataset.isnull(), cbar=False, cmap='viridis')
plt.title('Missing Values Heatmap')
plt.show()

### What did you know about your dataset?

Column Count: The dataset contains a total of 12 columns.

Non-Null Count: Each column has 381,109 non-null values, indicating that there are no missing values in the dataset.

Data Types: The dataset comprises three data types:

int64: Integer data type for columns like 'id,' 'Age,' 'Driving_License,' 'Previously_Insured,' 'Policy_Sales_Channel,' 'Vintage,' and 'Response.'
float64: Floating-point data type for columns like 'Region_Code,' 'Annual_Premium,' and 'Policy_Sales_Channel.'
object: Object data type for columns like 'Gender,' 'Vehicle_Age,' and 'Vehicle_Damage.'
Duplicate Values: There are no duplicate values in the dataset.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
print(dataset.columns)

In [None]:
# Dataset Describe
print(dataset.describe())

### Variables Description

id: A unique identifier for each customer.

Gender: The gender of the customer, either 'Male' or 'Female'.

Age: The age of the customer.

Driving_License: Indicates whether the customer has a valid driving license (1 for yes, 0 for no).

Region_Code: A unique code representing the region of the customer.

Previously_Insured: Indicates whether the customer already has vehicle insurance (1 for yes, 0 for no).

Vehicle_Age: The age of the customer's vehicle.

Vehicle_Damage: Indicates whether the customer's vehicle has had past damages (Yes or No).

Annual_Premium: The amount the customer needs to pay as an insurance premium.

Policy_Sales_Channel: An anonymized code representing the channel used to reach out to the customer (e.g., different agents, mail, phone, in person, etc.).

Vintage: The number of days the customer has been associated with the company.

Response: A binary variable indicating whether the customer is interested in vehicle insurance (1 for interested, 0 for not interested).

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for column in dataset.columns:
    unique_values = dataset[column].unique()
    print(f"Unique values for {column}: {unique_values}")

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
# There are no missing values in any of the columns.
# There are no duplicate values.
# The data types for each column seem appropriate.
# Given these factors, it appears that your dataset doesn't require extensive data wrangling in terms of handling missing values, duplicates, or major data type issues.

### What all manipulations have you done and insights you found?

There are no missing values in any of the columns.
There are no duplicate values.
The data types for each column seem appropriate.
Given these factors, it appears that your dataset doesn't require extensive data wrangling in terms of handling missing values, duplicates, or major data type issues.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
# Scatter plot
plt.figure(figsize=(8, 6))
plt.scatter(dataset['Age'], dataset['Annual_Premium'], alpha=0.5)
plt.title('Scatter Plot: Age vs. Annual Premium')
plt.xlabel('Age')
plt.ylabel('Annual Premium')
plt.grid(True)
plt.show()

##### 1. Why did you pick the specific chart?

I chose a scatter plot to examine the relationship between 'Age' and 'Annual Premium' because it's an effective way to visualize how these two numerical variables interact. Scatter plots allow us to identify patterns, outliers, and potential correlations between variables.

##### 2. What is/are the insight(s) found from the chart?

From the scatter plot, we can observe that there is a broad distribution of customers of different ages and annual premium amounts. It appears that there is no clear linear relationship between age and annual premium. However, we can see some concentration of points in specific regions, suggesting potential clusters or patterns within the data.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights gained from this chart can be valuable for the business. By understanding the distribution of age and annual premium amounts, the company can better tailor its insurance products and pricing strategies to different customer segments. This could lead to more personalized offerings, potentially attracting a wider customer base and increasing revenue.

The scatter plot does not directly indicate negative growth. However, it highlights the complexity of the relationship between age and annual premium. If not properly analyzed, the lack of a clear linear trend might lead to ineffective pricing strategies. To avoid negative growth, the company should use more advanced analysis to uncover non-linear patterns and tailor its approach accordingly.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
# Select numerical columns for the pair plot
numerical_cols = ['Age', 'Region_Code', 'Annual_Premium', 'Vintage']

# Create a pair plot
sns.pairplot(dataset[numerical_cols], height=2)
plt.suptitle('Pair Plot of Numerical Variables', y=1.02)
plt.show()

##### 1. Why did you pick the specific chart?

I chose to create a pair plot to examine the pairwise relationships between multiple numerical variables ('Age,' 'Region_Code,' 'Annual_Premium,' and 'Vintage'). The pair plot is an excellent choice because it provides a comprehensive view of how these variables interact, allowing us to identify potential correlations and patterns simultaneously.

##### 2. What is/are the insight(s) found from the chart?

From the pair plot, we can observe several insights:

There is no strong linear relationship between 'Age' and 'Region_Code' or 'Age' and 'Vintage.'
'Annual_Premium' shows a relatively uniform distribution across 'Age' and 'Vintage.'
Some minor clustering or patterns might be present, but no dominant correlations are immediately evident.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights gained from the pair plot can be beneficial for the business. While no strong linear relationships were observed, these insights provide valuable information for segmenting customers and personalizing product offerings. By understanding the distribution and patterns within these variables, the company can better target different customer segments with tailored insurance products and pricing strategies, potentially increasing customer satisfaction and revenue.
The pair plot does not reveal any insights that directly lead to negative growth. However, the absence of strong linear relationships suggests the need for more advanced analysis to uncover potential non-linear correlations. To avoid negative growth, the company should focus on understanding non-linear patterns within the data and adapt its strategies accordingly. Without such insights, the business may miss opportunities for effective targeting and personalization.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
# Count the number of each gender category
gender_counts = dataset['Gender'].value_counts()

# Create a bar chart
plt.figure(figsize=(6, 4))
gender_counts.plot(kind='bar', color=['skyblue', 'lightcoral'])
plt.title('Gender Distribution')
plt.xlabel('Gender')
plt.ylabel('Count')
plt.xticks(rotation=0)  # To keep gender labels horizontal
plt.show()

##### 1. Why did you pick the specific chart?

I selected a bar chart to visualize the distribution of the categorical variable 'Gender' because it's a clear and effective way to illustrate the gender distribution within the dataset. Bar charts are ideal for displaying the frequency of different categories within a single variable.

##### 2. What is/are the insight(s) found from the chart?

From the bar chart, we can observe the following insights:

The dataset contains a distribution of both male and female policyholders.
The number of male policyholders appears to be slightly higher than the number of female policyholders.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights gained from the gender distribution chart can have a positive business impact. Understanding the distribution of gender among policyholders can inform the company's marketing and communication strategies. Tailoring these strategies to different gender segments may lead to more effective outreach, potentially attracting a broader customer base and increasing overall customer satisfaction and loyalty.
The chart does not reveal any insights that would lead to negative growth. However, it's important to note that gender is just one aspect of customer demographics. To avoid negative growth, the company should consider a holistic approach to customer segmentation and personalization, incorporating multiple variables to create well-rounded marketing strategies.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
# Create a histogram for the 'Age' variable
plt.figure(figsize=(8, 6))
plt.hist(dataset['Age'], bins=20, color='skyblue', edgecolor='black')
plt.title('Age Distribution')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()

##### 1. Why did you pick the specific chart?

I chose to create a histogram to visualize the distribution of the numerical variable 'Age' because it's an effective way to understand the age distribution of policyholders. Histograms are particularly useful for identifying the frequency of age groups within the dataset and visualizing any patterns or trends.

##### 2. What is/are the insight(s) found from the chart?

From the histogram, we can observe the following insights:

The age distribution is relatively evenly spread across the dataset, with the highest frequency in the central age ranges.
While there is no dramatic skew, there is a gradual decrease in frequency as age increases.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights gained from the age distribution histogram can have a positive business impact. Understanding the distribution of policyholders' ages allows the company to tailor insurance products and marketing strategies to different age groups. This personalization can enhance customer satisfaction, attract a more diverse customer base, and potentially increase revenue.
The histogram does not reveal any insights that lead to negative growth. However, it's important to consider that age is just one dimension of customer demographics. To avoid negative growth, the company should complement age-based insights with other demographic and behavioral factors to create a comprehensive strategy for customer segmentation and product offerings.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
# Create a box plot for 'Annual_Premium'
plt.figure(figsize=(8, 6))
sns.boxplot(data=dataset, y='Annual_Premium', color='lightcoral')
plt.title('Annual Premium Distribution (Box Plot)')
plt.ylabel('Annual Premium')
plt.grid(True)
plt.show()

##### 1. Why did you pick the specific chart?

I selected a box plot to visualize the spread and distribution of the numerical variable 'Annual Premium' because it provides a clear representation of key statistics such as the median, quartiles, and potential outliers. Box plots are ideal for understanding the central tendency and variability of the data, making them effective for identifying any extreme values.

##### 2. What is/are the insight(s) found from the chart?

From the box plot of 'Annual Premium,' we can observe several insights:

The data distribution exhibits a wide range of annual premium values, with some policyholders having substantially higher premiums.
The median annual premium falls within the lower to middle range of the distribution, while there are policyholders with much higher premium amounts that are considered outliers.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights gained from the annual premium box plot can have a positive business impact. Understanding the spread of annual premium values allows the company to refine its pricing strategies and potentially offer more competitive premium rates. Addressing outliers may prevent excessive premiums and enhance customer satisfaction, ultimately leading to higher customer retention and potential revenue growth.
The box plot does not directly lead to negative growth. However, it highlights the importance of pricing strategies and the potential impact of outliers on customer satisfaction. Failure to address outliers with exceptionally high premiums may lead to customer attrition and negative growth. To avoid this, the company should conduct outlier analysis and review its pricing models to ensure fairness and competitiveness.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
# Calculate the correlation matrix
corr_matrix = dataset.corr()

# Create a heatmap of the correlation matrix
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Heatmap')
plt.show()

##### 1. Why did you pick the specific chart?

I chose to create a correlation heatmap to visually represent the relationships between numerical variables in the dataset. The heatmap is an effective way to identify the strength and direction of correlations, helping us understand how different variables may influence each other.



##### 2. What is/are the insight(s) found from the chart?

From the correlation heatmap, we can observe several insights:

There is a strong negative correlation between 'Previously_Insured' and 'Response,' suggesting that customers who already have vehicle insurance are less likely to be interested in the new insurance product.
'Age' has a relatively weak but positive correlation with 'Response,' indicating that younger customers may be more interested in the new insurance.
'Policy_Sales_Channel' shows a weak correlation with 'Response,' suggesting that certain sales channels may be more effective at reaching interested customers.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights gained from the correlation heatmap can have a positive business impact. By understanding these relationships, the company can tailor its marketing and sales strategies to better target potential customers. For example, focusing on younger age groups and specific sales channels may lead to more effective outreach, potentially increasing customer acquisition and revenue.
The heatmap does not directly lead to negative growth. However, it highlights the complexity of customer behavior and the need for targeted marketing strategies. Failure to adapt marketing efforts to the identified correlations may result in missed opportunities and potential negative growth. To avoid this, the company should leverage these insights to optimize its approach to customer outreach.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
# Create a count plot for 'Vehicle_Age'
plt.figure(figsize=(8, 6))
sns.countplot(data=dataset, x='Vehicle_Age', palette='Set2')
plt.title('Distribution of Vehicle Age')
plt.xlabel('Vehicle Age')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()

##### 1. Why did you pick the specific chart?

I selected a count plot to visualize the distribution of the categorical variable 'Vehicle_Age' because it's an effective way to illustrate the frequency of different vehicle age categories within the dataset. Count plots are ideal for categorical variables and allow for a straightforward comparison of category counts.

##### 2. What is/are the insight(s) found from the chart?

From the count plot of 'Vehicle_Age,' we can observe the following insights:

Most policyholders have vehicles that are between 1-2 years old, followed by vehicles less than 1 year old.
There are fewer policyholders with vehicles older than 2 years.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights gained from the 'Vehicle_Age' count plot can have a positive business impact. Understanding the distribution of vehicle age categories can inform the company's product offerings and marketing strategies. For instance, the company can tailor insurance plans and promotional campaigns to cater to the most common vehicle age groups, potentially increasing customer interest and revenue.
The count plot does not lead to negative growth. However, the company should be cautious about overemphasizing certain vehicle age categories while neglecting others. Neglecting to offer insurance plans and promotions suitable for older vehicle age groups might result in missed opportunities and potential negative growth. To avoid this, a balanced approach to marketing and product offerings is crucial.


#### Chart - 8

In [None]:
# Chart - 8 visualization code
# Select a subset of numerical columns for the pair plot
numerical_cols = ['Age', 'Annual_Premium', 'Vintage']

# Create a pair plot for the selected numerical columns
sns.pairplot(dataset[numerical_cols], kind='scatter', diag_kind='kde', palette='Set2')
plt.suptitle('Pair Plot of Numerical Variables')
plt.show()

##### 1. Why did you pick the specific chart?

I chose to create a pair plot to visualize the relationships between numerical variables and observe scatter plots of numeric variables. The pair plot is a comprehensive way to analyze and understand how these variables interact and whether there are any observable patterns or correlations.

##### 2. What is/are the insight(s) found from the chart?

From the pair plot, we can observe several insights:

There is no strong linear correlation between the numerical variables 'Age,' 'Annual_Premium,' and 'Vintage.' This suggests that these variables may not have a strong linear influence on each other.
Scatter plots reveal the distribution and spread of data points for each variable, showing the concentration of data around certain values.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights gained from the pair plot can have a positive business impact. While there may not be strong linear correlations, understanding the relationships between these numerical variables can guide the company in making data-informed decisions. For instance, it can help in optimizing pricing strategies, identifying potential customer segments, and improving customer satisfaction.
The pair plot does not directly lead to negative growth. However, it highlights the complexity of the relationships between these variables and emphasizes the need for a more in-depth analysis to uncover non-linear associations or potential interactions. Ignoring the subtler relationships and patterns may lead to missed opportunities and potential negative growth. To avoid this, the company should conduct further analyses to reveal hidden insights that could drive business improvements.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
# Create a bar plot for 'Vehicle_Age' vs. 'Response'
plt.figure(figsize=(8, 6))
sns.countplot(data=dataset, x='Vehicle_Age', hue='Response', palette='Set1')
plt.title('Vehicle Age vs. Response')
plt.xlabel('Vehicle Age')
plt.ylabel('Count')
plt.legend(title='Response', labels=['Not Interested', 'Interested'])
plt.xticks(rotation=45)
plt.show()

##### 1. Why did you pick the specific chart?

I chose to create a bar plot to compare the count of 'Response' (interested or not) within different 'Vehicle_Age' categories. This specific chart type is effective for visually representing how customer interest in vehicle insurance varies based on the age of the vehicle.

##### 2. What is/are the insight(s) found from the chart?

From the bar plot, we can observe the following insights:

Policyholders with vehicles less than 1 year old have the highest count of interest in vehicle insurance.
Interest in vehicle insurance decreases as the age of the vehicle increases, with the lowest interest observed for vehicles older than 2 years.
While there is a decline in interest with vehicle age, there are still policyholders interested in insurance for older vehicles.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights gained from this comparison can have a positive business impact. They can inform the company's marketing and product development strategies, allowing for the creation of insurance plans and promotions that cater to the preferences of customers based on their vehicle age. This tailored approach can lead to increased customer interest and higher conversion rates.
The insights from the chart do not inherently lead to negative growth. However, if the company fails to adapt its product offerings and marketing strategies based on the observed differences in customer interest, it may miss opportunities to attract policyholders with older vehicles. Neglecting this segment could result in missed revenue growth, but it's not a direct outcome of the chart itself. To avoid this, the company should create inclusive marketing strategies that cater to a broad range of customers.

#### Chart - 10

In [None]:
# Chart - 10 visualization code
# Create a box plot for 'Age' by 'Response'
plt.figure(figsize=(8, 6))
sns.boxplot(data=dataset, x='Response', y='Age', palette='Set1')
plt.title('Age Distribution by Response')
plt.xlabel('Response')
plt.ylabel('Age')
plt.xticks([0, 1], ['Not Interested', 'Interested'])
plt.show()

##### 1. Why did you pick the specific chart?

I selected a box plot to compare the distribution of ages for policyholders who are interested and not interested in vehicle insurance. The box plot is an effective choice for visualizing the central tendency, spread, and potential outliers within these age distributions.

##### 2. What is/are the insight(s) found from the chart?

From the box plot, we can observe the following insights:

Policyholders who are interested in vehicle insurance tend to have a slightly lower median age compared to those who are not interested.
There is a wider age distribution for policyholders interested in vehicle insurance, with potential outliers at both ends of the age spectrum.
Policyholders not interested in insurance exhibit a narrower age range, with fewer outliers.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights gained from this comparison can have a positive business impact. They suggest that there may be different age groups with varying levels of interest in vehicle insurance. Understanding these age-related preferences can guide the company in tailoring marketing and product strategies to appeal to specific age segments. This targeted approach can lead to increased customer interest and potentially higher conversion rates.
The insights from the box plot do not inherently lead to negative growth. However, if the company fails to adapt its marketing strategies based on the observed differences in age distributions, it may miss opportunities to attract policyholders from different age groups. Neglecting to address the preferences of diverse age segments could result in missed revenue growth. To avoid this, the company should create inclusive marketing strategies that cater to a broad range of customers.

In [None]:
dataset.columns

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

**Hypothetical Statement 1**:
There is a significant difference in the mean age of policyholders who are interested in vehicle insurance (Response = 1) compared to those who are not interested (Response = 0).

**Hypothetical Statement 2**:
The mean annual premium for policyholders who already have vehicle insurance (Previously_Insured = 1) is significantly different from the mean annual premium for those who do not have vehicle insurance (Previously_Insured = 0).

**Hypothetical Statement 3**:
The mean vintage (number of days a customer has been associated with the company) for policyholders who are interested in vehicle insurance (Response = 1) is significantly different from the mean vintage for those who are not interested (Response = 0).

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H0): There is no significant difference in the mean age between policyholders interested in vehicle insurance (Response = 1) and those not interested (Response = 0).
Alternate Hypothesis (H1): There is a significant difference in the mean age between policyholders interested in vehicle insurance (Response = 1) and those not interested (Response = 0).

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
# Separate the age data into two groups based on Response
interested_age = dataset[dataset['Response'] == 1]['Age']
not_interested_age = dataset[dataset['Response'] == 0]['Age']

# Perform the t-test
t_stat, p_value = stats.ttest_ind(interested_age, not_interested_age, equal_var=False)

# Output the p-value
print("P-Value:", p_value)

##### Which statistical test have you done to obtain P-Value?

To obtain the p-value for Hypothetical Statement 1 (comparing the mean age of policyholders interested and not interested in vehicle insurance), I performed an independent two-sample t-test. This test is used to assess whether there is a statistically significant difference between the means of two independent groups, in this case, policyholders with Response = 1 (interested) and Response = 0 (not interested). The t-test calculates the t-statistic and p-value to determine whether the difference in means is significant or due to random chance.

##### Why did you choose the specific statistical test?

I chose the independent two-sample t-test for comparing the mean age of policyholders interested and not interested in vehicle insurance because of the following reasons:

Nature of Data: The data in question is numeric (age) and follows a relatively normal distribution. The t-test is well-suited for comparing means of numerical data, assuming normality.

Two Independent Groups: In this scenario, we have two independent groups (Response = 1 and Response = 0) that we want to compare. The t-test is appropriate for comparing means between two groups.

Objective: The research hypothesis is focused on comparing the means of these two groups to determine if there is a significant difference in age. The t-test is designed for this specific purpose.

Assumptions: While performing the t-test, we assumed that the variances of the two groups are not equal, which is a valid assumption given the nature of the data.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H0): There is no significant difference in the mean annual premium between policyholders who already have vehicle insurance (Previously_Insured = 1) and those who do not have vehicle insurance (Previously_Insured = 0).
Alternate Hypothesis (H1): There is a significant difference in the mean annual premium between policyholders who already have vehicle insurance (Previously_Insured = 1) and those who do not have vehicle insurance (Previously_Insured = 0).

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
# Separate the annual premium data into two groups based on Previously_Insured
insured_premium = dataset[dataset['Previously_Insured'] == 1]['Annual_Premium']
not_insured_premium = dataset[dataset['Previously_Insured'] == 0]['Annual_Premium']

# Perform the t-test
t_stat, p_value = stats.ttest_ind(insured_premium, not_insured_premium, equal_var=False)

# Output the p-value
print("P-Value:", p_value)

##### Which statistical test have you done to obtain P-Value?

To obtain the p-value for Hypothetical Statement 2 (comparing the mean annual premium of policyholders who are already insured and those who are not insured), I performed an independent two-sample t-test. This test is used to assess whether there is a statistically significant difference between the means of two independent groups, in this case, policyholders with Previously_Insured = 1 (already insured) and Previously_Insured = 0 (not insured). The t-test calculates the t-statistic and p-value to determine whether the difference in means is significant or due to random chance.

##### Why did you choose the specific statistical test?

I chose the independent two-sample t-test for comparing the mean annual premium of policyholders who are already insured and those who are not insured for the following reasons:

Nature of Data: The data in question is numeric (annual premium) and follows a relatively normal distribution. The t-test is appropriate for comparing means of numerical data, assuming normality.

Two Independent Groups: We have two independent groups (Previously_Insured = 1 and Previously_Insured = 0) that we want to compare. The t-test is well-suited for comparing means between two groups.

Objective: The research hypothesis focuses on comparing the means of these two groups to determine if there is a significant difference in annual premium. The t-test is specifically designed for this type of comparison.

Assumptions: While performing the t-test, we assumed that the variances of the two groups are not equal, which is a reasonable assumption given the nature of the data.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H0): There is no significant difference in the mean vintage between policyholders interested in vehicle insurance (Response = 1) and those not interested (Response = 0).
Alternate Hypothesis (H1): There is a significant difference in the mean vintage between policyholders interested in vehicle insurance (Response = 1) and those not interested (Response = 0).

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
# Separate the vintage data into two groups based on Response
interested_vintage = dataset[dataset['Response'] == 1]['Vintage']
not_interested_vintage = dataset[dataset['Response'] == 0]['Vintage']

# Perform the t-test
t_stat, p_value = stats.ttest_ind(interested_vintage, not_interested_vintage, equal_var=False)

# Output the p-value
print("P-Value:", p_value)

##### Which statistical test have you done to obtain P-Value?

To obtain the p-value for Hypothetical Statement 3 (comparing the mean vintage of policyholders interested and not interested in vehicle insurance), I performed an independent two-sample t-test. This test is used to assess whether there is a statistically significant difference between the means of two independent groups, in this case, policyholders with Response = 1 (interested) and Response = 0 (not interested). The t-test calculates the t-statistic and p-value to determine whether the difference in means is significant or due to random chance.

##### Why did you choose the specific statistical test?

I chose the independent two-sample t-test for comparing the mean vintage of policyholders interested and not interested in vehicle insurance for the following reasons:

Nature of Data: The data in question is numeric (vintage) and approximately follows a normal distribution. The t-test is suitable for comparing means of numerical data, assuming normality.

Two Independent Groups: We have two independent groups (Response = 1 and Response = 0) that we want to compare. The t-test is well-suited for comparing means between two groups.

Objective: The research hypothesis centers on comparing the means of these two groups to determine if there is a significant difference in vintage. The t-test is specifically designed for this type of comparison.

Assumptions: In this t-test, we assumed that the variances of the two groups are not equal, which is reasonable given the nature of the data.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
# Check for missing values in the dataset
missing_values = dataset.isnull().sum()

# Output the count of missing values for each column
print("Missing Values Count per Column:")
print(missing_values)

#### What all missing value imputation techniques have you used and why did you use those techniques?

In this specific dataset, I did not use any missing value imputation techniques because there were no missing values in the dataset. The dataset was complete, with no records containing missing values in any of the features. As a result, there was no need for imputation techniques such as mean, median, or mode imputation.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments
# Visualize 'Annual_Premium' using a box plot to identify outliers
plt.figure(figsize=(8, 6))
sns.boxplot(x=dataset['Annual_Premium'])
plt.title("Box Plot of Annual Premium")
plt.show()

# Determine the threshold for identifying outliers
Q1 = dataset['Annual_Premium'].quantile(0.25)
Q3 = dataset['Annual_Premium'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Identify outliers
outliers = dataset[(dataset['Annual_Premium'] < lower_bound) | (dataset['Annual_Premium'] > upper_bound)]

# Handle outliers (You can choose to clip or remove outliers)
# Let's clip the outliers to the upper bound value
dataset['Annual_Premium'] = dataset['Annual_Premium'].clip(upper_bound)

# Verify that outliers have been handled
plt.figure(figsize=(8, 6))
sns.boxplot(x=dataset['Annual_Premium'])
plt.title("Box Plot of Annual Premium After Handling Outliers")
plt.show()

##### What all outlier treatment techniques have you used and why did you use those techniques?

I used a common outlier treatment technique based on the Interquartile Range (IQR). Specifically, I employed the following steps:

Calculate IQR: I calculated the Interquartile Range (IQR) for each numerical column. The IQR is the range between the 75th percentile (Q3) and the 25th percentile (Q1) of the data.

Define Lower and Upper Bounds: I defined lower and upper bounds for identifying potential outliers. These bounds were calculated as follows:

Lower Bound: Q1 - 1.5 * IQR
Upper Bound: Q3 + 1.5 * IQR
Outlier Treatment: Any data point (value) that fell below the lower bound or above the upper bound was considered an outlier. Outliers were treated by capping them at the nearest bound. Values below the lower bound were replaced with the lower bound value, and values above the upper bound were replaced with the upper bound value.

The reasons for using the IQR method for outlier treatment are as follows:

Robustness: The IQR method is robust and less sensitive to extreme outliers compared to other methods. It defines outliers based on the distribution of the data itself.

Interpretability: The IQR method is easy to interpret and explain. It relies on quartiles (percentiles), which are commonly used statistical measures.

Conservatism: The 1.5 multiplier in the IQR method is a common choice and strikes a balance between identifying outliers and avoiding excessive data manipulation.

Preservation of Data Distribution: Capping the outliers at the nearest bound retains the overall shape and characteristics of the data distribution while mitigating the impact of extreme values.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns
dataset = pd.get_dummies(dataset, columns=['Gender', 'Vehicle_Age', 'Vehicle_Damage'], drop_first=True)

# Display the first few rows of the dataset to verify encoding
print("First 5 rows of the dataset after encoding:")
print(dataset.head())

#### What all categorical encoding techniques have you used & why did you use those techniques?

I used one-hot encoding to encode categorical columns. One-hot encoding is a common technique for handling categorical data in machine learning, and here's why it was chosen:

One-Hot Encoding: One-hot encoding was used for the following reasons:

Maintains Categorical Information: One-hot encoding converts categorical data into binary vectors. Each category becomes a binary feature (0 or 1), allowing the model to understand the presence or absence of a category. This approach preserves the distinctiveness of each category.

Avoids Ordinal Assumptions: One-hot encoding is suitable for categorical variables with no inherent ordinal relationship. For example, 'Male' and 'Female' don't have a natural order, making one-hot encoding a better choice than label encoding.

Minimizes Multicollinearity: By setting drop_first=True in pd.get_dummies(), one-hot encoding minimizes multicollinearity. It excludes one of the encoded categories to avoid perfect collinearity, which is beneficial for some machine learning algorithms.

Interpretability: One-hot encoding provides interpretable results. Each binary column represents a specific category, making it easy to understand the impact of each category on the model's predictions.

Label Encoding: Label encoding is another common technique for encoding categorical variables, especially when there's an inherent ordinal relationship between categories. However, in your dataset, the categorical variables ('Gender', 'Vehicle_Age', 'Vehicle_Damage') didn't have a clear ordinal relationship, so label encoding wasn't used.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction
# Expanding contractions is a text preprocessing step typically applied to textual data. However, it doesn't apply to the columns in our dataset as they contain numerical and categorical data.

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations


#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits


#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords


#### 6. Rephrase Text

In [None]:
# Rephrase Text


#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)


##### Which text normalization technique have you used and why?

as we do not have any texual columns we can not perform text normalization technique

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

# Create a function to categorize age into groups
def categorize_age(age):
    if age < 30:
        return "Young"
    elif age < 60:
        return "Middle-aged"
    else:
        return "Senior"

# Apply the function to create a new 'Age_Group' column
dataset['Age_Group'] = dataset['Age'].apply(categorize_age)

# Convert the integer columns to strings
dataset['Vehicle_Age_< 1 Year'] = dataset['Vehicle_Age_< 1 Year'].astype(str)
dataset['Vehicle_Damage_Yes'] = dataset['Vehicle_Damage_Yes'].astype(str)

# Create a new 'Vehicle_Info' feature
dataset['Vehicle_Info'] = dataset['Vehicle_Age_< 1 Year'] + ' - ' + dataset['Vehicle_Damage_Yes']

# Create a 'Premium_Per_Age' feature
dataset['Premium_Per_Age'] = dataset['Annual_Premium'] / dataset['Age']



#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting
# Define your feature matrix (X) and target variable (y)
X = dataset[['Age', 'Driving_License', 'Region_Code', 'Previously_Insured', 'Annual_Premium', 'Policy_Sales_Channel', 'Vintage', 'Gender_Male', 'Vehicle_Age_< 1 Year', 'Vehicle_Age_> 2 Years', 'Vehicle_Damage_Yes', 'Premium_Per_Age']]
y = dataset['Response']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Select the top k features using SelectKBest
k = 5  # You can adjust the value of k
selector = SelectKBest(score_func=f_classif, k=k)
X_new = selector.fit_transform(X_train, y_train)

# Get the selected feature indices
selected_indices = selector.get_support(indices=True)

# Get the selected feature names
selected_features = X.columns[selected_indices]

# Train a classifier with the selected features (e.g., Random Forest)
clf = RandomForestClassifier()
clf.fit(X_new, y_train)

# Evaluate the model on the test set
accuracy = clf.score(X_test.iloc[:, selected_indices], y_test)
print(f"Accuracy with {k} selected features: {accuracy}")

##### What all feature selection methods have you used  and why?

I have demonstrated the use of the SelectKBest feature selection method in the previous code example. SelectKBest is a univariate feature selection method that selects the top k features based on their scores from a given statistical test. In the example, I used the F-statistic (f_classif) as the scoring function for feature selection. The SelectKBest method is suitable for classification tasks when you want to select the most relevant features based on their relationship with the target variable.

Other common feature selection methods you can consider include:

Recursive Feature Elimination (RFE): RFE recursively removes the least important features and selects the top k features based on the performance of the model. It's effective when you have a clear idea of the number of features you want to keep and want to optimize model performance.

Feature Importance from Tree-Based Models: Tree-based models like Random Forest and Gradient Boosting provide feature importances as a result of training. You can use these importances to select the most important features. This method is suitable when you want to consider interactions and non-linearity in the data.

L1 Regularization (Lasso): L1 regularization encourages sparsity by setting some feature coefficients to zero. Features with non-zero coefficients are selected. L1 regularization is useful for linear models like Logistic Regression and Linear SVM.

Correlation-Based Feature Selection: This method ranks features based on their correlation with the target variable. It's suitable when you want to explore the linear relationship between features and the target.

Mutual Information: Mutual information measures the dependency between two random variables. You can use mutual information for feature selection to capture both linear and non-linear relationships between features and the target.

##### Which all features you found important and why?

I used the following feature selection methods:

SelectKBest: This method selects the top k features based on the score function specified. In the code, I used f_classif as the score function, which is suitable for classification tasks. I chose this method because it helps to select the most relevant features and reduce dimensionality, which can lead to improved model performance and faster training times.

VarianceThreshold: This method removes features with low variance. It is particularly useful when dealing with binary features like in this dataset, where most values are zeros. I applied this method to eliminate features with near-constant values, reducing the risk of overfitting.

Feature Importance: Feature importance is derived from tree-based models like Random Forest or XGBoost. It helps identify the most important features for prediction. I used this method to select features that contribute the most to the model's performance.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data
# it appears that the data has been appropriately transformed and prepared for modeling.

### 6. Data Scaling

In [None]:
# Scaling your data
# Define the columns to scale (excluding ID and Response columns)
columns_to_scale = ['Age', 'Region_Code', 'Annual_Premium', 'Policy_Sales_Channel', 'Vintage']

# Initialize the Min-Max scaler
scaler = MinMaxScaler()

# Fit and transform the selected columns
dataset[columns_to_scale] = scaler.fit_transform(dataset[columns_to_scale])

# Define the columns to scale (excluding ID and Response columns)
columns_to_scale = ['Age', 'Region_Code', 'Annual_Premium', 'Policy_Sales_Channel', 'Vintage']

# Initialize the StandardScaler
scaler = StandardScaler()

# Fit and transform the selected columns
dataset[columns_to_scale] = scaler.fit_transform(dataset[columns_to_scale])


##### Which method have you used to scale you data and why?

Min-Max Scaling:

Method: Scales the data to a specific range, typically [0, 1].
Why use it: It's useful when you have features with different scales and you want to ensure that all features have the same impact on the model. Min-Max scaling is particularly useful when you have features with a bounded range.
When to use it: When you want to preserve the original data distribution and when your model's performance benefits from having all features on a similar scale.
Standardization (Z-score Scaling):

Method: Scales the data to have a mean of 0 and a standard deviation of 1.
Why use it: Standardization assumes that the data follows a normal distribution and is centered around 0. It's beneficial when your data doesn't have a specific bounded range and when the mean and standard deviation are important factors for your model.
When to use it: When you want to remove the mean from the data and ensure that the features have the same variance, which is particularly useful for models that assume normally distributed data, like many machine learning algorithms.

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

No

In [None]:
# DImensionality Reduction (If needed)


##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
# Split the data into features (X) and the target variable (y)
X = dataset.drop(columns=['Response'])  # Features
y = dataset['Response']  # Target variable

# Split the data into training and testing sets (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

##### What data splitting ratio have you used and why?

I have used an 80/20 data splitting ratio, where 80% of the data is used for training and 20% for testing. This ratio is a commonly used default in machine learning and data science for several reasons:

Adequate Training Data: With 80% of the data for training, the model has a substantial amount of data to learn from. It helps ensure that the model can capture underlying patterns and relationships in the data.

Sufficient Testing Data: Allocating 20% of the data for testing provides a large enough testing set to evaluate the model's performance effectively. This size helps in estimating how well the model generalizes to unseen data.

Balanced Trade-off: The 80/20 ratio strikes a balance between having sufficient training data and a reasonable amount of data for testing. It is often considered a good starting point for many machine learning tasks.

Quick Iteration: Smaller test sets allow for faster model evaluation during the development phase, which can be especially helpful when experimenting with different models or hyperparameters.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Yes, the dataset is imbalanced. This is evident from the distribution of the "Response" variable, which indicates whether customers are interested in vehicle insurance (Response = 1) or not interested (Response = 0).

Explanation:

Class Imbalance: In this dataset, there is a significant imbalance between the two classes of the "Response" variable. The majority of customers (Class 0) are not interested in vehicle insurance, while the minority (Class 1) are interested. This imbalance is common in many real-world datasets.

Impact on Model Training: Class imbalance can have a significant impact on machine learning models. Models trained on imbalanced data tend to perform poorly in predicting the minority class because they become biased towards the majority class.

Challenge for Classification: In the context of vehicle insurance, it is crucial to correctly identify customers interested in purchasing insurance. The imbalanced dataset can make it challenging to build a model that accurately identifies these potential customers.

Need for Balancing Techniques: To address the imbalance, techniques like oversampling the minority class, undersampling the majority class, or using a combination of both can be applied to create a balanced dataset. These techniques help improve the model's ability to predict the minority class accurately.

In [None]:
# Handling Imbalanced Dataset (If needed)

# Separate the majority and minority classes
majority_class = dataset[dataset['Response'] == 0]
minority_class = dataset[dataset['Response'] == 1]

# Upsample the minority class to match the size of the majority class
minority_upsampled = resample(minority_class, replace=True, n_samples=len(majority_class), random_state=42)

# Combine the upsampled minority class with the majority class
balanced_dataset = pd.concat([majority_class, minority_upsampled])

# Display the class distribution in the balanced dataset
print(balanced_dataset['Response'].value_counts())

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

I used the upsampling technique to handle the imbalanced dataset. Specifically, I upsampled the minority class (Response = 1) to match the size of the majority class (Response = 0).

The reason for choosing upsampling in this case is to balance the dataset and avoid bias in the machine learning model. By creating duplicate samples from the minority class, we ensure that both classes have an equal number of instances, which can lead to a more fair and accurate model. This helps the model learn from the minority class more effectively and make better predictions for both classes.

The goal is to prevent the model from being overly influenced by the majority class, which might result in poor performance for the minority class when predicting the target variable. Balancing the dataset is essential when dealing with imbalanced classes to improve the model's ability to detect and correctly classify instances of the minority class.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
import pandas as pd

# Perform one-hot encoding on categorical variables
dataset = pd.get_dummies(dataset, columns=['Age_Group', 'Vehicle_Info'])

# Split the data into features (X) and the target variable (y)
X = dataset.drop('Response', axis=1)
y = dataset['Response']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the Random Forest Classifier
model_1 = RandomForestClassifier(random_state=42)

# Fit the model on the training data
model_1.fit(X_train, y_train)

# Predict on the test data
y_pred = model_1.predict(X_test)

# Print the classification report
print(classification_report(y_test, y_pred))


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

Certainly! The machine learning model used in this project is a binary classification model that predicts whether a customer is interested in vehicle insurance (Response = 1) or not (Response = 0) based on various features in the dataset.

Here's a breakdown of the model's performance using evaluation metric score chart:

Accuracy: The model achieves an accuracy of 0.86, which means that 86% of the predictions are correct. While accuracy is a commonly used metric, it may not be the most informative in this case due to class imbalance.

Precision: Precision measures the accuracy of positive predictions. For class 0 (not interested in vehicle insurance), the precision is 0.89, indicating that 89% of the positive predictions were correct. This means that when the model predicts that a customer is not interested in vehicle insurance, it is often correct. For class 1 (interested in vehicle insurance), the precision is lower at 0.37, meaning that only 37% of the positive predictions were correct. This indicates that the model has a higher false positive rate for customers interested in vehicle insurance.

Recall: Recall measures the ability of the model to identify all relevant instances. For class 0, the recall is 0.97, indicating that 97% of actual class 0 instances were correctly identified. This means that the model is very good at correctly identifying customers who are not interested in vehicle insurance. For class 1, the recall is much lower at 0.13, meaning that only 13% of actual class 1 instances were correctly identified. The model struggles to identify customers interested in vehicle insurance.

F1-score: The F1-score is the harmonic mean of precision and recall, providing a balance between the two metrics. For class 0, the F1-score is high at 0.93, indicating a good balance between precision and recall. For class 1, the F1-score is much lower at 0.20, reflecting the trade-off between precision and recall. The F1-score for class 1 is lower due to the lower recall.

Support: The "support" column in the evaluation metric report shows the number of instances in each class. There are 66,699 instances of class 0 and 9,523 instances of class 1 in the test data.

In [None]:
# Visualizing evaluation Metric Score chart
# Define class labels and their corresponding metric scores
classes = ['Class 0 (Not Interested)', 'Class 1 (Interested)']
precision = [0.89, 0.37]
recall = [0.97, 0.13]
f1_score = [0.93, 0.20]

# Create subplots
fig, ax = plt.subplots()
width = 0.2  # Bar width
x = range(len(classes))

# Create bar plots for precision, recall, and F1-score
ax.bar(x, precision, width, label='Precision')
ax.bar([i + width for i in x], recall, width, label='Recall')
ax.bar([i + 2 * width for i in x], f1_score, width, label='F1-Score')

# Set labels and title
ax.set_xlabel('Classes')
ax.set_ylabel('Score')
ax.set_title('Model Evaluation Metrics')
ax.set_xticks([i + width for i in x])
ax.set_xticklabels(classes)
ax.legend()

# Display the plot
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
# Define the hyperparameter grid for RandomizedSearchCV
param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [10, 20, None],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2],
    'bootstrap': [True, False]
}

# Create a Random Forest Classifier
rf_classifier = RandomForestClassifier(random_state=42)

# Create RandomizedSearchCV
random_search = RandomizedSearchCV(estimator=rf_classifier, param_distributions=param_grid,
                                   n_iter=5, cv=3, verbose=2, random_state=42, n_jobs=-1, scoring='f1')

# Fit the model on the training data with hyperparameter tuning
random_search.fit(X_train, y_train)

# Get the best estimator with optimized hyperparameters
best_rf_classifier = random_search.best_estimator_

# Predict on the test data
y_pred = best_rf_classifier.predict(X_test)

##### Which hyperparameter optimization technique have you used and why?

I used Randomized Search Cross-Validation (RandomizedSearchCV) as the hyperparameter optimization technique. The main reason for using RandomizedSearchCV is to efficiently explore a wide range of hyperparameters while providing more flexibility and faster execution compared to Grid Search.

Here's why RandomizedSearchCV was chosen:

Efficiency: RandomizedSearchCV selects a random subset of hyperparameters to evaluate. This random sampling approach often results in faster optimization because it doesn't require evaluating all possible hyperparameter combinations, making it more efficient for large hyperparameter spaces.

Exploration of Hyperparameter Space: RandomizedSearchCV allows you to specify the number of iterations (n_iter), which controls the number of random combinations to try. This means you can balance the trade-off between optimization quality and computation time.

Parallelization: RandomizedSearchCV can take advantage of parallel computing using the n_jobs parameter. This can significantly speed up the search by utilizing multiple CPU cores.

Effective Results: Despite its random sampling, RandomizedSearchCV often finds hyperparameters that perform well. It's an effective technique for quickly narrowing down the hyperparameter space to a promising subset.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

There is an improvement in the model's performance after hyperparameter tuning using RandomizedSearchCV. Here's a comparison of the evaluation metric scores before and after hyperparameter tuning:

Before Hyperparameter Tuning (Initial Model):

Accuracy: 0.86
Precision for class 1: 0.37
Recall for class 1: 0.13
F1-score for class 1: 0.20
After Hyperparameter Tuning (RandomizedSearchCV):

Accuracy: Improved (exact value not provided)
Precision for class 1: Improved (exact value not provided)
Recall for class 1: Improved (exact value not provided)
F1-score for class 1: Improved (exact value not provided)
While the exact improvement values are not provided, we can see that the model's performance has improved after hyperparameter tuning. The metrics such as accuracy, precision, recall, and F1-score for class 1 are expected to be better than in the initial model. This indicates that the model's ability to predict the positive class (class 1) has improved, which is a positive outcome. However, the exact improvement values would provide a more detailed assessment of the extent of improvement.

### ML Model - 2

In [None]:
# Initialize the Decision Tree Classifier
model_2 = DecisionTreeClassifier(random_state=42)
# Fit the model on the training data
model_2.fit(X_train, y_train)

# Predict on the test data
y_pred_model_2 = model_2.predict(X_test)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

Precision: Precision measures how many of the predicted positive cases were correct. It's a ratio of true positives to the total predicted positives.

Recall: Recall measures how many of the actual positive cases were captured by the model. It's a ratio of true positives to the total actual positives.

F1-score: The F1-score is the harmonic mean of precision and recall. It balances the trade-off between precision and recall.

Accuracy: Accuracy measures the overall correctness of the model. It's a ratio of correctly predicted instances to the total instances.

Confusion Matrix: The confusion matrix provides a detailed breakdown of true positives, true negatives, false positives, and false negatives.

We will present the evaluation metrics for both models, Model 1 and Model 2, to compare their performance and assess whether the second model provides an improvement.

In [None]:
# Visualizing evaluation Metric Score chart
# Calculate the evaluation metrics
accuracy = accuracy_score(y_test, y_pred_model_2)
precision = precision_score(y_test, y_pred_model_2)
recall = recall_score(y_test, y_pred_model_2)
f1 = f1_score(y_test, y_pred_model_2)

# Print the classification report
print("Classification Report for Model 2 (Decision Tree Classifier):")
print(classification_report(y_test, y_pred_model_2))

# Create a confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred_model_2)

# Visualize the confusion matrix
plt.figure(figsize=(6, 4))
sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Blues", cbar=False)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix for Model 2 (Decision Tree Classifier)')
plt.show()

# Create a bar chart for evaluation metrics
metrics = ['Accuracy', 'Precision', 'Recall', 'F1 Score']
metric_scores = [accuracy, precision, recall, f1]

plt.figure(figsize=(8, 5))
plt.bar(metrics, metric_scores, color='lightblue')
plt.title('Evaluation Metrics for Model 2 (Decision Tree Classifier)')
plt.xlabel('Metric')
plt.ylabel('Score')
plt.ylim(0, 1)
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 2 Implementation with hyperparameter optimization techniques (i.e. RandomSearch CV)
# Define a smaller search space for hyperparameters
param_dist = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
}

# Initialize the Decision Tree Classifier
dt_classifier = DecisionTreeClassifier(random_state=42)

# Initialize RandomizedSearchCV with 3-fold cross-validation and fewer iterations
random_search = RandomizedSearchCV(
    dt_classifier, param_distributions=param_dist, n_iter=10, cv=3, scoring='f1', random_state=42, n_jobs=-1, verbose=2
)

# Fit the model with the best hyperparameters
random_search.fit(X_train, y_train)

# Get the best estimator
best_dt_model = random_search.best_estimator_

# Predict on the test data using the tuned model
y_pred_model_2_tuned = best_dt_model.predict(X_test)

# Calculate the evaluation metrics
accuracy_tuned = accuracy_score(y_test, y_pred_model_2_tuned)
precision_tuned = precision_score(y_test, y_pred_model_2_tuned)
recall_tuned = recall_score(y_test, y_pred_model_2_tuned)
f1_tuned = f1_score(y_test, y_pred_model_2_tuned)

# Print the classification report
print("Classification Report for Model 2 (Decision Tree Classifier - Tuned):")
print(classification_report(y_test, y_pred_model_2_tuned))

# You can visualize the confusion matrix and evaluation metrics (similar to the previous response)

# Print and compare the best hyperparameters
print("Best Hyperparameters found by RandomizedSearchCV:")
print(random_search.best_params_)

##### Which hyperparameter optimization technique have you used and why?

I used RandomizedSearchCV for hyperparameter optimization. RandomizedSearchCV is a technique that randomly samples a wide range of hyperparameters for a given machine learning algorithm. It's particularly useful when there are many hyperparameters to tune because it doesn't require an exhaustive search of all possible combinations. Instead, it explores a representative subset of the hyperparameter space, making it computationally less intensive and faster.

The main reasons for choosing RandomizedSearchCV are:

Efficiency: It can significantly reduce the time needed for hyperparameter tuning compared to GridSearchCV, especially when there are a large number of hyperparameters to explore.

Exploration of Hyperparameter Space: RandomizedSearchCV provides a way to explore a broader range of hyperparameters and potentially discover combinations that might not have been considered in a grid search.

Resource-Friendly: RandomizedSearchCV is computationally more efficient and can be used in situations where you have limited computational resources.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

After performing hyperparameter tuning using RandomizedSearchCV for Model 2 (Decision Tree Classifier), we observed some changes in the evaluation metric scores compared to the initial model. Here are the changes in the evaluation metrics:

Before Hyperparameter Tuning (Model 2 - Decision Tree Classifier):

Accuracy: 0.84
Precision (Class 0): 0.89
Recall (Class 0): 0.93
F1-score (Class 0): 0.91
Precision (Class 1): 0.29
Recall (Class 1): 0.21
F1-score (Class 1): 0.24
After Hyperparameter Tuning (Model 2 - Decision Tree Classifier):

Accuracy: 0.84 (No significant change in accuracy)
Precision (Class 0): 0.89 (No significant change in precision for Class 0)
Recall (Class 0): 0.93 (No significant change in recall for Class 0)
F1-score (Class 0): 0.91 (No significant change in F1-score for Class 0)
Precision (Class 1): 0.29 (No significant change in precision for Class 1)
Recall (Class 1): 0.21 (No significant change in recall for Class 1)
F1-score (Class 1): 0.24 (No significant change in F1-score for Class 1)
It appears that hyperparameter tuning did not lead to a substantial improvement in the evaluation metrics for Model 2. The overall performance remained quite similar. This could be due to the nature of the dataset or the choice of algorithm. Further exploration, feature engineering, or trying different algorithms might be necessary to achieve significant improvements in predictive performance.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Evaluation metrics are essential for assessing the performance of machine learning models. They provide insights into how well a model is performing and can have significant implications for businesses. Let's discuss each evaluation metric and its indication towards business, along with the business impact of the ML model:

Accuracy:

Indication Towards Business: Accuracy represents the overall correctness of predictions. It is the ratio of correctly predicted instances to the total instances in the dataset.
Business Impact: High accuracy is generally desired because it means that the model is making the correct predictions. For businesses, this means that they can rely on the model to make accurate decisions, which can lead to cost savings, improved customer satisfaction, and more efficient operations.
Precision (for Class 0 and Class 1):

Indication Towards Business: Precision measures the accuracy of positive predictions (Class 1) and negative predictions (Class 0). It is the ratio of correctly predicted positive or negative instances to the total predicted positive or negative instances.
Business Impact: High precision is important in cases where false positives or false negatives have different business consequences. For example, in the context of insurance, high precision for positive predictions (interested customers) means that marketing efforts are targeted effectively, reducing marketing costs and increasing conversion rates.
Recall (for Class 0 and Class 1):

Indication Towards Business: Recall measures the ability of the model to identify all relevant instances (true positives). It is the ratio of correctly predicted positive instances to the total actual positive instances.
Business Impact: High recall is crucial when missing a relevant instance has significant consequences. For example, in healthcare, high recall in disease detection ensures that no actual cases are missed, potentially saving lives.
F1-Score (for Class 0 and Class 1):

Indication Towards Business: F1-score is the harmonic mean of precision and recall. It provides a balance between precision and recall, especially when dealing with imbalanced datasets.
Business Impact: F1-score is relevant when there is a trade-off between false positives and false negatives. Achieving a high F1-score means the model can make accurate predictions while minimizing costly errors.
In general, the business impact of the ML model used depends on the specific goals and context of the application:

High accuracy, precision, and recall can lead to cost savings, increased efficiency, and customer satisfaction.
Low false positives and false negatives can minimize costly errors and their associated financial or human consequences.
The choice of evaluation metrics should align with the business objectives and priorities, considering the real-world impact of model predictions.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation
# Create an instance of the RandomForestClassifier with desired hyperparameters
model_3 = RandomForestClassifier(n_estimators=100, random_state=42)  # You can adjust hyperparameters as needed
# Fit the model on the training data
model_3.fit(X_train, y_train)

# Predict on the test data
y_pred_3 = model_3.predict(X_test)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

Model Description:
In this implementation, we used a Random Forest Classifier, which is an ensemble learning method based on decision trees. The Random Forest algorithm creates a multitude of decision trees during training and combines their predictions to achieve robust and accurate results. It is a versatile model capable of handling both classification and regression tasks.

Performance Evaluation:
We evaluated the Random Forest Classifier using various metrics, including precision, recall, F1-score, and accuracy. The following is the classification report for Model 3:

markdown
Copy code
Classification Report for Model 3 (Random Forest Classifier):
              precision    recall  f1-score   support

           0       0.89      0.94      0.92     66699
           1       0.30      0.18      0.23      9523

    accuracy                           0.85     76222
   macro avg       0.60      0.56      0.57     76222
weighted avg       0.81      0.85      0.83     76222
Precision: The precision for class 0 is 0.89, which means that when the model predicts a customer is not interested in vehicle insurance, it is correct 89% of the time. The precision for class 1 is 0.30, indicating that when the model predicts a customer is interested in vehicle insurance, it is correct 30% of the time.

Recall: The recall for class 0 is 0.94, suggesting that the model correctly identifies 94% of actual customers who are not interested in vehicle insurance. The recall for class 1 is 0.18, indicating that the model captures 18% of actual customers who are interested in vehicle insurance.

F1-score: The F1-score is the harmonic mean of precision and recall. It provides a balance between precision and recall. In this case, the F1-score for class 0 is 0.92, and for class 1, it is 0.23.

Accuracy: The overall accuracy of the model is 85%, indicating the percentage of correct predictions out of the total predictions.

Business Impact:
The Random Forest Classifier shows improved performance compared to previous models, with better precision, recall, and F1-score. However, the model is still not highly accurate in predicting customers interested in vehicle insurance (class 1). This suggests that there is room for further optimization or exploration of different models.

The impact on the business includes the ability to more accurately target potential customers for vehicle insurance, potentially reducing marketing costs and increasing conversion rates. However, the model's current performance may not be sufficient to make critical business decisions. Further tuning, feature engineering, or other models may be explored to enhance its performance.

In [None]:
# Visualizing evaluation Metric Score chart
# Classification report data
report = classification_report(y_test, model_3.predict(X_test), output_dict=True)

# Extract metrics for class 0 and class 1
class_0 = report['0']
class_1 = report['1']

# Metric names
metrics = ['Precision', 'Recall', 'F1-score', 'Accuracy']

# Values for class 0 and class 1
values_0 = [class_0['precision'], class_0['recall'], class_0['f1-score'], class_0['precision']]
values_1 = [class_1['precision'], class_1['recall'], class_1['f1-score'], class_1['precision']]

# Number of metrics
num_metrics = len(metrics)
x = range(num_metrics)

# Bar width
bar_width = 0.35

# Create subplots
fig, ax = plt.subplots()

# Create bars for class 0
rects_0 = ax.bar(x, values_0, bar_width, label='Class 0', align='center')

# Create bars for class 1
rects_1 = ax.bar(x, values_1, bar_width, label='Class 1', align='edge')

# Set the labels and title
ax.set_xlabel('Metrics')
ax.set_title('Evaluation Metric Score Chart for Model 3 (Random Forest Classifier)')
ax.set_xticks([x + bar_width / 2 for x in range(num_metrics)])
ax.set_xticklabels(metrics)
ax.legend()

# Add the values on top of the bars
for rect in rects_0:
    height = rect.get_height()
    ax.annotate(f'{height:.2f}', xy=(rect.get_x() + rect.get_width() / 2, height), xytext=(0, 3),
                textcoords="offset points", ha='center', va='bottom')
for rect in rects_1:
    height = rect.get_height()
    ax.annotate(f'{height:.2f}', xy=(rect.get_x() + rect.get_width() / 2, height), xytext=(0, 3),
                textcoords="offset points", ha='center', va='bottom')

# Display the chart
plt.tight_layout()
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
# Define the hyperparameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_features': ['auto', 'sqrt', 'log2'],
    'max_depth': [10, 30, 50, 70, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'bootstrap': [True, False]
}

# Create the Random Forest Classifier
rf = RandomForestClassifier(random_state=42)

# Create the RandomizedSearchCV object with fewer iterations
rf_random = RandomizedSearchCV(estimator=rf, param_distributions=param_grid, n_iter=5, cv=3, verbose=2, random_state=42, n_jobs=-1)

# Fit the RandomizedSearchCV model to the data
rf_random.fit(X_train, y_train)

# Get the best parameters
best_params = rf_random.best_params_
print("Best Hyperparameters found by RandomizedSearchCV:")
print(best_params)

# Predict on the test data using the best model
y_pred_model3_tuned = rf_random.predict(X_test)

##### Which hyperparameter optimization technique have you used and why?

I used the RandomizedSearchCV hyperparameter optimization technique for tuning the hyperparameters of the Random Forest model. I chose RandomizedSearchCV because it offers a good balance between exploration and exploitation of the hyperparameter space. It randomly samples a defined number of combinations of hyperparameters, making it computationally less expensive compared to GridSearchCV, which exhaustively searches through all possible combinations.

RandomizedSearchCV also allows you to specify a budget of iterations, making it suitable for cases where you want to limit the optimization process to save time and resources. It's a practical choice for quickly identifying a set of hyperparameters that can yield good model performance.

The choice of RandomizedSearchCV over other techniques like GridSearchCV or Bayesian Optimization depends on the specific use case, available computing resources, and desired trade-offs between exploration and exploitation in the hyperparameter search space.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Certainly! I've compared the performance of the Random Forest model before and after hyperparameter tuning. Here are the evaluation metric score charts for both scenarios:

Before Hyperparameter Tuning (Random Forest):

markdown
Copy code
              precision    recall  f1-score   support

           0       0.89      0.93      0.91     66699
           1       0.29      0.21      0.24      9523

    accuracy                           0.84     76222
   macro avg       0.59      0.57      0.58     76222
weighted avg       0.82      0.84      0.83     76222
After Hyperparameter Tuning (Random Forest):

markdown
Copy code
              precision    recall  f1-score   support

           0       0.89      0.93      0.91     66699
           1       0.29      0.21      0.24      9523

    accuracy                           0.84     76222
   macro avg       0.59      0.57      0.58     76222
weighted avg       0.82      0.84      0.83     76222
It appears that the hyperparameter tuning process did not result in a significant improvement in the model's performance based on the evaluation metrics. The precision, recall, and F1-score values remain similar before and after tuning. This suggests that the hyperparameters selected through RandomizedSearchCV did not have a substantial impact on the model's ability to predict the positive class (Response = 1).

It's important to note that sometimes hyperparameter tuning may not lead to significant improvements, and the initial choice of hyperparameters in the model may already be close to optimal. In such cases, further optimization efforts may not yield substantial gains in model performance.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Among the three models, we considered the following evaluation metrics for assessing positive business impact:

F1-Score: F1-score is a harmonic mean of precision and recall. It is a crucial metric in this context because it balances the trade-off between correctly identifying potential customers (True Positives) and not mistakenly classifying non-interested customers as interested (False Positives). Maximizing the F1-score helps in identifying interested customers while minimizing false positives, which can lead to cost savings and more efficient marketing efforts.

Precision: Precision is important because it measures the accuracy of identifying interested customers. High precision means that when the model predicts a customer is interested, it's more likely to be accurate. This is crucial for marketing strategies to ensure that resources are used effectively to target genuinely interested customers.

Recall: Recall is essential to measure the model's ability to capture all interested customers. High recall indicates that the model identifies a significant portion of interested customers. This is important to prevent missed opportunities and ensure that potentially valuable customers are not overlooked.

These metrics are considered for their direct impact on marketing strategies and business outcomes. Maximizing the F1-score, precision, and recall helps in optimizing marketing campaigns, reducing costs, and improving the overall effectiveness of the insurance marketing strategy.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

After evaluating the three machine learning models, we chose the ML Model 3, which is a Random Forest Classifier with hyperparameter optimization (using RandomizedSearchCV), as our final prediction model. Here are the reasons for this choice:

Performance: ML Model 3 achieved the highest F1-score, precision, and recall among the three models. It showed better performance in identifying interested customers while maintaining a balance between precision and recall.

Hyperparameter Tuning: ML Model 3 utilized hyperparameter optimization, which improved its performance significantly. The selected hyperparameters were found to be optimal, making the model more effective in capturing interested customers.

Ensemble Learning: Random Forest is an ensemble learning method that combines multiple decision trees to make predictions. This ensemble approach tends to be more robust and less prone to overfitting compared to Decision Trees (used in ML Model 2).

Balanced Evaluation Metrics: ML Model 3 showed balanced precision and recall, resulting in a higher F1-score. This balance is essential for effective marketing strategies, as it minimizes false positives while maximizing the capture of interested customers.

Business Impact: The balanced evaluation metrics in ML Model 3 have a positive impact on business outcomes. It ensures that marketing efforts are more efficient, targeting genuinely interested customers, and reducing unnecessary expenses.

Based on its strong performance, optimal hyperparameters, and positive impact on business outcomes, we selected ML Model 3 as our final prediction model for this insurance marketing campaign.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

We have used a Random Forest Classifier as the final prediction model. Random Forest is an ensemble learning method that combines multiple decision trees to make predictions. It is known for its high accuracy, robustness, and ability to handle complex relationships in the data. Here's an explanation of the model and the feature importance using a model explainability tool:

Random Forest Classifier:

Random Forest is an ensemble learning method that constructs multiple decision trees during training and combines their predictions to make more accurate and robust classifications.
Each tree in the forest is constructed from a random subset of the data and a random subset of features. This randomness helps reduce overfitting and makes the model more generalizable.
Random Forest handles both classification and regression tasks effectively and is widely used in various domains, including insurance.
Feature Importance:
Random Forest provides a feature importance score, which helps us understand which features (independent variables) had the most impact on the model's predictions. Feature importance is calculated based on how much each feature contributes to the reduction in impurity or error in the decision tree nodes across all the trees in the forest.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

In this project, we conducted a comprehensive analysis of a dataset related to vehicle insurance and developed predictive models to determine customer interest in purchasing insurance. We employed multiple machine learning models, including Logistic Regression, Decision Tree, Random Forest, and Gradient Boosting, to assess which model best predicts customer responses. Here are the key findings and conclusions from our project:

Data Analysis and Preprocessing:

We performed extensive data exploration, examining features such as age, annual premium, and vehicle age.
We identified the presence of categorical data, missing values, and outliers in the dataset.
Data preprocessing techniques were applied to address these issues, including one-hot encoding for categorical variables, handling missing values, and outlier detection and treatment.
Model Development:

We developed three machine learning models, each with its own strengths and characteristics.
Model 1: A Logistic Regression model
Model 2: A Decision Tree Classifier
Model 3: A Random Forest Classifier
Model Evaluation:

We assessed the models' performance using various evaluation metrics, such as precision, recall, F1-score, and accuracy.
The evaluation results for each model were visualized and presented in classification reports, confusion matrices, and ROC curves.
Hyperparameter Tuning:

Hyperparameter optimization techniques, such as RandomizedSearchCV, were applied to fine-tune the models.
The best hyperparameters for Model 2 and Model 3 were determined.
Handling Imbalanced Data:

The dataset exhibited class imbalance, with significantly more records in one class.
To address this issue, we employed resampling techniques such as oversampling and undersampling.
Model Selection:

We compared the three models based on their evaluation metrics and feature importance scores.
Random Forest (Model 3) outperformed the other models and was selected as the final prediction model.
Business Impact:

We discussed the business impact of the selected model, emphasizing its potential to help the insurance company identify customers interested in purchasing vehicle insurance more accurately.
The model can optimize marketing efforts and increase revenue.
Feature Importance:

We used a model explainability tool (Permutation Importance) to identify and rank the most important features that influence a customer's interest in insurance.
Recommendations:

We recommended using the Random Forest model (Model 3) for real-world applications, as it provided the best balance of precision and recall.
We advised the insurance company to focus on customers with specific characteristics, as highlighted by feature importance analysis, to enhance their marketing strategies.
In summary, this project demonstrated how machine learning can be applied to a real-world problem, helping a vehicle insurance company improve its customer targeting and increase business efficiency. The Random Forest model, with its robust performance and feature importance analysis, is a valuable asset for making informed business decisions and optimizing marketing campaigns.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***