<a href="https://colab.research.google.com/github/Prashikdhole/Health_Insurance_Cross-_sale_classification/blob/main/Capstone_Project_Supervised_ML_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - HEALTH INSURANCE CROSS SALE PREDICTION

##### **Project Type**    - Classification
##### **Contribution**    - Team
##### **Team Member 1 -** Dipak Patle
##### **Team Member 2 -** Prashik Dhole

# **Project Summary -**

Cross-selling identifies products or services that satisfy additional, complementary needs that are unfulfilled by the original product that a customer possesses. As an example, a mouse could be cross-sold to a customer purchasing a keyboard. Oftentimes, cross-selling points users to products they would have purchased anyways; by showing them at the right time, a store ensures they make the sale.

Cross-selling is prevalent in various domains and industries including banks. For example, credit cards are cross-sold to people registering a savings account. In ecommerce, cross-selling is often utilized on product pages, during the checkout process, and in lifecycle campaigns. It is a highly-effective tactic for generating repeat purchases, demonstrating the breadth of a catalog to customers. Cross-selling can alert users to products they didn't previously know you offered, further earning their confidence as the best retailer to satisfy a particular need.


# **GitHub Link -**

# **Problem Statement**


Your client is an Insurance company that has provided Health Insurance to its customers now they need your help in building a model to predict whether the policyholders (customers) from past year will also be interested in Vehicle Insurance provided by the company. An insurance policy is an arrangement by which a company undertakes to provide a guarantee of compensation for specified loss, damage, illness, or death in return for the payment of a specified premium. A premium is a sum of money that the customer needs to pay regularly to an insurance company for this guarantee. For example, you may pay a premium of Rs. 5000 each year for a health insurance cover of Rs. 200,000/- so that if, God forbid, you fall ill and need to be hospitalised in that year, the insurance provider company will bear the cost of hospitalisation etc. for upto Rs. 200,000. Now if you are wondering how can company bear such high hospitalisation cost when it charges a premium of only Rs. 5000/-, that is where the concept of probabilities comes in picture. For example, like you, there may be 100 customers who would be paying a premium of Rs. 5000 every year, but only a few of them (say 2-3) would get hospitalised that year and not everyone. This way everyone shares the risk of everyone else.

Just like medical insurance, there is vehicle insurance where every year customer needs to pay a premium of certain amount to insurance provider company so that in case of unfortunate accident by the vehicle, the insurance provider company will provide a compensation (called ‘sum assured’) to the customer. Building a model to predict whether a customer would be interested in Vehicle Insurance is extremely helpful for the company because it can then accordingly plan its communication strategy to reach out to those customers and optimise its business model and revenue.

Now, in order to predict, whether the customer would be interested in Vehicle insurance, you have information about demographics (gender, age, region code type), Vehicles (Vehicle Age, Damage), Policy (Premium, sourcing channel) etc.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from imblearn.over_sampling import RandomOverSampler
from collections import Counter
from xgboost import XGBClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')

In [None]:
df = pd.read_csv('/content/drive/MyDrive/TRAIN-HEALTH INSURANCE CROSS SELL PREDICTION.csv')

### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
duplicate_value_counts = len(df[df.duplicated()])

print("Number of Duplicates values:",duplicate_value_counts)

In [None]:
# Dataset Duplicate Value Count
display(df.drop_duplicates())

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()

In [None]:
#visualizing the missing values
sns.heatmap(df.isnull(),cmap = 'viridis',cbar=False)

#What did you know about your dataset?

1.The Dataset contains 381109 no.of.rows and 12 no.of.Columns.

2.No duplicate value present in the dataset.

3.No missing value present in the dataset.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe()

### **Variables Description**

**id** : Unique ID for the customer

**Gender** : Gender of the customer

**Age** : Age of the customer

**Driving_License**: 0 = Customer does not have DL, 1 = Customer already has DL

**Region_Code** : Unique code for the region of the customer

**Previously_Insured** : 1 = Customer already has Vehicle Insurance, 0 = Customer doesn't have Vehicle Insurance

**Vehicle_Age** : Age of the Vehicle

**Vehicle_Damage**: 1 = Customer got his/her vehicle damaged in the past. 0 = Customer didn't get his/her vehicle damaged in the past.

**Annual_Premium** : The amount customer needs to pay as premium in the year

**Policy Sales Channel** : Anonymized Code for the channel of outreaching to the customer ie. Different Agents, Over Mail, Over Phone, In Person, etc.

**Vintage** : Number of Days, Customer has been associated with the company

**Response** : 1 = Customer is interested, 0 = Customer is not interested

# Check Unique Values for each variable.

In [None]:
for column in df.columns:
  unique_values = df[column].unique()

  print(f"Unique values for {column}:{unique_values}")

In [None]:
df.nunique()

## 3. ***Data Wrangling***

In [None]:
df.head(10)


In [None]:
df[df.duplicated()]

In [None]:
df.isnull().sum()

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

### **Univariate Analysis**

#### Chart - 1

In [None]:
# Dependent Variable 'Responce'
plt.figure(figsize = (8,7))
sns.set_theme(style = 'whitegrid')
sns.countplot(x = df['Response'], data = df)

##### 1. Why did you pick the specific chart?

##### Answer - I chose this because a Seaborn countplot is ideal for visually highlighting the differences between two values of a variable.

##### 2. What is/are the insight(s) found from the chart?

##### Answer - Most people choose not to buy vehicle insurance.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

##### Answer - The plot reveals a small customer base that prefers purchasing vehicle insurance, indicating a negative impact on the business.

#### Chart - 2

In [None]:
#Distribution of Age
plt.figure(figsize = (15,8))
sns.countplot(x=df['Age'],data=df)

##### 1. Why did you pick the specific chart?

##### Answer - I chose this chart because the distplot, or distribution plot, visually represents the spread and pattern of continuous data variables. It provides an overview of the data distribution using Seaborn's distplot function.

##### 2. What is/are the insight(s) found from the chart?

##### Answer - Based on the plot, it is evident that the age range of 22-25 has the highest number of customers.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

##### Answer - The majority of responses received are from the age groups 22-25, suggesting that targeting these age groups can contribute to business growth.

#### Chart - 3

In [None]:
# Insured and uninsuerd data
plt.figure(figsize=(7,9))
plt.pie(df['Previously_Insured'].value_counts(), autopct='%.0f%%', shadow=True, startangle=200, explode=[0.01,0])
plt.legend(labels=['Insured','Not insured'])
plt.show()

##### 1. Why did you pick the specific chart?

##### Answer - I chose pie chart, which is created using the Matplotlib library, is a useful tool for visually representing the proportion or distribution of different values within a variable.

##### 2. What is/are the insight(s) found from the chart?

##### Answer - The Pie Chart illustrates the distribution of responses among individuals based on their previous insurance status. It indicates that people who do not have prior insurance coverage tend to overwhelmingly choose to opt in for vehicle insurance.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

##### Answer - By observing the Pie Chart, the company can identify an opportunity to specifically target customers who do not have previous vehicle insurance. These customers are more likely to be open to considering and opting in for vehicle insurance. Therefore, the company can focus its efforts on actively reaching out to this particular group of customers.

#### Chart - 4

In [None]:
plt.figure(figsize=(15,9))
a=df['Annual_Premium']
sns.distplot(a, color='purple')

##### 1. Why did you pick the specific chart?

##### Answer - I chose a distribution plot beacause it is a visual representation that shows how data is distributed or spread out. It is particularly useful for continuous data variables.

##### 2. What is/are the insight(s) found from the chart?

##### Answer - The majority of customers tend to have subscription prices that are centered around $50,000.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

##### Answer - The plot indicates that customers who chose vehicle insurance are more inclined towards higher premium options. This is advantageous for the business as it leads to increased revenue.

#### Chart - 5

In [None]:
plt.figure(figsize=(10,6))
sns.boxplot(df['Annual_Premium'])

**For the above boxplot chart we get insight that there's a lot of outliers in the annual premium.**

#### Chart - 6

In [None]:
plt.figure(figsize=(5,7))
sns.countplot(x=df['Vehicle_Damage'])

##### 1. Why did you pick the specific chart?

##### Answer - I chose this because a Seaborn countplot is ideal for visually highlighting the differences between two values of a variable.

##### 2. What is/are the insight(s) found from the chart?

Answer - from above countplot chart we get insight that the Customers which have Vehicle_Damage are more likely to buy insurance. than the customer who does'nt have vehicle Damage

#### Chart - 7

In [None]:
df['Vehicle_Age'].hist();

##### 1. Why did you pick the specific chart?

##### Answer - A histogram, created using the matplotlib library, is a useful tool for visualizing the distribution of values for a specific variable.

##### 2. What is/are the insight(s) found from the chart?

From above histogram chart we get insight that most of people are having vehicle age between 1 or 2 years and very few peoples are having vehicle age more than 2 years.

### **Bivariate Analysis**

#### Chart - 8

In [None]:
# Age vs Response
plt.figure(figsize=(16,8))
sns.countplot(data=df, x='Age', hue='Response', palette='CMRmap_r')
plt.xlabel('Age response')
plt.ylabel('count')
plt.show()

##### 1. Why did you pick the specific chart?

##### Answer - I chose this chart because the distplot, or distribution plot, visually represents the spread and pattern of continuous data variables. It provides an overview of the data distribution using Seaborn's distplot function.

##### 2. What is/are the insight(s) found from the chart?

from above countplot chart we get insight that People ages between from 31 to 50 are more likely to respond. While Young people below 30 are not interested in vehicle insurance

#### Chart - 9

In [None]:
 # Gender vs Response
 df.groupby(['Gender', 'Response']).size().unstack().plot(kind='bar', stacked=True)


##### 1. Why did you pick the specific chart?

##### Answer - I chose this because a bar chart from the matplotlib library is a useful way to visually represent the proportions of different values within a different variable.e if we apply unstack function.

##### 2. What is/are the insight(s) found from the chart?

from above bar plot we get insight that the Male category having a vehicle is slightly greater than that of female category and chances of buying insurance is also little high.

In [None]:
#### Chart - 10

In [None]:
plt.figure(figsize=(10,6))
sns.countplot(data=df, x='Vehicle_Age', hue='Response', palette='Dark2_r')
plt.xlabel('Vehicle Age', fontsize=15)
plt.ylabel('Count', fontsize=15)
plt.title('Vehicle Age and Customer Responce analysis', fontsize=15)
plt.show()

##### 1. Why did you pick the specific chart?

##### Answer - I chose this chart because the distplot, or distribution plot, visually represents the spread and pattern of continuous data variables. It provides an overview of the data distribution using Seaborn's distplot function.

##### 2. What is/are the insight(s) found from the chart?

from above countplot we get insight that the Customers with vehicle age 1-2 years are more likely to interested in buying insurance as compared to the other two and Customers with vehicle age <1 years have very less chances of buying insurance

#### Chart - 11

In [None]:
sns.barplot(x='Response', y='Annual_Premium', data=df)

##### 1. Why did you pick the specific chart?

##### Answer - I chose this because a bar chart from the matplotlib library is a useful way to visually represent the proportions of different values within a variable.

##### 2. What is/are the insight(s) found from the chart?

from the above bar chart we get insight that the People who response have slightly higher annual premium

#### Chart - 12

In [None]:
plt.figure(figsize=(20,8))
sns.heatmap(df.corr(), annot=True)

##### 1. Why did you pick the specific chart?

Answer - I chose this beacuse a correlation plot visually represents the relationships between variables in a dataset. It displays the correlation of each variable with itself and with other columns using a heatmap of colors.

##### 2. What is/are the insight(s) found from the chart?

from above coorelation plot we seen that Target variable is not much affected by Vintage variable. We can drop least correlated variables

#### Chart - 12

In [None]:
sns.pairplot(df)

##### 1. Why did you pick the specific chart?

##### Answer - The Seaborn Pairplot is a useful tool for visualizing relationships between variables in a dataset. This makes it easier to interpret and understand the data, as it condenses a large amount of information into a single figure.

##### 2. What is/are the insight(s) found from the chart?

##### Answer - By generating scatterplots the Pairplot function in Seaborn allowed us to visually explore and understand the relationships between different columns in the dataset.

# **Conclusion of EDA**

1.Most of the costumers who have the vehicle or who owns the vehicle age is ranges between 21 to 25 years. There are few costumers who have vehicle above the of 60 years.

2.54% customers are previously insured and 46% customers are not insured yet.and also Customer who are not previously insured are likely to be insured.

3.The Customers which have Vehicle_Damage are more likely to buy insurance. than the customer who does'nt have vehicle Damage.

4.Most of people are having vehicle age between 1 or 2 years and very few peoples are having vehicle age more than 2 years

5.People ages between from 31 to 50 are more likely to respond. While Young people below 30 are not interested in vehicle insurance.

6.The Male category having a vehicle is slightly greater than that of female category and chances of buying insurance is also little high.

7.The Customers with vehicle age 1-2 years are more likely to interested in buying insurance as compared to the other two and Customers with vehicle age <1 years have very less chances of buying insurance.

8.the People who response have slightly higher annual premium.

## ***5. Hypothesis Testing***

### Hypothetical Statement - 1

### HYPOTHESIS : **Males are more interested in buying vehicle insurance.**

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer -
***Null Hypothesis:*** No, Males aren't more interested in buying Vehicle Insurance

***Alternative Hypothesis:*** Yes, Males are more interested in buying Vehicle Insurance

#### 2. Perform an appropriate statistical test.

In [None]:
hyp_data = pd.crosstab(df['Response'],df['Gender'],margins=False)
hyp_data

### **Set significance level to 0.05.**

In [None]:
from scipy.stats import chisquare
from scipy.stats import chi2_contingency

stats,p,dof,expected = chi2_contingency(hyp_data)

In [None]:
p

### The p value is smaller than significance level. So, we will reject the null hypothesis and accept the alternative hypothesis.

##### Which statistical test have you done to obtain P-Value?

##### Answer - We used the Chi-Square contingency test to determine the statistical significance (p-value) of our hypothesis.

##### Why did you choose the specific statistical test?

##### Answer - The Chi-Square contingency test serves as a basis for statistical inference, allowing us to investigate the relationship between variables based on the observed data. By applying this test, we can determine whether there is a significant association between the variables of interest.

### Hypothetical Statement - 2

### HYPOTHESIS : **As the number of days a customer is associated with the company increase, the chances that the customer will opt in for vehicle insurance increases.**

Answer -
***Null Hypothesis:*** No, as the number of days a customer is associated with the company is increases, customer isn't likely to buy a vehicle insurance.

***Alternative Hypothesis:*** Yes, as the number of days a customer is asscociated with the company increase, the customer are more chances to buy vehicle insurance

#### 2. Perform an appropriate statistical test.

In [None]:
hypo_data = pd.crosstab(df['Response'],df['Vintage'],margins=False)
hypo_data

### **Set significance level to 0.05.**

In [None]:
from scipy.stats import chisquare
from scipy.stats import chi2_contingency

stats,p,dof,expected = chi2_contingency(hypo_data)

In [None]:
p

### The p value is smaller than significance level. So, we will reject the null hypothesis and accept the alternative hypothesis.

##### Which statistical test have you done to obtain P-Value

##### Answer - We used the Chi-Square contingency test to determine the statistical significance (p-value) of our hypothesis.

##### Why did you choose the specific statistical test?

##### Answer - The Chi-Square contingency test serves as a basis for statistical inference, allowing us to investigate the relationship between variables based on the observed data. By applying this test, we can determine whether there is a significant association between the variables of interest.

## ***6. Feature Engineering & Data Pre-processing***

### **Encoding Object Columns**

In [None]:
#Changing categorical values to numerical values
df['Gender']=df['Gender'].map({'Female':1, 'Male': 0})
df['Vehicle_Age']=df['Vehicle_Age'].map({'< 1 Year':0, '1-2 Year': 1, '> 2 Years':2})
df['Vehicle_Damage']=df['Vehicle_Damage'].map({'Yes':1, 'No': 0})
df.head()

In [None]:
correlation = df.corr()
correlation['Response'].sort_values(ascending=False)[1:]

In [None]:
X=df.drop(columns=['id', 'Driving_License', 'Policy_Sales_Channel','Vintage', 'Response'])
y=df['Response']

In [None]:
# Fill any NaNs with mode()
fill_mode=lambda col: col.fillna(col.mode())
X= X.apply(fill_mode, axis =0)
df=df.apply(fill_mode, axis=0)

## ***7. Model Building***

In [None]:
# Check for imbalance in data
df['Response'].value_counts()

* We can clearly see that there is a huge difference between the dataset.
* Satndard ML techniques such as Desicion Tree and Logistic Regression have a bias towards the majority class and they tend to ignore the minority class. So solving this issue we can resampling techniques.

In [None]:
# Resampling
ros = RandomOverSampler(random_state=0)
X_new,y_new = ros.fit_resample(X, y)

print("After Random Over Sampling Of Minor Class Total Sample are :", len(y_new))
print("Original dataset shape{}".format(Counter(y)))
print("Resampled dataset shape{}".format(Counter(y_new)))

### ***Splitting the Data in Train in Test sets***

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_new, y_new, random_state=42, test_size=0.3)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

In [None]:
# Normalizing the Dataset using Standard Scalling Technique
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

### ***Logistic Regression***

In [None]:
model = LogisticRegression(random_state=42)
model = model.fit(X_train, y_train)
# Making Prediction
pred = model.predict(X_test)
prob = model.predict_proba(X_test)[:,1]

#### Model Evaluation

In [None]:
# Evaluation
r_lgt = recall_score(y_test, pred)
print("recall_score : ", r_lgt)

p_lgt = precision_score(y_test, pred)
print("precision_score : ", p_lgt)

f1_lgt = f1_score(y_test, pred)
print("f1_score : ", f1_lgt)

A_lgt = accuracy_score(y_test, pred)
print("accuracy_score : ", A_lgt)

acu_lgt = roc_auc_score(y_test, pred)
print("ROC_AUC_score : ", acu_lgt)

In [None]:
fpr, tpr, _= roc_curve(y_test, prob)

plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title('Logistic Regression ROC Curve')
plt.plot(fpr,tpr)
plt.plot((0,1), linestyle='--', color='black')
plt.show()

#### Confusion Matrix

In [None]:
matrix=confusion_matrix(y_test, pred)
print(matrix)
sns.heatmap(matrix, annot=True, fmt='g')

* From confusion matrix we see that the model is predicting positive response but also predicting negative response too.

In [None]:
print(classification_report(pred, y_test))

### ***RandomForest Classifier***

In [None]:
rf_model = RandomForestClassifier()
rf_model = rf_model.fit(X_train, y_train)

rf_pred = rf_model.predict(X_test)
rf_proba = rf_model.predict_proba(X_test)[:,1]

#### Model Evaluation

In [None]:
# Evaluation
r_rf = recall_score(y_test, rf_pred)
print("recall_score : ", r_lgt)

p_rf = precision_score(y_test, rf_pred)
print("precision_score : ", p_lgt)

f1_rf = f1_score(y_test, rf_pred)
print("f1_score : ", f1_lgt)

A_rf = accuracy_score(y_test, rf_pred)
print("accuracy_score : ", A_lgt)

acu_rf = roc_auc_score(y_test, rf_pred)
print("ROC_AUC_score : ", acu_lgt)

In [None]:
fpr, tpr, _= roc_curve(y_test, rf_proba)

plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title('Randon Forest ROC Curve')
plt.plot(fpr,tpr)
plt.plot((0,1), linestyle='--', color='black')
plt.show()

#### Confusion Matrix

In [None]:
matrix=confusion_matrix(y_test, rf_pred)
print(matrix)
sns.heatmap(matrix, annot=True, fmt='g')

* The confusion matrix now shows that the model now is much better with predicting positive response.

In [None]:
print(classification_report(y_test, rf_pred))

* The model performs very well, so we  can use it to predict unkniwn data.

### ***XGBoost***

In [None]:
XG_model = XGBClassifier()
XG_model = XG_model.fit(X_train, y_train)

XG_pred = XG_model.predict(X_test)
XG_prob = XG_model.predict_proba(X_test)[:,1]

#### Mode Evaluation

In [None]:
# Evaluation
r_XG = recall_score(y_test, XG_pred)
print("recall_score : ", r_XG)

p_XG = precision_score(y_test, XG_pred)
print("precision_score : ", p_XG)

f1_XG = f1_score(y_test, XG_pred)
print("f1_score : ", f1_XG)

A_XG = accuracy_score(y_test, XG_pred)
print("accuracy_score : ", A_XG)

acu_XG = roc_auc_score(y_test, XG_pred)
print("ROC_AUC_score : ", acu_XG)

In [None]:
fpr, tpr, _= roc_curve(y_test, XG_prob)

plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title('XGBoost ROC Curve')
plt.plot(fpr,tpr)
plt.plot((0,1), linestyle='--', color='black')
plt.show()

#### Confusion Matrix

In [None]:
matrix=confusion_matrix(y_test, XG_pred)
print(matrix)
sns.heatmap(matrix, annot=True, fmt='g')

* From the confusion matrix we see that the model is bit better with predicting positive response.

In [None]:
print(classification_report(XG_pred, y_test))

### ***Knn***

In [None]:
knn = KNeighborsClassifier()


In [None]:
knn.fit(X_train,y_train)

In [None]:
knn_pred = knn.predict(X_test)
knn_proba = knn.predict_proba(X_test)[:,1]


In [None]:
r_knn = recall_score(y_test, knn_pred)
print("recall_score : ", r_knn)

p_knn = precision_score(y_test, knn_pred)
print("precision_score : ", p_knn)

f1_knn = f1_score(y_test, knn_pred)
print("f1_score : ", f1_knn)

A_knn = accuracy_score(y_test, knn_pred)
print("accuracy_score : ", A_knn)

acu_knn = roc_auc_score(y_test, knn_pred)
print("ROC_AUC_score : ", acu_knn)

In [None]:
fpr, tpr, _= roc_curve(y_test, knn_proba)

plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title('KNN ROC Curve')
plt.plot(fpr,tpr)
plt.plot((0,1), linestyle='--', color='black')
plt.show()

In [None]:
print(classification_report(knn_pred, y_test))

#### Confusion Matrix

In [None]:
matrix=confusion_matrix(y_test, knn_pred)
print(matrix)
sns.heatmap(matrix, annot=True, fmt='g')

### ***Comparing the Model***

In [None]:
com = ['Logistic Regression', 'Randomforest', 'XGBClassifier','KNN']
data = {'Accuracy':[A_lgt,A_rf,A_XG,A_knn],'Recall':[r_lgt,r_rf,r_XG,r_knn],'Precision':[p_lgt,p_rf,p_XG,p_knn],'f1_score':[f1_lgt,f1_rf,f1_XG,f1_knn],'ROC_AUC':[acu_lgt,acu_rf,acu_XG,acu_knn]}
result = pd.DataFrame(data=data, index=com)
result

# ***Conclusion***

* The variables suc as Age, Previous_Insured, Annual_Premium are more affecting the target varialbles.
* Comparing ROC curve we can see that Random Forest model perform better. Because curves closer to the top-left corner, it indicates a better perormance.