<h1>Problem Statement</h1>
<p>An insurance company that has provided Health Insurance to its customers now they need our help in building a model to predict whether the policyholders (customers) from past year will also be interested in Vehicle Insurance provided by the company.</p>
<p>An insurance policy is an arrangement by which a company undertakes to provide a guarantee of compensation for specified loss, damage, illness, or death in return for the payment of a specified premium. A premium is a sum of money that the customer needs to pay regularly to an insurance company for this guarantee.</p>
<p>For example, you may pay a premium of Rs. 5000 each year for a health insurance cover of Rs. 200,000/- so that if, God forbid, you fall ill and need to be hospitalized in that year, the insurance provider company will bear the cost of hospitalization etc. for up to Rs. 200,000. Now if you are wondering how can company bear such high hospitalization cost when it charges a premium of only Rs. 5000/-, that is where the concept of probabilities comes in picture. For example, like you, there may be 100 customers who would be paying a premium of Rs. 5000 every year, but only a few of them (say 2-3) would get hospitalized that year and not everyone. This way everyone shares the risk of everyone else.</p>
<p>Just like medical insurance, there is vehicle insurance where every year customer needs to pay a premium of certain amount to the insurance provider company so that in case of unfortunate accident by the vehicle, the insurance provider company will provide a compensation (called ‘sum assured’) to the customer.</p>
<p><b>Building a model to predict whether a customer would be interested in Vehicle Insurance</b> is extremely helpful for the company because it can then accordingly plan its communication strategy to reach out to those customers and optimize its business model and revenue.</p>
<p>Now, in order to predict, whether the customer would be interested in Vehicle insurance, you have information about demographics (gender, age, region code type), Vehicles (Vehicle Age, Damage), Policy (Premium, sourcing channel) etc.</p>
<p></p>
<h1>Data Description</h1>
<ul>
    <li><b>id:</b> Unique ID for the customer</li>
    <li><b>Gender:</b> Gender of the customer</li>
    <li><b>Age:</b> Age of the customer</li>
    <li><b>Driving_License:</b> 0 : Customer does not have DL, 1 : Customer already has DL</li>
    <li><b>Region_Code:</b> Unique code for the region of the customer</li>
    <li><b>Previously_Insured: </b> 1: Customer already has Vehicle Insurance, 0 : Customer doesn't have Vehicle Insurance</li>
    <li><b>Vehicle_Age:</b> Age of the Vehicle</li>
    <li><b>Vehicle_Damage:</b> 1: Customer got his vehicle damaged in the past. 0 : Customer didn't get his/her vehicle damaged in the past.</li>
    <li><b>Annual_Premium:</b> The amount customer needs to pay as premium in the year</li>
    <li><b>PolicySalesChannel:</b> Anonymized Code for the channel of outreaching to the customer ie. Different Agents, Over Mail, Over Phone, In Person, etc.</li>
    <li><b>Vintage:</b> Number of Days, Customer has been associated with the company</li>
    <li><b>Response:</b> 1 : Customer is interested, 0 : Customer is not interested</li>

</ul>
<h2>Evaluation Metric</h2>
The evaluation metric would be accuracy and ROC_AUC score.


## Import Necessary Libraries

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.model_selection import train_test_split, learning_curve
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import roc_auc_score, accuracy_score, roc_curve
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

## Load the dataset

In [2]:
df = pd.read_csv("https://raw.githubusercontent.com/shuvo14051/datasets/master/Sell-Prediction.csv")
df.head()

Unnamed: 0,id,Gender,Age,Driving_License,Region_Code,Previously_Insured,Vehicle_Age,Vehicle_Damage,Annual_Premium,Policy_Sales_Channel,Vintage,Response
0,1,Male,44,1,28.0,0,> 2 Years,Yes,40454.0,26.0,217,1
1,2,Male,76,1,3.0,0,1-2 Year,No,33536.0,26.0,183,0
2,3,Male,47,1,28.0,0,> 2 Years,Yes,38294.0,26.0,27,1
3,4,Male,21,1,11.0,1,< 1 Year,No,28619.0,152.0,203,0
4,5,Female,29,1,41.0,1,< 1 Year,No,27496.0,152.0,39,0


## EDA

In [3]:
print("Shape of the dataset:",df.shape)
print("Null values?", sum(df.isnull().sum()))

Shape of the dataset: (381109, 12)
Null values? 0


In [4]:
print(df.columns.to_list())

['id', 'Gender', 'Age', 'Driving_License', 'Region_Code', 'Previously_Insured', 'Vehicle_Age', 'Vehicle_Damage', 'Annual_Premium', 'Policy_Sales_Channel', 'Vintage', 'Response']


In [5]:
print(df.dtypes)

id                        int64
Gender                   object
Age                       int64
Driving_License           int64
Region_Code             float64
Previously_Insured        int64
Vehicle_Age              object
Vehicle_Damage           object
Annual_Premium          float64
Policy_Sales_Channel    float64
Vintage                   int64
Response                  int64
dtype: object


## Take a look on the object type columns

In [6]:
categorical_df = df.select_dtypes(include=['O'])
print(categorical_df.head(5))

   Gender Vehicle_Age Vehicle_Damage
0    Male   > 2 Years            Yes
1    Male    1-2 Year             No
2    Male   > 2 Years            Yes
3    Male    < 1 Year             No
4  Female    < 1 Year             No


In [7]:
df['Gender'].value_counts()

Male      206089
Female    175020
Name: Gender, dtype: int64

In [8]:
df['Vehicle_Age'].value_counts()

1-2 Year     200316
< 1 Year     164786
> 2 Years     16007
Name: Vehicle_Age, dtype: int64

In [9]:
df['Vehicle_Damage'].value_counts()

Yes    192413
No     188696
Name: Vehicle_Damage, dtype: int64

## Map this object types columns

In [None]:
df['Vehicle_Damage'] = df['Vehicle_Damage'].map({"Yes":0,"No":1})
df['Vehicle_Damage'].value_counts()

In [None]:
df['Gender'] = df['Gender'].map({"Male":0,"Female":1})
df['Gender'].value_counts()

In [None]:
df['Vehicle_Age'] = df['Vehicle_Age'].map({"< 1 Year":0,"1-2 Year":1,"> 2 Years":2})
df['Vehicle_Age'].value_counts()

## Now lets take a look into our converted dataset
After all the mappings of categorical varibales

In [None]:
df.head()

# Mean encoding
Mean encoding (also known as target encoding) is a technique for encoding categorical variables as numerical variables based on the mean of the target variable for each category.

<p>The basic idea of mean encoding is to replace each category of a categorical variable with the mean value of the target variable for that category. This can help capture the relationship between the categorical variable and the target variable, especially when the categorical variable has a large number of unique categories.</p>

### Gender 

In [None]:
means_g = df.groupby("Gender")["Response"].mean()
print(means_g)
df['gender_encoded'] = df["Gender"].map(means_g)

### Driving licence

In [None]:
means_driving = df.groupby("Driving_License")["Response"].mean()
print(means_driving)
df['driving_license_encoded'] = df["Driving_License"].map(means_driving)

### Previously_Insured

In [None]:
means_Insured= df.groupby("Previously_Insured")["Response"].mean()
print(means_driving)
df['Previously_Insured_encoded'] = df["Previously_Insured"].map(means_Insured)

### Vehicle_Damage

In [None]:
Vehicle_Damage= df.groupby("Vehicle_Damage")["Response"].mean()
print(Vehicle_Damage)
df['Vehicle_Damage_encoded'] = df["Vehicle_Damage"].map(Vehicle_Damage)

### Vehicle_Age

In [None]:
Vehicle_Age= df.groupby("Vehicle_Age")["Response"].mean()
print(Vehicle_Age)
df['Vehicle_Age_encoded'] = df["Vehicle_Age"].map(Vehicle_Age)

## Target column
For the value counts it's clear that we have an imbalance classification problem here. 87.74% people are not interested to take a vehicle insurence and only 12.25% people are interested.
 So we need to use some undersampling or oversampling technique here to solve this imbalance problem.

In [None]:
print(df['Response'].value_counts())
print(df['Response'].value_counts(normalize=True))

## Over sampling
Synthetic Minority Oversampling Technique (SMOTE) is a statistical technique for increasing the number of cases in your dataset in a balanced way. The component works by generating new instances from existing minority cases that you supply as input.
<p>ADASYN (Adaptive Synthetic) is an algorithm that generates synthetic data, and its greatest advantages are not copying the same minority data, and generating more data for “harder to learn” examples.</p>

In [None]:
X = df.drop('Response', axis=1)
y = df['Response']

In [None]:
sns.countplot(y=y, data=df)
plt.show()

In [None]:
from imblearn.under_sampling import RandomUnderSampler,EditedNearestNeighbours
from imblearn.over_sampling import SMOTE, ADASYN
sampling = ADASYN()
X, y = sampling.fit_resample(X, y)

In [None]:
sns.countplot(y=y, data=df)
plt.show()

## Train test split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, 
                                                    stratify=y, random_state=42)

## Scaling

In [None]:
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
clf = LogisticRegression()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
y_train_pred = clf.predict(X_train)

## Training vs testing accuracy

In [None]:
print(f"Training score: {accuracy_score(y_train,y_train_pred)*100:.2f}%")
print(f"Testing score: {accuracy_score(y_test,y_pred)*100:.2f}%") 

## Classification report

In [None]:
print(classification_report(y_test,y_pred))

## Confusion matrix

In [None]:
cn = confusion_matrix(y_test,y_pred)
sns.heatmap(cn, annot=True, fmt = 'g')
plt.show()

## AUC ROC score and curve

In [None]:
auc_roc_score = roc_auc_score(y_test, y_pred)
print(f"{auc_roc_score*100:.2f}")

In [None]:
fpr, tpr, thresholds = roc_curve(y_test, y_pred)

# Plot ROC curve
plt.plot(fpr, tpr)
plt.plot([0, 1], [0, 1], '--')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.show()

In [None]:
train_sizes, train_scores, test_scores = learning_curve(clf, X_train, y_train, cv=10, scoring='accuracy')

train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)
test_mean = np.mean(test_scores, axis=1)
test_std = np.std(test_scores, axis=1)

plt.plot(train_sizes, train_mean, label='Training Score')
plt.plot(train_sizes, test_mean, label='Cross-Validation Score')
plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std, alpha=0.1)
plt.fill_between(train_sizes, test_mean - test_std, test_mean + test_std, alpha=0.1)

plt.xlabel('Training Set Size')
plt.ylabel('Accuracy')
plt.title('Learning Curve')
plt.legend(loc='best')

plt.show()