# **Customer Conversion Prediction**








# ☛ Problem Statement :

> You are working for a new-age insurance company and employ mutiple outreach plans to sell term insurance to your customers. Telephonic marketing campaigns still remain one of the most effective way to reach out to people however they incur a lot of cost. Hence, it is important to identify the customers that are most likely to convert beforehand so that they can be specifically targeted via call. We are given the historical marketing data of the insurance company and are required to build a ML model that will predict if a client will subscribe to the insurance.






### **Features:**

*  age (numeric)
*   job : type of job
*   marital : marital status
*   educational_qual : education status
*   call_type : contact communication type
*   day: last contact day of the month (numeric)
*   mon: last contact month of year
*   dur: last contact duration, in seconds (numeric)
*   num_calls: number of contacts performed during this campaign and for this   client
*   prev_outcome: outcome of the previous marketing campaign (categorical: "unknown","other","failure","success")


#**Output variable (desired target) :**
  * y - has the client subscribed to the insurance?




 #  **Basic Analysis of Dataset from Problem Statement and Features**

*   It is a supervised learning problem - We are predicting target variable
*   From target varible we can clearly understand it is a classification problem
*   From target variable we can tell it is a binary classification problem.   target = (y/n)


#**Importing necessary dependencies**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

: 

In [None]:
#To ignore warnings
import warnings
warnings.filterwarnings("ignore")

# **Loading Dataset**

In [None]:
df=pd.read_csv("/train.csv")
pd.set_option('display.max_columns',None)


In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Checking size of dataset
print("Data set size : ", df.shape)

In [None]:
#Fetching top 5 row in dataset
df.head()

In [None]:
#Fetching Bottom 5 rows
df.tail()

In [None]:
#finding the column names
df.columns

In [None]:
#Basic statistical analysis of dataset
df.describe()

From statistical result we can understatnd the basic statistical report of min, max, percentage, mean and standard deviation

In [None]:
#checking for the data is balanced or not
df['y'].value_counts()

From the above result we can clearly understand that the dataset is imbalanced. Lets find the percentage.

In [None]:
#Finding the percentage of the data
print('Percentage for "no": ',((39916) / (45205)) * 100 )
print('Percentage for "yes": ',((5289) / (45205)) * 100 )

from the above result we can clearly understand that the percentage for no is very high so the Majority class "no" with 88.29% and Minority class "yes" with 11.7%.

# **Data Preprocessing**
# **Data Cleaning**

Data cleaning is the process of identifying and correcting or removing errors, inconsistencies, and inaccuracies in a dataset to improve its quality and ensure that it is ready for analysis. It involves tasks such as handling missing or duplicate data, correcting data types, and removing outliers or irrelevant information.

### **Missing Values**

In [None]:
#checking for null values
df.isnull().sum()

### **Finding Duplicate Values**

In [None]:
#checking for no of duplicate values
df.duplicated().sum()

From the above result we can find 6 duplicate datas. So will drop the duplicates.

In [None]:
#droping duplicates
df = df.drop_duplicates()

In [None]:
df.shape

In [None]:
#after droping agin check for no of duplicates
df.duplicated().sum()

Duplicates are removed from dataset.

### **Checking Data Type**

In [None]:
df.dtypes

### **Unique Values of Categorical Column**

In [None]:
print("Unique values of Job \n")
print(df['job'].unique())

In [None]:
print("Unique values of Marital Status \n")
print(df['marital'].unique())

In [None]:
print("Unique values of Educationsl Qualification \n")
print(df['education_qual'].unique())

In [None]:
print("Unique values of Call Type \n")
print(df['call_type'].unique())

In [None]:
print("Unique values of Month \n")
print(df['mon'].unique())

In [None]:
print("Unique values of Previous Outcome \n")
print(df['prev_outcome'].unique())

In [None]:
print("Unique values of Target Variable 'y' \n")
print(df['y'].unique())

From all the above results all values are unique which means there is no incorrect or wrong data that is spelling mistake, upper case and lower case mismatch of each values.

### **Exploring the Dataset and replace the unknown values**

**Converting categorical Target column into numerical column.**

In [None]:
df['target'] = df["y"].map({"yes":1 , "no": 0})

In [None]:
df.head()

☛  **Age**


In [None]:
#no of counts for particular age
df.age.value_counts()

In [None]:
#checking for the percentage of how many people get insured? compared with Target vs Age
df.groupby('age')['target'].mean()

**☛ Job**

In [None]:
#no of counts for particular job
df.job.value_counts()

In [None]:
#checking for the percentage of how many people get insured? compared with Target vs Job
df.groupby('job')['target'].mean()

In [None]:
# droping the column unknown
#outof 45211 rows, deletion of 288 rows will not get more impact on dataset so planning to delete

#replacing unknown value as null
df['job'] =df['job'].replace('unknown',np.nan)

In [None]:
#counting the no of null value in job column
df.job.isnull().sum()

In [None]:
#removing null values from job column
df=df.dropna(subset=['job'])

In [None]:
#after removing null values checking for the summ of null values
df.job.isnull().sum()

**Marital Status**

In [None]:
#no of counts for marital status
df.marital.value_counts()

In [None]:
#checking for the percentage of how many people get insured? compared with Target vs Marital Status
df.groupby('marital')['target'].mean()

**Educational Qualification**

In [None]:
#no of counts for Educational qualification
df.education_qual.value_counts()

In [None]:
#checking for the percentage of how many people get insured? compared with Target vs Educational Qualification
df.groupby('education_qual')['target'].mean()

In [None]:
#Finding the percentage of unknown value
print('Percentage for "Unknown": ',((1730) / (23202+13301+6851+1730)) * 100 )

In [None]:
#replacing unknown value as null
df['education_qual'] =df['education_qual'].replace('unknown',np.nan)

In [None]:
#checking for null values
df.education_qual.isnull().sum()

In [None]:
#droping the null values
df = df. dropna(subset=['education_qual'])

In [None]:
#checking for null value after deleting
df.education_qual.isnull().sum()

**Call Type**

In [None]:
#no of counts for Call type
df.call_type.value_counts()

In [None]:
#checking for the percentage of how many people get insured? compared with Target vs Call Type
df.groupby('call_type')['target'].mean()

In [None]:
#Finding the percentage of unknown value
print('Percentage for "Unknown": ',((12283) / (29285+13020+12283)) * 100 )

Unknown call type percentage is 22.50% so we will keep as it is.

**Day**

In [None]:
#no of counts for Day
df.day.value_counts()

In [None]:
#checking for the percentage of how many people get insured? compared with Target vs Day
df.groupby('day')['target'].mean()

**Month**

In [None]:
#no of counts for month
df.mon.value_counts()

In [None]:
#checking for the percentage of how many people get insured? compared with Target vs Month
df.groupby('mon')['target'].mean()

**Duration**

In [None]:
#no of counts for duration
df.dur.value_counts()

In [None]:
#checking for the percentage of how many people get insured? compared with Target vs Duration
df.groupby('dur')['target'].mean()

**Number of Calls**

In [None]:
#no of counts for number of calls
df.num_calls.value_counts()

In [None]:
#checking for the percentage of how many people get insured? compared with Target vs Numer of Calls
df.groupby('num_calls')['target'].mean()

**Previous Outcome**

In [None]:
#no of counts for previous outcome
df.prev_outcome.value_counts()

In [None]:
#checking for the percentage of how many people get insured? compared with Target vs Previous outcome
df.groupby('prev_outcome')['target'].mean()

In [None]:
#Finding the percentage of unknown value
print('Percentage for "Unknown": ',((35280) / (35280+4709+1774+1424)) * 100 )

It is around 81% values are unknown. So will keep unknown value as it is.

**Target Variable Y**

In [None]:
#no of counts of target variable y
df.y.value_counts()

### **Outlier Deduction and Correction**
**Outlier Detection**
1.   Z-Score
      Z-Score(x)=(x-mean(x)) / SD(x)
      **Threshold Limit**
      Z-Score > 3 and Z-Score < -3 ---> Outlier
2.   IQR
      IQR = Q3(75%)-Q1(25%)
      **Upper Threshold** = Q3 + (1.5 * IQR)
      **Lower Threshold** = Q1 - (1.5 * IQR)
3.   Plotting
      Box Plot

**Outlier Correction**
1.   Deletion
2.   Clip/Strip

In [None]:
df.info()

## **Age**

**Box Plot**





In [None]:
#Outlier Detuction using Box Plot for Age Column
sns.set(style="whitegrid")
sns.boxplot(x=df['age'], color='Chartreuse')

From outlier we can see that there are many dots are displayed outside whisker.

**IQR**

In [None]:
#detecting Outlier for Age column
q1,q3=np.percentile(df["age"],[25,75])
IQR=q3-q1
upper=q3+1.5*IQR
lower=q1-1.5*IQR
print("Upper age bound:",upper,"Lower age bound :", lower)

**Removing outlier for Age**

In [None]:
#removing outlier for age column
# Clip/ Strip is used to detuct value to lower & upper threshold.
df.age = df.age.clip(10.5,70.5)

In [None]:
df.age.describe()

**Checking- After outlier removal**

In [None]:
sns.set(style="whitegrid")
sns.boxplot(x=df['age'], color='Chartreuse')

## **Day**

**Box plot**

In [None]:
#Outlier Detuction using Box Plot for day Column
sns.set(style="whitegrid")
sns.boxplot(x=df['day'], color='Chartreuse')

**IQR**

In [None]:
#detecting Outlier for Age column
q1,q3=np.percentile(df["day"],[25,75])
IQR=q3-q1
upper=q3+1.5*IQR
lower=q1-1.5*IQR
print("Upper bound:",upper,"Lower bound :", lower)

In [None]:
df.day.describe()

From Box plot itself we can tell there is no outlier, even though checked with IQR approach. min and max values are in between lower and upper bound.

## **Duration**

**Box Plot**

In [None]:
#Outlier Detuction using Box Plot for duration Column
sns.set(style="whitegrid")
sns.boxplot(df['dur'], color='Chartreuse')

**IQR**

In [None]:
#detecting Outlier for Duration column
q1,q3=np.percentile(df["dur"],[25,75])
IQR=q3-q1
upper=q3+1.5*IQR
lower=q1-1.5*IQR
print("Upper bound:",upper,"Lower bound :", lower)

**Removing Outlier for duration column**

In [None]:
#removing outlier for duration column
# Clip/ Strip is used to detuct value to lower & upper threshold.
df.dur = df.dur.clip(-219.5,640.5)

In [None]:
df.dur.describe()

**Checking after outlier removal**

In [None]:
sns.set(style="whitegrid")
sns.boxplot(df['dur'], color='Chartreuse')

## **No of Calls**

**Box Plot**

In [None]:
#checking for outlier using boxplot fot the column no of calls
sns.set(style="whitegrid")
sns.boxplot(df['num_calls'], color='Chartreuse')

**IQR**

In [None]:
detecting Outlier for number of calls column
q1,q3=np.percentile(df["num_calls"],[25,75])
IQR=q3-q1
upper=q3+1.5*IQR
lower=q1-1.5*IQR
print("Upper bound:",upper,"Lower bound :", lower)

In [None]:
#removing outlier for num_calls column
# Clip/ Strip is used to detuct value to lower & upper threshold.
df.num_calls = df.num_calls.clip(-2,6.0)

In [None]:
df.num_calls.describe()

**Checking after outlier removal**

In [None]:
sns.set(style="whitegrid")
sns.boxplot(df['num_calls'], color='Chartreuse')

we detucted and removed outlier for all numerical columns. So we are done with Data Cleaning Process.

# **EDA - Exploratory Data Analysis**

EDA is an important step in the data analysis process, as it helps to identify potential issues with the data and to develop a deeper understanding of the relationships between variables.

### **Distribution of Feature and Target variable**

In [None]:
# Age distribution
plt.figure(figsize = (20,20),dpi=180)
plt.subplot(3,4,1)
sns.histplot((df.age),color='BlueViolet')

# Get the current Axes object
ax = plt.gca()

# Calculate and annotate the percentage of each category
total = float(len(df['age']))
for p in ax.patches:
    height = p.get_height()
    percentage = '{:.1f}%'.format(100 * height/total)
    x = p.get_x() + p.get_width() / 2 - 0.1
    y = height + 5
    ax.text(x, y, percentage, fontsize=8, rotation=90, ha='center', va='bottom', color='Purple')

plt.title('Age Distribution', fontsize = 12, color='maroon', fontweight='bold')
plt.xlabel('Age',fontsize = 12, color='green')
plt.ylabel('Count',fontsize = 12, color='green')




#Job distribution
plt.subplot(3,4,2)
sns.countplot(df['job'],order=df.job.value_counts().index)

# Get the current Axes object
ax = plt.gca()

# Calculate and annotate the percentage of each category
total = float(len(df['job']))
for p in ax.patches:
    height = p.get_height()
    percentage = '{:.1f}%'.format(100 * height/total)
    x = p.get_x() + p.get_width() / 2 - 0.1
    y = height + 5
    ax.text(x, y, percentage, fontsize=8, rotation=90, ha='center', va='bottom', color='Purple')

plt.title('Job Distribution', fontsize = 14, color="maroon", fontweight='bold')
plt.xlabel('Type Of Job',fontsize = 12, color='green')
plt.xticks(rotation = 90)
plt.ylabel('Count',fontsize = 12, color='green')



# Marital distribution
plt.subplot(3,4,3)
custom_colors = {'married': 'Magenta', 'divorced': 'BlueViolet', 'single': 'Lime'}
sns.countplot(df['marital'],order=df.marital.value_counts().index, palette=custom_colors)

# Get the current Axes object
ax = plt.gca()

# Calculate and annotate the percentage of each category
total = float(len(df['marital']))
for p in ax.patches:
    height = p.get_height()
    percentage = '{:.1f}%'.format(100 * height/total)
    x = p.get_x() + p.get_width() / 2 - 0.1
    y = height + 5
    ax.text(x, y, percentage, fontsize=8, rotation=90, ha='center', va='bottom', color='Purple')

plt.title('Marital Status Distribution', fontsize = 14, color='maroon', fontweight='bold')
plt.xlabel('Marital',fontsize = 12, color='green')
plt.xticks(rotation = 90)
plt.ylabel('Count',fontsize = 12, color='green')



# Education qualification distribution
plt.subplot(3,4,4)
custom_colors = {'secondary': 'DarkGreen', 'tertiary': 'LightSeaGreen', 'primary': 'Aquamarine'}
sns.countplot(df['education_qual'],order=df.education_qual.value_counts().index, palette=custom_colors)

# Get the current Axes object
ax = plt.gca()

# Calculate and annotate the percentage of each category
total = float(len(df['education_qual']))
for p in ax.patches:
    height = p.get_height()
    percentage = '{:.1f}%'.format(100 * height/total)
    x = p.get_x() + p.get_width() / 2 - 0.1
    y = height + 5
    ax.text(x, y, percentage, fontsize=8, rotation=90, ha='center', va='bottom', color='Purple')

plt.title('Education Qualification', fontsize = 14, color='maroon', fontweight='bold')
plt.xlabel('Education',fontsize = 12, color='green')
plt.xticks(rotation = 90)
plt.ylabel('Count',fontsize = 12, color='green')




# Call type distribution
plt.subplot(3,4,5)
custom_colors = {'cellular': 'MediumVioletRed', 'telephone': 'purple', 'unknown' :'MediumSpringGreen'}
sns.countplot(df['call_type'],order=df.call_type.value_counts().index, palette=custom_colors)

# Get the current Axes object
ax = plt.gca()

# Calculate and annotate the percentage of each category
total = float(len(df['call_type']))
for p in ax.patches:
    height = p.get_height()
    percentage = '{:.1f}%'.format(100 * height/total)
    x = p.get_x() + p.get_width() / 2 - 0.1
    y = height + 5
    ax.text(x, y, percentage, fontsize=8, rotation=90, ha='center', va='bottom', color='Purple')

plt.title('Call Type', fontsize = 14, color='maroon', fontweight='bold')
plt.xlabel('Call type',fontsize = 12, color='green')
plt.xticks(rotation = 90)
plt.ylabel('Count',fontsize = 12, color='green')




# Day distribution
plt.subplot(3,4,6)
sns.histplot(df['day'], color="Fuchsia")

# Get the current Axes object
ax = plt.gca()

# Calculate and annotate the percentage of each category
total = float(len(df['day']))
for p in ax.patches:
    height = p.get_height()
    percentage = '{:.1f}%'.format(100 * height/total)
    x = p.get_x() + p.get_width() / 2 - 0.1
    y = height + 5
    ax.text(x, y, percentage, fontsize=8, rotation=90, ha='center', va='bottom', color='Purple')

plt.title('Day', fontsize = 14, color='maroon', fontweight='bold')
plt.xlabel('Day',fontsize = 12, color='green')
plt.xticks(rotation = 90,fontsize = 10)
plt.ylabel('Count',fontsize = 12, color='green')




 # Mon distribution
plt.subplot(3,4,7)
sns.countplot(df['mon'],order=df.mon.value_counts().index)

# Get the current Axes object
ax = plt.gca()

# Calculate and annotate the percentage of each category
total = float(len(df['mon']))
for p in ax.patches:
    height = p.get_height()
    percentage = '{:.1f}%'.format(100 * height/total)
    x = p.get_x() + p.get_width() / 2 - 0.1
    y = height + 5
    ax.text(x, y, percentage, fontsize=8, rotation=90, ha='center', va='bottom', color='Purple')

plt.title('Month', fontsize = 14, color='maroon', fontweight='bold')
plt.xlabel('Month',fontsize = 12, color='green')
plt.xticks(rotation = 90)
plt.ylabel('Count',fontsize = 12, color='green')




# Dur distribution
plt.subplot(3,4,8)
sns.histplot((df.dur),color = 'cyan')

# Get the current Axes object
ax = plt.gca()

# Calculate and annotate the percentage of each category
total = float(len(df['dur']))
for p in ax.patches:
    height = p.get_height()
    percentage = '{:.1f}%'.format(100 * height/total)
    x = p.get_x() + p.get_width() / 2 - 0.1
    y = height + 5
    ax.text(x, y, percentage, fontsize=8, rotation=90, ha='center', va='bottom', color='Purple')

plt.title('Duration', fontsize = 14, color='maroon', fontweight='bold')
plt.xlabel('Duration',fontsize = 12, color='green')
plt.ylabel('Count',fontsize = 12, color='green')




# Num call distribution
plt.subplot(3,4,9)
sns.histplot(df['num_calls'])

# Get the current Axes object
ax = plt.gca()

# Calculate and annotate the percentage of each category
total = float(len(df['num_calls']))
for p in ax.patches:
    height = p.get_height()
    percentage = '{:.1f}%'.format(100 * height/total)
    x = p.get_x() + p.get_width() / 2 - 0.1
    y = height + 5
    ax.text(x, y, percentage, fontsize=8, rotation=90, ha='center', va='bottom', color='Purple')

plt.title('Number Of Calls', fontsize = 14, color='maroon', fontweight='bold')
plt.xlabel('Number Of Calls',fontsize = 12, color='green')
plt.xticks(rotation = 90)
plt.ylabel('Count',fontsize = 12, color='green')


# Previous outcome distribution
plt.subplot(3,4,10)
custom_colors = {'unknown': 'HotPink', 'failure': 'Olive', 'other': 'Purple', 'success':'Yellow'}
sns.countplot(df['prev_outcome'],order=df.prev_outcome.value_counts().index, palette=custom_colors)

# Get the current Axes object
ax = plt.gca()

# Calculate and annotate the percentage of each category
total = float(len(df['prev_outcome']))
for p in ax.patches:
    height = p.get_height()
    percentage = '{:.1f}%'.format(100 * height/total)
    x = p.get_x() + p.get_width() / 2 - 0.1
    y = height + 5
    ax.text(x, y, percentage, fontsize=8, rotation=90, ha='center', va='bottom', color='Purple')

plt.title('Previous Outcome', fontsize = 14, color='maroon', fontweight='bold')
plt.xlabel('Previous Outcome',fontsize = 12, color='green')
plt.xticks(rotation = 90)
plt.ylabel('Count',fontsize = 12, color='green')


 # Target distribution
plt.subplot(3,4,11)
custom_colors = {'no': 'GreenYellow', 'yes': 'Teal'}
sns.countplot(df['y'], palette=custom_colors)

# Get the current Axes object
ax = plt.gca()

# Calculate and annotate the percentage of each category
total = float(len(df['y']))
for p in ax.patches:
    height = p.get_height()
    percentage = '{:.1f}%'.format(100 * height/total)
    x = p.get_x() + p.get_width() / 2 - 0.1
    y = height + 5
    ax.text(x, y, percentage, fontsize=8, rotation=90, ha='center', va='bottom', color='Purple')

plt.title('Target Distribution', fontsize = 14, color='maroon', fontweight='bold')
plt.xlabel('Target Distribution',fontsize = 12, color='green')
plt.xticks(rotation = 90)
plt.ylabel('Count',fontsize = 12, color='green')

plt.tight_layout()


plt.show()


**From the above plots we can clearly tell the following interpretation**

### **1. Age**

*   Most Target : 30 to 40 years
*   Least Target : belove 20 and above 60


### **2. Job**

*   Most Target : blue-collar and management
*   Least Target : students and house maid

### **3. Marital Status**

*   Most Target : Maried
*   Least Target : Divorced

### **4. Education**

*   Most Target : Secondary
*   Least Target : Primary

### **5. Call Type**

*   Most Target : cellular
*   Least Target : telephone

### **6. Day**   

*   Most Target : Mid of the month
*   Least Target : Beginning of Month

### **7. Month**

*   Most Target : May
*   Least Target : December

### **8. Duration**

*   Most Target : call last around 1750 second
*   Least Target : call last around 100 to 200 second

### **9. No of Calls**

*   Most Target : most people contacted one time
*   Least Target : least people contacted 5 times

### **10. Previous Outcome**

*   Most Target : most people previous outcome was unknown
*   Least Target : Previous outcome for least people are success

### **11. Target**

*   No of people insured is very less percentage (ie) 88 %, only few percentage are insured.


# **Features vs Target**

### **Categorical Variable vs Target (Categorical) -- Job, Marital, Educational Qualification, Call Type, Month**

In [None]:
plt.figure(figsize=(20,35), dpi=180)
#plt.suptitle("Categorical Data Vs Target", fontsize=20, fontweight='bold', color='maroon')
#Jobs vs Target
plt.subplot(3,3,1)
my_colors = ['Magenta', 'cyan']
sns.countplot(x='job',hue='y',data=df, palette=my_colors)
plt.xticks(rotation=50)
plt.title('Jobs vs Target', fontweight='bold', color='maroon')
plt.xlabel('Job', color='DarkGreen')
plt.ylabel('y', color='DarkGreen')

#Marital Status vs Target
plt.subplot(3,3,2)
my_colors = ['Magenta', 'cyan']
sns.countplot(x='marital',hue='y',data=df, palette=my_colors)
plt.xticks(rotation=50)
plt.title('Marital Status vs Target', fontweight='bold', color='maroon')
plt.xlabel('Marital Status', color='DarkGreen')
plt.ylabel('y', color='DarkGreen')

#Educational Qualification vs Target
plt.subplot(3,3,3)
my_colors = ['Magenta', 'cyan']
sns.countplot(x='education_qual',hue='y',data=df, palette=my_colors)
plt.xticks(rotation=50)
plt.title('Educational Qualification vs Target', fontweight='bold', color='maroon')
plt.xlabel('Educational Qualification', color='DarkGreen')
plt.ylabel('y', color='DarkGreen')

#Month vs Target
plt.subplot(3,3,4)
my_colors = ['Magenta', 'cyan']
sns.countplot(x='mon',hue='y',data=df, palette=my_colors)
plt.xticks(rotation=50)
plt.title('Month vs Target', fontweight='bold', color='maroon' )
plt.xlabel('Month', color='DarkGreen')
plt.ylabel('y', color='DarkGreen')

#Previous Outcome vs Target
plt.subplot(3,3,5)
my_colors = ['Magenta', 'cyan']
sns.countplot(x='prev_outcome',hue='y',data=df, palette=my_colors)
plt.xticks(rotation=50)
plt.title('Previous Outcome vs Target', fontweight='bold', color='maroon' )
plt.xlabel('Previous Outcome', color='DarkGreen')
plt.ylabel('y', color='DarkGreen')

#Call Type vs Target
plt.subplot(3,3,6)
my_colors = ['Magenta', 'cyan']
sns.countplot(x='call_type',hue='y',data=df, palette=my_colors)
plt.xticks(rotation=50)
plt.title('Call Type vs Target', fontweight='bold', color='maroon')
plt.xlabel('Call Type', color='DarkGreen')
plt.ylabel('y', color='DarkGreen')

plt.show()


### **Categorical Data Vs Target**
**Jobs vs Target**

*   Target (No) : Blue Collar
*   Suscribed (Yes): Management

**Marital Status vs Target**

*   Target (No) : Married
*   Subscribed (Yes): Married

**Educational Qualification vs Target**

*   Target (No): Secondary
*   Subscribed (Yes): Secondary

**Month vs Target**

*   Target (No): May
*   Subscribed (Yes): May

**Previous Outcome vs Target**

*   Target (No): unknown
*   Subscribed (Yes): unknown

**Call Type vs Target**

*   Target (No): Cellular
*   Subscribed (Yes): Cellular

### **Feature VS Target Distribution - Percentage of people Subscribed**

In [None]:
plt.figure(figsize=(20,35), dpi=180)
#plt.suptitle("Categorical Data Vs Target", fontsize=20, fontweight='bold', color='maroon')

#Jobs vs Target
plt.subplot(3,3,1)
(df.groupby('job')['target'].mean()*100).sort_values().plot(kind="bar",color='cyan')
plt.xticks(rotation=50)
plt.title('Jobs vs Target', fontweight='bold', color='maroon')
plt.xlabel('Job', color='DarkGreen')
plt.ylabel('y', color='DarkGreen')

#Marital Status vs Target
plt.subplot(3,3,2)
(df.groupby('marital')['target'].mean()*100).sort_values().plot(kind="bar",color='Magenta')
plt.xticks(rotation=50)
plt.title('Marital Status vs Target', fontweight='bold', color='maroon')
plt.xlabel('Marital Status', color='DarkGreen')
plt.ylabel('y', color='DarkGreen')

#Educational Qualification vs Target
plt.subplot(3,3,3)
(df.groupby('education_qual')['target'].mean()*100).sort_values().plot(kind="bar",color='cyan')
plt.xticks(rotation=50)
plt.title('Educational Qualification vs Target', fontweight='bold', color='maroon')
plt.xlabel('Educational Qualification', color='DarkGreen')
plt.ylabel('y', color='DarkGreen')

#Month vs Target
plt.subplot(3,3,4)
(df.groupby('mon')['target'].mean()*100).sort_values().plot(kind="bar",color='Magenta')
plt.xticks(rotation=50)
plt.title('Month vs Target', fontweight='bold', color='maroon' )
plt.xlabel('Month', color='DarkGreen')
plt.ylabel('y', color='DarkGreen')

#Call Type vs Target
plt.subplot(3,3,5)
(df.groupby('call_type')['target'].mean()*100).sort_values().plot(kind="bar",color='cyan')
plt.xticks(rotation=50)
plt.title('Call Type vs Target', fontweight='bold', color='maroon')
plt.xlabel('Call Type', color='DarkGreen')
plt.ylabel('y', color='DarkGreen')

#Previous Outcome vs Target
plt.subplot(3,3,6)
(df.groupby('prev_outcome')['target'].mean()*100).sort_values().plot(kind="bar",color='Magenta')
plt.xticks(rotation=50)
plt.title('Previous Outcome vs Target', fontweight='bold', color='maroon')
plt.xlabel('Previous Outcome', color='DarkGreen')
plt.ylabel('y', color='DarkGreen')


plt.show()


### **Percentage of people Subscribed -- Categorical Data Vs Target (Categorical)**
**Jobs vs Target**

*   Most subscribed % : Student, retired
*   Least Subscribed % : blue-collar

**Marital Status vs Target**

*   Most subscribed % : Single
*   Least Subscribed % : Married

**Educational Qualification vs Target**

*   Most subscribed % : teritary
*   Least Subscribed % : primary

**Month vs Target**

*   Most subscribed % : March, September
*   Least Subscribed % : May

**Call Type vs Target**

*   Most subscribed % : Cellular
*   Least Subscribed % : unknown

**Previous Outcome vs Target**

*   Most subscribed % : Success
*   Least Subscribed % : unknown



### **Numerical Variable vs Target -- Age, Day, Duration, No of Calls**

In [None]:
plt.figure(figsize=(20, 15), dpi=150)
#sub title to show title for overall plot
plt.suptitle("Numerical Data Vs Target", fontsize=18,  fontweight='bold', color='maroon')

#Age vs Target
plt.subplot(2,2,1)
my_colors = ['Magenta', 'DarkBlue']
sns.histplot(x='age',hue='y',data=df, palette=my_colors)
plt.xticks(rotation=50)
plt.title('Age vs Target', fontweight='bold', color='maroon' )
plt.xlabel('Age', color='DarkGreen')
plt.ylabel('y', color='DarkGreen')
#df[['age','target']].corr()

#Day vs Target
plt.subplot(2,2,2)
my_colors = ['Magenta', 'DarkBlue']
sns.histplot(x='day',hue='y',data=df, palette=my_colors)
plt.xticks(rotation=50)
plt.title('Day vs Target', fontweight='bold', color='maroon' )
plt.xlabel('Day', color='DarkGreen')
plt.ylabel('y', color='DarkGreen')
#df[['day','target']].corr()

#Duration vs Target
plt.subplot(2,2,3)
my_colors = ['Magenta', 'DarkBlue']
sns.histplot(x='dur',hue='y',data=df, palette=my_colors)
plt.xticks(rotation=50)
plt.title('Duration vs Target', fontweight='bold', color='maroon' )
plt.xlabel('Duration', color='DarkGreen')
plt.ylabel('y', color='DarkGreen')

#No of Calls vs Target
plt.subplot(2,2,4)
my_colors = ['Magenta', 'DarkBlue']
sns.histplot(x='num_calls',hue='y',data=df, palette=my_colors)
plt.xticks(rotation=50)
plt.title('No of Calls vs Target', fontweight='bold', color='maroon' )
plt.xlabel('No Of Calls', color='DarkGreen')
plt.ylabel('y', color='DarkGreen')

plt.show()


### **Numeric Data vs Target**

**Age vs Target**

*   Target : Middle age people
*   Subscribed : Middle age people

**Day vs Target**

*   Target : Middle of Month
*   Subscribed : Middle of Month

**Duration vs Target**

*  Duration of call is also important to subscribe for insurance.

**No of Calls vs Target**

*  No of calls increase subscrition also getting increase.

# **Encoding**
In this project i am going to use decision tree so we muct do label encoding.

In [None]:
df.columns

### **Job**

In [None]:
#Encoding for job column (Label Encoding)
df['job']=df['job'].map({'blue-collar':1,'entrepreneur':2,'services':3,'housemaid':4,'technician':5,'self-employed':6,'admin.':7,'management':8, 'unemployed':9, 'retired': 10, 'student' : 11})
df.head(3)

### **Marital Status**

In [None]:
#Encoding for Marital status (Label Encoding)
df['marital'] =df['marital'].map({'married': 1, 'divorced': 2, 'single' : 3})
df.head(3)

### **Educational Qualification**

In [None]:
#encoding for educational qualification (Label Encoding)
df['education_qual'] = df['education_qual'].map({'primary': 1, 'secondary': 2, 'tertiary' :3})
df.head(3)

### **Month**

In [None]:
# Encoding for month column (Label Encoding)
df['mon']=df['mon'].map({'may': 1, 'jul' : 2, 'jan': 3, 'nov': 4, 'jun' : 5, 'aug' : 6, 'feb' : 7, 'apr' : 8, 'oct' : 9, 'dec' : 10 , 'sep': 11, 'mar': 12})
df.head(3)

### **Call Type**

In [None]:
# Encoding for call type column (Label Encoding)
df['call_type'] = df['call_type'].map({'unknown': 1, 'telephone' : 2, 'cellular' : 3})
df.head(3)

### **Previous Outcome**

In [None]:
# Encoding for previous outcome column (Label Encoding)
df['prev_outcome']=df['prev_outcome'].map({'unknown' : 1, 'failure' : 2, 'other' : 3, 'success': 4})
df.head(3)

# **Feature and Target Selection**

In [None]:
df.columns

In [None]:
# X --> Feature y-- > Target

x = df[['age', 'job', 'marital', 'education_qual', 'call_type', 'day', 'mon', 'dur', 'num_calls', 'prev_outcome']].values
y=df['target'].values

# **Spliting**

In [None]:
# splitting the data as train and test

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2,random_state = 3 )


# **Balancing**

In [None]:
#Balancing the data
from imblearn.combine import SMOTEENN
smt = SMOTEENN(sampling_strategy='all')
x_train_smt, y_train_smt = smt.fit_resample(x_train, y_train)

In [None]:
print(len(x_train_smt))
print(len(y_train_smt))

# **Scaling**

In [None]:
#scaling the data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
x_train_scaled = scaler.fit_transform(x_train_smt)
x_test_scaled = scaler.transform(x_test)

# **Modelling**

## **Logistic Regression**

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

lr = LogisticRegression()

lr.fit(x_train_scaled,y_train_smt)
lr.score(x_test_scaled,y_test)

In [None]:
y_pred=lr.predict_proba(x_test_scaled)
y_pred

In [None]:
log_reg_auroc = roc_auc_score(y_test,y_pred[:,1])
print("AUROC score for logistic regression  :  ",round(log_reg_auroc,2))

## **K-Nearest Neighbour (KNN)**

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
for i in [1,2,3,4,5,6,7,8,9,10,20,30,40,50]:
  knn= KNeighborsClassifier(i)
  knn.fit(x_train_scaled, y_train_smt)
  print("K value :", i, "Train Score : ", knn.score(x_train_scaled,y_train_smt), "Cross Value Accuracy :" , np.mean(cross_val_score(knn, x_test_scaled, y_test, cv=10)))

In [None]:
knn= KNeighborsClassifier(i)
knn.fit(x_train_scaled, y_train_smt)
print("KNN Score: ",knn.score(x_test_scaled,y_test))
print( "AUROC on the sampled dataset : ",roc_auc_score( y_test, knn.predict_proba(x_test)[:, 1]))

## **Decision Tree**

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_auc_score

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_auc_score
dt = DecisionTreeClassifier()
dt.fit(x_train_smt,y_train_smt)
print("Decision Tree Score : ", dt.score(x_train_smt,y_train_smt))
print( "AUROC on the sampled dataset : ",roc_auc_score( y_test, dt.predict_proba(x_test)[:, 1]))

In [None]:
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.model_selection import cross_val_score #this will help me to do cross- validation
import numpy as np

for depth in [1,2,3,4,5,6,7,8,9,10,20]:
  dt = DecisionTreeClassifier(max_depth=depth) # will tell the DT to not grow past the given threhsold
  # Fit dt to the training set
  dt.fit(x_train_smt, y_train_smt) # the model is trained
  trainAccuracy = accuracy_score(y_train_smt, dt.predict(x_train_smt)) # this is useless information - i am showing to prove a point
  dt = DecisionTreeClassifier(max_depth=depth) # a fresh model which is not trained yet
  valAccuracy = cross_val_score(dt, x_test_scaled, y_test, cv=10) # syntax : cross_val_Score(freshModel,fts, target, cv= 10/5)
  print("Depth  : ", depth, " Training Accuracy : ", trainAccuracy, " Cross val score : " ,np.mean(valAccuracy))

**k= 5 is the good cross validation score of 0.896**

In [None]:
dt = DecisionTreeClassifier(max_depth=5)
dt.fit(x_train_smt,y_train_smt)
print("Decision Tree Score : ", dt.score(x_train_smt,y_train_smt))
print( "AUROC on the sampled dataset : ",roc_auc_score( y_test, dt.predict_proba(x_test)[:, 1]))

## **XG Boost**

In [None]:
import xgboost as xgb
from sklearn.model_selection import cross_val_score
import numpy as np
for lr in [0.01,0.02,0.03,0.04,0.05,0.1,0.11,0.12,0.13,0.14,0.15,0.2,0.5,0.7,1]:
  model = xgb.XGBClassifier(learning_rate = lr, n_estimators=100, verbosity = 0) # initialise the model
  model.fit(x_train_smt,y_train_smt) #train the model
  print("Learning rate : ", lr," Train score : ", model.score(x_train_smt,y_train_smt)," Cross-Val score : ", np.mean(cross_val_score(model, x_test, y_test, cv=10)))


**Learning Rate 0.2 is getting the best cross validation score of 0.899**

## **Random Forest**

In [None]:
from sklearn.ensemble import RandomForestClassifier
rf= RandomForestClassifier(max_depth=2,n_estimators=100,max_features="sqrt")    #max_depth=log(no of features)
rf.fit(x_train, y_train)
y_pred= rf.predict(x_test)

In [None]:
#doing cross validation to get best value of max _depth to prevent overfitted model
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
for depth in [1,2,3,4,5,6,7,8,9,10]:
  rf= RandomForestClassifier(max_depth=depth,n_estimators=100,max_features="sqrt")   # will tell the DT to not grow past the given threhsold
  # Fit dt to the training set
  rf.fit(x_train, y_train) # the model is trained
  rf= RandomForestClassifier(max_depth=depth,n_estimators=100,max_features="sqrt")   # a fresh model which is not trained yet
  valAccuracy = cross_val_score(rf, x_train, y_train, cv=10) # syntax : cross_val_Score(freshModel,fts, target, cv= 10/5)
  print("Depth  : ", depth, " Training Accuracy : ", trainAccuracy, " Cross val score : " ,np.mean(valAccuracy))

**Depth = 8 is giving the good cross validation score fo 0.904**

# **Solution Statement**

Models are tested, below are the AUROC value of each model

*   **Logistic Regression** - AUROC Score is **0.88**
*   **KNN** - AUROC Score is  **0.895**
*   **Decision Tree** - AUROC Score is **0.897**
*   **XG Boost** - AUROC Score is  **0.899**
*   **Random Forest** - AUROC Score is **0.904**

**Hence Random Forest is giving the good AUROC Score of 0.904, so Random Forest is the best model for customer convertion prediction**

# **Feature Importance**

In [None]:
from xgboost import plot_importance

# plot feature importance
plot_importance(model)
plt.show()

In [None]:
df.columns

f0 - Age, f1 - Job, f2 - marital status, f3- educational qualification,
f4 - call type, f5 - day, f6 - mon, f7 -dur, f8 - number of calls,
f9 - previous outcome f10 - y

### **Conclusion:**

Based on the Feature Importance given by best machine Learning that will predict if a client subscribed to the insurance.

The client should focused on the top few features of order given below to have them subscribed to the insurance.

*   Duration - Longer the call better influncing the clients
*   Age - Age of the person plays an important role in insurance. Middle age people are targeted more and people who suscribed to insurance also middle age people.  
*   Day - People who subscribed to insurance are mostly middle of the month.
*   Month - In the month of may people subscribed to insurance are more.
*   Job - In this blue collar people are targeted more but people who subscribed more are from management job.
