<a href="https://colab.research.google.com/github/Aramnani/Capstone-Project-3---Email-Campaign-Effectiveness-Prediction/blob/main/Email_Campaign_Effectiveness_Prediction_Capstone_Project_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -



##### **Project Type**    - Classification
##### **Contribution**    - Individual
##### **Name -** Aakash Ramnani

# **Project Summary -**

Email Marketing can be defined as a marketing technique in which businesses stay connected with their customers through emails, making them aware about their new products, updates, important notices related to the products they are using.

Most importantly, email marketing allows businesses to build relationships with leads, new customers and past customers. It's a way to communicate directly to the customers in their inbox, at a time that is convenient for them. With the right messaging tone and strategies, emails are one of the most important marketing channels.

In this problem statement, we will be trying to create machine learning models that characterize and predict whether the mail is ignored, read or acknowledged by the reader. In addition to this, we will be trying to analyze and find all the features that are important for an email to not get ignored.

The main steps of the project are:

1. Basic EDA(Exploratory Data Analysis): I have performed basic EDA to understand the data and its characteristics. Also, have used Univariate - BI variate - and Multivarite analysis to Understanding the correlation between variables and to explore distribution and intercation of variables.

2. Hypothesis Testing: I have performed hypothesis testing to test the relationship between the variables. Also have used statistical tests such as t-test, ANOVA, f-test to compare the means or proportions of different groups or categories.

3. Feature Engineering and Data Preprocessing: To create and select the most relevant and informative features for the model. Have used methods such as correlation analysis and VIF(Variance Inflation Factor) for feature selection.

4. ML Model Development and Evaluation: In this project Have developed and evaluated different machine learning Models. I have used logistic regression, KNN Classifier, Decision Tree, Random Forest and XGBoost.

5. Model Interpretation and Explanation : Explained the model predictions using feature importance. It helps to analysis the effect of different factors on the output variable.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**Most of the small to medium business owners are making effective use of Gmail-based Email marketing Strategies for offline targeting of converting their prospective customers into leads so that they stay with them in business. The main objective is to create a machine learning model to characterize the mail and track the mail that is ignored; read; acknowledged by the reader.**

# **General Guidelines** : -

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.

     The additional credits will have advantages over other students during Star Student selection.

             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.


```
# Chart visualization code
```


*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np

import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
plt.rcParams.update({'figure.figsize':(8,5),'figure.dpi':100})

from statsmodels.stats.outliers_influence import variance_inflation_factor

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from imblearn.over_sampling import SMOTE
from collections import Counter

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier

from sklearn import metrics
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, roc_auc_score, f1_score, recall_score,roc_curve, classification_report

!pip install shap
import shap

import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Load Dataset
dataset = pd.read_csv('/content/drive/MyDrive/Almabetter/machine learning/project/Classification/data_email_campaign.csv')

### Dataset First View

In [None]:
# Dataset First Look
dataset.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print("No. of Rows in dataset are : ", dataset.shape[0])
print("No. of Columns in Dataset are : ", dataset.shape[1])

### Dataset Information

In [None]:
# Dataset Info
dataset.info()

- We have data points of float, integers and object datatype.
- There are missing values in Customer_Location, Total_Past_Communications, Total_Links and Total_Images column.

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
dataset.duplicated().sum()

- There are no duplicated values in our dataset.

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
dataset.isnull().sum()

In [None]:
# Visualizing the missing values
sns.heatmap(dataset.isnull(), cbar=False)

### What did you know about your dataset?

- There are 68353 rows and 12 columns in our dataset.
- We have data points of float, integers and object datatype.
- There are no duplicated values in our dataset.
- There are missing values in Customer_Location, Total_Past_Communications, Total_Links and Total_Images column.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
dataset.columns

In [None]:
# Dataset Describe
dataset.describe(include = 'all')

### Variables Description

Our email campaign dataset have 68353 observations and 12 features. Clearly Email_Status is our target variable.

Our features:

- **Email Id -** It contains the email id's of the customers/individuals
- **Email Type -** There are two categories 1 and 2. We can think of them as marketing emails or important updates, notices like emails regarding the business.
- **Subject Hotness Score -** It is the email's subject's score on the basis of how good and effective the content is.
- **Email Source -** It represents the source of the email like sales and marketing or important admin mails related to the product.
- **Email Campaign Type -** The campaign type of the email.
- **Total Past Communications -** This column contains the total previous mails from the same source, the number of communications had.
- **Customer Location -** Contains demographical data of the customer, the location where the customer resides.
- **Time Email sent Category -** It has three categories 1,2 and 3; the time of the day when the email was sent, we can think of it as morning, evening and night time slots.
- **Word Count -** The number of words contained in the email.
- **Total links -** Number of links in the email.
- **Total Images -** Number of images in the email.
- **Email Status -** Our target variable which contains whether the mail was ignored, read, acknowledged by the reader.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for col in dataset.columns.to_list():
  print(f"Unique values for {col} are : ",dataset[col].unique())

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
df = dataset.copy()

**EDA (Exploratory Data Analysis)**

In [None]:
df.isnull().sum()

- There are missing values in Customer_Location, Total_Past_Communications, Total_Links and Total_Images. Max number of missing values are in Customer_Location.

- There is no way we can guess the Customer_location to impute the null values, so we will se what effect Customer_Location has on Email_Status and then decide whether to impute the missing values or drop the column.

In [None]:
# Let's impute the missing values for other columns except for Customer_location
# Checking distribution of Total_Past_Communications for missing value imputation
sns.distplot(x = df['Total_Past_Communications'])

As it can be seen that distribution for Total_Past_Communications is almost normal. So it will be right to impute the missing the values with mean.

In [None]:
# Imputing the missing values of Total_Past_Communications with mean.
df['Total_Past_Communications'].fillna(df['Total_Past_Communications'].mean(), inplace = True)

In [None]:
# Checking the distribution for Total_Links
sns.displot(x = df['Total_Links'], kde = True)

In [None]:
# Checking the distribution for Total_Images
sns.displot(x = df['Total_Images'], kde = True)

- It is clear from the plot that distribution for both Total_Links and Total_Images is right skewed. So it would be right to impute the missing values with mode.

In [None]:
# Imputing the missing values of Total_Links and Total_Images with mode
df['Total_Links'].fillna(df['Total_Links'].mode()[0], inplace = True)
df['Total_Images'].fillna(df['Total_Images'].mode()[0], inplace = True)

In [None]:
df.isnull().sum()

- Univariate Analysis

In [None]:
# Univariate analysis of target variable.
df['Email_Status'].value_counts()

- 0 is for ignored, 1 is for read and 2 is for acknowledged.
- Value count for 0 is way more than value count for 1 and 2.
- Our target variable is imbalanced and we'll have to treat it with right technique for our model to work correctly.

In [None]:
df['Word_Count'] = df['Word_Count'].astype('float64')

In [None]:
categorical_feature = df.describe(include = 'int64').columns
continuous_feature = df.describe(include = 'float64').columns

print("Categorical Features are : ", categorical_feature)
print("Continuous Feature are : ", continuous_feature)

In [None]:
# Checking the value counts for categorical features
for col in categorical_feature:
  print(df[col].value_counts())

- Bi-variate analysis

- Continuous Variable vs Target Variable

In [None]:
# Email_Status vs Subject_Hotness_score
shs_es = df.groupby(['Email_Status'])['Subject_Hotness_Score'].median().sort_values(ascending = False)

In [None]:
shs_es

- Subject Hotness score for read and acknowledged email is lower.

In [None]:
# Email_Status vs Total_Past_Communications
tpc_es = df.groupby(['Email_Status'])['Total_Past_Communications'].median().sort_values(ascending = False)
tpc_es

- More number of past communications high chances of the email being read and acknowledge.

In [None]:
# Email_Status vs Word_Count
wc_es = df.groupby(['Email_Status'])['Word_Count'].median().sort_values(ascending = False)
wc_es

- Lengthy Emails are ignored more than read and acknowledged.

In [None]:
# Total_Links vs Email_Status
tl_es = df.groupby(['Email_Status'])['Total_Links'].median().sort_values(ascending = False)
tl_es

- Median values for all three cases (ignored, read, acknowledged) are almost same.

In [None]:
# Total_Images vs Email_Status
ti_es = df.groupby(['Email_Status'])['Total_Images'].median().sort_values(ascending = False)
ti_es

- Median value for all three cases (ignored, read, acknowledge) is zero. We need to futher analysis for proper insight.

- Categorical Variable vs target Variable

In [None]:
# Email_Type vs Email_Status
et_es = df.groupby(['Email_Type'])['Email_Status'].value_counts()
et_es

- Email Type 1 are sent more than Email type 2, However there is no significant difference in the proportion of emails being ignored, read or acknowledged.

In [None]:
# Email_Source_Type vs Email_Status
est_es = df.groupby(['Email_Source_Type'])['Email_Status'].value_counts()
est_es

- There is no significant difference between ignored, read or acknowledged emails for both the categories of Email Source Type.

In [None]:
# Email_Campaign_Type vs Email_Status
ect_es = df.groupby(['Email_Campaign_Type'])['Email_Status'].value_counts()
ect_es

- Very less number of emails are sent with email campaign type 1.
- Maximum number of emails are sent with eamil campaign type 2, however proportion of ignored emails for emal campaign type 2 is very high.

In [None]:
# Time_Email_sent_Category vs Email_Status
tes_es = df.groupby(['Time_Email_sent_Category'])['Email_Status'].value_counts()
tes_es

- Maximum number of emails are sent in time slot 2.

In [None]:
# Customer_Location vs Email_Status
cl_es = df.groupby(['Customer_Location'])['Email_Status'].value_counts()
cl_es

- Most number of emails are sent to customer location 'G'.

### What all manipulations have you done and insights you found?

- There are missing values in Customer_Location, Total_Past_Communications, Total_Links and Total_Images. Max number of missing values are in Customer_Location.

- There is no way we can guess the Customer_location to impute the null values, so we will se what effect Customer_Location has on Email_Status and then decide whether to impute the missing values or drop the column.

- Distribution for Total_Past_Communications is almost normal. So we imputed the missing values with mean.

- Distribution for both Total_Links and Total_Images is right skewed. So we imputed the missing values with mode.

- Our target variable is imbalanced and we'll have to treat it with right technique for our model to work correctly.

- Subject Hotness score for read and acknowledged email is lower.

- More number of past communications high chances of the email being read and acknowledge.

- Lengthy Emails are ignored more than read and acknowledged.

- Very less number of emails are sent with email campaign type 1.

- Maximum number of emails are sent with eamil campaign type 2, however proportion of ignored emails for emal campaign type 2 is very high.

- Most number of emails are sent to customer location 'G'.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

In [None]:
def without_hue(ax, feature):
    total = len(feature)
    for p in ax.patches:
        percentage = '{:.1f}%'.format(100 * p.get_height()/total)
        x = p.get_x() + p.get_width() / 2 - 0.05
        y = p.get_y() + p.get_height()
        ax.annotate(percentage, (x, y), size = 12)

#### Chart - 1

In [None]:
# Chart - 1 visualization code
# Checking target variable
plt.figure(figsize = (7,5))
ax = sns.countplot(x = df['Email_Status'])
plt.xticks(size = 12)
plt.xlabel('Email_Status', size = 12)
plt.yticks(size = 12)
plt.ylabel('count', size = 12)

without_hue(ax, df.Email_Status)

##### 1. Why did you pick the specific chart?

We used the count plot to plot the value counts of the categories in our target variable.

##### 2. What is/are the insight(s) found from the chart?

- 80% of all the emails sent are ignored.
- 16.1% of all the emails are read.
- 3.5% of all the emails are acknowledged.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

- The above plot tell us our target variable is imbalanced and we need to treat it with right technique before model deployment to get more accurate predictions.
- We'll use undersampling/SMOTE to balance the data to avoid the model being bias to ingnored emails.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
# Checking categorical variables
for col in categorical_feature:
    counts = df[col].value_counts().sort_index()
    fig = plt.figure(figsize=(9, 6))
    ax = fig.gca()
    counts.plot.bar(ax = ax, color='steelblue')
    ax.set_title(col + ' counts')
    ax.set_xlabel(col)
    ax.set_ylabel("Frequency")
plt.show()

##### 1. Why did you pick the specific chart?

We used count plot as the variable are categorical and we need the value count of data points of each category present in the dataset.

##### 2. What is/are the insight(s) found from the chart?

Email Type - 1 are sent more in compare to Email type - 2.
There is no significant difference is Email source type - 1 and Email Source type - 2.
Most of the emails are sent using campaign type 2 and very less amount of emails are sent using campaign type 1.
Time email sent has three slots 1,2 and 3. Most number of emails are sent in Time email sent slot 2. Time email sent Slot 1 and slot 3 have almost similar counts of email.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

We need futher analysis to see if there is any business impact.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
# Checking Distribution for continuous variable
for i, value in enumerate(continuous_feature):
 sns.distplot(x=df[value], hist = True)
 plt.xlabel(value)
 plt.show()

##### 1. Why did you pick the specific chart?

- We have used distplot to check the distribution of each continuous variable.

##### 2. What is/are the insight(s) found from the chart?

- Except for Total_Past_Communications and Word_Count all the other continuous feature i.e. Subject_Hotness_Score, Total_links and Total_Images are rightly skwed, indicating the presence of outliers.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

- Checking the distribution can help us impute the missing values and also help us check if we are in line with few of the algorithm assumptions.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
# Subject_Hotness_score vs Email_Status
sns.boxplot(x = df['Email_Status'], y = df['Subject_Hotness_Score'])
plt.show()

##### 1. Why did you pick the specific chart?

- We have used box plot to check the relation between continuous and target variable, which will also show the presence of outliers in the continuous variable with respect to target variable.

##### 2. What is/are the insight(s) found from the chart?

- In the subject hotness score, median of ignored emails was around 1 with a few outliers.
- Acknowledged emails has the most outliers.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

- It is observed that the Subject_Hotness_Score for read and acknowledged emails is much lower.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
# Total_Past_Communications vs Email_Status
sns.boxplot(x = df['Email_Status'], y = df['Total_Past_Communications'])

##### 1. Why did you pick the specific chart?

- We have used box plot to check the relation between continuous and target variable, which will also show the presence of outliers in the continuous variable with respect to target variable.

##### 2. What is/are the insight(s) found from the chart?

- Median of read and acknowledged emails is more than the ignored emails. These shows that propertion of read and acknowledged emails is higher than ignored emails.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

- It is very evident from the plot that more number of previous interactions with customer, more the customer tend to read and acknowledge the mail. It is important to build a good reppo and connection with customer.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
# Word_Count vs Email_Status
sns.boxplot(x = df['Email_Status'], y = df['Word_Count'])

##### 1. Why did you pick the specific chart?

- We have used box plot to check the relation between continuous and target variable, which will also show the presence of outliers in the continuous variable with respect to target variable.

##### 2. What is/are the insight(s) found from the chart?

- Proportion of ignored emails is higher than read and acknowledge.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

- E-mails with high number of words are ignored more than read or acknowledged. Lengthy E-mails should be avoided.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
# Total_Links vs Email_Status
sns.boxplot(x = df['Email_Status'], y = df['Total_Links'])

##### 1. Why did you pick the specific chart?

- We have used box plot to check the relation between continuous and target variable, which will also show the presence of outliers in the continuous variable with respect to target variable.

##### 2. What is/are the insight(s) found from the chart?

- Median value in all three cases for total_links are almost same. There are number of outliers for all three cases.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

- Total Links has no significant effect on whether emails are being ignored, read or acknowledged.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
# Total_Images vs Email_Status
sns.boxplot(x = df['Email_Status'], y = df['Total_Images'])

##### 1. Why did you pick the specific chart?

- We have used box plot to check the relation between continuous and target variable, which will also show the presence of outliers in the continuous variable with respect to target variable.

##### 2. What is/are the insight(s) found from the chart?

Number of outliers for ignored emails for total images is high.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

High number of outliers in Total_Images suggest that there are more images in ignored emails.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
# Email_Type vs Email_Status
ax = sns.countplot(x=df['Email_Type'], hue=df['Email_Status'])
unique = len([x for x in df['Email_Type'].unique() if x==x])
bars = ax.patches
for i in range(unique):
  catbars=bars[i:][::unique]
  #get height
  total = sum([x.get_height() for x in catbars])
  #print percentage
  for bar in catbars:
    ax.text(bar.get_x()+bar.get_width()/2., bar.get_height(), f'{bar.get_height()/total:.0%}', ha="center",va="bottom")
plt.show()

##### 1. Why did you pick the specific chart?

- We have used countplot to get the value counts of each category in the varibale with respect to the target variable.

##### 2. What is/are the insight(s) found from the chart?

- Number of email type 1 sent is more than email type 2.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

- The proportion of ignored, read and acknowledged in both the email type is almost same.

#### Chart - 10

In [None]:
# Chart - 10 visualization code
# Email_Source_Type vs Email_Status
ax = sns.countplot(x=df['Email_Source_Type'], hue=df['Email_Status'])
unique = len([x for x in df['Email_Source_Type'].unique() if x==x])
bars = ax.patches
for i in range(unique):
  catbars=bars[i:][::unique]
  #get height
  total = sum([x.get_height() for x in catbars])
  #print percentage
  for bar in catbars:
    ax.text(bar.get_x()+bar.get_width()/2., bar.get_height(), f'{bar.get_height()/total:.0%}', ha="center",va="bottom")
plt.show()

##### 1. Why did you pick the specific chart?

- We have used countplot to get the value counts of each category in the varibale with respect to the target variable.

##### 2. What is/are the insight(s) found from the chart?

There is no significant difference in Email_Source_Type 1 and Email_Source_Tpye 2.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

- Proportion for Emails ignored, read and acknowledged is almost same for both type of sorce.

#### Chart - 11

In [None]:
# Chart - 11 visualization code
# Email_Campaign_Type vs Email_Status
ax = sns.countplot(x=df['Email_Campaign_Type'], hue=df['Email_Status'])
unique = len([x for x in df['Email_Campaign_Type'].unique() if x==x])
bars = ax.patches
for i in range(unique):
  catbars=bars[i:][::unique]
  #get height
  total = sum([x.get_height() for x in catbars])
  #print percentage
  for bar in catbars:
    ax.text(bar.get_x()+bar.get_width()/2., bar.get_height(), f'{bar.get_height()/total:.0%}', ha="center",va="bottom")
plt.show()

##### 1. Why did you pick the specific chart?

- We have used countplot to get the value counts of each category in the varibale with respect to the target variable.

##### 2. What is/are the insight(s) found from the chart?

- Most of the Emails sent are through email campaign type 2. Most ignored emails are also of campaign type 2.
- Emails sent through campaign type 1 are very less in number.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

- Emails sent through campaign type 1 are very less in number. However the rate at which the emails are read and acknoledged is better than campaign type 2 and campaign type 3.

- Campaign type 3 has good results, with less number of emails sent but comparatively more number of emails out of sent emails were read and acknowledged.

#### Chart - 12

In [None]:
# Chart - 12 visualization code
# Time_Email_sent_Category vs Email_Status
ax = sns.countplot(x=df['Time_Email_sent_Category'], hue=df['Email_Status'])
unique = len([x for x in df['Time_Email_sent_Category'].unique() if x==x])
bars = ax.patches
for i in range(unique):
  catbars=bars[i:][::unique]
  #get height
  total = sum([x.get_height() for x in catbars])
  #print percentage
  for bar in catbars:
    ax.text(bar.get_x()+bar.get_width()/2., bar.get_height(), f'{bar.get_height()/total:.0%}', ha="center",va="bottom")
plt.show()

##### 1. Why did you pick the specific chart?

- We have used countplot to get the value counts of each category in the varibale with respect to the target variable.

##### 2. What is/are the insight(s) found from the chart?

- Time_Email_sent_Category is divided into 3 slots.
- Most number of emails are sent in time slot 2.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

- The proportion of ignored, read and acknowledged emails is almost same for all three time slots.

#### Chart - 13

In [None]:
# Chart - 13 visualization code
# Customer_Location vs Email_Status
ax = sns.countplot(x=df['Customer_Location'], hue=df['Email_Status'])
unique = len([x for x in df['Customer_Location'].unique() if x==x])
bars = ax.patches
for i in range(unique):
  catbars=bars[i:][::unique]
  #get height
  total = sum([x.get_height() for x in catbars])
  #print percentage
  for bar in catbars:
    ax.text(bar.get_x()+bar.get_width()/2., bar.get_height(), f'{bar.get_height()/total:.0%}', ha="center",va="bottom")
plt.show()

##### 1. Why did you pick the specific chart?

- We have used countplot to get the value counts of each category in the varibale with respect to the target variable.

##### 2. What is/are the insight(s) found from the chart?

- Most number of emails are sent to location 'G'.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

- Proportion of ignored, read and acknowledged emails for Customer_Location is almost same for all the location. Indicating that Customer_Location has no significant effect on the output variable.
- Thus we can drop the column Customer_Location which has highest number of missing values.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
plt.figure(figsize=(15,8))
correlation = df.corr()
sns.heatmap(abs(correlation), annot=True, cmap='coolwarm')

##### 1. Why did you pick the specific chart?

- We have used correlation heatmap to check the multicolinearity and correlation between variables.

##### 2. What is/are the insight(s) found from the chart?

- Email Campaign Type and Total past communication shows positive correlation with emails being read and acknowledged.
- Total_Links and Total_Images show high corelation.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code

plt.figure(figsize=(15,15))
# setting the axis for graph
sns.pairplot(df)
# adding visualizations to chart
plt.minorticks_on()
plt.grid(which='both',alpha=0.3,linestyle='--')
plt.show()

##### 1. Why did you pick the specific chart?

We have used pairplot to understand the best set of features to explain the relationship between two variables.

##### 2. What is/are the insight(s) found from the chart?

Total_Images and Total_Links are highly corelated. So we can either select any one of the feature or combine both to form one feature for model implementation.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

- **Hypothesis - 1 -** The proportion of ignored, read and acknowledged emails is almost same for all the customer location.
- **Hypothesis - 2 -** There is no correlation between Total_Links and Total_Images.
- **Hypothesis - 3 -** Total_Past_Communication has no effect on Email_Status.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

- **Null Hypothesis (H0) -** The proportion of ingmored, read and acknowledge emails is almost same for all the customer locations.
- **Alternate Hypothesis (HA) -** Their is a huge difference in proportion of ignored, read and acknowledged emails for all the customer location.


- As we have two categorical features to compare we will use chi-square test with significance level 0.05 for these hypothesis.

#### 2. Perform an appropriate statistical test.

In [None]:
!pip install researchpy
import researchpy as rp
import scipy.stats as stats

In [None]:
# Perform Statistical Test to obtain P-Value
# Chi-Square Test
crosstab, test_results, expected = rp.crosstab(dataset["Customer_Location"], dataset["Email_Status"],
                                               test= "chi-square",
                                               expected_freqs= True,
                                               prop= "cell")

In [None]:
crosstab

In [None]:
test_results

- As p value 0.5240 is greater than significance level 0.05, we cannot reject the null hypothesis.

##### Which statistical test have you done to obtain P-Value?

- We have used chi-square test to obtain the p-value.

##### Why did you choose the specific statistical test?

- As both of our variable for the test are categorical we have used chi-square test.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

- **Null Hypothesis (H0) -** There is no correlation between Total_Links and Total_Images.
- **Alternate Hypothesis (HA) -** Total_Links and Total_Images are highly correlated.

- As we have two quantitative variables and we need to find the correlation we will be using pearson correlation test.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
# Pearson's Correlation Test
from scipy.stats.stats import pearsonr

In [None]:
var1 = dataset['Total_Links'].fillna(dataset['Total_Links'].mode()[0])
var2 = dataset['Total_Images'].fillna(dataset['Total_Images'].mode()[0])

In [None]:
pearsonr(var1, var2)

- Value for r (correlation coefficient) is 0.75 which is greater than 0.5 which indicates a high correlation between two variable.
- Value of p is 0 which less than significance level 0.05, so we will reject the null hypothesis.

##### Which statistical test have you done to obtain P-Value?

- We have used pearson's correlation test to test the given hypothesis.

##### Why did you choose the specific statistical test?

- We have used pearson's correlation test as we need to find the correlation between two continuous variable not following a normal distribution.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

- **Null Hypothesis (H0) -** Total_Past_Communication has no significant effect on Email_Status.
- **Alternate Hypothesis (HA) -** Total_Past_Communication has significant effect on Email_Status.

- We need to test for one categorical variable and one quantitative variable we will use one anova test with significance level 0.05.

#### 2. Perform an appropriate statistical test.

In [None]:
var1 = dataset['Total_Past_Communications'].fillna(dataset['Total_Past_Communications'].mean())
var2 = dataset['Email_Status']

In [None]:
# Perform Statistical Test to obtain P-Value
# One Way ANOVA test
from scipy.stats import f_oneway
f_oneway(var1, var2)

- Here p-value 0.00 is less than significance value 0.05, so we will reject the null hypothesis.

##### Which statistical test have you done to obtain P-Value?

- We have used one way ANOVA test to obtain p-value.

##### Why did you choose the specific statistical test?

- We have one categorical variable and one quantitative variable to compare so we used one way ANOVA test to test the given hypothesis.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
dataset.isnull().sum()

- There were missing values in Customer_Location, Total_Past_Communications, Total_Links and Total_Images.
- We imputed the missing values in Total_Past_Cummunications by mean as its distribution was almost normal.
- We imputed missing values is Total_Links and Total_Images by mode. Their distribution was right skewed.
- As there is no way we can guess the customer location so we checked whether or not there is a significant impact of Customer_Location on out target variable. And we found that there is no effect of Customer_Location on our target variable so we will drop the Customer_Location column.

In [None]:
# Dropping Customer_Location
df.drop('Customer_Location', axis = 1, inplace = True)

In [None]:
df.isnull().sum()

#### What all missing value imputation techniques have you used and why did you use those techniques?

- There were missing values in Customer_Location, Total_Past_Communications, Total_Links and Total_Images.
- We imputed the missing values in Total_Past_Cummunications by mean as its distribution was almost normal.
- We imputed missing values is Total_Links and Total_Images by mode. Their distribution was right skewed.
- As there is no way we can guess the customer location so we checked whether or not there is a significant impact of Customer_Location on out target variable. And we found that there is no effect of Customer_Location on our target variable so we will drop the Customer_Location column.

### 2. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features
# Plotting correlation for refrence
plt.figure(figsize=(15,8))
correlation = df.corr()
sns.heatmap(abs(correlation), annot=True, cmap='coolwarm')

In [None]:
# VIF for checking multicolinearity
def calc_vif(X):

    # Calculating VIF
    vif = pd.DataFrame()
    vif["variables"] = X.columns
    vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

    return(vif)

In [None]:
calc_vif(df[[i for i in df.describe().columns if i not in categorical_feature]])

- Total_Links and Total_Images are highly Correlated so combining them to create a new column Total_Links_Images, and dropping Total_Images and Total_Links from data frame.

In [None]:
# Total_Links and Total_Images are highly correlated so combining both the column into new column
df['Total_links_images'] = df['Total_Images'] + df['Total_Links']
df.drop(['Total_Images','Total_Links'], axis = 1, inplace = True)

In [None]:
calc_vif(df[[i for i in df.describe().columns if i not in categorical_feature]])

- We have multicolinearity checked.

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting
# We do not need Email_ID for prediction so dropping Email_ID from dataframe.
df.drop('Email_ID', axis =1, inplace = True)

##### What all feature selection methods have you used  and why?

- We used correlation and Variance Inflation Factor to check the multicolinearity and we combined Total_Images and Total_Links to Total_links_images and dropped Total_Images and Total_Links.
- We dropped Email_ID and Customer_Location as they have no impact on target variable.

##### Which all features you found important and why?

In [None]:
df.columns

Answer Here

### 3. Handling Outliers

In [None]:
continuous_feature = df.describe(include = 'float64').columns

In [None]:
continuous_features = continuous_feature.drop('Word_Count')

In [None]:
# Handling Outliers & Outlier treatments
# Number of Outliers in all the continuous features
outliers = {}
for col in continuous_features:
  q_75, q_25 = np.percentile(df.loc[:,col],[75,25])
  IQR = q_75-q_25
  max = q_75+(1.5*IQR)
  min = q_25-(1.5*IQR)
  outlier_list=[]
  outlier_list=df.loc[df[col] < min]['Email_Status'].tolist()
  outlier_list.append(df.loc[df[col] > max]['Email_Status'].tolist())
  outliers[col]={}
  for i in outlier_list[0]:
      outliers[col][i] = outliers[col].get(i,0) + 1
print(outliers)

- We have the number of outliers with respect to Email_Status.
- Our output data is imbalanced, Email_Status value 1 and value 2 are in minority class.
- For our model to predict coreclty and not be biased to one class, we need to take care that we are not deleting more than 5% of useful data related to minority class.

In [None]:
# Percentage of outliers in minority class
min_outliers = 0
maj_outliers = 0
for col in continuous_features:
  min_outliers += outliers[col][1]
  min_outliers += outliers[col][2]
  maj_outliers += outliers[col][0]

total_min = df['Email_Status'].value_counts()[1] + df['Email_Status'].value_counts()[2]
total_maj = df['Email_Status'].value_counts()[0]

min_outliers_per = (min_outliers/total_min)*100
maj_outliers_per = (maj_outliers/total_maj)*100
total_percentage = ((min_outliers+maj_outliers)/(total_min+total_maj))*100
print(f'The percentage of outliers in minority classes is {min_outliers_per}')
print(f'The percentage of outliers in majority class is {maj_outliers_per}')
print(f'The percentage of total outliers are {total_percentage}')

- There are more than 5% outliers for minority class so we will not be deleting them.
- Deleting the outliers for majority class.

In [None]:
# Deleting outliers for majority class
for col in continuous_features:
  q_low = df[col].quantile(0.01)
  q_hi  = df[col].quantile(0.99)
  df = df.drop(df[(df[col] > q_hi) &  (df['Email_Status']==0)].index)
  df = df.drop(df[(df[col] < q_low) & (df['Email_Status']==0)].index)

##### What all outlier treatment techniques have you used and why did you use those techniques?

- We have the number of outliers with respect to Email_Status.
- Our output data is imbalanced, Email_Status value 1 and value 2 are in minority class.
- For our model to predict coreclty and not be biased to one class, we need to take care that we are not deleting more than 5% of useful data related to minority class.
- There are more than 5% outliers for minority class so we will not be deleting them.
- Deleted the outliers for majority class.

### 4. Categorical Encoding

In [None]:
# Splitting the target variable from dataframe.
x1 = df.drop('Email_Status', axis =1)
y1 = df['Email_Status']

In [None]:
print(x1.shape)
print(y1.shape)

In [None]:
categorical_feature

In [None]:
categorical_features = categorical_feature.drop('Email_Status')
categorical_features

In [None]:
# Encode your categorical columns
# OneHot Encoding for categorical features
x_ohe = pd.get_dummies(x1, columns = categorical_features, drop_first = True)
x_ohe.head()

#### What all categorical encoding techniques have you used & why did you use those techniques?

- We have used one hot encoding as the categorical data was nominal and not ordinal.

### 5. Data Scaling

In [None]:
# Scaling your data
scaler = StandardScaler()
x_scaled = scaler.fit_transform(x_ohe)

In [None]:
x_scaled_df = pd.DataFrame(x_scaled, columns = x_ohe.columns)
x_scaled_df.head()

##### Which method have you used to scale you data and why?

- A standard scaler is a data preprocessing technique that transforms the numerical features of a dataset to have a mean of zero and a standard deviation of one, which can improve the performance of model.

- It scales the features to a common range, which can help compare different features and avoid dominance of some features over others due to their large magnitude.

### 6. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

- 80% of all the emails sent are ignored.
- 16.1% of all the emails are read.
- 3.5% of all the emails are acknowledged.

It is clear that our output variable is imbalanced. Email_Status value 1 and value 2 are the miority class. We need to treat it with proper technique before model implementation for more accurate predictions.

In [None]:
# Handling Imbalanced Dataset (If needed)
# SMOTE (Synthetic Minority Oversampling Technique)
smote = SMOTE()

x_smote,y_smote = smote.fit_resample(x_scaled,y1)

In [None]:
print("x before SMOTE : ", x_scaled.shape)
print("x after SMOTE : ", x_smote.shape)
print("y before SMOTE : ", y1.shape)
print("y after SMOTE : ", y_smote.shape)

In [None]:
# Visulizing data before SMOTE
plt.bar(Counter(df['Email_Status']).keys(), Counter(df['Email_Status']).values())
plt.title("Before SMOTE")

In [None]:
# Visulizing data after SMOTE
plt.bar(Counter(y_smote).keys(), Counter(y_smote).values())
plt.title("After SMOTE")

In [None]:
unique_values, count_values = np.unique(y_smote, return_counts=True)
print("Frequency of unique values of the Email_Status:")
print(np.asarray((unique_values, count_values)))

- We have balanced data with 53502 data points for each class.

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

- We have used SMOTE technique to treat imbalanced data.
- SMOTE generates synthetic data for minority class.
- SMOTE (Synthetic Minority Oversampling Technique) works by randomly picking a point from the minority class and computing the k-nearest neighbors for this point. The synthetic points are added between the chosen point and its neighbors.
- SMOTE will save us from the problem of loss of information unlike other techniques like under sampling.

### 7. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
x_train, x_test, y_train, y_test = train_test_split(x_smote, y_smote, test_size = 0.2, random_state = 42, stratify = y_smote)

In [None]:
print("Shape of x_train : ", x_train.shape)
print("Shape of x_test : ", x_test.shape)
print("Shape of y_train : ", y_train.shape)
print("Shape of y_test : ", y_test.shape)

##### What data splitting ratio have you used and why?

- As our dataset is small we have used 80% - 20% splitting ratio i.e. 80% of data is training set and 20% is test set.
- We need to use the stratify parameter inorder to make sure that the train and test datasets have the same ratios of the predictor variables.

## ***7. ML Model Implementation***

- Function for evaluating models.

In [None]:
# Columns for comparison matrics
comparison_columns = ['Model_Name', 'Train_Accuracy', 'Train_Recall', 'Train_Precision', 'Train_F1score', 'Train_AUC' ,'Test_Accuracy', 'Test_Recall', 'Test_Precision', 'Test_F1score', 'Test_AUC']

In [None]:
# Function to evaluate the model
def model_evaluation(model_name, model_var, x_train, y_train, x_test, y_test):
  ''' This function predicts and evaluates various models for clasification algorithms, visualizes results
      and creates a dataframe that compares the various models.'''

  #Making predictions
  y_pred_train = model_var.predict(x_train)
  y_pred_test = model_var.predict(x_test)

  #probs
  train_prob = model_var.predict_proba(x_train)
  test_prob = model_var.predict_proba(x_test)

  #Accuracy
  accuracy_train = accuracy_score(y_train,y_pred_train)
  accuracy_test = accuracy_score(y_test,y_pred_test)

  #Confusion Matrix
  cm_train = confusion_matrix(y_train,y_pred_train)
  cm_test = confusion_matrix(y_test,y_pred_test)

  #Recall
  train_recall = recall_score(y_train,y_pred_train, average='weighted')
  test_recall = recall_score(y_test,y_pred_test, average='weighted')

  #Precision SMOTE
  train_precision = precision_score(y_train,y_pred_train, average='weighted')
  test_precision = precision_score(y_test,y_pred_test, average='weighted')

  #F1 Score
  train_f1 = f1_score(y_train,y_pred_train, average='weighted')
  test_f1 = f1_score(y_test,y_pred_test, average='weighted')

  #ROC-AUC
  train_auc = roc_auc_score(y_train,train_prob,average='weighted',multi_class = 'ovr')
  test_auc = roc_auc_score(y_test,test_prob,average='weighted',multi_class = 'ovr')

  #Visualising Results
  print("----- Evaluation -------" + str(model_name) + '-----')
  print("---------------Test data ---------------\n")
  print("Confusion matrix \n")
  print(cm_test)
  print(classification_report(y_test,y_pred_test))

  #create ROC curve
  fpr = {}
  tpr = {}
  thresh ={}
  no_of_class=3
  for i in range(no_of_class):
      fpr[i], tpr[i], thresh[i] = metrics.roc_curve(y_test, test_prob[:,i], pos_label=i)
  plt.plot(fpr[0], tpr[0], linestyle='--',color='blue', label='Class 0 vs Others'+" AUC="+str(test_auc))
  plt.plot(fpr[1], tpr[1], linestyle='--',color='green', label='Class 1 vs Others'+" AUC="+str(test_auc))
  plt.plot(fpr[2], tpr[2], linestyle='--',color='orange', label='Class 2 vs Others'+" AUC="+str(test_auc))
  plt.title('Multiclass ROC curve of '+ str(model_name))
  plt.ylabel('True Positive Rate')
  plt.xlabel('False Positive Rate')
  plt.legend(loc=4)
  plt.show()

  #Saving our results
  global comparison_columns

  metric_scores = [model_name,accuracy_train,train_recall,train_precision,train_f1,train_auc,accuracy_test,test_recall,test_precision,test_f1,test_auc]
  final_dict = dict(zip(comparison_columns,metric_scores))
  dict_list = [final_dict]

  return dict_list

In [None]:
#function to create the comparison table
final_list = []
def add_dict_to_final_df(final_dict):
  global final_list
  for elem in final_dict:
    final_list.append(elem)
  global comp_df
  comp_df = pd.DataFrame(final_list, columns= comparison_columns)

### ML Model - 1 - Logistic Regression

In [None]:
# ML Model - 1 Implementation
lor = LogisticRegression(class_weight='balanced',multi_class='multinomial', solver='lbfgs')

# Fit the Algorithm
lor.fit(x_train,y_train)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
logistic_reg_eval = model_evaluation('Logistic Regression', lor, x_train, y_train, x_test, y_test)
logistic_reg_eval

In [None]:
#adding result to final list
add_dict_to_final_df(logistic_reg_eval)
comp_df

### ML Model - 2 - Decision Tree

In [None]:
# ML Model - 1 Implementation
dtc = DecisionTreeClassifier()

# Fit the Algorithm
dtc.fit(x_train, y_train)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
decision_tree_eval = model_evaluation('Decision Tree', dtc, x_train, y_train, x_test, y_test)
decision_tree_eval

In [None]:
#adding result to final list
add_dict_to_final_df(decision_tree_eval)
comp_df

### ML Model - 3 - Random Forest

In [None]:
# ML Model - 3 Implementation
rfc = RandomForestClassifier(random_state=42, max_depth=5, n_estimators=100, oob_score=True)

# Fit the Algorithm
rfc.fit(x_train, y_train)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
random_forest_eval = model_evaluation('Random Forest', rfc, x_train, y_train, x_test, y_test)
random_forest_eval

In [None]:
#adding result to final list
add_dict_to_final_df(random_forest_eval)
comp_df

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
rfc_gd = RandomForestClassifier(random_state=42, n_jobs=-1)

#GridSearchCV
params = {'max_depth': [3,5,10,20],'min_samples_leaf': [5,10,20,50,100],'n_estimators': [10,25,30,50,100,200]}
gd_cv = GridSearchCV(estimator=rfc_gd, param_grid=params, cv = 4, n_jobs=-1, verbose=1, scoring="f1_weighted")

# Fit the Algorithm
gd_cv.fit(x_train, y_train)

In [None]:
rf_gd_bp = gd_cv.best_estimator_
rf_gd_bp

In [None]:
random_forest_tunned_eval = model_evaluation('Random Forest tunned', rf_gd_bp, x_train, y_train, x_test, y_test)
random_forest_tunned_eval

In [None]:
#adding result to final list
add_dict_to_final_df(random_forest_tunned_eval)
comp_df

##### Which hyperparameter optimization technique have you used and why?

- We have used GridSearchCV for hyperparameter tuning.
- GridSearchCV tries all the combinations of the values passed in the dictionary and evaluates the model for each combination using the Cross-Validation method.
- Hence after using this function we get accuracy/loss for every combination of hyperparameters and we can choose the one with the best performance

### ML Model - 4 - KNN Classifier

In [None]:
# ML Model - 4 Implementation
knn_clf = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)

# Fit the Algorithm
knn_clf.fit(x_train, y_train)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
knn_classifier_eval = model_evaluation('KNN Classifier', knn_clf, x_train, y_train, x_test, y_test)
knn_classifier_eval

In [None]:
#adding result to final list
add_dict_to_final_df(knn_classifier_eval)
comp_df

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# Parameter Grid
params_knn = {'n_neighbors':np.arange(1,5)}

# ML Model - 4 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
knn = KNeighborsClassifier()
knn_rd_cv= RandomizedSearchCV(knn,params_knn,cv=5)

# Fit the Algorithm
knn_rd_cv.fit(x_train,y_train)

In [None]:
# Best Parameters
knn_rd_cv_bp = knn_rd_cv.best_params_
knn_rd_cv_bp

In [None]:
knn_classifier_tunned_eval = model_evaluation('KNN Classifier tunned', knn_rd_cv, x_train, y_train, x_test, y_test)
knn_classifier_tunned_eval

In [None]:
#adding result to final list
add_dict_to_final_df(knn_classifier_tunned_eval)
comp_df

### ML Model - 5 - XGBoost

In [None]:
# ML Model - 4 Implementation
xgb = XGBClassifier(n_estimators=100,max_depth=12,min_samples_leaf=20,min_samples_split=30)

# Fit the Algorithm
xgb.fit(x_train,y_train)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
xgboost_eval = model_evaluation('XGBoost', xgb, x_train, y_train, x_test, y_test)
xgboost_eval

In [None]:
#adding result to final list
add_dict_to_final_df(xgboost_eval)
comp_df

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

In [None]:
comp_df

In [None]:
# visiualization for f1_score
sns.barplot(y=comp_df['Model_Name'], x = comp_df['Test_F1score'])

In [None]:
# visiualization for AUC
sns.barplot(y=comp_df['Model_Name'], x = comp_df['Test_AUC'])

- Train F1 score and test f1 score for KNN classifier with best parameter and XGBoost are the highest out of all the models.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

- F1 score and ROC AUC are two evaluation metrics that can be useful for email campaign effectiveness prediction, especially when the data is imbalanced.
- Both F1 and ROC AUC are suitable for imbalanced classification because they are not affected by the class distribution. They can provide a better indication of the classifier's performance than accuracy, which can be misleading when the data is skewed.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

In [None]:
comp_df

Based on the metrics, XG Boost Classifier and KNN Classifier with best parameter works the best giving a train F1 score of 96% and 99% respectively and test F1 score of 86.17% and 86.12% respectively.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

In [None]:
x_smote_df  = pd.DataFrame(x_smote, columns = x_scaled_df.columns)
x_smote_df

In [None]:
x_train_df = pd.DataFrame(x_train, columns = x_smote_df.columns)
x_train_df.head()

In [None]:
# Feature Importance - XGBoost
feature_imp_xgb = pd.DataFrame({"Variable": x_train_df.columns,"Importance": xgb.feature_importances_})
feature_imp_xgb.sort_values(by="Importance", ascending=False, inplace = True)
sns.barplot(x=feature_imp_xgb['Importance'], y= feature_imp_xgb['Variable'])

- Feature with highest importance is shown by the bar plot.
- Email_Campaign_Type is most important feature followed by Total_Past_Communications for XGBoost Classifier model.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

- **From EDA**
  - The percentage ratio of ignored, read and acknowledged emails are almost same for all the customer location. Customer Location does not exclusively influence the Email_Status. Both EDA and Chi-square hypothesis test showed the same result. So we should not consider location as a factor for people ignoring reading or acknowledging the emails.
  - If campaign type is 1, then there 66% chances of emails being read and 23% chances of emails being acknowledged.
  - Time_Email_Sent_category has no signifiacnt effect on Email_Status.
  - Analyzing total past communications by performing EDA and one way ANOVA test, it is evident that the more the number of previous emails, the more it leads to read and acknowledged emails. This is just about making connection with the customers.
  - The more the words in an email, the more tendency it has to get ignored. Too lengthy emails are getting ignored.

- **Modeling**
  - Based on the metrics, XG Boost Classifier and KNN Classifier with best parameter works the best, giving a train F1 score of 96% and 99% respectively and test F1 score of 86.17% and 86.12% respectively.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***