<a href="https://colab.research.google.com/github/Sudar278/Email_Campaign_Effectiveness_Prediction-Classification-/blob/main/Email_Campaign_Effectiveness_Prediction(Classification).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project Title: Email Campaign Effectiveness Prediction


## Problem Description
Most of the small to medium business owners are making effective use of Gmail-based e-mail marketing strategies for offline targeting of converting their prospective customers into leads so that they stay with them in business.

The main objective is to create a machine learning model to characterize the mail and track the mail that is ignored; read; acknowledged by the reader.


### Data Columns Explanation






Email_ID : E-mail ID of recipients.

Email_Type : 2 different e-mail types: 1 and 2.

Subject_Hotness_Score : Measures the strength and effectiveness of mail subject.

Email_Source_Type : 2 Types of different e-mail source types: 1 and 2.

Customer_Location : Differentiates between 7 different e-mail customer locations: A, B, C, D, E, F and G.

Email_Campaign_Type : 3 different types of e-mail campaign types: 1, 2 and 3.

Total_Past_Communications : Number of previous communications from the same source. The number of communications happened 

Time_Email_sent_Category : Differentiates between 3 different time of day (the mail was sent) category: 1, 2 and 3.

Word_Count : Number of words in the mail.

Total_Links : Number of links in the mail.

Total_Images : Number of images in the mail.

Email_Status : Differentiates between 3 different e-mail statuses: 1, 2 and 3, representing ignored, read & acknowledged respectively. This is the target variable.


#### Task
Analysing the data on e-mail marketing campaign and data cleaning, EDA, Feature Engineering and to build classification machine
learning model to predict the mail that is ignored, read or acknowledged by the reader.

# Importing Libraries and Defining Functions 

In [1]:
# Importing Libraries
import warnings
warnings.filterwarnings("ignore")

import math

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import train_test_split
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import RandomizedSearchCV

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

from sklearn import metrics
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
from sklearn.metrics import confusion_matrix

import pickle

# Data Loading

In [2]:
# Loading data from google drive and creating a dataframe df 
from google.colab import drive
drive.mount('/content/drive')
df=pd.read_csv('/content/drive/MyDrive/Capstone_project_datas/Email Campain prediction/data_email_campaign.csv')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Dataset First View

In [4]:
df.head()

Unnamed: 0,Email_ID,Email_Type,Subject_Hotness_Score,Email_Source_Type,Customer_Location,Email_Campaign_Type,Total_Past_Communications,Time_Email_sent_Category,Word_Count,Total_Links,Total_Images,Email_Status
0,EMA00081000034500,1,2.2,2,E,2,33.0,1,440,8.0,0.0,0
1,EMA00081000045360,2,2.1,1,,2,15.0,2,504,5.0,0.0,0
2,EMA00081000066290,2,0.1,1,B,3,36.0,2,962,5.0,0.0,1
3,EMA00081000076560,1,3.0,2,E,2,25.0,2,610,16.0,0.0,0
4,EMA00081000109720,1,0.0,2,C,3,18.0,2,947,4.0,0.0,0


In [7]:
df.describe()

Unnamed: 0,Email_Type,Subject_Hotness_Score,Email_Source_Type,Email_Campaign_Type,Total_Past_Communications,Time_Email_sent_Category,Word_Count,Total_Links,Total_Images,Email_Status
count,68353.0,68353.0,68353.0,68353.0,61528.0,68353.0,68353.0,66152.0,66676.0,68353.0
mean,1.285094,1.095481,1.456513,2.272234,28.93325,1.999298,699.931751,10.429526,3.550678,0.230934
std,0.451462,0.997578,0.498109,0.46868,12.536518,0.631103,271.71944,6.38327,5.596983,0.497032
min,1.0,0.0,1.0,1.0,0.0,1.0,40.0,1.0,0.0,0.0
25%,1.0,0.2,1.0,2.0,20.0,2.0,521.0,6.0,0.0,0.0
50%,1.0,0.8,1.0,2.0,28.0,2.0,694.0,9.0,0.0,0.0
75%,2.0,1.8,2.0,3.0,38.0,2.0,880.0,14.0,5.0,0.0
max,2.0,5.0,2.0,3.0,67.0,3.0,1316.0,49.0,45.0,2.0


In [21]:
# Total number of rows
print(f'Total number of rows, columns = {df.shape}')
# Number of Duplicate rows
print(f'Number of Duplicate rows = {df[df.duplicated()].shape[0]}')

Total number of rows, columns = (68353, 12)
Number of Duplicate rows = 0


There is 68353 rows 12 columns and no duplicate rows in the data set 

In [22]:
# Looking at the basic info of the data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 68353 entries, 0 to 68352
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Email_ID                   68353 non-null  object 
 1   Email_Type                 68353 non-null  int64  
 2   Subject_Hotness_Score      68353 non-null  float64
 3   Email_Source_Type          68353 non-null  int64  
 4   Customer_Location          56758 non-null  object 
 5   Email_Campaign_Type        68353 non-null  int64  
 6   Total_Past_Communications  61528 non-null  float64
 7   Time_Email_sent_Category   68353 non-null  int64  
 8   Word_Count                 68353 non-null  int64  
 9   Total_Links                66152 non-null  float64
 10  Total_Images               66676 non-null  float64
 11  Email_Status               68353 non-null  int64  
dtypes: float64(4), int64(6), object(2)
memory usage: 6.3+ MB


4 columns in the data has null values

Total_Past_Communications,Total_Links,Total_Images are actually integers but they are mentioned as float


In [31]:
# Getting unique values of each columns to know the categorical variables
columns=list(df.columns)
for i in columns:
  unique_values=(df[i]).unique()
  print(f'{i} : {unique_values}')


Email_ID : ['EMA00081000034500' 'EMA00081000045360' 'EMA00081000066290' ...
 'EMA00089998436500' 'EMA00089999168800' 'EMA00089999316900']
Email_Type : [1 2]
Subject_Hotness_Score : [2.2 2.1 0.1 3.  0.  1.5 3.2 0.7 2.  0.5 0.2 1.  4.  1.9 1.1 1.6 0.3 2.3
 1.4 1.7 2.8 1.2 0.8 0.6 4.2 1.8 2.4 0.9 1.3 3.3 2.6 3.1 4.1 2.9 2.7 0.4
 3.5 3.7 2.5 3.8 3.9 3.4 4.6 4.5 3.6 4.4 4.7 5.  4.3 4.8 4.9]
Email_Source_Type : [2 1]
Customer_Location : ['E' nan 'B' 'C' 'G' 'D' 'F' 'A']
Email_Campaign_Type : [2 3 1]
Total_Past_Communications : [33. 15. 36. 25. 18. nan 34. 21. 40. 27. 24. 42. 11. 23. 37. 35. 51.  9.
 39. 31. 50. 30. 14. 45. 53. 28.  7. 38. 52. 22. 43. 12. 16. 20. 41. 56.
 26. 29.  5. 32. 44. 10. 17. 46. 47. 48.  8. 49. 13.  0.  6. 55. 19. 60.
 59. 61. 54. 62. 57. 64. 58. 65. 66. 67. 63.]
Time_Email_sent_Category : [1 2 3]
Word_Count : [ 440  504  962  610  947  416  116 1241  655  744  931  550  565  700
  694 1061  623  560 1082  684  733 1122  649  778  855  704  339  988
  389  636  812  8