<a href="https://colab.research.google.com/github/Subhajit53/Email-Campaign-Effectiveness-Prediction/blob/main/Email_Campaign_Effectiveness_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <b><u> Project Title : Email Campaign Effectiveness Prediction : Characterize the mail and track the mail</u></b>

## <b> Problem Description </b>

### Most of the small to medium business owners are making effective use of Gmail-based Email marketing Strategies for offline targeting of converting their prospective customers into leads so that they stay with them in Business. The main objective is to create a machine learning model to characterize the mail and track the mail that is ignored; read; acknowledged by the reader.

## <b> Data Description </b>

### Number of observations : 68353

### <b> Columns : </b>

1.   Email_ID : Unique identifier of emails sent
2.   Email_Type : Type of email encoded as 1 and 2   
3.   Subject_Hotness_Score : A score between 0 to 5 for hotness of the email topic
4.   Email_Source_Type : Source of email encoded as 1 and 2
5.   Customer_Location : Location of customer encoded as A,B,C,D,E,F,G
6.   Email_Campaign_Type : Type of campaign encoded as 1, 2 and 3
7.   Total_Past_Communications : Number of past communications
8.   Time_Email_sent_Category	: Time at which email email was sent encoded as 1, 2 and 3
9.   Word_Count : Words in the email
10.   Total_Links : Number of links in the email
11.   Total_Images : Number of images in the email
12.   Email_Status : Email status encoded as 0 : ignored, 1 : read, 2 : acknowledged









# **Introduction :**
##### An email campaign is a sequence of marketing efforts that contacts multiple recipients at once. Email campaigns are designed to reach out to subscribers at the best time and provide valuable content and relevant offers. Using email campaigns allows you to build deep and trusting relationships with your customers.

##### There are multiple factors that are working behind success of an email campaign like email contents, time at which email was sent, length of the email etc.

##### Here, I want to build a classification model to predict whether any particular campaign email is going blind or hitting the target.

# **Approach :**
##### To solve the problem, I have devised a 3-step approach below:

#### **1. Basic EDA :**
##### In this step, I want to do some exploration on the data. First, I shall check for null values and try to replace or remove them. Then, I shall check for outliers using boxplots and try to replace or remove them. Thirdly, I shall get some visualizations to get an idea of the variables in hand.


#### **2. Model training and testing :**
##### In this step, I shall get a train-test pair from the given dataset and fit 5 classification models to the train set, make predictions on the test set using them and calculate various evaluation metrics. The models are namely : Decision Trees, Random Forests, Gradient Boosting Machine, Support Vector Machines, Naive-Bayes Classifier.

**Note:** A point to be noted here. We can't fit a logistic regression here as this is a multi-class (>2) classification problem.

#### **3. Model Evaluation :**
##### As the last step, I shall compare all the models and try to come up with a conclusion about which model might be the best choice here.

# **Analysis:**

### **Data Exporting and exploration :**

In [None]:
# Importing essential libraries
import pandas as pd
import numpy as np

In [None]:
# Mounting drive
from google.colab import drive
drive.mount('/content/drive')

In [62]:
# Reading the dataset
df = pd.read_csv('/content/drive/MyDrive/Classification Project/Copy of data_email_campaign.csv', index_col='Email_ID')

In [None]:
# Getting a glimpse of the data
df.head()

Unnamed: 0_level_0,Email_Type,Subject_Hotness_Score,Email_Source_Type,Customer_Location,Email_Campaign_Type,Total_Past_Communications,Time_Email_sent_Category,Word_Count,Total_Links,Total_Images,Email_Status
Email_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
EMA00081000034500,1,2.2,2,E,2,33.0,1,440,8.0,0.0,0
EMA00081000045360,2,2.1,1,,2,15.0,2,504,5.0,0.0,0
EMA00081000066290,2,0.1,1,B,3,36.0,2,962,5.0,0.0,1
EMA00081000076560,1,3.0,2,E,2,25.0,2,610,16.0,0.0,0
EMA00081000109720,1,0.0,2,C,3,18.0,2,947,4.0,0.0,0


In [None]:
# Getting the shape of the data
df.shape

(68353, 11)

In [None]:
# Getting a short info of the data
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 68353 entries, EMA00081000034500 to EMA00089999316900
Data columns (total 11 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Email_Type                 68353 non-null  int64  
 1   Subject_Hotness_Score      68353 non-null  float64
 2   Email_Source_Type          68353 non-null  int64  
 3   Customer_Location          56758 non-null  object 
 4   Email_Campaign_Type        68353 non-null  int64  
 5   Total_Past_Communications  61528 non-null  float64
 6   Time_Email_sent_Category   68353 non-null  int64  
 7   Word_Count                 68353 non-null  int64  
 8   Total_Links                66152 non-null  float64
 9   Total_Images               66676 non-null  float64
 10  Email_Status               68353 non-null  int64  
dtypes: float64(4), int64(6), object(1)
memory usage: 6.3+ MB


## **1. Basic EDA :**

From the data shape and info it is clear that there are some null values in the dataset. Let's have a look at them and see what we can do to impute them.

The columns with missing values are :

1.   Customer_Location (Categorical)
2.   Total_Past_Communications (Numerical)
3.   Total_Links (Numerical)
4.   Total_Images (Numerical)



We can do one thing here. We can first get dummy features on categorical column and then use KNNImputer to impute the missing values.

In [63]:
df_with_dummy = pd.get_dummies(df, prefix = '', prefix_sep = '')

In [64]:
# Imputing missing data using KNN Imputer
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors = 3)
df_with_no_na = imputer.fit_transform(df_with_dummy)

In [66]:
df_with_no_na = pd.DataFrame(df_with_no_na, columns = df_with_dummy.columns)

Cool! Now let's impute the new modified columns to the main dataset.

In [84]:
# Replace columns with na values
df['Total_Past_Communications'] = df_with_no_na['Total_Past_Communications'].values
df['Total_Links'] = df_with_no_na['Total_Links'].values
df['Total_Images'] = df_with_no_na['Total_Images'].values

In [83]:
# Check the info again
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 68353 entries, EMA00081000034500 to EMA00089999316900
Data columns (total 11 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Email_Type                 68353 non-null  int64  
 1   Subject_Hotness_Score      68353 non-null  float64
 2   Email_Source_Type          68353 non-null  int64  
 3   Customer_Location          56758 non-null  object 
 4   Email_Campaign_Type        68353 non-null  int64  
 5   Total_Past_Communications  68353 non-null  float64
 6   Time_Email_sent_Category   68353 non-null  int64  
 7   Word_Count                 68353 non-null  int64  
 8   Total_Links                68353 non-null  float64
 9   Total_Images               68353 non-null  float64
 10  Email_Status               68353 non-null  int64  
dtypes: float64(4), int64(6), object(1)
memory usage: 6.3+ MB


Awesome! Now let's do something about the categorical column. I am going to use ML to solve this problem. I will build a simple Decision Trees model to predict Customer Location based on other features which will be trained on non-null observations and will predict on null observations.