<a href="https://colab.research.google.com/github/Kailash-13011992/Introduction-to-Machine-learning/blob/main/Kailash_Sahu_Email_Campaign_Effectiveness_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Title : Email Campaign Effectiveness Prediction**

https://github.com/Kailash-13011992/Introduction-to-Machine-learning

##**Problem Description**

Most of the small to medium business owners are making effective use of Gmail-based
Email marketing Strategies for offline targeting of converting their prospective customers into
leads so that they stay with them in business.
The main objective is to create a machine learning model to characterize the mail and track
the mail that is ignored; read; acknowledged by the reader.
Data columns are self-explanatory.

##**Business Context**
Email marketing is the act of sending a commercial message, typically to a group of people, using email. In its broadest sense, every email sent to a potential or current customer could be considered email marketing. It involves using email to send advertisements, request business, or solicit sales or donations. Email marketing strategies commonly seek to achieve one or more of three primary objectives, to build loyalty, trust, or brand awareness. The term usually refers to sending email messages with the purpose of enhancing a merchant's relationship with current or previous customers, encouraging customer loyalty and repeat business, acquiring new customers or convincing current customers to purchase something immediately, and sharing third-party ads.

## **Data Description**
* **Email Id** - It contains the email id's of the customers/individuals
* **Email Type** - There are two categories 1 and 2. We can think of them as marketing emails or important updates, notices like emails regarding the business.
* **Subject Hotness Score** - It is the email's subject's score on the basis of how good and effective the content is.
* **Email Source** - It represents the source of the email like sales and marketing or important admin mails related to the product.
* **Email Campaign Type** - The campaign type of the email.
* **Total Past Communications** - This column contains the total previous mails from the same source, the number of communications had.
* **Customer Location** - Contains demographical data of the customer, the location where the customer resides.
* **Time Email sent Category** - It has three categories 1,2 and 3; the time of the day when the email was sent, we can think of it as morning, evening and night time slots.
* **Word Count** - The number of words contained in the email.
* **Total links** - Number of links in the email.
* **Total Images** - Number of images in the email.
* **Email Status** - Our target variable which contains whether the mail was ignored, read, acknowledged by the reader.

## **Data Collection and Preprocessing**

### Importing

In [None]:
# Importing important libraries and modules
# For data reading and manipulation
import pandas as pd
import numpy as np

# For data visualization
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
plt.rcParams.update({'figure.figsize':(8,5),'figure.dpi':100})

# VIF
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Modelling
# Train-Test Split
from sklearn.model_selection import train_test_split
# Grid Search for Hyperparameter Tuning
from sklearn.model_selection import GridSearchCV

# Metrics
from sklearn import metrics
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, roc_auc_score, f1_score, recall_score,roc_curve, classification_report

# To ignore warnings
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Reading the csv dataset
df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/PROJECTS/Supervised ML - Classification/data_email_campaign.csv')

FileNotFoundError: [Errno 2] No such file or directory: '/content/drive/MyDrive/Colab Notebooks/PROJECTS/Supervised ML - Classification/data_email_campaign.csv'

### Data Inspection

In [None]:
# Size of the data
df.shape

In [None]:
# First look of our dataset
df.head()

In [None]:
# Basic info of the data
df.info()

In [None]:
# Description of the data
df.describe()

In [None]:
df.isnull().mean()*100

From the above data it cab be observed that 4 features have null values.\
● Customer_Location\
● Total_past_communications\
● Total_Links\
● Total_Images\
We will be handling it in the upcoming Data Cleaning section.

In [None]:
# Looking for duplicates
df.duplicated().sum()

There are no duplicates in the dataset.

## **Exploratory Data Analysis**

### Categorical Data

In [None]:
#starting with categorical variables
categorical_variables = ['Email_Type','Email_Source_Type','Customer_Location','Email_Campaign_Type','Time_Email_sent_Category']
Target_variable = ['Email_Status']

for i,value in enumerate(categorical_variables):
  ax = sns.countplot(x=df[value], hue=df[Target_variable[0]])
  unique = len([x for x in df[value].unique() if x==x])
  # Bars are created in hue order
  bars = ax.patches
  for i in range(unique):
      catbars=bars[i:][::unique]
      #get height
      total = sum([x.get_height() for x in catbars])
      # Print percentage on the bars
      for bar in catbars:
        ax.text(bar.get_x()+bar.get_width()/2.,
                    bar.get_height(),
                    f'{bar.get_height()/total:.0%}',
                    ha="center",va="bottom")
  plt.show()

As it can observed that the distribution of Email_Status is almost similar in all the categories except in Email_Campaign_Type, it shows a totally different trend. For Email_Campaign_Type = 1 it's only 10% of the customers who are ignoring the email and for 2 around 87% customer ignore the emails.

### Continuous Data

#### Univariate

In [None]:
#continuous variables
continuous_variables = ['Subject_Hotness_Score', 'Total_Past_Communications','Word_Count','Total_Links','Total_Images']
i = 1
fig = plt.figure(figsize = (15,10))
for c in list(continuous_variables):
    if i <= 3:
            ax1 = fig.add_subplot(2,3,i)
            sns.boxplot(data = df, x=c, ax = ax1)
            ax2 = fig.add_subplot(2,3,i+3)
            sns.distplot(df[c], ax=ax2)

    i += 1
    if i == 4:
        fig = plt.figure(figsize = (15,10))
        i = 1

it's evident that **Word Count** and **Total_Past Communications** follow almost a **normal distribution**. The rest of the features were **highly skewed** to the **left**.

#### Bivariate

In [None]:
#continuous variables through boxplots
fig = plt.figure(figsize = (15,10))
i = 1
for value in continuous_variables:
  if i <= len(continuous_variables):
    axes = fig.add_subplot(2,3,i)
    ax = sns.boxplot(data = df, x = 'Email_Status', y = value, ax = axes)
  i += 1

From the above boxplots, following observations can be made:
* For **high Subject_Hotness_Score** the chances of mail getting **ignored** is also **high**.
* As the number of **Total_Past_Communication** is **increasing**, the chances of Email getting **ignored is decreasing**.
* As the **word_count** increases beyond the **600** mark we see that there is a **high** possibility of that email being **ignored**. The ideal mark is **400–600**.

In [None]:
## Correlation between continuous variables
correlation = df[continuous_variables].corr()
sns.heatmap(abs(correlation), annot=True, cmap='coolwarm')

Here it can observed that the correlation score is **0.78** for **Total_Images** and **Total_Links** which is on a scale of (-1,1) so it can be understood as a **high positive correlation**.

## **Data Cleaning**

### Handling Missing Data

In [None]:
# Dropping Customer_Location column from the dataframe
df.drop(columns=['Customer_Location'], inplace = True)
# Removing Customer_Location from categorical_variables
categorical_variables.remove('Customer_Location')

It's been already seen in our missing values analysis that the **Customer_Location** feature has the **most** number of missing values (16.963411 % missing values). Also, in categorical data analysis, after plotting the frequency graph of different values of Customer_location with respect to the **Email_status** category we found that the percentage ratio of Email being Ignored, Read or Acknowledged is the same **irrespective** of the **Customer_Location**.\
● The Customer_Location feature does not affect Email_Status and it can be dropped

In [None]:
# Imputing Total_Past_Communications with the mean
df['Total_Past_Communications'].fillna(df['Total_Past_Communications'].mean(),inplace=True)

From the continuous data analysis part it's known get that the graph of **Total_past_Communications** follows **approximate Normal Distribution**. So, let's **impute** the missing values by the **mean** of the values.

In [None]:
# Imputing Total_Links with the mode
df['Total_Links'].fillna(df['Total_Links'].mode()[0],inplace=True)

In [None]:
# Imputing Total_Images with the mode
df['Total_Images'].fillna(df['Total_Images'].mode()[0],inplace=True)

From the continuous data analysis part it's known that the graph of **Total_Links & Total_Images** is **left skewed**. So, **imputing** the missing values by the **mode** of the values is most appropriate.

### Clean-up

In [None]:
# Dropping column Email_ID
df.drop(columns=['Email_ID'], inplace=True)

As it's known it is an ID column so it doesn't add value to our data and it's better to be dropped.

## **Feature Engineering**

### Multicollinearity

In [None]:
# VIF code
def vif_cal(df):
  vif = pd.DataFrame()
  vif["variables"] = df.columns
  vif["VIF"] = [variance_inflation_factor(df.values, i) for i in range(df.shape[1])]
  return(vif)

In [None]:
# Let's get VIF scores
vif_df = vif_cal(df[[i for i in df.describe().columns if i not in categorical_variables + ['Email_Status']]])
vif_df

In [None]:
#scatter plot between total images and total links
sns.scatterplot(x=df["Total_Images"],y=df["Total_Links"],hue=df['Email_Status'])

The ralation between Total Links and Total Images is almost linear so it would be better to add them together.

In [None]:
# Combining total links and total images
df['Total_Images_Links'] = df['Total_Images'] + df['Total_Links']
# Dropping previous columns
df.drop(['Total_Images','Total_Links'],inplace=True,axis=1)

In [None]:
# Let's check VIF scores
vif_df = vif_cal(df[[i for i in df.describe().columns if i not in categorical_variables + ['Email_Status']]])
vif_df

### Outliers Treatment

In [None]:
# Removing dropped columns from the dataset
continuous_variables.remove('Total_Images')
continuous_variables.remove('Total_Links')
# Adding the combined column
continuous_variables.append('Total_Images_Links')

In [None]:
df.head()

In [None]:
# Check for the outliers in continuous variables
sns.boxplot(data = df[continuous_variables], orient='h', dodge=False)

The feature **Word_Count** has **no** outliers.

In [None]:
# Removing Word_Count column as it has no outliers
continuous_variables.remove('Word_Count')
# Creating an empty dictionary to store the count of each Email_Status
outliers = {}
for elem in continuous_variables:
  # Finding Quartile
  q_75, q_25 = np.percentile(df.loc[:,elem],[75,25])
  # Calculating Inter Quartile Range
  IQR = q_75-q_25
  # Fixing Boundaries for outliers
  max = q_75+(1.5*IQR)
  min = q_25-(1.5*IQR)
  # An empty list to store email_status of only outliers
  outlier_list=[]
  outlier_list=df.loc[df[elem] < min]['Email_Status'].tolist()
  outlier_list.append(df.loc[df[elem] > max]['Email_Status'].tolist())
  outliers[elem]={}
  for i in outlier_list[0]:
      outliers[elem][i] = outliers[elem].get(i,0) + 1
print(outliers)

Since the dependent variable is highly imbalanced so before dropping outliers it must be checked that it will not delete more than 5% of the minority class which is Email_Status =1,2.

In [None]:
#finding the percentage of minority classs going to be affected by outliers
sum_min=0
sum_maj=0
for x in [y for y in continuous_variables]:
  sum_min += outliers[x][1]
  sum_min += outliers[x][2]
  sum_maj += outliers[x][0]
total=df.groupby('Email_Status').count()['Email_Type'][1]+df.groupby('Email_Status').count()['Email_Type'][2]
total_0=df.groupby('Email_Status').count()['Email_Type'][0]
print("Percentage of majority class having outliers = ",100*sum_maj/total_0)
print("Percentage of minority class having outliers = ",100*sum_min/total)

NameError: name 'continuous_variables' is not defined

It can be understood that close to 5% of data was being removed from minority class. Hence decided against removing the outliers. This problem can be solved through normalization and choosing boosted trees for our modelling which are robust to outliers.

In [None]:
# Deleting majority outliers
for elem in continuous_variables:
  q_low = df[elem].quantile(0.01)
  q_high  = df[elem].quantile(0.99)
  df = df.drop(df[(df[elem] > q_high) &  (df['Email_Status']==0)].index)
  df = df.drop(df[(df[elem] < q_low) & (df['Email_Status']==0)].index)

In [None]:
categorical_variables

In [None]:
#creating dummy variables
df = pd.get_dummies(df,columns=categorical_variables, drop_first=True)
# as some features had binary categories, we are going to delete one of them to keep it binary encoded and have less columns
df.head(2)

In [None]:
df.shape

### Feature Scaling

In [None]:
# Let's add word count back to the continuous variabl
continuous_variables.append('Word_Count')

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

# fit scaler to the train set, it will learn the parameters
scaler.fit(df[continuous_variables])

# Transform train and test sets
df[continuous_variables] = scaler.transform(df[continuous_variables])

In [None]:
# Splitting the data for training and testing
X_train, X_test, y_train, y_test = train_test_split(df.drop('Email_Status', axis = 1), df['Email_Status'], test_size=0.20, random_state = 42, stratify = df['Email_Status'])

we need to stratify to get same proprtion of classes in both the sets.

In [None]:
X_train.head()

### Handling Imbalance

In [None]:
# Visualizing our imbalanced dataset
ax = sns.countplot(x=df['Email_Status'])
totals = []
for i in ax.patches:
    totals.append(i.get_height())

total = sum(totals)

for i in ax.patches:
    ax.text(i.get_x() - .01, i.get_height() + .5, \
          str(round((i.get_height()/total)*100, 2))+'%', fontsize=12)
plt.show()

Only around 3.5% of observations are classified as acknowledged emails and 80% are ignored emails. This will create a bias in favour of ignored emails in the model.

#### Undersampling

In [None]:
from imblearn.under_sampling import RandomUnderSampler
from collections import Counter

rus = RandomUnderSampler(random_state=42, replacement=True)
x_train_rus, y_train_rus = rus.fit_resample(X_train, y_train)

print('Original dataset shape:', len(df))
print('Resampled dataset shape', len(y_train_rus))

In [None]:
plt.bar(Counter(df['Email_Status']).keys(), Counter(df['Email_Status']).values())
plt.title("Before Undersampling")

In [None]:
plt.bar(Counter(y_train_rus).keys(), Counter(y_train_rus).values())
plt.title("After Undersampling")

In [None]:
unique_elements, count_of_elements = np.unique(y_train_rus, return_counts=True)
print("Frequency of the unique values of Email_Status:")
print(np.asarray((unique_elements, count_of_elements)))

Random Under Sampler created a balanced dataset of 2373 records.

#### SMOTE (Synthetic Minority Oversampling Technique)

In [None]:
from imblearn.over_sampling import SMOTE
smote = SMOTE()

# Fit predictor and target variable
x_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)
print('Original dataset shape', len(y_train))
print('Resampled dataset shape', len(y_train_smote))

In [None]:
plt.bar(Counter(y_train_smote).keys(), Counter(y_train_smote).values())
plt.title("After Undersampling")

## **Model Implementation and Evaluation**

In [None]:
#Columns needed to compare metrics
comparison_columns = ['Model_Name', 'Train_Accuracy', 'Train_Recall', 'Train_Precision', 'Train_F1score', 'Train_AUC' ,'Test_Accuracy', 'Test_Recall', 'Test_Precision', 'Test_F1score', 'Test_AUC']

In [None]:
def model_evaluation(model_name_RUS,model_name_SMOTE,model_var_rus, model_var_smote, x_train_rus, y_train_rus, x_train_smote, y_train_smote, X_test, y_test):
  ''' This function predicts, evaluates various models for clasification using Random Undersampling and SMOTE algorithms, visualizes results
      and creates a dataframe that compares the various models.'''

  #Making predictions random undersampling
  y_pred_rus_train = model_var_rus.predict(x_train_rus)
  y_pred_rus_test = model_var_rus.predict(X_test)
  #probs
  train_rus_proba = model_var_rus.predict_proba(x_train_rus)
  test_rus_proba = model_var_rus.predict_proba(X_test)

  #Making predictions smote
  y_pred_smote_train = model_var_smote.predict(x_train_smote)
  y_pred_smote_test = model_var_smote.predict(X_test)
  #probs
  train_sm_proba = model_var_smote.predict_proba(x_train_smote)
  test_sm_proba = model_var_smote.predict_proba(X_test)

  #Evaluation
  #Accuracy RUS
  accuracy_rus_train = accuracy_score(y_train_rus,y_pred_rus_train)
  accuracy_rus_test = accuracy_score(y_test,y_pred_rus_test)
  #Accuracy SMOTE
  accuracy_smote_train = accuracy_score(y_train_smote,y_pred_smote_train)
  accuracy_smote_test = accuracy_score(y_test,y_pred_smote_test)

  #Confusion Matrix RUS
  cm_rus_train = confusion_matrix(y_train_rus,y_pred_rus_train)
  cm_rus_test = confusion_matrix(y_test,y_pred_rus_test)
  #Confusion Matrix SMOTE
  cm_smote_train = confusion_matrix(y_train_smote,y_pred_smote_train)
  cm_smote_test = confusion_matrix(y_test,y_pred_smote_test)

  #Recall RUS
  train_recall_rus = recall_score(y_train_rus,y_pred_rus_train, average='weighted')
  test_recall_rus = recall_score(y_test,y_pred_rus_test, average='weighted')
  #Recall SMOTE
  train_recall_smote = recall_score(y_train_smote,y_pred_smote_train, average='weighted')
  test_recall_smote = recall_score(y_test,y_pred_smote_test, average='weighted')

  #Precision RUS
  train_precision_rus = precision_score(y_train_rus,y_pred_rus_train, average='weighted')
  test_precision_rus = precision_score(y_test,y_pred_rus_test, average='weighted')
  #Precision SMOTE
  train_precision_smote = precision_score(y_train_smote,y_pred_smote_train, average='weighted')
  test_precision_smote = precision_score(y_test,y_pred_smote_test, average='weighted')

  #F1 Score RUS
  train_f1_rus = f1_score(y_train_rus,y_pred_rus_train, average='weighted')
  test_f1_rus = f1_score(y_test,y_pred_rus_test, average='weighted')
  #F1 Score SMOTE
  train_f1_smote = f1_score(y_train_smote,y_pred_smote_train, average='weighted')
  test_f1_smote = f1_score(y_test,y_pred_smote_test, average='weighted')

  #ROC-AUC RUS
  train_auc_rus = roc_auc_score(y_train_rus,train_rus_proba,average='weighted',multi_class = 'ovr')
  test_auc_rus = roc_auc_score(y_test,test_rus_proba,average='weighted',multi_class = 'ovr')
  #ROC-AUC SMOTE
  train_auc_smote = roc_auc_score(y_train_smote,train_sm_proba,average='weighted',multi_class = 'ovr')
  test_auc_smote = roc_auc_score(y_test,test_sm_proba,average='weighted',multi_class = 'ovr')

  #Visualising Results RUS
  print("----- Evaluation on Random Undersampled data -----" + str(model_name_RUS) + "------")
  print("--------------Test data ---------------\n")
  print("Confusion matrix \n")
  print(cm_rus_test)
  print(classification_report(y_test,y_pred_rus_test))

  #create ROC curve
  fpr = {}
  tpr = {}
  thresh ={}
  no_of_class=3
  for i in range(no_of_class):
      fpr[i], tpr[i], thresh[i] = metrics.roc_curve(y_test, test_rus_proba[:,i], pos_label=i)
  plt.plot(fpr[0], tpr[0], linestyle='--',color='blue', label='Class 0 vs Others'+"AUC="+str(test_auc_rus))
  plt.plot(fpr[1], tpr[1], linestyle='--',color='green', label='Class 1 vs Others'+"AUC="+str(test_auc_rus))
  plt.plot(fpr[2], tpr[2], linestyle='--',color='orange', label='Class 2 vs Others'+"AUC="+str(test_auc_rus))
  plt.title('Multiclass ROC curve of ' + str(model_name_RUS))
  plt.ylabel('True Positive Rate')
  plt.xlabel('False Positive Rate')
  plt.legend(loc=4)
  plt.show()

  #Visualising Results SMOTE
  print("----- Evaluation on SMOTE data -------" + str(model_name_SMOTE) + '-----')
  print("---------------Test data ---------------\n")
  print("Confusion matrix \n")
  print(cm_smote_test)
  print(classification_report(y_test,y_pred_smote_test))

  #create ROC curve
  fpr = {}
  tpr = {}
  thresh ={}
  no_of_class=3
  for i in range(no_of_class):
      fpr[i], tpr[i], thresh[i] = metrics.roc_curve(y_test, test_sm_proba[:,i], pos_label=i)
  plt.plot(fpr[0], tpr[0], linestyle='--',color='blue', label='Class 0 vs Others'+" AUC="+str(test_auc_smote))
  plt.plot(fpr[1], tpr[1], linestyle='--',color='green', label='Class 1 vs Others'+" AUC="+str(test_auc_smote))
  plt.plot(fpr[2], tpr[2], linestyle='--',color='orange', label='Class 2 vs Others'+" AUC="+str(test_auc_smote))
  plt.title('Multiclass ROC curve of '+ str(model_name_SMOTE))
  plt.ylabel('True Positive Rate')
  plt.xlabel('False Positive Rate')
  plt.legend(loc=4)
  plt.show()

  #Saving our results
  global comparison_columns
  metric_scores_rus = [model_name_RUS,accuracy_rus_train,train_recall_rus,train_precision_rus,train_f1_rus,train_auc_rus,accuracy_rus_test,test_recall_rus,test_precision_rus,test_f1_rus,test_auc_rus]
  final_dict_rus = dict(zip(comparison_columns,metric_scores_rus))

  metric_scores_smote = [model_name_SMOTE,accuracy_smote_train,train_recall_smote,train_precision_smote,train_f1_smote,train_auc_smote,accuracy_smote_test,test_recall_smote,test_precision_smote,test_f1_smote,test_auc_smote]
  final_dict_smote = dict(zip(comparison_columns,metric_scores_smote))

  dict_list = [final_dict_rus, final_dict_smote]
  return dict_list

In [None]:
# Function to create the comparison table
final_list = []
def add_list_to_final_df(dict_list):
  global final_list
  for elem in dict_list:
    final_list.append(elem)
  global comparison_df
  comparison_df = pd.DataFrame(final_list, columns= comparison_columns)

### Logistic Regression

In [None]:
# Importing library
from sklearn.linear_model import LogisticRegression
# Fitting Random Under Sampling
logistic_rus = LogisticRegression(class_weight='balanced',multi_class='multinomial', solver='lbfgs')
logistic_rus.fit(x_train_rus, y_train_rus)

In [None]:
# Fitting on smote
logistic_smote = LogisticRegression(class_weight='balanced',multi_class='multinomial', solver='lbfgs')
logistic_smote.fit(x_train_smote, y_train_smote)

In [None]:
# Let's evaluate logistic regression
logistic_reg_list = model_evaluation('Logistic Regression RUS','Logistic Regression SMOTE',logistic_rus, logistic_smote, x_train_rus, y_train_rus, x_train_smote, y_train_smote, X_test, y_test)
logistic_reg_list

In [None]:
# Adding results to final list
add_list_to_final_df(logistic_reg_list)

NameError: name 'add_list_to_final_df' is not defined

In [None]:
# Having a look at our final comparison dataframe
comparison_df

### Decision Tree

In [None]:
# Importing library
from sklearn.tree import DecisionTreeClassifier

In [None]:
# Applying Classifier using Random under sampling
dt_rus = DecisionTreeClassifier()
dt_rus.fit(x_train_rus,y_train_rus)

In [None]:
# Applying Classifier using SMOTE
dt_smote = DecisionTreeClassifier()
dt_smote.fit(x_train_smote,y_train_smote)

In [None]:
# Evaluating Dcision Tree Classifier
dt_eval_list = model_evaluation('Decision Tree RUS', 'Decision Tree SMOTE', dt_rus, dt_smote, x_train_rus, y_train_rus, x_train_smote, y_train_smote, X_test, y_test)
dt_eval_list

In [None]:
# Updating the results list
add_list_to_final_df(dt_eval_list)
# Having a look at our final comparison dataframe
comparison_df

### KNN

In [None]:
# Importing library
from sklearn.neighbors import KNeighborsClassifier
knn_rus = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)

# Fit the model on the train set
knn_rus.fit(x_train_rus,y_train_rus)

In [None]:
# Applying Classifier SMOTE
knn_smote = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)

# Fit the model on the train set
knn_smote.fit(x_train_smote,y_train_smote)

In [None]:
# KNN Evaluation
knn_eval_list = model_evaluation('KNN RUS', 'KNN SMOTE', knn_rus, knn_smote, x_train_rus, y_train_rus, x_train_smote, y_train_smote, X_test, y_test)
knn_eval_list

In [None]:
# Updating the results list
add_list_to_final_df(knn_eval_list)
# Having a look at our final comparison dataframe
comparison_df

NameError: name 'add_list_to_final_df' is not defined

### Random Forest

In [None]:
# Importing library
from sklearn.ensemble import RandomForestClassifier

# Applying Classifier using Random under sampling
rf_rus = RandomForestClassifier(random_state=42, max_depth=5, n_estimators=100, oob_score=True)
rf_rus.fit(x_train_rus,y_train_rus)

In [None]:
# Applying Classifier SMOTE
rf_smote = RandomForestClassifier(random_state=42, max_depth=5, n_estimators=100, oob_score=True)
rf_smote.fit(x_train_smote,y_train_smote)

In [None]:
# Random Forest Evaluation
rf_eval_list = model_evaluation('Random Forest RUS', 'Random Forest SMOTE', rf_rus, rf_smote, x_train_rus, y_train_rus, x_train_smote, y_train_smote, X_test, y_test)
rf_eval_list

In [None]:
# Updating the results list
add_list_to_final_df(rf_eval_list)
# Having a look at our final comparison dataframe
comparison_df

### Random Forest Hyperparameter Tuning

In [None]:
# Fitting the classifier
rf = RandomForestClassifier(random_state=42, n_jobs=-1)

# Parameter dictionary
params = {'max_depth': [3,5,10,20],
          'min_samples_leaf': [5,10,20,50,100],
          'n_estimators': [10,25,30,50,100,200]}

# Grid Search to get the best parameters
grid_search = GridSearchCV(estimator=rf, param_grid=params, cv = 4, n_jobs=-1, verbose=1, scoring="f1_weighted")

# Fitting Random Under Sampling to grid search
grid_search.fit(x_train_rus,y_train_rus)

In [None]:
# Best parameters
rf_tuned_rus = grid_search.best_estimator_

In [None]:
# Fitting SMOTE to grid search
grid_search_smote = GridSearchCV(estimator=rf, param_grid=params, cv = 4, n_jobs=-1, verbose=1, scoring="f1_weighted")
grid_search_smote.fit(x_train_smote,y_train_smote)

# Best smote Parameters
rf_tuned_smote = grid_search_smote.best_estimator_

In [None]:
# Evaluation for Random Forest Hyperparameter Tuned model
rf_tuned_list = model_evaluation('Random Forest Tuned RUS', 'Random Forest Tuned SMOTE', rf_tuned_rus, rf_tuned_smote,x_train_rus, y_train_rus, x_train_smote, y_train_smote, X_test, y_test)
rf_tuned_list

In [None]:
# Updating the results list
add_list_to_final_df(rf_tuned_list)
# Having a look at our final comparison dataframe
comparison_df

NameError: name 'add_list_to_final_df' is not defined

In [None]:
# Feature importance given by hyperparameter random forest tuned model
feature_imp = pd.DataFrame({"Variable": x_train_smote.columns,"Importance": rf_tuned_smote.feature_importances_})
feature_imp.sort_values(by="Importance", ascending=False, inplace = True)

In [None]:
# Visualizing feature importance
sns.barplot(x=feature_imp['Importance'],y= feature_imp['Variable'])

In [None]:
# Dropping irrelevant features
x_train_smote1 = x_train_smote.drop(['Time_Email_sent_Category_3','Time_Email_sent_Category_2','Email_Campaign_Type_3'],axis=1)
x_train_rus1 = x_train_rus.drop(['Time_Email_sent_Category_3','Time_Email_sent_Category_2','Email_Campaign_Type_3'],axis=1)
X_test1 = X_test.drop(['Time_Email_sent_Category_3','Time_Email_sent_Category_2','Email_Campaign_Type_3'],axis=1)

In [None]:
# Grid Search to get the best parameters for RUS
grid_search_rus = GridSearchCV(estimator=rf, param_grid=params, cv = 4, n_jobs=-1, verbose=1, scoring="f1_weighted")
# Fitting RUS to grid search
grid_search_rus.fit(x_train_rus1,y_train_rus)
# Optimal model
rf_tuned_rus1 = grid_search_rus.best_estimator_

In [None]:
# Fitting SMOTE
grid_search_smote1 = GridSearchCV(estimator=rf, param_grid=params, cv = 4, n_jobs=-1, verbose=1, scoring="f1_weighted")
grid_search_smote1.fit(x_train_smote1,y_train_smote)
# Optimal smote model
rf_tuned_smote1 = grid_search_smote1.best_estimator_

In [None]:
# Model Evaluation for Hyperparameter tuned  Random Forest with feature selection
rf_tuned_list1 = model_evaluation('Random Forest Tuned RUS FSelect', 'Random Forest Tuned SMOTE FSelect', rf_tuned_rus1, rf_tuned_smote1,x_train_rus1, y_train_rus, x_train_smote1, y_train_smote, X_test1, y_test)
rf_tuned_list1

In [None]:
# Updating the results list
add_list_to_final_df(rf_tuned_list1)
# Having a look at our final comparison dataframe
comparison_df

### XGBoost

In [None]:
# Importing library
from xgboost import XGBClassifier

# Fitting rus
xgb_rus = XGBClassifier(n_estimators=100, max_depth=12, min_samples_leaf=20, min_samples_split=30)
xgb_rus.fit(x_train_rus, y_train_rus)

In [None]:
# Fitting smote
xgb_smote = XGBClassifier(n_estimators=100, max_depth=12, min_samples_leaf=20, min_samples_split=30)
xgb_smote.fit(x_train_smote, y_train_smote)

In [None]:
# Model evaluation of XGB
xgb_eval_list = model_evaluation('XGBoost RUS', 'XGBoost SMOTE', xgb_rus, xgb_smote, x_train_rus, y_train_rus, x_train_smote, y_train_smote, X_test, y_test)
xgb_eval_list

In [None]:
# Visualising feature importance of XGBoost Classifier
feature_imp_xgb = pd.DataFrame({"Variable": x_train_smote.columns,"Importance": xgb_smote.feature_importances_})
feature_imp_xgb.sort_values(by="Importance", ascending=False, inplace = True)
sns.barplot(x=feature_imp_xgb['Importance'], y= feature_imp_xgb['Variable'])

In [None]:
# Updating the results list
add_list_to_final_df(xgb_eval_list)
# Having a look at our final comparison dataframe
comparison_df

### Comparison of all the Models

In [None]:
# Visualizing comparison of f1 score for all models
# Creating subplots
ax = plt.subplots()

ax = sns.pointplot(y=comparison_df['Model_Name'], x = comparison_df['Test_F1score'], color='g', labels=('Test_F1score'))
ax = sns.pointplot(y=comparison_df['Model_Name'], x = comparison_df['Train_F1score'], color='r', labels=('Train_F1score'))

# Renaming the axes
ax.set(xlabel="Score", ylabel="Model_Name")
ax.legend(handles=ax.lines[::len(comparison_df)+1], labels=["Test_F1score","Train_F1score"])

ax.set_xticklabels([t.get_text().split("T")[0] for t in ax.get_xticklabels()])
# Visulaizing illustration
plt.show()

In [None]:
# Visualizing comparison of auc score for all models
# Creating subplots
ax = plt.subplots()

ax = sns.pointplot(y=comparison_df['Model_Name'], x = comparison_df['Test_AUC'], color='g', labels=('Test_AUC'))
ax = sns.pointplot(y=comparison_df['Model_Name'], x = comparison_df['Train_AUC'], color='r', labels=('Train_AUC'))

# Renaming the axes
ax.set(xlabel="Score", ylabel="Model_Name")
ax.legend(handles=ax.lines[::len(comparison_df)+1], labels=["Test_AUC","Train_AUC"])

ax.set_xticklabels([t.get_text().split("T")[0] for t in ax.get_xticklabels()])
# Visulaizing illustration
plt.show()

## **Conclusions**

* It could be observed from the EDA that **Email_Campaign_Type** was the **most important** feature. If the Email_Campaign_Type was **1**, there is a **90%** likelihood of your Email to be **acknowledged**.

* It was observed that both **Time_Email_Sent and Customer_Location** were insignificant in determining the **Email_status**. The ratio of the Email_Status was same **irrespective** of the time frame the emails were sent on.

* As the **word_count** increases beyond the **600** mark we see that there is a **high** possibility of that email being **ignored**. The ideal mark was **400-600**.

* For modelling, it was observed that for **imbalance handling** Oversampling i.e. **SMOTE** worked way better than **undersampling** as the latter resulted in a lot of loss of information.

* **Decision Tree Model** was **overfitting** as it was working really good on train data but bad on test data.

* **Hyperparameter tuning** wasn't able to improve the results to a better extent and casused a lot computaional time.

* **XGBoost Algorithm** worked in the **best** way possible with such an imbalanced data having outliers, followed by Random Forest Hyperparameter Tuned model after feature selection with F1 Score of 0.75 on the test set.