<a href="https://colab.research.google.com/github/ParagKuthe/Capstone-Project-Play-Store-App-Review-Analysis/blob/main/Email_Campaign_Effectiveness_Prediction_ML_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



<font size='8px'><font color=#800080>**Project Name**</font> - <font color='#3792cb'>**Email Campaign Effectiveness Prediction**</font>







##### **Project Type**    - Classification
##### **Contribution**    - Individual


# **Project Summary -**

The goal of this project is to create a machine learning model that can characterize and track emails sent through Gmail-based email marketing campaigns. This model will be used by small to medium business owners who are looking to improve the effectiveness of their email marketing efforts and increase customer retention.

One of the main challenges in email marketing is determining which emails are being read, ignored, or acknowledged by the reader. By understanding which emails are most effective at engaging the reader, business owners can tailor their marketing efforts and increase their chances of success.

To address this problem, we will gather data on a variety of email characteristics, including the subject line, sender name, email content, email format, and email frequency. We will also consider the target audience of the emails and any other relevant factors.

Using this data, we will train a machine learning model to predict whether an email is likely to be read, ignored, or acknowledged by the reader. This model will be able to analyze new emails and provide a prediction of how they are likely to be received by the reader.

To evaluate the performance of the model, we will split our data into a training set and a testing set. We will use the training set to fit the model and the testing set to evaluate its performance. We will use a variety of metrics, such as precision, recall, and F1 score, to assess the model's accuracy and effectiveness.

Once the model is trained and evaluated, it can be deployed in a production environment to help small to medium business owners improve the effectiveness of their email marketing campaigns. By using the model to characterize and track emails, they will be able to make more informed decisions about how to target their marketing efforts and increase customer retention.

Overall, this project aims to provide small to medium business owners with a powerful tool for improving the effectiveness of their email marketing campaigns. By using machine learning to characterize and track emails, they will be able to make more informed decisions and increase the chances of success for their marketing efforts.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


Small to medium business owners are using Gmail-based email marketing strategies to convert prospective customers into leads, but they are unable to track which emails are being ignored, read, or acknowledged by the reader. They want to create a machine learning model to help characterize and track these emails. The main objective is to improve the effectiveness of their email marketing efforts and increase customer retention.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries

import numpy as np
import pandas as pd
from numpy import math
from numpy import loadtxt
import seaborn as sns
import matplotlib.pyplot as plt

from scipy.stats import *
from scipy import stats
import math


from xgboost import XGBClassifier
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn import metrics
from sklearn.metrics import classification_report
from sklearn.model_selection import StratifiedKFold
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score,confusion_matrix, f1_score, classification_report, roc_auc_score

from sklearn.model_selection import GridSearchCV
import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')


### Dataset First View

In [None]:
# Dataset First Look
df=pd.read_csv("/content/drive/MyDrive/data_email_campaign.csv")
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape


### Dataset Information

In [None]:
# Dataset Info
print("Information Of Dataset:\n")
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
print("Count Of Duplicate Values:\n")
duplicate=df.duplicated()
df[duplicate].sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
print("Null Values In DataSet\n")
null_values=pd.DataFrame(df.isnull().sum().reset_index(name="Null Values"))
null_values=null_values[null_values["Null Values"]!=0]
null_values_percentage=null_values["Null Values"]/len(df)*100
pd.DataFrame(null_values)

In [None]:
#Removing Missing Values
df.dropna(inplace=True)

In [None]:
df.isnull().sum()

### What did you know about your dataset?

Most of the small to medium business owners are making effective use of Gmail-based Email marketing Strategies for offline targeting of converting their prospective customers into leads so that they stay with them in Business. The main objective is to create a machine learning model to characterize the mail and track the mail that is ignored; read; acknowledged by the reader.

In the above dataset there are total 68353 rows and  12 columns with null values and 48291 rows and 12 columns after removing null values

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
print("Columns In DataSets:\n")
list(df.columns)

In [None]:
# Dataset Describe
df.describe(include="all")

### Variables Description

**Email_Id :** Email Id Of Customer \
**Email_Type :** 1 For marketing Emails & 2 for business Emails \
**Subject_Hotness_Score :** Score on the basis of how good and effective the content is. \
**Email_Source_Type :** Source of the Emails like 1 for marketing or 2 for business. \
**Customer_Location :** Location of the customer. \
**Email_Campaign_Type :** Type of Email 1 for marketing,2 for business, 3 for important admin mails related to products. \
**Total_Past_Communications :** The total previous mails from the same source the no of communication had. \
**Time_Email_sent_Category :** The time at which the mail was sent. 1 for morning, 2 for evening, 3 for night. \
**Word_Count :** The no of words contain the mail. \
**Total_Links :** Number Of links in the mails. \
**Total_Images :** Number of images in the link. \
**Email_Status :** target variable which contain 0 for the mail was ignored, 1 for read, 2 for acknowledged by reader.

### Check Unique Values for each variable.

In [None]:
for i in df:
  print("\n Unique Values In ",i,"\n",df[i].unique(),"\nWith Count Of:",df[i].nunique())

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Drop Column Having Categorical values
df1=df.drop(["Email_ID","Customer_Location"],axis=1)

In [None]:
ignored=df1[df1["Email_Status"]==0]
read=df1[df1["Email_Status"]==1]
acknowledge=df1[df1["Email_Status"]==2]
print("Emails Ignored By The User :",len(ignored))
print("Emails Read By The User :",len(read))
print("Emails Acknowledge By The User :",len(acknowledge))


In [None]:
#Ignored Emails Details
ignored.describe()

In [None]:
#Read Emails Details
read.describe()

In [None]:
#acknowledge Emails Details
acknowledge.describe()

In [None]:
import pandas as pd

# Assuming df is your DataFrame
p = df.groupby("Time_Email_sent_Category").size().reset_index(name='Count')

# Display the result
print(p)


In [None]:
Email_source_type_count_on_Email_type=pd.DataFrame(df.groupby("Email_Type")["Email_Source_Type"].value_counts().reset_index(name="Count"))
Email_source_type_count_on_Email_type

In [None]:
Email_Campaign_Type_count_on_Email_type=pd.DataFrame(df.groupby("Email_Type")["Email_Campaign_Type"].value_counts().reset_index(name="Count"))
Email_Campaign_Type_count_on_Email_type

In [None]:
Customer_Location_on_Email_type=pd.DataFrame(df.groupby("Email_Type")["Customer_Location"].value_counts().reset_index(name="Count"))
Customer_Location_on_Email_type

In [None]:
Time_Email_sent_Category_on_Email_type=pd.DataFrame(df.groupby("Email_Type")["Time_Email_sent_Category"].value_counts().reset_index(name="Count"))
Time_Email_sent_Category_on_Email_type

In [None]:
num_features=df.select_dtypes(include=["float","int"]).columns.to_list()
num_features

In [None]:
df.groupby('Email_Type')[num_features].agg(["sum","mean","median","count"]).T

In [None]:
df.groupby("Subject_Hotness_Score")[num_features].agg(["sum","mean","median","count"]).T

In [None]:
df.groupby("Email_Source_Type")[num_features].agg(["sum","mean","median","count"]).T

In [None]:
df.groupby("Email_Campaign_Type")[num_features].agg(["sum","mean","median","count"]).T

In [None]:
df.groupby("Time_Email_sent_Category")[num_features].agg(["sum","mean","median","count"]).T

In [None]:
df.groupby("Email_Status")[num_features].agg(["sum","mean","median","count"]).T

In [None]:
cat_features=[feature for feature in df.columns.to_list() if feature not in num_features]
cat_features

In [None]:
import pandas as pd

# Assuming your DataFrame is named 'df' and it contains a column named 'Email_Status'
value_counts = df['Email_Status'].value_counts().reset_index()

# Rename the columns in the new DataFrame
value_counts.columns = ['Email_Status', 'Count']

# Create the Email_status_value_count DataFrame
Email_status_value_count = pd.DataFrame(value_counts)

# Display the result
print(Email_status_value_count)


In [None]:
Email_status_value_count_on_Email_Type=pd.DataFrame(df.groupby("Email_Type")["Email_Status"].value_counts().reset_index(name="count"))
Email_status_value_count_on_Email_Type

In [None]:
Total_link_on_Email_Type=pd.DataFrame(df.groupby("Email_Type")["Total_Links"].value_counts().reset_index(name="count"))
Total_link_on_Email_Type

In [None]:
Subject_Hotness_Score_on_Email_Source_Type=pd.DataFrame(df.groupby("Email_Source_Type")["Subject_Hotness_Score"].value_counts().reset_index(name="Count"))
Subject_Hotness_Score_on_Email_Source_Type

In [None]:
Customer_Location_on_Email_Source_Type=pd.DataFrame(df.groupby("Email_Source_Type")["Customer_Location"].value_counts().reset_index(name="Count"))
Customer_Location_on_Email_Source_Type

In [None]:
Email_Campaign_Type_on_Email_Source_Type=pd.DataFrame(df.groupby("Email_Source_Type")["Email_Campaign_Type"].value_counts().reset_index(name="Count"))
Email_Campaign_Type_on_Email_Source_Type

In [None]:
Time_Email_sent_Category_on_Email_Source_Type=pd.DataFrame(df.groupby("Email_Source_Type")["Time_Email_sent_Category"].value_counts().reset_index(name="Count"))
Time_Email_sent_Category_on_Email_Source_Type

In [None]:
Email_Status_on_Email_Source_Type=pd.DataFrame(df.groupby("Email_Source_Type")["Email_Status"].value_counts().reset_index(name="Count"))
Email_Status_on_Email_Source_Type

In [None]:
Avg_Link=df.groupby("Email_Type")["Total_Links"].mean().reset_index(name="Avg Link")
Avg_Word_Count=df.groupby("Email_Type")["Word_Count"].mean().reset_index(name="Avg_Word_Count")
Avg_Images=df.groupby("Email_Type")["Total_Images"].mean().reset_index(name="Avg Images")

### What all manipulations have you done and insights you found?

Answer Here.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
data=df["Email_Type"].value_counts()

labels=["Marketting Email","Business Email"]
plt.pie(data,labels=labels,autopct="%.1f%%",colors=["cyan", "Green"])
plt.title('Type Of The Emails ',size=15,loc='center')
plt.legend(bbox_to_anchor=(0.9, 0, 0.78, 1))
plt.show()

##### 1. Why did you pick the specific chart?

For analysing the types of the emails out of 100% pie chart is best suitable for such type of visualisation

##### 2. What is/are the insight(s) found from the chart?

Most Of the types of the Emails are the marketting Email having total percentage of 71% and business Email having percentage out of total is 29%.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The gained insights helps in creating a positive business impact. From thise pie chart we know that the most of the emails are marketting email which means the most of the business owener positively uses emails for business purpose and marketting thier product on email

#### Chart - 2

In [None]:
# Chart - 2 visualization code
data=df["Email_Source_Type"].value_counts()

labels=["Marketting Email","Business Emails"]
plt.pie(data,labels=labels,autopct="%.1f%%")
plt.title('Email Source Type',size=15,loc='center')
plt.legend(bbox_to_anchor=(0.9, 0, 0.78, 1))
plt.show()

##### 1. Why did you pick the specific chart?

For analysing the emails source type out of 100% pie chart is best suitable for such type of visualisation

##### 2. What is/are the insight(s) found from the chart?

Most Of the Email source types are the marketting Email having total percentage of 54% and business emails having percentage out of total is 46%.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The gained insights helps in creating a positive business impact. From thise pie chart we know that the most of the emails are marketting email which means the most of the business owener positively uses emails for business purpose and marketting thier product on email

#### Chart - 3

In [None]:
# Chart - 3 visualization code
plt.figure(figsize = (7,7))
plt.title("Count Of Email_Source_Type On The Basis Of Email_Type")
ax = sns.barplot(data = Email_source_type_count_on_Email_type, x = 'Email_Source_Type',y = "Count", hue = "Email_Type")
plt.show()
print("---------------------------------------------------------------------------------")

#  visualization code
plt.figure(figsize = (7,7))
plt.title("Count Of Email_Campaign_Type On The Basis Of Email_Type")
ax = sns.barplot(data = Email_Campaign_Type_count_on_Email_type, x = 'Email_Campaign_Type',y = "Count", hue = "Email_Type")

plt.show()

print("---------------------------------------------------------------------------------")

plt.figure(figsize = (7,7))
plt.title("Count Of Customer_Location On The Basis Of Email_Type")
ax = sns.barplot(data = Customer_Location_on_Email_type, x = 'Customer_Location',y = "Count", hue = "Email_Type")

plt.show()

print("---------------------------------------------------------------------------------")


# Get unique values in the 'Email_Type' column
unique_email_types = Time_Email_sent_Category_on_Email_type['Email_Type'].unique()

# Define a custom color palette for all unique values in 'Email_Type'
custom_palette = {email_type: sns.color_palette("husl")[i] for i, email_type in enumerate(unique_email_types)}

plt.figure(figsize=(7, 7))
plt.title("Count Of Time_Email_sent_Category On The Basis Of Email_Type")
ax = sns.barplot(data=Time_Email_sent_Category_on_Email_type, x='Time_Email_sent_Category', y='Count', hue='Email_Type', palette=custom_palette)

plt.show()

print("---------------------------------------------------------------------------------")

plt.figure(figsize = (7,7))
plt.title("Count Of Total link On The Basis Of Email_Type")
ax = sns.barplot(data = Total_link_on_Email_Type, x = "Total_Links",y = "count", hue = 'Email_Type')
plt.xticks(rotation=50, ha='right')
plt.show()



##### 1. Why did you pick the specific chart?

For comapring the different parameters of the dataset bar chart is used which better visualise the result and easy for understanding.

##### 2. What is/are the insight(s) found from the chart?

In above visualisation we are plotting diiferent variables of the dataset with respect to the Email_types. and understanding hows the correlation between Email_types and other variables

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The gained insights helps in creating a positive business impact.In above visualisation we are plotting diiferent variables of the dataset with respect to the Email_types. and understanding hows the correlation between Email_types and other variables

#### Chart - 4

In [None]:
# Chart - 4 visualization code

# Assuming df is your DataFrame
p = df.groupby("Customer_Location").agg({"Total_Past_Communications": sum}).reset_index()

# Convert to DataFrame
p_df = pd.DataFrame(p)

# Plotting a bar plot
plt.bar(p_df['Customer_Location'], p_df['Total_Past_Communications'],color="green")

# Adding labels and title
plt.xlabel('Customer Location')
plt.ylabel('Total Past Communications')
plt.title('Total Past Communications by Customer Location')

# Display the plot
plt.show()


In [None]:
data=df["Customer_Location"].value_counts()

labels=['G','E', 'D','C','F','B','A']
plt.pie(data,labels=labels,autopct="%.1f%%",explode=[0.01,0.01,0.01,0.01,0.01,0.3,0.01],colors=['skyblue','red','green','violet','pink','brown','cyan','magenta'])
plt.title('Customer Location',size=15,loc='center')
plt.legend(bbox_to_anchor=(0.9, 0, 0.78, 1))
plt.show()

##### 1. Why did you pick the specific chart?

+ For comapring the different parameters of the dataset bar chart is used which better visualise the result and easy for understanding. \
+ For analysing the emails source type out of 100% pie chart is best suitable for such type of visualisation

##### 2. What is/are the insight(s) found from the chart?

Chart 1: From thise chart we can know that the highest no of the total past communication is done with the G group customer location. \
chart 2 : From thise chart we know that the G community has maximum population out of all total  which is 40.9 % and A community has lowest population which is 2.5 %


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The gained insights helps in creating a positive business impact. the G community has maximum population out of all total which is 40.9 % and A community has lowest population which is 2.5%. From thise we can give more focused on customers with highest population community which ultimately provides positive business growth.

#### Chart - 5

In [None]:
# Chart - 5 visualization code

p = df.groupby("Time_Email_sent_Category").size().reset_index(name='Count')
p
# Assuming p_df is your DataFrame
plt.bar(p['Time_Email_sent_Category'], p['Count'], width=0.3,color="blue")  # Adjust the width as needed

# Adding labels and title
plt.xlabel('Time Email Sent Category')
plt.ylabel('Count')
plt.title('Count of Emails by Time Email Sent Category')

# Replace x-axis labels with desired labels
plt.xticks(p['Time_Email_sent_Category'], ['Morning',  'Evening', 'Night'])

# Rotating x-axis labels for better readability (optional)
plt.xticks(rotation=45, ha='right')

# Display the plot
plt.show()


##### 1. Why did you pick the specific chart?

For comapring the different parameters of the dataset bar chart is used which better visualise the result and easy for understanding.

##### 2. What is/are the insight(s) found from the chart?

The highest no of the mails send to the customers during evening time.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The gained insights helps in creating a positive business impact.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
Email_Status_Pie_Chart=df["Email_Status"].value_counts()
labels=("Ignored","Read","Acknowledge")
plt.pie(Email_Status_Pie_Chart,labels=labels,autopct="%.1f%%",shadow=True, explode=[0,0,0],colors=["Red","Orange","skyblue"])
plt.title("Pie Distribution Of Email Status")
plt.legend(bbox_to_anchor=(0.9, 0, 0.78, 1))
plt.show()

##### 1. Why did you pick the specific chart?

For analysing the emails status type out of 100% pie chart is best suitable for such type of visualisation

##### 2. What is/are the insight(s) found from the chart?

From thise pie chart we know that the percentage of the email status.\
The percentage of Ignored are highest which is 80%. \
The percentage of mails which are read by user are 16% . \
The percentage of mails which are acknowledge by user are 4 %.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The gained insights helps in creating a positive business impact. From thise chart we know that percentage of emails which are opened, read and ackowledge by the user

#### Chart - 7

In [None]:
# Chart - 7 visualization code
custom_palette = ["Red", "Blue", "cyan"]

plt.title("Email status value count on The Basis Of Email Type")

# Make sure to use the correct column names in the plot
sns.barplot(data=Email_status_value_count_on_Email_Type, x="Email_Type", y="count", hue="Email_Status", palette=custom_palette)

plt.show()


##### 1. Why did you pick the specific chart?


For comapring the different parameters of the dataset bar chart is used which better visualise the result and easy for understanding.

##### 2. What is/are the insight(s) found from the chart?

As we are analysing the count of email status such as emails which are ignored,opened and acknowledge by the user with respect to the email type.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The gained insights helps in creating a positive business impact. As we are analysing the count of  email status such as emails which are ignored,opened and acknowledge by the user with respect to the email type.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
import seaborn as sns
import matplotlib.pyplot as plt

# Assuming you have defined your DataFrame named Customer_Location_on_Email_Source_Type

# Get unique values in the 'Email_Source_Type' column
unique_email_source_types = Customer_Location_on_Email_Source_Type['Email_Source_Type'].unique()

# Define a custom color palette for Email_Source_Type
custom_palette = {email_source_type: sns.color_palette("husl")[i] for i, email_source_type in enumerate(unique_email_source_types)}

plt.figure(figsize=(7, 7))
plt.title("Count Of Email_Source_Type On The Basis Of Customer_Location")
ax = sns.barplot(data=Customer_Location_on_Email_Source_Type, x='Customer_Location', y="Count", hue='Email_Source_Type', palette=custom_palette)

plt.show()

print("-----------------------------------------------------------------------------------------")

plt.figure(figsize=(7, 7))
plt.title("Count Of Email_Source_Type On The Basis Of Email_Campaign_Type")
ax = sns.barplot(data=Email_Campaign_Type_on_Email_Source_Type, x='Email_Campaign_Type', y="Count", hue='Email_Source_Type', palette=custom_palette)
plt.show()


print("-----------------------------------------------------------------------------------------")

plt.figure(figsize=(7, 7))
plt.title("Count Of Time Email sent Category On The Basis Of Email_Campaign_Type")
ax = sns.barplot(data=Time_Email_sent_Category_on_Email_Source_Type, x='Time_Email_sent_Category', y="Count", hue='Email_Source_Type', palette=custom_palette)
plt.show()



print("-----------------------------------------------------------------------------------------")


plt.figure(figsize=(7, 7))
plt.title("Count Of Time Email Status On The Basis Of Email_Campaign_Type")
ax = sns.barplot(data=Email_Status_on_Email_Source_Type, x='Email_Status', y="Count", hue='Email_Source_Type', palette=custom_palette)
plt.show()



##### 1. Why did you pick the specific chart?


For comapring the different parameters of the dataset bar chart is used which better visualise the result and easy for understanding.

##### 2. What is/are the insight(s) found from the chart?

We are trying to find the correlation between different variables with respect to the Email Source Type

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The gained insights helps creating a positive business impact. Beacuse We are trying to find the correlation between different variables with respect to the Email Source Type.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
plt.figure(figsize = (7,7))
plt.title("Avg Link VS Email Type")
ax = sns.barplot(data = Avg_Link, x = "Email_Type",y = "Avg Link")
plt.show()

print("------------------------------------------------------------------")

plt.figure(figsize = (7,7))
plt.title("Avg Word_Count  VS Email Type")
ax = sns.barplot(data = Avg_Word_Count, x = "Email_Type",y = "Avg_Word_Count")
plt.show()

print("------------------------------------------------------------------")

plt.figure(figsize = (7,7))
plt.title("Avg Images VS Email_Type")
ax = sns.barplot(data = Avg_Images, x = 'Email_Type',y = "Avg Images")
plt.show()


##### 1. Why did you pick the specific chart?


For comapring the different parameters of the dataset bar chart is used which better visualise the result and easy for understanding.

##### 2. What is/are the insight(s) found from the chart?

We are trying to find the correlation between different variables with respect to the Email Type

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The gained insights helps in creating a positive business impact. Becuase the correlation between different variables with respect to the Email Type helps in finding the relation and its effect on different variables

#### Chart - 10

In [None]:
# Chart - 10 visualization code
for features in num_features:
  sns.barplot(data=df,x="Email_Type",y=features,hue="Customer_Location")
  plt.show()
  plt.title(features)


#### Chart - 11 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
plt.figure(figsize=(10, 5))
sns.heatmap(data=df.corr(),annot=True,cmap="coolwarm")
plt.show()

##### 1. Why did you pick the specific chart?

For finding the relation between different different variables heatmap is used which best represents the relation between different variables

##### 2. What is/are the insight(s) found from the chart?

From thise heatmap we know that what is the actual relation between different variables of the datasets

#### Chart - 12 - Pair Plot

In [None]:
# Pair Plot visualization code
columns = ['Email_Type', 'Subject_Hotness_Score', 'Email_Source_Type',
       'Customer_Location', 'Email_Campaign_Type', 'Total_Past_Communications',
       'Time_Email_sent_Category','Email_Status']
data_for_pairplot = df[columns]

p = sns.pairplot(data_for_pairplot)
plt.show()

##### 1. Why did you pick the specific chart?


Pair plot is used to understand the best set of features to explain a relationship between two variables or to form the most separated clusters. It also helps to form some simple classification models by drawing some simple lines or make linear separation in our data-set.

Thus, I used pair plot to analyse the patterns of data and realationship between the features. It's exactly same as the correlation map but here you will get the graphical representation.

##### 2. What is/are the insight(s) found from the chart?

From the above chart I got to know, there are less linear relationship between variables and since most of them were categorial data with one or two category, it does not show much relationship.

Total links and total image show some linear relation and we already know they are correlated as seen in earlier heatmap.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

* The Email_Type of the campaign will not have any significant impact on the Email_Status.
* The Subject_Hotness_Score of the email will not have any significant impact on the Total_Past_Communications.
* The Customer_Location will not have any significant impact on the Total_Links and Total_Images in the email.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

* Null Hypothesis: There is no relationship between Email_Type and Email_Status (H0: B1 = 0)
* Alternative Hypothesis: There is a relationship between Email_Type and Email_Status (H1: B1 ≠ 0)
* Test Type : chi-square test

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

# perform chi-square test of independence
chi2, p_value, dof, expected = stats.chi2_contingency(pd.crosstab(df['Email_Type'], df['Email_Status']))

if p_value < 0.05:
    print("Reject the null hypothesis - the Email_Type has a significant impact on the Email_Status")
else:
    print("Fail to reject the null hypothesis - the Email_Type does not have a significant impact on the Email_Status")

##### Which statistical test have you done to obtain P-Value?

For this hypothesis, I used chi-square test of independence which is a statistical test to determine if there is a significant association between two categorical variables. In this case, the two variables are Email_Type and Email_Status.

##### Why did you choose the specific statistical test?

This test is appropriate because the variables are categorical and I want to determine if there is a relationship between them.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

* Null Hypothesis: There is no linear relationship between Subject_Hotness_Score and Total_Past_Communications (H0: ρ = 0)
* Alternative Hypothesis: There is a linear relationship between Subject_Hotness_Score and Total_Past_Communications (H1: ρ ≠ 0)
* Test Type : Pearson's correlation test

#### 2. Perform an appropriate statistical test.

In [None]:
#to perform the test, variable I used has null or infinity values, therefore creating copy and treating those
data = df.copy()
data = data.replace([np.inf, -np.inf], np.nan)
data = data.fillna(data.mean())

In [None]:
# Perform Statistical Test to obtain P-Value
# perform correlation test
r, p_value = stats.pearsonr(data['Subject_Hotness_Score'], data['Total_Past_Communications'])

if p_value < 0.05:
    print("Reject the null hypothesis - the Subject_Hotness_Score has a significant impact on the Total_Past_Communications")
else:
    print("Fail to reject the null hypothesis - the Subject_Hotness_Score does not have a significant impact on the Total_Past_Communications")


##### Which statistical test have you done to obtain P-Value?

For this hypothesis, I used Pearson's correlation test which measures the linear correlation between two continuous variables. In this case, the two variables are Subject_Hotness_Score and Total_Past_Communications.

##### Why did you choose the specific statistical test?

This test is appropriate because the variables are continuous and I want to determine if there is a linear relationship between them.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

* Null Hypothesis: The mean of Total_Links is equal among the location (A, B, C, D, E, F, G) (H0: μ1 = μ2 = μ3 = μ4 = μ5 = μ6 = μ7)
* Alternative Hypothesis: The mean of Total_Links is not equal among the location (A, B, C, D, E, F, G) (H1: at least one mean is different from the others)
* Test Type : ANOVA Test

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
# perform ANOVA test
f_value, p_value = stats.f_oneway(df[df['Customer_Location'] == 'A']['Total_Links'],
                                  df[df['Customer_Location'] == 'B']['Total_Links'],
                                  df[df['Customer_Location'] == 'C']['Total_Links'],
                                  df[df['Customer_Location'] == 'D']['Total_Links'],
                                  df[df['Customer_Location'] == 'E']['Total_Links'],
                                  df[df['Customer_Location'] == 'F']['Total_Links'],
                                  df[df['Customer_Location'] == 'G']['Total_Links'])
if p_value < 0.05:
    print("Reject the null hypothesis - the Customer_Location has a significant impact on the Total_Links in the email")
else:
    print("Fail to reject the null hypothesis - the Customer_Location does not have a significant impact on the Total_Links in the email")


In [None]:
# Perform Statistical Test to obtain P-Value
# perform ANOVA test
f_value, p_value = stats.f_oneway(df[df['Customer_Location'] == 'A']['Total_Images'],
                                  df[df['Customer_Location'] == 'B']['Total_Images'],
                                  df[df['Customer_Location'] == 'C']['Total_Images'],
                                  df[df['Customer_Location'] == 'D']['Total_Images'],
                                  df[df['Customer_Location'] == 'E']['Total_Images'],
                                  df[df['Customer_Location'] == 'F']['Total_Images'],
                                  df[df['Customer_Location'] == 'G']['Total_Images'])
if p_value < 0.05:
    print("Reject the null hypothesis - the Customer_Location has a significant impact on the Total_Images in the email")
else:
    print("Fail to reject the null hypothesis - the Customer_Location does not have a significant impact on the Total_Images in the email")


In [None]:
# perform Kruskal-Wallis test
stat, p_value = stats.kruskal(df[df['Customer_Location'] == 'A']['Total_Links'],
                              df[df['Customer_Location'] == 'B']['Total_Links'],
                              df[df['Customer_Location'] == 'C']['Total_Links'],
                              df[df['Customer_Location'] == 'D']['Total_Links'],
                              df[df['Customer_Location'] == 'E']['Total_Links'],
                              df[df['Customer_Location'] == 'F']['Total_Links'],
                              df[df['Customer_Location'] == 'G']['Total_Links'])
if p_value < 0.05:
    print("Reject the null hypothesis - the Customer_Location has a significant impact on the Total_Links in the email")
else:
    print("Fail to reject the null hypothesis - the Customer_Location does not have a significant impact on the Total_Links in the email")


##### Which statistical test have you done to obtain P-Value?

For this hypothesis, I used ANOVA (Analysis of Variance) test because ANOVA is a statistical test that is used to determine whether there is a statistically significant difference in the means of two or more groups.

##### Why did you choose the specific statistical test?

This test is used to determine if there are significant differences between the means of two or more groups. In this case, we have different locations (A,B,C,D,E,F,G) and we want to determine if there is a significant difference in the mean of Total_Links among these groups. ANOVA is appropriate for this case because the variable Total_Links is continuous and we want to compare the means of multiple groups.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
df.dropna().inplace=True
df.isnull().sum()

#### What all missing value imputation techniques have you used and why did you use those techniques?

Here not using any missing value imputation techniques.Becuase any imputation technique can influence the data towards a specific variables which drops down the accuracy of the ml model and it performs poorly. we are deleting all the null values and making our dataset completely perfect for ml model creation.

### 2. Handling Outliers

In [None]:
plt.scatter(df1["Email_Type"],df1["Subject_Hotness_Score"])
plt.xlabel("Email Type")
plt.ylabel("Subject Hotness Score")
plt.title("Email Type VS Subject Hotness Score")
plt.show()

In [None]:
plt.scatter(df1["Email_Type"],df1["Email_Source_Type"])
plt.xlabel("Email Type")
plt.ylabel("Email Source Type")
plt.title("Email Type VS Email Source Type")
plt.show()

In [None]:
plt.scatter(df1["Email_Type"],df1["Email_Campaign_Type"])
plt.xlabel("Email Type")
plt.ylabel("Email Campaign Type")
plt.title("Email Type VS Email Campaign Type")
plt.show()

In [None]:
plt.scatter(df1["Email_Type"],df1["Total_Past_Communications"])
plt.xlabel("Email Type")
plt.ylabel("Total Past Communications")
plt.title("Email Type VS Total Past Communications")
plt.show()

In [None]:
plt.scatter(df1["Email_Type"],df1["Time_Email_sent_Category"])
plt.xlabel("Email Type")
plt.ylabel("Time Email sent Category")
plt.title("Email Type VS Time Email sent Category")
plt.show()

In [None]:
plt.scatter(df1["Email_Type"],df1["Word_Count"])
plt.xlabel("Email Type")
plt.ylabel("Word Count")
plt.title("Email Type VS Word Coun")
plt.show()

In [None]:
plt.scatter(df1["Email_Type"],df1["Total_Links"])
plt.xlabel("Email_Type")
plt.ylabel("Total_Links")
plt.title("Email_Type VS Total_Links")
plt.show()

In [None]:
plt.scatter(df1["Email_Type"],df1["Total_Images"])
plt.xlabel("Email Type")
plt.ylabel("Total_Images")
plt.title("Email Type VS Total_Images")
plt.show()

In [None]:
plt.scatter(df1["Email_Campaign_Type"],df1["Email_Status"])
plt.xlabel("Email Campaign Type")
plt.ylabel("Email Status")
plt.title("Email Campaign Type VS Email Status ")
plt.show()

In [None]:
# Handling Outliers & Outlier treatments in Email_Type
mean=df1["Email_Type"].mean()
std=df1["Email_Type"].std()
outlier=mean + 2 * std
df1[df1["Email_Type"]>outlier]

In [None]:
# Handling Outliers & Outlier treatments in Email_Source_Type
mean=df1["Email_Source_Type"].mean()
std=df1["Email_Source_Type"].std()
outlier=mean + 2 * std
df1[df1["Email_Source_Type"]>outlier]

In [None]:
# Handling Outliers & Outlier treatments In Email_Campaign_Type
mean=df1["Email_Campaign_Type"].mean()
std=df1["Email_Campaign_Type"].std()
outlier=mean + 2 * std
df1[df1["Email_Campaign_Type"]>outlier]

In [None]:
# Handling Outliers & Outlier treatments In Time_Email_sent_Category
mean=df1["Time_Email_sent_Category"].mean()
std=df1["Time_Email_sent_Category"].std()
outlier=mean + 2 * std
df1[df1["Time_Email_sent_Category"]>outlier]

##### What all outlier treatment techniques have you used and why did you use those techniques?

There Are No Any Outliers In Email Campaign Effectiveness Predidiction Dataset

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns
#Converting Total_Past_Communications datatype from float to int
df["Total_Past_Communications"]=df["Total_Past_Communications"].astype("int")

#Converting Total_Links datatype from float to int
df["Total_Links"]=df["Total_Links"].astype("int")

#Converting Total_Links datatype from float to int
df["Total_Images"]=df["Total_Images"].astype("int")

#### What all categorical encoding techniques have you used & why did you use those techniques?

Convered "Float" Data Type Of Column Variable Name "Total_Past_Communications", "Total_Links", "Total_Images" To "Int" Data Type

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features
#Total_Image_link
df["Total_Image_link"]=df["Total_Links"]+df["Total_Images"]


In [None]:
#Avg Word count per image and link
df["Avg Word count per image and link"]=(df["Total_Image_link"]/df["Word_Count"]) * 100

In [None]:
#Word Per Link
df["Word Per Link"]=(df["Total_Links"]/df["Word_Count"])

In [None]:
#Word Per Image
df["Word Per Image"]=(df["Total_Images"]/df["Word_Count"])

#### 2. Feature Selection

In [None]:
# Dropping Constant and Quasi Constant Feature
def dropping_constant(data):
  from  sklearn.feature_selection import VarianceThreshold
  var_thres= VarianceThreshold(threshold=0.05)
  var_thres.fit(data)
  concol = [column for column in data.columns
          if column not in data.columns[var_thres.get_support()]]
          #var_thres.get_support() return boolean values on checking condition
  if "Email_Status" in concol:
    concol.remove("Email_Status")
  else:
    pass
  print(f'Columns dropped: {concol}')
  df_removed_var=data.drop(concol,axis=1)
  return df_removed_var

In [None]:
x=df1.drop("Email_Status",axis=1)

In [None]:
# Calling the function
df_removed_var=dropping_constant(x)

In [None]:
#correlation matrix

corr = df_removed_var.corr()
cmap=sns.diverging_palette(5, 250, as_cmap=True)

def magnify():
    return [dict(selector="th",
                 props=[("font-size", "7pt")]),
            dict(selector="td",
                 props=[('padding', "0em 0em")]),
            dict(selector="th:hover",
                 props=[("font-size", "12pt")]),
            dict(selector="tr:hover td:hover",
                 props=[('max-width', '200px'),
                        ('font-size', '12pt')])
]

corr.style.background_gradient(cmap, axis=1)\
    .set_properties(**{'max-width': '80px', 'font-size': '10pt'})\
    .set_caption("Hover to magify")\
    .set_precision(2)\
    .set_table_styles(magnify())

In [None]:
# Calculating VIF
def calc_vif(X):
    vif = pd.DataFrame()
    vif["variables"] = X.columns
    vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

    return(vif)

In [None]:
# Checking Variable Inflation Factor
# the independent variables set
X = df_removed_var.copy()

# VIF dataframe
vif_data = pd.DataFrame()
vif_data["feature"] = X.columns

# calculating VIF for each feature
vif_data["VIF"] = [variance_inflation_factor(X.values, i)
                          for i in range(len(X.columns))]

for i in range(len(vif_data)):
  vif_data.loc[i,"VIF"]=vif_data.loc[i,"VIF"].round(2)
  if vif_data.loc[i,"VIF"]>=8:
    print(vif_data.loc[i,"feature"])

In [None]:
# Check Feature Correlation and finding multicolinearity
def correlation(df,threshold):
  col_corr=set()
  corr_matrix= df.corr()
  for i in range (len(corr_matrix.columns)):
    for j in range(i):
      if abs (corr_matrix.iloc[i,j])>threshold:
        colname=corr_matrix.columns[i]
        col_corr.add(colname)
  return list(col_corr)

In [None]:
correlation(df_removed_var,0.6)

In [None]:
#dropping highly correlated values
df_removed=df_removed_var.drop(['Email_Source_Type'],axis=1)
df_removed.shape

In [None]:
# Again checking VIF post-dropped features
# the independent variables set
X = df_removed.copy()

# VIF dataframe
vif_data = pd.DataFrame()
vif_data["feature"] = X.columns


# calculating VIF for each feature
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(len(X.columns))]
vif_data["VIF"] = vif_data["VIF"].apply(lambda x: round(x, 2))
vif_data[vif_data["VIF"] >= 8]["feature"].apply(print)

In [None]:
calc_vif(df_removed[[i for i in df_removed.describe().columns if i not in ['Email_Status']]])

##### What all feature selection methods have you used  and why?

I used Dropping Constant Feature, Dropping columns having multicolinearity and validate through VIF.

Feature Selector that removes all low variance features. This feature selection algorithm looks only at the features(X), not the desired outputs(Y), and can be used for unsupported learning.

A Pearson correlation is a number between -1 and 1 that indicates the extent to which two variables are linearly related. The Pearson correlation is also known as the “product moment correlation coefficient” (PMCC) or simply “correlation”

Pearson correlations are suitable only for metric variables The correlation coefficient has values between -1 to 1

• A value closer to 0 implies weaker correlation (exact 0 implying no correlation)

• A value closer to 1 implies stronger positive correlation

• A value closer to -1 implies stronger negative correlation

Collinearity is the state where two variables are highly correlated and contain similar information about the variance within a given dataset. To detect collinearity among variables, simply create a correlation matrix and find variables with large absolute values.

Steps for Implementing VIF

• Calculate the VIF factors.

• Inspect the factors for each predictor variable, if the VIF is between 5–10, multicollinearity is likely present and you should consider dropping the variable.

In VIF method, we pick each feature and regress it against all of the other features. For each regression, the factor is calculated as :

VIF=\frac{1}{1-R^2}

Where, R-squared is the coefficient of determination in linear regression. Its value lies between 0 and 1.

1st I dropped columns having constant or quasi constant variance. Then using pearson corelation I removed the columns having multicolinearity and again validate the VIFs for each feauture and found some features having VIF of more than 5-10 and I considered it to be 8 and again manipulated some features and again dropped multicolinear columns to make the VIF less than 8.

##### Which all features you found important and why?

In [None]:
#important features
df_removed.columns.to_list()

In [None]:
# Embedded Method of validating the feature importances of selected features
def randomforest_embedded(x,y):
  # Create the random forest eith hyperparameters
  model= RandomForestClassifier(n_estimators=550)
  # Fit the mmodel
  model.fit(x,y)
  # get the importance of thr resulting features
  importances= model.feature_importances_
  # Create a data frame for visualization
  final_df= pd.DataFrame({"Features": pd.DataFrame(x).columns, "Importances": importances})
  final_df.set_index('Importances')
  # Sort in ascending order to better visualization
  final_df= final_df.sort_values('Importances')
  return final_df

In [None]:
# Getting feature importance of selected features
randomforest_embedded(x,y=df["Email_Status"])

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data
# Transform Your data
# Getting symmetric and skew symmetric features from the columns
symmetric_feature=[]
non_symmetric_feature=[]
for i in df_removed.describe().columns:
  if abs(df_removed[i].mean()-df_removed[i].median())<0.1:
    symmetric_feature.append(i)
  else:
    non_symmetric_feature.append(i)

# Getting Symmetric Distributed Features
print("Symmetric Distributed Features : -",symmetric_feature)

# Getting Skew Symmetric Distributed Features
print("Skew Symmetric Distributed Features : -",non_symmetric_feature)

In [None]:
#vizualization
for variable in non_symmetric_feature:
  sns.set_context('notebook')
  plt.figure(figsize=(14,5))
  plt.subplot(1,2,1)   #means 1 row, 2 Columns and 1st plot
  df_removed[variable].hist(bins=30)

  ##QQ plot
  plt.subplot(1,2,2)
  stats.probplot(df_removed[variable], dist='norm',plot=plt)
  plt.title(variable)
  plt.show()
  print('='*120)

In [None]:
for col in ['Subject_Hotness_Score','Total_Past_Communications','Word_Count']:
  df_removed[col]=np.sqrt(df_removed[col])

In [None]:
for i,col in enumerate(['Subject_Hotness_Score','Total_Past_Communications','Word_Count']) :
    plt.figure(figsize = (18,18))
    plt.subplot(6,2,i+1);
    sns.distplot(df_removed[col], color = '#055E85', fit = norm);
    feature = df_removed[col]
    plt.axvline(feature.mean(), color='#ff033e', linestyle='dashed', linewidth=3,label= 'mean');  #red
    plt.axvline(feature.median(), color='#A020F0', linestyle='dashed', linewidth=3,label='median'); #cyan
    plt.title(f'{col.title()}');
    plt.tight_layout();

From the features, since some of them where having categorial feature therefore did not required transformation, for three features I have applied square root transformation to change it into gaussian normal distribution.

### 6. Data Scaling

In [None]:
# x=df1.drop("Email_Status",axis=1)
y=df1["Email_Status"]

In [None]:
# Scaling your data
#StandardScaler
standard_scaler=StandardScaler()
standard_scaler_scaled_data=standard_scaler.fit_transform(x)
standard_scaler_scaled_data


##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

This dataset doesnot need any dimensionality reduction.

Dimensionality reduction is a technique that is used to reduce the number of features in a dataset. It is often used when the number of features is very large, as this can lead to problems such as overfitting and slow computation. There are a variety of techniques that can be used for dimensionality reduction, such as principal component analysis (PCA) and singular value decomposition (SVD).

There are several reasons why dimensionality reduction might be useful. One reason is that it can help to reduce the size of a dataset, which can be particularly useful when the dataset is very large. It can also help to improve the performance of machine learning models by reducing the number of features that the model has to consider, which can lead to faster computation and better generalization to new data.

Another reason to use dimensionality reduction is to reduce the curse of dimensionality, which refers to the fact that as the number of dimensions increases, the volume of the space increases exponentially. This can lead to problems such as the nearest neighbor search becoming less effective, as the distances between points become much larger. Dimensionality reduction can help to reduce the curse of dimensionality by reducing the number of dimensions in the data.

Finally, dimensionality reduction can also be useful for visualizing high-dimensional data. It can be difficult to visualize data in more than three dimensions, so reducing the number of dimensions can make it easier to understand the patterns in the data.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
x_train,x_test,y_train,y_test=train_test_split(standard_scaler_scaled_data,y,test_size=0.2,random_state=45)
print("The Shape Of X Train Dataset:",x_train.shape)
print("The Shape Of X Test Dataset:",x_test.shape)
print("The Shape Of y Train Dataset:",y_train.shape)
print("The Shape Of y Test Dataset:",y_test.shape)


##### What data splitting ratio have you used and why?

For ml model I am spliting data in 80:20 ratio.That means I am using 80% data for trainig the model and 20 % data for testing the model.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

In [None]:
# Dependant Variable Column Visualization
df['Email_Status'].value_counts().plot(kind='pie',
                              figsize=(5,5),
                               autopct="%1.1f%%",
                               startangle=90,
                               shadow=True,
                               labels=['Ignored','Read','Acknowledged'],
                               colors=['lightgreen','yellow','red'],
                               explode=[0.1,0.1,0.1]
                              );

Imbalanced dataset is relevant primarily in the context of supervised machine learning involving two or more classes.

Imbalance means that the number of data points available for different the classes is different: If there are two classes, then balanced data would mean 50% points for each of the class. For most machine learning techniques, little imbalance is not a problem. So, if there are 60% points for one class and 40% for the other class, it should not cause any significant performance degradation. Only when the class imbalance is high, e.g. 90% points for one class and 10% for the other, standard optimization criteria or performance measures may not be as effective and would need modification.

In our case the dataset dependent column data ratio is 80:16:4. So, during model creating it's obvious that there will be bias and having a great chance of predicting the majority one so frequently. SO the dataset should be balanced before it going for the model creation part.

In [None]:
df["Email_Status"].value_counts()

In [None]:
print(x_train.shape)
print(y_train.shape)

In [None]:
# #SMOTE
# # Handaling imbalance dataset using SMOTE
sm = SMOTE(random_state=42)
x_train, y_train = sm.fit_resample(x_train, y_train)

# describes info about train and test set
print("Number transactions X_train dataset: ", x_train.shape)
print("Number transactions y_train dataset: ", y_train.shape)
print("Number transactions X_test dataset: ", x_test.shape)
print("Number transactions y_test dataset: ", y_test.shape)

In [None]:
from collections import Counter
counter = Counter(y_train)
for key,value in counter.items():
  per = value / len(y_train) * 100
  print('Class=%d, n=%d (%.3f%%)' % (key, value, per))
# plot the distribution
plt.bar(counter.keys(), counter.values())
plt.show()

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

I have used SMOTE (Synthetic Minority Over-sampling technique) to balance the 80:16:4 dataset.

SMOTE is a technique in machine learning for dealing with issues that arise when working with an unbalanced data set. In practice, unbalanced data sets are common and most ML algorithms are highly prone to unbalanced data so we need to improve their performance by using techniques like SMOTE.

To address this disparity, balancing schemes that augment the data to make it more balanced before training the classifier were proposed. Oversampling the minority class by duplicating minority samples or undersampling the majority class is the simplest balancing method.

The idea of incorporating synthetic minority samples into tabular data was first proposed in SMOTE, where synthetic minority samples are generated by interpolating pairs of original minority points.

SMOTE is a data augmentation algorithm that creates synthetic data points from raw data. SMOTE can be thought of as a more sophisticated version of oversampling or a specific data augmentation algorithm.

SMOTE has the advantage of not creating duplicate data points, but rather synthetic data points that differ slightly from the original data points. SMOTE is a superior oversampling option.

That's why for lots of advantages, I have used SMOTE technique for balancinmg the dataset.

## ***7. ML Model Implementation***

### ML Model - LogisticRegression

In [None]:
# ML Model - 1 Implementation
model1=LogisticRegression(fit_intercept=True,
            class_weight='balanced',multi_class='multinomial')
# Fit the Algorithm
model1.fit(x_train,y_train)
# Predict on the model

y_train_pred=model1.predict(x_train)
y_pred=model1.predict(x_test)

training_accuracy=accuracy_score(y_train_pred,y_train)

prediction_accuracy=accuracy_score(y_pred,y_test)


accuracy1=accuracy_score(y_test,y_pred)
precision1=precision_score(y_test,y_pred,average="weighted")
recall1=recall_score(y_test,y_pred,average="weighted")
f1=f1_score(y_test,y_pred,average="weighted")

In [None]:
print("Training Accuracy",training_accuracy * 100,"%")
print("Prediction Accuracy",prediction_accuracy * 100,"%")
print("Accuracy Of LogisticRegression Model Is: ",accuracy1)
print("Precision Of LogisticRegression Model Is: ",precision1)
print("Recall Of LogisticRegression Model Is: ",recall1)
print("F1 Score Of LogisticRegression Model Is: ",f1)

In [None]:
# Get the predicted probabilities
train_pred=model1.predict_proba(x_train)
test_pred=model1.predict_proba(x_test)


In [None]:
# Get the predicted classes
train_class_pred=model1.predict(x_train)
test_class_pred=model1.predict(x_test)


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
#confusion Matrix For Training
labels=["Ignored","Read","Acknowledge"]
cm=confusion_matrix(y_train,train_class_pred)
print(cm)
ax=plt.subplot()
sns.heatmap(cm,annot=True,ax=ax)
ax.set_title("Confusion matrix For Train")
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)
plt.show()

In [None]:
#confusion Matrix for Testing
labels=["Ignored","Read","Acknowledge"]
cm=confusion_matrix(y_test,test_class_pred)
print(cm)
ax=plt.subplot()
sns.heatmap(cm,annot=True,ax=ax)
ax.set_title("Confusion matrix For Test")
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# Initialize Stratified K-Fold Cross-Validation
k = 5
skf = StratifiedKFold(n_splits=k, shuffle=True, random_state=42)

# Initialize lists to store the scores
accuracy_scores = []
precision_scores = []
recall_scores = []
f1_scores = []

# Iterate over the folds
for train_index, val_index in skf.split(x, y):
    # Split the data into training and validation sets
    x_train, x_val = x.iloc[train_index], x.iloc[val_index]
    y_train, y_val = y.iloc[train_index], y.iloc[val_index]

    # Train the model
    model2 = LogisticRegression()
    model2.fit(x_train, y_train)

    # Perform predictions on the validation set
    y_pred = model2.predict(x_val)

    # Calculate and append the scores
    accuracy_scores.append(accuracy_score(y_val, y_pred))
    precision_scores.append(precision_score(y_val, y_pred, average="weighted"))
    recall_scores.append(recall_score(y_val, y_pred, average="weighted"))
    f1_scores.append(f1_score(y_val, y_pred, average="weighted"))

# Print the average scores
print(f"Average Accuracy: {np.mean(accuracy_scores) * 100:.2f}%")
print(f"Average Precision: {np.mean(precision_scores):.2f}")
print(f"Average Recall: {np.mean(recall_scores):.2f}")
print(f"Average F1-Score: {np.mean(f1_scores):.2f}")

In [None]:
# Predict on the model for train and test data
y_pred_train2 = model2.predict(x_train)
y_pred2 = model2.predict(x_test)

In [None]:
# Get the predicted probabilities
train_probability2 = model2.predict_proba(x_train)
test_probability2 = model2.predict_proba(x_test)

In [None]:
# Result for train
print("Classification Report (Train):\n", classification_report(y_train, y_pred_train2))
print("ROC AUC Score (Train):", roc_auc_score(y_train, train_probability2, multi_class='ovr'))
print()

In [None]:
# Result for test
print("Classification Report (Test):\n", classification_report(y_test, y_pred2))
print("ROC AUC Score (Test):", roc_auc_score(y_test, test_probability2, multi_class='ovr'))


##### Which hyperparameter optimization technique have you used and why?

I have used Stratified KFold Cross validation technique for optimizing performans of the ml model. As in Stratified KFold Cross validation the proportion of different classes remains consistent across the training and validation sets and it gives better results in case of data inbalaced condition.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

For training dataset, I found precision of 81% and recall of 99% and f1-score of 89% for ignored emails. For class 2 i.e., email opened got precision of 42% and recall of 4% and f1-score of 8% and for class 3 i.e., email acknowledged got precision of 0% and recall of 0% and f1-score of 0% Accuracy is 80% and average precision, recall & f1_score are 41%, 34% and 32% respectively with a roc auc score of 68%.

For test dataset, I found precision of 86% and recall of 53% and f1-score of 66% for ignored emails. For class 2 i.e., email opened got precision of 26% and recall of 43% and f1-score of 32% and for class 3 i.e., email acknowledged got precision of 3% and recall of 21% and f1-score of 6% Accuracy is 50% and average precision, recall & f1_score are 38%, 39% and 34% respectively with a roc auc score of 62%.

### ML Model - RandomForestClassifier

In [None]:
# ML Model - 2 Implementation
model5=RandomForestClassifier()

# Fit the Algorithm
model5.fit(x_train,y_train)
# Predict on the model

y_train_pred=model5.predict(x_train)
y_pred=model5.predict(x_test)

training_accuracy5=accuracy_score(y_train_pred,y_train)
prediction_accuracy5=accuracy_score(y_test,y_pred)

accuracy5=accuracy_score(y_test,y_pred)
precision5=precision_score(y_test,y_pred,average="weighted")
recall5=recall_score(y_test,y_pred,average="weighted")
f5=f1_score(y_test,y_pred,average="weighted")

In [None]:
print("Training Accuracy:",training_accuracy5 * 100)
print("Prediction Accuracy",prediction_accuracy5 *100)
print("Accuracy Of  Model Is: ",accuracy5)
print("Precision Of  Model Is: ",precision5)
print("Recall Of  Model Is: ",recall5)
print("F1 Score Of  Model Is: ",f5)

In [None]:
#Get Predicted Classes
train_class_pred22=model5.predict(x_train)
test_class_pred22=model5.predict(x_test)

In [None]:
# Get the predicted classes
train_class_pred22=model5.predict(x_train)
test_class_pred22=model5.predict(x_test)


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
#confusion Matrix For Training
labels=["Ignored","Read","Acknowledge"]
cm=confusion_matrix(y_train,train_class_pred22)
print(cm)
ax=plt.subplot()
sns.heatmap(cm,annot=True,ax=ax)
ax.set_title("Confusion matrix For Train")
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)
plt.show()

In [None]:
# Visualizing evaluation Metric Score chart
#confusion Matrix For Testing
labels=["Ignored","Read","Acknowledge"]
cm=confusion_matrix(y_test,test_class_pred22)
print(cm)
ax=plt.subplot()
sns.heatmap(cm,annot=True,ax=ax)
ax.set_title("Confusion_matrix For Test")
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# Initialize Stratified K-Fold Cross-Validation
k = 5
skf = StratifiedKFold(n_splits=k, shuffle=True, random_state=42)

# Initialize lists to store the scores
accuracy_scores = []
precision_scores = []
recall_scores = []
f1_scores = []

# Iterate over the folds
for train_index, val_index in skf.split(x, y):
    # Split the data into training and validation sets
    x_train, x_val = x.iloc[train_index], x.iloc[val_index]
    y_train, y_val = y.iloc[train_index], y.iloc[val_index]

    # Train the model
    model5 = RandomForestClassifier()
    model5.fit(x_train, y_train)

    # Perform predictions on the validation set
    y_pred = model5.predict(x_val)

    # Calculate and append the scores
    accuracy_scores.append(accuracy_score(y_val, y_pred))
    precision_scores.append(precision_score(y_val, y_pred, average="weighted"))
    recall_scores.append(recall_score(y_val, y_pred, average="weighted"))
    f1_scores.append(f1_score(y_val, y_pred, average="weighted"))

# Print the average scores
print(f"Average Accuracy: {np.mean(accuracy_scores) * 100:.2f}%")
print(f"Average Precision: {np.mean(precision_scores):.2f}")
print(f"Average Recall: {np.mean(recall_scores):.2f}")
print(f"Average F1-Score: {np.mean(f1_scores):.2f}")

In [None]:
# Predict on the model for train and test data
y_pred_train5 = model5.predict(x_train)
y_pred5 = model5.predict(x_test)

In [None]:
# Get the predicted probabilities
train_probability5 = model5.predict_proba(x_train)
test_probability5 = model5.predict_proba(x_test)

In [None]:
# Result for train
print("Classification Report (Train):\n", classification_report(y_train, y_pred_train5))
print("ROC AUC Score (Train):", roc_auc_score(y_train, train_probability5, multi_class='ovr'))
print()

In [None]:
# Result for test
print("Classification Report (Test):\n", classification_report(y_test, y_pred5))
print("ROC AUC Score (Test):", roc_auc_score(y_test, test_probability5, multi_class='ovr'))


##### Which hyperparameter optimization technique have you used and why?

I have used Stratified KFold Cross validation technique for optimizing performans of the ml model. As in Stratified KFold Cross validation the proportion of different classes remains consistent across the training and validation sets and it gives better results in case of data inbalaced condition.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

For training dataset, I found precision of 100% and recall of 100% and f1-score of 100% for ignored emails. For class 2 i.e., email opened got precision of 100% and recall of 100% and f1-score of 100% and for class 3 i.e., email acknowledged got precision of 100% and recall of 100% and f1-score of 100% Accuracy is 100% and average precision, recall & f1_score are 100%, 100% and 100% respectively with a roc auc score of 99%.

For test dataset, I found precision of 87% and recall of 2% and f1-score of 5% for ignored emails. For class 2 i.e., email opened got precision of 16% and recall of 87% and f1-score of 27% and for class 3 i.e., email acknowledged got precision of 5% and recall of 15% and f1-score of 8% Accuracy is 17% and average precision, recall & f1_score are 36%, 35% and 13% respectively with a roc auc score of 50%.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

I have used these metrices for evaluation of the model and their impact on business are as follows:
* **Accuracy**: This metric indicates the percentage of correctly classified instances out of the total number of instances. In a business setting, this would indicate the overall effectiveness of the model in making correct predictions. A high accuracy score would have a positive impact on the business, as it would indicate a high level of confidence in the model's predictions.

* **Precision**: This metric indicates the proportion of true positive predictions out of all positive predictions made by the model. In a business setting, this would indicate the level of confidence in the model's ability to identify positive instances correctly. A high precision score would have a positive impact on the business, as it would indicate that the model is not making false positive predictions.

* **Recall**: This metric indicates the proportion of true positive predictions out of all actual positive instances. In a business setting, this would indicate the model's ability to identify all positive instances. A high recall score would have a positive impact on the business, as it would indicate that the model is not missing any positive instances.

* **F1 Score**: This metric is a combination of precision and recall and is used to balance the trade-off between the two. In a business setting, this would indicate the overall effectiveness of the model in making correct predictions while also avoiding false positives and false negatives. A high F1 score would have a positive impact on the business, as it would indicate that the model is making accurate predictions while also being able to identify all positive instances.

* **ROC AUC**: This metric indicates the ability of the model to distinguish between positive and negative instances. In a business setting, this would indicate the model's ability to correctly classify instances as positive or negative. A high ROC AUC score would have a positive impact on the business, as it would indicate that the model is able to correctly classify instances.

In summary, the Random Forest Classifier can be considered as an efficient model for the business, especially when it achieves high scores in all of these evaluation metrics, which would indicate that it can accurately predict outcomes, identify all positive instances, and correctly classify instances as positive or negative.

### ML Model - XGBClassifier

In [None]:
model2=XGBClassifier()

# Fit the Algorithm
model2.fit(x_train,y_train)
# Predict on the model

y_train_pred=model2.predict(x_train)
y_pred=model2.predict(x_test)

training_accuracy2=accuracy_score(y_train_pred,y_train)
prediction_accuracy2=accuracy_score(y_test,y_pred)

accuracy2=accuracy_score(y_test,y_pred)
precision2=precision_score(y_test,y_pred,average="weighted")
recall2=recall_score(y_test,y_pred,average="weighted")
f2=f1_score(y_test,y_pred,average="weighted")

In [None]:
print("Training Accuracy:",training_accuracy2 * 100)
print("Prediction Accuracy",prediction_accuracy2 *100)
print("Accuracy Of Model Is: ",accuracy2)
print("Precision Of Model Is: ",precision2)
print("Recall Of  Model Is: ",recall2)
print("F1 Score Of  Model Is: ",f2)

In [None]:
# Get the predicted probabilities
train_pred3=model2.predict_proba(x_train)
test_pred3=model2.predict_proba(x_test)


In [None]:
# Get the predicted classes
train_class_pred3=model2.predict(x_train)
test_class_pred3=model2.predict(x_test)


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
#confusion Matrix For Training
labels=["Ignored","Read","Acknowledge"]
cm=confusion_matrix(y_train,train_class_pred3)
print(cm)
ax=plt.subplot()
sns.heatmap(cm,annot=True,ax=ax)
ax.set_title("Confusion matrix For Train")
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)
plt.show()

In [None]:
# Visualizing evaluation Metric Score chart
#confusion Matrix For Testing
labels=["Ignored","Read","Acknowledge"]
cm=confusion_matrix(y_test,test_class_pred3)
print(cm)
ax=plt.subplot()
sns.heatmap(cm,annot=True,ax=ax)
ax.set_title("Confusion_matrix For Test")
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# Initialize Stratified K-Fold Cross-Validation
k = 5
skf = StratifiedKFold(n_splits=k, shuffle=True, random_state=42)

# Initialize lists to store the scores
accuracy_scores = []
precision_scores = []
recall_scores = []
f1_scores = []

# Iterate over the folds
for train_index, val_index in skf.split(x, y):
    # Split the data into training and validation sets
    x_train, x_val = x.iloc[train_index], x.iloc[val_index]
    y_train, y_val = y.iloc[train_index], y.iloc[val_index]

    # Train the model
    model2 = XGBClassifier()
    model2.fit(x_train, y_train)

    # Perform predictions on the validation set
    y_pred = model2.predict(x_val)

    # Calculate and append the scores
    accuracy_scores.append(accuracy_score(y_val, y_pred))
    precision_scores.append(precision_score(y_val, y_pred, average="weighted"))
    recall_scores.append(recall_score(y_val, y_pred, average="weighted"))
    f1_scores.append(f1_score(y_val, y_pred, average="weighted"))

# Print the average scores
print(f"Average Accuracy: {np.mean(accuracy_scores) * 100:.2f}%")
print(f"Average Precision: {np.mean(precision_scores):.2f}")
print(f"Average Recall: {np.mean(recall_scores):.2f}")
print(f"Average F1-Score: {np.mean(f1_scores):.2f}")

In [None]:
# Predict on the model for train and test data
y_pred_train2 = model2.predict(x_train)
y_pred2 = model2.predict(x_test)

In [None]:
# Get the predicted probabilities
train_probability2 = model2.predict_proba(x_train)
test_probability2 = model2.predict_proba(x_test)

In [None]:
# Result for train
print("Classification Report (Train):\n", classification_report(y_train, y_pred_train2))
print("ROC AUC Score (Train):", roc_auc_score(y_train, train_probability2, multi_class='ovr'))
print()

In [None]:
# Result for test
print("Classification Report (Test):\n", classification_report(y_test, y_pred2))
print("ROC AUC Score (Test):", roc_auc_score(y_test, test_probability2, multi_class='ovr'))


##### Which hyperparameter optimization technique have you used and why?

I have used Stratified KFold Cross validation technique for optimizing performans of the ml model. As in Stratified KFold Cross validation the proportion of different classes remains consistent across the training and validation sets and it gives better results in case of data inbalaced condition.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

For training dataset, I found precision of 86% and recall of 98% and f1-score of 91% for ignored emails. For class 2 i.e., email opened got precision of 69% and recall of 32% and f1-score of 43% and for class 3 i.e., email acknowledged got precision of 91% and recall of 9% and f1-score of 17% Accuracy is 84% and average precision, recall & f1_score are 82%, 46% and 50% respectively with a roc auc score of 68%.

For test dataset, I found precision of 79% and recall of 1% and f1-score of 1% for ignored emails. For class 2 i.e., email opened got precision of 16% and recall of 99% and f1-score of 28% and for class 3 i.e., email acknowledged got precision of 0% and recall of 0% and f1-score of 0% Accuracy is 17% and average precision, recall & f1_score are 32%, 33% and 10% respectively with a roc auc score of 50%.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

When evaluating the effectiveness of an email campaign in a classification model, the following evaluation metrics would be considered for a positive business impact:

* **Precision**: This metric indicates the proportion of true positive predictions (emails that were opened and resulted in a desired action) out of all positive predictions made by the model. In a business setting, this would indicate the level of confidence in the model's ability to identify individuals who are likely to engage with the campaign. A high precision score would have a positive impact on the business, as it would indicate that the model is not making false positive predictions and is effectively identifying individuals who are likely to engage with the campaign.

* **Recall**: This metric indicates the proportion of true positive predictions (emails that were opened and resulted in a desired action) out of all actual positive instances (emails that were opened and resulted in a desired action). In a business setting, this would indicate the model's ability to identify all individuals who engaged with the campaign. A high recall score would have a positive impact on the business, as it would indicate that the model is not missing any individuals who engaged with the campaign.

* **F1 Score**: This metric is a combination of precision and recall and is used to balance the trade-off between the two. In a business setting, this would indicate the overall effectiveness of the model in identifying individuals who are likely to engage with the campaign while also avoiding false positives and false negatives. A high F1 score would have a positive impact on the business, as it would indicate that the model is effectively identifying individuals who are likely to engage with the campaign while also being able to identify all individuals who engaged with the campaign.

* **ROC AUC**: This metric indicates the ability of the model to distinguish between positive and negative instances. In a business setting, this would indicate the model's ability to correctly classify instances as positive (engaged with the campaign) or negative (did not engage with the campaign). A high ROC AUC score would have a positive impact on the business, as it would indicate that the model is able to correctly classify individuals as likely to engage with the campaign or not.

The evaluation metrics that would be considered for a positive business impact of an email campaign effectiveness in a classification model are **precision, recall** which combine to provide F1 score. These metrics would indicate the model's ability to identify individuals who are likely to engage with the campaign while also being able to identify all individuals who engaged with the campaign, and correctly classify instances as positive or negative.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

I will choose Logistic Regression Model for final prediction, Because of the best performance of the best performance of the model in training as well as prediction of the data. It does not overfit the training dataset compared to the other ml models and the  accuracy,precision,recall and f1-score is also getting increased after using the cross validation techniques.   

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

### **LogisticRegression :**
Training Accuracy 49.43277040092819 % \
Prediction Accuracy 54.45698312454705 % \
Accuracy Of LogisticRegression Model Is:  0.5445698312454705 \
Precision Of LogisticRegression Model Is:  0.757510952520623 \
Recall Of LogisticRegression Model Is:  0.5445698312454705 \
F1 Score Of LogisticRegression Model Is:  0.6210434210218452 \
### **LogisticRegression With Cross validation Techniques :**
Average Accuracy: 80.29% \
Average Precision: 0.72 \
Average Recall: 0.80 \
Average F1-Score: 0.73 \
ROC AUC Score (Train): 0.6810871575694524 \
ROC AUC Score (Test): 0.6243740188218597 \

### **RandomForestClassifier:**
Training Accuracy: 99.90163849558668 \
Prediction Accuracy 18.31452531317942 \
Accuracy Of  Model Is:  0.1831452531317942 \
Precision Of  Model Is:  0.6766487357671398 \
Recall Of  Model Is:  0.1831452531317942 \
F1 Score Of  Model Is:  0.10333977009590749 \
### **RandomForestClassifier: With Cross validation Techniques :**
Average Accuracy: 80.07% \
Average Precision: 0.74 \
Average Recall: 0.80 \
Average F1-Score: 0.76 \
ROC AUC Score (Train): 0.9999968916856035 \
ROC AUC Score (Test): 0.4662545240067582 \

By analysing the above results of two model we can understand that LogisticRegression model has better performance of both training as well as test dataset and having highest ROC curve which is 62% . Thats why I will used LogisticeRegression model for future preference

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

+ According to the Email Campaign Type feature, Campaign Type 1 had a very high possibility of being viewed even though relatively few emails were sent. The majority of emails sent under email campaign type 2 were ignored. It appears that campaign 3 was successful because more emails were read and acknowledged even though fewer emails were sent during this campaign. \
* Most Of the types of the Emails are the marketting Email having total percentage of 71% and business Email having percentage out of total is 29%. \
* Most Of the Email source types are the marketting Email having total percentage of 54% and Admin Email Of Products having percentage out of total is 46%. \
+ The highest no of the mails send to the customers during evening time.   \
+ G community has maximum population out of all total which is 40.9 % and A community has lowest population which is 2.5 %. \
+ Highest no of the total past communication is done with the G group customer location. \
+ Email_Type 1 has highest no of Emails ignored,read and acknowledge as compared to Email_Type 2.
+ The percentage of Ignored are highest which is 80%.
The percentage of mails which are read by user are 16% .
The percentage of mails which are acknowledge by user are 4 %. \
+ From heatmap we know that there is a negative correlation between Email_Status and Word_count,Total_Links,Total_Images. \
+ There is a positive relation between the total past communications and Email_status. \
+ For final prediction on ml model we are using LogisticRegression ML Model.



### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***