Compared to the raw dataset, we lost 149 rows (1.86% of data), primarily due to values of 0 in WebsiteVisits, as a value of 0 would mean the customer did not visit the site at all. No significant change in distributions was observed.

### Loading in the dataset
---



In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/MISY331/digital_marketing_campaign_dataset.csv")
df.shape

(8000, 20)

In [None]:
df.head()

Unnamed: 0,CustomerID,Age,Gender,Income,CampaignChannel,CampaignType,AdSpend,ClickThroughRate,ConversionRate,WebsiteVisits,PagesPerVisit,TimeOnSite,SocialShares,EmailOpens,EmailClicks,PreviousPurchases,LoyaltyPoints,AdvertisingPlatform,AdvertisingTool,Conversion
0,8000,56,Female,136912,Social Media,Awareness,6497.870068,0.043919,0.088031,0,2.399017,7.396803,19,6,9,4,688,IsConfid,ToolConfid,1
1,8001,69,Male,41760,Email,Retention,3898.668606,0.155725,0.182725,42,2.917138,5.352549,5,2,7,2,3459,IsConfid,ToolConfid,1
2,8002,46,Female,88456,PPC,Awareness,1546.429596,0.27749,0.076423,2,8.223619,13.794901,0,11,2,8,2337,IsConfid,ToolConfid,1
3,8003,32,Female,44085,PPC,Conversion,539.525936,0.137611,0.088004,47,4.540939,14.688363,89,2,2,0,2463,IsConfid,ToolConfid,1
4,8004,60,Female,83964,PPC,Conversion,1678.043573,0.252851,0.10994,0,2.046847,13.99337,6,6,6,8,4345,IsConfid,ToolConfid,1


### Checking for missing values

In [None]:
df.isnull().sum()

Unnamed: 0,0
CustomerID,0
Age,0
Gender,0
Income,0
CampaignChannel,0
CampaignType,0
AdSpend,0
ClickThroughRate,0
ConversionRate,0
WebsiteVisits,0


### Removing Duplicates

In [None]:
df.drop_duplicates(inplace=True)
df.shape

(8000, 20)

Droping all rows where WebsiteVisits = 0, Rows Lost: 149

In [None]:
df.drop(df[df['WebsiteVisits'] == 0].index, inplace=True)
df.shape

(7851, 20)

Create Dummy Variables

In [None]:
## Dummy Columns
dummy_cols = ['Gender', 'CampaignChannel', 'CampaignType']
df = pd.get_dummies(df,
                    columns = dummy_cols,
                    prefix= ['is ', 'channel ', 'type '], drop_first = True, dtype = int)
df.head()

Unnamed: 0,CustomerID,Age,Income,AdSpend,ClickThroughRate,ConversionRate,WebsiteVisits,PagesPerVisit,TimeOnSite,SocialShares,...,AdvertisingTool,Conversion,is _Male,channel _PPC,channel _Referral,channel _SEO,channel _Social Media,type _Consideration,type _Conversion,type _Retention
1,8001,69,41760,3898.668606,0.155725,0.182725,42,2.917138,5.352549,5,...,ToolConfid,1,1,0,0,0,0,0,0,1
2,8002,46,88456,1546.429596,0.27749,0.076423,2,8.223619,13.794901,0,...,ToolConfid,1,0,1,0,0,0,0,0,0
3,8003,32,44085,539.525936,0.137611,0.088004,47,4.540939,14.688363,89,...,ToolConfid,1,0,1,0,0,0,0,1,0
5,8005,25,42925,9579.388247,0.153795,0.161316,6,2.12585,7.752831,95,...,ToolConfid,1,0,0,0,0,1,0,0,0
6,8006,38,25615,7302.899852,0.040975,0.060977,42,1.753995,10.698672,54,...,ToolConfid,1,0,0,1,0,0,0,0,0


### Dropping irrelevant columns

In [None]:
columns_to_exclude = ['AdvertisingPlatform', 'AdvertisingTool']

df=df.drop(columns = columns_to_exclude)

In [None]:
df.head()

Unnamed: 0,CustomerID,Age,Income,AdSpend,ClickThroughRate,ConversionRate,WebsiteVisits,PagesPerVisit,TimeOnSite,SocialShares,...,LoyaltyPoints,Conversion,is _Male,channel _PPC,channel _Referral,channel _SEO,channel _Social Media,type _Consideration,type _Conversion,type _Retention
1,8001,69,41760,3898.668606,0.155725,0.182725,42,2.917138,5.352549,5,...,3459,1,1,0,0,0,0,0,0,1
2,8002,46,88456,1546.429596,0.27749,0.076423,2,8.223619,13.794901,0,...,2337,1,0,1,0,0,0,0,0,0
3,8003,32,44085,539.525936,0.137611,0.088004,47,4.540939,14.688363,89,...,2463,1,0,1,0,0,0,0,1,0
5,8005,25,42925,9579.388247,0.153795,0.161316,6,2.12585,7.752831,95,...,3316,1,0,0,0,0,1,0,0,0
6,8006,38,25615,7302.899852,0.040975,0.060977,42,1.753995,10.698672,54,...,930,1,0,0,1,0,0,0,0,0


### Exporting train and test set, and saving the cleaned csv

In [None]:
import pickle
from sklearn.model_selection import train_test_split
X = df.drop(columns=['Conversion', 'CustomerID'])  # Drop target & ID
y = df['Conversion']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)
# Save the train/test splits to pickle files
with open('/content/drive/MyDrive/Colab Notebooks/MISY331/X_train.p', 'wb') as f:
    pickle.dump(X_train, f)

with open('/content/drive/MyDrive/Colab Notebooks/MISY331/X_test.p', 'wb') as f:
    pickle.dump(X_test, f)

with open('/content/drive/MyDrive/Colab Notebooks/MISY331/y_train.p', 'wb') as f:
    pickle.dump(y_train, f)

with open('/content/drive/MyDrive/Colab Notebooks/MISY331/y_test.p', 'wb') as f:
    pickle.dump(y_test, f)

In [None]:
df.to_csv('digital_marketing_campaign_dataset_cleaned.csv', index=False)