# **Marketing Spend Optimization using Machine Learning in Python**

# **Introduction**

A company specializing in B2C sales (product: Data Science and Data Engineering related courses) spends significant money on marketing campaigns across various channels such as social media, email, and search advertising. More information on marketing sources can be found in the data column “Marketing Source”. The marketing team faces challenges in optimizing the marketing budget allocation across these channels to maximize revenue and return on investment (ROI).<br><br>
This project aims to develop a data-driven approach to optimize the marketing budget allocation across various channels to maximize revenue and ROI. By using machine learning algorithms, the marketing team can make informed decisions about allocating the marketing budget based on predicted revenue and ROI.




### **Business Impact of Marketing Budget Optimization**


**Increase product conversions**: Marketing Budget Optimization leads to righ user targeting through right channels/assets which leads to better conversions.



**Increase revenue**: Increased conversions(as mentioned in point above) will lead to more revenue or buyer engagement. For example, if the company is able to target a user who is more active on Instagram, chances are more that he/she will click on the Ad and add the product to cart. So overall probability of an order increases and hence the revenue.



**Improve budget allocation**: Over budgeting on non-efficient channels lead to waste of marketing money without getting enough revenue.

**Improve Customer Acquisition Cost**: Customer Acquisition Cost(CAC) improves if right targeting channels are used for a customer often leading to better repeat rates as well.


### **Assumptions**

* We assume that <b>Interest Level</b> is our target variable which refers to the interest of a user for a lead id
* We assume that whenever the target variable is NA or Not called, the corresponding lead id is not meaningful and hence dropped
* If the model's prediction is very close to 1, it means that the user is very likely to engage with the lead id
* Columns with a lot of null values are not meaningful and imputation also won't be helpful

## **Approach**


We are treating this problem as an supervised learning problem. So every data point will have a target variable for the model to learn the dependencies and predict on the unknown.


In real life, this model would tell the business whether a user is likely to engage with the ad or not and that would in turn help the company to allocate budgets accordingly.


Given our assumptions about the data, we will build a prediction model based on the historical data. Simplifying, here's the logic of what we'll build:


1. We'll build a model to identify if a customer will be interested in the lead;
2. We'll use various tree based model and compare their performance on interest prediction;
3. We will then choose the most successful model to use in production;

* Exploratory Data Analysis (EDA):
  * Understand the features and their relationships with target variables
  * Check for missing or invalid values and their imputation


* Data Preprocessing:
  * Encode the variables using label encoding
  * Split the dataset into training and testing sets

* Model Building and Testing:
  * Random Forest
  * Light Gradient Boosting
  * Extreme Gradient Boosting


# **Package Requirements**

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
import matplotlib.pyplot as plt
import pandas as pd
from sklearn import preprocessing
# Suppress all warnings
import warnings
warnings.filterwarnings("ignore")

# **The Data**

In [3]:
csv_file_path = "D:\Github\ML-Powered-Marketing-Spend-Optimization\Marketing_Data.csv"
df = pd.read_csv(csv_file_path)

Lets look at the first 10 records from the dataframe.

In [4]:
df.head(10)

Unnamed: 0,Lead Id,Lead Owner,Interest Level,Lead created,Lead Location(Auto),Creation Source,Next activity,What do you do currently ?,What are you looking for in Product ?,Website Source,Lead Last Update time,Marketing Source,Lead Location(Manual),Demo Date,Demo Status,Closure date
0,5e502dcf828b8975a78e89f3e9aeac12,e14c3a,Not Interested,12-01-2023 16:42,IN,API,,Student,,,12-01-2023 19:27,,India,,,
1,efe3f074c61959c2ea1906dd0346aa69,d16267,Slightly Interested,04-12-2021 09:32,,API,12-01-2022 00:00,,,Sales lead,12-01-2022 17:17,Paid - Instagram,India,05-12-2021 00:00,No Show,
2,d26dc5cd5843622a203cf396b4ee4b1a,d138f9,No Answer,15-04-2022 10:16,,API,16-04-2022 00:00,,,,16-04-2022 20:35,Paid-Adwords,In,,,
3,d50acaedc1e5b9c18f8ceb3c6cff345b,38e2a6,Not Interested,21-10-2022 17:02,IN,API,23-10-2022 00:00,fresher,,,02-12-2022 13:35,Paid-Adwords,IN,22-11-2022 00:00,Scheduled,
4,07758f3d12a23e68bb3b58b8009dd9a8,d130bb,Not Interested,25-10-2021 10:48,,API,,,,Sales lead,13-11-2021 14:51,Affiliate,India,,,
5,665eb8f7c975b055afa58b5dda3a78bc,d5b5bd,Slightly Interested,24-11-2022 22:41,,API,26-11-2022 00:00,,Big Data engineering,,26-11-2022 19:49,,TR,,,
6,a1ea99cba3b88f6c59fea8a84f051dec,d16267,No Answer,07-07-2022 14:50,,API,,,,Sales lead,08-07-2022 18:20,Medium,India,,,
7,e69523450132baed2dd72836cdfc9778,d130bb,Not Interested,16-09-2021 23:37,,API,,Glass maker at home,,Sales lead,12-11-2021 04:49,Paid - Facebook,India,,,
8,fe244887bc37b5f49311c750ce6b279f,d138f9,No Answer,08-06-2022 13:30,,API,24-06-2022 00:00,,,,24-06-2022 10:44,Paid - Instagram,In,,,
9,3500a29dc4849a7166e98db2e44ddc53,38e2a6,No Answer,21-10-2022 23:50,IN,API,23-10-2022 00:00,,,,23-10-2022 11:09,Paid - Instagram,,,,


From the dataset preview, it's clear that several fields such as "Demo Date," "Closure date," and "What are you looking for in Product?" contain a high number of missing values, which may limit their usefulness unless carefully imputed or excluded. Key categorical features like "Marketing Source" and "Lead Owner" show enough variability to serve as strong predictors in a classification model. Additionally, the target variable "Interest Level" includes values like "Not Interested," "Slightly Interested," and "No Answer," indicating the need for thoughtful preprocessing—such as binarization—to ensure the model can learn meaningful patterns.

# **Exploratory Data Analysis**

### **What can we learn from the data?**

In [5]:
df.shape

(38984, 16)

In [6]:
# Checking the names of the columns
df.columns

Index(['Lead Id', 'Lead Owner', 'Interest Level', 'Lead created',
       'Lead Location(Auto)', 'Creation Source', 'Next activity',
       'What do you do currently ?', 'What are you looking for in Product ?',
       'Website Source', 'Lead Last Update time', 'Marketing Source',
       'Lead Location(Manual)', 'Demo Date', 'Demo Status', 'Closure date'],
      dtype='object')

## **Data Dictionary**



| Column name	 | Description|
| ----- | ----- |
| Lead Id|  Unique Identifier |
| Lead Owner|  Internal sales person associated with the lead |
| Interest Level|  What is lead's interest level? (entered manually) |
| Lead created|  Lead creation date |
| Lead Location(Auto)|  Automatically detected location |
| Creation Source|  Creation source of the lead |
| Next activity|  Date for Next Activity |
| What do you do currently ?|  Current profile of lead |
| What are you looking for in Product ?|  Specific requirement from product |
| Website Source|  Website Source of the Lead |
| Lead Last Update time|  Last update time for Lead |
| Marketing Source|  Marketing Source of the Lead |
| Lead Location(Manual)|  Manually entered lead location |
| Demo Date|  Date for Demo |
| Demo Status|  Status of demo booked with lead |
| Closure date|  Lead closing date |

In [7]:
# Check the Information of the Dataframe, number of unique values and frequency
df.describe()

Unnamed: 0,Lead Id,Lead Owner,Interest Level,Lead created,Lead Location(Auto),Creation Source,Next activity,What do you do currently ?,What are you looking for in Product ?,Website Source,Lead Last Update time,Marketing Source,Lead Location(Manual),Demo Date,Demo Status,Closure date
count,38984,38984,38847,38984,10810,38984,14776,16909,9970,24088,38984,28339,34974,10851,11423,629
unique,37450,23,8,35951,169,3,2610,6831,4046,10,32693,46,415,583,3,277
top,bcbcf737090f0a52c59237fb0ee921d5,2f6f7f,Slightly Interested,13-01-2022 14:05,IN,API,31-01-2023 00:00,Student,DS,Sales lead,06-03-2023 17:53,SEO,IN,30-07-2022 00:00,Scheduled,01-05-2022 00:00
freq,6,5643,14572,17,6735,36291,74,3406,481,23121,401,10127,14126,48,4000,9


In [8]:
# Check the Information of the Dataframe, datatypes and non-null counts
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38984 entries, 0 to 38983
Data columns (total 16 columns):
 #   Column                                 Non-Null Count  Dtype 
---  ------                                 --------------  ----- 
 0   Lead Id                                38984 non-null  object
 1   Lead Owner                             38984 non-null  object
 2   Interest Level                         38847 non-null  object
 3   Lead created                           38984 non-null  object
 4   Lead Location(Auto)                    10810 non-null  object
 5   Creation Source                        38984 non-null  object
 6   Next activity                          14776 non-null  object
 7   What do you do currently ?             16909 non-null  object
 8   What are you looking for in Product ?  9970 non-null   object
 9   Website Source                         24088 non-null  object
 10  Lead Last Update time                  38984 non-null  object
 11  Marketing Sourc

**Observation:**
* we can see some null values present in this data. We will treat them later
* There is also imbalance which must be addressed and features like Lead Owner, Marketing Source, and Website Source show good categorical diversity, making them potentially valuable inputs for a classification model predicting lead engagement or conversion.
* Lead created, Next activity, Lead Last Update time and Demo Date should be datetime datatype but it is object

In [9]:
df['Lead created'] = pd.to_datetime(df['Lead created'], format="%d-%m-%Y %H:%M")
df['Lead Last Update time'] = pd.to_datetime(df['Lead Last Update time'], format="%d-%m-%Y %H:%M")
df['Next activity'] = pd.to_datetime(df['Next activity'], format="%d-%m-%Y %H:%M")
df['Demo Date'] = pd.to_datetime(df['Demo Date'], format="%d-%m-%Y %H:%M")

Lets see how many different Lead Owners we have

In [10]:
df['Lead Owner'].unique()

array(['e14c3a', 'd16267', 'd138f9', '38e2a6', 'd130bb', 'd5b5bd',
       '949886', 'fc348d', 'c18c01', '1eafbe', '2f6f7f', '5fe006',
       '8a10c8', '1a9b5d', 'c5837c', '64c0b2', '684149', '154755',
       'b89cfd', '8c20b0', '2c7db1', '65ed8c', '64347b'], dtype=object)

In [11]:
df['Lead Owner'].value_counts()

Lead Owner
2f6f7f    5643
d16267    5313
1eafbe    5226
d5b5bd    4417
fc348d    3405
1a9b5d    2515
d138f9    2023
e14c3a    1695
c5837c    1415
d130bb    1195
b89cfd    1120
949886     986
8c20b0     667
38e2a6     594
684149     559
c18c01     526
5fe006     525
8a10c8     509
64c0b2     317
2c7db1     288
154755      34
65ed8c      10
64347b       2
Name: count, dtype: int64

**Observation**
* The data seems to be evenly distributed amongst lead owners

In [12]:
df['Interest Level'].unique()

array(['Not Interested', 'Slightly Interested', 'No Answer', 'Closed',
       'Not called', 'Invalid Number', 'Fairly Interested', nan,
       'Very Interested'], dtype=object)

In [13]:
df['Interest Level'].value_counts()

Interest Level
Slightly Interested    14572
Not Interested         10545
No Answer               9254
Not called              1585
Fairly Interested       1320
Closed                   811
Invalid Number           636
Very Interested          124
Name: count, dtype: int64

**Observation**
* We see that some of the interest levels are similar semantically
* Since interest level is our target variable, it seems to be nicely distributed

**Points to ponder upon**
* Should we formulate the problem as multi-class or binary classification problem?
* In case, we want to do binary classification, how do we deal with many values in our target variable?

In [14]:
df['What do you do currently ?'].value_counts()

What do you do currently ?
Student                           3406
student                           1282
Fresher                            298
Working                            194
Working pro                        148
                                  ... 
JFW                                  1
BTECH last year                      1
Asst Project                         1
He is working in Sales               1
Course AIMA - Business Analyst       1
Name: count, Length: 6831, dtype: int64

In [15]:
df['What do you do currently ?'].unique().shape

(6832,)

**Observation**
* There are many unique values in the above column
* If we process the string we can reduce these

**Think about it**
* Do we need to focus on so many values and confuse the model?
* What can we do to reduce these?

In [16]:
df['Creation Source'].unique()

array(['API', 'Manually created', 'Deal'], dtype=object)

In [17]:
df['Creation Source'].value_counts()

Creation Source
API                 36291
Manually created     2533
Deal                  160
Name: count, dtype: int64

**Observation**
* This feature looks well balanced in terms of unique values

In [18]:
df['What are you looking for in Product ?'].unique().shape

(4047,)

In [19]:
df['What are you looking for in Product ?'].value_counts()

What are you looking for in Product ?
DS                                  481
ML                                  325
DS projects                         254
ML projects                         221
BD                                  158
                                   ... 
Better knowledge & hands exp          1
ds in shipping and logistics          1
Better Knowledge & Career Opts        1
Project for College project work      1
DL, ML                                1
Name: count, Length: 4046, dtype: int64

**Observation**
* This feature has many unqiue values and processing it will take a lot of time

**Think about it**
* Should we still work with this column?
* If yes, what can we do?

In [20]:
df['Website Source'].unique()

array([nan, 'Sales lead', 'Start Project', 'Demo button lead',
       'Chat lead', 'Cashback lead', 'eBook',
       'Demo button lead, Chat lead', 'Sales lead, Demo button lead',
       'Sales lead, Chat lead', 'Sales lead, eBook'], dtype=object)

In [21]:
df['Website Source'].value_counts()

Website Source
Sales lead                      23121
Start Project                     560
Demo button lead                  267
Chat lead                         114
Cashback lead                      10
eBook                               5
Sales lead, Demo button lead        5
Sales lead, Chat lead               3
Sales lead, eBook                   2
Demo button lead, Chat lead         1
Name: count, dtype: int64

**Observation**
* Column has very less variance in terms of frequency
* Most of the values are concentrated around 1 or 2 enums

**Think about it**
* Almost no variance in data, can model learn something important?

In [22]:
df['Marketing Source'].value_counts()

Marketing Source
SEO                                            10127
Paid - Instagram                                3895
Paid-Adwords                                    3514
Paid-YouTube                                    2652
Affiliate                                       2531
Medium                                          2215
Paid - Facebook                                 1528
Email Campaign                                  1050
Paid - Linkedin                                  154
Naukri                                           102
Medium, Paid-Adwords                              70
SEO, Medium, Paid-Adwords                         57
Referral                                          50
SEO, Affiliate                                    48
Linkedin jobs                                     46
SEO, Paid-Adwords                                 44
SEO, Paid - Instagram                             34
Affiliate, Medium                                 29
SEO, Medium                  

**Observation**
* There is a long tail of values
* The 1st half looks really interesting in terms of distribution

In [23]:
df['Demo Status'].value_counts()

Demo Status
Scheduled    4000
Done         3956
No Show      3467
Name: count, dtype: int64

**Observation**
* The column is nicely dsitributed
* Has very less unique values

**Think about it**
* If we use this column, are we doing a feature leak?

In [24]:
df['Lead Location(Manual)'].value_counts(normalize=1)

Lead Location(Manual)
IN                       0.403900
India                    0.322354
In                       0.059444
US                       0.046635
in                       0.018128
                           ...   
Mumbai, India.           0.000029
Hungary                  0.000029
IN\                      0.000029
ps                       0.000029
Vishakhapatnam, India    0.000029
Name: proportion, Length: 415, dtype: float64

**Observation**
* This feature again has a very long tail

**Think about it**
* What if we just create 2 enums; India and Non India

# **Data Processing & Feature engineering**

### **Data Preprocessing and Leakage**

Data leakage is a situation where information from the test or prediction data is inadvertently used during the training process of a machine learning model. This can occur when information from the test or prediction data is leaked into the training data, and the model uses this information to improve its performance during the training process.

Data leakage can occur during the preprocessing phase of machine learning when information from the test or prediction data is used to preprocess the training data, inadvertently leaking information from the test or prediction data into the training data.

For example, consider a scenario where the preprocessing step involves imputing missing values in the dataset. If the missing values are imputed using the mean or median values of the entire dataset, including the test and prediction data, then the imputed values in the training data may be influenced by the values in the test and prediction data. This can lead to data leakage, as the model may learn to recognize patterns in the test and prediction data during the training process, leading to overfitting and poor generalization performance.


To avoid data leakage, it's important to perform the data preprocessing steps on the training data only, and then apply the same preprocessing steps to the test and prediction data separately. This ensures that the test and prediction data remain unseen by the model during the training process, and helps to prevent overfitting and improve the accuracy of the model.

In the context of this problem, we will perform data preprocessing steps together for the sake of simplicity, which could potentially lead to data leakage. However, in real-world scenarios, it's important to treat the test and prediction data separately and apply the necessary preprocessing steps separately, based on the characteristics of the data.

### **Missing Value Detection and Imputation**

Real world datasets are never friendly to data scientists. They always pose great challenges to those who are dealing with them due to many different reasons and one of them is “missing values”

Missing values can be imputed with a provided constant value, or using the statistics (mean, median or most frequent) of each column in which the missing values are located

We previously saw there are some missing values in the data. Lets have a look into that now.

In [25]:
df['Lead Owner'].isna().sum()

0

In [26]:
df['Interest Level'].isna().sum()

137

Since target variable has missing values, we will drop such rows

In [27]:
df = df[df['Interest Level'].notna()]

In [28]:
print(f"Shape after drop: {df.shape}")

Shape after drop: (38847, 16)


In [29]:
df['Interest Level'].value_counts()

Interest Level
Slightly Interested    14572
Not Interested         10545
No Answer               9254
Not called              1585
Fairly Interested       1320
Closed                   811
Invalid Number           636
Very Interested          124
Name: count, dtype: int64

Checking for missing values in Lead Owner (none found) and Interest Level (137 missing). Since Interest Level is the target variable,  correctly drop those rows to avoid training on unlabeled data. Finally, checked the class distribution to assess potential imbalance for downstream modeling.

##### Now we will handle our target variable

Since there are multiple values in target variable and we want to formulate our problem as a binary classification problem, we will do the following assignments

**Label assignment:**
* Slightly Interested = 1
* Not Interested=0
* No Answer=0
* Fairly Interested=1
* Very Interested=1

* we will drop rows where value is Not called, Closed or Invalid Number

In [30]:
df = df[~df['Interest Level'].isin(["Not called", "Closed", "Invalid Number"])]

In [31]:
df['Interest Level'].value_counts()

Interest Level
Slightly Interested    14572
Not Interested         10545
No Answer               9254
Fairly Interested       1320
Very Interested          124
Name: count, dtype: int64

In [32]:
df['Interest Level'] = df['Interest Level'].apply(lambda x: 1 if x in ["Slightly Interested", "Fairly Interested", "Very Interested"] else 0)

In [33]:
df['Interest Level'].value_counts()

Interest Level
0    19799
1    16016
Name: count, dtype: int64

We cleaned and simplified the Interest Level column to make it suitable for binary classification. First, we removed entries that didn’t reflect actual interest, such as "Not called", "Closed", and "Invalid Number". Then, we grouped all genuinely interested responses ("Slightly Interested", "Fairly Interested", and "Very Interested") under a single class labeled 1, and everything else as 0. This allowed us to frame the problem as a clear interested vs. not interested prediction task, with a fairly balanced distribution across the two classes.

#### Drop not imporant columns

In [34]:
df.columns

Index(['Lead Id', 'Lead Owner', 'Interest Level', 'Lead created',
       'Lead Location(Auto)', 'Creation Source', 'Next activity',
       'What do you do currently ?', 'What are you looking for in Product ?',
       'Website Source', 'Lead Last Update time', 'Marketing Source',
       'Lead Location(Manual)', 'Demo Date', 'Demo Status', 'Closure date'],
      dtype='object')

In [54]:
df = df.drop(["Lead Id", "Lead Location(Manual)", "Demo Date", "Demo Status", "Closure date"], axis=1)

KeyError: "['Lead Id', 'Lead Location(Manual)', 'Demo Date', 'Demo Status', 'Closure date'] not found in axis"

#### Lead creation time

- We will create 2 features here from our lead creation time column
    1. hour of day
    2. day of week

In [36]:
df['hour_of_day'] = df['Lead created'].dt.hour
df['day_of_week'] = df['Lead created'].dt.weekday

In [37]:
df = df.drop(["Lead created"], axis=1)

Now lead created column is not useful and we drop it

#### Creation source

In [38]:
df['Creation Source'].value_counts()

Creation Source
API                 33317
Manually created     2346
Deal                  152
Name: count, dtype: int64

In [39]:
from pandas import factorize

In [40]:
labels, categories = factorize(df["Creation Source"])

In [41]:
df["labels"] = labels
abs(df["Interest Level"].corr(df["labels"]))

0.008490292073158507

There is a positive correlation with the target variable

In [42]:
df = df.drop(["labels"], axis=1)

#### What do you do currently?

* student = 1
* others = 0

As we saw earlier, this feature has a large number of values of which students are a dominating part.

We will binarize this column into students and non-students

**Facts**

<i>Binarization</i> is the process of dividing data into two groups and assigning one out. of two values to all the members of the same group. This is usually accomplished. by defining a threshold t and assigning the value 0 to all the data points below. the threshold and 1 to those above it.

In [43]:
df['What do you do currently ?'].isna().sum()

19419

In [44]:
df['What do you do currently ?'].value_counts(normalize=1)

What do you do currently ?
Student                           0.205904
student                           0.076482
Fresher                           0.018053
Working                           0.011710
Data Engineer                     0.008295
                                    ...   
Completed mba in civil            0.000061
Into automation                   0.000061
Stu - Btech Final year            0.000061
Big Data Prof                     0.000061
Course AIMA - Business Analyst    0.000061
Name: proportion, Length: 6592, dtype: float64

In [45]:
df['What do you do currently ?'] = df['What do you do currently ?'].apply(lambda x: 1 if 'student' in str(x).strip().lower() else 0)

#### Website Source

In [46]:
df['Website Source'].isna().sum()

13828

In [47]:
df['Website Source'].value_counts()

Website Source
Sales lead                      21133
Start Project                     517
Demo button lead                  240
Chat lead                          78
Cashback lead                       6
eBook                               5
Sales lead, Demo button lead        4
Sales lead, Chat lead               2
Sales lead, eBook                   2
Name: count, dtype: int64

In [48]:
df = df.drop(["Website Source"], axis=1)

Dropping the <b>Website Source</b> column as there is not enough variance

#### Marketing Source

In [49]:
df['Marketing Source'].value_counts()

Marketing Source
SEO                                            9751
Paid - Instagram                               3738
Paid-Adwords                                   3258
Paid-YouTube                                   2376
Affiliate                                      2215
Medium                                         2045
Paid - Facebook                                1436
Email Campaign                                  813
Paid - Linkedin                                 136
Naukri                                           99
Medium, Paid-Adwords                             63
SEO, Medium, Paid-Adwords                        51
Linkedin jobs                                    42
SEO, Paid-Adwords                                39
Referral                                         38
SEO, Affiliate                                   38
SEO, Paid - Instagram                            30
Affiliate, Medium                                27
Paid - Instagram, Paid-Adwords                 

In [50]:
df['Marketing Source'].isna().sum()

9456

Marketing Source has a large number of missing value and it will be noisy if we do an imputation here.

Rather, let's create a new value <b>Unknown</b> which will be substituted for NA values

In [51]:
df['Marketing Source'].fillna("Unknown", inplace=True)

PS: Imputation with Unknown led to improvements that dropping these rows

### Label Encoding

**Transforming Categorical Variables**

Transforming variables is an important step in the data preprocessing pipeline of machine learning, as it helps to convert the data into a format that is suitable for analysis and modeling. There are several ways to transform variables, depending on the type and nature of the data.

Categorical variables, for example, are variables that take on discrete values from a finite set of categories, such as colors, gender, or occupation. One common way to transform categorical variables is through one-hot encoding. One-hot encoding involves creating a new binary variable for each category in the original variable, where the value is 1 if the observation belongs to that category and 0 otherwise. This approach is useful when the categories have no natural order or ranking.

Another way to transform categorical variables is through label encoding. Label encoding involves assigning a unique integer value to each category in the variable. This approach is useful when the categories have a natural order or ranking, such as low, medium, and high.
Transforming categorical features into numerical labels:

**Note:** We are NOT using dummies here to minimize the explosion of columns because of the distance methods we are using.


In [52]:
label_encoder1 = preprocessing.LabelEncoder()

In [53]:
df['Marketing Source']= label_encoder1.fit_transform(df['Marketing Source'])
save_point("fcMar1")

NameError: name 'save_point' is not defined

In [None]:
label_encoder2 = preprocessing.LabelEncoder()

In [None]:
df['Lead Owner']= label_encoder2.fit_transform(df['Lead Owner'])

In [None]:
label_encoder3 = preprocessing.LabelEncoder()
df['Creation Source']= label_encoder3.fit_transform(df['Creation Source'])

In [None]:
df.head()

We transformed 3 columns using label encoding

Remember one thing, you should always use the same label encoding variable for test dataset. Since here we are handling train/test together, we are not worrying about it

# **Model Building and Testing**

In [None]:
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from sklearn.metrics import accuracy_score, precision_recall_curve, roc_curve, plot_roc_curve, plot_precision_recall_curve
from sklearn.model_selection import train_test_split

from xgboost import XGBClassifier
from lightgbm import LGBMClassifier

Identify the right features for the model

In [None]:
X = df[["Lead Owner", "What do you do currently ?", "Marketing Source", "Creation Source", "hour_of_day", "day_of_week"]]
y = df["Interest Level"]

**Splitting the dataset into a training and production dataset:**

- Training: Part of data used for training our supervised models
- Test: Part of the dataset used for testing our models performance

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
X_train.head()

We finally have prepared model ready data.

Lets look into model building now.

## **Supervised learning**



Supervised learning uses a training set to teach models to yield the desired output. This training dataset includes inputs and correct outputs, which allow the model to learn over time. The algorithm measures its accuracy through the loss function, adjusting until the error has been sufficiently minimized.

Supervised learning can be separated into two types of problems when data mining—classification and regression:

1. Classification uses an algorithm to accurately assign test data into specific categories. It recognizes specific entities within the dataset and attempts to draw some conclusions on how those entities should be labeled or defined. Common classification algorithms are linear classifiers, support vector machines (SVM), decision trees, k-nearest neighbor, and random forest, which are described in more detail below.


2. Regression is used to understand the relationship between dependent and independent variables. It is commonly used to make projections, such as for sales revenue for a given business. Linear regression, logistical regression, and polynomial regression are popular regression algorithms.



## **Decision Tree**



A decision tree is a non-parametric supervised learning algorithm, which is utilized for both classification and regression tasks. It has a hierarchical, tree structure, which consists of a root node, branches, internal nodes and leaf nodes.


Decision tree learning employs a divide and conquer strategy by conducting a greedy search to identify the optimal split points within a tree. This process of splitting is then repeated in a top-down, recursive manner until all, or the majority of records have been classified under specific class labels. Whether or not all data points are classified as homogenous sets is largely dependent on the complexity of the decision tree. Smaller trees are more easily able to attain pure leaf nodes—i.e. data points in a single class.

However, as a tree grows in size, it becomes increasingly difficult to maintain this purity, and it usually results in too little data falling within a given subtree. When this occurs, it is known as data fragmentation, and it can often lead to overfitting. As a result, decision trees have preference for small trees, which is consistent with the principle of parsimony in Occam’s Razor; that is, “entities should not be multiplied beyond necessity.” Said differently, decision trees should add complexity only if necessary, as the simplest explanation is often the best. To reduce complexity and prevent overfitting, pruning is usually employed; this is a process, which removes branches that split on features with low importance. The model’s fit can then be evaluated through the process of cross-validation.

### **Bagging**



Bagging is an ensemble learning technique that aims to decrease the variance of a single estimator by combining the predictions from multiple learners. The basic idea behind bagging is to generate multiple versions of the training dataset through random sampling with replacement, and then train a separate classifier for each sampled dataset. The predictions from these individual classifiers are then combined using averaging or voting to obtain a final prediction.

**Algorithm:**

Suppose we have a training set D of size n, and we want to train a classifier using bagging. Here are the steps involved:

* Create k different bootstrap samples from D, each of size n.
* Train a classifier on each bootstrap sample.
* When making predictions on a new data point, take the average or majority vote of the predictions from each of the k classifiers.


**Mathematical Explanation:**

Suppose we have a binary classification problem with classes -1 and 1. Let's also assume that we have a training set D of size n, and we want to train a decision tree classifier using bagging.

**Bootstrap Sample**: For each of the k classifiers, we create a bootstrap sample of size n by sampling with replacement from D. This means that each bootstrap sample may contain duplicates of some instances and may also miss some instances from the original dataset. Let's denote the i-th bootstrap sample as D_i.

**Train a Classifier**: We train a decision tree classifier T_i on each bootstrap sample D_i. This gives us k classifiers T_1, T_2, ..., T_k.

**Combine Predictions**: To make a prediction on a new data point x, we take the majority vote of the predictions from each of the k classifiers.

The idea behind bagging is that the variance of the prediction error decreases as k increases. This is because each classifier has a chance to explore a different part of the feature space due to the random sampling with replacement, and the final prediction is a combination of these diverse classifiers.



### **Boosting**

Boosting is a machine learning algorithm that works by combining several weak models (also known as base learners) into a strong model. The goal of boosting is to reduce the bias and variance of the base learners by iteratively adding new models to the ensemble that focus on correcting the errors made by the previous models. In other words, the boosting algorithm tries to learn from the mistakes of the previous models and improve the overall accuracy of the ensemble.

Boosting works by assigning higher weights to the data points that the previous models misclassified, and lower weights to the ones that were classified correctly. This ensures that the new model focuses more on the difficult data points that the previous models struggled with, and less on the ones that were already well-classified. As a result, the new model is more specialized and can improve the accuracy of the ensemble.

There are several types of boosting algorithms, including AdaBoost (Adaptive Boosting), Gradient Boosting, and XGBoost (Extreme Gradient Boosting). Each of these algorithms has its own approach to assigning weights to the data points and building the new models, but they all share the fundamental idea of iteratively improving the accuracy of the ensemble by combining weak models into a strong one. Boosting is a powerful algorithm that has been shown to achieve state-of-the-art results in many machine learning tasks, such as image classification, natural language processing, and recommender systems.









**Difference between Bagging and Boosting**


It's important to remember that boosting is a generic method, not a specific model, in order to comprehend it. Boosting involves specifying a weak model, such as regression or decision trees, and then improving it. In Ensemble Learning, the primary difference between Bagging and Boosting is that in bagging, weak learners are trained in simultaneously, but in boosting, they are trained sequentially. This means that each new model iteration increases the weights of the prior model's misclassified data. This redistribution of weights aids the algorithm in determining which parameters it should focus on in order to increase its performance.

Both the Ensemble techniques are used in a different way as well.  Bagging methods, for example, are often used on poor learners who have large variance and low bias such as decision trees because they tend to overfit, whereas boosting methods are employed when there is low variance and high bias. While bagging can help prevent overfitting, boosting methods are more vulnerable to it because of a simple fact they continue to build on weak learners and continue to minimise error. This can lead to overfitting on the training data but specifying a decent number of models to be generated or hyperparameter tuning,  regularization can help in this case, if overfitting encountered.


### **Random Forest**


Another way that decision trees can maintain their accuracy is by forming an ensemble via a random forest algorithm; this classifier predicts more accurate results, particularly when the individual trees are uncorrelated with each other.

Random Forest is an ensemble learning algorithm that builds a large number of decision trees and combines them to make a final prediction. It is a type of bagging method, where multiple decision trees are trained on random subsets of the training data and features. The algorithm then averages the predictions of these individual trees to produce a final prediction. Random Forest is particularly useful for handling high-dimensional data and for avoiding overfitting.

**Algorithm of Random Forest**

The algorithm of Random Forest can be summarized in the following steps:

* Start by randomly selecting a subset of the training data, with replacement. This subset is called the bootstrap sample.

* Next, randomly select a subset of features from the full feature set.

* Build a decision tree using the bootstrap sample and the selected subset of features. At each node of the tree, select the best feature and split the data based on the selected feature.

* Repeat steps 1-3 to build multiple trees.

* Finally, combine the predictions of all trees to make a final prediction. For classification, this is usually done by taking a majority vote of the predicted classes. For regression, this is usually done by taking the average of the predicted values.


**Mathematics Behind Random Forest**

The mathematics behind Random Forest involves the use of decision trees and the bootstrap sampling technique. Decision trees are constructed using a recursive binary partitioning algorithm that splits the data based on the values of the selected features. At each node, the algorithm chooses the feature and the split point that maximizes the information gain. Information gain measures the reduction in entropy or impurity of the target variable after the split. The goal is to minimize the impurity of the subsets after each split.

Bootstrap sampling is a statistical technique that involves randomly sampling the data with replacement to create multiple subsets. These subsets are used to train individual decision trees. By using bootstrap samples, the algorithm can generate multiple versions of the same dataset with slightly different distributions. This introduces randomness into the training process, which helps to reduce overfitting.



**Difference between Bagging and Random Forest**

Bagging and Random Forest are both ensemble learning algorithms that involve training multiple models on random subsets of the data. The main difference between the two is the way the individual models are trained.

Bagging involves training multiple models using the bootstrap sampling technique, but each model uses the same set of features. This can lead to correlated predictions, which reduces the variance but not necessarily the bias of the model.

Random Forest, on the other hand, involves training multiple models using the bootstrap sampling technique, but each model uses a randomly selected subset of features. This introduces additional randomness into the model and helps to reduce the correlation between individual predictions. Random Forest can achieve better performance than Bagging, especially when dealing with high-dimensional data or noisy features. In simpler terms it uses subsets of observations as well as features.








## **Gradient Boosting Trees**

Gradient boosting is a machine learning technique used in regression and classification tasks, among others. It gives a prediction model in the form of an ensemble of weak prediction models, which are typically decision trees.

When a decision tree is the weak learner, the resulting algorithm is called gradient-boosted trees; it usually outperforms random forest. A gradient-boosted trees model is built in a stage-wise fashion as in other boosting methods, but it generalizes the other methods by allowing optimization of an arbitrary differentiable loss function.

Few examples of gradient boosting trees are Xgboost, LightGBM, etc

## **Bagging vs Boosting**


1. Bagging: It is a homogeneous weak learners’ model that learns from each other independently in parallel and combines them for determining the model average.

![image.png](https://media.geeksforgeeks.org/wp-content/uploads/20210707140912/Bagging.png)


2. Boosting: It is also a homogeneous weak learners’ model but works differently from Bagging. In this model, learners learn sequentially and adaptively to improve model predictions of a learning algorithm.

![image.png](https://media.geeksforgeeks.org/wp-content/uploads/20210707140911/Boosting.png)

Credit: geeksforgeeks

In [None]:
rf = RandomForestClassifier(n_estimators=300)
xgb = XGBClassifier(n_estimators=300, objective='binary:logistic', tree_method='hist', eta=0.1, max_depth=3)
lgb = LGBMClassifier(n_estimators=300)
checkpoint("fcMar1")

#### Training multiple models together

In [None]:
rf.fit(X_train, y_train)
xgb.fit(X_train, y_train)
lgb.fit(X_train, y_train)

All models are trained very qucikly here.

Scitkit-learn provides an additional parameter n_jobs=-1 which parallelize some of the models using cpu threads

# Model Evaluation

### **Classification Evaluation Metrics**

Classification evaluation metrics are used to evaluate the performance of a machine learning model that is trained for classification tasks. Some of the commonly used classification evaluation metrics are F1 score, recall score, confusion matrix, and ROC AUC score. Here's an overview of each of these metrics:

**F1 score**: The F1 score is a metric that combines the precision and recall of a model into a single value. It is calculated as the harmonic mean of precision and recall, and is expressed as a value between 0 and 1, where 1 indicates perfect precision and recall.
F1 score is the harmonic mean of precision and recall. It is calculated as follows:
$$ F1 = \frac{2}{\frac{1}{precision} + \frac{1}{recall}} $$
where precision is the number of true positives divided by the sum of true positives and false positives, and recall is the number of true positives divided by the sum of true positives and false negatives.

**Recall**: Use the recall score when the cost of false negatives (i.e., missing instances of a class) is high. For example, in a medical diagnosis problem, the cost of missing a positive case may be high, so recall would be a more appropriate metric.
Recall score (also known as sensitivity) is the number of true positives divided by the sum of true positives and false negatives. It is given by the following formula:
$$ Recall = \frac{TP}{TP + FN} $$

**Precision**: Precision is another important classification evaluation metric, which is defined as the ratio of true positives to the total predicted positives. It measures the accuracy of positive predictions made by the classifier, i.e., the proportion of positive identifications that were actually correct.
The formula for precision is:
$$ precision = \frac{true\ positive}{true\ positive + false\ positive} $$
where true positive refers to the cases where the model correctly predicted the positive class, and false positive refers to the cases where the model incorrectly predicted the positive class.
Precision is useful when the cost of false positives is high, such as in medical diagnosis or fraud detection, where a false positive can have serious consequences. In such cases, a higher precision indicates that the model is better at identifying true positives and minimizing false positives.

**Confusion Matrix**:
A confusion matrix is a table that is often used to describe the performance of a classification model. It compares the predicted labels with the true labels and counts the number of true positives, false positives, true negatives, and false negatives. Here is an example of a confusion matrix:

|          | Actual Positive | Actual Negative |
|----------|----------------|----------------|
| Predicted Positive | True Positive (TP) | False Positive (FP) |
| Predicted Negative | False Negative (FN) | True Negative (TN) |

​



**ROC AUC Score**:
ROC AUC (Receiver Operating Characteristic Area Under the Curve) score is a measure of how well a classifier is able to distinguish between positive and negative classes. It is calculated as the area under the ROC curve. The ROC curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. TPR is the number of true positives divided by the sum of true positives and false negatives, and FPR is the number of false positives divided by the sum of false positives and true negatives.
$$ ROC\ AUC\ Score = \int_0^1 TPR(FPR^{-1}(t)) dt $$
where $FPR^{-1}$ is the inverse of the FPR function.

**When to use which**:

The choice of evaluation metric depends on the specific requirements of the business problem. Here are some general guidelines:

* F1 score: Use the F1 score when the class distribution is imbalanced, and when both precision and recall are equally important.

* Recall score: Use the recall score when the cost of false negatives (i.e., missing instances of a class) is high. For example, in a medical diagnosis problem, the cost of missing a positive case may be high, so recall would be a more appropriate metric.

* Precision: Precision is useful when the cost of false positives is high, such as in medical diagnosis or fraud detection, where a false positive can have serious consequences. In such cases, a higher precision indicates that the model is better at identifying true positives and minimizing false positives.

* Confusion matrix: The confusion matrix is a versatile tool that can be used to visualize the performance of a model across different classes. It can be useful for identifying specific areas of the model that need improvement.

* ROC AUC score: Use the ROC AUC score when the ability to distinguish between positive and negative classes is important. For example, in a credit scoring problem, the ability to distinguish between good and bad credit risks is crucial.

Importance with respect to the business problem:

The importance of each evaluation metric varies depending on the business problem. For example, in a spam detection problem, precision may be more important than recall, since false positives (i.e., classifying a non-spam email as spam) may annoy users, while false negatives (i.e., missing a spam email) may not be as harmful. On the other hand, in a disease diagnosis problem, recall may be more important than precision, since missing a positive case (i.e., a false negative) could have serious consequences. Therefore, it is important to choose the evaluation metric that is most relevant to the specific business problem at hand.



### Evaluation metrics

![image.png](https://blog.paperspace.com/content/images/2020/09/Fig01.jpg)

1. Accuracy

\begin{equation}
\text{Accuracy} = \frac{True Positive + True Negative}{True Positive + False Positive + True Negative + False Negative}
\end{equation}


2. Precision

\begin{equation}
\text{Precision} = \frac{True Positive}{True Positive + False Positive}
\end{equation}


3. Recall

\begin{equation}
\text{Recall} = \frac{True Positive}{True Positive + False Negative}
\end{equation}


4. F1-score

\begin{equation}
\text{F1-score} = 2 * \frac{Precision * Recall}{Precision + Recall}
\end{equation}

Precision-Recall (PR) curve and Area Under the Curve (AUC) curve are evaluation metrics commonly used in binary classification problems to assess the performance of a model and determine an appropriate threshold for decision making.

The Precision-Recall (PR) curve is a graphical representation of the trade-off between precision and recall for different classification thresholds. Precision is the ratio of true positive predictions to the total number of positive predictions, while recall (also known as sensitivity or true positive rate) is the ratio of true positive predictions to the total number of actual positive instances in the data. The PR curve plots precision on the y-axis and recall on the x-axis, with each point on the curve representing a different classification threshold. A higher precision and recall indicate better model performance.

The Area Under the Curve (AUC) is a single scalar value that summarizes the PR curve. It measures the overall performance of the model across all possible classification thresholds. The AUC value ranges from 0 to 1, where a higher value indicates better model performance. An AUC of 1 represents a perfect model that achieves maximum precision and recall across all thresholds.

Choosing the right threshold depends on the specific requirements of your problem. The PR curve can help you visualize the precision-recall trade-off at different thresholds. If your problem prioritizes precision (minimizing false positives), you may want to choose a threshold that maximizes precision while maintaining a reasonable level of recall. On the other hand, if recall is more important (minimizing false negatives), you would choose a threshold that maximizes recall while still maintaining an acceptable level of precision.

The selection of the threshold ultimately depends on the cost or impact of false positives and false negatives in your specific problem domain. By analyzing the PR curve and considering the specific requirements and trade-offs of your problem, you can make an informed decision about the threshold that best balances precision and recall for your particular use case.







In [None]:
def get_evaluation_metrics(model_name, model, pred, actual):
    print("Accuracy of %s: " % model_name, accuracy_score(pred, actual))

In [None]:
get_evaluation_metrics("Random Forest", rf, rf.predict(X_test), y_test)
get_evaluation_metrics("XGBoost", xgb, xgb.predict(X_test), y_test)
get_evaluation_metrics("Light GBM", lgb, lgb.predict(X_test), y_test)

In [None]:
plot_precision_recall_curve(rf, X_test, y_test)
plot_roc_curve(rf, X_test, y_test)

In [None]:
plot_precision_recall_curve(xgb, X_test, y_test)
plot_roc_curve(xgb, X_test, y_test)
checkpoint("fcMar1")

In [None]:
plot_precision_recall_curve(lgb, X_test, y_test)
plot_roc_curve(lgb, X_test, y_test)

### **Think about it**

- Although numerically results are similar for XGBoost and LightGBM, which one do you think is better?
- Why does PR curve matter here? Think in terms of business justfication!

# **Try it out**

- Can you try out grid-search to tune hyperparameters for these models?

- Can you come up with the right thresholds for these models depending on what do you feel is more important here? Precision or recall?

- Can you train a multi-layer perceptron and check the performance?

## **Conclusion**

In this project we used a bunch of supervised models to predict if the customer would be interested in a lead.

The problem could have been formulated as a multi-class classification problem but we instead formulated this as a binary classification problem and the confidence on the predictions would enable the stakeholders to chase the lead.


A successful data science project requires a clear understanding of the business problem and the data available, as well as the ability to select and apply appropriate data preprocessing techniques, feature engineering methods, and machine learning algorithms. It is also important to assess and optimize the performance of the model and communicate the results effectively to stakeholders.

After looking at the PR and ROC curves above, we can conclude that <b>LightGBM</b> is giving us the best possible results.

## **Interview Questions**

### **Supervised Learning:**

* What are decision trees? How does the model decide on the split?
* What is boostrap aggregation?
* Explain bias and variance in context of boosting and bagging?
* How can you use decision tree for missing value imputation?
* How does regularization work in gradient boosting models?


### **Code Implementation:**

* Is bagging or boosting computationally faster?
* Can you write your own code to calculate precision and recall?
* How would you handle dataset if the target variable would have been imbalanced?

## **Feedback**

In [None]:
feedback()