───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────    
📘 **Author:** Teslim Uthman Adeyanju  
📫 **Email:** [info@adeyanjuteslim.co.uk](mailto:info@adeyanjuteslim.co.uk)  
🔗 **LinkedIn:** [linkedin.com/in/adeyanjuteslimuthman](https://www.linkedin.com/in/adeyanjuteslimuthman)  
🌐 **Website & Blog:** [adeyanjuteslim.co.uk](https://adeyanjuteslim.co.uk)  
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────    


# **Marketing-Intelligence-A-Predictive-Model-for-Lead-Conversion**

** table of contents**
- [Project Overview](#project-overview)
- [Project Objectives](#project-objectives)
- [Methodology](#methodology)
- [Library Tools](#library-tools)   

## 📚 1.0 INTRODUCTION

<div style="font-family: Avenir, sans-serif; font-size: 16px; line-height: 1.6; color: white; background-color: #333; padding: 10px; border-radius: 5px;">
This section provides an overview of the dataset and the problem we are trying to solve. We will also discuss the data overview, project objective, methodology and the tools (libaries) we will use to solve the problem.

</div>

🔗 Link to Dataset: [Kaggle - Leads Dataset](https://www.kaggle.com/ashydv/leads-dataset)

**Project Description**

This project focuses on analyzing lead behavioral patterns and evaluating the impact of different lead sources and engagement attributes on conversion outcomes. By using classification modeling, the project aims to:

- Predict whether a lead will convert into a customer  
- Identify high-impact features influencing conversions  
- Score and segment leads based on their conversion likelihood  

This tool is beneficial for marketing teams, sales strategists, and data-driven business managers seeking to optimize lead nurturing, campaign effectiveness, and customer acquisition.

**🎯 Project Objectives**

- Predict whether a lead will convert into a customer  
- Identify key features influencing lead conversion  
- Score and segment leads based on conversion likelihood  
- Support data-driven decisions in marketing and sales strategy

### 🔍 Library Tools
___

In [11]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

## 📚 2.0 DATA PREPROCESSING


<div style="font-family: Avenir, sans-serif; font-size: 16px; line-height: 1.6; color: white; background-color: #333; padding: 10px; border-radius: 5px;">

This section focuses on loading the dataset and performing data preprocessing tasks such as handling missing values, changing data types, and confirming the absence of duplicates. This will make our dataset ready for exploratory data analysis and model development.
</div>

In [12]:
df = pd.read_csv('Leads.csv', engine='pyarrow')
df

Unnamed: 0,Prospect ID,Lead Number,Lead Origin,Lead Source,Do Not Email,Do Not Call,Converted,TotalVisits,Total Time Spent on Website,Page Views Per Visit,...,Get updates on DM Content,Lead Profile,City,Asymmetrique Activity Index,Asymmetrique Profile Index,Asymmetrique Activity Score,Asymmetrique Profile Score,I agree to pay the amount through cheque,A free copy of Mastering The Interview,Last Notable Activity
0,7927b2df-8bba-4d29-b9a2-b6e0beafe620,660737,API,Olark Chat,No,No,0,0.0,0,0.00,...,No,Select,Select,02.Medium,02.Medium,15.0,15.0,No,No,Modified
1,2a272436-5132-4136-86fa-dcc88c88f482,660728,API,Organic Search,No,No,0,5.0,674,2.50,...,No,Select,Select,02.Medium,02.Medium,15.0,15.0,No,No,Email Opened
2,8cc8c611-a219-4f35-ad23-fdfd2656bd8a,660727,Landing Page Submission,Direct Traffic,No,No,1,2.0,1532,2.00,...,No,Potential Lead,Mumbai,02.Medium,01.High,14.0,20.0,No,Yes,Email Opened
3,0cc2df48-7cf4-4e39-9de9-19797f9b38cc,660719,Landing Page Submission,Direct Traffic,No,No,0,1.0,305,1.00,...,No,Select,Mumbai,02.Medium,01.High,13.0,17.0,No,No,Modified
4,3256f628-e534-4826-9d63-4a8b88782852,660681,Landing Page Submission,Google,No,No,1,2.0,1428,1.00,...,No,Select,Mumbai,02.Medium,01.High,15.0,18.0,No,No,Modified
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9235,19d6451e-fcd6-407c-b83b-48e1af805ea9,579564,Landing Page Submission,Direct Traffic,Yes,No,1,8.0,1845,2.67,...,No,Potential Lead,Mumbai,02.Medium,01.High,15.0,17.0,No,No,Email Marked Spam
9236,82a7005b-7196-4d56-95ce-a79f937a158d,579546,Landing Page Submission,Direct Traffic,No,No,0,2.0,238,2.00,...,No,Potential Lead,Mumbai,02.Medium,01.High,14.0,19.0,No,Yes,SMS Sent
9237,aac550fe-a586-452d-8d3c-f1b62c94e02c,579545,Landing Page Submission,Direct Traffic,Yes,No,0,2.0,199,2.00,...,No,Potential Lead,Mumbai,02.Medium,01.High,13.0,20.0,No,Yes,SMS Sent
9238,5330a7d1-2f2b-4df4-85d6-64ca2f6b95b9,579538,Landing Page Submission,Google,No,No,1,3.0,499,3.00,...,No,,Other Metro Cities,02.Medium,02.Medium,15.0,16.0,No,No,SMS Sent


In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9240 entries, 0 to 9239
Data columns (total 37 columns):
 #   Column                                         Non-Null Count  Dtype  
---  ------                                         --------------  -----  
 0   Prospect ID                                    9240 non-null   object 
 1   Lead Number                                    9240 non-null   int64  
 2   Lead Origin                                    9240 non-null   object 
 3   Lead Source                                    9204 non-null   object 
 4   Do Not Email                                   9240 non-null   object 
 5   Do Not Call                                    9240 non-null   object 
 6   Converted                                      9240 non-null   int64  
 7   TotalVisits                                    9103 non-null   float64
 8   Total Time Spent on Website                    9240 non-null   int64  
 9   Page Views Per Visit                           9103 

### 🔍 Data Cleaning and Formatting

In [19]:
# check the dataframe shape
df.shape

(9240, 37)

In [51]:
def columns_head_transform(df):
    """
    Transforms the column names of the DataFrame to lowercase and replaces spaces with underscores.
    """
    df.columns = df.columns.str.lower().str.replace(' ', '_')
    return df

In [52]:
columns_head_transform(df)

Unnamed: 0,lead_origin,lead_source,do_not_email,do_not_call,converted,totalvisits,total_time_spent_on_website,page_views_per_visit,last_activity,country,...,receive_more_updates_about_our_courses,tags,lead_quality,update_me_on_supply_chain_content,get_updates_on_dm_content,asymmetrique_activity_score,asymmetrique_profile_score,i_agree_to_pay_the_amount_through_cheque,a_free_copy_of_mastering_the_interview,last_notable_activity
0,api,olark_chat,no,no,0,0.0,0,0.00,page_visited_on_website,,...,no,interested_in_other_courses,low_in_relevance,no,no,15.0,15.0,no,no,modified
1,api,organic_search,no,no,0,5.0,674,2.50,email_opened,india,...,no,ringing,,no,no,15.0,15.0,no,no,email_opened
2,landing_page_submission,direct_traffic,no,no,1,2.0,1532,2.00,email_opened,india,...,no,will_revert_after_reading_the_email,might_be,no,no,14.0,20.0,no,yes,email_opened
3,landing_page_submission,direct_traffic,no,no,0,1.0,305,1.00,unreachable,india,...,no,ringing,not_sure,no,no,13.0,17.0,no,no,modified
4,landing_page_submission,google,no,no,1,2.0,1428,1.00,converted_to_lead,india,...,no,will_revert_after_reading_the_email,might_be,no,no,15.0,18.0,no,no,modified
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9235,landing_page_submission,direct_traffic,yes,no,1,8.0,1845,2.67,email_marked_spam,saudi_arabia,...,no,will_revert_after_reading_the_email,high_in_relevance,no,no,15.0,17.0,no,no,email_marked_spam
9236,landing_page_submission,direct_traffic,no,no,0,2.0,238,2.00,sms_sent,india,...,no,wrong_number_given,might_be,no,no,14.0,19.0,no,yes,sms_sent
9237,landing_page_submission,direct_traffic,yes,no,0,2.0,199,2.00,sms_sent,india,...,no,invalid_number,not_sure,no,no,13.0,20.0,no,yes,sms_sent
9238,landing_page_submission,google,no,no,1,3.0,499,3.00,sms_sent,india,...,no,,,no,no,15.0,16.0,no,no,sms_sent


In [44]:
df

Unnamed: 0,lead origin,lead source,do not email,do not call,converted,totalvisits,total time spent on website,page views per visit,last activity,country,...,receive more updates about our courses,tags,lead quality,update me on supply chain content,get updates on dm content,asymmetrique activity score,asymmetrique profile score,i agree to pay the amount through cheque,a free copy of mastering the interview,last notable activity
0,api,olark_chat,no,no,0,0.0,0,0.00,page_visited_on_website,,...,no,interested_in_other_courses,low_in_relevance,no,no,15.0,15.0,no,no,modified
1,api,organic_search,no,no,0,5.0,674,2.50,email_opened,india,...,no,ringing,,no,no,15.0,15.0,no,no,email_opened
2,landing_page_submission,direct_traffic,no,no,1,2.0,1532,2.00,email_opened,india,...,no,will_revert_after_reading_the_email,might_be,no,no,14.0,20.0,no,yes,email_opened
3,landing_page_submission,direct_traffic,no,no,0,1.0,305,1.00,unreachable,india,...,no,ringing,not_sure,no,no,13.0,17.0,no,no,modified
4,landing_page_submission,google,no,no,1,2.0,1428,1.00,converted_to_lead,india,...,no,will_revert_after_reading_the_email,might_be,no,no,15.0,18.0,no,no,modified
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9235,landing_page_submission,direct_traffic,yes,no,1,8.0,1845,2.67,email_marked_spam,saudi_arabia,...,no,will_revert_after_reading_the_email,high_in_relevance,no,no,15.0,17.0,no,no,email_marked_spam
9236,landing_page_submission,direct_traffic,no,no,0,2.0,238,2.00,sms_sent,india,...,no,wrong_number_given,might_be,no,no,14.0,19.0,no,yes,sms_sent
9237,landing_page_submission,direct_traffic,yes,no,0,2.0,199,2.00,sms_sent,india,...,no,invalid_number,not_sure,no,no,13.0,20.0,no,yes,sms_sent
9238,landing_page_submission,google,no,no,1,3.0,499,3.00,sms_sent,india,...,no,,,no,no,15.0,16.0,no,no,sms_sent


In [20]:
df.columns.to_list()

['Prospect ID',
 'Lead Number',
 'Lead Origin',
 'Lead Source',
 'Do Not Email',
 'Do Not Call',
 'Converted',
 'TotalVisits',
 'Total Time Spent on Website',
 'Page Views Per Visit',
 'Last Activity',
 'Country',
 'Specialization',
 'How did you hear about X Education',
 'What is your current occupation',
 'What matters most to you in choosing a course',
 'Search',
 'Magazine',
 'Newspaper Article',
 'X Education Forums',
 'Newspaper',
 'Digital Advertisement',
 'Through Recommendations',
 'Receive More Updates About Our Courses',
 'Tags',
 'Lead Quality',
 'Update me on Supply Chain Content',
 'Get updates on DM Content',
 'Lead Profile',
 'City',
 'Asymmetrique Activity Index',
 'Asymmetrique Profile Index',
 'Asymmetrique Activity Score',
 'Asymmetrique Profile Score',
 'I agree to pay the amount through cheque',
 'A free copy of Mastering The Interview',
 'Last Notable Activity']

In [26]:
# check for the missing values
df.isnull().sum() / df.shape[0] * 100

Lead Origin                                       0.000000
Lead Source                                       0.389610
Do Not Email                                      0.000000
Do Not Call                                       0.000000
Converted                                         0.000000
TotalVisits                                       1.482684
Total Time Spent on Website                       0.000000
Page Views Per Visit                              1.482684
Last Activity                                     1.114719
Country                                          26.634199
Specialization                                   15.562771
How did you hear about X Education               23.885281
What is your current occupation                  29.112554
What matters most to you in choosing a course    29.318182
Search                                            0.000000
Magazine                                          0.000000
Newspaper Article                                 0.0000

**Missing Value Handling Strategy**


If a column has more than 30–35% missing values, and it:  
- Is not crucial for the target prediction
- Does not have a strong correlation with the target
- Has too many unique values (high cardinality categorical features)
then we will drop that column.

In [30]:
# Get columns with more than 30% missing values
cols_with_missing = df.columns[df.isnull().sum() / df.shape[0] * 100 > 30]
print("Columns with more than 30% missing values:")
for col in cols_with_missing:
	print(f"{col}: {df[col].isnull().sum() / df.shape[0] * 100:.2f}%")

Columns with more than 30% missing values:
Tags: 36.29%
Lead Quality: 51.59%
Asymmetrique Activity Score: 45.65%
Asymmetrique Profile Score: 45.65%


In [None]:
df = df.drop(columns=['Tags', 'Lead Quality', 'Asymmetrique Activity Score', 
                      'Asymmetrique Activity Index', 
                      'Asymmetrique Profile Index'], axis=1)