# Problem Description

## Background

X Education company, who sells online courses to industry professionals, gets its initial leads from online marketing on various websites. Leads implies people who might be interested to join their courses.

The company wishes to identify the most potential leads i.e. Hot Leads, some of which get converted and join their courses.

With this information, the sales team can focus more on communicating with the hot leads rather than making calls to everyone (from the pool of initial leads).

Their current lead conversion rate is quite poor i.e. 30%.

The CEO has given a target to increase their lead conversion rate to 80%.

For this, a model needs to be built which assigns lead score to each of the leads.
A customer assigned with a higher lead score would mean a higher conversion chance and 
a customer assigned with a lower lead score would mean a lower conversion chance.


## Data

There are 9000 data points with various feature variables such as Lead Source, Total Time Spent on Website, Total Visits, Last Activity, etc in the given 'leads' dataset.

The target attribute is 'Converted' which carries a value 1 or 0 with following meaning.
- 1: Lead was converted
- 0: Lead wasn't converted


## Goal

Build a logistic ML model which assigns a lead score between 0 to 100 to each of the leads.
- A higher lead score would indicate a hot lead and a lower score implies a cold lead.
- The 'converted' column will finally have values 1 or 0 against each of the leads.


# Read and Understand the dataset

In [1]:
# import required libraries
# import numpy as np
import pandas as pd

In [2]:
# read data
leads = pd.read_csv('Leads.csv')
leads.head()

Unnamed: 0,Prospect ID,Lead Number,Lead Origin,Lead Source,Do Not Email,Do Not Call,Converted,TotalVisits,Total Time Spent on Website,Page Views Per Visit,...,Get updates on DM Content,Lead Profile,City,Asymmetrique Activity Index,Asymmetrique Profile Index,Asymmetrique Activity Score,Asymmetrique Profile Score,I agree to pay the amount through cheque,A free copy of Mastering The Interview,Last Notable Activity
0,7927b2df-8bba-4d29-b9a2-b6e0beafe620,660737,API,Olark Chat,No,No,0,0.0,0,0.0,...,No,Select,Select,02.Medium,02.Medium,15.0,15.0,No,No,Modified
1,2a272436-5132-4136-86fa-dcc88c88f482,660728,API,Organic Search,No,No,0,5.0,674,2.5,...,No,Select,Select,02.Medium,02.Medium,15.0,15.0,No,No,Email Opened
2,8cc8c611-a219-4f35-ad23-fdfd2656bd8a,660727,Landing Page Submission,Direct Traffic,No,No,1,2.0,1532,2.0,...,No,Potential Lead,Mumbai,02.Medium,01.High,14.0,20.0,No,Yes,Email Opened
3,0cc2df48-7cf4-4e39-9de9-19797f9b38cc,660719,Landing Page Submission,Direct Traffic,No,No,0,1.0,305,1.0,...,No,Select,Mumbai,02.Medium,01.High,13.0,17.0,No,No,Modified
4,3256f628-e534-4826-9d63-4a8b88782852,660681,Landing Page Submission,Google,No,No,1,2.0,1428,1.0,...,No,Select,Mumbai,02.Medium,01.High,15.0,18.0,No,No,Modified


In [3]:
# check number of records and features in the data
leads.shape

(9240, 37)

In [4]:
# check basic information about null values and data types regarding the dataset features
leads.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9240 entries, 0 to 9239
Data columns (total 37 columns):
 #   Column                                         Non-Null Count  Dtype  
---  ------                                         --------------  -----  
 0   Prospect ID                                    9240 non-null   object 
 1   Lead Number                                    9240 non-null   int64  
 2   Lead Origin                                    9240 non-null   object 
 3   Lead Source                                    9204 non-null   object 
 4   Do Not Email                                   9240 non-null   object 
 5   Do Not Call                                    9240 non-null   object 
 6   Converted                                      9240 non-null   int64  
 7   TotalVisits                                    9103 non-null   float64
 8   Total Time Spent on Website                    9240 non-null   int64  
 9   Page Views Per Visit                           9103 

In [5]:
# A significant number of features have high null values

# Prospect ID and Lead Number are both unique values given to each person

In [6]:
# look at the conversion rate
leads.Converted.value_counts()

Converted
0    5679
1    3561
Name: count, dtype: int64

In [7]:
# check the percentage conversion rate
round((leads.Converted.value_counts()[1]/len(leads))*100,2)

38.54

In [8]:
# The number of leads who converted according to the given data is within 40%

In [9]:
# null values indicated/filled as NaN in the data
nan_values = leads.isnull().sum()
nan_values.sort_values()

Prospect ID                                         0
I agree to pay the amount through cheque            0
Get updates on DM Content                           0
Update me on Supply Chain Content                   0
Receive More Updates About Our Courses              0
Through Recommendations                             0
Digital Advertisement                               0
Newspaper                                           0
X Education Forums                                  0
A free copy of Mastering The Interview              0
Magazine                                            0
Search                                              0
Newspaper Article                                   0
Last Notable Activity                               0
Lead Number                                         0
Lead Origin                                         0
Total Time Spent on Website                         0
Converted                                           0
Do Not Call                 

In [10]:
# null values filled as 'Select' in the data
select_values = (leads == 'Select').sum()
select_values.sort_values()

Prospect ID                                         0
X Education Forums                                  0
Newspaper                                           0
Digital Advertisement                               0
Through Recommendations                             0
Receive More Updates About Our Courses              0
Tags                                                0
A free copy of Mastering The Interview              0
Lead Quality                                        0
Get updates on DM Content                           0
Asymmetrique Activity Index                         0
Asymmetrique Profile Index                          0
Asymmetrique Activity Score                         0
Asymmetrique Profile Score                          0
I agree to pay the amount through cheque            0
Update me on Supply Chain Content                   0
Magazine                                            0
Newspaper Article                                   0
What matters most to you in 

In [11]:
# overall null values in the data
total_null_values = nan_values + select_values
total_null_values.sort_values(ascending=False)

How did you hear about X Education               7250
Lead Profile                                     6855
Lead Quality                                     4767
Asymmetrique Profile Score                       4218
Asymmetrique Activity Score                      4218
Asymmetrique Activity Index                      4218
Asymmetrique Profile Index                       4218
City                                             3669
Specialization                                   3380
Tags                                             3353
What matters most to you in choosing a course    2709
What is your current occupation                  2690
Country                                          2461
Page Views Per Visit                              137
TotalVisits                                       137
Last Activity                                     103
Lead Source                                        36
Receive More Updates About Our Courses              0
I agree to pay the amount th

In [12]:
# percentage overall null values
total_perc_null_values = round((total_null_values.sort_values(ascending=False)/len(leads))*100, 2)
total_perc_null_values.sort_values(ascending=False)

How did you hear about X Education               78.46
Lead Profile                                     74.19
Lead Quality                                     51.59
Asymmetrique Profile Score                       45.65
Asymmetrique Activity Score                      45.65
Asymmetrique Activity Index                      45.65
Asymmetrique Profile Index                       45.65
City                                             39.71
Specialization                                   36.58
Tags                                             36.29
What matters most to you in choosing a course    29.32
What is your current occupation                  29.11
Country                                          26.63
TotalVisits                                       1.48
Page Views Per Visit                              1.48
Last Activity                                     1.11
Lead Source                                       0.39
Lead Origin                                       0.00
X Educatio

In [13]:
# The following columns have more than 30% null values
# How did you hear about X Education               78.46
# Lead Profile                                     74.19
# Lead Quality                                     51.59
# Asymmetrique Profile Score                       45.65
# Asymmetrique Activity Score                      45.65
# Asymmetrique Activity Index                      45.65
# Asymmetrique Profile Index                       45.65
# City                                             39.71
# Specialization                                   36.58
# Tags                                             36.29