<h1 id="tocheading">Table of Contents</h1>
<div id="toc"></div>

## Logistic Regression Case Study on Lead Scoring - DS68 Batch

### Problem Statement

An education company named X Education sells online courses to industry professionals. On any given day, many professionals who are interested in the courses land on their website and browse for courses.

The company markets its courses on several websites and search engines like Google. Once these people land on the website, they might browse the courses or fill up a form for the course or watch some videos. When these people fill up a form providing their email address or phone number, they are classified as leads. Moreover, the company also gets leads through past referrals.

Once these leads are acquired, employees from the sales team start making calls, writing emails, etc. Through this process, some of the leads get converted while most do not. The typical lead conversion rate at X Education is around 30%.

Although X Education generates a lot of leads, its lead conversion rate is very poor. For example, if they acquire 100 leads in a day, only about 30 of them are converted. To make this process more efficient, **the company wishes to identify the most potential leads**, also known as ‘Hot Leads’. If they successfully identify this set of leads, the lead conversion rate should increase as the sales team will focus more on communicating with potential leads rather than making calls to everyone.

A typical lead conversion process can be represented using a funnel:

![Lead Conversion Process](https://cdn.upgrad.com/UpGrad/temp/189f213d-fade-4fe4-b506-865f1840a25a/XNote_201901081613670.jpg)

X Education has appointed us to help them select the most promising leads, i.e., those that are most likely to convert into paying customers. The company requires us to build a model that assigns a lead score to each of the leads such that customers with higher lead scores have a higher conversion chance and customers with lower lead scores have a lower conversion chance. **The CEO has given a target lead conversion rate of around 80%.**

### Data

We have been provided with a leads dataset from the past containing approximately 9,000 data points. This dataset consists of various attributes such as Lead Source, Total Time Spent on Website, Total Visits, Last Activity, etc., which may or may not be useful in ultimately deciding whether a lead will be converted or not. The target variable is the column ‘Converted’, indicating whether a past lead was converted (1) or not (0).

It is important for us to check for levels present in categorical variables. Many categorical variables have a level called 'Select', which needs to be handled as it is equivalent to a null value.

### Goals and Objectives

There are several goals for this case study:

- **Build a logistic regression model** to assign a lead score between 0 and 100 to each of the leads. A higher score would mean that the lead is hot (most likely to convert), whereas a lower score would mean that the lead is cold (unlikely to convert).

- Address additional problems presented by the company which our model should be able to adapt to if requirements change in the future. These problems are provided in a separate document file. We will make sure to include this in our final presentation where we will make recommendations.

*___Throughout this report, <font color="red">RED</font> text signifies the <font color="red">key outcomes</font> and <font color="red">understandings</font>.___*

# Step 1: Importing Essential Libraries and Modules

In [10]:
# Import and Suppress warnings
import warnings
warnings.filterwarnings("ignore")

# Import essential libraries for data analysis, visualization, and statistical modeling
import numpy as np, pandas as pd
import matplotlib as mpl, matplotlib.pyplot as plt, seaborn as sns
import scipy as sp, scipy.stats as ss
import tabulate
## Set pandas display options to show more rows and columns
pd.set_option("display.max_rows", 500)
pd.set_option('display.max_colwidth', 1500)
pd.set_option('display.max_columns', None)
## Configure Matplotlib for inline plotting in Jupyter Notebook
%matplotlib inline
## Set Seaborn theme and enable color codes for plotting
sns.set_theme()

# Libraries for statistical modeling and machine learning
import statsmodels.api as sm  # For statistical modeling
import sklearn  # For machine learning
from statsmodels.stats.outliers_influence import variance_inflation_factor  # For calculating VIF

# Libraries for data preprocessing and feature engineering
from sklearn.preprocessing import LabelEncoder  # For encoding categorical variables
from sklearn.preprocessing import StandardScaler  # For feature scaling
from sklearn.model_selection import train_test_split  # For splitting data into training and testing sets

# Libraries for model selection and evaluation
from sklearn.linear_model import LogisticRegression  # For logistic regression modeling
from sklearn.feature_selection import RFE  # For recursive feature elimination
from sklearn.model_selection import StratifiedKFold # For stratified k-fold cross-validation
from sklearn.model_selection import GridSearchCV # For hyperparameter tuning using grid search
from sklearn import metrics  # For model evaluation metrics
from sklearn.metrics import precision_recall_curve  # For precision-recall curve analysis

# Libraries for dimensionality reduction
from sklearn.decomposition import PCA  # For principal component analysis
from sklearn.decomposition import IncrementalPCA  # For incremental PCA

# Other libraries
from scipy.stats import skew, kurtosis # For calculating skewness and kurtosi

In [11]:
# Display the versions of Libraries in this report.
print('Numpy version:', np.__version__)
print('Pandas version:', pd.__version__)
print('Matplotlib version:', mpl.__version__)
print('Seaborn version:', sns.__version__)  
print('Scipy version:', sp.__version__)
print('Tabulate version:', tabulate.__version__)
print('statsmodels.api version:' , sm.__version__)
print('Scikit-learn version:' , sklearn.__version__)

Numpy version: 1.26.4
Pandas version: 2.1.4
Matplotlib version: 3.9.2
Seaborn version: 0.13.1
Scipy version: 1.13.1
Tabulate version: 0.9.0
statsmodels.api version: 0.14.2
Scikit-learn version: 1.5.1


# Step 2: Reading, Understanding and Preparing Data

## 1. Read in and Inspect the Dataset

In [14]:
# Load the data by Pandas.
df = pd.read_csv('Leads.csv', encoding='utf-8')
df.head()

Unnamed: 0,Prospect ID,Lead Number,Lead Origin,Lead Source,Do Not Email,Do Not Call,Converted,TotalVisits,Total Time Spent on Website,Page Views Per Visit,Last Activity,Country,Specialization,How did you hear about X Education,What is your current occupation,What matters most to you in choosing a course,Search,Magazine,Newspaper Article,X Education Forums,Newspaper,Digital Advertisement,Through Recommendations,Receive More Updates About Our Courses,Tags,Lead Quality,Update me on Supply Chain Content,Get updates on DM Content,Lead Profile,City,Asymmetrique Activity Index,Asymmetrique Profile Index,Asymmetrique Activity Score,Asymmetrique Profile Score,I agree to pay the amount through cheque,A free copy of Mastering The Interview,Last Notable Activity
0,7927b2df-8bba-4d29-b9a2-b6e0beafe620,660737,API,Olark Chat,No,No,0,0.0,0,0.0,Page Visited on Website,,Select,Select,Unemployed,Better Career Prospects,No,No,No,No,No,No,No,No,Interested in other courses,Low in Relevance,No,No,Select,Select,02.Medium,02.Medium,15.0,15.0,No,No,Modified
1,2a272436-5132-4136-86fa-dcc88c88f482,660728,API,Organic Search,No,No,0,5.0,674,2.5,Email Opened,India,Select,Select,Unemployed,Better Career Prospects,No,No,No,No,No,No,No,No,Ringing,,No,No,Select,Select,02.Medium,02.Medium,15.0,15.0,No,No,Email Opened
2,8cc8c611-a219-4f35-ad23-fdfd2656bd8a,660727,Landing Page Submission,Direct Traffic,No,No,1,2.0,1532,2.0,Email Opened,India,Business Administration,Select,Student,Better Career Prospects,No,No,No,No,No,No,No,No,Will revert after reading the email,Might be,No,No,Potential Lead,Mumbai,02.Medium,01.High,14.0,20.0,No,Yes,Email Opened
3,0cc2df48-7cf4-4e39-9de9-19797f9b38cc,660719,Landing Page Submission,Direct Traffic,No,No,0,1.0,305,1.0,Unreachable,India,Media and Advertising,Word Of Mouth,Unemployed,Better Career Prospects,No,No,No,No,No,No,No,No,Ringing,Not Sure,No,No,Select,Mumbai,02.Medium,01.High,13.0,17.0,No,No,Modified
4,3256f628-e534-4826-9d63-4a8b88782852,660681,Landing Page Submission,Google,No,No,1,2.0,1428,1.0,Converted to Lead,India,Select,Other,Unemployed,Better Career Prospects,No,No,No,No,No,No,No,No,Will revert after reading the email,Might be,No,No,Select,Mumbai,02.Medium,01.High,15.0,18.0,No,No,Modified


In [15]:
# Print the shape of df
print(df.shape)

(9240, 37)


In [16]:
# Print the info of df
df.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9240 entries, 0 to 9239
Data columns (total 37 columns):
 #   Column                                         Non-Null Count  Dtype  
---  ------                                         --------------  -----  
 0   Prospect ID                                    9240 non-null   object 
 1   Lead Number                                    9240 non-null   int64  
 2   Lead Origin                                    9240 non-null   object 
 3   Lead Source                                    9204 non-null   object 
 4   Do Not Email                                   9240 non-null   object 
 5   Do Not Call                                    9240 non-null   object 
 6   Converted                                      9240 non-null   int64  
 7   TotalVisits                                    9103 non-null   float64
 8   Total Time Spent on Website                    9240 non-null   int64  
 9   Page Views Per Visit                           9103 

In [17]:
# Print the missing values of columns of df
df.isnull().sum()

Prospect ID                                         0
Lead Number                                         0
Lead Origin                                         0
Lead Source                                        36
Do Not Email                                        0
Do Not Call                                         0
Converted                                           0
TotalVisits                                       137
Total Time Spent on Website                         0
Page Views Per Visit                              137
Last Activity                                     103
Country                                          2461
Specialization                                   1438
How did you hear about X Education               2207
What is your current occupation                  2690
What matters most to you in choosing a course    2709
Search                                              0
Magazine                                            0
Newspaper Article           

In [18]:
# Reprint the missing values of columns of df sorted in ascending order
df.isnull().sum().sort_values()

Prospect ID                                         0
I agree to pay the amount through cheque            0
Get updates on DM Content                           0
Update me on Supply Chain Content                   0
Receive More Updates About Our Courses              0
Through Recommendations                             0
Digital Advertisement                               0
Newspaper                                           0
X Education Forums                                  0
A free copy of Mastering The Interview              0
Magazine                                            0
Search                                              0
Newspaper Article                                   0
Last Notable Activity                               0
Lead Number                                         0
Lead Origin                                         0
Total Time Spent on Website                         0
Converted                                           0
Do Not Call                 

In [19]:
# Describe for Numerical Variables of 
df.describe()

Unnamed: 0,Lead Number,Converted,TotalVisits,Total Time Spent on Website,Page Views Per Visit,Asymmetrique Activity Score,Asymmetrique Profile Score
count,9240.0,9240.0,9103.0,9240.0,9103.0,5022.0,5022.0
mean,617188.435606,0.38539,3.445238,487.698268,2.36282,14.306252,16.344883
std,23405.995698,0.486714,4.854853,548.021466,2.161418,1.386694,1.811395
min,579533.0,0.0,0.0,0.0,0.0,7.0,11.0
25%,596484.5,0.0,1.0,12.0,1.0,14.0,15.0
50%,615479.0,0.0,3.0,248.0,2.0,14.0,16.0
75%,637387.25,1.0,5.0,936.0,3.0,15.0,18.0
max,660737.0,1.0,251.0,2272.0,55.0,18.0,20.0
