<h1 id="tocheading">Table of Contents</h1>
<div id="toc"></div>

## Logistic Regression Case Study on Lead Scoring - DS68 Batch

### Problem Statement

An education company named X Education sells online courses to industry professionals. On any given day, many professionals who are interested in the courses land on their website and browse for courses.

The company markets its courses on several websites and search engines like Google. Once these people land on the website, they might browse the courses or fill up a form for the course or watch some videos. When these people fill up a form providing their email address or phone number, they are classified as leads. Moreover, the company also gets leads through past referrals.

Once these leads are acquired, employees from the sales team start making calls, writing emails, etc. Through this process, some of the leads get converted while most do not. The typical lead conversion rate at X Education is around 30%.

Although X Education generates a lot of leads, its lead conversion rate is very poor. For example, if they acquire 100 leads in a day, only about 30 of them are converted. To make this process more efficient, **the company wishes to identify the most potential leads**, also known as ‘Hot Leads’. If they successfully identify this set of leads, the lead conversion rate should increase as the sales team will focus more on communicating with potential leads rather than making calls to everyone.

A typical lead conversion process can be represented using a funnel:

![Lead Conversion Process](https://cdn.upgrad.com/UpGrad/temp/189f213d-fade-4fe4-b506-865f1840a25a/XNote_201901081613670.jpg)

X Education has appointed us to help them select the most promising leads, i.e., those that are most likely to convert into paying customers. The company requires us to build a model that assigns a lead score to each of the leads such that customers with higher lead scores have a higher conversion chance and customers with lower lead scores have a lower conversion chance. **The CEO has given a target lead conversion rate of around 80%.**

### Data

We have been provided with a leads dataset from the past containing approximately 9,000 data points. This dataset consists of various attributes such as Lead Source, Total Time Spent on Website, Total Visits, Last Activity, etc., which may or may not be useful in ultimately deciding whether a lead will be converted or not. The target variable is the column ‘Converted’, indicating whether a past lead was converted (1) or not (0).

It is important for us to check for levels present in categorical variables. Many categorical variables have a level called 'Select', which needs to be handled as it is equivalent to a null value.

### Goals and Objectives

There are several goals for this case study:

- **Build a logistic regression model** to assign a lead score between 0 and 100 to each of the leads. A higher score would mean that the lead is hot (most likely to convert), whereas a lower score would mean that the lead is cold (unlikely to convert).

- Address additional problems presented by the company which our model should be able to adapt to if requirements change in the future. These problems are provided in a separate document file. We will make sure to include this in our final presentation where we will make recommendations.

*___Throughout this report, <font color="red">RED</font> text signifies the <font color="red">key outcomes</font> and <font color="red">understandings</font>.___*

### Import Important libraries

In [3]:
# Import important libraries
import numpy as np  # For numerical operations
import pandas as pd  # For data manipulation and analysis

# Suppress warnings
import warnings
warnings.filterwarnings("ignore")
warnings.simplefilter('ignore', FutureWarning)

# Set pandas display options to show more rows and columns
pd.set_option("display.max_rows", 500)
pd.set_option('display.max_colwidth', 1500)
pd.set_option('display.max_columns', None)

# Libraries for visualization
import matplotlib.pyplot as plt  # For creating plots and charts
import seaborn as sns  # For creating statistical graphics

# Libraries for data modelling
import sklearn  # A general machine learning library
from sklearn.preprocessing import LabelEncoder  # For encoding categorical variables
from sklearn.model_selection import train_test_split  # For splitting data into training and testing sets

# Libraries for statsmodel
import statsmodels.api as sm  # For statistical modeling
from statsmodels.stats.outliers_influence import variance_inflation_factor  # For calculating VIF

# Libraries for sklearn
from sklearn.linear_model import LogisticRegression  # For logistic regression modeling
from sklearn.feature_selection import RFE  # For recursive feature elimination
from sklearn import metrics  # For model evaluation metrics
from sklearn.preprocessing import StandardScaler  # For feature scaling
from sklearn.metrics import precision_recall_curve  # For precision-recall curve analysis

# Libraries for PCA
from sklearn.decomposition import PCA  # For principal component analysis
from sklearn.decomposition import IncrementalPCA  # For incremental PCA