<a href="https://colab.research.google.com/github/Ashika-Ashok/-Real-Time-AI-Sales-Call-Assistant-for-Enhanced-Conversation-Strategies/blob/main/Lead_Scoring.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lead Scoring

## Logistic Regression

## Problem Statement
An education company named __X Education__ sells online courses to industry professionals. On any given day, many professionals who are interested in the courses land on their website and browse for courses.



The company markets its courses on several websites and search engines like Google. Once these people land on the website, they might browse the courses or fill up a form for the course or watch some videos. <br>

__When these people fill up a form providing their email address or phone number, they are classified to be a lead. Moreover, the company also gets leads through past referrals.__<br>

Once these leads are acquired, employees from the sales team start making calls, writing emails, etc. Through this process, some of the leads get converted while most do not. __The typical lead conversion rate at X education is around 30%.__


Now, although X Education gets a lot of leads, its lead conversion rate is very poor. For example, if, say, they acquire 100 leads in a day, only about 30 of them are converted. To make this process more efficient, the company wishes to identify the most potential leads, also known as __‘Hot Leads’__. <br>

If they successfully identify this set of leads, the lead conversion rate should go up as the sales team will now be focusing more on communicating with the potential leads rather than making calls to everyone. A typical lead conversion process can be represented using the following funnel:
![image.jpg](attachment:image.jpg)



__Lead Conversion Process__ - Demonstrated as a funnel
As you can see, there are a lot of leads generated in the initial stage (top) but only a few of them come out as paying customers from the bottom.<br>

In the middle stage, you need to nurture the potential leads well (i.e. educating the leads about the product, constantly communicating etc. ) in order to get a higher lead conversion.

X Education has appointed you to help them select the most promising leads, i.e. the leads that are most likely to convert into paying customers. <br>
The company requires you to build a model wherein you need to assign a lead score to each of the leads such that the customers with higher lead score have a higher conversion chance and the customers with lower lead score have a lower conversion chance.

__The CEO, in particular, has given a ballpark of the target lead conversion rate to be around 80%.__




### Data

You have been provided with a leads dataset from the past with around 9000 data points. This dataset consists of various attributes such as Lead Source, Total Time Spent on Website, Total Visits, Last Activity, etc. which may or may not be useful in ultimately deciding whether a lead will be converted or not. The target variable, in this case, is the column ‘Converted’ which tells whether a past lead was converted or not wherein 1 means it was converted and 0 means it wasn’t converted.

Another thing that you also need to check out for are the levels present in the categorical variables.<br>

__Many of the categorical variables have a level called 'Select' which needs to be handled because it is as good as a null value.__



### Goal


1. Build a logistic regression model to assign a lead score between 0 and 100 to each of the leads which can be used by the company to target potential leads. A higher score would mean that the lead is hot, i.e. is most likely to convert whereas a lower score would mean that the lead is cold and will mostly not get converted.

In [None]:
# Supress Warnings
import warnings
warnings.filterwarnings('ignore')

# Importing libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# visulaisation
from matplotlib.pyplot import xticks
%matplotlib inline

# Data display coustomization
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 100)

## Data Preparation

### Data Loading

In [None]:
data = pd.DataFrame(pd.read_csv('../input/Leads.csv'))
data.head(5)

In [None]:
#checking duplicates
sum(data.duplicated(subset = 'Prospect ID')) == 0
# No duplicate values

### Data Inspection

In [None]:
data.shape

In [None]:
data.info()

In [None]:
data.describe()

### Data Cleaning

In [None]:
# As we can observe that there are select values for many column.
#This is because customer did not select any option from the list, hence it shows select.
# Select values are as good as NULL.

# Converting 'Select' values to NaN.
data = data.replace('Select', np.nan)

In [None]:
data.isnull().sum()

In [None]:
round(100*(data.isnull().sum()/len(data.index)), 2)

In [None]:
# # we will drop the columns having more than 70% NA values.
# data = data.drop(data.loc[:,list(round(100*(data.isnull().sum()/len(data.index)), 2)>70)].columns, 1)

In [None]:
# Now we will take care of null values in each column one by one.

In [None]:
# Lead Quality: Indicates the quality of lead based on the data and intuition the the employee who has been assigned to the lead

In [None]:
data['Lead Quality'].describe()

In [None]:
sns.countplot(data['Lead Quality'])

In [None]:
# As Lead quality is based on the intution of employee, so if left blank we can impute 'Not Sure' in NaN safely.
data['Lead Quality'] = data['Lead Quality'].replace(np.nan, 'Not Sure')

In [None]:
sns.countplot(data['Lead Quality'])

In [None]:
# Asymmetrique Activity Index  |
# Asymmetrique Profile Index   \   An index and score assigned to each customer
# Asymmetrique Activity Score  |    based on their activity and their profile
# Asymmetrique Profile Score   \

In [None]:
fig, axs = plt.subplots(2,2, figsize = (10,7.5))
plt1 = sns.countplot(data['Asymmetrique Activity Index'], ax = axs[0,0])
plt2 = sns.boxplot(data['Asymmetrique Activity Score'], ax = axs[0,1])
plt3 = sns.countplot(data['Asymmetrique Profile Index'], ax = axs[1,0])
plt4 = sns.boxplot(data['Asymmetrique Profile Score'], ax = axs[1,1])
plt.tight_layout()

In [None]:
# There is too much variation in thes parameters so its not reliable to impute any value in it.
# 45% null values means we need to drop these columns.

In [None]:
# data = data.drop(['Asymmetrique Activity Index','Asymmetrique Activity Score','Asymmetrique Profile Index','Asymmetrique Profile Score'],1)

In [None]:
round(100*(data.isnull().sum()/len(data.index)), 2)

In [None]:
# City

In [None]:
data.City.describe()

In [None]:
sns.countplot(data.City)
xticks(rotation = 90)

In [None]:
# Around 60% of the data is Mumbai so we can impute Mumbai in the missing values.

In [None]:
# data['City'] = data['City'].replace(np.nan, 'Mumbai')

In [None]:
# Specailization

In [None]:
data.Specialization.describe()

In [None]:
sns.countplot(data.Specialization)
xticks(rotation = 90)

In [None]:
# It maybe the case that lead has not entered any specialization if his/her option is not availabe on the list,
#  may not have any specialization or is a student.
# Hence we can make a category "Others" for missing values.

In [None]:
data['Specialization'] = data['Specialization'].replace(np.nan, 'Others')

In [None]:
round(100*(data.isnull().sum()/len(data.index)), 2)

In [None]:
# Tags

In [None]:
data.Tags.describe()

In [None]:
fig, axs = plt.subplots(figsize = (15,7.5))
sns.countplot(data.Tags)
xticks(rotation = 90)

In [None]:
# Blanks in the tag column may be imputed by 'Will revert after reading the email'.

In [None]:
data['Tags'] = data['Tags'].replace(np.nan, 'Will revert after reading the email')

In [None]:
# What matters most to you in choosing a course

In [None]:
data['What matters most to you in choosing a course'].describe()

In [None]:
# Blanks in the this column may be imputed by 'Better Career Prospects'.

In [None]:
# data['What matters most to you in choosing a course'] = data['What matters most to you in choosing a course'].replace(np.nan, 'Better Career Prospects')

In [None]:
# Occupation

In [None]:
data['What is your current occupation'].describe()

In [None]:
# 86% entries are of Unemployed so we can impute "Unemployed" in it.

In [None]:
data['What is your current occupation'] = data['What is your current occupation'].replace(np.nan, 'Unemployed')

In [None]:
# Country

In [None]:
# Country is India for most values so let's impute the same in missing values.
data['Country'] = data['Country'].replace(np.nan, 'India')

In [None]:
round(100*(data.isnull().sum()/len(data.index)), 2)

In [None]:
round(100*(data.isnull().sum()/len(data.index)), 2)

In [None]:
data[["Prospect ID","Lead Number"]].nunique()

In [None]:
print(data.shape)

In [None]:
data["Last Notable Activity"].value_counts()

In [None]:
data.head()

In [None]:
data.shape

In [None]:
data.drop(["Prospect ID"],axis=1).to_csv("Marketing_Leads_India.csv.gz",index=False,compression="gzip")