# Problem Statement

An education company named X Education sells online courses to industry professionals. 

X Education wants to select the most promising leads, i.e. the leads that are most likely to convert into paying customers. They wants to build a mdoel, wherein a leadscore is assigned to each of the leads such that the customers with higher lead score have a higher conversion chance and the customers with lower lead score have a lower conversion chance. The CEO, in particular, has given a ballpark of the target lead conversion rate to be around 80%.

### Goals of the case study.

- Build a logistic regression model to assign a lead score between 0 and 100 to each of the leads which can be used by the company to target potential leads. A higher score would mean that the lead is hot, i.e. is most likely to convert whereas a lower score would mean that the lead is cold and will mostly not get converted.

- There are some more problems presented by the company which your model should be able to adjust to if the company's requirement changes in the future so you will need to handle these as well. These problems are provided in a separate doc file. Please fill it based on the logistic regression model you got in the first step. Also, make sure you include this in your final PPT where you'll make recommendations.

# Importing Files and Reading dataset

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly as py

In [2]:
import warnings
warnings.filterwarnings("ignore")

In [3]:
pd.set_option('display.max_rows', 200)
pd.set_option('display.max_columns', 200)

In [13]:
leads = pd.read_csv("Leads.csv")
leads.head()

Unnamed: 0,Prospect ID,Lead Number,Lead Origin,Lead Source,Do Not Email,Do Not Call,Converted,TotalVisits,Total Time Spent on Website,Page Views Per Visit,Last Activity,Country,Specialization,How did you hear about X Education,What is your current occupation,What matters most to you in choosing a course,Search,Magazine,Newspaper Article,X Education Forums,Newspaper,Digital Advertisement,Through Recommendations,Receive More Updates About Our Courses,Tags,Lead Quality,Update me on Supply Chain Content,Get updates on DM Content,Lead Profile,City,Asymmetrique Activity Index,Asymmetrique Profile Index,Asymmetrique Activity Score,Asymmetrique Profile Score,I agree to pay the amount through cheque,A free copy of Mastering The Interview,Last Notable Activity
0,7927b2df-8bba-4d29-b9a2-b6e0beafe620,660737,API,Olark Chat,No,No,0,0.0,0,0.0,Page Visited on Website,,Select,Select,Unemployed,Better Career Prospects,No,No,No,No,No,No,No,No,Interested in other courses,Low in Relevance,No,No,Select,Select,02.Medium,02.Medium,15.0,15.0,No,No,Modified
1,2a272436-5132-4136-86fa-dcc88c88f482,660728,API,Organic Search,No,No,0,5.0,674,2.5,Email Opened,India,Select,Select,Unemployed,Better Career Prospects,No,No,No,No,No,No,No,No,Ringing,,No,No,Select,Select,02.Medium,02.Medium,15.0,15.0,No,No,Email Opened
2,8cc8c611-a219-4f35-ad23-fdfd2656bd8a,660727,Landing Page Submission,Direct Traffic,No,No,1,2.0,1532,2.0,Email Opened,India,Business Administration,Select,Student,Better Career Prospects,No,No,No,No,No,No,No,No,Will revert after reading the email,Might be,No,No,Potential Lead,Mumbai,02.Medium,01.High,14.0,20.0,No,Yes,Email Opened
3,0cc2df48-7cf4-4e39-9de9-19797f9b38cc,660719,Landing Page Submission,Direct Traffic,No,No,0,1.0,305,1.0,Unreachable,India,Media and Advertising,Word Of Mouth,Unemployed,Better Career Prospects,No,No,No,No,No,No,No,No,Ringing,Not Sure,No,No,Select,Mumbai,02.Medium,01.High,13.0,17.0,No,No,Modified
4,3256f628-e534-4826-9d63-4a8b88782852,660681,Landing Page Submission,Google,No,No,1,2.0,1428,1.0,Converted to Lead,India,Select,Other,Unemployed,Better Career Prospects,No,No,No,No,No,No,No,No,Will revert after reading the email,Might be,No,No,Select,Mumbai,02.Medium,01.High,15.0,18.0,No,No,Modified


In [14]:
leads.shape

(9240, 37)

In [15]:
leads.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9240 entries, 0 to 9239
Data columns (total 37 columns):
 #   Column                                         Non-Null Count  Dtype  
---  ------                                         --------------  -----  
 0   Prospect ID                                    9240 non-null   object 
 1   Lead Number                                    9240 non-null   int64  
 2   Lead Origin                                    9240 non-null   object 
 3   Lead Source                                    9204 non-null   object 
 4   Do Not Email                                   9240 non-null   object 
 5   Do Not Call                                    9240 non-null   object 
 6   Converted                                      9240 non-null   int64  
 7   TotalVisits                                    9103 non-null   float64
 8   Total Time Spent on Website                    9240 non-null   int64  
 9   Page Views Per Visit                           9103 

# EDA

## Handling NULL Values

### Converting Select to NaN

In [16]:
leads = leads.replace("Select", np.nan)

In [17]:
# Percentage of null values in the dataset

leads_null = leads.isna().mean().reset_index()
leads_null = leads_null.rename(columns = {"index": "Column Name", 0:"Missing Value"})
leads_null["Missing Value"] = leads_null["Missing Value"]*100
leads_null.sort_values(by = "Missing Value", ascending = False, inplace = True)

print("Null Values in the dataset", len(leads_null[leads_null['Missing Value'] > 0]))
leads_null[leads_null['Missing Value'] > 0]

Null Values in the dataset 17


Unnamed: 0,Column Name,Missing Value
13,How did you hear about X Education,78.463203
28,Lead Profile,74.188312
25,Lead Quality,51.590909
33,Asymmetrique Profile Score,45.649351
32,Asymmetrique Activity Score,45.649351
30,Asymmetrique Activity Index,45.649351
31,Asymmetrique Profile Index,45.649351
29,City,39.707792
12,Specialization,36.580087
24,Tags,36.287879


In [18]:
print("Non-Null Columns in the dataset:", len(leads_null[leads_null["Missing Value"] == 0]))
print("Columns with NULL values in the dataset:", len(leads_null[leads_null["Missing Value"] > 0]))
print("Columns with more than 50% NULL values in the dataset:", len(leads_null[leads_null["Missing Value"] > 50]))
print("Total Columns in the dataset:", len(leads_null))

Non-Null Columns in the dataset: 20
Columns with NULL values in the dataset: 17
Columns with more than 50% NULL values in the dataset: 3
Total Columns in the dataset: 37


In [19]:
# create a copy of dataset before dropping any columns for backup purpose
lead_df = leads.copy()

###### Creating a list of columns where NULL value % is greater than 50%. We will drop these columns and will not include these further in the analysis

In [20]:
lead_null_greaterthan50 = leads_null[leads_null["Missing Value"] >= 50]

lead_col_drop = list(lead_null_greaterthan50.iloc[:, 0])
lead_col_drop

['How did you hear about X Education', 'Lead Profile', 'Lead Quality']

In [21]:
lead_df.drop(lead_col_drop, axis = 1, inplace = True)
lead_df.shape

(9240, 34)

### Check Unique values in all columns

In [26]:
lead_df_nunique = lead_df.nunique().reset_index()
lead_df_nunique = lead_df_nunique.rename(columns = {"index": "Column Name", 0:"Unique Values in col"})
lead_df_nunique.sort_values(by = "Unique Values in col", ascending = False, inplace = True)
lead_df_nunique

Unnamed: 0,Column Name,Unique Values in col
0,Prospect ID,9240
1,Lead Number,9240
8,Total Time Spent on Website,1731
9,Page Views Per Visit,114
7,TotalVisits,41
11,Country,38
23,Tags,26
3,Lead Source,21
12,Specialization,18
10,Last Activity,17


Columns which has same value across the entire data frame is of no use for modelling as they will not be able to uniquely qualify any entry. Hence we will remove all entries where `unique values in a column` is 1.

Similarly, Prospect ID and Lead Number both uniquely identifies each row in the dataset and hence we do not need 2 columns.

In [29]:
lead_single_val = lead_df_nunique[lead_df_nunique["Unique Values in col"] == 1]

lead_col_drop = list(lead_single_val.iloc[:, 0])
lead_col_drop.append("Prospect ID")
lead_col_drop

['Update me on Supply Chain Content',
 'Get updates on DM Content',
 'Magazine',
 'I agree to pay the amount through cheque',
 'Receive More Updates About Our Courses',
 'Prospect ID']

In [30]:
lead_df.drop(lead_col_drop, axis = 1, inplace = True)
lead_df.shape

(9240, 28)

In [39]:
for col in lead_df.columns[1:]:
    print(col, ":")
    print(lead_df[col].value_counts(dropna=False))
    print("-"*30)
    print()

Lead Origin :
Landing Page Submission    4886
API                        3580
Lead Add Form               718
Lead Import                  55
Quick Add Form                1
Name: Lead Origin, dtype: int64
------------------------------

Lead Source :
Google               2868
Direct Traffic       2543
Olark Chat           1755
Organic Search       1154
Reference             534
Welingak Website      142
Referral Sites        125
Facebook               55
NaN                    36
bing                    6
google                  5
Click2call              4
Press_Release           2
Social Media            2
Live Chat               2
youtubechannel          1
testone                 1
Pay per Click Ads       1
welearnblog_Home        1
WeLearn                 1
blog                    1
NC_EDM                  1
Name: Lead Source, dtype: int64
------------------------------

Do Not Email :
No     8506
Yes     734
Name: Do Not Email, dtype: int64
------------------------------

Do Not C

In [40]:
lead_df.columns

Index(['Lead Number', 'Lead Origin', 'Lead Source', 'Do Not Email',
       'Do Not Call', 'Converted', 'TotalVisits',
       'Total Time Spent on Website', 'Page Views Per Visit', 'Last Activity',
       'Country', 'Specialization', 'What is your current occupation',
       'What matters most to you in choosing a course', 'Search',
       'Newspaper Article', 'X Education Forums', 'Newspaper',
       'Digital Advertisement', 'Through Recommendations', 'Tags', 'City',
       'Asymmetrique Activity Index', 'Asymmetrique Profile Index',
       'Asymmetrique Activity Score', 'Asymmetrique Profile Score',
       'A free copy of Mastering The Interview', 'Last Notable Activity'],
      dtype='object')

In [42]:
# Percentage of null values in the dataset

leads_null = lead_df.isna().mean().reset_index()
leads_null = leads_null.rename(columns = {"index": "Column Name", 0:"Missing Value"})
leads_null["Missing Value"] = leads_null["Missing Value"]*100
leads_null.sort_values(by = "Missing Value", ascending = False, inplace = True)

print("Null Values in the dataset", len(leads_null[leads_null['Missing Value'] > 0]))
leads_null[leads_null['Missing Value'] > 0]

Null Values in the dataset 14


Unnamed: 0,Column Name,Missing Value
22,Asymmetrique Activity Index,45.649351
25,Asymmetrique Profile Score,45.649351
24,Asymmetrique Activity Score,45.649351
23,Asymmetrique Profile Index,45.649351
21,City,39.707792
11,Specialization,36.580087
20,Tags,36.287879
13,What matters most to you in choosing a course,29.318182
12,What is your current occupation,29.112554
10,Country,26.634199


In [43]:
print("Non-Null Columns in the dataset:", len(leads_null[leads_null["Missing Value"] == 0]))
print("Columns with NULL values in the dataset:", len(leads_null[leads_null["Missing Value"] > 0]))
print("Columns with more than 50% NULL values in the dataset:", len(leads_null[leads_null["Missing Value"] > 50]))
print("Total Columns in the dataset:", len(leads_null))

Non-Null Columns in the dataset: 14
Columns with NULL values in the dataset: 14
Columns with more than 50% NULL values in the dataset: 0
Total Columns in the dataset: 28
