# Lead Scoring

An education company named X Education sells online courses to industry professionals. On any given day, many professionals who are interested in the courses land on their website and browse for courses. 

X Education has appointed you to help them select the most promising leads, i.e. the leads that are most likely to convert into paying customers. The company requires you to build a model wherein you need to assign a lead score to each of the leads such that the customers with higher lead score have a higher conversion chance and the customers with lower lead score have a lower conversion chance. The CEO, in particular, has given a ballpark of the target lead conversion rate to be around 80%

In [185]:
# Importing required packages
import numpy as np, pandas as pd

In [186]:
# Read data
df = pd.read_csv('/content/Leads.csv')
df.head()

Unnamed: 0,Prospect ID,Lead Number,Lead Origin,Lead Source,Do Not Email,Do Not Call,Converted,TotalVisits,Total Time Spent on Website,Page Views Per Visit,Last Activity,Country,Specialization,How did you hear about X Education,What is your current occupation,What matters most to you in choosing a course,Search,Magazine,Newspaper Article,X Education Forums,Newspaper,Digital Advertisement,Through Recommendations,Receive More Updates About Our Courses,Tags,Lead Quality,Update me on Supply Chain Content,Get updates on DM Content,Lead Profile,City,Asymmetrique Activity Index,Asymmetrique Profile Index,Asymmetrique Activity Score,Asymmetrique Profile Score,I agree to pay the amount through cheque,A free copy of Mastering The Interview,Last Notable Activity
0,7927b2df-8bba-4d29-b9a2-b6e0beafe620,660737,API,Olark Chat,No,No,0,0.0,0,0.0,Page Visited on Website,,Select,Select,Unemployed,Better Career Prospects,No,No,No,No,No,No,No,No,Interested in other courses,Low in Relevance,No,No,Select,Select,02.Medium,02.Medium,15.0,15.0,No,No,Modified
1,2a272436-5132-4136-86fa-dcc88c88f482,660728,API,Organic Search,No,No,0,5.0,674,2.5,Email Opened,India,Select,Select,Unemployed,Better Career Prospects,No,No,No,No,No,No,No,No,Ringing,,No,No,Select,Select,02.Medium,02.Medium,15.0,15.0,No,No,Email Opened
2,8cc8c611-a219-4f35-ad23-fdfd2656bd8a,660727,Landing Page Submission,Direct Traffic,No,No,1,2.0,1532,2.0,Email Opened,India,Business Administration,Select,Student,Better Career Prospects,No,No,No,No,No,No,No,No,Will revert after reading the email,Might be,No,No,Potential Lead,Mumbai,02.Medium,01.High,14.0,20.0,No,Yes,Email Opened
3,0cc2df48-7cf4-4e39-9de9-19797f9b38cc,660719,Landing Page Submission,Direct Traffic,No,No,0,1.0,305,1.0,Unreachable,India,Media and Advertising,Word Of Mouth,Unemployed,Better Career Prospects,No,No,No,No,No,No,No,No,Ringing,Not Sure,No,No,Select,Mumbai,02.Medium,01.High,13.0,17.0,No,No,Modified
4,3256f628-e534-4826-9d63-4a8b88782852,660681,Landing Page Submission,Google,No,No,1,2.0,1428,1.0,Converted to Lead,India,Select,Other,Unemployed,Better Career Prospects,No,No,No,No,No,No,No,No,Will revert after reading the email,Might be,No,No,Select,Mumbai,02.Medium,01.High,15.0,18.0,No,No,Modified


In [187]:
df.shape

(9240, 37)

In [188]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9240 entries, 0 to 9239
Data columns (total 37 columns):
 #   Column                                         Non-Null Count  Dtype  
---  ------                                         --------------  -----  
 0   Prospect ID                                    9240 non-null   object 
 1   Lead Number                                    9240 non-null   int64  
 2   Lead Origin                                    9240 non-null   object 
 3   Lead Source                                    9204 non-null   object 
 4   Do Not Email                                   9240 non-null   object 
 5   Do Not Call                                    9240 non-null   object 
 6   Converted                                      9240 non-null   int64  
 7   TotalVisits                                    9103 non-null   float64
 8   Total Time Spent on Website                    9240 non-null   int64  
 9   Page Views Per Visit                           9103 

In [189]:
df.describe()

Unnamed: 0,Lead Number,Converted,TotalVisits,Total Time Spent on Website,Page Views Per Visit,Asymmetrique Activity Score,Asymmetrique Profile Score
count,9240.0,9240.0,9103.0,9240.0,9103.0,5022.0,5022.0
mean,617188.435606,0.38539,3.445238,487.698268,2.36282,14.306252,16.344883
std,23405.995698,0.486714,4.854853,548.021466,2.161418,1.386694,1.811395
min,579533.0,0.0,0.0,0.0,0.0,7.0,11.0
25%,596484.5,0.0,1.0,12.0,1.0,14.0,15.0
50%,615479.0,0.0,3.0,248.0,2.0,14.0,16.0
75%,637387.25,1.0,5.0,936.0,3.0,15.0,18.0
max,660737.0,1.0,251.0,2272.0,55.0,18.0,20.0


In [190]:
df_null = pd.DataFrame(round((df.isna().sum()/df.shape[0])*100,2),columns=['% Null']).reset_index()

In [191]:
df_null.rename(columns={'index':'Column'},inplace=True)

In [192]:
df_high_null = df_null[df_null['% Null']>25].sort_values(by='% Null')

In [193]:
cols_to_drop = df_high_null.Column.values
cols_to_drop = cols_to_drop[2:]

In [194]:
df_low_null = df_null[df_null['% Null']<25].sort_values(by='% Null')

# Drop columns (missing values > 25%) 

In [195]:
df.drop(labels=list(cols_to_drop),inplace=True,axis=1)

In [196]:
df.dropna(inplace=True,axis=0)

In [197]:
df.isna().sum()

Prospect ID                                 0
Lead Number                                 0
Lead Origin                                 0
Lead Source                                 0
Do Not Email                                0
Do Not Call                                 0
Converted                                   0
TotalVisits                                 0
Total Time Spent on Website                 0
Page Views Per Visit                        0
Last Activity                               0
Country                                     0
Specialization                              0
How did you hear about X Education          0
What is your current occupation             0
Search                                      0
Magazine                                    0
Newspaper Article                           0
X Education Forums                          0
Newspaper                                   0
Digital Advertisement                       0
Through Recommendations           

In [198]:
df.shape

(4925, 29)

In [199]:
obj_cols = df.select_dtypes(np.object).columns.tolist()
obj_cols

['Prospect ID',
 'Lead Origin',
 'Lead Source',
 'Do Not Email',
 'Do Not Call',
 'Last Activity',
 'Country',
 'Specialization',
 'How did you hear about X Education',
 'What is your current occupation',
 'Search',
 'Magazine',
 'Newspaper Article',
 'X Education Forums',
 'Newspaper',
 'Digital Advertisement',
 'Through Recommendations',
 'Receive More Updates About Our Courses',
 'Update me on Supply Chain Content',
 'Get updates on DM Content',
 'City',
 'I agree to pay the amount through cheque',
 'A free copy of Mastering The Interview',
 'Last Notable Activity']

In [200]:
'Do Not Call','Search','Magazine','Digital Advertisement','Newspaper Article','X Education Forums','Newspaper','Through Recommendations','Receive More Updates About Our Courses','Update me on Supply Chain Content','Get updates on DM Content','I agree to pay the amount through cheque'

('Do Not Call',
 'Search',
 'Magazine',
 'Digital Advertisement',
 'Newspaper Article',
 'X Education Forums',
 'Newspaper',
 'Through Recommendations',
 'Receive More Updates About Our Courses',
 'Update me on Supply Chain Content',
 'Get updates on DM Content',
 'I agree to pay the amount through cheque')

In [201]:
for col in obj_cols:
  print('-----------------------------')
  print(df[col].value_counts())
  print('-----------------------------')

-----------------------------
232f7d3b-ee3b-4987-8ef4-a84e8fdecedf    1
27b0b7e1-4d8a-46b6-be0d-89769d6a6795    1
a6df299b-f612-476e-8a0d-ad948dbaf755    1
06feb89a-fb00-48b2-9ee8-71e69ec328fd    1
0c15052a-9f8a-47c4-9fc3-eb20c84ffd74    1
                                       ..
309b99d6-f4bd-446a-902f-756f59171058    1
49fd74ef-98cf-437f-be6b-3fd85843666f    1
d9ed7525-5cf0-45ba-87c2-ca2bca521874    1
53eb261a-c8a8-410b-9110-3025d9ac5d22    1
7ea5288c-e117-4bb9-8f0f-846bf0534f0f    1
Name: Prospect ID, Length: 4925, dtype: int64
-----------------------------
-----------------------------
Landing Page Submission    3598
API                        1300
Lead Add Form                27
Name: Lead Origin, dtype: int64
-----------------------------
-----------------------------
Google               2028
Direct Traffic       1856
Organic Search        860
Olark Chat             73
Referral Sites         71
Reference              21
Welingak Website        5
bing                    3
Social

In [202]:
cols_to_drop_2 = ['Prospect ID','Do Not Call','Search','Magazine','Digital Advertisement','Newspaper Article','X Education Forums','Newspaper','Through Recommendations','Receive More Updates About Our Courses','Update me on Supply Chain Content','Get updates on DM Content','I agree to pay the amount through cheque']

In [203]:
df.drop(labels=cols_to_drop_2,inplace=True,axis=1)

In [204]:
df.shape

(4925, 16)

In [205]:
obj_cols = df.select_dtypes(np.object).columns.tolist()
obj_cols

['Lead Origin',
 'Lead Source',
 'Do Not Email',
 'Last Activity',
 'Country',
 'Specialization',
 'How did you hear about X Education',
 'What is your current occupation',
 'City',
 'A free copy of Mastering The Interview',
 'Last Notable Activity']

In [206]:
for col in obj_cols:
  print('-----------------------------')
  print(round((df[col].value_counts()/df.shape[0])*100,2))
  print('-----------------------------')

-----------------------------
Landing Page Submission    73.06
API                        26.40
Lead Add Form               0.55
Name: Lead Origin, dtype: float64
-----------------------------
-----------------------------
Google               41.18
Direct Traffic       37.69
Organic Search       17.46
Olark Chat            1.48
Referral Sites        1.44
Reference             0.43
Welingak Website      0.10
bing                  0.06
Social Media          0.04
Pay per Click Ads     0.02
Click2call            0.02
Press_Release         0.02
testone               0.02
Facebook              0.02
WeLearn               0.02
Name: Lead Source, dtype: float64
-----------------------------
-----------------------------
No     92.61
Yes     7.39
Name: Do Not Email, dtype: float64
-----------------------------
-----------------------------
Email Opened                    38.94
SMS Sent                        33.18
Page Visited on Website          8.06
Converted to Lead                5.79
Olark

In [207]:
cols_to_drop_3 = ['How did you hear about X Education','Country','City','Do Not Email']
df.drop(labels=cols_to_drop_3,inplace=True,axis=1)

In [208]:
df.shape

(4925, 12)

In [209]:
df['Specialization'] = df['Specialization'].apply(lambda x:'Other' if x=='Select' else x)

In [210]:
df['Specialization'].value_counts()

Other                                906
Finance Management                   657
Human Resource Management            576
Marketing Management                 553
Operations Management                349
Business Administration              277
IT Projects Management               258
Supply Chain Management              251
Banking, Investment And Insurance    230
Media and Advertising                152
Travel and Tourism                   145
International Business               128
Healthcare Management                101
Hospitality Management                78
Retail Management                     74
E-COMMERCE                            74
Rural and Agribusiness                52
E-Business                            41
Services Excellence                   23
Name: Specialization, dtype: int64

In [211]:
df.drop(labels='Lead Number',inplace=True,axis=1)

In [212]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4925 entries, 1 to 9239
Data columns (total 11 columns):
 #   Column                                  Non-Null Count  Dtype  
---  ------                                  --------------  -----  
 0   Lead Origin                             4925 non-null   object 
 1   Lead Source                             4925 non-null   object 
 2   Converted                               4925 non-null   int64  
 3   TotalVisits                             4925 non-null   float64
 4   Total Time Spent on Website             4925 non-null   int64  
 5   Page Views Per Visit                    4925 non-null   float64
 6   Last Activity                           4925 non-null   object 
 7   Specialization                          4925 non-null   object 
 8   What is your current occupation         4925 non-null   object 
 9   A free copy of Mastering The Interview  4925 non-null   object 
 10  Last Notable Activity                   4925 non-null   obje

In [213]:
df['A free copy of Mastering The Interview'].value_counts()

No     2770
Yes    2155
Name: A free copy of Mastering The Interview, dtype: int64

In [214]:
df['A free copy of Mastering The Interview'] = df['A free copy of Mastering The Interview'].apply(lambda x:1 if x=='Yes' else 0)

In [215]:
df['A free copy of Mastering The Interview'].value_counts()

0    2770
1    2155
Name: A free copy of Mastering The Interview, dtype: int64