# Lead Scoring Study 

## A. Problem Statment 

1. Building a logistic regression model to assign a lead score between 0 and 100 to each of the leads which can be used by the company to target potential leads
2. Creating a model that can adjust according to company's requirements
    - When sales team is recruited and the team wants to contact all the leads that has good chances of conversion
    - When the sales team is involved in other project and will only call when there is a requirement 

#### Importing Revelant Libraries

In [157]:
import numpy as np
import pandas as pd

import warnings 
warnings.filterwarnings('ignore')

In [158]:
df = pd.read_csv('Leads.csv')

In [159]:
# checking the shape of dataframe 

df.shape

(9240, 37)

In [160]:
# observing the dataframe for the first time

df.head()

Unnamed: 0,Prospect ID,Lead Number,Lead Origin,Lead Source,Do Not Email,Do Not Call,Converted,TotalVisits,Total Time Spent on Website,Page Views Per Visit,...,Get updates on DM Content,Lead Profile,City,Asymmetrique Activity Index,Asymmetrique Profile Index,Asymmetrique Activity Score,Asymmetrique Profile Score,I agree to pay the amount through cheque,A free copy of Mastering The Interview,Last Notable Activity
0,7927b2df-8bba-4d29-b9a2-b6e0beafe620,660737,API,Olark Chat,No,No,0,0.0,0,0.0,...,No,Select,Select,02.Medium,02.Medium,15.0,15.0,No,No,Modified
1,2a272436-5132-4136-86fa-dcc88c88f482,660728,API,Organic Search,No,No,0,5.0,674,2.5,...,No,Select,Select,02.Medium,02.Medium,15.0,15.0,No,No,Email Opened
2,8cc8c611-a219-4f35-ad23-fdfd2656bd8a,660727,Landing Page Submission,Direct Traffic,No,No,1,2.0,1532,2.0,...,No,Potential Lead,Mumbai,02.Medium,01.High,14.0,20.0,No,Yes,Email Opened
3,0cc2df48-7cf4-4e39-9de9-19797f9b38cc,660719,Landing Page Submission,Direct Traffic,No,No,0,1.0,305,1.0,...,No,Select,Mumbai,02.Medium,01.High,13.0,17.0,No,No,Modified
4,3256f628-e534-4826-9d63-4a8b88782852,660681,Landing Page Submission,Google,No,No,1,2.0,1428,1.0,...,No,Select,Mumbai,02.Medium,01.High,15.0,18.0,No,No,Modified


## B. Data Preperation and EDA

In [161]:
# checking for columns with 'select' field
# select is an option that was not selected during data entry

df.columns[df.isin(['Select']).any()]

Index(['Specialization', 'How did you hear about X Education', 'Lead Profile',
       'City'],
      dtype='object')

In [162]:
# replacing 'Select' by nan in 'Specialization'

df['Specialization'] = df['Specialization'].replace(np.nan, 'Select')

In [163]:
# replacing 'Select' by nan in 'How did you hear about X Education'

df['How did you hear about X Education'] = df['How did you hear about X Education'].replace(np.nan, 'Select')

In [164]:
# replacing 'Select' by nan in 'Lead Profile'

df['Lead Profile'] = df['Lead Profile'].replace(np.nan, 'Select')

In [165]:
#replacing 'Select' by nan in 'City'

df['City'] = df['City'].replace(np.nan, 'Select')

In [166]:
# checking the percentage of missing values 

round(df.isnull().sum()/len(df),2)

Prospect ID                                      0.00
Lead Number                                      0.00
Lead Origin                                      0.00
Lead Source                                      0.00
Do Not Email                                     0.00
Do Not Call                                      0.00
Converted                                        0.00
TotalVisits                                      0.01
Total Time Spent on Website                      0.00
Page Views Per Visit                             0.01
Last Activity                                    0.01
Country                                          0.27
Specialization                                   0.00
How did you hear about X Education               0.00
What is your current occupation                  0.29
What matters most to you in choosing a course    0.29
Search                                           0.00
Magazine                                         0.00
Newspaper Article           

#### Dropping rows with missing high percentage of missing values

- 'Lead Quality' is a metric based on the intution of the employee who has been assigned the lead. With high missing values it does not hold significance to the model we will create

In [167]:
# droppping 'Lead Quality'

df.drop('Lead Quality' , axis =1 , inplace = True)

#### Note

The following are metrics assigned to each customer based on their activity and profile. It does not hold high significance to understand their state with hot leads specially when they have high missing values
1. 'Asymmetrique Activity Index'
2. 'Asymmetrique Profile Index'
3. 'Asymmetrique Activity Score'
4. 'Asymmetrique Profile Score'

In [168]:
# dropping the above mentioned rows 

df.drop('Asymmetrique Activity Index' , axis =1 , inplace = True)

df.drop('Asymmetrique Profile Index' , axis =1 , inplace = True)

df.drop('Asymmetrique Activity Score' , axis =1 , inplace = True)

df.drop('Asymmetrique Profile Score' , axis =1 , inplace = True)

#### Note

- 'Country' has significant number of missing values and is not a important variable for the problem statement we have

In [169]:
# dropping 'Country'

df.drop('Country' , axis =1 , inplace = True)

In [170]:
# checking the percentage of missing values 

round(df.isnull().sum()/len(df),2)

Prospect ID                                      0.00
Lead Number                                      0.00
Lead Origin                                      0.00
Lead Source                                      0.00
Do Not Email                                     0.00
Do Not Call                                      0.00
Converted                                        0.00
TotalVisits                                      0.01
Total Time Spent on Website                      0.00
Page Views Per Visit                             0.01
Last Activity                                    0.01
Specialization                                   0.00
How did you hear about X Education               0.00
What is your current occupation                  0.29
What matters most to you in choosing a course    0.29
Search                                           0.00
Magazine                                         0.00
Newspaper Article                                0.00
X Education Forums          

In [171]:
df.shape

(9240, 31)

#### Note

- 'What is your current occupation' , 'What matters most to you in choosing a course' and 'Tags' have might high number of missing values but they have important information for the problem statement. Doing some analysis on the same

In [172]:
df['What is your current occupation'].value_counts()

Unemployed              5600
Working Professional     706
Student                  210
Other                     16
Housewife                 10
Businessman                8
Name: What is your current occupation, dtype: int64

#### Note

- The majority of information in 'What is your current occupation' does not have enough significance in the categorisation of hot leads. Between unemployed, workig professional and student it has most of the information. With high missing values this column is not significant enough. Thus dropping 'What is your current occupation'

In [173]:
# dropping 'What is your current occupation'

df.drop('What is your current occupation' , axis =1 , inplace = True)

In [174]:
df['What matters most to you in choosing a course'].value_counts()

Better Career Prospects      6528
Flexibility & Convenience       2
Other                           1
Name: What matters most to you in choosing a course, dtype: int64

#### Note

- 'What matters most to you in choosing a course' has most of its information as Better Career Prospects. Thus if we drop the column it will not make a lot of difference to the model

In [175]:
# dropping 'What matters most to you in choosing a course'

df.drop('What matters most to you in choosing a course' , axis =1 , inplace = True)

In [176]:
df['Tags'].value_counts()

Will revert after reading the email                  2072
Ringing                                              1203
Interested in other courses                           513
Already a student                                     465
Closed by Horizzon                                    358
switched off                                          240
Busy                                                  186
Lost to EINS                                          175
Not doing further education                           145
Interested  in full time MBA                          117
Graduation in progress                                111
invalid number                                         83
Diploma holder (Not Eligible)                          63
wrong number given                                     47
opp hangup                                             33
number not provided                                    27
in touch with EINS                                     12
Lost to Others

#### Note

- The information with large values in 'Tags' such as -Will revert after reading the email- have less significance in making with regards to our problem statement. We can drop this column 

In [177]:
# dropping 'Tags'

df.drop('Tags' , axis =1 , inplace = True)

In [178]:
# checking the missing values now 

round(df.isnull().sum()/len(df.index),2)

Prospect ID                                 0.00
Lead Number                                 0.00
Lead Origin                                 0.00
Lead Source                                 0.00
Do Not Email                                0.00
Do Not Call                                 0.00
Converted                                   0.00
TotalVisits                                 0.01
Total Time Spent on Website                 0.00
Page Views Per Visit                        0.01
Last Activity                               0.01
Specialization                              0.00
How did you hear about X Education          0.00
Search                                      0.00
Magazine                                    0.00
Newspaper Article                           0.00
X Education Forums                          0.00
Newspaper                                   0.00
Digital Advertisement                       0.00
Through Recommendations                     0.00
Receive More Updates

#### Note

- We do not have significant missing values in the dataframe now. We can proceed to the next step. 

In [179]:
df.shape

(9240, 28)

#### Finding highly skewed columns

- Skewed columns are columns that have categorical values but are highly polarised in terms of their data

In [180]:
df.columns

Index(['Prospect ID', 'Lead Number', 'Lead Origin', 'Lead Source',
       'Do Not Email', 'Do Not Call', 'Converted', 'TotalVisits',
       'Total Time Spent on Website', 'Page Views Per Visit', 'Last Activity',
       'Specialization', 'How did you hear about X Education', 'Search',
       'Magazine', 'Newspaper Article', 'X Education Forums', 'Newspaper',
       'Digital Advertisement', 'Through Recommendations',
       'Receive More Updates About Our Courses',
       'Update me on Supply Chain Content', 'Get updates on DM Content',
       'Lead Profile', 'City', 'I agree to pay the amount through cheque',
       'A free copy of Mastering The Interview', 'Last Notable Activity'],
      dtype='object')

In [181]:
df['Page Views Per Visit'].value_counts()

0.00    2189
2.00    1795
3.00    1196
4.00     896
1.00     651
        ... 
3.43       1
2.56       1
6.33       1
1.64       1
2.08       1
Name: Page Views Per Visit, Length: 114, dtype: int64

In [182]:
# dropping 'Do Not Call'

df.drop('Do Not Call', axis =1 , inplace = True)

In [183]:
df['Magazine'].value_counts()

No    9240
Name: Magazine, dtype: int64

In [184]:
# dropping 'Magazine'

df.drop('Magazine', axis=1 , inplace = True)

In [185]:
df['Search'].value_counts()

No     9226
Yes      14
Name: Search, dtype: int64

In [186]:
# dropping 'Search'

df.drop('Search' , axis =1, inplace =True)

In [187]:
df['Newspaper Article'].value_counts()

No     9238
Yes       2
Name: Newspaper Article, dtype: int64

In [188]:
# dropping 'Newspaper Article'

df.drop('Newspaper Article', axis =1 , inplace = True)

In [189]:
df['X Education Forums'].value_counts()

No     9239
Yes       1
Name: X Education Forums, dtype: int64

In [190]:
# dropping 'X Education Forums'

df.drop('X Education Forums', axis =1 , inplace = True)

In [191]:
df['Newspaper'].value_counts()

No     9239
Yes       1
Name: Newspaper, dtype: int64

In [192]:
# dropping 'Newspaper'

df.drop('Newspaper', axis=1 , inplace = True)

In [193]:
df['Digital Advertisement'].value_counts()

No     9236
Yes       4
Name: Digital Advertisement, dtype: int64

In [194]:
# dropping 'Digital Advertisement'

df.drop('Digital Advertisement', axis=1 , inplace = True)

In [195]:
df['Through Recommendations'].value_counts()

No     9233
Yes       7
Name: Through Recommendations, dtype: int64

In [196]:
# dropping 'Through Recommendations'

df.drop('Through Recommendations', axis=1 , inplace = True)

In [197]:
df['Receive More Updates About Our Courses'].value_counts()

No    9240
Name: Receive More Updates About Our Courses, dtype: int64

In [198]:
# dropping 'Receive More Updates About Our Courses'

df.drop('Receive More Updates About Our Courses', axis=1 , inplace = True)

In [199]:
df['Update me on Supply Chain Content'].value_counts()

No    9240
Name: Update me on Supply Chain Content, dtype: int64

In [200]:
# dropping 'Update me on Supply Chain Content'

df.drop('Update me on Supply Chain Content', axis=1 , inplace = True)

In [201]:
df['Get updates on DM Content'].value_counts()

No    9240
Name: Get updates on DM Content, dtype: int64

In [202]:
# dropping 'Get updates on DM Content'

df.drop('Get updates on DM Content', axis=1 , inplace = True)

In [203]:
df['I agree to pay the amount through cheque'].value_counts()

No    9240
Name: I agree to pay the amount through cheque, dtype: int64

In [204]:
 # dropping 'I agree to pay the amount through cheque'

df.drop('I agree to pay the amount through cheque', axis=1 , inplace = True)

#### Note

- The other columns do not have skewed values in the dataframe and we can proceed to the next step

#### Checking for categorical columns with less percentage of rows

In [205]:
df.columns

Index(['Prospect ID', 'Lead Number', 'Lead Origin', 'Lead Source',
       'Do Not Email', 'Converted', 'TotalVisits',
       'Total Time Spent on Website', 'Page Views Per Visit', 'Last Activity',
       'Specialization', 'How did you hear about X Education', 'Lead Profile',
       'City', 'A free copy of Mastering The Interview',
       'Last Notable Activity'],
      dtype='object')

In [206]:
df['Lead Source'].value_counts()

Google               2868
Direct Traffic       2543
Olark Chat           1755
Organic Search       1154
Reference             534
Welingak Website      142
Referral Sites        125
Facebook               55
bing                    6
google                  5
Click2call              4
Press_Release           2
Social Media            2
Live Chat               2
youtubechannel          1
testone                 1
Pay per Click Ads       1
welearnblog_Home        1
WeLearn                 1
blog                    1
NC_EDM                  1
Name: Lead Source, dtype: int64

In [207]:
# putting the values with less than 100 occurences in one category - other

lis_ls = ['Facebook','bing','google','Click2call','Live Chat','Social Media','Press_Release','blog','Pay per Click Ads','welearnblog_Home','WeLearn','testone','NC_EDM','youtubechannel']

df['Lead Source'] = df['Lead Source'].apply(lambda x : 'Other_ls' if x in lis_ls else x)


In [208]:
df['Lead Source'].value_counts()

Google              2868
Direct Traffic      2543
Olark Chat          1755
Organic Search      1154
Reference            534
Welingak Website     142
Referral Sites       125
Other_ls              83
Name: Lead Source, dtype: int64

In [209]:
df['How did you hear about X Education'].value_counts()

Select                   7250
Online Search             808
Word Of Mouth             348
Student of SomeSchool     310
Other                     186
Multiple Sources          152
Advertisements             70
Social Media               67
Email                      26
SMS                        23
Name: How did you hear about X Education, dtype: int64

In [210]:
# putting the values with less than 100 occurences in one category - other

lis_ed = ['Advertisements','Social Media','Email','SMS']

df['How did you hear about X Education'] = df['How did you hear about X Education'].apply(lambda x : 'Other_ed' if x in lis_ed else x)

In [211]:
df['How did you hear about X Education'].value_counts()

Select                   7250
Online Search             808
Word Of Mouth             348
Student of SomeSchool     310
Other                     186
Other_ed                  186
Multiple Sources          152
Name: How did you hear about X Education, dtype: int64

In [212]:
df['Lead Profile'].value_counts()

Select                         6855
Potential Lead                 1613
Other Leads                     487
Student of SomeSchool           241
Lateral Student                  24
Dual Specialization Student      20
Name: Lead Profile, dtype: int64

In [213]:
lis_lp = ['Lateral Student','Dual Specialization Student']

df['Lead Profile'] = df['Lead Profile'].apply(lambda x : 'Other_lp' if x in lis_lp else x)

In [214]:
df['Lead Profile'].value_counts()

Select                   6855
Potential Lead           1613
Other Leads               487
Student of SomeSchool     241
Other_lp                   44
Name: Lead Profile, dtype: int64

In [215]:
df['Last Activity'].value_counts()

Email Opened                    3437
SMS Sent                        2745
Olark Chat Conversation          973
Page Visited on Website          640
Converted to Lead                428
Email Bounced                    326
Email Link Clicked               267
Form Submitted on Website        116
Unreachable                       93
Unsubscribed                      61
Had a Phone Conversation          30
Approached upfront                 9
View in browser link Clicked       6
Email Received                     2
Email Marked Spam                  2
Visited Booth in Tradeshow         1
Resubscribed to emails             1
Name: Last Activity, dtype: int64

In [216]:
lis_la = ['Unreachable','Unsubscribed','Had a Phone Conversation','View in browser link Clicked','Approached upfront','Email Marked Spam','Email Received','Resubscribed to emails','Visited Booth in Tradeshow']

df['Last Activity'] = df['Last Activity'].apply(lambda x : 'Other_la' if x in lis_la else x)

In [217]:
df['Last Activity'].value_counts()

Email Opened                 3437
SMS Sent                     2745
Olark Chat Conversation       973
Page Visited on Website       640
Converted to Lead             428
Email Bounced                 326
Email Link Clicked            267
Other_la                      205
Form Submitted on Website     116
Name: Last Activity, dtype: int64

In [218]:
df['Last Notable Activity'].value_counts()

Modified                        3407
Email Opened                    2827
SMS Sent                        2172
Page Visited on Website          318
Olark Chat Conversation          183
Email Link Clicked               173
Email Bounced                     60
Unsubscribed                      47
Unreachable                       32
Had a Phone Conversation          14
Email Marked Spam                  2
Approached upfront                 1
Resubscribed to emails             1
View in browser link Clicked       1
Form Submitted on Website          1
Email Received                     1
Name: Last Notable Activity, dtype: int64

In [219]:
lis_na = ['Email Bounced','Unsubscribed','Unreachable','Had a Phone Conversation','Email Marked Spam','Resubscribed to emails','Email Received','Form Submitted on Website','View in browser link Clicked','Approached upfront']

df['Last Notable Activity'] = df['Last Notable Activity'].apply(lambda x : 'Other_na' if x in lis_na else x)

In [220]:
df['Last Notable Activity'].value_counts()

Modified                   3407
Email Opened               2827
SMS Sent                   2172
Page Visited on Website     318
Olark Chat Conversation     183
Email Link Clicked          173
Other_na                    160
Name: Last Notable Activity, dtype: int64

In [221]:
round(df.isnull().sum()/len(df.index),2)

Prospect ID                               0.00
Lead Number                               0.00
Lead Origin                               0.00
Lead Source                               0.00
Do Not Email                              0.00
Converted                                 0.00
TotalVisits                               0.01
Total Time Spent on Website               0.00
Page Views Per Visit                      0.01
Last Activity                             0.01
Specialization                            0.00
How did you hear about X Education        0.00
Lead Profile                              0.00
City                                      0.00
A free copy of Mastering The Interview    0.00
Last Notable Activity                     0.00
dtype: float64

#### Removing any remaining null values

In [222]:
round(df.isnull().sum()/len(df.index),2)

Prospect ID                               0.00
Lead Number                               0.00
Lead Origin                               0.00
Lead Source                               0.00
Do Not Email                              0.00
Converted                                 0.00
TotalVisits                               0.01
Total Time Spent on Website               0.00
Page Views Per Visit                      0.01
Last Activity                             0.01
Specialization                            0.00
How did you hear about X Education        0.00
Lead Profile                              0.00
City                                      0.00
A free copy of Mastering The Interview    0.00
Last Notable Activity                     0.00
dtype: float64

In [223]:
# dropping the null values in the remaiaining columns as their number is very small 

df.dropna(inplace = True)

In [224]:
# checking for the null variables 

round(df.isnull().sum()/len(df.index),2)

Prospect ID                               0.0
Lead Number                               0.0
Lead Origin                               0.0
Lead Source                               0.0
Do Not Email                              0.0
Converted                                 0.0
TotalVisits                               0.0
Total Time Spent on Website               0.0
Page Views Per Visit                      0.0
Last Activity                             0.0
Specialization                            0.0
How did you hear about X Education        0.0
Lead Profile                              0.0
City                                      0.0
A free copy of Mastering The Interview    0.0
Last Notable Activity                     0.0
dtype: float64

In [225]:
df.shape

(9074, 16)

In [226]:
#### Checking the percentage of the rows that are left

a = len(df.index)/9240
print(a)

0.982034632034632


In [227]:
# We have retainied more than 98 percentage of the rows after removing the null values 

#### Creating Dummies for all the categorical variables 

In [228]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9074 entries, 0 to 9239
Data columns (total 16 columns):
 #   Column                                  Non-Null Count  Dtype  
---  ------                                  --------------  -----  
 0   Prospect ID                             9074 non-null   object 
 1   Lead Number                             9074 non-null   int64  
 2   Lead Origin                             9074 non-null   object 
 3   Lead Source                             9074 non-null   object 
 4   Do Not Email                            9074 non-null   object 
 5   Converted                               9074 non-null   int64  
 6   TotalVisits                             9074 non-null   float64
 7   Total Time Spent on Website             9074 non-null   int64  
 8   Page Views Per Visit                    9074 non-null   float64
 9   Last Activity                           9074 non-null   object 
 10  Specialization                          9074 non-null   obje

#### Variable One : Lead Origin

In [229]:
leadorigin_dummy = pd.get_dummies(df['Lead Origin'], drop_first = True)
leadorigin_dummy.head()

Unnamed: 0,Landing Page Submission,Lead Add Form,Lead Import
0,0,0,0
1,0,0,0
2,1,0,0
3,1,0,0
4,1,0,0


#### Variable Two : Lead Source

In [230]:
leadsource_dummy = pd.get_dummies(df['Lead Source'], drop_first = True)
leadsource_dummy.head()

Unnamed: 0,Google,Olark Chat,Organic Search,Other_ls,Reference,Referral Sites,Welingak Website
0,0,1,0,0,0,0,0
1,0,0,1,0,0,0,0
2,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0
4,1,0,0,0,0,0,0


#### Variable Three : Do Not Email

In [231]:
# creating a function to clearly mention the yes and its related variable

def email(x):
    if x == 'Yes':
        return 'yes_email'
    else:
        return 'no_email'

In [232]:

df['Do Not Email'] = df['Do Not Email'].apply(email)

In [233]:
donotemail_dummy = pd.get_dummies(df['Do Not Email'], drop_first = True)
donotemail_dummy.head()

Unnamed: 0,yes_email
0,0
1,0
2,0
3,0
4,0


#### Variable Four : Last Activity 

In [234]:
def sms(x):
    if x == 'SMS Sent':
        return 'sms_sent_la'
    if x == 'Olark Chat Conversation':
        return 'Olark_Chat_Conversation_la'
    else:
        return x

In [235]:
df['Last Activity'] = df['Last Activity'].apply(sms)

In [236]:
df['Last Activity'].value_counts()

Email Opened                  3432
sms_sent_la                   2716
Olark_Chat_Conversation_la     972
Page Visited on Website        640
Converted to Lead              428
Email Bounced                  312
Email Link Clicked             267
Other_la                       191
Form Submitted on Website      116
Name: Last Activity, dtype: int64

In [237]:
lastactivity_dummy = pd.get_dummies(df['Last Activity'], drop_first = True)
lastactivity_dummy.head()

Unnamed: 0,Email Bounced,Email Link Clicked,Email Opened,Form Submitted on Website,Olark_Chat_Conversation_la,Other_la,Page Visited on Website,sms_sent_la
0,0,0,0,0,0,0,1,0
1,0,0,1,0,0,0,0,0
2,0,0,1,0,0,0,0,0
3,0,0,0,0,0,1,0,0
4,0,0,0,0,0,0,0,0


#### Variable Five : Specialization 

In [238]:
df['Specialization'].value_counts()

Select                               3282
Finance Management                    959
Human Resource Management             837
Marketing Management                  823
Operations Management                 499
Business Administration               399
IT Projects Management                366
Supply Chain Management               346
Banking, Investment And Insurance     335
Travel and Tourism                    202
Media and Advertising                 202
International Business                176
Healthcare Management                 156
E-COMMERCE                            111
Hospitality Management                111
Retail Management                     100
Rural and Agribusiness                 73
E-Business                             57
Services Excellence                    40
Name: Specialization, dtype: int64

In [239]:
def specs(x):
    if x == 'Select':
        return 'select_specs'
    else:
        return x

In [240]:
df['Specialization'] = df['Specialization'].apply(specs)

In [241]:
specialization_dummy = pd.get_dummies(df['Specialization'], drop_first = True)
specialization_dummy.head()

Unnamed: 0,Business Administration,E-Business,E-COMMERCE,Finance Management,Healthcare Management,Hospitality Management,Human Resource Management,IT Projects Management,International Business,Marketing Management,Media and Advertising,Operations Management,Retail Management,Rural and Agribusiness,Services Excellence,Supply Chain Management,Travel and Tourism,select_specs
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
2,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1


#### Variable Six : How did you hear about X Education

In [242]:
df['How did you hear about X Education'].value_counts()

Select                   7086
Online Search             808
Word Of Mouth             347
Student of SomeSchool     310
Other                     186
Other_ed                  185
Multiple Sources          152
Name: How did you hear about X Education, dtype: int64

In [243]:
def edu (x):
    if x == 'Student of SomeSchool':
        return 'edu_someschool'
    else:
        return x

In [244]:
df['How did you hear about X Education'] = df['How did you hear about X Education'].apply(edu)

In [245]:
education_dummy = pd.get_dummies(df['How did you hear about X Education'], drop_first = True)
education_dummy.head()

Unnamed: 0,Online Search,Other,Other_ed,Select,Word Of Mouth,edu_someschool
0,0,0,0,1,0,0
1,0,0,0,1,0,0
2,0,0,0,1,0,0
3,0,0,0,0,1,0
4,0,1,0,0,0,0


#### Variable Seven : Lead Profile

In [246]:
df['Lead Profile'].value_counts()

Select                   6757
Potential Lead           1554
Other Leads               482
Student of SomeSchool     240
Other_lp                   41
Name: Lead Profile, dtype: int64

In [247]:
def lp(x):
    if x == 'Select':
        return 'select_lp'
    elif x == 'Student of SomeSchool':
        return 'lp_someschool'
    else:
        return x

In [248]:
df['Lead Profile'] = df['Lead Profile'].apply(lp)

In [249]:
leadprofile_dummy = pd.get_dummies(df['Lead Profile'], drop_first = True)
leadprofile_dummy.head()

Unnamed: 0,Other_lp,Potential Lead,lp_someschool,select_lp
0,0,0,0,1
1,0,0,0,1
2,0,1,0,0
3,0,0,0,1
4,0,0,0,1


#### Variable Eight : City 

In [250]:
df['City'].value_counts()

Select                         3575
Mumbai                         3177
Thane & Outskirts               745
Other Cities                    680
Other Cities of Maharashtra     446
Other Metro Cities              377
Tier II Cities                   74
Name: City, dtype: int64

In [251]:
def city (x):
    if  x == 'Select':
        return 'select_city'
    else:
        return x

In [252]:
df['City'] = df['City'].apply(city)

In [253]:
city_dummy = pd.get_dummies(df['City'], drop_first = True)
city_dummy.head()

Unnamed: 0,Other Cities,Other Cities of Maharashtra,Other Metro Cities,Thane & Outskirts,Tier II Cities,select_city
0,0,0,0,0,0,1
1,0,0,0,0,0,1
2,0,0,0,0,0,0
3,0,0,0,0,0,0
4,0,0,0,0,0,0


#### Variable Nine : A free copy of Mastering The Interview

In [254]:
df['A free copy of Mastering The Interview'].value_counts()

No     6186
Yes    2888
Name: A free copy of Mastering The Interview, dtype: int64

In [255]:
# creating a function to clearly mention the yes and its related variable

def emails(x):
    if x == 'Yes':
        return 'yes_copy'
    else:
        return 'no_copy'

In [256]:
df['A free copy of Mastering The Interview'] = df['A free copy of Mastering The Interview'].apply(emails)

In [257]:
copy_dummy = pd.get_dummies(df['A free copy of Mastering The Interview'], drop_first = True)
copy_dummy.head()

Unnamed: 0,yes_copy
0,0
1,0
2,1
3,0
4,0


#### Variable Ten : Last Notable Activity

In [258]:
df['Last Notable Activity'].value_counts()

Modified                   3267
Email Opened               2823
SMS Sent                   2152
Page Visited on Website     318
Olark Chat Conversation     183
Email Link Clicked          173
Other_na                    158
Name: Last Notable Activity, dtype: int64

In [259]:
activity_dummy = pd.get_dummies(df['Last Notable Activity'], drop_first = True)
activity_dummy.head()

Unnamed: 0,Email Opened,Modified,Olark Chat Conversation,Other_na,Page Visited on Website,SMS Sent
0,0,1,0,0,0,0
1,1,0,0,0,0,0
2,1,0,0,0,0,0
3,0,1,0,0,0,0
4,0,1,0,0,0,0


In [260]:
#### Putting all the dummy dataframe together 

final = pd.concat([df,leadorigin_dummy,leadsource_dummy,donotemail_dummy,lastactivity_dummy,specialization_dummy,education_dummy,leadprofile_dummy,city_dummy,copy_dummy,activity_dummy],axis =1)
final.head()

Unnamed: 0,Prospect ID,Lead Number,Lead Origin,Lead Source,Do Not Email,Converted,TotalVisits,Total Time Spent on Website,Page Views Per Visit,Last Activity,...,Thane & Outskirts,Tier II Cities,select_city,yes_copy,Email Opened,Modified,Olark Chat Conversation,Other_na,Page Visited on Website,SMS Sent
0,7927b2df-8bba-4d29-b9a2-b6e0beafe620,660737,API,Olark Chat,no_email,0,0.0,0,0.0,Page Visited on Website,...,0,0,1,0,0,1,0,0,0,0
1,2a272436-5132-4136-86fa-dcc88c88f482,660728,API,Organic Search,no_email,0,5.0,674,2.5,Email Opened,...,0,0,1,0,1,0,0,0,0,0
2,8cc8c611-a219-4f35-ad23-fdfd2656bd8a,660727,Landing Page Submission,Direct Traffic,no_email,1,2.0,1532,2.0,Email Opened,...,0,0,0,1,1,0,0,0,0,0
3,0cc2df48-7cf4-4e39-9de9-19797f9b38cc,660719,Landing Page Submission,Direct Traffic,no_email,0,1.0,305,1.0,Other_la,...,0,0,0,0,0,1,0,0,0,0
4,3256f628-e534-4826-9d63-4a8b88782852,660681,Landing Page Submission,Google,no_email,1,2.0,1428,1.0,Converted to Lead,...,0,0,0,0,0,1,0,0,0,0


In [261]:
# dropping all the related variables after creating the dummies 

final = final.drop('Lead Origin',axis =1)
final = final.drop('Lead Source',axis =1)
final = final.drop('Do Not Email',axis =1)
final = final.drop('Last Activity',axis =1)
final = final.drop('Specialization',axis =1)
final = final.drop('How did you hear about X Education',axis =1)
final = final.drop('Lead Profile',axis =1)
final = final.drop('City',axis =1)
final = final.drop('A free copy of Mastering The Interview',axis =1)
final = final.drop('Last Notable Activity',axis =1)

In [262]:
final.head()

Unnamed: 0,Prospect ID,Lead Number,Converted,TotalVisits,Total Time Spent on Website,Page Views Per Visit,Landing Page Submission,Lead Add Form,Lead Import,Google,...,Thane & Outskirts,Tier II Cities,select_city,yes_copy,Email Opened,Modified,Olark Chat Conversation,Other_na,Page Visited on Website,SMS Sent
0,7927b2df-8bba-4d29-b9a2-b6e0beafe620,660737,0,0.0,0,0.0,0,0,0,0,...,0,0,1,0,0,1,0,0,0,0
1,2a272436-5132-4136-86fa-dcc88c88f482,660728,0,5.0,674,2.5,0,0,0,0,...,0,0,1,0,1,0,0,0,0,0
2,8cc8c611-a219-4f35-ad23-fdfd2656bd8a,660727,1,2.0,1532,2.0,1,0,0,0,...,0,0,0,1,1,0,0,0,0,0
3,0cc2df48-7cf4-4e39-9de9-19797f9b38cc,660719,0,1.0,305,1.0,1,0,0,0,...,0,0,0,0,0,1,0,0,0,0
4,3256f628-e534-4826-9d63-4a8b88782852,660681,1,2.0,1428,1.0,1,0,0,1,...,0,0,0,0,0,1,0,0,0,0
