# Leads scoring case study

##### Problem Statement

An education company named X Education sells online courses to industry professionals. On any given day, many professionals who are interested in the courses land on their website and browse for courses. 

 

The company markets its courses on several websites and search engines like Google. Once these people land on the website, they might browse the courses or fill up a form for the course or watch some videos. When these people fill up a form providing their email address or phone number, they are classified to be a lead. Moreover, the company also gets leads through past referrals. Once these leads are acquired, employees from the sales team start making calls, writing emails, etc. Through this process, some of the leads get converted while most do not. The typical lead conversion rate at X education is around 30%. 

 

Now, although X Education gets a lot of leads, its lead conversion rate is very poor. For example, if, say, they acquire 100 leads in a day, only about 30 of them are converted. To make this process more efficient, the company wishes to identify the most potential leads, also known as ‘Hot Leads’. If they successfully identify this set of leads, the lead conversion rate should go up as the sales team will now be focusing more on communicating with the potential leads rather than making calls to everyone. A typical lead conversion process can be represented using the following funnel:

Lead Conversion Process - Demonstrated as a funnel
Lead Conversion Process - Demonstrated as a funnel
As you can see, there are a lot of leads generated in the initial stage (top) but only a few of them come out as paying customers from the bottom. In the middle stage, you need to nurture the potential leads well (i.e. educating the leads about the product, constantly communicating etc. ) in order to get a higher lead conversion.

 

X Education has appointed you to help them select the most promising leads, i.e. the leads that are most likely to convert into paying customers. The company requires you to build a model wherein you need to assign a lead score to each of the leads such that the customers with a higher lead score have a higher conversion chance and the customers with a lower lead score have a lower conversion chance. The CEO, in particular, has given a ballpark of the target lead conversion rate to be around 80%.


##### Goals of the Case Study

Build a logistic regression model to assign a lead score between 0 and 100 to each of the leads which can be used by the company to target potential leads. A higher score would mean that the lead is hot, i.e. is most likely to convert whereas a lower score would mean that the lead is cold and will mostly not get converted.

There are some more problems presented by the company which your model should be able to adjust to if the company's requirement changes in the future so you will need to handle these as well. These problems are provided in a separate doc file. Please fill it based on the logistic regression model you got in the first step. Also, make sure you include this in your final PPT where you'll make recommendations.

In [1]:
#importing the libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

from sklearn.preprocessing import StandardScaler

In [2]:
#importing dataset

leads_df=pd.read_csv("Leads.csv")
leads_df.head()

FileNotFoundError: [Errno 2] No such file or directory: 'Leads.csv'

In [None]:
leads_df.shape

In [None]:
leads_df.info()

In [None]:
leads_df

In [None]:
leads_df.describe()

### EXPLORATORY DATA ANALYSIS
### Data understanding, preparation of the data

In [None]:
#check for duplicates

sum(leads_df.duplicated(subset = 'Prospect ID')) == 0
sum(leads_df.duplicated(subset = 'Lead Number')) == 0

No duplicate values found in Prospect ID & Lead Number in the dataset


Prospect ID & Lead Number are two variables that are just indicative of the ID number of the approched People so can be dropped.

In [None]:
#dropping Lead Number and Prospect ID since they have all unique values

leads_df.drop(['Prospect ID', 'Lead Number'], 1, inplace = True)

In [None]:
#Converting 'Select' values to NaN.

leads_df = leads_df.replace('Select', np.nan)

In [None]:
#checking null values in each rows

leads_df.isnull().sum()

In [None]:
#checking percentage of null values in each column

round(100*(leads_df.isnull().sum()/len(leads_df.index)), 2)

In [None]:
#dropping cols with more than 45% missing values

cols=leads_df.columns

for i in cols:
    if((100*(leads_df[i].isnull().sum()/len(leads_df.index))) >= 45):
        leads_df.drop(i, 1, inplace = True)

In [None]:
#checking null values percentage

round(100*(leads_df.isnull().sum()/len(leads_df.index)), 2)

##### Categorical Varaible Analysis:

In [None]:
#checking value counts of Country column

leads_df['Country'].value_counts(dropna=False)

In [None]:
#plotting spread of Country columnn 
plt.figure(figsize=(10,5))
s1=sns.countplot(leads_df.Country, hue=leads_df.Converted)
s1.set_xticklabels(s1.get_xticklabels(),rotation=90)
plt.show()

As we can see the Number of Values for India are quite high, so this column can be dropped

In [None]:
#creating a list of columns to be droppped

cols_to_drop=['Country']

In [None]:
#checking value counts of "City" column

leads_df['City'].value_counts(dropna=False)

In [None]:
leads_df['City'] = leads_df['City'].replace(np.nan,'Mumbai')

In [None]:
#plotting spread of City columnn after replacing NaN values

plt.figure(figsize=(10,5))
s1=sns.countplot(leads_df.City, hue=leads_df.Converted)
s1.set_xticklabels(s1.get_xticklabels(),rotation=90)
plt.show()

In [None]:
#checking value counts of Specialization column

leads_df['Specialization'].value_counts(dropna=False)

In [None]:
# Lead may not have mentioned specialization because it was not in the list or maybe they are a students 
# and don't have a specialization yet. So we will replace NaN values here with 'Not Specified'

leads_df['Specialization'] = leads_df['Specialization'].replace(np.nan, 'Not Specified')

In [None]:
#plotting spread of Specialization columnn 

plt.figure(figsize=(10,5))
s1=sns.countplot(leads_df.Specialization, hue=leads_df.Converted)
s1.set_xticklabels(s1.get_xticklabels(),rotation=90)
plt.show()

By the graph We see that specialization with Management are having higher number of leads as well as leads converted. So this is definitely a significant variable and should not be dropped.

In [None]:
#combining Management Specializations because they show similar trends

leads_df['Specialization'] = leads_df['Specialization'].replace(['Finance Management','Human Resource Management',
                                                           'Marketing Management','Operations Management',
                                                           'IT Projects Management','Supply Chain Management',
                                                    'Healthcare Management','Hospitality Management',
                                                           'Retail Management'] ,'Management_Specializations')  

In [None]:
#visualizing count of Variable based on Converted value

plt.figure(figsize=(15,5))
s1=sns.countplot(leads_df.Specialization, hue=leads_df.Converted)
s1.set_xticklabels(s1.get_xticklabels(),rotation=90)
plt.show()

In [None]:
#What is your current occupation

leads_df['What is your current occupation'].value_counts(dropna=False)

In [None]:
#imputing Nan values with mode "Unemployed"

leads_df['What is your current occupation'] = leads_df['What is your current occupation'].replace(np.nan, 'Unemployed')

In [None]:
#checking count of values
leads_df['What is your current occupation'].value_counts(dropna=False)

In [None]:
#visualizing count of Variable based on Converted value
s1=sns.countplot(leads_df['What is your current occupation'], hue=leads_df.Converted)
s1.set_xticklabels(s1.get_xticklabels(),rotation=90)
plt.show()

Working Professionals going for the course have high chances of joining it.
Unemployed leads are the most in terms of Absolute numbers.

In [None]:
#checking value counts

leads_df['What matters most to you in choosing a course'].value_counts(dropna=False)

In [None]:
#replacing Nan values with Mode "Better Career Prospects"

leads_df['What matters most to you in choosing a course'] = leads_df['What matters most to you in choosing a course'].replace(np.nan,'Better Career Prospects')

In [None]:
#visualizing count of Variable based on Converted value

s1=sns.countplot(leads_df['What matters most to you in choosing a course'], hue=leads_df.Converted)
s1.set_xticklabels(s1.get_xticklabels(),rotation=90)
plt.show()

this variable will not impact the data because it does not have significant meaning

In [None]:
#checking value counts of variable
leads_df['What matters most to you in choosing a course'].value_counts(dropna=False)

In [None]:
#Here again we have another Column that is worth Dropping. So we Append to the cols_to_drop List
cols_to_drop.append('What matters most to you in choosing a course')
cols_to_drop

In [None]:
#checking value counts of Tag variable
leads_df['Tags'].value_counts(dropna=False)

In [None]:
#replacing Nan values with "Not Specified"
leads_df['Tags'] = leads_df['Tags'].replace(np.nan,'Not Specified')

In [None]:
#visualizing count of Variable based on Converted value

plt.figure(figsize=(15,5))
s1=sns.countplot(leads_df['Tags'], hue=leads_df.Converted)
s1.set_xticklabels(s1.get_xticklabels(),rotation=90)
plt.show()

some of the tags are with very low frequency 

In [None]:
#replacing tags with low frequency with "Other Tags"
leads_df['Tags'] = leads_df['Tags'].replace(['In confusion whether part time or DLP', 'in touch with EINS','Diploma holder (Not Eligible)',
                                     'Approached upfront','Graduation in progress','number not provided', 'opp hangup','Still Thinking',
                                    'Lost to Others','Shall take in the next coming month','Lateral student','Interested in Next batch',
                                    'Recognition issue (DEC approval)','Want to take admission but has financial problems',
                                    'University not recognized'], 'Other_Tags')

leads_df['Tags'] = leads_df['Tags'].replace(['switched off',
                                      'Already a student',
                                       'Not doing further education',
                                       'invalid number',
                                       'wrong number given',
                                       'Interested  in full time MBA'] , 'Other_Tags')

In [None]:
#checking percentage of missing values
round(100*(leads_df.isnull().sum()/len(leads_df.index)), 2)

In [None]:
#checking value counts of Lead Source column

leads_df['Lead Source'].value_counts(dropna=False)

In [None]:
#replacing Nan Values and combining low frequency values
leads_df['Lead Source'] = leads_df['Lead Source'].replace(np.nan,'Others')
leads_df['Lead Source'] = leads_df['Lead Source'].replace('google','Google')
leads_df['Lead Source'] = leads_df['Lead Source'].replace('Facebook','Social Media')
leads_df['Lead Source'] = leads_df['Lead Source'].replace(['bing','Click2call','Press_Release',
                                                     'youtubechannel','welearnblog_Home',
                                                     'WeLearn','blog','Pay per Click Ads',
                                                    'testone','NC_EDM'] ,'Others') 

We can group some of the lower frequency occuring labels under a common label 'Others'

In [None]:
#visualizing count of Variable based on Converted value
plt.figure(figsize=(15,5))
s1=sns.countplot(leads_df['Lead Source'], hue=leads_df.Converted)
s1.set_xticklabels(s1.get_xticklabels(),rotation=90)
plt.show()

Obsevations from the graph
1. Maximum number of leads are generated by Google and Direct traffic.
2. Conversion Rate of reference leads and leads through welingak website is high.
3. To improve overall lead conversion rate, focus should be on improving lead converion of olark chat, organic search, direct traffic, and google leads and generate more leads from reference and welingak website.

In [None]:
#Lead Origin
leads_df['Lead Origin'].value_counts(dropna=False)

In [None]:
#visualizing count of Variable based on Converted value

plt.figure(figsize=(8,5))
s1=sns.countplot(leads_df['Lead Origin'], hue=leads_df.Converted)
s1.set_xticklabels(s1.get_xticklabels(),rotation=90)
plt.show()

Observaions

1. API and Landing Page Submission bring higher number of leads as well as conversion.
2. Lead Add Form has a very high conversion rate but count of leads are not very high.
3. Lead Import and Quick Add Form get very few leads.
4. In order to improve overall lead conversion rate, we have to improve lead converion of API and Landing Page Submission origin and generate more leads from Lead Add Form.

In [None]:
fig=plt.subplots(figsize=(16, 16))

for i, feature in enumerate(['Lead Source', 'Lead Origin']):
    plt.subplot(3, 3, i+1)
    plt.subplots_adjust(hspace = 6.0)
    sns.countplot(x=feature, hue="Converted",data=leads_df)
    plt.xticks(rotation=90)
    plt.tight_layout()

**OBSERVATION:**
- Despite having a relatively lower conversion rate of approximately 30%, both API and Landing Page Submission generate a substantial number of leads.
- Conversely, the Lead Add Form generates a significantly lower count of leads, yet boasts a notably high conversion rate.
- Lead Import contributes negligibly to both lead count and conversion rate and can be disregarded.
- To enhance the overall lead conversion rate, efforts should be directed towards improving the conversion rates of API and Landing Page Submission, while simultaneously increasing lead generation.Form'**

In [None]:
# Last Activity:

leads_df['Last Activity'].value_counts(dropna=False)

In [None]:
sns.countplot(x="Last Activity", hue="Converted", data= leads_df)
plt.xticks( rotation='vertical')
plt.show()

**OBSERVATION:**

- The highest count among last activities is recorded for "Email Opened".
- 
The maximum conversion rate is observed for the last activity being "SMS Sent".

**We should focus on increasing the conversion rate of those having last activity as Email Opened by making a call to those leads and also try to increase the count of the ones having last activity as SMS sent**-"

In [None]:
#replacing Nan Values and combining low frequency values

leads_df['Last Activity'] = leads_df['Last Activity'].replace(np.nan,'Others')
leads_df['Last Activity'] = leads_df['Last Activity'].replace(['Unreachable','Unsubscribed',
                                                        'Had a Phone Conversation', 
                                                        'Approached upfront',
                                                        'View in browser link Clicked',       
                                                        'Email Marked Spam',                  
                                                        'Email Received','Resubscribed to emails',
                                                         'Visited Booth in Tradeshow'],'Others')

In [None]:
# Last Activity:

leads_df['Last Activity'].value_counts(dropna=False)

In [None]:
#Check the Null Values in All Columns:
round(100*(leads_df.isnull().sum()/len(leads_df.index)), 2)

In [None]:
#Drop all rows which have Nan Values. Since the number of Dropped rows is less than 2%, it will not affect the model
leads_df = leads_df.dropna()

In [None]:
#Checking percentage of Null Values in All Columns:
round(100*(leads_df.isnull().sum()/len(leads_df.index)), 2)

In [None]:
#Do Not Email & Do Not Call
#visualizing count of Variable based on Converted value

plt.figure(figsize=(15,5))

ax1=plt.subplot(1, 2, 1)
ax1=sns.countplot(leads_df['Do Not Call'], hue=leads_df.Converted)
ax1.set_xticklabels(ax1.get_xticklabels(),rotation=90)

ax2=plt.subplot(1, 2, 2)
ax2=sns.countplot(leads_df['Do Not Email'], hue=leads_df.Converted)
ax2.set_xticklabels(ax2.get_xticklabels(),rotation=90)
plt.show()

In [None]:
#checking value counts for Do Not Call
leads_df['Do Not Call'].value_counts(dropna=False)

In [None]:
#checking value counts for Do Not Email
leads_df['Do Not Email'].value_counts(dropna=False)

We Can append the Do Not Call Column to the list of Columns to be Dropped since > 90% is of only one Value

In [None]:
cols_to_drop.append('Do Not Call')
cols_to_drop

IMBALANCED VARIABLES THAT CAN BE DROPPED

In [None]:
leads_df.Search.value_counts(dropna=False)

In [None]:
leads_df.Magazine.value_counts(dropna=False)

In [None]:
leads_df['Newspaper Article'].value_counts(dropna=False)

In [None]:
leads_df['X Education Forums'].value_counts(dropna=False)

In [None]:
leads_df['Newspaper'].value_counts(dropna=False)

In [None]:
leads_df['Digital Advertisement'].value_counts(dropna=False)

In [None]:
leads_df['Through Recommendations'].value_counts(dropna=False)

In [None]:
leads_df['Receive More Updates About Our Courses'].value_counts(dropna=False)

In [None]:
leads_df['Update me on Supply Chain Content'].value_counts(dropna=False)

In [None]:
leads_df['Get updates on DM Content'].value_counts(dropna=False)

In [None]:
leads_df['I agree to pay the amount through cheque'].value_counts(dropna=False)

In [None]:
leads_df['A free copy of Mastering The Interview'].value_counts(dropna=False)

In [None]:
#adding imbalanced columns to the list of columns to be dropped

cols_to_drop.extend(['Search','Magazine','Newspaper Article','X Education Forums','Newspaper',
                 'Digital Advertisement','Through Recommendations','Receive More Updates About Our Courses',
                 'Update me on Supply Chain Content',
                 'Get updates on DM Content','I agree to pay the amount through cheque'])

In [None]:
#checking value counts of last Notable Activity
leads_df['Last Notable Activity'].value_counts()

In [None]:
#clubbing lower frequency values
leads_df['Last Notable Activity'] = leads_df['Last Notable Activity'].replace(['Had a Phone Conversation',
                                                                       'Email Marked Spam',
                                                                         'Unreachable',
                                                                         'Unsubscribed',
                                                                         'Email Bounced',                                                                    
                                                                       'Resubscribed to emails',
                                                                       'View in browser link Clicked',
                                                                       'Approached upfront', 
                                                                       'Form Submitted on Website', 
                                                                       'Email Received'],'Other_Notable_activity')

In [None]:
#visualizing count of Variable based on Converted value

plt.figure(figsize = (14,5))
ax1=sns.countplot(x = "Last Notable Activity", hue = "Converted", data = leads_df)
ax1.set_xticklabels(ax1.get_xticklabels(),rotation=90)
plt.show()

In [None]:
#checking value counts for variable

leads_df['Last Notable Activity'].value_counts()

In [None]:
#list of columns to be dropped
cols_to_drop

In [None]:
#dropping columns
leads_df = leads_df.drop(cols_to_drop,1)
leads_df.info()

Some of the columns have been dropped as they are not impcat the data too much 

##### EDA on Numerical variables

In [None]:
#Check the % of Data that has Converted Values = 1:

Converted = (sum(leads_df['Converted'])/len(leads_df['Converted'].index))*100
Converted

In [None]:
#Checking correlations of numeric values
# figure size
plt.figure(figsize=(10,8))

# heatmap
sns.heatmap(leads_df.corr(), cmap="YlGnBu", annot=True)
plt.show()

total time spent on variable is corelated to converted 

For numerical variables we need to plot the boxplots t check the any outliers present in data

In [None]:
#Total Visits
#visualizing spread of variable

plt.figure(figsize=(6,4))
sns.boxplot(y=leads_df['TotalVisits'])
plt.show()

From the boxplot we can see that there are more outliers in data

In [None]:
#checking percentile values for "Total Visits"

leads_df['TotalVisits'].describe(percentiles=[0.05,.25, .5, .75, .90, .95, .99])

In [None]:
#Outlier Treatment: Remove top & bottom 1% of the Column Outlier values

Q3 = leads_df.TotalVisits.quantile(0.99)
leads_df = leads_df[(leads_df.TotalVisits <= Q3)]
Q1 = leads_df.TotalVisits.quantile(0.01)
leads_df = leads_df[(leads_df.TotalVisits >= Q1)]
sns.boxplot(y=leads_df['TotalVisits'])
plt.show()

In [None]:
leads_df.shape

In [None]:
#checking percentiles for "Total Time Spent on Website"

leads_df['Total Time Spent on Website'].describe(percentiles=[0.05,.25, .5, .75, .90, .95, .99])

In [None]:
#visualizing spread of numeric variable

plt.figure(figsize=(6,4))
sns.boxplot(y=leads_df['Total Time Spent on Website'])
plt.show()

Since there are no major Outliers for the above variable we don't do any Outlier Treatment for this above Column

Check for Page Views Per Visit:

In [None]:
#checking spread of "Page Views Per Visit"

leads_df['Page Views Per Visit'].describe()

In [None]:
#visualizing spread of numeric variable

plt.figure(figsize=(6,4))
sns.boxplot(y=leads_df['Page Views Per Visit'])
plt.show()

In [None]:
#Outlier Treatment: Remove top & bottom 1% 

Q3 = leads_df['Page Views Per Visit'].quantile(0.99)
leads_df = leads_df[leads_df['Page Views Per Visit'] <= Q3]
Q1 = leads_df['Page Views Per Visit'].quantile(0.01)
leads_df = leads_df[leads_df['Page Views Per Visit'] >= Q1]
sns.boxplot(y=leads_df['Page Views Per Visit'])
plt.show()

In [None]:
leads_df.shape

In [None]:
#checking Spread of "Total Visits" vs Converted variable
sns.boxplot(y = 'TotalVisits', x = 'Converted', data = leads_df)
plt.show()

Obsevations

1. Median for converted and not converted leads are the close.
2. Nothng conclusive can be said on the basis of Total Visits


In [None]:
#checking Spread of "Total Time Spent on Website" vs Converted variable

sns.boxplot(x=leads_df.Converted, y=leads_df['Total Time Spent on Website'])
plt.show()

Observations:

1. Leads spending more time on the website are more likely to be converted.
2. Website should be made more engaging to make leads spend more time.

In [None]:
#checking Spread of "Page Views Per Visit" vs Converted variable

sns.boxplot(x=leads_df.Converted,y=leads_df['Page Views Per Visit'])
plt.show()

Observations

1. Median for converted and unconverted leads is the same.
2. Nothing can be said specifically for lead conversion from Page Views Per Visit

In [None]:
#checking missing values in leftover columns/

round(100*(leads_df.isnull().sum()/len(leads_df.index)),2)

There are no missing values in the columns to be analyzed further

##### Creating a dummy variables 

In [None]:
#getting a list of categorical columns

cat_cols= leads_df.select_dtypes(include=['object']).columns
cat_cols

In [None]:
# List of variables to map

list =  ['A free copy of Mastering The Interview','Do Not Email']

# Defining the map function
def binary_map(x):
    return x.map({'Yes': 1, "No": 0})

# Applying the function to the housing list
leads_df[list] = leads_df[list].apply(binary_map)

In [None]:
#getting dummies and dropping the first column and adding the results to the master dataframe
dummy = pd.get_dummies(leads_df[['Lead Origin','What is your current occupation',
                             'City']], drop_first=True)

leads_df = pd.concat([leads_df,dummy],1)

In [None]:
dummy = pd.get_dummies(leads_df['Specialization'], prefix  = 'Specialization')
dummy = dummy.drop(['Specialization_Not Specified'], 1)
leads_df = pd.concat([leads_df, dummy], axis = 1)

In [None]:
dummy = pd.get_dummies(leads_df['Lead Source'], prefix  = 'Lead Source')
dummy = dummy.drop(['Lead Source_Others'], 1)
leads_df = pd.concat([leads_df, dummy], axis = 1)

In [None]:
dummy = pd.get_dummies(leads_df['Last Activity'], prefix  = 'Last Activity')
dummy = dummy.drop(['Last Activity_Others'], 1)
leads_df = pd.concat([leads_df, dummy], axis = 1)

In [None]:
dummy = pd.get_dummies(leads_df['Last Notable Activity'], prefix  = 'Last Notable Activity')
dummy = dummy.drop(['Last Notable Activity_Other_Notable_activity'], 1)
leads_df = pd.concat([leads_df, dummy], axis = 1)

In [None]:
dummy = pd.get_dummies(leads_df['Tags'], prefix  = 'Tags')
dummy = dummy.drop(['Tags_Not Specified'], 1)
leads_df = pd.concat([leads_df, dummy], axis = 1)

In [None]:
#dropping the original columns after dummy variable creation

leads_df.drop(cat_cols,1,inplace = True)

In [None]:
leads_df.head()

###### To build the logistic regression model we will split the data into train set ata and test set data

In [None]:
from sklearn.model_selection import train_test_split

# Putting response variable to y
y = leads_df['Converted']

y.head()

X=leads_df.drop('Converted', axis=1)

In [None]:
# Splitting the data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, test_size=0.3, random_state=100)

In [None]:
X_train.info()

##### We will scale the data as the data will be irregular

In [None]:
#scaling numeric columns

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

num_cols=X_train.select_dtypes(include=['float64', 'int64']).columns

X_train[num_cols] = scaler.fit_transform(X_train[num_cols])

X_train.head()

##### Model Building using Stats Model & RFE:

In [None]:
import statsmodels.api as sm

In [None]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()

from sklearn.feature_selection import RFE
rfe = RFE(logreg, 15)             # running RFE with 15 variables as output
rfe = rfe.fit(X_train, y_train)

In [None]:
rfe.support_

In [None]:
#list of RFE supported columns
col = X_train.columns[rfe.support_]
col

In [None]:
X_train.columns[~rfe.support_]

In [None]:
y_train

We splitted the data into train data and test data now we will build the model and do the prediction on the test data

# Logistic regression model 1

In [None]:
X_train_sm = sm.add_constant(X_train[col])
logm1 = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logm1.fit()
res.summary()

from the first model we can see that the p value for the variable Lead Origin_Lead Add Form is more so we can drop that variable and again build the second model 

In [None]:
#dropping column with high p-value
col = col.drop('Lead Origin_Lead Add Form',1)

# Logistic regression model 2

In [None]:
X_train_sm = sm.add_constant(X_train[col])
logm2 = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logm2.fit()
res.summary()

from the second model we can see that the p value for the variable Tags_Closed by Horizzon is more so we can drop that variable and again build the third model 

In [None]:
#dropping column with high p-value
col = col.drop('Tags_Closed by Horizzon',1)

# Logistic regression model 3

In [None]:
X_train_sm = sm.add_constant(X_train[col])
logm2 = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logm2.fit()
res.summary()

p-value of variable Last Notable Activity_Modified is high, so we can drop it.

In [None]:
#dropping column with high p-value

col = col.drop('Last Notable Activity_Modified',1)

# Logistic regression model 4

In [None]:
X_train_sm = sm.add_constant(X_train[col])
logm2 = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logm2.fit()
res.summary()

p-value of variable Last Activity_Page Visited on Website is high  so we can drop it

In [None]:
#dropping column with high p-value

col = col.drop('Last Activity_Page Visited on Website',1)

# Logistic regression model 5

In [None]:
X_train_sm = sm.add_constant(X_train[col])
logm2 = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logm2.fit()
res.summary()

Since 'All' the p-values are less we can check the Variance Inflation Factor to see if there is any correlation between the variables

In [None]:
# Check for the VIF values of the feature variables. 
from statsmodels.stats.outliers_influence import variance_inflation_factor

In [None]:
# Create a dataframe that will contain the names of all the feature variables and their respective VIFs
vif = pd.DataFrame()
vif['Features'] = X_train[col].columns
vif['VIF'] = [variance_inflation_factor(X_train[col].values, i) for i in range(X_train[col].shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

By the VIF we can see that the value is not more for all the variables So  Moving on to derive the Probabilities, Lead Score, Predictions on Train Data:

In [None]:
# Getting the Predicted values on the train set
y_train_pred = res.predict(X_train_sm)
y_train_pred[:10]

In [None]:
y_train_pred = y_train_pred.values.reshape(-1)
y_train_pred[:10]

In [None]:
y_train_pred_final = pd.DataFrame({'Converted':y_train.values, 'Converted_prob':y_train_pred})
y_train_pred_final['Prospect ID'] = y_train.index
y_train_pred_final.head()

In [None]:
y_train_pred_final['Predicted'] = y_train_pred_final.Converted_prob.map(lambda x: 1 if x > 0.5 else 0)

# Let's see the head
y_train_pred_final.head()

In [None]:
from sklearn import metrics

# Confusion matrix 
confusion = metrics.confusion_matrix(y_train_pred_final.Converted, y_train_pred_final.Predicted )
print(confusion)

In [None]:
# Let's check the overall accuracy.
print(metrics.accuracy_score(y_train_pred_final.Converted, y_train_pred_final.Predicted))

In [None]:
TP = confusion[1,1] # true positive 
TN = confusion[0,0] # true negatives
FP = confusion[0,1] # false positives
FN = confusion[1,0] # false negatives

In [None]:
# Let's see the sensitivity of our logistic regression model
TP / float(TP+FN)

In [None]:
# Let us calculate specificity
TN / float(TN+FP)

In [None]:
# Calculate False Postive Rate - predicting conversion when customer does not have convert
print(FP/ float(TN+FP))

In [None]:
# positive predictive value 
print (TP / float(TP+FP))

In [None]:
# Negative predictive value
print (TN / float(TN+ FN))

##### PLOTTING ROC CURVE

In [None]:
def draw_roc( actual, probs ):
    fpr, tpr, thresholds = metrics.roc_curve( actual, probs,
                                              drop_intermediate = False )
    auc_score = metrics.roc_auc_score( actual, probs )
    plt.figure(figsize=(5, 5))
    plt.plot( fpr, tpr, label='ROC curve (area = %0.2f)' % auc_score )
    plt.plot([0, 1], [0, 1], 'k--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate or [1 - True Negative Rate]')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver operating characteristic example')
    plt.legend(loc="lower right")
    plt.show()

    return None

In [None]:
fpr, tpr, thresholds = metrics.roc_curve( y_train_pred_final.Converted, y_train_pred_final.Converted_prob, drop_intermediate = False )

In [None]:
draw_roc(y_train_pred_final.Converted, y_train_pred_final.Converted_prob)

The ROC Curve should be a value close to 1. We are getting a good value of 0.97 indicating a good predictive model.

##### Finding Optimal Cutoff Point

Above we had chosen an arbitrary cut-off value of 0.5. We need to determine the best cut-off value and the below section deals with that:

In [None]:
# Let's create columns with different probability cutoffs 
numbers = [float(x)/10 for x in range(10)]
for i in numbers:
    y_train_pred_final[i]= y_train_pred_final.Converted_prob.map(lambda x: 1 if x > i else 0)
y_train_pred_final.head()

In [None]:
# Now let's calculate accuracy sensitivity and specificity for various probability cutoffs.
cutoff_df = pd.DataFrame( columns = ['prob','accuracy','sensi','speci'])
from sklearn.metrics import confusion_matrix

# TP = confusion[1,1] # true positive 
# TN = confusion[0,0] # true negatives
# FP = confusion[0,1] # false positives
# FN = confusion[1,0] # false negatives

num = [0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
for i in num:
    cm1 = metrics.confusion_matrix(y_train_pred_final.Converted, y_train_pred_final[i] )
    total1=sum(sum(cm1))
    accuracy = (cm1[0,0]+cm1[1,1])/total1
    
    speci = cm1[0,0]/(cm1[0,0]+cm1[0,1])
    sensi = cm1[1,1]/(cm1[1,0]+cm1[1,1])
    cutoff_df.loc[i] =[ i ,accuracy,sensi,speci]
print(cutoff_df)


In [None]:
# Let's plot accuracy sensitivity and specificity for various probabilities.
cutoff_df.plot.line(x='prob', y=['accuracy','sensi','speci'])
plt.show()

In [None]:
#### From the curve above, 0.3 is the optimum point to take it as a cutoff probability.

y_train_pred_final['final_Predicted'] = y_train_pred_final.Converted_prob.map( lambda x: 1 if x > 0.3 else 0)

y_train_pred_final.head()

In [None]:
y_train_pred_final['Lead_Score'] = y_train_pred_final.Converted_prob.map( lambda x: round(x*100))

y_train_pred_final[['Converted','Converted_prob','Prospect ID','final_Predicted','Lead_Score']].head()

In [None]:
# Let's check the overall accuracy.
metrics.accuracy_score(y_train_pred_final.Converted, y_train_pred_final.final_Predicted)

In [None]:
confusion2 = metrics.confusion_matrix(y_train_pred_final.Converted, y_train_pred_final.final_Predicted )
confusion2

In [None]:
TP = confusion2[1,1] # true positive 
TN = confusion2[0,0] # true negatives
FP = confusion2[0,1] # false positives
FN = confusion2[1,0] # false negatives

In [None]:
# Let's see the sensitivity of our logistic regression model
TP / float(TP+FN)

In [None]:
# Let us calculate specificity
TN / float(TN+FP)

Observation:

So as we can see above the model seems to be performing well. The ROC curve has a value of 0.97, which is very good. We have the following values for the Train Data:


Accuracy : 90.81%

Sensitivity : 92.05%

Specificity : 90.23%

Some of the other Stats are derived below, indicating the False Positive Rate, Positive Predictive Value,Negative Predictive Values, Precision & Recall.

In [None]:
# Calculate False Postive Rate - predicting conversion when customer does not have convert
print(FP/ float(TN+FP))

In [None]:
# Positive predictive value 
print (TP / float(TP+FP))

In [None]:
# Negative predictive value
print (TN / float(TN+ FN))

In [None]:
#Looking at the confusion matrix again

confusion = metrics.confusion_matrix(y_train_pred_final.Converted, y_train_pred_final.final_Predicted )
confusion

In [None]:
##### Precision
TP / TP + FP

confusion[1,1]/(confusion[0,1]+confusion[1,1])

In [None]:
##### Recall
TP / TP + FN

confusion[1,1]/(confusion[1,0]+confusion[1,1])

In [None]:
from sklearn.metrics import precision_score, recall_score

In [None]:
precision_score(y_train_pred_final.Converted , y_train_pred_final.final_Predicted)

In [None]:

recall_score(y_train_pred_final.Converted, y_train_pred_final.final_Predicted)

In [None]:
from sklearn.metrics import precision_recall_curve

In [None]:
y_train_pred_final.Converted, y_train_pred_final.final_Predicted
p, r, thresholds = precision_recall_curve(y_train_pred_final.Converted, y_train_pred_final.Converted_prob)

In [None]:
plt.plot(thresholds, p[:-1], "g-")
plt.plot(thresholds, r[:-1], "r-")
plt.show()

In [None]:
#scaling test set

num_cols=X_test.select_dtypes(include=['float64', 'int64']).columns

X_test[num_cols] = scaler.fit_transform(X_test[num_cols])

X_test.head()

In [None]:
X_test = X_test[col]
X_test.head()

In [None]:
X_test

In [None]:
X_test_sm = sm.add_constant(X_test)

In [None]:
X_test_sm

##### we will do the prediction on the test dataset

In [None]:
y_test_pred = res.predict(X_test_sm)

In [None]:
y_test_pred[:10]

In [None]:
# Converting y_pred to a dataframe which is an array
y_pred_1 = pd.DataFrame(y_test_pred)

In [None]:
# Let's see the head
y_pred_1.head()

In [None]:
# Converting y_test to dataframe
y_test_df = pd.DataFrame(y_test)

In [None]:
# Putting CustID to index
y_test_df['Prospect ID'] = y_test_df.index

In [None]:
# Removing index for both dataframes to append them side by side 
y_pred_1.reset_index(drop=True, inplace=True)
y_test_df.reset_index(drop=True, inplace=True)

In [None]:
# Appending y_test_df and y_pred_1
y_pred_final = pd.concat([y_test_df, y_pred_1],axis=1)

In [None]:
y_pred_final.head()

In [None]:
# Renaming the column 
y_pred_final= y_pred_final.rename(columns={ 0 : 'Converted_prob'})

In [None]:
y_pred_final.head()

In [None]:
# Rearranging the columns
y_pred_final = y_pred_final[['Prospect ID','Converted','Converted_prob']]
y_pred_final['Lead_Score'] = y_pred_final.Converted_prob.map( lambda x: round(x*100))

In [None]:
# Let's see the head of y_pred_final
y_pred_final.head()

In [None]:
y_pred_final['final_Predicted'] = y_pred_final.Converted_prob.map(lambda x: 1 if x > 0.3 else 0)

In [None]:
y_pred_final.head()

In [None]:
# Let's check the overall accuracy.
metrics.accuracy_score(y_pred_final.Converted, y_pred_final.final_Predicted)

In [None]:
confusion2 = metrics.confusion_matrix(y_pred_final.Converted, y_pred_final.final_Predicted )
confusion2

In [None]:
TP = confusion2[1,1] # true positive 
TN = confusion2[0,0] # true negatives
FP = confusion2[0,1] # false positives
FN = confusion2[1,0] # false negatives

In [None]:
# Let's see the sensitivity of our logistic regression model
TP / float(TP+FN)

In [None]:
# Let us calculate specificity
TN / float(TN+FP)

In [None]:
precision_score(y_pred_final.Converted , y_pred_final.final_Predicted)

In [None]:
recall_score(y_pred_final.Converted, y_pred_final.final_Predicted)

Observations:
    
After running the model on the Test Data these are the figures we obtain:
    

Accuracy : 90.92%
    
Sensitivity : 91.41%
    
Specificity : 90.62%
    
    
Final Observation:
    
comparing the values obtained for Train & Test:
    

Train Data:
    
Accuracy : 90.81%

Sensitivity : 92.05%

Specificity : 90.10%
    
Test Data: 
    
Accuracy : 90.92%
    
Sensitivity : 91.41%
    
Specificity : 90.62%

Final conclusion from the logistic regression model:
    
- The Model seems to predict the Conversion Rate very well and we should be able to give the CEO confidence in making good calls based on this model
-  The final model has Sensitivity of 0.91, this means the model is able to predict 91% customers out of all the converted customers, (Positive conversion) correctly.
- While we have checked both Sensitivity-Specificity we have considered the for calculating the final prediction.
- Accuracy, Sensitivity and Specificity values of test set are around 90%, 91% and 90% which are approximately closer to the respective values calculated using trained set.
- Also the lead score calculated in the trained set of data shows the conversion rate on the final predicted model is around 90%
- Hence overall this model seems to be good. 

In [None]:
#  1.	Which are the top three variables in your model which contribute most towards the probability of a lead getting converted?

# From our model we can see that Tags_Lost to EINS, What is your current occupation_Working Professional
#Total Time Spent on Website

In [None]:
# 2.	What are the top 3 categorical/dummy variables in the model which should be focused the most on in order to increase the probability of lead conversion?

#Tags_Lost to EINS
#Tags_Interested in other courses
#Last Activity_Email Bounced	all these variables has low VIF

In [None]:
# 3.	X Education has a period of 2 months every year during which they hire some interns. The sales team, in particular, has around 10 interns allotted to them. So during this phase, they wish to make the lead conversion more aggressive. So they want almost all of the potential leads (i.e. the customers who have been predicted as 1 by the model) to be converted and hence, want to make phone calls to as much of such people as possible. Suggest a good strategy they should employ at this stage.



In [None]:
# 4.	Similarly, at times, the company reaches its target for a quarter before the deadline. During this time, the company wants the sales team to focus on some new work as well. So during this time, the company’s aim is to not make phone calls unless it’s extremely necessary, i.e. they want to minimize the rate of useless phone calls. Suggest a strategy they should employ at this stage.

