## Lead Score - Case Study


## Problem Statement

An education company named X Education sells online courses to industry professionals. On any given day, many professionals who are interested in the courses land on their website and browse for courses.

The company markets its courses on several websites and search engines like Google. Once these people land on the website, they might browse the courses or fill up a form for the course or watch some videos. When these people fill up a form providing their email address or phone number, they are classified to be a lead. Moreover, the company also gets leads through past referrals. Once these leads are acquired, employees from the sales team start making calls, writing emails, etc. Through this process, some of the leads get converted while most do not. The typical lead conversion rate at X education is around 30%.

There are a lot of leads generated in the initial stage, but only a few of them come out as paying customers. In the middle stage, you need to nurture the potential leads well (i.e. educating the leads about the product, constantly communicating etc. ) in order to get a higher lead conversion.

An X Education need help to select the most promising leads, i.e. the leads that are most likely to convert into paying customers. The company requires us to build a model wherein you need to assign a lead score to each of the leads such that the customers with higher lead score have a higher conversion chance and the customers with lower lead score have a lower conversion chance. The CEO, in particular, has given a ballpark of the target lead conversion rate to be around 80%.

## Goals and Objectives
There are quite a few goals for this case study.

 - Build a logistic regression model to assign a lead score between 0 and 100 to each of the leads which can be used by the company to target potential leads. A higher score would mean that the lead is hot, i.e. is most likely to convert whereas a lower score would mean that the lead is cold and will mostly not get converted.
 - There are some more problems presented by the company which your model should be able to adjust to if the company's requirement changes in the future so you will need to handle these as well. These problems are provided in a separate doc file. Please fill it based on the logistic regression model you got in the first step. Also, make sure you include this in your final PPT where you'll make recommendations.

In [None]:
# Supress Warnings
import warnings
warnings.filterwarnings('ignore')

# Importing libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="darkgrid")
%matplotlib inline

# Data display customization
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 100)


#importing sklearn libraries
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
import statsmodels.api as sm  
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_curve
from sklearn.metrics import precision_score, recall_score
from sklearn.metrics import precision_recall_curve

## Step 1: Reading and Understanding the Data

In [None]:
#load the dataset
leads_df = pd.read_csv("../input/leadscore/Leads.csv")
leads_df.head()

In [None]:
#check the shape of dataframe
leads_df.shape

In [None]:
#inspect the dataframe
leads_df.info()

In [None]:
#check the statistics of dataframe
leads_df.describe()

In [None]:
#check null values in each column in dataframe
leads_df.isnull().sum()

In [None]:
# Duplicate check

leads_df.loc[leads_df.duplicated()]

In [None]:
#check the original Conversion Rate
original_Conversion_rate = round((sum(leads_df['Converted'])/len(leads_df['Converted'].index))*100, 2)
print("The conversion rate of leads is ",original_Conversion_rate)

### Observation

- #### The shape of leads dataset is 9240 rows and 37 columns.
- #### There are 7 numerical columns and 30 categorical columns
- #### There are many 'Select' values present in various columns in the dataset. These values correspond to the user having not made any selection.
- #### There are missing/null values in many columns.
- #### There are no duplicate values in the dataset
- #### The conversion rate of leads is 38.54%

## Step 2: Data Cleaning

In [None]:
#Replacing 'Select' with NaN since the customer has not selected any options for these columns while entering the data.
leads_df = leads_df.replace('Select',np.nan)

In [None]:
#Check number of unique values per column
leads_df.nunique()

### Observation

 - #### As seen from above, there are few columns with only 1 unique value.
    Get updates on DM Content

    Update me on Supply Chain Content

    I agree to pay the amount through cheque

    Receive More Updates About Our Courses

    Magazine

 - #### These columns have only one unique value with no null values so we can drop them as they wont contribute much to the model. 

In [None]:
#drop unique valued columns
leads_df= leads_df.drop(['Magazine','Receive More Updates About Our Courses','I agree to pay the amount through cheque','Get updates on DM Content','Update me on Supply Chain Content'],axis=1)

In [None]:
#drop Prospect ID since they have all unique values

leads_df.drop(['Prospect ID'], 1, inplace = True)

In [None]:
# check for percentage of null values in each column

missing_val_percent = round(100*(leads_df.isnull().sum()/len(leads_df.index)), 2)
print(missing_val_percent)

### Observation

 - #### There are few columns with high percentage(more than 45%) of missing values 
    
 - #### We will drop these columns where null values are more than 45%


In [None]:
leads_df.drop(columns=['Lead Profile','Lead Quality','How did you hear about X Education','Asymmetrique Activity Index', 'Asymmetrique Profile Index','Asymmetrique Activity Score', 'Asymmetrique Profile Score'],inplace=True)

In [None]:
# check for percentage of null values in each column after dropping columns having more than 45% null values

round(100*(leads_df.isnull().sum()/len(leads_df.index)), 2)


### Observation

 - #### There are still few columns with high percentage of null values, i.e. above 30%.

 - #### Let us explore these columns individually to take care of null values in each column.

In [None]:
#check City column

leads_df.City.describe()

In [None]:
leads_df.City.value_counts(normalize=True)

### Observation

 - #### Around 58% of the data is Mumbai so we can impute the missing values with 'Mumbai' .

In [None]:
leads_df['City'] = leads_df['City'].replace(np.nan,'Mumbai')

In [None]:
leads_df.City.value_counts(normalize=True)

In [None]:
#check Specialization column
leads_df.Specialization.describe()

In [None]:
leads_df.Specialization.value_counts(normalize=True)

### Observation

- #### There are 36% null values, we will replace those with 'Others' since NaN values have the highest percentage of values.

- #### Lead may not have mentioned specialization because it was not in the list or maybe they don't have a specialization yet. 

In [None]:
leads_df['Specialization'] = leads_df['Specialization'].replace(np.nan,'Others')

In [None]:
leads_df.Specialization.value_counts(normalize=True)

In [None]:
#Check Tags column
leads_df.Tags.describe()

In [None]:
leads_df.Tags.value_counts(normalize=True)

### Observation

 - #### Tags column contains 36% data with tag -"Will revert after reading the email" and 36% null values
 
 - #### These tags are added by sales team of X Education and may vary with time since its added based on the understanding of the sales team.Hence not much reliable and we can drop this column

In [None]:
#drop Tags column
leads_df = leads_df.drop('Tags', axis=1)

In [None]:
#check 'What matters most to you in choosing a course' column
leads_df['What matters most to you in choosing a course'].describe()

In [None]:
leads_df['What matters most to you in choosing a course'].value_counts(normalize=True)

### Observation

- #### This column is heavily skewed towards better career prospects.Hence we can drop this column since almost all candidates that take this course are looking to have a better career.

In [None]:
#drop 'What matters most to you in choosing a course' column
leads_df = leads_df.drop('What matters most to you in choosing a course', axis=1)

In [None]:
#check 'What is your current occupation' column
leads_df['What is your current occupation'].describe()

In [None]:
leads_df['What is your current occupation'].value_counts(normalize=True)

### Observation

 - #### Around 85% of the data is Unemployed so we can impute the missing values with 'Unemployed' .

In [None]:
leads_df['What is your current occupation'] = leads_df['What is your current occupation'].replace(np.nan, 'Unemployed')

In [None]:
leads_df['What is your current occupation'].value_counts(normalize=True)

In [None]:
#check country column
leads_df['Country'].describe()

In [None]:
leads_df['Country'].value_counts(normalize=True)

### Observation

 - #### Around 96% of the data is India and 27% data is missing . Hence dropping this column wont impact the model

In [None]:
#drop country column
leads_df = leads_df.drop('Country', axis=1)

In [None]:
# check for percentage of null values in each column 
round(100*(leads_df.isnull().sum()/len(leads_df.index)), 2)


In [None]:
# Now missing values are under 2% so we can drop them.
leads_df.dropna(inplace = True)

In [None]:
# check for percentage of null values in each column 
round(100*(leads_df.isnull().sum()/len(leads_df.index)), 2)

## Step 3: Univariate Analysis and Bi-variate Analysis 

 - ### <u> Lead Origin

In [None]:
plt.figure(figsize=(10,5))
sns.countplot(x = "Lead Origin", hue = "Converted", data = leads_df)
plt.title("Conversion in terms of Lead origin")
plt.show()

In [None]:
#def function conversion summary
def conversion_summary(df,col):
    convert=df.pivot_table(values='Lead Number',index=col ,columns='Converted', aggfunc='count').fillna(0)
    convert["Conversion(%)"] =round(convert[1]/(convert[0]+convert[1]),2)*100
    print(convert.sort_values(ascending=False,by="Conversion(%)"))

In [None]:
conversion_summary(leads_df,"Lead Origin")

### Observation
#### From the above plot and Lead origin conversion summary, we can infer that:

- #### Lead Add Form has the highest conversion rate at 94%
- #### API and Landing Page Submission have 31% and 36% conversion rate but they generate maximum leads counts.
- #### Lead Import has the least amount of conversions and leads count.
- #### To improve overall lead conversion rate, focus should be on improving lead conversion rate of API and Landing Page Submission. Also,generate more leads from Lead Add form since they have a very good conversion rate


 - ### <u> Lead Source

In [None]:
plt.figure(figsize=(15,5))
sns.countplot(x = "Lead Source", hue = "Converted", data = leads_df)
plt.title("Conversion in terms of Lead Source")
plt.xticks(rotation=90)
plt.show()

### Observation

 - #### Few Lead sources have very low count.Hence we can merge them into common category 'Others'
 - #### Also , there are 2 categories with same name Google and google.Hence replace google with Google to have a single category



In [None]:
leads_df['Lead Source'] = leads_df['Lead Source'].replace(['google'], 'Google')
leads_df['Lead Source'] = leads_df['Lead Source'].replace(['Click2call', 'Live Chat', 'NC_EDM', 'Pay per Click Ads', 'Press_Release',
  'Social Media', 'WeLearn', 'bing', 'blog', 'testone', 'welearnblog_Home', 'youtubechannel'], 'Others')

In [None]:
#generate the barplot again to check the distribution
plt.figure(figsize=(15,5))
sns.countplot(x = "Lead Source", hue = "Converted", data = leads_df)
plt.title("Conversion in terms of Lead Source")
plt.xticks(rotation=90)
plt.show()

In [None]:
conversion_summary(leads_df,"Lead Source")

### Observations

#### From the above plot and Lead origin conversion summary, we can infer that:

 - #### Google and direct traffic generates maximum number of leads but has conversion rate of 40% and 32%.
 - #### Welingak website and References has highest conversion rates around 98% and 93% but generates less number of leads.
 - #### olark chat and organic search generates significant number of leads but their conversion rate is around 26% and 38%.
 - #### Lead source in 'others' category such as Click2call', 'Live Chat', 'NC_EDM', 'Pay per Click Ads', 'Press_Release','Social Media', 'WeLearn', 'bing', 'blog', 'testone', 'welearnblog_Home', 'youtubechannel' generates very less leads.
 - #### To improve overall lead conversion rate, focus should be on improving lead conversion of olark chat, organic search, direct traffic and google lead source .Also , generate more leads from reference and welingak website since they have a very good conversion rate
 

 - ### <u> Do Not Email & Do Not Call
    


In [None]:
fig, axs = plt.subplots(1,2,figsize = (12,6))
sns.countplot(x = "Do Not Email", hue = "Converted", data = leads_df, ax = axs[0])
sns.countplot(x = "Do Not Call", hue = "Converted", data = leads_df, ax = axs[1])
plt.show()


In [None]:
conversion_summary(leads_df,"Do Not Email")


In [None]:
conversion_summary(leads_df,"Do Not Call")

### Observations

#### From the above plot and conversion summary, we can infer that:

 - #### Around 99% of customers do not like to be called or receive emails about the course.


 - ### <u> Total Visits

In [None]:
plt.figure(figsize=(15,5))
sns.boxplot(leads_df['TotalVisits'])
plt.show()

### Observations

#### There are a number of outliers in Total Visits column.We will cap the outliers to 95% .


In [None]:
percentiles = leads_df['TotalVisits'].quantile([0.05,0.95]).values
leads_df['TotalVisits'][leads_df['TotalVisits'] <= percentiles[0]] = percentiles[0]
leads_df['TotalVisits'][leads_df['TotalVisits'] >= percentiles[1]] = percentiles[1]

In [None]:
plt.figure(figsize=(15,5))
sns.boxplot(leads_df['TotalVisits'])
plt.show()

In [None]:
plt.figure(figsize=(15,5))
sns.boxplot(y = 'TotalVisits', x = 'Converted', data = leads_df)
plt.show()

### Observations
#### From the above boxplot, we can conclude that:

 - #### Median for converted and non-converted leads are same.
 
 - #### People who visits the platform have equal chances(50-50) of applying and not applying for the course.


 - ### <u> Total time spent on website

In [None]:
plt.figure(figsize=(15,5))
sns.boxplot(y = 'Total Time Spent on Website', x = 'Converted', data = leads_df)
plt.show()

### Observations
#### From the above boxplot, we can conclude that:

 - #### People spending more time on website have more chances of opting for a course
 
 - #### People who spend less time on the website didn't opt for any courses.


 - ### <u> Page views per visit

In [None]:
plt.figure(figsize=(15,5))
sns.boxplot(y = 'Page Views Per Visit', x = 'Converted', data = leads_df)
plt.show()

### Observations

#### There are a number of outliers in Page views Per Visit column.We will cap the outliers to 95% .


In [None]:
percentiles = leads_df['Page Views Per Visit'].quantile([0.05,0.95]).values
leads_df['Page Views Per Visit'][leads_df['Page Views Per Visit'] <= percentiles[0]] = percentiles[0]
leads_df['Page Views Per Visit'][leads_df['Page Views Per Visit'] >= percentiles[1]] = percentiles[1]

In [None]:
plt.figure(figsize=(15,5))
sns.boxplot(y = 'Page Views Per Visit', x = 'Converted', data = leads_df)
plt.show()

### Observations
#### From the above boxplot, we can conclude that:

 - #### Median for converted and non-converted leads are same.
 
 - #### People who visits 1 to 3 average number of pages on website have equal chances(50-50) of applying and not applying for the course.
    
 - #### People who dont visit any pages have higher conversion chances    

 - ### <u> Last Activity

In [None]:
plt.figure(figsize=(15,5))
sns.countplot(x = "Last Activity", hue = "Converted", data = leads_df)
plt.xticks(rotation = 90)
plt.show()

In [None]:
conversion_summary(leads_df,"Last Activity")

### Observation
#### Based on the above boxplot, we can infer that:

 - #### Maximum leads are generated from people with last activity - Email opened and SMS sent.

 - #### Conversion rate is around 63% and 36% .

 - #### Least leads are generated from people with last activity - Approached upfront,Email Marked Spam,Resubscribed to emails ,emails received,View in browser link Clicked,Visited Booth in Tradeshow  
 
 - #### olark chat conversation and Page Visited on Website generates significant number of leads but their conversion rate is around 9% and 24%.

 - #### To improve overall lead conversion rate, focus should be on improving lead conversion of people with last activity -olark chat conversation,SMS sent and Page Visited on Website .
 


 - ### <u> Specialization

In [None]:
plt.figure(figsize=(15,5))
sns.countplot(x = "Specialization", hue = "Converted", data = leads_df)
plt.xticks(rotation = 90)
plt.show()

In [None]:
conversion_summary(leads_df,"Specialization")

### Observations

#### From the above plot and conversion summary, we can infer that:

 - #### Most of the specialization have around 40-50% conversion rate.


 - ### <u> Occupation

In [None]:
plt.figure(figsize=(15,5))
sns.countplot(x = "What is your current occupation", hue = "Converted", data = leads_df)
plt.xticks(rotation = 90)
plt.show()

In [None]:
conversion_summary(leads_df,"What is your current occupation")

### Observations

#### From the above plot and conversion summary, we can infer that:

 - #### Working Professionals and Unemployed people generates maximum leads .
 
 - #### Conversion rate for Working Professionals is high around 92% and  Conversion rate for Unemployed is around 33%
 
 - #### To improve overall lead conversion rate, focus should be on improving lead conversion of unemployed .Also , generate more leads from Working Professionals.
 


 - ### <u> Search, Newspaper article , X Education Forums , Newspaper, Digital Advertisement ,Through Recommendations

In [None]:
plt.figure(figsize=(15,5))

plt.subplot(3,2,1)
sns.countplot(x = "Search", hue = "Converted", data = leads_df)

plt.subplot(3,2,2)
sns.countplot(x = "Newspaper Article", hue = "Converted", data = leads_df)

plt.subplot(3,2,3)
sns.countplot(x = "X Education Forums", hue = "Converted", data = leads_df)

plt.subplot(3,2,4)
sns.countplot(x = "Newspaper", hue = "Converted", data = leads_df)

plt.subplot(3,2,5)
sns.countplot(x = "Digital Advertisement", hue = "Converted", data = leads_df)

plt.subplot(3,2,6)
sns.countplot(x = "Through Recommendations", hue = "Converted", data = leads_df)

plt.show()

### Observation

#### Almost 99% customers have not seen the X education ad in search, Newspaper article , X Education Forums , Newspaper, Digital Advertisement  or Through Recommendations

 - ### <u> City

In [None]:
plt.figure(figsize=(15,5))
sns.countplot(x = "City", hue = "Converted", data = leads_df)

plt.show()

In [None]:
conversion_summary(leads_df,"City")

### Observation

#### Maximum leads are generated from Mumbai city with conversion rate of around 36% .Hence focus should me more on increasing conversion rate of Mumbai city

 - ### <u> A free copy of Mastering The Interview

In [None]:
plt.figure(figsize=(15,5))
sns.countplot(x = "A free copy of Mastering The Interview", hue = "Converted", data = leads_df)

plt.show()

In [None]:
conversion_summary(leads_df,"A free copy of Mastering The Interview")

### Observation

 - #### Most of the customers didnt want the free copy  of Mastering The Interview.
    
 - #### Customers who opted for free copy had conversion rate of 36% while the ones who didnt opt had conversion rate of 39%
    
   

 - ### <u> Last Notable Activity

In [None]:
plt.figure(figsize=(15,5))
sns.countplot(x = "Last Notable Activity", hue = "Converted", data = leads_df)
plt.xticks(rotation = 90)
plt.show()

### Observation 

#### This column is very much similar to Last activity column

In [None]:
#check correlation among variables
plt.figure(figsize = (12,6))
mask = np.zeros_like(leads_df.corr(),dtype=np.bool)
mask[np.triu_indices_from(mask)]=True
sns.heatmap(leads_df.corr(), mask=mask,annot = True, cmap="YlGnBu")
plt.show()

### Observation

 - #### Total visits and Page views per visit column are correlated. 
 - #### Hence we should have either of this column in our model to avoid multi-collinearity

### Based on our data analysis , we conclude that many variables are not significant to the model.Hence we can drop them for further analysis

In [None]:
leads_df = leads_df.drop(['Lead Number','Search','Newspaper Article','X Education Forums','Newspaper',
           'Digital Advertisement','Through Recommendations'],1)

In [None]:
leads_df.shape

## Step 4: Data Preparation

### Converting binary variables (Yes/No) to 1/0

In [None]:
# List of binary variables
varlist =  ['A free copy of Mastering The Interview','Do Not Email','Do Not Call']

# Defining the map function
def binary_map(x):
    return x.map({'Yes': 1, "No": 0})

# Applying the map function to the binary variables list
leads_df[varlist] = leads_df[varlist].apply(binary_map)

### Create a dummy variable for the categorical variables

In [None]:
dummy = ['Lead Origin', 'Lead Source', 'Last Activity', 'Specialization','What is your current occupation','City','Last Notable Activity']
dummy_data = pd.get_dummies(leads_df[dummy],drop_first=True)
dummy_data.head()

In [None]:
# Combining dummy data with the original dataset

leads_df = pd.concat([leads_df, dummy_data], axis=1)
leads_df.head()

In [None]:
# Drop the original columns 
drop_cols = ['Lead Origin', 'Lead Source', 'Last Activity', 'Specialization','What is your current occupation','City','Last Notable Activity','Lead Source_Others','Specialization_Others']
leads_df = leads_df.drop(drop_cols, axis=1)
leads_df.head()



In [None]:
#check the shape of dataframe
leads_df.shape

## Step 5: Train-Test Split

In [None]:
# Putting feature variable to X
X = leads_df.drop(['Converted'], axis=1)


# Putting response variable to y
y = leads_df['Converted']

In [None]:
# Splitting the data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, test_size=0.3, random_state=100)

## Step 6: Feature Scaling

In [None]:
#create object of StandardScaler
scaler = StandardScaler()

#Apply scaler() to numerical columns
X_train[['TotalVisits','Total Time Spent on Website','Page Views Per Visit']] = scaler.fit_transform(X_train[['TotalVisits','Total Time Spent on Website','Page Views Per Visit']])

X_train.head()

## Step 7: Feature Selection Using RFE

In [None]:
logreg = LogisticRegression()

# running RFE with 20 variables as output
rfe = RFE(logreg, 20) 
rfe = rfe.fit(X_train, y_train)

In [None]:
rfe.support_

In [None]:
list(zip(X_train.columns, rfe.support_, rfe.ranking_))

In [None]:
col1 = X_train.columns[rfe.support_]
col1

In [None]:
X_train.columns[~rfe.support_]

## Step 9: Model Building

In [None]:
#BUILDING MODEL #1
X_train_sm = sm.add_constant(X_train[col1])
logm1 = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logm1.fit()
res.summary()

In [None]:
#check variance inflation factor
vif = pd.DataFrame()
vif['Features'] = X_train[col1].columns
vif['VIF'] = [variance_inflation_factor(X_train[col1].values, i) for i in range(X_train[col1].shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

### Observation
 - #### The p value for column 'What is your current occupation_Housewife' is very high and above the threshold. Hence we will drop this column from our model
    


In [None]:
col2 = col1.drop('What is your current occupation_Housewife',1)
col2

In [None]:
#BUILDING MODEL #2
X_train_sm = sm.add_constant(X_train[col2])
logm2 = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logm2.fit()
res.summary()

In [None]:
#check variance inflation factor
vif = pd.DataFrame()
vif['Features'] = X_train[col2].columns
vif['VIF'] = [variance_inflation_factor(X_train[col2].values, i) for i in range(X_train[col2].shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

### Observation
 - #### The p value and VIF value for column 'Lead Source_Reference' is high and above the threshold.Hence we will drop this column
    

In [None]:
col3 = col2.drop('Lead Source_Reference',1)
col3

In [None]:
#BUILDING MODEL #3
X_train_sm = sm.add_constant(X_train[col3])
logm3 = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logm3.fit()
res.summary()

In [None]:
#check variance inflation factor
vif = pd.DataFrame()
vif['Features'] = X_train[col3].columns
vif['VIF'] = [variance_inflation_factor(X_train[col3].values, i) for i in range(X_train[col3].shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

### Observation
 - #### The VIF value for column 'What is your current occupation_Unemployed' is high and above the threshold.Hence we will drop it

In [None]:
col4 = col3.drop('What is your current occupation_Unemployed',1)
col4

In [None]:
#BUILDING MODEL #4
X_train_sm = sm.add_constant(X_train[col4])
logm4 = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logm4.fit()
res.summary()

In [None]:
#check variance inflation factor
vif = pd.DataFrame()
vif['Features'] = X_train[col4].columns
vif['VIF'] = [variance_inflation_factor(X_train[col4].values, i) for i in range(X_train[col4].shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

### Observation
 - #### The VIF value for column 'What is your current occupation_Student' is high and above the threshold.Hence we will drop it

In [None]:
col5 = col4.drop('What is your current occupation_Student',1)
col5

In [None]:
#BUILDING MODEL #5
X_train_sm = sm.add_constant(X_train[col5])
logm5 = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logm5.fit()
res.summary()

In [None]:
#check variance inflation factor
vif = pd.DataFrame()
vif['Features'] = X_train[col4].columns
vif['VIF'] = [variance_inflation_factor(X_train[col4].values, i) for i in range(X_train[col4].shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

### Observation

 - #### The VIF values of all the variables are under threshold value 3.
 - #### The p value of all variables are under threshold value 0.05.
 - #### Hence we will consider Model 5 as our final model for further analysis
    




In [None]:
# Getting the Predicted values on the train set
y_train_pred = res.predict(X_train_sm)
y_train_pred[:10]

In [None]:
y_train_pred = y_train_pred.values.reshape(-1)
y_train_pred[:10]


### Creating a dataframe with the actual 'Converted' flag and the predicted 'Lead_Score_Prob' probabilities

In [None]:
y_train_pred_final = pd.DataFrame({'Converted':y_train.values, 'Lead_Score_Prob':y_train_pred})
y_train_pred_final['Prospect ID'] = y_train.index
y_train_pred_final.head()


In [None]:
#Creating new column 'Predicted' with value 1 if Lead_Score_Prob > 0.5 else 0
y_train_pred_final['predicted'] = y_train_pred_final.Lead_Score_Prob.map(lambda x: 1 if x > 0.5 else 0)

y_train_pred_final.head()

## Step 10: Model Evaluation

In [None]:
# Confusion matrix 
confusion = confusion_matrix(y_train_pred_final.Converted, y_train_pred_final.predicted )
print(confusion)

In [None]:
# check the overall accuracy.
print(accuracy_score(y_train_pred_final.Converted, y_train_pred_final.predicted))

In [None]:
TP = confusion[1,1] # true positive 
TN = confusion[0,0] # true negatives
FP = confusion[0,1] # false positives
FN = confusion[1,0] # false negatives

In [None]:
# Let us calculate sensitivity 
round((TP / float(TP+FN)),2)

In [None]:
# Let us calculate specificity
round((TN / float(TN+FP)),2)

In [None]:
# Calculate false postive rate 
print(FP/ float(TN+FP))

In [None]:
# positive predictive value 
print (TP / float(TP+FP))

In [None]:
# Negative predictive value
print (TN / float(TN+ FN))

## Step 11: Plotting the ROC Curve


 - #### ROC shows the tradeoff between sensitivity and specificity (any increase in sensitivity will be accompanied by a decrease in specificity).
 - #### The closer the curve follows the left-hand border and then the top border of the ROC space, the more accurate the test.
 - #### The closer the curve comes to the 45-degree diagonal of the ROC space, the less accurate the test.

In [None]:
def draw_roc( actual, probs ):
    fpr, tpr, thresholds = metrics.roc_curve( actual, probs,
                                              drop_intermediate = False )
    auc_score = metrics.roc_auc_score( actual, probs )
    plt.figure(figsize=(5, 5))
    plt.plot( fpr, tpr, label='ROC curve (area = %0.2f)' % auc_score )
    plt.plot([0, 1], [0, 1], 'k--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate or [1 - True Negative Rate]')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver operating characteristic example')
    plt.legend(loc="lower right")
    plt.show()

    return None

In [None]:
fpr, tpr, thresholds = roc_curve( y_train_pred_final.Converted, y_train_pred_final.Lead_Score_Prob, drop_intermediate = False )

In [None]:
draw_roc(y_train_pred_final.Converted, y_train_pred_final.Lead_Score_Prob)

#### The ROC Curve should be a value close to 1. We are getting a value of 0.88 indicating a good predictive model.

## Step 12: Finding Optimal Cutoff Point

#### Above we had chosen an arbitrary cut-off value of 0.5. We need to determine the best cut-off value.

#### Optimal cutoff probability is that prob where we get balanced sensitivity and specificity

In [None]:
# Let's create columns with different probability cutoffs 
numbers = [float(x)/10 for x in range(10)]
for i in numbers:
    y_train_pred_final[i]= y_train_pred_final.Lead_Score_Prob.map(lambda x: 1 if x > i else 0)
y_train_pred_final.head()

In [None]:
cutoff_df = pd.DataFrame( columns = ['Probability','Accuracy','Sensitivity','Specificty'])


num = [0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
for i in num:
    cm1 = metrics.confusion_matrix(y_train_pred_final.Converted, y_train_pred_final[i] )
    total1=sum(sum(cm1))
    accuracy = (cm1[0,0]+cm1[1,1])/total1
    
    speci = cm1[0,0]/(cm1[0,0]+cm1[0,1])
    sensi = cm1[1,1]/(cm1[1,0]+cm1[1,1])
    cutoff_df.loc[i] =[ i ,accuracy,sensi,speci]
print(cutoff_df)

In [None]:
# Let's plot accuracy sensitivity and specificity for various probabilities.

sns.set_style('whitegrid')
sns.set_context('paper')

cutoff_df.plot.line(x='Probability', y=['Accuracy','Sensitivity','Specificty'])
plt.xticks(np.arange(0,1,step=.05), size=8)
plt.yticks(size=12)
plt.show()

### Observation

#### From the above curve we can see that the optimal cutoff is at 0.35. This is the point where all the parameters - Accuracy,Sensitivity,Specificity are equally balanced

In [None]:
y_train_pred_final['final_predicted'] = y_train_pred_final.Lead_Score_Prob.map(lambda x: 1 if x > 0.35 else 0)

y_train_pred_final.head()

In [None]:
#Assigning lead score
y_train_pred_final['Lead_Score'] = y_train_pred_final.Lead_Score_Prob.map( lambda x: round(x*100))

y_train_pred_final.head()

In [None]:
# Let's check the overall accuracy.
metrics.accuracy_score(y_train_pred_final.Converted, y_train_pred_final.final_predicted)

In [None]:
confusion2 = metrics.confusion_matrix(y_train_pred_final.Converted, y_train_pred_final.final_predicted )
confusion2

In [None]:
TP = confusion2[1,1] # true positive 
TN = confusion2[0,0] # true negatives
FP = confusion2[0,1] # false positives
FN = confusion2[1,0] # false negatives

In [None]:
# Let's check the sensitivity 
TP / float(TP+FN)

In [None]:
# Let us calculate specificity
TN / float(TN+FP)

In [None]:
# Calculate false postive rate 
print(FP/ float(TN+FP))

In [None]:
# positive predictive value 
print (TP / float(TP+FP))

In [None]:
# Negative predictive value
print (TN / float(TN+ FN))

### Observation

#### when we are selecting the optimal cutoff = 0.35, the various performance parameters Accuracy, Sensitivity & Specificity are all 80%



## Step 13: Metrics - Precision and Recall

#### To attain more stability and predict successfully in our model one needs to check two important parameters -precision and recall which tells us the score for result relevancy and how many truly relevant results are returned

In [None]:
#Calculating Precision
precision =round(TP/float(TP+FP),2)
precision

In [None]:
#Calculating Recall
recall = round(TP/float(TP+FN),2)
recall

In [None]:
#Calculating precision using precision_score function from sklearn
precision_score(y_train_pred_final.Converted , y_train_pred_final.final_predicted)

In [None]:
#Calculating recall using recall_score function from sklearn
recall_score(y_train_pred_final.Converted, y_train_pred_final.final_predicted)

### Observation

 - #### As per our business objective, the recall percentage is more significant since we don't want to left out any hot leads which are willing to get converted.
 - #### Hence Recall- 81% suggest a good model


In [None]:
#Let us generate the Precision vs Recall tradeoff curve 
p ,r, thresholds=precision_recall_curve(y_train_pred_final.Converted,y_train_pred_final['Lead_Score_Prob'])
plt.title('Precision vs Recall tradeoff')
plt.plot(thresholds, p[:-1], "g-")    # Plotting precision
plt.plot(thresholds, r[:-1], "r-")    # Plotting Recall
plt.show()


#### As seen from above,there is tradeoff between Precision and Recall.Precision and Recall are inversely related means if one increases other will genuinely decrease. 

## Step 14: Making predictions on the test set

In [None]:
X_test[['TotalVisits','Total Time Spent on Website','Page Views Per Visit']] = scaler.transform(X_test[['TotalVisits','Total Time Spent on Website','Page Views Per Visit']])


In [None]:
X_test = X_test[col5]

X_test.shape

In [None]:
X_test.head()

In [None]:
#add constant
X_test_sm = sm.add_constant(X_test)

In [None]:
#making predictions on test set
y_test_pred = res.predict(X_test_sm)

In [None]:
y_test_pred[:10]

In [None]:
# Converting y_pred to a dataframe which is an array
y_pred_1 = pd.DataFrame(y_test_pred)
y_pred_1.head()

In [None]:
# Converting y_test to dataframe
y_test_df = pd.DataFrame(y_test)

In [None]:
# Putting prospect ID to index
y_test_df['Prospect ID'] = y_test_df.index

In [None]:
# Removing index for both dataframes to append them side by side 

y_pred_1.reset_index(drop=True, inplace=True)
y_test_df.reset_index(drop=True, inplace=True)

In [None]:
# Appending y_test_df and y_pred_1

y_pred_final = pd.concat([y_test_df, y_pred_1],axis=1)

In [None]:
y_pred_final.head()

In [None]:
# Renaming the column 

y_pred_final= y_pred_final.rename(columns={ 0 : 'Lead_Score_Prob'})

In [None]:
# Rearranging the columns

y_pred_final = y_pred_final.reindex(['Prospect ID','Converted','Lead_Score_Prob'], axis=1)

In [None]:
# Adding Lead_Score column

y_pred_final['Lead_Score'] = round((y_pred_final['Lead_Score_Prob'] * 100),0)

y_pred_final['Lead_Score'] = y_pred_final['Lead_Score'].astype(int)

In [None]:
# Let's see the head of y_pred_final
y_pred_final.head()

In [None]:
y_pred_final['final_Predicted'] = y_pred_final.Lead_Score_Prob.map(lambda x: 1 if x > 0.35 else 0)

In [None]:
y_pred_final.head()

In [None]:
#classifying leads based on Lead score
y_pred_final['Lead_Type'] = y_pred_final.Lead_Score.map(lambda x: 'Hot Lead' if x >35 else 'Cold Lead')
y_pred_final.sort_values(by='Lead_Score', ascending = False)

In [None]:
# Let's check the overall accuracy.
metrics.accuracy_score(y_pred_final.Converted, y_pred_final.final_Predicted)

In [None]:
confusion2 = metrics.confusion_matrix(y_pred_final.Converted, y_pred_final.final_Predicted )
confusion2

In [None]:
TP = confusion2[1,1] # true positive 
TN = confusion2[0,0] # true negatives
FP = confusion2[0,1] # false positives
FN = confusion2[1,0] # false negatives

In [None]:
# Let's see the sensitivity of our logistic regression model
TP / float(TP+FN)

In [None]:
# Let us calculate specificity
TN / float(TN+FP)

In [None]:
precision_score(y_pred_final.Converted , y_pred_final.final_Predicted)

In [None]:
recall_score(y_pred_final.Converted, y_pred_final.final_Predicted)

### Final Observation:
#### Lets compare the  Model Performance parameters oobtained for Train & Test data:

 - #### Train Data: 
#### Accuracy : 80.96%
#### Sensitivity : 80.98%
#### Specificity : 80.94%
#### Precision : 72.69%
#### Recall : 80.98%



 - #### Test Data: 
#### Accuracy : 80.35%
#### Sensitivity : 79.37%
#### Specificity : 80.91%
#### Precision : 70.34%
#### Recall : 79.37%



### Observation 


 - #### We got around 1% difference on train and test data's performance metrics.This implies that our final model didn't overfit training data and is performing well.

 - #### High Sensitivity will ensure that almost all leads who are likely to Convert are correctly predicted where as high Specificity will ensure that leads that are on the brink of the probability of getting Converted or not are not selected.

 - #### Depending on the business requirement, we can increase or decrease the probability threshold value with in turn will decrease or increase the Sensitivity and increase or decrease the Specificity of the model.

### Determining Feature Importance

#### Selecting the coefficients of the selected features from our final model excluding the intercept

In [None]:
pd.options.display.float_format = '{:.2f}'.format
new_params = res.params[1:]
new_params

In [None]:
#Getting a relative coeffient value for all the features wrt the feature with the highest coefficient


feature_importance = new_params
feature_importance = 100.0 * (feature_importance / feature_importance.max())
feature_importance

In [None]:
##Sorting the feature variables based on their relative coefficient values

sorted_idx = np.argsort(feature_importance,kind='quicksort',order='list of str')
sorted_idx

In [None]:
##Plot showing the feature variables based on their relative coefficient values

pos = np.arange(sorted_idx.shape[0]) + .5

featfig = plt.figure(figsize=(10,6))
featax = featfig.add_subplot(1, 1, 1)
featax.barh(pos, feature_importance[sorted_idx], align='center', color = 'tab:blue',alpha=0.8)
featax.set_yticks(pos)
featax.set_yticklabels(np.array(X_train[col5].columns)[sorted_idx], fontsize=12)
featax.set_xlabel('Relative Feature Importance', fontsize=14)

plt.tight_layout()   
plt.show()

### Final Model Reporting & Equation-

#### log odds is given by: log(P/1-P) = c + B1X1 +B2X2 + B3X3 + .... + BnXn
    

#### log odds = 0.18 +(-1.59 * Do Not Email) + (1.13 * Total Time Spent on Website) + (3.92 * Lead Origin_Lead Add Form) + (1.52 * Lead Origin_Lead Import) + (1.24 * Lead Source_Olark Chat) + (2.06 * Lead Source_Welingak website) + (-1.12 * Last Activity_Converted to Lead) + (-1.28 * Last Activity_Email Bounced) + (1.91 * Last Activity_Had a Phone Conversation) + (-1.32 * Last Activity_Olark Chat Conversation) + (2.75 * What is your current occupation_Working Professional) + (-1.86 * Last Notable Activity_Email Link Clicked) + (-1.40 * Last Notable Activity_Email Opened) + (-1.73 * Last Notable Activity_Modified) + (-1.52* Last Notable Activity_Olark Chat Conversation) + (-1.69 * Last Notable Activity_Page Visited on Website ) 

### <u> Recommendations -

 - #### The sales team of the X-Education should focus on the leads having lead origin - lead add form , occupation - Working Professional , Lead source - Wellingak website.
 - #### Hot Leads are identified as 'Customers having lead score above 35. Sales Team of the company should first focus on the 'Hot Leads'
 - #### The 'Cold Leads'(Customer having lead score <= 35) should be focused after the Sales Team is done with the 'Hot Leads'.
 - #### There are many important variables like city, specialization , occupation which can potentially explain Conversion better.It is important for the management to make few of these information mandatory to fill , so that we can use in our model and build important decisions for the business.
 - #### We have high recall score than precision score. Hence this model has an ability to adjust with the company’s requirements in coming future.
 - #### High Sensitivity will ensure that almost all leads who are likely to Convert are correctly predicted where as high Specificity will ensure that leads that are on the brink of the probability of getting Converted or not are not selected.
 - #### It’s better to focus least on customers who do not want to be called about the course.
 - #### If the Last Notable Activity is Modified, he/she may not be the potential lead.