In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE

import statsmodels.api as sm

import warnings
warnings.filterwarnings('ignore')

In [None]:
leads = pd.read_csv('Leads.csv')

In [None]:
leads.head()

## Basic Analysis

In [None]:
leads.shape

#### There are total of 9240 rows of data with 37 columns

In [None]:
leads.info()

In [None]:
leads.describe()

#### With all the statistics from the numeric columns we see quite a lot variance as well as null values

## Data Cleaning

#### There are various types of data types and all seems to be in correct format. But looks like there are some null values as well based on the total data count as 9240

Let's calculate the percentage of null values in the dataset

In [None]:
def calculate_null_percentage(dataset):
    return round(dataset.isnull().sum() / len(dataset) * 100, 2)

In [None]:
calculate_null_percentage(leads)

#### Also as mentioned in the problem statement, "Select" is considered to be as `null`. This is because if the data was collected from an user interface, there could have been several options as A, B, C and 'Select'. If the data entry operator or user did not choose any of the valid options it would remain as 'Select'. This implies that "Select" is same as `null` in the dataset.

Let's replace `Select` with null and re-calculate the `null` values percentage.

In [None]:
leads = leads.replace('Select', np.nan)

In [None]:
calculate_null_percentage(leads)

#### There is a significant increse from `29.32%` to `74.19%` in the `Lead Profile` column after replacing all the `Select` values.

According to general guideline all columns which has more than **`40%`** of missing values should be dropped as they won't impact on the analysis any significantly.
Let's see what does these columns contain in actual.

In [None]:
columns_with_high_missing_values = ["How did you hear about X Education", 
                                    "Lead Quality", 
                                    "Lead Profile", 
                                    "Asymmetrique Activity Index", 
                                    "Asymmetrique Profile Index", 
                                    "Asymmetrique Activity Score", 
                                    "Asymmetrique Profile Score"]

In [None]:
leads[columns_with_high_missing_values]

#### Although these columns seems to have an impact on the case study as they have some kind of score, but due to high percentage of missing values, these columns need to be dropped from the dataset

In [None]:
leads = leads.drop(columns = columns_with_high_missing_values)

In [None]:
leads.shape

In [None]:
calculate_null_percentage(leads)

#### Specialization, Tags and City has close to `40%` missing values but we should not drop them as they might have impact on the overall analysis

In [None]:
leads.Specialization.value_counts() / len(leads) * 100

#### Let's fill the null values with a columns called `Other`

In [None]:
leads.Specialization = leads.Specialization.fillna('Other')

#### Recalculating the Specialization values

In [None]:
leads.Specialization.value_counts() / len(leads) * 100

#### Let's apply the same for Tags columns

In [None]:
leads.Tags.value_counts() / len(leads) * 100

Tags column has most values as "Will revert after reading the email" i.e. **58.7%**. So, all the missing columns can be filled with the same value.

In [None]:
leads.Tags = leads.Tags.fillna('Will revert after reading the email')

In [None]:
leads.Tags.value_counts() / len(leads) * 100

#### Let's look into the City columns

In [None]:
leads.City.value_counts() / len(leads) * 100

#### As we do not have enough information on the City we can fill the missing City information as **Mumbai** (Which is already present in the dataset) and has the majority in count. We could have filled with "Other Cities" as well here.

In [None]:
leads.City = leads.City.fillna('Mumbai')

In [None]:
calculate_null_percentage(leads)

#### Country column has **`26.63%`** missing values.

In [None]:
leads.Country.value_counts() / len(leads) * 100

Most mentioned country is India. Missing values could be filled with India. 

In [None]:
leads.Country = leads.Country.fillna('India')

In [None]:
leads.Country.value_counts() / len(leads) * 100

#### "What is your current occupation" column has some missing values. Let's impute this.

In [None]:
leads['What is your current occupation'].value_counts() / len(leads) * 100

Majority of the leads are unemployed here. It might be not appropriate to fill the missing data with `Unemployed`. It could be filled with `Other` as well. Business domain-wise and Unemployed person would(propably) tend to choose a course for employment. So all the missing values could be better filled with `Unemployed` column.

In [None]:
leads['What is your current occupation'] = leads['What is your current occupation'].fillna('Unemployed')

In [None]:
leads['What is your current occupation'].value_counts() / len(leads) * 100

#### "What matters most to you in choosing a course" column has good number of missing values. Let's impute that too

In [None]:
leads['What matters most to you in choosing a course'].value_counts() / len(leads) * 100

Very simple choice here to fill the missing values with "Better Career Prospects".

In [None]:
leads['What matters most to you in choosing a course'] = leads['What matters most to you in choosing a course'].fillna('Better Career Prospects')

In [None]:
leads['What matters most to you in choosing a course'].value_counts() / len(leads) * 100

In [None]:
calculate_null_percentage(leads)

#### Rest of the missing values columns are less then `1.5%`. These rows will be dropped.

In [None]:
leads.dropna(inplace=True)

In [None]:
calculate_null_percentage(leads)

#### As we see there are no more null values in the dataset. We can proceed for data analysis for better understanding of the dataset and features.

## Exploratory Data Analysis

#### Let's find out the conversion rate as convert is the target variable.

In [None]:
sum(leads.Converted) / len(leads) * 100

#### So, the conversion rate is very close to `38%`

Let's find out relation among leads origin and lead source with the conversions.

In [None]:
plt.figure(figsize=(20, 10))
sns.countplot(data = leads, x = 'Lead Origin', hue = 'Converted')
plt.xticks(rotation = 45)
plt.show()

#### Observations
1. API and Landing Page Submission have more conversion rate
2. Lead Add Form has the highest coversion rate compared to the other two
3. Lead Import has a very minimal conversion rate

In [None]:
plt.figure(figsize=(20, 10))
sns.countplot(data = leads, x = 'Lead Source', hue = 'Converted')
plt.xticks(rotation = 45)
plt.show()

#### Observations
1. Olark Chat, Organic Search Direct Traffic, Google and Reference have more conversion rate
2. There are Google and google both - we need to convert them to single category
3. Also there are many **other** categories with minimal reach and conversions - we can convert them to a other category here

In [None]:
leads['Lead Source'].value_counts()

In [None]:
# Replacing google with Google
leads['Lead Source'].replace("google", "Google", inplace = True)

In [None]:
# Replacing Click2call, Press_Release, Social Media, 
# Live Chat, youtubechannel, testone, Pay per Click Ads, 
# welearnblog_Home, WeLearn, blog, NC_EDM
# to 'Others'

leads['Lead Source'].replace(['Click2call', 'Press_Release', 'Social Media', 'Live Chat', 
              'youtubechannel', 'testone', 'Pay per Click Ads', 'welearnblog_Home', 
              'WeLearn', 'blog', 'NC_EDM'], "Others", inplace = True)

In [None]:
leads['Lead Source'].value_counts()

In [None]:
plt.figure(figsize=(20, 10))
sns.countplot(data = leads, x = 'Lead Source', hue = 'Converted')
plt.xticks(rotation = 45)
plt.show()

#### Observations -
1. Google and Direct Traffic conversion rates are high in numbers
2. Reference and Welingak website has the highest conversion rate in terms of percentage

More focus should be on Google traffic, reference and Welingak website, nurturing these sources might increase conversion rate by a good margin

Let's see 'Do Not Email' and 'Do Not Call' columns based on the conversion rate

In [None]:
fig, axs = plt.subplots(1, 2, figsize=(12, 4))
sns.countplot(data = leads, x = 'Do Not Email', hue = 'Converted', ax=axs[0])
sns.countplot(data = leads, x = 'Do Not Call', hue = 'Converted', ax=axs[1])
plt.show()

#### Observations -
1. Condidates who allowed emails and calls communication has more convertion rate.
2. In case of 'Do Not Email' and 'Do Not Call' - 'NO' - conversion rates are similar
3. It can be inferred that interested candidates allowed calls and emails

Let's analyse 'TotalVisits'

In [None]:
plt.boxplot(data=leads, x='TotalVisits')
plt.show()

In [None]:
sns.boxplot(data=leads, x="Converted", y='TotalVisits')
plt.show()

As we see there are many outliers in the totalvisits column based on the target columns. Let's consider 99th percentile to see if the outliers are removed. We will be updating the values to 99th percentile values here.

In [None]:
quantile = leads.TotalVisits.quantile([0.05, 0.99]).values
leads.TotalVisits[leads.TotalVisits <= quantile[0]] = quantile[0]
leads.TotalVisits[leads.TotalVisits >= quantile[1]] = quantile[1]

In [None]:
sns.boxplot(data=leads, x="Converted", y='TotalVisits')
plt.show()

We can still see a good amount of outliers, it's will be better to take 95th percentile.

In [None]:
quantile = leads.TotalVisits.quantile([0.05, 0.95]).values
leads.TotalVisits[leads.TotalVisits <= quantile[0]] = quantile[0]
leads.TotalVisits[leads.TotalVisits >= quantile[1]] = quantile[1]

In [None]:
sns.boxplot(data=leads, x="Converted", y='TotalVisits')
plt.show()

#### Observations -
1. There were good amount of outliers, any of the 99th or 95th percentiles data could be taken for further analysis. 
2. Mean of both the converted and non-converted are quite same.

Let's check "Total Time Spent on Website" column in case of any outliers

In [None]:
sns.boxplot(data=leads, x="Converted", y='Total Time Spent on Website')
plt.show()

#### Observations -
1. There are good amount of outliers for non-converted leads
2. Also, leads who are spending more time got converted, so wesite of could be more engaging to attract more leads and eventually more conversions

Let's look at the "Page Views Per Visit" attribute

In [None]:
sns.boxplot(data=leads, x="Converted", y='Page Views Per Visit')
plt.show()

There are outliers, we will apply 95th percentile for this as well

In [None]:
quantile = leads['Page Views Per Visit'].quantile([0.05, 0.95]).values
leads['Page Views Per Visit'][leads['Page Views Per Visit'] <= quantile[0]] = quantile[0]
leads['Page Views Per Visit'][leads['Page Views Per Visit'] >= quantile[1]] = quantile[1]

In [None]:
sns.boxplot(data=leads, x="Converted", y='Page Views Per Visit')
plt.show()

#### Observations -
1. Converted and non-converted leads have quite similar mean
2. We cannot infer here that converted leads have visited more pages in compare to non-converted ones

Let's analyse 'Last Activity' column

In [None]:
leads['Last Activity']

In [None]:
plt.figure(figsize=(20, 10))
sns.countplot(data = leads, x = 'Last Activity', hue = 'Converted')
plt.xticks(rotation = 90)
plt.show()

#### Observations - 
1. 'SMS Sent' and 'Email Opened' has more conversions than any other
2. 'SMS Sent' has the higher conversion rate
3. We can group the other minor 'Last Activity' attributes to 'Other'

In [None]:
leads['Last Activity'].value_counts()

In [None]:
leads['Last Activity'].replace(['Resubscribed to emails', 'Visited Booth in Tradeshow', 'Email Marked Spam', 
                                'Email Received', 'Approached upfront', 'View in browser link Clicked', 
                                'Had a Phone Conversation',], "Other", inplace = True)

In [None]:
plt.figure(figsize=(20, 10))
sns.countplot(data = leads, x = 'Last Activity', hue = 'Converted')
plt.xticks(rotation = 90)
plt.show()

#### Observations - 
1. Nothing major changes from the previous observations

Let's check conversions based on Country

In [None]:
plt.figure(figsize=(10, 6))
sns.countplot(data = leads, x = 'Country', hue = 'Converted')
plt.xticks(rotation = 90)
plt.show()

#### Observations -
Majority of the leads are from India and conversions too

Let's check conversions with specialization

In [None]:
plt.figure(figsize=(20, 10))
sns.countplot(data = leads, x = 'Specialization', hue = 'Converted')
plt.xticks(rotation = 90)
plt.show()

#### Observations -
1. There are more conversions from Other category, this implies that leads who did not mention their specialization have got more conversions than other specializations
2. Among all the mentioned specializations Marketing Management, Human Resource Management, Finance Management has very good conversion rate as we all numbers.

In [None]:
plt.figure(figsize=(20, 10))
sns.countplot(data = leads, x = 'What is your current occupation', hue = 'Converted')
plt.xticks(rotation = 90)
plt.show()

#### Observations -
1. Unemployed leads have more conversions in numbers - we can infer that unemployed leads are interested in courses to find alternative choices in their career
2. Working professionals have more conversion rate - we can infer that working professionals are interested in courses to upskill themselves. This means working professional have a high chance of joining a course.

In [None]:
leads.info()

In [None]:
plt.figure(figsize=(20, 10))
sns.countplot(data = leads, x = 'What matters most to you in choosing a course', hue = 'Converted')
plt.xticks(rotation = 90)
plt.show()

#### Observations -
1. Supporting the previous observation, leads have chosen a course for "Better Career Prospect" and have decent conversion.

Let's check all the advertisement attributes and their contributions to the conversions
They are -
1. Search
2. Magazine
3. Newspaper Article
4. X Education Forums
5. Newspaper
6. Digital Advertisement

In [None]:
fig, axs = plt.subplots(2, 3, figsize=(16, 8))
sns.countplot(data = leads, x = 'Search', hue = 'Converted', ax=axs[0, 0])
sns.countplot(data = leads, x = 'Magazine', hue = 'Converted', ax=axs[0, 1])
sns.countplot(data = leads, x = 'Newspaper Article', hue = 'Converted', ax=axs[0, 2])
sns.countplot(data = leads, x = 'X Education Forums', hue = 'Converted', ax=axs[1, 0])
sns.countplot(data = leads, x = 'Newspaper', hue = 'Converted', ax=axs[1, 1])
sns.countplot(data = leads, x = 'Digital Advertisement', hue = 'Converted', ax=axs[1, 2])
plt.show()

#### Observations -
1. Search has most of the entries as No with decent conversion rate, also it has some Yes(s) which has conversions
2. Magazine have all the values as No - no inference can be drawn from here
3. Newspaper article has most of the entries as No.
4. X Education Forums also has most entries as No.
5. Regular Newspaper has most entries as No.
6. Digital Advertisement has most entries as No.
7. This implies that none of the advertising media for X Education has been effective in lead conversions yet.

Let's check "Through Recommendations" attribute

In [None]:
plt.figure(figsize=(10, 5))
sns.countplot(data = leads, x = 'Through Recommendations', hue = 'Converted')
plt.show()

#### Observations -
1. Most of the leads are not through recommendations.
2. There are a very less number of leads are through recommendation and conversion rate is high for them

In [None]:
plt.figure(figsize=(10, 5))
sns.countplot(data = leads, x = 'Receive More Updates About Our Courses', hue = 'Converted')
plt.show()

#### Observations -
1. No leads heard about the updates about courses, X Education might have to focus on sharing updates about the courses more proactively.

Let's check "Tags" attribute

In [None]:
plt.figure(figsize=(20, 10))
sns.countplot(data = leads, x = 'Tags', hue = 'Converted')
plt.xticks(rotation=90)
plt.show()

#### Observations -
1. Many leads were converted who reverted after email and closed by Horizzon.
2. More emails could be shared for increasing conversions
3. Also this column is generated by sales or marketting team, might not be helpful for the model building - will have to remove this column.

In [None]:
plt.figure(figsize=(6, 4))
sns.countplot(data = leads, x = 'Update me on Supply Chain Content', hue = 'Converted')
plt.show()

#### Observations -
1. All are No - not much inferecences to take from here

In [None]:
plt.figure(figsize=(6, 4))
sns.countplot(data = leads, x = 'Get updates on DM Content', hue = 'Converted')
plt.show()

#### Observations -
1. All leads chose No got getting updates on DM content - not much inferecences to take from here

Let's have a looks at the "City" attribute

In [None]:
plt.figure(figsize=(6, 4))
sns.countplot(data = leads, x = 'City', hue = 'Converted')
plt.xticks(rotation = 90)
plt.show()

#### Observations -
1. Mumbai has the highest lead conversion in numbers.
2. Tier II cities are very less in numbers for of lead conversion
3. Thane & Outskirits, other Metro Cities, and cities have the quite high conversion rate although their number leads is low

In [None]:
plt.figure(figsize=(6, 4))
sns.countplot(data = leads, x = 'I agree to pay the amount through cheque', hue = 'Converted')
plt.show()

#### Observations - 
1. All leads chose not to pay by cheque

In [None]:
plt.figure(figsize=(6, 4))
sns.countplot(data = leads, x = 'A free copy of Mastering The Interview', hue = 'Converted')
plt.show()

#### Observations -
1. There are distributions among no and yes, but not much inference can be drawn from here.
2. Conversion rate seems to be similar for both yes and no

In [None]:
plt.figure(figsize=(16, 8))
sns.countplot(data = leads, x = 'Last Notable Activity', hue = 'Converted')
plt.xticks(rotation=90)
plt.show()

#### Observations -
1. Most of the conversions happened when leads opened email or and SMS was sent to the leads. This organization should keep focuse on these two process to get more engagement

#### There are many columns which are not adding much value to the future analysis, we should remove these columns.
1. Lead Number
2. Country
3. Search
4. Magazine
5. Newspaper Article
6. X Education Forums
7. Newspaper
8. Digital Advertisement
9. Through Recommendations
10. Receive More Updates About Our Courses
11. Tags
12. Update me on Supply Chain Content
13. Get updates on DM Content
14. I agree to pay the amount through cheque
15. a free copy of Mastering The Interview 

In [None]:
leads.drop(['Lead Number',
'Country',
'Search',
'Magazine',
'Newspaper Article',
'X Education Forums',
'Newspaper',
'Digital Advertisement',
'Through Recommendations',
'Receive More Updates About Our Courses',
'Tags',
'Update me on Supply Chain Content',
'Get updates on DM Content',
'I agree to pay the amount through cheque',
'A free copy of Mastering The Interview'], axis=1, inplace=True)

As observed earlier we will have to drop **"What matters most to you in choosing a course"** column too for better data preparation

In [None]:
leads.drop('What matters most to you in choosing a course', axis=1, inplace=True)

In [None]:
leads.info()

In [None]:
leads.shape

#### After the data clean up, we have 14 attributes for model development. 
##### There are couple of columns which are categorical i.e. yes/no - binary, and multi category. We will need to convert them to numerical form so that they can be used for model preparation.

## Data Preparation

#### There are only two columns with Binary category - 
1. Do Not Email
2. Do Not Call

We will convert the `No` to `0` and `Yes` to `1`

In [None]:
columns = ['Do Not Email', 'Do Not Call']
leads[columns] = leads[columns].apply(lambda x: x.map({ 'No': 0, 'Yes': 1}))

##### Other columns are multi-value categorical columns. They must be replaced with dummy variables. As per guidelines we will drop the first columns after converting the categorical columns to dummy vatiables.

In [None]:
columns_to_create_dummies = ['Lead Origin', 'Lead Source', 'Last Activity', 
    'Specialization', 'What is your current occupation', 
    'City', 'Last Notable Activity']
dummies = pd.get_dummies(leads[columns_to_create_dummies], drop_first=True)

In [None]:
dummies.head(3)

In [None]:
leads = pd.concat([leads, dummies], axis=1)

In [None]:
leads.shape

#### Let's drop the columns for which dummies are created

In [None]:
leads.drop(columns_to_create_dummies, axis=1, inplace=True)

In [None]:
leads.shape

#### We are not in position to split the data into train and test. Will be using sklearn module.

In [None]:
X = leads.drop(['Prospect ID', 'Converted'], axis=1)

In [None]:
y = leads.Converted

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=100)

In [None]:
# Making sure the test-train spit is 30-70
print(round(len(X_train) / len(leads) * 100, 1))
print(round(len(X_test) / len(leads) * 100, 1))

#### There are attributes which has numerical values of different ranges. They must be scaled before considered in the model. We will be using StandardScaler for this case.

In [None]:
scaler = StandardScaler()

In [None]:
X_train.head()

In [None]:
columns_to_scale = ['TotalVisits', 'Total Time Spent on Website', 'Page Views Per Visit']

In [None]:
X_train[columns_to_scale] = scaler.fit_transform(X_train[columns_to_scale])

In [None]:
X_train.head()

#### Checking correlations among the features

In [None]:
plt.figure(figsize = (60,60))
sns.heatmap(X_train.corr(), annot = True, fmt=".3f", cmap="flare")
plt.show()

#### There are certain columns which are highly correlated to each other. e.g.
1. Last Activity_Email Bounced
2. Lead Source_Reference
3. Lead Source_Facebook
4. Lead Origin_Landing Page Submission
5. Last Notable Activity_SMS Sent
6. Last Notable Activity_Email Opened
7. Last Notable Activity_Had a Phone Conversation
8. Last Notable Activity_Page Visited on Website
9. Last Notable Activity_Unreachable
10. Last Notable Activity_Unsubscribed
11. Last Notable Activity_Email Link Clicked

#### Also there is high correlation between
1. `TotalVisits` and `Page Views Per Visit` 
2. `Landing Page Submission` and `Page Views Per Visit` 
3. `Lead Origin_Landing Page Submission` and `TotalVisits` 
but we should not drop these Features, they might attribute to the final model selection

In [None]:
columns_to_drop = ['Last Activity_Email Bounced'
,'Lead Source_Reference'
,'Lead Source_Facebook'
,'Lead Origin_Landing Page Submission'
,'Last Notable Activity_SMS Sent'
,'Last Notable Activity_Email Opened'
,'Last Notable Activity_Had a Phone Conversation'
,'Last Notable Activity_Page Visited on Website'
,'Last Notable Activity_Unreachable'
,'Last Notable Activity_Unsubscribed'
,'Last Notable Activity_Email Link Clicked']

X_train = X_train.drop(columns_to_drop, axis=1)
X_test = X_test.drop(columns_to_drop, axis=1)

#### Checking the correlation again

In [None]:
plt.figure(figsize = (60,60))
sns.heatmap(X_train.corr(), annot = True, fmt=".3f", cmap="flare")
plt.show()

#### Now we are in a position to proceed for feature selection using RFE.

In [None]:
lr = LogisticRegression()

# Running Feature Selection and select 25 Features
rfe = RFE(estimator = lr, n_features_to_select = 25)
rfe = rfe.fit(X_train, y_train)

In [None]:
rfe.support_

#### Let's check the ranking and the RFE selection criteria with training data

In [None]:
list(zip(X_train.columns, rfe.support_, rfe.ranking_))

#### Let's find out the Feature selected by RFE

In [None]:
columns_selected_by_rfe = X_train.columns[rfe.support_]

In [None]:
columns_selected_by_rfe

#### As we have got the first feature set selected by RFE we can start with the model building

## Model Building

In [None]:
logm1 = sm.GLM(y_train, (sm.add_constant(X_train[columns_selected_by_rfe])), family = sm.families.Binomial())
logm1.fit().summary()

#### `What is your current occupation_Housewife` , `Do Not Call` and  `Last Notable Activity_View in browser link Clicked` has very high P-Value. This column need to be dropped

In [None]:
columns_selected_by_rfe = columns_selected_by_rfe.drop(["What is your current occupation_Housewife",
                                                        "Do Not Call",
                                                        "Last Notable Activity_View in browser link Clicked"])

In [None]:
logm2 = sm.GLM(y_train, (sm.add_constant(X_train[columns_selected_by_rfe])), family = sm.families.Binomial())
logm2.fit().summary()

#### `Lead Source_bing` has a very high P-Value, Need to remove this column.

In [None]:
columns_selected_by_rfe = columns_selected_by_rfe.drop("Lead Source_bing")

In [None]:
logm3 = sm.GLM(y_train, (sm.add_constant(X_train[columns_selected_by_rfe])), family = sm.families.Binomial())
logm3.fit().summary()

#### Still there are few Features which have high P-Value. Let's try to remove them one by one.
#### One of them `Specialization_Hospitality Management`.

In [None]:
columns_selected_by_rfe = columns_selected_by_rfe.drop("Specialization_Hospitality Management")

In [None]:
logm4 = sm.GLM(y_train, (sm.add_constant(X_train[columns_selected_by_rfe])), family = sm.families.Binomial())
logm4.fit().summary()

#### `City_Tier II Cities` has high P-Value. This feature need to be removed.

In [None]:
columns_selected_by_rfe = columns_selected_by_rfe.drop("City_Tier II Cities")

In [None]:
logm5 = sm.GLM(y_train, (sm.add_constant(X_train[columns_selected_by_rfe])), family = sm.families.Binomial())
logm5.fit().summary()

#### There are few more Features which have little high P-Values, but we will see VIFs now to eliminate further features.