In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
leads = pd.read_csv('Leads.csv')

In [None]:
leads.head()

## Basic Analysis

In [None]:
leads.shape

#### There are total of 9240 rows of data with 37 columns

In [None]:
leads.info()

In [None]:
leads.describe()

#### With all the statistics from the numeric columns we see quite a lot variance as well as null values

## Data Cleaning

#### There are various types of data types and all seems to be in correct format. But looks like there are some null values as well based on the total data count as 9240

Let's calculate the percentage of null values in the dataset

In [None]:
def calculate_null_percentage(dataset):
    return round(dataset.isnull().sum() / len(dataset) * 100, 2)

In [None]:
calculate_null_percentage(leads)

#### Also as mentioned in the problem statement, "Select" is considered to be as `null`. This is because if the data was collected from an user interface, there could have been several options as A, B, C and 'Select'. If the data entry operator or user did not choose any of the valid options it would remain as 'Select'. This implies that "Select" is same as `null` in the dataset.

Let's replace `Select` with null and re-calculate the `null` values percentage.

In [None]:
leads = leads.replace('Select', np.nan)

In [None]:
calculate_null_percentage(leads)

#### There is a significant increse from `29.32%` to `74.19%` in the `Lead Profile` column after replacing all the `Select` values.

According to general guideline all columns which has more than **`40%`** of missing values should be dropped as they won't impact on the analysis any significantly.
Let's see what does these columns contain in actual.

In [None]:
columns_with_high_missing_values = ["How did you hear about X Education", 
                                    "Lead Quality", 
                                    "Lead Profile", 
                                    "Asymmetrique Activity Index", 
                                    "Asymmetrique Profile Index", 
                                    "Asymmetrique Activity Score", 
                                    "Asymmetrique Profile Score"]

In [None]:
leads[columns_with_high_missing_values]

#### Although these columns seems to have an impact on the case study as they have some kind of score, but due to high percentage of missing values, these columns need to be dropped from the dataset

In [None]:
leads = leads.drop(columns = columns_with_high_missing_values)

In [None]:
leads.shape

In [None]:
calculate_null_percentage(leads)

#### Specialization, Tags and City has close to `40%` missing values but we should not drop them as they might have impact on the overall analysis

In [None]:
leads.Specialization.value_counts() / len(leads) * 100

#### Let's fill the null values with a columns called `Other`

In [None]:
leads.Specialization = leads.Specialization.fillna('Other')

#### Recalculating the Specialization values

In [None]:
leads.Specialization.value_counts() / len(leads) * 100

#### Let's apply the same for Tags columns

In [None]:
leads.Tags.value_counts() / len(leads) * 100

Tags column has most values as "Will revert after reading the email" i.e. **58.7%**. So, all the missing columns can be filled with the same value.

In [None]:
leads.Tags = leads.Tags.fillna('Will revert after reading the email')

In [None]:
leads.Tags.value_counts() / len(leads) * 100

#### Let's look into the City columns

In [None]:
leads.City.value_counts() / len(leads) * 100

#### As we do not have enough information on the City we can fill the missing City information as **Mumbai** (Which is already present in the dataset) and has the majority in count. We could have filled with "Other Cities" as well here.

In [None]:
leads.City = leads.City.fillna('Mumbai')

In [None]:
calculate_null_percentage(leads)

#### Country column has **`26.63%`** missing values.

In [None]:
leads.Country.value_counts() / len(leads) * 100

Most mentioned country is India. Missing values could be filled with India. 

In [None]:
leads.Country = leads.Country.fillna('India')

In [None]:
leads.Country.value_counts() / len(leads) * 100

#### "What is your current occupation" column has some missing values. Let's impute this.

In [None]:
leads['What is your current occupation'].value_counts() / len(leads) * 100

Majority of the leads are unemployed here. It might be not appropriate to fill the missing data with `Unemployed`. It could be filled with `Other` as well. Business domain-wise and Unemployed person would(propably) tend to choose a course for employment. So all the missing values could be better filled with `Unemployed` column.

In [None]:
leads['What is your current occupation'] = leads['What is your current occupation'].fillna('Unemployed')

In [None]:
leads['What is your current occupation'].value_counts() / len(leads) * 100

#### "What matters most to you in choosing a course" column has good number of missing values. Let's impute that too

In [None]:
leads['What matters most to you in choosing a course'].value_counts() / len(leads) * 100

Very simple choice here to fill the missing values with "Better Career Prospects".

In [None]:
leads['What matters most to you in choosing a course'] = leads['What matters most to you in choosing a course'].fillna('Better Career Prospects')

In [None]:
leads['What matters most to you in choosing a course'].value_counts() / len(leads) * 100

In [None]:
calculate_null_percentage(leads)

#### Rest of the missing values columns are less then `1.5%`. These rows will be dropped.

In [None]:
leads.dropna(inplace=True)

In [None]:
calculate_null_percentage(leads)

#### As we see there are no more null values in the dataset. We can proceed for data analysis for better understanding of the dataset and features.

## Exploratory Data Analysis

#### Let's find out the conversion rate as convert is the target variable.

In [None]:
sum(leads.Converted) / len(leads) * 100

#### So, the conversion rate is very close to `38%`

Let's find out relation among leads origin and lead source with the conversions.

In [None]:
plt.figure(figsize=(20, 10))
sns.countplot(data = leads, x = 'Lead Origin', hue = 'Converted')
plt.xticks(rotation = 45)
plt.show()

#### Observations
1. API and Landing Page Submission have more conversion rate
2. Lead Add Form has the highest coversion rate compared to the other two
3. Lead Import has a very minimal conversion rate

In [None]:
plt.figure(figsize=(20, 10))
sns.countplot(data = leads, x = 'Lead Source', hue = 'Converted')
plt.xticks(rotation = 45)
plt.show()

#### Observations
1. Olark Chat, Organic Search Direct Traffic, Google and Reference have more conversion rate
2. There are Google and google both - we need to convert them to single category
3. Also there are many **other** categories with minimal reach and conversions - we can convert them to a other category here

In [None]:
leads['Lead Source'].value_counts()

In [None]:
# Replacing google with Google
leads['Lead Source'].replace("google", "Google", inplace = True)

In [None]:
# Replacing Click2call, Press_Release, Social Media, 
# Live Chat, youtubechannel, testone, Pay per Click Ads, 
# welearnblog_Home, WeLearn, blog, NC_EDM
# to 'Others'

leads['Lead Source'].replace(['Click2call', 'Press_Release', 'Social Media', 'Live Chat', 
              'youtubechannel', 'testone', 'Pay per Click Ads', 'welearnblog_Home', 
              'WeLearn', 'blog', 'NC_EDM'], "Others", inplace = True)

In [None]:
leads['Lead Source'].value_counts()

In [None]:
plt.figure(figsize=(20, 10))
sns.countplot(data = leads, x = 'Lead Source', hue = 'Converted')
plt.xticks(rotation = 45)
plt.show()

#### Observations -
1. Google and Direct Traffic conversion rates are high in numbers
2. Reference and Welingak website has the highest conversion rate in terms of percentage

More focus should be on Google traffic, reference and Welingak website, nurturing these sources might increase conversion rate by a good margin

Let's see 'Do Not Email' and 'Do Not Call' columns based on the conversion rate

In [None]:
fig, axs = plt.subplots(1, 2, figsize=(12, 4))
sns.countplot(data = leads, x = 'Do Not Email', hue = 'Converted', ax=axs[0])
sns.countplot(data = leads, x = 'Do Not Call', hue = 'Converted', ax=axs[1])
plt.show()

#### Observations -
1. Condidates who allowed emails and calls communication has more convertion rate.
2. In case of 'Do Not Email' and 'Do Not Call' - 'NO' - conversion rates are similar
3. It can be inferred that interested candidates allowed calls and emails