### Problem statement

An education company named X Education sells online courses to industry professionals.

The typical lead conversion rate at X Education is between 30%-40%. The CEO wishes to see the lead conversion rate increase to around 80%.

Develop a lead scoring model using a leads dataset from the past with around 9000 data points and various attributes such as Lead Source, Total Time Spent on Website, Total Visits, Last Activity, etc.

- Lead Source: The source of the lead. Includes Google, Organic Search, Olark Chat, etc.
- Do Not Email: An indicator variable selected by the customer wherein they select whether of not they want to be emailed about the course or not.
- Do Not Call: An indicator variable selected by the customer wherein they select whether of not they want to be called about the course or not.
- Converted: The target variable. Indicates whether a lead has been successfully converted or not.
- TotalVisit: The total number of visits made by the customer on the website.
- Page Views Per Visit: The total number of visits made by the customer on the website.
- Last Activity: Last activity performed by the customer. Includes Email Opened, Olark Chat Conversation, etc.
- Country: The country of the customer.
- Search: Indicating whether the customer had seen the ad in any of the listed items (i.e. during a web search).
- Magazine: Indicating whether the customer had seen the ad in any of the listed items (i.e. in a magazine).
- Newspaper Article: Indicating whether the customer had seen the ad in any of the listed items (i.e. in a newspaper article).
- X Education Forums: Indicating whether the customer had seen the ad in any of the listed items (i.e. in a X Education forum).
- Newspaper: Indicating whether the customer had seen the ad in any of the listed items (i.e. in a newspaper).
- Digital Advertisement: Indicating whether the customer had seen the ad in any of the listed items (i.e. in a digital ad).
- Through Recommendations: Indicates whether the customer came in through recommendations.
- Receive More Updates About Our Courses: Indicates whether the customer chose to receive more updates about the courses.
- Specialization: The industry domain in which the customer worked before. Includes the level 'Select Specialization' which means the customer had not selected this option while filling the form.
- How did you hear about X Education: The source from which the customer heard about X Education.
- What is your current occupation: Indicates whether the customer is a student, umemployed or employed.
- What matters most to you in choosing a course: An option selected by the customer indicating what is their main motto behind doing this course.
- Update me on Supply Chain Content: Indicates whether the customer wants updates on the Supply Chain Content.
- Get updates on DM Content: Indicates whether the customer wants updates on the DM Content.
- Tags: Tags assigned to customers indicating the current status of the lead.
- Lead Quality: Indicates the quality of lead based on the data and intuition the employee who has been assigned to the lead.
- Lead Profile: A lead level assigned to each customer based on their profile.
- I agree to pay the amount through cheque: Indicates whether the customer has agreed to pay the amount through cheque or not.
- A free copy of Mastering The Interview: Indicates whether the customer wants a free copy of 'Mastering the Interview' or not.
- Asymmetrique Activity Index: An index and score assigned to each customer based on their activity and their profile.
- Asymmetrique Profile Index: An index and score assigned to each customer based on their activity and their profile.
- Asymmetrique Activity Score: An index and score assigned to each customer based on their activity and their profile.
- Asymmetrique Profile Score: An index and score assigned to each customer based on their activity and their profile.

In [None]:
import numpy as np
import pandas as pd
import re
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
data = pd.read_csv(r'C:\Users\CG Lapy 2\Downloads\WORK\data_mining\data_mining_project\lead\Leads X Education.csv')

In [None]:
data.head()

In [None]:
data.shape

In [None]:
data.info()

In [None]:
data1 = data.copy()

#### Handling null values

In [None]:
data1.isnull().sum()

In [None]:
null_values_perc = data1.isnull().sum()/len(data1)
null_values_perc

In [None]:
#dropping columns containing more than 50% null values

null_values_50 = null_values_perc[null_values_perc > 50]
data1.drop(columns = null_values_50.index, inplace=True)
data1.shape

In [None]:
data1['Lead Source'].replace(np.nan, "Unknown", inplace=True)

In [None]:
data1['Do Not Call'].value_counts()

In [None]:
#checking the distributions of variables with null values

sns.histplot(x = data1.TotalVisits);

In [None]:
sns.histplot(x = data1['Page Views Per Visit']);

In [None]:
# above two are skewed, so we replace the null values with their median
data1['TotalVisits'] = data1['TotalVisits'].fillna(data1['TotalVisits'].median())
data1['Page Views Per Visit'] = data1['Page Views Per Visit'].fillna(data1['Page Views Per Visit'].median())

In [None]:
data1['Last Activity'] = data1['Last Activity'].fillna("None")

In [None]:
data1['City'].unique()

In [None]:
# if the city is indian then replacing null values in country with India
indian_cities = ['Mumbai', 'Thane & Outskirts', 'Other Cities of Maharashtra']

for i in data1['Country'].isnull().index:
    if data1.loc[i,'City'] in indian_cities:
        data1['Country'] = "India"
    else:
        data1.loc[i,'Country'] = "Unknown"

In [None]:
fields = ['Specialization', 'How did you hear about X Education']
data1[fields] = data1[fields].replace("Select", np.nan)
data1[fields] = data1[fields].fillna("NA")

In [None]:
data1['What is your current occupation'].value_counts()

In [None]:
data1["City"] = data1["City"].fillna("Unknown")

In [None]:
data1['What is your current occupation'] = data1['What is your current occupation'].fillna("NA")

In [None]:
data1['Magazine'].value_counts()

In [None]:
data1['Newspaper Article'].value_counts()

In [None]:
data1['X Education Forums'].value_counts()

In [None]:
data1['Newspaper'].value_counts()

In [None]:
data1['Digital Advertisement'].value_counts()

In [None]:
data1['Receive More Updates About Our Courses'].value_counts()

In [None]:
data1['Get updates on DM Content'].value_counts()

In [None]:
data1['I agree to pay the amount through cheque'].value_counts()

In [None]:
data1['What matters most to you in choosing a course'].value_counts()

In [None]:
#Removing unneseccary and irrelevant columns

data1.drop(['Prospect ID', 'Lead Number', 'Do Not Call', 'Magazine', 'Newspaper Article','X Education Forums','Newspaper', 'Receive More Updates About Our Courses','Update me on Supply Chain Content','Get updates on DM Content','I agree to pay the amount through cheque','Tags','Lead Quality','What matters most to you in choosing a course'], axis=1, inplace=True)

In [None]:
for i in ['Do Not Email', 'Converted','Search','Through Recommendations','A free copy of Mastering The Interview']:
    data1[i] = data[i].replace("Yes" , 1)
    data1[i] = data[i].replace("No" , 0)

In [None]:
data1.drop(['Asymmetrique Activity Score','Asymmetrique Profile Score'], axis=1, inplace=True)

In [None]:
data1.dropna(inplace=True)

In [None]:
data1.columns

In [None]:
data1['Asymmetrique Profile Index'].unique()

In [None]:
data3 = data1.copy()
data4 = data1.copy()

In [None]:
# cleaning the values in Asymmetrique Activity Index and Asymmetrique Profile Index columns

#replacing everything other than a digit to blank space

def clean_score(x):
    x = re.sub(r'[^\d]','',x)
    return x

In [None]:
data1['Asymmetrique Profile Index'] = data1['Asymmetrique Profile Index'].apply(clean_score)


In [None]:
data1['Asymmetrique Activity Index'] = data1['Asymmetrique Activity Index'].apply(clean_score)

In [None]:
data1['Asymmetrique Activity Index'].unique()

In [None]:
data1['Asymmetrique Profile Index'].unique()

In [None]:
data1['Asymmetrique Profile Index'] = data1['Asymmetrique Profile Index'].astype(int)
data1['Asymmetrique Activity Index'] = data1['Asymmetrique Activity Index'].astype(int)

In [None]:
data2 = data1.copy

### EDA

In [None]:
sns.countplot( x = data1.Converted, data = data1);

In [None]:
origin_count = data1['Lead Origin'].value_counts().sort_values(ascending=False).index
sns.countplot(x = data1['Lead Origin'], data = data1, order = origin_count)

plt.xticks(rotation = 45);

In [None]:
source_count = data1['Lead Source'].value_counts().sort_values(ascending=False).index
sns.countplot(x = data1['Lead Source'], data = data1, order = source_count)

plt.xticks(rotation = 75);

In [None]:
sns.countplot(x = 'Specialization', data = data1, order = data1['Specialization'].value_counts().index)
plt.title("Specialization of leads")
plt.xticks(rotation = 90);

In [None]:
activity = data1['Last Activity'].value_counts().index
sns.countplot(x = 'Last Activity', data=data1, order = activity)
plt.xticks(rotation=90);

In [None]:
occupation = data1['What is your current occupation'].value_counts().index
sns.countplot(x = 'What is your current occupation', data=data1, order = occupation)
plt.xticks(rotation=90);

In [None]:
city_count = data1['City'].value_counts().index
sns.countplot(x = 'City', data=data1, order = city_count)
plt.xticks(rotation=70);

In [None]:
sns.countplot(x = 'Lead Origin', data = data1, hue = 'Converted')
plt.xticks(rotation = 45)
plt.title("Distribution of Lead Origin by Conversion");

In [None]:
sns.countplot(x = 'Lead Source', data = data1, hue = 'Converted', order = data1['Lead Source'].value_counts().index)
plt.xticks(rotation = 90)
plt.title("Distribution of Lead Source by conversion");

In [None]:
sns.countplot(x = 'Last Notable Activity', data = data1, hue = 'Converted', order = data1['Last Notable Activity'].value_counts().index)
plt.xticks(rotation = 90)
plt.title("Distribution of leads' last notable activity by conversion");

In [None]:
data2 = data1.copy()

### Encoding

In [None]:
#applying ohe on the remaining categorical columns

# Select all categorical columns
cat_columns = data2.select_dtypes(include=['object']).columns

# Apply one-hot encoding to categorical columns
data_encoded = pd.get_dummies(data2, columns=cat_columns)
data_encoded.shape

In [None]:
data_encoded1 = data_encoded.copy()
data_encoded2 = data_encoded.copy()

In [None]:
from sklearn.model_selection import train_test_split

X = data_encoded.drop('Converted', axis=1)
y = data_encoded['Converted']

from sklearn.preprocessing import Normalizer
# create the Normalizer object
normalizer = Normalizer()
# fit the normalizer to the data
normalizer.fit(X)
# transform the data
X_normalized = normalizer.transform(X)

data = pd.DataFrame(X_normalized)


X_train, X_test, y_train, y_test = train_test_split(data, y, test_size=0.2, random_state=42)

### Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_score, recall_score, f1_score, confusion_matrix

lr = LogisticRegression()

lr.fit(X_train, y_train)
pred_prob1 = lr.predict_proba(X_test)[:,1]

threshold = 0.5
y_pred_lr = (pred_prob1 > threshold).astype(int)

In [None]:
accuracy = lr.score(X_test, y_test)
print("Accuracy: ", accuracy)

precision = precision_score(y_test, y_pred_lr)
print("precision: ", precision )

recall = recall_score(y_test, y_pred_lr)
print("Recall: ", recall)

f1 = f1_score(y_test, y_pred_lr)
print("F1 score: ", f1)

In [None]:
cm1 = confusion_matrix(y_test, y_pred_lr)
print(cm1)

In [None]:
sns.heatmap(cm1, annot = True, cmap=sns.color_palette("flare", as_cmap=True))
plt.xlabel('Predicted label')
plt.ylabel('True label')
plt.title('Confusion Matrix')
plt.show();

### Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier()
rf.fit(X_train, y_train)

y_pred_rf = lr.predict(X_test)

In [None]:
accuracy_rf = rf.score(X_test, y_test)
print("Accuracy: ", accuracy_rf)

precision_rf = precision_score(y_test, y_pred_rf)
print("precision: ", precision_rf )

recall_rf = recall_score(y_test, y_pred_rf)
print("Recall: ", recall_rf)

f1_rf = f1_score(y_test, y_pred_rf)
print("F1 score: ", f1_rf)

In [None]:
cm2 = confusion_matrix(y_test, y_pred_rf)
print(cm2)

In [None]:
sns.heatmap(cm2, annot = True, cmap=sns.color_palette("flare", as_cmap=True))
plt.xlabel('Predicted label')
plt.ylabel('True label')
plt.title('Confusion Matrix')
plt.show();

In [None]:
pip install scikit-plot

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import plot_roc_curve

In [None]:
#Fit logistic model
parameters_lr = {'penalty': ['l1', 'l2'], 'C' : np.logspace(-3,3,5,base=10.0)}
lr1 = LogisticRegression(solver='liblinear', random_state=123)

lr_cv = GridSearchCV(lr1, param_grid=parameters_lr, cv=5, scoring='roc_auc', n_jobs=-1)
lr_cv.fit(X_train, y_train)

print(lr_cv.best_params_)
lr_best = lr_cv.best_estimator_

In [None]:
#Fit random forest classifier w/ hyperparameter tuning
parameters_rf = {'max_depth':np.arange(6,30,2),'min_samples_leaf':np.arange(100,500,50)}
rf1 = RandomForestClassifier()

rf_cv = GridSearchCV(rf1, param_grid=parameters_rf, cv=5, scoring='roc_auc', n_jobs=-1)
rf_cv.fit(X_train, y_train)

print(rf_cv.best_params_)
rf_best = rf_cv.best_estimator_

In [None]:
#Plot ROC curve & AUC
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(16, 5))
models = {'Logistic Regression':lr_best, 'Random Forest Classifier':rf_best}

for n,m in models.items():
    plot_roc_curve(m, X_test, y_test, ax=axes[list(models.keys()).index(n)])
    plt.sca(axes[list(models.keys()).index(n)])
    plt.title('ROC Curve - ' + n)

###### The Logistic Regression model performs better as it shows a higher curve than Random Forest Classifier