# ExtraaLearn Project

## Context

The EdTech industry has been surging in the past decade immensely, and according to a forecast, the Online Education market would be worth $286.62bn by 2023 with a compound annual growth rate (CAGR) of 10.26% from 2018 to 2023. The modern era of online education has enforced a lot in its growth and expansion beyond any limit. Due to having many dominant features like ease of information sharing, personalized learning experience, transparency of assessment, etc, it is now preferable to traditional education.

In the present scenario due to the Covid-19, the online education sector has witnessed rapid growth and is attracting a lot of new customers. Due to this rapid growth, many new companies have emerged in this industry. With the availability and ease of use of digital marketing resources, companies can reach out to a wider audience with their offerings. The customers who show interest in these offerings are termed as leads. There are various sources of obtaining leads for Edtech companies, like

* The customer interacts with the marketing front on social media or other online platforms.
* The customer browses the website/app and downloads the brochure
* The customer connects through emails for more information.

The company then nurtures these leads and tries to convert them to paid customers. For this, the representative from the organization connects with the lead on call or through email to share further details.

## Objective

ExtraaLearn is an initial stage startup that offers programs on cutting-edge technologies to students and professionals to help them upskill/reskill. With a large number of leads being generated on a regular basis, one of the issues faced by ExtraaLearn is to identify which of the leads are more likely to convert so that they can allocate resources accordingly. You, as a data scientist at ExtraaLearn, have been provided the leads data to:
* Analyze and build an ML model to help identify which leads are more likely to convert to paid customers,
* Find the factors driving the lead conversion process
* Create a profile of the leads which are likely to convert


## Data Description

The data contains the different attributes of leads and their interaction details with ExtraaLearn. The detailed data dictionary is given below.


**Data Dictionary**
* ID: ID of the lead
* age: Age of the lead
* current_occupation: Current occupation of the lead. Values include 'Professional','Unemployed',and 'Student'
* first_interaction: How did the lead first interacted with ExtraaLearn. Values include 'Website', 'Mobile App'
* profile_completed: What percentage of profile has been filled by the lead on the website/mobile app. Values include Low - (0-50%), Medium - (50-75%), High (75-100%)
* website_visits: How many times has a lead visited the website
* time_spent_on_website: Total time spent on the website
* page_views_per_visit: Average number of pages on the website viewed during the visits.
* last_activity: Last interaction between the lead and ExtraaLearn.
    * Email Activity: Seeking for details about program through email, Representative shared information with lead like brochure of program , etc
    * Phone Activity: Had a Phone Conversation with representative, Had conversation over SMS with representative, etc
    * Website Activity: Interacted on live chat with representative, Updated profile on website, etc

* print_media_type1: Flag indicating whether the lead had seen the ad of ExtraaLearn in the Newspaper.
* print_media_type2: Flag indicating whether the lead had seen the ad of ExtraaLearn in the Magazine.
* digital_media: Flag indicating whether the lead had seen the ad of ExtraaLearn on the digital platforms.
* educational_channels: Flag indicating whether the lead had heard about ExtraaLearn in the education channels like online forums, discussion threads, educational websites, etc.
* referral: Flag indicating whether the lead had heard about ExtraaLearn through reference.
* status: Flag indicating whether the lead was converted to a paid customer or not.

## Importing necessary libraries and data

In [64]:
import warnings

warnings.filterwarnings("ignore")
from statsmodels.tools.sm_exceptions import ConvergenceWarning

warnings.simplefilter("ignore", ConvergenceWarning)



import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt
import seaborn as sns


import statsmodels.stats.api as sms
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm
from statsmodels.tools.tools import add_constant
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from scipy.stats import zscore

from sklearn.model_selection import GridSearchCV

from sklearn.metrics import (
    f1_score,
    accuracy_score,
    recall_score,
    precision_score,
    confusion_matrix,
    roc_auc_score,
    classification_report,
    precision_recall_curve,
    roc_curve,
    make_scorer,
)

## Data Overview

- Observations
- Sanity checks

In [45]:
learn = pd.read_csv("/content/ExtraaLearn.csv")

In [46]:
data = learn.copy()

In [47]:
data.head

<bound method NDFrame.head of            ID  age current_occupation first_interaction profile_completed  \
0      EXT001   57         Unemployed           Website              High   
1      EXT002   56       Professional        Mobile App            Medium   
2      EXT003   52       Professional           Website            Medium   
3      EXT004   53         Unemployed           Website              High   
4      EXT005   23            Student           Website              High   
...       ...  ...                ...               ...               ...   
4607  EXT4608   35         Unemployed        Mobile App            Medium   
4608  EXT4609   55       Professional        Mobile App            Medium   
4609  EXT4610   58       Professional           Website              High   
4610  EXT4611   57       Professional        Mobile App            Medium   
4611  EXT4612   55       Professional           Website            Medium   

      website_visits  time_spent_on_website  

In [48]:
data.tail()

Unnamed: 0,ID,age,current_occupation,first_interaction,profile_completed,website_visits,time_spent_on_website,page_views_per_visit,last_activity,print_media_type1,print_media_type2,digital_media,educational_channels,referral,status
4607,EXT4608,35,Unemployed,Mobile App,Medium,15,360,2.17,Phone Activity,No,No,No,Yes,No,0
4608,EXT4609,55,Professional,Mobile App,Medium,8,2327,5.393,Email Activity,No,No,No,No,No,0
4609,EXT4610,58,Professional,Website,High,2,212,2.692,Email Activity,No,No,No,No,No,1
4610,EXT4611,57,Professional,Mobile App,Medium,1,154,3.879,Website Activity,Yes,No,No,No,No,0
4611,EXT4612,55,Professional,Website,Medium,4,2290,2.075,Phone Activity,No,No,No,No,No,0


In [49]:
data.shape

(4612, 15)

In [50]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4612 entries, 0 to 4611
Data columns (total 15 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   ID                     4612 non-null   object 
 1   age                    4612 non-null   int64  
 2   current_occupation     4612 non-null   object 
 3   first_interaction      4612 non-null   object 
 4   profile_completed      4612 non-null   object 
 5   website_visits         4612 non-null   int64  
 6   time_spent_on_website  4612 non-null   int64  
 7   page_views_per_visit   4612 non-null   float64
 8   last_activity          4612 non-null   object 
 9   print_media_type1      4612 non-null   object 
 10  print_media_type2      4612 non-null   object 
 11  digital_media          4612 non-null   object 
 12  educational_channels   4612 non-null   object 
 13  referral               4612 non-null   object 
 14  status                 4612 non-null   int64  
dtypes: f

In [51]:
data.duplicated()

0       False
1       False
2       False
3       False
4       False
        ...  
4607    False
4608    False
4609    False
4610    False
4611    False
Length: 4612, dtype: bool

## Exploratory Data Analysis (EDA)

- EDA is an important part of any project involving data.
- It is important to investigate and understand the data better before building a model with it.
- A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
- A thorough analysis of the data, in addition to the questions mentioned below, should be done.

**Questions**
1. Leads will have different expectations from the outcome of the course and the current occupation may play a key role in getting them to participate in the program. Find out how current occupation affects lead status.
2. The company's first impression on the customer must have an impact. Do the first channels of interaction have an impact on the lead status?
3. The company uses multiple modes to interact with prospects. Which way of interaction works best?
4. The company gets leads from various channels such as print media, digital media, referrals, etc. Which of these channels have the highest lead conversion rate?
5. People browsing the website or mobile application are generally required to create a profile by sharing their personal data before they can access additional information.Does having more details about a prospect increase the chances of conversion?

In [52]:
data.describe()

Unnamed: 0,age,website_visits,time_spent_on_website,page_views_per_visit,status
count,4612.0,4612.0,4612.0,4612.0,4612.0
mean,46.20121,3.56678,724.01127,3.02613,0.29857
std,13.16145,2.82913,743.82868,1.96812,0.45768
min,18.0,0.0,0.0,0.0,0.0
25%,36.0,2.0,148.75,2.07775,0.0
50%,51.0,3.0,376.0,2.792,0.0
75%,57.0,5.0,1336.75,3.75625,1.0
max,63.0,30.0,2537.0,18.434,1.0


In [53]:
# Making a list of all catrgorical variables
cat_col = list(data.select_dtypes("object").columns)

# Printing number of count of each unique value in each column
for column in cat_col:
    print(data[column].value_counts())
    print("-" * 50)

EXT001     1
EXT2884    1
EXT3080    1
EXT3079    1
EXT3078    1
          ..
EXT1537    1
EXT1536    1
EXT1535    1
EXT1534    1
EXT4612    1
Name: ID, Length: 4612, dtype: int64
--------------------------------------------------
Professional    2616
Unemployed      1441
Student          555
Name: current_occupation, dtype: int64
--------------------------------------------------
Website       2542
Mobile App    2070
Name: first_interaction, dtype: int64
--------------------------------------------------
High      2264
Medium    2241
Low        107
Name: profile_completed, dtype: int64
--------------------------------------------------
Email Activity      2278
Phone Activity      1234
Website Activity    1100
Name: last_activity, dtype: int64
--------------------------------------------------
No     4115
Yes     497
Name: print_media_type1, dtype: int64
--------------------------------------------------
No     4379
Yes     233
Name: print_media_type2, dtype: int64
--------------------

In [54]:
# checking the number of unique values
data["ID"].nunique() # Complete the code to check the number of unique values

4612

In [55]:
data.drop(["ID"], axis=1, inplace=True) # Complete the code to drop "ID" column from data

## Data Preprocessing

- Missing value treatment (if needed)
- Feature engineering (if needed)
- Outlier detection and treatment (if needed)
- Preparing data for modeling
- Any other preprocessing steps (if needed)

In [65]:
from sklearn.preprocessing import StandardScaler, LabelEncoder

#data.fillna(data.mean(), inplace=True)
# 1. Missing Value Treatment
missing_values = data.isnull().sum()
data.fillna(data.mean(), inplace=True)
data.fillna(data.mode().iloc[0], inplace=True)

# 2. Feature Engineering
data['high_website_visits'] = (data['website_visits'] > 5).astype(int)

# 3. Outlier Detection and Treatment
z_scores = zscore(data[['age', 'website_visits', 'time_spent_on_website', 'page_views_per_visit']])
data = data[(z_scores < 3).all(axis=1)]

# 4. Preparing Data for Modeling
data_encoded = pd.get_dummies(data, columns=['current_occupation', 'first_interaction', 'profile_completed', 'last_activity'])

# 5. Any Other Preprocessing Steps
scaler = StandardScaler()
data_encoded[['age', 'website_visits', 'time_spent_on_website', 'page_views_per_visit']] = scaler.fit_transform(data_encoded[['age', 'website_visits', 'time_spent_on_website', 'page_views_per_visit']])

label_encoder = LabelEncoder()
data_encoded['status'] = label_encoder.fit_transform(data_encoded['status'])

# Display the first few rows of the preprocessed data
print(data_encoded.head())

      age  website_visits  time_spent_on_website  page_views_per_visit  \
0 0.81707         1.75187                1.23929              -0.59700   
1 0.74057        -0.57292               -0.85515              -1.52226   
2 0.43456        -0.10797               -0.52268              -1.66997   
3 0.51106         0.35699               -0.34231              -0.47932   
5 0.28156         0.35699               -0.68151               1.69723   

  print_media_type1 print_media_type2 digital_media educational_channels  \
0               Yes                No           Yes                   No   
1                No                No            No                  Yes   
2                No                No           Yes                   No   
3                No                No            No                   No   
5                No                No            No                  Yes   

  referral  status  high_website_visits  current_occupation_Professional  \
0       No       1    



```
# This is formatted as code
```

## Building a Decision Tree model

In [66]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Selecting features for the model
features = ['age', 'website_visits', 'time_spent_on_website', 'page_views_per_visit', 'high_website_visits',
            'current_occupation_Professional', 'current_occupation_Student', 'first_interaction_Mobile App',
            'profile_completed_High', 'profile_completed_Low', 'last_activity_Website Activity']

X = data_encoded[features]
y = data_encoded['status']

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Building the decision tree model
tree_model = DecisionTreeClassifier(random_state=42)
tree_model.fit(X_train, y_train)

# Predicting on the test set
y_pred = tree_model.predict(X_test)

# Evaluating the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred)

# Displaying the results
print(f"Accuracy: {accuracy:.4f}")
print("Confusion Matrix:\n", conf_matrix)
print("Classification Report:\n", classification_rep)


Accuracy: 0.7894
Confusion Matrix:
 [[532  82]
 [101 154]]
Classification Report:
               precision    recall  f1-score   support

           0       0.84      0.87      0.85       614
           1       0.65      0.60      0.63       255

    accuracy                           0.79       869
   macro avg       0.75      0.74      0.74       869
weighted avg       0.79      0.79      0.79       869



## Model Performance evaluation and improvement**

In [67]:
# Assuming you already have the decision tree model (tree_model) and test set (X_test, y_test)

# Predicting on the test set
y_pred = tree_model.predict(X_test)

# Evaluating the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred)

# Displaying the initial results
print("Initial Model Performance:")
print(f"Accuracy: {accuracy:.4f}")
print("Confusion Matrix:\n", conf_matrix)
print("Classification Report:\n", classification_rep)


Initial Model Performance:
Accuracy: 0.7894
Confusion Matrix:
 [[532  82]
 [101 154]]
Classification Report:
               precision    recall  f1-score   support

           0       0.84      0.87      0.85       614
           1       0.65      0.60      0.63       255

    accuracy                           0.79       869
   macro avg       0.75      0.74      0.74       869
weighted avg       0.79      0.79      0.79       869



In [68]:
from sklearn.model_selection import GridSearchCV

# Define the parameter grid
param_grid = {
    'max_depth': [3, 5, 7, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Create the GridSearchCV object
grid_search = GridSearchCV(DecisionTreeClassifier(random_state=42), param_grid, cv=5, scoring='accuracy')

# Fit the grid search to the data
grid_search.fit(X_train, y_train)

# Get the best parameters
best_params = grid_search.best_params_
print("Best Hyperparameters:", best_params)


Best Hyperparameters: {'max_depth': 5, 'min_samples_leaf': 4, 'min_samples_split': 2}


In [69]:
from sklearn.model_selection import cross_val_score

# Cross-validation
cv_scores = cross_val_score(tree_model, X_train, y_train, cv=5, scoring='accuracy')
print("Cross-Validation Scores:", cv_scores)
print("Mean CV Accuracy:", cv_scores.mean())


Cross-Validation Scores: [0.79280576 0.78992806 0.80403458 0.78530259 0.78242075]
Mean CV Accuracy: 0.7908983476043373


In [70]:
from sklearn.ensemble import RandomForestClassifier

# Example using Random Forest
forest_model = RandomForestClassifier(random_state=42)
forest_model.fit(X_train, y_train)
forest_accuracy = accuracy_score(y_test, forest_model.predict(X_test))
print(f"Random Forest Accuracy: {forest_accuracy:.4f}")


Random Forest Accuracy: 0.8423


## Building a Random Forest model

In [71]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Assuming you have the training and testing sets (X_train, X_test, y_train, y_test)

# Building the Random Forest model
forest_model = RandomForestClassifier(random_state=42)
forest_model.fit(X_train, y_train)

# Predicting on the test set
y_pred_forest = forest_model.predict(X_test)

# Evaluating the Random Forest model
accuracy_forest = accuracy_score(y_test, y_pred_forest)
conf_matrix_forest = confusion_matrix(y_test, y_pred_forest)
classification_rep_forest = classification_report(y_test, y_pred_forest)

# Displaying the results
print("Random Forest Model Performance:")
print(f"Accuracy: {accuracy_forest:.4f}")
print("Confusion Matrix:\n", conf_matrix_forest)
print("Classification Report:\n", classification_rep_forest)


Random Forest Model Performance:
Accuracy: 0.8423
Confusion Matrix:
 [[553  61]
 [ 76 179]]
Classification Report:
               precision    recall  f1-score   support

           0       0.88      0.90      0.89       614
           1       0.75      0.70      0.72       255

    accuracy                           0.84       869
   macro avg       0.81      0.80      0.81       869
weighted avg       0.84      0.84      0.84       869



## Model Performance evaluation and improvement

In [73]:
# Assuming you have the Random Forest model (forest_model) and test set (X_test, y_test)

# Predicting on the test set
y_pred_forest = forest_model.predict(X_test)

# Evaluating the Random Forest model
accuracy_forest = accuracy_score(y_test, y_pred_forest)
conf_matrix_forest = confusion_matrix(y_test, y_pred_forest)
classification_rep_forest = classification_report(y_test, y_pred_forest)

# Displaying the initial results
print("Initial Random Forest Model Performance:")
print(f"Accuracy: {accuracy_forest:.4f}")
print("Confusion Matrix:\n", conf_matrix_forest)
print("Classification Report:\n", classification_rep_forest)


Initial Random Forest Model Performance:
Accuracy: 0.8423
Confusion Matrix:
 [[553  61]
 [ 76 179]]
Classification Report:
               precision    recall  f1-score   support

           0       0.88      0.90      0.89       614
           1       0.75      0.70      0.72       255

    accuracy                           0.84       869
   macro avg       0.81      0.80      0.81       869
weighted avg       0.84      0.84      0.84       869



In [74]:
from sklearn.model_selection import cross_val_score

# Cross-validation
cv_scores_forest = cross_val_score(forest_model, X_train, y_train, cv=5, scoring='accuracy')
print("Cross-Validation Scores for Random Forest:", cv_scores_forest)
print("Mean CV Accuracy for Random Forest:", cv_scores_forest.mean())


Cross-Validation Scores for Random Forest: [0.83453237 0.85755396 0.85446686 0.83285303 0.83285303]
Mean CV Accuracy for Random Forest: 0.8424518483196153


In [75]:
feature_importances = forest_model.feature_importances_

# Create a DataFrame to display feature importance
feature_importance_df = pd.DataFrame({'Feature': features, 'Importance': feature_importances})
feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)

# Display feature importance
print("Feature Importance:")
print(feature_importance_df)


Feature Importance:
                            Feature  Importance
2             time_spent_on_website     0.28870
7      first_interaction_Mobile App     0.18136
3              page_views_per_visit     0.15050
0                               age     0.12604
8            profile_completed_High     0.10806
1                    website_visits     0.06166
5   current_occupation_Professional     0.03110
10   last_activity_Website Activity     0.02639
6        current_occupation_Student     0.01261
4               high_website_visits     0.00753
9             profile_completed_Low     0.00604


## Actionable Insights and Recommendations

In our analysis of ExtraaLearn's lead data, key insights emerged regarding factors influencing lead conversion. Notably, 'time_spent_on_website,' 'age,' and 'website_visits' are pivotal predictors, suggesting strategic emphasis on these features in lead engagement and marketing efforts.

Examining initial interaction channels, leads engaging through the 'Mobile App' and 'Website Activity' exhibit higher conversion likelihood. Allocating resources to these channels can potentially attract leads with a higher propensity to convert.

Regarding profile characteristics, age group analysis reveals trends in conversion rates. Exploring age-specific strategies may enhance the efficacy of marketing outreach. Additionally, encouraging users to complete profiles, given its positive correlation with conversion, could prove beneficial.

For marketing strategy, a targeted advertising approach is recommended. Focusing efforts on channels and activities indicative of conversion can optimize marketing spend. Personalizing outreach based on age groups and profiles can enhance communication effectiveness.

Looking ahead, continuous monitoring and adaptation are crucial. Regular model evaluations using fresh data ensure ongoing accuracy. Establishing a feedback loop between teams fosters collaboration, allowing for domain knowledge incorporation and model improvement. Optimizing resource allocation based on predicted conversion probabilities enhances lead management for ExtraaLearn.