<a href="https://colab.research.google.com/github/SaniyaBubere/Lead-Scoring-Model/blob/main/LeadScoring.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Problem: Lead Scoring Model

Selling something is not an easy task. A business might have many potential customers, commonly referred as leads, but not enough resources to cater them all. Even most of the leads won’t turn into actual bookings. So there is a need for a system that prioritises the leads, and sorts them on the basis of a score, referred to here as lead score. So whenever a new lead is generated, this system analyses the features of the lead and gives it a score that correlates with chances of it being converted into booking. Such ranking of potential customers not only helps in saving time but also helps in increasing the conversion rate by letting the sales team figure out what leads to spend time on.

Here you have a dataset of leads with their set of features and their status. You have to build a ML model that predicts the lead score as an OUTPUT on the basis of the INPUT set of features. This lead score will range from 0-100, so more the lead score means more chances of conversion of lead to WON.


# Data Set:

The provided dataset includes information about leads generated for rental properties. Each row represents a lead and provides various details such as agent ID, lead status, lost reason, budget, lease duration, move-in date, source of lead, destination city and country, desired room type, and lead ID.

The dataset also includes information about the source of the leads such as the source website, source city, source country, UTM source, and UTM medium. Finally, each lead has a unique lead ID.

LEAD: In the context of sales and marketing, a lead is a potential customer or prospect who has shown interest in a company's products or services.

Status: Lead status refers to the current stage of a lead in the sales process. It describes whether a lead is a potential customer, a lost opportunity, or a converted customer.

LEASE: Lease refers to a contractual agreement between two parties, whereby the owner of a property (lessor) allows another party (lessee) to use the property for a specified period of time in exchange for rent

Problem Statement : Develop a machine learning model to predict lead scores ranging from 0-100 based on input features. This will help prioritize leads and improve conversion rates.

# Summary

In this project, I imported the necessary libraries and mounted the drive. Then, I inspected the data and checked for duplicate values, and dropped unwanted columns. I performed data cleaning by imputing null values with the mode and changed the values to categories of each column. I used one-hot encoding to represent categorical data.

Furthermore, I addressed the issue of unbalanced data using the Synthetic Minority Over-sampling Technique (SMOTE). I implemented two models: Random Forest and XGBoost. I evaluated the performance of the models using metrics such as accuracy, precision, recall, and F1-score, and found that the XGBoost model performed better.

In conclusion, I successfully developed a machine learning model to predict lead scores, which can be used to prioritize leads and improve the conversion rates of businesses.

Importing Libraries

In [63]:
# importing libraries
import pandas as pd
import numpy as np

from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
import xgboost as xgb
from xgboost import XGBRegressor
from sklearn.metrics import accuracy_score
import scipy.stats as stats
from sklearn.metrics import r2_score
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from collections import OrderedDict
from sklearn.model_selection import RandomizedSearchCV
from collections import Counter





In [7]:
# Mounting the drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# reading file 
df=pd.read_excel("/content/drive/MyDrive/Lead Scoring/Data_Science_Internship.xlsx")

In [None]:
# first look of data set
df.head()

In [None]:
# Drop first column of dataframe
df = df.iloc[: , 1:]

#Data Inspection

In [None]:
# Get the number of rows and columns
rows, columns = df.shape

In [None]:
# Print the number of rows and columns
print("Number of rows: ", rows)
print("Number of columns: ", columns)

In [None]:
# Dataset Duplicate Value Count
df[df.duplicated()]

In [None]:
# Calculating the shape after removing duplicates
df = df.drop_duplicates(keep = 'first')
df.shape

In [None]:
# Dataset Info
df.info()

We have Null Values in our Dataset

In [None]:
# Describing the data
df.describe()

#Data Cleaning

The leads with STATUS other than ‘WON’ or ‘LOST’ are dropped 

In [None]:
# Looking for unique value of status column
df["status"].unique()

In [None]:
# Including only Won and Lost value for status column
df = df[df["status"].isin(["WON","LOST"])]

Replacing the given value with NA

In [None]:
# Storing the given value in variable
value='9b2d5b4678781e53038e91ea5324530a03f27dc1d0e5f6c9bc9d493a23be9de0' 

In [None]:
# looking for the given value in whole dataset
count=df.isin([value]).sum()
count

In [None]:
# Replace the given value with a NA value
df.replace('9b2d5b4678781e53038e91ea5324530a03f27dc1d0e5f6c9bc9d493a23be9de0', np.nan, inplace=True)

In [None]:
# Checking that the value replaced is still present or not 
count=df.isin([value]).sum()
count

In [None]:
# Checking for null value
df.isnull().sum()

In [None]:
# Calculating Null Percentage of each column
round(100*(df.isnull().sum()/len(df.index)), 2)

Removing the columns having more than 40% missing values

In [None]:
# Removing the columns having more than 40% missing values
pct_null = df.isnull().sum() / len(df)
missing_features = pct_null[pct_null > 0.40].index
df.drop(missing_features, axis=1, inplace=True)

In [None]:
# Checking the update
round(100*(df.isnull().sum()/len(df.index)), 2)

1 column removed.



Removing the Rows having  equal or more then 70% missing values

In [None]:
# Removing the Rows having  equal or more then 70% missing values
pct_null = df.isnull().sum() / len(df)
missing_features = pct_null[pct_null >= 0.70].index
df.drop(missing_features, axis=0, inplace=True)


In [None]:
# Checking the shape
df.shape

In [None]:
# Function for Unique Value of DataFrame
def get_all_unique_values(df):
    for col in df.columns:
        print(f"Unique values in column '{col}':")
        print(df[col].unique())
        print()

In [None]:
# Get and print all unique values
get_all_unique_values(df)

# Handling Missing Data

In [None]:
# Handle missing values by filling them with mode
df = df.fillna(df.mode().iloc[0])

In [None]:
# Checking for Null values
df.isna().sum()

## Hurreyyy there is no null value in our data set

# Changing the Values to Category

In [None]:
def categories(x, value_counts_dict):
  
    return 'others' if value_counts_dict[x] < 10 else x

# Drop the 'id' column
df = df.drop('lead_id', axis=1)

# define threshold
threshold = 10

# get columns with categorical data
cat_cols = [col for col in df.select_dtypes(include=['object', 'category']).columns if col != 'status']


# loop through categorical columns
for col in cat_cols:
    print("Column Name : ", col)
    print("-----------------------------------------")

    # # calculate value counts for each category
    value_counts_dict = (df[col].value_counts(normalize=True) * 100).to_dict()

    # apply change_to_others function to each category
    df[col] = df[col].apply(lambda x:categories(x, value_counts_dict))

    print("After :")
    print(df[col].value_counts(normalize=True) * 100)
    print('\n')


#Separating the dependent and independent variables

In [None]:
# Replacing the Lost and Won as 0 and 1 of dependent variable
y = df['status'].replace({'LOST': 0, 'WON': 1})

In [None]:
#  independent variables
X=df.drop(columns='status')

In [None]:
X.head()
X.shape

#One Hot Encoding

In [None]:
# Performing one hot encoding on the dependent variables
X=pd.get_dummies(X)
X.shape

In [None]:
X.columns

In [None]:

#converting all columns to int data type
X = X.apply(pd.to_numeric, errors='coerce')


# SMOTE:

In [None]:
#counting dataset
df.status.value_counts() 

In [41]:
# initializing smote ()
smote = SMOTE()


In [None]:
# Divivding the dataset into train and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=11)

In [43]:
# Making 2 variable X_train_smote, y_train_smote and fitting 
X_train_smote, y_train_smote = smote.fit_resample(X,y)
# printing the values before and after
print("Before SMOTE :" , Counter(y_train))
print("After SMOTE :" , Counter(y_train_smote))

Before SMOTE : Counter({0: 34586, 1: 2459})
After SMOTE : Counter({0: 43235, 1: 43235})


In [44]:
# Looking for shape of Original dataset shape & Resampled dataset shape
print('Original dataset shape', len(df))
print('Resampled dataset shape', len(y_train_smote))

Original dataset shape 46307
Resampled dataset shape 86470


In [45]:
# making variable for column with each and every columns in dataset
columns = list(X.columns)
columns

['movein',
 'Agent_id_2fca346db656187102ce806ac732e06a62df0dbb2829e511a770556d398e1a6e',
 'Agent_id_others',
 'lost_reason_Low availability',
 'lost_reason_Low budget',
 'lost_reason_Not interested',
 'lost_reason_Not responding',
 'lost_reason_others',
 'budget_0-0',
 'budget_others',
 'budget_£121 - £180 Per Week',
 'budget_£60 - £120 Per week',
 'lease_0',
 'lease_Complete Education Year Stay 50 - 52 weeks',
 'lease_Full Year Course Stay 40 - 44 weeks',
 'lease_others',
 'source_7aae3e886e89fc1187a5c47d6cea1c22998ee610ade1f2b7c51be879f0c37ca8',
 'source_others',
 'source_city_ecc0e7dc084f141b29479058967d0bc07dee25d9690a98ee4e6fdad5168274d7',
 'source_city_others',
 'source_country_8da82000ef9c4468ba47362a924b895e40662fed846942a1870a674e5c6d1fc2',
 'source_country_e09e10e67812e9d236ad900e5d46b4308fc62f5d69446a9750aa698e797e9c96',
 'source_country_others',
 'utm_source_7f3fa48ca885678134842fa7456f3ece53a97f843b610185d900ac4e467c7490',
 'utm_source_bbdefa2950f49882f295b1285d4fa9dec45fc

In [46]:
#Create a new Dataframe with balanced data
balanced_df = pd.DataFrame(X_train_smote, columns=columns)

In [47]:
# storing default in y_train_smote 
balanced_df['status'] = y_train_smote

In [48]:
#check shape of new daatframe
balanced_df.shape

(86470, 33)

In [52]:
# independent variable (estimator)
X = balanced_df.drop("status", axis = 1)

# dependent variable (label)
y = balanced_df["status"]

In [53]:
# Divivding the dataset into train and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=11)

Function for Evaluation Metrics

In [54]:
def regression_evaluation_metrics(model, X_train, y_train, X_test, y_test):
    # Fit the model to the training data
    model.fit(X_train, y_train)
    
    # Make predictions on the training and testing data
    y_pred_train = model.predict(X_train)
    y_pred_test = model.predict(X_test)

    # Calculate mean absolute error
    MAE_train = mean_absolute_error(y_train, y_pred_train)
    print("MAE Train :", MAE_train)

    MAE_test = mean_absolute_error(y_test, y_pred_test)
    print("MAE Test:", MAE_test)

    # Calculate mean squared error
    MSE_train = mean_squared_error(y_train, y_pred_train)
    print("MSE Train :", MSE_train)

    MSE_test = mean_squared_error(y_test, y_pred_test)
    print("MSE Test:", MSE_test)

    # Calculate root mean squared error
    RMSE_train = np.sqrt(MSE_train)
    print("RMSE Train:", RMSE_train)

    RMSE_test = np.sqrt(MSE_test)
    print("RMSE Test:", RMSE_test)

    # Calculate RMSPE
    sales_mean = np.mean(y_train)
    RMSPE_train = RMSE_train / sales_mean
    print("RMSPE Train:", RMSPE_train)

    RMSPE_test = RMSE_test / sales_mean
    print("RMSPE Test:", RMSPE_test)

    # Calculate R-squared
    R2_train = r2_score(y_train, y_pred_train)
    print("R2 Train:", R2_train)

    R2_test = r2_score(y_test, y_pred_test)
    print("R2 Test:", R2_test)

    # Calculate adjusted R-squared
    ADJUSTED_R2_train = 1 - ((1 - R2_train) * (168879 - 1) / (168879 - 1 - 26))
    print("Adjusted R2 Train :", ADJUSTED_R2_train)

    ADJUSTED_R2_test = 1 - ((1 - R2_test) * (168879 - 1) / (168879 - 1 - 26))
    print("Adjusted R2 Test:", ADJUSTED_R2_test)


# Random Forest Model

In [57]:
rf = RandomForestRegressor()

#calling function evaluation_metrics for random forest
regression_evaluation_metrics(rf, X_train, y_train, X_test, y_test)

MAE Train : 0.009300537971172498
MAE Test: 0.01416058190413284
MSE Train : 0.004189330299789019
MSE Test: 0.008431592438905147
RMSE Train: 0.06472503611268995
RMSE Test: 0.0918237030341575
RMSPE Train: 0.1294163973214857
RMSPE Test: 0.18359963236959503
R2 Train: 0.9832426776662545
R2 Test: 0.9662735937081127
Adjusted R2 Train : 0.9832400973569856
Adjusted R2 Test: 0.9662684004823079


In [58]:
# Creating parameter grid  
param_grid = {
    'max_depth': [10,20,30],
    'min_samples_leaf': [3, 4, 5],
    'min_samples_split': [5, 8, 10],
    'n_estimators': [100, 150, 200]
}

In [61]:
# Instantiate grid search model
random_search = RandomizedSearchCV(estimator = rf,param_distributions= param_grid,  scoring = 'accuracy',  
                                   cv = 3, n_jobs = -1, verbose = 1)
     

In [62]:
# Fit grid search to the data
random_search.fit(X_train, y_train)

Fitting 3 folds for each of 10 candidates, totalling 30 fits




In [64]:
#get best parameters
random_search.best_params_

{'n_estimators': 100,
 'min_samples_split': 8,
 'min_samples_leaf': 3,
 'max_depth': 10}

In [65]:
#get best score
random_search.best_score_

nan

In [66]:
#calling function evaluation_metrics for random forest
regression_evaluation_metrics(random_search, X_train, y_train, X_test, y_test)

Fitting 3 folds for each of 10 candidates, totalling 30 fits




MAE Train : 0.012344432311094217
MAE Test: 0.01480792503262996
MSE Train : 0.005727585657301699
MSE Test: 0.007647172652413111
RMSE Train: 0.0756808143276861
RMSE Test: 0.08744811405864114
RMSPE Train: 0.15132225371945585
RMSPE Test: 0.17485073093391218
R2 Train: 0.9770896558196006
R2 Test: 0.9694112762531744
Adjusted R2 Train : 0.9770861280618679
Adjusted R2 Test: 0.9694065661708691


# XG BOOST

In [67]:
# Define the XGBoost regressor
xg_reg = xgb.XGBRegressor()

#calling function regression_evaluation_metrics for xgboost
regression_evaluation_metrics(xg_reg, X_train, y_train, X_test, y_test)

MAE Train : 0.016308002728642253
MAE Test: 0.01902612424567739
MSE Train : 0.005952953271478674
MSE Test: 0.007555525971772775
RMSE Train: 0.07715538394356336
RMSE Test: 0.08692252856292651
RMSPE Train: 0.15427062576755035
RMSPE Test: 0.17379983339217286
R2 Train: 0.9761881853018567
R2 Test: 0.9697778633728644
Adjusted R2 Train : 0.9761845187347912
Adjusted R2 Test: 0.9697732097380108


In [68]:

# Number of trees
n_estimators = [50,80,100]

# Maximum depth of trees
max_depth = [4,6,8]

# Minimum number of samples required to split a node
min_samples_split = [50,100,150]

# Minimum number of samples required at each leaf node
min_samples_leaf = [40,50]

# HYperparameter Grid
param_dict = {'n_estimators' : n_estimators,
              'max_depth' : max_depth,
              'min_samples_split' : min_samples_split,
              'min_samples_leaf' : min_samples_leaf}

In [69]:
# Instantiate grid search model
XGboost = RandomizedSearchCV(estimator = xg_reg,param_distributions= param_dict,  scoring = 'accuracy',  
                                   cv = 3, n_jobs = -1, verbose = 1)

In [70]:
# Calling function we made evaluation_metrics for XGboost 
regression_evaluation_metrics(XGboost, X_train, y_train, X_test, y_test)

Fitting 3 folds for each of 10 candidates, totalling 30 fits




Parameters: { "min_samples_leaf", "min_samples_split" } are not used.

MAE Train : 0.012292873928187432
MAE Test: 0.01743683799616365
MSE Train : 0.004584886250951359
MSE Test: 0.008221311847348213
RMSE Train: 0.06771178812401397
RMSE Test: 0.09067145001238379
RMSPE Train: 0.13538834740777497
RMSPE Test: 0.18129572581601472
R2 Train: 0.9816604537544773
R2 Test: 0.9671147169855405
Adjusted R2 Train : 0.9816576298127865
Adjusted R2 Test: 0.96710965327674


# Conclusion:

Based on the evaluation metrics, we can see that both Random Forest and XGBoost regressors performed well on the given data.

For Random Forest regressor, we can see that the MAE on the training set is 0.0123 and on the test set is 0.0148. Similarly, for XGBoost regressor, we can see that the MAE on the training set is 0.0123 and on the test set is 0.0174.

The R2 score for Random Forest regressor on the training set is 0.977 and on the test set is 0.969. Similarly, for XGBoost regressor, the R2 score on the training set is 0.981 and on the test set is 0.967.

The Adjusted R2 score for Random Forest regressor on the training set is 0.977 and on the test set is 0.969. Similarly, for XGBoost regressor, the Adjusted R2 score on the training set is 0.982 and on the test set is 0.967.

Tt seems like the Random Forest model is performing better than the XGBoost model. The Random Forest model has lower MAE, MSE, and RMSE values for both training and testing datasets. Additionally, it has a higher R-squared value for the testing dataset, indicating that it fits the data better