**Problem: Lead Scoring Model**

Selling something is not an easy task. A business might have many potential customers, commonly referred as leads, but not enough resources to cater them all. Even most of the leads won’t turn into actual bookings. So there is a need for a system that prioritises the leads, and sorts them on the basis of a score, referred to here as lead score. So whenever a new lead is generated, this system analyses the features of the lead and gives it a score that correlates with chances of it being converted into booking. Such ranking of potential customers not only helps in saving time but also helps in increasing the conversion rate by letting the sales team figure out what leads to spend time on. Here you have a dataset of leads with their set of features and their status. You have to build a ML model that predicts the lead score as an OUTPUT on the basis of the INPUT set of features. This lead score will range from 0-100, so more the lead score means more chances of conversion of lead to WON. NOTE: The leads with STATUS other than ‘WON’ or ‘LOST’ can be dropped during training.

NOTE: Treat all columns as CATEGORICAL columns

NOTE: This '9b2d5b4678781e53038e91ea5324530a03f27dc1d0e5f6c9bc9d493a23be9de0' represents NaN and could be present in more than one column.

Steps should be:

1. Data Cleaning ( including Feature Selection)
2. Training ( on Y percent of data)
3. Testing ( on (100-Y) percent of data)
4. Evaluate the performance using metrics such as accuracy, precision, recall and F1-score.

Dataset link: https://docs.google.com/spreadsheets/d/1rK1CLqpsd6JfSBLk9nRE-f0NzDc9lEXgxZ-cKjxIN_s/edit?usp=sharing


## 1. Data Cleaning and Feature Selection:

In [1]:
import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
# load the dataset
df = pd.read_csv('lead.csv') # rename dataset as lead.csv file then upload dataset 

In [2]:
# check the column names in the DataFrame
print(df.columns)

# drop rows with STATUS other than 'WON' or 'LOST'
df = df[df['status'].isin(['WON', 'LOST'])]

Index(['Unnamed: 0', 'Agent_id', 'status', 'lost_reason', 'budget', 'lease',
       'movein', 'source', 'source_city', 'source_country', 'utm_source',
       'utm_medium', 'des_city', 'des_country', 'room_type', 'lead_id'],
      dtype='object')


In [3]:
# replace missing values with a constant value
df.replace('9b2d5b4678781e53038e91ea5324530a03f27dc1d0e5f6c9bc9d493a23be9de0', np.nan, inplace=True)

# check for missing values in the DataFrame
print(df.isna().sum())


Unnamed: 0            0
Agent_id              0
status                0
lost_reason        3073
budget             3694
lease              2336
movein            13610
source             5951
source_city        8831
source_country     8622
utm_source           61
utm_medium         3184
des_city           2529
des_country        2529
room_type         23491
lead_id               0
dtype: int64


## 2. Data Preprocessing: 

In [4]:
# convert categorical variables to numerical variables
df_encoded = pd.get_dummies(df, drop_first=True)

# split the data into training and testing sets
from sklearn.model_selection import train_test_split
train_data, test_data = train_test_split(df_encoded, test_size=0.2, random_state=42)


## 3. Model Training and Testing:

In [5]:
# import the necessary library
from sklearn.ensemble import GradientBoostingRegressor

# specify the features and target variable
X_train = train_data.drop(['status_WON'], axis=1)
y_train = train_data['status_WON']

# create an instance of the model
model = GradientBoostingRegressor()

In [6]:
# train the model
model.fit(X_train, y_train)

In [7]:
# test the model
X_test = test_data.drop(['status_WON'], axis=1)
y_test = test_data['status_WON']
y_pred = model.predict(X_test)

## 4. Model Evaluation:

In [8]:
# evaluate the model
# calculate the mean squared error
mse = mean_squared_error(y_test, y_pred)

# calculate the root mean squared error
rmse = np.sqrt(mse)

# calculate the mean absolute error
mae = mean_absolute_error(y_test, y_pred)

# calculate the R-squared score
r2 = r2_score(y_test, y_pred)

# print the results
print("Mean Squared Error:", mse)
print("Root Mean Squared Error:", rmse)
print("Mean Absolute Error:", mae)
print("R-squared Score:", r2)

Mean Squared Error: 0.037640521663067895
Root Mean Squared Error: 0.19401165342078783
Mean Absolute Error: 0.09344739194290066
R-squared Score: 0.3606922630442836
