In [1]:
import numpy as np
import pandas as pd

Smart Lead Scoring Engine


Can you identify the potential leads for a D2C startup?




Problem Statement


A D2C startup develops products using cutting edge technologies like Web 3.0. Over the past few months, the company has started multiple marketing campaigns offline and digital both. As a result, the users have started showing interest in the product on the website. These users with intent to buy product(s) are generally known as leads (Potential Customers). 


Leads are captured in 2 ways - Directly and Indirectly. 


Direct leads are captured via forms embedded in the website while indirect leads are captured based on certain activity of a user on the platform such as time spent on the website, number of user sessions, etc.


Now, the marketing & sales team wants to identify the leads who are more likely to buy the product so that the sales team can manage their bandwidth efficiently by targeting these potential leads and increase the sales in a shorter span of time.


Now, as a data scientist, your task at hand is to predict the propensity to buy a product based on the user's past activities and user level information.



About Dataset


You are provided with the leads data of last year containing both direct and indirect leads. Each lead provides information about their activity on the platform, signup information and campaign information. Based on his past activity on the platform, you need to build the predictive model to classify if the user would buy the product in the next 3 months or not.



Data Dictionary


You are provided with 3 files - train.csv, test.csv and sample_submission.csv



Training set


train.csv contains the leads information of last 1 year from Jan 2021 to Dec 2021. And also the target variable indicating if the user will buy the product in next 3 months or not 



Variable

Description

id

Unique identifier of a lead

created_at

Date of lead dropped

signup_date

Sign up date of the user on the website

campaign_var (1 and 2)

campaign information of the lead

products_purchased

No. of past products purchased at the time of dropping the lead

user_activity_var (1 to 12)

Derived activities of the user on the website

buy

0 or 1 indicating if the user will buy the product in next 3 months or not 



Test set


test.csv contains the leads information of the current year from Jan 2022 to March 2022. You need to predict if the lead will buy the product in next 3 months or not.



Variable

Description

id

Unique identifier of a lead

created_at

Date of lead dropped

signup_date

Sign up date of the user on the website

campaign_var (1 and 2)

Campaign information of the lead

products_purchased

No. of past products purchased at the time of dropping the lead

user_activity_var (1 to 12) 

Derived activities of the user on the website



Submission File Format


sample_submission.csv contains 2 variables - 



Variable

Description

id

Unique Identifier of a lead

buy

0 or 1 indicating if the user will buy the product in next 3 months or not



Evaluation metric


The evaluation metric for this hackathon would be F1 Score of Class 1.



Public and Private Split


Test data is further divided into Public (40%) and Private (60%) data. 


Your initial responses will be checked and scored on the Public data. The final rankings would be based on your private score which will be published once the competition is over.



Submission Tutorials


All Submissions are to be done at the solution checker tab.
For a step by step view on how to make a submission check the below video



Guidelines for Final Submission


Please ensure that your final submission includes the following:

Solution file containing the predictions for the id in the test set (Format is given in sample_submission.csv)
A zipped file containing code & approach (Note that both code and approach document are mandatory for shortlisting)
Code: Clean code with comments on each part
Approach: Please share your approach to solve the problem (doc/ppt/pdf format). It should cover the following topics:
A brief on the approach used to solve the problem.
Which Data-preprocessing / Feature Engineering ideas really worked? How did you discover them?
What does your final model look like? How did you reach it?

In [3]:
!pip install catboost

Collecting catboost
  Downloading catboost-1.0.6-cp39-none-win_amd64.whl (73.9 MB)
Collecting graphviz
  Downloading graphviz-0.20-py3-none-any.whl (46 kB)
Installing collected packages: graphviz, catboost
Successfully installed catboost-1.0.6 graphviz-0.20


In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier
from catboost import CatBoostClassifier



def models(x, y):
    accuracy = []
    f1score = []
    model = []
    
    model.append(LogisticRegression())
    model.append(KNeighborsClassifier())
    model.append(SVC(random_state=40))
    model.append(RandomForestClassifier(random_state=40))
    model.append(AdaBoostClassifier(DecisionTreeClassifier(random_state=40)))
    model.append(BaggingClassifier(random_state=40))
    model.append(GradientBoostingClassifier(random_state=40))
    model.append(XGBClassifier(random_state=40, verbosity=0))
    model.append(CatBoostClassifier(random_state=40, verbose=0))
    
    for i in model:
        mdl = i
        i.fit(x_train_sc, y_train)
        pred = i.predict(x_test_sc)
        
        accuracy.append((round(accuracy_score(y_test, test_pred), 2))*100)
        f1score.append((round(f1_score(y_test, test_pred), 2))*100)
        
        print(f'Model: {i}\nAccuracy: {accuracy_score(y_test, test_pred)}\nF1-score: {f1_score(y_test, test_pred)}\n\n')
   

 models(x_train_sc,y_train)

In [None]:
def getBestParams(model,x,y):
    
    tuned_paramaters = [{'criterion': ['gini', 'entropy'],
                     'min_samples_split': [10, 20, 30],
                     'max_depth': [3, 5, 7, 9],
                     'min_samples_leaf': [15, 20, 25, 30, 35],
                     'max_leaf_nodes': [5, 10, 15, 20, 25],
                     'n_estimators':[100,150,200]}]
    grid = GridSearchCV(estimator = model, 
                         param_grid = tuned_paramaters, 
                         cv = 10)
    
    gridModel=grid.fit(x,y)
    
    return gridModel

In [None]:
def BuildModel(x,y):
    accuracy=[]
    f1score=[]
    models=[]
    models.append(LogisticRegression())
    models.append(KNeighborsClassifier())
    models.append(SVC(random_state=40))
    models.append(RandomForestClassifier(random_state=40))
    models.append(GradientBoostingClassifier(random_state=40))
    models.append(XGBClassifier(random_state=40, verbosity=0))
    models.append(DecisionTreeClassifier())
    models.append(GaussianNB())
    
    
    for model in models:
        
        print(type(model))
       
        
        if str(model) == 'DecisionTreeClassifier()' or str(model)== 'RandomForestClassifier(random_state=40)':
            
            
            
            bestParams=getBestParams(model,x,y)
            
            print(bestParams.best_params_)
            gridModel=model(
                
                criterion=bestParams.best_params_.get('criterion'),
                max_depth=bestParams.best_params_.get('max_depth'),
                max_leaf_nodes=bestParams.best_params_.get('max_leaf_nodes'),
                min_samples_leaf=bestParams.best_params_.get('min_samples_leaf'),
                min_samples_split=bestParams.best_params_.get('min_samples_split'),
                n_estimators=bestParams.best_params_.get('n_estimators')
            )
            
            gridModel.fit(x,y)
            test_pred=gridModel.predict(x_test_sc)
        
        else:
            model.fit(x,y)
            test_pred=model.predict(x_test_sc)
            
            
            
        
        print(f'Model: {model}\n')
        
        print('\n Classification Report : \n ')
        
        print(classification_report(y_test, test_pred))
        
        print('\n')
        
        print(f'Model: {model}\nAccuracy: {accuracy_score(y_test, test_pred)}\nF1-score: {f1_score(y_test, test_pred)}\n\n')


# Model Building        
BuildModel(x_train_sc,y_train)

In [None]:
def getBestParams(model,x,y):
    
    tuned_paramaters = [{'max_depth':[5,10,15,20,30],'min_samples_split':[10,20,25,40,60,100],'n_estimators':[100,150,200]}]
    grid = GridSearchCV(estimator = model, 
                         param_grid = tuned_paramaters, 
                         cv = 6)
    
    gridModel=grid.fit(x,y)
    
    return gridModelb

In [None]:
def BuildModel(x,y):
    accuracy=[]
    f1score=[]
    models=[]
    models.append(LogisticRegression())
    models.append(KNeighborsClassifier())
    models.append(SVC(random_state=40))
    models.append(RandomForestClassifier(random_state=40))
    models.append(GradientBoostingClassifier(random_state=40))
    models.append(XGBClassifier(random_state=40, verbosity=0))
    models.append(DecisionTreeClassifier())
    models.append(GaussianNB())
    
    
    for model in models:
        
        
        modelName=str(model)
        print(type(model),type(modelName))
       
        
        if modelName == 'DecisionTreeClassifier()' or modelName == 'RandomForestClassifier(random_state=40)':
            
            
            
            bestParams=getBestParams(model,x,y)
            
            print(bestParams.best_params_)
            gridModel=model(
                
                
                max_depth=bestParams.best_params_.get('max_depth'),
                min_samples_split=bestParams.best_params_.get('min_samples_split'),
                n_estimators=bestParams.best_params_.get('n_estimators')
            )
            
            gridModel.fit(x,y)
            test_pred=gridModel.predict(x_test_sc)
        
        else:
            model.fit(x,y)
            test_pred=model.predict(x_test_sc)
            
            
            
        
        print(f'Model: {model}\n')
        
        print('\n Classification Report : \n ')
        
        print(classification_report(y_test, test_pred))
        
        print('\n')
        
        print(f'Model: {model}\nAccuracy: {accuracy_score(y_test, test_pred)}\nF1-score: {f1_score(y_test, test_pred)}\n\n')


# Model Building        
BuildModel(x_train_sc,y_train)

In [None]:
1st run
Model                 	Precision Score	 Recall Score	Accuracy Score	f1-score
0	Logistic Regression	            0.808018	0.747802	0.784254	0.776744
1	Random Forest Best Estimator	0.879615	0.911967	0.893176	0.895499
2	XGBoost MOdel Best Estimator	0.879816	0.923118	0.898127	0.900947
3	Deep Neural Network          	0.866594	0.853957	0.860725	0.860229

2nd run

	Model	                 Precision Score	Recall Score	Accuracy Score	f1-score
0	Logistic Regression	             0.80728	0.742012	0.781617	0.773271
1	Random Forest Best Estimator	0.877193	0.906069	0.889194	0.891397
2	Random Forest Best Estimator	0.879554	0.905962	0.890539	0.892563
3	Random Forest Best Estimator	0.877461	0.917542	0.894306	0.897054
4	XGBoost MOdel Best Estimator	0.881333	0.915827	0.895867	0.898249

In [None]:
NEXT

In [None]:
Results wanted


Model	                         Precision Score	Recall Score	Accuracy Score	f1-score
0	Logistic Regression             0.792072	    0.829187	       0.805026	    0.810205
1	Decision Tree Base	            0.906736	    0.94971	           0.925735	    0.927726
2	Random Forest Base	            0.916614	    0.940596	       0.927241	    0.92845
3	Random Forest Base	            0.885047	    0.922153	       0.900818 	0.903219
4	Random Forest Base	            0.916614	    0.940596	       0.927241 	0.92845
5	Logistic Regression	            0.738056	    0.813318	       0.761436 	0.773861
6	Decision Tree Base	            0.881179	    0.932769	       0.903132	    0.90624