# Hackerearth Reduce Marketing Waste Challenge
<hr>
<p align="center">
    <img src="https://d2908q01vomqb2.cloudfront.net/cb4e5208b4cd87268b208e49452ed6e89a68e0b8/2021/07/16/HackerEarthFeatureImage.png" width="500" height="600">
</p>

----------

## Problem

You want to reduce marketing waste and aim your marketing initiatives only at those customers who will benefit from your product. This will result in the following:

* Increased business
* New customers who are compatible with your organization
* Seamless transactions with a higher success rate
* More profit with fewer obstacles

## Task

Your company has products that can be used for hiring assessments. Your task is to predict the probability percentage that a client will purchase a product from the features provided in the dataset that is given.

## Evaluation

<code>score = max(0, 100-np.sqrt(metrics.mean_squared_error(actual, predicted)))</code>

Link : https://www.hackerearth.com/problem/machine-learning/reduce-marketing-waste-24-9c4e0592/

## Environment Setup

In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from sklearn.preprocessing import RobustScaler, MinMaxScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import mean_squared_error

import xgboost as xgb
from sklearn.ensemble import GradientBoostingRegressor, AdaBoostRegressor, RandomForestRegressor
from sklearn.ensemble import IsolationForest
from catboost import CatBoostRegressor
import lightgbm as lgb
from sklearn.model_selection import train_test_split, RandomizedSearchCV, cross_val_score

## Dataset

In [2]:
df_train = pd.read_csv('data/train.csv', parse_dates=['Date_of_creation'], na_values=['?'])
df_test = pd.read_csv('data/test.csv', parse_dates=['Date_of_creation'], na_values=['?'])

In [3]:
print('Training Dataset Shape: ' + str(df_train.shape))
print('Test Dataset Shape: ' + str(df_test.shape))

Training Dataset Shape: (7007, 23)
Test Dataset Shape: (2093, 22)


In [4]:
# Storing the target column in a separate variable
df_train = df_train[df_train['Success_probability'] <= 100]
df_train = df_train[df_train['Success_probability'] >= 0]
target = df_train['Success_probability']

In [5]:
print('Training Dataset Shape: ' + str(df_train.shape))
print('Test Dataset Shape: ' + str(df_test.shape))

Training Dataset Shape: (6422, 23)
Test Dataset Shape: (2093, 22)


In [6]:
# Creating a copy of the Training Dataset
df_train_copy = df_train.copy()

In [7]:
df_test['Internal_rating'].value_counts()

 3.00     421
 5.00     417
 2.00     411
 1.00     399
 4.00     391
-1.00      48
 82.34      6
Name: Internal_rating, dtype: int64

In [8]:
df_test.head()

Unnamed: 0,Deal_title,Lead_name,Industry,Deal_value,Weighted_amount,Date_of_creation,Pitch,Contact_no,Lead_revenue,Fund_category,...,POC_name,Designation,Lead_POC_email,Hiring_candidate_role,Lead_source,Level_of_meeting,Last_lead_update,Internal_POC,Resource,Internal_rating
0,TitleAD16O,Bonilla Ltd Inc,Investment Bank/Brokerage,200988$,,2020-04-15,Product_1,167.332.2751x989,100 - 500 Million,Category 4,...,sonia,Chairman/CEO/President,maureenthomas@bonilla.com,"Designer, fashion/clothing",Marketing Event,Level 1,more than a month,"Massiah,Gerard F",No,-1.0
1,TitleOW6CR,"Williams, Rogers and Roach PLC",Electronics,409961$,2541758.2$,2021-01-23,Product_1,001-486-903-0711x7831,100 - 500 Million,Category 3,...,Daniel Bell,CEO/Co-Founder/Chairman,danielbell@williams.com,Horticultural consultant,Marketing Event,Level 2,Up-to-date,"Smith,Keenan H",Yes,1.0
2,TitleVVJQ5,"Wood, Vaughn and Morales Ltd",Banks,434433$,3041031.0$,2020-07-19,Product_1,(393)104-2610x9723,100 - 500 Million,Category 1,...,Andrew Davis,Chairman/Chief Innovation Officer,andrewdavis@wood.com,Information officer,Marketing Event,Level 2,Did not hear back after Level 1,"Gilley,Janine",Deliverable,5.0
3,TitleUS8NA,Durham-Crawford Inc,Music,218952$,1521716.4$,2020-02-27,Product_2,(817)040-4599,100 - 500 Million,Category 1,...,shital,CEO/Chairman/President,charlesrivera@durhamcrawford.com,Commercial/residential surveyor,Contact Email,Level 3,more than a month,"Morsy,Omar A",No,5.0
4,Title5VGWW,"Simpson, Duncan and Long LLC",Real Estate,392835$,2455218.75$,2020-10-25,Product_1,718-032-5726x76098,500 Million - 1 Billion,Category 3,...,Shelly Stephenson,CEO/Co-Founder/Chairman,shellystephenson@simpson.com,Wellsite geologist,Others,Level 3,More than 2 weeks,"Morsy,Omar A",Deliverable,2.0


In [9]:
x = df_test[df_test['Internal_rating'] == 82.34].index
for i in x:
    df_test.loc[i,'Internal_rating'] = 0

In [10]:
df_test.Internal_rating = df_test.Internal_rating.abs()

In [11]:
for i in x:
    df_test.loc[i,'Internal_rating'] = df_test['Internal_rating'].mode()[0]

In [12]:
df_test.Internal_rating.value_counts()

1.0    453
3.0    421
5.0    417
2.0    411
4.0    391
Name: Internal_rating, dtype: int64

## Exploratory Data Analysis and Preprocessing

In [13]:
print('Total Unique values in each column of Training dataset -> \n')
for col in df_train.columns:
    print(str(col) + " : " + str(len(df_train[col].unique())))
print('\n')
print('Total Unique values in each column of Test dataset -> \n')
for col in df_test.columns:
    print(str(col) + " : " + str(len(df_test[col].unique())))

Total Unique values in each column of Training dataset -> 

Deal_title : 6422
Lead_name : 6422
Industry : 172
Deal_value : 6339
Weighted_amount : 5947
Date_of_creation : 777
Pitch : 2
Contact_no : 6422
Lead_revenue : 3
Fund_category : 4
Geography : 3
Location : 598
POC_name : 4865
Designation : 10
Lead_POC_email : 6422
Hiring_candidate_role : 639
Lead_source : 4
Level_of_meeting : 3
Last_lead_update : 11
Internal_POC : 60
Resource : 7
Internal_rating : 5
Success_probability : 246


Total Unique values in each column of Test dataset -> 

Deal_title : 2093
Lead_name : 2093
Industry : 139
Deal_value : 2084
Weighted_amount : 2034
Date_of_creation : 720
Pitch : 2
Contact_no : 2093
Lead_revenue : 3
Fund_category : 4
Geography : 3
Location : 566
POC_name : 1746
Designation : 10
Lead_POC_email : 2093
Hiring_candidate_role : 618
Lead_source : 4
Level_of_meeting : 3
Last_lead_update : 11
Internal_POC : 60
Resource : 7
Internal_rating : 5


**Drop -----> Deal_title, Lead_name, Contact_no, POC_name, Lead_POC_email**

In [14]:
# Function for dropping the data
def drop_cols(data):
    data.drop(['Deal_title', 'Lead_name', 'Contact_no', 'POC_name', 'Lead_POC_email'], axis=1, inplace=True)
    return data

In [15]:
df_train_dev = df_train.copy()
df_train_dev = drop_cols(df_train_dev)

In [16]:
df_train_dev.shape

(6422, 18)

In [17]:
# Function for feature extraction
def feature_extraction(data):
    # 1. Remove the '$' from the numerical columns and convert it to numerical
    cols = data.columns
    data[cols] = data[cols].replace({r'\$':''}, regex = True)
    data['Deal_value'] = pd.to_numeric(data['Deal_value'])
    data['Weighted_amount'] = pd.to_numeric(data['Weighted_amount'])
    
    # 2. Extracting Features from the datetime column
    data['Creation_year'] = data['Date_of_creation'].dt.year
    data['Creation_quarter'] = data['Date_of_creation'].dt.quarter
    
    # 3. Drop the datetime column
    data.drop('Date_of_creation', axis=1, inplace=True)
    
    return data

In [18]:
df_train_dev = feature_extraction(df_train_dev)

In [19]:
df_train_dev.shape

(6422, 19)

In [20]:
df_train_dev.dtypes

Industry                  object
Deal_value               float64
Weighted_amount          float64
Pitch                     object
Lead_revenue              object
Fund_category             object
Geography                 object
Location                  object
Designation               object
Hiring_candidate_role     object
Lead_source               object
Level_of_meeting          object
Last_lead_update          object
Internal_POC              object
Resource                  object
Internal_rating            int64
Success_probability      float64
Creation_year              int64
Creation_quarter           int64
dtype: object

In [21]:
df_train_dev.isnull().sum()

Industry                    1
Deal_value                 46
Weighted_amount           474
Pitch                       0
Lead_revenue                0
Fund_category               0
Geography                 878
Location                    9
Designation                 0
Hiring_candidate_role       0
Lead_source                 0
Level_of_meeting            0
Last_lead_update         1111
Internal_POC                0
Resource                  138
Internal_rating             0
Success_probability         0
Creation_year               0
Creation_quarter            0
dtype: int64

In [22]:
# Function for filling NaN values
def imputation(data):
    numerical = ['Deal_value', 'Weighted_amount']
    labels = ['Industry','Last_lead_update', 'Resource', 'Location']
    
    for num in numerical:
        data[num].fillna(data[num].median(), inplace=True)
    for lab in labels:
        data[lab].fillna(data[lab].mode()[0], inplace=True)
    
    data[['Place', 'State']] = data['Location'].str.split(' ', 1, expand=True)
    data['State'].fillna('0', inplace = True)
    data['Geography'].fillna('USA', inplace = True)
    data.loc[data['State'] == '0', 'Geography'] = 'India'
    data.drop(['Place', 'State', 'Location'], axis = 1, inplace = True)
    
    
    return data

In [23]:
df_train_dev = imputation(df_train_dev)

In [24]:
df_train_dev.isnull().sum()

Industry                 0
Deal_value               0
Weighted_amount          0
Pitch                    0
Lead_revenue             0
Fund_category            0
Geography                0
Designation              0
Hiring_candidate_role    0
Lead_source              0
Level_of_meeting         0
Last_lead_update         0
Internal_POC             0
Resource                 0
Internal_rating          0
Success_probability      0
Creation_year            0
Creation_quarter         0
dtype: int64

In [25]:
df_train_dev

Unnamed: 0,Industry,Deal_value,Weighted_amount,Pitch,Lead_revenue,Fund_category,Geography,Designation,Hiring_candidate_role,Lead_source,Level_of_meeting,Last_lead_update,Internal_POC,Resource,Internal_rating,Success_probability,Creation_year,Creation_quarter
0,Restaurants,320506.0,2067263.70,Product_2,50 - 100 Million,Category 2,USA,Executive Vice President,Community pharmacist,Website,Level 3,No track,"Davis,Sharrice A",No,3,73.60,2020,1
1,Construction Services,39488.0,240876.80,Product_2,500 Million - 1 Billion,Category 4,India,Chairman/CEO/President,Recruitment consultant,Others,Level 1,Did not hear back after Level 1,"Brown,Maxine A",No,5,58.90,2019,3
2,Hospitals/Clinics,359392.0,2407926.40,Product_1,500 Million - 1 Billion,Category 4,USA,SVP/General Counsel,Health service manager,Marketing Event,Level 1,Up-to-date,"Georgakopoulos,Vasilios T",No,4,68.80,2019,3
3,Real Estate,76774.0,468321.40,Product_2,500 Million - 1 Billion,Category 3,USA,CEO/Co-Founder/Chairman,"Therapist, speech and language",Contact Email,Level 2,Did not hear back after Level 1,"Brown,Maxine A",We have all the requirements,1,64.50,2021,1
4,Financial Services,483896.0,1549272.35,Product_2,50 - 100 Million,Category 3,India,Executive Vice President,Media planner,Website,Level 2,Up-to-date,"Thomas,Lori E",No,4,62.40,2019,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7000,Beverages (Alcoholic),152908.0,970965.80,Product_1,100 - 500 Million,Category 1,India,CEO/President,Counselling psychologist,Website,Level 2,5 days back,"Gould,Lisa D",No,1,62.70,2020,4
7001,Banks,479541.0,2685429.60,Product_1,100 - 500 Million,Category 3,India,CEO,Systems analyst,Others,Level 2,Up-to-date,"Morsy,Omar A",No,2,57.40,2020,1
7003,Hospitals/Clinics,220208.0,1453372.80,Product_2,100 - 500 Million,Category 1,India,CEO,Financial risk analyst,Marketing Event,Level 2,Up-to-date,"Brown,Maxine A",We have all the requirements,3,26.35,2020,1
7004,Semiconductors,253608.0,1549272.35,Product_1,100 - 500 Million,Category 2,USA,SVP/General Counsel,Nature conservation officer,Marketing Event,Level 3,Up-to-date,"Logan,Kevin N",No,1,70.60,2020,1


In [26]:
print('Total Unique values in each column of Training dataset -> \n')
for col in df_train_dev.columns:
    print(str(col) + " : " + str(len(df_train_dev[col].unique())))

Total Unique values in each column of Training dataset -> 

Industry : 171
Deal_value : 6339
Weighted_amount : 5947
Pitch : 2
Lead_revenue : 3
Fund_category : 4
Geography : 2
Designation : 10
Hiring_candidate_role : 639
Lead_source : 4
Level_of_meeting : 3
Last_lead_update : 10
Internal_POC : 60
Resource : 6
Internal_rating : 5
Success_probability : 246
Creation_year : 3
Creation_quarter : 4


In [27]:
df_train_dev.shape

(6422, 18)

In [28]:
df_train_dev

Unnamed: 0,Industry,Deal_value,Weighted_amount,Pitch,Lead_revenue,Fund_category,Geography,Designation,Hiring_candidate_role,Lead_source,Level_of_meeting,Last_lead_update,Internal_POC,Resource,Internal_rating,Success_probability,Creation_year,Creation_quarter
0,Restaurants,320506.0,2067263.70,Product_2,50 - 100 Million,Category 2,USA,Executive Vice President,Community pharmacist,Website,Level 3,No track,"Davis,Sharrice A",No,3,73.60,2020,1
1,Construction Services,39488.0,240876.80,Product_2,500 Million - 1 Billion,Category 4,India,Chairman/CEO/President,Recruitment consultant,Others,Level 1,Did not hear back after Level 1,"Brown,Maxine A",No,5,58.90,2019,3
2,Hospitals/Clinics,359392.0,2407926.40,Product_1,500 Million - 1 Billion,Category 4,USA,SVP/General Counsel,Health service manager,Marketing Event,Level 1,Up-to-date,"Georgakopoulos,Vasilios T",No,4,68.80,2019,3
3,Real Estate,76774.0,468321.40,Product_2,500 Million - 1 Billion,Category 3,USA,CEO/Co-Founder/Chairman,"Therapist, speech and language",Contact Email,Level 2,Did not hear back after Level 1,"Brown,Maxine A",We have all the requirements,1,64.50,2021,1
4,Financial Services,483896.0,1549272.35,Product_2,50 - 100 Million,Category 3,India,Executive Vice President,Media planner,Website,Level 2,Up-to-date,"Thomas,Lori E",No,4,62.40,2019,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7000,Beverages (Alcoholic),152908.0,970965.80,Product_1,100 - 500 Million,Category 1,India,CEO/President,Counselling psychologist,Website,Level 2,5 days back,"Gould,Lisa D",No,1,62.70,2020,4
7001,Banks,479541.0,2685429.60,Product_1,100 - 500 Million,Category 3,India,CEO,Systems analyst,Others,Level 2,Up-to-date,"Morsy,Omar A",No,2,57.40,2020,1
7003,Hospitals/Clinics,220208.0,1453372.80,Product_2,100 - 500 Million,Category 1,India,CEO,Financial risk analyst,Marketing Event,Level 2,Up-to-date,"Brown,Maxine A",We have all the requirements,3,26.35,2020,1
7004,Semiconductors,253608.0,1549272.35,Product_1,100 - 500 Million,Category 2,USA,SVP/General Counsel,Nature conservation officer,Marketing Event,Level 3,Up-to-date,"Logan,Kevin N",No,1,70.60,2020,1


In [29]:
target_dev = df_train_dev['Success_probability']
df_train_dev.drop('Success_probability', axis=1, inplace=True)

In [30]:
target_dev

0       73.60
1       58.90
2       68.80
3       64.50
4       62.40
        ...  
7000    62.70
7001    57.40
7003    26.35
7004    70.60
7006    68.70
Name: Success_probability, Length: 6422, dtype: float64

In [31]:
# Normalization of data
def normalization(data):
    # enc = RobustScaler()
    label_enc = LabelEncoder()
    enc = MinMaxScaler()
    numerical = ['Deal_value', 'Weighted_amount']
    data[numerical] = enc.fit_transform(data[numerical])

    for lab in data.columns:
        if lab not in numerical:
            data[lab] = label_enc.fit_transform(data[lab])
    return data

In [32]:
df_train_dev = normalization(df_train_dev)

In [33]:
df_train_dev

Unnamed: 0,Industry,Deal_value,Weighted_amount,Pitch,Lead_revenue,Fund_category,Geography,Designation,Hiring_candidate_role,Lead_source,Level_of_meeting,Last_lead_update,Internal_POC,Resource,Internal_rating,Creation_year,Creation_quarter
0,119,0.639895,0.572982,1,1,1,1,7,126,3,2,6,12,2,2,1,0
1,31,0.076110,0.064622,1,2,3,0,4,501,2,0,2,5,2,4,0,2
2,56,0.717909,0.667802,0,2,3,1,8,286,1,0,8,18,2,3,0,2
3,114,0.150914,0.127930,1,2,2,1,2,611,0,1,2,5,4,0,2,0
4,46,0.967692,0.428803,1,1,2,0,7,369,3,1,8,53,2,3,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7000,18,0.303656,0.267836,0,0,0,0,3,143,3,1,1,20,2,0,1,3
7001,17,0.958955,0.745043,0,0,2,0,0,578,2,1,8,42,2,1,1,0
7003,56,0.438675,0.402110,1,0,0,0,0,250,1,1,8,5,4,2,1,0
7004,130,0.505683,0.428803,0,0,1,1,8,395,1,2,8,35,2,0,1,0


In [34]:
def model_experimentation(models, X_train, X_test, y_train, y_test):
    '''
    Fit and Score the deep learning models without performing hyperparameter tuning
    '''
    model_scores = {}
    model_train = {}
    for name, model in models.items():
        model.fit(X_train, y_train)
        model_preds_train = model.predict(X_train)
        model_train[name] = max(0, 100-np.sqrt(mean_squared_error(y_train, model_preds_train)))
        model_preds= model.predict(X_test)
        model_scores[name] = max(0, 100-np.sqrt(mean_squared_error(y_test, model_preds)))
    return model_scores, model_train

def test_prediction(model, X, name='submission.csv'):
    predicted_df = pd.DataFrame()
    X, predicted_df['Deal_title'] = preprocess_test(X)
    predicted_df['Success_probability'] = model.predict(X)
    predicted_df.to_csv('submissions/'+name)

In [35]:
# Final pipeline
def preprocess(data, train=True):
    data = drop_cols(data)
    data = feature_extraction(data)
    data = imputation(data)
    
    iso = IsolationForest()
    yhat = iso.fit_predict(data[['Deal_value', 'Weighted_amount']])
    mask = yhat != -1
    data = data[mask]
    
    target = data['Success_probability']
    data.drop('Success_probability', axis=1, inplace=True)
    data = normalization(data)
    return data, target

In [36]:
df_train_dev, target_dev = preprocess(df_train.copy())

X_train, X_test, y_train, y_test = train_test_split(df_train_dev, target_dev, test_size=0.33, random_state=42)

models = {'XGB': xgb.XGBRegressor(n_jobs=-1),
          'CAT': CatBoostRegressor(),
          'GBR': GradientBoostingRegressor(),
          'ADA': AdaBoostRegressor(),
          'LGB': lgb.LGBMRegressor(),
          'RF': RandomForestRegressor(n_jobs=-1)
         }

scores, scores_train = model_experimentation(models, X_train, X_test, y_train, y_test)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data[numerical] = enc.fit_transform(data[numerical])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https:/

Learning rate set to 0.04753
0:	learn: 12.2916716	total: 147ms	remaining: 2m 26s
1:	learn: 12.2334412	total: 149ms	remaining: 1m 14s
2:	learn: 12.1711587	total: 151ms	remaining: 50s
3:	learn: 12.1302634	total: 152ms	remaining: 37.8s
4:	learn: 12.0833543	total: 154ms	remaining: 30.6s
5:	learn: 12.0392654	total: 156ms	remaining: 25.8s
6:	learn: 11.9861936	total: 158ms	remaining: 22.3s
7:	learn: 11.9423798	total: 159ms	remaining: 19.8s
8:	learn: 11.8961648	total: 161ms	remaining: 17.7s
9:	learn: 11.8608954	total: 163ms	remaining: 16.1s
10:	learn: 11.8219298	total: 165ms	remaining: 14.8s
11:	learn: 11.7853785	total: 167ms	remaining: 13.7s
12:	learn: 11.7541714	total: 168ms	remaining: 12.8s
13:	learn: 11.7181683	total: 170ms	remaining: 12s
14:	learn: 11.6889454	total: 172ms	remaining: 11.3s
15:	learn: 11.6631464	total: 174ms	remaining: 10.7s
16:	learn: 11.6363191	total: 176ms	remaining: 10.2s
17:	learn: 11.6123973	total: 177ms	remaining: 9.67s
18:	learn: 11.5858697	total: 179ms	remaining: 9

206:	learn: 9.8163090	total: 508ms	remaining: 1.95s
207:	learn: 9.8135246	total: 510ms	remaining: 1.94s
208:	learn: 9.8049290	total: 511ms	remaining: 1.94s
209:	learn: 9.7962615	total: 513ms	remaining: 1.93s
210:	learn: 9.7914597	total: 515ms	remaining: 1.92s
211:	learn: 9.7866525	total: 516ms	remaining: 1.92s
212:	learn: 9.7847334	total: 518ms	remaining: 1.91s
213:	learn: 9.7805233	total: 519ms	remaining: 1.91s
214:	learn: 9.7739388	total: 521ms	remaining: 1.9s
215:	learn: 9.7700182	total: 523ms	remaining: 1.9s
216:	learn: 9.7619863	total: 525ms	remaining: 1.89s
217:	learn: 9.7553708	total: 526ms	remaining: 1.89s
218:	learn: 9.7513048	total: 528ms	remaining: 1.88s
219:	learn: 9.7489281	total: 529ms	remaining: 1.88s
220:	learn: 9.7443166	total: 531ms	remaining: 1.87s
221:	learn: 9.7418906	total: 533ms	remaining: 1.87s
222:	learn: 9.7350066	total: 534ms	remaining: 1.86s
223:	learn: 9.7298343	total: 536ms	remaining: 1.85s
224:	learn: 9.7254546	total: 537ms	remaining: 1.85s
225:	learn: 9.

384:	learn: 8.8685599	total: 830ms	remaining: 1.32s
385:	learn: 8.8651893	total: 831ms	remaining: 1.32s
386:	learn: 8.8597063	total: 834ms	remaining: 1.32s
387:	learn: 8.8492772	total: 835ms	remaining: 1.32s
388:	learn: 8.8398977	total: 837ms	remaining: 1.31s
389:	learn: 8.8362182	total: 839ms	remaining: 1.31s
390:	learn: 8.8350699	total: 841ms	remaining: 1.31s
391:	learn: 8.8271214	total: 843ms	remaining: 1.31s
392:	learn: 8.8201714	total: 844ms	remaining: 1.3s
393:	learn: 8.8168072	total: 846ms	remaining: 1.3s
394:	learn: 8.8115176	total: 848ms	remaining: 1.3s
395:	learn: 8.8029048	total: 850ms	remaining: 1.3s
396:	learn: 8.8020433	total: 852ms	remaining: 1.29s
397:	learn: 8.7988354	total: 853ms	remaining: 1.29s
398:	learn: 8.7953738	total: 855ms	remaining: 1.29s
399:	learn: 8.7858997	total: 857ms	remaining: 1.28s
400:	learn: 8.7827652	total: 859ms	remaining: 1.28s
401:	learn: 8.7752124	total: 861ms	remaining: 1.28s
402:	learn: 8.7682555	total: 863ms	remaining: 1.28s
403:	learn: 8.76

584:	learn: 7.8667401	total: 1.16s	remaining: 824ms
585:	learn: 7.8628867	total: 1.16s	remaining: 822ms
586:	learn: 7.8596081	total: 1.17s	remaining: 820ms
587:	learn: 7.8547835	total: 1.17s	remaining: 818ms
588:	learn: 7.8507993	total: 1.17s	remaining: 816ms
589:	learn: 7.8443326	total: 1.17s	remaining: 814ms
590:	learn: 7.8387119	total: 1.17s	remaining: 812ms
591:	learn: 7.8351013	total: 1.17s	remaining: 809ms
592:	learn: 7.8327831	total: 1.18s	remaining: 808ms
593:	learn: 7.8273438	total: 1.18s	remaining: 805ms
594:	learn: 7.8200189	total: 1.18s	remaining: 803ms
595:	learn: 7.8104898	total: 1.18s	remaining: 801ms
596:	learn: 7.8074101	total: 1.18s	remaining: 799ms
597:	learn: 7.8008121	total: 1.18s	remaining: 797ms
598:	learn: 7.7944226	total: 1.19s	remaining: 794ms
599:	learn: 7.7908555	total: 1.19s	remaining: 792ms
600:	learn: 7.7864809	total: 1.19s	remaining: 790ms
601:	learn: 7.7807068	total: 1.19s	remaining: 788ms
602:	learn: 7.7757208	total: 1.19s	remaining: 785ms
603:	learn: 

759:	learn: 7.0757187	total: 1.48s	remaining: 469ms
760:	learn: 7.0711288	total: 1.49s	remaining: 467ms
761:	learn: 7.0661167	total: 1.49s	remaining: 465ms
762:	learn: 7.0654231	total: 1.49s	remaining: 463ms
763:	learn: 7.0609751	total: 1.49s	remaining: 461ms
764:	learn: 7.0537248	total: 1.49s	remaining: 459ms
765:	learn: 7.0479579	total: 1.5s	remaining: 457ms
766:	learn: 7.0438990	total: 1.5s	remaining: 455ms
767:	learn: 7.0411673	total: 1.5s	remaining: 453ms
768:	learn: 7.0361258	total: 1.5s	remaining: 451ms
769:	learn: 7.0332159	total: 1.5s	remaining: 449ms
770:	learn: 7.0297890	total: 1.5s	remaining: 447ms
771:	learn: 7.0262707	total: 1.51s	remaining: 445ms
772:	learn: 7.0252264	total: 1.51s	remaining: 443ms
773:	learn: 7.0232335	total: 1.51s	remaining: 441ms
774:	learn: 7.0214013	total: 1.51s	remaining: 439ms
775:	learn: 7.0190665	total: 1.51s	remaining: 437ms
776:	learn: 7.0161675	total: 1.52s	remaining: 435ms
777:	learn: 7.0140267	total: 1.52s	remaining: 433ms
778:	learn: 7.0090

930:	learn: 6.4894029	total: 1.8s	remaining: 134ms
931:	learn: 6.4865502	total: 1.81s	remaining: 132ms
932:	learn: 6.4779777	total: 1.81s	remaining: 130ms
933:	learn: 6.4756947	total: 1.81s	remaining: 128ms
934:	learn: 6.4751784	total: 1.81s	remaining: 126ms
935:	learn: 6.4688926	total: 1.81s	remaining: 124ms
936:	learn: 6.4666819	total: 1.82s	remaining: 122ms
937:	learn: 6.4652692	total: 1.82s	remaining: 120ms
938:	learn: 6.4611241	total: 1.82s	remaining: 118ms
939:	learn: 6.4579877	total: 1.82s	remaining: 116ms
940:	learn: 6.4565813	total: 1.82s	remaining: 114ms
941:	learn: 6.4543561	total: 1.83s	remaining: 112ms
942:	learn: 6.4498736	total: 1.83s	remaining: 111ms
943:	learn: 6.4457241	total: 1.83s	remaining: 109ms
944:	learn: 6.4424131	total: 1.83s	remaining: 107ms
945:	learn: 6.4354779	total: 1.83s	remaining: 105ms
946:	learn: 6.4323681	total: 1.84s	remaining: 103ms
947:	learn: 6.4275403	total: 1.84s	remaining: 101ms
948:	learn: 6.4226069	total: 1.84s	remaining: 98.9ms
949:	learn: 

In [37]:
scores

{'XGB': 87.5545898845032,
 'CAT': 88.60876418868344,
 'GBR': 88.8580137898176,
 'ADA': 84.19563130415918,
 'LGB': 88.68478605461246,
 'RF': 88.52122155448262}

In [38]:
scores_train

{'XGB': 97.441336462662,
 'CAT': 93.73185721937537,
 'GBR': 89.76941309363636,
 'ADA': 84.37319506197562,
 'LGB': 92.90419539442388,
 'RF': 95.4985907230702}

In [39]:
df_test

Unnamed: 0,Deal_title,Lead_name,Industry,Deal_value,Weighted_amount,Date_of_creation,Pitch,Contact_no,Lead_revenue,Fund_category,...,POC_name,Designation,Lead_POC_email,Hiring_candidate_role,Lead_source,Level_of_meeting,Last_lead_update,Internal_POC,Resource,Internal_rating
0,TitleAD16O,Bonilla Ltd Inc,Investment Bank/Brokerage,200988$,,2020-04-15,Product_1,167.332.2751x989,100 - 500 Million,Category 4,...,sonia,Chairman/CEO/President,maureenthomas@bonilla.com,"Designer, fashion/clothing",Marketing Event,Level 1,more than a month,"Massiah,Gerard F",No,1.0
1,TitleOW6CR,"Williams, Rogers and Roach PLC",Electronics,409961$,2541758.2$,2021-01-23,Product_1,001-486-903-0711x7831,100 - 500 Million,Category 3,...,Daniel Bell,CEO/Co-Founder/Chairman,danielbell@williams.com,Horticultural consultant,Marketing Event,Level 2,Up-to-date,"Smith,Keenan H",Yes,1.0
2,TitleVVJQ5,"Wood, Vaughn and Morales Ltd",Banks,434433$,3041031.0$,2020-07-19,Product_1,(393)104-2610x9723,100 - 500 Million,Category 1,...,Andrew Davis,Chairman/Chief Innovation Officer,andrewdavis@wood.com,Information officer,Marketing Event,Level 2,Did not hear back after Level 1,"Gilley,Janine",Deliverable,5.0
3,TitleUS8NA,Durham-Crawford Inc,Music,218952$,1521716.4$,2020-02-27,Product_2,(817)040-4599,100 - 500 Million,Category 1,...,shital,CEO/Chairman/President,charlesrivera@durhamcrawford.com,Commercial/residential surveyor,Contact Email,Level 3,more than a month,"Morsy,Omar A",No,5.0
4,Title5VGWW,"Simpson, Duncan and Long LLC",Real Estate,392835$,2455218.75$,2020-10-25,Product_1,718-032-5726x76098,500 Million - 1 Billion,Category 3,...,Shelly Stephenson,CEO/Co-Founder/Chairman,shellystephenson@simpson.com,Wellsite geologist,Others,Level 3,More than 2 weeks,"Morsy,Omar A",Deliverable,2.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2088,Title2R8VU,"Phillips, Smith and Jones Inc",BioTech/Drugs,417150$,2732332.5$,2020-11-25,Product_2,(444)658-6350x207,100 - 500 Million,Category 3,...,William Spears,Chairman/CEO/President,williamspears@phillips.com,Community development worker,Marketing Event,Level 3,No track,"Ross,Eric L",No,1.0
2089,Title7HCNJ,Elliott-Morales PLC,Real Estate,488661$,2956399.05$,2019-10-20,Product_2,1377254815,50 - 100 Million,Category 3,...,Amy Page,Chief Executive Officer,amypage@elliottmorales.com,Forest/woodland manager,Others,Level 2,Following up but lead not responding,"Abdul-Hamid,Saud Muhamad",Cannot deliver,5.0
2090,TitleCD5YZ,Herrera-Santos PLC,Sales/Marketing Services,421119$,2631993.75$,2019-03-23,Product_1,695-757-3607x5834,50 - 100 Million,Category 1,...,ashma,CEO/Co-Founder/Chairman,markcombs@herrerasantos.com,Actuary,Contact Email,Level 1,,"Bannister,Joan",Deliverable,5.0
2091,Title8OKXL,"Howard, Martinez and Jacobs PLC",Banks,59879$,350292.15$,2019-02-05,Product_1,+1-688-318-4079x644,50 - 100 Million,Category 2,...,aarti,Chief Executive Officer,justinmorgan@howard.com,"Designer, textile",Marketing Event,Level 1,,"Murray,Younetta",We have all the requirements,5.0


In [40]:
# Final pipeline
def preprocess_test(data, train=True):
    target = data['Deal_title']
    
    data = drop_cols(data)
    data = feature_extraction(data)
    data = imputation(data)
    

    data = normalization(data)
    return data, target

In [42]:
df_test = pd.read_csv('data/test.csv', parse_dates=['Date_of_creation'], na_values=['?'])
model_GBR = AdaBoostRegressor()
model_GBR.fit(df_train_dev, target_dev)
test_prediction(model_GBR, df_test, 'GBR_1.csv')