This project deals with leads and bids for loans. Each lead that is available can be bid in for $3, $35, $50, or $75. Bidding on a lead does not mean that the lead is won. The lower the bid price the less likely it is for the bid to be won. Once a lead is won it can be attempted to convert the lead into a loan. I am going to use the lead data to develop a model for deciding which leads should be bid on and how much. The bid amounts again are limited to $3, $35, $50, or $75.

Fields

BidPrice: If this is populated, then we bid this price for this lead AcceptedBid: This will tell you if the bid was accepted (we “won” the bid) ExpectedRevenue: This is the amount of revenue we expect to get from the lead if it turns into a loan.

Three things need to happen for us to get any revenue:

We need to bid
We need to win the bid
We need to convert the lead into a loan ExpectedConversion: This is the expected conversion rate of lead into loan.

In [None]:
# Imports
import numpy as np 
import pandas as pd 

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestRegressor # import the random forest model
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import warnings
warnings.filterwarnings("ignore")

First load the data and then fill in NaN BidPrices with 0 to prevent errors when creating and training a model. 
Note I created a column named 'Bid identifier' which is (ExpectedRevenue)*(ExpectedConversion).

In [2]:
leadData = pd.read_csv("Lead_Bid_Data.csv")

leadData.isnull().sum()
leadData['BidPrice'].fillna(0, inplace=True)
leadData.isnull().sum()

id                    0
BidPrice              0
AcceptedBid           0
ExpectedRevenue       0
ExpectedConversion    0
Bid identifier        0
dtype: int64

Creating a SelectedFeatures variables with the column names for the independent variables or the X variables. Y is the dependent variable or the variable I want to predict. I want to see if I can train a model to assign the 'BidPrices' for each 'id'. Later I will create 'rules' to assign the 'BidPrice' 

In [3]:
SelectedFeatures = ['ExpectedRevenue', 'ExpectedConversion', 'Bid identifier']
X = leadData[SelectedFeatures]
y = leadData['BidPrice'].values

Creating a training and testing set. The test set is 20% of the 'leadData' data set.

In [145]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=2, stratify=y)

Starting with a Decision Tree - will use both the Regressor and Classifier methods to predict the 'BidPrice'
I am using the Regressor method for fun as this is more of a Classification type problem and the regressor method will not be limited to the \\$3, \\$35, \\$50, \\$75 values for the Bid Price.

The Regressor method does not use whole numbers to assign BidPrices, to make the BidPrices viable, i.e. \\$3, \\$35, \\$50, \\$75 I created a set of rules to take the values predicted by the model and convert them into viable bids. Using a regressor method and changing the boundaries of the rules converting predicted values to viable values can provide a bit more freedom in which customers are assigned which bid values than a classifier method.

In [146]:
DTModel = DecisionTreeRegressor()
DTModel.fit(X = X_train, y = y_train)
y_pred = DTModel.predict(X = X_test)
i=0
while i < len(y_pred):
    if y_pred[i] <= 14.99 and y_pred[i] > 1:
        y_pred[i] = 3
        i+=1
    elif y_pred[i] <=1:
        y_pred[i] = 0
        i+=1
    elif y_pred[i] >= 15 and y_pred[i] <= 39.99:
        y_pred[i] = 35
        i+=1
    elif y_pred[i] >= 40 and y_pred[i] <= 59.99:
        y_pred[i] = 50
        i+=1
    elif y_pred[i] >= 60 and y_pred[i] <= 80:
        y_pred[i] = 75
        i+=1
    else:
        i+=1

In [147]:
accuracy_score(y_test, y_pred)

0.8389249304911955

Accuracy is 83.9% - this is a little low, I am looking for something in the 90% range.

Now I will try the DecisionTreeClassifier method. As this method will predict values based on the training set, rules do not need to be created to make sure all predicted bids are viable.

In [150]:
DTCModel = DecisionTreeClassifier()
DTCModel.fit(X = X_train, y = y_train)
y_pred = DTCModel.predict(X = X_test)

In [151]:
accuracy_score(y_test, y_pred)

0.83790546802595

Accuracy is 83.8%. This is very close to the Regressor methods accuracy but still low.

Now to try and improve accuracy of the prediction I am going to try a Random Forest algorithm instead.

I will use both the RandomForestRegressor anbd the RandomForestClassifier. The RandomForestRegressor like the DecisionTreeRegressor method with provide a wide range of predictions so. I am using the same rules/logic to convert the predicted values to viable bids.

In [154]:
forestRegressor = RandomForestRegressor()
forestRegressor.fit(X = X_train, y = y_train)
y_pred = forestRegressor.predict(X = X_test)

i=0
while i < len(y_pred):
    if y_pred[i] <= 14.99 and y_pred[i] > 1:
        y_pred[i] = 3
        i+=1
    elif y_pred[i] <=1:
        y_pred[i] = 0
        i+=1
    elif y_pred[i] >= 15 and y_pred[i] <= 39.99:
        y_pred[i] = 35
        i+=1
    elif y_pred[i] >= 40 and y_pred[i] <= 59.99:
        y_pred[i] = 50
        i+=1
    elif y_pred[i] >= 60 and y_pred[i] <= 80:
        y_pred[i] = 75
        i+=1
    else:
        i+=1

In [155]:
accuracy_score(y_test, y_pred)

0.8521779425393883

Accuracy = 85%, this is still a bit low, but it is better than the Decision Tree Regressor Method. Note that if predicted BidPrices were not converted to the viable bid values, the accuracy_score method above and the confusion_matrix and classification_report methods below would not work and return an error. These methods only work for classification models.

In [156]:
confusion_matrix(y_test, y_pred)

array([[5688,  415,  298,    0,    0],
       [ 139,  330,  201,    0,    0],
       [  14,  225, 2047,   59,   67],
       [   0,    0,   18,  116,   50],
       [   0,    0,    7,  102, 1014]], dtype=int64)

Confusion matrix shows the number of True and False Positives and Negatives.

In [157]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

         0.0       0.97      0.89      0.93      6401
         3.0       0.34      0.49      0.40       670
        35.0       0.80      0.85      0.82      2412
        50.0       0.42      0.63      0.50       184
        75.0       0.90      0.90      0.90      1123

   micro avg       0.85      0.85      0.85     10790
   macro avg       0.69      0.75      0.71     10790
weighted avg       0.88      0.85      0.86     10790



Classification report above shows the precision of the model in predicting each of the values. NOTE, this is after I applied the rules to convert the predicted values to viable values. We can see that \\$0, \\$35, and \\$75 Bid values are pretty precise.

Random Forest Classifier - to compare against the Regressor. Uses same training and testing data as the Regressor model.  Here the model predicts \\$0, \\$3, \\$35, \\$50, \\$75 directly, so no logic is needed to convert the values.

In [158]:
forestClassifier = RandomForestClassifier()
forestClassifier.fit(X=X_train, y=y_train)
y_pred_test = forestClassifier.predict(X_test)

In [159]:
accuracy_score(y_test, y_pred_test)

0.8611677479147358

Accuracy = 86.1%, slightly better than the Regressor method for predicting BidPrices but still not what I want.

In [160]:
confusion_matrix(y_test, y_pred_test)

array([[5955,  179,  267,    0,    0],
       [ 287,  197,  186,    0,    0],
       [ 174,  124, 2012,   30,   72],
       [   1,    1,   41,   95,   46],
       [   0,    0,   54,   36, 1033]], dtype=int64)

Confusion matrix shows the number of True and False Positives and Negatives.

In [161]:
print(classification_report(y_test, y_pred_test))

              precision    recall  f1-score   support

         0.0       0.93      0.93      0.93      6401
         3.0       0.39      0.29      0.34       670
        35.0       0.79      0.83      0.81      2412
        50.0       0.59      0.52      0.55       184
        75.0       0.90      0.92      0.91      1123

   micro avg       0.86      0.86      0.86     10790
   macro avg       0.72      0.70      0.71     10790
weighted avg       0.85      0.86      0.86     10790



Classification report above shows the precision of the model in predicting each of the values. We can see that \\$0, \\$35, and \\$75 Bid values are pretty precise. Compared to the Classification report from the regressor method the precision of \\$3, and \\$50 bids are higher.

# Predict AcceptedBid = 1

Next, I want to predict if the bid will be accepted (AcceptedBid =1). This assumes the Bid is already assigned, either through a model or some other method.  For this I will be using the 'BidPrice' given in the "Soaren_Management_Lead_Bid_Test_Data.csv" or leadData. The Testing set is 20% of the data set.

In [162]:
SelectedFeatures2 = ['BidPrice', 'ExpectedRevenue', 'ExpectedConversion', 'Bid identifier']
X2 = leadData[SelectedFeatures2]
y2 = leadData['AcceptedBid'].values

X_train2, X_test2, y_train2, y_test2 = train_test_split(X2, y2, test_size=0.20, random_state=0)

As seen above I am using 4 features. I will again use both the RandomForestClassifier and RandomForestRegressor Methods.

In [163]:
forestClassifier2 = RandomForestClassifier()
forestClassifier2.fit(X=X_train2, y=y_train2)
y_pred_test2 = forestClassifier2.predict(X_test2)

In [164]:
accuracy_score(y_test2, y_pred_test2)

0.8949953660797034

Accuracy is 89.5% - very close to 90%.

In [165]:
confusion_matrix(y_test2, y_pred_test2)

array([[7005,  601],
       [ 532, 2652]], dtype=int64)

Confusion matrix shows the number of True and False Positives and Negatives.

In [166]:
print(classification_report(y_test2, y_pred_test2))

              precision    recall  f1-score   support

           0       0.93      0.92      0.93      7606
           1       0.82      0.83      0.82      3184

   micro avg       0.89      0.89      0.89     10790
   macro avg       0.87      0.88      0.87     10790
weighted avg       0.90      0.89      0.90     10790



Now to try the Regressor method. There is logic to convert 'BidAccpeted' to a 0 or 1 as the Regressor method will provide a range of Doubles as the output.

In [169]:
forestRegressor2 = RandomForestRegressor()
forestRegressor2.fit(X = X_train2, y = y_train2)
y_pred2 = forestRegressor2.predict(X = X_test2)

i=0
while i < len(y_pred2):
    if y_pred2[i] >=.5:
        y_pred2[i] = 1
        i+=1
    else:
        y_pred2[i] = 0
        i+=1
        

In [170]:
accuracy_score(y_test2, y_pred2)

0.9012048192771084

Accuracy = 90%, this is good.

In [171]:
confusion_matrix(y_test2, y_pred2)

array([[6911,  695],
       [ 371, 2813]], dtype=int64)

Confusion matrix shows the number of True and False Positives and Negatives.

In [173]:
print(classification_report(y_test2, y_pred2))

              precision    recall  f1-score   support

           0       0.95      0.91      0.93      7606
           1       0.80      0.88      0.84      3184

   micro avg       0.90      0.90      0.90     10790
   macro avg       0.88      0.90      0.88     10790
weighted avg       0.91      0.90      0.90     10790



I am going to compare the Random Forest algorithm to a Logistic Regression model. The model will use the same training and testing sets.

In [177]:
logreg = LogisticRegression()
logreg.fit(X_train2, y_train2)
y_log_pred = logreg.predict(X_test2)

print('Train/Test split results:')
print(logreg.__class__.__name__+" accuracy is %2.3f" % accuracy_score(y_test2, y_log_pred))

Train/Test split results:
LogisticRegression accuracy is 0.918


The accuracy for the Logistic Regression model is a bit better than the Random Forest models. This is the model I would use.

In [178]:
print(classification_report(y_test2, y_log_pred))

              precision    recall  f1-score   support

           0       0.97      0.91      0.94      7606
           1       0.81      0.94      0.87      3184

   micro avg       0.92      0.92      0.92     10790
   macro avg       0.89      0.92      0.91     10790
weighted avg       0.93      0.92      0.92     10790



Want to try the Random Forest Classifier Method here to see how dropping 'ExpectedRevenue' as a feature will affect the model.

In [181]:
SelectedFeatures3 = ['BidPrice', 'ExpectedConversion', 'Bid identifier']
X3 = leadData[SelectedFeatures3]
y3 = leadData['AcceptedBid'].values

X_train3, X_test3, y_train3, y_test3 = train_test_split(X3, y3, test_size=0.20, random_state=0)


forestClassifier3 = RandomForestClassifier()
forestClassifier3.fit(X=X_train3, y=y_train3)
y_pred_test3 = forestClassifier3.predict(X_test3)

accuracy_score(y_test3, y_pred_test3)

0.8945319740500464

This model has about the same accuracy as the model with 4 features.

In [182]:
logreg2 = LogisticRegression()
logreg2.fit(X_train3, y_train3)
y_log_pred2 = logreg2.predict(X_test3)

print('Train/Test split results:')
print(logreg.__class__.__name__+" accuracy is %2.3f" % accuracy_score(y_test3, y_log_pred2))

Train/Test split results:
LogisticRegression accuracy is 0.918


Appears to be just as good as the model with 4 features, so I will use the model with 4 features.

Now I want to create rules for assigning BidPrices instead of using a model.
I will compare the predicted net revenue of bids that went through against the net revenue of bid that went through for the file. The net revenue of the file is $1,207,654.
I have 8 Test cases, each case changes how the 'BidPrice' is assigned.

# Test Case 1

In [211]:
leadData2 = leadData.copy()
leadData2['BidPrice']=0

leadData75 = leadData2.loc[leadData2['Bid identifier'] >= 140]
leadData75['BidPrice'] = 75

leadData50 = leadData2.loc[(leadData2['Bid identifier'] >= 100) & (leadData2['Bid identifier'] <140)]
leadData50['BidPrice'] = 50

leadData35 = leadData2.loc[(leadData2['Bid identifier'] >= 65) & (leadData2['Bid identifier'] < 100)]
leadData35['BidPrice'] = 35

leadData3 = leadData2.loc[(leadData2['Bid identifier'] >= 20) & (leadData2['Bid identifier'] < 65)]
leadData3['BidPrice'] = 3

leadDataOther = leadData2.loc[leadData2['Bid identifier'] < 20]
leadDataOther['BidPrice'] = 0

# Test Case 2

Increases number of \\$50 bids and decreases number of \\$75 bids compared to previous case.

In [216]:
leadData2 = leadData.copy()
leadData2['BidPrice']=0

leadData75 = leadData2.loc[leadData2['Bid identifier'] >= 150]
leadData75['BidPrice'] = 75

leadData50 = leadData2.loc[(leadData2['Bid identifier'] >= 100) & (leadData2['Bid identifier'] < 150)]
leadData50['BidPrice'] = 50

leadData35 = leadData2.loc[(leadData2['Bid identifier'] >= 65) & (leadData2['Bid identifier'] < 100)]
leadData35['BidPrice'] = 35

leadData3 = leadData2.loc[(leadData2['Bid identifier'] >= 20) & (leadData2['Bid identifier'] < 65)]
leadData3['BidPrice'] = 3

leadDataOther = leadData2.loc[leadData2['Bid identifier'] < 20]
leadDataOther['BidPrice'] = 0

# Test Case 3

Decreases number of \\$50 bids increases number of \\$35 bids compared to previous case.

In [226]:
leadData2 = leadData.copy()
leadData2['BidPrice']=0

leadData75 = leadData2.loc[leadData2['Bid identifier'] >= 150]
leadData75['BidPrice'] = 75

leadData50 = leadData2.loc[(leadData2['Bid identifier'] >= 110) & (leadData2['Bid identifier'] <150)]
leadData50['BidPrice'] = 50

leadData35 = leadData2.loc[(leadData2['Bid identifier'] >= 65) & (leadData2['Bid identifier'] < 110)]
leadData35['BidPrice'] = 35

leadData3 = leadData2.loc[(leadData2['Bid identifier'] >= 20) & (leadData2['Bid identifier'] < 65)]
leadData3['BidPrice'] = 3

leadDataOther = leadData2.loc[leadData2['Bid identifier'] < 20]
leadDataOther['BidPrice'] = 0

# Test Case 4

Increases number of \\$35 bids decreases number of \\$3 bids compared to previous case.

In [232]:
leadData2 = leadData.copy()
leadData2['BidPrice']=0

leadData75 = leadData2.loc[leadData2['Bid identifier'] >= 150]
leadData75['BidPrice'] = 75

leadData50 = leadData2.loc[(leadData2['Bid identifier'] >= 110) & (leadData2['Bid identifier'] <150)]
leadData50['BidPrice'] = 50

leadData35 = leadData2.loc[(leadData2['Bid identifier'] >= 60) & (leadData2['Bid identifier'] < 110)]
leadData35['BidPrice'] = 35

leadData3 = leadData2.loc[(leadData2['Bid identifier'] >= 20) & (leadData2['Bid identifier'] < 60)]
leadData3['BidPrice'] = 3

leadDataOther = leadData2.loc[leadData2['Bid identifier'] < 20]
leadDataOther['BidPrice'] = 0

# Test Case 5

Decreases number of \\$35 bids increases number of \\$3 bids compared to previous case.

In [238]:
leadData2 = leadData.copy()
leadData2['BidPrice']=0

leadData75 = leadData2.loc[leadData2['Bid identifier'] >= 150]
leadData75['BidPrice'] = 75

leadData50 = leadData2.loc[(leadData2['Bid identifier'] >= 110) & (leadData2['Bid identifier'] <150)]
leadData50['BidPrice'] = 50

leadData35 = leadData2.loc[(leadData2['Bid identifier'] >= 67) & (leadData2['Bid identifier'] < 110)]
leadData35['BidPrice'] = 35

leadData3 = leadData2.loc[(leadData2['Bid identifier'] >= 20) & (leadData2['Bid identifier'] < 67)]
leadData3['BidPrice'] = 3

leadDataOther = leadData2.loc[leadData2['Bid identifier'] < 20]
leadDataOther['BidPrice'] = 0

# Test Case 6

Removes \\$50 bids, increase number of \\$35 bids compared to previous case.

In [244]:
leadData2 = leadData.copy()
leadData2['BidPrice']=0

leadData75 = leadData2.loc[leadData2['Bid identifier'] >= 150]
leadData75['BidPrice'] = 75

leadData35 = leadData2.loc[(leadData2['Bid identifier'] >= 67) & (leadData2['Bid identifier'] < 150)]
leadData35['BidPrice'] = 35

leadData3 = leadData2.loc[(leadData2['Bid identifier'] >= 20) & (leadData2['Bid identifier'] < 67)]
leadData3['BidPrice'] = 3

leadDataOther = leadData2.loc[leadData2['Bid identifier'] < 20]
leadDataOther['BidPrice'] = 0

# Test Case 7

Removes \\$35 bid and increases number of \\$3 bid compared to previous case. This case does not provide enough (AcceptedBid = 1) so it is not viable as is.

In [131]:
leadData2 = leadData.copy()
leadData2['BidPrice']=0

leadData75 = leadData2.loc[leadData2['Bid identifier'] >= 150]
leadData75['BidPrice'] = 75

leadData3 = leadData2.loc[(leadData2['Bid identifier'] >= 20) & (leadData2['Bid identifier'] < 150)]
leadData3['BidPrice'] = 3

leadDataOther = leadData2.loc[leadData2['Bid identifier'] < 20]
leadDataOther['BidPrice'] = 0

# Test Case 8

Increases the number of \\$75 bids and decreases the number of \\$3 bids compared to previous case. This case does not provide enough (AcceptedBid = 1) so it is not viable as is.

In [136]:
leadData2 = leadData.copy()
leadData2['BidPrice']=0

leadData75 = leadData2.loc[leadData2['Bid identifier'] >= 100]
leadData75['BidPrice'] = 75

leadData3 = leadData2.loc[(leadData2['Bid identifier'] >= 20) & (leadData2['Bid identifier'] < 100)]
leadData3['BidPrice'] = 3

leadDataOther = leadData2.loc[leadData2['Bid identifier'] < 20]
leadDataOther['BidPrice'] = 0

# Following used for Test Cases 1-5.

In [239]:
leadDataNew = leadData75.append(leadData50.append(leadData35.append(leadData3.append(leadDataOther))))
leadDataNew = leadDataNew.sort_index()

# Following used for Test Case 6

In [245]:
leadDataNew = leadData75.append(leadData35.append(leadData3.append(leadDataOther)))
leadDataNew = leadDataNew.sort_index()

# Following for Test Case 7-8

In [137]:
leadDataNew = leadData75.append(leadData3.append(leadDataOther))
leadDataNew = leadDataNew.sort_index()

# For all Test Cases

I am using the above rules to assign the BidPrices. I am comparing the prediction (AcceptedBid = 1) of the Random Forest Classifier and Logistic Regression with 4 Features ['BidPrice', 'ExpectedRevenue', 'ExpectedConversion', 'Bid identifier'], and the Random Forest Classifier with 3 Features ['BidPrice', 'ExpectedConversion', 'Bid identifier'].
To make sure the number of leads purchased (AcceptedBid = 1) is within 5% of the actual leads purchased (15224-16827), I am checking the sum of (AcceptedBid =1).

In [246]:
leadData1 = leadDataNew[SelectedFeatures2]
final_pred = forestClassifier2.predict(leadData1)
leadData2 = leadDataNew[SelectedFeatures3]
final_pred2 = forestClassifier3.predict(leadData2)
leadData3 = leadDataNew[SelectedFeatures2]
final_log_pred = logreg.predict(leadData3)

In [247]:
leadDataNew['BidAcceptedForest2'] = final_pred
leadDataNew['BidAcceptedForest3'] = final_pred2
leadDataNew['BidAcceptedLogReg'] = final_log_pred
leadDataNew['OriginalBidPrice'] = leadData['BidPrice']

Verifying AccepteBid is within (15224-16827)

In [248]:
print(leadDataNew['AcceptedBid'].sum())
print(leadDataNew['BidAcceptedForest2'].sum())
print(leadDataNew['BidAcceptedForest3'].sum())
print(leadDataNew['BidAcceptedLogReg'].sum())

16026
15432
14975
15309


Reordering the columns in the dataframe and writing to wrting results to a csv

In [249]:
leadDataNew = leadDataNew[['id','OriginalBidPrice','BidPrice','AcceptedBid','BidAcceptedForest2','BidAcceptedForest3', 'BidAcceptedLogReg','ExpectedRevenue','ExpectedConversion','Bid identifier']]
leadDataNew.to_csv("result.csv", index=False)

Following uses a Random Forest Classfier model to assign Bid Prices and then predicts if the bids will be accepted using the logistic regression.

In [142]:
data = leadData.copy()
dataX = data[SelectedFeatures]
data_pred = forestClassifier.predict(dataX)
data['BidPrice']=data_pred
dataBid = data[SelectedFeatures2]
data_final_pred = logreg.predict(dataBid)
data['AcceptedBid'] = data_final_pred

In [144]:
data['OriginalBidPrice']=leadData['BidPrice']
data['OriginalAcceptedBid']=leadData['AcceptedBid']
data = data[['id','OriginalBidPrice','OriginalAcceptedBid','BidPrice','AcceptedBid','ExpectedRevenue','ExpectedConversion','Bid identifier']]
data['AcceptedBid'].sum()

18881

As we can see this way produces to many leads the number of leads purchased (AcceptedBid = 1) needs to be within 5% of the actual leads purchased (15224-16827). 18881 is not between 15224 and 16827.

In [None]:
data.to_csv("result1.csv", index=False)