# The Price Is Right

In order to examine the impacts of different cost functions on specific problems we are going to use an example from the gameshow The Price is Right. In the first round of The Price is Right, 4 contestants "Come on down!" to play against each other. They are given an item and asked to guess how much it costs. If the contestant goes over the price, they automatically lose. The winner is the contestant with the guess that is closest to the actual price without going over.  
  
So what if we want to maximize our chances of winning on The Price is Right? Well as a data scientist we might think that this is the perfect chance to train a model to make predictions for us. We need to build a model that gets as close to the correct price as possible without going over. We are going to explore how we can use cost functions to build the best model possible.  

We are using data from the Amazon Toy Dataset to train and test on. With the price as our target variable, I have taken the liberty of building our exogenous variables using NLP and some basic data cleaning on a set of descriptions, reviews, and information for each product.  

In this experiment we are going to use a very popular algorithm, [LightGBM from Microsoft](https://lightgbm.readthedocs.io/en/latest/). With LightGBM, we are going to explore several different functions to try to optimize our results. We are going to use the basic MAE and MSE along with some custom objective and evaluation functions of our own to see which one works best.  

To evaluate our "winnings" on The Price is Right, we are going to calculate them as follows. If a prediction is over the true value, our winnings are 0. If our prediction is below the true value, then our winnings are equal to our prediction. For example, if an item was worth $10 and we predicted it was worth 7, we would win 7 dollars. This will encourage us to not only predict under the actual value, but still predict as close to the value as possible.

In [53]:
import pandas as pd
import numpy as np
from lightgbm import LGBMRegressor
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from sklearn.metrics import make_scorer
from sklearn.ensemble import RandomForestRegressor

## Calculating Winnings
First we are going to write a class to calculate the winnings our model would make from its set of predictions. It will also tell us how many of the products our model overpredicted. If our model overpredicts, our winnings are zero. If our model underpredicts, then our winnings equal our prediction.  

We will also look at winnings as a percentage of each price we predict under. That will see how we fare when we don't weight more expensive items more heavily. This will give us multiple points of view for how our models are doing. It also could possibly represent someone who would rather win a big item vs. someone just interested in moving onto the second round.

In [101]:
class price_is_right:
    def __init__(self, truth, preds):
        self.truths = truth
        self.preds = preds
        
    def calculate_winnings(self):
        winning_df = pd.DataFrame(zip(self.truths, self.preds), columns=['truth', 'pred'])
        winning_df['winnings'] = 0
        winning_df['winning_percent'] = 0
        winning_df.loc[winning_df['pred'] <= winning_df['truth'], 'winnings'] = winning_df['pred']
        winning_df.loc[winning_df['pred'] <= winning_df['truth'], 'winning_percent'] = winning_df['pred']/winning_df['truth']
        self.winnings = winning_df['winnings'].sum()
        self.winning_percent = winning_df['winning_percent'].sum()
        self.overpredicted = winning_df[winning_df['winnings']==0].shape[0]
    
    def print_results(self):
        self.calculate_winnings()
        print(f'OVERPREDICTED: {self.overpredicted}/{len(self.preds)}')
        print(f'STANDARDIZED WINNINGS: {self.winning_percent}')
        print(f'TOTAL WINNINGS: {self.winnings}')
        
    

In [102]:
train = pd.read_csv('../data/amazon_train.csv')
test = pd.read_csv('../data/amazon_test.csv')

## Data
Let's take a quick overview of the data we are using. The price column is going to be our target. The manufacturer, seller, category, and subcategory columns have been encoded using [scikit-learns LabelEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html). The name, info, description, and reviews columns were created with word embeddings from [spacy](https://spacy.io/usage). If you are interested in how the data was prepared feel free to check out my [other notebook](../notebooks/data_cleaning.ipynb).

In [104]:
train.head()

Unnamed: 0,manufacturer,price,number_of_reviews,average_review_rating,sellers,cat,sub_cat,name,info,desc,reviews
0,987,8.44,2.0,5.0,549,34,14,0.605618,-1.30593,-1.993142,1.8243
1,1574,209.99,2.0,4.0,699,13,116,3.52892,-1.269589,-1.730483,-4.421397
2,1663,4.0,3.0,5.0,81,14,46,-3.511664,-0.837028,-2.297708,-3.458916
3,2342,19.99,1.0,5.0,160,7,140,0.83604,-1.123865,-2.583533,-2.375369
4,2105,1.49,77.0,4.5,387,0,97,1.504639,5.455936,-1.23651,-2.221026


We are going to split our test data into a validation and test set so that we have a validation set to help tune our model with. 

In [83]:
test, val = train_test_split(test, test_size=.25, random_state=0)
x_train = train.drop('price', 1)
y_train = train['price']
x_val = val.drop('price', 1)
y_val = val['price']
x_test = test.drop('price', 1)
y_test = test['price']

# LightGBM
Here is a quick helper function to help us run experiments and get the evaluations back. We will run them with all the same hyperparameters and with a random state set to try to make it as fair as possible. 

In [105]:
def run_experiment(x_train, y_train, x_val, y_val, x_test, y_test, obj, eval_met):
    lgb = LGBMRegressor(objective = obj,
                    subsample = 0.1,
                     random_state = 0,
                     num_leaves = 120,
                     n_estimators = 10000,
                     min_split_gain = 0.06999,
                     min_data_in_leaf = 12,
                     max_depth = 56,
                     learning_rate = 0.005,
                     colsample_bytree = 0.4,
                     boosting_type = 'gbdt')

    lgb.fit(x_train, y_train, eval_set=(x_val, y_val), eval_metric=eval_met, early_stopping_rounds=25, verbose=False)

    preds = lgb.predict(x_test)
    win = price_is_right(y_test, preds)
    return win

## Loss Functions
Here is the meat of what we are investigating. We are looking at how different loss functions will impact how our model makes predictions. The 2 functions that we will look at that are already supported by LightGBM are the [Mean Absolute Error](https://en.wikipedia.org/wiki/Mean_absolute_error) and the [Mean Squared Error](https://en.wikipedia.org/wiki/Mean_squared_error). We are also going to write a few of our own functions to try to get the most out of our model.  

With LightGBM there are 2 different types of functions that we need to write. The first one is an objective function for training. This function needs to be compatible with the [gradient descent](https://en.wikipedia.org/wiki/Gradient_descent) used to optimize the model. LightGBM requires objective functions to return both the first and second derivative of the cost function.  

The second type of function that we can use with LightGBM is an evaluation function. This function is how we will optimize our models parameters with our validation data during training. So while updating weights based on the objective function, the model will run until the score of our evaluation function stops improving. 

Most of these functions are going to be some variation of the MAE or MSE cost functions mentioned above. The difference is that we are going to add extra penalties if the model predicts above the true value. You will see that we are going to run these experiments several times with different penalties to see how the model is impacted.  

The last custom function we are going to look at is going to be very specific to our problem at hand. We are writing a function where the loss approximates our price is right evaluation. So it is trying to maximize our winnings from that specific game.   

So now let's see how the various models perform.

In [127]:
penalty = 3
def overprediction_penalty_objective(truth, pred):
    err = (truth - pred).astype("float")
    grad = np.where(err<0, -2*penalty*err, -2*err)
    hess = np.where(err<0, 2*penalty, 2)
    return grad, hess

def overprediction_penalty_squared(truth, pred):
    err = (truth - pred).astype("float")
    loss = np.where(err<0, (err**2)*penalty, err**2) 
    return "overprediction_squared_penalty", np.mean(loss), False

def overprediction_penalty_absolute(truth, pred):
    err = (truth - pred).astype("float")
    loss = np.where(err<0, abs(err)*penalty, abs(err)) 
    return "overprediction_squared_penalty", np.mean(loss), False

def overprediction_penalty_custom(truth, pred):
    err = (truth - pred).astype("float")
    loss = np.where(err<0, -pred, 0)
    return "custom_penalty", np.mean(loss), False

In [128]:
l1 = run_experiment(x_train, y_train, x_val, y_val, x_test, y_test, 'regression_l1', 'regression_l1')
l2 = run_experiment(x_train, y_train, x_val, y_val, x_test, y_test, 'regression_l2', 'regression_l1')
l1_custom_mae = run_experiment(x_train, y_train, x_val, y_val, x_test, y_test,
                           'regression_l1', overprediction_penalty_absolute)
l1_custom = run_experiment(x_train, y_train, x_val, y_val, x_test, y_test,
                           overprediction_penalty_objective, overprediction_penalty_absolute)
l2_custom = run_experiment(x_train, y_train, x_val, y_val, x_test, y_test,
                           overprediction_penalty_objective, overprediction_penalty_squared)
pr_custom = run_experiment(x_train, y_train, x_val, y_val, x_test, y_test,
                           overprediction_penalty_objective, overprediction_penalty_custom)

results = [l1, l2, l1_custom_mae, l1_custom, l2_custom, pr_custom]
res_names = ['MAE', 'MSE', 'Asymmetric MAE', 'Asymmetric MAE Custom', 'Asymmetric MSE Custom', 'Asymmetric Price is Right Eval']
for res, name in zip(results, res_names) :
    print(name)
    res.print_results()
    print('-'*50)

MAE
OVERPREDICTED: 1073/1923
STANDARDIZED WINNINGS: 518.9734986279623
TOTAL WINNINGS: 13106.62017913922
--------------------------------------------------
MSE
OVERPREDICTED: 1407/1923
STANDARDIZED WINNINGS: 332.0025585063151
TOTAL WINNINGS: 12083.902791589415
--------------------------------------------------
Asymmetric MAE
OVERPREDICTED: 1044/1923
STANDARDIZED WINNINGS: 515.0322848333686
TOTAL WINNINGS: 12472.09020435238
--------------------------------------------------
Asymmetric MAE Custom
OVERPREDICTED: 606/1923
STANDARDIZED WINNINGS: 669.3091728645843
TOTAL WINNINGS: 11387.71741059305
--------------------------------------------------
Asymmetric MSE Custom
OVERPREDICTED: 1056/1923
STANDARDIZED WINNINGS: 515.0055781733957
TOTAL WINNINGS: 13213.585045532007
--------------------------------------------------
Asymmetric Price is Right Eval
OVERPREDICTED: 1094/1923
STANDARDIZED WINNINGS: 500.58660282916566
TOTAL WINNINGS: 13191.59839390855
---------------------------------------------

In [129]:
penalty = 5
def overprediction_penalty_objective(truth, pred):
    err = (truth - pred).astype("float")
    grad = np.where(err<0, -2*penalty*err, -2*err)
    hess = np.where(err<0, 2*penalty, 2)
    return grad, hess

def overprediction_penalty_squared(truth, pred):
    err = (truth - pred).astype("float")
    loss = np.where(err<0, (err**2)*penalty, err**2) 
    return "overprediction_squared_penalty", np.mean(loss), False

def overprediction_penalty_absolute(truth, pred):
    err = (truth - pred).astype("float")
    loss = np.where(err<0, abs(err)*penalty, abs(err)) 
    return "overprediction_squared_penalty", np.mean(loss), False

def overprediction_penalty_custom(truth, pred):
    err = (truth - pred).astype("float")
    loss = np.where(err<0, -pred, 0)
    return "custom_penalty", np.mean(loss), False

In [130]:
l1 = run_experiment(x_train, y_train, x_val, y_val, x_test, y_test, 'regression_l1', 'regression_l1')
l2 = run_experiment(x_train, y_train, x_val, y_val, x_test, y_test, 'regression_l2', 'regression_l1')
l1_custom_mae = run_experiment(x_train, y_train, x_val, y_val, x_test, y_test,
                           'regression_l1', overprediction_penalty_absolute)
l1_custom = run_experiment(x_train, y_train, x_val, y_val, x_test, y_test,
                           overprediction_penalty_objective, overprediction_penalty_absolute)
l2_custom = run_experiment(x_train, y_train, x_val, y_val, x_test, y_test,
                           overprediction_penalty_objective, overprediction_penalty_squared)
pr_custom = run_experiment(x_train, y_train, x_val, y_val, x_test, y_test,
                           overprediction_penalty_objective, overprediction_penalty_custom)

results = [l1, l2, l1_custom_mae, l1_custom, l2_custom, pr_custom]
res_names = ['MAE', 'MSE', 'Asymmetric MAE', 'Asymmetric MAE Custom', 'Asymmetric MSE Custom', 'Asymmetric Price is Right Eval']
for res, name in zip(results, res_names) :
    print(name)
    res.print_results()
    print('-'*50)

MAE
OVERPREDICTED: 1073/1923
STANDARDIZED WINNINGS: 518.9734986279623
TOTAL WINNINGS: 13106.62017913922
--------------------------------------------------
MSE
OVERPREDICTED: 1407/1923
STANDARDIZED WINNINGS: 332.0025585063151
TOTAL WINNINGS: 12083.902791589415
--------------------------------------------------
Asymmetric MAE
OVERPREDICTED: 1024/1923
STANDARDIZED WINNINGS: 518.9131936412001
TOTAL WINNINGS: 12208.791338964089
--------------------------------------------------
Asymmetric MAE Custom
OVERPREDICTED: 412/1923
STANDARDIZED WINNINGS: 670.3289902811514
TOTAL WINNINGS: 9620.347383119373
--------------------------------------------------
Asymmetric MSE Custom
OVERPREDICTED: 948/1923
STANDARDIZED WINNINGS: 563.8008071408441
TOTAL WINNINGS: 13111.88744685337
--------------------------------------------------
Asymmetric Price is Right Eval
OVERPREDICTED: 1013/1923
STANDARDIZED WINNINGS: 531.3069657155254
TOTAL WINNINGS: 13172.880637088296
----------------------------------------------

In [131]:
penalty = 10
def overprediction_penalty_objective(truth, pred):
    err = (truth - pred).astype("float")
    grad = np.where(err<0, -2*penalty*err, -2*err)
    hess = np.where(err<0, 2*penalty, 2)
    return grad, hess

def overprediction_penalty_squared(truth, pred):
    err = (truth - pred).astype("float")
    loss = np.where(err<0, (err**2)*penalty, err**2) 
    return "overprediction_squared_penalty", np.mean(loss), False

def overprediction_penalty_absolute(truth, pred):
    err = (truth - pred).astype("float")
    loss = np.where(err<0, abs(err)*penalty, abs(err)) 
    return "overprediction_squared_penalty", np.mean(loss), False

def overprediction_penalty_custom(truth, pred):
    err = (truth - pred).astype("float")
    loss = np.where(err<0, -pred, 0)
    return "custom_penalty", np.mean(loss), False

In [132]:
l1 = run_experiment(x_train, y_train, x_val, y_val, x_test, y_test, 'regression_l1', 'regression_l1')
l2 = run_experiment(x_train, y_train, x_val, y_val, x_test, y_test, 'regression_l2', 'regression_l1')
l1_custom_mae = run_experiment(x_train, y_train, x_val, y_val, x_test, y_test,
                           'regression_l1', overprediction_penalty_absolute)
l1_custom = run_experiment(x_train, y_train, x_val, y_val, x_test, y_test,
                           overprediction_penalty_objective, overprediction_penalty_absolute)
l2_custom = run_experiment(x_train, y_train, x_val, y_val, x_test, y_test,
                           overprediction_penalty_objective, overprediction_penalty_squared)
pr_custom = run_experiment(x_train, y_train, x_val, y_val, x_test, y_test,
                           overprediction_penalty_objective, overprediction_penalty_custom)

results = [l1, l2, l1_custom_mae, l1_custom, l2_custom, pr_custom]
res_names = ['MAE', 'MSE', 'Asymmetric MAE', 'Asymmetric MAE Custom', 'Asymmetric MSE Custom', 'Asymmetric Price is Right Eval']
for res, name in zip(results, res_names) :
    print(name)
    res.print_results()
    print('-'*50)

MAE
OVERPREDICTED: 1073/1923
STANDARDIZED WINNINGS: 518.9734986279623
TOTAL WINNINGS: 13106.62017913922
--------------------------------------------------
MSE
OVERPREDICTED: 1407/1923
STANDARDIZED WINNINGS: 332.0025585063151
TOTAL WINNINGS: 12083.902791589415
--------------------------------------------------
Asymmetric MAE
OVERPREDICTED: 1020/1923
STANDARDIZED WINNINGS: 520.4878815965665
TOTAL WINNINGS: 12178.970816858713
--------------------------------------------------
Asymmetric MAE Custom
OVERPREDICTED: 229/1923
STANDARDIZED WINNINGS: 636.9795088982221
TOTAL WINNINGS: 7565.7668748990345
--------------------------------------------------
Asymmetric MSE Custom
OVERPREDICTED: 721/1923
STANDARDIZED WINNINGS: 642.9132780838727
TOTAL WINNINGS: 12157.905742198142
--------------------------------------------------
Asymmetric Price is Right Eval
OVERPREDICTED: 936/1923
STANDARDIZED WINNINGS: 557.4083958514
TOTAL WINNINGS: 13298.838314943547
------------------------------------------------

In [133]:
penalty = 12
def overprediction_penalty_objective(truth, pred):
    err = (truth - pred).astype("float")
    grad = np.where(err<0, -2*penalty*err, -2*err)
    hess = np.where(err<0, 2*penalty, 2)
    return grad, hess

def overprediction_penalty_squared(truth, pred):
    err = (truth - pred).astype("float")
    loss = np.where(err<0, (err**2)*penalty, err**2) 
    return "overprediction_squared_penalty", np.mean(loss), False

def overprediction_penalty_absolute(truth, pred):
    err = (truth - pred).astype("float")
    loss = np.where(err<0, abs(err)*penalty, abs(err)) 
    return "overprediction_squared_penalty", np.mean(loss), False

def overprediction_penalty_custom(truth, pred):
    err = (truth - pred).astype("float")
    loss = np.where(err<0, -pred, 0)
    return "custom_penalty", np.mean(loss), False

In [134]:
l1 = run_experiment(x_train, y_train, x_val, y_val, x_test, y_test, 'regression_l1', 'regression_l1')
l2 = run_experiment(x_train, y_train, x_val, y_val, x_test, y_test, 'regression_l2', 'regression_l1')
l1_custom_mae = run_experiment(x_train, y_train, x_val, y_val, x_test, y_test,
                           'regression_l1', overprediction_penalty_absolute)
l1_custom = run_experiment(x_train, y_train, x_val, y_val, x_test, y_test,
                           overprediction_penalty_objective, overprediction_penalty_absolute)
l2_custom = run_experiment(x_train, y_train, x_val, y_val, x_test, y_test,
                           overprediction_penalty_objective, overprediction_penalty_squared)
pr_custom = run_experiment(x_train, y_train, x_val, y_val, x_test, y_test,
                           overprediction_penalty_objective, overprediction_penalty_custom)

results = [l1, l2, l1_custom_mae, l1_custom, l2_custom, pr_custom]
res_names = ['MAE', 'MSE', 'Asymmetric MAE', 'Asymmetric MAE Custom', 'Asymmetric MSE Custom', 'Asymmetric Price is Right Eval']
for res, name in zip(results, res_names) :
    print(name)
    res.print_results()
    print('-'*50)

MAE
OVERPREDICTED: 1073/1923
STANDARDIZED WINNINGS: 518.9734986279623
TOTAL WINNINGS: 13106.62017913922
--------------------------------------------------
MSE
OVERPREDICTED: 1407/1923
STANDARDIZED WINNINGS: 332.0025585063151
TOTAL WINNINGS: 12083.902791589415
--------------------------------------------------
Asymmetric MAE
OVERPREDICTED: 1020/1923
STANDARDIZED WINNINGS: 520.4878815965665
TOTAL WINNINGS: 12178.970816858713
--------------------------------------------------
Asymmetric MAE Custom
OVERPREDICTED: 163/1923
STANDARDIZED WINNINGS: 606.7159254879609
TOTAL WINNINGS: 6605.691516861292
--------------------------------------------------
Asymmetric MSE Custom
OVERPREDICTED: 688/1923
STANDARDIZED WINNINGS: 657.5199413663684
TOTAL WINNINGS: 12109.453287846598
--------------------------------------------------
Asymmetric Price is Right Eval
OVERPREDICTED: 877/1923
STANDARDIZED WINNINGS: 591.4610578464806
TOTAL WINNINGS: 13040.864274666164
----------------------------------------------