# Model evaluation

## Data science question

My original data science question was: Predict ratings of instant noodle products. As mentionned in class the data science question itself often does not include the goal or how the model's usefulness in the real world would be measured. For this purpose we define the utility function below.

## Utility function

The model would be used at a grocery store when one wants to choose between two or more ramen products. The product with the highest rating would be purchased. Overtime the desire is for the user of the model to consume higher rated ramen on average.

To simplify, our utility function can output a 1 when the ordering of the model is equal to the ordering of the actual ramen ratings.

Our utility function in this case is binary:

$$
U(x_1, x_2 |y_1 \geq y_2) = 
     \begin{cases}
       \text{0} &\quad\text{if $y_1 < y_2$}\\
       \text{1} &\quad\text{if $y_1 \geq y_2$}\\
     \end{cases}
$$

## Loading data & models


In [8]:
from sklearn.model_selection import train_test_split
from joblib import load
import pandas as pd
import numpy as np

# Load dataset
ebm_val_data = pd.read_csv('../clean_data/ebm_val_data.csv')
linear_val_data = pd.read_csv('../clean_data/linear_val_data.csv')

y_val = ebm_val_data['Stars']

X_ebm = ebm_val_data.drop(columns=['Stars'])
X_lm = linear_val_data.drop(columns=['Stars'])

# Load models
lm = load('../models/linear.joblib')
ebm = load('../models/ebm.joblib')

## Utility function scores

In [11]:
y_hat_lm = lm.predict(X_lm)
y_hat_ebm = ebm.predict(X_ebm)

In [118]:
def get_utility_score(y, y_hat):
    offset = 0
    score = 0
    perfect_score = 0
    for i in range(len(y)):
        for j in range(len(y) - offset):
            if i != j + offset:
                if y_hat[i] >= y_hat[j + offset] and y[i] >= y[j + offset]:
                    score += 1
                perfect_score += 1
        offset += 1
    return score, perfect_score

In [123]:
lm_utility_score = get_utility_score(y_val, y_hat_lm)
print("Lms correctly picked the highest rated ramen {0:.2f}% of the time.".format(100*(lm_utility_score[0]/lm_utility_score[1])))

Lms correctly picked the highest rated ramen 31.67% of the time.


In [122]:
ebm_utility_score = get_utility_score(y_val, y_hat_ebm)
print("Ebms correctly picked the highest rated ramen {0:.2f}% of the time.".format(100*(ebm_utility_score[0]/ebm_utility_score[1])))

Ebms correctly picked the highest rated ramen 33.85% of the time.


From this we find that, altough the models would give a smaller error when attempting to calculate the specific star rating of a ramen product than a human, it is better to guess between two ramen products than to use the linear and EBM models to choose which ramen to eat. Let's see if we can do better than guessing by using a fixed prediction of rating based on the average of the star ratings by brands.


## Simple approach

We start by importing our raw data and obtaining the average star rating by brand.

In [70]:
import pandas as pd

rr_df = pd.read_csv('../raw_data/ramen_ratings.csv');

# Remove unrated entries
rated_df = rr_df[~(rr_df.Stars == "Unrated")]

# Change column type to float
rated_df_with_types = rated_df.copy()
rated_df_with_types.Stars = rated_df_with_types.Stars.astype('float')

# Get star rating by brand
stars_by_brand = rated_df_with_types[['Stars', 'Brand']].groupby(['Brand']).mean()

In [92]:
def get_rating(row):
    return stars_by_brand.loc[row['Brand']].Stars

predictions = rated_df_with_types.apply(get_rating, axis=1)

In [124]:
avg_brand_score = get_utility_score(rated_df_with_types.Stars.values, predictions.values)
print("Predicion using the average brand rating correctly picked the highest rated ramen {0:.2f}% of the time.".format(100*(avg_brand_score[0]/avg_brand_score[1])))

Predicion using the average brand rating correctly picked the highest rated ramen 47.45% of the time.


## Conclusion

With the current dataset, we weren't able to create a model which could choose the highest rated ramen between two ramen products with a better accuracy than a random guess. In this case, using the average of the Brand's rating resulted in more accurate predictions than model trained on our features.