## Evaluation Metrics

In this part we will introduce the evaluation metrics used in our models. We used several metrics to see how they can give us different insights about the performance of our models and we came to this conclusion:

- **R2 Score**: This metric is mostly used when we don't the scale of the target variable. It is a good metric to see how well our model is performing compared to a simple model that predicts the mean of the target variable. The problem was we got a very high R2 score for all of our models and the reason is that the mean squared error when we predict the mean of the target variable is very high so even if our model is not performing well, the R2 score will be high. 

$$ R^2 = 1 - \frac{MSE_{model}}{MSE_{mean}} = 1 - \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2} $$

We check the RMSE of the mean model to illustrate this point:

In [13]:
import pandas as pd 
import numpy as np
from sklearn.metrics import mean_squared_error, r2_score

df = pd.read_csv('../data/not_scaled_data.csv')

y_true = df['price']
y_pred = [df['price'].mean()]*len(df['price'])

print('Root Mean Squared Error when we guess the mean:', np.sqrt(mean_squared_error(y_true, y_pred)))

Root Mean Squared Error when we guess the mean: 1418975.7503518122


In [15]:
import pandas as pd 
import numpy as np
from sklearn.metrics import mean_squared_error, r2_score

df = pd.read_csv('../data/not_scaled_data.csv')

y_true = df['log_price']
y_pred = [df['log_price'].mean()]*len(df['log_price'])

print('Root Mean Squared Error when we guess the mean:', np.sqrt(mean_squared_error(y_true, y_pred)))

Root Mean Squared Error when we guess the mean: 0.273661211274132


As you can see the RMSE for `price` is 1.418 Millions and for `log_price` is 0.27 which is very high and this is why the R2 score is very high for all of our models.

* **Root Mean Squared Error (RMSE), Mean Absolute Error and Median Absolute Error**: We report these metrics to see how much error our model is making in predicting the target variable. The difference between RMSE and Mean Absolute Error tells us if we predicted some of the prices very badly compared to the rest of the prices. The Median Absolute Error is also good to see when our model predict so badly on some few prices but in general it's better.

* **Mean Absolute Percentage Error and Median Absolute Percentage Error**: These metrics are good to see how much percentage error our model is making in predicting the target variable. Because if we have an error of 10000 on a house that costs 100000 it's not the same as having an error of 10000 on a house that costs 1000000 but the previous metrics will treat them the same. Again we use both mean and median to see if our model is making a big error on some few prices.

$$ MAPE = \frac{1}{n} \sum_{i=1}^{n} \frac{|y_i - \hat{y}_i|}{y_i} $$

* **Within x%**: This metric is used to see how many of our predictions are within a certain percentage of the actual price. This is a good metric to see how well our model is performing in general. Zillow itself uses this metric alongside Median Absolute Percentage Error to evaluate the performance of the models.([see this for more information](https://www.zillow.com/tech/home-value-estimates/))

Now that we defined the metrics we will use, We implement some functions to show these metrics for our models.

`show_metrics` function gets the predictions and the actual values and if the target variable is `price` or `log_price` and if the target variable was scaled or not and it will show the metrics we defined above in a neat dataframe. So now we can use it for all of our models to see how they are performing.

In [20]:
from sklearn import metrics
import numpy as np

def calc_median_absolute_percentage_error(y_true, y_pred):
    return np.median(np.abs((y_true - y_pred) / y_true)) * 100

def calculate_metrics(y_pred, y_test):
    r2 = metrics.r2_score(y_test, y_pred)
    smse = np.sqrt(metrics.mean_squared_error(y_test, y_pred))
    mean_ae = metrics.mean_absolute_error(y_test, y_pred)
    mean_ape = metrics.mean_absolute_percentage_error(y_test, y_pred)
    median_ae = metrics.median_absolute_error(y_test, y_pred)
    median_ape = calc_median_absolute_percentage_error(y_test, y_pred)
    return [r2, smse, mean_ae, mean_ape, median_ae, median_ape]
    
def within_x_percent(y_pred, y_test, x):
    return np.sum(np.abs((y_pred - y_test)/y_test) < x) / len(y_test) * 100

def show_metrics(y_pred, y_test, target_scaler = None, logarithm = False):        
    metrics_df = pd.DataFrame(columns=['Target', 'R2', 'Root Mean Squared Error', 'Mean Absolute Error',
                                            'Mean Absolute Percentage Error', 'Median Absolute Error', 'Median Absolute Percentage Error'])
    
    metrics_df.loc[0] = ['Target as it is'] + calculate_metrics(y_pred, y_test)
    
    if target_scaler:
        y_pred = target_scaler.inverse_transform(y_pred.reshape(-1, 1)).flatten()
        y_test = target_scaler.inverse_transform(y_test.reshape(-1, 1)).flatten()
        metrics_df.loc[1] = ['Scaled Target is inversed to real value'] + calculate_metrics(y_pred, y_test)

    
    if logarithm:
        y_pred = np.power(10, y_pred)
        y_test = np.power(10, y_test)
        metrics_df.loc[2] = ['Target -> 10 ^ Target'] + calculate_metrics(y_pred, y_test)
    
    dist_df = pd.DataFrame({"within 5%": [within_x_percent(y_pred, y_test, 0.05)],
                    "within 10%": [within_x_percent(y_pred, y_test, 0.10)],
                    "within 20%": [within_x_percent(y_pred, y_test, 0.20)],
                    "within 50%": [within_x_percent(y_pred, y_test, 0.50)],
                    "median absolute percentage error": [calc_median_absolute_percentage_error(y_test, y_pred)]}, index=['Percentage'])
        
    display(metrics_df)
    display(dist_df)

For example we check the metrics for the mean model:

In [21]:
y_true = df['price']
y_pred = [df['price'].mean()]*len(df['price'])

show_metrics(y_pred, y_true)

Unnamed: 0,Target,R2,Root Mean Squared Error,Mean Absolute Error,Mean Absolute Percentage Error,Median Absolute Error,Median Absolute Percentage Error
0,Target as it is,0.0,1418976.0,777313.945594,0.731953,591309.636114,53.955122


Unnamed: 0,within 5%,within 10%,within 20%,within 50%,median absolute percentage error
Percentage,5.00528,10.200634,20.95037,46.504752,53.955122


In [22]:
y_true = df['log_price']
y_pred = [df['log_price'].mean()]*len(df['log_price'])

show_metrics(y_pred, y_true, logarithm=True)

Unnamed: 0,Target,R2,Root Mean Squared Error,Mean Absolute Error,Mean Absolute Percentage Error,Median Absolute Error,Median Absolute Percentage Error
0,Target as it is,0.0,0.2736612,0.211149,0.034711,0.171059,2.836147
2,Target -> 10 ^ Target,-0.043476,1449493.0,698940.744615,0.524995,392820.83669,39.847522


Unnamed: 0,within 5%,within 10%,within 20%,within 50%,median absolute percentage error
Percentage,5.871172,11.784583,24.920803,60.443506,39.847522


The interesting thing is when we use `log_price` and use the mean model we get MAPE for price than when we use the actual price as the target variable. This shows that the feature engineering we did on the target variable can be helpful.