## Evaluation
One of the tasks of the assignment is to evaluate the model. And while Mean Average Error is a robust metric for regression tasks, other metrics can reveal more info about the performance of the model. In this notebook I will look at multiple performance metrics of the model and find the most important. Metrics would be measured on the whole dataset, since difference in loss between test and train during training was not really high, and that approach is not likely to spoil the result.

In [14]:
# load the data
import pandas as pd
from torch.utils.data import TensorDataset, DataLoader
import torch

torch.manual_seed(1337)

data = pd.read_csv('classical.csv', index_col=0)
data.head()

def norm_year(year):
    return year - 1900
    
data['23'] = data['23'].apply(norm_year)
data['23']
X = data.drop(columns=['43'])
y = data['43']

Xt = torch.tensor(X.values, dtype=torch.float)
yt = torch.tensor(y, dtype=torch.float)

dataset = TensorDataset(Xt, yt)

batch_size = 512
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

Let's also take a look on dataset metrics, so we can assess the scores we get.

In [15]:
y.describe()

count    99991.000000
mean         3.529868
std          1.125679
min          1.000000
25%          3.000000
50%          4.000000
75%          4.000000
max          5.000000
Name: 43, dtype: float64

In [16]:
# set up the model
from torch import nn

class RatingModel(nn.Module):

    def __init__(self, input_dim):
        super(RatingModel, self).__init__()

        self.model = nn.Sequential(
            nn.Linear(input_dim, 64),
            nn.ReLU(),
            nn.Linear(64, 32),
            nn.ReLU(),
            nn.Linear(32, 16),
            nn.ReLU(),
            nn.Linear(16, 1),
            nn.ReLU(),
        )

    def forward(self, x):
        return self.model(x)


In [35]:
#set up prediction fucntion
from tqdm.notebook import tqdm

def predict(model, dataloader, device="cpu"):
    
        with torch.no_grad():
            results = None
            targets = None
            model.eval()  # evaluation mode
            loop = tqdm(enumerate(dataloader, 0), total=len(dataloader))
            for i, data in loop:
                inputs, labels = data
                inputs, labels = inputs.to(device), labels.to(device)

                outputs = model(inputs)
                labels = torch.squeeze(labels)
                outputs = torch.squeeze(outputs)

                if results is None:
                    results = outputs
                    targets = labels
                else:
                    results = torch.cat((results, outputs))
                    targets = torch.cat((targets, labels))
                
        return results.cpu(), targets.cpu()

In [50]:
# load the model and get the predictions
model = RatingModel(input_dim=43)
ckpt = torch.load("fc-43-084.pt") # you may need to change that to best.pt or other
model.load_state_dict(ckpt)

device = 'cuda' if torch.cuda.is_available else 'cpu'
model = model.to(device)

preds, labels = predict(model, dataloader, device)

  0%|          | 0/196 [00:00<?, ?it/s]

First, let's re-evaluate the model using Mean Average Error metric so we have it on hand to compare to other metrics. Pytorch implements this loss function as L1Loss, so we can easily use it.

In [51]:
MAEScore = nn.functional.l1_loss(preds, labels).item()
print(MAEScore)

0.8477112650871277


The result we get is almost identical to the result we got during training, so no issues here.

Next evaluation function we'll try is Mean Squared Error. I tried using it as a loss function in the training stage, but it ended up not converging. But we still can use it as a metric.

In [53]:
MSEScore = nn.functional.mse_loss(preds, labels).item()
print(MSEScore)

1.1995724439620972


The result we get there is higher than the MAE. That means that the model often has error of more than 1, which might suggest presence of outliers of dataset (which MSE is susceptible to).

Now lets try calculating Accuracy of the model. The task of predicting the rating is not classification task, but rather a regression one, but since movie ratings in the dataset are discrete, we can use them as 'classes' to calculate the accuracy.

In [61]:
import numpy as np
np_preds = preds.numpy()
np_labels = labels.numpy()

Accuracy = np.sum(np_preds.round() == np_labels) / len(np_preds)
print(Accuracy)

0.3527817503575322


Accuracy of 35% is not really high, so to better assess perforamce of the model, let's comapre it to performance of random guessing:

In [72]:
rnd_preds = torch.rand(preds.shape) * 4 + 1
print(f'MAE: {nn.functional.l1_loss(rnd_preds, labels).item()}')
print(f'MSE: {nn.functional.mse_loss(rnd_preds, labels).item()}')
np_rnd = rnd_preds.numpy()
print(f'Accuracy: {np.sum(np_rnd.round() == np_labels) / len(np_rnd)}')

MAE: 1.388507604598999
MSE: 2.8849971294403076
Accuracy: 0.21556940124611215


Performance of the model is noticeably better than random guessing, so we can consider it a success.

Out of all metrics tested, Accuracy was the least useful metric, since we consider the task of movie recomendation to be regression task, not a classification one.
MSE and MAE metrics are both valuable, since MAE shows average error, and allows us to build an expectation of how off the model would be on average.
MSE is very similar to MAE, but since the error is squared, it can be compared to MAE to see how often model has really high error. 

I will use MAE and MSE as final evaluation metrics.