# Evaluating recommender systems

We will look at the restaurant recommendations once more. To evaluate our recommender system, we will split our data into train and test sets again. This allows us to compare predictions with true values and evaluate how well our recommender performs.

In [None]:
import pandas as pd
import numpy as np

# 1.&nbsp;Import data

In [None]:
# Import the csv with the ratings.
url = 'https://drive.google.com/file/d/1ptu4AlEXO4qQ8GytxKHoeuS1y4l_zWkC/view?usp=sharing' 
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]
frame = pd.read_csv(path)

users_items = pd.pivot_table(data=frame, 
                                 values='rating', 
                                 index='userID', 
                                 columns='placeID')

users_items.fillna(0, inplace=True)

In [None]:
users_items

# 2.&nbsp;Train-test split

## 2.1&nbsp;Find all nonzero ratings

This will help us make the train and test split.

The `0.0` ratings cannot go to the test set. Therefore, we need to identify the non-zero ratings and make the split on them.

In [None]:
# Create a DataFrame that contains the positions of all nonzero ratings.
ratings_pos = pd.DataFrame(
    np.nonzero(np.array(users_items)),
    ).T

ratings_pos.head()

In [None]:
# Rename the columns.
ratings_pos.columns = ["row_pos", "column_pos"]
ratings_pos.head()

How shall we interpret the `rating_pos` DataFrame?

The ratings at the positions [0, 31], [0, 32], [0, 75], [0, 81] etc. from the `users_items` DataFrame are not equal to zero.

The value in the column `row_pos` corresponds to the row index in the `users_items` DataFrame, whereas the value in the column `column_pos` corresponds to the column index.

Example [0,31]: This corresponds to a nonzero rating from the first user (userID = U1001). This is because this user's data is stored in the first row of the `users_items` DataFrame, with index `0`. The rating was for the restaurant in the column at position 31 (132825).

In [None]:
# Get the nonzero ratings from the positions above.
users_items.iloc[0:1, [31, 32, 75, 81, 85]]

Let's find out how many non-zero values are in the `rating_pos` DataFrame. These are the candidates to take part in the train and test split.

In [None]:
len(ratings_pos)

## 2.2&nbsp;Make the train and test split

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
# Split nonzero ratings into train and test sets.
train_pos, test_pos = train_test_split(ratings_pos, 
                                       random_state=123, 
                                       test_size=.1)

These values are in the train set...

In [None]:
train_pos.sort_values(["row_pos","column_pos"]).head(3)

...and these in the test set.

In [None]:
test_pos.sort_values(["row_pos","column_pos"]).head(3)

Now we have two DataFrames called `train_pos` and `test_pos` which contain the rating positions in the `users_items` DataFrame.

##2.3&nbsp;Create the train DataFrame

The train and test datasets will both have the same shape as the `users_items`DataFrame. Most of their values will be zero, except for the values in the positions stored inside the `train_pos` set for the train dataset, and the `test_pos`set for the test dataset.

In [None]:
# Create a copy of the users-items DataFrame and set all values to zero.
train = users_items.copy()

for column in train.columns:
  train[column].values[:] = 0

In [None]:
# Sum the values in the DataFrame to check that all of them are equal to zero.
train.sum().sum()

In [None]:
# Iterate over the nonzero positions in the train dataset.
# Get the corresponding rating for each nonzero position from the users_items DataFrame.
# Insert that value into the newly created DataFrame at the same position.
for pos in train_pos.values: 
    index = pos[0]
    col = pos[1]
    train.iloc[index, col] = users_items.iloc[index, col]

In [None]:
train.head()

How many ratings from our user `U1001` fell into the train set?

In [None]:
train.iloc[0:1, [31, 32, 75, 81, 85]]

## 2.4&nbsp;Create the test DataFrame

Now it is time for the test set. We will follow the same process.

In [None]:
# Create a copy of the users-items DataFrame and set all values to zero.
test = users_items.copy()

# Iterate over the nonzero positions in the test dataset.
# Get the corresponding rating for each nonzero position from the users_items DataFrame.
# Insert that value in the newly created DataFrame at the same position.
for column in test.columns:
  test[column].values[:] = 0

for pos in test_pos.values: 
    index = pos[0]
    col = pos[1]
    test.iloc[index, col] = users_items.iloc[index, col]

How many ratings from our user `U1001` fell into the test set?

In [None]:
test.iloc[0:1, [31, 32, 75, 81, 85]]

We can build a compact DataFrame to store the positions of all the places in the test set and their true rating.

In [None]:
true_test_ratings = []

# Iterate over rows and get the values in the two columns (= positions in users_items).
# Use positions to get ratings and store them in the true_test_ratings list.
for index, row in test_pos.iterrows():
  true_test_ratings.append(users_items.iloc[row[0], row[1]])

In [None]:
# Add ratings as new column.
test_pos = test_pos.assign(true_rating = true_test_ratings)

In [None]:
test_pos.head()

## 2.5&nbsp;Create the similarity matrix for the train set

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
# Get cosine similarities for the train dataset.
train_similarity = pd.DataFrame(cosine_similarity(train), 
                                columns=train.index, 
                                index=train.index)
train_similarity.head(3)

## 2.6&nbsp;Predict rating for an individual value in the test set

We will look at the rating that user `U1001` gave to restaurant `placeID=135039` - the data value in position [0, 85] that went into the test dataset.

Using only the ratings from the train set and the similarity matrix computed from it, we will predict this value. 

In [None]:
# Get the ratings for restaurant 135039 and the similarities of user U1001.
# Combine them in a DataFrame.
results = (
    pd.DataFrame({
        'ratings': train.loc[:,135039], 
        'similarities' : train_similarity.loc["U1001",:]
    })
)
results.head()

As always, we compute the weights using the similarities.

In [None]:
# Calculate similarities and add them in a new column "weights".
results = results.assign(weights = results["similarities"] / (sum(results["similarities"])-1))

In [None]:
results.head(3)

Then we weigh the rating that each user gave to that restaurant using each user's weight.

In [None]:
results = results.assign(weighted_ratings = results["ratings"] * results["weights"])
results.head(3)

Finally, we get the predicted rating for user U1001 for the restaurant `135039` by adding up all the weighted ratings.

In [None]:
pred_rating = results["weighted_ratings"].sum()
pred_rating

Let's have a look at the real rating that user U1001 gace to restaurant `135039`.

In [None]:
true_rating = users_items.loc["U1001", 135039]
true_rating

The difference between the prediction and the true value is the error.

In [None]:
error = true_rating - pred_rating
error

# 3.&nbsp;Compute all recommendations for the test set

Now we need to predict the rating for all the restaurants in the test set, and compute the performance metrics.

## 3.1&nbsp;Create a function to get predictions for individual values

We will build a function that computes the ratings for a single user and a single restaurant, taking an index and a column position as input. To do so, we will use the code from above when we predicted the rating of user U1001 for restaurant `135039`.

In [None]:
def recommender(index_pos, column_pos): 
    # Build a DataFrame with the ratings for one restaurant (column_pos) and
    # the similarities to one user (index_pos).
    results = (
      pd.DataFrame({
          'ratings': train.iloc[:,column_pos], 
          'similarities' : train_similarity.iloc[index_pos,:]
          })
      )
    
    # Compute the weights.
    results = results.assign(weights = results["similarities"] / (sum(results["similarities"]) -1))
    
    # Compute the weighted ratings.
    results = results.assign(weighted_ratings = results["ratings"] * results["weights"])
    
    # Compute the rating prediction for one user and one restaurant.
    prediction = results["weighted_ratings"].sum()

    return prediction

In [None]:
# Run function for user U1001 and restaurant 135039.
recommender(0, 85)

## 3.2&nbsp;Apply function to all values in the test set

Before computing the predicted rating for all rows in the test dataset, let's order the values.

In [None]:
# Sort the values in the test dataset.
test_pos.sort_values(["row_pos", "column_pos"])

To get a prediction for all the values in the test dataset, we will iterate over its rows and then store the predicted ratings in a list.

In [None]:
recs_test = []

# Iterate over rows of the test_pos dataset.
for index, row in test_pos.iterrows():
    recs_test.append(
# Use recommender function.
        recommender(
            index_pos = int(row[0]), 
            column_pos = int(row[1])
            )
        )

In [None]:
recs_test

Again, we add the list with the predictions as a new column to the `test_pos` DataFrame.

In [None]:
# Add new column "pred_rating" with the predictions.
test_pos = test_pos.assign(pred_rating = recs_test)

In [None]:
test_pos.head()

## 3.3&nbsp;Use visualizations to compare true and predicted ratings

Let's have a look at the distributions of both the true and the predicted ratings.

In [None]:
# Predicted ratings' distribution first.
test_pos.pred_rating.hist();

In [None]:
# True ratings' distribution.
test_pos.true_rating.hist();

It looks like our predictions are generally much lower than the true ratings. Our small visualizations don't seem to be enough to evaluate the quality of our predictions.

Let's try to quantify this.

# 4.&nbsp;Performance metrics

From the various metrics available, let's pick the $R^2$ score first to quantify the difference between the predicted and the true ratings. As a second step, we will use the mean absolute error.

## 4.1&nbsp;$R^2$ score

The highest possible $R^2$ score is 1 while usually, the lower boundary is 0. Nevertheless, the $R^2$ can also become negative when the predicted values perform worse than just using the average score for each prediction would have been.

Have a look at the documentation of the $R^2$ score [here](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html?highlight=r2%20score#sklearn.metrics.r2_score).

In [None]:
from sklearn.metrics import r2_score

In [None]:
# Calculate R squared for all predictions and true ratings.
r2_score(test_pos.true_rating, test_pos.pred_rating)

A negative $R^2$ score! Let's try a different metric.

## 4.2&nbsp;Mean Absolute Error

In [None]:
from sklearn.metrics import mean_absolute_error

In [None]:
# Calculate MAE for all predictions and true ratings.
mean_absolute_error(test_pos.true_rating, test_pos.pred_rating)

Let's visualize the mean absolute errors for each rating.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
plt.title('Error analysis')
plt.xlabel('Predicted ratings')
plt.ylabel('True ratings')

# Plot diagonal for predictions = true ratings.
sns.lineplot(x=[0,2], y=[0,2], color='red')
# For each test datapoint, plot predicted vs. true rating.
sns.scatterplot(x=test_pos["pred_rating"], y=test_pos["true_rating"], alpha=0.4);

We know that our model won't be capable of exact predictions (i.e. values on the red diagonal). Luckily, this doesn't matter much for recommenders.

Instead, we need to be able to rank items from most likely to be enjoyed to least likely.

Therefore, it is more important for us that the order is correct. To check for this, we will investigate whether the predicted values for true values of 2 are higher than the predicted values for true values of 1.

Average predicted score for true ratings of 2.

In [None]:
# Filter for true ratings of 2.
test_pos.loc[test_pos.true_rating==2,:]["pred_rating"].mean()

Average predicted score for true ratings of 1.

In [None]:
# Filter for true ratings of 1.
test_pos.loc[test_pos.true_rating==1,:]["pred_rating"].mean()

We can see that on average, our recommender predicts higher ratings for restaurants whose true ratings are also higher. This means that in general, our recommender performs reasonably well at finding the correct order.

# 5.&nbsp;Challenge

Evaluate whether a recommender system using the sum of `rating + food_rating + service_rating` instead of only the `rating` is better?

In [None]:
frame.head()

In [None]:
# your code here