# Prac 08
In this prac we will be building a recommender system for movies.

# Q1
Using the [MovieLens](https://grouplens.org/datasets/movielens/) (small) data set complete the following:
- Download the [ml-latest-small](https://files.grouplens.org/datasets/movielens/ml-latest-small.zip) data file. Unzip, and then upload the files 'movies.csv' and 'ratings.csv' to the Google Colab
- Load the movies and ratings into a pandas data frame, and inspect the data
- Convert the ratings dataframe into a utility matrix using `df.pivot`, and inspect the matrix.
- Select a user from the utility matrix that we will recommend a movie for
- Predict the rating of a movie they have not seen using a user-based approach
- Predict the ratings of all unseen movies and recommend the top 10

In [None]:
%matplotlib inline
import numpy as np
import pandas as pd
import sklearn

## Download data
Download the [ml-latest-small](https://files.grouplens.org/datasets/movielens/ml-latest-small.zip) data file. Unzip, and then upload the files 'movies.csv' and 'ratings.csv' to the Google Colab

Since the data is in a .zip file it's probably easiest to just use some bash tools to download and extract the files directly into the colab area.

In [None]:
! [[ -e ml-latest-small.zip ]] || $(wget https://files.grouplens.org/datasets/movielens/ml-latest-small.zip; unzip ml-latest-small.zip)

## Load the data
Load the movies and ratings into a pandas data frame, and inspect the data.

In [None]:
# load the movies data
movies_df = pd.read_csv(?)
movies_df.head()

In [None]:
# load the ratings data
ratings_df = pd.read_csv(?).drop(columns='timestamp')
ratings_df.head()

## Create a utility matrix
Convert the ratings dataframe into a utility matrix using `df.pivot`, and inspect the matrix.
- Make the matrix smaller by selecting only movies with more than 50 ratings, and users who have rated more than 50 movies

In [None]:
# convert ratings table into a utility matrix
utility = ratings_df.pivot(index=?,
                           columns=?,
                           values=?)
utility.tail()

In [None]:
users_gt_50 = utility.count(axis='columns') > ?
movies_gt_50 = utility.count(axis='index') > ?

In [None]:
utility_sm = utility.loc[users_gt_50, movies_gt_50]
utility_sm

## Select a user
Select a user from the utility matrix that we will recommend a movie for.

I recommend chosing user 605 in the above list.

In [None]:
User1 = utility.loc[?]

## Predict the rating of a movie
Predict the rating of a movie they have not seen using a user-based approach
- use a similarity score of `0.8` to select similar users
- predict the rating for movie with `id=7`

In [None]:
# using row1.corr(row2) we calculate the pearson correlation function between users
def corrUser(row):
  if User1.equals(row): # For later implementation we don't want to compare a user to themselves.
    return 0            # Instead of skipping them we just set the correlation to zero.
  return User1.corr(row)

In [None]:
# now compute all the similarities
similarities = utility.apply(corrUser,
                             axis=1)

In [None]:
threshold=?
peers_id = utility.index[similarities>threshold]
print(f"There are {len(peers_id)} similar users")

In [None]:
utility.loc[peers_id]

Predicted score is
$\frac{\sum( S_i * x_i) }{ \sum (S_i)}$
where $S_i$ is the similarity score and $x_i$ is the user rating.

In [None]:
# Compute the average of similar user scores for movie with id=7
mid = ?
numerator = np.sum(similarities[peers_id] * utility.loc[peers_id, mid])
denomenator = np.sum( similarities[peers_id] * (utility.loc[peers_id, mid] > 0) ) # the >0 returns true/false which is interpreted as 1/0, and acts as a mask
predicted_score = numerator / denomenator
predicted_score

## Select top 10 predictions
Predict the ratings of all unseen movies and recommend the top 10
- Use the above to write a function to predict the score for all the movies
- sort the predictions and choose the top 10
- report the predicted score and names of the top 10 movies

In [None]:
def predict_scores(peers, utility):
  """
  Predict the score for all unseen movies.
  Peers should list the userid for all the similar users.
  Utility should be a utility matrix.

  Returns a dict of {movieId:score}
  """
  scores = {}
  for movie in utility.columns:
    if np.isnan(User1[movie]):     # only predict scores for unseen movies
      numerator = np.sum(similarities[peers_id] * utility.loc[peers_id, movie])
      denomenator = np.sum( similarities[peers_id] * (utility.loc[peers_id, movie] > 0) )
      predicted_score = numerator / denomenator
      if not np.isnan(predicted_score):  # don't record predictions of nan
        scores[movie] = predicted_score
  return scores

In [None]:
# compute the predicted scores
predictions = predict_scores(peers_id,
                             utility)

In [None]:
# choose the top10 movieIds
top_10 = sorted(predictions,
                key=predictions.get, # sort based on the value not the key
                reverse=True, # reverse means hightest first
                )[:?]  # choose first 10
top_10

In [None]:
# Display a nice table of results
print("Pred \t| Movie")
print("--------------")
for movie in top_10:
  # extract the movie name from the movies_df
  name = movies_df['title'][movies_df['movieId']==movie].values[0]
  print(f"{predictions[movie]:3.1f}\t| {name}")