# Prac 08
In this prac we will be building a recommender system for movies.

# Q1
Using the [MovieLens](https://grouplens.org/datasets/movielens/) (small) data set complete the following:
- Download the [ml-latest-small](https://files.grouplens.org/datasets/movielens/ml-latest-small.zip) data file. Unzip, and then upload the files 'movies.csv' and 'ratings.csv' to the Google Colab
- Load the movies and ratings into a pandas data frame, and inspect the data
- Convert the ratings dataframe into a utility matrix using `df.pivot`, and inspect the matrix.
- Select a user from the utility matrix that we will recommend a movie for
- Predict the rating of a movie they have not seen using a user-based approach
- Predict the ratings of all unseen movies and recommend the top 10

In [None]:
%matplotlib inline
import numpy as np
import pandas as pd
import sklearn

## Download data
Download the [ml-latest-small](https://files.grouplens.org/datasets/movielens/ml-latest-small.zip) data file. Unzip, and then upload the files 'movies.csv' and 'ratings.csv' to the Google Colab

Since the data is in a .zip file it's probably easiest to just use some bash tools to download and extract the files directly into the colab area.

In [None]:
! [[ -e ml-latest-small.zip ]] || $(wget https://files.grouplens.org/datasets/movielens/ml-latest-small.zip; unzip ml-latest-small.zip)

## Load the data
Load the movies and ratings into a pandas data frame, and inspect the data.

In [None]:
# load the movies data
movies_df = pd.read_csv('/content/ml-latest-small/movies.csv')
movies_df.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [None]:
# load the ratings data
ratings_df = pd.read_csv('/content/ml-latest-small/ratings.csv').drop(columns='timestamp')
ratings_df.head()

Unnamed: 0,userId,movieId,rating
0,1,1,4.0
1,1,3,4.0
2,1,6,4.0
3,1,47,5.0
4,1,50,5.0


## Create a utility matrix
Convert the ratings dataframe into a utility matrix using `df.pivot`, and inspect the matrix.
- Make the matrix smaller by selecting only movies with more than 50 ratings, and users who have rated more than 50 movies

In [None]:
# convert ratings table into a utility matrix
utility = ratings_df.pivot(index='userId',
                           columns='movieId',
                           values='rating')
utility.tail()

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
606,2.5,,,,,,2.5,,,,...,,,,,,,,,,
607,4.0,,,,,,,,,,...,,,,,,,,,,
608,2.5,2.0,2.0,,,,,,,4.0,...,,,,,,,,,,
609,3.0,,,,,,,,,4.0,...,,,,,,,,,,
610,5.0,,,,,5.0,,,,,...,,,,,,,,,,


In [None]:
users_gt_50 = utility.count(axis='columns') > 50
movies_gt_50 = utility.count(axis='index') > 50

In [None]:
utility_sm = utility.loc[users_gt_50, movies_gt_50]
utility_sm

movieId,1,2,3,6,7,10,11,16,17,19,...,81845,89745,91500,91529,99114,106782,109374,109487,112852,122904
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,,4.0,4.0,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
6,,4.0,5.0,4.0,4.0,3.0,4.0,4.0,4.0,2.0,...,,,,,,,,,,
7,4.5,,,,,,,,,,...,,,,,,,,,,
10,,,,,,,,,,,...,5.0,,,5.0,,1.0,0.5,0.5,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
605,4.0,3.5,,,,,,,,,...,,,,,,,,,,
606,2.5,,,,2.5,,2.5,,4.0,2.0,...,,,4.5,,,,,,,
607,4.0,,,,,,3.0,,,,...,,,,,,,,,,
608,2.5,2.0,2.0,,,4.0,,4.5,,2.0,...,,,,,,,,,,


## Select a user
Select a user from the utility matrix that we will recommend a movie for.

I recommend chosing user 605 in the above list.

In [None]:
User1 = utility.loc[605]

## Predict the rating of a movie
Predict the rating of a movie they have not seen using a user-based approach
- use a similarity score of `0.8` to select similar users
- predict the rating for movie with `id=7`

In [None]:
# using row1.corr(row2) we calculate the pearson correlation function between users
def corrUser(row):
  if User1.equals(row): # For later implementation we don't want to compare a user to themselves.
    return 0            # Instead of skipping them we just set the correlation to zero.
  return User1.corr(row)

In [None]:
# now compute all the similarities
similarities = utility.apply(corrUser,
                             axis=1)

  c = cov(x, y, rowvar, dtype=dtype)
  c *= np.true_divide(1, fact)


In [None]:
threshold=0.8
peers_id = utility.index[similarities>threshold]
print(f"There are {len(peers_id)} similar users")

There are 15 similar users


In [None]:
utility.loc[peers_id]

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
36,,,,,,,,,,,...,,,,,,,,,,
158,,,,,,,,,,,...,,,,,,,,,,
172,,,,,,,,,,,...,,,,,,,,,,
180,,,,,,,,,,,...,,,,,,,,,,
209,,,,,,,,,,,...,,,,,,,,,,
255,,,,,,,,,,,...,,,,,,,,,,
258,,,,,,,,,,,...,,,,,,,,,,
389,5.0,,,,4.0,5.0,,,3.0,,...,,,,,,,,,,
396,5.0,,,,,,,,,,...,,,,,,,,,,
422,4.0,,,,,,,,,,...,,,,,,,,,,


Predicted score is
$\frac{\sum( S_i * x_i) }{ \sum (S_i)}$
where $S_i$ is the similarity score and $x_i$ is the user rating.

In [None]:
# Compute the average of similar user scores for movie with id=1
mid = 1
numerator = np.sum(similarities[peers_id] * utility.loc[peers_id, mid])
denomenator = np.sum( similarities[peers_id] * (utility.loc[peers_id, mid] > 0) )
predicted_score = numerator / denomenator
predicted_score

4.650483330913248

## Select top 10 predictions
Predict the ratings of all unseen movies and recommend the top 10
- Use the above to write a function to predict the score for all the movies
- sort the predictions and choose the top 10
- report the predicted score and names of the top 10 movies

In [None]:
def predict_scores(peers, utility):
  """
  Predict the score for all unseen movies.
  Peers should list the userid for all the similar users.
  Utility should be a utility matrix.

  Returns a dict of {movieId:score}
  """
  scores = {}
  for movie in utility.columns:
    if np.isnan(User1[movie]):     # only predict scores for unseen movies
      numerator = np.sum(similarities[peers_id] * utility.loc[peers_id, movie])
      denomenator = np.sum( similarities[peers_id] * (utility.loc[peers_id, movie] > 0) )
      predicted_score = numerator / denomenator
      if not np.isnan(predicted_score):  # don't record predictions of nan
        scores[movie] = predicted_score
  return scores

In [None]:
# compute the predicted scores
predictions = predict_scores(peers_id,
                             utility)

  


In [None]:
# choose the top10 movieIds
top_10 = sorted(predictions,
                key=predictions.get, # sort based on the value not the key
                reverse=True, # reverse means hightest first
                )[:10]  # choose first 10
top_10

[6, 62, 140, 141, 216, 293, 520, 628, 736, 750]

In [None]:
# Display a nice table of results
print("Pred \t| Movie")
print("--------------")
for movie in top_10:
  # extract the movie name from the movies_df
  name = movies_df['title'][movies_df['movieId']==movie].values[0]
  print(f"{predictions[movie]:3.1f}\t| {name}")

Pred 	| Movie
--------------
5.0	| Heat (1995)
5.0	| Mr. Holland's Opus (1995)
5.0	| Up Close and Personal (1996)
5.0	| Birdcage, The (1996)
5.0	| Billy Madison (1995)
5.0	| Léon: The Professional (a.k.a. The Professional) (Léon) (1994)
5.0	| Robin Hood: Men in Tights (1993)
5.0	| Primal Fear (1996)
5.0	| Twister (1996)
5.0	| Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1964)
