# DES431 Project: Recommendation System

# Background

**MovieLens** is a movie recommendation system operated by GroupLens, a research group at the University of Minnesota. MovieLens has been developed to provide personalized movie recommendations to its users based on their viewing history and preferences.

# Task

1. This project is to be completed by a group of three students.
2. Propose and implement your own recommendation system based on the MovieLens dataset.
   - Use `ratings_train.csv` as the training set and `ratings_valid.csv` as the validation set.
   - Your recommendation system may utilize information from `movies.csv` for making recommendations.
   - The structure of the data files is detailed at `https://files.grouplens.org/datasets/movielens/ml-latest-small-README.html`.
   - The goal of the recommendation system is to minimize the root-mean-square error (RMSE), i.e., to minimize the difference between the predicted and actual ratings.
   - Implement a function named `predict_rating`. This function should accept a DataFrame with two columns: `userId` and `movieId`, and return the DataFrame with an additional column named `rating`, containing predicted ratings of a `movieId` by a `userId`.
   - The `predict_rating` function must be compatible with an undisclosed test set having the same format as the validation set. The test set contains  Your implementation will be evaluated by the test set. Failure to comply will result in a 50% deduction of your score.
   - You are required to modify the given program to enhance recommendation quality. Submitting the unaltered original program will be considered plagiarism.
3. Prepare slides for a 7-minute presentation that explains your proposed technique and algorithm for making recommendations, and demonstrates your RMSE results on the validation set.
4. Submit your Python notebook and the presentation slides in PDF format via Google Classroom by April 30, 2024, at 23:59. All members of the group must individually submit their work to Google Classroom. Late submissions will not be accepted and will incur a 10% deduction. Do not procrastinate. Plagiarism and code duplication will be rigorously checked.
5. Present your work on May 1, 2024, within a 7-minute timeframe. Presentations exceeding 7 minutes will result in point deductions.


In [2]:
# Edit this cell for the group name and members
# 1. Peereakarn Thongsata 6422780047 
# 2. Thanarat Attakulkijkarn 6422781318
# 3. Theeraphat Wongnijasil 6422782126

In [3]:
import numpy as np
import pandas as pd

# Loading data

In [4]:
ratings_train = pd.read_csv('ratings_train.csv')
ratings_valid = pd.read_csv('ratings_valid.csv')
movies = pd.read_csv('movies.csv')

In [5]:
ratings_train.describe()

Unnamed: 0,userId,movieId,rating,timestamp
count,96464.0,96464.0,96464.0,96464.0
mean,327.86935,19105.768059,3.509325,1204483000.0
std,183.95296,35243.409786,1.041385,216528300.0
min,1.0,1.0,0.5,828124600.0
25%,177.0,1196.0,3.0,1013395000.0
50%,330.0,2959.0,3.5,1182909000.0
75%,479.0,7486.0,4.0,1435993000.0
max,610.0,193609.0,5.0,1537799000.0


In [6]:
ratings_train = pd.merge(ratings_train, movies, on="movieId")
ratings_train.head(10)

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,5,1,4.0,847434962,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,7,1,4.5,1106635946,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
3,15,1,2.5,1510577970,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
4,17,1,4.5,1305696483,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
5,18,1,3.5,1455209816,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
6,19,1,4.0,965705637,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
7,21,1,3.5,1407618878,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
8,27,1,3.0,962685262,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
9,31,1,5.0,850466616,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy


In [7]:
movies.head(10)

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
5,6,Heat (1995),Action|Crime|Thriller
6,7,Sabrina (1995),Comedy|Romance
7,8,Tom and Huck (1995),Adventure|Children
8,9,Sudden Death (1995),Action
9,10,GoldenEye (1995),Action|Adventure|Thriller


# Constructing model and predicting ratings

In [8]:
# construct a dictionary to map each genre to an indice

map_genres = dict()
n_genres = 0

for genres in movies["genres"]:
  for genre in genres.split("|"):
    if genre not in map_genres:
      map_genres[genre] = n_genres
      n_genres += 1

map_genres

{'Adventure': 0,
 'Animation': 1,
 'Children': 2,
 'Comedy': 3,
 'Fantasy': 4,
 'Romance': 5,
 'Drama': 6,
 'Action': 7,
 'Crime': 8,
 'Thriller': 9,
 'Horror': 10,
 'Mystery': 11,
 'Sci-Fi': 12,
 'War': 13,
 'Musical': 14,
 'Documentary': 15,
 'IMAX': 16,
 'Western': 17,
 'Film-Noir': 18,
 '(no genres listed)': 19}

In [9]:
# construct a dictionary to map each userId to an indice

map_userId = dict()
n_users = 0

for id in ratings_train["userId"]:
  if id not in map_userId:
    map_userId[id] = n_users
    n_users += 1

map_userId

{1: 0,
 5: 1,
 7: 2,
 15: 3,
 17: 4,
 18: 5,
 19: 6,
 21: 7,
 27: 8,
 31: 9,
 32: 10,
 33: 11,
 40: 12,
 43: 13,
 44: 14,
 45: 15,
 46: 16,
 50: 17,
 54: 18,
 57: 19,
 63: 20,
 64: 21,
 66: 22,
 68: 23,
 71: 24,
 73: 25,
 76: 26,
 78: 27,
 82: 28,
 86: 29,
 89: 30,
 90: 31,
 91: 32,
 93: 33,
 96: 34,
 98: 35,
 103: 36,
 107: 37,
 112: 38,
 119: 39,
 121: 40,
 124: 41,
 130: 42,
 132: 43,
 134: 44,
 135: 45,
 137: 46,
 140: 47,
 141: 48,
 144: 49,
 145: 50,
 151: 51,
 153: 52,
 155: 53,
 156: 54,
 159: 55,
 160: 56,
 161: 57,
 166: 58,
 167: 59,
 169: 60,
 171: 61,
 177: 62,
 178: 63,
 179: 64,
 182: 65,
 185: 66,
 186: 67,
 191: 68,
 193: 69,
 200: 70,
 201: 71,
 202: 72,
 206: 73,
 213: 74,
 214: 75,
 216: 76,
 217: 77,
 219: 78,
 220: 79,
 223: 80,
 226: 81,
 229: 82,
 232: 83,
 233: 84,
 234: 85,
 239: 86,
 240: 87,
 247: 88,
 249: 89,
 252: 90,
 254: 91,
 263: 92,
 264: 93,
 266: 94,
 269: 95,
 270: 96,
 273: 97,
 274: 98,
 275: 99,
 276: 100,
 277: 101,
 279: 102,
 280: 103,
 282:

In [10]:
# construct a matrix P

P = np.zeros((n_users, n_genres))
freq = np.zeros((n_users, n_genres))

for i, row in ratings_train.iterrows():
  for genre in row["genres"].split("|"):
    P[map_userId[row["userId"]]][map_genres[genre]] += row["rating"]
    freq[map_userId[row["userId"]]][map_genres[genre]] += 1
  
P = P/freq

for i, row in ratings_train.groupby(["userId"]).mean().iterrows():
  P[map_userId[i]] = np.nan_to_num(P[map_userId[i]], nan = row["rating"])

P


  P = P/freq


array([[4.38823529, 4.68965517, 4.54761905, ..., 4.28571429, 5.        ,
        4.36637931],
       [3.25      , 4.33333333, 4.11111111, ..., 3.        , 3.63636364,
        3.63636364],
       [3.31481481, 3.39285714, 3.2       , ..., 1.5       , 3.25      ,
        3.23026316],
       ...,
       [4.15384615, 5.        , 4.41666667, ..., 4.078125  , 4.078125  ,
        4.078125  ],
       [3.54166667, 3.54166667, 4.25      , ..., 3.54166667, 3.54166667,
        3.54166667],
       [5.        , 3.96296296, 3.5       , ..., 3.96296296, 3.96296296,
        3.96296296]])

In [11]:
# construct a dictionary to map each movieId to an indice

map_movieId = dict()
n_movies = 0

for id in movies["movieId"]:
  if id not in map_movieId:
    map_movieId[id] = n_movies
    n_movies += 1

map_movieId

{1: 0,
 2: 1,
 3: 2,
 4: 3,
 5: 4,
 6: 5,
 7: 6,
 8: 7,
 9: 8,
 10: 9,
 11: 10,
 12: 11,
 13: 12,
 14: 13,
 15: 14,
 16: 15,
 17: 16,
 18: 17,
 19: 18,
 20: 19,
 21: 20,
 22: 21,
 23: 22,
 24: 23,
 25: 24,
 26: 25,
 27: 26,
 28: 27,
 29: 28,
 30: 29,
 31: 30,
 32: 31,
 34: 32,
 36: 33,
 38: 34,
 39: 35,
 40: 36,
 41: 37,
 42: 38,
 43: 39,
 44: 40,
 45: 41,
 46: 42,
 47: 43,
 48: 44,
 49: 45,
 50: 46,
 52: 47,
 53: 48,
 54: 49,
 55: 50,
 57: 51,
 58: 52,
 60: 53,
 61: 54,
 62: 55,
 63: 56,
 64: 57,
 65: 58,
 66: 59,
 68: 60,
 69: 61,
 70: 62,
 71: 63,
 72: 64,
 73: 65,
 74: 66,
 75: 67,
 76: 68,
 77: 69,
 78: 70,
 79: 71,
 80: 72,
 81: 73,
 82: 74,
 83: 75,
 85: 76,
 86: 77,
 87: 78,
 88: 79,
 89: 80,
 92: 81,
 93: 82,
 94: 83,
 95: 84,
 96: 85,
 97: 86,
 99: 87,
 100: 88,
 101: 89,
 102: 90,
 103: 91,
 104: 92,
 105: 93,
 106: 94,
 107: 95,
 108: 96,
 110: 97,
 111: 98,
 112: 99,
 113: 100,
 116: 101,
 117: 102,
 118: 103,
 119: 104,
 121: 105,
 122: 106,
 123: 107,
 125: 108,
 126: 10

In [12]:
# construct a matrix T

n_movies = movies["movieId"].nunique()

T = np.zeros((n_movies, n_genres))

for i, row in movies.iterrows():
  for genre in row["genres"].split("|"):
    T[map_movieId[row["movieId"]]][map_genres[genre]] = len(row["genres"].split("|"))

T = 1 / T
T = np.nan_to_num(T, posinf = 0)
T

  T = 1 / T


array([[0.2       , 0.2       , 0.2       , ..., 0.        , 0.        ,
        0.        ],
       [0.33333333, 0.        , 0.33333333, ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.5       , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])

In [13]:
# Model construction
predicted = np.matmul(P, np.transpose(T))
avg_rating = ratings_train[['movieId', 'rating']].groupby(by='movieId').mean()

# Prediction
def predict_rating(df):
  # Input: 
	# 	df = a dataframe with two columns: userId, movieId
	# Output:
	#   a dataframe with three columns: userId, movieId, rating
	rating = []
	for i, row in df.iterrows():
		rating.append(predicted[map_userId[row["userId"]]][map_movieId[row["movieId"]]])

	df["rating"] = rating
	return df

In [14]:
# Prepare df for prediction
r = ratings_valid[['userId', 'movieId']]

# Predict ratings
ratings_pred = predict_rating(r)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["rating"] = rating


In [15]:
ratings_pred.head(10)

Unnamed: 0,userId,movieId,rating
0,4,45,3.48847
1,4,52,3.504385
2,4,58,3.504385
3,4,222,3.451699
4,4,247,3.630021
5,4,265,3.545577
6,4,319,3.48847
7,4,345,3.525808
8,4,417,3.535647
9,4,441,3.609756


In [16]:
from sklearn.metrics import mean_squared_error

r_true = ratings_valid['rating'].to_numpy()
r_pred = ratings_pred['rating'].to_numpy()

rmse = mean_squared_error(r_true, r_pred, squared=False)
print(f"RMSE = {rmse:.4f}")
print(f"Initial RMSE = 0.9171")

RMSE = 0.8976
Initial RMSE = 0.9171
