# DES431 Project: Recommendation System

# Background

**MovieLens** is a movie recommendation system operated by GroupLens, a research group at the University of Minnesota. MovieLens has been developed to provide personalized movie recommendations to its users based on their viewing history and preferences.

# Task

1. This project is to be completed by a group of three students.
2. Propose and implement your own recommendation system based on the MovieLens dataset.
   - Use `ratings_train.csv` as the training set and `ratings_valid.csv` as the validation set.
   - Your recommendation system may utilize information from `movies.csv` for making recommendations.
   - The structure of the data files is detailed at `https://files.grouplens.org/datasets/movielens/ml-latest-small-README.html`.
   - The goal of the recommendation system is to minimize the root-mean-square error (RMSE), i.e., to minimize the difference between the predicted and actual ratings.
   - Implement a function named `predict_rating`. This function should accept a DataFrame with two columns: `userId` and `movieId`, and return the DataFrame with an additional column named `rating`, containing predicted ratings of a `movieId` by a `userId`.
   - The `predict_rating` function must be compatible with an undisclosed test set having the same format as the validation set. The test set contains  Your implementation will be evaluated by the test set. Failure to comply will result in a 50% deduction of your score.
   - You are required to modify the given program to enhance recommendation quality. Submitting the unaltered original program will be considered plagiarism.
3. Prepare slides for a 7-minute presentation that explains your proposed technique and algorithm for making recommendations, and demonstrates your RMSE results on the validation set.
4. Submit your Python notebook and the presentation slides in PDF format via Google Classroom by April 30, 2024, at 23:59. All members of the group must individually submit their work to Google Classroom. Late submissions will not be accepted and will incur a 10% deduction. Do not procrastinate. Plagiarism and code duplication will be rigorously checked.
5. Present your work on May 1, 2024, within a 7-minute timeframe. Presentations exceeding 7 minutes will result in point deductions.


In [None]:
# Edit this cell for the group name and members
'''
Group name: wolf gang
Aticha Numsonthi 6422780203
Kavinnat Rawanggij 64227781201
Pannathat Akkaratornsakul 6422782829
'''

In [1]:
import numpy as np
import pandas as pd

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


# Loading data

In [2]:
ratings_train = pd.read_csv('ratings_train.csv')
ratings_valid = pd.read_csv('ratings_valid.csv')
movies = pd.read_csv('movies.csv')

In [None]:
ratings_train.head()

# Constructing model and predicting ratings

In [3]:
movies_table = {}
dict_id = 0
while dict_id < movies.shape[0]:
    movies_table[dict_id] = movies.iloc[dict_id]['movieId']
    dict_id += 1

In [4]:
times = ratings_train.shape[0]
n = ratings_train['userId'].max()
m = movies.shape[0]
size = (n,m)
rating_mat = np.zeros(size)
i = 0
while i < times:
    user, movie, rating = ratings_train.iloc[i,0:3].to_numpy()
    movie_ind = list(movies_table.values()).index(movie)
    rating_mat[int(user)-1,movie_ind] = rating
    i+=1


In [5]:
rating_df = pd.DataFrame(rating_mat)
rows = ratings_train.userId.unique()
columns = movies.movieId.unique()
rating_df.columns = columns
rating_df.index = rows
rating_df

In [6]:
rating_df

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
1,4.0,0.0,4.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
606,2.5,0.0,0.0,0.0,0.0,0.0,2.5,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
607,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
608,2.5,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
609,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
# rating_df[rating_df.loc[:,1] != 0].index.tolist() #rated user in movie1

In [None]:
# # from scipy.sparse.linalg import svds
# # U, S, V = svds(rating_mat)
# U, S, V = np.linalg.svd(rating_mat)
# print(U.shape, S.shape, V.shape)

In [None]:
# mask = (rating_mat != 0).astype(int)
# print(mask)

In [None]:
# n_features = 2
# P = U[:, :n_features]
# S = np.diag(S[:n_features])
# V = V[:n_features, :]
# Q = np.matmul(S, V)

In [None]:
# print(P.shape, Q.shape)
# print(np.matmul(P, Q))
# E = mask * (rating_mat - np.matmul(P, Q))
# mse = np.sum(E**2)/np.sum(mask)
# print(f"MSE = {mse:.4f}")

In [None]:
# x = np.arange(rating_mat.shape[0])
# y = np.arange(rating_mat.shape[1])
# xy = np.dstack(np.meshgrid(x, y)).reshape(-1, 2)

# eta1, eta2 = 0.1, 0.1
# lambda1, lambda2 = 0.0000004, 0.0000004

# for epoch in range(20):
#   np.random.shuffle(xy)
#   for x, y in xy:
#     if rating_mat[x, y] == 0:
      
#       continue
    
#     epsilon = 2*(rating_mat[x, y] - np.dot(P[x, :], Q[:, y]))
#     Q[:, y] = Q[:, y] + eta1*(epsilon - lambda2*Q[:, y])
#     P[x, :] = P[x, :] + eta2*(epsilon - lambda1*P[x, :])
    
#   E = mask * (rating_mat - np.matmul(P, Q))
#   mse = np.sum(E**2)/np.sum(mask)
#   print(f"Epoch = {epoch+1:4d}, MSE = {mse:.4f}")

In [62]:
def userbase_rating(user, movie):
    rating_available = rating_df[rating_df.loc[:,movie] != 0].index.tolist()
    sim_dict = {}
    for i in rating_available:
        user_i = rating_df.loc[i]
        user_j = rating_df.loc[user]
        sim_dict[i] = np.dot(user_i, user_j)
    sorted_dict_values = dict(sorted(sim_dict.items(), reverse = True, key=lambda item: item[1]))
    most_9_sim = dict(list(sorted_dict_values.items())[:9])
    x = 0 #numerator
    y = 0 #denominator
    for i in most_9_sim:
        x += most_9_sim[i]*rating_df.loc[i,movie]
        y += most_9_sim[i]
    rating = x/y    
    return rating
    

In [66]:
# Model construction
avg_rating = ratings_train[['movieId', 'rating']].groupby(by='movieId').mean()
users_bias = ratings_train[['userId', 'rating']].groupby(by='userId').mean()
overall_mean = avg_rating.mean()

# Prediction
def predict_rating(df):
    # Input: 
	# 	df = a dataframe with two columns: userId, movieId
	# Output:
	#   a dataframe with three columns: userId, movieId, rating
	predict_list= []
	for i in range(df.shape[0]):
		userId = df['userId'].iloc[i]  # Extract scalar value
		movieId = df['movieId'].iloc[i]  # Extract scalar value
		
		movie_bias = avg_rating.loc[movieId].iloc[0] - overall_mean
		user_bias = users_bias.loc[userId] - overall_mean
		predict_rating =  0.35*(userbase_rating(userId,movieId)) + 0.65*(overall_mean+movie_bias+user_bias) 
		predict_list.append(float(predict_rating))
	df.insert(2,"rating", predict_list, True)
	return (df)

In [67]:
# Prepare df for prediction
r = ratings_valid[['userId', 'movieId']]

# Predict ratings
ratings_pred = predict_rating(r)

  predict_list.append(float(predict_rating))


In [68]:
ratings_pred.head(10)

Unnamed: 0,userId,movieId,rating
0,4,45,3.570818
1,4,52,3.621184
2,4,58,4.080631
3,4,222,4.139533
4,4,247,4.183271
5,4,265,3.967303
6,4,319,3.711424
7,4,345,3.732459
8,4,417,4.269596
9,4,441,4.293134


In [69]:
from sklearn.metrics import mean_squared_error

r_true = ratings_valid['rating'].to_numpy()
r_pred = ratings_pred['rating'].to_numpy()

rmse = mean_squared_error(r_true, r_pred, squared=False)
print(f"RMSE = {rmse:.4f}")

RMSE = 0.8621


