# DES431 Project: Recommendation System

# Background

**MovieLens** is a movie recommendation system operated by GroupLens, a research group at the University of Minnesota. MovieLens has been developed to provide personalized movie recommendations to its users based on their viewing history and preferences.

# Tasks

1. This project is to be completed by a group of up to three students.
2. Propose and implement your own recommendation system based on the MovieLens dataset.
   - Use `ratings_train.csv` as the training set and `ratings_valid.csv` as the validation set.
   - Your recommendation system may utilize information from `movies.csv` for making recommendations.
   - The structure of the data files is detailed at `https://files.grouplens.org/datasets/movielens/ml-latest-small-README.html`.
   - The goal of the recommendation system is to minimize the root-mean-square error (RMSE), i.e., to minimize the difference between the predicted and actual ratings.
   - You are required to modify the provided program to enhance recommendation quality. Submitting the original, unaltered program will be considered plagiarism.
3. Prepare slides for a 7-minute presentation that explains your proposed technique and algorithm for making recommendations, and demonstrates your RMSE results on the validation set. The slides must include
   - A diagram and detailed explanation of your model
   - Results on the validation set
   - A discussion of the pros and cons of your model
4. Submit your Python notebook and the presentation slides in PDF format via Google Classroom by May 5, 2025, at 23:59. 
   - All members of the group must individually submit their work to Google Classroom. 
   - Late submissions will not be accepted and will incur a 10% deduction. 
   - Do not procrastinate. Plagiarism and code duplication will be rigorously checked.
5. Present your work on either May 7 or May 14, within a 7-minute timeframe. Presentations exceeding 7 minutes will result in point deductions. The presentation schedule will be announced later.
6. Attend the presentation physically on-site in the classroom on both days. Late penalty will be applied.
7. Evaluate the presenations of all groups including your group. 

You need to complete all tasks (1--7). Failure to complete any task will result in a score deduction.


In [1]:
# Edit this cell for the group name and members
# Group name: DeDuo
# Group member1: Nutthawee Charoenngampis (6522772662)
# Group member2: Siraprapha Pongpan (6522771268)


In [2]:
import numpy as np
import pandas as pd

# Loading data

In [3]:
ratings_train = pd.read_csv('./data/ratings_train.csv')
ratings_valid = pd.read_csv('./data/ratings_valid.csv')
movies = pd.read_csv('./data/movies.csv')

In [4]:
ratings_train.describe()

Unnamed: 0,userId,movieId,rating,timestamp
count,96464.0,96464.0,96464.0,96464.0
mean,327.86935,19105.768059,3.509325,1204483000.0
std,183.95296,35243.409786,1.041385,216528300.0
min,1.0,1.0,0.5,828124600.0
25%,177.0,1196.0,3.0,1013395000.0
50%,330.0,2959.0,3.5,1182909000.0
75%,479.0,7486.0,4.0,1435993000.0
max,610.0,193609.0,5.0,1537799000.0


In [5]:
ratings_train.head(10)

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931
5,1,70,3.0,964982400
6,1,101,5.0,964980868
7,1,110,4.0,964982176
8,1,151,5.0,964984041
9,1,157,5.0,964984100


In [6]:
ratings_valid.head(10)

Unnamed: 0,userId,movieId,rating,timestamp
0,4,45,3.0,986935047
1,4,52,3.0,964622786
2,4,58,3.0,964538444
3,4,222,1.0,945629040
4,4,247,3.0,986848894
5,4,265,5.0,964538468
6,4,319,5.0,945079182
7,4,345,4.0,945629063
8,4,417,2.0,945078467
9,4,441,1.0,986934915


In [7]:
movies.head(10)

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
5,6,Heat (1995),Action|Crime|Thriller
6,7,Sabrina (1995),Comedy|Romance
7,8,Tom and Huck (1995),Adventure|Children
8,9,Sudden Death (1995),Action
9,10,GoldenEye (1995),Action|Adventure|Thriller


# Constructing model and predicting ratings

## Example 1

Predict ratings using the average rating

In [8]:
# Model construction: calculate the average rating for each movie from the training set
avg_rating = ratings_train[['movieId', 'rating']].groupby(by='movieId').mean()
	    
# Merge the average rating with the validation set to get the predicted ratings
ratings_pred = ratings_valid[['userId', 'movieId']].copy()
ratings_pred = ratings_pred.merge(avg_rating, on='movieId', how='left')

In [9]:
# Display the predicted ratings
ratings_pred.head(10)

Unnamed: 0,userId,movieId,rating
0,4,45,3.366667
1,4,52,3.52
2,4,58,4.0625
3,4,222,3.928571
4,4,247,3.975
5,4,265,3.913793
6,4,319,3.526316
7,4,345,3.573529
8,4,417,4.0
9,4,441,3.986486


In [10]:
# Calculate RMSE
from sklearn.metrics import root_mean_squared_error

r_true = ratings_valid['rating'].to_numpy()
r_pred = ratings_pred['rating'].to_numpy()

rmse = root_mean_squared_error(r_true, r_pred)
print(f"RMSE = {rmse:.4f}")

RMSE = 0.9171


## Example 2

Use `Surprise`, a Python library for recommender systems (https://surpriselib.com/), to construct an SVD model and perform predictions

In [11]:
from surprise import Dataset, Reader, SVD

data_train = Dataset.load_from_df(ratings_train[['userId', 'movieId', 'rating']], reader=Reader(rating_scale=(1, 5)))
data_valid = Dataset.load_from_df(ratings_valid[['userId', 'movieId', 'rating']], reader=Reader(rating_scale=(1, 5)))

In [12]:
from surprise import accuracy
from surprise.model_selection import train_test_split
trainset = data_train.build_full_trainset()
model2 = SVD(n_factors=200, n_epochs=1000, lr_all=0.005, reg_all=0.02)
model2.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x1269cb140>

In [13]:
validset = data_valid.build_full_trainset().build_testset()
predictions = model2.test(validset)
predictions_df = pd.DataFrame(predictions, columns=['uid', 'iid', 'true_r', 'est', 'details'])
accuracy.rmse(predictions, verbose=True)

RMSE: 0.8279


0.8279438003403604

In [14]:
predictions_df.head(10)

Unnamed: 0,uid,iid,true_r,est,details
0,4,45,3.0,3.767924,{'was_impossible': False}
1,4,52,3.0,3.138017,{'was_impossible': False}
2,4,58,3.0,4.20677,{'was_impossible': False}
3,4,222,1.0,3.475641,{'was_impossible': False}
4,4,247,3.0,4.091084,{'was_impossible': False}
5,4,265,5.0,2.901886,{'was_impossible': False}
6,4,319,5.0,3.490339,{'was_impossible': False}
7,4,345,4.0,2.763079,{'was_impossible': False}
8,4,417,2.0,3.332777,{'was_impossible': False}
9,4,441,1.0,3.606849,{'was_impossible': False}


## Example 3

Use `Surprise` to construct a user-based collaborative filtering model

In [15]:
from surprise import KNNWithMeans

model3 = KNNWithMeans(k=10, sim_options={'name': 'cosine', 'user_based': True})
model3.fit(trainset)
predictions = model3.test(validset)
predictions_df = pd.DataFrame(predictions, columns=['uid', 'iid', 'true_r', 'est', 'details'])
accuracy.rmse(predictions, verbose=True)
predictions_df.head(10)

Computing the cosine similarity matrix...
Done computing similarity matrix.
RMSE: 0.8701


Unnamed: 0,uid,iid,true_r,est,details
0,4,45,3.0,3.861059,"{'actual_k': 10, 'was_impossible': False}"
1,4,52,3.0,3.814467,"{'actual_k': 10, 'was_impossible': False}"
2,4,58,3.0,4.004362,"{'actual_k': 10, 'was_impossible': False}"
3,4,222,1.0,4.081305,"{'actual_k': 10, 'was_impossible': False}"
4,4,247,3.0,4.139495,"{'actual_k': 10, 'was_impossible': False}"
5,4,265,5.0,3.480554,"{'actual_k': 10, 'was_impossible': False}"
6,4,319,5.0,3.925908,"{'actual_k': 10, 'was_impossible': False}"
7,4,345,4.0,3.766812,"{'actual_k': 10, 'was_impossible': False}"
8,4,417,2.0,4.140247,"{'actual_k': 10, 'was_impossible': False}"
9,4,441,1.0,4.074553,"{'actual_k': 10, 'was_impossible': False}"


## My code


In [1]:
### base model + movie genre

import pandas as pd
import numpy as np
from surprise import SVDpp, SVD, KNNBaseline, BaselineOnly, Dataset, Reader
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.ensemble import GradientBoostingRegressor

# Load data
ratings_train = pd.read_csv("./data/ratings_train.csv")
ratings_valid = pd.read_csv("./data/ratings_valid.csv")
movies = pd.read_csv("./data/movies.csv")

# converting training data to surprise format
reader = Reader(rating_scale=(0.5, 5.0))
data_train = Dataset.load_from_df(ratings_train[['userId', 'movieId', 'rating']], reader)
trainset = data_train.build_full_trainset()

# Compute user and movie average ratings
user_avg = ratings_train.groupby('userId')['rating'].mean()
movie_avg = ratings_train.groupby('movieId')['rating'].mean()

# convert each movie genre to 0 and 1 matrix 
movies['genres'] = movies['genres'].apply(lambda x: x.split('|'))
mlb = MultiLabelBinarizer()
genres_encoded = mlb.fit_transform(movies['genres'])
genres_df = pd.DataFrame(genres_encoded, columns=mlb.classes_)
movie_genres = pd.concat([movies[['movieId']], genres_df], axis=1)

# Train base models including svdpp, svd, knn, baseline
models = {
    "svdpp": SVDpp(n_factors=120, n_epochs=30, lr_all=0.005, reg_all=0.1),
    "svd": SVD(n_factors=120, n_epochs=30, lr_all=0.005, reg_all=0.1),
    "knn": KNNBaseline(k=50, sim_options={'name': 'pearson_baseline', 'user_based': False}),
    "baseline": BaselineOnly()
}

print("Training base models...")
for name, model in models.items():
    model.fit(trainset)

# Create meta-features for validation set
X_meta = [] # 
y_true = []

for _, row in ratings_valid.iterrows():
    uid, iid, true_r = row['userId'], row['movieId'], row['rating']
    
    # Base model predictions
    base_preds = [model.predict(uid, iid).est for model in models.values()]
    
    # User average
    user_avg_rating = user_avg.get(uid, 3.5)
    
    # Movie average
    movie_avg_rating = movie_avg.get(iid, 3.5)
    
    # Genre features
    genre_row = movie_genres[movie_genres['movieId'] == iid]
    if not genre_row.empty:
        genre_features = genre_row.iloc[0, 1:].values
    else:
        genre_features = np.zeros(len(mlb.classes_))
    
    # Combine all features
    features = base_preds + [user_avg_rating, movie_avg_rating] + list(genre_features)
    
    X_meta.append(features)
    y_true.append(true_r)

X_meta = np.array(X_meta)
y_true = np.array(y_true)

# Train meta model 
meta_model = GradientBoostingRegressor(n_estimators=200, learning_rate=0.05, max_depth=5)
meta_model.fit(X_meta, y_true)

# Predict and evaluate
y_pred = meta_model.predict(X_meta)
rmse = np.sqrt(mean_squared_error(y_true, y_pred))
print(f"Final Ensemble RMSE with extra features: {rmse:.4f}")

Training base models...
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Final Ensemble RMSE with extra features: 0.5762
