# DES431 Project: Recommendation System

# Background

**MovieLens** is a movie recommendation system operated by GroupLens, a research group at the University of Minnesota. MovieLens has been developed to provide personalized movie recommendations to its users based on their viewing history and preferences.

# Task

1. This project is to be completed by a group of three students.
2. Propose and implement your own recommendation system based on the MovieLens dataset.
   - Use `ratings_train.csv` as the training set and `ratings_valid.csv` as the validation set.
   - Your recommendation system may utilize information from `movies.csv` for making recommendations.
   - The structure of the data files is detailed at `https://files.grouplens.org/datasets/movielens/ml-latest-small-README.html`.
   - The goal of the recommendation system is to minimize the root-mean-square error (RMSE), i.e., to minimize the difference between the predicted and actual ratings.
   - Implement a function named `predict_rating`. This function should accept a DataFrame with two columns: `userId` and `movieId`, and return the DataFrame with an additional column named `rating`, containing predicted ratings of a `movieId` by a `userId`.
   - The `predict_rating` function must be compatible with an undisclosed test set having the same format as the validation set. The test set contains  Your implementation will be evaluated by the test set. Failure to comply will result in a 50% deduction of your score.
   - You are required to modify the given program to enhance recommendation quality. Submitting the unaltered original program will be considered plagiarism.
3. Prepare slides for a 7-minute presentation that explains your proposed technique and algorithm for making recommendations, and demonstrates your RMSE results on the validation set.
4. Submit your Python notebook and the presentation slides in PDF format via Google Classroom by April 30, 2024, at 23:59. All members of the group must individually submit their work to Google Classroom. Late submissions will not be accepted and will incur a 10% deduction. Do not procrastinate. Plagiarism and code duplication will be rigorously checked.
5. Present your work on May 1, 2024, within a 7-minute timeframe. Presentations exceeding 7 minutes will result in point deductions.


In [None]:
!pip install scikit-surprise

Collecting scikit-surprise
  Downloading scikit-surprise-1.1.3.tar.gz (771 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m772.0/772.0 kB[0m [31m10.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (setup.py) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.3-cp310-cp310-linux_x86_64.whl size=3162998 sha256=8733d575a53936c934257c892fde3d9cf1c2216c7ee0fe78d1675accf81d1b21
  Stored in directory: /root/.cache/pip/wheels/a5/ca/a8/4e28def53797fdc4363ca4af740db15a9c2f1595ebc51fb445
Successfully built scikit-surprise
Installing collected packages: scikit-surprise
Successfully installed scikit-surprise-1.1.3


In [None]:
import numpy as np
import pandas as pd

# Loading data

In [None]:
ratings_train = pd.read_csv('ratings_train.csv')
ratings_valid = pd.read_csv('ratings_valid.csv')
movies = pd.read_csv('movies.csv')

In [None]:
ratings_train.describe()

Unnamed: 0,userId,movieId,rating,timestamp
count,96464.0,96464.0,96464.0,96464.0
mean,327.86935,19105.768059,3.509325,1204483000.0
std,183.95296,35243.409786,1.041385,216528300.0
min,1.0,1.0,0.5,828124600.0
25%,177.0,1196.0,3.0,1013395000.0
50%,330.0,2959.0,3.5,1182909000.0
75%,479.0,7486.0,4.0,1435993000.0
max,610.0,193609.0,5.0,1537799000.0


In [None]:
ratings_train.head(10)

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931
5,1,70,3.0,964982400
6,1,101,5.0,964980868
7,1,110,4.0,964982176
8,1,151,5.0,964984041
9,1,157,5.0,964984100


In [None]:
movies.head(10)

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
5,6,Heat (1995),Action|Crime|Thriller
6,7,Sabrina (1995),Comedy|Romance
7,8,Tom and Huck (1995),Adventure|Children
8,9,Sudden Death (1995),Action
9,10,GoldenEye (1995),Action|Adventure|Thriller


# Constructing model and predicting ratings

In [None]:
#check NaN
nan_counts = ratings_train.isna().sum()
print(nan_counts)

userId       0
movieId      0
rating       0
timestamp    0
dtype: int64


In [None]:
from surprise import SVD, Dataset, Reader

# Model construction
reader = Reader(rating_scale=(ratings_train['rating'].min(), ratings_train['rating'].max()))
data = Dataset.load_from_df(ratings_train[['userId', 'movieId', 'rating']], reader)
trainset = data.build_full_trainset()
model = SVD(n_factors=300, n_epochs=1200, lr_all=0.007, reg_all=0.09)

# Train the model on the full dataset
trainset = data.build_full_trainset()
model.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7b69ce446f50>

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer

def prepare_genre_similarity(movies):
    tfidf = TfidfVectorizer(token_pattern='[^\|]+')
    movie_genres_tfidf = tfidf.fit_transform(movies['genres'])
    genre_similarity = cosine_similarity(movie_genres_tfidf)
    movie_idx = pd.Series(data=movies.index, index=movies['movieId']).to_dict()
    return genre_similarity, movie_idx

##### Implement function for prediction

In [None]:
def predict_rating(df):
    user_preferences = ratings_train[ratings_train['rating'] >= 4]
    genre_similarity, movie_idx = prepare_genre_similarity(movies)

    pred_ratings = []
    for index, row in df.iterrows():
        uid, mid = row['userId'], row['movieId']
        pred = model.predict(uid, mid)
        prediction = pred.est

        # Content-based adjustment
        sim_scores = []
        try:
            liked_movies = user_preferences[user_preferences['userId'] == uid]
            for _, liked_row in liked_movies.iterrows():
                if liked_row['movieId'] in movie_idx and mid in movie_idx:
                    idx = movie_idx[mid]
                    liked_idx = movie_idx[liked_row['movieId']]
                    sim_score = genre_similarity[idx, liked_idx]
                    if np.isnan(sim_score):
                        continue  # Skip NaN similarity scores
                    sim_scores.append(sim_score)

            if sim_scores:
                genre_adjustment = np.mean(sim_scores)
                if np.isnan(genre_adjustment):
                    genre_adjustment = 0
                prediction += genre_adjustment * 0.2  # Adjust prediction based on similarity

            else:
                print(f"No valid similarity scores for UID {uid}, MID {mid}. No adjustment made.")
        except KeyError as e:
            print(f"KeyError for UID {uid}, MID {mid}: {e}")

        pred_ratings.append([uid, mid, prediction])

    pred_df = pd.DataFrame(pred_ratings, columns=['userId', 'movieId', 'rating'])
    return pd.merge(df, pred_df, on=['userId', 'movieId'], how='left')

In [None]:
# Prepare df for prediction
r = ratings_valid[['userId', 'movieId']]


# Predict ratings
ratings_pred = predict_rating(r)

In [None]:
ratings_pred.head(4)

Unnamed: 0,userId,movieId,rating
0,4,45,3.748059
1,4,52,3.001318
2,4,58,3.982842
3,4,222,3.604508


In [None]:
from sklearn.metrics import mean_squared_error

r_true = ratings_valid['rating'].to_numpy()
r_pred = ratings_pred['rating'].to_numpy()

rmse = mean_squared_error(r_true, r_pred, squared=False)
print(f"RMSE = {rmse:.4f}")

RMSE = 0.8166
