# **COMP9727 Recommender Systems**
## Tutorial Week 3: Collaborative Filtering

@Author: **Mingqin Yu**

@Reviewer: **Wayne Wobcke**

### Objective

The aim of the tutorial is to become more familiar with collaborative filtering for use in ratings prediction, focusing on neighbourhood methods (user-based similarity and item-based similarity). This could be useful for the project, where this can form the basis of a recommender system.

### Before the tutorial

1. Review the lecture material on collaborative filtering and top-N metrics for evaluation of recommendation models.

2. Read, understand and run all the code in the NLP pipeline below, and come prepared to discuss the answers to some of the questions.

3. Have a look at the `seaborn` library for visualization.

### Ratings Prediction Pipeline

1. **Environment Setup and Data Exploration**
    - Install necessary libraries and packages
    - Load the built-in MovieLens dataset
    - Load user data
    - Load movie data
2. **Initial Data Analysis**
    - Display individual user ratings
    - Display ratings for a specific movie
    - Visualize the data
3. **Build the Collaborative Filtering Model**
    - Load custom dataset in `surprise`
    - Split data into training and test sets
    - Build k-Nearest Neighbours model
4. **Model Evaluation**
    - Calculate predictions on the test set
    - Evaluate error using RMSE
5. **Using Predictions to Provide Recommendations**
    - Calculate personalized movie recommendations for users

**Specific Learning Objectives**

1. **Understand**
    - Differences between collaborative filtering and content-based recommendation
2. **Apply** 
    - Learn how to set up and use the `scikit-surprise` library to building a collaborative filtering recommender system
3. **Analyse**
    - Choose appropriate metrics to evaluate the performance of the recommendation model
4. **Evaluate**
    - Understand the differences between artificial problems and real-world applications of recommender systems

## Ratings Prediction and Recommendation Pipeline

### 1. Environment Setup and Data Exploration

__Step 1. Install libraries and packages__

Install (or import) the following packages
   - `pandas`: Data manipulation and analysis library
   - `matplotlib`: Plotting library
   - `scikit-surprise`: Provides tools for building collaborative filtering recommender systems
   - `seaborn`: Data visualization library based on `matplotlib`

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Careful: It might be better to install these outside Jupyter
!pip install scikit-surprise
!pip install seaborn

__Step 2. Load and Explore the Built-In MovieLens Dataset__

Install the MovieLens `ml-100k` dataset and convert it to a DataFrame for further analysis

- **Functions & Methods**
  - `Dataset.load_builtin()`: Loads a dataset from the `surprise` library
  - `pd.DataFrame()`: Converts raw data into a structured DataFrame
- **Parameters & Settings**
  - `ml-100k`: The dataset name, representing a collection of 100,000 movie ratings

In [None]:
import pandas as pd
from surprise import Dataset

data = Dataset.load_builtin('ml-100k')
raw_data = data.raw_ratings

ratings = pd.DataFrame(raw_data, columns=['userId', 'movieId', 'rating', 'timestamp'])
print("First few rows of the ratings DataFrame:")
print(ratings.head())

__Step 3. Load User Data__


Fetch user-specific data from an online source to understand user demographics and attributes

- **Functions & Methods**
  - `pd.read_csv()`: Reads a CSV file and converts it into a DataFrame
- **Parameters & Settings**
  - `sep='|'`: Specifies the delimiter used in the data source
  - `names=user_cols`: Defines the column names for the DataFrame
  - `encoding='latin-1'`: Specifies the character encoding used in the data source

In [None]:
# Define the column names
user_cols = ['userId', 'age', 'gender', 'occupation', 'zip_code']

# Load the data into a pandas DataFrame
users = pd.read_csv('http://files.grouplens.org/datasets/movielens/ml-100k/u.user', sep='|', names=user_cols, encoding='latin-1')

print("First few rows of the users DataFrame:")
print(users.head())

__Step 4. Load Movie Data__

Obtain details about movies, including title, release date and genres

- **Functions & Methods**
   - `pd.read_csv()`: Reads a CSV file and converts it into a DataFrame
- **Parameters & Settings**
  - `sep='|'`: Specifies the delimiter used in the data source
  - `names=item_cols`: Defines the column names for the DataFrame
  - `encoding='latin-1'`: Specifies the character encoding used in the data source

In [None]:
# Define the column names
item_cols = ['movieId', 'movie title', 'release date', 'video release date', 'IMDb URL', 'unknown', 'Action', 'Adventure', 'Animation', 'Children\'s', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western']

# Load the data into a pandas DataFrame
movies = pd.read_csv('http://files.grouplens.org/datasets/movielens/ml-100k/u.item', sep='|', names=item_cols, encoding='latin-1')

pd.set_option('display.width', 150)
print("First few rows of the movies DataFrame:")
print(movies.head())

### 2. Initial Data Analysis

__Step 1. Display an Individual User's Ratings__

Extract and view ratings provided by a particular user to understand individual preferences

- **Functions & Methods**:
  - DataFrame filtering: Uses boolean indexing to filter rows based on a condition
- **Parameters & Settings**:
   - `ratings['userId'] == '196'`: The condition to filter ratings for the user with ID '196'

In [None]:
specific_user_ratings = ratings[ratings['userId'] == '196']
print(f"Ratings provided by user 196:\n {specific_user_ratings}")

__Step 2. Display Ratings for a Specific Movie__

Fetch and display ratings given to a specific movie (in this case, movie ID '50') by different users

- **Functions & Methods**
  - DataFrame filtering: Uses boolean indexing to filter rows based on a condition
- **Parameters & Settings**
  - `ratings['movieId'] == '50'`: The condition to filter ratings for the movie with ID '50'

In [None]:
# Display ratings given to a specific movie (e.g. movie ID '50') by different users
specific_movie_ratings = ratings[ratings['movieId'] == '50']
print(f"Ratings for movie ID 50:\n {specific_movie_ratings}")

__Step 3. Visualize the Data__

Visualize the distribution of ratings in the dataset

- **Functions & Methods**
  - `sns.set_style()`: Set the aesthetic style of the plots
  - `plt.figure()`: Create a new figure for plotting
  - `sns.countplot()`: Show the counts of observations in each categorical bin
  - `plt.title()`: Set the title for the plot
  - `plt.show()`: Display the figure
- **Parameters & Settings**
  - `"whitegrid"`: Grid style for the plot
   - `figsize=(8,6)`: Figure size
   - `x='rating'`: Column in the DataFrame to be used for the x-axis
   - `data=ratings`: DataFrame source for the plot

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style("whitegrid")
plt.figure(figsize=(8,6))
sns.countplot(x='rating', data=ratings)
plt.title('Distribution of Ratings')
plt.show()

### 3. Build the Collaborative Filtering Model

__Step 1. Load Custom Dataset in Surprise__

Convert the ratings DataFrame into a format suitable for the `surprise` library

- **Functions & Methods**:
  - `Reader()`: A parser to read the dataset
  - `Dataset.load_from_df()`: Load a dataset from a pandas DataFrame
- **Parameters & Settings**:
  - `rating_scale=(0.5, 5)`: Specifies the rating scale
  - `ratings[['userId', 'movieId', 'rating']]`: Columns from the DataFrame to be loaded
  - `reader`: The reader object to parse the dataset

In [None]:
from surprise import Reader

reader = Reader(rating_scale=(0.5, 5))
data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)

__Step 2. Split Data into Training and Test Sets__

Split the dataset into training and test sets for model evaluation

- **Functions & Methods**
  - `train_test_split()`: Split the dataset into training and test sets
- **Parameters & Settings**
  - `test_size=0.2`: Specifies that 20% of the data will be used for testing

In [None]:
from surprise.model_selection import train_test_split

trainset, testset = train_test_split(data, test_size=0.2)

__Step 3. Build k-Nearest Neighbours Model__

Initialize and train a k-Nearest Neighbours model for recommendations

- **Functions & Methods**
  - `KNNBasic()`: A basic collaborative filtering algorithm based on k-nearest neighbours
   - `fit()`: Train the model on the training set
- **Parameters & Settings**
  - `trainset`: The training data

In [None]:
from surprise import KNNBasic

algo = KNNBasic()
algo.fit(trainset)

### 4. Model Evaluation

__Step 1. Calculate Predictions on the Test Set__

To make predictions on ratings for user-item pairs in the test set and understand the model's output on unseen data.

- **Functions & Methods**
  - `test()`: This method of the trained model is used to make predictions on a given set of data

In [None]:
predictions = algo.test(testset)
for pred in predictions[:3]:
    print(pred)

__Step 2. Evaluate Error using RMSE__

To evaluate the performance of the model on the test set using RMSE (lower is better)

- **Functions & Methods**:
  - `accuracy.rmse()`: Computes the Root Mean Squared Error (RMSE) between the true ratings and the predicted ratings
- **Parameters & Settings**:
  - `RMSE`: A measure of the differences between values (ratings) predicted by the model and the actual observed values

In [None]:
from surprise import accuracy

accuracy.rmse(predictions)

### 5. Using Predictions to Provide Recommendations

__Step 1. Calculate personalized movie recommendations for users__

To fetch top recommendations for users based on the trained model. A function is implemented that fetches the top N recommendations for each user. Once this function is ready, it can be utilized to display the top 10 (or any other number) recommendations for specific users

- **Functions & Methods**
  - Custom function (not directly provided but hinted at): A function that processes predictions and extracts the top N recommendations for each user.

In [None]:
def get_top_n_recommendations(predictions, n=10):
    top_n = {}

    for uid, iid, true_r, est, _ in predictions:
        top_n.setdefault(uid, []).append((iid, est))

    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:n]

    return top_n

top_n = get_top_n_recommendations(predictions, n=10)

for uid, user_ratings in list(top_n.items())[:3]:
    print(f"Top 10 recommendations for user {uid}: {', '.join([iid for iid, _ in user_ratings])}")

## Discussion Questions

Experiment with the above code before class and think about the answers to the following questions. It does not matter if you do not do all the questions; the main thing is to do some of them and come to class ready for the discussion.

1. **Ratings Data**
    - What do you notice about the ratings scale and the distribution of ratings?
    - What is the "sparsity" of the ratings matrix, i.e. what proportion of values are undefined?
    - Does this make prediction/recommendation easier or harder?
2. **User Characteristics**
    - Analyse the demographics (age, occupation, etc.) of users in the dataset. Discuss any potential biases present.
    - If you have time, see if any of those potential biases are present in the data.
3. **Movie Characteristics**
    - Analyse movies based on their genres and discuss any noticeable trends or popular genres in the dataset.
    - Are certain genres more likely to receive higher ratings?
    - How might the release date of a movie influence its ratings?
    - How might the timestamps of the recommendations be utilized?
4. **Model Building**
    - What is the default value of k used in `surprise`?
    - What is the effect of different values of k on prediction and recommendation?
    - What is the default similarity measure used in `surprise`?
    - What is the effect of different similarity measures on prediction and recommendation?
4. **Model Evaluation**
    - How does the training/test split work in this scenario? How does this affect the reliability of the evaluation?
    - Compare the RMSE achieved with different configurations or hyperparameters of the model.
    - What can we say about how well the method would work for on cold start problems (for both users and movies)?
    - How can the method be adjusted to provide more diverse recommendations?
    - Suggest a more suitable evaluation setup, including metrics, for these scenarios.
5. **Recommendation**
    - Compare the ratings of a few specific users. Discuss any similarities and differences observed.
    - How can individual user ratings be utilized to improve the recommendations?
    - How might outliers or extreme ratings from specific users affect the model?
    - How might overfitting manifest in the context of recommendation systems based on predictions?
    - How would this recommender system work in real time?