In [1]:
import numpy as np
import pandas as pd

In [2]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all" # to make jupyter print all outputs, not just the last one
from IPython.core.display import HTML # to pretty print pandas df and be able to copy them over (e.g. to ppt slides)

In [3]:
df = pd.read_parquet('cleaned/strat_2NDsample_netflix')
df2 = pd.read_parquet('cleaned/strat_sample_movielens')

### Netflix data stats:

In [4]:
# Extract all user IDs from the 'review_data' column using list comprehension
user_ids = [review_entry.get('userId') for row in df['review_data'] for review_entry in row if review_entry.get('userId')]

# Count the number of unique users and reviews
unique_users = set(user_ids)
amount_of_reviews = len(user_ids)

# Calculate averages
avg_reviews_per_unique_user = amount_of_reviews / len(unique_users)
avg_reviews_per_movie_id = amount_of_reviews / len(df)

# Print results
print("There are {} reviews in the NETFLIX dataframe.".format(amount_of_reviews))
print("There are {} unique users who have reviewed a movie.".format(len(unique_users)))
print("There are {} movieIds in the NETFLIX dataset.".format(len(df)))
print("A unique user places {} reviews on average in the NETFLIX dataset.".format(round(avg_reviews_per_unique_user)))
print("A movieId receives {} reviews on average in the NETFLIX dataset.".format(round(avg_reviews_per_movie_id)))

There are 1612370 reviews in the NETFLIX dataframe.
There are 344699 unique users who have reviewed a movie.
There are 250 movieIds in the NETFLIX dataset.
A unique user places 5 reviews on average in the NETFLIX dataset.
A movieId receives 6449 reviews on average in the NETFLIX dataset.


### Movielens data stats:

In [5]:
review_data2 = df2['review_data'].values
review_data2 = [row for row in review_data2 if row is not None]
user_ids2 = np.concatenate([np.array([entry['userId'] for entry in row]) for row in review_data2])
ratings2 = np.concatenate([np.array([entry['rating'] for entry in row]) for row in review_data2])
movieIds2 = np.concatenate([[movieId] * len(row) for movieId, row in zip(df['movieId'], review_data2)])

avg_reviews_per_unique_user2 = len(ratings2) / len(np.unique(user_ids2))
avg_reviews_per_movie_id2 = len(ratings2) / len(df2)

# print results
print("There are {} reviews in the MOVIELENS dataframe.".format(len(ratings2)))
print("There are {} unique users who have reviewed a movie.".format(len(np.unique(user_ids2))))
print("There are {} movieIds in the MOVIELENS dataset.".format(len(df2)))
print("A unique user places {} reviews on average in the MOVIELENS dataset.".format(round(avg_reviews_per_unique_user2)))
print("A movieId receives {} reviews on average in the MOVIELENS dataset.".format(round(avg_reviews_per_movie_id2)))

There are 16555 reviews in the MOVIELENS dataframe.
There are 14324 unique users who have reviewed a movie.
There are 200 movieIds in the MOVIELENS dataset.
A unique user places 1 reviews on average in the MOVIELENS dataset.
A movieId receives 83 reviews on average in the MOVIELENS dataset.


**Conclusion:** the movielens can be considered as more sparse (meaning more null values) than Netflix as the amount of movieIds is much higher, but the avg. review per userId and per movieId is much lower. We will see in performance if this makes a difference.

Define dataframe without date item in review_data dictionary to start with, later date features may be added for both Netflix and movielens:

In [6]:
netflix_df = df[df.columns]
netflix_df['review_data'] = netflix_df['review_data'].apply(lambda x: None if x is None else [{'userId': review['userId'], 'rating': review['rating']} for review in x if 'userId' in review and 'rating' in review])

In [7]:
movielens_df = df2[df2.columns]
movielens_df['review_data'] = movielens_df['review_data'].apply(lambda x: None if x is None else [{'userId': review['userId'], 'rating': review['rating']} for review in x if 'userId' in review and 'rating' in review])

### Feature engineering and pre processing:

Year and title will be dropped:

In [8]:
netflix_sample = df.drop(['year','title'],axis=1)
movielens_sample = df2.drop(['year','title'],axis=1)

Then, let's split our data into train, validation and test sets where we ensure that no training data flows into test and validation sets:

### Function Explanation

`train_val_test_split`

1. **Shuffle the Data**:
   - The input data is shuffled using `data.sample(frac=1, random_state=42)` to ensure randomness. `random_state=42` ensures reproducibility.

2. **Calculate Set Sizes**:
   - The sizes of each set (training, validation, and test) are calculated based on the provided ratios and the total number of samples in the data.

3. **Split the Data**:
   - The shuffled data is split into three sets: training, validation, and test.
   - The training data contains the first `num_train` samples.
   - The validation data contains the next `num_val` samples, starting from the index immediately following the last training sample.
   - The test data contains the remaining samples, starting from the index immediately following the last validation sample.

4. **Reset Index**:
   - The index of each set is reset to ensure that it starts from 0 and increases incrementally.

5. **Return Sets**:
   - The function returns the training, validation, and test sets as pandas DataFrames.

In [9]:
def train_val_test_split(data, train_ratio=0.8, val_ratio=0.1, test_ratio=0.1):
    """
    Splits the data into training, validation, and test sets, simultaneously ensuring no training data flows into validation or test data.

    Parameters:
    - data: pandas DataFrame containing the data to be split.
    - train_ratio: float, ratio of the training set size to the total data size (default: 0.8).
    - val_ratio: float, ratio of the validation set size to the total data size (default: 0.1).
    - test_ratio: float, ratio of the test set size to the total data size (default: 0.1).

    Returns:
    - train_data: pandas DataFrame, training set.
    - val_data: pandas DataFrame, validation set.
    - test_data: pandas DataFrame, test set.
    """
    # Shuffle the data
    data_shuffled = data.sample(frac=1, random_state=42)

    # Calculate the sizes of each set
    num_samples = len(data_shuffled)
    num_train = int(train_ratio * num_samples)
    num_val = int(val_ratio * num_samples)
    num_test = num_samples - num_train - num_val

    # Split the data into train, validation, and test sets
    train_data = data_shuffled[:num_train]
    # Below is ensured the validation data and the test data starts after the indices which are already in the training data, ensuring that no training data will flow into validation of test data.
    val_data = data_shuffled[num_train:num_train+num_val]
    test_data = data_shuffled[num_train+num_val:]

    # Reset index for each set
    train_data.reset_index(drop=True, inplace=True)
    val_data.reset_index(drop=True, inplace=True)
    test_data.reset_index(drop=True, inplace=True)

    return train_data, val_data, test_data

Let's split the data accordingly and take two differenct sample sizes to see what effect it has on model performance:

In [10]:
# netflix dataset splitting
train_data, val_data, test_data = train_val_test_split(netflix_sample)
train_data2, val_data2, test_data2 = train_val_test_split(movielens_sample)

Subsequently, let's define some function to make our life easer for the compatibility of more datasets. We gather unique item and user ids, create user-item matrix which will be centered, followed by performing SVD en making recommendations using the dot product between the decomposed matrices resulting from SVD:

### Set-up user-item matrix
First we will create a user-item matrix which records all the user-item interactions.


### `create_user_item_matrix` Function Explanation

### Steps:
1. **Extract Review Data**:
   - Extract the review data from the provided DataFrame, which contains user IDs, ratings, and movie IDs.

2. **Create User and Movie IDs Arrays**:
   - Extract user IDs, ratings, and movie IDs from the review data and concatenate them into separate arrays.
   - Generate dictionaries to map user IDs and movie IDs to unique indices in the user-item matrix.

3. **Initialize User-Item Matrix**:
   - Determine the dimensions of the user-item matrix based on the number of unique users and movies.
   - Initialize an empty user-item matrix filled with NaN values.

4. **Populate User-Item Matrix**:
   - Iterate through the review data and populate the user-item matrix with ratings.
   - Map user and movie IDs to their corresponding indices in the matrix and insert the ratings.

5. **Return Results**:
   - Return the user-item matrix along with dictionaries mapping user and movie IDs to indices, and arrays containing user and movie IDs.
  
### Functions Used and Purpose:

- **`np.concatenate()`**: Used to concatenate arrays containing user IDs, ratings, and movie IDs extracted from the review data.
- **`enumerate()`**: Used to iterate over the unique user IDs and movie IDs and generate indices for mapping.
- **`np.unique()`**: Used to find the unique user IDs and movie IDs in the review data.
- **`np.full()`**: Used to initialize an empty user-item matrix filled with NaN values.
- **`zip()`**: Used to iterate over multiple iterables simultaneously (user IDs, movie IDs, ratings).
- **`enumerate()`**: Used to iterate over the indices and elements of an iterable (user IDs, movie IDs) simultaneously.
- **Indexing and Slicing**: Used to access and modify elements in arrays and matrices.

In [11]:
def create_user_item_matrix(train_test_val_set):
    """
    Creates a user-item matrix from the provided dataset containing review data.

    Parameters:
    train_test_val_set (DataFrame): DataFrame containing review data with columns 'review_data',
                                    which is a list of dictionaries with keys 'userId', 'rating',
                                    and 'movieId'.

    Returns:
    user_item_matrix (numpy.ndarray): Matrix representing users' ratings for items (movies), the matrix is an NumPy array which contains lists of user-item interactions, meaning a user and their corresponding ratings to the movieIds.    
    
    user_id_dict (dict): Dictionary mapping user IDs to unique indices in the user-item matrix.
    
    movie_id_dict (dict): Dictionary mapping movie IDs to unique indices in the user-item matrix.
    
    user_ids (numpy.ndarray): Array containing user IDs corresponding to each rating in the matrix.
    
    movie_ids (numpy.ndarray): Array containing movie IDs corresponding to each rating in the matrix.

    """
    review_data = train_test_val_set['review_data'].values
    user_ids = np.concatenate([np.array([entry['userId'] for entry in row]) for row in review_data])
    ratings = np.concatenate([np.array([entry['rating'] for entry in row]) for row in review_data])
    movieIds = np.concatenate([[movieId] * len(row) for movieId, row in zip(train_test_val_set['movieId'], review_data)])

    # create dictionaries to map user IDs and movie IDs to unique indices to map over
    user_id_dict = {user_id: index for index, user_id in enumerate(np.unique(user_ids))}
    movie_id_dict = {movie_id: index for index, movie_id in enumerate(np.unique(movieIds))}

    # initialize an empty user-item matrix
    user_count = len(user_id_dict)
    movie_count = len(movie_id_dict)
    user_item_matrix = np.zeros((user_count, movie_count))

    # populate the user-item matrix with ratings from netflix dataset
    for i, (user_id, movie_id, rating) in enumerate(zip(user_ids, movieIds, ratings)):
        user_index = user_id_dict[user_id]
        movie_index = movie_id_dict[movie_id]
        user_item_matrix[user_index, movie_index] = rating

    return user_item_matrix, user_id_dict, movie_id_dict, user_ids, movieIds

To account for variantions in the ratings, let's center our rating matrix with the following function by substracting the user_mean of each row of the row total of each user (row_vector)
### Function explanation
`center_data`

This function creates a centered matrix of the user-item matrix, which is commonly used in matrix factorization algorithms such as Singular Value Decomposition (SVD) and collaborative filtering.

1. **Replace NaN Values**:
   - Check for NaN values in the `user_item_matrix` and replace them with 0. This step is crucial for processing the data, as missing ratings are commonly represented as NaN.

2. **Compute User Means**:
   - Calculate the mean rating for each user across all items. This provides a measure of the average rating given by each user.

3. **Center the Data**:
   - Subtract the mean rating of each user (retrieved from the `user_means` array) from the corresponding ratings in the `user_item_matrix`. This centers the data by removing the user-specific variations in ratings.

4. **Return Centered Matrix**:
   - Return the centered user-item matrix, where each user's ratings are adjusted to reflect deviations from their mean rating.
   - Also return the `user_means` array for potential future use or analysis.

In [12]:
def center_data(user_item_matrix):
    """
    Creates a centered matrix of the previously created user-item matrix

    Parameters:
    User-item matrix which is made a Numpy array with appended lists with ratings of each users of each item. Each position in each list corresponds to the same movieId. Datatype within the matrix is float64. Each NaN value is converted to 0. In other words, for the time being the implicit feedback is converted to 0.

    Return:
    A centered user item matrix, where the row mean of each user is subtracted from the initial ratings, to account for variations in ratings
    
    """
    # Check for NaN values and replace them with 0
    user_item_matrix[np.isnan(user_item_matrix)] = 0
    
    # Compute user means
    user_means = np.mean(user_item_matrix, axis=1)
    
    # Center the data
    centered_user_item_matrix = user_item_matrix - user_means[:, np.newaxis]
    
    return centered_user_item_matrix, user_means

After centering the data, we would like to have less dimensions to work with, as it takes less time and computation power, and at the same time does not lower the quality of the predictions.

### Function explanation
`apply_svd`

1. **SVD Decomposition**:
   - Apply the SVD decomposition to the `centered_user_item_matrix` using the `np.linalg.svd` function from NumPy. This results in three matrices: U, Sigma, and Vt.

2. **Sigma Matrix Adjustment**:
   - Set up the Sigma matrix by keeping only the first `num_latent_factors` singular values and forming a diagonal matrix with them.

3. **Truncate U and Vt**:
   - Keep only the columns corresponding to the first `num_latent_factors` in both U and Vt matrices. This ensures that U represents the relationship between users and latent factors, and Vt represents the relationship between items and latent factors.

4. **Return Decomposed Matrices**:
   - Return the decomposed matrices U, Sigma, and Vt, which capture the underlying structure of the user-item interactions in terms of latent factors.

In [13]:
# I will decompose the user item matrix in this function using numpy
def apply_svd(centered_user_item_matrix, num_latent_factors):
    """
    Applies Singular Value Decomposition (SVD) to decompose the centered user-item matrix into three matrices:
    U, Sigma, and Vt.

    U: user matrix with values which represent the relation between the chosen latent factors, Users are the rows, matrix is orthonormal to Vt
    Sigma: diagonal matrix where the chosen latent factors are in the diagonal line, ordered descendingly. 
    Vt: Item matrix with values which represent the relation between the chosen latent factors, Items are the columns, matrix is orthonormal to U

    Parameters:
    centered_user_item_matrix (numpy.ndarray): Centered user-item matrix to be decomposed.
    num_latent_factors (int): Number of latent factors to retain in the decomposition.

    Returns:
    U (numpy.ndarray): Matrix representing the relationship between users and latent factors.
    Sigma (numpy.ndarray): Diagonal matrix containing the singular values, representing the importance of each latent factor.
    Vt (numpy.ndarray): Transpose of the matrix representing the relationship between items and latent factors.

    """
    # U, sigma and Vt are created using the svd function from numpy
    U, Sigma, Vt = np.linalg.svd(centered_user_item_matrix, full_matrices=False)
    # set up sigma, which is the diagonal matrix from the decomposition, where the dimensions are dependent on the amount of latent factors
    Sigma = np.diag(Sigma[:num_latent_factors])
    # set up U and Vt which have to orthonormal to each other to ensure U represents each user and Vt represents each item, otherwise the total matrix would not add up.
    U = U[:, :num_latent_factors]
    Vt = Vt[:num_latent_factors, :]
    return U, Sigma, Vt

With our decomposed matrices from the original matrix, let's make rating predictions by building up the original matrix with the given amount of latent factors by performing matrix multiplication in the function below:

### Function explanation:

`compute_recommendations_for_all_users`

1. **Compute Predicted Ratings**:
   - Use matrix multiplication to compute all predicted ratings based on the decomposed matrices U, Sigma, and Vt. Add back the user mean ratings to obtain the actual predicted ratings.

2. **Mask Interacted Items**:
   - Mask out items that users have already interacted with in the `user_item_matrix` by setting their predicted ratings to negative infinity. This ensures that these items are not recommended again.

3. **Get Top Recommendations**:
   - Sort the predicted ratings for each user in descending order to get the top recommendations. Extract the indices of the top items for each user.

4. **Adjust Item IDs**:
   - Adjust the indices to match item IDs by adding 1 to each index, as item IDs typically start from 1.

5. **Create Recommendations Dictionary**:
   - Create a dictionary `all_recommendations` mapping each user ID to a list of top recommended item IDs.

In [14]:
def compute_recommendations_for_all_users(U, Sigma, Vt, user_means, user_ids, num_recommendations, user_item_matrix):
    """
    Computes recommendations for all users based on the decomposed matrices from Singular Value Decomposition (SVD).

    Parameters:
    U: user matrix with values which represent the relation between the chosen latent factors, Users are the rows, matrix is orthonormal to Vt
    Sigma: diagonal matrix where the chosen latent factors are in the diagonal line, ordered descendingly. 
    Vt: Item matrix with values which represent the relation between the chosen latent factors, Items are the columns, matrix is orthonormal to U

    user_means (numpy.ndarray): Array containing mean ratings for each user.
    user_ids (numpy.ndarray): Array containing user IDs.
    num_recommendations (int): Number of recommendations to generate for each user.
    user_item_matrix (numpy.ndarray): Matrix representing user-item interactions, where rows correspond to users and columns correspond to items.

    Returns:
    all_recommendations (dict): Dictionary mapping user IDs to lists of top recommended item IDs.
    all_predicted_centered_ratings (numpy.ndarray): Array of predicted centered ratings for all users and items.
                                                    Predicted ratings are centered by adding the mean rating for each user.
                                                    Each row corresponds to a user, and each column corresponds to an item.
    """
    # this line computes the predicted ratings by doing matrix multiplication with the decomposed matrices, it essentially builts up the original rating matrix with less features, thanks to SVD. The matrix multiplication(dot product) estimates the ratings with less features, meaning the ratings will be predicted with less features. By adding up the user means in the end, we will take the centering of the data into account as well, just like the original rating matrix and also account for variations in the rating.
    all_predicted_ratings = np.dot(U, np.dot(Sigma, Vt)) + user_means[:, np.newaxis]

    # mask out items already interacted with by users
    all_predicted_ratings[user_item_matrix > 0] = -np.inf

    # get top recommendations for each user
    top_indices = np.argsort(all_predicted_ratings, axis=1)[:, ::-1]
    top_recommendations = top_indices[:, :num_recommendations] + 1  # Adjust indices to match item IDs

    # Create dictionary mapping user IDs to top recommended item IDs
    all_recommendations = {user_id: top_items for user_id, top_items in zip(user_ids, top_recommendations)}

    return all_recommendations, all_predicted_ratings

**I can be sure user_ids across functions are the same, because:** in the function compute_recommendations_for_all_users, the user IDs are used to retrieve the corresponding user indices within the centered matrix. Here's how:

User IDs are used to retrieve the corresponding user indices using the user_id_to_index dictionary.
The predicted ratings for each user are computed based on their index within the centered matrix.
After computation, the recommendations and predicted ratings are stored and returned in a manner that preserves the correspondence between user IDs and their respective predictions.

**Therefore**, when accessing the predictions or recommendations for a specific user ID from the returned results, you can be confident that they correspond to the same user ID in the original centered matrix.

***********************

Before parameter tuning, I will run the recommender system for the train and validation set and record some baseline performance. Root Mean Squared Error (RMSE) will be used as performance metric. 

- Reason behind this is the corresponding original and predicted centered ratings from the train_data and val_data will be used for measuring performance. A form of squared mean error is appropriate for such cases. Recall and precision revolve around ratings which are relevant to the user or not, which is difficult and subjective to identify within this model. 
  
- Furthermore, RMSE will is expressed in the same units as the input data, making it easy to interpret for a user but a stakeholder as well.
- RMSE tend to highlight differences more on smaller sample sizes than MSE would do.

In [15]:
# here I make a baseline selection of latent factors
num_latent_factors = 1

# here I make a baseline selection of recommendations per user
num_recommendations = 4


`Netflix`

Training data

In [16]:
user_item_matrix_train, user_id_dict_train, movie_id_dict_train, user_ids_train, movie_ids_train = create_user_item_matrix(train_data)

# get unique movieIds, use set to ensure unique values and put ids in a list
user_ids_train = list(set(user_ids_train))
item_ids_train = list(set(movie_ids_train))

# unpack the tuple returned by center_data function to get an updates user item matrix which is more robust to variations in rating
centered_user_item_matrix_train, user_means_train = center_data(user_item_matrix_train)

# apply SVD using the centered matrix to reduce memory usage and to decompose the matrix to be able to make recommendations using the dot product method
U_TRAIN, Sigma_train, Vt_train = apply_svd(centered_user_item_matrix_train, num_latent_factors)

# compute the recommendations
all_recommendations_train, all_predicted_centered_ratings_train = compute_recommendations_for_all_users(U_TRAIN, Sigma_train, Vt_train, user_means_train, user_ids_train, num_recommendations, user_item_matrix_train)

Validation data

In [17]:
user_item_matrix_val, user_id_dict_val, movie_id_dict_val, user_ids_val, movie_ids_val = create_user_item_matrix(val_data)

# get unique movieIds, use set to ensure unique values and put ids in a list
user_ids_val = list(set(user_ids_val))
item_ids_val = list(set(movie_ids_val))

# unpack the tuple returned by center_data function to get an updates user item matrix which is more robust to variations in rating
centered_user_item_matrix_val, user_means_val = center_data(user_item_matrix_val)

# apply SVD using the centered matrix to reduce memory usage and to decompose the matrix to be able to make recommendations using the dot product method
U_VAL, Sigma_val, Vt_val = apply_svd(centered_user_item_matrix_val, num_latent_factors)

# compute the recommendations
all_recommendations_val, all_predicted_centered_ratings_val = compute_recommendations_for_all_users(U_VAL, Sigma_val, Vt_val, user_means_val, user_ids_val, num_recommendations, user_item_matrix_val)

`Movielens`

Training data

In [18]:
user_item_matrix_train2, user_id_dict_train2, movie_id_dict_train2, user_ids_train2, movie_ids_train2 = create_user_item_matrix(train_data2)

# get unique movieIds, use set to ensure unique values and put ids in a list
user_ids_train2 = list(set(user_ids_train2))
item_ids_train2 = list(set(movie_ids_train2))

# unpack the tuple returned by center_data function to get an updates user item matrix which is more robust to variations in rating
centered_user_item_matrix_train2, user_means_train2 = center_data(user_item_matrix_train2)

# apply SVD using the centered matrix to reduce memory usage and to decompose the matrix to be able to make recommendations using the dot product method
U_TRAIN2, Sigma_train2, Vt_train2 = apply_svd(centered_user_item_matrix_train2, num_latent_factors)

# compute the recommendations
all_recommendations_train2, all_predicted_centered_ratings_train2 = compute_recommendations_for_all_users(U_TRAIN2, Sigma_train2, Vt_train2, user_means_train2, user_ids_train2, num_recommendations, user_item_matrix_train2)

Validation data

In [19]:
user_item_matrix_val2, user_id_dict_val2, movie_id_dict_val2, user_ids_val2, movie_ids_val2 = create_user_item_matrix(val_data2)

# get unique movieIds, use set to ensure unique values and put ids in a list
user_ids_val2 = list(set(user_ids_val2))
item_ids_val2 = list(set(movie_ids_val2))

# unpack the tuple returned by center_data function to get an updates user item matrix which is more robust to variations in rating
centered_user_item_matrix_val2, user_means_val2 = center_data(user_item_matrix_val2)

# apply SVD using the centered matrix to reduce memory usage and to decompose the matrix to be able to make recommendations using the dot product method
U_val2, Sigma_val2, Vt_val2 = apply_svd(centered_user_item_matrix_val2, num_latent_factors)

# compute the recommendations
all_recommendations_val2, all_predicted_centered_ratings_val2 = compute_recommendations_for_all_users(U_val2, Sigma_val2, Vt_val2, user_means_val2, user_ids_val2, num_recommendations, user_item_matrix_val2)

**We will use RMSE as performance metric, using the function below to compute it:**

### Function explanation

`compute_rmse`
1. **Handle Implicit Ratings**: 
   - Convert `NaN` values in both `original_ratings` and `predicted_ratings` arrays to 0s. This is done using `np.nan_to_num()` function to ensure that non-rated items are treated as having a rating of 0 for comparison.
   
2. **Flatten Arrays**:
   - Flatten both `original_ratings` and `predicted_ratings` arrays into 1D arrays to facilitate making masks.

3. **Remove Unrated Items**:
   - Create a mask to filter out entries where the original rating is 0 (unrated items). Only ratings for rated items are considered for RMSE calculation.

4. **Compute Squared Differences**:
   - Calculate the squared differences between original and predicted ratings for the rated items.

5. **Compute Mean Squared Error (MSE)**:
   - Compute the mean squared error (MSE) by averaging the squared differences.

6. **Compute RMSE**:
   - Compute the square root of the mean squared error to obtain the RMSE value, which indicates the average difference between the original and predicted ratings.

7. **Return RMSE**:
   - Return the computed RMSE value as the output of the function.

In [20]:
def compute_rmse(original_ratings, predicted_ratings):
    """
    Computes the Root Mean Square Error (RMSE) between the original ratings and the predicted ratings. MovieIds a user has not interacted with is turned into 0 for now.

    Parameters:
    original_ratings (numpy.ndarray): Array containing the original ratings.
    predicted_ratings (numpy.ndarray): Array containing the predicted ratings.

    Returns:
    float: The RMSE value.
    
    """
    # handle implicit ratings with 0s for now
    original_ratings = np.nan_to_num(original_ratings, nan=0, posinf=0, neginf=0)
    predicted_ratings = np.nan_to_num(predicted_ratings, nan=0, posinf=0, neginf=0)

    # make 1d arrays by flattening them to be able to make masks
    original_ratings_flat = original_ratings.flatten()
    predicted_ratings_flat = predicted_ratings.flatten()
    
    # remove entries with no original rating (unrated items)
    mask = original_ratings_flat != 0
    original_ratings_flat = original_ratings_flat[mask]
    predicted_ratings_flat = predicted_ratings_flat[mask]
    
    # Compute the squared differences
    squared_diff = np.square(original_ratings_flat - predicted_ratings_flat)
    
    # Compute the mean squared error
    mse = np.mean(squared_diff)
    
    # Compute the square root of the mean squared error to get RMSE
    rmse = np.sqrt(mse)
    
    return rmse

In [21]:
# Evaluate performance on the FIRST training set
train_rmse = compute_rmse(centered_user_item_matrix_train, all_predicted_centered_ratings_train)
# Evaluate performance on the FIRST validation set
val_rmse = compute_rmse(centered_user_item_matrix_val, all_predicted_centered_ratings_val)

# Evaluate performance on the SECOND training set
train_rmse2 = compute_rmse(centered_user_item_matrix_train2, all_predicted_centered_ratings_train2)
# Evaluate performance on the SECOND validation set
val_rmse2 = compute_rmse(centered_user_item_matrix_val2, all_predicted_centered_ratings_val2)

`Netflix` baseline performance:

In [22]:
print("RMSE on training set:", train_rmse)
print("RMSE on validation set:", val_rmse)

RMSE on training set: 0.5734839206350196
RMSE on validation set: 0.8369768965842413


**Baseline findings:** MSE is higher on validation set than training set, indicating the model overfits to data it has already seen. I only set the number of latent factors at 1. It could mean that this amount of latent factorsm makes the model less flexible to account for new patterns in the validation data. Less latent factors also tend to be more sensitive to noise in the data.
</BR>

`Movielens` baseline performance:

In [23]:
print("RMSE on training set:", train_rmse2)
print("RMSE on validation set:", val_rmse2)

RMSE on training set: 0.3102560320623854
RMSE on validation set: 0.7805121058103744


**Baseline performance:** 

Same goes for the movielens dataset: a single latent factor model has limited capacity to capture complex data structures the interactions might have, as well as the underlying structure of the interactions. The movielens dataset has even more noise in terms of null values, so that could be why the model is overfitting massively on the training data.

## Hyper parameter tuning

`Netflix`

In [24]:
latent_factors_range2 = [1, 10, 25, 50, 250]
rmse_values2 = []
best_nlf_train = None
best_rmse = float('inf')

for num_latent_factors in latent_factors_range2:
    # perform only SVD and generationg of recommendations again while doing the loop, as the matrix and centered matrix will not change depending on the amount of latent factors    
    # apply Singular Value Decomposition (SVD)
    U_TRAIN, Sigma_train, Vt_train = apply_svd(centered_user_item_matrix_train, num_latent_factors)
    
    # compute recommendations for all users
    all_recommendations_train, all_predicted_centered_ratings_train = compute_recommendations_for_all_users(U_TRAIN, Sigma_train, Vt_train, user_means_train, user_ids_train, num_recommendations, user_item_matrix_train)
    
    # compute Mean Squared Error (MSE) or Root Mean Square Error (RMSE) with the function I have written for it
    rmse = compute_rmse(centered_user_item_matrix_train, all_predicted_centered_ratings_train)
    
    # append the RMSE value to the list
    rmse_values2.append(rmse)

        # Check if current k gives the best RMSE
    if rmse < best_rmse:
        best_rmse = rmse
        best_nlf_train = num_latent_factors

# print the result descendingly
for i, num_latent_factors in enumerate(latent_factors_range2):
    print(f"Num Latent Factors: {num_latent_factors} | RMSE: {rmse_values2[i]}")

print(f"\nBest number of latent factors: {best_nlf_train} | Best RMSE: {best_rmse}")


Num Latent Factors: 1 | RMSE: 0.5734839206350196
Num Latent Factors: 10 | RMSE: 0.5457853462705187
Num Latent Factors: 25 | RMSE: 0.5383698369814283
Num Latent Factors: 50 | RMSE: 0.5337370895901662
Num Latent Factors: 250 | RMSE: 0.5309695396848648

Best number of latent factors: 250 | Best RMSE: 0.5309695396848648


`Movielens`

In [25]:
latent_factors_range2 = [1, 10, 25, 50, 250]
rmse_values2 = []
best_nlf_train2 = None
best_rmse = float('inf')

for num_latent_factors in latent_factors_range2:
    # perform only SVD and generationg of recommendations again while doing the loop, as the matrix and centered matrix will not change depending on the amount of latent factors    
    # apply Singular Value Decomposition (SVD)
    U_TRAIN2, Sigma_train2, Vt_train2 = apply_svd(centered_user_item_matrix_train2, num_latent_factors)
    
    # compute recommendations for all users
    all_recommendations_train2, all_predicted_centered_ratings_train2 = compute_recommendations_for_all_users(U_TRAIN2, Sigma_train2, Vt_train2, user_means_train2, user_ids_train2, num_recommendations, user_item_matrix_train2)
    
    # compute Mean Squared Error (MSE) or Root Mean Square Error (RMSE) with the function I have written for it
    rmse = compute_rmse(centered_user_item_matrix_train2, all_predicted_centered_ratings_train2)
    
    # append the RMSE value to the list
    rmse_values2.append(rmse)

        # Check if current k gives the best RMSE
    if rmse < best_rmse:
        best_rmse = rmse
        best_nlf_train2 = num_latent_factors

# print the result descendingly
for i, num_latent_factors in enumerate(latent_factors_range2):
    print(f"Num Latent Factors: {num_latent_factors} | RMSE: {rmse_values2[i]}")

print(f"\nBest number of latent factors: {best_nlf_train2} | Best RMSE: {best_rmse}")

Num Latent Factors: 1 | RMSE: 0.3102560320623854
Num Latent Factors: 10 | RMSE: 0.307779326501893
Num Latent Factors: 25 | RMSE: 0.30663196005867877
Num Latent Factors: 50 | RMSE: 0.3064720326682541
Num Latent Factors: 250 | RMSE: 0.3063378920193254

Best number of latent factors: 250 | Best RMSE: 0.3063378920193254


## Final predictions on test set:

Best amount of latent factors are used to predict on test set:

`Netflix`

In [57]:
num_latent_factors_test1 = best_nlf_train

user_item_matrix_test1, user_id_dict_test1, movie_id_dict_test1, user_ids_test1, movie_ids_test1 = create_user_item_matrix(test_data)

# get unique movieIds, use set to ensure unique values and put ids in a list
user_ids_test1 = list(set(user_ids_test1))
item_ids_test1 = list(set(movie_ids_test1))

# unpack the tuple returned by center_data function to get an updates user item matrix which is more robust to variations in rating
centered_user_item_matrix_test1, user_means_test1 = center_data(user_item_matrix_test1)

# apply SVD using the centered matrix to reduce memory usage and to decompose the matrix to be able to make recommendations using the dot product method
U_test1, Sigma_test1, Vt_test1 = apply_svd(centered_user_item_matrix_test1, num_latent_factors_test1)

# compute the recommendations
all_recommendations_test1, all_predicted_centered_ratings_test1 = compute_recommendations_for_all_users(U_test1, Sigma_test1, Vt_test1, user_means_test1, user_ids_test1, num_recommendations, user_item_matrix_test1)

In [61]:
test_rmse = compute_rmse(centered_user_item_matrix_test1, all_predicted_centered_ratings_test1)
print("RMSE on test set:", test_rmse)
print("RMSE on training set:", train_rmse)
print("RMSE on val set:", val_rmse)

RMSE on test set: 0.8045300162758322
RMSE on training set: 0.5734839206350196
RMSE on val set: 0.8369768965842413


After hyper parameter tuning, the rmse went down a bit with relation to the validation data, meaning selecting a higher amount of latent factors were more able to catch the complexitiy of the data. In other words: the preferences of the users.

`Movielens`

In [28]:
num_latent_factors_test2 = best_nlf_train2

user_item_matrix_test2, user_id_dict_test2, movie_id_dict_test2, user_ids_test2, movie_ids_test2 = create_user_item_matrix(test_data2)

# get unique movieIds, use set to ensure unique values and put ids in a list
user_ids_test2 = list(set(user_ids_test2))
item_ids_test2 = list(set(movie_ids_test2))

# unpack the tuple returned by center_data function to get an updates user item matrix which is more robust to variations in rating
centered_user_item_matrix_test2, user_means_test2 = center_data(user_item_matrix_test2)

# apply SVD using the centered matrix to reduce memory usage and to decompose the matrix to be able to make recommendations using the dot product method
U_test2, Sigma_test2, Vt_test2 = apply_svd(centered_user_item_matrix_test2, num_latent_factors_test2)

# compute the recommendations
all_recommendations_test2, all_predicted_centered_ratings_test2 = compute_recommendations_for_all_users(U_test2, Sigma_test2, Vt_test2, user_means_test2, user_ids_test2, num_recommendations, user_item_matrix_test2)

In [62]:
test_rmse2 = compute_rmse(centered_user_item_matrix_test2, all_predicted_centered_ratings_test2)
print("RMSE on test set:", test_rmse2)
print("RMSE on training set:", train_rmse2)
print("RMSE on val set:", val_rmse2)

RMSE on test set: 0.7514817317977059
RMSE on training set: 0.3102560320623854
RMSE on val set: 0.7805121058103744


After fitting a higher amount of laten factors, the model starts to perform similarly to the training data, meaning the hyper parameter tuning worked succesfully to make the model perform better.

250 laten factors did indeed make up for the overfitting behaviour on the validation data. More latent factors were able to catch the complexitiy of the movielens data more.

### Overall conclusion:

SVD is a very quick way of creating a recommender system, resulting in quick calculations compared to KNN, which takes approx. 20 minutes for a fraction of the amount of reviews SVD can handle.

The dimensionality reduction method, which runs through the veins of this model, really caused this model to catch the complexity of the data with less features.