# Advanced Recommender Systems

### Content-Based Filtering: 
Recommends items based on the attributes of the items and a profile of the user’s preferences. These algorithms try to recommend items that are similar to those that a user liked in the past. The similarity of items is determined based on features associated with the compared items.

E.g.: if a user has liked movies from a particular genre or director in the past, the system will recommend movies from those or similar genres or directors. The recommendations are based on item properties like description, category, actors, directors, etc.

Content-based methods are limited by the amount of knowledge they have about an item. They can only make recommendations based on available item attributes, and they might not be able to capture the quality of the content without user ratings.

### Collaborative Filtering:

Recommends items by identifying patterns in the preferences of multiple users. This approach doesn't require information about the items themselves, rather it operates purely on the user-item interactions data. There are two main subtypes:

- **Memory-Based**: involves using the entire user-item dataset to generate recommendations:
  - **User-Item Filtering**: finds users that are similar to the target user (based on similarity in their ratings) and recommends items that these similar users liked. For instance, if users A and B have similar ratings for the same movies, then movies liked by user A that user B has not seen yet might be recommended to user B.
  - **Item-Item Filtering**: recommends items that are similar to that item. For example, if a user liked a particular book, similar books (based on ratings by all users) will be recommended.

.
- **Model-Based**: involves building predictive models. These models predict user ratings for items based on past ratings:
  - **Matrix Factorization**: Such as Singular Value Decomposition (SVD). The user-item interactions matrix is decomposed into the product of two lower dimensionality rectangular matrices. This method helps in reducing the dimensionality of the dataset and filling in the missing values in the matrix.
  - **Machine Learning Algorithms**: Various machine learning algorithms like neural networks, clustering, and regression models are used to predict user ratings.

Memory-based methods are easy to implement and produce reasonable recommendation quality. However, they are not scalable and are computationally expensive as they require computing the similarities between all user pairs or all item pairs.

Model-based methods are scalable and can deal with higher sparsity level than memory-based models, but they are more complex and require a learning process to build the model.

In [280]:
import numpy as np
import pandas as pd
import seaborn as sns

sns.set_style("whitegrid")

## Data Retrieval

In [281]:
col_names = ["user_id", "item_id", "rating", "timestamp"]
df = pd.read_csv("./filez/Movie_u.data", sep="\t", names=col_names)  # tab separated

movie_titles = pd.read_csv("./filez/Movie_Id_Titles")
df = pd.merge(df, movie_titles, on="item_id")

df.head()

Unnamed: 0,user_id,item_id,rating,timestamp,title
0,0,50,5,881250949,Star Wars (1977)
1,290,50,5,880473582,Star Wars (1977)
2,79,50,4,891271545,Star Wars (1977)
3,2,50,5,888552084,Star Wars (1977)
4,8,50,5,879362124,Star Wars (1977)


In [282]:
n_users = df.user_id.nunique()
n_items = df.item_id.nunique()

print("Num of Users:  " + str(n_users))
print("Num of Movies: " + str(n_items))

Num of Users:  944
Num of Movies: 1682


## Train/Test Split

In [283]:
from sklearn.model_selection import train_test_split

"""
segment data in just two sets of data:
- training matrix: 75% of the ratings
- testing matrix: 25% of the ratings
"""

train_data, test_data = train_test_split(df, test_size=0.25)

## Memory-Based Collaborative Filtering
* *User-Item Collaborative Filtering*: “Users who are similar to you also liked …”
* *Item-Item Collaborative Filtering*: “Users who liked this item also liked …”

Both filters require creating a user-item matrix for the entire dataset. Since we have split the data into testing and training we will need to create two ``[944 x 1682]`` matrices (all users by all movies).


### Train/Test Matrices

In [284]:
# Create two user-item matrices, one for training and another for testing

"""
- The for loop iterates over each row (line) in the train_data DataFrame.
- line[1] and line[2] are the user ID and item ID, respectively.
    The - 1 is there because Python uses zero-based indexing, while user IDs 
    and item IDs typically start from 1.
- line[3] is the rating given by the user to the item.
- Each line[1] - 1 and line[2] - 1 combination corresponds to a specific cell
    in the matrix, and line[3] is the value assigned to that cell. Essentially,
    this step is filling in the user-item matrix with the ratings from the
    train dataset.
(then same for test dataset)
"""

train_data_matrix = np.zeros((n_users, n_items))
for line in train_data.itertuples():
    train_data_matrix[line[1] - 1, line[2] - 1] = line[3]

test_data_matrix = np.zeros((n_users, n_items))
for line in test_data.itertuples():
    test_data_matrix[line[1] - 1, line[2] - 1] = line[3]


### Similarity Matrices

In [285]:
from sklearn.metrics.pairwise import pairwise_distances

"""
A distance metric commonly used in recommender systems is cosine similarity,
where the ratings are seen as vectors in n-dimensional space and the 
similarity is calculated based on the angle between these vectors.
-> The output will range from 0 to 1 since all ratings are positive
"""

user_similarity_matrix = pairwise_distances(train_data_matrix, metric="cosine")
item_similarity_matrix = pairwise_distances(train_data_matrix.T, metric="cosine")

Similarity between users can be seen as weights that are multiplied by the ratings of a similar user (corrected for the average rating of that user). We need to normalize it so that the ratings stay between 1 and 5 and, as a final step, sum the average ratings for the user that you are trying to predict. 

The idea here is that some users may tend always to give high or low ratings to all movies. The relative difference in the ratings that these users give is more important than the absolute values. E.g.: suppose, user *k* gives 4 stars to his favourite movies and 3 stars to all other good movies. Suppose now that another user *t* rates movies that he/she likes with 5 stars, and the movies he/she fell asleep over with 3 stars. These two users could have a very similar taste but treat the rating system differently. 

When making a prediction for item-based CF we don't need to correct for users average rating since query user itself is used to do predictions.

### Predictions

In [286]:
def predict(ratings, similarity, type="user"):
    """output: The predicted ratings matrix, which has the same dimensions as the input ratings matrix."""
    if type == "user":
        # calculates the mean rating for each user (across all items)
        mean_user_rating = ratings.mean(axis=1)
        # calculates the difference between each user's ratings and their mean rating
        # use np.newaxis so that mean_user_rating has same format as ratings
        ratings_diff = ratings - mean_user_rating[:, np.newaxis]
        # - similarity.dot(ratings_diff): This computes the weighted sum of rating deviations
        #   for each user, using the similarity matrix: it tries to capture the influence of
        #   similar users' preferences.
        # - The result is divided by the sum of the absolute values of the user similarities
        #   (np.abs(similarity).sum(axis=1)), normalizing the prediction.
        # - the mean user rating is added back to adjust the prediction to the appropriate scale.
        pred = (
            mean_user_rating[:, np.newaxis]
            + similarity.dot(ratings_diff)
            / np.array([np.abs(similarity).sum(axis=1)]).T
        )
    elif type == "item":
        # - computes the predicted rating for each item as a weighted average
        #   of all other items, weighted by the item similarity scores.
        # - the ratings are multiplied (dot product) by the item similarity matrix.
        # - then, it's normalized by dividing by the sum of the absolute values
        #   of the item similarities (np.abs(similarity).sum(axis=1)).
        pred = ratings.dot(similarity) / np.array([np.abs(similarity).sum(axis=1)])
    return pred

In [287]:
item_prediction = predict(train_data_matrix, item_similarity_matrix, type="item")
user_prediction = predict(train_data_matrix, user_similarity_matrix, type="user")

There are many evaluation metrics but one of the most popular metric used to evaluate accuracy of predicted ratings is *Root Mean Squared Error (RMSE)*. 

### Model Evaluation

In [288]:
from sklearn.metrics import mean_squared_error
from math import sqrt


def rmse(prediction, ground_truth):
    """
    Since we only want to consider predicted ratings that are in the
    test dataset, we filter out all other elements in the prediction
    matrix with `prediction[ground_truth.nonzero()]`.
    """
    prediction = prediction[ground_truth.nonzero()].flatten()
    ground_truth = ground_truth[ground_truth.nonzero()].flatten()
    return sqrt(mean_squared_error(prediction, ground_truth))

In [289]:
print('User-based CF RMSE: ' + str(rmse(user_prediction, test_data_matrix)))
print('Item-based CF RMSE: ' + str(rmse(item_prediction, test_data_matrix)))

User-based CF RMSE: 3.124570497835974
Item-based CF RMSE: 3.4535601310961703


**Root Mean Square Error (RMSE)**
- RMSE is a standard way to measure the error of a model in predicting quantitative data.
- It represents the square root of the second sample moment of the differences between predicted values and observed values or the quadratic mean of these differences.
- In simpler terms, RMSE tells you how concentrated the data is around the line of best fit. Lower values of RMSE indicate a better fit.


**Interpreting the Output**
- **User-based CF RMSE**: `3.1272613401638307`: This is the RMSE for the user-based collaborative filtering model. It indicates, on average, the magnitude of error between the model's predictions and the actual ratings in the test data set. A value of 3.12 suggests that the predictions are, on average, about 3.12 units away from the actual user ratings.
- **Item-based CF RMSE**: `3.4551423218319646`: This is the RMSE for the item-based collaborative filtering model. Similar to the user-based RMSE, it measures the average error in the item-based predictions. The value of 3.45 indicates a slightly higher error in prediction compared to the user-based model.


**Why This Matters**

- By comparing the RMSE of different models, you can assess which model performs better in terms of accuracy. In this case, the user-based model has a lower RMSE, suggesting it may be more accurate for your dataset.
- However, it's important to note that RMSE is just one metric, and it should be considered alongside other factors like the nature of the data, the context of the problem, and the specific use case of the recommendations.
- It's also worth noting that RMSE values are relative and should be interpreted within the context of your specific dataset and application. For instance, an RMSE of 3.12 might be acceptable in one context but too high in another, depending on how ratings are scaled and the importance of precision in your application.

Memory-based algorithms are easy to implement and produce reasonable prediction quality.

The drawback of memory-based CF is that it doesn't scale to real-world scenarios and doesn't address the well-known cold-start problem, that is when new user or new item enters the system.

Model-based CF methods are scalable and can deal with higher sparsity level than memory-based models, but also suffer when new users or items that don't have any ratings enter the system.

## Recommendation

In [290]:
def recommend_title(user_id: int, type: str):
    # Step 1: Choose method
    pred = user_prediction[user_id] if type == "user" else item_prediction[user_id]

    # Step 2: Get predicted ratings for this user
    predicted_ratings_df = pd.DataFrame(pred, columns=["predicted_rating"])

    # Step 3: Filter out items the user has already rated
    # user_rated_items = df[df["user_id"] == user_id]["item_id"].values
    # predicted_ratings_df = predicted_ratings_df[
    #     ~predicted_ratings_df.index.isin(user_rated_items)
    # ]
    

    # Step 4: Select the top 5 titles
    top_5_items = (
        predicted_ratings_df.sort_values(by="predicted_rating", ascending=False)
        .head(5)
        .index
    )

    # Step 5: Retrieve the titles
    top_5_titles = df[df["item_id"].isin(top_5_items)]["title"].unique().tolist()

    # Step 6: Print the top 5 titles
    print(f"Top 5 recommended titles for user {user_id} based on {type} prediction:\n")
    for item in top_5_titles:
        print(f"- {item}")

In [291]:
recommend_title(user_id=70, type='user')

Top 5 recommended titles for user 70 based on user prediction:

- Men in Black (1997)
- I.Q. (1994)
- Snow White and the Seven Dwarfs (1937)
- Apocalypse Now (1979)


In [292]:
recommend_title(user_id=70, type='item')

Top 5 recommended titles for user 70 based on item prediction:

- Tetsuo II: Body Hammer (1992)
- Innocent Sleep, The (1995)
- Broken English (1996)
- Underneath, The (1995)
- Death in the Garden (Mort en ce jardin, La) (1956)


In [293]:
df[df['user_id'] == 70].sort_values(by=['rating'], ascending=False).head(15)

Unnamed: 0,user_id,item_id,rating,timestamp,title
18017,70,174,5,884065782,Raiders of the Lost Ark (1981)
10397,70,423,5,884066910,E.T. the Extra-Terrestrial (1982)
29617,70,228,5,884064269,Star Trek: The Wrath of Khan (1982)
22539,70,419,5,884065035,Mary Poppins (1964)
28157,70,483,5,884064444,Casablanca (1942)
68363,70,511,5,884067855,Lawrence of Arabia (1962)
58976,70,588,5,884065528,Beauty and the Beast (1991)
944,70,172,5,884064217,"Empire Strikes Back, The (1980)"
25448,70,298,5,884064134,Face/Off (1997)
10271,70,143,5,884149431,"Sound of Music, The (1965)"


    Conclusion:
    - this method is just not working properly 😅
    - recommended items from both methods seem to be quite unrelated between each other and vs. user's most rated films
    - alternatively worth trying the SVD method