<center>
    <h1 id='content-based-filtering' style='color:#7159c1; font-size:350%'>Collaborative Filtering</h1>
    <i style='font-size:125%'>Recommendations of Items from Similar Users</i>
</center>

> **Topics**

```
- ✨ Content-Based Filtering Problems
- ✨ Collaborative Filtering
- ✨ User Based Approach
- ✨ K-Nearest Neighbors
- ✨ Hands-on
```

<h1 id='0-content-based-filtering-problems' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>✨ | Content-Based Filtering Problems</h1>

In the previous two notebooks, we dived into Content-Based Filtering with Plot Description and Metadatas approach and got better recommendations results!!

Nonetheless, you may be wondering: *"Okay, where is the catch? Is this method really perfect? Are there any problems with it?"*. And yes, even though giving better results, there are some cons on Content-Based Filtering.

The first problem is that the recommendations are based on similiar items regardless the user tastes. Picture this, if user A and user B are into Mob Psycho 100, they both will receive the same similar animes recommendations, regardless their animes tastes and, consequently, a Recommendation Bubble is created.

Besides, people tastes change over the time, so, even though user A are into shounen animes like Mob Psycho 100 today, in a few weeks this very user can be into slice-of-life animes and, since the given recommendations will be using Mob Psycho 100 as a parameter, the user will not receive any slice-of-life animes recommendations, leading to the user search for another platform to watch what he is looking for.

Thus, in order to minimize these problems, a new recommendation method has been made up: `Collaborative Filtering`!! Let's find out what it is and how it works.

<h1 id='1-collaborative-filtering' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>✨ | Collaborative Filtering</h1>

`Collaborative Filtering` reccomends animes that similar users liked, being able to get closer to the real users tastes. If you use Netflix, you probably already stumbled upon to some series marked as `For you`. If that's so, congrats, that is a real-world Collaborative Filtering Recommendation!! To make things even clearer, assume that two similar users, user A and user B, like Demon Slayer, and user B is also into Grand Blue, so the platform will recommend Grand Blue to user A.

Besides, this Filtering has two modes: 1) `User-Based`, where the user receives recommendations from items that similar users liked; and 2) `Item-Based`, where the user receives recommendations from items that similar users liked and the current user may well rate the recommended item.


About the advantages:

> **Better Recommendations** - `since it recommends animes that similar users liked, this system method tends to get closer to the user tastes when compared to Content-Based and Demographic Filtering`;

> **Personalized Recommendations** - `even though two users are searching for recommendations using the same anime, for instance Mob Psycho 100, both of them will receive different recommendations due to their tastes`;

> **Low Bubble of Recommendations** - `consequently, the probability of a Bubble of Recommendations be created is low and, even if one is created, it will be small`.

<br />

Disadvantages-wise:

> **More Data Required** - `ir order to get closer to users tastes, in addition to having animes data, it is needed to have users data, such as their ratings on previously watched animes`;

> **Bubble of Recommendations** - `even though the probability of a small Bubble of Recommendations be created is low, there is yet the risk of it be happening`;

> **Outliers** - `it is needed to add a cut-off of users ratings and mean rating score by user in order to avoid outliers in the recommendations. For instance, consider that user A rated 100 animes with 1 score and the very user mean score of all rated animes is 1.5, it means that the user bad rated all animes he watched and, consequently, may be up no good in the platform giving outliers to the ratings`;

> **More Computational Cost and Power** - `Collaborative-Filtering Algorithms are more complex and sofisticated to the previously ones, then, more computational cost and power is needed to run them`.

<br />

The image below ilustrates how this technique works:

<br />

<figure style='text-align:center'>
    <img style='border-radius:20px' src='./assets/2-collaborative-filtering.png' alt='Collaborative Filtering Diagram' />
    <figcaption>Figure 1 - Collaborative Filtering Diagram. By <a href='https://www.analyticsvidhya.com/blog/2022/02/introduction-to-collaborative-filtering/'>Shivam Baldha - Introduction to Collaborative Filtering©</a>.</figcaption>
</figure>

<br /><br />

In this notebook, we are going further to User-Based technique and use K-Nearest Neighbors to find similar users.

<h1 id='2-user-based-approach' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>✨ | User-Based Approach</h1>

In a few words, consider a person called user E, the `Collaborative Filtering User-Based Approach` works on finding similar users to user E and then recommendating to him similar items that the similar users liked.

To do it, the Algorithm first calculates the similarity between the users using `Pearson Correlation, Cosine Similarity or other metric`, then predicts the rate user E would give to the animes that the most similar users have watched and recommends the most predicted, rated ones.

For example, consider the following situation where want to recommend movies to user E. The first step is to calculate the similarity of the others users to this one. The image below ilustrates the situation:

<br />

<figure style='text-align:center'>
    <img style='border-radius:20px' src='./assets/18.0-collaborative-filtering-user-based.png' alt='Collaborative Filtering example using User-Based approach' />
    <figcaption>Figure 2 - Board ilustrating the similarity of the users to user E. The indexes are the users, the columns are the movies and the users rates to the movies and the last column is the similarity of the users to user E. The similarity has been calculated using Pearson Correlation. Besides, since user A and F have not rated movies that user E has been, their similarity is 0 (NaN). Since the similarity is being calculated to user E, user E has full similarity to itself; also, user D is totally different to user E due to the similarity be -1. By <a href='https://www.kaggle.com/code/ibtesama/getting-started-with-a-movie-recommendation-system'>Sibtesam Ahmed - Getting Started with a Movie Recommendation System©</a>.</figcaption>
</figure>

<br /><br />

After calculating the similarities, we have to predict the ratings that user E would give to the movies he hasn't rated and then, recommends to him the movies liked by the most similar users and that got the higher predicted ratings from user E. The following image pictures the results:

<br />

<figure style='text-align:center'>
    <img style='border-radius:20px' src='./assets/18.1-collaborative-filtering-user-based.png' alt='Collaborative Filtering example results using User-Based approach' />
    <figcaption>Figure 3 -Board ilustrating the results of the Collaborative Filtering. The predicted ratings of user E are marked with asterisks (*). The most similar users to user E are C and B. Probably, Avengers would be the recommended movie since its the movie that a similar user (B) has liked and got a high predicted rating to user E. By <a href='https://www.kaggle.com/code/ibtesama/getting-started-with-a-movie-recommendation-system'>Sibtesam Ahmed - Getting Started with a Movie Recommendation System©</a>.</figcaption>
</figure>

<br /><br />

<h1 id='3-k-nearest-neighbors' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>✨ | K-Nearest Neighbors</h1>

Instead of be using Pearson Correlation or Cosine Similarity to find similar users to a given one, we are going further and to apply `K-Nearest Neighbors` to do this task. This algorithm does one thing different than what has been done in the example from the previous section: instead of finding similar users, it consider that the users are grouped into clusters and its major goal is to find the most similar cluster to a given user.

K-Nearest Neighbores works like this:

> 1 - group the users into clusters (when the categories are known, we can stick into them. When the categories are unknown, we can use Unsupervisioned Machine Learning Algorithms, such as `K-Means Clustering`, to cluster the data);

> 2 - for a given user, find the K nearest neighbors, being "K" the number of nearest neighbors to be considered;

> 3 - when "K" is equals to 1, the given user is similar to the cluster of the unique nearest neighbor. When "K" is greater than 1, the given user is similar to the cluster of the most nearest neighbors belong. If there are a tie, we randomly choose one of the tied clusters to the given user be similar (picture that the user E has 5 nearest neighbors from cluster Red and 5 others from cluster Blue. Since both clusters has the same amount of users chosen as nearest neighbors to the given user, we randomly choose between Red and Blue to the very user be similar to).

<br />

The image below pictures an example of the clustering:

<br />

<figure style='text-align:center'>
    <img style='border-radius:20px' src='./assets/19-k-nearest-neighbors.png' alt='Example of K-Nearest Neighbors Algorithm assigning a cluster to a given data point' />
    <figcaption>Figure 4 - Example of K-Nearest Neighbors Algorithm assigning a cluster to a given data point. By <a href='https://www.youtube.com/watch?v=HVXime0nQeI'>StatQuest with Josh Starmer - StatQuest: K-nearest neighbors, Clearly Explained©</a>.</figcaption>
</figure>

<br /><br />

About the value of Nearest Neighbors (K) to be taken, we have to consider these information:

> 1 - There is no phisical or biological way to determine the best value for "K", so you may have to try a few out values  before settling on one. Do this by pretending part of the training data is "unknown";

> 2 - Low values for K, such as K=1 or K=2, can be noisy and subject to the effects of outliers;

> 3 - Large values for K smooth over things, but you do not want to K be so large that a category with only a few samples in it will always be out voted by other categories.

<br />

For better explanations about how K-Nearest Neighbors and K-Means Clustering work, consider watching these two videos provided by [StatQuest with Josh Starmer](https://www.youtube.com/@statquest): [StatQuest: K-nearest neighbors, Clearly Explained](https://www.youtube.com/watch?v=HVXime0nQeI) and [StatQuest: K-means clustering](https://www.youtube.com/watch?v=4b5d3muPQmA).

<h1 id='4-hands-on' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>✨ | Hands-on</h1>

```
- Settings
- Reading Datasets
- Dropping Variables
- Splitting Dataset into Training and Validation
- Training the Model
- Recommendations
```

---

**- Settings**

In [None]:
# ---- Imports ----
import inflect       # pip install inflect
import numpy as np   # pip install numpy
import pandas as pd  # pip install pandas

# ---- Surprise Imports ----
#
# pip install scikit-surprise
#
# If you get any problems installing the package, follow the steps in this Stack Overflow link:
# - https://stackoverflow.com/questions/44951456/pip-error-microsoft-visual-c-14-0-is-required
#
from surprise import accuracy
from surprise import KNNBasic
from surprise import KNNWithMeans
from surprise.dataset import Dataset
from surprise.model_selection.validation import cross_validate
from surprise.reader import Reader
from surprise.model_selection import RandomizedSearchCV
from surprise.model_selection import train_test_split

# ---- Constants ----
DATASETS_PATH = ('./datasets')
INFLECT_ENGINE = (inflect.engine())
SEED = (20240106)
SURPRISE_READER = (Reader())

TRAINING_DF_SIZE = (0.80)
VALIDATION_DF_SIZE = (0.20)

SIMILARITY_OPTIONS = {
    'name': ['cosine', 'pearson', 'pearson_baseline'] # similarity metrics: the best one will be used for recommendations;
    , 'user_based': [True] # True: User-Based Approach; False: Item-Based Approach;
    , 'min_support': [3, 4, 5] # User-Based: minimum number of similar items to be considered; Item-Based: minimum number of similar users to be considered.
}
KNN_PARAMS = {
    'k': range(30, 100, 1)              # list of nearest neighbors to be considered: the best one will be used for recommendations;
    , 'sim_options': SIMILARITY_OPTIONS # similarity parameters.
}

# ---- Settings ----
np.random.seed(SEED)
pd.set_option('display.max_columns', None)

# ---- Functions ----
def find_best_knn_model(model, parameters, dataset):
    """
    \ Descrition:
        - creates a Randomized Search CV to find the best score, parameters and estimators for a
    K-Nearest Neighbor model;
        - returns the best model.
    
    \ Parameters:
        - model: Chosen Classification model. For this notebook, K-Nearest Neighbors is the chosen one;
        - parameters: dictionary of KNN Parameters;
        - dataset: Surprise DataFrame.
    """
    classification_model = RandomizedSearchCV(
        model
        , parameters
        , n_jobs=4                  # number of CPU cores used on models' training and validation
        , measures=['rmse', 'mse']  # evaluation metrics
    )
    classification_model.fit(dataset)
    
    print(f'- Best Score: {classification_model.best_score}')
    print(f'- Best Parameters: {classification_model.best_params}')
    print(f'- Best Estimator: {classification_model.best_estimator}')
    
    return classification_model

---

**- Reading Datasets**

In [None]:
# ---- Reading Animes Datasets ----
animes_df = pd.read_csv(f'{DATASETS_PATH}/anime-transformed-dataset-2023.csv', index_col='id')

print(f'- Number of Observations: {animes_df.shape[0]} ({INFLECT_ENGINE.number_to_words(animes_df.shape[0])})')
print(f'- Number of Variables: {animes_df.shape[1]} ({INFLECT_ENGINE.number_to_words(animes_df.shape[1])})')
print('---')

animes_df.head()

In [None]:
# ---- Reading Users Dataset ----
users_df = pd.read_csv(f'{DATASETS_PATH}/users-details-transformed-2023.csv', index_col='id')

print(f'- Number of Observations: {users_df.shape[0]} ({INFLECT_ENGINE.number_to_words(users_df.shape[0])})')
print(f'- Number of Variables: {users_df.shape[1]} ({INFLECT_ENGINE.number_to_words(users_df.shape[1])})')
print('---')

users_df.head()

In [None]:
# ---- Reading Ratings Dataset ----
ratings_df = pd.read_csv(f'{DATASETS_PATH}/users-scores-transformed-2023.csv')

print(f'- Number of Observations: {ratings_df.shape[0]} ({INFLECT_ENGINE.number_to_words(ratings_df.shape[0])})')
print(f'- Number of Variables: {ratings_df.shape[1]} ({INFLECT_ENGINE.number_to_words(ratings_df.shape[1])})')
print('---')

ratings_df.head()

---

**- Dropping Variables**

For Surprise package, only three variables are needed: the user id, the anime id and the rating the user gave to the anime. Thus, we have to drop the user name and anime title variables.

In [None]:
# ---- Dropping Variables ----
variables_to_keep = ['user_id', 'anime_id', 'rating']
ratings_df = ratings_df[variables_to_keep]

---

**- Splitting Dataset into Training and Validation**

In [None]:
# ---- Splitting Dataset into Training and Validation ----
#
# - converting Pandas DataFrame into Surprise DataFrame
#
ratings_surprise_df = Dataset.load_from_df(ratings_df, SURPRISE_READER)

In [None]:
# ---- Splitting Dataset into Trainig and Validation ----
#
# - training: 80%
# - validation: 20%
#
training_surprise_df, validation_surprise_df = train_test_split(
    data=ratings_surprise_df
    , train_size=TRAINING_DF_SIZE
    , test_size=VALIDATION_DF_SIZE
    , random_state=SEED
)

---

# TO SORT

In [None]:
%%time
knn_model = find_best_knn_model(
    KNNBasic
    , KNN_PARAMS
    , ratings_surprise_df
)

In [None]:
%%time
knn_with_means_model = find_best_knn_model(
    KNNWithMeans
    , KNN_PARAMS
    , ratings_surprise_df
)

---

<h1 id='reach-me' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>📫 | Reach Me</h1>

> **Email** - [csfelix08@gmail.com](mailto:csfelix08@gmail.com?)

> **Linkedin** - [linkedin.com/in/csfelix/](https://www.linkedin.com/in/csfelix/)

> **GitHub:** - [CSFelix](https://github.com/CSFelix)

> **Kaggle** - [DSFelix](https://www.kaggle.com/dsfelix)

> **Portfolio** - [CSFelix.io](https://csfelix.github.io/).