<center>
    <h1 id='content-based-filtering' style='color:#7159c1; font-size:350%'>Collaborative Filtering</h1>
    <i style='font-size:125%'>Recommendations of Items from Similar Users</i>
</center>

> **Topics**

```
- 🧑‍🤝‍🧑 Content-Based Filtering Problems
- 🧑‍🤝‍🧑 Collaborative Filtering
- 🧑‍🤝‍🧑 User-Based Approach
- 🧑‍🤝‍🧑 K-Nearest Neighbors
- 🧑‍🤝‍🧑 K-Nearest Neighbors Basic VS K-Nearest Neighbors With Means VS K-Nearest Neighbors With Z-Score
- 🧑‍🤝‍🧑 Grid Search CV VS Randomized Search CV
- 🧑‍🤝‍🧑 Hands-on
```

<h1 id='0-content-based-filtering-problems' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>🧑‍🤝‍🧑 | Content-Based Filtering Problems</h1>

In the previous two notebooks, we dived into Content-Based Filtering with Plot Description and Metadatas approach and got better recommendations results!!

Nonetheless, you may be wondering: *"Okay, where is the catch? Is this method really perfect? Are there any problems with it?"*. And yes, even though giving better results, there are some cons on Content-Based Filtering.

The first problem is that the recommendations are based on similiar items regardless the user tastes. Picture this, if user A and user B are into Mob Psycho 100, they both will receive the same similar animes recommendations, regardless their animes tastes and, consequently, a Recommendation Bubble is created.

Besides, people tastes change over the time, so, even though user A is into shounen animes like Mob Psycho 100 today, in a few weeks this very user can be into slice-of-life animes and, since the given recommendations will be using Mob Psycho 100 as a parameter, the user will not receive any slice-of-life animes recommendations, leading to the user search for another platform to watch what he is looking for.

Thus, in order to minimize these problems, a new recommendation method has been made up: `Collaborative Filtering`!! Let's find out what it is and how it works.

<h1 id='1-collaborative-filtering' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>🧑‍🤝‍🧑 | Collaborative Filtering</h1>

`Collaborative Filtering` reccomends animes that similar users liked, being able to get closer to the real users tastes. If you use Netflix, you probably already stumbled upon to some series marked as `For you`. If that's so, congrats, that is a real-world Collaborative Filtering Recommendation!! To make things even clearer, assume that two similar users, user A and user B, like Demon Slayer, and user B is also into Grand Blue, so the platform will recommend Grand Blue to user A.

Besides, this Filtering has two modes: 1) `User-Based`, where the user receives recommendations from items that similar users liked; and 2) `Item-Based`, where the user receives recommendations from items that similar users liked and the current user may well rate the recommended item. Both of them follow the `Memory/Neighborhood Logic`.


About the advantages:

> **Better Recommendations** - `since it recommends animes that similar users liked, this system method tends to get closer to the user tastes when compared to Content-Based and Demographic Filtering`;

> **Community Tastes** - `it recommends similar items that similar users liked`;

> **Personalized Recommendations** - `even though two users are searching for recommendations using the same anime, such as Mob Psycho 100, both of them will receive different recommendations due to their tastes`;

> **Low Bubble of Recommendations** - `consequently, the probability of a Bubble of Recommendations be created is low and, even if one is created, it will be small`.

<br />

Disadvantages-wise:

> **More Data Required** - `ir order to get closer to users tastes, in addition to having animes data, it is needed to have users data, such as their ratings on previously watched animes`;

> **Few Data Available for Items** - `new animes will have few ratings from users, leading to outliers`;

> **Bubble of Recommendations** - `even though the probability of a small Bubble of Recommendations be created is low, there is yet the risk of it be happening`;

> **Outliers** - `it is needed to add a cut-off of users ratings and mean rating score by user in order to avoid outliers in the recommendations. For instance, consider that user A rated 100 animes with 1 score and the very user mean score of all rated animes is 1.5, it means that the user bad rated all animes he watched and, consequently, may be up no good in the platform giving outliers to the ratings`;

> **More Computational Cost and Power** - `Collaborative-Filtering Algorithms are more complex and sofisticated to the previously ones, then, more computational cost and power is needed to run them`.

<br />

The image below ilustrates how this technique works:

<br />

<figure style='text-align:center'>
    <img style='border-radius:20px' src='./assets/2-collaborative-filtering.png' alt='Collaborative Filtering Diagram' />
    <figcaption>Figure 1 - Collaborative Filtering Diagram. By <a href='https://www.analyticsvidhya.com/blog/2022/02/introduction-to-collaborative-filtering/'>Shivam Baldha - Introduction to Collaborative Filtering©</a>.</figcaption>
</figure>

<br /><br />

In this notebook, we are going further to User-Based technique and use K-Nearest Neighbors to find similar users.

<h1 id='2-user-based-approach' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>🧑‍🤝‍🧑 | User-Based Approach</h1>

In a few words, consider a person called user E, the `Collaborative Filtering User-Based Approach` works on finding similar users to user E and then recommendating to him similar items that the similar users liked.

To do it, the Algorithm first calculates the similarity between the users using `Pearson Correlation, Cosine Similarity or other metric`, then predicts the rate user E would give to the animes that the most similar users have watched and recommends the most predicted, rated ones.

For example, consider the following situation where we want to recommend movies to user E. The first step is to calculate the similarity of the others users to this one. The image below ilustrates the situation:

<br />

<figure style='text-align:center'>
    <img style='border-radius:20px' src='./assets/6-collaborative-filtering-user-based.png' alt='Collaborative Filtering example using User-Based approach' />
    <figcaption>Figure 2 - Board ilustrating the similarity of the users to user E. The indexes are the users, the columns are the movies and the users ratings to the movies and the last column is the similarity of the users to user E. The similarity has been calculated using Pearson Correlation. Besides, since user A and F have not rated movies that user E has been, their similarity is 0 (NaN). Since the similarity is being calculated to user E, user E has full similarity to itself; also, user D is totally different to user E due to the similarity be -1. By <a href='https://www.kaggle.com/code/ibtesama/getting-started-with-a-movie-recommendation-system'>Sibtesam Ahmed - Getting Started with a Movie Recommendation System©</a>.</figcaption>
</figure>

<br /><br />

After calculating the similarities, we have to predict the ratings that user E would give to the movies he hasn't rated and then, recommends to him the movies liked by the most similar users and that got the higher predicted ratings from user E. The following image pictures the results:

<br />

<figure style='text-align:center'>
    <img style='border-radius:20px' src='./assets/7-collaborative-filtering-user-based.png' alt='Collaborative Filtering example results using User-Based approach' />
    <figcaption>Figure 3 -Board ilustrating the results of the Collaborative Filtering. The predicted ratings of user E are marked with asterisks (*). The most similar users to user E are C and B. Probably, Avengers would be the recommended movie since its the movie that a similar user (B) has liked and got a high predicted rating to user E. By <a href='https://www.kaggle.com/code/ibtesama/getting-started-with-a-movie-recommendation-system'>Sibtesam Ahmed - Getting Started with a Movie Recommendation System©</a>.</figcaption>
</figure>

<br /><br />

<h1 id='3-k-nearest-neighbors' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>🧑‍🤝‍🧑 | K-Nearest Neighbors</h1>

Instead of be using Pearson Correlation or Cosine Similarity to find similar users to a given one, we are going further and to apply `K-Nearest Neighbors` to do this task. This algorithm does one thing different than what has been done in the example from the previous section: instead of finding similar users, it considers that the users are grouped into clusters and its major goal is to find the most similar cluster to a given user.

K-Nearest Neighbors works like this:

> 1 - group the users into clusters (when the categories are known, we can stick into them. When the categories are unknown, we can use Unsupervisioned Machine Learning Algorithms, such as `K-Means Clustering`, to cluster the data);

> 2 - for a given user, find the K nearest neighbors, being "K" the number of nearest neighbors to be considered;

> 3 - when "K" is equals to 1, the given user is similar to the cluster of the unique nearest neighbor. When "K" is greater than 1, the given user is similar to the cluster of the most nearest neighbors belong. If there are a tie, we randomly choose one of the tied clusters to the given user be similar (picture that the user E has 5 nearest neighbors from cluster Red and 5 others from cluster Blue. Since both clusters has the same amount of users chosen as nearest neighbors to the given user, we randomly choose between Red and Blue to the very user be similar to).

<br />

The image below pictures an example of the clustering:

<br />

<figure style='text-align:center'>
    <img style='border-radius:20px' src='./assets/8-k-nearest-neighbors.png' alt='Example of K-Nearest Neighbors Algorithm assigning a cluster to a given data point' />
    <figcaption>Figure 4 - Example of K-Nearest Neighbors Algorithm assigning a cluster to a given data point. By <a href='https://www.youtube.com/watch?v=HVXime0nQeI'>StatQuest with Josh Starmer - StatQuest: K-nearest neighbors, Clearly Explained©</a>.</figcaption>
</figure>

<br /><br />

About the value of Nearest Neighbors (K) to be taken, we have to consider these information:

> 1 - There is no phisical or biological way to determine the best value for "K", so you may have to try a few out values  before settling on one. Do this by pretending part of the training data is "unknown";

> 2 - Low values for K, such as K=1 or K=2, can be noisy and subject to the effects of outliers;

> 3 - Large values for K smooth over things, but you do not want to K be so large that a category with only a few samples in it will always be out voted by other categories.

<br />

For better explanations about how K-Nearest Neighbors and K-Means Clustering work, consider watching these two videos provided by [StatQuest with Josh Starmer](https://www.youtube.com/@statquest): [StatQuest: K-nearest neighbors, Clearly Explained](https://www.youtube.com/watch?v=HVXime0nQeI) and [StatQuest: K-means clustering](https://www.youtube.com/watch?v=4b5d3muPQmA).

<h1 id='4-k-nearest-neighbors-basic-vs-k-nearest-neighbors-with-means-vs-k-nearest-neighbors-with-z-score' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>🧑‍🤝‍🧑 | K-Nearest Neighbors Basic VS K-Nearest Neighbors With Means VS K-Nearest Neighbors With Z-Score</h1>

The K-Nearest Neighbors (KNN) has some variations in its algorithms in other to get better results in specific situations, the main three ones are `Basic`, `With Means` and `With Z-Score` variations, let's see their description and settle which one to use in the project:

> **K-Nearest Neighbors Basic** - `known as the vanilla version, this one is the first KNN Algorithm and has been explained in the previous section`;

> **K-Nearest Neighbors With Means** - `this one is like the Basic version, with the addition of the mean ratings of each user in order to avoid outliers and give different weights to the users accordingly to their mean ratings`;

> **K-Nearest Neighbors With Z-Score** - `this one is like the With Means version, with the addition of the Z-Score Normalization of each user mean rating`.

Since assigning different weights to the users accordingly to their mean rating is a great idea to get more accurate results and since the ratings dataset is not normalized, we are going to stick to `K-Nearest Neighbors With Means` from now on!

<h1 id='5-grid-search-cv-vs-randomized-search-cv' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>🧑‍🤝‍🧑 | Grid Search CV VS Randomized Search CV</h1>

Okay, now that we already have chosen the model to find similar users, we have to find the best hyperparameters to it. Hyperparameters are those ones that the algorithm cannot learn by itself over the training step and it is a task for us to choose the best values in order to the model provide the best results.

For K-Nearest Neighbors, the hyperparameters we are going to consider are the `Similarity Metric, User-Based Approach, Minimum Number of Similar Items and Minimum and Maximum Number of Nearest Neighbors`.

Fortunately, there are some techniques we can use instead of trying a bunch of random values, such as `Grid Search CV` and `Randomized Search CV`, being:

> **Grid Search CV** - `tests the permutation of all hyperparameters and returns the values that made the model got the best results. This approach is recommended for small datasets`;

> **Randomized Search CV** - `tests some random permutations of the hyperparameters and returns the values that made the model got the best results. This approach is recommended for large datasets`.

For a better understanding, consider we have two hyperparameters $A=[1, 2, 3]$ and $B=[4, 5, 6]$. In Grid Search CV, all permutations are tested, that is, all A-B pairs into $pairs = [(1, 4), (1, 5), (1, 6), (2, 4), (2, 5), (2, 6), (3, 4), (3, 5), (3,6)]$ are tested; whereas in Randomized Search CV, only a few randomly chosen A-B pairs are tested.

Since our ratings dataset is kind of large - over than 23 million observations!! -, we are going to stick to `Randomized Search CV`.

Besides, notice that the Recommendation Problem has been turned into an Optimization Problem, where we should find out the Unsupervisioned Machine Learning Algorithm and the Hyperparameters Values that fit the problem best. Now, let's go to the code!!

<h1 id='6-hands-on' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>🧑‍🤝‍🧑 | Hands-on</h1>

```
- Settings
- Reading Datasets
- Dropping Variables
- Getting Random Observations for Hyperparameter Tuning
- Finding Best Hyperparameters for the Model
- Splitting Dataset into Training and Validation
- Training the Model
- Recommendations
```

> **OBS.:** since Surprise Package only works with ratings from 0 to 5 and the animes dataset works with ratings from 0 to 10, we have to divide the animes dataset ratings by 2 in order to the Surprise Predictions be accordingly to the range from 0 to 5. After that, we have to multiply the Surprise Predictions by 2 in order to the predictions fit the animes dataset ratings when returning the recommendations.

---

**- Settings**

In [1]:
# -----------------
# ---- Imports ----
# -----------------
from collections import defaultdict  # pip install collections
import inflect                       # pip install inflect
import numpy as np                   # pip install numpy
import pandas as pd                  # pip install pandas




# --------------------------
# ---- Surprise Imports ----
# --------------------------
#
# pip install scikit-surprise
#
# If you get any problems installing the package, follow the steps in this Stack Overflow link:
# - https://stackoverflow.com/questions/44951456/pip-error-microsoft-visual-c-14-0-is-required
#
from surprise import accuracy
from surprise import KNNWithMeans
from surprise.dataset import Dataset
from surprise.model_selection.validation import cross_validate
from surprise.model_selection import RandomizedSearchCV
from surprise.model_selection import train_test_split
from surprise.reader import Reader




# -------------------
# ---- Constants ----
# -------------------
DATASETS_PATH = ('./datasets')
INFLECT_ENGINE = (inflect.engine())
SEED = (20240106)
SEED2 = (20240107)
SURPRISE_READER = (Reader())

HYPERPARAMETER_TUNING_SAMPLE_SIZE = 15_000 # my laptop crashes with more than 15,000 observations. I need more RAM :( (currently I have 12GB)
TRAINING_DF_SIZE = (0.70)
VALIDATION_DF_SIZE = (0.30)

SIMILARITY_OPTIONS = {
    'name': ['cosine', 'pearson', 'pearson_baseline', 'msd'] # similarity metrics: the best one will be used for recommendations;
    , 'user_based': [True] # True: User-Based Approach; False: Item-Based Approach;
    , 'min_support': [3, 4, 5] # User-Based: minimum number of similar items to be considered; Item-Based: minimum number of similar users to be considered.
}
KNN_PARAMS = {
    'k': [30, 40, 50]              # list of maximum nearest neighbors to be considered: the best one will be used for recommendations;
    , 'min_k': [20, 25, 30]        # minimum number of nearest neighbors to be considered;
    , 'sim_options': SIMILARITY_OPTIONS # similarity parameters;
}




# ------------------
# ---- Settings ----
# ------------------
np.random.seed(SEED)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)




# -------------------
# ---- Functions ----
# -------------------
def find_best_hyperparameters_values(model, search_cv, parameters, dataset):
    """
    \ Descrition:
        - creates a parameter Search CV to find the best score, hyparameters values and estimators for a given
    K-Nearest Neighbors model;
        - returns the best model and its score, hyperparameters and estimators.
    
    \ Parameters:
        - model: Chosen Classification model. For this notebook, K-Nearest Neighbors With Means is the chosen one;
        - search-cv: Surprise Parameter Search. For this notebook, Randomized Search CV is the chosen one;
        - parameters: dictionary of KNN Parameters;
        - dataset: Surprise DataFrame.
    """
    classification_model = search_cv(
        model
        , parameters
        , n_jobs=-1                 # number of CPU cores used on models' training and validation (-1 all cores are used)
        , measures=['rmse']         # evaluation metrics
        , cv=3                      # number of folds for cross-validation
        , pre_dispatch='1*n_jobs'   # number of dispatches for each job to process simultaneously
        , random_state=SEED
    )
    classification_model.fit(dataset)
    
    print(f'- Best Score: {classification_model.best_score}')
    print(f'- Best Parameters: {classification_model.best_params}')
    print(f'- Best Estimator: {classification_model.best_estimator}')
    
    return classification_model


def get_recommendations(predictions_df, animes_df, maximum_number_of_recommendations=10):
    """
    \ Description:
        - Maps the Predictions for Each User;
        - Sorts the Predictios for Each User;
        - Creates a List Containing the K-Nearest Neighbors and Highest Predicted Rating Ones;
        - Returns the List.
    
    \ Parameters:
        - Predictions DF: Surprise DataFrame;
        - Animes DF: Pandas DataFrame;
        - Maximum Number of Recommendations: Integer.
    """
    
    # Mapping the Predictions for Each User
    #
    # - since Surprise Package only works with ratings from 0 to 5 and
    # the animes dataset works with ratings from 0 to 10, we have to
    # multiply the 'estimated (est)' value by 2, in order to the Surprise
    # predictions fit into the animes dataset ratings.
    #
    top_n_recommendations = defaultdict(list)
    for uuid, id, true_r, est, _ in predictions_df: top_n_recommendations[uuid].append((id, est * 2))
    
    # Sorting Predictions for Each User and Retrieving the K-Nearest Neighbors and Highest Ones
    for uuid, user_ratings in top_n_recommendations.items():
        user_ratings.sort(key=lambda rating: rating[1], reverse=True)
        top_n_recommendations[uuid] = user_ratings[:maximum_number_of_recommendations]
    
    return top_n_recommendations

In [2]:
# -----------------
# ---- Classes ----
# -----------------
class collaborative_filtering_user_based_approach():
    """
    This class apply Collaborative Filtering with User-Based Approach to recommend 'n' animes to a given user.
    """
    
    def __init__(self, model, training_df, validation_df, full_dataset, animes_df):
        """
        \ Description:
            - Class Constructor.
        
        \ Parameters:
            - Model: Surprise Prediction Model. K-Nearest Neighbors With Means is the chosen one for this notebook;
            - Training DF: Surprise DataFrame;
            - Validation DF: Surprise DataFrame;
            - Full Dataset: Surprise DataFrame (Merge of: Training DF and Validation DF);
            - Animes DF:  Pandas DataFrame.
            
        \ Other Attributes:
            - Predictions Validations: Surprise DataFrame;
            - Top Recommendations: Surprise DataFrame;
            - Recommendations DF: Pandas DataFrame.
        """
        self.model = model
        self.training_df = training_df
        self.validation_df = validation_df
        self.full_dataset = full_dataset
        self.animes_df = animes_df
        
        self.predictions_validations = None
        self.top_recommendations = None
        self.recommendations_df = None
        
    def fit_and_predict(self, maximum_number_of_recommendations, verbose=True):
        """
        \ Description:
            - Applies training, validation and evaluation steps;
            - Gets the top 'n' recommendations;
            - Returns the top 'n' recommendations as a Pandas DataFrame.
        
        \ Parameters:
            - Maximum Number of Recommendations: Integer;
            - Verbose: Boolean.
        """
        # ---- Fitting Step ----
        if verbose:
            fitting_step_title = '** Fitting Step **'
            print('*' * len(fitting_step_title))
            print(fitting_step_title)
            print('*' * len(fitting_step_title))
            print('\n')
            
        self.model.fit(self.training_df)
        
        # ---- Validation Step -----
        if verbose:
            prediction_step_title = '** Prediction Step **'
            print('\n\n')
            print('*' * len(prediction_step_title))
            print(prediction_step_title)
            print('*' * len(prediction_step_title))
            print('\n')
            
        self.predictions_validations = self.model.test(self.validation_df)

        # ---- Evaluation Step ----
        if verbose:
            evaluation_step_title = '** Evaluation Step **'
            print('\n\n')
            print('*' * len(evaluation_step_title))
            print(evaluation_step_title)
            print('*' * len(evaluation_step_title))
            print('\n')
        
        rmse_evaluation = round(accuracy.rmse(self.predictions_validations), 4)
        print(f'- Root Mean Squared Error (RMSE) for the predictions: {rmse_evaluation}')
        print('\n\n')
        
        # ---- Getting Recommendations ----
        if verbose:
            predicting_recommendations_step_title = '** Predicting Recommendations Step **'
            print('\n\n')
            print('*' * len(predicting_recommendations_step_title))
            print(predicting_recommendations_step_title)
            print('*' * len(predicting_recommendations_step_title))
            print('\n')
            
        self.top_recommendations = get_recommendations(self.predictions_validations, self.animes_df)
        self.recommendations_df = pd.DataFrame(columns=['user_id', 'anime_id', 'predicted_rating'])
        
        for item in self.top_recommendations:
            current_recommendation_df = pd.DataFrame(self.top_recommendations[item], columns=['anime_id', 'predicted_rating'])
            current_recommendation_df['user_id'] = item
            recommendation_variables = current_recommendation_df.columns.tolist()
            recommendation_variables = recommendation_variables[-1:] + recommendation_variables[:-1]
            current_recommendation_df = current_recommendation_df[recommendation_variables]
            self.recommendations_df = pd.concat([self.recommendations_df, current_recommendation_df], axis=0)
        
        # ---- Return ----
        print('Fitting and Prediction Done 😉👍')
        print('\n')
        return rmse_evaluation
    
    def cross_validation(self, verbose=True):
        """
        \ Description:
            - Applies Cross-Validation and its results evaluation using RMSE;
            - Returns the evaluation as a Pandas DataFrame.
        
        \ Parameters:
            - Verbose: Boolean.
        """
        # ---- Cross-Validation Step ----
        if verbose:
            cross_validation_step_title = '** Cross-Validation Step **'
            print('*' * len(cross_validation_step_title))
            print(cross_validation_step_title)
            print('*' * len(cross_validation_step_title))
            print('\n')
            
        cross_validation_results = cross_validate(self.model, self.full_dataset, n_jobs=-1)
        cross_validation_results = round(cross_validation_results['test_rmse'].mean(), 4)
            
        # ---- Evaluation Step ----
        if verbose:
            validation_step_title = '** Validation Step **'
            print('\n\n')
            print('*' * len(validation_step_title))
            print(validation_step_title)
            print('*' * len(validation_step_title))
            print('\n')
        
        print(f'- Mean Cross-Validation Root Mean Squared Error (RMSE): {cross_validation_results}')
        print('\n\n')
        
        # ---- Return ----
        print('Cross-Validation Done 😉👍')
        print('\n')
        return cross_validation_results
    
    def recommend(self, user_id, maximum_number_of_recommendations=10, verbose=True):
        """
        \ Description:
            - Gets the 'n' recommended items;
            - Gets the recommended animes info;
            - Returns the recommended animes info and the predicted rating as a Pandas DataFrame.
        
        \ Parameters:
            - User ID: Integer;
            - Maximum Number of Recommendations: Integer;
            - Verbose: Boolean.
        """
        # ---- Recommendation Step ----
        if verbose:
            recommendation_step_title = '** Recommendations Step **'
            print('*' * len(recommendation_step_title))
            print(recommendation_step_title)
            print('*' * len(recommendation_step_title))
            print('\n')
        
        recommendations_df = self.recommendations_df.loc[self.recommendations_df['user_id'] == user_id] \
            .head(maximum_number_of_recommendations)
        
        # ---- Formatting Animes Recommendation DataFrame ----
        recommended_anime_ids = recommendations_df.anime_id.to_list()
        recommended_animes_df = self.animes_df.loc[recommended_anime_ids]
        recommended_animes_df['predicted_rating'] = recommendations_df['predicted_rating'].to_list()
        
        # ---- Return ----
        if verbose: display(recommended_animes_df)
        print('Recommendations Done, Enjoy 😉👍')
        print('\n')
        
        return recommended_animes_df

---

**- Reading Datasets**

In [3]:
# ---- Reading Animes Dataset ----
animes_df = pd.read_csv(f'{DATASETS_PATH}/anime-transformed-dataset-2023.csv', index_col='id')[
    ['title', 'score', 'genres', 'is_hentai', 'image_url']
]

print(f'- Number of Observations: {animes_df.shape[0]} ({INFLECT_ENGINE.number_to_words(animes_df.shape[0])})')
print(f'- Number of Variables: {animes_df.shape[1]} ({INFLECT_ENGINE.number_to_words(animes_df.shape[1])})')
print('---')

animes_df.head()

- Number of Observations: 23748 (twenty-three thousand, seven hundred and forty-eight)
- Number of Variables: 5 (five)
---


Unnamed: 0_level_0,title,score,genres,is_hentai,image_url
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,cowboy bebop,8.75,"award winning, action, sci-fi",0,https://cdn.myanimelist.net/images/anime/4/19644.jpg
5,cowboy bebop tengoku no tobira,8.38,"action, sci-fi",0,https://cdn.myanimelist.net/images/anime/1439/93480.jpg
6,trigun,8.22,"adventure, action, sci-fi",0,https://cdn.myanimelist.net/images/anime/7/20310.jpg
7,witch hunter robin,7.25,"mystery, supernatural, action, drama",0,https://cdn.myanimelist.net/images/anime/10/19969.jpg
8,bouken ou beet,6.94,"adventure, supernatural, fantasy",0,https://cdn.myanimelist.net/images/anime/7/21569.jpg


In [4]:
# ---- Reading Ratings Dataset ----
ratings_df = pd.read_csv(f'{DATASETS_PATH}/users-scores-transformed-2023.csv')

print(f'- Number of Observations: {ratings_df.shape[0]} ({INFLECT_ENGINE.number_to_words(ratings_df.shape[0])})')
print(f'- Number of Variables: {ratings_df.shape[1]} ({INFLECT_ENGINE.number_to_words(ratings_df.shape[1])})')
print('---')

print(f'- Number of Unique Users: {ratings_df.user_id.nunique()} ({INFLECT_ENGINE.number_to_words(ratings_df.user_id.nunique())})')
print(f'- Number of Unique Animes: {ratings_df.anime_id.nunique()} ({INFLECT_ENGINE.number_to_words(ratings_df.anime_id.nunique())})')
print('---')

ratings_df.head()

- Number of Observations: 23796586 (twenty-three million, seven hundred and ninety-six thousand, five hundred and eighty-six)
- Number of Variables: 5 (five)
---
- Number of Unique Users: 264067 (two hundred and sixty-four thousand and sixty-seven)
- Number of Unique Animes: 16380 (sixteen thousand, three hundred and eighty)
---


Unnamed: 0,user_id,username,anime_id,anime_title,rating
0,1,xinil,21,one piece,9
1,1,xinil,48,hack sign,7
2,1,xinil,320,a kite,5
3,1,xinil,49,aa megami-sama,8
4,1,xinil,304,aa megami-sama movie,8


---

**- Dropping Variables**

For Surprise package, only three variables are needed: the user id, the anime id and the rating the user gave to the anime. Thus, we have to drop the user name and anime title variables.

In [5]:
# ---- Dropping Variables ----
variables_to_keep = ['user_id', 'anime_id', 'rating']
ratings_df = ratings_df[variables_to_keep]

---

**- Getting Random Observations for Hyperparameter Tuning**

In [6]:
# ---- Getting Random Observations for Hyperparameter Tuning ----
#
# - since Surprise Package only works with ratings from 0 to 5 and
# the animes dataset works with ratings from 0 to 10, we have to
# divide the 'rating' variable by 2, in order to the anime dataset
# ratings fit into Surprise Predictions.
#
temp_ratings_df = ratings_df.copy()
temp_ratings_df.rating = temp_ratings_df.rating.apply(lambda rating: rating / 2)

hyperparameter_tuning_df = temp_ratings_df.sample(HYPERPARAMETER_TUNING_SAMPLE_SIZE, random_state=SEED)
hyperparameter_tuning_df = Dataset.load_from_df(hyperparameter_tuning_df, SURPRISE_READER)

print(
    f'- Number of Observations for Hyperparameter Tuning: {HYPERPARAMETER_TUNING_SAMPLE_SIZE}'
    f' ({INFLECT_ENGINE.number_to_words(HYPERPARAMETER_TUNING_SAMPLE_SIZE)})'
)

- Number of Observations for Hyperparameter Tuning: 15000 (fifteen thousand)


---

**- Finding Best Hyperparameters for the Model**

In [7]:
# ---- Finding the Best Hyperparameters for the Model ----
#
# - using Randomized Search CV in order to find the best Hyperparameters Values.
#
hyperparameters_tuning_values = find_best_hyperparameters_values(
    model=KNNWithMeans
    , search_cv=RandomizedSearchCV
    , parameters=KNN_PARAMS
    , dataset=hyperparameter_tuning_df
)

- Best Score: {'rmse': 0.8510943485466242}
- Best Parameters: {'rmse': {'k': 30, 'min_k': 25, 'sim_options': {'name': 'pearson_baseline', 'user_based': True, 'min_support': 5}}}
- Best Estimator: {'rmse': <surprise.prediction_algorithms.knns.KNNWithMeans object at 0x000002082B76BD60>}


In [8]:
# ---- Finding Best Hyperparameters for the Model ----
#
# - getting the model with the best parameters.
#
chosen_knn_with_means_model = hyperparameters_tuning_values.best_estimator['rmse']

---

**- Splitting Dataset into Training and Validation**

In [9]:
# ---- Splitting Dataset into Training and Validation ----
#
# - converting Pandas DataFrame into Surprise DataFrame
#
ratings_surprise_df = Dataset.load_from_df(
    temp_ratings_df.sample(HYPERPARAMETER_TUNING_SAMPLE_SIZE, random_state=SEED2)
    , SURPRISE_READER
)

In [10]:
# ---- Splitting Dataset into Training and Validation ----
#
# - deleting some dataframes from the memory since we are going to use 'ratings_surprise_df' from now on;
# - datasets to delete:
#    \ ratings_df;
#    \ temp_ratings_df;
#    \ hyperparameter_tuning_df.
#
ratings_df = None
temp_ratings_df = None
hyperparameter_tuning_df = None

In [11]:
# ---- Splitting Dataset into Trainig and Validation ----
#
# - training: 70%
# - validation: 30%
#
training_surprise_df, validation_surprise_df = train_test_split(
    data=ratings_surprise_df
    , train_size=TRAINING_DF_SIZE
    , test_size=VALIDATION_DF_SIZE
    , random_state=SEED
)

---

**- Training the Model**

In [12]:
# ---- Training the Model ----
#
# - creating the model;
# - fitting and predicting;
# - cross-validating datas.
#
user_based_recommender = collaborative_filtering_user_based_approach(
    chosen_knn_with_means_model
    , training_surprise_df
    , validation_surprise_df
    , ratings_surprise_df
    , animes_df
)

model_rmse = user_based_recommender.fit_and_predict(maximum_number_of_recommendations=10, verbose=True)
model_cross_validation_rmse = user_based_recommender.cross_validation(verbose=True)

******************
** Fitting Step **
******************


Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.



*********************
** Prediction Step **
*********************





*********************
** Evaluation Step **
*********************


RMSE: 0.8622
- Root Mean Squared Error (RMSE) for the predictions: 0.8622






*************************************
** Predicting Recommendations Step **
*************************************


Fitting and Prediction Done 😉👍


***************************
** Cross-Validation Step **
***************************





*********************
** Validation Step **
*********************


- Mean Cross-Validation Root Mean Squared Error (RMSE): 0.8481



Cross-Validation Done 😉👍




---

**- Recommendations**

In [13]:
# ---- Recommendations ----
user_based_top_10_animes = user_based_recommender.recommend(
    user_id=388458
    , maximum_number_of_recommendations=10
    , verbose=True
)

**************************
** Recommendations Step **
**************************




Unnamed: 0_level_0,title,score,genres,is_hentai,image_url,predicted_rating
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
18465,genshiken nidaime,7.43,comedy,0,https://cdn.myanimelist.net/images/anime/11/52935.jpg,7.618476


Recommendations Done, Enjoy 😉👍




---

<h1 id='reach-me' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>📫 | Reach Me</h1>

> **Email** - [csfelix08@gmail.com](mailto:csfelix08@gmail.com?)

> **Linkedin** - [linkedin.com/in/csfelix/](https://www.linkedin.com/in/csfelix/)

> **GitHub:** - [CSFelix](https://github.com/CSFelix)

> **Kaggle** - [DSFelix](https://www.kaggle.com/dsfelix)

> **Portfolio** - [CSFelix.io](https://csfelix.github.io/).