<center>
    <h1 id='content-based-filtering' style='color:#7159c1; font-size:350%'>Collaborative Filtering</h1>
    <i style='font-size:125%'>Recommendations of Items from Similar Items that Similar Users Liked</i>
</center>

> **Topics**

```
- ‚ú® Collaborative Filtering User-Based Problems
- ‚ú® Item-Based Approach
- ‚ú® Matrix Decomposition
- ‚ú® Singular Value Decomposition (SVD)
- ‚ú® Singular Value Decomposition (SVD) VS Singular Value Decomposition Plus (SVD++)
- ‚ú® Hands-on
```

<h1 id='0-collaborative-user-based-problems' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>‚ú® | Collaborative Filtering User-Based Problems</h1>

<h1 id='5-hands-on' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>‚ú® | Hands-on</h1>

```
- Settings
- Reading Datasets
- Dropping Variables
- Getting Random Observations for Hyperparameter Tuning
- Finding Best Hyperparameters for the Model
- Splitting Dataset into Training and Validation
- Training the Model
- Recommendations
```

> **OBS.:** since Surprise Package only works with ratings from 0 to 5 and the animes dataset works with ratings from 0 to 10, we have to divide the animes dataset ratings by 2 in order to the Surprise Predictions be accordingly to the range from 0 to 5. After that, we have to multiply the Surprise Predictions by 2 in order to the predictions fit the animes dataset ratings when returning the recommendations.

---

**- Settings**

In [1]:
# -----------------
# ---- Imports ----
# -----------------
from collections import defaultdict  # pip install collections
import inflect                       # pip install inflect
import numpy as np                   # pip install numpy
import pandas as pd                  # pip install pandas




# --------------------------
# ---- Surprise Imports ----
# --------------------------
#
# pip install scikit-surprise
#
# If you get any troubles installing the package, follow the steps in this Stack Overflow link:
# - https://stackoverflow.com/questions/44951456/pip-error-microsoft-visual-c-14-0-is-required
#
from surprise import accuracy
from surprise import SVD
from surprise.dataset import Dataset
from surprise.model_selection.validation import cross_validate
from surprise.model_selection import RandomizedSearchCV
from surprise.model_selection import train_test_split
from surprise.reader import Reader




# -------------------
# ---- Constants ----
# -------------------
DATASETS_PATH = ('./datasets')
INFLECT_ENGINE = (inflect.engine())
SEED = (20240106)
SEED2 = (20240107)
SURPRISE_READER = (Reader())

HYPERPARAMETER_TUNING_SAMPLE_SIZE = 250_000 # SVD has a better memory performance than KNN, thus we can increase the sample size
# however, with more than 250,000 observations, the training and prediction steps take a considerable time
TRAINING_DF_SIZE = (0.80)
VALIDATION_DF_SIZE = (0.20)

SIMILARITY_OPTIONS = {
    'name': ['cosine', 'pearson', 'pearson_baseline', 'msd'] # similarity metrics: the best one will be used for recommendations;
    , 'user_based': [False] # True: User-Based Approach; False: Item-Based Approach;
    , 'min_support': [3, 4, 5] # User-Based: minimum number of similar items to be considered; item-Based: minimum number of similar users to be considedred.
}
SVD_PARAMS = {
    'n_epochs': [5, 10, 15, 20] # number of iterations in SVD;
    , 'lr_all': [0.002, 0.005, 0.007]  # learning rate for all parameters (lr_bu, lr_bi, lr_pu and lr_qi);
    , 'reg_all': [0.4, 0.6, 0.8] # regularization term for all parameters (reg_bu, reg_bi, reg_pu and reg_qi).
}




# ------------------
# ---- Settings ----
# ------------------
np.random.seed(SEED)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)




# -------------------
# ---- Functions ----
# -------------------
def find_best_hyperparameters_values(model, search_cv, parameters, dataset):
    """
    \ Description:
        - Creates a parameter Search CV to find the best score, hyperparameters values and estimators for a given
    Singular Value Decomposition model;
        - Returns the best model and its score, hyperparameters and estimators.
        
    \ Parameters:
        - model: Chosen Matrix Decomposition Model. For this notebook, Singular Value Decomposition (SVD) is the chosen one;
        - search-cv: Surprise Parameter Search. For this notebook, Randomized Search CV is the chosen one;
        - parameters: dictionary of SVD Parameters;
        - dataset: Surprise DataFrame.
    """
    matrix_decomposition_model = search_cv(
        model
        , parameters
        , n_jobs=-1                 # number of CPU cores used opn models' training and validation (-1 all cores are user)
        , measures=['rmse']         # evaluation metrics
        , cv=3                      # number of folds for cross-validation
        , pre_dispatch='1*n_jobs'   # number of dispatches for each job to process simultaneously
        , random_state=SEED
    )
    matrix_decomposition_model.fit(dataset)
    
    print(f'- Best Score: {matrix_decomposition_model.best_score}')
    print(f'- Best Parameters: {matrix_decomposition_model.best_params}')
    print(f'- Best Estimator: {matrix_decomposition_model.best_estimator}')
    
    return matrix_decomposition_model

def get_recommendations(predictions_df, animes_df, maximum_number_of_recommendations=10):
    """
    \ Description:
        - Maps the Predictions for Each User;
        - Sorts the Predictions for Each User;
        - Creates a List Containing the 'maximum_number_of_recommendations';
        - Returns the List.
        
    \ Parameters:
        - Predictions DF: Surprise DataFrame;
        - Animes DF: Pandas DataFrame;
        - Maximum Number of Recommendations: Integer.
    """
    
    # Mapping the Predictions for Each User
    #
    # - since Surprise Package only works with ratings from 0 to 5 and
    # the animes dataset works with ratings from 0 to 10, we have to
    # multiply the 'estimated (est)' value by 2, in order to the Surprise
    # predictions fit into the animes dataset ratings.
    #
    top_n_recommendations = defaultdict(list)
    for uuid, id, true_r, est, _ in predictions_df: top_n_recommendations[uuid].append((id, est * 2))
    
    # Sorting Predictions for Each User and Retrieving the Recommendations
    for uuid, user_ratings in top_n_recommendations.items():
        user_ratings.sort(key=lambda rating: rating[1], reverse=True)
        top_n_recommendations[uuid] = user_ratings[:maximum_number_of_recommendations]
        
    return top_n_recommendations

In [2]:
# -----------------
# ---- Classes ----
# -----------------
class collaborative_filtering_item_based_approach():
    """
    This class apply Collaborative Filtering with Item-Based Approach to recommend 'n' animes to a given user.
    """
    
    def __init__(self, model, training_df, validation_df, full_dataset, animes_df):
        """
        \ Description:
            - Class Constructor.
            
        \ Parameters:
            - Model: Surprise Prediction Model. Singular Value Decomposition is the chosen one for this notebook;
            - Training DF: Surprise DataFrame;
            - Validation DF: Surprise DataFrame;
            - Full Dataset: Surprise DataFrame (Merge of: Training DF and Validation DF);
            - Animes DF: Pandas DataFrame.
            
        \ Other Attributes:
            - Predictions Validations: Surprise DataFrame;
            - Top Recommendations: Surprise DataFrame;
            - Recommendations DF: Pandas DataFrame.
        """
        self.model = model
        self.training_df = training_df
        self.validation_df = validation_df
        self.full_dataset = full_dataset
        self.animes_df = animes_df
        
        self.predictions_validations = None
        self.top_recommendations = None
        self.recommendations_df = None
        
    def fit_and_predict(self, maximum_number_of_recommendations, verbose=True):
        """
        \ Description:
            - Applies training, validation and evaluation steps;
            - Gets the top 'n' recommendations;
            - Returns the top 'n' recommendations as a Pandas DataFrame.
            
        \ Parameters:
            - Maximum Number of Recommendations: Integer;
            - Verbose: Boolean.
        """
        # ---- Fitting Step ----
        if verbose:
            fitting_step_title = '** Fitting Step **'
            print('*' * len(fitting_step_title))
            print(fitting_step_title)
            print('*' * len(fitting_step_title))
            print('\n')
        
        self.model.fit(self.training_df)
        
        # ---- Prediction Step ----
        if verbose:
            prediction_step_title = '** Prediction Step **'
            print('\n\n')
            print('*' * len(prediction_step_title))
            print(prediction_step_title)
            print('*' * len(prediction_step_title))
            print('\n')
            
        self.predictions_validations = self.model.test(self.validation_df)
        
        # ---- Evaluation Step ----
        if verbose:
            evaluation_step_title = '** Evaluation Step **'
            print('\n\n')
            print('*' * len(evaluation_step_title))
            print(evaluation_step_title)
            print('*' * len(evaluation_step_title))
            print('\n')
            
        rmse_evaluation = round(accuracy.rmse(self.predictions_validations), 4)
        print(f'- Root Mean Squared Error (RMSE) for the predictions: {rmse_evaluation}')
        print('\n\n')
        
        # ---- Getting Recommendations ----
        if verbose:
            predicting_recommendations_step_title = '** Predicting Recommendations Step **'
            print('\n\n')
            print('*' * len(predicting_recommendations_step_title))
            print(predicting_recommendations_step_title)
            print('*' * len(predicting_recommendations_step_title))
            print('\n')
            
        self.top_recommendations = get_recommendations(self.predictions_validations, self.animes_df)
        self.recommendations_df = pd.DataFrame(columns=['user_id', 'anime_id', 'predicted_rating'])
        
        for item in self.top_recommendations:
            current_recommendation_df = pd.DataFrame(self.top_recommendations[item], columns=['anime_id', 'predicted_rating'])
            current_recommendation_df['user_id'] = item
            recommendation_variables = current_recommendation_df.columns.tolist()
            recommendation_variables = recommendation_variables[-1:] + recommendation_variables[:-1]
            current_recommendation_df = current_recommendation_df[recommendation_variables]
            self.recommendations_df = pd.concat([self.recommendations_df, current_recommendation_df], axis=0)
            
        # ---- Return ----
        print('Fitting and Prediction Done üòâüëç')
        print('\n')
        return rmse_evaluation
    
    def cross_validation(self, verbose=True):
        """
        \ Description:
            - Applies Cross-Validation and its results evaluation using RMSE;
            - Returns the evaluation as a Pandas DataFrame.
            
        \ Parameters:
            - Verbose: Boolean.
        """
        # ---- Cross-Validation Step ----
        if verbose:
            cross_validation_step_title = '** Cross-Validation Step **'
            print('*' * len(cross_validation_step_title))
            print(cross_validation_step_title)
            print('*' * len(cross_validation_step_title))
            print('\n')
            
        cross_validation_results = cross_validate(self.model, self.full_dataset, n_jobs=-1)
        cross_validation_results = round(cross_validation_results['test_rmse'].mean(), 4)
        
        # ---- Evaluation Step ----
        if verbose:
            validation_step_title = '** Validation Step **'
            print('\n\n')
            print('*' * len(validation_step_title))
            print(validation_step_title)
            print('*' * len(validation_step_title))
            print('\n')
            
        print(f'- Mean Cross-Validation Root Mean Squared Error (RMSE): {cross_validation_results}')
        print('\n\n')
        
        # ---- Return ----
        print('Cross-Validation Done üòâüëç')
        print('\n')
        return cross_validation_results
    
    def recommend(self, user_id, maximum_number_of_recommendations=10, verbose=True):
        """
        \ Description:
            - Gets the 'n' recommended items;
            - Gets the recommeded animes info;
            - Returns the recommended animes info and the predicted rating as a Pandas DataFrame.
            
        \ Parameters:
            - User ID: Integer;
            - Maximum Number of Recommendations: Integer;
            - Verbose: Boolean.
        """
        # ---- Recommendation Step ----
        if verbose:
            recommendation_step_title = '** Recommendations Step **'
            print('*' * len(recommendation_step_title))
            print(recommendation_step_title)
            print('*' * len(recommendation_step_title))
            print('\n')
            
        recommendations_df = self.recommendations_df.loc[self.recommendations_df['user_id'] == user_id] \
            .head(maximum_number_of_recommendations)
        
        # Formatting Animes Recommendation DataFrame ----
        recommended_anime_ids = recommendations_df.anime_id.to_list()
        recommended_animes_df = self.animes_df.loc[recommended_anime_ids]
        recommended_animes_df['predicted_rating'] = recommendations_df['predicted_rating'].to_list()
        
        # ---- Return ----
        if verbose: display(recommended_animes_df)
        print('Recommendations Done, Enjoy üòâüëç')
        print('\n')
        
        return recommended_animes_df

---

**- Reading Datasets**

In [3]:
# ---- Reading Animes Dataset ----
animes_df = pd.read_csv(f'{DATASETS_PATH}/anime-transformed-dataset-2023.csv', index_col='id')[
    ['title', 'score', 'genres', 'is_hentai', 'image_url']
]

print(f'- Number of Observations: {animes_df.shape[0]} ({INFLECT_ENGINE.number_to_words(animes_df.shape[0])})')
print(f'- Number of Variables: {animes_df.shape[1]} ({INFLECT_ENGINE.number_to_words(animes_df.shape[1])})')
print('---')

animes_df.head()

- Number of Observations: 23748 (twenty-three thousand, seven hundred and forty-eight)
- Number of Variables: 5 (five)
---


Unnamed: 0_level_0,title,score,genres,is_hentai,image_url
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,cowboy bebop,8.75,"award winning, action, sci-fi",0,https://cdn.myanimelist.net/images/anime/4/19644.jpg
5,cowboy bebop tengoku no tobira,8.38,"action, sci-fi",0,https://cdn.myanimelist.net/images/anime/1439/93480.jpg
6,trigun,8.22,"adventure, action, sci-fi",0,https://cdn.myanimelist.net/images/anime/7/20310.jpg
7,witch hunter robin,7.25,"mystery, supernatural, action, drama",0,https://cdn.myanimelist.net/images/anime/10/19969.jpg
8,bouken ou beet,6.94,"adventure, supernatural, fantasy",0,https://cdn.myanimelist.net/images/anime/7/21569.jpg


In [4]:
# ---- Reading Ratings Dataset ----
ratings_df = pd.read_csv(f'{DATASETS_PATH}/users-scores-transformed-2023.csv')

print(f'- Number of Observations: {ratings_df.shape[0]} ({INFLECT_ENGINE.number_to_words(ratings_df.shape[0])})')
print(f'- Number of Variables: {ratings_df.shape[1]} ({INFLECT_ENGINE.number_to_words(ratings_df.shape[1])})')
print('---')

print(f'- Number of Unique Users: {ratings_df.user_id.nunique()} ({INFLECT_ENGINE.number_to_words(ratings_df.user_id.nunique())})')
print(f'- Number of Unique Animes: {ratings_df.anime_id.nunique()} ({INFLECT_ENGINE.number_to_words(ratings_df.anime_id.nunique())})')
print('---')

ratings_df.head()

- Number of Observations: 23796586 (twenty-three million, seven hundred and ninety-six thousand, five hundred and eighty-six)
- Number of Variables: 5 (five)
---
- Number of Unique Users: 264067 (two hundred and sixty-four thousand and sixty-seven)
- Number of Unique Animes: 16380 (sixteen thousand, three hundred and eighty)
---


Unnamed: 0,user_id,username,anime_id,anime_title,rating
0,1,xinil,21,one piece,9
1,1,xinil,48,hack sign,7
2,1,xinil,320,a kite,5
3,1,xinil,49,aa megami-sama,8
4,1,xinil,304,aa megami-sama movie,8


---

**- Dropping Variables**

For Surprise package, only three variables are needed: the user id, the anime id and the rating the user gave to the anime. Thus, we have to drop the user name and anime title variables.

In [5]:
# ---- Dropping Variables ----
variables_to_keep = ['user_id', 'anime_id', 'rating']
ratings_df = ratings_df[variables_to_keep]

---

**- Getting Random Observations for Hyperparameter Tuning**

In [6]:
# ---- Getting Random Observations for Hyperparameter Tuning ----
#
# - since Surprise Package only works with ratings from 0 to 5 and
# the animes dataset works with ratings from 0 to 10, we have to
# divide the 'rating' variable by 2, in order to the anime dataset
# ratings fit into Surprise Predictions.
#
temp_ratings_df = ratings_df.copy()
temp_ratings_df.rating = temp_ratings_df.rating.apply(lambda rating: rating / 2)

hyperparameter_tuning_df = temp_ratings_df.sample(HYPERPARAMETER_TUNING_SAMPLE_SIZE, random_state=SEED)
hyperparameter_tuning_df = Dataset.load_from_df(hyperparameter_tuning_df, SURPRISE_READER)

print(
    f'- Number of Observations for Hyperparameter Tuning: {HYPERPARAMETER_TUNING_SAMPLE_SIZE}'
    f' ({INFLECT_ENGINE.number_to_words(HYPERPARAMETER_TUNING_SAMPLE_SIZE)})'
)

- Number of Observations for Hyperparameter Tuning: 250000 (two hundred and fifty thousand)


---

**- Finding Best Hyperparameters for the Model**

In [7]:
# ---- Finding the Best Hyperparameters for the Model ----
#
# - using Randomized Search CV in order to find the best Hyperparameters Values.
#
hyperparameters_tuning_values = find_best_hyperparameters_values(
    model=SVD
    , search_cv=RandomizedSearchCV
    , parameters=SVD_PARAMS
    , dataset=hyperparameter_tuning_df
)

- Best Score: {'rmse': 0.7450818977744432}
- Best Parameters: {'rmse': {'n_epochs': 20, 'lr_all': 0.007, 'reg_all': 0.4}}
- Best Estimator: {'rmse': <surprise.prediction_algorithms.matrix_factorization.SVD object at 0x0000021313E2F3D0>}


In [8]:
# ---- Finding the Best Hyperparameters for the Model ----
#
# - getting the model with the best parameters.
#
chosen_svd_model = hyperparameters_tuning_values.best_estimator['rmse']

---

**- Splitting Dataset into Training and Validation**

In [9]:
# ---- Splitting Dataset into Training and Validation ----
#
# - converting Pandas DataFrame into Surprise DataFrame
#
ratings_surprise_df = Dataset.load_from_df(
    temp_ratings_df.sample(HYPERPARAMETER_TUNING_SAMPLE_SIZE, random_state=SEED2)
    , SURPRISE_READER
)

In [10]:
# ---- Splitting Dataset into Training and Validation ----
#
# - deleting some dataframes from the memory since we are going to use 'ratings_surprise_df' from now on;
# - datasets to delete:
#    \ ratings_df;
#    \ temp_ratings_df;
#    \ hyperparameter_tuning_df.
#
ratings_df = None
temp_ratings_df = None
hyperparameter_tuning_df = None

In [11]:
# ---- Splitting Dataset into Training and Validation ----
#
# - training: 80%;
# - validation: 20%.
#
training_surprise_df, validation_surprise_df = train_test_split(
    data=ratings_surprise_df
    , train_size=TRAINING_DF_SIZE
    , test_size=VALIDATION_DF_SIZE
    , random_state=SEED
)

---

**- Training the Model**

In [12]:
# ---- Training the Model ----
#
# - creating the model;
# - fitting and predicting;
# - cross-validating datas.
#
item_based_recommender = collaborative_filtering_item_based_approach(
    chosen_svd_model
    , training_surprise_df
    , validation_surprise_df
    , ratings_surprise_df
    , animes_df
)

model_rmse = item_based_recommender.fit_and_predict(maximum_number_of_recommendations=10, verbose=True)
model_cross_validation_rmse = item_based_recommender.cross_validation(verbose=True)

******************
** Fitting Step **
******************





*********************
** Prediction Step **
*********************





*********************
** Evaluation Step **
*********************


RMSE: 0.7426
- Root Mean Squared Error (RMSE) for the predictions: 0.7426






*************************************
** Predicting Recommendations Step **
*************************************


Fitting and Prediction Done üòâüëç


***************************
** Cross-Validation Step **
***************************





*********************
** Validation Step **
*********************


- Mean Cross-Validation Root Mean Squared Error (RMSE): 0.7412



Cross-Validation Done üòâüëç




---

**- Recommendations**

In [13]:
# ---- Recommendations ----
item_based_top_10_animes = item_based_recommender.recommend(
    user_id=388_458
    , maximum_number_of_recommendations=10
    , verbose=True
)

**************************
** Recommendations Step **
**************************




Unnamed: 0_level_0,title,score,genres,is_hentai,image_url,predicted_rating
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
18465,genshiken nidaime,7.43,comedy,0,https://cdn.myanimelist.net/images/anime/11/52935.jpg,7.819258


Recommendations Done, Enjoy üòâüëç






---

<h1 id='reach-me' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>üì´ | Reach Me</h1>

> **Email** - [csfelix08@gmail.com](mailto:csfelix08@gmail.com?)

> **Linkedin** - [linkedin.com/in/csfelix/](https://www.linkedin.com/in/csfelix/)

> **GitHub:** - [CSFelix](https://github.com/CSFelix)

> **Kaggle** - [DSFelix](https://www.kaggle.com/dsfelix)

> **Portfolio** - [CSFelix.io](https://csfelix.github.io/).