<center>
    <h1 id='hybrid-filtering' style='color:#7159c1; font-size:350%'>Hybrid Filtering</h1>
    <i style='font-size:125%'>Combining Content-Based Filtering, Collaborative Filtering and Demographic Filtering</i>
</center>

> **Topics**

```
- 🍡 Collaborative Filtering Problems
- 🍡 Hybrid Filtering
- 🍡 Hands-on
- 🍡 Benchmarking
```

<h1 id='0-collaborative-filtering-problems' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>🍡 | Collaborative Filtering Problems</h1>

Collaborative Filtering has some issues that we have to pay attention, being the `computational cost and time` the first one. As seen on Collaborative Filtering Item-Based Algorithm, a laptop with 12GB of RAM memory got 100% usage of this hardware, even though the dataset's sample being small compared to the total amount of data: only a few more datas than a million observations out of up to twenty-three million.

Another problem is the `available data`. Since our dataset is large and contains observations from a bunch of users and animes, we did not face it off, but it is important to have this issue in mind. Collaborative Filtering requires a good number of users ratings of each anime in order to better recognizing the users tastes and retrieving more suitable recommendations. Due to this, when the platform has new users or new released animes, the Collaborative Filtering may not work very well with them, since the available data about them is scarce.

The solution for the first problem is literally using a more powerful machine too do the tasks, changing the algorithm for more performatic ones.

About the second problem, we can go into `Hybrid Filtering`, the best and last Recommendation System Technique we are going to see in this project.

<h1 id='1-hybrid-filtering' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>🍡 | Hybrid Filtering</h1>

`Hybrid Filtering` combines the Content-Based Filtering and the Collaborative Filtering altogether. It normally applies the first technique when there are few users ratings available for a given anime and smoothly replaces it to the second technique as more users ratings become avaible for the given anime.

Making things clearer, picture a situation where only a few users have rated Dragon Ball Z, in light of the small number of ratings, the technique will use Content-Based Filtering and recommend similar animes to Dragon Ball Z.

On the other hand, a situation where many users have rated Noragami anime, due to the large number of ratings, the technique will use Collaborative Filtering and recommend similar items that similar users have liked.

About the advantages:

> **Content-Based Filtering and Collaborative Filtering** - `since Hybrid Filtering combines the both techniques and switches between them accordingly to the chosen user/item, this technique has the advantages of both of them`;

> **Better Recommendations and Small Bubble** - `consequently, better recommendations are made with a tiny probability of creating a Bubble of Recommendations`.

<br />

Disadvantages-wise:

> **Content-Based Filtering and Collaborativee Filtering** - `it also has the chosen technique to the chosen user/item disadvantages`;

> **Required Users and Items Data** - `it requires that the dataset contains datas about the items and the users, as well as the interactions between them, that is, the users ratings for the items`;

> **Computational Cost and Time** - `also, more computational cost and time is needed for the model`.

<br />

In this notebook, we are going to apply Hybrid Filtering merging `Collaborative Filtering based on Items' Metadatas` with `Collaborative Filtering based on Users` and `Demographic Filtering`. Thus, before heading to the code, let's see how this algorithm works.

<h1 id='2-hands=on' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>🍡 | Hands-on</h1>

Steps:

```
- Settings;
- Demographic Filtering Algorithm;
- Content-Based Filtering Items' Metadatas Algorithm;
- Collaborative Filtering Users-Based Algorithm;
- Recommendations.
```

---

**- Settings**

In [1]:
# ---- Imports ----
import numpy as np                                           # pip install numpy
import pandas as pd                                          # pip install pandas
from sklearn.feature_extraction.text import TfidfVectorizer  # pip install sklearn
from sklearn.metrics import mean_squared_error
from sklearn.metrics.pairwise import linear_kernel
from sklearn.model_selection import train_test_split


# ---- Constants ----
DATASETS_PATH = ('./datasets')
SEED = (20240420) # April 20, 2024 (fourth Bitcoin Halving)

ANIMES_SCORED_BY_CUTOFF = (0.75)
USERS_NUMBER_RATINGS_CUTOFF = (2_000)
BASELINE_PREDICTION = (2.5)

# ---- Settings ----
np.random.seed(SEED)
pd.set_option('display.max_columns', None)

# ---- Functions: Content-Based Filtering ----
def generate_metadatas_sequential_text(dataset, features):
    """
    \ Description:
        - iters each dataset row and features parameter's elements;
        - if the value at row[feature] position is different than a single hyphen:
             - the value gets all spaces replaced by underscores;
             - the value gets all commas-spaces replaced by space;
             - sequential_text is incremented by the resultant value and by a space at the end;
             - sequential_text is stripped and appended into sequential_text_list;
             - at the end, sequential_text_list is returned.
    
    \ Paramters:
        - dataset: Pandas DataFrame;
        - features: list of strings.
    """
    sequential_text_list = []
    
    for index, row in dataset.iterrows():
        current_sequential_text = ''
        for feature in features:
            if row[feature] != '-':
                current_sequential_text += row[feature].replace(' ', '_').replace(',_', ' ')
                current_sequential_text += ' '
                     
        sequential_text_list.append(current_sequential_text.strip())
    
    return sequential_text_list

def get_recommendations_content_filtering(df, title, animes_indices, cosine_similarity, number_recommendations=10):
    """
    \ Description:
        - gets the index of the anime that matches the title;
        - gets the pairwise similarity scores of all animes with the chosen anime;
        - sort the animes based on the similarity socres on descending order;
        - gets the scores of the top 'number_recommendations' animes, excluding the chosen one;
        - gets the animes indices;
        - returns the recommended animes id, title, synopsis, score, genre and image url.
    
    \ Parameters:
        - df: Pandas DataFrame;
        - title: string;
        - animes_indices: list of integers;
        - cosine_similarity: NumPy array of floats;
        - number_recommendation: integer.
    """
    index = animes_indices[title]
    
    similarity_scores = list(enumerate(cosine_similarity[index]))
    similarity_scores = sorted(similarity_scores, key=lambda score: score[1], reverse=True)
    similarity_scores = similarity_scores[1:number_recommendations+1]
    
    recommended_animes_indices = [index[0] for index in similarity_scores]
    recommended_animes_scores = [index[1] for index in similarity_scores]
    
    recommendations_df = df.iloc[recommended_animes_indices][
        ['id', 'title', 'synopsis', 'score', 'genres', 'image_url']
    ].set_index('id')
    recommendations_df['cosine_similarity'] = recommended_animes_scores
    
    return recommendations_df

# ---- Functions: Collaborative Filtering ----
def calculate_score(user_id, anime_id):
    """
    \ Description:
        - drops the selected user from 'user_id' parameter on similarities and ratings matrices;
        - calculates the total score and weight between the users;
        - calculates the average user rating for the item from 'anime_id';
        - returns the predicted rating balanced by the weight.
    
    \ Parameters:
        - user_id: integer;
        - anime_id: integer.
        
    \ Return:
        - Baseline Prediction: float (when item is not into training dataset OR
    none of the similar users have rated items in common with the 'user_id' parameter);
        - Predicted Rating: float.
    """
    # If the item is not into the training dataset, the baseline value is returned
    if anime_id not in ratings_matrix.columns: return BASELINE_PREDICTION

    # Dropping the selected user from 'user_id' parameter
    similarity_scores = similarity_matrix[user_id].drop(labels=user_id)
    normalized_ratings = normalized_ratings_matrix[anime_id].drop(index=user_id)
    
    # Dropping users that haven't rated the item
    similarity_scores.drop(index=normalized_ratings[normalized_ratings.isnull()].index, inplace=True)
    normalized_ratings.dropna(inplace=True)
    
    # If none of the other users have rated items in common with the user in question, the baseline value is returned
    if similarity_scores.isna().all(): return BASELINE_PREDICTION
    
    # Calculating Predicted Rating
    total_score = 0
    total_weight = 0
    
    for user_id_rating in normalized_ratings.index:
        # It is possible that another user rated the item but that
        # they have not rated any items in common with the user in question
        if not pd.isna(similarity_scores[user_id_rating]):
            total_score += normalized_ratings[user_id_rating] * similarity_scores[user_id_rating]
            total_weight += abs(similarity_scores[user_id_rating])
            
    avg_user_rating = ratings_matrix.T.mean()[user_id]
    return avg_user_rating + total_score / total_weight

def get_recommendations_collaborative_filtering(df, animes_df, user_id, number_recommendations=10):
    """
    \ Description:
        - filters the top 10 recommendations by 'predicted_rating';
        - creates a dataframe containing info about the filtered animes;
        - merges 'predicted_rating' to the dataset;
        - drops unuseful columns;
        - returns the recommendations descended sorted by 'predicted_rating'.
    
    \ Parameters:
        - df: Pandas DataFrame;
        - animes_df: Pandas DataFrame;
        - user_id: integer;
        - number_recommendations: integer.
        
    \ Return:
        - recommendations_df: Pandas DataFrame.
    """
    filtered_animes = df.loc[df.user_id == user_id]            \
      .sort_values(by='predicted_rating', ascending=False)    \
      .head(number_recommendations)
    
    recommended_animes_ids = filtered_animes.anime_id.unique().tolist()
    
    recommendations_df = animes_df.loc[animes_df.id.isin(recommended_animes_ids)][
        ['id', 'title', 'synopsis', 'score', 'genres', 'image_url']
    ]
    
    recommendations_df = recommendations_df.merge(
        filtered_animes
        , left_on='id'
        , right_on='anime_id'
        , how='left'
    )
    
    recommendations_df.drop(columns=['anime_id', 'user_id'], inplace=True)
    
    return recommendations_df.sort_values(by='predicted_rating', ascending=False)

# ---- Functions: General ----
def get_recommendations(
    predicted_ratings_df
    , animes_collaborative_filtering_df
    , user_id
    , animes_content_filtering_df
	, title
    , animes_indices
    , cosine_similarity
    , animes_demographic_filtering_df
    , number_recommendations=10
):
    """
    \ Description:
        - if the selected anime has more than the cut-off number of ratings, recommendations
    are made applying Collaborative Filtering User-Based Algorithm;
        - if the selected anime does not have more than the cut-off number of ratings, recommendations
    are made applying Content-Based Item Metadatas Algorithm;
        - if the anime does not exist in the dataset, recommendations are made applying Demographic
    Filtering.
    
    \ Parameters:
        - predicted_ratings_df: Pandas DataFrame;
        - animes_collaborative_filtering_df: Pandas DataFrame;
        - user_id: integer;
        
        - animes_content_filtering_df: Pandas DataFrame;
        - title: string;
        - animes_indices: Pandas Series;
        - cosine_similarity: NumPy Array;
        
        - animes_demographic_filtering_df: Pandas DataFrame;
        
        - number_recommendations: integer.
    
    \ Return:
        - recommendations_df: Pandas DataFrame.
    """
    if (animes_collaborative_filtering_df['title'] == title).any():
        print('- Using Collaborative Filtering User-Based Algorithm!')
        print('---\n\n')
        return get_recommendations_collaborative_filtering(
            predicted_ratings_df
            , animes_collaborative_filtering_df
            , user_id
            , number_recommendations
        )
    elif (animes_content_filtering_df['title'] == title).any():
        print('- Using Content-Based Item Metadatas Based Algorithm!')
        print('---\n\n')
        return get_recommendations_content_filtering(
            animes_content_filtering_df
            , title
            , animes_indices
            , cosine_similarity_metadatas
            , number_recommendations
        )
    else:
        print('Using Demographic Filtering Algorithm!')
        print('---\n\n')
        return animes_demographic_filtering_df.sort_values(by='score', ascending=False).head(number_recommendations)

---

**- Demographic Filtering Algorithm**

In [2]:
# ---- Reading Dataset ----
animes_demographic_filtering_df = pd.read_csv(f'{DATASETS_PATH}/anime-transformed-dataset-2023.csv', index_col='id')
animes_demographic_filtering_df = animes_demographic_filtering_df.loc[
    animes_demographic_filtering_df.score > 0
][['title', 'genres', 'score', 'scored_by', 'popularity', 'image_url']]

---

**- Content-Based Filtering Items' Metadatas Algorithm**

In [3]:
# ---- Reading Dataset ----
animes_content_filtering_df = pd.read_csv(f'{DATASETS_PATH}/anime-transformed-dataset-2023.csv', index_col='id')[
    ['title', 'synopsis', 'score', 'genres', 'type', 'source', 'image_url']
]

# ---- Generating Sequential Text for Metadatas ----
metadata_features = ['genres', 'type', 'source']
animes_content_filtering_df['metadatas'] = generate_metadatas_sequential_text(animes_content_filtering_df, metadata_features)
animes_content_filtering_df.head()

# ---- Lower Casing ----
animes_content_filtering_df.metadatas = animes_content_filtering_df.metadatas.apply(lambda metadata: metadata.lower())

# ---- Removing All Break Lines (\n) and Special Characters (\t \r \x0b \x0c) ----
animes_content_filtering_df.metadatas = animes_content_filtering_df.metadatas.apply(lambda metadata: ' '.join(metadata.split()))

# ---- Calculating TF-IDF ----
tfidf_vectorizer = TfidfVectorizer(analyzer='word', norm='l2', stop_words='english')
tfidf_metadatas = tfidf_vectorizer.fit_transform(animes_content_filtering_df.metadatas)

# ---- Calculating Cosine Similarity ----
cosine_similarity_metadatas = linear_kernel(tfidf_metadatas, tfidf_metadatas)

# ---- Reseting Animes DataFrame Index ----
#
# - in order to the index follow a sequence from 0 to 'n', being 'n'
# the total number of animes.
#
animes_content_filtering_df.reset_index(inplace=True)

# ---- Getting Animes ID-Title Pairs ----
animes_indices = pd.Series(animes_content_filtering_df.index, index=animes_content_filtering_df.title)

---

**- Collaborative Filtering Users-Based Algorithm**

In [4]:
# ---- Reading Animes Dataset ----
animes_collaborative_filtering_df = pd.read_csv(f'{DATASETS_PATH}/anime-transformed-dataset-2023.csv')[
    ['id', 'title', 'synopsis', 'score', 'genres', 'image_url', 'scored_by']
]

# ---- Filterig Animes with more than or equal to a cutoff of number of Users Ratings ----
minimum_number_of_ratings = animes_collaborative_filtering_df.scored_by.quantile(q=ANIMES_SCORED_BY_CUTOFF, interpolation='linear')
animes_collaborative_filtering_df = animes_collaborative_filtering_df.loc[animes_collaborative_filtering_df.scored_by >= minimum_number_of_ratings].copy()

# ---- Reading Ratings Dataset ----
ratings_df = pd.read_csv(f'{DATASETS_PATH}/users-scores-transformed-2023.csv')[
    ['user_id', 'anime_id', 'rating']
]

# ---- Filterig Ratings by Filtered Animes ----
filtered_animes_ids = animes_collaborative_filtering_df.id.to_list()
ratings_df = ratings_df.loc[ratings_df.anime_id.isin(filtered_animes_ids)].copy()

# ---- Filtering Ratings with users with more than or equal to 2000 Ratings ----
users_ratings_count = ratings_df.user_id.value_counts()
ratings_df = ratings_df.loc[
    ratings_df.user_id.isin(users_ratings_count[users_ratings_count >= USERS_NUMBER_RATINGS_CUTOFF].index)
].copy()

# ---- Splitting Dataset into Train and Validation ----
train_ratings_df, valid_ratings_df = train_test_split(
    ratings_df
    , train_size=0.80
    , test_size=0.20
    , random_state=SEED
)

# ---- Calculating Ratings Matrix ----
#
# - values: users ratings to animes;
# - indexes: users ids;
# - columns: animes ids;
#
ratings_matrix = pd.pivot_table(train_ratings_df, values='rating', index='user_id', columns='anime_id')
normalized_ratings_matrix = ratings_matrix.subtract(ratings_matrix.mean(axis=1), axis=0)

# ---- Calculating Users Similarity Matrix ----
similarity_matrix = ratings_matrix.T.corr(method='pearson')

# ---- Predictions Calculation ----
valid_ratings = np.array(valid_ratings_df['rating'])
users_ids_list = valid_ratings_df['user_id']
animes_ids_list = valid_ratings_df['anime_id']
predicted_ratings = np.array([calculate_score(user_id, anime_id) for (user_id, anime_id) in zip(users_ids_list, animes_ids_list)])

# ---- Validation ----
rmse = np.sqrt(mean_squared_error(valid_ratings, predicted_ratings))

# --- Predicted Ratings Dataset ----
predicted_ratings_df = pd.DataFrame(columns=['user_id', 'anime_id', 'predicted_rating'])
predicted_ratings_df['user_id'] = users_ids_list
predicted_ratings_df['anime_id'] = animes_ids_list
predicted_ratings_df['predicted_rating'] = predicted_ratings
predicted_ratings_df.reset_index(drop=True, inplace=True)

---

**- Recommendations**

In [5]:
# ---- Content-Based Filtering: Search Function ----
#
# - search animes titles that contains a given string in order to use it
# in the next cell to get recommendations.
#
animes_content_filtering_df.title.loc[animes_content_filtering_df.title.str.contains('brotherhood')]

3961                      fullmetal alchemist brotherhood
4578             fullmetal alchemist brotherhood specials
5174     fullmetal alchemist brotherhood - 4-koma theater
11624                        brotherhood final fantasy xv
Name: title, dtype: object

In [6]:
# ---- Getting Recommendations ----
get_recommendations(
	predicted_ratings_df=predicted_ratings_df
	, animes_collaborative_filtering_df=animes_collaborative_filtering_df
	, user_id=609_917
	, animes_content_filtering_df=animes_content_filtering_df
	, title='fullmetal alchemist brotherhood'
	, animes_indices=animes_indices
	, cosine_similarity=cosine_similarity_metadatas
    , animes_demographic_filtering_df=animes_demographic_filtering_df
	, number_recommendations=10
)

- Using Collaborative Filtering User-Based Algorithm!
---




Unnamed: 0,id,title,synopsis,score,genres,image_url,predicted_rating
4,4181,clannad after story,"clannad: after story, the sequel to the critic...",8.93,"supernatural, romance, drama",https://cdn.myanimelist.net/images/anime/1299/...,9.768797
2,1575,code geass hangyaku no lelouch,"in the year 2010, the holy empire of britannia...",8.7,"action, sci-fi, award winning, drama",https://cdn.myanimelist.net/images/anime/1032/...,9.74966
3,2001,tengen toppa gurren lagann,simon and kamina were born and raised in a dee...,8.63,"adventure, action, sci-fi, award winning",https://cdn.myanimelist.net/images/anime/4/512...,9.661821
5,9989,ano hi mita hana no namae wo bokutachi wa mada...,jinta yadomi is peacefully living as a recluse...,8.31,"supernatural, drama",https://cdn.myanimelist.net/images/anime/5/796...,9.632413
6,11741,fate zero 2nd season,as the fourth holy grail war rages on with no ...,8.55,"action, fantasy, supernatural",https://cdn.myanimelist.net/images/anime/1522/...,9.623956
8,36862,made in abyss movie 3 fukaki tamashii no reimei,"after bonding over a tragic loss, the long-suf...",8.63,"fantasy, drama, adventure, sci-fi, mystery",https://cdn.myanimelist.net/images/anime/1502/...,9.594341
7,12355,ookami kodomo no ame to yuki,"hana, a hard-working college student, falls in...",8.58,"slice of life, fantasy, award winning",https://cdn.myanimelist.net/images/anime/9/357...,9.556795
1,1535,death note,"brutal murders, petty thefts, and senseless vi...",8.62,"suspense, supernatural",https://cdn.myanimelist.net/images/anime/9/945...,9.526311
9,36990,non non biyori movie vacation,"with summer vacation coming to an end, the gir...",8.25,slice of life,https://cdn.myanimelist.net/images/anime/1044/...,9.460765
0,572,kaze no tani no nausica,a millennium has passed since the catastrophic...,8.36,"adventure, fantasy, award winning",https://cdn.myanimelist.net/images/anime/10/75...,9.422699


<h1 id='3-benchmarking' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>🍡 | Benchmarking</h1>

In [1]:
# ---- Imports ----
import numpy as np                                           # pip install numpy
import pandas as pd                                          # pip install pandas
import psutil as psutil                                      # pip install psutil
import os                                                    # pip install os
from sklearn.feature_extraction.text import TfidfVectorizer  # pip install sklearn
from sklearn.metrics import mean_squared_error               # pip install sklearn
from sklearn.metrics.pairwise import linear_kernel           # pip install sklearn
from sklearn.model_selection import train_test_split         # pip install sklearn
import threading                                             # pip install threading
import time                                                  # pip install time


# ---- Constants ----
ANIMES_SCORED_BY_CUTOFF = (0.75)
USERS_NUMBER_RATINGS_CUTOFF = (2_000)
BASELINE_PREDICTION = (2.5)

NUMBER_OF_RECOMMENDATIONS = (10)
NUMBER_OF_ITERATIONS = (10)
DATASETS_PATH = ('./datasets')
SEED = (20240420) # April 20, 2024 (fourth Bitcoin Halving)

# ---- Settings ----
np.random.seed(SEED)

# ---- Functions: Content-Based Filtering ----
def generate_metadatas_sequential_text(dataset, features):
    """
    \ Description:
        - iters each dataset row and features parameter's elements;
        - if the value at row[feature] position is different than a single hyphen:
             - the value gets all spaces replaced by underscores;
             - the value gets all commas-spaces replaced by space;
             - sequential_text is incremented by the resultant value and by a space at the end;
             - sequential_text is stripped and appended into sequential_text_list;
             - at the end, sequential_text_list is returned.
    
    \ Paramters:
        - dataset: Pandas DataFrame;
        - features: list of strings.
    """
    sequential_text_list = []
    
    for index, row in dataset.iterrows():
        current_sequential_text = ''
        for feature in features:
            if row[feature] != '-':
                current_sequential_text += row[feature].replace(' ', '_').replace(',_', ' ')
                current_sequential_text += ' '
                     
        sequential_text_list.append(current_sequential_text.strip())
    
    return sequential_text_list

def get_recommendations_content_filtering(df, title, animes_indices, cosine_similarity, number_recommendations=10):
    """
    \ Description:
        - gets the index of the anime that matches the title;
        - gets the pairwise similarity scores of all animes with the chosen anime;
        - sort the animes based on the similarity socres on descending order;
        - gets the scores of the top 'number_recommendations' animes, excluding the chosen one;
        - gets the animes indices;
        - returns the recommended animes id, title, synopsis, score, genre and image url.
    
    \ Parameters:
        - df: Pandas DataFrame;
        - title: string;
        - animes_indices: list of integers;
        - cosine_similarity: NumPy array of floats;
        - number_recommendation: integer.
    """
    index = animes_indices[title]
    
    similarity_scores = list(enumerate(cosine_similarity[index]))
    similarity_scores = sorted(similarity_scores, key=lambda score: score[1], reverse=True)
    similarity_scores = similarity_scores[1:number_recommendations+1]
    
    recommended_animes_indices = [index[0] for index in similarity_scores]
    recommended_animes_scores = [index[1] for index in similarity_scores]
    
    recommendations_df = df.iloc[recommended_animes_indices][
        ['id', 'title', 'synopsis', 'score', 'genres', 'image_url']
    ].set_index('id')
    recommendations_df['cosine_similarity'] = recommended_animes_scores
    
    return recommendations_df

# ---- Functions: Collaborative Filtering ----
def calculate_score(user_id, anime_id, ratings_matrix, normalized_ratings_matrix, similarity_matrix):
    """
    \ Description:
        - drops the selected user from 'user_id' parameter on similarities and ratings matrices;
        - calculates the total score and weight between the users;
        - calculates the average user rating for the item from 'anime_id';
        - returns the predicted rating balanced by the weight.
    
    \ Parameters:
        - user_id: integer;
        - anime_id: integer;
        - ratings_matrix: Pandas DataFrame;
        - normalized_ratings_matrix: Pandas DataFrame;
        - similarity_matrix: Pandas DataFrame.
        
    \ Return:
        - Baseline Prediction: float (when item is not into training dataset OR
    none of the similar users have rated items in common with the 'user_id' parameter);
        - Predicted Rating: float.
    """
    # If the item is not into the training dataset, the baseline value is returned
    if anime_id not in ratings_matrix.columns: return BASELINE_PREDICTION

    # Dropping the selected user from 'user_id' parameter
    similarity_scores = similarity_matrix[user_id].drop(labels=user_id)
    normalized_ratings = normalized_ratings_matrix[anime_id].drop(index=user_id)
    
    # Dropping users that haven't rated the item
    similarity_scores.drop(index=normalized_ratings[normalized_ratings.isnull()].index, inplace=True)
    normalized_ratings.dropna(inplace=True)
    
    # If none of the other users have rated items in common with the user in question, the baseline value is returned
    if similarity_scores.isna().all(): return BASELINE_PREDICTION
    
    # Calculating Predicted Rating
    total_score = 0
    total_weight = 0
    
    for user_id_rating in normalized_ratings.index:
        # It is possible that another user rated the item but that
        # they have not rated any items in common with the user in question
        if not pd.isna(similarity_scores[user_id_rating]):
            total_score += normalized_ratings[user_id_rating] * similarity_scores[user_id_rating]
            total_weight += abs(similarity_scores[user_id_rating])
            
    avg_user_rating = ratings_matrix.T.mean()[user_id]
    return avg_user_rating + total_score / total_weight

def get_recommendations_collaborative_filtering(df, animes_df, user_id, number_recommendations=10):
    """
    \ Description:
        - filters the top 10 recommendations by 'predicted_rating';
        - creates a dataframe containing info about the filtered animes;
        - merges 'predicted_rating' to the dataset;
        - drops unuseful columns;
        - returns the recommendations descended sorted by 'predicted_rating'.
    
    \ Parameters:
        - df: Pandas DataFrame;
        - animes_df: Pandas DataFrame;
        - user_id: integer;
        - number_recommendations: integer.
        
    \ Return:
        - recommendations_df: Pandas DataFrame.
    """
    filtered_animes = df.loc[df.user_id == user_id]            \
      .sort_values(by='predicted_rating', ascending=False)    \
      .head(number_recommendations)
    
    recommended_animes_ids = filtered_animes.anime_id.unique().tolist()
    
    recommendations_df = animes_df.loc[animes_df.id.isin(recommended_animes_ids)][
        ['id', 'title', 'synopsis', 'score', 'genres', 'image_url']
    ]
    
    recommendations_df = recommendations_df.merge(
        filtered_animes
        , left_on='id'
        , right_on='anime_id'
        , how='left'
    )
    
    recommendations_df.drop(columns=['anime_id', 'user_id'], inplace=True)
    
    return recommendations_df.sort_values(by='predicted_rating', ascending=False)

# ---- Functions: General ----
def get_recommendations(
    predicted_ratings_df
    , animes_collaborative_filtering_df
    , user_id
    , animes_content_filtering_df
    , title
    , animes_indices
    , cosine_similarity
    , animes_demographic_filtering_df
    , number_recommendations=10
):
    """
    \ Description:
        - if the selected anime has more than the cut-off number of ratings, recommendations
    are made applying Collaborative Filtering User-Based Algorithm;
        - if the selected anime does not have more than the cut-off number of ratings, recommendations
    are made applying Content-Based Item Metadatas Algorithm;
        - if the anime does not exist in the dataset, recommendations are made applying Demographic
    Filtering.
    
    \ Parameters:
        - predicted_ratings_df: Pandas DataFrame;
        - animes_collaborative_filtering_df: Pandas DataFrame;
        - user_id: integer;
        
        - animes_content_filtering_df: Pandas DataFrame;
        - title: string;
        - animes_indices: Pandas Series;
        - cosine_similarity: NumPy Array;
        
        - animes_demographic_filtering_df: Pandas DataFrame;
        
        - number_recommendations: integer.
    
    \ Return:
        - recommendations_df: Pandas DataFrame.
    """
    if (animes_collaborative_filtering_df['title'] == title).any():
        return get_recommendations_collaborative_filtering(
            predicted_ratings_df
            , animes_collaborative_filtering_df
            , user_id
            , number_recommendations
        )
    elif (animes_content_filtering_df['title'] == title).any():
        return get_recommendations_content_filtering(
            animes_content_filtering_df
            , title
            , animes_indices
            , cosine_similarity_metadatas
            , number_recommendations
        )
    else:
        return animes_demographic_filtering_df.sort_values(by='score', ascending=False).head(number_recommendations)
    
def hybrid_filtering(
    demographic_filtering_df
    , content_based_filtering_df
    , collaborative_filtering_df
    , ratings_df
    , number_recommendations
    , anime_title
):
    """
    \ Description:
        - applies Hybrid Filtering for Benchmark.
        
    \ Paramters:
        - demographic_filtering_df: Pandas DataFrame;
        - content_based_filtering_df: Pandas DataFrame;
        - collaborative_filtering_df: Pandas DataFrame;
        - ratings_df: Pandas DataFrame;
        - number_recommendations: integer.
    """
    temp_demographic_filtering_df = demographic_filtering_df.copy()
    temp_content_based_filtering_df = content_based_filtering_df.copy()
    temp_collaborative_filtering_df = collaborative_filtering_df.copy()
    temp_ratings_df = ratings_df.copy()
    
    
    
    # ***************************************
    # ** Content-Based Filtering Metadatas **
    # ***************************************
    
    # ---- Calculating TF-IDF ----
    tfidf_vectorizer = TfidfVectorizer(analyzer='word', norm='l2', stop_words='english')
    tfidf_metadatas = tfidf_vectorizer.fit_transform(temp_content_based_filtering_df.metadatas)
    
    # ---- Calculating Cosine Similarity ----
    cosine_similarity_metadatas = linear_kernel(tfidf_metadatas, tfidf_metadatas)
    
    # ---- Reseting Animes DataFrame Index ----
    #
    # - in order to the index follow a sequence from 0 to 'n', being 'n'
    # the total number of animes.
    #
    temp_content_based_filtering_df.reset_index(inplace=True)
    
    # ---- Getting Animes ID-Title Pairs ----
    temp_animes_indices = pd.Series(temp_content_based_filtering_df.index, index=temp_content_based_filtering_df.title)
    
    
    
    # ****************************************
    # ** Collaborative Filtering User-Based **
    # ****************************************
    
    # ---- Splitting Dataset into Train and Validation ----
    train_ratings_df, valid_ratings_df = train_test_split(
        temp_ratings_df
        , train_size=0.80
        , test_size=0.20
        , random_state=SEED
    )
    
    # ---- Calculating Ratings Matrix ----
    #
    # - values: users ratings to animes;
    # - indexes: users ids;
    # - columns: animes ids;
    #
    ratings_matrix = pd.pivot_table(train_ratings_df, values='rating', index='user_id', columns='anime_id')
    normalized_ratings_matrix = ratings_matrix.subtract(ratings_matrix.mean(axis=1), axis=0)
    
    # ---- Calculating Users Similarity Matrix ----
    similarity_matrix = ratings_matrix.T.corr(method='pearson')
    
    # ---- Predictions Calculation ----
    valid_ratings = np.array(valid_ratings_df['rating'])
    users_ids_list = valid_ratings_df['user_id']
    animes_ids_list = valid_ratings_df['anime_id']
    predicted_ratings = np.array([
        calculate_score(user_id, anime_id, ratings_matrix, normalized_ratings_matrix, similarity_matrix)
        for (user_id, anime_id)
        in zip(users_ids_list, animes_ids_list)
    ])
    
    # ---- Validation ----
    rmse = np.sqrt(mean_squared_error(valid_ratings, predicted_ratings))
    
    # --- Predicted Ratings Dataset ----
    predicted_ratings_df = pd.DataFrame(columns=['user_id', 'anime_id', 'predicted_rating'])
    predicted_ratings_df['user_id'] = users_ids_list
    predicted_ratings_df['anime_id'] = animes_ids_list
    predicted_ratings_df['predicted_rating'] = predicted_ratings
    predicted_ratings_df.reset_index(drop=True, inplace=True)
    
    
    
    # *********************
    # ** Recommendations **
    # *********************
    
    # ---- Getting Recommendations ----
    get_recommendations(
      predicted_ratings_df=predicted_ratings_df
      , animes_collaborative_filtering_df=temp_collaborative_filtering_df
      , user_id=609_917
      , animes_content_filtering_df=temp_content_based_filtering_df
      , title=anime_title
      , animes_indices=temp_animes_indices
      , cosine_similarity=cosine_similarity_metadatas
      , animes_demographic_filtering_df=temp_demographic_filtering_df
      , number_recommendations=number_recommendations
    )

In [2]:
# ***************************
# ** Demographic Filtering **
# ***************************

# ---- Reading Dataset ----
animes_demographic_filtering_df = pd.read_csv(f'{DATASETS_PATH}/anime-transformed-dataset-2023.csv', index_col='id')
animes_demographic_filtering_df = animes_demographic_filtering_df.loc[
    animes_demographic_filtering_df.score > 0
][['title', 'genres', 'score', 'scored_by', 'popularity', 'image_url']]



# ***************************************
# ** Content-Based Filtering Metadatas **
# ***************************************

# ---- Reading Dataset ----
animes_content_filtering_df = pd.read_csv(f'{DATASETS_PATH}/anime-transformed-dataset-2023.csv', index_col='id')[
    ['title', 'synopsis', 'score', 'genres', 'type', 'source', 'image_url']
]

# ---- Generating Sequential Text for Metadatas ----
metadata_features = ['genres', 'type', 'source']
animes_content_filtering_df['metadatas'] = generate_metadatas_sequential_text(animes_content_filtering_df, metadata_features)

# ---- Lower Casing ----
animes_content_filtering_df.metadatas = animes_content_filtering_df.metadatas.apply(lambda metadata: metadata.lower())

# ---- Removing All Break Lines (\n) and Special Characters (\t \r \x0b \x0c) ----
animes_content_filtering_df.metadatas = animes_content_filtering_df.metadatas.apply(lambda metadata: ' '.join(metadata.split()))



# ****************************************
# ** Collaborative Filtering User-Based **
# ****************************************

# ---- Reading Animes Dataset ----
animes_collaborative_filtering_df = pd.read_csv(f'{DATASETS_PATH}/anime-transformed-dataset-2023.csv')[
    ['id', 'title', 'synopsis', 'score', 'genres', 'image_url', 'scored_by']
]

# ---- Filterig Animes with more than or equal to a cutoff of number of Users Ratings ----
minimum_number_of_ratings = animes_collaborative_filtering_df.scored_by.quantile(q=ANIMES_SCORED_BY_CUTOFF, interpolation='linear')
animes_collaborative_filtering_df = animes_collaborative_filtering_df.loc[animes_collaborative_filtering_df.scored_by >= minimum_number_of_ratings].copy()

# ---- Reading Ratings Dataset ----
ratings_df = pd.read_csv(f'{DATASETS_PATH}/users-scores-transformed-2023.csv')[
    ['user_id', 'anime_id', 'rating']
]

# ---- Filterig Ratings by Filtered Animes ----
filtered_animes_ids = animes_collaborative_filtering_df.id.to_list()
ratings_df = ratings_df.loc[ratings_df.anime_id.isin(filtered_animes_ids)].copy()

# ---- Filtering Ratings with users with more than or equal to 2000 Ratings ----
users_ratings_count = ratings_df.user_id.value_counts()
ratings_df = ratings_df.loc[
    ratings_df.user_id.isin(users_ratings_count[users_ratings_count >= USERS_NUMBER_RATINGS_CUTOFF].index)
].copy()



# ***************
# ** Benchmark **
# ***************

# ---- Benchmark Dataset ----
benchmark_df = pd.DataFrame(
    columns=[
        'iteration', 'algorithm', 'execution_time', 'avg_cpu_usage'
        , 'min_cpu_usage', 'max_cpu_usage', 'avg_ram_usage'
        , 'min_ram_usage', 'max_ram_usage'
    ]
)

In [3]:
# ---- Thread ----
global python_process

global iteration_cpu_usage
global cpu_usage
global min_cpu_usage
global max_cpu_usage

global iteration_ram_usage
global ram_usage
global min_ram_usage
global max_ram_usage

global execution_time

global running

def benchmark():
    global iteration_cpu_usage
    global iteration_ram_usage
    global running
    
    running = True
    
    while running:
        iteration_cpu_usage.append(python_process.cpu_percent(interval=0.1) / psutil.cpu_count())
        iteration_ram_usage.append(python_process.memory_percent(memtype='uss'))
        #iteration_ram_usage.append(python_process.memory_full_info().uss / 1024 / 1024) # in MB

def start_thread():
    global thread
    thread = threading.Thread(target=benchmark)
    thread.start()

def stop_thread():
    global thread
    global running
    
    running = False
    thread.join() # wait for thread's end

In [4]:
# ---- Benchmark ----
python_process = psutil.Process(os.getpid())

iteration_cpu_usage = []
cpu_usage = []
min_cpu_usage = []
max_cpu_usage = []

iteration_ram_usage = []
ram_usage = []
min_ram_usage = []
max_ram_usage = []

execution_time = []

running = False

animes_title = [
    'fullmetal alchemist brotherhood', 'fullmetal alchemist brotherhood', 'fullmetal alchemist brotherhood', 'fullmetal alchemist brotherhood'
    , '?', '?', '?'
    , '?', '?', '?'
]

for iteration in range(NUMBER_OF_ITERATIONS):
    # ---- Globals ----
    global iteration_cpu_usage
    global cpu_usage
    global min_cpu_usage
    global max_cpu_usage
    
    global iteration_ram_usage
    global ram_usage
    global min_ram_usage
    global max_ram_usage
    
    global execution_time
    
    # ---- Thread ----
    iteration_cpu_usage = []
    iteration_ram_usage = []

    start_time = time.perf_counter()
    start_thread()
    
    try:
        hybrid_filtering(
            animes_demographic_filtering_df
            , animes_content_filtering_df
            , animes_collaborative_filtering_df
            , ratings_df
            , NUMBER_OF_RECOMMENDATIONS
            , animes_title[iteration]
        )
    except Exception as exception: print(f'- An exception occurred: {exception}')
    finally: stop_thread()
    
    # ---- Computing Bechmarks ----
    print(f'- Calculations of iteration {iteration}')
    
    final_time = time.perf_counter()
    execution_time.append(final_time - start_time)
    
    cpu_usage.append(sum(iteration_cpu_usage) / len(iteration_cpu_usage))
    min_cpu_usage.append(min(iteration_cpu_usage))
    max_cpu_usage.append(max(iteration_cpu_usage))
    
    ram_usage.append(sum(iteration_ram_usage) / len(iteration_ram_usage))
    min_ram_usage.append(min(iteration_ram_usage))
    max_ram_usage.append(max(iteration_ram_usage))

- Calculations of iteration 0
- Calculations of iteration 1
- Calculations of iteration 2
- Calculations of iteration 3
- Calculations of iteration 4
- Calculations of iteration 5
- Calculations of iteration 6
- Calculations of iteration 7
- Calculations of iteration 8
- Calculations of iteration 9


In [5]:
# ---- Storaging Data ----
benchmark_df['iteration'] = [number for number in range(NUMBER_OF_ITERATIONS)]
benchmark_df['algorithm'] = 'Hybrid Filtering'
benchmark_df['execution_time'] = execution_time

benchmark_df['avg_cpu_usage'] = cpu_usage
benchmark_df['min_cpu_usage'] = min_cpu_usage
benchmark_df['max_cpu_usage'] = max_cpu_usage

benchmark_df['avg_ram_usage'] = ram_usage
benchmark_df['min_ram_usage'] = min_ram_usage
benchmark_df['max_ram_usage'] = max_ram_usage

benchmark_df

Unnamed: 0,iteration,algorithm,execution_time,avg_cpu_usage,min_cpu_usage,max_cpu_usage,avg_ram_usage,min_ram_usage,max_ram_usage
0,0,Hybrid Filtering,232.815563,11.763836,0.0,16.625,22.977762,1.172168,44.168416
1,1,Hybrid Filtering,250.184384,11.688466,0.0,16.125,26.164434,1.158836,46.662203
2,2,Hybrid Filtering,195.960814,11.926131,0.0,14.3375,26.070355,1.160801,49.996925
3,3,Hybrid Filtering,194.296961,12.172251,0.0,12.6,35.412753,1.158965,58.004571
4,4,Hybrid Filtering,182.30886,12.303803,0.0,16.125,35.415534,1.165212,62.632606
5,5,Hybrid Filtering,163.964925,12.231969,0.0,14.3375,25.645048,1.16138,58.433024
6,6,Hybrid Filtering,182.320029,12.305696,0.0,12.6,35.507852,1.162378,62.631382
7,7,Hybrid Filtering,171.493498,12.09204,0.0,16.125,26.301949,1.172909,59.183823
8,8,Hybrid Filtering,170.785071,12.257074,0.0,12.6,28.401564,1.16708,59.017788
9,9,Hybrid Filtering,171.382266,12.177311,0.0,12.6,22.924738,1.161831,57.461896


In [6]:
# ---- Exporting Data ----
benchmark_df.to_csv(
    f'{DATASETS_PATH}/benchmarks/hybrid-filtering.csv'
    , index=False
)

---

<h1 id='reach-me' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>📫 | Reach Me</h1>

> **Email** - [csfelix08@gmail.com](mailto:csfelix08@gmail.com?)

> **Linkedin** - [linkedin.com/in/csfelix/](https://www.linkedin.com/in/csfelix/)

> **GitHub:** - [CSFelix](https://github.com/CSFelix)

> **Kaggle** - [DSFelix](https://www.kaggle.com/dsfelix)

> **Portfolio** - [CSFelix.io](https://csfelix.github.io/).