<center>
    <h1 id='hybrid-filtering' style='color:#7159c1; font-size:350%'>Hybrid Filtering</h1>
    <i style='font-size:125%'>Combining Content-Based Filtering and Collaborative Filtering with LightFM Algorithm</i>
</center>

> **Topics**

```
- ✨ Collaborative Filtering Problems
- ✨ Hybrid Filtering
- ✨ LightFM Algorithm
- ✨ Hands-on
```

<h1 id='0-collaborative-filtering-problems' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>✨ | Collaborative Filtering Problems</h1>

Collaborative Filtering has some issues that we have to pay attention, being the `computational cost and time` the first one. As seen on User-Based approach with K-Nearest Neighbors (KNN) Algorithm, a laptop with 12GB of RAM memory got 100% usage of this hardware, even though the dataset's sample being small compared to the total amount of data: only fifteen thousand out of up to twenty-three million. A way to minimize the memory cost and increase the sample size was to use the Item-Based Approach with Singular Value Decomposition (SVD) Algorithm, but again, even getting better recommendations, we stumbled upon with the computation time problem. A sample size of two hundred fifty thousand observations took a considerable time to process compared to the other models made so far.

Another problem is the `available data`. Since our dataset is large and contains observations from a bunch of users and animes, we did not face it off, but it is important to have this issue in mind. Collaborative-Filtering requires a good number of users ratings of each anime in order to better recognizing the users tastes and retrieving more suitable recommendations. Due to this, when the platform has new users or new released animes, the Collaborative Filtering may not work very well with them, since the available data about them is scarce.

The solution for the first problem is literally using a more powerful machine too do the tasks, changing the KNN Algorithm for more performatic ones and/or working with better Hyperparameters Values.

About the second problem, we can go into `Hybrid Filtering`, the best and last Recommendation System Technique we are going to see in this project.

<h1 id='1-hybrid-filtering' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>✨ | Hybrid Filtering</h1>

`Hybrid Filtering` combines the Content-Based Filtering and the Collaborative Filtering altogether. It normally applies the first technique when there are few users ratings available foor a given anime and smoothly replaces it to the second technique as more users ratings become avaible for the given anime.

To making things clearer, picture a situation where only a few users have rated Dragon Ball Z, in light of the small number of ratings, the technique will use Content-Based Filtering and recommend similar animes to Dragon Ball Z.

In the other hand, a situation where many users have rated Noragami anime, due to the large number of ratings, the technique will use Collaborative Filtering and recommend similar items that similar users have liked.

About the advantages:

> **Content-Based Filtering and Collaborative Filtering** - `since Hybrid Filtering combines the both techniques and switches between them accordingly to the chosen user/item, this technique has the advantages of both of them`;

> **Better Recommendations and Small Bubble** - `consequently, better recommendations are made with a tiny probability of creating a Bubble of Recommendations`.

<br />

Disadvantages-wise:

> **Content-Based Filtering and Collaborativee Filtering** - `it also has the chosen technique to the chosen user/item disadvantages`;

> **Required Users and Items Data** - `it requires that the dataset contains datas about the items and the users, as well as the interactions between them, that is, the users ratings for the items`;

> **Computational Cost and Time** - `also, more computational cost and time is needed for the model`.

<br />

In this notebook, we are going to apply Hybrid Filtering using the LightFM Algorithm. Thus, before heading to the code, let's see how this algorithm works.

<h1 id='2-lightfm-algorithm' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>✨ | LightFM Algorithm</h1>

In a few words, `LightFM` is an Algorithm that implements Hybrid Filtering very well, making possible to incorporate both item and user data into Matrix Factorization Algorithms. When calculating the recommendations, the algorithm automatically switches between Content-Based Filtering and Collaborative Filtering depending on the chosen user and item.

When the chosen user does not have so much interactions (ratings) with the items, the model uses Content-Based Filtering for recommendations and only considers the items data. When the chosen user have a considerable number of interactions, the model uses Collaborative Filtering for recommendations and considers both user and item data.

Briefly, the algorithm works like this:

1. Convert the Dataset into LightFM Dataset Format;
2. Calculates the user/items interaction matrix and their respective weights;
3. Train the model taking the interactions into consideration;
4. Makes predictions about the user ratings to items.

<br />

If you want to go further to the theory of LightFM, consider giving a read on these documents:

1. [LightFM - hybrid matrix factorisation on MovieLens (Python, CPU)](https://github.com/recommenders-team/recommenders/blob/main/examples/02_model_hybrid/lightfm_deep_dive.ipynb);

2. [Welcome to LightFM’s documentation!](https://making.lyst.com/lightfm/docs/home.html).

Now, let's go straight to the code!

<h1 id='3-hands=on' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>✨ | Hands-on</h1>

```
- Settings
- Reading Datasets
- Dropping Variables
- Getting Sample for Model
- Converting Dataset to LightFM Dataset and Calculating Interaction Matrix
- Training the Model
- Recommendations
```

> **OBS.:** since LightFM Package only works with ratings from 0 to 5 and the animes dataset works with ratings from 0 to 10, we have to divide the animes dataset ratings by 2 in order to the LightFM Predictions be accordingly to the range from 0 to 5. After that, we have to multiply the LightFM Predictions by 2 in order to the predictions fit the animes dataset ratings when returning the recommendations.

---

**- Settings**

In [4]:
# -----------------
# ---- Imports ----
# -----------------
import inflect       # pip install inflect
import numpy as np   # pip install numpy
import pandas as pd  # pip install pandas




# -------------------------
# ---- LightFM Imports ----
# -------------------------
#
# pip install lightfm
#
from lightfm import cross_validation
from lightfm import LightFM
from lightfm.data import Dataset
from lightfm.evaluation import auc_score




# -------------------
# ---- Constants ----
# -------------------
DATASETS_PATH = ('./datasets')
INFLECT_ENGINE = (inflect.engine())
SEED = (20240106)

SAMPLE_SIZE = (250_000)
TRAINING_DF_SIZE = (0.80)
VALIDATION_DF_SIZE = (0.20)

NUMBER_RECOMMENDATIONS = (10)
LEARNING_RATE = (0.25)
NUMBER_COMPONENTS = (20) # number of Latent Factors for LightFM
NUMBER_EPOCHS = (20)
ITEM_ALPHA = (1e-6) # regularisation for both users features
USER_ALPHA = (1e-6) # regularisation for both items features




# ------------------
# ---- Settings ----
# ------------------
np.random.seed(SEED)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)




# -------------------
# ---- Functions ----
# -------------------

# \ Description:
#    - returns the map of anime ids using a LightFM Mapping Dictionary.
#
# \ Parameters:
#    - anime_ids_list: List;
#    - anime_ids_map: LightFM Mapping.
#
map_anime_ids = lambda anime_ids_list, anime_ids_map: [anime_ids_map[id] for id in anime_ids_list]

# \ Description:
#    - returns the map of a user id using LightFM Mapping Dictionary.
#
# \ Parameters:
#    - user_id: Integer;
#    - user_ids_map: LightFM Mapping.
#
map_user_id = lambda user_id, user_ids_map: user_ids_map[user_id]

def get_recommendations(predictions, anime_ids_list, animes_df, number_of_recommendations=10):
    """
    \ Description:
        - creates a Pandas DataFrame containing the predicted rating and the anime info;
        - sorts the data by predicted rating in descending order;
        - returns the first 'number_of_recommendations' animes with higher predicted ratings.
    
    \ Parameters:
        - predictions: LightFM Predictions Array;
        - anime_ids_list: List;
        - animes_df: Pandas DataFrame;
        - number_of_recommendations: Integer.
    """
    # ---- Creating DataFrame ----
    recommendations_df = pd.DataFrame(columns=[
        'anime_id', 'title', 'genres', 'is_hentai', 'image_url', 'predicted_rating'
    ])
    
    # ---- Setting Datas into DataFrame ----
    recommendations_df.anime_id = anime_ids_list
    recommendations_df.title = animes_df.title.loc[anime_ids_list].to_list()
    recommendations_df.genres = animes_df.genres.loc[anime_ids_list].to_list()
    recommendations_df.is_hentai = animes_df.is_hentai.loc[anime_ids_list].to_list()
    recommendations_df.image_url = animes_df.image_url.loc[anime_ids_list].to_list()
    recommendations_df.predicted_rating = predictions * 2
    
    # ---- Rounding Ratings to Four Significant Digits ----
    recommendations_df.predicted_rating = recommendations_df.predicted_rating.apply(lambda rating: round(rating, 4))
    
    # ---- Rounding Ratings Higher than 10 ----
    #
    # - it happens due to some tiny decimal ratings shy, such as 10.0718
    #
    recommendations_df.predicted_rating = recommendations_df.predicted_rating.apply(lambda rating: 10.00 if rating > 10 else rating)
    
    # ---- Setting 'anime_id' as Index ----
    recommendations_df.set_index('anime_id', inplace=True)
    
    # ---- Sorting Dataset by 'predicted_rating' ----
    recommendations_df.sort_values(by='predicted_rating', ascending=False, inplace=True)
    
    # ---- Return ----
    return recommendations_df.head(number_of_recommendations)

---

**- Reading Datasets**

In [5]:
# ---- Reading Animes Dataset ----
animes_df = pd.read_csv(f'{DATASETS_PATH}/anime-transformed-dataset-2023.csv', index_col='id')[
    ['title', 'score', 'genres', 'is_hentai', 'type', 'producers', 'licensors', 'studios', 'source', 'image_url']
]

print(f'- Number of Observations: {animes_df.shape[0]} ({INFLECT_ENGINE.number_to_words(animes_df.shape[0])})')
print(f'- Number of Variables: {animes_df.shape[1]} ({INFLECT_ENGINE.number_to_words(animes_df.shape[1])})')
print('---')

animes_df.head()

- Number of Observations: 23748 (twenty-three thousand, seven hundred and forty-eight)
- Number of Variables: 10 (ten)
---


Unnamed: 0_level_0,title,score,genres,is_hentai,type,producers,licensors,studios,source,image_url
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,cowboy bebop,8.75,"award winning, action, sci-fi",0,tv,bandai visual,"funimation, bandai entertainment",sunrise,original,https://cdn.myanimelist.net/images/anime/4/19644.jpg
5,cowboy bebop tengoku no tobira,8.38,"action, sci-fi",0,movie,"sunrise, bandai visual",sony pictures entertainment,bones,original,https://cdn.myanimelist.net/images/anime/1439/93480.jpg
6,trigun,8.22,"adventure, action, sci-fi",0,tv,victor entertainment,"funimation, geneon entertainment usa",madhouse,manga,https://cdn.myanimelist.net/images/anime/7/20310.jpg
7,witch hunter robin,7.25,"mystery, supernatural, action, drama",0,tv,"dentsu, bandai visual, tv tokyo music, victor entertainment","funimation, bandai entertainment",sunrise,original,https://cdn.myanimelist.net/images/anime/10/19969.jpg
8,bouken ou beet,6.94,"adventure, supernatural, fantasy",0,tv,"dentsu, tv tokyo",illumitoon entertainment,toei animation,manga,https://cdn.myanimelist.net/images/anime/7/21569.jpg


In [6]:
# ---- Reading Ratings Dataset ----
ratings_df = pd.read_csv(f'{DATASETS_PATH}/users-scores-transformed-2023.csv')

print(f'- Number of Observations: {ratings_df.shape[0]} ({INFLECT_ENGINE.number_to_words(ratings_df.shape[0])})')
print(f'- Number of Variables: {ratings_df.shape[1]} ({INFLECT_ENGINE.number_to_words(ratings_df.shape[1])})')
print('---')

print(f'- Number of Unique Users: {ratings_df.user_id.nunique()} ({INFLECT_ENGINE.number_to_words(ratings_df.user_id.nunique())})')
print(f'- Number of Unique Animes: {ratings_df.anime_id.nunique()} ({INFLECT_ENGINE.number_to_words(ratings_df.anime_id.nunique())})')
print('---')

ratings_df.head()

- Number of Observations: 23796586 (twenty-three million, seven hundred and ninety-six thousand, five hundred and eighty-six)
- Number of Variables: 5 (five)
---
- Number of Unique Users: 264067 (two hundred and sixty-four thousand and sixty-seven)
- Number of Unique Animes: 16380 (sixteen thousand, three hundred and eighty)
---


Unnamed: 0,user_id,username,anime_id,anime_title,rating
0,1,xinil,21,one piece,9
1,1,xinil,48,hack sign,7
2,1,xinil,320,a kite,5
3,1,xinil,49,aa megami-sama,8
4,1,xinil,304,aa megami-sama movie,8


---

**- Dropping Variables**

In [7]:
# ---- Dropping Variables ----
variables_to_keep = ['user_id', 'anime_id', 'rating']
ratings_df = ratings_df[variables_to_keep]

---

**- Getting Sample for the Model**

In [8]:
# ---- Getting Sample for the Model ----
sample_ratings_df = ratings_df.sample(
    n=SAMPLE_SIZE
    , random_state=SEED
)

sample_ratings_df.rating = sample_ratings_df.rating.apply(lambda rating: rating / 2)

In [9]:
# ---- Getting Sample for the Model ----
#
# since we are going to use 'sample_ratings_df' from now on, let's delete the following datasets:
#    - ratings_df
#
ratings_df = None

---

**- Listing All Unique Users IDs and Unique Animes IDs**

In [10]:
# ---- Listing All Unique Animes IDs ----
unique_user_ids = list(set(sample_ratings_df.user_id.unique()))
unique_anime_ids = list(set(sample_ratings_df.anime_id.unique()))

---

**- Converting Dataset to LightFM Dataset and Calculating Interaction Matrix**

In [11]:
# ---- Converting Dataset to LightFM Dataset ----
#
# - the 'fit' method created mappings of users and items by their IDs
#
lightfm_df = Dataset()
lightfm_df.fit(users=unique_user_ids, items=unique_anime_ids)

number_mapped_users, number_mapped_animes = lightfm_df.interactions_shape()
print(f'- Number of Mapped Users: {number_mapped_users} ({INFLECT_ENGINE.number_to_words(number_mapped_users)})')
print(f'- Number of Mapped Animes: {number_mapped_animes} ({INFLECT_ENGINE.number_to_words(number_mapped_animes)})')

- Number of Mapped Users: 106404 (one hundred and six thousand, four hundred and four)
- Number of Mapped Animes: 8906 (eight thousand, nine hundred and six)


In [12]:
# ---- Calculating Full Interaction Matrix ----
#
# - interactions: matrix containing all users-animes interactions. Its size is a tuple containing
# the animes ids and the users ids. interactions.shape > (number of users ids, number of items ids)
#
# - weights: matrix containing all users-animes interactions weights. Its size is a tuple containing
# all the interactions weights. weights.shape > (number of users ids, number of items data).
#
(interactions, weights) = lightfm_df.build_interactions(sample_ratings_df.values)

In [13]:
# ---- Deleting Datasets ----
#
# since we are going to use 'lightfm_df' from now on, let's delete the following datasets:
#    - sample_ratings_df
#
sample_ratings_df = None

---

**- Training the Model**

In this notebook, the LightFM model will be using the `Logistic (default loss in LightFM)` as the Loss Function. It would be better to use `Weighted Approximate-Rank Pairwise (WARP)` instead, but since [it does not work on Windows Environments](https://github.com/lyst/lightfm/issues/690), we are going stick into Logistic.

If you are interested about how `WARP` works, consider giving a read in the LightFM Documentation: [Learning-to-rank using the WARP loss](https://making.lyst.com/lightfm/docs/examples/warp_loss.html#learning-to-rank-using-the-warp-loss). Briefly, it maximises the rank of positive examples by repeatedly sampling negative examples until a rank violation has been located. This approach is recommended when only positive interactions are present.

In [14]:
# ---- Training the Model ----
hybrid_filtering_recommender = LightFM(
    loss='logistic'
    , k=NUMBER_RECOMMENDATIONS
    , no_components=NUMBER_COMPONENTS
    , learning_rate=LEARNING_RATE
    , user_alpha=USER_ALPHA
    , item_alpha=ITEM_ALPHA
    , random_state=SEED
)

hybrid_filtering_recommender.fit(
    interactions=interactions
    , epochs=NUMBER_EPOCHS
)

print('- Model Parameters:')
hybrid_filtering_recommender.get_params()

- Model Parameters:


{'loss': 'logistic',
 'learning_schedule': 'adagrad',
 'no_components': 20,
 'learning_rate': 0.25,
 'k': 10,
 'n': 10,
 'rho': 0.95,
 'epsilon': 1e-06,
 'max_sampled': 10,
 'item_alpha': 1e-06,
 'user_alpha': 1e-06,
 'random_state': RandomState(MT19937) at 0x154A923FA40}

In [15]:
# ---- Evaluation ----
auc_score(
    hybrid_filtering_recommender
    , interactions
    , num_threads=4
).mean()

0.9031274

The AUC Score indicates the probability that a randomly chosen positive example has a higher score than a randomly chosen negative example, being 1.0 a perfect score. It is commonly evaluated by the followings ranges:

> **AUC-ROC = 0.5** - `No discrimination (random performance)`;

> **AUC-ROC 0.5 to 0.6** - `Poor discrimination`;

> **AUC-ROC 0.6 to 0.7** - `Fair discrimination`;

> **AUC-ROC 0.7 to 0.8** - `Good discrimination`;

> **AUC-ROC 0.8 to 0.9** - `Very good discrimination`;

> **AUC-ROC > 0.9** - `Excellent discrimination`.

<br />

Then, we can say that our Hybrid Filtering Recommendation Model is doind a great job!!

---

**- Recommendations**

In [16]:
# ---- Recommendations ----
#
# - choosing user id;
# - mapping all anime_ids,
#
mapped_user_id = map_user_id(
  user_id=388_458
   , user_ids_map=lightfm_df.mapping()[0]
)

mapped_anime_ids = map_anime_ids(
    anime_ids_list=unique_anime_ids
    , anime_ids_map=lightfm_df.mapping()[2]
)

In [17]:
# ---- Recommendations ----
predictions = hybrid_filtering_recommender.predict(mapped_anime_ids, mapped_anime_ids)

get_recommendations(
    predictions= predictions
    , anime_ids_list=unique_anime_ids
    , animes_df=animes_df
    , number_of_recommendations=NUMBER_RECOMMENDATIONS
)

Unnamed: 0_level_0,title,genres,is_hentai,image_url,predicted_rating
anime_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1575,code geass hangyaku no lelouch,"award winning, action, sci-fi, drama",0,https://cdn.myanimelist.net/images/anime/1032/135088.jpg,10.0
223,dragon ball,"adventure, comedy, fantasy, action",0,https://cdn.myanimelist.net/images/anime/1887/92364.jpg,10.0
269,bleach,"adventure, fantasy, action",0,https://cdn.myanimelist.net/images/anime/3/40451.jpg,10.0
199,sen to chihiro no kamikakushi,"adventure, supernatural, award winning",0,https://cdn.myanimelist.net/images/anime/6/79597.jpg,10.0
4898,kuroshitsuji,"mystery, comedy, supernatural, action",0,https://cdn.myanimelist.net/images/anime/5/27013.jpg,10.0
202,wolf s rain,"drama, fantasy, action, mystery, adventure, sci-fi",0,https://cdn.myanimelist.net/images/anime/5/59403.jpg,10.0
7724,shiki,"horror, suspense, supernatural, mystery",0,https://cdn.myanimelist.net/images/anime/1531/119165.jpg,10.0
934,higurashi no naku koro ni,"horror, suspense, supernatural, mystery",0,https://cdn.myanimelist.net/images/anime/12/19634.jpg,10.0
205,samurai champloo,"adventure, comedy, action",0,https://cdn.myanimelist.net/images/anime/1375/121599.jpg,10.0
431,howl no ugoku shiro,"drama, fantasy, award winning, romance, adventure",0,https://cdn.myanimelist.net/images/anime/5/75810.jpg,10.0


---

<h1 id='reach-me' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>📫 | Reach Me</h1>

> **Email** - [csfelix08@gmail.com](mailto:csfelix08@gmail.com?)

> **Linkedin** - [linkedin.com/in/csfelix/](https://www.linkedin.com/in/csfelix/)

> **GitHub:** - [CSFelix](https://github.com/CSFelix)

> **Kaggle** - [DSFelix](https://www.kaggle.com/dsfelix)

> **Portfolio** - [CSFelix.io](https://csfelix.github.io/).