<center>
    <h1 id='hybrid-filtering' style='color:#7159c1; font-size:350%'>Hybrid Filtering</h1>
    <i style='font-size:125%'>Combining Content-Based Filtering and Collaborative Filtering with LightFM Algorithm</i>
</center>

> **Topics**

```
- ✨ Collaborative Filtering Problems
- ✨ Hybrid Filtering
- ✨ LightFM Algorithm
- ✨ Hands-on
```

<h1 id='0-collaborative-filtering-problems' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>✨ | Collaborative Filtering Problems</h1>

Collaborative Filtering has some issues that we have to pay attention, being the `computational cost and time` the first one. As seen on User-Based approach with K-Nearest Neighbors (KNN) Algorithm, a laptop with 12GB of RAM memory got 100% usage of this hardware, even though the dataset's sample being small compared to the total amount of data: only fifteen thousand out of up to twenty-three million. A way to minimize the memory cost and increase the sample size was to use the Item-Based Approach with Singular Value Decomposition (SVD) Algorithm, but again, even getting better recommendations, we stumbled upon with the computation time problem. A sample size of two hundred fifty thousand observations took a considerable time to process compared to the other models made so far.

Another problem is the `available data`. Since our dataset is large and contains observations from a bunch of users and animes, we did not face it off, but it is important to have this issue in mind. Collaborative-Filtering requires a good number of users ratings of each anime in order to better recognizing the users tastes and retrieving more suitable recommendations. Due to this, when the platform has new users or new released animes, the Collaborative Filtering may not work very well with them, since the available data about them is scarce.

The solution for the first problem is literally using a more powerful machine too do the tasks, changing the KNN Algorithm for more performatic ones and/or working with better Hyperparameters Values.

About the second problem, we can go into `Hybrid Filtering`, the best and last Recommendation System Technique we are going to see in this project.

<h1 id='1-hybrid-filtering' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>✨ | Hybrid Filtering</h1>

`Hybrid Filtering` combines the Content-Based Filtering and the Collaborative Filtering altogether. It normally applies the first technique when there are few users ratings available foor a given anime and smoothly replaces it to the second technique as more users ratings become avaible for the given anime.

To making things clearer, picture a situation where only a few users have rated Dragon Ball Z, in light of the small number of ratings, the technique will use Content-Based Filtering and recommend similar animes to Dragon Ball Z.

In the other hand, a situation where many users have rated Noragami anime, due to the large number of ratings, the technique will use Collaborative Filtering and recommend similar items that similar users have liked.

About the advantages:

> **Content-Based Filtering and Collaborative Filtering** - `since Hybrid Filtering combines the both techniques and switches between them accordingly to the chosen user/item, this technique has the advantages of both of them`;

> **Better Recommendations and Small Bubble** - `consequently, better recommendations are made with a tiny probability of creating a Bubble of Recommendations`.

<br />

Disadvantages-wise:

> **Content-Based Filtering and Collaborativee Filtering** - `it also has the chosen technique to the chosen user/item disadvantages`;

> **Required Users and Items Data** - `it requires that the dataset contains datas about the items and the users, as well as the interactions between them, that is, the users ratings for the items`;

> **Computational Cost and Time** - `also, more computational cost and time is needed for the model`.

<br />

In this notebook, we are going to apply Hybrid Filtering using the LightFM Algorithm. Thus, before heading to the code, let's see how this algorithm works.

<h1 id='2-lightfm-algorithm' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>✨ | LightFM Algorithm</h1>

In a few words, `LightFM` is an Algorithm that implements Hybrid Filtering very well, making possible to incorporate both item and user data into Matrix Factorization Algorithms. When calculating the recommendations, the algorithm automatically switches between Content-Based Filtering and Collaborative Filtering depending on the chosen user and item.

When the chosen user does not have so much interactions (ratings) with the items, the model uses Content-Based Filtering for recommendations and only considers the items data. When the chosen user have a considerable number of interactions, the model uses Collaborative Filtering for recommendations and considers both user and item data.

If you want to go further to the theory of LightFM, consider giving a read on these documents:

1. [LightFM - hybrid matrix factorisation on MovieLens (Python, CPU)](https://github.com/recommenders-team/recommenders/blob/main/examples/02_model_hybrid/lightfm_deep_dive.ipynb);

2. [Welcome to LightFM’s documentation!](https://making.lyst.com/lightfm/docs/home.html).

Now, let's go straight to the code!

<h1 id='3-hands=on' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>✨ | Hands-on</h1>

```
- Settings
- Reading Datasets
- Dropping Variables
- Getting Sample for Model
- Converting Dataset to LightFM Dataset and Calculating Interaction Matrix
- Splitting Dataset into Training and Validation
- Training the Model
- Recommendations
```

---

**- Settings**

In [2]:
# -----------------
# ---- Imports ----
# -----------------
import inflect       # pip install inflect
import numpy as np   # pip install numpy
import pandas as pd  # pip install pandas




# -------------------------
# ---- LightFM Imports ----
# -------------------------
#
# pip install lightfm
#
from lightfm import cross_validation
from lightfm import LightFM
from lightfm.data import Dataset
from lightfm.evaluation import precision_at_k as lightfm_precision_at_k
from lightfm.evaluation import recall_at_k as light_fm_recall_at_k




# -------------------
# ---- Constants ----
# -------------------
DATASETS_PATH = ('./datasets')
INFLECT_ENGINE = (inflect.engine())
SEED = (20240110)

TRAINING_DF_SIZE = (0.80)
VALIDATION_DF_SIZE = (0.20)

LEARNING_RATE = (0.25)
NUMBER_COMPONENTS = (20) # number of Latent Factors for LightFM
NUMBER_EPOCHS = (20)




# ------------------
# ---- Settings ----
# ------------------
np.random.seed(SEED)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

---

**- Reading Datasets**

In [3]:
# ---- Reading Animes Dataset ----
animes_df = pd.read_csv(f'{DATASETS_PATH}/anime-transformed-dataset-2023.csv', index_col='id')[
    ['title', 'score', 'genres', 'is_hentai', 'type', 'producers', 'licensors', 'studios', 'source', 'image_url']
]

print(f'- Number of Observations: {animes_df.shape[0]} ({INFLECT_ENGINE.number_to_words(animes_df.shape[0])})')
print(f'- Number of Variables: {animes_df.shape[1]} ({INFLECT_ENGINE.number_to_words(animes_df.shape[1])})')
print('---')

animes_df.head()

- Number of Observations: 23748 (twenty-three thousand, seven hundred and forty-eight)
- Number of Variables: 10 (ten)
---


Unnamed: 0_level_0,title,score,genres,is_hentai,type,producers,licensors,studios,source,image_url
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,cowboy bebop,8.75,"award winning, action, sci-fi",0,tv,bandai visual,"funimation, bandai entertainment",sunrise,original,https://cdn.myanimelist.net/images/anime/4/19644.jpg
5,cowboy bebop tengoku no tobira,8.38,"action, sci-fi",0,movie,"sunrise, bandai visual",sony pictures entertainment,bones,original,https://cdn.myanimelist.net/images/anime/1439/93480.jpg
6,trigun,8.22,"adventure, action, sci-fi",0,tv,victor entertainment,"funimation, geneon entertainment usa",madhouse,manga,https://cdn.myanimelist.net/images/anime/7/20310.jpg
7,witch hunter robin,7.25,"mystery, supernatural, action, drama",0,tv,"dentsu, bandai visual, tv tokyo music, victor entertainment","funimation, bandai entertainment",sunrise,original,https://cdn.myanimelist.net/images/anime/10/19969.jpg
8,bouken ou beet,6.94,"adventure, supernatural, fantasy",0,tv,"dentsu, tv tokyo",illumitoon entertainment,toei animation,manga,https://cdn.myanimelist.net/images/anime/7/21569.jpg


In [4]:
# ---- Reading Ratings Dataset ----
ratings_df = pd.read_csv(f'{DATASETS_PATH}/users-scores-transformed-2023.csv')

print(f'- Number of Observations: {ratings_df.shape[0]} ({INFLECT_ENGINE.number_to_words(ratings_df.shape[0])})')
print(f'- Number of Variables: {ratings_df.shape[1]} ({INFLECT_ENGINE.number_to_words(ratings_df.shape[1])})')
print('---')

print(f'- Number of Unique Users: {ratings_df.user_id.nunique()} ({INFLECT_ENGINE.number_to_words(ratings_df.user_id.nunique())})')
print(f'- Number of Unique Animes: {ratings_df.anime_id.nunique()} ({INFLECT_ENGINE.number_to_words(ratings_df.anime_id.nunique())})')
print('---')

ratings_df.head()

- Number of Observations: 23796586 (twenty-three million, seven hundred and ninety-six thousand, five hundred and eighty-six)
- Number of Variables: 5 (five)
---
- Number of Unique Users: 264067 (two hundred and sixty-four thousand and sixty-seven)
- Number of Unique Animes: 16380 (sixteen thousand, three hundred and eighty)
---


Unnamed: 0,user_id,username,anime_id,anime_title,rating
0,1,xinil,21,one piece,9
1,1,xinil,48,hack sign,7
2,1,xinil,320,a kite,5
3,1,xinil,49,aa megami-sama,8
4,1,xinil,304,aa megami-sama movie,8


---

**- Dropping Variables**

In [5]:
# ---- Dropping Variables ----
variables_to_keep = ['user_id', 'anime_id', 'rating']
ratings_df = ratings_df[variables_to_keep]

---

**- Converting Dataset to LightFM Dataset and Calculating Interaction Matrix**

In [7]:
# ---- Converting Dataset to LightFM Dataset ----
lightfm_df = Dataset()

# the 'fit' method created mappings of users and items by their IDs
lightfm_df.fit(
    ratings_df.user_id
    , ratings_df.anime_id
)

number_mapped_users, number_mapped_animes = lightfm_df.interactions_shape()
print(f'- Number of Mapped Users: {number_mapped_users} ({INFLECT_ENGINE.number_to_words(number_mapped_users)})')
print(f'- Number of Mapped Animes: {number_mapped_animes} ({INFLECT_ENGINE.number_to_words(number_mapped_animes)})')

- Number of Mapped Users: 264067 (two hundred and sixty-four thousand and sixty-seven)
- Number of Mapped Animes: 16380 (sixteen thousand, three hundred and eighty)


In [8]:
# ---- Calculating Full Interaction Matrice ----
#
# - interactions: matrix containing all users-animes interactions. Its size is a tuple containing
# the animes ids and the users ids. interactions.shape > (number of animes ids, number of users ids)
#
# - weights: matrix containing all users-animes interactions weights. Its size is a tuple containing
# all the interactions weights. weights.shape > (number of anime ids, number of users data).
#
(interactions, weights) = lightfm_df.build_interactions(ratings_df.values)

---

**- Splitting Dataset into Training and Validation**

In [9]:
# ---- Splitting Dataset into Training and Validation ----
#
# LightLM works slightly differently compared to other packages as it expects the train and
# test sets to have same dimension. Therefore the conventional train-test split will not work.
#
# The package has included the cross_validation.random_train_test_split method to split the
# interaction data and splits it into two disjoint training and test sets.
#
# However, note that it does not validate the interactions in the test set to guarantee all items
# and users have historical interactions in the training set. Therefore this may result into a
# partial cold-start problem in the test set.
#
training_lightfm_df, validation_lightfm_df = cross_validation.random_train_test_split(
    interactions=interactions
    , test_percentage=VALIDATION_DF_SIZE
    , random_state=SEED
)

print(f'- Number of Observations of Training Dataset: {(training_lightfm_df.shape[0])} ({INFLECT_ENGINE.number_to_words(training_lightfm_df.shape[0])})')
print(f'- Number of Variables of Training Dataset: {(training_lightfm_df.shape[1])} ({INFLECT_ENGINE.number_to_words(training_lightfm_df.shape[1])})')
print('---')

print(f'- Number of Observations of Validation Dataset: {(validation_lightfm_df.shape[0])} ({INFLECT_ENGINE.number_to_words(validation_lightfm_df.shape[0])})')
print(f'- Number of Variables of Validation Dataset: {(validation_lightfm_df.shape[1])} ({INFLECT_ENGINE.number_to_words(validation_lightfm_df.shape[1])})')

- Number of Observations of Training Dataset: 264067 (two hundred and sixty-four thousand and sixty-seven)
- Number of Variables of Training Dataset: 16380 (sixteen thousand, three hundred and eighty)
---
- Number of Observations of Validation Dataset: 264067 (two hundred and sixty-four thousand and sixty-seven)
- Number of Variables of Validation Dataset: 16380 (sixteen thousand, three hundred and eighty)


---

**- Training the Model**

In this notebook, the LightFM model will be using the `Weighted Approximate-Rank Pairwise (WARP)` as the loss. Further explanation on the topic can be found in the LightFM Documentation: [Learning-to-rank using the WARP loss](https://making.lyst.com/lightfm/docs/examples/warp_loss.html#learning-to-rank-using-the-warp-loss).

In general, it maximises the rank of positive examples by repeatedly sampling negative examples until a rank violation has been located. This approach is recommended when only positive interactions are present.

In [None]:
# ---- Training the Model ----
hybrid_filtering_recommender = LightFM(
    loss='warp'
    , no_components=NUMBER_COMPONENTS # latent vectors
    , learning_rate=LEARNING_RATE
    , random_state=SEED
)

hybrid_filtering_recommender.fit(
    interactions=training_lightfm_df
    , epochs=NUMBER_EPOCHS
)

---

<h1 id='reach-me' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>📫 | Reach Me</h1>

> **Email** - [csfelix08@gmail.com](mailto:csfelix08@gmail.com?)

> **Linkedin** - [linkedin.com/in/csfelix/](https://www.linkedin.com/in/csfelix/)

> **GitHub:** - [CSFelix](https://github.com/CSFelix)

> **Kaggle** - [DSFelix](https://www.kaggle.com/dsfelix)

> **Portfolio** - [CSFelix.io](https://csfelix.github.io/).