# Content-Based Recommendation
The Lenskit model in `algorithsm/genre.py` will apply content-based filtering, which will recommend items based on the comparison between anime and the user's ratings. The content of each item is represented as a set of descriptors. In this case, we'll use the anime's genre and use explicit data, such as the ratings (1 - 10).

## Run first few cells below if running on Google Colab

In [None]:
!pip install lenskit~=0.14

Collecting lenskit~=0.14
  Downloading lenskit-0.14.4-py3-none-any.whl (74 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m74.0/74.0 kB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m
Collecting binpickle>=0.3.2 (from lenskit~=0.14)
  Downloading binpickle-0.3.4-py3-none-any.whl (13 kB)
Collecting seedbank>=0.1.0 (from lenskit~=0.14)
  Downloading seedbank-0.1.3-py3-none-any.whl (8.5 kB)
Collecting csr>=0.3.1 (from lenskit~=0.14)
  Downloading csr-0.5.1-py3-none-any.whl (25 kB)
Collecting anyconfig==0.13.* (from seedbank>=0.1.0->lenskit~=0.14)
  Downloading anyconfig-0.13.0-py2.py3-none-any.whl (87 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m87.8/87.8 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: binpickle, anyconfig, seedbank, csr, lenskit
Successfully installed anyconfig-0.13.0 binpickle-0.3.4 csr-0.5.1 lenskit-0.14.4 seedbank-0.1.3


In [None]:
from google.colab import drive

# connect to Drive files
drive.mount("/content/drive")

Mounted at /content/drive


In [None]:
%cd "/content/drive/MyDrive/DSCI 641: Project"

/content/drive/MyDrive/DSCI 641: Project


## Import Libraries

In [None]:
# Built-in Python libraries
from pathlib import Path

In [None]:
# PyData libraries
import pickle
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
# LensKit libraries
import lenskit.crossfold as xf
from lenskit import Recommender, topn, batch, util, Predictor
from lenskit.metrics.topn import ndcg
from csr import CSR

In [None]:
from algorithms import genre

In [None]:
path = Path(".")

## Read in Dataset

### Anime
* Useful to search up information (name, genre, number of episodes, etc) about an anime, given the ID
* Some preprocessing was to rename the `anime_id` column to `item` and to filter out anime that are music type

In [None]:
anime_df = (
    pd
    .read_csv(path / "dataset/2017" /"anime.csv")
    .rename(columns={"anime_id": "item"})
    .query("type != 'Music'")
)
anime_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 11806 entries, 0 to 12293
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   item      11806 non-null  int64  
 1   name      11806 non-null  object 
 2   genre     11744 non-null  object 
 3   type      11781 non-null  object 
 4   episodes  11806 non-null  object 
 5   rating    11576 non-null  float64
 6   members   11806 non-null  int64  
dtypes: float64(1), int64(2), object(4)
memory usage: 737.9+ KB


### Ratings

In [None]:
train = (
    pd
    .read_csv("data/train.csv")
    .drop(columns=["Unnamed: 0"])
    .astype({"item": "int32", "user": "int32", "rating": "int32"})
)

test = (
    pd
    .read_csv("data/eval.csv")
    .drop(columns=["Unnamed: 0"])
    .astype({"item": "int32", "user": "int32", "rating": "int32"})
)

## Calculate TF-IDF

From the first few lecture, we learned about Term Frequency (TF) and Inverse Document Frequency (IDF) and how TF-IDF is the measure of originality of a word by comparing the number of times the word appears in a document with the number of documents the word appears in.

Therefore, this will allow the system to understand and quantify the importance of each genre for the anime relative to the entire dataset. And by highlighting the uniqueness of genres for each anime, TF-IDF can improve the recommendations. Anime with higher TF-IDF scores for certain genres are likely to be more similar, allowing for more target recommendations. On the other hand, common genres that appear in most anime might not be useful in distinguishing between different anime, so the TF-IDF for those will be lower.

For the TF-IDF Vectorizer, I made the minimum document frequency 3 meaning that the term must be in at least 3 documents to be considered. This is experimental and could vary, but the point of using this is to help filter out rare genres or quirks in the data so our model is more generalized. Strip accents is used to help standardize the text.

In [None]:
def create_genre_score_matrix(anime_df: pd.DataFrame) -> CSR:
    anime_df = anime_df.copy(deep=True)
    tfv = TfidfVectorizer(
        min_df=5,
        strip_accents="unicode",
        token_pattern=r"\w{1,}",
        ngram_range=(1, 5),
        stop_words="english"
    )

    anime_df.genre = anime_df.genre.fillna("")
    genres = anime_df.genre.str.split(",").astype(str)

    tfv_matrix = tfv.fit_transform(genres)
    return tfv_matrix

In [None]:
def calculate_cosine_sim(
    anime_df: pd.DataFrame,
    tfv_matrix: CSR
) -> pd.DataFrame:
    cos_sim = cosine_similarity(tfv_matrix, tfv_matrix)
    return pd.DataFrame(
        cos_sim,
        index=anime_df.item,
        columns=anime_df.item
    )

In [None]:
tfv_matrix = create_genre_score_matrix(anime_df=anime_df)
genre_scores = calculate_cosine_sim(anime_df=anime_df, tfv_matrix=tfv_matrix)

## Create and Fit Model
For each user, the model recommends 20 items. As per our use case, when the user enters in some anime they like, we'll output recommendations that will fit a webpage. So the model will output 20 items, we can fit them with a 4 by 5 grid.

In [None]:
genre_model = genre.GenreRec()

In [None]:
def fit_the_model(
    algo: Predictor,
    train: pd.DataFrame,
    genre_df: pd.DataFrame = None
):
    fittable = util.clone(algo)
    fittable = Recommender.adapt(fittable)
    fittable.fit(train, genre=genre_df)
    return fittable

In [None]:
fittable = fit_the_model(
    algo=genre_model,
    train=train,
    genre_df=genre_scores
)
users = test.user.unique()
recs = batch.recommend(fittable, users, 20)
recs["algorithm"] = "content-based"

### Experiment with One User
* Here we have one user that rated item 11771, `Kuroko no Basket` or the English title: `Kuroko's Basketball`  very highly, but hasn't rated item 12189 or 16417 yet. Item 11771 is a popular sports anime about highschoolers playing basketball. And we see that the recommendation the model outputs is the 2nd and 3rd season of the same show, which is smart since they've rated that anime so highly. Additionally, we see other sports anims being recommended like,` Diamond no Ace`, a baseball anime and `Prince of Tennis: The National Tournament Finals`, a tennis anime. Therefore, you see that our model works!`

In [None]:
train.query("user == 2")

Unnamed: 0,item,user,rating
603650,11771,2,10
987026,12189,2,-1
996527,16417,2,-1


In [None]:
anime_df.query("item.isin([11771, 12189, 16417])")

Unnamed: 0,item,name,genre,type,episodes,rating,members
122,11771,Kuroko no Basket,"Comedy, School, Shounen, Sports",TV,25,8.46,338315
335,12189,Hyouka,"Mystery, School, Slice of Life",TV,22,8.17,372246
2133,16417,Tamako Market,"Comedy, Slice of Life",TV,12,7.35,128529


In [None]:
(
    fittable
    .predict_for_user(user=2, items=None)
    .nlargest(10)
    .to_frame("score")
    .merge(anime_df, on="item", how="left")
)

Unnamed: 0,item,score,name,genre,type,episodes,rating,members
0,24415,3.188425,Kuroko no Basket 3rd Season,"Comedy, School, Shounen, Sports",TV,25,8.62,184525
1,16894,3.188425,Kuroko no Basket 2nd Season,"Comedy, School, Shounen, Sports",TV,25,8.58,243325
2,30230,3.188425,Diamond no Ace: Second Season,"Comedy, School, Shounen, Sports",TV,51,8.5,38531
3,11771,3.188425,Kuroko no Basket,"Comedy, School, Shounen, Sports",TV,25,8.46,338315
4,18689,3.188425,Diamond no Ace,"Comedy, School, Shounen, Sports",TV,75,8.25,81384
5,4053,3.188425,Prince of Tennis: The National Tournament Finals,"Comedy, School, Shounen, Sports",OVA,7,8.15,21020
6,16916,3.188425,Kuroko no Basket: Tip Off,"Comedy, School, Shounen, Sports",Special,1,8.07,41760
7,22125,3.188425,Kuroko no Basket: Mou Ikkai Yarimasen ka,"Comedy, School, Shounen, Sports",Special,1,7.86,20397
8,15487,3.188425,Kuroko no Basket NG-shuu,"Comedy, School, Shounen, Sports",Special,9,7.84,32215
9,5751,3.188425,Prince of Tennis: Another Story - Messages Fro...,"Comedy, School, Shounen, Sports",OVA,4,7.82,9487


### Save Model

In [None]:
(path / "models").mkdir(exist_ok=True)

In [None]:
with open("models/content_based_model.pkl", "wb") as fp:
    pickle.dump(fittable, fp)

## Metrics
* Hit Rate
* nDCG

In [None]:
rla = topn.RecListAnalysis()
rla.add_metric(topn.ndcg)
rla.add_metric(topn.hit)
results = rla.compute(recs, test)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  scores['ngood'].fillna(0, inplace=True)


In [None]:
recs.to_parquet("data/output/content-based-rec-20.parquet", index=False)

In [None]:
results.ndcg.mean()

0.022162480866059927

In [None]:
results.hit.mean()

0.22542022