<i>Copyright (c) Recommenders contributors.</i>

<i>Licensed under the MIT License.</i>

# TF-IDF Content-Based Recommendation on the COVID-19 Open Research Dataset
This demonstrates a simple implementation of Term Frequency Inverse Document Frequency (TF-IDF) content-based recommendation on the [COVID-19 Open Research Dataset](https://azure.microsoft.com/en-us/services/open-datasets/catalog/covid-19-open-research/), hosted through Azure Open Datasets.

In this notebook, we will create a recommender which will return the top k recommended articles similar to any article of interest (query item) in the COVID-19 Open Research Dataset.

In [1]:
import sys
import logging
import scipy
import numpy as np
import pandas as pd

from recommenders.models.tfidf.tfidf_utils import TfidfRecommender
from recommenders.datasets import movielens

# Print version
print(f"System version: {sys.version}")

  from .autonotebook import tqdm as notebook_tqdm


System version: 3.9.21 (main, Dec 11 2024, 16:35:24) [MSC v.1929 64 bit (AMD64)]


### 1. Load the dataset into a dataframe
Let's begin by loading the metadata file for the dataset into a Pandas dataframe. This file contains metadata about each of the scientific articles included in the full dataset.

In [2]:
# Top k items to recommend
TOP_K = 10

# Select MovieLens data size: 100k, 1m, 10m, or 20m
MOVIELENS_DATA_SIZE = "100k"

In [3]:
# set log level to INFO
logging.basicConfig(
    level=logging.DEBUG,
    format="%(asctime)s %(levelname)-8s %(message)s",
    datefmt="%Y-%m-%d %H:%M:%S",
)

In [4]:
data = movielens.load_pandas_df(
    size=MOVIELENS_DATA_SIZE,
    header=["userID", "itemID", "rating", "timestamp"],
    title_col="title",
    genres_col='genres',
    year_col='year'
)
# Convert the float precision to 32-bit in order to reduce memory consumption
data["rating"] = data["rating"].astype(np.float32)

data.head()

2025-02-18 22:05:18 DEBUG    Starting new HTTPS connection (1): files.grouplens.org:443
2025-02-18 22:05:18 DEBUG    https://files.grouplens.org:443 "GET /datasets/movielens/ml-100k.zip HTTP/1.1" 200 4924029
2025-02-18 22:05:18 INFO     Downloading https://files.grouplens.org/datasets/movielens/ml-100k.zip
100%|██████████| 4.81k/4.81k [00:01<00:00, 2.51kKB/s]


Unnamed: 0,userID,itemID,rating,timestamp,title,genres,year
0,196,242,3.0,881250949,Kolya (1996),Comedy,1996
1,186,302,3.0,891717742,L.A. Confidential (1997),Crime|Film-Noir|Mystery|Thriller,1997
2,22,377,1.0,878887116,Heavyweights (1994),Children's|Comedy,1994
3,244,51,2.0,880606923,Legends of the Fall (1994),Drama|Romance|War|Western,1994
4,166,346,1.0,886397596,Jackie Brown (1997),Crime|Drama,1997


### 4. Instantiate the recommender
All functions for data preparation and recommendation are contained within the **TfidfRecommender** class we have imported. Prior to running these functions, we must create an object of this class.

Select one of the following tokenization methods to use in the model:

| tokenization_method | Description                                                                                                                      |
|:--------------------|:---------------------------------------------------------------------------------------------------------------------------------|
| 'none'              | No tokenization is applied. Each word is considered a token.                                                                     |
| 'nltk'              | Simple stemming is applied using NLTK.                                                                                           |
| 'bert'              | HuggingFace BERT word tokenization ('bert-base-cased') is applied.                                                               |
| 'scibert'           | SciBERT word tokenization ('allenai/scibert_scivocab_cased') is applied.<br>This is recommended for scientific journal articles. |

In [35]:
# Create the recommender object
recommender = TfidfRecommender(id_col='itemID', tokenization_method='bert')

### 5. Prepare text for use in the TF-IDF recommender

In [36]:
data['genres'] = data['genres'].str.replace('|', ' ', regex=False)
data.head()

Unnamed: 0,userID,itemID,rating,timestamp,title,genres,year
0,196,242,3.0,881250949,Kolya (1996),Comedy,1996
1,186,302,3.0,891717742,L.A. Confidential (1997),Crime Film-Noir Mystery Thriller,1997
2,22,377,1.0,878887116,Heavyweights (1994),Children's Comedy,1994
3,244,51,2.0,880606923,Legends of the Fall (1994),Drama Romance War Western,1994
4,166,346,1.0,886397596,Jackie Brown (1997),Crime Drama,1997


In [37]:
cols_to_clean = ['title','genres']
clean_col = 'cleaned_text'
df_clean = recommender.clean_dataframe(data, cols_to_clean, clean_col)
df_clean

Unnamed: 0,userID,itemID,rating,timestamp,title,genres,year,cleaned_text
0,196,242,3.0,881250949,Kolya (1996),Comedy,1996,Kolya 1996 Comedy
1,186,302,3.0,891717742,L.A. Confidential (1997),Crime Film-Noir Mystery Thriller,1997,LA Confidential 1997 Crime FilmNoir Mystery Th...
2,22,377,1.0,878887116,Heavyweights (1994),Children's Comedy,1994,Heavyweights 1994 Childrens Comedy
3,244,51,2.0,880606923,Legends of the Fall (1994),Drama Romance War Western,1994,Legends of the Fall 1994 Drama Romance War Wes...
4,166,346,1.0,886397596,Jackie Brown (1997),Crime Drama,1997,Jackie Brown 1997 Crime Drama
...,...,...,...,...,...,...,...,...
99995,880,476,3.0,880175444,"First Wives Club, The (1996)",Comedy,1996,First Wives Club The 1996 Comedy
99996,716,204,5.0,879795543,Back to the Future (1985),Comedy Sci-Fi,1985,Back to the Future 1985 Comedy SciFi
99997,276,1090,1.0,874795795,Sliver (1993),Thriller,1993,Sliver 1993 Thriller
99998,13,225,2.0,882399156,101 Dalmatians (1996),Children's Comedy,1996,101 Dalmatians 1996 Childrens Comedy


In [45]:
df_clean = df_clean[:100]
len(df_clean)

100

Let's also tokenize the cleaned text for use in the TF-IDF model. The tokens are stored within our TfidfRecommender object.

In [46]:
# Tokenize text with tokenization_method specified in class instantiation
tf, vectors_tokenized = recommender.tokenize_text(df_clean, text_col="cleaned_text")

2025-02-18 23:07:08 DEBUG    https://huggingface.co:443 "HEAD /bert-base-cased/resolve/main/tokenizer_config.json HTTP/1.1" 200 0


### 6. Recommend articles using TF-IDF
Let's now fit the recommender model to the processed data (tokens) and retrieve the top k recommended articles.

When creating our object, we specified k=5 so the `recommend_top_k_items` function will return the top 5 recommendations for each public domain article.

In [47]:
# Fit the TF-IDF vectorizer
recommender.fit(tf, vectors_tokenized)
tokens = recommender.get_tokens()
tokens


{'ko': 584,
 'lya': 622,
 '1996': 129,
 'comedy': 340,
 'ko lya': 585,
 'lya 1996': 623,
 '1996 comedy': 135,
 'ko lya 1996': 586,
 'lya 1996 comedy': 624,
 'la': 587,
 'fi': 465,
 'dent': 386,
 'ial': 539,
 '1997': 144,
 'crime': 360,
 'film': 468,
 'oir': 693,
 'mystery': 669,
 'hrill': 531,
 'er': 439,
 'la fi': 588,
 'fi dent': 466,
 'dent ial': 387,
 'ial 1997': 540,
 '1997 crime': 149,
 'crime film': 364,
 'film oir': 469,
 'oir mystery': 694,
 'mystery hrill': 670,
 'hrill er': 532,
 'la fi dent': 589,
 'fi dent ial': 467,
 'dent ial 1997': 388,
 'ial 1997 crime': 541,
 '1997 crime film': 151,
 'crime film oir': 365,
 'film oir mystery': 470,
 'oir mystery hrill': 695,
 'mystery hrill er': 671,
 'heavyweight': 507,
 '1994': 99,
 'children': 318,
 'heavyweight 1994': 508,
 '1994 children': 102,
 'children comedy': 321,
 'heavyweight 1994 children': 509,
 '1994 children comedy': 103,
 'legends': 602,
 'fall': 454,
 'drama': 411,
 'romance': 761,
 'war': 929,
 'western': 931,
 'leg

In [48]:
len(tokens)

955

In [49]:
# Get recommendations
top_k_recommendations = recommender.recommend_top_k_items(df_clean, k=5)

In our recommendation table, each row represents a single recommendation.

- **cord_uid** corresponds to the article that is being used to make recommendations from.
- **rec_rank** contains the recommdation's rank (e.g., rank of 1 means top recommendation).
- **rec_score** is the cosine similarity score between the query article and the recommended article.
- **rec_cord_uid** corresponds to the recommended article.

In [50]:
# Preview the recommendations
top_k_recommendations

Unnamed: 0,itemID,rec_rank,rec_score,rec_itemID
0,242,1,0.138428,1049
1,242,2,0.137708,25
2,242,3,0.126384,111
3,242,4,0.045733,1081
4,242,5,0.042057,288
...,...,...,...,...
470,23,1,0.205055,100
471,23,2,0.202522,54
472,23,3,0.184241,98
473,23,4,0.180336,332
