<i>Copyright (c) Recommenders contributors.</i>

<i>Licensed under the MIT License.</i>

# TF-IDF Content-Based Recommendation on the COVID-19 Open Research Dataset
This demonstrates a simple implementation of Term Frequency Inverse Document Frequency (TF-IDF) content-based recommendation on the [COVID-19 Open Research Dataset](https://azure.microsoft.com/en-us/services/open-datasets/catalog/covid-19-open-research/), hosted through Azure Open Datasets.

In this notebook, we will create a recommender which will return the top k recommended articles similar to any article of interest (query item) in the COVID-19 Open Research Dataset.

In [1]:
import sys
import logging
import scipy
import numpy as np
import pandas as pd

from recommenders.models.tfidf.tfidf_utils import TfidfRecommender
from recommenders.datasets import movielens

# Print version
print(f"System version: {sys.version}")

  from .autonotebook import tqdm as notebook_tqdm


System version: 3.9.21 (main, Dec 11 2024, 16:35:24) [MSC v.1929 64 bit (AMD64)]


### 1. Load the dataset into a dataframe
Let's begin by loading the metadata file for the dataset into a Pandas dataframe. This file contains metadata about each of the scientific articles included in the full dataset.

In [2]:
# Top k items to recommend
TOP_K = 10

# Select MovieLens data size: 100k, 1m, 10m, or 20m
MOVIELENS_DATA_SIZE = "100k"

In [3]:
# set log level to INFO
logging.basicConfig(
    level=logging.DEBUG,
    format="%(asctime)s %(levelname)-8s %(message)s",
    datefmt="%Y-%m-%d %H:%M:%S",
)

In [4]:
data = movielens.load_pandas_df(
    size=MOVIELENS_DATA_SIZE,
    header=["userID", "itemID", "rating", "timestamp"],
    title_col="title",
    genres_col='genres',
    year_col='year'
)
# Convert the float precision to 32-bit in order to reduce memory consumption
data["rating"] = data["rating"].astype(np.float32)

data.head()

2025-02-18 22:05:18 DEBUG    Starting new HTTPS connection (1): files.grouplens.org:443
2025-02-18 22:05:18 DEBUG    https://files.grouplens.org:443 "GET /datasets/movielens/ml-100k.zip HTTP/1.1" 200 4924029
2025-02-18 22:05:18 INFO     Downloading https://files.grouplens.org/datasets/movielens/ml-100k.zip
100%|██████████| 4.81k/4.81k [00:01<00:00, 2.51kKB/s]


Unnamed: 0,userID,itemID,rating,timestamp,title,genres,year
0,196,242,3.0,881250949,Kolya (1996),Comedy,1996
1,186,302,3.0,891717742,L.A. Confidential (1997),Crime|Film-Noir|Mystery|Thriller,1997
2,22,377,1.0,878887116,Heavyweights (1994),Children's|Comedy,1994
3,244,51,2.0,880606923,Legends of the Fall (1994),Drama|Romance|War|Western,1994
4,166,346,1.0,886397596,Jackie Brown (1997),Crime|Drama,1997


### 4. Instantiate the recommender
All functions for data preparation and recommendation are contained within the **TfidfRecommender** class we have imported. Prior to running these functions, we must create an object of this class.

Select one of the following tokenization methods to use in the model:

| tokenization_method | Description                                                                                                                      |
|:--------------------|:---------------------------------------------------------------------------------------------------------------------------------|
| 'none'              | No tokenization is applied. Each word is considered a token.                                                                     |
| 'nltk'              | Simple stemming is applied using NLTK.                                                                                           |
| 'bert'              | HuggingFace BERT word tokenization ('bert-base-cased') is applied.                                                               |
| 'scibert'           | SciBERT word tokenization ('allenai/scibert_scivocab_cased') is applied.<br>This is recommended for scientific journal articles. |

In [5]:
# Create the recommender object
recommender = TfidfRecommender(id_col='cord_uid', tokenization_method='scibert')

### 5. Prepare text for use in the TF-IDF recommender

In [6]:
data['genres'] = data['genres'].str.replace('|', ' ', regex=False)
data.head()

Unnamed: 0,userID,itemID,rating,timestamp,title,genres,year
0,196,242,3.0,881250949,Kolya (1996),Comedy,1996
1,186,302,3.0,891717742,L.A. Confidential (1997),Crime Film-Noir Mystery Thriller,1997
2,22,377,1.0,878887116,Heavyweights (1994),Children's Comedy,1994
3,244,51,2.0,880606923,Legends of the Fall (1994),Drama Romance War Western,1994
4,166,346,1.0,886397596,Jackie Brown (1997),Crime Drama,1997


In [7]:
cols_to_clean = ['title','genres']
clean_col = 'cleaned_text'
df_clean = recommender.clean_dataframe(data, cols_to_clean, clean_col)
df_clean.head()

Unnamed: 0,userID,itemID,rating,timestamp,title,genres,year,cleaned_text
0,196,242,3.0,881250949,Kolya (1996),Comedy,1996,Kolya 1996 Comedy
1,186,302,3.0,891717742,L.A. Confidential (1997),Crime Film-Noir Mystery Thriller,1997,LA Confidential 1997 Crime FilmNoir Mystery Th...
2,22,377,1.0,878887116,Heavyweights (1994),Children's Comedy,1994,Heavyweights 1994 Childrens Comedy
3,244,51,2.0,880606923,Legends of the Fall (1994),Drama Romance War Western,1994,Legends of the Fall 1994 Drama Romance War Wes...
4,166,346,1.0,886397596,Jackie Brown (1997),Crime Drama,1997,Jackie Brown 1997 Crime Drama


Let's also tokenize the cleaned text for use in the TF-IDF model. The tokens are stored within our TfidfRecommender object.

In [8]:
# Tokenize text with tokenization_method specified in class instantiation
tf, vectors_tokenized = recommender.tokenize_text(df_clean, text_col="cleaned_text")

2025-02-18 22:05:22 DEBUG    Starting new HTTPS connection (1): huggingface.co:443
2025-02-18 22:05:22 DEBUG    https://huggingface.co:443 "HEAD /allenai/scibert_scivocab_cased/resolve/main/tokenizer_config.json HTTP/1.1" 404 0
2025-02-18 22:05:22 DEBUG    https://huggingface.co:443 "HEAD /allenai/scibert_scivocab_cased/resolve/main/vocab.txt HTTP/1.1" 200 0


### 6. Recommend articles using TF-IDF
Let's now fit the recommender model to the processed data (tokens) and retrieve the top k recommended articles.

When creating our object, we specified k=5 so the `recommend_top_k_items` function will return the top 5 recommendations for each public domain article.

In [9]:
# Fit the TF-IDF vectorizer
recommender.fit(tf, vectors_tokenized)
recommender.get_tokens()


{'ko': 6416,
 'ly': 7119,
 '1996': 537,
 'come': 2669,
 'dy': 3556,
 'ko ly': 6427,
 'ly 1996': 7130,
 '1996 come': 554,
 'come dy': 2672,
 'ko ly 1996': 6428,
 'ly 1996 come': 7131,
 '1996 come dy': 555,
 'la': 6447,
 'confident': 2723,
 'ial': 5512,
 '1997': 580,
 'crime': 2858,
 'film': 4471,
 'ir': 5940,
 'mys': 7787,
 'ter': 10727,
 'thr': 10837,
 'iller': 5730,
 'la confident': 6466,
 'confident ial': 2724,
 'ial 1997': 5513,
 '1997 crime': 601,
 'crime film': 2869,
 'film ir': 4474,
 'ir mys': 5943,
 'mys ter': 7788,
 'ter thr': 10752,
 'thr iller': 10838,
 'la confident ial': 6467,
 'confident ial 1997': 2725,
 'ial 1997 crime': 5514,
 '1997 crime film': 603,
 'crime film ir': 2870,
 'film ir mys': 4475,
 'ir mys ter': 5944,
 'mys ter thr': 7794,
 'ter thr iller': 10753,
 'heavy': 5250,
 'weight': 11812,
 '1994': 454,
 'children': 2466,
 'heavy weight': 5255,
 'weight 1994': 11813,
 '1994 children': 467,
 'children come': 2469,
 'heavy weight 1994': 5256,
 'weight 1994 children

In [10]:
# Get recommendations
top_k_recommendations = recommender.recommend_top_k_items(df_clean, k=1)

MemoryError: Unable to allocate 34.5 GiB for an array with shape (4625328666,) and data type float64

In our recommendation table, each row represents a single recommendation.

- **cord_uid** corresponds to the article that is being used to make recommendations from.
- **rec_rank** contains the recommdation's rank (e.g., rank of 1 means top recommendation).
- **rec_score** is the cosine similarity score between the query article and the recommended article.
- **rec_cord_uid** corresponds to the recommended article.

In [None]:
# Preview the recommendations
top_k_recommendations

Unnamed: 0,cord_uid,rec_rank,rec_score,rec_cord_uid
0,ej795nks,1,0.142033,u7lz3spe
1,ej795nks,2,0.117743,j35w1vsw
2,ej795nks,3,0.100325,nt60lv2k
3,ej795nks,4,0.076779,vp9d9vmp
4,ej795nks,5,0.074392,05d1mhkq
...,...,...,...,...
1280,yetdnv6j,1,0.048499,9w9w0z4o
1281,yetdnv6j,2,0.046675,6nas74q1
1282,yetdnv6j,3,0.044476,7docv0dt
1283,yetdnv6j,4,0.040522,oj60pldq
