# Getting started

In this notebook, we illustrate how to use the Neural News Recommendation with Multi-Head Self-Attention ([NRMS](https://aclanthology.org/D19-1671/)). The implementation is taken from the [recommenders](https://github.com/recommenders-team/recommenders) repository. We have simply stripped the model to keep it cleaner.

We use a small dataset, which is downloaded from [recsys.eb.dk](https://recsys.eb.dk/). All the datasets are stored in the folder path ```~/ebnerd_data/*```.

## Load functionality

In [2]:
from transformers import AutoTokenizer, AutoModel
from pathlib import Path
import tensorflow as tf
import polars as pl
import datetime

from ebrec.utils._constants import *

from ebrec.utils._behaviors import (
    create_binary_labels_column,
    sampling_strategy_wu2019,
    add_prediction_scores,
    truncate_history,
    ebnerd_from_path,
)
from ebrec.evaluation import MetricEvaluator, AucScore, NdcgScore, MrrScore
from ebrec.utils._articles import convert_text2encoding_with_transformers
from ebrec.utils._polars import concat_str_columns, slice_join_dataframes
from ebrec.utils._articles import create_article_id_to_value_mapping
from ebrec.utils._nlp import get_transformers_word_embeddings
from ebrec.utils._python import write_submission_file, rank_predictions_by_score

from ebrec.models.newsrec.dataloader import NRMSDataLoader
from ebrec.models.newsrec.model_config import hparams_nrms
from ebrec.models.newsrec import NRMSModel




In [3]:
# List all physical devices
gpus = tf.config.experimental.list_physical_devices("GPU")
for gpu in gpus:
    tf.config.experimental.set_memory_growth(gpu, True)

physical_devices = tf.config.list_physical_devices()
print("Available devices:", physical_devices)

Available devices: [PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')]


## Load dataset

### Generate labels
We sample a few just to get started. For testset we just make up a dummy column with 0 and 1 - this is not the true labels.

In [4]:
# PATH = Path("~/ebnerd_data").expanduser()
# #
# DATASPLIT = "ebnerd_small"
# DUMP_DIR = Path("ebnerd_predictions")
# DUMP_DIR.mkdir(exist_ok=True, parents=True)

In [5]:
from pathlib import Path

# Use raw string to avoid issues with backslashes
PATH = Path(r"C:\Users\antot\Downloads\ebnerd-benchmark\examples\ebnerd_data").expanduser()
TRAIN = f"ebnerd_small"  # [ebnerd_demo, ebnerd_small, ebnerd_large]
VAL = f"ebnerd_small"
TEST = f"ebnerd_testset"#, "ebnerd_testset_gt"


# Create a directory for dumping predictions
#DUMP_DIR = Path("ebnerd_predictions")
#DUMP_DIR.mkdir(exist_ok=True, parents=True)

In [48]:
DUMP_DIR = Path(r"C:\Users\antot\Downloads\ebnerd-benchmark\examples").expanduser()
DUMP_DIR.mkdir(exist_ok=True, parents=True)

History size can often be a memory bottleneck; if adjusted, the NRMS hyperparameter ```history_size``` must be updated to ensure compatibility and efficient memory usage

In [7]:
HISTORY_SIZE = 20
hparams_nrms.history_size = HISTORY_SIZE

In [8]:
# We just want to load the necessary columns
COLUMNS = [
    DEFAULT_USER_COL,
    DEFAULT_IMPRESSION_ID_COL,
    DEFAULT_IMPRESSION_TIMESTAMP_COL,
    DEFAULT_HISTORY_ARTICLE_ID_COL,
    DEFAULT_CLICKED_ARTICLES_COL,
    DEFAULT_INVIEW_ARTICLES_COL,
]
# This notebook is just a simple 'get-started'; we down sample the number of samples to just run quickly through it.
FRACTION = 0.01

In this example we sample the dataset, just to keep it smaller. We'll split the training data into training and validation 

In [9]:
# Load your train and validation datasets directly
df_train = ebnerd_from_path(
    PATH.joinpath("ebnerd_small/train"),
    history_size=HISTORY_SIZE,
    
).select(COLUMNS).pipe(
    sampling_strategy_wu2019,
    npratio=4,
    shuffle=True,
    with_replacement=True,
    seed=123,
).pipe(create_binary_labels_column)

df_validation = ebnerd_from_path(
    PATH.joinpath("ebnerd_small/validation"),
    history_size=HISTORY_SIZE
    
).select(COLUMNS).pipe(
    sampling_strategy_wu2019,
    npratio=4,
    shuffle=True,
    with_replacement=True,
    seed=123,
).pipe(create_binary_labels_column)

print(f"Train samples: {df_train.height}\nValidation samples: {df_validation.height}")

# Preview the datasets
print("Train Data Sample:")
print(df_train.head(2))

print("Validation Data Sample:")
print(df_validation.head(2))


Train samples: 234277
Validation samples: 246289
Train Data Sample:
shape: (2, 7)
┌─────────┬──────────────┬──────────────┬──────────────┬──────────────┬──────────────┬─────────────┐
│ user_id ┆ impression_i ┆ impression_t ┆ article_id_f ┆ article_ids_ ┆ article_ids_ ┆ labels      │
│ ---     ┆ d            ┆ ime          ┆ ixed         ┆ clicked      ┆ inview       ┆ ---         │
│ u32     ┆ ---          ┆ ---          ┆ ---          ┆ ---          ┆ ---          ┆ list[i8]    │
│         ┆ u32          ┆ datetime[μs] ┆ list[i32]    ┆ list[i64]    ┆ list[i64]    ┆             │
╞═════════╪══════════════╪══════════════╪══════════════╪══════════════╪══════════════╪═════════════╡
│ 139836  ┆ 149474       ┆ 2023-05-24   ┆ [0, 9745590, ┆ [9778657]    ┆ [9778728,    ┆ [0, 0, … 1] │
│         ┆              ┆ 07:47:53     ┆ … 9765156]   ┆              ┆ 9778669, …   ┆             │
│         ┆              ┆              ┆              ┆              ┆ 9778657]     ┆             │
│ 143471 

In [10]:

#print(f"Model Directory: {MODEL_NAME}")

# Data preprocessing parameters
MAX_TITLE_LENGTH = 30
HISTORY_SIZE = 20
FRACTION = 1.0
EPOCHS = 5
FRACTION_TEST = 1.0
hparams_nrms.history_size = HISTORY_SIZE

# Batch sizes
BATCH_SIZE_TRAIN = 64
BATCH_SIZE_VAL = 64
BATCH_SIZE_TEST_WO_B = 64
BATCH_SIZE_TEST_W_B = 64
N_CHUNKS_TEST = 10
CHUNKS_DONE = 0

# We just want to load the necessary columns
COLUMNS = [
    DEFAULT_USER_COL,
    DEFAULT_IMPRESSION_ID_COL,
    DEFAULT_IMPRESSION_TIMESTAMP_COL,
    DEFAULT_HISTORY_ARTICLE_ID_COL,
    DEFAULT_CLICKED_ARTICLES_COL,
    DEFAULT_INVIEW_ARTICLES_COL,
]
# This notebook is just a simple 'get-started'; we down sample the number of samples to just run quickly through it.
FRACTION = 0.01


### Test set
We'll use the validation set, as the test set.

In [11]:
df_test = (
    ebnerd_from_path(
        PATH.joinpath(PATH, "ebnerd_testset/test")
    )
    .sample(fraction=FRACTION)
)

print(f"Test samples: {df_test.height}")
print("Test Data Sample:")
print(df_test.head(2))


Test samples: 135367
Test Data Sample:
shape: (2, 15)
┌───────────┬───────────┬───────────┬───────────┬───┬───────────┬───────────┬───────────┬──────────┐
│ impressio ┆ impressio ┆ read_time ┆ scroll_pe ┆ … ┆ is_subscr ┆ session_i ┆ is_beyond ┆ article_ │
│ n_id      ┆ n_time    ┆ ---       ┆ rcentage  ┆   ┆ iber      ┆ d         ┆ _accuracy ┆ id_fixed │
│ ---       ┆ ---       ┆ f32       ┆ ---       ┆   ┆ ---       ┆ ---       ┆ ---       ┆ ---      │
│ u32       ┆ datetime[ ┆           ┆ f32       ┆   ┆ bool      ┆ u32       ┆ bool      ┆ list[i32 │
│           ┆ μs]       ┆           ┆           ┆   ┆           ┆           ┆           ┆ ]        │
╞═══════════╪═══════════╪═══════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪══════════╡
│ 210091342 ┆ 2023-06-0 ┆ 11.0      ┆ null      ┆ … ┆ false     ┆ 55445808  ┆ false     ┆ [9786066 │
│           ┆ 8         ┆           ┆           ┆   ┆           ┆           ┆           ┆ ,        │
│           ┆ 05:32:53  ┆           ┆

In [12]:
COLUMNSTEST = [
    DEFAULT_USER_COL,
    DEFAULT_IMPRESSION_ID_COL,
    DEFAULT_IMPRESSION_TIMESTAMP_COL,
    DEFAULT_INVIEW_ARTICLES_COL,
    DEFAULT_LABELS_COL

]

## Load articles

In [13]:

df_articles_train = pl.read_parquet(PATH.joinpath("ebnerd_small/articles.parquet"))
df_articles_train.head()
#df_articles_test = pl.read_parquet(TEST_MAIN_PATH.joinpath("articles.parquet"))

article_id,title,subtitle,last_modified_time,premium,body,published_time,image_ids,article_type,url,ner_clusters,entity_groups,topics,category,subcategory,category_str,total_inviews,total_pageviews,total_read_time,sentiment_score,sentiment_label
i32,str,str,datetime[μs],bool,str,datetime[μs],list[i64],str,str,list[str],list[str],list[str],i16,list[i16],str,i32,i32,f32,f32,str
3001353,"""Natascha var i…","""Politiet frygt…",2023-06-29 06:20:33,False,"""Sagen om den ø…",2006-08-31 08:06:45,[3150850],"""article_defaul…","""https://ekstra…",[],[],"[""Kriminalitet"", ""Personfarlig kriminalitet""]",140,[],"""krimi""",,,,0.9955,"""Negative"""
3003065,"""Kun Star Wars …","""Biografgængern…",2023-06-29 06:20:35,False,"""Vatikanet har …",2006-05-21 16:57:00,[3006712],"""article_defaul…","""https://ekstra…",[],[],"[""Underholdning"", ""Film og tv"", ""Økonomi""]",414,"[433, 434]","""underholdning""",,,,0.846,"""Positive"""
3012771,"""Morten Bruun f…","""FODBOLD: Morte…",2023-06-29 06:20:39,False,"""Kemien mellem …",2006-05-01 14:28:40,[3177953],"""article_defaul…","""https://ekstra…",[],[],"[""Erhverv"", ""Kendt"", … ""Ansættelsesforhold""]",142,"[196, 199]","""sport""",,,,0.8241,"""Negative"""
3023463,"""Luderne flytte…","""I landets tynd…",2023-06-29 06:20:43,False,"""Det frække erh…",2007-03-24 08:27:59,[3184029],"""article_defaul…","""https://ekstra…",[],[],"[""Livsstil"", ""Erotik""]",118,[133],"""nyheder""",,,,0.7053,"""Neutral"""
3032577,"""Cybersex: Hvor…","""En flirtende s…",2023-06-29 06:20:46,False,"""De fleste af o…",2007-01-18 10:30:37,[3030463],"""article_defaul…","""https://ekstra…",[],[],"[""Livsstil"", ""Partnerskab""]",565,[],"""sex_og_samliv""",,,,0.9307,"""Neutral"""


In [14]:
df_articles_test = pl.read_parquet(PATH.joinpath(PATH, "ebnerd_testset/articles.parquet"))
df_articles_test.head()


article_id,title,subtitle,last_modified_time,premium,body,published_time,image_ids,article_type,url,ner_clusters,entity_groups,topics,category,subcategory,category_str,total_inviews,total_pageviews,total_read_time,sentiment_score,sentiment_label
i32,str,str,datetime[μs],bool,str,datetime[μs],list[i64],str,str,list[str],list[str],list[str],i16,list[i16],str,i32,i32,f32,f32,str
3000022,"""Hanks beskyldt…","""Tom Hanks har …",2023-06-29 06:20:32,False,"""Tom Hanks skul…",2006-09-20 09:24:18,[3518381],"""article_defaul…","""https://ekstra…","[""David Gardner""]","[""PER""]","[""Kriminalitet"", ""Kendt"", … ""Litteratur""]",414,[432],"""underholdning""",,,,0.9911,"""Negative"""
3000063,"""Bostrups aske …","""Studieværten b…",2023-06-29 06:20:32,False,"""Strålende sens…",2006-09-24 07:45:30,"[3170935, 3170939]","""article_defaul…","""https://ekstra…",[],[],"[""Kendt"", ""Underholdning"", … ""Personlig begivenhed""]",118,[133],"""nyheder""",,,,0.5155,"""Neutral"""
3000613,"""Jesper Olsen r…","""Den tidligere …",2023-06-29 06:20:33,False,"""Jesper Olsen, …",2006-05-09 11:29:00,[3164998],"""article_defaul…","""https://ekstra…","[""Frankrig"", ""Jesper Olsen"", … ""Jesper Olsen""]","[""LOC"", ""PER"", … ""PER""]","[""Kendt"", ""Sport"", … ""Sygdom og behandling""]",142,"[196, 271]","""sport""",,,,0.9876,"""Negative"""
3000700,"""Madonna topløs…","""47-årige Madon…",2023-06-29 06:20:33,False,"""Skal du have s…",2006-05-04 11:03:12,[3172046],"""article_defaul…","""https://ekstra…",[],[],"[""Kendt"", ""Livsstil"", ""Underholdning""]",414,[432],"""underholdning""",,,,0.8786,"""Neutral"""
3000840,"""Otto Brandenbu…","""Sangeren og sk…",2023-06-29 06:20:33,False,"""'Og lidt for S…",2007-03-01 18:34:00,[3914446],"""article_defaul…","""https://ekstra…",[],[],"[""Kendt"", ""Underholdning"", … ""Musik og lyd""]",118,[133],"""nyheder""",,,,0.9468,"""Negative"""


## Init model using HuggingFace's tokenizer and wordembedding
In the original implementation, they use the GloVe embeddings and tokenizer. To get going fast, we'll use a multilingual LLM from Hugging Face. 
Utilizing the tokenizer to tokenize the articles and the word-embedding to init NRMS.


In [15]:
from transformers import AutoModel, AutoTokenizer
TRANSFORMER_MODEL_NAME = "FacebookAI/xlm-roberta-base"
TEXT_COLUMNS_TO_USE = [DEFAULT_SUBTITLE_COL, DEFAULT_TITLE_COL]
MAX_TITLE_LENGTH = 30

# LOAD HUGGINGFACE:
transformer_model = AutoModel.from_pretrained(TRANSFORMER_MODEL_NAME)
transformer_tokenizer = AutoTokenizer.from_pretrained(TRANSFORMER_MODEL_NAME)

# We'll init the word embeddings using the
word2vec_embedding = get_transformers_word_embeddings(transformer_model)
#
df_articles_train, cat_cal = concat_str_columns(df_articles_train, columns=TEXT_COLUMNS_TO_USE)
df_articles_train, token_col_title = convert_text2encoding_with_transformers(
    df_articles_train, transformer_tokenizer, cat_cal, max_length=MAX_TITLE_LENGTH
)
# =>
article_mapping_train = create_article_id_to_value_mapping(
    df=df_articles_train, value_col=token_col_title
)



df_articles_test, cat_cal = concat_str_columns(df_articles_test, columns=TEXT_COLUMNS_TO_USE)
df_articles_test, token_col_title = convert_text2encoding_with_transformers(
    df_articles_test, transformer_tokenizer, cat_cal, max_length=MAX_TITLE_LENGTH
)
# =>
article_mapping_test = create_article_id_to_value_mapping(
    df=df_articles_test, value_col=token_col_title
)




# Initiate the dataloaders
In the implementations we have disconnected the models and data. Hence, you should built a dataloader that fits your needs.

Note, with this ```NRMSDataLoader``` the ```eval_mode=False``` is meant for ```model.model.fit()``` whereas ```eval_mode=True``` is meant for ```model.scorer.predict()```. 

In [16]:
# Initialize DataLoaders for train and validation
print("Initializing train and validation dataloaders...")

Initializing train and validation dataloaders...


In [17]:
BATCH_SIZE = 8 # try with 64
df_train_subset = df_train[:1000] 
df_val_subset = df_validation[:1000]  
train_dataloader = NRMSDataLoader(
    behaviors=df_train_subset,
    article_dict=article_mapping_train,
    unknown_representation="zeros",
    history_column=DEFAULT_HISTORY_ARTICLE_ID_COL,
    eval_mode=False,
    batch_size=BATCH_SIZE,
)
val_dataloader = NRMSDataLoader(
    behaviors=df_val_subset,
    article_dict=article_mapping_train,
    unknown_representation="zeros",
    history_column=DEFAULT_HISTORY_ARTICLE_ID_COL,
    eval_mode=False,
    batch_size=BATCH_SIZE,
)

## Train the model


In [18]:
# List all physical devices
physical_devices = tf.config.list_physical_devices()
print("Available devices:", physical_devices)

Available devices: [PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')]


Initiate the NRMS-model:

In [19]:
model = NRMSModel(
    hparams=hparams_nrms,
    word2vec_embedding=word2vec_embedding,
    seed=42,
)
model.model.compile(
    optimizer=model.model.optimizer,
    loss=model.model.loss,
    metrics=["AUC"],
)

MODEL_NAME = model.__class__.__name__
MODEL_WEIGHTS = DUMP_DIR.joinpath(f"state_dict/{MODEL_NAME}/weights")
LOG_DIR = DUMP_DIR.joinpath(f"runs/{MODEL_NAME}")




### Callbacks
We will add some callbacks to model training.

In [38]:
from pathlib import Path
from tensorflow.keras.callbacks import ModelCheckpoint

# Define paths
DUMP_DIR = Path(r"C:\Users\antot\Downloads\ebnerd-benchmark\examples").expanduser()
DUMP_DIR.mkdir(exist_ok=True, parents=True)

MODEL_NAME = model.__class__.__name__
MODEL_WEIGHTS = DUMP_DIR.joinpath(f"state_dict/{MODEL_NAME}/weights")
LOG_DIR = DUMP_DIR.joinpath(f"runs/{MODEL_NAME}")

# Ensure directory for weights exists
MODEL_WEIGHTS.parent.mkdir(parents=True, exist_ok=True)

# Compile the model
model = NRMSModel(
    hparams=hparams_nrms,
    word2vec_embedding=word2vec_embedding,
    seed=42,
)
model.model.compile(
    optimizer=model.model.optimizer,
    loss=model.model.loss,
    metrics=["AUC"],
)

# Define checkpoint callback
checkpoint_callback = ModelCheckpoint(
    filepath=str(MODEL_WEIGHTS),
    save_weights_only=True,
    monitor='val_auc',
    mode='max',
    save_best_only=True,
    verbose=1
)

In [21]:
# model = NRMSModel
# MODEL_NAME = model.__class__.__name__


## Train and store the weights

In [39]:
import time
import tensorflow as tf

# Learning rate scheduler
lr_scheduler = tf.keras.callbacks.ReduceLROnPlateau(
    monitor="val_auc",  # Monitor validation AUC
    mode="max",         # Maximize AUC
    factor=0.2,         # Reduce learning rate by 80%
    patience=2,         # Wait for 2 epochs with no improvement
    min_lr=1e-5         # Set a minimum learning rate
)

# Use callbacks if enabled
USE_CALLBACKS = True
callbacks = [lr_scheduler] if USE_CALLBACKS else []

# Training loop
EPOCHS = 1  # Adjust to desired number of epochs
for epoch in range(EPOCHS):
    start_time = time.time()
    print(f"Starting Epoch {epoch + 1}/{EPOCHS}")

    # Train the model for one epoch
    model.model.fit(
        train_dataloader,              # Training data
        validation_data=val_dataloader,  # Validation data
        epochs=1,                       # One epoch at a time
        callbacks=callbacks,            # Use callbacks if enabled
        verbose=1                       # Display progress
    )

    # Measure epoch duration
    epoch_time = time.time() - start_time
    print(f"Epoch {epoch + 1} completed in {epoch_time:.2f} seconds")

# Save weights after training
MODEL_WEIGHTS.parent.mkdir(parents=True, exist_ok=True)  # Ensure directory exists
model.model.save_weights(MODEL_WEIGHTS)
print(f"Model weights saved at: {MODEL_WEIGHTS}")


Starting Epoch 1/1
Epoch 1 completed in 261.31 seconds
Model weights saved at: C:\Users\antot\Downloads\ebnerd-benchmark\examples\state_dict\NRMSModel\weights


In [23]:
if USE_CALLBACKS:
    _ = model.model.load_weights(filepath=MODEL_WEIGHTS)




# Example how to compute some metrics:

In [40]:
df_test = (
    ebnerd_from_path(
        PATH.joinpath(PATH, "ebnerd_testset/test")
    )
    .sample(fraction=FRACTION)
)

print(f"Test samples: {df_test.height}")
print("Test Data Sample:")
print(df_test.head(2))


Test samples: 135367
Test Data Sample:
shape: (2, 15)
┌───────────┬───────────┬───────────┬───────────┬───┬───────────┬───────────┬───────────┬──────────┐
│ impressio ┆ impressio ┆ read_time ┆ scroll_pe ┆ … ┆ is_subscr ┆ session_i ┆ is_beyond ┆ article_ │
│ n_id      ┆ n_time    ┆ ---       ┆ rcentage  ┆   ┆ iber      ┆ d         ┆ _accuracy ┆ id_fixed │
│ ---       ┆ ---       ┆ f32       ┆ ---       ┆   ┆ ---       ┆ ---       ┆ ---       ┆ ---      │
│ u32       ┆ datetime[ ┆           ┆ f32       ┆   ┆ bool      ┆ u32       ┆ bool      ┆ list[i32 │
│           ┆ μs]       ┆           ┆           ┆   ┆           ┆           ┆           ┆ ]        │
╞═══════════╪═══════════╪═══════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪══════════╡
│ 508798591 ┆ 2023-06-0 ┆ 15.0      ┆ null      ┆ … ┆ false     ┆ 35722524  ┆ false     ┆ [9782438 │
│           ┆ 1         ┆           ┆           ┆   ┆           ┆           ┆           ┆ ,        │
│           ┆ 13:10:44  ┆           ┆

# Αυτο  το χρηισμοποιεί αυτος δεν ειναι το δικό μας.... χρειαζεται ένα function το οποίο λογικα θα δημιουργεί μια κενο label column και μετα υπολογίζοντας τα σκορ να βαζει αυτό με το μεγαλύτερο να το κανει 1. Λογικα το γραφει κάπου στα utils η στους dataloaders

In [41]:
df_test = (
    ebnerd_from_path(PATH.joinpath("ebnerd_testset", "test"), history_size=HISTORY_SIZE)
    .sample(fraction=FRACTION_TEST)
    .with_columns(
        pl.col(DEFAULT_INVIEW_ARTICLES_COL)
        .list.first()
        .alias(DEFAULT_CLICKED_ARTICLES_COL)
    )
    .select(COLUMNS + [DEFAULT_IS_BEYOND_ACCURACY_COL])
    .with_columns(
        pl.col(DEFAULT_INVIEW_ARTICLES_COL)
        .list.eval(pl.element() * 0)
        .alias(DEFAULT_LABELS_COL)
    )
)
df_test.head()

user_id,impression_id,impression_time,article_id_fixed,article_ids_clicked,article_ids_inview,is_beyond_accuracy,labels
u32,u32,datetime[μs],list[i32],i32,list[i32],bool,list[i32]
35982,6451339,2023-06-05 15:02:49,"[9786268, 9782806, … 9789494]",9796527,"[9796527, 7851321, … 9492777]",False,"[0, 0, … 0]"
36012,6451363,2023-06-05 15:03:56,"[9788323, 9788362, … 9790885]",9798532,"[9798532, 9791602, … 9798958]",False,"[0, 0, … 0]"
36162,6451382,2023-06-05 15:25:53,"[9788524, 9788106, … 9790700]",9798498,"[9798498, 9793856, … 9798724]",False,"[0, 0, … 0]"
36162,6451383,2023-06-05 15:26:35,"[9788524, 9788106, … 9790700]",9797419,"[9797419, 9798829, … 9798805]",False,"[0, 0, … 0]"
36162,6451385,2023-06-05 15:26:14,"[9788524, 9788106, … 9790700]",9785014,"[9785014, 9798958, … 9486080]",False,"[0, 0, … 0]"


In [42]:
df_test["is_beyond_accuracy"].count

<bound method Series.count of shape: (13_536_710,)
Series: 'is_beyond_accuracy' [bool]
[
	false
	false
	false
	false
	false
	false
	false
	false
	false
	false
	false
	false
	…
	true
	true
	true
	true
	true
	true
	true
	true
	true
	true
	true
	true
	true
]>

# Προσοχη το εόμενο είναι subset????????????????????????????///

In [27]:
df_test= df_test[:10000]

I break it so that I can eun the test

In [43]:
import polars as pl

# Assume df_test is already defined
df_test = df_test[:1000]  # Restrict to first 10,000 rows

# Split 500 rows for each case
df_false = df_test[:500].with_columns(
    pl.lit(False).alias("is_beyond_accuracy")
)

df_true = df_test[500:1000].with_columns(
    pl.lit(True).alias("is_beyond_accuracy")
)

# Combine into a single DataFrame
df_test = pl.concat([df_false, df_true])

# Verify the distribution
print(
    df_test.groupby("is_beyond_accuracy")
    .agg(pl.count().alias("count"))
)


shape: (2, 2)
┌────────────────────┬───────┐
│ is_beyond_accuracy ┆ count │
│ ---                ┆ ---   │
│ bool               ┆ u32   │
╞════════════════════╪═══════╡
│ false              ┆ 500   │
│ true               ┆ 500   │
└────────────────────┴───────┘


  df_test.groupby("is_beyond_accuracy")
  .agg(pl.count().alias("count"))


In [44]:
# Filter rows into two subsets
df_test_wo_beyond = df_test.filter(~pl.col("is_beyond_accuracy"))
df_test_w_beyond = df_test.filter(pl.col("is_beyond_accuracy"))

# Verify the split
print("Rows without beyond accuracy (False):", df_test_wo_beyond.shape[0])
print("Rows with beyond accuracy (True):", df_test_w_beyond.shape[0])


Rows without beyond accuracy (False): 500
Rows with beyond accuracy (True): 500


In [30]:
# df_test_w_beyond

In [45]:
from ebrec.utils._polars import split_df_chunks



df_test_chunks = split_df_chunks(df_test_wo_beyond, n_chunks=N_CHUNKS_TEST)
df_pred_test_wo_beyond = []

In [46]:
BATCH_SIZE_TRAIN = 32
BATCH_SIZE_VAL = 32
BATCH_SIZE_TEST_WO_B = 32
BATCH_SIZE_TEST_W_B = 4
N_CHUNKS_TEST = 10
CHUNKS_DONE = 0

In [52]:
import gc
from tensorflow.keras.backend import clear_session
TEST_DF_DUMP = DUMP_DIR.joinpath("test_predictions", MODEL_NAME)
TEST_DF_DUMP.mkdir(parents=True, exist_ok=True)

df_test_chunks = split_df_chunks(df_test_wo_beyond, n_chunks=N_CHUNKS_TEST)
df_pred_test_wo_beyond = []

for i, df_test_chunk in enumerate(df_test_chunks[CHUNKS_DONE:], start=1 + CHUNKS_DONE):
    print(f"Init test-dataloader: {i}/{len(df_test_chunks)}")
    # Initialize DataLoader
    test_dataloader_wo_b = NRMSDataLoader(
        behaviors=df_test_chunk,
        article_dict=article_mapping_test,
        unknown_representation="zeros",
        history_column=DEFAULT_HISTORY_ARTICLE_ID_COL,
        eval_mode=True,
        batch_size=BATCH_SIZE_TEST_WO_B,
    )
    # Predict and clear session
    scores = model.scorer.predict(test_dataloader_wo_b)
    clear_session()

    # Process the predictions
    df_test_chunk = add_prediction_scores(df_test_chunk, scores.tolist()).with_columns(
        pl.col("scores")
        .map_elements(lambda x: list(rank_predictions_by_score(x)))
        .alias("ranked_scores")
    )

    # Save the processed chunk
    df_test_chunk.select(DEFAULT_IMPRESSION_ID_COL, "ranked_scores").write_parquet(
        TEST_DF_DUMP.joinpath(f"pred_wo_ba_{i}.parquet")
    )

    # Append and clean up
    df_pred_test_wo_beyond.append(df_test_chunk)

    # Cleanup
    del df_test_chunk, test_dataloader_wo_b, scores
    gc.collect()

Init test-dataloader: 1/10
Init test-dataloader: 2/10
Init test-dataloader: 3/10
Init test-dataloader: 4/10
Init test-dataloader: 5/10
Init test-dataloader: 6/10
Init test-dataloader: 7/10
Init test-dataloader: 8/10
Init test-dataloader: 9/10
Init test-dataloader: 10/10


In [53]:
import polars as pl

# Concatenate all DataFrame chunks into a single DataFrame
df_pred_test_wo_beyond = pl.concat(df_pred_test_wo_beyond)

# Now you can use the .select() method
df_pred_test_wo_beyond.select(DEFAULT_IMPRESSION_ID_COL, "scores").write_parquet(
    TEST_DF_DUMP.joinpath("pred_wo_ba.parquet")
)

# View the head of the DataFrame
print(df_pred_test_wo_beyond.head(30))


shape: (30, 10)
┌─────────┬────────────┬───────────┬───────────┬───┬───────────┬───────────┬───────────┬───────────┐
│ user_id ┆ impression ┆ impressio ┆ article_i ┆ … ┆ is_beyond ┆ labels    ┆ scores    ┆ ranked_sc │
│ ---     ┆ _id        ┆ n_time    ┆ d_fixed   ┆   ┆ _accuracy ┆ ---       ┆ ---       ┆ ores      │
│ u32     ┆ ---        ┆ ---       ┆ ---       ┆   ┆ ---       ┆ list[i32] ┆ list[f64] ┆ ---       │
│         ┆ u32        ┆ datetime[ ┆ list[i32] ┆   ┆ bool      ┆           ┆           ┆ list[i64] │
│         ┆            ┆ μs]       ┆           ┆   ┆           ┆           ┆           ┆           │
╞═════════╪════════════╪═══════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪═══════════╡
│ 35982   ┆ 6451339    ┆ 2023-06-0 ┆ [9786268, ┆ … ┆ false     ┆ [0, 0, …  ┆ [0.630941 ┆ [3, 2, …  │
│         ┆            ┆ 5         ┆ 9782806,  ┆   ┆           ┆ 0]        ┆ ,         ┆ 9]        │
│         ┆            ┆ 15:02:49  ┆ …         ┆   ┆           ┆           

In [None]:
#df_test_w_beyond

In [None]:
print(type(df_pred_test_wo_beyond))


<class 'list'>


In [None]:
# from ebrec.utils._constants import (
#     DEFAULT_HISTORY_ARTICLE_ID_COL,
#     DEFAULT_IS_BEYOND_ACCURACY_COL,
#     DEFAULT_CLICKED_ARTICLES_COL,
#     DEFAULT_INVIEW_ARTICLES_COL,
#     DEFAULT_IMPRESSION_ID_COL,
#     DEFAULT_SUBTITLE_COL,
#     DEFAULT_LABELS_COL,
#     DEFAULT_TITLE_COL,
#     DEFAULT_USER_COL,
# )

# # Prepare test data (without beyond-accuracy data)
# # df_pred_test_wo_beyond = pl.concat(df_pred_test_wo_beyond)



In [None]:
# df_pred_test_wo_beyond.select(DEFAULT_IMPRESSION_ID_COL, "ranked_scores").write_parquet(
#     TEST_DF_DUMP.joinpath("pred_wo_ba.parquet")
# )

In [54]:
print("Init test-dataloader: beyond-accuracy")
test_dataloader_w_b = NRMSDataLoader(
    behaviors=df_test_w_beyond,
    article_dict=article_mapping_test,
    unknown_representation="zeros",
    history_column=DEFAULT_HISTORY_ARTICLE_ID_COL,
    eval_mode=True,
    batch_size=BATCH_SIZE_TEST_W_B,
)

Init test-dataloader: beyond-accuracy


In [55]:
scores = model.scorer.predict(test_dataloader_w_b)
df_pred_test_w_beyond = add_prediction_scores(
    df_test_w_beyond, scores.tolist()
).with_columns(
    pl.col("scores")
    .map_elements(lambda x: list(rank_predictions_by_score(x)))
    .alias("ranked_scores")
)
df_pred_test_w_beyond.select(DEFAULT_IMPRESSION_ID_COL, "ranked_scores").write_parquet(
    TEST_DF_DUMP.joinpath("pred_w_ba.parquet")
)



In [56]:
# Check the schemas of both DataFrames
print("Schema of df_pred_test_wo_beyond:")
print(df_pred_test_wo_beyond.schema)

print("Schema of df_pred_test_w_beyond:")
print(df_pred_test_w_beyond.schema)

Schema of df_pred_test_wo_beyond:
OrderedDict([('user_id', UInt32), ('impression_id', UInt32), ('impression_time', Datetime(time_unit='us', time_zone=None)), ('article_id_fixed', List(Int32)), ('article_ids_clicked', Int32), ('article_ids_inview', List(Int32)), ('is_beyond_accuracy', Boolean), ('labels', List(Int32)), ('scores', List(Float64)), ('ranked_scores', List(Int64))])
Schema of df_pred_test_w_beyond:
OrderedDict([('user_id', UInt32), ('impression_id', UInt32), ('impression_time', Datetime(time_unit='us', time_zone=None)), ('article_id_fixed', List(Int32)), ('article_ids_clicked', Int32), ('article_ids_inview', List(Int32)), ('is_beyond_accuracy', Boolean), ('labels', List(Int32)), ('scores', List(Float64)), ('ranked_scores', List(Int64))])


In [57]:
# Check the schemas of both DataFrames
print("Schema of df_pred_test_wo_beyond:")
print(df_pred_test_wo_beyond.schema)

print("Schema of df_pred_test_w_beyond:")
print(df_pred_test_w_beyond.schema)

# Align column types
df_pred_test_wo_beyond = df_pred_test_wo_beyond.with_columns(
    [pl.col(column).cast(df_pred_test_w_beyond.schema[column]) for column in df_pred_test_w_beyond.schema]
)

# Combine both DataFrames
df_test = pl.concat([df_pred_test_wo_beyond, df_pred_test_w_beyond])

# Write to Parquet
df_test.select(DEFAULT_IMPRESSION_ID_COL, "ranked_scores").write_parquet(
    TEST_DF_DUMP.joinpath("pred_concat.parquet")
)


Schema of df_pred_test_wo_beyond:
OrderedDict([('user_id', UInt32), ('impression_id', UInt32), ('impression_time', Datetime(time_unit='us', time_zone=None)), ('article_id_fixed', List(Int32)), ('article_ids_clicked', Int32), ('article_ids_inview', List(Int32)), ('is_beyond_accuracy', Boolean), ('labels', List(Int32)), ('scores', List(Float64)), ('ranked_scores', List(Int64))])
Schema of df_pred_test_w_beyond:
OrderedDict([('user_id', UInt32), ('impression_id', UInt32), ('impression_time', Datetime(time_unit='us', time_zone=None)), ('article_id_fixed', List(Int32)), ('article_ids_clicked', Int32), ('article_ids_inview', List(Int32)), ('is_beyond_accuracy', Boolean), ('labels', List(Int32)), ('scores', List(Float64)), ('ranked_scores', List(Int64))])


In [None]:
print(type(df_pred_test_wo_beyond))
print(type(df_pred_test_w_beyond))


<class 'polars.dataframe.frame.DataFrame'>
<class 'polars.dataframe.frame.DataFrame'>


In [58]:
df_test.head

<bound method DataFrame.head of shape: (1_000, 10)
┌─────────┬────────────┬───────────┬───────────┬───┬───────────┬───────────┬───────────┬───────────┐
│ user_id ┆ impression ┆ impressio ┆ article_i ┆ … ┆ is_beyond ┆ labels    ┆ scores    ┆ ranked_sc │
│ ---     ┆ _id        ┆ n_time    ┆ d_fixed   ┆   ┆ _accuracy ┆ ---       ┆ ---       ┆ ores      │
│ u32     ┆ ---        ┆ ---       ┆ ---       ┆   ┆ ---       ┆ list[i32] ┆ list[f64] ┆ ---       │
│         ┆ u32        ┆ datetime[ ┆ list[i32] ┆   ┆ bool      ┆           ┆           ┆ list[i64] │
│         ┆            ┆ μs]       ┆           ┆   ┆           ┆           ┆           ┆           │
╞═════════╪════════════╪═══════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪═══════════╡
│ 35982   ┆ 6451339    ┆ 2023-06-0 ┆ [9786268, ┆ … ┆ false     ┆ [0, 0, …  ┆ [0.630941 ┆ [3, 2, …  │
│         ┆            ┆ 5         ┆ 9782806,  ┆   ┆           ┆ 0]        ┆ ,         ┆ 9]        │
│         ┆            ┆ 15:02:49  ┆ …  

In [1]:
df_test.columns

NameError: name 'df_test' is not defined

In [59]:
import polars as pl
import numpy as np

# Update the 'labels' column to match the length of 'ranked_scores' and assign 1 to the highest rank
df_test = df_test.with_columns(
    pl.struct(["ranked_scores", "scores"])
    .apply(lambda row: [1 if rank == 1 else 0 for rank in row["ranked_scores"]]
           if len(row["ranked_scores"]) == len(row["scores"]) else None)
    .alias("labels")
)

# Check for rows where labels are None (mismatched lengths)
invalid_rows = df_test.filter(pl.col("labels").is_null())

if invalid_rows.height > 0:
    print("Found rows with mismatched 'ranked_scores' and 'scores':")
    print(invalid_rows)

# Verify the updated 'labels' column
print(df_test.select(["ranked_scores", "labels"]))


shape: (1_000, 2)
┌───────────────┬─────────────┐
│ ranked_scores ┆ labels      │
│ ---           ┆ ---         │
│ list[i64]     ┆ list[i64]   │
╞═══════════════╪═════════════╡
│ [3, 2, … 9]   ┆ [0, 0, … 0] │
│ [4, 7, … 3]   ┆ [0, 0, … 0] │
│ [2, 5, … 3]   ┆ [0, 0, … 0] │
│ [9, 11, … 3]  ┆ [0, 0, … 0] │
│ [2, 6, … 1]   ┆ [0, 0, … 1] │
│ …             ┆ …           │
│ [7, 30, … 35] ┆ [0, 0, … 0] │
│ [20, 5, … 31] ┆ [0, 0, … 0] │
│ [3, 7, … 2]   ┆ [0, 0, … 0] │
│ [2, 5, … 1]   ┆ [0, 0, … 1] │
│ [1, 31, … 34] ┆ [1, 0, … 0] │
└───────────────┴─────────────┘


  .apply(lambda row: [1 if rank == 1 else 0 for rank in row["ranked_scores"]]


In [60]:
df_test.head()

user_id,impression_id,impression_time,article_id_fixed,article_ids_clicked,article_ids_inview,is_beyond_accuracy,labels,scores,ranked_scores
u32,u32,datetime[μs],list[i32],i32,list[i32],bool,list[i64],list[f64],list[i64]
35982,6451339,2023-06-05 15:02:49,"[9786268, 9782806, … 9789494]",9796527,"[9796527, 7851321, … 9492777]",False,"[0, 0, … 0]","[0.630941, 0.950646, … 0.008185]","[3, 2, … 9]"
36012,6451363,2023-06-05 15:03:56,"[9788323, 9788362, … 9790885]",9798532,"[9798532, 9791602, … 9798958]",False,"[0, 0, … 0]","[0.104358, 0.001216, … 0.179523]","[4, 7, … 3]"
36162,6451382,2023-06-05 15:25:53,"[9788524, 9788106, … 9790700]",9798498,"[9798498, 9793856, … 9798724]",False,"[0, 0, … 0]","[0.097785, 0.000392, … 0.015833]","[2, 5, … 3]"
36162,6451383,2023-06-05 15:26:35,"[9788524, 9788106, … 9790700]",9797419,"[9797419, 9798829, … 9798805]",False,"[0, 0, … 0]","[0.024571, 0.001634, … 0.580246]","[9, 11, … 3]"
36162,6451385,2023-06-05 15:26:14,"[9788524, 9788106, … 9790700]",9785014,"[9785014, 9798958, … 9486080]",False,"[0, 0, … 1]","[0.918962, 0.120063, … 0.9671]","[2, 6, … 1]"


In [62]:
# Importing the required library
import pandas as pd

# Assuming df_test is already defined; otherwise, create or assign it here
# For example, df_test = pd.DataFrame(data) 

# Check if df_test is a DataFrame
if not isinstance(df_test, pd.DataFrame):
    # Convert to DataFrame if it isn't one
    df_test = pd.DataFrame(df_test)

# Save the first 5 rows of the DataFrame to a CSV file
df_test.head().to_csv("df_test_head.csv", index=False)

print("The first 5 rows of df_test have been saved as 'df_test_head.csv'.")


The first 5 rows of df_test have been saved as 'df_test_head.csv'.


In [165]:

metrics = MetricEvaluator(
    labels=df_test["labels"].to_list(),
    predictions=df_test["scores"].to_list(),
    metric_functions=[AucScore(), MrrScore(), NdcgScore(k=5), NdcgScore(k=10)],
)
metrics.evaluate()

AUC: 100%|█████████████████████████████████| 1000/1000 [00:02<00:00, 396.30it/s]
AUC: 100%|███████████████████████████████| 1000/1000 [00:00<00:00, 17663.20it/s]
AUC: 100%|████████████████████████████████| 1000/1000 [00:00<00:00, 7731.24it/s]
AUC: 100%|████████████████████████████████| 1000/1000 [00:00<00:00, 6861.87it/s]


<MetricEvaluator class>: 
 {
    "auc": 1.0,
    "mrr": 1.0,
    "ndcg@5": 1.0,
    "ndcg@10": 1.0
}

-----------------------------------------------------------------------------

## Add the predictions to the dataframe

# Kanei concatination beyond and non beyond accuracy kai meta kanei to ta predictions. 

### Compute metrics

In [164]:
metrics = MetricEvaluator(
    labels=df_test["labels"].to_list(),
    predictions=df_test["scores"].to_list(),
    metric_functions=[AucScore(), MrrScore(), NdcgScore(k=5), NdcgScore(k=10)],
)
metrics.evaluate()

AUC: 100%|█████████████████████████████████| 1000/1000 [00:02<00:00, 367.36it/s]
AUC: 100%|███████████████████████████████| 1000/1000 [00:00<00:00, 17145.15it/s]
AUC: 100%|████████████████████████████████| 1000/1000 [00:00<00:00, 5578.18it/s]
AUC: 100%|████████████████████████████████| 1000/1000 [00:00<00:00, 8029.14it/s]


<MetricEvaluator class>: 
 {
    "auc": 1.0,
    "mrr": 1.0,
    "ndcg@5": 1.0,
    "ndcg@10": 1.0
}

This is using the validation, simply add the testset to your flow.

In [28]:
write_submission_file(
    impression_ids=df_test[DEFAULT_IMPRESSION_ID_COL],
    prediction_scores=df_test["ranked_scores"],
    path=DUMP_DIR.joinpath("predictions.txt"),
    filename_zip=f"{DATASPLIT}_predictions-{MODEL_NAME}.zip",
)

0it [00:00, ?it/s]

2446it [00:00, 27609.70it/s]

Zipping ebnerd_predictions/predictions.txt to ebnerd_predictions/ebnerd_small_predictions-NRMSModel.zip





# DONE 🚀