<i>Copyright (c) Microsoft Corporation. All rights reserved.</i>

<i>Licensed under the MIT License.</i>

# DKN : Deep Knowledge-Aware Network for News Recommendation

DKN \[1\] is a deep learning model which incorporates information from knowledge graph for better news recommendation. Specifically, DKN uses TransX \[2\] method for knowledge graph representation learning, then applies a CNN framework, named KCNN, to combine entity embedding with word embedding and generate a final embedding vector for a news article. CTR prediction is made via an attention-based neural scorer. 

## Properties of DKN:

- DKN is a content-based deep model for CTR prediction rather than traditional ID-based collaborative filtering. 
- It makes use of knowledge entities and common sense in news content via joint learning from semantic-level and knowledge-level representations of news articles.
- DKN uses an attention module to dynamically calculate a user's aggregated historical representaition.


## Data format

DKN takes several files as input as follows:

- **training / validation / test files**: each line in these files represents one instance. Impressionid is used to evaluate performance within an impression session, so it is only used when evaluating, you can set it to 0 for training data. The format is : <br> 
`[label] [userid] [CandidateNews]%[impressionid] `<br> 
e.g., `1 train_U1 N1%0` <br> 

- **user history file**: each line in this file represents a users' click history. You need to set `history_size` parameter in the config file, which is the max number of user's click history we use. We will automatically keep the last `history_size` number of user click history, if user's click history is more than `history_size`, and we will automatically pad with 0 if user's click history is less than `history_size`. the format is : <br> 
`[Userid] [newsid1,newsid2...]`<br>
e.g., `train_U1 N1,N2` <br> 

- **document feature file**: It contains the word and entity features for news articles. News articles are represented by aligned title words and title entities. To take a quick example, a news title may be: <i>"Trump to deliver State of the Union address next week"</i>, then the title words value may be `CandidateNews:34,45,334,23,12,987,3456,111,456,432` and the title entitie value may be: `entity:45,0,0,0,0,0,0,0,0,0`. Only the first value of entity vector is non-zero due to the word "Trump". The title value and entity value is hashed from 1 to `n` (where `n` is the number of distinct words or entities). Each feature length should be fixed at k (`doc_size` parameter), if the number of words in document is more than k, you should truncate the document to k words, and if the number of words in document is less than k, you should pad 0 to the end. 
the format is like: <br> 
`[Newsid] [w1,w2,w3...wk] [e1,e2,e3...ek]`

- **word embedding/entity embedding/ context embedding files**: These are `*.npy` files of pretrained embeddings. After loading, each file is a `[n+1,k]` two-dimensional matrix, n is the number of words(or entities) of their hash dictionary, k is dimension of the embedding, note that we keep embedding 0 for zero padding. 

In this experiment, we used GloVe\[4\] vectors to initialize the word embedding. We trained entity embedding using TransE\[2\] on knowledge graph and context embedding is the average of the entity's neighbors in the knowledge graph.<br>

## MIND dataset

MIND dataset\[3\] is a large-scale English news dataset. It was collected from anonymized behavior logs of Microsoft News website. MIND contains 1,000,000 users, 161,013 news articles and 15,777,377 impression logs. Every news article contains rich textual content including title, abstract, body, category and entities. Each impression log contains the click events, non-clicked events and historical news click behaviors of this user before this impression.

A smaller version, [MIND-small](https://azure.microsoft.com/en-us/services/open-datasets/catalog/microsoft-news-dataset/), is a small version of the MIND dataset by randomly sampling 50,000 users and their behavior logs from the MIND dataset.

The datasets contains these files for both training and validation data:

#### behaviors.tsv

The behaviors.tsv file contains the impression logs and users' news click hostories. It has 5 columns divided by the tab symbol:

+ Impression ID. The ID of an impression.
+ User ID. The anonymous ID of a user.
+ Time. The impression time with format "MM/DD/YYYY HH:MM:SS AM/PM".
+ History. The news click history (ID list of clicked news) of this user before this impression.
+ Impressions. List of news displayed in this impression and user's click behaviors on them (1 for click and 0 for non-click).

One simple example: 

`1    U82271    11/11/2019 3:28:58 PM    N3130 N11621 N12917 N4574 N12140 N9748    N13390-0 N7180-0 N20785-0 N6937-0 N15776-0 N25810-0 N20820-0 N6885-0 N27294-0 N18835-0 N16945-0 N7410-0 N23967-0 N22679-0 N20532-0 N26651-0 N22078-0 N4098-0 N16473-0 N13841-0 N15660-0 N25787-0 N2315-0 N1615-0 N9087-0 N23880-0 N3600-0 N24479-0 N22882-0 N26308-0 N13594-0 N2220-0 N28356-0 N17083-0 N21415-0 N18671-0 N9440-0 N17759-0 N10861-0 N21830-0 N8064-0 N5675-0 N15037-0 N26154-0 N15368-1 N481-0 N3256-0 N20663-0 N23940-0 N7654-0 N10729-0 N7090-0 N23596-0 N15901-0 N16348-0 N13645-0 N8124-0 N20094-0 N27774-0 N23011-0 N14832-0 N15971-0 N27729-0 N2167-0 N11186-0 N18390-0 N21328-0 N10992-0 N20122-0 N1958-0 N2004-0 N26156-0 N17632-0 N26146-0 N17322-0 N18403-0 N17397-0 N18215-0 N14475-0 N9781-0 N17958-0 N3370-0 N1127-0 N15525-0 N12657-0 N10537-0 N18224-0 `

#### news.tsv

The news.tsv file contains the detailed information of news articles involved in the behaviors.tsv file. It has 7 columns, which are divided by the tab symbol:

+ News ID
+ Category
+ SubCategory
+ Title
+ Abstract
+ URL
+ Title Entities (entities contained in the title of this news)
+ Abstract Entities (entites contained in the abstract of this news)

One simple example: 

`N46466    lifestyle    lifestyleroyals    The Brands Queen Elizabeth, Prince Charles, and Prince Philip Swear By    Shop the notebooks, jackets, and more that the royals can't live without.    https://www.msn.com/en-us/lifestyle/lifestyleroyals/the-brands-queen-elizabeth,-prince-charles,-and-prince-philip-swear-by/ss-AAGH0ET?ocid=chopendata    [{"Label": "Prince Philip, Duke of Edinburgh", "Type": "P", "WikidataId": "Q80976", "Confidence": 1.0, "OccurrenceOffsets": [48], "SurfaceForms": ["Prince Philip"]}, {"Label": "Charles, Prince of Wales", "Type": "P", "WikidataId": "Q43274", "Confidence": 1.0, "OccurrenceOffsets": [28], "SurfaceForms": ["Prince Charles"]}, {"Label": "Elizabeth II", "Type": "P", "WikidataId": "Q9682", "Confidence": 0.97, "OccurrenceOffsets": [11], "SurfaceForms": ["Queen Elizabeth"]}]    [] `

#### entity_embedding.vec & relation_embedding.vec

The entity_embedding.vec and relation_embedding.vec files contain the 100-dimensional embeddings of the entities and relations learned from the subgraph (from WikiData knowledge graph) by TransE method. In both files, the first column is the ID of entity/relation, and the other columns are the embedding vector values.

One simple example: 

`Q42306013  0.014516 -0.106958 0.024590 ... -0.080382`


## DKN architecture

The following figure shows the architecture of DKN.

![](https://recodatasets.z20.web.core.windows.net/images/dkn_architecture.png)

DKN takes one piece of candidate news and one piece of a user’s clicked news as input. For each piece of news, a specially designed KCNN is used to process its title and generate an embedding vector. KCNN is an extension of traditional CNN that allows flexibility in incorporating symbolic knowledge from a knowledge graph into sentence representation learning. 

With the KCNN, we obtain a set of embedding vectors for a user’s clicked history. To get final embedding of the user with
respect to the current candidate news, we use an attention-based method to automatically match the candidate news to each piece
of his clicked news, and aggregate the user’s historical interests with different weights. The candidate news embedding and the user embedding are concatenated and fed into a deep neural network (DNN) to calculate the predicted probability that the user will click the candidate news.

## Global settings and imports

In [2]:
# Basic environment setup
import os
import sys
import zipfile
import shutil
import json
import numpy as np
import pandas as pd
import tensorflow as tf

# Set paths
BASE_DIR = os.getcwd()
DATA_DIR = os.path.join(BASE_DIR, "mind-dkn")

# Confirm TensorFlow version
print("Python version:", sys.version)
print("TensorFlow version:", tf.__version__)


Python version: 3.11.9 (main, Apr 17 2025, 20:47:49) [Clang 17.0.0 (clang-1700.0.13.3)]
TensorFlow version: 2.15.1


In [3]:
# Paths to your local MIND dataset
NEWS_PATH = "/Users/anuj/Downloads/MINDsmall_train/news.tsv"
BEHAVIORS_PATH = "/Users/anuj/Downloads/MINDsmall_train/behaviors.tsv"

# Load news data
news_df = pd.read_csv(NEWS_PATH, sep='\t', header=None,
                      names=['NewsID', 'Category', 'SubCategory', 'Title', 'Abstract', 'URL', 'TitleEntities', 'AbstractEntities'])

# Load behaviors data
behaviors_df = pd.read_csv(BEHAVIORS_PATH, sep='\t', header=None,
                           names=['ImpressionID', 'UserID', 'Time', 'History', 'Impressions'])

# Preview
print("News samples:")
print(news_df.head(2))
print("\nBehaviors samples:")
print(behaviors_df.head(2))


News samples:
   NewsID   Category      SubCategory  \
0  N55528  lifestyle  lifestyleroyals   
1  N19639     health       weightloss   

                                               Title  \
0  The Brands Queen Elizabeth, Prince Charles, an...   
1                      50 Worst Habits For Belly Fat   

                                            Abstract  \
0  Shop the notebooks, jackets, and more that the...   
1  These seemingly harmless habits are holding yo...   

                                             URL  \
0  https://assets.msn.com/labs/mind/AAGH0ET.html   
1  https://assets.msn.com/labs/mind/AAB19MK.html   

                                       TitleEntities  \
0  [{"Label": "Prince Philip, Duke of Edinburgh",...   
1  [{"Label": "Adipose tissue", "Type": "C", "Wik...   

                                    AbstractEntities  
0                                                 []  
1  [{"Label": "Adipose tissue", "Type": "C", "Wik...  

Behaviors samples:
   Impression

In [4]:
def generate_training_data(behaviors_df):
    data = []

    for _, row in behaviors_df.iterrows():
        history = row['History']
        history_list = history.split() if isinstance(history, str) else []

        impressions = row['Impressions'].split()
        for imp in impressions:
            if '-' not in imp:
                continue
            news_id, label = imp.split('-')
            label = int(label)
            data.append((history_list, news_id, label))

    return data

# Generate training samples
training_samples = generate_training_data(behaviors_df)
print(f"Total training samples: {len(training_samples)}")
print("Example sample (history, candidate, label):")
print(training_samples[0])


Total training samples: 5843444
Example sample (history, candidate, label):
(['N55189', 'N42782', 'N34694', 'N45794', 'N18445', 'N63302', 'N10414', 'N19347', 'N31801'], 'N55689', 1)


## Data preparation

In this example, let's go through a real case on how to apply DKN on a raw news dataset from the very beginning. We will download a copy of open-source MIND dataset, in its original raw format. Then we will process the raw data files into DKN's input data format, which is stated previously. 

In [5]:
import re
from collections import Counter

def tokenize_title(title):
    # Basic lowercase + remove punctuation
    return re.findall(r"\w+", title.lower())

# Build vocab from titles
word_counter = Counter()

news_title_dict = {}
for _, row in news_df.iterrows():
    news_id = row['NewsID']
    tokens = tokenize_title(row['Title'])
    news_title_dict[news_id] = tokens
    word_counter.update(tokens)

# Show most common words
print("Most common tokens:", word_counter.most_common(10))


Most common tokens: [('to', 14168), ('in', 13332), ('the', 11661), ('s', 9499), ('of', 8099), ('for', 7642), ('a', 6109), ('on', 5325), ('and', 4892), ('with', 3737)]


In [6]:
ENTITY_EMBEDDINGS_PATH = "/Users/anuj/Downloads/MINDsmall_train/entity_embedding.vec"
EMBEDDING_DIM = 100  # Update this if entity_embeddings.vec uses a different dimension

def load_entity_embeddings(embedding_path, word_counter, embedding_dim=100, min_freq=1):
    vocab = [word for word, freq in word_counter.items() if freq >= min_freq]
    word2idx = {word: idx + 1 for idx, word in enumerate(vocab)}  # Reserve 0 for padding
    idx2word = {idx: word for word, idx in word2idx.items()}

    embeddings_index = {}
    with open(embedding_path, 'r', encoding='utf-8') as f:
        for line in f:
            parts = line.strip().split()
            if len(parts) <= embedding_dim:
                continue  # Skip malformed lines
            word = parts[0]
            vector = np.asarray(parts[1:], dtype='float32')
            embeddings_index[word] = vector

    # Create embedding matrix
    embedding_matrix = np.zeros((len(word2idx) + 1, embedding_dim))
    for word, idx in word2idx.items():
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            embedding_matrix[idx] = embedding_vector
        else:
            # Random init if word not in the entity embeddings
            embedding_matrix[idx] = np.random.normal(scale=0.6, size=(embedding_dim,))

    return word2idx, idx2word, embedding_matrix

# Assuming `word_counter` is defined earlier in your code
word2idx, idx2word, embedding_matrix = load_entity_embeddings(ENTITY_EMBEDDINGS_PATH, word_counter, EMBEDDING_DIM)

print(f"Vocabulary size: {len(word2idx)}")
print(f"Embedding matrix shape: {embedding_matrix.shape}")


Vocabulary size: 31023
Embedding matrix shape: (31024, 100)


In [None]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

MAX_TITLE_LEN = 20  # As used in the original DKN setup

# Convert titles to sequences of word indices
def encode_titles(news_title_dict, word2idx, max_len=MAX_TITLE_LEN):
    news_title_encoded = {}

    for news_id, tokens in news_title_dict.items():
        indices = [word2idx.get(word, 0) for word in tokens]  # 0 is for OOV/padding
        padded_indices = pad_sequences([indices], maxlen=max_len, padding='post', truncating='post')[0]
        news_title_encoded[news_id] = padded_indices

    return news_title_encoded

news_title_encoded = encode_titles(news_title_dict, word2idx)

# Check a sample
sample_id = list(news_title_encoded.keys())[0]
print(f"News ID: {sample_id}")
print(f"Encoded title: {news_title_encoded[sample_id]}")


News ID: N55528
Encoded title: [ 1  2  3  4  5  6  7  5  8  9 10  0  0  0  0  0  0  0  0  0]


: 

## Create hyper-parameters

In [None]:
MAX_HISTORY_LEN = 50  # As per DKN

user_histories = []
candidate_titles = []
labels = []
for history, candidate, label in training_samples:
    # Encode user history as list of title sequences
    encoded_history = [news_title_encoded.get(news_id, np.zeros(MAX_TITLE_LEN, dtype=int))
                       for news_id in history]

    # Pad/truncate to MAX_HISTORY_LEN
    if len(encoded_history) < MAX_HISTORY_LEN:
        pad_len = MAX_HISTORY_LEN - len(encoded_history)
        encoded_history += [np.zeros(MAX_TITLE_LEN, dtype=int)] * pad_len
    else:
        encoded_history = encoded_history[:MAX_HISTORY_LEN]

    # Convert to numpy array
    encoded_history = np.array(encoded_history)

    # Encode candidate
    encoded_candidate = news_title_encoded.get(candidate, np.zeros(MAX_TITLE_LEN, dtype=int))

    user_histories.append(encoded_history)
    candidate_titles.append(encoded_candidate)
    labels.append(label)

user_histories = np.array(user_histories)
candidate_titles = np.array(candidate_titles)
labels = np.array(labels)

print("user_histories shape:", user_histories.shape)  # (num_samples, 50, 20)
print("candidate_titles shape:", candidate_titles.shape)  # (num_samples, 20)
print("labels shape:", labels.shape)  # (num_samples,)


## Train the DKN model

In [None]:
import tensorflow as tf
from tensorflow.keras import Model, Input
from tensorflow.keras.layers import Embedding, Conv1D, GlobalMaxPooling1D, Dense, Dot, Activation, Concatenate, Dropout, Softmax, Multiply, Lambda

# Constants
MAX_HISTORY_LEN = 50
MAX_TITLE_LEN = 20
EMBEDDING_DIM = embedding_matrix.shape[1]
FILTER_NUM = 100
WINDOW_SIZE = 3

# Shared embedding + CNN encoder for titles
def build_title_encoder():
    title_input = Input(shape=(MAX_TITLE_LEN,), dtype='int32')  # (None, 20)
    
    embedding_layer = Embedding(
        input_dim=embedding_matrix.shape[0],
        output_dim=EMBEDDING_DIM,
        weights=[embedding_matrix],
        input_length=MAX_TITLE_LEN,
        trainable=False
    )

    embedded_title = embedding_layer(title_input)  # (None, 20, 100)
    
    conv = Conv1D(filters=FILTER_NUM, kernel_size=WINDOW_SIZE, activation='relu')(embedded_title)
    pooled = GlobalMaxPooling1D()(conv)  # (None, 100)

    model = Model(inputs=title_input, outputs=pooled)
    return model

# Instantiate encoder
title_encoder = build_title_encoder()

# Input layers
user_history_input = Input(shape=(MAX_HISTORY_LEN, MAX_TITLE_LEN), dtype='int32', name='user_history')  # (None, 50, 20)
candidate_input = Input(shape=(MAX_TITLE_LEN,), dtype='int32', name='candidate')  # (None, 20)

# Encode user history (each of 50 titles)
user_history_reshaped = tf.reshape(user_history_input, [-1, MAX_TITLE_LEN])  # (None*50, 20)
user_encoded = title_encoder(user_history_reshaped)  # (None*50, 100)
user_encoded = tf.reshape(user_encoded, [-1, MAX_HISTORY_LEN, FILTER_NUM])  # (None, 50, 100)

# Encode candidate
candidate_encoded = title_encoder(candidate_input)  # (None, 100)
candidate_expanded = tf.expand_dims(candidate_encoded, axis=1)  # (None, 1, 100)

# Attention mechanism (dot-product + softmax)
attention_scores = tf.reduce_sum(user_encoded * candidate_expanded, axis=-1)  # (None, 50)
attention_weights = tf.nn.softmax(attention_scores, axis=-1)  # (None, 50)

# Weighted sum over user history
attention_weights_expanded = tf.expand_dims(attention_weights, axis=-1)  # (None, 50, 1)
user_vector = tf.reduce_sum(user_encoded * attention_weights_expanded, axis=1)  # (None, 100)

# Concatenate user vector and candidate vector
final_vector = Concatenate()([user_vector, candidate_encoded])  # (None, 200)
final_vector = Dropout(0.2)(final_vector)
output = Dense(1, activation='sigmoid')(final_vector)

# Build model
dkn_model = Model(inputs=[user_history_input, candidate_input], outputs=output)
dkn_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['AUC'])

dkn_model.summary()


Model: "model_1"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 user_history (InputLayer)   [(None, 50, 20)]             0         []                            
                                                                                                  
 tf.reshape (TFOpLambda)     (None, 20)                   0         ['user_history[0][0]']        
                                                                                                  
 candidate (InputLayer)      [(None, 20)]                 0         []                            
                                                                                                  
 model (Functional)          (None, 100)                  3132500   ['tf.reshape[0][0]',          
                                                                     'candidate[0][0]']     

: 

In [None]:
BATCH_SIZE = 4  
EPOCHS = 3  # Start small—can increase later

history = dkn_model.fit(
    x=[user_histories, candidate_titles],
    y=labels,
    batch_size=BATCH_SIZE,
    epochs=EPOCHS,
    validation_split=0.1,
    verbose=1
)


Epoch 1/3

## Evaluate the DKN model

In [None]:
import matplotlib.pyplot as plt

def plot_training_history(history):
    # Plot loss
    plt.figure(figsize=(12, 5))
    
    plt.subplot(1, 2, 1)
    plt.plot(history.history['loss'], label='Train Loss')
    plt.plot(history.history['val_loss'], label='Val Loss')
    plt.title('Loss over Epochs')
    plt.xlabel('Epoch')
    plt.ylabel('Binary Crossentropy Loss')
    plt.legend()

    # Plot AUC
    plt.subplot(1, 2, 2)
    plt.plot(history.history['auc'], label='Train AUC')
    plt.plot(history.history['val_auc'], label='Val AUC')
    plt.title('AUC over Epochs')
    plt.xlabel('Epoch')
    plt.ylabel('AUC')
    plt.legend()

    plt.tight_layout()
    plt.show()

plot_training_history(history)


NameError: name 'history' is not defined

In [None]:
import pandas as pd

# Paths to MINDsmall_dev set
DEV_BEHAVIORS_PATH = "/Users/anuj/Downloads/MINDsmall_dev/behaviors.tsv"
DEV_NEWS_PATH = "/Users/anuj/Downloads/MINDsmall_dev/news.tsv"

# Load dev behaviors
dev_behaviors = pd.read_csv(
    DEV_BEHAVIORS_PATH, sep='\t', header=None,
    names=['ImpressionID', 'UserID', 'Time', 'History', 'Impressions']
)

# Load dev news
dev_news = pd.read_csv(
    DEV_NEWS_PATH, sep='\t', header=None,
    names=['NewsID', 'Category', 'SubCategory', 'Title', 'Abstract', 'URL', 'TitleEntities', 'AbstractEntities']
)

print(f"Dev behaviors: {dev_behaviors.shape}")
print(f"Dev news: {dev_news.shape}")


## Document embedding inference API

After training, you can get document embedding through this document embedding inference API. The input file format is same with document feature file. The output file fomrat is: `[Newsid] [embedding]`

In [None]:
# Tokenize and pad dev news titles using existing word2idx
def preprocess_news_titles(news_df, word2idx, max_len):
    tokenized_titles = []
    news_id_list = []
    
    for _, row in news_df.iterrows():
        news_id = row['NewsID']
        title = str(row['Title']).lower()
        tokens = re.findall(r'\w+', title)
        token_indices = [word2idx.get(word, 0) for word in tokens]
        if len(token_indices) < max_len:
            token_indices += [0] * (max_len - len(token_indices))
        else:
            token_indices = token_indices[:max_len]
        tokenized_titles.append(token_indices)
        news_id_list.append(news_id)
    
    return dict(zip(news_id_list, tokenized_titles))

dev_news_encoded = preprocess_news_titles(dev_news, word2idx, MAX_TITLE_LEN)

print(f"Encoded {len(dev_news_encoded)} dev news articles")


<reco_utils.recommender.deeprec.models.dkn.DKN at 0x7fe60850deb8>

## Results on large MIND dataset

Here are performances using the large MIND dataset (1,000,000 users, 161,013 news articles and 15,777,377 impression logs). 

| Models | g-AUC | MRR |NDCG@5 | NDCG@10 |
| :------| :------: | :------: | :------: | :------ |
| LibFM | 0.5993 | 0.2823 | 0.3005 | 0.3574 |
| Wide&Deep | 0.6216 | 0.2931 | 0.3138 | 0.3712 |
| DKN | 0.6436 | 0.3128 | 0.3371 | 0.3908|


Note that the results of DKN are using Microsoft recommender and the results of the first two models come from the MIND paper \[3\].
We compare the results on the same test dataset. 

One epoch takes 6381.3s (5066.6s for training, 1314.7s for evaluating) for DKN on GPU. Hardware specification for running the large dataset: <br>
GPU: Tesla P100-PCIE-16GB <br>
CPU: 6 cores Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz

## References

\[1\] Wang, Hongwei, et al. "DKN: Deep Knowledge-Aware Network for News Recommendation." Proceedings of the 2018 World Wide Web Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 2018.<br>
\[2\] Knowledge Graph Embeddings including TransE, TransH, TransR and PTransE. https://github.com/thunlp/KB2E <br>
\[3\] Wu, Fangzhao, et al. "MIND: A Large-scale Dataset for News Recommendation" Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. https://msnews.github.io/competition.html <br>
\[4\] GloVe: Global Vectors for Word Representation. https://nlp.stanford.edu/projects/glove/

In [None]:
def create_eval_samples(dev_behaviors_df, news_encoded, max_history_len):
    user_histories = []
    candidate_news = []
    labels = []
    impression_ids = []

    for _, row in dev_behaviors_df.iterrows():
        history_ids = str(row['History']).split() if pd.notna(row['History']) else []
        history_titles = [news_encoded.get(nid, [0] * MAX_TITLE_LEN) for nid in history_ids]
        if len(history_titles) < max_history_len:
            history_titles += [[0] * MAX_TITLE_LEN] * (max_history_len - len(history_titles))
        else:
            history_titles = history_titles[:max_history_len]

        impressions = str(row['Impressions']).split()
        for imp in impressions:
            if '-' not in imp:
                continue
            news_id, label = imp.split('-')
            candidate_title = news_encoded.get(news_id, [0] * MAX_TITLE_LEN)

            user_histories.append(history_titles)
            candidate_news.append(candidate_title)
            labels.append(int(label))
            impression_ids.append(row['ImpressionID'])

    return np.array(user_histories), np.array(candidate_news), np.array(labels), impression_ids

max_history_len = 50  # Same as during training

eval_user_histories, eval_candidate_titles, eval_labels, eval_impression_ids = create_eval_samples(
    dev_behaviors, dev_news_encoded, max_history_len
)

print(f"Evaluation samples: {len(eval_labels)}")


TF Version: 2.15.1
GPU Available: []




Model: "model"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 hist_input (InputLayer)     [(None, 50, 20)]             0         []                            
                                                                                                  
 news_input (InputLayer)     [(None, 20)]                 0         []                            
                                                                                                  
 embedding_1 (Embedding)     (None, 50, 20, 100)          5399200   ['hist_input[0][0]']          
                                                                                                  
 embedding (Embedding)       (None, 20, 100)              5399200   ['news_input[0][0]']          
                                                                                              

: 