<a href="https://colab.research.google.com/github/RoboTuan/KRED/blob/master/kred_example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
!nvidia-smi

Mon May  1 20:39:10 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   49C    P8    10W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

# Introduction

This repository is the implementation of [KRED: Knowledge-Aware Document Representation for News Recommendations](https://arxiv.org/abs/1910.11494) [1]


## Model description



KRED is a knowledge enhanced framework which enhance a document embedding with knowledge information for multiple news recommendation tasks. The framework mainly contains two part: representation enhancement part(left) and multi-task training part(right).

![](./framework.PNG)

## Dataset description and download

MIND dataset [2] is a large-scale English news dataset. It was collected from anonymized behavior logs of Microsoft News website. MIND contains 1,000,000 users, 161,013 news articles and 15,777,377 impression logs. Every news article contains rich textual content including title, abstract, body, category and entities. Each impression log contains the click events, non-clicked events and historical news click behaviors of this user before this impression.

For quicker training and evaluaiton, we sample MINDdemo dataset of 5k users from MIND small dataset. The MINDdemo dataset has the same file format as MINDsmall and MINDlarge. If you want to try experiments on MINDsmall and MINDlarge, please change the dowload source. Select the MIND_type parameter from ['large', 'small', 'demo'] to choose dataset.

MINDdemo_train is used for training, and MINDdemo_dev is used for evaluation. Training data and evaluation data are composed of a news file and a behaviors file. You can find more detailed data description in [MIND repo](https://github.com/msnews/msnews.github.io/blob/master/assets/doc/introduction.md)

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
import os
if not os.path.isdir('KRED'):
  !git clone https://github.com/RoboTuan/KRED

Cloning into 'KRED'...
remote: Enumerating objects: 292, done.[K
remote: Counting objects: 100% (153/153), done.[K
remote: Compressing objects: 100% (64/64), done.[K
remote: Total 292 (delta 108), reused 119 (delta 88), pack-reused 139[K
Receiving objects: 100% (292/292), 44.16 MiB | 14.44 MiB/s, done.
Resolving deltas: 100% (171/171), done.


In [5]:
!mkdir ./data
!mkdir ./data/train
!mkdir ./data/valid
!mkdir ./data/kg
!cp /content/drive/MyDrive/MINDsmall_dev.zip ./data/valid
!cp /content/drive/MyDrive/MINDsmall_train.zip ./data/train
!cp /content/drive/MyDrive/kg.zip ./data/kg

In [6]:
!pip install -U sentence-transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sentence-transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting transformers<5.0.0,>=4.6.0 (from sentence-transformers)
  Downloading transformers-4.28.1-py3-none-any.whl (7.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m20.9 MB/s[0m eta [36m0:00:00[0m
Collecting sentencepiece (from sentence-transformers)
  Downloading sentencepiece-0.1.98-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m79.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting huggingface-hub>=0.4.0 (from sentence-transformers)
  Downloading huggingface_hub-0.14.1-py3-

In [1]:
!cp KRED/config.yaml .

In [2]:
import sys
sys.path.append('KRED')
import os
from utils.util import *
from train_test import *

# Options: demo, small, large
MIND_type = 'small'
data_path = "./data/"

train_news_file = os.path.join(data_path, 'train', r'news.tsv')
train_behaviors_file = os.path.join(data_path, 'train', r'behaviors.tsv')
valid_news_file = os.path.join(data_path, 'valid', r'news.tsv')
valid_behaviors_file = os.path.join(data_path, 'valid', r'behaviors.tsv')
knowledge_graph_file = os.path.join(data_path, 'kg/wikidata-graph', r'wikidata-graph.tsv')
entity_embedding_file = os.path.join(data_path, 'kg/wikidata-graph', r'entity2vecd100.vec')
relation_embedding_file = os.path.join(data_path, 'kg/wikidata-graph', r'relation2vecd100.vec')

mind_url, mind_train_dataset, mind_dev_dataset, _ = get_mind_data_set(MIND_type)

kg_url = "https://kredkg.blob.core.windows.net/wikidatakg/"

if not os.path.exists(train_news_file):
    download_deeprec_resources(mind_url, os.path.join(data_path, 'train'), mind_train_dataset)
    
if not os.path.exists(valid_news_file):
    download_deeprec_resources(mind_url, \
                               os.path.join(data_path, 'valid'), mind_dev_dataset)

if not os.path.exists(knowledge_graph_file):
    download_deeprec_resources(kg_url, \
                               os.path.join(data_path, 'kg'), "kg.zip")

## loading config

In [3]:
import sys
import os
sys.path.append('')
sys.argv = ['']

import argparse
from parse_config import ConfigParser

parser = argparse.ArgumentParser(description='KRED')


parser.add_argument('-c', '--config', default="./KRED/config.yaml", type=str,
                    help='config file path (default: None)')
parser.add_argument('-r', '--resume', default=None, type=str,
                    help='path to latest checkpoint (default: None)')
parser.add_argument('-d', '--device', default=None, type=str,
                    help='indices of GPUs to enable (default: all)')

#config = parser.parse_args("")
config = ConfigParser.from_args(parser)
config



<parse_config.ConfigParser at 0x7ff4aabddd50>

## Create hyper-parameters

In [4]:
epochs = 1
batch_size = 64
train_type = "single_task"
task = "user2item" # task should be within: user2item, item2item, vert_classify, pop_predict

config['trainer']['epochs'] = epochs
config['data_loader']['batch_size'] = batch_size
config['trainer']['training_type'] = train_type
config['trainer']['task'] = task
config['trainer']['save_period'] = epochs/2
config['data']['sentence_embedding_folder'] = "/content/drive/MyDrive/sentence_embedding/"

## Process dataset

Since MIND dataset do not contain user's location information, we can not use local news 


In [5]:
if not os.path.isfile("/content/drive/MyDrive/sentence_embedding/train_news_embeddings.pkl"):
    write_embedding_news("./data/train", config["data"]["sentence_embedding_folder"])

if not os.path.isfile("/content/drive/MyDrive/sentence_embedding/valid_news_embeddings.pkl"):
    write_embedding_news("./data/valid", config["data"]["sentence_embedding_folder"])


In [6]:
# data = load_data_mind(config, sentence_embedding_folder)
if not os.path.isfile("/content/drive/MyDrive/sentence_embedding/data_mind.pkl"):
    write_data_mind(config, "/content/drive/MyDrive/sentence_embedding/")
data = read_pickle("/content/drive/MyDrive/sentence_embedding/data_mind.pkl")

test_data = data[-1]

In [7]:
def limit_user2item_validation_data(data, size):
    test_data = data[-1]
    test_data_reduced = {key: test_data[key][:size] for key in test_data.keys()}
    # Concatenate the old tuple with the updated validation data
    return data[:-1] + (test_data_reduced,)

data = limit_user2item_validation_data(data, 10000)


In [8]:
len(data)

11

## Train the KRED model

In [9]:
single_task_training(config, data)

Using device: cuda:0


INFO:train:model training


all loss: tensor(1707.3665, device='cuda:0', grad_fn=<AddBackward0>)


INFO:trainer:Saving checkpoint: out/saved/models/KRED/0501_210029/checkpoint-model-epoch1.pth ...


auc socre: 0.6016132115953025


## Evaluate the KRED model

In [10]:
testing(test_data, config)

auc score:0.6149325123656655
ndcg score:0.3369811877204074


## Performance on MINDlarge

we test the performance on MINDlarge dev dataset for your reference:

| Models | AUC | NDCG@10 |
| :------- | :------- | :------- |
| KRED(single task training) | 0.6702 | 0.4018 |
| KRED(multi task training) |  0.6731 | 0.4039|


## Reference

[1] Liu, Danyang, et al. "KRED: Knowledge-Aware Document Representation for News Recommendations." Fourteenth ACM Conference on Recommender Systems. 2020.

[2] Wu, Fangzhao, et al. "MIND: A Large-scale Dataset for News Recommendation" Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.