# Dataset Preprocessing

To run the KGCN model on a newly-added dataset, we need to prepare three files: `ratings.csv`, `kg.txt`, and `item_index2entity_id.txt`. As there is no concrete explanation on how to build a dataset like `music` or `movie` in the paper or codes, this data preprocessing ipynb serves as a reference.

The `product` dataset is built upon the Rec-Tmall dataset, which can be found at https://tianchi.aliyun.com/dataset/140281.

You can either download the full dataset or use sample dataset in `./raw` directory.

In [1]:
import pandas as pd
import numpy as np

log_path = "./raw/(sample)sam_tianchi_2014002_rec_tmall_log.csv"
product_path = "./raw/(sample)sam_tianchi_2014001_rec_tmall_product.csv"

## Generate ratings.csv

Convert four behaviors into explicit ratings:

['click', 'collect', 'cart', 'alipay'] → [1.25, 2.5, 3.75, 5]

In [2]:
# Read raw data
log = pd.read_csv(log_path)
log.groupby('action').count()

Unnamed: 0_level_0,item_id,user_id,vtime
action,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
alipay,7,7,7
cart,358,358,358
click,4474,4474,4474
collect,5,5,5


In [3]:
# Convert four behaviors into explicit ratings
log.action = log.action.replace(
    ['click', 'collect', 'cart', 'alipay'], 
    [1.25, 2.5, 3.75, 5])
log = log[['user_id', 'item_id', 'action']].rename({'action': 'rating'}, axis=1)
log.groupby('rating').count()

Unnamed: 0_level_0,user_id,item_id
rating,Unnamed: 1_level_1,Unnamed: 2_level_1
1.25,4474,4474
2.5,5,5
3.75,358,358
5.0,7,7


In [4]:
# Save to ratings.csv
log.to_csv('ratings.csv', index=False, sep='\t')

## Generate kg.txt

from `*product.csv`

* product.belong_to.leave_category
* leave_category.belong_to.parent_category
* product.product_brand.brand
* product.selled_by.seller

In [5]:
product = pd.read_csv(product_path, encoding='GBK')
    # dtype={'title':'str', 'pict_url':'str', 'category':'str', 'brand_id':'str', 'seller_id':'str'}
product.dtypes

item_id       int64
title        object
pict_url     object
category     object
brand_id     object
seller_id    object
dtype: object

In [6]:
kg = [  # ['head', 'relation', 'tail'], 
      ]
for _, row in product.iterrows():
    
    # product.belong_to.leave_category
    kg.append([
        f'i{row.item_id}',
        'product.belong_to.leave_category',
        f'c{row.category}'
    ])

    # leave_category.belong_to.parent_category
    entry = [
        f'c{row.category}',
        'leave_category.belong_to.parent_category',
        f'c{row.category.split("-")[0]}'
    ]
    kg.append(entry) if entry not in kg else ()

    # product.product_brand.brand
    kg.append([
        f'i{row.item_id}',
        'product.product_brand.brand',
        f'{row.brand_id}'
    ]) if row.brand_id == row.brand_id else ()  # (NaN == NaN) => False

    # product.selled_by.seller
    kg.append([
        f'i{row.item_id}',
        'product.selled_by.seller',
        f'{row.seller_id}'
    ])
np.savetxt('kg.txt', kg, fmt='%s', delimiter='\t')

## Generate item_index2entity_id.txt

The items/entities include user id, product id, parent/leave category, brand id, seller id.

In [7]:
npkg = np.array(kg)
items_in_kg = npkg[:, 0::2].flatten().tolist()
items_of_user = log.user_id.tolist()
items = items_in_kg + items_of_user
items = list(set(items))  # Move duplicates
i2e = [[item, entity] for entity, item in enumerate(items)]
np.savetxt('item_index2entity_id.txt', i2e, fmt='%s', delimiter='\t')

## Convert all entities with new ids (Optional)

using item_index2entity_id.txt

In [8]:
i2e = dict(i2e)

ratings = log
ratings = ratings.replace(i2e)
ratings.to_csv('ratings.csv', index=False, sep='\t')

kg = pd.DataFrame(kg)
kg = kg.replace(i2e)
kg.to_csv('kg.txt', header=False, index=False, sep='\t')

i2e = [[entity, entity] for entity, item in enumerate(items)]
np.savetxt('item_index2entity_id.txt', i2e, fmt='%s', delimiter='\t')