## 🧹 Tokenflow Dataset Preparation for Link Prediction

This notebook provides the necessary preprocessing steps to prepare the **Tokenflow** dataset for use with the **RelationalAI Graph Neural Network (GNN) learning engine**.

We specifically format the dataset for a **link prediction task**, where the goal is to predict relationships (edges) between entities in a graph.

Once the preprocessing is complete, the processed data will be uploaded to the appropriate **Snowflake tables**, ready to be consumed by the GNN engine.


In [None]:
import pandas as pd

## 🔗 Link Prediction Use Case: Predicting Transactions

In our link prediction task, the goal is to predict **transaction links** between **senders** and **buyers**.

To prepare the data for this task, we will follow these steps:

1. **Create Entity Tables**
   We will create two tables—one for **buyers** and one for **senders**—containing the unique IDs of each entity type.

2. **Create the Relationship (Edge) Table**
   We will use `token-trades.csv` as our **transaction table**, representing the edges (links) between senders and buyers.

3. **Generate Train, Validation, and Test Splits**
   Using the transaction data, we will create separate datasets for training, validation, and testing.

This setup ensures the data is properly formatted and ready for ingestion by the RelationalAI GNN engine.


In [None]:


trades_df = pd.read_csv('../../data/token-trades.csv')
trades_df.head()

In [None]:
# create two dataframes with unique buyer and sender IDs
buyers_df = trades_df[['BUY_TOKEN_ADDRESS']].drop_duplicates()
senders_df = trades_df[['TX_SENDER_ADDRESS']].drop_duplicates()
# let's make a transaction table from the trades table by keeping some
# of the features
transactions_df = trades_df[['TX_SENDER_ADDRESS','BUY_TOKEN_ADDRESS','BLOCK_TIMESTAMP',
                             'BUY_AMOUNT','BUY_TOKEN_SYMBOL','SELL_TOKEN_SYMBOL',
                             'SELL_AMOUNT']]

In [None]:
transactions_df['BLOCK_TIMESTAMP'].max()

In [None]:
# create train, test and validation data
train_start_date = "2024-10-21"
train_end_date = "2025-01-14"
val_end_date = "2025-01-31"

# Ensure BLOCK_TIMESTAMP is in datetime format
transactions_df['BLOCK_TIMESTAMP'] = pd.to_datetime(transactions_df['BLOCK_TIMESTAMP'])

train_df = transactions_df[
    (transactions_df['BLOCK_TIMESTAMP'] >= train_start_date) &
    (transactions_df['BLOCK_TIMESTAMP'] <= train_end_date)
][['BLOCK_TIMESTAMP', 'TX_SENDER_ADDRESS', 'BUY_TOKEN_ADDRESS']]

val_df = transactions_df[
    (transactions_df['BLOCK_TIMESTAMP'] > train_end_date) &
    (transactions_df['BLOCK_TIMESTAMP'] <= val_end_date)
][['BLOCK_TIMESTAMP', 'TX_SENDER_ADDRESS', 'BUY_TOKEN_ADDRESS']]

test_df = transactions_df[
    (transactions_df['BLOCK_TIMESTAMP'] > val_end_date)
][['BLOCK_TIMESTAMP', 'TX_SENDER_ADDRESS', 'BUY_TOKEN_ADDRESS']]

# note that for link prediction tasks we expect that the destination
# entities - the entities that we are trying to predict a link to - 
# to be grouped in a list. So we will transform our trian,test and
# validation dataframes accordingly
train_df = train_df.groupby(['TX_SENDER_ADDRESS', 'BLOCK_TIMESTAMP'])['BUY_TOKEN_ADDRESS'].agg(list).reset_index()
val_df = val_df.groupby(['TX_SENDER_ADDRESS', 'BLOCK_TIMESTAMP'])['BUY_TOKEN_ADDRESS'].agg(list).reset_index()
test_df = test_df.groupby(['TX_SENDER_ADDRESS', 'BLOCK_TIMESTAMP'])['BUY_TOKEN_ADDRESS'].agg(list).reset_index()


print(f'Train size: {train_df.shape[0]}')
print(f'Validation size: {val_df.shape[0]}')
print(f'Test size: {test_df.shape[0]}')

# in link prediction problems, validation and test data need
# to all have the same timestamp, so we will fake it for now
val_df['BLOCK_TIMESTAMP'] =  pd.Timestamp("2025-01-31 13:23:59.000")
test_df['BLOCK_TIMESTAMP'] = pd.Timestamp("2025-02-01 13:23:59.000")

## Upload Data To Snowflake

We assume you have created a `.env` file in the same directory as this notebook. (A sample `.env` file is included for reference.)

Your `.env` file should contain the following fields:

```
ACCOUNT_NAME=snowflake_account_name
USER_NAME=snowflake_user_name
PASSWORD=snowflake_user_password
WAREHOUSE=snowflake_warehouse
APP_NAME=RAI_EXPT_APP
AUTH_METHOD=password
```

In [None]:
from load_to_snowflake import create_session, load_to_snowflake


In [None]:
import os
from dotenv import load_dotenv
load_dotenv()

snowflake_config = {
    "account": os.getenv("ACCOUNT_NAME"),
    "user": os.getenv("USER_NAME"),
    "password": os.getenv("PASSWORD"),
    "warehouse": os.getenv("WAREHOUSE"),
}

session = create_session(snowflake_config)

In [None]:
load_to_snowflake(
    session = session,
    df = buyers_df,
    database="GNN_TOKENFLOW",
    schema="DATA",
    table_name="BUYERS"
)

load_to_snowflake(
    session = session,
    df = senders_df,
    database="GNN_TOKENFLOW",
    schema="DATA",
    table_name="SENDERS"
)

load_to_snowflake(
    session = session,
    df = transactions_df,
    database="GNN_TOKENFLOW",
    schema="DATA",
    table_name="TRANSACTIONS"
)

load_to_snowflake(
    session = session,
    df = train_df,
    database="GNN_TOKENFLOW",
    schema="TASK",
    table_name="TRAIN"
)

load_to_snowflake(
    session = session,
    df = test_df,
    database="GNN_TOKENFLOW",
    schema="TASK",
    table_name="TEST"
)

load_to_snowflake(
    session = session,
    df = val_df,
    database="GNN_TOKENFLOW",
    schema="TASK",
    table_name="VALIDATION"
)