# Graph Construction Step

* Construct the graph for each site's transaction data

Each node represents a transaction, and the edges represent the relationships between transactions. Since each site consists of the same Sender_BIC, to define the graph edge, we use the following rules:

1. The two transactions are with the same Receiver_BIC.
2. The time difference between the two transactions is smaller than 6000.

Note that in real applications, such rules should be designed according to the characteristics of the candidate data.

### Load Data

In [None]:
site_input_dir = "/tmp/dataset/horizontal_credit_fraud_data/"
site_name = "ZHSZUS33_Bank_1"

In [None]:
import os

import pandas as pd
dataset_names = ["train", "test"]
datasets = {}

for ds_name in dataset_names:
    file_name = os.path.join(site_input_dir, site_name, f"{ds_name}.csv" )
    df = pd.read_csv(file_name)
    datasets[ds_name] = df
    print(df)

In [None]:
df.columns

In [None]:
import pandas as pd

edge_maps = {}

info_columns = ['Time', 'Receiver_BIC', 'UETR']
time_threshold = 6000

for ds_name in dataset_names:
    df = datasets[ds_name]
    
    # Find transaction pairs that are within the time threshold
    # First sort the table by 'Time'
    df = df.sort_values(by="Time")
    # Keep only the columns that are needed for the graph edge map
    df = df[info_columns]

    # Then for each row, find the next rows that is within the time threshold
    graph_edge_map = []
    for i in range(len(df)):
        # Find the next rows that is:
        # - within the time threshold
        # - has the same Receiver_BIC
        j = 1
        while (i + j < len(df) and df["Time"].values[i + j] < df["Time"].values[i] + time_threshold):
            if (df["Receiver_BIC"].values[i + j] == df["Receiver_BIC"].values[i]):
                graph_edge_map.append([df["UETR"].values[i], df["UETR"].values[i + j]])
            j += 1

    print(f"Generated edge map for {ds_name}, in total {len(graph_edge_map)} valid edges for {len(df)} transactions")

    edge_maps[ds_name] = pd.DataFrame(graph_edge_map)    


In [None]:
edge_maps["train"]

In [None]:
for name in edge_maps:
    site_dir = os.path.join(site_input_dir, site_name)
    os.makedirs(site_dir, exist_ok=True)
    edge_map_file_name = os.path.join(site_dir, f"{name}_edgemap.csv")
    print("save to = ", edge_map_file_name)
    # save to csv file without header and index
    edge_maps[name].to_csv(edge_map_file_name, header=False, index=False)

In [None]:
! tree /tmp/dataset/horizontal_credit_fraud_data/ZHSZUS33_Bank_1

Let's go back to the [XGBoost Notebook](../xgboost.ipynb)