# Link Prediction

## Preparation

In [1]:
%env NX_CUGRAPH_AUTOCONFIG=True

env: NX_CUGRAPH_AUTOCONFIG=True


In [2]:
import locale
def getpreferredencoding(do_setlocale = True):
    return "UTF-8"
locale.getpreferredencoding = getpreferredencoding
%pip install igraph networkit pandas matplotlib seaborn networkx numpy scikit-learn tqdm ipywidgets

Note: you may need to restart the kernel to use updated packages.


In [3]:
# %pip uninstall torch pykeen
%pip install torch --index-url https://download.pytorch.org/whl/cu126
%pip install pykeen

Looking in indexes: https://download.pytorch.org/whl/cu126
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [4]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import networkx as nx
import pickle
import random
import igraph as ig
import networkit as nk

from itertools import combinations
from sklearn.metrics import roc_auc_score, average_precision_score, f1_score, precision_score, recall_score
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import mutual_info_classif
from tqdm import tqdm
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB

### Dataset Preparation

In [5]:
pickle_file_path = 'dataset/amazon_copurchase_graph.pickle'
with open(pickle_file_path, 'rb') as f:
    G = pickle.load(f)

print(G)

DiGraph with 259102 nodes and 1207337 edges


#### Node Features

In [6]:

print(f"Total Nodes: {G.number_of_nodes()}")


for node, data in list(G.nodes(data=True))[:5]:
    print(f"Node: {node}, Data: {data}")

print()
sample_node = next(iter(G.nodes(data=True)))[1]
print("Node features:", list(sample_node.keys()))

Total Nodes: 259102
Node: 1, Data: {'title': 'Patterns of Preaching: A Sermon Sampler', 'group': 'Book', 'salesrank': 396585.0, 'review_cnt': 2, 'downloads': 2, 'rating': 5.0, 'in_degree': 0, 'out_degree': 4, 'pagerank_centrality': 6.210153588242165e-07, 'betweenness_centrality': 0.0, 'harmonic_closeness_centrality': 0.1442557706580312, 'degree_centrality': 1.5437995221940477e-05, 'community': 10}
Node: 2, Data: {'title': 'Candlemas: Feast of Flames', 'group': 'Book', 'salesrank': 168596.0, 'review_cnt': 12, 'downloads': 12, 'rating': 4.5, 'in_degree': 1, 'out_degree': 4, 'pagerank_centrality': 7.560926314778459e-07, 'betweenness_centrality': 31563.672353370643, 'harmonic_closeness_centrality': 0.1444868764333364, 'degree_centrality': 1.92974940274256e-05, 'community': 10}
Node: 4, Data: {'title': 'Life Application Bible Commentary: 1 and 2 Timothy and Titus', 'group': 'Book', 'salesrank': 631289.0, 'review_cnt': 1, 'downloads': 1, 'rating': 4.0, 'in_degree': 24, 'out_degree': 5, 'page

Fitur-fitur dari node dalam graph ini meliputi:  

*   **`title`**:  
    *   **Tipe Data**: String (Teks)  
    *   **Deskripsi**: Nama atau judul produk. Fitur ini memberikan deskripsi tekstual tentang produk yang dimaksud.  
    *   **Contoh**: "Patterns of Preaching: A Sermon Sampler", "Candlemas: Feast of Flames", dll.  

*   **`group`**:  
    *   **Tipe Data**: String (Kategorikal)  
    *   **Deskripsi**: Kategori atau grup tempat produk tersebut berada. Fitur ini membantu dalam memahami jenis produk (misalnya, Buku, Musik, DVD, dll.).  
    *   **Contoh**: "Book"  

*   **`salesrank`**:  
    *   **Tipe Data**: Float  
    *   **Deskripsi**: Peringkat penjualan produk di Amazon. Semakin rendah nilai `salesrank`, semakin tinggi tingkat penjualan dan popularitasnya. Fitur ini sering digunakan untuk mengukur seberapa baik suatu produk terjual di Amazon.  
    *   **Contoh**: `396585.0`, `168596.0`, `1270652.0`, dll.  

*   **`review_cnt`**:  
    *   **Tipe Data**: Integer  
    *   **Deskripsi**: Jumlah ulasan pelanggan yang diterima oleh produk. Nilai `review_cnt` yang lebih tinggi bisa menunjukkan tingkat visibilitas produk yang lebih besar, popularitas yang lebih tinggi, atau keterlibatan pelanggan yang lebih banyak.  
    *   **Contoh**: `2`, `12`, `1`, `1`, `0`, dll.  

*   **`downloads`**:  
    *   **Tipe Data**: Integer  
    *   **Deskripsi**: Jumlah unduhan yang terkait dengan produk. Makna pastinya dapat bervariasi tergantung pada sumber dataset. Bisa saja mewakili unduhan produk digital atau metrik keterlibatan lainnya. Dalam konteks produk "Book" pada contoh ini, bisa merujuk pada unduhan sampel buku atau bentuk keterlibatan lain yang relevan dengan dataset.  
    *   **Contoh**: `2`, `12`, `1`, `1`, `0`, dll.  

*   **`rating`**:  
    *   **Tipe Data**: Float  
    *   **Deskripsi**: Rata-rata rating pelanggan terhadap produk, biasanya dalam skala 0 hingga 5 (atau sistem serupa). Fitur ini mencerminkan tingkat kepuasan pelanggan serta persepsi kualitas produk secara keseluruhan.  
    *   **Contoh**: `5.0`, `4.5`, `5.0`, `4.0`, `0.0`, dll.   

*   **`in_degree`**:  
    *   **Tipe Data**: Integer  
    *   **Deskripsi**: Jumlah edge (sisi) yang masuk ke node ini. Menunjukkan seberapa banyak produk lain yang terhubung ke produk ini dalam graph. Dalam konteks dataset ini, bisa menunjukkan seberapa sering produk ini direferensikan oleh produk lain.  
    *   **Contoh**: `0`, `1`, `24`, `53`, `21`, dll.  

*   **`out_degree`**:  
    *   **Tipe Data**: Integer  
    *   **Deskripsi**: Jumlah edge (sisi) yang keluar dari node ini. Menunjukkan seberapa banyak produk lain yang direferensikan oleh produk ini.  
    *   **Contoh**: `4`, `4`, `5`, `5`, `5`, dll.  

*   **`pagerank_centrality`**:  
    *   **Tipe Data**: Float  
    *   **Deskripsi**: Skor PageRank node dalam graph. Metrik ini mengukur kepentingan sebuah node berdasarkan jumlah dan kualitas tautan yang mengarah ke node tersebut. Semakin tinggi nilainya, semakin berpengaruh node tersebut dalam jaringan.  
    *   **Contoh**: `6.21e-07`, `7.56e-07`, `1.34e-05`, dll.  

*   **`betweenness_centrality`**:  
    *   **Tipe Data**: Float  
    *   **Deskripsi**: Mengukur seberapa sering sebuah node menjadi perantara dalam jalur terpendek antara dua node lainnya. Node dengan betweenness centrality tinggi berperan sebagai "jembatan" yang menghubungkan berbagai bagian dalam graph.  
    *   **Contoh**: `0.0`, `31563.67`, `6528478.27`, `15442396.47`, dll.  

*   **`harmonic_closeness_centrality`**:  
    *   **Tipe Data**: Float  
    *   **Deskripsi**: Versi alternatif dari closeness centrality yang menghitung seberapa dekat suatu node dengan node lain berdasarkan jarak harmonik. Makin tinggi nilainya, makin dekat node tersebut ke banyak node lain dalam graph.  
    *   **Contoh**: `0.1442`, `0.1444`, `0.1558`, `0.1658`, dll.  

*   **`degree_centrality`**:  
    *   **Tipe Data**: Float  
    *   **Deskripsi**: Mengukur proporsi node lain yang terhubung dengan node ini dalam graph. Degree centrality dihitung sebagai jumlah total koneksi (degree) node ini dibagi dengan jumlah maksimum koneksi yang mungkin dalam graph.  
    *   **Contoh**: `1.54e-05`, `1.92e-05`, `1.11e-04`, `2.23e-04`, dll.  

*   **`community`**:  
    *   **Tipe Data**: Integer (Kategorikal)  
    *   **Deskripsi**: Identitas komunitas tempat node ini tergabung, berdasarkan algoritma deteksi komunitas. Node dalam komunitas yang sama lebih cenderung saling terhubung dibandingkan dengan node di komunitas lain.  
    *   **Contoh**: `10`, `10`, `10`, `31`, dll.  

#### Edge Features

In [7]:
print(f"Total Edges: {G.number_of_edges()}")

for u, v, data in list(G.edges(data=True))[:5]:
    print(f"Edge: ({u}, {v}), Data: {data}")

sample_edge = next(iter(G.edges(data=True)))[2]
print("\nEdge features:", list(sample_edge.keys()))


Total Edges: 1207337
Edge: (1, 2), Data: {}
Edge: (1, 4), Data: {}
Edge: (1, 5), Data: {}
Edge: (1, 15), Data: {}
Edge: (2, 11), Data: {}

Edge features: []


Tidak ada edge feature pada graph ini

### Split Dataset

In [8]:
nkG = nk.nxadapter.nx2nk(G)

edges = list(G.edges())
existing_edges = set(edges)

# Sampling dengan Networkit Graph (lebih cepat)
def sample_non_edges_nk(nkG, num_samples):
    non_edges = set()
    nodes = list(G.nodes())

    while len(non_edges) < num_samples:
        u, v = random.sample(nodes, 2)
        if not nkG.hasEdge(u, v):
            non_edges.add((u, v))

    return list(non_edges)

num_samples = len(edges)
non_edges = sample_non_edges_nk(nkG, num_samples)

train_edges, test_edges = train_test_split(edges, test_size=0.2, random_state=42)
train_non_edges = random.sample(non_edges, len(train_edges))
test_non_edges = random.sample(non_edges, len(test_edges))

G_train = nx.Graph()
G_train.add_nodes_from(G.nodes())
G_train.add_edges_from(train_edges)

print(f"Train Edges: {len(train_edges)}, Test Edges: {len(test_edges)}")
print(f"Train Non-Edges: {len(train_non_edges)}, Test Non-Edges: {len(test_non_edges)}")

Train Edges: 965869, Test Edges: 241468
Train Non-Edges: 965869, Test Non-Edges: 241468


In [10]:
# Metrik evaluasi ranking problem
def precision_at_k(y_true, y_scores, k):
    sorted_indices = np.argsort(y_scores)[::-1]
    top_k = sorted_indices[:k]
    return np.mean(y_true[top_k])

def recall_at_k(y_true, y_scores, k):
    sorted_indices = np.argsort(y_scores)[::-1]
    top_k = sorted_indices[:k]
    return np.sum(y_true[top_k]) / np.sum(y_true)

def mean_average_precision(y_true, y_scores):
    sorted_indices = np.argsort(y_scores)[::-1]
    relevant = np.cumsum(y_true[sorted_indices])
    precision_at_i = relevant / (np.arange(len(y_true)) + 1)
    return np.sum(precision_at_i * y_true[sorted_indices]) / np.sum(y_true)

def f1_beta_at_k(y_true, y_scores, k, beta=1):
    precision_k = precision_at_k(y_true, y_scores, k)
    recall_k = recall_at_k(y_true, y_scores, k)

    if precision_k + recall_k == 0:
        return 0.0

    beta_sq = beta ** 2
    return (1 + beta_sq) * (precision_k * recall_k) / ((beta_sq * precision_k) + recall_k)



In [11]:
# train_pairs = train_edges + train_non_edges
# train_labels = np.array([1] * len(train_edges) + [0] * len(train_non_edges))

# test_pairs = test_edges + test_non_edges
# test_labels = np.array([1] * len(test_edges) + [0] * len(test_non_edges))

# train_features = extract_features(G_train, train_pairs)
# test_features = extract_features(G_train, test_pairs)

# X_train = train_features.drop(columns=["node1", "node2"])
# X_test = test_features.drop(columns=["node1", "node2"])

# # X_train_selected, selected_features = feature_selection(X_train, train_labels, top_k=10)
# # selected_feature_names = X_train.columns[selected_features]
# # print("Fitur yang dipilih:", selected_feature_names.tolist())

# # X_test_selected = X_test.iloc[:, selected_features]

# models = {
#     "Random Forest": RandomForestClassifier(n_estimators=100, random_state=42),
#     "Logistic Regression": LogisticRegression(max_iter=1000, random_state=42),
#     "Naive Bayes": GaussianNB()
# }

# k = 100000

# print("{:<25} {:>10} {:>10} {:>15} {:>15} {:>10} {:>10}".format(
#     "Model", "AUC-ROC", "AP Score", f"Precision@{k}", f"Recall@{k}", "MAP", f"F1@{k}"
# ))
# print("=" * 105)

# for name, model in models.items():
#     model.fit(X_train, train_labels)

#     probabilities = model.predict_proba(X_test)[:, 1]

#     auc_roc = roc_auc_score(test_labels, probabilities)
#     ap_score = average_precision_score(test_labels, probabilities)

#     precision_at_k_ml = precision_at_k(test_labels, probabilities, k)
#     recall_at_k_ml = recall_at_k(test_labels, probabilities, k)
#     map_score = mean_average_precision(test_labels, probabilities)
#     f1_k_ml = f1_beta_at_k(test_labels, probabilities, k)

#     print("{:<25} {:>10.6f} {:>10.6f} {:>15.6f} {:>15.6f} {:>10.6f} {:>10.6f}".format(
#         name.upper(), auc_roc, ap_score, precision_at_k_ml, recall_at_k_ml, map_score, f1_k_ml
#     ))


## Graph Embedding Link Prediction

In [12]:
train_edges, test_edges = train_test_split(edges, test_size=0.2, random_state=42)
train_edges[:5]

[(168955, 57948),
 (26010, 27648),
 (209390, 204186),
 (123658, 114870),
 (111403, 49956)]

Setelah beberapa kali *tuning* dalam parameternya, model yang dibangun adalah sebagai berikut.

- Model TransE dengan embedding berdimensi 200
- Optimizer Adam dengan learning rate 0.01
- Loss function: MarginRankingLoss untuk pelatihan
- Batch size: 256 (pelatihan), 64 (evaluasi)
- Negative sampling: basic
- Regularisasi LP dengan bobot 0.01
- Dilatih selama 20 epoch di GPU

Hanya 40% data yang digunakan untuk menyingkatkan waktu pelatihan. Pembagian dataset pelatihan, pengujian, dan evaluasi adalah 70%, 15%, dan 15%.

In [21]:
import pandas as pd
import numpy as np
from pykeen.triples import TriplesFactory

# Assuming 'train_edges' and 'G' are defined from the previous code

triples = np.array(edges)
reducer = 0.5

relation_placeholder = np.full((triples.shape[0], 1), "bought_with", dtype=object)
triples = np.column_stack((triples[:, 0], relation_placeholder, triples[:, 1]))
triples = triples.astype(str)

tf = TriplesFactory.from_labeled_triples(triples, create_inverse_triples=True)

tf_train, tf_validation, tf_test, tf_unused = tf.split([0.7 * reducer, 0.15 * reducer, 0.15 * reducer, (1 - reducer)])

INFO:pykeen.triples.splitting:done splitting triples to groups of sizes [164222, 90551, 90550, 603669]


In [19]:
import torch
torch.cuda.is_available()

True

In [22]:
from pykeen.pipeline import pipeline

# Define and train the model
result = pipeline(
    training=tf_train,
    testing=tf_test,
    validation=tf_validation,
    model='TransE',
    epochs=20,
    model_kwargs={'embedding_dim': 200},
    optimizer='Adam',
    optimizer_kwargs={'lr': 0.01},
    loss='MarginRankingLoss',
    training_kwargs={'batch_size': 256},
    negative_sampler='basic',
    regularizer='LP',
    regularizer_kwargs={'weight': 0.01},
    evaluator_kwargs={
        'filtered': True,
        'batch_size': 64
    },
	device='cuda:0'
)

# Evaluate the model
result.metric_results.to_df()

INFO:pykeen.pipeline.api:Using device: cuda:0
INFO:pykeen.triples.triples_factory:Creating inverse triples.


Training epochs on cuda:0:   0%|          | 0/20 [00:00<?, ?epoch/s]

INFO:pykeen.triples.triples_factory:Creating inverse triples.


Training batches on cuda:0:   0%|          | 0/3302 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0/3302 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0/3302 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0/3302 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0/3302 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0/3302 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0/3302 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0/3302 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0/3302 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0/3302 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0/3302 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0/3302 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0/3302 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0/3302 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0/3302 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0/3302 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0/3302 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0/3302 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0/3302 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0/3302 [00:00<?, ?batch/s]

Evaluating on cuda:0:   0%|          | 0.00/90.5k [00:00<?, ?triple/s]

INFO:pykeen.evaluation.evaluator:Evaluation took 10045.01s seconds


Unnamed: 0,Side,Rank_type,Metric,Value
0,head,optimistic,inverse_harmonic_mean_rank,0.000047
1,tail,optimistic,inverse_harmonic_mean_rank,0.000041
2,both,optimistic,inverse_harmonic_mean_rank,0.000044
3,head,realistic,inverse_harmonic_mean_rank,0.000047
4,tail,realistic,inverse_harmonic_mean_rank,0.000041
...,...,...,...,...
220,tail,realistic,adjusted_hits_at_k,-0.000017
221,both,realistic,adjusted_hits_at_k,-0.000011
222,head,pessimistic,adjusted_hits_at_k,-0.000005
223,tail,pessimistic,adjusted_hits_at_k,-0.000017


In [74]:
with pd.option_context('display.max_rows', None, 'display.max_columns', None):  # more options can be specified also
    display(result.metric_results.to_df())

Unnamed: 0,Side,Rank_type,Metric,Value
0,head,optimistic,inverse_harmonic_mean_rank,4.709105e-05
1,tail,optimistic,inverse_harmonic_mean_rank,4.067223e-05
2,both,optimistic,inverse_harmonic_mean_rank,4.388164e-05
3,head,realistic,inverse_harmonic_mean_rank,4.709095e-05
4,tail,realistic,inverse_harmonic_mean_rank,4.067215e-05
5,both,realistic,inverse_harmonic_mean_rank,4.388155e-05
6,head,pessimistic,inverse_harmonic_mean_rank,4.709085e-05
7,tail,pessimistic,inverse_harmonic_mean_rank,4.067207e-05
8,both,pessimistic,inverse_harmonic_mean_rank,4.388146e-05
9,head,optimistic,z_geometric_mean_rank,-3.570078


In [40]:
model = result

In [72]:
import numpy as np

metrics = result.metric_results.to_dict()

hmr = model.get_metric('harmonic_mean_rank') # to prioritize top ranking result
ihmr = model.get_metric('inverse_harmonic_mean_rank') # ~= mrr
mrr = model.get_metric('mean_reciprocal_rank')

k = 100000

# Display results in a formatted tableSetelah beberapa kali *tuning* dalam parameternya, model yang dibangun adalah sebagai berikut.

- Model TransE dengan embedding berdimensi 200
- Optimizer Adam dengan learning rate 0.01
- Loss function: MarginRankingLoss untuk pelatihan
- Batch size: 256 (pelatihan), 64 (evaluasi)
- Negative sampling: basic
- Regularisasi LP dengan bobot 0.01
- Dilatih selama 20 epoch di GPU

Hanya 40% data yang digunakan untuk menyingkatkan waktu pelatihan. Pembagian dataset pelatihan, pengujian, dan evaluasi adalah 70%, 15%, dan 15%.
print("{:<25} {:>10} {:>10} {:>15} {:>15} {:>10} {:>10} {:>10} {:>10} {:>10}".format(
    "Model", "AUC-ROC", "AP Score", f"Precision@100k", f"Recall@100k", "MAP", f"F1@100k", "HMR", "IHMR", "MRR"
))

print("{:<25} {:>10} {:>10} {:>15} {:>15} {:>10} {:>10} {:>10.2f} {:>10.6f} {:>10.2f}".format(
    "TransE (PyKEEN)", "N/A", "N/A", "N/A", "N/A", "N/A", "N/A", hmr, ihmr, mrr
))


Model                        AUC-ROC   AP Score  Precision@100k     Recall@100k        MAP    F1@100k        HMR       IHMR        MRR
TransE (PyKEEN)                  N/A        N/A             N/A             N/A        N/A        N/A   22788.62   0.000044       0.00


In [25]:
# Save the model
result.save_to_directory("embedding_model")

# Load the model
# from pykeen.pipeline import PipelineResult
# loaded_result = PipelineResult.from_directory("saved_model")
# trained_model = loaded_result.model

INFO:pykeen.triples.triples_factory:Stored TriplesFactory(num_entities=259102, num_relations=2, create_inverse_triples=True, num_triples=422567) to file:///D:/Academic/Projects/PD/embedding_model2/training_triples
INFO:pykeen.pipeline.api:Saved to directory: D:\Academic\Projects\PD\embedding_model2
