Summary:
- We build a knowledge graph using account transactions and ownership/control relations.
- We use a Knowledge-Graph Embedding model (TransE) to learn embeddings for all entities and relations.
- After training, we can score possible new links (e.g., hidden suspicious transfers).
- Higher score = higher predicted likelihood of the relationship existing.
- Goal: uncover hidden or future suspicious connections (e.g., money laundering paths).

Expected result:
- Entity embeddings and relation embeddings.
- Numerical scores for candidate triples.
- Accounts or entities with high hidden link scores are flagged for investigation.

# 🧰 Packages used for Knowledge Graph Embedding code

## 📦 pandas
- Create and manage tables (triples DataFrame).
- Used to clearly define (head, relation, tail) triples before converting to graph.

---

## 📦 torch
- Core PyTorch library for tensor operations.
- Used to create tensors for scoring triples and accessing embeddings.

---

## 📦 pykeen
- Main library for building and training Knowledge Graph Embeddings.
- Key features:
  - `TriplesFactory`: convert triples to internal graph object.
  - `pipeline`: easy way to train KGE models like TransE, RotatE.
  - Scoring functions to evaluate potential links.

---

## ✅ Summary
> *These packages together allow you to define your knowledge graph, train embeddings, and score suspicious or hidden links easily.*

In [None]:
PyKEEN = framework for KGE
	•	PyKEEN is a unified library that supports many different KGE algorithms.
	•	You can easily switch between them just by changing the model name.

In [10]:
import pandas as pd
import torch
from pykeen.pipeline import pipeline
from pykeen.triples import TriplesFactory

# ---------------------------------
# Step 1: Create knowledge graph triples
# ---------------------------------
triples = [
    ("A", "transfers_to", "B"),
    ("A", "transfers_to", "C"),
    ("B", "transfers_to", "D"),
    ("C", "transfers_to", "D"),
    ("A", "owns", "CompanyX"),
    ("CompanyX", "controlled_by", "PersonY"),
    ("PersonY", "controls", "D")
]

triples_df = pd.DataFrame(triples, columns=["head", "relation", "tail"])
print("🔗 Triples table (knowledge graph):")
print(triples_df, "\n")



🔗 Triples table (knowledge graph):
       head       relation      tail
0         A   transfers_to         B
1         A   transfers_to         C
2         B   transfers_to         D
3         C   transfers_to         D
4         A           owns  CompanyX
5  CompanyX  controlled_by   PersonY
6   PersonY       controls         D 



In [None]:
# ---------------------------------
# Step 2: Convert to TriplesFactory

# ---------------------------------
''' 
# map entities and relations to integer IDS
Entity        ID
A             0
B             1 
CompanyX      2

Relation      ID
transfers_to  0
owns          1 
controlled_by 2

Then converts triples to ID tensors
(A, transfers_to, B) => [0, 0, 1]

Tensor example:
tensor([
  [0, 0, 1],
  [0, 1, 2],
  [2, 2, 3],
  ...
])
'''
tf = TriplesFactory.from_labeled_triples(
    triples_df[["head", "relation", "tail"]].values
)
print('tf',tf)
# Split into train/test (PyKEEN expects this)
train_tf, test_tf = tf.split([0.8, 0.2])
print(train_tf, test_tf)


INFO:pykeen.triples.splitting:done splitting triples to groups of sizes [0, 2]


tf TriplesFactory(num_entities=6, num_relations=4, create_inverse_triples=False, num_triples=7)
TriplesFactory(num_entities=6, num_relations=4, create_inverse_triples=False, num_triples=5) TriplesFactory(num_entities=6, num_relations=4, create_inverse_triples=False, num_triples=2)


In [26]:
# ---------------------------------
# Step 3: Train KGE model
# ---------------------------------
result = pipeline(
    training=train_tf,
    testing=test_tf,
    model='TransE',
    training_kwargs=dict(num_epochs=100),
)



INFO:pykeen.pipeline.api:Using device: None
INFO:pykeen.nn.representation:Inferred unique=False for Embedding()
INFO:pykeen.nn.representation:Inferred unique=False for Embedding()


Training epochs on cpu:   0%|          | 0/100 [00:00<?, ?epoch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/1.00 [00:00<?, ?batch/s]



Evaluating on cpu:   0%|          | 0.00/2.00 [00:00<?, ?triple/s]

INFO:pykeen.evaluation.evaluator:Evaluation took 0.02s seconds


In [None]:

# ---------------------------------
# Step 4: Extract embeddings
# ---------------------------------
'''
stores dictionaries
tf.entity_to_id => {'A': 0, 'B': 1, 'CompanyX': 2, 'PersonY': 3, 'D': 4}
tf.relation_to_id   => {'transfers_to': 0, 'owns': 1, 'controlled_by': 2, 'controls': 3}
'''

entity_embeddings = result.model.entity_representations[0]
relation_embeddings = result.model.relation_representations[0]

# Example: Get embedding for A
embedding_A = entity_embeddings(torch.tensor([tf.entity_to_id["A"]])).detach().numpy()
print("📍 Embedding vector for entity A:")
print(embedding_A, "\n")

# Example: Get embedding for relation transfers_to
embedding_transfer = relation_embeddings(torch.tensor([tf.relation_to_id["transfers_to"]])).detach().numpy()
print("🔄 Embedding vector for relation 'transfers_to':")
print(embedding_transfer, "\n")



📍 Embedding vector for entity A:
[[-0.07973986 -0.2504249   0.05097923 -0.25063926  0.02950187 -0.01629204
   0.18301998  0.18390365 -0.1437828   0.08742514  0.21019123 -0.05539561
   0.16767271  0.16617793  0.02767996 -0.01452485  0.06938939 -0.02264258
   0.05187031  0.12631048 -0.2414875  -0.22792947 -0.08107183 -0.22353646
  -0.08727829  0.08163689 -0.04147071  0.11045516  0.22577046  0.20580865
   0.07212144 -0.04152542 -0.01053331 -0.06780873 -0.17319354  0.05516351
   0.19073737 -0.06339318 -0.02646952  0.10326536  0.24820502 -0.10201132
   0.06053919 -0.14650267  0.02554761  0.2055259   0.09851103  0.20880887
   0.22851089  0.10236115]] 

🔄 Embedding vector for relation 'transfers_to':
[[ 0.04591494  0.07179378 -0.1028071   0.21411969 -0.26528192 -0.06401383
  -0.26743394  0.11708802  0.07005366  0.0508561  -0.17405905  0.17120609
  -0.16560054 -0.05427961  0.28832406  0.16804111 -0.13138331  0.05978727
   0.22079481 -0.05185619 -0.08317145 -0.2004517   0.12980919 -0.01638967
 

In [31]:
# ---------------------------------
# Step 5: Score potential hidden link
# ---------------------------------
triple = torch.tensor([
    [tf.entity_to_id["A"], tf.relation_to_id["transfers_to"], tf.entity_to_id["D"]]
])
score = result.model.score_hrt(triple).detach().numpy()

print(f"⚖️ Score for (A, transfers_to, D): {float(score[0]):.4f}")

⚖️ Score for (A, transfers_to, D): -9.8958


  print(f"⚖️ Score for (A, transfers_to, D): {float(score[0]):.4f}")


In [32]:
# ---------------------------------
# Step 6: Print evaluation metrics
# ---------------------------------
metrics = result.metric_results.to_dict()

print("\n📊 Available metric keys:")
print(metrics.keys())

# Get first available key
first_key = list(metrics.keys())[0]

print(f"\n📊 Evaluation metrics ({first_key}):")
for metric, value in metrics[first_key].items():
    if isinstance(value, (float, int)):
        print(f"{metric}: {value:.4f}")
    else:
        print(f"{metric}: {value}")  # Print raw if not a simple number


📊 Available metric keys:
dict_keys(['head', 'tail', 'both'])

📊 Evaluation metrics (head):
optimistic: {'median_absolute_deviation': 1.482602218505602, 'geometric_mean_rank': 3.872983346207417, 'adjusted_inverse_harmonic_mean_rank': -0.29221732745961815, 'adjusted_arithmetic_mean_rank_index': -0.33333333333333326, 'variance': 1.0, 'inverse_arithmetic_mean_rank': 0.25, 'harmonic_mean_rank': 3.75, 'adjusted_arithmetic_mean_rank': 1.2307692307692308, 'standard_deviation': 1.0, 'z_arithmetic_mean_rank': -0.676481425202546, 'inverse_median_rank': 0.25, 'arithmetic_mean_rank': 4.0, 'median_rank': 4.0, 'inverse_geometric_mean_rank': 0.2581988897471611, 'z_geometric_mean_rank': -0.7311934862411587, 'count': 2.0, 'inverse_harmonic_mean_rank': 0.26666666666666666, 'z_inverse_harmonic_mean_rank': -0.8140279252758773, 'adjusted_geometric_mean_rank_index': -0.4176853051609841, 'hits_at_1': 0.0, 'hits_at_3': 0.5, 'hits_at_5': 1.0, 'hits_at_10': 1.0, 'z_hits_at_k': 0.0, 'adjusted_hits_at_k': 0.0}
re

In [35]:
print(metrics['head'])

{'optimistic': {'median_absolute_deviation': 1.482602218505602, 'geometric_mean_rank': 3.872983346207417, 'adjusted_inverse_harmonic_mean_rank': -0.29221732745961815, 'adjusted_arithmetic_mean_rank_index': -0.33333333333333326, 'variance': 1.0, 'inverse_arithmetic_mean_rank': 0.25, 'harmonic_mean_rank': 3.75, 'adjusted_arithmetic_mean_rank': 1.2307692307692308, 'standard_deviation': 1.0, 'z_arithmetic_mean_rank': -0.676481425202546, 'inverse_median_rank': 0.25, 'arithmetic_mean_rank': 4.0, 'median_rank': 4.0, 'inverse_geometric_mean_rank': 0.2581988897471611, 'z_geometric_mean_rank': -0.7311934862411587, 'count': 2.0, 'inverse_harmonic_mean_rank': 0.26666666666666666, 'z_inverse_harmonic_mean_rank': -0.8140279252758773, 'adjusted_geometric_mean_rank_index': -0.4176853051609841, 'hits_at_1': 0.0, 'hits_at_3': 0.5, 'hits_at_5': 1.0, 'hits_at_10': 1.0, 'z_hits_at_k': 0.0, 'adjusted_hits_at_k': 0.0}, 'realistic': {'median_absolute_deviation': 1.482602218505602, 'geometric_mean_rank': 3.872

PyKEEN computes different filtering strategies when ranking:
	•	Optimistic: Best possible ranking assumption.
	•	Pessimistic: Worst possible.
	•	Realistic: More practical estimate.

Usually, we focus on realistic.

📊 Key metrics explained

⭐ Hits@K
	•	hits_at_1: Fraction of correct heads ranked #1. (0.0 here)
	•	hits_at_3: Correct heads in top 3. (0.5 here → 50% correct)
	•	hits_at_5, hits_at_10: Same logic but larger K.

Interpretation:
	•	hits_at_3 = 0.5 ⇒ in 50% of head prediction cases, true head is in top 3.

In [38]:
# ---------------------------------
# Step 7: Generate and save suspicious link scores
# ---------------------------------
print("\n💡 Generating possible suspicious links...")

# List all entities
all_entities = list(tf.entity_to_id.keys())
print("all_entities",all_entities)
# Example: Check all possible tails for (A, transfers_to, ?)
head_id = tf.entity_to_id["A"]
relation_id = tf.relation_to_id["transfers_to"]
print('head_id', head_id)
print('relation_id',relation_id)
triples_to_score = []

for tail in all_entities:
    if tail != "A":  # Avoid self-loop
        tail_id = tf.entity_to_id[tail]
        triples_to_score.append([head_id, relation_id, tail_id])

triples_tensor = torch.tensor(triples_to_score)

# Get scores
scores = result.model.score_hrt(triples_tensor).detach().numpy()

# Create DataFrame
suspicious_df = pd.DataFrame({
    "head": ["A"] * len(all_entities[:-1]),
    "relation": ["transfers_to"] * len(all_entities[:-1]),
    "tail": [t for t in all_entities if t != "A"],
    "score": scores.flatten()
})

# Sort by score (ascending, lower = more plausible for TransE)
suspicious_df = suspicious_df.sort_values(by="score").reset_index(drop=True)
print(suspicious_df)
print("\n✅ Top suspicious links ranked by plausibility (lowest score first):")
print(suspicious_df)

# Save example (e.g., suspicious_df.to_sql(...) or .to_csv(...) )
# suspicious_df.to_csv("suspicious_links.csv", index=False)


💡 Generating possible suspicious links...
all_entities ['A', 'B', 'C', 'CompanyX', 'D', 'PersonY']
head_id 0
relation_id 3
  head      relation      tail      score
0    A  transfers_to         C -10.747336
1    A  transfers_to   PersonY -10.303490
2    A  transfers_to         D  -9.895801
3    A  transfers_to  CompanyX  -9.465056
4    A  transfers_to         B  -7.791265

✅ Top suspicious links ranked by plausibility (lowest score first):
  head      relation      tail      score
0    A  transfers_to         C -10.747336
1    A  transfers_to   PersonY -10.303490
2    A  transfers_to         D  -9.895801
3    A  transfers_to  CompanyX  -9.465056
4    A  transfers_to         B  -7.791265
