Step 1: Load the Pre-split Data
training, validation, and test data are already split and stored in train.txt, valid.txt, and test.txt, we will load these datasets.(FB15K FB15k-237 NELL WN18RR)

In [None]:
import pandas as pd

# Load the pre-split training, validation, and test data
train_data_loader = pd.read_csv('data/train.txt', sep=' ', header=None, names=['head', 'relation', 'tail'])
valid_data_loader = pd.read_csv('data/valid.txt', sep=' ', header=None, names=['head', 'relation', 'tail'])
test_data_loader = pd.read_csv('data/test.txt', sep=' ', header=None, names=['head', 'relation', 'tail'])

# Print the head of the loaded datasets to ensure correctness
print("Training Data Head:", train_data_loader.head())
print("Validation Data Head:", valid_data_loader.head())
print("Test Data Head:", test_data_loader.head())


Data Processing Explanation:
Data Source: train.txt, valid.txt, and test.txt files, each containing triplet data (head, relation, tail).
Data Processing: These files are loaded as pandas DataFrames, and columns are named (head, relation, tail). This data will be used for model training, validation, and testing.

Step 2: Data Preprocessing
Before training, ensure that there are no missing values by removing rows containing NaN values.
Use the IQR (Interquartile Range) method to detect outliers. IQR is the range of the middle 50% of the data distribution. Any data points smaller than Q1 minus 1.5 times the IQR, or larger than Q3 plus 1.5 times the IQR, are considered outliers.

In [None]:
# Remove missing values
train_data_loader = train_data_loader.dropna()
valid_data_loader = valid_data_loader.dropna()
test_data_loader = test_data_loader.dropna()
# Identify and remove outliers using IQR method
Q1 = train_data_loader['head'].quantile(0.25)
Q3 = train_data_loader['head'].quantile(0.75)
IQR = Q3 - Q1
train_data_loader = train_data_loader[(train_data_loader['head'] >= (Q1 - 1.5 * IQR)) & 
                                      (train_data_loader['head'] <= (Q3 + 1.5 * IQR))]

# Repeat for 'tail' and 'relation' columns

# Print the sizes of the datasets after dropping NaN values
print(f"Training Data Size after Dropping NA: {len(train_data_loader)}")
print(f"Validation Data Size after Dropping NA: {len(valid_data_loader)}")
print(f"Test Data Size after Dropping NA: {len(test_data_loader)}")


Data Processing Explanation:
Data Source: From the train_data_loader, valid_data_loader, and test_data_loader datasets.
Data Processing: Rows containing NaN values are removed to ensure that each triplet (head, relation, tail) is complete, preventing issues during training.
Data Usage: Cleaned datasets will be used for training, validation, and testing.

Step 3: Negative Sample Generation (Different Methods for Different Models)
For TransE and DistMult decoders, we use Bernoulli Negative Sampling.
For RotatE decoder, we use Self-Adversarial Negative Sampling.
Bernoulli Negative Sampling (for TransE and DistMult):

In [None]:
import numpy as np

# Bernoulli Negative Sampling: For TransE and DistMult
def bernoulli_negative_sampling(train_data_loader, num_negatives=5):
    negative_samples = []
    entities = train_data_loader['head'].unique()  # Get all entities
    relations = train_data_loader['relation'].unique()  # Get all relations

    for _, row in train_data_loader.iterrows():
        head, relation, tail = row['head'], row['relation'], row['tail']
        
        for _ in range(num_negatives):  # Generate num_negatives negative samples per positive sample
            if np.random.rand() < 0.5:  # 50% chance to replace the head entity
                negative_head = np.random.choice(entities)
                negative_samples.append((negative_head, relation, tail))
            else:  # 50% chance to replace the tail entity
                negative_tail = np.random.choice(entities)
                negative_samples.append((head, relation, negative_tail))

    return pd.DataFrame(negative_samples, columns=['head', 'relation', 'tail'])


Data Processing Explanation:
Training Data Source: train_data_loader, which contains the positive samples from the training set.
Data Processing: Bernoulli Negative Sampling is applied to generate negative samples. We randomly choose to replace either the head or tail entity to generate the negative sample.
Data Usage: The generated negative samples are combined with the original training data to form a dataset that will be used for training.

In [None]:
# Self-Adversarial Negative Sampling: For RotatE
def self_adversarial_negative_sampling(train_data_loader, model, num_negatives=5, temperature=1.0):
    negative_samples = []
    entities = train_data_loader['head'].unique()

    for _, row in train_data_loader.iterrows():
        head, relation, tail = row['head'], row['relation'], row['tail']
        candidates = []

        # Generate candidate negative samples
        for _ in range(num_negatives * 10):  # Generate multiple candidate negative samples
            if np.random.rand() < 0.5:
                negative_head = np.random.choice(entities)
                candidates.append((negative_head, relation, tail))
            else:
                negative_tail = np.random.choice(entities)
                candidates.append((head, relation, negative_tail))

        # Use the model's scoring method to calculate the scores of the candidate negative samples
        scores = model.score_candidates(candidates)  # Assuming the model has a score_candidates method
        scores = torch.softmax(scores / temperature, dim=0)  # Adjust based on temperature
        selected_indices = torch.multinomial(scores, num_negatives, replacement=True)

        for idx in selected_indices:
            negative_samples.append(candidates[idx])

    return pd.DataFrame(negative_samples, columns=['head', 'relation', 'tail'])


Data Processing Explanation:
Training Data Source: train_data_loader, which contains the positive samples from the training set.
Data Processing: Self-Adversarial Negative Sampling is applied. Multiple negative candidates are generated, and their scores are computed using the model itself. Based on the scores, negative samples are selected using the softmax function and temperature scaling.
Data Usage: The selected negative samples are combined with the original positive samples and will be used for training.

Step 4: Model Training
We initialize the corresponding model based on the decoder choice and proceed with training, using the generated negative samples.

In [None]:
import torch
import torch.optim as optim
import torch.nn as nn
from model import TransEModel, DistMultModel, RotatEModel  # Import your selected model

# Choose decoder (model)
model_choice = "RotatE"  # Can choose "TransE", "DistMult", "RotatE"

# Initialize the corresponding model
if model_choice == "TransE":
    model = TransEModel(num_entities, num_relations, embedding_dim)
elif model_choice == "DistMult":
    model = DistMultModel(num_entities, num_relations, embedding_dim)
else:  # Default to RotatE model
    model = RotatEModel(num_entities, num_relations, embedding_dim)

# Loss function and optimizer
loss_fn = nn.CrossEntropyLoss()  # Using cross-entropy loss function
optimizer = optim.Adam(model.parameters(), lr=0.001)  # Adam optimizer with a learning rate of 0.001

# Training process  is described in detail in trainer.py.

Data Processing Explanation:
Training Data Source: train_data_loader, which contains both positive and negative samples.
Data Usage: The model is trained using the training data, which now includes both positive and generated negative samples. The loss is computed and used to update model parameters through backpropagation.

Step 5: Validation Set Usage
At the end of each epoch, we use the validation data to evaluate the model and help with hyperparameter tuning.

In [None]:
# Validation Phase: After each epoch, evaluate the model using the validation set
model.eval()  

val_loss = 0  # Initialize validation loss
for batch in valid_data.values:  # Validation data comes from the valid_data set
    head, relation, tail = batch[0], batch[1], batch[2]
    score = model(head, relation, tail)
    val_loss += loss.item()  # Accumulate validation loss

print(f"Validation Loss: {val_loss}")  # Print validation loss


Data Processing Explanation:
Validation Data Source: valid_data_loader, which comes from the validation set.
Data Usage: At the end of each epoch, we evaluate the model on the validation set, compute the validation loss, and print it to monitor performance.

Step 6: Test Set Usage
Finally, after training, we evaluate the model on the test set to assess.
Testing process  is described in detail in Tester.py.

Data Processing Explanation:
Test Data Source: test_data_loader, which comes from the test set.
Data Usage: The model is evaluated on the test data to assess how well it generalizes to unseen data, and the final test loss is computed.

This notebook only provides a detailed explanation of data processing and its usage in training, validation, and testing. The full implementation can be found in the code.