# Inference Notebook
This Jupyter Notebook serves as an inference script for a machine learning model designed for relation extraction from textual dataset TACRED. The code presented here is responsible for loading our pre-trained models, processing input data, performing inference, and evaluating the models' performances.


### Load the Data
 Example Input Test Data
This is an example input test data defined as a list of dictionaries representing data instances. This is the format consistent with the TACRED data. The input sentence for testing should be provided in this format

In [1]:
# This is an example input test data represented as a list of dictionaries.
# Each dictionary represents a data instance with various attributes such as 'id', 'docid', 'relation', 'token', 'subj_start', 'subj_end', 'obj_start', 'obj_end', 'subj_type', 'obj_type', 'stanford_pos', 'stanford_ner', 'stanford_head', 'stanford_deprel'.
# 'id': Unique identifier for the data instance.
# 'docid': Identifier for the document containing the data instance.
# 'relation': Type of relation between entities.
# 'token': List of tokens representing the text of the instance.
# 'subj_start', 'subj_end': Start and end indices of the subject entity in the 'token' list.
# 'obj_start', 'obj_end': Start and end indices of the object entity in the 'token' list.
# 'subj_type', 'obj_type': Types of subject and object entities.
# 'stanford_pos': Part-of-speech tags assigned by the Stanford NLP toolkit.
# 'stanford_ner': Named entity recognition tags assigned by the Stanford NLP toolkit.
# 'stanford_head': Dependency parsing head indices for each token.
# 'stanford_deprel': Dependency relation labels for each token.
input_test_data = [{'id': '098f6eb6b0421982e87d',
                    'docid': 'APW_ENG_20091113.0131',
                    'relation': 'per:age',
                    'token': ['Sarah', ',', '33', ',', 'agreed', ',', 'citing', 'the', 'weeks', 'when', 'protesters', 'gathered', 'outside', 'their', 'home', ',', 'once', 'even', 'breaking', 'their', 'home', "'s", 'front', 'window', '.'],
                    'subj_start': 0,
                    'subj_end': 0,
                    'obj_start': 2,
                    'obj_end': 2,
                    'subj_type': 'PERSON',
                    'obj_type': 'NUMBER',
                    'stanford_pos': ['NNP', ',', 'CD', ',', 'VBD', ',', 'VBG', 'DT', 'NNS', 'WRB', 'NNS', 'VBD', 'IN', 'PRP$', 'NN', ',', 'RB', 'RB', 'VBG', 'PRP$', 'NN', 'POS', 'NN', 'NN', '.'],
                    'stanford_ner': ['PERSON', 'O', 'NUMBER', 'O', 'O', 'O', 'O', 'DURATION', 'DURATION', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'DATE', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'],
                    'stanford_head': [5, 1, 1, 1, 0, 5, 5, 9, 7, 12, 12, 9, 15, 15, 12, 12, 19, 19, 12, 21, 24, 21, 24, 19, 5],
                    'stanford_deprel': ['nsubj', 'punct', 'amod', 'punct', 'ROOT', 'punct', 'xcomp', 'det', 'dobj', 'advmod', 'nsubj', 'acl:relcl', 'case', 'nmod:poss', 'nmod', 'punct', 'advmod', 'advmod', 'advcl', 'nmod:poss', 'nmod:poss', 'case', 'compound', 'dobj', 'punct']
                    }]

## BERT-based Approach

### 1. Importing Libraries
These code cell install & import necessary python packages & required libraries for the code execution. 

In [2]:
from module1 import *

  from .autonotebook import tqdm as notebook_tqdm


### 2. Preprocess the Dataset
The code segment preprocesses test data by converting tokens into numerical representations and obtaining the subject and object positions

In [3]:
test_processed_data = [preprocess_data(item) for item in input_test_data]
test_dataset = TACREDDataset(test_processed_data)
test_dataloader = DataLoader(test_dataset, batch_size=32, collate_fn=collate_fn)


### 3. Loading the Model

In [4]:
with open('model1.pkl', 'rb') as file:
    loaded_model = pickle.load(file)

In [5]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
optimizer = AdamW(loaded_model.parameters(), lr=5e-5)
loss_fn = nn.CrossEntropyLoss()




Validation Loss: 7.6921
Validation Accuracy: 0.0000


### 4. Evaluating the Model

In [13]:
true_labels,predictions = evaluate_model(loaded_model, test_dataloader, device)
original_true_labels = label_encoder.inverse_transform(true_labels)
original_predictions = label_encoder.inverse_transform(predictions)
print(f'True Label : {original_true_labels}, Predicted Label : {original_predictions}')


True Label : ['per:age'], Predicted Label : ['no_relation']


## GNN-based Approach

## RE Approach 2 - GNN based relation extraction using Contrastive Learning


### 1. Installing Required Packages
These code cell install & import necessary python packages & required libraries for the code execution. Additionally, it imports custom module '**modules**' for the project.

In [7]:
# This code installs necessary Python packages using the pip package manager.
# !pip install numpy torch torch-geometric scikit-learn

In [8]:
# Importing necessary libraries for the code execution.
import json  # Importing the JSON module for handling JSON data.
import random  # Importing the random module for generating random numbers.
import numpy as np  # Importing numpy library and aliasing it as np for numerical computations.
import torch  # Importing PyTorch library for deep learning.
import torch.nn.functional as F  # Importing torch.nn.functional for various neural network operations.
from torch.nn import CrossEntropyLoss  # Importing CrossEntropyLoss for computing the loss.
from torch_geometric.data import Data, DataLoader  # Importing Data and DataLoader from torch_geometric for graph data handling.
from torch_geometric.nn import GATConv, global_mean_pool  # Importing GATConv and global_mean_pool for graph convolution and pooling operations.
from sklearn.metrics import accuracy_score, precision_recall_fscore_support  # Importing metrics from scikit-learn for evaluation.

import modules2  # Importing custom modules for the project.

### 2. Test Data Preprocessing
The code segment preprocesses test data by converting tokens, POS tags, and NER tags into numerical representations. Graphs are created from the preprocessed test data to represent structured relationships between tokens and features to be passed into the model. DataLoader for the test graphs is created to facilitate batch processing during testing.

In [9]:
# It reads test data from a JSON file using a custom module 'modules.read_json_file'. This is test data is loaded just for creating
# 'pos_tag_to_index, ner_tag_to_index', which are required during preprocessing
# test_data = modules.read_json_file('test.json')

# It creates tag indices for part-of-speech (POS) tags and named entity recognition (NER) tags from the test data.
# The 'create_tag_indices' function generates dictionaries mapping POS and NER tags to numerical indices.
pos_tag_to_index, ner_tag_to_index = modules2.create_tag_indices(modules2.test_data)

# It preprocesses the test dataset using a custom module 'modules.preprocess_dataset'.
# This function converts tokens, POS tags, and NER tags into numerical representations using the dictionaries generated earlier.
preprocessed_test_data = modules2.preprocess_dataset(input_test_data, pos_tag_to_index, ner_tag_to_index)

# It creates graphs from the preprocessed test data using a custom module 'modules.create_graphs'.
# These graphs represent the structured relationships between tokens and their features.
graphs_test = modules2.create_graphs(preprocessed_test_data)

# It creates a DataLoader for the test graphs to facilitate batch processing during testing.
# The DataLoader is initialized with a batch size of 32 and shuffle set to False.
test_loader = DataLoader(graphs_test, batch_size=32, shuffle=False)



### 3. Loading Trained Model
A pre-trained RelationExtractionGNN model is loaded from a specified file path.

In [10]:
# The variable 'model_path' stores the path where the trained model is saved.
model_path = 'model2.pth'

In [11]:
# This code segment loads a trained RelationExtractionGNN model from a specified file path.

# It first initializes an instance of the RelationExtractionGNN class using a custom module 'modules'.
loaded_model = modules2.RelationExtractionGNN(num_node_features=58, num_classes=42)

# It loads the saved model state from the file specified by 'model_path' using the torch.load function.
# The loaded model state is stored in the 'checkpoint' dictionary.
checkpoint = torch.load(model_path)

# It then loads the model state dictionary into the initialized model using the load_state_dict method.
# This step initializes the model parameters with the saved weights and biases.
loaded_model.load_state_dict(checkpoint['model_state_dict'])


<All keys matched successfully>

### 4. Evaluating Model
The loaded model is evaluated on the test input, and the true and predicted labels for the input instance are printed.


In [12]:
# This code snippet evaluates the loaded model on the test dataset and prints the true and predicted labels for a single instance.
true_labels, predicted_labels, _, _, _, _ = modules2.evaluate_model(loaded_model, test_loader)
print(f'True Label : {modules2.index_to_relation[true_labels[0]]}, Predicted Label : {modules2.index_to_relation[predicted_labels[0]]}')

True Label : per:age, Predicted Label : no_relation


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
