Just using this to write the easily test the code for the baseline model. Final implementation will be in a py script, so it can be run from command line using GPU.


# To do!
- create function to extract data to train model -- DONE!
- create function to output tags into appropriate format -- DONE!
- make model -- DONE!
  - Incorporate start, stop and unknown tokens into the convert data shape. Start and stop should be both a label and a vocab. Unknown should only be vocab -- DONE!
  - Define allowed transitions, such as cannot transition into start token, cannot transition into pad token, except from stop token, cannot transition out of stop token except into pad token, can only transition into I tokens, from the B token of the same category. Potentially use allowed_transitions from the allen nlp CRF module to create it, it should then be fed into the model on its creation -- DONE!
- define hyperparamter space and random space search to optimize on dev dataset
  - Hyperparameters we have are DIM_EMBEDDING, LSTM_HIDDEN, LEARNING_RATE, EPOCHS and BATCH_SIZE. The values we have currently were selected arbitrarily, we could look at articles implementing Bi-LSTM and CRF for inspiration on ranges and appropriate values. 
  - https://pytorch.org/tutorials/beginner/hyperparameter_tuning_tutorial.html I think this might be the easiest way to implement it, otherwise we might have to implement from scratch
- train model -- This part should be working, just need to select the hyperparameters before we actually do it.
- submit results

In [2]:
#imports for ray
import ray
from ray import tune
from ray.tune import CLIReporter
from ray.tune.schedulers import ASHAScheduler
from torch.nn.parallel import DataParallel
import os
from ray.air import Checkpoint, session
# TODO: Migrate to ray.train.Checkpoint and remove following line(not sure how to do it)
os.environ["RAY_AIR_NEW_PERSISTENCE_MODE"]="0"

ImportError: cannot import name 'Checkpoint' from 'ray.air' (d:\School\ITU\Coding\Anaconda\lib\site-packages\ray\air\__init__.py)

In [13]:
#Putting all the imports in one place for readability
import numpy as np
import torch
from torch import nn
from allennlp.modules.conditional_random_field import ConditionalRandomField as CRF
from allennlp.modules import conditional_random_field as CRFmodule
from torcheval.metrics.functional import multiclass_accuracy
from torcheval.metrics.functional import multiclass_confusion_matrix as MCM
import random
from collections import Counter


# Setting seeds to ensure reproducibility of results

random.seed(666)
np.random.seed(666)
torch.manual_seed(666)

<torch._C.Generator at 0x23790786e70>

In [14]:
#Extracts the data into 2 lists of lists, one with the tokens another with the tags


def extractData(filePath):
    """
    Returns:tuple: A tuple containing input data (list of lists of words), tags (list of lists of tags),
    and metadata (list of tuples containing newdoc_id, sent_id, and text).
    """
    wordsData = []
    tagsData = []
    currentSent = None
    with open(filePath, 'r', encoding='utf-8') as file:
        for line in file:
            line = line.strip()
            if line.startswith("# sent_id"):
                sentId = line.split("= ")[1]
            elif line.startswith("#"):
                continue
            elif line:                
                parts = line.split('\t')
                word = parts[1]
                tag = parts[2]
                if sentId != currentSent:
                    currentSent = sentId
                    wordsData.append([word])
                    tagsData.append([tag])
                else:
                    wordsData[-1].append(word)
                    tagsData[-1].append(tag)
    return wordsData, tagsData

# Example usage:
#file_path = "../Data/UniversalNER/train/en_ewt-ud-train.iob2"
#words_data, tags_data = extract_data(file_path)
# for words, tags in zip(words_data, tags_data):
#     print("Words:", words)
#     print("Tags:", tags)
#     print()


In [15]:
#Converts the Data into a tensor for use by the model

def convertDataShape(data, vocabulary = {}, labels = [], training = True, PADDING_TOKEN = '<PAD>', START_TOKEN = '<START>', STOP_TOKEN = '<END>', UNKNOWN_TOKEN = '<UNK>'):
    """
    If training is enabled creates a vocabulary of all words in a list. Otherwise, a vocabulary should be passed.
    Does the same with the labels.
    Creates a matrix of sentences and positions, where each value indicates a word via its index in the vocabulary.
    Creates another matrix of sentences and positions, where the values indicate a label.
    '<PAD>' or another user defined token is used as padding for short sentences. Will also act as an unknown token, if not training, it is assumed to be in vocabulary.
    Returns, the vocabulary, the labels and the two matrices.
    
    Input:
    data          - (string list * string list) list - List of sentences. Each sentence is a tuple of two lists. The first is a list of words, the second a list of labels.
    vocabulary    - string : int dictionary          - Dictionary of words in the vocabulary, values are the indices. Should be provided if not training. Defaults to empty dict.
    labels        - string : int dictionary          - Dictionary of labels to classify, values are the indices. Should be provided if not training. Defaults to empty dict.
    training      - boolean                          - Boolean variable deffining whether training is taking place, if yes then a new vocabulary will be created. Defaults to yes.
    PADDING_TOKEN - string                           - Token to be used as padding. Default is provided
    START_TOKEN   - string                           - Token to be used as marker for the start of the sentence. Default is provided
    STOP_TOKEN    - string                           - Token to be used as marker for the end of the sentence. Default is provided
    UNKNOWN_TOKEN - string                           - Token to be used as the unknown token. Default is provided
    
    Output:
    Xmatrix       - 2D torch.tensor                  - 2d torch tensor containing the index of the word in the sentence in the vocabulary
    Ymatrix       - 2D torch.tensor                  - 2d torch tensor containing the index of the label in the sentence in the labels
    vocabulary    - string : int dictionary          - Dictionary of words, with indices as values, used for training.
    labels        - string : int dictionary          - Dictionary of all the labels, with indices as values, used for classification. (all the labels are expected to be present in the training data, or in other words, the label list provided should be exhaustive)
    """


    if training:
        vocabList = sorted(set(word for sentence, _ in data for word in sentence))
        
        #In order to be able to work with unknown words in the future, we turn some of the least common words into unknown words so we can train on them
        #This is done by removing them from the vocab list before creating the dictionary
        vocabCount = Counter([word for sentence, _ in data for word in sentence])
        UNKNOWN_RATIO = 5 #This should be percentage of tokens we want to turn into Unknown tokens, the least common tokens will be used
        cutoff = int(len(vocabList) / (100 / UNKNOWN_RATIO)) + 1
        removeList = vocabCount.most_common()[:-cutoff:-1]
        for i in removeList:
            vocabList.remove(i[0])

        # Adding the special tokens in the first positions after the least common have been removed and creating the dictionaries
        vocabList = [PADDING_TOKEN, START_TOKEN, STOP_TOKEN, UNKNOWN_TOKEN] + vocabList
        vocabulary = {word: i for i, word in enumerate(vocabList)}
        labelList = [PADDING_TOKEN, START_TOKEN, STOP_TOKEN] + sorted(set(label for _, sentenceLabels in data for label in sentenceLabels))
        labels = {label: i for i, label in enumerate(labelList)}
    
    # Adding two to the max len in order to accomodate the introduction of start and end tokens
    maxLen = max(len(sentence) for sentence, _ in data) + 2
    Xmatrix = np.zeros((len(data), maxLen), dtype=int)
    Ymatrix = np.zeros((len(data), maxLen), dtype=int)

    for i, (sentence, sentenceLabels) in enumerate(data):
        #Set the first token as the start token (assumes it's index is 1)
        Xmatrix[i, 0] = 1
        Ymatrix[i, 0] = 1
        #Set all the indices to the correct index, with the unknown token as default
        for j, word in enumerate(sentence):
            Xmatrix[i, j+1] = vocabulary.get(word, vocabulary[UNKNOWN_TOKEN])
        for j, label in enumerate(sentenceLabels):
            Ymatrix[i, j+1] = labels.get(label, labels[START_TOKEN])
            lastWord = j         
        # Sets the token after the last word as en end token
        Xmatrix[i, lastWord + 2] = 2
        Ymatrix[i, lastWord + 2] = 2
    
    return torch.tensor(Xmatrix, dtype=torch.long), torch.tensor(Ymatrix, dtype=torch.long), vocabulary, labels

# two first sentences of EWT training dataset so that quickdebugging can be run



trainingDebugSen = [["Where", "in", "the", "world", "is", "Iguazu", "?"], ["Iguazu", "Falls"]]
trainingDebugTags = [["O", "O", "O", "O", "O", "B-LOC", "O"], ["B-LOC", "I-LOC"]]

dataDebug, labelsDebug, vocabDebug, tagsDebug = convertDataShape(list(zip(trainingDebugSen, trainingDebugTags)))
print(dataDebug)
print(labelsDebug)
print(vocabDebug)
print(tagsDebug)

tensor([[ 1,  7,  8, 10, 11,  9,  6,  4,  2],
        [ 1,  6,  5,  2,  0,  0,  0,  0,  0]])
tensor([[1, 5, 5, 5, 5, 5, 3, 5, 2],
        [1, 3, 4, 2, 0, 0, 0, 0, 0]])
{'<PAD>': 0, '<START>': 1, '<END>': 2, '<UNK>': 3, '?': 4, 'Falls': 5, 'Iguazu': 6, 'Where': 7, 'in': 8, 'is': 9, 'the': 10, 'world': 11}
{'<PAD>': 0, '<START>': 1, '<END>': 2, 'B-LOC': 3, 'I-LOC': 4, 'O': 5}


In [16]:
class baselineModel(torch.nn.Module):
    def __init__(self, nWords, tags, dimEmbed, dimHidden, constraints):
        super().__init__()
        self.dimEmbed = dimEmbed
        self.dimHidden = dimHidden
        self.vocabSize = nWords
        self.tagSetSize = len(tags)

        self.embed = nn.Embedding(nWords, dimEmbed)
        self.LSTM = nn.LSTM(dimEmbed, dimHidden, bidirectional=True)
        self.linear = nn.Linear(dimHidden * 2, self.tagSetSize)
        

        # Initialize the CRF layer
        self.CRF = CRF(self.tagSetSize, constraints = constraints, include_start_end_transitions=True)

    def forwardTrain(self, inputData, labels):
        # Embedding and LSTM layers
        wordVectors = self.embed(inputData)
        lstmOut, _ = self.LSTM(wordVectors)
        
        # Linear layer
        emissions = self.linear(lstmOut)
        
        # CRF layer to compute the log likelihood loss
        log_likelihood = self.CRF(emissions, labels)
        
        # The loss is the negative log-likelihood
        loss = -log_likelihood
        return loss
        
    def forwardPred(self, inputData):
        # Embedding and LSTM layers
        wordVectors = self.embed(inputData)
        lstmOut, _ = self.LSTM(wordVectors)
        
        # Linear layer
        emissions = self.linear(lstmOut)
        
        # Decode the best path
        best_paths = self.CRF.viterbi_tags(emissions)
        
        # Extract the predicted tags from the paths
        predictions = [path for path, score in best_paths]
        return predictions


In [17]:

def saveToIob2(words, labels, outputFilePath):
    """
    Save words and their corresponding labels in IOB2 format.

    Args:
    words (list): List of lists containing words.
    labels (list): List of lists containing labels.
    output_file (str): Path to the output IOB2 file.
    """
    with open(outputFilePath, 'w', encoding='utf-8') as file:
        for i in range(len(words)):
            for j in range(len(words[i])):
                line = f"{j+1}\t{words[i][j]}\t{labels[i][j]}\n"
                file.write(line)
            file.write('\n')

In [9]:
# two first sentences of EWT training dataset so that quickdebugging can be run

tags = ["O", "B-PER", "I-PER", "B-LOC", "I-LOC", "B-ORG", "I-ORG"]

trainingDebugSen = [["Where", "in", "the", "world", "is", "Iguazu", "?"], ["Iguazu", "Falls"]]
trainingDebugTags = [["O", "O", "O", "O", "O", "B-LOC", "O"], ["B-LOC", "I-LOC"]]

dataDebug, labelsDebug, vocabDebug, tagsDebug = convertDataShape(list(zip(trainingDebugSen, trainingDebugTags)))

In [10]:
#Quick traininig script on the debug dataset

DIM_EMBEDDING = 100
LSTM_HIDDEN = 50
LEARNING_RATE = 0.01
EPOCHS = 5

random.seed(666)
np.random.seed(666)
torch.manual_seed(666)

constraint_type = None

model = baselineModel(len(vocabDebug), tagsDebug, DIM_EMBEDDING, LSTM_HIDDEN, constraint_type)
optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)

for epoch in range(EPOCHS):
    model.train()
    
    optimizer.zero_grad()
    loss = model.forwardTrain(dataDebug, labelsDebug)
    
    loss.backward()
    optimizer.step()
    
    print(f"Epoch {epoch}, Loss: {loss.item()}")


Epoch 0, Loss: 33.8591194152832
Epoch 1, Loss: 24.502077102661133
Epoch 2, Loss: 17.171268463134766
Epoch 3, Loss: 11.20111083984375
Epoch 4, Loss: 6.7864837646484375


In [11]:
#Getting predicitons and checking accuracy


with torch.no_grad():
    predictsDebug = model.forwardPred(dataDebug)

confMat = MCM(torch.flatten(torch.tensor(predictsDebug, dtype=torch.long)), torch.flatten(labelsDebug), num_classes = len(tagsDebug))

acc = torch.trace(confMat[1:,1:])/torch.sum(confMat[1:,1:]) #Taking away the first collumn and first row, because those correspond to the padding token and we don't care
acc

tensor(1.)

In [18]:
# Loading all the training data sets

filePathTrain = "../Data/UniversalNER/train/"
wordsData = []
tagsData = []
datasets = ["da_ddt", "en_ewt", "hr_set", "pt_bosque", "sk_snk", "sr_set", "sv_talbanken", "zh_gsdsimp", "zh_gsd"]

for i in datasets:
    wordsDataTemp, tagsDataTemp = extractData(filePathTrain + i + "-ud-train.iob2")
    wordsData += wordsDataTemp
    tagsData += tagsDataTemp

trainData, trainLabels, vocab, labels = convertDataShape(list(zip(wordsData, tagsData)))

In [None]:
def train_cifar(config, data_dir=None):
    net = baselineModel(len(vocabDebug), tagsDebug, DIM_EMBEDDING, LSTM_HIDDEN, constraint_type)

    device = "cpu"
    if torch.cuda.is_available():
        device = "cuda:0"
        if torch.cuda.device_count() > 1:
            net = nn.DataParallel(net)
    net.to(device)

    criterion = nn.CrossEntropyLoss()
    optimizer = optim.SGD(net.parameters(), lr=config["lr"], momentum=0.9)

    checkpoint = session.get_checkpoint()

    if checkpoint:
        checkpoint_state = checkpoint.to_dict()
        start_epoch = checkpoint_state["epoch"]
        net.load_state_dict(checkpoint_state["net_state_dict"])
        optimizer.load_state_dict(checkpoint_state["optimizer_state_dict"])
    else:
        start_epoch = 0

    trainset, testset = load_data(data_dir)

    test_abs = int(len(trainset) * 0.8)
    train_subset, val_subset = random_split(
        trainset, [test_abs, len(trainset) - test_abs]
    )

    trainloader = torch.utils.data.DataLoader(
        train_subset, batch_size=int(config["batch_size"]), shuffle=True, num_workers=8
    )
    valloader = torch.utils.data.DataLoader(
        val_subset, batch_size=int(config["batch_size"]), shuffle=True, num_workers=8
    )

    for epoch in range(start_epoch, 10):  # loop over the dataset multiple times
        running_loss = 0.0
        epoch_steps = 0
        for i, data in enumerate(trainloader, 0):
            # get the inputs; data is a list of [inputs, labels]
            inputs, labels = data
            inputs, labels = inputs.to(device), labels.to(device)

            # zero the parameter gradients
            optimizer.zero_grad()

            # forward + backward + optimize
            outputs = net(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

            # print statistics
            running_loss += loss.item()
            epoch_steps += 1
            if i % 2000 == 1999:  # print every 2000 mini-batches
                print(
                    "[%d, %5d] loss: %.3f"
                    % (epoch + 1, i + 1, running_loss / epoch_steps)
                )
                running_loss = 0.0

        # Validation loss
        val_loss = 0.0
        val_steps = 0
        total = 0
        correct = 0
        for i, data in enumerate(valloader, 0):
            with torch.no_grad():
                inputs, labels = data
                inputs, labels = inputs.to(device), labels.to(device)

                outputs = net(inputs)
                _, predicted = torch.max(outputs.data, 1)
                total += labels.size(0)
                correct += (predicted == labels).sum().item()

                loss = criterion(outputs, labels)
                val_loss += loss.cpu().numpy()
                val_steps += 1

        checkpoint_data = {
            "epoch": epoch,
            "net_state_dict": net.state_dict(),
            "optimizer_state_dict": optimizer.state_dict(),
        }
        checkpoint = Checkpoint.from_dict(checkpoint_data)

        session.report(
            {"loss": val_loss / val_steps, "accuracy": correct / total},
            checkpoint=checkpoint,
        )
    print("Finished Training")

In [19]:
labels

{'<PAD>': 0,
 '<START>': 1,
 '<END>': 2,
 '-': 3,
 'B-LOC': 4,
 'B-ORG': 5,
 'B-OTH': 6,
 'B-PER': 7,
 'I-LOC': 8,
 'I-ORG': 9,
 'I-OTH': 10,
 'I-PER': 11,
 'O': 12}

In [10]:
DIM_EMBEDDING = 100
LSTM_HIDDEN = 50
LEARNING_RATE = 0.01
EPOCHS = 5
BATCH_SIZE = 32


PADDING_TOKEN = '<PAD>'
START_TOKEN = '<START>'
STOP_TOKEN = '<END>'
# The make constraint from the module was yielding some weird results so I decided to hardcode this for our use case, assuming the following dict of tags
#{'<PAD>': 0, '<START>': 1, '<END>': 2, '-': 3, 'B-LOC': 4, 'B-ORG': 5, 'B-OTH': 6, 'B-PER': 7, 'I-LOC': 8, 'I-ORG': 9, 'I-OTH': 10, 'I-PER': 11, 'O': 12}
CONSTRAINTS = [(1, 4), (1, 5), (1, 6), (1, 7), (1, 10), (2, 0), (4, 2), (4, 4), (4, 5), (4, 6), (4, 7), (4, 8) (4, 12), 
              (5, 2), (5, 4), (5, 5), (5, 6), (5, 7), (5, 9), (5, 12), (6, 2), (6, 4), (6, 5), (6, 6), (6, 7), (6, 10), (6, 12),
              (7, 2), (7, 4), (7, 5), (7, 6), (7, 7), (7, 11), (7, 12), (8, 2), (8, 4), (8, 5), (8, 6), (8, 7), (8, 8), (8, 12),
              (9, 2), (9, 4), (9, 5), (9, 6), (9, 7), (9, 9), (9, 12), (10, 2), (10, 4), (10, 5), (10, 6), (10, 7), (10, 10), (10, 12),
              (11, 2), (11, 4), (11, 5), (11, 6), (11, 7), (11, 11), (11, 12), (12, 2), (12, 4), (12, 5), (12, 6), (12, 7), (12, 12)]

random.seed(666)
np.random.seed(666)
torch.manual_seed(666)

numBatches = trainData.shape[0] // BATCH_SIZE

trainDataBatches = trainData[:BATCH_SIZE*numBatches].view(numBatches, trainData.shape[1], BATCH_SIZE)
trainLabelsBatches = trainLabels[:BATCH_SIZE*numBatches].view(numBatches, trainLabels.shape[1], BATCH_SIZE)



model = baselineModel(len(vocab), labels, DIM_EMBEDDING, LSTM_HIDDEN, CONSTRAINTS)
optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)

for epoch in range(EPOCHS):
    model.train()

    model.zero_grad()

    for batch in zip(trainDataBatches, trainLabelsBatches): 
        optimizer.zero_grad()
        
        loss = model.forwardTrain(batch[0], batch[1])
        loss.backward()
        optimizer.step()
        
     
    
    print(f"Epoch {epoch}, Loss: {loss.item()}")


Epoch 0, Loss: 76.76655578613281
Epoch 1, Loss: 45.305023193359375
Epoch 2, Loss: 42.572418212890625
Epoch 3, Loss: 42.45745849609375
Epoch 4, Loss: 41.4228515625


In [44]:
#Loading all the dev datasets

filePathDev = "../Data/UniversalNER/dev/"

wordsDataDev = []
tagsDataDev = []
datasets = ["da_ddt", "en_ewt", "hr_set", "pt_bosque", "sk_snk", "sr_set", "sv_talbanken", "zh_gsdsimp", "zh_gsd"]

for i in datasets:
    wordsDataTemp, tagsDataTemp = extractData(filePathDev + i + "-ud-dev.iob2")
    wordsDataDev += wordsDataTemp
    tagsDataDev += tagsDataTemp

devData, devLabels, _, _ = convertDataShape(list(zip(wordsDataDev, tagsDataDev)), vocabulary = vocab, labels = labels, training = False)

In [12]:
#Getting predicitons and checking accuracy

DEV_BATCH_SIZE = 113

devNumBatches = devData.shape[0] // DEV_BATCH_SIZE
devDataBatches = devData[:DEV_BATCH_SIZE*devNumBatches].view(devNumBatches, DEV_BATCH_SIZE, devData.shape[1])
devLabelsBatches = devLabels[:DEV_BATCH_SIZE*devNumBatches].view(devNumBatches, DEV_BATCH_SIZE, devData.shape[1])

predicts = []
with torch.no_grad():

    for batch in devDataBatches:
        predicts += model.forwardPred(batch)



KeyboardInterrupt: 

In [None]:
confMat = MCM(torch.flatten(torch.tensor(predicts, dtype=torch.long)), torch.flatten(devLabels), num_classes = len(labels))

#Taking away the first three collumns and rows, because those correspond to the functional tokens and we don't care
acc = torch.trace(confMat[3:,3:])/torch.sum(confMat[3:,3:]) 
acc

ValueError: The `input` and `target` should have the same first dimension, got shapes torch.Size([16330400]) and torch.Size([16379868]).

In [None]:
outputFilePath = "./baselineModel.iob2"

#convert the predictions back into labels

# creates a list of lists of tags, where the padding token is excluded
predictLabels = [[list(labels.keys())[i] for i in j if list(labels.keys())[i] != PADDING_TOKEN and list(labels.keys())[i] != START_TOKEN and list(labels.keys())[i] != STOP_TOKEN] for j in predicts]

# the saveToIob2 works when provided data in the right format
saveToIob2(devWordsData, predictLabels, outputFilePath)


In [5]:
#Loading all the training data for the submission

filePathTest = "../Project/en_ewt-ud-test-masked.iob2"

wordsDataTest, tagsDataTest = extractData(filePathTest)

testData, _, _, _ = convertDataShape(list(zip(wordsDataTest, tagsDataTest)), vocabulary = vocab, labels = labels, training = False)


with torch.no_grad():

    predictsTest = model.forwardPred(testData)

outputFilePathTest = "./baselineModelSubmit.iob2"

#convert the predictions back into labels

# creates a list of lists of tags, where the padding token is excluded
predictLabelsTest = [[list(labels.keys())[i] for i in j if list(labels.keys())[i] != PADDING_TOKEN and list(labels.keys())[i] != START_TOKEN and list(labels.keys())[i] != STOP_TOKEN] for j in predictsTest]

# the saveToIob2 works when provided data in the right format
saveToIob2(wordsDataTest, predictLabelsTest, outputFilePathTest)

NameError: name 'extractData' is not defined

In [14]:
import re

# Open the input file
with open("LOTR_as_txt.sty", "r", encoding="utf-8") as file:
    content = file.read()

# Remove chapter headings and names
cleaned_content = re.sub(r'Chapter\s+\w+\s+.*?\n', '', content)

# Tokenize the cleaned content into words and punctuation
tokens = re.findall(r"[\w’]+|[.,!?;():\[\]]", cleaned_content)

# Write each token to a new text file
with open("LOTR_tokens.txt", "w", encoding="utf-8") as output_file:
    for token in tokens:
        output_file.write(token + "\n")
