# Baseline for EWT
To get your project started, you start with implementing a baseline model. Ideally, this is going to be the main baseline that you are going to compare to in your paper. Note that this baseline should be more advanced than just predicting the majority class (O).

We will use EWT portion of the Universal NER project, which we provide in the folder "Project_description" for convenience. You can use the train data (en_ewt-ud-train.iob2) and test data(en_ewt-ud-dev.iob2) to build your baseline, then upload your prediction on the test data (en_ewt-ud-test.iob2).

It is important to upload your predictions in same format as the training and test files, so that the span_f1.py script can be used.

Note that you do not have to implement your baseline from scratch, you can use for example the code from the RNN or BERT assignments as a starting point.

__________________________________________________

# Homemade baseline implementation with RNN

First, import needed modules

In [None]:
import torch
from torch import nn

Then, read in the file

In [None]:
def read_ewt_file(path):
    data=[]
    words=[]
    tags=[]
    nr_tags=0
    nr_toks=0
    
    for line in open(path, encoding='utf-8'):
        line=line.strip()
        
        if line:
            if line[0]=='#':
                continue
    
            elements=line.split('\t')
            nr_toks+=1
            
            words.append(elements[1])
            tags.append(elements[2])
            
            if elements[3]!='-':
                print(elements[3])
            if elements[4]=='stephen':
                nr_tags+=1
    
        else:
            if words:
                data.append((words, tags))
            words=[]
            tags=[]

    if tags!=[]:
        data.append((words, tags))

    proportion_tagged=nr_tags/nr_toks
    
    return data, proportion_tagged

In [None]:
train_data,prop_tag_train=read_ewt_file('Project_description/en_ewt-ud-train.iob2')
print("Proportion of training data tagged: ", prop_tag_train)

## Vocab and tensor creation

Append each new word and label we meet in the training data to vocab lists, so we can look up indexes

In [None]:
#Initialize a list for each vocabulary, where item at index 0 should just be pad/unknown
word_vocab=['<PAD>']
label_vocab=['<PAD>']

#Iterate over each sentence and corresponding label in the training data
for pair in train_data:
    #Unpack the tokens and labels, to iterate simultaneously over each word/token and it's label
    for word, label in zip(pair[0], pair[1]):

        #Check if the word and token already exists in the vocabulary, and if not, add it
        if word not in word_vocab:
            word_vocab.append(word)
        if label not in label_vocab:
            label_vocab.append(label)

Save the metrics we need for later

In [None]:
nr_diff_words=len(word_vocab)

#Will be used as the nr of rows in our pytorch tensors
nr_sent=len(train_data)

#Will be used as the nr of columns in our pytorch tensors
longest_sent=max([len(x[0]) for x in train_data])

Convert each word to it's corresponding vocabulary index, and place it into a pytorch tensor

In [None]:
#Initialize tensors with 0-values, this is especially good since the 0-index in our vocabs is just the unknown/pad token and label.
# So anything that doesn't get replaced by a vocabulary index, will just have index for <PAD>

train_data_matrix=torch.zeros((nr_sent,longest_sent)) #PyTorch tensor of dim 12543 x 159 . Should consist of sentences word by word as rows, and padding for shorter sentences.
train_label_matrix=torch.zeros((nr_sent,longest_sent)) #PyTorch tensor of dim 12543 x 159, containing values from the label index for each word in the train_data_matrix

#Iterate over the training data again, this time looking up the vocab index for each token and label, to create pytorch tensors of sentence representation
for sent_nr, (sentence, labels) in enumerate(train_data):
    for tok_nr, (token, label) in enumerate(zip(sentence, labels)):
        token_idx=word_vocab.index(token)
        label_idx=label_vocab.index(label)
        train_data_matrix[sent_nr,tok_nr]=token_idx
        train_label_matrix[sent_nr,tok_nr]=label_idx

#Convert all matrix values to dType LongInt, since initially adding them to the tensor interpreted the values as float
train_data_matrix=train_data_matrix.long()
train_label_matrix=train_label_matrix.long()

________________________________________________

## Now use the pytorch tensors as training data for an RNN
#### Okay i lied above, the actual batch-dividing and RNN uses Rob's solution, but the data conversion and prediction output is all our own work

Below this, we don't 100% understand what's going on, but have adapted Rob's code to make it work for our implementation

In [None]:
tmp_feats=torch.zeros((200,100))

batch_size=32
num_batches=int(len(tmp_feats)/batch_size)

tmp_feats_batches=tmp_feats[:batch_size*num_batches].view(num_batches, batch_size, 100)

for feats_batch in tmp_feats_batches:
    print(feats_batch.shape)

In [None]:
def create_batches(batch_size, train_data_matrix, train_label_matrix):
    num_batches=int(len(train_data_matrix)/batch_size)
    batches_X=train_data_matrix[:batch_size*num_batches].view(num_batches, batch_size, train_data_matrix.shape[1]) #6, 32, 159
    batches_Y=train_label_matrix[:batch_size*num_batches].view(num_batches, batch_size, train_label_matrix.shape[1]) #6, 32, 159
    return batches_X, batches_Y

In [None]:
torch.manual_seed(42)
DIM_EMBEDDING=100
RNN_HIDDEN=50
BATCH_SIZE=32
LEARNING_RATE=0.01
EPOCHS=10

class TaggerModel(torch.nn.Module):
    def __init__(self, nwords, ntags):
        super().__init__()
        self.embed = nn.Embedding(nwords, DIM_EMBEDDING)
        self.rnn = nn.RNN(DIM_EMBEDDING, RNN_HIDDEN, batch_first=True)
        self.fc = nn.Linear(RNN_HIDDEN, ntags)
        
    def forward(self, input_data):
        word_vectors = self.embed(input_data)
        output, hidden = self.rnn(word_vectors)
        predictions = self.fc(output)

        return predictions 

model = TaggerModel(len(word_vocab), len(label_vocab))
optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)
loss_function = torch.nn.CrossEntropyLoss(ignore_index=0, reduction='sum')

#creating the batches 
batches_X, batches_y = create_batches(BATCH_SIZE, train_data_matrix, train_label_matrix)

for epoch in range(EPOCHS):
    model.train()
    # reset the gradient
    model.zero_grad()
    print(f"Epoch {epoch+1}\n-------------------------------")
    loss_sum = 0

    # loop over batches
    for X, y in zip(batches_X, batches_y):
        predicted_values = model.forward(X)
        predicted_values=predicted_values.view(BATCH_SIZE*longest_sent, -1) #resizing tensor to 2D from 3D
        
        # calculate loss
        y=torch.flatten(y.view(BATCH_SIZE*longest_sent, -1)) #flattening to make it 1D
        loss = loss_function(predicted_values, y)
        loss_sum += loss.item() #avg later

        # update
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    print(f"Average loss after epoch {epoch+1}: {loss_sum/batches_X.shape[0]}")
        
# set to evaluation mode
model.eval()

In [None]:
#Read in the test data with the same function as for the training data
test_data,prop_tag_test=read_ewt_file('Project_description/en_ewt-ud-dev.iob2')
print("Proportion of testing data tagged: ", prop_tag_test)

#Get the nr of sentences from the testing data, so we can initialize a properly sized pytorch tensor
test_nr_sent=len(test_data)
test_data_matrix=torch.zeros((test_nr_sent,longest_sent)) #PyTorch tensor of dim 12543 x 159 . Should consist of sentences word by word as rows, and padding for shorter sentences.

#Iterate over the testing data, and add each word to the test_data_matrix based on index. We do not need a test_label matrix, as we are trying to predict labels
for sent_nr, (sentence, labels) in enumerate(test_data):
    for tok_nr, (token, label) in enumerate(zip(sentence, labels)):
        #Note that since we are using the word vocabulary created by the training data, we don't know that all words from the testing data will be present
        try:
            token_idx=word_vocab.index(token)
        #New words should be classified as unknown, and have vocab index 0
        except:
            token_idx=0
            
        test_data_matrix[sent_nr,tok_nr]=token_idx
test_data_matrix=test_data_matrix.long()

#Now that our test-data is in the correct format, we run it through our finalized model, to get predicted labels for the test-data
predictions_test = model.forward(test_data_matrix)

___________________________________________

## Lastly, from the predictions for each label, take the most likely, and convert to iob2-format for the output file
#### Oh yeah, and we're back to homemade solutions!

First, get the predictions as words

In [None]:
#Since the predictions_test is a 3-dimensional tensor (8 layers of 2d-matrices, with each layer representing a possible label), 
# we need to get the most likely label for each token, and look that index up in our label_vocab for the word-label
test_pred=[]
for i, sentence in enumerate(predictions_test[:,:,]):
    labels=[]
    for j, token in enumerate(sentence):
        label=label_vocab[torch.argmax(predictions_test[i,j,:])]
        labels.append(label)
    test_pred.append((test_data[i][0],labels))

Then, assemble it all into a string of the proper format - with 5 columns of nr, token, label, "-", and "stephen" if the token has a label

In [None]:
#To get it as iob2-format output, we assemble the predicted labels with the corresponding words in a string, as well as a "stephen" if a word has a label
output_txt=""
for sentence, labels in test_pred:
    output_txt+="\n# text = "+" ".join(sentence)+"\n"
    for i, (token, label) in enumerate(zip(sentence,labels)):
        steph="-"
        if label != "O":
            steph="stephen"
        line=str(i+1)+"\t"+token+"\t"+label+"\t-\t"+steph+"\n"
        output_txt+=line

Very last step: Save the string as a new file in the directory

In [None]:
#Lastly, write that string into a file
with open("baseline_test_pred_output.iob2", "w", encoding="utf-8") as f:
    f.write(output_txt)