# Word sense disambiguation - Sara Nordin Hällgren

Instructions on how to run this notebook: change the filepaths under __Load the data__ if needed. I have referenced GloVe but ended up not using it.

In [1]:
# Imports - should just be standard packages

import re
import math
import time
import numpy as np
import pandas as pd

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
from dataclasses import dataclass

import nltk
from nltk.tokenize import word_tokenize

from sklearn import preprocessing
from sklearn.model_selection import train_test_split

# Load the data

In [17]:
train_path = "a3_data/wsd_train.txt"
test_path = "a3_data/wsd_test_blind.txt"
validation_path = r"C:\Users\saran\OneDrive\Dokument\GitHub\NLP\Project3\a3_data\wsd_test.txt"

with open(train_path, encoding = "utf-8") as f:
    for d, line in enumerate(f):
        print(line.lower())
        break
        
with open(validation_path, encoding = "utf-8") as f:
    for d, line in enumerate(f):
        print(line.lower())
        break

keep%2:42:07::	keep.v	15	action by the committee in pursuance of its mandate , the committee will continue to keep under review the situation relating to the question of palestine and participate in relevant meetings of the general assembly and the security council . the committee will also continue to monitor the situation on the ground and draw the attention of the international community to urgent developments in the occupied palestinian territory , including east jerusalem , requiring international action .

physical%5:00:00:material:01	physical.a	58	iaea pointed out that training and education were fundamental to the agency 's approach to enhancing physical protection systems in states . training courses , workshops and seminars that had been held on six continents had raised awareness and had provided hands-on experience of various subjects including the physical protection of research facilities , the practical operation of physical protection systems , and the engineering safet

In [3]:
def load_data(file_path):
    
    sense_list = []
    lemma_list = []
    position_list = []
    text_list = []

    with open(file_path, encoding = "utf-8") as f:
        
        for d, line in enumerate(f):

            line = line.lower()

            # Extract sense key
            ix = line.find("\t")
            sense_key = line[0:ix]
            line = line[ix+1:]

            # Extract lemma
            ix = line.find("\t")
            lemma = line[0:ix]
            line = line[ix+1:]

            # Extract position
            ix = line.find("\t")
            position = line[0:ix]
            text = line[ix+1:].split()

            # Extract text
            sense_list.append(sense_key)
            lemma_list.append(lemma)
            position_list.append(position)
            text_list.append(text)
    
    # Convert to df
    df = pd.DataFrame(sense_list, columns = ["Sense_key"])
    df["Lemma"] = lemma_list
    df["Position"] = position_list
    df["Text"] = text_list

    del sense_list, lemma_list, position_list
    
    return df

Using the given WordNet data, the task is to implement WSD as a classification task. This is implemented with two different kinds of neural network: a (word-based) convolutional neural network and and a regular deep network. They are discussed more in detail below; first I discuss some general choices:

__Design choices__

First of all, it made sense to convert each text to lower case to avoid having the same word appear twice in the vocabulary. However I was debating whether to remove stopwords like we did in assignment 2. In the end, I decided to keep the stopwords since they are able to change the sense of a word. An example is:

- Standing in line (waiting for something specific)
- Standing in a line (they are just standing)

The second decision was on how to represent each line in the documents numerically. A CBoW approach would be a good fit for assigning topics to documents but seems like a bad choice in this setting - the word senses will almost certainly get lost. I have used the nn.Embedding from PyTorch, and also created the infrastructure for using the GloVe embedding. One could learn representations as they go, but there's not a lot of examples per unique sense_key, in some cases. Since the WSD texts appear to be generic enough, pretrained embedding vectors should be good enough.  

Note that the longest line in any of the documents contains 284 words, but that many are shorter. In the CNN case it made sense to zero pad the data, so that each document is represented by a $284 \times 64$ matrix (where 64 is the embedding dimension). CNNs used in image analysis settings are able to handle pictures of similar dimensions easily. Since I had to train one CNN for each unique lemma, 30 in total, it was possible to adapt the input dimensions by only using the longest line for each lemma in the dataset. This was found to slightly speed up computations but it does not appear worth the trouble here - however this would be a good approach if some documents are much longer than others. 

Since we have one key word per document and we know its position, it seems like a shame not to use the actual position of the word. Also, since the deep network does not accept a 2D input each document needs to be flattened. Flattening a $284 \times 64$ matrix leads to a very long vector, which leads to an extremely large number of nodes in the network. Because of this I instead used a window of 10 words on each side of the lemma. 

Lastly, the output dimension of both the CNN and DNN are determined by the number of possible senses for each lemma. I have onehot encoded the sense labels and filtered the network output through a log softmax function.

# Classify using a CNN

CNNs seemed promising for this kind of task, as they can model interactions between words (exactly what we want). In this case, we ignore the position of the lemma and see this as a document classification problem. The main structure of my CNN code is based on a text classification tutorial I found on GitHub, where the goal is to distinguish between two different languages:

- https://github.com/FernandoLpz/Text-Classification-CNN-PyTorch

I also saw an example of how to utilize the log softmax outputs with a suitable loss function in this tutorial:

- https://pytorch.org/tutorials/beginner/nlp/deep_learning_tutorial.html

The idea with a network built from feature maps is to learn local features. In image analysis, this could be an edge, a corner or a leopard spot. The same kernel is applied to different parts of the data, with the same weights and thresholds. In the WSD setting, one hopes that it will be able to pick up on the patterns different words appear in, and learn to predict a sense key based on this. 

My CNN has one convolution layer consisting of 4 feature maps with different receptive field sizes (2, 3, 4, and 5). They all have stride 2, a ReLU activation function, and a max pooling layer. They all apply a one-dimensional convolution: the entire embedding for each word is always read in, but they focus on a different set of words at a time. The max pooling outputs are concatenated and fed through a fully connected layer. Finally, 25% dropout is applied, after which the output is passed through a log-softmax layer that works well with the NLLLoss function - Negative Log Likelihood.

I have implemented early stopping, so that training will abort if the loss has not decreased for 5 iterations.

## Preprocessing

In [4]:
class Preprocessing:
    def __init__(self, df, num_words, seq_len):
        self.data = df
        self.num_words = num_words
        self.seq_len = seq_len  
        
        self.vocabulary = None
        self.x_tokenized = None
        self.x_padded = None
        self.x_raw = None
        
        self.lemma = None
        self.n_outputs = None
        self.le = None
        self.y = None
        self.y_onehot = None
        
        self.x_train = None
        self.x_test = None
        self.y_train = None
        self.y_test = None  
    
    def load_data(self):
        # split into sentences (x) and sense key (y)
        df = self.data
        self.x_raw = df.Text.values
        self.lemma = df.Lemma.iloc[0]
        self.n_outputs = len(df.Sense_key.unique())
        
        labels = np.asarray(df.Sense_key.values)
        le = preprocessing.LabelEncoder()
        self.y = le.fit_transform(labels)
        self.le = le
        
    def build_vocabulary(self):
        # Builds the vocabulary 
        self.vocabulary = dict()
        fdist = nltk.FreqDist()

        for sentence in self.x_raw:
            for word in sentence:
                fdist[word] += 1

        common_words = fdist.most_common(self.num_words)

        for idx, word in enumerate(common_words):
            self.vocabulary[word[0]] = (idx+1)
            
    def word_to_idx(self):
        # By using the dictionary each token is transformed into its index based representation
        self.x_tokenized = list() 

        for sentence in self.x_raw:
            temp_sentence = list()
            for word in sentence:
                if word in self.vocabulary.keys():
                    temp_sentence.append(self.vocabulary[word])
            self.x_tokenized.append(temp_sentence)
    
    def padding_sentences(self):
        # Each sentence which does not fulfill the required length is padded with the index 0
        pad_idx = 0
        self.x_padded = list()

        for sentence in self.x_tokenized:
            while len(sentence) < self.seq_len:
                sentence.insert(len(sentence), pad_idx)

            self.x_padded.append(sentence)
            
        self.x_padded = np.array(self.x_padded) 
        
    def onehot_encode(self):
        # Create a onehot encoded representation of the targets
        self.y_onehot = list()
        y_idx = self.le.inverse_transform(self.y)
        
        for i in range(len(self.y)):
            
            tmp = np.zeros(self.n_outputs)
        
            for k in range(self.n_outputs):
                if self.data.Sense_key.iloc[i] == y_idx[i]:
                    tmp[self.y[i]] = 1
                    
            self.y_onehot.append(tmp)
            
        self.y_onehot = np.array(self.y_onehot)
            
    def split_data(self):
        self.x_train, self.x_test, self.y_train, self.y_test = \
        train_test_split(self.x_padded, self.y_onehot, test_size=0.25, random_state=None)

## Parameters

In [5]:
@dataclass
class Parameters:

    # Preprocessing parameters
    num_words: int = 8000
    seq_len = 284 

    # Model parameters
    embedding_size: int = 64
    out_size: int = 32
    stride: int = 2

    # Training parameters
    epochs: int = 100
    batch_size: int = 12
    learning_rate: float = 0.001
    early_stopping_win = 5
        
    # Runtime parameters - will be different for each lemma
    n_outputs: int = None

## TextClassifier

In [6]:
class TextClassifier(nn.ModuleList):

    def __init__(self, params):
        super(TextClassifier, self).__init__()

        # Parameters regarding text preprocessing
        self.seq_len = params.seq_len
        self.num_words = params.num_words
        self.embedding_size = params.embedding_size

        # Dropout definition
        self.dropout = nn.Dropout(0.25)

        # CNN parameters definition
        # Kernel sizes
        self.kernel_1 = 2
        self.kernel_2 = 3
        self.kernel_3 = 4
        self.kernel_4 = 5

        # Output size for each convolution
        self.out_size = params.out_size
        # Number of strides for each convolution
        self.stride = params.stride

        # Embedding layer definition
        self.embedding = nn.Embedding(self.num_words + 1, self.embedding_size, padding_idx=0)

        # Convolution layers definition
        self.conv_1 = nn.Conv1d(self.seq_len, self.out_size, self.kernel_1, self.stride)
        self.conv_2 = nn.Conv1d(self.seq_len, self.out_size, self.kernel_2, self.stride)
        self.conv_3 = nn.Conv1d(self.seq_len, self.out_size, self.kernel_3, self.stride)
        self.conv_4 = nn.Conv1d(self.seq_len, self.out_size, self.kernel_4, self.stride)

        # Max pooling layers definition
        self.pool_1 = nn.MaxPool1d(self.kernel_1, self.stride)
        self.pool_2 = nn.MaxPool1d(self.kernel_2, self.stride)
        self.pool_3 = nn.MaxPool1d(self.kernel_3, self.stride)
        self.pool_4 = nn.MaxPool1d(self.kernel_4, self.stride)

        # Fully connected layer definition
        self.fc = nn.Linear(self.in_features_fc(), params.n_outputs)
        
        # Softmax output layer definition
        self.log_softmax = nn.LogSoftmax(dim = 1)

    def in_features_fc(self):
        '''Calculates the number of output features after Convolution + Max pooling

        Convolved_Features = ((embedding_size + (2 * padding) - dilation * (kernel - 1) - 1) / stride) + 1
        Pooled_Features = ((embedding_size + (2 * padding) - dilation * (kernel - 1) - 1) / stride) + 1

        source: https://pytorch.org/docs/stable/generated/torch.nn.Conv1d.html
        '''
        
        # Calculate size of convolved/pooled features for convolution_1/max_pooling_1 features
        out_conv_1 = ((self.embedding_size - 1 * (self.kernel_1 - 1) - 1) / self.stride) + 1
        out_conv_1 = math.floor(out_conv_1)
        out_pool_1 = ((out_conv_1 - 1 * (self.kernel_1 - 1) - 1) / self.stride) + 1
        out_pool_1 = math.floor(out_pool_1)

        # Calculate size of convolved/pooled features for convolution_2/max_pooling_2 features
        out_conv_2 = ((self.embedding_size - 1 * (self.kernel_2 - 1) - 1) / self.stride) + 1
        out_conv_2 = math.floor(out_conv_2)
        out_pool_2 = ((out_conv_2 - 1 * (self.kernel_2 - 1) - 1) / self.stride) + 1
        out_pool_2 = math.floor(out_pool_2)

        # Calculate size of convolved/pooled features for convolution_3/max_pooling_3 features
        out_conv_3 = ((self.embedding_size - 1 * (self.kernel_3 - 1) - 1) / self.stride) + 1
        out_conv_3 = math.floor(out_conv_3)
        out_pool_3 = ((out_conv_3 - 1 * (self.kernel_3 - 1) - 1) / self.stride) + 1
        out_pool_3 = math.floor(out_pool_3)

        # Calculate size of convolved/pooled features for convolution_4/max_pooling_4 features
        out_conv_4 = ((self.embedding_size - 1 * (self.kernel_4 - 1) - 1) / self.stride) + 1
        out_conv_4 = math.floor(out_conv_4)
        out_pool_4 = ((out_conv_4 - 1 * (self.kernel_4 - 1) - 1) / self.stride) + 1
        out_pool_4 = math.floor(out_pool_4)

        # Returns "flattened" vector (input for fully connected layer)
        return (out_pool_1 + out_pool_2 + out_pool_3 + out_pool_4) * self.out_size
    

    def forward(self, x):

        # Sequence of tokens is filtered through an embedding layer
        x = self.embedding(x)

        # Convolution layer 1 is applied
        x1 = self.conv_1(x)
        x1 = torch.relu(x1)
        x1 = self.pool_1(x1)

        # Convolution layer 2 is applied
        x2 = self.conv_2(x)
        x2 = torch.relu((x2))
        x2 = self.pool_2(x2)

        # Convolution layer 3 is applied
        x3 = self.conv_3(x)
        x3 = torch.relu(x3)
        x3 = self.pool_3(x3)

        # Convolution layer 4 is applied
        x4 = self.conv_4(x)
        x4 = torch.relu(x4)
        x4 = self.pool_4(x4)

        # The output of each convolutional layer is concatenated into a unique vector
        union = torch.cat((x1, x2, x3, x4), 2)
        union = union.reshape(union.size(0), -1)
        
        # The "flattened" vector is passed through a fully connected layer
        out = self.fc(union)
        # Dropout is applied
        out = self.dropout(out)
        # Log softmax is applied
        out = self.log_softmax(out)

        # Use this, or there's a dim-0 error when a batch contains only one value
        if len(out) > 1:
            return out.squeeze()
        else:
            return out

## Run

In [8]:
class DatasetMapper(Dataset):

    def __init__(self, x, y):
        self.x = x
        self.y = y

    def __len__(self):
        return len(self.x)

    def __getitem__(self, idx):
        return self.x[idx], self.y[idx]

class Run:
    '''Training, evaluation and metrics calculation'''

    @staticmethod
    def train(model, data, params):

        # Initialize dataset maper
        train = DatasetMapper(data['x_train'], data['y_train'])
        test = DatasetMapper(data['x_test'], data['y_test'])

        # Initialize loaders
        loader_train = DataLoader(train, batch_size=params.batch_size)
        loader_test = DataLoader(test, batch_size=params.batch_size)

        # Define loss function and optimizer
        loss_function = nn.NLLLoss()
        optimizer = optim.Adam(model.parameters(), lr=params.learning_rate)
        
        # Define vector for early stopping
        prev_loss = np.zeros(params.early_stopping_win)

        # Starts training phase
        for epoch in range(params.epochs):
            # Set model in training model
            model.train()
            predictions = []
            # Starts batch training
            for x_batch, y_batch in loader_train:

                y_batch = y_batch.type(torch.FloatTensor)

                # Feed the model
                y_pred = model(x_batch.long())
                                          
                # Transform back from onehot encoded targets
                y_true = np.zeros(y_batch.shape[0])
                
                for i in range(y_batch.shape[0]):
                    for j in range(y_batch.shape[1]):
                        if y_batch[i,j] == 1:
                            y_true[i] = j
            
                y_true = torch.from_numpy(y_true).long()

                # Loss calculation
                loss = loss_function(y_pred, y_true)

                # Clean gradientes
                optimizer.zero_grad()

                # Gradients calculation
                loss.backward()

                # Gradients update
                optimizer.step()

                # Save predictions
                predictions += list(y_pred.detach().numpy())
                
            # Evaluation phase
            test_predictions = Run.evaluation(model, loader_test)
            
            # Metrics calculation
            train_accuracy = Run.calculate_accuracy(data['y_train'], predictions)
            test_accuracy = Run.calculate_accuracy(data['y_test'], test_predictions)
            
            if epoch % 5 == 0:
                print("Epoch: %d, loss: %.4f, Train accuracy: %.4f, Test accuracy: %.4f" % \
                      (epoch, loss.item(), train_accuracy, test_accuracy))
            
            # Early stopping check
            if epoch > 10:
                if loss.item() < min(prev_loss):
                    prev_loss = prev_loss[1:]
                    prev_loss = np.append(prev_loss, loss.item())
                else:
                    break
                
        return train_accuracy, test_accuracy

    @staticmethod
    def evaluation(model, loader_test):

        # Set the model in evaluation mode
        model.eval()
        predictions = []

        # Start evaluation phase
        with torch.no_grad():
            for x_batch, y_batch in loader_test:
                y_pred = model(x_batch.long())
                predictions += list(y_pred.detach().numpy())
        return predictions
        
    @staticmethod
    def calculate_accuracy(grand_truth, predictions):
        # Metrics calculation
        correct = 0
        
        for true, pred in zip(grand_truth, predictions):
    
            for i, element in enumerate(pred):
                if element == max(pred) and true[i] == 1:
                    correct += 1
                else:
                    pass
            
        # Return accuracy
        return (correct) / len(grand_truth)
    
    @staticmethod
    def prediction(model, data, le, params):
        
        # Initialize loader
        loader = DataLoader(data, batch_size=Parameters.batch_size, shuffle=False)
        
        model.eval()
        predictions = []
        
        with torch.no_grad():
            for x_batch in loader:
                pred = model(x_batch.long())
                predictions += list(pred.detach().numpy())
                
        sense_pred = []        
        for line in predictions:
            for i, val in enumerate(line):
                if val == max(line):
                    sense_pred.append(i)
                    
        sense_pred = le.inverse_transform(sense_pred)
        
        # Return the predicted senses
        return sense_pred
                

## Controller

In [9]:
class Controller(Parameters):

    def __init__(self, df, validation_df):
        
        self.lemma = None
        self.train_accuracy = None
        self.test_accuracy = None
        self.sense_pred = None
        
        # Preprocessing pipeline
        self.data, lemma, n_outputs, le, vocabulary = self.prepare_data(df, Parameters.num_words, Parameters.seq_len)
        
        self.le = le
        self.lemma = lemma
        self.vocabulary = vocabulary
        Parameters.n_outputs = n_outputs  

        # Initialize the model
        self.model = TextClassifier(Parameters)

        # Training - Evaluation pipeline
        train_accuracy, test_accuracy = Run().train(self.model, self.data, Parameters)

        # Save accuracies
        self.train_accuracy = train_accuracy
        self.test_accuracy = test_accuracy
        
        # Make predictions on valdiation dataset
        self.validation_data = self.prepare_validation_data(validation_df, self.vocabulary, Parameters.seq_len)
        self.sense_pred = Run().prediction(self.model, self.validation_data, self.le, Parameters)
 
    @staticmethod
    def prepare_data(df, num_words, seq_len):
        
        # Preprocessing pipeline
        pr = Preprocessing(df, num_words, seq_len)
        pr.load_data()
        pr.build_vocabulary()
        pr.word_to_idx()
        pr.padding_sentences()
        pr.onehot_encode()
        pr.split_data()

        return ({'x_train': pr.x_train, 'y_train': pr.y_train, 'x_test': pr.x_test, 'y_test': pr.y_test}, \
                pr.lemma, pr.n_outputs, pr.le, pr.vocabulary)
   
    @staticmethod
    def prepare_validation_data(df, vocabulary, seq_len):
        
        num_words = len(vocabulary)

        pr = Preprocessing(test_short, num_words, seq_len)
        pr.vocabulary = vocabulary
        pr.seq_len = seq_len
        pr.load_data()
        pr.word_to_idx()
        pr.padding_sentences()

        return pr.x_padded

    # if __name__ == '__main__':
    #    controller = Controller(df_pos)

## Run the code

In [18]:
df = load_data(train_path)
test_df = load_data(test_path)
validation_df = load_data(validation_path)

In [19]:
# Loop over all lemmas

lemma_vec = []
train_accuracy_vec = []
test_accuracy_vec = []
predicted_df = test_df.copy()

start_time = time.time()

for lemma in df.Lemma.unique():
    
    df_short = df[df.Lemma == lemma]
    test_short = test_df[test_df.Lemma == lemma]
    controller = Controller(df_short, test_short)
    
    print('-'*60)
    print("Lemma: %s, Final training accuracy: %.4f, Final test accuracy: %.4f" % \
                  (controller.lemma, controller.train_accuracy, controller.test_accuracy))
    print('-'*60)
    
    # Append accuracies for each lemma
    lemma_vec.append(controller.lemma)
    train_accuracy_vec.append(controller.train_accuracy)
    test_accuracy_vec.append(controller.test_accuracy)
    
    # Make predictions
    predictions = controller.sense_pred
    for k, idx in enumerate(test_short.index):
        predicted_df.iloc[idx].Sense_key = predictions[k]
    
elapsed_time = time.time() - start_time
print("Elapsed time: ", elapsed_time)  

evaluation_df = pd.DataFrame(lemma_vec, columns = ["Lemma"])
evaluation_df['training_acc'] = train_accuracy_vec
evaluation_df['test_acc'] = test_accuracy_vec
evaluation_df.to_csv('CNN_evaluation.csv', index=False)
predicted_df.to_csv('CNN_predictions.csv', index=False)

accuracy = np.sum(validation_df.Sense_key == predicted_df.Sense_key)/(len(validation_df))
print("Validation accuracy: %.5f" % accuracy)

Epoch: 0, loss: 1.5711, Train accuracy: 0.3858, Test accuracy: 0.4463
Epoch: 5, loss: 0.7047, Train accuracy: 0.9669, Test accuracy: 0.5122
Epoch: 10, loss: 0.3128, Train accuracy: 0.9916, Test accuracy: 0.5493
------------------------------------------------------------
Lemma: keep.v, Final training accuracy: 0.9936, Final test accuracy: 0.5389
------------------------------------------------------------


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


Epoch: 0, loss: 0.6237, Train accuracy: 0.2150, Test accuracy: 0.1825
Epoch: 5, loss: 0.1731, Train accuracy: 0.9652, Test accuracy: 0.2644
Epoch: 10, loss: 0.0000, Train accuracy: 0.9956, Test accuracy: 0.2868
------------------------------------------------------------
Lemma: national.a, Final training accuracy: 0.9988, Final test accuracy: 0.2886
------------------------------------------------------------
Epoch: 0, loss: 1.9395, Train accuracy: 0.1807, Test accuracy: 0.2276
Epoch: 5, loss: 0.2962, Train accuracy: 0.9535, Test accuracy: 0.1955
Epoch: 10, loss: 0.4435, Train accuracy: 0.9984, Test accuracy: 0.2388
------------------------------------------------------------
Lemma: build.v, Final training accuracy: 0.9973, Final test accuracy: 0.2404
------------------------------------------------------------
Epoch: 0, loss: 1.5064, Train accuracy: 0.3315, Test accuracy: 0.4265
Epoch: 5, loss: 0.1135, Train accuracy: 0.9903, Test accuracy: 0.4534
Epoch: 10, loss: 0.6152, Train accura

Epoch: 0, loss: 1.6819, Train accuracy: 0.2368, Test accuracy: 0.2743
Epoch: 5, loss: 0.4100, Train accuracy: 0.9794, Test accuracy: 0.3787
Epoch: 10, loss: 0.0174, Train accuracy: 0.9931, Test accuracy: 0.4049
------------------------------------------------------------
Lemma: life.n, Final training accuracy: 0.9894, Final test accuracy: 0.4011
------------------------------------------------------------
Epoch: 0, loss: 1.5601, Train accuracy: 0.2615, Test accuracy: 0.2874
Epoch: 5, loss: 0.0200, Train accuracy: 0.9993, Test accuracy: 0.3772
Epoch: 10, loss: 0.2336, Train accuracy: 1.0000, Test accuracy: 0.3952
------------------------------------------------------------
Lemma: order.n, Final training accuracy: 1.0000, Final test accuracy: 0.4012
------------------------------------------------------------
Epoch: 0, loss: 1.5478, Train accuracy: 0.2144, Test accuracy: 0.2179
Epoch: 5, loss: 0.4664, Train accuracy: 0.9759, Test accuracy: 0.2452
Epoch: 10, loss: 0.3362, Train accuracy: 

In [22]:
test_short.head()

Unnamed: 0,Sense_key,Lemma,Position,Text
92,?,major.a,67,"[resource, requirements, (, before, recosting,..."
110,?,major.a,39,"[the, most, significant, outcome, of, the, fir..."
137,?,major.a,28,"[however, ,, as, a, part, of, the, united, nat..."
202,?,major.a,73,"[the, representative, of, indonesia, ,, speaki..."
211,?,major.a,39,"[total, contributions, for, unifem, increased,..."


Training and predicting a model for each lemma, all of them stop training early after around 10-15 iterations. Loss and accuracy scores are printed out on a lemma-by-lemma basis, to make comparison between models easier. We notice that for most (if not all) lemmas, the training accuracy comes close to 100% while the test accuracy varies a lot. For some words, such as **lead** and **build**, the test accuracy is as low as 20% when training stops. For **line** we get a test accuracy as high as 88%, implying that there is a clearer local structure around different senses of this particular lemma.

The accuracy on the validation set is 42%, which is definitely above the MFS baseline accuracy. Since I have applied a quite simple model to a quite complicated problem, 42% does not seem all that bad.  In an image analysis setting we would commonly see several convolution layers stacked on top of each other. This could improve performance here too, but probably not too much. CNNs do learn a local structure but they have no sense of backpropagation, and will probably miss more subtle language structures no matter how many layers we stack. 

I see no noticeable changes in the validation accuracy after running the code for a couple of times - of course, this could be analysed further.

# Classify using deep network

I also wanted to evaluate the performance of a simple deep network. As mentioned above, I use a window of 20 words from each document, centered on the position of the lemma. The embedding dimension here is 50, meaning that the reshaped vector for each embedded document will have length 1000. This is the dimension of the input layer in the neural network: the following layers are of dimensions 200, 100, 50, with the output dimension again being determined by the number of unique senses for the lemma. 

The Preprocessing_2 class is almost identical to the Preprocessing class above; in retrospect this did not need to be a separate class at all. The main difference is that I have added (and commented out) infrastructure for using GloVe pretrained vectors. It was interesting to try GloVe but I did not notice any performance differences compared to the nn.Embedding - one can only get so far with a classical deep net, no matter how good the embedding is. 

In this network as well, I have used 25% dropout followed by a log softmax function. 

## Preprocessing

embeddings_dict = {}

with open("glove.6B.50d.txt", 'r', encoding="utf-8") as f:
    for line in f:
        values = line.split()
        word = values[0]
        vector = np.asarray(values[1:], "float32")
        embeddings_dict[word] = vector

In [336]:
class Preprocessing_2:
    def __init__(self, df, num_words, win_size, embedding_size):
        self.data = df
        self.num_words = num_words        
        self.embedding_size = embedding_size
        self.win_size = win_size
        self.seq_len = 2*win_size
        
        self.vocabulary = None
        self.x_tokenized = None
        self.x_embedded = None
        self.x_padded = None
        self.x_raw = None
        
        self.lemma = None
        self.n_outputs = None
        self.le = None
        self.y = None
        self.y_onehot = None
        
        self.x_train = None
        self.x_test = None
        self.y_train = None
        self.y_test = None  
    
    def load_data(self):
        # split into sentences (x) and sense key (y)
        df = self.data
        self.x_raw = df.Text.values
        self.lemma = df.Lemma.iloc[0]
        self.n_outputs = len(df.Sense_key.unique())
        
        labels = np.asarray(df.Sense_key.values)
        le = preprocessing.LabelEncoder()
        self.y = le.fit_transform(labels)
        self.le = le
        
    def remove_text(self):
        win_size = self.win_size

        for i, pos in enumerate(self.data.Position):

            line_len = len(self.x_raw[i])
            pos = int(pos)

            while pos < win_size:
                pos += 1
            while pos > line_len - win_size:
                pos -= 1

            self.x_raw[i] = self.x_raw[i][pos-win_size:pos+win_size]
        
#     def glove_embedding(self, embeddings_dict):
        
#         embedding_size = self.embedding_size
#         seq_len = self.seq_len

#         padding = np.zeros(embedding_size)
#         embedded_matrix = []

#         for line in pr.x_raw:

#             embedded_line = []

#             for word in line:
#                 try:
#                     embedded_line.append(embeddings_dict[word])
#                 except KeyError:
#                     continue

#             while len(embedded_line) < seq_len:
#                 embedded_line.append(padding)

#             embedded_line = np.array(embedded_line)
#             embedded_matrix.append(embedded_line)

#        self.embedded_matrix = np.array(embedded_matrix)   
        
    def build_vocabulary(self):
        # Builds the vocabulary 
        self.vocabulary = dict()
        fdist = nltk.FreqDist()

        for sentence in self.x_raw:
            for word in sentence:
                fdist[word] += 1

        common_words = fdist.most_common(self.num_words)

        for idx, word in enumerate(common_words):
            self.vocabulary[word[0]] = (idx+1)
            
    def word_to_idx(self):
        # By using the dictionary each token is transformed into its index based representation
        self.x_tokenized = list() 

        for sentence in self.x_raw:
            temp_sentence = list()
            for word in sentence:
                if word in self.vocabulary.keys():
                    temp_sentence.append(self.vocabulary[word])
            self.x_tokenized.append(temp_sentence)
    
    def padding_sentences(self):
        # Each sentence which does not fulfill the required length is padded with the index 0
        pad_idx = 0
        self.x_padded = list()

        for sentence in self.x_tokenized:
            while len(sentence) < self.seq_len:
                sentence.insert(len(sentence), pad_idx)

            self.x_padded.append(sentence)
            
        self.x_padded = np.array(self.x_padded) 
        
    def onehot_encode(self):
        # Create a onehot encoded representation of the targets
        self.y_onehot = list()
        y_idx = self.le.inverse_transform(self.y)
        
        for i in range(len(self.y)):
            
            tmp = np.zeros(self.n_outputs)
        
            for k in range(self.n_outputs):
                if self.data.Sense_key.iloc[i] == y_idx[i]:
                    tmp[self.y[i]] = 1
                    
            self.y_onehot.append(tmp)
            
        self.y_onehot = np.array(self.y_onehot)
            
    def split_data(self):
        self.x_train, self.x_test, self.y_train, self.y_test = \
        train_test_split(self.x_padded, self.y_onehot, test_size=0.25, random_state=None)

## Parameters

In [337]:
@dataclass
class Parameters_2:

    # Preprocessing parameters
    num_words: int = 8000
    win_size = 10
    seq_len = 2*win_size
    embedding_size: int = 50

    # Model parameters
    out_size_1 = 200
    out_size_2 = 100
    out_size_3 = 50
  
    # Training parameters
    epochs: int = 100
    batch_size: int = 12
    learning_rate: float = 0.001
    early_stopping_win = 5
        
    # Runtime parameters - will be different for each lemma
    n_outputs: int = None

## Deep classifier

In [338]:
class DeepClassifier(nn.ModuleList):

    def __init__(self, params):
        super(DeepClassifier, self).__init__()

        # Parameters regarding text preprocessing
        self.seq_len = params.seq_len
        self.num_words = params.num_words
        self.embedding_size = params.embedding_size

        # Dropout definition
        self.dropout = nn.Dropout(0.25)

        # Define fully connected layers
        self.fc_1 = nn.Linear(self.seq_len*params.embedding_size, params.out_size_1)
        self.fc_2 = nn.Linear(params.out_size_1, params.out_size_2)
        self.fc_3 = nn.Linear(params.out_size_2, params.out_size_3)
        self.fc_4 = nn.Linear(params.out_size_3, params.n_outputs)

        # Embedding layer definition
        self.embedding = nn.Embedding(self.num_words + 1, self.embedding_size, padding_idx=0)
        
        # Softmax output layer definition
        self.log_softmax = nn.LogSoftmax(dim = 1)

    def forward(self, x):

        # Sequence of tokens is filtered through an embedding layer
        x = self.embedding(x)
        
        # Reshape to one dimension
        x = x.reshape(x.size(0), -1)

        # Pass the embedded vector through the fully connected layers
        x1 = self.fc_1(x)
        x2 = self.fc_2(x1)
        x3 = self.fc_3(x2)
        out = self.fc_4(x3)
        
        # Dropout is applied
        out = self.dropout(out)
        out = self.log_softmax(out)

        # Use this, or there's a dim-0 error when a batch contains only one value
        if len(out) > 1:
            return out.squeeze()
        else:
            return out

## Run

In [339]:
class DatasetMapper_2(Dataset):

    def __init__(self, x, y):
        self.x = x
        self.y = y

    def __len__(self):
        return len(self.x)

    def __getitem__(self, idx):
        return self.x[idx], self.y[idx]

class Run_2:
    '''Training, evaluation and metrics calculation'''

    @staticmethod
    def train(model, data, params):

        # Initialize dataset mapper
        train = DatasetMapper_2(data['x_train'], data['y_train'])
        test = DatasetMapper_2(data['x_test'], data['y_test'])

        # Initialize loaders
        loader_train = DataLoader(train, batch_size=params.batch_size)
        loader_test = DataLoader(test, batch_size=params.batch_size)

        # Define loss function and optimizer
        loss_function = nn.NLLLoss()
        optimizer = optim.Adam(model.parameters(), lr=params.learning_rate)
        
        # Define vector for early stopping
        prev_loss = np.zeros(params.early_stopping_win)

        # Starts training phase
        for epoch in range(params.epochs):
            # Set model in training model
            model.train()
            predictions = []
            # Starts batch training
            for x_batch, y_batch in loader_train:

                y_batch = y_batch.type(torch.FloatTensor)

                # Feed the model
                y_pred = model(x_batch.long())
                                          
                # Transform back from onehot encoded targets
                y_true = np.zeros(y_batch.shape[0])
                
                for i in range(y_batch.shape[0]):
                    for j in range(y_batch.shape[1]):
                        if y_batch[i,j] == 1:
                            y_true[i] = j
            
                y_true = torch.from_numpy(y_true).long()

                # Loss calculation
                loss = loss_function(y_pred, y_true)

                # Clean gradientes
                optimizer.zero_grad()

                # Gradients calculation
                loss.backward()

                # Gradients update
                optimizer.step()

                # Save predictions
                predictions += list(y_pred.detach().numpy())
                
            # Evaluation phase
            test_predictions = Run_2.evaluation(model, loader_test)
            
            # Metrics calculation
            train_accuracy = Run_2.calculate_accuracy(data['y_train'], predictions)
            test_accuracy = Run_2.calculate_accuracy(data['y_test'], test_predictions)
            
            if epoch % 5 == 0:
                print("Epoch: %d, loss: %.4f, Train accuracy: %.4f, Test accuracy: %.4f" % \
                      (epoch, loss.item(), train_accuracy, test_accuracy))
            
            # Early stopping check
            if epoch > 10:
                if loss.item() < min(prev_loss):
                    prev_loss = prev_loss[1:]
                    prev_loss = np.append(prev_loss, loss.item())
                else:
                    break
                
        return train_accuracy, test_accuracy

    @staticmethod
    def evaluation(model, loader_test):

        # Set the model in evaluation mode
        model.eval()
        predictions = []

        # Start evaluation phase
        with torch.no_grad():
            for x_batch, y_batch in loader_test:
                y_pred = model(x_batch.long())
                predictions += list(y_pred.detach().numpy())
        return predictions
        
    @staticmethod
    def calculate_accuracy(grand_truth, predictions):
        # Metrics calculation
        correct = 0
        
        for true, pred in zip(grand_truth, predictions):
    
            for i, element in enumerate(pred):
                if element == max(pred) and true[i] == 1:
                    correct += 1
                else:
                    pass
            
        # Return accuracy
        return (correct) / len(grand_truth)
    
    @staticmethod
    def prediction(model, data, le, params):
        
        # Initialize loader
        loader = DataLoader(data, batch_size=Parameters.batch_size, shuffle=False)
        
        model.eval()
        predictions = []
        
        with torch.no_grad():
            for x_batch in loader:
                pred = model(x_batch.long())
                predictions += list(pred.detach().numpy())
                
        sense_pred = []        
        for line in predictions:
            for i, val in enumerate(line):
                if val == max(line):
                    sense_pred.append(i)
                    
        sense_pred = le.inverse_transform(sense_pred)
        
        # Return the predicted senses
        return sense_pred
                

## Controller

In [342]:
class Controller_2(Parameters_2):

    def __init__(self, df, validation_df):
        
        self.lemma = None
        self.train_accuracy = None
        self.test_accuracy = None
        self.sense_pred = None
        
        # Preprocessing pipeline
        self.data, lemma, n_outputs, le, vocabulary = self.prepare_data(df, Parameters_2)
        
        self.le = le
        self.lemma = lemma
        self.vocabulary = vocabulary
        Parameters_2.n_outputs = n_outputs  

        # Initialize the model
        self.model = DeepClassifier(Parameters_2)

        # Training - Evaluation pipeline
        train_accuracy, test_accuracy = Run_2().train(self.model, self.data, Parameters_2)

        # Save accuracies
        self.train_accuracy = train_accuracy
        self.test_accuracy = test_accuracy
        
        # Make predictions on valdiation dataset
        self.validation_data = self.prepare_validation_data(validation_df, self.vocabulary, Parameters_2)
        self.sense_pred = Run_2().prediction(self.model, self.validation_data, self.le, Parameters_2)
 
    @staticmethod
    def prepare_data(df, params):
        
        # Preprocessing pipeline
        pr = Preprocessing_2(df, params.num_words, params.win_size, params.embedding_size)
        pr.load_data()
        pr.remove_text()
        pr.build_vocabulary()
        pr.word_to_idx()
        pr.padding_sentences()
        pr.onehot_encode()
        pr.split_data()
        
        return ({'x_train': pr.x_train, 'y_train': pr.y_train, 'x_test': pr.x_test, 'y_test': pr.y_test}, \
                pr.lemma, pr.n_outputs, pr.le, pr.vocabulary)
   
    @staticmethod
    def prepare_validation_data(df, vocabulary, params):
        
        num_words = len(vocabulary)

        pr = Preprocessing_2(test_short, num_words, params.win_size, params.embedding_size)
        pr.vocabulary = vocabulary
        pr.seq_len = params.seq_len
        
        pr.load_data()
        pr.remove_text()
        pr.word_to_idx()
        pr.padding_sentences()

        return pr.x_padded
    

## Run the code

In [276]:
df = load_data(train_path)
test_df = load_data(test_path)
validation_df = load_data(validation_path)

In [334]:
# Loop over all lemmas

lemma_vec = []
train_accuracy_vec = []
test_accuracy_vec = []
predicted_df = test_df.copy()

start_time = time.time()

for lemma in df.Lemma.unique():
    
    df_short = df[df.Lemma == lemma]
    test_short = test_df[test_df.Lemma == lemma]
    controller = Controller_2(df_short, test_short)
    
    print('-'*60)
    print("Lemma: %s, Final training accuracy: %.4f, Final test accuracy: %.4f" % \
                  (controller.lemma, controller.train_accuracy, controller.test_accuracy))
    print('-'*60)
    
    # Append accuracies for each lemma
    lemma_vec.append(controller.lemma)
    train_accuracy_vec.append(controller.train_accuracy)
    test_accuracy_vec.append(controller.test_accuracy)
    
    # Make predictions
    predictions = controller.sense_pred
    for k, idx in enumerate(test_short.index):
        predicted_df.iloc[idx].Sense_key = predictions[k]
    
elapsed_time = time.time() - start_time
print("Elapsed time: ", elapsed_time)  

evaluation_df = pd.DataFrame(lemma_vec, columns = ["Lemma"])
evaluation_df['training_acc'] = train_accuracy_vec
evaluation_df['test_acc'] = test_accuracy_vec
evaluation_df.to_csv('Deep_evaluation.csv', index=False)
predicted_df.to_csv('Deep_predictions.csv', index=False)

accuracy = np.sum(validation_df.Sense_key == predicted_df.Sense_key)/(len(validation_df))
print("Validation accuracy: %.5f" % accuracy)

Epoch: 0, loss: 1.3227, Train accuracy: 0.5873, Test accuracy: 0.6590
Epoch: 5, loss: 0.5443, Train accuracy: 0.8872, Test accuracy: 0.6664
Epoch: 10, loss: 0.6236, Train accuracy: 0.9508, Test accuracy: 0.7013
------------------------------------------------------------
Lemma: keep.v, Final training accuracy: 0.9520, Final test accuracy: 0.7042
------------------------------------------------------------
Epoch: 0, loss: 1.7453, Train accuracy: 0.4357, Test accuracy: 0.4860
Epoch: 5, loss: 0.0000, Train accuracy: 0.9105, Test accuracy: 0.5847
Epoch: 10, loss: 0.0000, Train accuracy: 0.9323, Test accuracy: 0.5624
------------------------------------------------------------
Lemma: national.a, Final training accuracy: 0.9534, Final test accuracy: 0.5493
------------------------------------------------------------
Epoch: 0, loss: 2.0070, Train accuracy: 0.3282, Test accuracy: 0.3942
Epoch: 5, loss: 0.3522, Train accuracy: 0.8001, Test accuracy: 0.3718
Epoch: 10, loss: 0.2950, Train accurac

Epoch: 0, loss: 1.4057, Train accuracy: 0.4503, Test accuracy: 0.6150
Epoch: 5, loss: 0.3632, Train accuracy: 0.8860, Test accuracy: 0.5333
Epoch: 10, loss: 0.2160, Train accuracy: 0.9466, Test accuracy: 0.5883
------------------------------------------------------------
Lemma: find.v, Final training accuracy: 0.9511, Final test accuracy: 0.5517
------------------------------------------------------------
Epoch: 0, loss: 1.3837, Train accuracy: 0.5040, Test accuracy: 0.5784
Epoch: 5, loss: 0.0641, Train accuracy: 0.9184, Test accuracy: 0.6138
Epoch: 10, loss: 0.3143, Train accuracy: 0.9570, Test accuracy: 0.5914
------------------------------------------------------------
Lemma: life.n, Final training accuracy: 0.9502, Final test accuracy: 0.5970
------------------------------------------------------------
Epoch: 0, loss: 1.3417, Train accuracy: 0.5163, Test accuracy: 0.6327
Epoch: 5, loss: 0.0507, Train accuracy: 0.9261, Test accuracy: 0.6567
Epoch: 10, loss: 0.6647, Train accuracy: 0

Interestingly, the final validation accuracy is higher for the regular deep network! It is also noticeably faster to train. Validation accuracy is around 51% with respect to a couple of runs (again, this could be explored further). The deep network's prediction across a 20-word window in the validation set will be based on whether it's seen similar words in the same order before. In other words, it has a less sophisticated approach than CNN but it still manages to do quite well - possibly because I have filtered out a lot of unneccessary information in the preprocessing stage. Words closer to the lemma are more likely to be related to the sense. One more consistent way of filtering information would be to only look at the sentence containing the lemma, and disregard the rest of the documents. 

This data processing approach could be tried with CNNs as well, but even more interestingly it could be inputted into a more sophisticated model. GRU, LSTM and attention models are all much better suited for this task and it would be very interesting to explore how they perform. A structure that allows for feedback through and gives different importance to different tasks of the text would have really good chances of doing well in WSD. However, due to lack of time I will have to look into this some other time. 