# CE7455 Assignment 2

Name: PENG HONGYI <br>
Matric No: G2105029E

## Quetion One (i)

__Named Entity Recognition__ (NER) as an important task in NLP, attemps to classify predefined entities in a sentence. In our assignment, we use _eng.train_ for training, _eng.testa_ for validation and _eng.testb_ for testing.  

A sentence in the dataset is presented below, where the first columns is the input word and the last column is the output tag. The dataset contains four different types of predefined entities: PERSON, LOCATION, ORGANIZATION, and MISC. As shown in the fourth column, the fourth column contains the groudtruth entity name and the BIO tag, seperated by '-'.

| Word    |     |      | Tag    |
|---------|-----|------|--------|
| EU      | NNP | I-NP | I-ORG  |
| rejects | VBZ | I-VP | O      |
| German  | JJ  | I-NP | I-MISC |
| call    | NN  | I-NP | O      |
| to      | TO  | I-VP | O      |
| boycott | VB  | I-VP | O      |
| British | JJ  | I-NP | I-MISC |
| lamb    | NN  | I-NP | O      |
| .       | .   | O    | O      |

## Question one (ii)


There are 5 preprocessing steps in the code provided:
* Replacing all the digit with 0.
* Convert BIO tagging to BIOES tagging.
* Generate words mapping.
* Generate tag mapping.
* Generate chracter mapping.
Mappings here are dictionaries that assign an integer id to every unique word, character and tag. After the proprocessing step, we found: 

> Found 17493 unique words (203621 in total) 
Found 75 unique characters 
Found 19 unique named entity tags

The preprocessed dataset is stored in _data/mapping.pkl_. To save time, we will directly processed data throughout this assignment.



In [1]:
# Load data
import pickle
with open('data/mapping.pkl', 'rb') as f:
    mapping = pickle.load(f)
list(mapping.keys())


['word_to_id', 'tag_to_id', 'char_to_id', 'parameters', 'word_embeds']

In [2]:
word_to_id = mapping['word_to_id']
tag_to_id = mapping['tag_to_id']
char_to_id = mapping['char_to_id']
# We use our own parameters
# parameters = mapping['parameters']
word_embeds = mapping['word_embeds']

In [3]:
from collections import OrderedDict
import torch
parameters = OrderedDict()
parameters['train'] = "./data/eng.train" #Path to train file
parameters['dev'] = "./data/eng.testa" #Path to test file
parameters['test'] = "./data/eng.testb" #Path to dev file
parameters['tag_scheme'] = "BIOES" #BIO or BIOES
parameters['lower'] = True # Boolean variable to control lowercasing of words
parameters['zeros'] =  True # Boolean variable to control replacement of  all digits by 0 
parameters['char_dim'] = 30 #Char embedding dimension
parameters['word_dim'] = 100 #Token embedding dimension
parameters['word_lstm_dim'] = 200 #Token LSTM hidden layer size
parameters['word_bidirect'] = True #Use a bidirectional LSTM for words
parameters['embedding_path'] = "./data/glove.6B.100d.txt" #Location of pretrained embeddings
parameters['all_emb'] = 1 #Load all embeddings
parameters['crf'] =1 #Use CRF (0 to disable)
parameters['dropout'] = 0.5 #Droupout on the input (0 = no dropout)
parameters['epoch'] =  50 #Number of epochs to run"
parameters['weights'] = "" #path to Pretrained for from a previous run
parameters['name'] = "self-trained-model" # Model name
parameters['gradient_clip']=5.0
parameters['char_mode']="CNN"
models_path = "./models/" #path to saved models
parameters['use_gpu'] = torch.cuda.is_available() #GPU Check
use_gpu = parameters['use_gpu']
parameters['reload'] = "./models/pre-trained-model" 

In [4]:
from Utils import load_sentences
from TagConversion import update_tag_scheme
train_sentences = load_sentences(parameters['train'], parameters['zeros'])
test_sentences = load_sentences(parameters['test'], parameters['zeros'])
val_sentences = load_sentences(parameters['dev'], parameters['zeros'])
update_tag_scheme(train_sentences, parameters['tag_scheme'])
update_tag_scheme(val_sentences, parameters['tag_scheme'])
update_tag_scheme(test_sentences, parameters['tag_scheme'])


In [5]:
from Utils import prepare_dataset

train_data = prepare_dataset(
    train_sentences, word_to_id, char_to_id, tag_to_id, parameters['lower']
)
val_data = prepare_dataset(
    val_sentences, word_to_id, char_to_id, tag_to_id, parameters['lower']
)
test_data = prepare_dataset(
    test_sentences, word_to_id, char_to_id, tag_to_id, parameters['lower']
)
print("{} / {} / {} sentences in train / val / test.".format(len(train_data), len(val_data), len(test_data)))

14041 / 3250 / 3453 sentences in train / val / test.


In [6]:
from BaseModel import BiLSTM_CRF
model = BiLSTM_CRF(vocab_size=len(word_to_id),
                   tag_to_ix=tag_to_id,
                   embedding_dim=parameters['word_dim'],
                   hidden_dim=parameters['word_lstm_dim'],
                   use_gpu=use_gpu,
                   char_to_ix=char_to_id,
                   pre_word_embeds=word_embeds,
                   use_crf=parameters['crf'],
                   char_mode=parameters['char_mode'])

  nn.init.uniform(input_embedding, -bias, bias)
  nn.init.uniform(weight, -sampling_range, sampling_range)
  nn.init.uniform(weight, -sampling_range, sampling_range)
  nn.init.uniform(weight, -sampling_range, sampling_range)
  nn.init.uniform(weight, -sampling_range, sampling_range)
  nn.init.uniform(input_linear.weight, -bias, bias)


In [7]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

In [8]:
num_parameters = count_parameters(model)
print(f'The base model contain {num_parameters} parameters')

The base model contain 2284255 parameters


In [9]:
model.load_state_dict(torch.load(parameters['reload']))
print("model reloaded :", parameters['reload'])
if use_gpu:
    model.cuda()

model reloaded : ./models/pre-trained-model


In [10]:
import numpy as np
from torch.autograd import Variable
from Helper import get_chunks
from tqdm import tqdm


def evaluate(model, datas):
    prediction = []
    correct_preds, total_correct, total_preds = 0., 0., 0.
    for data in tqdm(datas, total=len(datas)):
        ground_truth_id = data['tags']
        words = data['str_words']
        chars2 = data['chars']
        if parameters['char_mode'] == 'LSTM':
            chars2_sorted = sorted(chars2, key=lambda p: len(p), reverse=True)
            d = {}
            for i, ci in enumerate(chars2):
                for j, cj in enumerate(chars2_sorted):
                    if ci == cj and not j in d and not i in d.values():
                        d[j] = i
                        continue
            chars2_length = [len(c) for c in chars2_sorted]
            char_maxl = max(chars2_length)
            chars2_mask = np.zeros(
                (len(chars2_sorted), char_maxl), dtype='int')
            for i, c in enumerate(chars2_sorted):
                chars2_mask[i, :chars2_length[i]] = c
            chars2_mask = Variable(torch.LongTensor(chars2_mask))

        if parameters['char_mode'] == 'CNN':
            d = {}
            chars2_length = [len(c) for c in chars2]
            char_maxl = max(chars2_length)
            chars2_mask = np.zeros(
                (len(chars2_length), char_maxl), dtype='int')
            for i, c in enumerate(chars2):
                chars2_mask[i, :chars2_length[i]] = c
            chars2_mask = Variable(torch.LongTensor(chars2_mask))

        dwords = Variable(torch.LongTensor(data['words']))
        if use_gpu:
            val, out = model(
                dwords.cuda(), chars2_mask.cuda(), chars2_length, d)
        else:
            val, out = model(dwords, chars2_mask, chars2_length, d)
        predicted_id = out
        lab_chunks = set(get_chunks(ground_truth_id, tag_to_id))
        lab_pred_chunks = set(get_chunks(predicted_id,
                                         tag_to_id))

        correct_preds += len(lab_chunks & lab_pred_chunks)
        total_preds += len(lab_pred_chunks)
        total_correct += len(lab_chunks)

    p = correct_preds / total_preds if correct_preds > 0 else 0
    r = correct_preds / total_correct if correct_preds > 0 else 0
    F = 2 * p * r / (p + r) if correct_preds > 0 else 0
    return F

In [15]:
train_data[0]['chars']

[[31, 48],
 [6, 0, 52, 0, 12, 2, 7],
 [42, 0, 6, 14, 1, 3],
 [12, 1, 9, 9],
 [2, 5],
 [21, 5, 19, 12, 5, 2, 2],
 [36, 6, 4, 2, 4, 7, 11],
 [9, 1, 14, 21],
 [18]]

In [11]:
f_score = evaluate(model, test_data)
print(f'Original Model Test F1-Score: {f_score}')

  0%|          | 3/3453 [00:00<02:34, 22.39it/s]

Char Embeds torch.Size([12, 1, 8, 25])
Char Embeds after CNN  torch.Size([12, 25, 10, 1])
Char Embeds after Maxpool torch.Size([12, 25])
Word Embed torch.Size([12, 100])
lstm-out torch.Size([12, 1, 400])
Char Embeds torch.Size([2, 1, 5, 25])
Char Embeds after CNN  torch.Size([2, 25, 7, 1])
Char Embeds after Maxpool torch.Size([2, 25])
Word Embed torch.Size([2, 100])
lstm-out torch.Size([2, 1, 400])
Char Embeds torch.Size([6, 1, 10, 25])
Char Embeds after CNN  torch.Size([6, 25, 12, 1])
Char Embeds after Maxpool torch.Size([6, 25])
Word Embed torch.Size([6, 100])
lstm-out torch.Size([6, 1, 400])
Char Embeds torch.Size([25, 1, 12, 25])
Char Embeds after CNN  torch.Size([25, 25, 14, 1])
Char Embeds after Maxpool torch.Size([25, 25])
Word Embed torch.Size([25, 100])
lstm-out torch.Size([25, 1, 400])
Char Embeds torch.Size([25, 1, 10, 25])
Char Embeds after CNN  torch.Size([25, 25, 12, 1])
Char Embeds after Maxpool torch.Size([25, 25])
Word Embed torch.Size([25, 100])
lstm-out torch.Size([2

  0%|          | 6/3453 [00:00<05:18, 10.83it/s]

Char Embeds torch.Size([23, 1, 11, 25])
Char Embeds after CNN  torch.Size([23, 25, 13, 1])
Char Embeds after Maxpool torch.Size([23, 25])
Word Embed torch.Size([23, 100])
lstm-out torch.Size([23, 1, 400])
Char Embeds torch.Size([17, 1, 8, 25])
Char Embeds after CNN  torch.Size([17, 25, 10, 1])
Char Embeds after Maxpool torch.Size([17, 25])
Word Embed torch.Size([17, 100])
lstm-out torch.Size([17, 1, 400])


  0%|          | 8/3453 [00:00<05:37, 10.21it/s]

Char Embeds torch.Size([18, 1, 10, 25])
Char Embeds after CNN  torch.Size([18, 25, 12, 1])
Char Embeds after Maxpool torch.Size([18, 25])
Word Embed torch.Size([18, 100])
lstm-out torch.Size([18, 1, 400])
Char Embeds torch.Size([28, 1, 9, 25])
Char Embeds after CNN  torch.Size([28, 25, 11, 1])
Char Embeds after Maxpool torch.Size([28, 25])
Word Embed torch.Size([28, 100])
lstm-out torch.Size([28, 1, 400])


  0%|          | 10/3453 [00:01<05:53,  9.74it/s]


Char Embeds torch.Size([38, 1, 10, 25])
Char Embeds after CNN  torch.Size([38, 25, 12, 1])
Char Embeds after Maxpool torch.Size([38, 25])
Word Embed torch.Size([38, 100])
lstm-out torch.Size([38, 1, 400])


KeyboardInterrupt: 

In [16]:
f_score = evaluate(model, test_data)
print(f'Original Model Test F1-Score: {f_score}')

100%|██████████| 3453/3453 [03:46<00:00, 15.23it/s]

Original Model Test F1-Score: 0.8401554170055119





We load the trained model provided at https://github.com/TheAnig/NER-LSTM-CNN-Pytorch/raw/master/trained-model-cpu and evaluate it performance on the test set. The test f1-score is:
> trained model: __0.84__

## Question one (iii)

Either an CNN and an LSTM can be used to perform character-level encoding
In the provided code, the CNN is declared as 
```
char_cnn = nn.Conv2d(in_channels=1, out_channels=self.out_channels, kernel_size=(3, char_embedding_dim), padding=(2,0))
```
Whereas, the LSTM is defined as
```
char_lstm = nn.LSTM(char_embedding_dim, char_lstm_dim, num_layers=1, bidirectional=True)
                init_lstm(self.char_lstm)
```
No matter what characte-level encoder is used, the extracted character-level representation will be contactenated with higher-level word embeddings and be fed into a Bidirectional lstm. However, for different encoder, the input dimension for the higher-level LSTM are different. If CNN is used, the input dimension is word_embedding_dim + out_channeles. If LSTM is used, the input dimension is word_embedding_dim + char_lstm_dim * 2

## Question one (iV)
As mentioned in the previous section, word embeddings, contactenated with the characte level embedding, are fed into an LSTM.
In this section, we will replace the LSTM with CNN.

In [16]:
class OneCNN_WordEncoderModel(BiLSTM_CRF):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        print(self.lstm)

In [17]:
model = OneCNN_WordEncoderModel(vocab_size=len(word_to_id),
                   tag_to_ix=tag_to_id,
                   embedding_dim=parameters['word_dim'],
                   hidden_dim=parameters['word_lstm_dim'],
                   use_gpu=use_gpu,
                   char_to_ix=char_to_id,
                   pre_word_embeds=word_embeds,
                   use_crf=parameters['crf'],
                   char_mode=parameters['char_mode'])

LSTM(125, 200, bidirectional=True)


  nn.init.uniform(input_embedding, -bias, bias)
  nn.init.uniform(weight, -sampling_range, sampling_range)
  nn.init.uniform(weight, -sampling_range, sampling_range)
  nn.init.uniform(weight, -sampling_range, sampling_range)
  nn.init.uniform(weight, -sampling_range, sampling_range)
  nn.init.uniform(input_linear.weight, -bias, bias)


In [18]:
train_data[0]['words']

[944, 15473, 198, 590, 8, 3848, 207, 6233, 2]

In [37]:
from torch import nn
with torch.no_grad():
    x  = torch.randn(10 , 1, 125)
    test = nn.Conv1d(in_channels=1, out_channels=20, kernel_size=10)
    pool = nn.MaxPool1d(6)
    y, _ = model.lstm(x)
    test_y = test(x)
    print(y.shape)
    print(test_y.shape)
    test_y = pool(test_y)
    test_y.view(10, )
    print(test_y.shape)

torch.Size([10, 1, 400])
torch.Size([10, 20, 116])


RuntimeError: shape '[10]' is invalid for input of size 3800