### Challenge

Build an LSTM model for named entity recognition(NER).  
Dataset has been attached with this email or you can use any other publicly available datasets for NER.
You can find more information about dataset in the link provided. you are free to choose any library.   

### Imports

In [1]:
import torch
from torch import nn

from read_data import read_data
from preproc import *
from dataloader import get_dls, Dataset
from model import Model, model_output
from training_loop import training_loop
from utils import classificationreport

check the above files for code in detail. I share only the process in this notebook

###  Data

I downloaded the data from https://github.com/davidsbatista/NER-datasets/tree/master/CONLL2003

#### Reading data

I use the function read_data from the python file read_data

In [2]:
training_sentence_tags = read_data('data/train.txt')


Let's take a look at the data'

In [3]:
training_sentence_tags[1]

(['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.'],
 ['B-ORG', 'O', 'B-MISC', 'O', 'O', 'O', 'B-MISC', 'O', 'O'])

#### Total vocabulary and labels

No preprocessing is done on individual words because for ner task lower casing might hurt.
Lemmatization and Stemming could be helpful, but are easy for a deep learning model to figure for itselves.

In [4]:
vocabtoidx, labeltoidx = vocab_to_idx(training_sentence_tags)


vocabtoidx is a dictionary which converts any word to a index (within the training data vocabulary)

In [12]:
# print(vocabtoidx)

In [7]:
vocabtoidx['The']

15

labeltoidx is a dictionary which converts given label to a index 

In [13]:
labeltoidx

{'B-LOC': 6,
 'B-MISC': 3,
 'B-ORG': 2,
 'B-PER': 4,
 'I-LOC': 9,
 'I-MISC': 8,
 'I-ORG': 7,
 'I-PER': 5,
 'O': 1,
 'pad': 0}

#### convert sentences and tags to numerical format

The function prepare_sentence converts a sentence and the tags  to numerical format using vocabtoidx and labeltoidx dictionaries

In [18]:
training_sentence_tags[1]

(['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.'],
 ['B-ORG', 'O', 'B-MISC', 'O', 'O', 'O', 'B-MISC', 'O', 'O'])

The above sentence is converted to numerical format by following function

In [17]:
prepare_sentence_tags(training_sentence_tags[1], vocabtoidx, labeltoidx)

([2, 3, 4, 5, 6, 7, 8, 9, 10], [2, 1, 3, 1, 1, 1, 3, 1, 1])

The function prepare batch converts entire dataset into numerical format

In [19]:
train_data=list(prepare_batch(training_sentence_tags, vocabtoidx, labeltoidx))

In [25]:
train_data[:3]  ### first three sentences have turned into following format

[([1], [1]),
 ([2, 3, 4, 5, 6, 7, 8, 9, 10], [2, 1, 3, 1, 1, 1, 3, 1, 1]),
 ([11, 12], [4, 5])]

#### Pytorch Dataset and Dataloader

I intend to build the model using pytorch.  
To pass the data into pytorch model we should use pytorch dataloader(for ease of use)

In [31]:
dataset_train = Dataset(train_data)
train_dl = get_dls(dataset_train, bs=16) ## this function gives me the dataloader 

I've passed a collate funtion to dataloader to make sure that each batch is of equal sized by padding each sentence
in a batch to the length of longest sentence in the batch

In [27]:
next(iter(train_dl))

(tensor([[  1,   0,   0,  ...,   0,   0,   0],
         [  2,   3,   4,  ...,   0,   0,   0],
         [ 11,  12,   0,  ...,   0,   0,   0],
         ...,
         [ 15, 349, 350,  ...,   0,   0,   0],
         [  1,   0,   0,  ...,   0,   0,   0],
         [355, 356, 357,  ...,   0,   0,   0]]),
 tensor([[1, 0, 0,  ..., 0, 0, 0],
         [2, 1, 3,  ..., 0, 0, 0],
         [4, 5, 0,  ..., 0, 0, 0],
         ...,
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 0, 0,  ..., 0, 0, 0],
         [6, 1, 6,  ..., 0, 0, 0]]))

In [32]:
valid_sentence_tags = read_data('data/valid.txt')
dataset_valid = Dataset(list(prepare_batch(valid_sentence_tags, vocabtoidx, labeltoidx)))
valid_dl = get_dls(dataset_valid, 32)


### Model

In [28]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")


I've experimented with different hyper parameters in another notebook and found out what works the best on validation data.   
These default options I've used in the main file.

In [30]:
net = Model(vocabtoidx, len(labeltoidx), 128, 128, pretrained=False, freeze=False).to(device)


The model is a very simple lstm model shown below

In [37]:
net

Model(
  (emb): Embedding(23627, 128)
  (lstm): LSTM(128, 128, batch_first=True)
  (linear): Linear(in_features=128, out_features=10, bias=True)
)

In [33]:
loss_func = nn.CrossEntropyLoss() ## Since this is a classification task

In [34]:
opt = torch.optim.SGD(net.parameters(), 1e1) ## optimiser which applies gradients to parameters

In [35]:
epochs=5 ## I haven't implemented early stopping. Something for future work

In [36]:
training_loop(net, opt, loss_func, epochs, train_dl, valid_dl, verbosity=1)

Train loss:0.404 Train accuracy:0.837 Valid loss:0.181 Valid accuracy:0.870
Train loss:0.131 Train accuracy:0.917 Valid loss:0.142 Valid accuracy:0.909
Train loss:0.079 Train accuracy:0.949 Valid loss:0.120 Valid accuracy:0.923
Train loss:0.049 Train accuracy:0.968 Valid loss:0.125 Valid accuracy:0.930
Train loss:0.031 Train accuracy:0.979 Valid loss:0.125 Valid accuracy:0.930


### Performance on Validation and Test sets

#### Performance on validation set

In [38]:
correct, predicted = model_output(net, valid_dl)
print(classificationreport(correct.cpu(), predicted.cpu(), target_names=list(zip(*labeltoidx.items()))[0][:-1]))

              precision    recall  f1-score   support

           O       0.95      0.99      0.97     42973
       B-ORG       0.75      0.61      0.67      1340
      B-MISC       0.80      0.66      0.73       922
       B-PER       0.74      0.64      0.69      1842
       I-PER       0.88      0.58      0.70      1307
       B-LOC       0.86      0.76      0.81      1837
       I-ORG       0.73      0.55      0.63       750
      I-MISC       0.63      0.58      0.60       346
       I-LOC       0.77      0.67      0.72       257

   micro avg       0.93      0.93      0.93     51574
   macro avg       0.79      0.67      0.72     51574
weighted avg       0.93      0.93      0.93     51574



#### Performance on Test set

In [39]:
test_sentence_tags = read_data('data/test.txt')
dataset_test = Dataset(list(prepare_batch(test_sentence_tags, vocabtoidx, labeltoidx)))
test_dl = get_dls(dataset_test, 32)


In [40]:
correct, predicted = model_output(net, test_dl)
print(classificationreport(correct.cpu(), predicted.cpu(), target_names=list(zip(*labeltoidx.items()))[0][:-1]))

              precision    recall  f1-score   support

           O       0.94      0.98      0.96     38520
       B-ORG       0.74      0.50      0.59      1660
      B-MISC       0.70      0.56      0.62       701
       B-PER       0.56      0.53      0.55      1616
       I-PER       0.77      0.38      0.51      1156
       B-LOC       0.81      0.73      0.77      1667
       I-ORG       0.70      0.55      0.62       834
      I-MISC       0.46      0.61      0.53       214
       I-LOC       0.68      0.49      0.57       257

   micro avg       0.91      0.91      0.91     46625
   macro avg       0.71      0.59      0.63     46625
weighted avg       0.90      0.91      0.90     46625



The test is performing close to the performance of validation set

** Note the performance can change a little bit with each run.**

### Future Work

* Lot of optimisations can be done. 
* Early stopping can be implemented. 
* Bidirectional Lstm can be used. Deeper LSTM's can be used'
* BILSTM-CRF will probably give much better result.
* The hyperparameters are tuned for random embeddings. They can be tuned together with pretrained embeddings to get better score
* POS tags can be used as input to a linear layer which can then be used for better results. 
* Trained word embeddings can be visualised. 