## CS310 Natural Language Processing
## Lab 5 (part 2): Data preparation for Named Entity Recognition (NER)

The dataset is CoNLL2003 English named entity recognition (NER). The dataset is a collection of news articles from Reuters. 

The dataset is annotated with four types of named entities: persons, locations, organizations, and miscellaneous entities that do not belong to the previous three types. The dataset is divided into three parts: **training**, **development**, and **testing**. 


In [2]:
from pprint import pprint
import torch.nn as nn   
import torch
import torch.nn.functional as F

In [3]:
TRAIN_PATH = 'data/train.txt'
DEV_PATH = 'data/dev.txt'
TEST_PATH = 'data/test.txt'
EMBEDDINGS_PATH = 'data/glove.6B.100d.txt' 
# Download from https://nlp.stanford.edu/data/glove.6B.zip
# It includes dimension 50, 100, 200, and 300.

The dataset is in the IOB format. 
The IOB format is a simple text chunking format that divides the text into chunks and assigns a label to each chunk. The label is a combination of two parts: the type of the named entity and the position of the word in the named entity. The type of the named entity is one of the four types mentioned above. The position of the word in the named entity is one of three positions: B (beginning), I (inside), and O (outside). For example, the word "New" in the named entity "New York" is labeled as "B-LOC" and the word "York" is labeled as "I-LOC". The word "I" in the sentence "I live in New York" is labeled as "O".

In [4]:
def read_ner_data(path_to_file):
    words = []
    tags = []
    with open(path_to_file, 'r', encoding='utf-8') as file:
        for line in file:
            splitted = line.split()
            if len(splitted) == 0:
                continue
            word = splitted[0]
            if word == '-DOCSTART-':
                continue
            entity = splitted[-1]
            words.append(word)
            tags.append(entity)
        return words, tags

In [5]:
train_words, train_tags = read_ner_data(TRAIN_PATH)
dev_words, dev_tags = read_ner_data(DEV_PATH)
test_words, test_tags = read_ner_data(TEST_PATH)

In [6]:
train_words[:5], train_tags[:5]

(['EU', 'rejects', 'German', 'call', 'to'], ['B-ORG', 'O', 'B-MISC', 'O', 'O'])

In [7]:
pprint(list(zip(train_words[:10], train_tags[:10])))

[('EU', 'B-ORG'),
 ('rejects', 'O'),
 ('German', 'B-MISC'),
 ('call', 'O'),
 ('to', 'O'),
 ('boycott', 'O'),
 ('British', 'B-MISC'),
 ('lamb', 'O'),
 ('.', 'O'),
 ('Peter', 'B-PER')]


**Note** that
- Each sentence ends with token '.' and tag 'O'. Between sentences there is a blank line.
- Same padding and packing pipeline as in the previous lab need be used for the NER data, too.

---

### T1. Build vocabularies for both words and labels (tags)

Use *ALL* the data from train, dev, and test sets to build the vocabularies, for word and label (tag), respectively.

In [8]:
words = set(train_words + dev_words + test_words)
tags = set(train_tags + dev_tags + test_tags)

In [1]:
### START YOUR CODE ###
vocab_words = {}
id = 0
for word in words:
    if word not in vocab_words:
        vocab_words[word] = id
        id += 1

vocab_tags = {}
id = 0
for tag in tags:
    if tag not in vocab_tags:
        vocab_tags[tag] = id
        id += 1

NameError: name 'words' is not defined

In [2]:
print('Word vocabulary size:', len(vocab_words))
print('Tag vocabulary size:', len(vocab_tags))

Word vocabulary size: 0


NameError: name 'vocab_tags' is not defined

### Model Architecture

In `__init__` method, initialize `word_embeddings` with a pretrained embedding weight matrix loaded from `glove.6B.100d.txt`.

For some variants of model, e.g., maximum entropy Markov model (MEMM), you also need to initialize `tag_embeddings` with a random weight matrix.

`forward` method takes the sequence of word indices (and sequece lengths) as input and returns the log probabilities of predicted labels (tags). 

### read embedding

In [11]:
# read glove embeddings
embedding_dict = {}
with open(EMBEDDINGS_PATH, 'r') as f:
    for line in f:
        values = line.split()
        word = values[0]
        vector = torch.tensor([float(val) for val in values[1:]])
        embedding_dict[word] = vector
vocab_size = len(embedding_dict)

embedding_dim = 100 # 100d
embedding_matrix = torch.zeros(vocab_size, embedding_dim)
for i, word in enumerate(embedding_dict):
    embedding_matrix[i] = embedding_dict[word]

In [12]:
class LSTMTagger(nn.Module):
    def __init__(self, *args, **kwargs):
        super(LSTMTagger, self).__init__()
       
        self.word_embeddings = nn.Embedding.from_pretrained(embedding_matrix)
        self.lstm = nn.LSTM(embedding_dim, hidden_size = 128, batch_first=True, bidirectional=True)
        self.fc = nn.Linear(128, len(vocab_tags))
        
    def forward(self, seq, seq_lens):
        padded_seqs = nn.utils.rnn.pad_sequence(seq, batch_first=True)
        padded_embs = self.embedding(padded_seqs)
        packed_embs = nn.utils.rnn.pack_padded_sequence(padded_embs, seq_lens, batch_first=True, enforce_sorted=False)
        out_packed, _ = self.rnn(packed_embs)
        out_unpacked, _ = nn.utils.rnn.pad_packed_sequence(out_packed, batch_first=True)
        logits = self.fc(out_unpacked)
        log_probs = F.log_softmax(logits, dim=-1)
        return log_probs

In [13]:
model = LSTMTagger()

In [14]:
print(model)

LSTMTagger(
  (word_embeddings): Embedding(400000, 100)
  (lstm): LSTM(100, 128, batch_first=True, bidirectional=True)
  (fc): Linear(in_features=128, out_features=9, bias=True)
)
