# Assignment 3 - Named Entity Recognition (NER)

Welcome to the third programming assignment of Course 3. In this assignment, you will learn to build more complicated models with Trax. By completing this assignment, you will be able to: 

- Design the architecture of a neural network, train it, and test it. 
- Process features and represents them
- Understand word padding
- Implement LSTMs
- Test with your own sentence

## Outline
- [Introduction](#0)
- [Part 1:  Exploring the data](#1)
    - [1.1  Importing the Data](#1.1)
    - [1.2  Data generator](#1.2)
		- [Exercise 01](#ex01)
- [Part 2:  Building the model](#2)
	- [Exercise 02](#ex02)
- [Part 3:  Train the Model ](#3)
	- [Exercise 03](#ex03)
- [Part 4:  Compute Accuracy](#4)
	- [Exercise 04](#ex04)
- [Part 5:  Testing with your own sentence](#5)


<a name="0"></a>
# Introduction

We first start by defining named entity recognition (NER). NER is a subtask of information extraction that locates and classifies named entities in a text. The named entities could be organizations, persons, locations, times, etc. 

For example:

<img src = 'ner.png' width="width" height="height" style="width:600px;height:150px;"/>

Is labeled as follows: 

- French: geopolitical entity
- Morocco: geographic entity 
- Christmas: time indicator

Everything else that is labeled with an `O` is not considered to be a named entity. In this assignment, you will train a named entity recognition system that could be trained in a few seconds (on a GPU) and will get around 75% accuracy. Then, you will load in the exact version of your model, which was trained for a longer period of time. You could then evaluate the trained version of your model to get 96% accuracy! Finally, you will be able to test your named entity recognition system with your own sentence.

In [1]:
#!pip -q install trax==1.3.1

import trax 
from trax import layers as tl
import os 
import numpy as np
import pandas as pd


from utils import get_params, get_vocab
import random as rnd

# set random seeds to make this notebook easier to replicate
trax.supervised.trainer_lib.init_random_number_generators(33)

INFO:tensorflow:tokens_length=568 inputs_length=512 targets_length=114 noise_density=0.15 mean_noise_span_length=3.0 


DeviceArray([ 0, 33], dtype=uint32)

In [2]:
import numpy as np
import pandas as pd
import os

import torch
from torch import nn
import torch.nn.functional as F
from torch.utils.data import TensorDataset, DataLoader

#from utils import get_params, get_vocab
import random as rnd

<a name="1"></a>
# Part 1:  Exploring the data

We will be using a dataset from Kaggle, which we will preprocess for you. The original data consists of four columns, the sentence number, the word, the part of speech of the word, and the tags.  A few tags you might expect to see are: 

* geo: geographical entity
* org: organization
* per: person 
* gpe: geopolitical entity
* tim: time indicator
* art: artifact
* eve: event
* nat: natural phenomenon
* O: filler word


In [3]:
# display original kaggle data
data = pd.read_csv("ner_dataset.csv", encoding = "ISO-8859-1") 
train_sents = open('data/small/train/sentences.txt', 'r').readline()
train_labels = open('data/small/train/labels.txt', 'r').readline()
print('SENTENCE:', train_sents)
print('SENTENCE LABEL:', train_labels)
print('ORIGINAL DATA:\n', data.head(5))
#del(data, train_sents, train_labels)

SENTENCE: Thousands of demonstrators have marched through London to protest the war in Iraq and demand the withdrawal of British troops from that country .

SENTENCE LABEL: O O O O O O B-geo O O O O O B-geo O O O O O B-gpe O O O O O

ORIGINAL DATA:
     Sentence #           Word  POS Tag
0  Sentence: 1      Thousands  NNS   O
1          NaN             of   IN   O
2          NaN  demonstrators  NNS   O
3          NaN           have  VBP   O
4          NaN        marched  VBN   O


In [4]:
data = pd.read_csv("ner_dataset.csv", encoding = "ISO-8859-1").fillna(method='ffill')
data.tail(10)

Unnamed: 0,Sentence #,Word,POS,Tag
1048565,Sentence: 47958,impact,NN,O
1048566,Sentence: 47958,.,.,O
1048567,Sentence: 47959,Indian,JJ,B-gpe
1048568,Sentence: 47959,forces,NNS,O
1048569,Sentence: 47959,said,VBD,O
1048570,Sentence: 47959,they,PRP,O
1048571,Sentence: 47959,responded,VBD,O
1048572,Sentence: 47959,to,TO,O
1048573,Sentence: 47959,the,DT,O
1048574,Sentence: 47959,attack,NN,O


<a name="1.1"></a>
## 1.1  Importing the Data

In this part, we will import the preprocessed data and explore it.

```
def get_vocab(vocab_path, tags_path):
    vocab = {}
    with open(vocab_path) as f:
        for i, l in enumerate(f.read().splitlines()):
            vocab[l] = i  # to avoid the 0
        # loading tags (we require this to map tags to their indices)
    vocab['<PAD>'] = len(vocab) # 35180
    tag_map = {}
    with open(tags_path) as f:
        for i, t in enumerate(f.read().splitlines()):
            tag_map[t] = i 
    
    return vocab, tag_map
```

---

```
def get_params(vocab, tag_map, sentences_file, labels_file):
    sentences = []
    labels = []

    with open(sentences_file) as f:
        for sentence in f.read().splitlines():
            # replace each token by its index if it is in vocab
            # else use index of UNK_WORD
            s = [vocab[token] if token in vocab 
                 else vocab['UNK']
                 for token in sentence.split(' ')]
            sentences.append(s)

    with open(labels_file) as f:
        for sentence in f.read().splitlines():
            # replace each label by its index
            l = [tag_map[label] for label in sentence.split(' ')] # I added plus 1 here
            labels.append(l) 
    return sentences, labels, len(sentences)
```

**Encoding the words**  
The embedding lookup requires that we pass in integers to our network. The easiest way to do this is to create dictionaries that map the words in the vocabulary to integers. Then we can convert each of our reviews into integers so they can be passed into the network.

>Build a dictionary that maps words to integers. Later we're going to pad our input vectors with zeros, so make sure the integers **start at 1, not 0**.

In [5]:
from collections import Counter

In [6]:
## Build a dictionary that maps words to integers
words_count = Counter(data["Word"])
vocab_to_int = {w: (i+1) for i, w in enumerate(words_count)}
vocab_to_int['UNK'] = len(vocab_to_int)
#vocab_to_int['<PAD>'] = len(vocab_to_int)

In [7]:
dict(list(words_count.items())[:10])

{'Thousands': 114,
 'of': 26354,
 'demonstrators': 110,
 'have': 5485,
 'marched': 65,
 'through': 515,
 'London': 261,
 'to': 23213,
 'protest': 237,
 'the': 52573}

In [8]:
dict(list(vocab_to_int.items())[:10]), dict(list(vocab_to_int.items())[-5:])

({'Thousands': 1,
  'of': 2,
  'demonstrators': 3,
  'have': 4,
  'marched': 5,
  'through': 6,
  'London': 7,
  'to': 8,
  'protest': 9,
  'the': 10},
 {'hardliner': 35175,
  'indicative': 35176,
  '3700': 35177,
  'Bermel': 35178,
  'UNK': 35178})

In [9]:
# stats about vocabulary
print('Unique words: ', len(vocab_to_int))  # should ~ 74000+
print()

Unique words:  35179



In [10]:
## Build a dictionary that maps tags to integers
tag_to_int = {t: (i+1) for i, t in enumerate(set(data["Tag"]))}
tag_to_int

{'I-nat': 1,
 'O': 2,
 'I-eve': 3,
 'I-per': 4,
 'I-art': 5,
 'B-art': 6,
 'I-tim': 7,
 'B-per': 8,
 'B-nat': 9,
 'B-gpe': 10,
 'B-geo': 11,
 'B-org': 12,
 'I-geo': 13,
 'I-gpe': 14,
 'B-eve': 15,
 'B-tim': 16,
 'I-org': 17}

Convert the sentences to integers and store the sentences in a new list called `sents_ints`.

In [11]:
def get_params2(vocab_to_int, tag_to_int, df):
    sents_int = []
    labels_int = [] 
    
    sents_w = df.groupby('Sentence #')['Word'].apply(list).values
    for sent in sents_w:
        sent_int = [vocab_to_int[w] if w in vocab_to_int else vocab_to_int['UNK'] for w in sent]
        sents_int.append(sent_int)
    
    sents_tags = df.groupby('Sentence #')['Tag'].apply(list).values
    for sent_tags in sents_tags:
        label_int = [tag_to_int[t] for t in sent_tags]
        labels_int.append(label_int)
    
    return sents_int, labels_int, len(sents_int)

In [12]:
sents_int, labels_int, sent_size = get_params2(vocab_to_int, tag_to_int, data)

train-val-test split 7:1.5:1.5

In [33]:
split_frac1, split_frac2 = 0.7, 0.15

## split data into training, validation, and test data (sentences, labels, and len)
split_idx = int(len(sents_int)*split_frac1)
train_sent, train_label = sents_int[:split_idx], labels_int[:split_idx]

split_idx2 = split_idx + int(len(sents_int)*split_frac2)
valid_sent, valid_label = sents_int[split_idx:split_idx2], labels_int[split_idx:split_idx2]
test_sent, test_label = sents_int[split_idx2:], labels_int[split_idx2:]

train_len, valid_len, test_len = len(train_sent), len(valid_sent), len(test_sent)

In [34]:
train_len, valid_len, test_len

(33571, 7193, 7195)

In [35]:
train_sent[0]

[1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 15,
 10,
 16,
 2,
 17,
 18,
 19,
 20,
 21,
 22]

--

In [36]:
vocab, tag_map = get_vocab('data/large/words.txt', 'data/large/tags.txt')
t_sentences, t_labels, t_size = get_params(vocab, tag_map, 'data/large/train/sentences.txt', 'data/large/train/labels.txt')
v_sentences, v_labels, v_size = get_params(vocab, tag_map, 'data/large/val/sentences.txt', 'data/large/val/labels.txt')
test_sentences, test_labels, test_size = get_params(vocab, tag_map, 'data/large/test/sentences.txt', 'data/large/test/labels.txt')

In [37]:
t_size, v_size, test_size

(33570, 7194, 7194)

In [None]:
t_size + v_size + test_size

In [None]:
tag_map

`vocab` is a dictionary that translates a word string to a unique number. Given a sentence, you can represent it as an array of numbers translating with this dictionary. The dictionary contains a `<PAD>` token. 

When training an LSTM using batches, all your input sentences must be the same size. To accomplish this, you set the length of your sentences to a certain number and add the generic `<PAD>` token to fill all the empty spaces. 

In [None]:
# vocab translates from a word to a unique number
print('vocab["the"]:', vocab["the"])
# Pad token
print('padded token:', vocab['<PAD>'])

In [None]:
###
# vocab translates from a word to a unique number
print('vocab["the"]:', vocab_to_int["the"])

The tag_map corresponds to one of the possible tags a word can have. Run the cell below to see the possible classes you will be predicting. The prepositions in the tags mean:
* I: Token is inside an entity.
* B: Token begins an entity.

In [None]:
print(tag_map)

In [14]:
###
print(tag_to_int)

{'I-nat': 1, 'O': 2, 'I-eve': 3, 'I-per': 4, 'I-art': 5, 'B-art': 6, 'I-tim': 7, 'B-per': 8, 'B-nat': 9, 'B-gpe': 10, 'B-geo': 11, 'B-org': 12, 'I-geo': 13, 'I-gpe': 14, 'B-eve': 15, 'B-tim': 16, 'I-org': 17}


So the coding scheme that tags the entities is a minimal one where B- indicates the first token in a multi-token entity, and I- indicates one in the middle of a multi-token entity. If you had the sentence 

**"Sharon flew to Miami on Friday"**

the outputs would look like:

```
Sharon B-per
flew   O
to     O
Miami  B-geo
on     O
Friday B-tim
```

your tags would reflect three tokens beginning with B-, since there are no multi-token entities in the sequence. But if you added Sharon's last name to the sentence: 

**"Sharon Floyd flew to Miami on Friday"**

```
Sharon B-per
Floyd  I-per
flew   O
to     O
Miami  B-geo
on     O
Friday B-tim
```

then your tags would change to show first "Sharon" as B-per, and "Floyd" as I-per, where I- indicates an inner token in a multi-token sequence.

In [38]:
# Exploring information about the data
print('The number of outputs is tag_map', len(tag_map))
# The number of vocabulary tokens (including <PAD>)
g_vocab_size = len(vocab)
print(f"Num of vocabulary words: {g_vocab_size}")
print('The vocab size is', len(vocab))
print('The training size is', t_size)
print('The validation size is', v_size)
print('An example of the first sentence is', t_sentences[0])
print('An example of its corresponding label is', t_labels[0])

The number of outputs is tag_map 17
Num of vocabulary words: 35181
The vocab size is 35181
The training size is 33570
The validation size is 7194
An example of the first sentence is [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 9, 15, 1, 16, 17, 18, 19, 20, 21]
An example of its corresponding label is [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0]


In [40]:
###
# Exploring information about the data
print('The number of outputs is tag_map', len(tag_to_int))
# The number of vocabulary tokens (including <PAD>)
g_vocab_size = len(vocab_to_int)
#print(f"Num of vocabulary words: {g_vocab_size}")
print(f'The vocab size is {g_vocab_size}')
print(f'The training size is {train_len}')
print(f'The validation size is {valid_len}')
print(f'The validation size is {test_len}')
print(f'An example of the first sentence is {train_sent[0]}')
print(f'An example of its corresponding label is {train_label[0]}')

The number of outputs is tag_map 17
The vocab size is 35179
The training size is 33571
The validation size is 7193
The validation size is 7195
An example of the first sentence is [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 10, 16, 2, 17, 18, 19, 20, 21, 22]
An example of its corresponding label is [2, 2, 2, 2, 2, 2, 11, 2, 2, 2, 2, 2, 11, 2, 2, 2, 2, 2, 10, 2, 2, 2, 2, 2]


So you can see that we have already encoded each sentence into a tensor by converting it into a number. We also have 16 possible classes, as shown in the tag map.


<a name="1.2"></a>
## 1.2  Data generator

In python, a generator is a function that behaves like an iterator. It will return the next item. Here is a [link](https://wiki.python.org/moin/Generators) to review python generators. 

In many AI applications it is very useful to have a data generator. You will now implement a data generator for our NER application.

<a name="ex01"></a>
### Exercise 01

**Instructions:** Implement a data generator function that takes in `batch_size, x, y, pad, shuffle` where x is a large list of sentences, and y is a list of the tags associated with those sentences and pad is a pad value. Return a subset of those inputs in a tuple of two arrays `(X,Y)`. Each is an array of dimension (`batch_size, max_len`), where `max_len` is the length of the longest sentence *in that batch*. You will pad the X and Y examples with the pad argument. If `shuffle=True`, the data will be traversed in a random form.

**Details:**

This code as an outer loop  
```
while True:  
...  
yield((X,Y))  
```

Which runs continuously in the fashion of generators, pausing when yielding the next values. We will generate a batch_size output on each pass of this loop.    

It has two inner loops. 
1. The first stores in temporal lists the data samples to be included in the next batch, and finds the maximum length of the sentences contained in it. By adjusting the length to include only the size of the longest sentence in each batch, overall computation is reduced. 

2. The second loop moves those inputs from the temporal list into NumPy arrays pre-filled with pad values.

There are three slightly out of the ordinary features. 
1. The first is the use of the NumPy `full` function to fill the NumPy arrays with a pad value. See [full function documentation](https://numpy.org/doc/1.18/reference/generated/numpy.full.html).

2. The second is tracking the current location in the incoming lists of sentences. Generators variables hold their values between invocations, so we create an `index` variable, initialize to zero, and increment by one for each sample included in a batch. However, we do not use the `index` to access the positions of the list of sentences directly. Instead, we use it to select one index from a list of indexes. In this way, we can change the order in which we traverse our original list, keeping untouched our original list.  

3. The third also relates to wrapping. Because `batch_size` and the length of the input lists are not aligned, gathering a batch_size group of inputs may involve wrapping back to the beginning of the input loop. In our approach, it is just enough to reset the `index` to 0. We can re-shuffle the list of indexes to produce different batches each time.

Batching

In [42]:
###
class EntityDataset:
    def __init__(self, sentences, labels, pad):
        # sentences: [[61, 249, 19, ..., 722, 21], [...]]
        # labels: [[15, 15, 15, ..., 15, 15], [...]]
        # pad - an integer representing a pad character
        self.sentences = sentences
        self.labels = labels
        self.max_len = max([len(i) for i in self.sentences])
        self.pad = pad
        
    def __len__(self):
        return len(self.sentences)
    
    def __getitem__(self, item):
        sentence = self.sentences[item] # get only one sentence
        labels = self.labels[item]
        
        ids = []
        target_tag = []
        
        mask = [1] * len(sentence)
        token_type_ids = [0] * len(sentence)
        
        # right pads to the max_len in dataset(or batch) at
        # the size of the longest sentence in each batch
        padding_len = self.max_len - len(sentence)
    
        sentence = sentence + ([self.pad] * padding_len)
        mask = mask + ([self.pad] * padding_len)
        # Define sentence A and B indices associated to 1st and 2nd sentences
        # [0, 0, ..., 1,...,1]
        token_type_ids = token_type_ids + ([self.pad] * padding_len)
        labels = labels + ([self.pad] * padding_len)
        
        return {
            "input_ids": torch.tensor(sentence, dtype=torch.long),
            "token_type_ids": torch.tensor(token_type_ids, dtype=torch.long),
            "attention_mask": torch.tensor(mask, dtype=torch.long),  
            "labels": torch.tensor(labels, dtype=torch.long)
        }

In [46]:
# Create Tensor datasets
# similar as TensorDataset
train_dataset = EntityDataset(
    sentences=train_sent,
    labels=train_label,
    pad=0
)

valid_dataset = EntityDataset(
    sentences=valid_sent,
    labels=valid_label,
    pad=0
)
test_dataset = EntityDataset(
    sentences=test_sent,
    labels=test_label,
    pad=0
)

In [47]:
train_dataset[0]

{'input_ids': tensor([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 10, 16,  2,
         17, 18, 19, 20, 21, 22,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
          0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
          0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
          0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
          0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0]),
 'token_type_ids': tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0]),
 'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

In [48]:
TRAIN_BATCH_SIZE = 32
VALID_BATCH_SIZE = 8
TEST_BATCH_SIZE = 8

In [49]:
# num_workers as a positive integer will turn on multi-process data loading 
# with the specified number of loader worker processes.
train_dataloader = DataLoader(train_dataset, batch_size=TRAIN_BATCH_SIZE,
                              shuffle=True, num_workers=4)
valid_dataloader = DataLoader(valid_dataset, batch_size=VALID_BATCH_SIZE, 
                              shuffle=True, num_workers=1)
test_dataloader = DataLoader(test_dataset, batch_size=TEST_BATCH_SIZE,  
                             shuffle=True, num_workers=1)

Take a look at a batch of data:

In [50]:
one_batch = next(iter(train_dataloader))
one_batch

{'input_ids': tensor([[ 3541,    19,    10,  ...,     0,     0,     0],
         [ 9154, 10478,   127,  ...,     0,     0,     0],
         [   62, 32058, 32059,  ...,     0,     0,     0],
         ...,
         [   17,   126,   127,  ...,     0,     0,     0],
         [ 3946,   443,  6174,  ...,     0,     0,     0],
         [  817, 29373,  6570,  ...,     0,     0,     0]]),
 'token_type_ids': tensor([[0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0],
         ...,
         [0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0]]),
 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         ...,
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0]]),
 'labels': tensor([[ 2,  2,  2,  ...,  0,  0,  0],
         [ 2,  2,  2,  ...,  0,  0,  0],
         [ 2, 12, 17,  ...,  0,  0,  

In [51]:
print(one_batch['input_ids'], '\n')
print(f"Batch size: {len(one_batch['input_ids'])}")

tensor([[ 3541,    19,    10,  ...,     0,     0,     0],
        [ 9154, 10478,   127,  ...,     0,     0,     0],
        [   62, 32058, 32059,  ...,     0,     0,     0],
        ...,
        [   17,   126,   127,  ...,     0,     0,     0],
        [ 3946,   443,  6174,  ...,     0,     0,     0],
        [  817, 29373,  6570,  ...,     0,     0,     0]]) 

Batch size: 32


In [31]:
# UNQ_C1 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
# GRADED FUNCTION: data_generator
def data_generator(batch_size, x, y, pad, shuffle=False, verbose=False):
    '''
      Input: 
        batch_size - integer describing the batch size
        x - list containing sentences where words are represented as integers
        y - list containing tags associated with the sentences
        shuffle - Shuffle the data order
        pad - an integer representing a pad character
        verbose - Print information during runtime
      Output:
        a tuple containing 2 elements:
        X - np.ndarray of dim (batch_size, max_len) of padded sentences
        Y - np.ndarray of dim (batch_size, max_len) of tags associated with the sentences in X
    '''
    
    # count the number of lines in data_lines
    num_lines = len(x)
    
    # create an array with the indexes of data_lines that can be shuffled
    lines_index = [*range(num_lines)]
    
    # shuffle the indexes if shuffle is set to True
    if shuffle:
        rnd.shuffle(lines_index)
    
    index = 0 # tracks current location in x, y
    while True:
        buffer_x = [0] * batch_size # Temporal array to store the raw x data for this batch
        buffer_y = [0] * batch_size # Temporal array to store the raw y data for this batch
                
  ### START CODE HERE (Replace instances of 'None' with your code) ###
        
        # Copy into the temporal buffers the sentences in x[index : index + batch_size] 
        # along with their corresponding labels y[index : index + batch_size]
        # Find maximum length of sentences in x[index : index + batch_size] for this batch. 
        # Reset the index if we reach the end of the data set, and shuffle the indexes if needed.
        max_len = 0
        for i in range(batch_size):
             # if the index is greater than or equal to the number of lines in x
            if index >= num_lines:
                # then reset the index to 0
                index = 0
                # re-shuffle the indexes if shuffle is set to True
                if shuffle:
                    rnd.shuffle(lines_index)
            
            # The current position is obtained using `lines_index[index]`
            # Store the x value at the current position into the buffer_x
            buffer_x[i] = x[lines_index[index]]
            
            # Store the y value at the current position into the buffer_y
            buffer_y[i] = y[lines_index[index]]
            
            lenx = len(buffer_x[i])    #length of current x[]
            if lenx > max_len:
                max_len = lenx         #max_len tracks longest x[]
            
            # increment index by one
            index += 1


        # create X,Y, NumPy arrays of size (batch_size, max_len) 'full' of pad value
        X = np.full((batch_size, max_len), pad)
        Y = np.full((batch_size, max_len), pad)

        # copy values from lists to NumPy arrays. Use the buffered values
        for i in range(batch_size):
            # get the example (sentence as a tensor)
            # in `buffer_x` at the `i` index
            x_i = buffer_x[i]
            
            # similarly, get the example's labels
            # in `buffer_y` at the `i` index
            y_i = buffer_y[i]
            
            # Walk through each word in x_i
            for j in range(len(x_i)):
                # store the word in x_i at position j into X
                X[i, j] = x_i[j]
                
                # store the label in y_i at position j into Y
                Y[i, j] = y_i[j]

    ### END CODE HERE ###
        if verbose: print("index=", index)
        yield((X,Y))

In [41]:
batch_size = 5
mini_sentences = t_sentences[0: 8]
mini_labels = t_labels[0: 8]
dg = data_generator(batch_size, mini_sentences, mini_labels, vocab["<PAD>"], shuffle=False, verbose=True)
X1, Y1 = next(dg)
X2, Y2 = next(dg)
print(Y1.shape, X1.shape, Y2.shape, X2.shape)
print(X1[0][:], "\n", Y1[0][:])

index= 5
index= 2
(5, 30) (5, 30) (5, 30) (5, 30)
[    0     1     2     3     4     5     6     7     8     9    10    11
    12    13    14     9    15     1    16    17    18    19    20    21
 35180 35180 35180 35180 35180 35180] 
 [    0     0     0     0     0     0     1     0     0     0     0     0
     1     0     0     0     0     0     2     0     0     0     0     0
 35180 35180 35180 35180 35180 35180]


**Expected output:**   
```
index= 5
index= 2
(5, 30) (5, 30) (5, 30) (5, 30)
[    0     1     2     3     4     5     6     7     8     9    10    11
    12    13    14     9    15     1    16    17    18    19    20    21
 35180 35180 35180 35180 35180 35180] 
 [    0     0     0     0     0     0     1     0     0     0     0     0
     1     0     0     0     0     0     2     0     0     0     0     0
 35180 35180 35180 35180 35180 35180]  
```

In [69]:
###
BATCH_SIZE = 5
mini_sentences = train_sent[0: 8]
mini_labels = train_label[0: 8]
mini_dataset = EntityDataset(mini_sentences, mini_labels, pad=0)
mini_dataloader = DataLoader(mini_dataset, batch_size=BATCH_SIZE, shuffle=True)
batch1 = next(iter(mini_dataloader))
batch2 = next(iter(mini_dataloader))
print(batch1['input_ids'].shape, batch1['labels'].shape, batch2['input_ids'].shape, batch2['labels'].shape)
#print(Y1.shape, X1.shape, Y2.shape, X2.shape
print(batch1['input_ids'][0], "\n", batch1['labels'][0])
#print(X1[0][:], "\n", Y1[0][:])

torch.Size([5, 35]) torch.Size([5, 35]) torch.Size([5, 35]) torch.Size([5, 35])
tensor([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 10, 16,  2,
        17, 18, 19, 20, 21, 22,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0]) 
 tensor([ 2,  2,  2,  2,  2,  2, 11,  2,  2,  2,  2,  2, 11,  2,  2,  2,  2,  2,
        10,  2,  2,  2,  2,  2,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0])


prepare the data set for the use with pytorch and BERT

In [19]:
import torch
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from transformers import BertTokenizer, BertConfig

#from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split

torch.__version__

'1.5.1'

The Bert implementation comes with a pretrained tokenizer and a definied vocabulary. We load the one related to the smallest pre-trained model `bert-base-cased`. We use the `cased` variate since it is well suited for NER.

In [20]:
!pwd

/home/jovyan/work


In [22]:
!pip install wget

Collecting wget
  Downloading wget-3.2.zip (10 kB)
Building wheels for collected packages: wget
  Building wheel for wget (setup.py) ... [?25ldone
[?25h  Created wheel for wget: filename=wget-3.2-py3-none-any.whl size=9679 sha256=7db5dd0192c9891cd636d0f27ea08a78c98fa57ebfea1f4f33314f01ef2ee10f
  Stored in directory: /home/jovyan/.cache/pip/wheels/a1/b6/7c/0e63e34eb06634181c63adacca38b79ff8f35c37e3c13e3c02
Successfully built wget
Installing collected packages: wget
Successfully installed wget-3.2
You should consider upgrading via the '/opt/conda/bin/python -m pip install --upgrade pip' command.[0m


Now we tokenize all sentences. Since the BERT tokenizer is based a [Wordpiece tokenizer](https://blog.floydhub.com/tokenization-nlp/) it will split tokens in subword tokens. For example ‘gunships’ will be split in the two tokens ‘guns’ and ‘##hips’. We have to deal with the issue of splitting our token-level labels to related subtokens. In practice you would solve this by a specialized data structure based on label spans, but for simplicity I do it explicitly here.

In [43]:
def tokenize_and_preserve_labels(sentence, text_labels):
    tokenized_sentence = []
    labels = []

    for word, label in zip(sentence, text_labels):

        # Tokenize the word and count # of subwords the word is broken into
        tokenized_word = tokenizer.tokenize(word)
        n_subwords = len(tokenized_word)

        # Add the tokenized word to the final tokenized word list
        tokenized_sentence.extend(tokenized_word)

        # Add the same label to the new list of labels `n_subwords` times
        labels.extend([label] * n_subwords)

    return tokenized_sentence, labels

In [44]:
tokenized_texts_and_labels = [
    tokenize_and_preserve_labels(sent, labs)
    for sent, labs in zip(sentences, labels)
]

AttributeError: 'NoneType' object has no attribute 'tokenize'

<a name="2"></a>
# Part 2:  Building the model

You will now implement the model. You will be using Google's TensorFlow. Your model will be able to distinguish the following:
<table>
    <tr>
        <td>
<img src = 'ner1.png' width="width" height="height" style="width:500px;height:150px;"/>
        </td>
    </tr>
</table>

The model architecture will be as follows: 

<img src = 'ner2.png' width="width" height="height" style="width:600px;height:250px;"/>

Concretely: 

* Use the input tensors you built in your data generator
* Feed it into an Embedding layer, to produce more semantic entries
* Feed it into an LSTM layer
* Run the output through a linear layer
* Run the result through a log softmax layer to get the predicted class for each word.

Good news! We won't make you implement the LSTM unit drawn above. However, we will ask you to build the model. 

<a name="ex02"></a>
### Exercise 02

**Instructions:** Implement the initialization step and the forward function of your Named Entity Recognition system.  
Please utilize help function e.g. `help(tl.Dense)` for more information on a layer
   
- [tl.Serial](https://github.com/google/trax/blob/1372b903bb66b0daccee19fd0b1fdf44f659330b/trax/layers/combinators.py#L26): Combinator that applies layers serially (by function composition).
    - You can pass in the layers as arguments to `Serial`, separated by commas. 
    - For example: `tl.Serial(tl.Embeddings(...), tl.Mean(...), tl.Dense(...), tl.LogSoftmax(...))` 


-  [tl.Embedding](https://github.com/google/trax/blob/1372b903bb66b0daccee19fd0b1fdf44f659330b/trax/layers/core.py#L113): Initializes the embedding. In this case it is the dimension of the model by the size of the vocabulary. 
    - `tl.Embedding(vocab_size, d_feature)`.
    - `vocab_size` is the number of unique words in the given vocabulary.
    - `d_feature` is the number of elements in the word embedding (some choices for a word embedding size range from 150 to 300, for example).
    

-  [tl.LSTM](https://github.com/google/trax/blob/1372b903bb66b0daccee19fd0b1fdf44f659330b/trax/layers/rnn.py#L87):`Trax` LSTM layer of size d_model. 
    - `LSTM(n_units)` Builds an LSTM layer of n_cells.



-  [tl.Dense](https://github.com/google/trax/blob/1372b903bb66b0daccee19fd0b1fdf44f659330b/trax/layers/core.py#L28):  A dense layer.
    - `tl.Dense(n_units)`: The parameter `n_units` is the number of units chosen for this dense layer.  


- [tl.LogSoftmax](https://github.com/google/trax/blob/1372b903bb66b0daccee19fd0b1fdf44f659330b/trax/layers/core.py#L242): Log of the output probabilities.
    - Here, you don't need to set any parameters for `LogSoftMax()`.
 

**Online documentation**

- [tl.Serial](https://trax-ml.readthedocs.io/en/latest/trax.layers.html#module-trax.layers.combinators)

- [tl.Embedding](https://trax-ml.readthedocs.io/en/latest/trax.layers.html#trax.layers.core.Embedding)

-  [tl.LSTM](https://trax-ml.readthedocs.io/en/latest/trax.layers.html#trax.layers.rnn.LSTM)

-  [tl.Dense](https://trax-ml.readthedocs.io/en/latest/trax.layers.html#trax.layers.core.Dense)

- [tl.LogSoftmax](https://trax-ml.readthedocs.io/en/latest/trax.layers.html#trax.layers.core.LogSoftmax)


- [nn.Embedding](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html)

- [nn.LSTM](https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html#torch.nn.LSTM)

- [nn.Linear](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html#torch.nn.Linear)

- [nn.functional.log_softmax](https://pytorch.org/docs/stable/generated/torch.nn.functional.log_softmax.html)

- [nn.LogSoftmax](https://pytorch.org/docs/stable/generated/torch.nn.LogSoftmax.html)

In [21]:
# check if GPU is available
train_on_gpu = torch.cuda.is_available()
if(train_on_gpu):
    print('Training on GPU!')
else: 
    print('No GPU available, training on CPU; consider making n_epochs very small.')

No GPU available, training on CPU; consider making n_epochs very small.


In [None]:
class NER_torch(nn.Module):
    
    def __init__(self, vocab_size=35181, d_model=50, tags=tag_map):
    '''
      Input: 
        vocab_size - integer containing the size of the vocabulary
        d_model - integer describing the embedding size
      Output:
        model - a torch serial model
    '''
    super().__init__()
    self.vocab_size = vocab_size
    self.d_model = d_model
    self.tags=tag_map
    
    # embedding layer  08-31 start from here
    self.embedding = nn.Embedding(vocab_size, d_model)
    #self.lstm = nn.LSTM(vocab_size, d_model, batch_first=True)
    self.lstm = nn.LSTM(d_model, d_model, batch_first=True)
    self.fc = nn.Linear(d_model, len(self.tags))
    
    
    def init_hidden(self, batch_size):
        ''' Initializes hidden state '''
        # Create two new tensors with sizes n_layers x batch_size x n_hidden,
        # initialized to zero, for hidden state and cell state of LSTM
        weight = next(self.parameters()).data
        
        hidden = (weight.new(self.n_layers, batch_size, self.n_hidden).zero_(),
                  weight.new(self.n_layers, batch_size, self.n_hidden).zero_())
        
        return hidden
    
    
    def forward(self, x):
        x = self.embedding(x)
        l_out, hidden = self.lstm(x, hidden)
        
        