## Problem 3. BiLSTM-CRF (30 points)

In this problem, you are expected to build a BiLSTM-CRF network model for named entity recognition (NER). 
The training, validation and testing datasets are English text, which have been provided below. The labels are tags of words in the text. 

The requirements are:

1)	[10 points] Build a BiLSTM-CRF model. If you use only the LSTM model, you will be given only half of the score. All code should be in the jupyter notebook. 

2)	[5 points] The training module should include training and validation processes. The training and validation batch size should be 32. 

3)	[5 points] Batches should have different max lengths. 

4)	[5 points] Plot the training loss curve and validation loss curve based on each epoch. 

5)	[5 points] Evaluate the test dataset by F1 score. 

Note：You should write comments to explain what you have done for the important parts and why your code works for the above requirements. The given code for data preprocessing should not be changed.

#### Reference: 
A paper for more information about sequence tagging problem and BiLSTM-CRF: https://arxiv.org/pdf/1603.01360.pdf


In [1]:
import pandas as pd
import numpy as np

We use an NER dataset for the training. Here is some preprocessing codes.

In [2]:
data = pd.read_csv("ner_dataset.csv", encoding="latin1")
data = data.drop(['POS'], axis =1)
data = data.fillna(method="ffill")
data.tail(15)

Unnamed: 0,Sentence #,Word,Tag
1048560,Sentence: 47958,of,O
1048561,Sentence: 47958,the,O
1048562,Sentence: 47958,rockets,O
1048563,Sentence: 47958,exploded,O
1048564,Sentence: 47958,upon,O
1048565,Sentence: 47958,impact,O
1048566,Sentence: 47958,.,O
1048567,Sentence: 47959,Indian,B-gpe
1048568,Sentence: 47959,forces,O
1048569,Sentence: 47959,said,O


The dataset includes 35178 words and 17 different tags.

In [3]:
word_to_ix = {}
words = set(list(data['Word'].values))
for w in words:
    word_to_ix[w]=len(word_to_ix)
n_words = len(words)
print(n_words)

35178


In [4]:
tag_dicts={}
tags = set(list(data["Tag"].values))
for t in tags:
    tag_dicts[t]=len(tag_dicts)
n_tags = len(tags)
print(n_tags)
print(tag_dicts)

17
{'I-org': 0, 'B-gpe': 1, 'B-tim': 2, 'B-nat': 3, 'B-geo': 4, 'I-eve': 5, 'B-eve': 6, 'I-per': 7, 'O': 8, 'I-tim': 9, 'I-nat': 10, 'I-geo': 11, 'B-org': 12, 'I-art': 13, 'B-art': 14, 'B-per': 15, 'I-gpe': 16}


In [5]:
agg_func = lambda s: [(w, t) for w, t in zip(s["Word"].values.tolist(),s["Tag"].values.tolist())]
grouped = data.groupby("Sentence #").apply(agg_func)
sentences = [s for s in grouped]
print(sentences[15])

[('Israeli', 'B-gpe'), ('officials', 'O'), ('say', 'O'), ('Prime', 'B-per'), ('Minister', 'I-per'), ('Ariel', 'I-per'), ('Sharon', 'I-per'), ('will', 'O'), ('undergo', 'O'), ('a', 'O'), ('medical', 'O'), ('procedure', 'O'), ('Thursday', 'B-tim'), ('to', 'O'), ('close', 'O'), ('a', 'O'), ('tiny', 'O'), ('hole', 'O'), ('in', 'O'), ('his', 'O'), ('heart', 'O'), ('discovered', 'O'), ('during', 'O'), ('treatment', 'O'), ('for', 'O'), ('a', 'O'), ('minor', 'O'), ('stroke', 'O'), ('suffered', 'O'), ('last', 'O'), ('month', 'O'), ('.', 'O')]


We should change the data into a sequence list and a tag list. Besides, we change the words and tags into dictionary indexes, which are easier to feed into the embeding layer.

In [6]:
max_len = 50
X = [[w[0]for w in s] for s in sentences]
Y = [[w[1]for w in s] for s in sentences]
new_data = []
new_tags=[]
for seq,tag in zip(X,Y):
    new_seq=[]
    new_tag=[]
    for i in range(max_len):
        try:
            new_seq.append(word_to_ix[seq[i]])
            new_tag.append(tag_dicts[tag[i]])
        except:
            pass
    new_data.append(new_seq)
    new_tags.append(new_tag)
print(new_data[15])
print(new_tags[15])

[20731, 27809, 23435, 33831, 14278, 28646, 28798, 10565, 18082, 11320, 32605, 8275, 29662, 23702, 2864, 11320, 9218, 33694, 34569, 34817, 26892, 11644, 33227, 16522, 8380, 11320, 24477, 21909, 17088, 2970, 24587, 25245]
[1, 8, 8, 15, 7, 7, 7, 8, 8, 8, 8, 8, 2, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8]


In [7]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(new_data, new_tags, test_size=0.3)
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=0.3)

### Step1. model building

### Step2. training module

### Step3. test module