# Named Entity Recognition (NER)

Named Entity Recognition (NER) is an important  task in natural language processing. In this assignment you will implement a neural network model for NER.  In particular you will implement an approach called Sliding Window Neural Network. The dataset is composed of sentences. The dataframe already has each words parsed in one column and the corresponding label (entity) in the second column. We will build a "window" model, the idea on the window model is to use 5-word window to predict the name entity of the middle word. Here is the first observation in our data:

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from ner import *
import pandas as pd
from sklearn.model_selection import train_test_split

In [3]:
data = pd.read_csv("data/Genia4ERtask1.iob2", sep="\t", header=None, names=["word", "label"])

In [4]:
data.head(10)

Unnamed: 0,word,label
0,IL-2,B-DNA
1,gene,I-DNA
2,expression,O
3,and,O
4,NF-kappa,B-protein
5,B,I-protein
6,activation,O
7,through,O
8,CD28,B-protein
9,requires,O


In [5]:
tiny_data = pd.read_csv("data/tiny.ner.train", sep="\t", header=None, names=["word", "label"])

The second observation is the 5 words starting with 'gene' and the label is the entity for the word 'and'. We have 5 features (categorical variables) which are words. We will use a word embedding to represent each value of the categorical features. For each observation, we concatenate the values of the 5 word embeddings for that observation. The vector of concatenated embeddings is feeded to a linear layer.

## Split dataset

In [6]:
N = int(data.shape[0]*0.8)
N

394040

In [7]:
train_df = data.iloc[:N,].copy()
valid_df = data.iloc[N:,].copy()

In [8]:
train_df.shape, valid_df.shape

((394040, 2), (98511, 2))

## Word and label to index mapping

In [9]:
vocab2index = label_encoding(train_df["word"].values)
label2index = label_encoding(train_df["label"].values)

## Label Encoding categorical variables

In [10]:
tiny_vocab2index = label_encoding(tiny_data["word"].values)
tiny_label2index = label_encoding(tiny_data["label"].values)
tiny_data_enc = dataset_encoding(tiny_data, tiny_vocab2index, tiny_label2index)

In [11]:
actual = np.array([17, 53, 31, 25, 44, 41, 32,  0, 11,  1])
assert(np.array_equal(tiny_data_enc.iloc[30:40].word.values, actual))

## Dataset definition

In [12]:
tiny_ds = NERDataset(tiny_data_enc)

In [13]:
len(tiny_ds)

93

In [14]:
tiny_data_enc[:20]

Unnamed: 0,word,label
0,11,0
1,30,3
2,26,6
3,18,6
4,13,2
5,7,5
6,17,6
7,60,6
8,8,2
9,52,6


In [15]:
data = []
for i in range(len(tiny_data_enc)):
    if i+2<= len(tiny_data_enc):
        data.append(tiny_data_enc[i:i+5])

In [16]:
len(tiny_ds[:][0])

93

In [17]:
x, y = tiny_ds[0]
x,y

(array([11, 30, 26, 18, 13]), 6)

In [18]:
x, y = tiny_ds[0]
assert(np.array_equal(x, np.array([11, 30, 26, 18, 13])))
assert(y == 6)
assert(len(tiny_ds) == 93)

## Testing

In [19]:
# encoding datasets
train_df_enc = dataset_encoding(train_df, vocab2index, label2index)
valid_df_enc = dataset_encoding(valid_df, vocab2index, label2index)

In [20]:
# creating datasets
train_ds =  NERDataset(train_df_enc)
valid_ds = NERDataset(valid_df_enc)

# dataloaders
batch_size = 10000
train_dl = DataLoader(train_ds, batch_size=batch_size, shuffle=True)
valid_dl = DataLoader(valid_ds, batch_size=batch_size)

In [21]:
from ner import *
import pandas as pd
from sklearn.model_selection import train_test_split

In [22]:
vocab_size = len(vocab2index)+1
n_class = len(label2index)
emb_size = 100


model = NERModel(vocab_size, n_class, emb_size)
optimizer = get_optimizer(model, lr = 0.01, wd = 1e-5)
train_model(model, optimizer, train_dl, valid_dl, epochs=10)

train loss  0.758 val loss 0.404 and accuracy 0.877
train loss  0.318 val loss 0.325 and accuracy 0.899
train loss  0.251 val loss 0.301 and accuracy 0.906
train loss  0.217 val loss 0.308 and accuracy 0.905
train loss  0.195 val loss 0.286 and accuracy 0.911
train loss  0.181 val loss 0.297 and accuracy 0.908
train loss  0.170 val loss 0.285 and accuracy 0.912
train loss  0.162 val loss 0.296 and accuracy 0.910
train loss  0.156 val loss 0.311 and accuracy 0.909
train loss  0.151 val loss 0.310 and accuracy 0.908


In [23]:
optimizer = get_optimizer(model, lr = 0.001, wd = 1e-5)
train_model(model, optimizer, train_dl, valid_dl, epochs=10)

train loss  1.629 val loss 1.652 and accuracy 0.895
train loss  1.626 val loss 1.651 and accuracy 0.897
train loss  1.625 val loss 1.650 and accuracy 0.897
train loss  1.622 val loss 1.645 and accuracy 0.903
train loss  1.619 val loss 1.643 and accuracy 0.905
train loss  1.616 val loss 1.641 and accuracy 0.907
train loss  1.614 val loss 1.640 and accuracy 0.908
train loss  1.612 val loss 1.639 and accuracy 0.909
train loss  1.610 val loss 1.639 and accuracy 0.909
train loss  1.609 val loss 1.638 and accuracy 0.909


In [24]:
valid_loss, valid_acc = valid_metrics(model, valid_dl)

In [25]:
valid_loss, valid_acc

(1.6383890185784586, 0.9093668470260997)

In [26]:
assert(np.abs(valid_loss - 0.3) < 0.02)

AssertionError: 

In [79]:
assert(np.abs(valid_acc - 0.9) < 0.01)

In [27]:
a1 = nn.Linear(10, 3)

In [30]:
a1.weight.shape

torch.Size([3, 10])