![MLU Logo](https://drive.corp.amazon.com/view/bwernes@/MLU_Logo.png?download=true)

## NLP2 Lecture 2 Support Notebook
### NATURAL LANGUAGE PROCESSING WITH TRANSFORMERS
### Α Pos Tagger trained on UD treebank with fine-tuning a BERT model

#### NOTE: run this with conda_amazonei_tensorflow_p36 kernel

### Table of Contents
<p>
<div class="lev1">
    <a href="#Chunking"><span class="toc-item-num">1&nbsp;&nbsp;</span>
        Chunking
    </a>
</div>
<div class="lev1">
    <a href="#Conditional-Random-Fields-for-NER"><span class="toc-item-num">2&nbsp;&nbsp;</span>
        Conditional Random Fields for NER
    </a>
</div>
<div class="lev1">
    <a href="#POS-Tagging-with-BERT"><span class="toc-item-num">3&nbsp;&nbsp;</span>
        POS Tagging with BERT
    </a>
</div>

# Chunking

In [1]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
sentence = "The instructor taught the student to process natural language"

[nltk_data] Downloading package punkt to /home/ec2-user/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/ec2-user/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [2]:
grammar = "NP: {<DT>?<JJ>*<NN>}"

In [3]:
s = nltk.pos_tag(nltk.word_tokenize(sentence))
cp = nltk.RegexpParser(grammar)
result = cp.parse(s)
strees = result.subtrees()
print("Found Noum Phrases:::")
for stree in strees:
    if stree.label() == 'NP':
        print('\t' +  '_'.join([w.lower() for w,s  in stree]))

Found Noum Phrases:::
	the_instructor
	the_student
	natural_language


## End of Example.  Return to Slides

![MLU Logo](https://drive.corp.amazon.com/view/bwernes@/MLU_Logo.png?download=true)

<div class="lev1">
    <a href="#NLP2-Lecture-2-Support-Notebook">
        <span class="toc-item-num">&nbsp;&nbsp;</span>
        Go to TOP
    </a>
</div>

# Conditional Random Fields for NER

# Named Entity Recognition using CRF model
In Natural Language Processing (NLP) an Entity Recognition is one of the common problem. The entity is referred to as the part of the text that is interested in. In NLP, NER is a method of extracting the relevant information from a large corpus and classifying those entities into predefined categories such as location, organization, name and so on. 
Information about lables: 
* geo = Geographical Entity
* org = Organization
* per = Person
* gpe = Geopolitical Entity
* tim = Time indicator
* art = Artifact
* eve = Event
* nat = Natural Phenomenon

        1. Total Words Count = 1354149 
        2. Target Data Column: Tag

#### Importing Libraries

In [2]:
!pip install sklearn_crfsuite



In [3]:
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn_crfsuite import CRF
from sklearn_crfsuite.metrics import flat_f1_score
from sklearn_crfsuite.metrics import flat_classification_report

import boto3
from os import path

In [4]:
# import the datasets
bucketname = 'mlu-courses-datalake' 
filename = 'NLP2/data/ner_dataset.csv' 
s3 = boto3.resource('s3')
if not path.exists("../data/ner_dataset.csv"):
    s3.Bucket(bucketname).download_file(filename, '../data/ner_dataset.csv')

In [5]:
#Reading the csv file
df = pd.read_csv('../data/ner_dataset.csv', encoding = "ISO-8859-1")

In [6]:
#Display first 10 rows
df.head(10)

Unnamed: 0,Sentence #,Word,POS,Tag
0,Sentence: 1,Thousands,NNS,O
1,,of,IN,O
2,,demonstrators,NNS,O
3,,have,VBP,O
4,,marched,VBN,O
5,,through,IN,O
6,,London,NNP,B-geo
7,,to,TO,O
8,,protest,VB,O
9,,the,DT,O


In [7]:
df.describe()

Unnamed: 0,Sentence #,Word,POS,Tag
count,47959,1048575,1048575,1048575
unique,47959,35178,42,17
top,Sentence: 40028,the,NN,O
freq,1,52573,145807,887908


#### Observations : 
* There are total 47959 sentences in the dataset.
* Number unique words in the dataset are 35178.
* Total 17 lables (Tags).

In [8]:
#Displaying the unique Tags
df['Tag'].unique()

array(['O', 'B-geo', 'B-gpe', 'B-per', 'I-geo', 'B-org', 'I-org', 'B-tim',
       'B-art', 'I-art', 'I-per', 'I-gpe', 'I-tim', 'B-nat', 'B-eve',
       'I-eve', 'I-nat'], dtype=object)

In [9]:
#Checking null values, if any.
df.isnull().sum()

Sentence #    1000616
Word                0
POS                 0
Tag                 0
dtype: int64

There are lots of missing values in 'Sentence #' attribute. So we will use pandas fillna technique and use 'ffill' method which propagates last valid observation forward to next.

In [10]:
df = df.fillna(method = 'ffill')

In [11]:
# This is a class te get sentence. The each sentence will be list of tuples with its tag and pos.
class sentence(object):
    def __init__(self, df):
        self.n_sent = 1
        self.df = df
        self.empty = False
        agg = lambda s : [(w, p, t) for w, p, t in zip(s['Word'].values.tolist(),
                                                       s['POS'].values.tolist(),
                                                       s['Tag'].values.tolist())]
        self.grouped = self.df.groupby("Sentence #").apply(agg)
        self.sentences = [s for s in self.grouped]
        
    def get_text(self):
        try:
            s = self.grouped['Sentence: {}'.format(self.n_sent)]
            self.n_sent +=1
            return s
        except:
            return None

In [12]:
#Displaying one full sentence
getter = sentence(df)
sentences = [" ".join([s[0] for s in sent]) for sent in getter.sentences]
sentences[0]

'Thousands of demonstrators have marched through London to protest the war in Iraq and demand the withdrawal of British troops from that country .'

In [13]:
#sentence with its pos and tag.
sent = getter.get_text()
print(sent)

[('Thousands', 'NNS', 'O'), ('of', 'IN', 'O'), ('demonstrators', 'NNS', 'O'), ('have', 'VBP', 'O'), ('marched', 'VBN', 'O'), ('through', 'IN', 'O'), ('London', 'NNP', 'B-geo'), ('to', 'TO', 'O'), ('protest', 'VB', 'O'), ('the', 'DT', 'O'), ('war', 'NN', 'O'), ('in', 'IN', 'O'), ('Iraq', 'NNP', 'B-geo'), ('and', 'CC', 'O'), ('demand', 'VB', 'O'), ('the', 'DT', 'O'), ('withdrawal', 'NN', 'O'), ('of', 'IN', 'O'), ('British', 'JJ', 'B-gpe'), ('troops', 'NNS', 'O'), ('from', 'IN', 'O'), ('that', 'DT', 'O'), ('country', 'NN', 'O'), ('.', '.', 'O')]


Getting all the sentences in the dataset.

In [14]:
sentences = getter.sentences

#### Feature Preparation
These are the default features used by the NER in nltk. We can also modify it for our customization.

In [15]:
def word2features(sent, i):
    word = sent[i][0]
    postag = sent[i][1]

    features = {
        'bias': 1.0,
        'word.lower()': word.lower(),
        'word[-3:]': word[-3:],
        'word[-2:]': word[-2:],
        'word.isupper()': word.isupper(),
        'word.istitle()': word.istitle(),
        'word.isdigit()': word.isdigit(),
        'postag': postag,
        'postag[:2]': postag[:2],
    }
    if i > 0:
        word1 = sent[i-1][0]
        postag1 = sent[i-1][1]
        features.update({
            '-1:word.lower()': word1.lower(),
            '-1:word.istitle()': word1.istitle(),
            '-1:word.isupper()': word1.isupper(),
            '-1:postag': postag1,
            '-1:postag[:2]': postag1[:2],
        })
    else:
        features['BOS'] = True

    if i < len(sent)-1:
        word1 = sent[i+1][0]
        postag1 = sent[i+1][1]
        features.update({
            '+1:word.lower()': word1.lower(),
            '+1:word.istitle()': word1.istitle(),
            '+1:word.isupper()': word1.isupper(),
            '+1:postag': postag1,
            '+1:postag[:2]': postag1[:2],
        })
    else:
        features['EOS'] = True

    return features


def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]

def sent2labels(sent):
    return [label for token, postag, label in sent]

def sent2tokens(sent):
    return [token for token, postag, label in sent]

In [16]:
X = [sent2features(s) for s in sentences]
y = [sent2labels(s) for s in sentences]

In [17]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

## Training the model with scikit-learn:

We can now train the model with conditional random fields implementation provided by the sklearn-crfsuite. Initializing the model instance and fitting the training data with the fit method.

In [18]:
%%time
# around 04:07 minutes with gpu
crf = CRF(algorithm = 'lbfgs',
         c1 = 0.1,
         c2 = 0.1,
         max_iterations = 100,
         all_possible_transitions = False)
crf.fit(X_train, y_train)

CPU times: user 4min 18s, sys: 244 ms, total: 4min 18s
Wall time: 4min 18s


AttributeError: 'CRF' object has no attribute 'keep_tempfiles'

AttributeError: 'CRF' object has no attribute 'keep_tempfiles'

In [21]:
#Predicting on the test set.
y_pred = crf.predict(X_test)

#### Evaluating the model performance.
We will use precision, recall and f1-score metrics to evaluate the performance of the model since the accuracy is not a good metric for this dataset because we have an unequal number of data points in each class.

In [22]:
f1_score = flat_f1_score(y_test, y_pred, average = 'weighted')
print(f1_score)

0.9711344117595363


In [23]:
report = flat_classification_report(y_test, y_pred)
print(report)

              precision    recall  f1-score   support

       B-art       0.50      0.17      0.25        83
       B-eve       0.43      0.35      0.39        57
       B-geo       0.86      0.90      0.88      7472
       B-gpe       0.97      0.94      0.95      3227
       B-nat       0.76      0.30      0.43        44
       B-org       0.79      0.75      0.77      3970
       B-per       0.85      0.82      0.83      3472
       B-tim       0.92      0.88      0.90      4134
       I-art       0.29      0.08      0.12        63
       I-eve       0.32      0.29      0.30        49
       I-geo       0.82      0.80      0.81      1477
       I-gpe       0.74      0.53      0.62        43
       I-nat       0.75      0.25      0.38        12
       I-org       0.81      0.80      0.80      3306
       I-per       0.85      0.91      0.87      3480
       I-tim       0.86      0.76      0.81      1356
           O       0.99      0.99      0.99    177791

   micro avg       0.97   

This looks quite nice.

## End of Example.  Return to Slides

![MLU Logo](https://drive.corp.amazon.com/view/bwernes@/MLU_Logo.png?download=true)

<div class="lev1">
    <a href="#NLP2-Lecture-2-Support-Notebook">
        <span class="toc-item-num">&nbsp;&nbsp;</span>
        Go to TOP
    </a>
</div>

# POS Tagging with BERT

In [25]:
__author__ = "kyubyong"
__address__ = "https://github.com/kyubyong/nlp_made_easy"
__email__ = "kbpark.linguist@gmail.com"

In [26]:
!pip install -q pytorch_pretrained_bert

In [27]:
import os
from tqdm import tqdm_notebook as tqdm
import numpy as np
import torch
import torch.nn as nn
from torch.utils import data
import torch.optim as optim
from pytorch_pretrained_bert import BertTokenizer

In [28]:
torch.__version__

'1.4.0'

#### Check if the GPU is being used by PyTorch

In [29]:
# setting device on GPU if available, else CPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Using device:', device)
print()

#Additional Info when using cuda
if device.type == 'cuda':
    print(torch.cuda.get_device_name(0))
    print('Memory Usage:')
    print('Allocated:', round(torch.cuda.memory_allocated(0)/1024**3,1), 'GB')
    print('Cached:   ', round(torch.cuda.memory_cached(0)/1024**3,1), 'GB')

Using device: cuda

Tesla K80
Memory Usage:
Allocated: 0.0 GB
Cached:    0.0 GB


In [30]:
import nltk
nltk.download('treebank')
tagged_sents = nltk.corpus.treebank.tagged_sents()
len(tagged_sents)

[nltk_data] Downloading package treebank to
[nltk_data]     /home/ec2-user/nltk_data...
[nltk_data]   Unzipping corpora/treebank.zip.


3914

In [31]:
tagged_sents[0]

[('Pierre', 'NNP'),
 ('Vinken', 'NNP'),
 (',', ','),
 ('61', 'CD'),
 ('years', 'NNS'),
 ('old', 'JJ'),
 (',', ','),
 ('will', 'MD'),
 ('join', 'VB'),
 ('the', 'DT'),
 ('board', 'NN'),
 ('as', 'IN'),
 ('a', 'DT'),
 ('nonexecutive', 'JJ'),
 ('director', 'NN'),
 ('Nov.', 'NNP'),
 ('29', 'CD'),
 ('.', '.')]

In [32]:
tags = list(set(word_pos[1] for sent in tagged_sents for word_pos in sent))

In [33]:
",".join(tags)

"-NONE-,NN,PRP,RBR,CD,DT,FW,-RRB-,MD,POS,LS,VBZ,NNP,VB,VBN,NNPS,SYM,NNS,#,VBG,CC,WDT,WRB,JJ,.,PDT,:,,,$,-LRB-,``,WP,'',RBS,UH,RB,TO,JJR,JJS,VBD,EX,RP,IN,VBP,WP$,PRP$"

In [34]:
# By convention, the 0'th slot is reserved for padding.
tags = ["<pad>"] + tags

In [35]:
tag2idx = {tag:idx for idx, tag in enumerate(tags)}
idx2tag = {idx:tag for idx, tag in enumerate(tags)}

In [36]:
# Let's split the data into train and test (or eval)
from sklearn.model_selection import train_test_split
train_data, test_data = train_test_split(tagged_sents, test_size=.1)
len(train_data), len(test_data)

(3522, 392)

In [37]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'

## Data loader

In [38]:
tokenizer = BertTokenizer.from_pretrained('bert-base-cased', do_lower_case=False)

100%|██████████| 213450/213450 [00:00<00:00, 23903299.75B/s]


In [39]:
class PosDataset(data.Dataset):
    def __init__(self, tagged_sents):
        sents, tags_li = [], [] # list of lists
        for sent in tagged_sents:
            words = [word_pos[0] for word_pos in sent]
            tags = [word_pos[1] for word_pos in sent]
            sents.append(["[CLS]"] + words + ["[SEP]"])
            tags_li.append(["<pad>"] + tags + ["<pad>"])
        self.sents, self.tags_li = sents, tags_li

    def __len__(self):
        return len(self.sents)

    def __getitem__(self, idx):
        words, tags = self.sents[idx], self.tags_li[idx] # words, tags: string list

        # We give credits only to the first piece.
        x, y = [], [] # list of ids
        is_heads = [] # list. 1: the token is the first piece of a word
        for w, t in zip(words, tags):
            tokens = tokenizer.tokenize(w) if w not in ("[CLS]", "[SEP]") else [w]
            xx = tokenizer.convert_tokens_to_ids(tokens)

            is_head = [1] + [0]*(len(tokens) - 1)

            t = [t] + ["<pad>"] * (len(tokens) - 1)  # <PAD>: no decision
            yy = [tag2idx[each] for each in t]  # (T,)

            x.extend(xx)
            is_heads.extend(is_head)
            y.extend(yy)

        assert len(x)==len(y)==len(is_heads), "len(x)={}, len(y)={}, len(is_heads)={}".format(len(x), len(y), len(is_heads))

        # seqlen
        seqlen = len(y)

        # to string
        words = " ".join(words)
        tags = " ".join(tags)
        return words, x, is_heads, tags, y, seqlen

In [40]:
def pad(batch):
    '''Pads to the longest sample'''
    f = lambda x: [sample[x] for sample in batch]
    words = f(0)
    is_heads = f(2)
    tags = f(3)
    seqlens = f(-1)
    maxlen = np.array(seqlens).max()

    f = lambda x, seqlen: [sample[x] + [0] * (seqlen - len(sample[x])) for sample in batch] # 0: <pad>
    x = f(1, maxlen)
    y = f(-2, maxlen)


    f = torch.LongTensor

    return words, f(x), is_heads, tags, f(y), seqlens

## Model

In [41]:
from pytorch_pretrained_bert import BertModel

In [42]:
class Net(nn.Module):
    def __init__(self, vocab_size=None):
        super().__init__()
        self.bert = BertModel.from_pretrained('bert-base-cased')

        self.fc = nn.Linear(768, vocab_size)
        self.device = device

    def forward(self, x, y):
        '''
        x: (N, T). int64
        y: (N, T). int64
        '''
        x = x.to(device)
        y = y.to(device)
        
        if self.training:
            self.bert.train()
            encoded_layers, _ = self.bert(x)
            enc = encoded_layers[-1]
        else:
            self.bert.eval()
            with torch.no_grad():
                encoded_layers, _ = self.bert(x)
                enc = encoded_layers[-1]
        
        logits = self.fc(enc)
        y_hat = logits.argmax(-1)
        return logits, y, y_hat


## Train and evaluate

In [43]:
def train(model, iterator, optimizer, criterion):
    model.train()
    for i, batch in enumerate(iterator):
        words, x, is_heads, tags, y, seqlens = batch
        _y = y # for monitoring
        optimizer.zero_grad()
        logits, y, _ = model(x, y) # logits: (N, T, VOCAB), y: (N, T)

        logits = logits.view(-1, logits.shape[-1]) # (N*T, VOCAB)
        y = y.view(-1)  # (N*T,)

        loss = criterion(logits, y)
        loss.backward()

        optimizer.step()

        if i%10==0: # monitoring
            print("step: {}, loss: {}".format(i, loss.item()))

In [44]:
def eval(model, iterator):
    model.eval()

    Words, Is_heads, Tags, Y, Y_hat = [], [], [], [], []
    with torch.no_grad():
        for i, batch in enumerate(iterator):
            words, x, is_heads, tags, y, seqlens = batch

            _, _, y_hat = model(x, y)  # y_hat: (N, T)

            Words.extend(words)
            Is_heads.extend(is_heads)
            Tags.extend(tags)
            Y.extend(y.numpy().tolist())
            Y_hat.extend(y_hat.cpu().numpy().tolist())

    ## gets results and save
    with open("result", 'w') as fout:
        for words, is_heads, tags, y_hat in zip(Words, Is_heads, Tags, Y_hat):
            y_hat = [hat for head, hat in zip(is_heads, y_hat) if head == 1]
            preds = [idx2tag[hat] for hat in y_hat]
            assert len(preds)==len(words.split())==len(tags.split())
            for w, t, p in zip(words.split()[1:-1], tags.split()[1:-1], preds[1:-1]):
                fout.write("{} {} {}\n".format(w, t, p))
            fout.write("\n")
            
    ## calc metric
    y_true =  np.array([tag2idx[line.split()[1]] for line in open('result', 'r').read().splitlines() if len(line) > 0])
    y_pred =  np.array([tag2idx[line.split()[2]] for line in open('result', 'r').read().splitlines() if len(line) > 0])

    acc = (y_true==y_pred).astype(np.int32).sum() / len(y_true)

    print("acc=%.2f"%acc)

## Load model and train

In [45]:
%%time
# around 00:45 minutes with gpu
model = Net(vocab_size=len(tag2idx))
model.to(device)
model = nn.DataParallel(model)

100%|██████████| 404400730/404400730 [00:08<00:00, 49892812.39B/s]


CPU times: user 12.3 s, sys: 3.13 s, total: 15.4 s
Wall time: 16.4 s


In [46]:
train_dataset = PosDataset(train_data)
eval_dataset = PosDataset(test_data)

train_iter = data.DataLoader(dataset=train_dataset,
                             batch_size=8,
                             shuffle=True,
                             num_workers=1,
                             collate_fn=pad)
test_iter = data.DataLoader(dataset=eval_dataset,
                             batch_size=8,
                             shuffle=False,
                             num_workers=1,
                             collate_fn=pad)

optimizer = optim.Adam(model.parameters(), lr = 0.0001)

criterion = nn.CrossEntropyLoss(ignore_index=0)

In [47]:
%%time
# around 02:16 minutes with gpu
train(model, train_iter, optimizer, criterion)
eval(model, test_iter)

step: 0, loss: 3.836632490158081
step: 10, loss: 1.5371752977371216
step: 20, loss: 0.7555820941925049
step: 30, loss: 0.3922635316848755
step: 40, loss: 0.23540589213371277
step: 50, loss: 0.317504346370697
step: 60, loss: 0.13820095360279083
step: 70, loss: 0.18537919223308563
step: 80, loss: 0.05420791730284691
step: 90, loss: 0.1245027482509613
step: 100, loss: 0.2243964523077011
step: 110, loss: 0.1313113123178482
step: 120, loss: 0.08636090904474258
step: 130, loss: 0.10636947304010391
step: 140, loss: 0.16407494246959686
step: 150, loss: 0.1347183883190155
step: 160, loss: 0.05791875720024109
step: 170, loss: 0.09987923502922058
step: 180, loss: 0.11940781772136688
step: 190, loss: 0.182859405875206
step: 200, loss: 0.07417254149913788
step: 210, loss: 0.1644778698682785
step: 220, loss: 0.11197341978549957
step: 230, loss: 0.0539681538939476
step: 240, loss: 0.3709862232208252
step: 250, loss: 0.10583696514368057
step: 260, loss: 0.09497590363025665
step: 270, loss: 0.106808081

In [48]:
open('result', 'r').read().splitlines()[:100]

['Characters NNS NNS',
 'drink VBP VBP',
 'Salty NNP NNP',
 'Dogs NNP NNPS',
 ', , ,',
 'whistle VBP VBP',
 '`` `` ``',
 'Johnny NNP NNP',
 'B. NNP NNP',
 'Goode NNP NNP',
 "'' '' ''",
 'and CC CC',
 'watch VBP VB',
 'Bugs NNP NNP',
 'Bunny NNP NNP',
 'reruns NNS NNS',
 '. . .',
 '',
 'In IN IN',
 'a DT DT',
 'disputed VBN VBN',
 '1985 CD CD',
 'ruling NN NN',
 ', , ,',
 'the DT DT',
 'Commerce NNP NNP',
 'Commission NNP NNP',
 'said VBD VBD',
 '0 -NONE- -NONE-',
 'Commonwealth NNP NNP',
 'Edison NNP NNP',
 'could MD MD',
 'raise VB VB',
 'its PRP$ PRP$',
 'electricity NN NN',
 'rates NNS NNS',
 'by IN IN',
 '$ $ $',
 '49 CD CD',
 'million CD CD',
 '*U* -NONE- -NONE-',
 '*-1 -NONE- -NONE-',
 'to TO TO',
 'pay VB VB',
 'for IN IN',
 'the DT DT',
 'plant NN NN',
 '. . .',
 '',
 '`` `` ``',
 'I PRP PRP',
 'deserve VBP VBP',
 'something NN NN',
 'for IN IN',
 'my PRP$ PRP$',
 'loyalty NN NN',
 ', , ,',
 "'' '' ''",
 'she PRP PRP',
 'says VBZ VBZ',
 '*T*-1 -NONE- -NONE-',
 '. . .',
 '',
 'T

## End of Example.  Return to Slides

![MLU Logo](https://drive.corp.amazon.com/view/bwernes@/MLU_Logo.png?download=true)

<div class="lev1">
    <a href="#NLP2-Lecture-2-Support-Notebook">
        <span class="toc-item-num">&nbsp;&nbsp;</span>
        Go to TOP
    </a>
</div>