# demo 

We will go through 4 examples: 

* **[text classification ](#text_classification)** - the goal is to classify a single sentence or short text.


* **[text pair classification ](#text_pair_classification)** - the goal is to to classify a pair of sentences or short texts.


* **[text pair regression ](#text_pair_regression)** - the goal is to predict a numerical value for a pair of sentences or short texts.


* **[Named Entity Recognition (NER)](#ner_conll_eng)** - the goal is to tag each  token in a list of tokens as a person, location, organization,etc.

### A note on GPU cards and memory

While its possible, it would be too slow to run the examples without a GPU card of some sort. In addition, the BERT models (especially the large model) are pretty big so it helps to have more GPU memory. 

The three biggest parameters you can change which will reduce the GPU memory requirements significantly are:

* **`bert_model`** - BERT models come in 2 sizes : `base` and `large`. As you would expect the large model demands more GPU memory and takes longer to train. If you have a small GPU, start with the any of the `base` models first. The default is set to `'bert-base-uncased'`

> `base(110M parameter models)` : `'bert-base-uncased'`, `'bert-base-cased'`, `'bert-base-multilingual-uncased'`, `'bert-base-multilingual-cased'`, and `'bert-base-chinese'`

> `large(340M parameter models)`: `'bert-large-uncased'` and `'bert-large-cased'`


* **`max_seq_length`** - the defualt is 128 with a max value of 512. But seting it to a smaller value like 96 or even 64  saves a lot of GPU memory and still gets good results on a lot of tasks.


* **`train_batch_size`** - the default is 32. Cutting it in half will save memory and should also still give good results.


In addition to these two parameters,  [huggingface/pytorch-pretrained-BERT](https://github.com/huggingface/pytorch-pretrained-BERT#Training-large-models-introduction,-tools-and-examples) has several options to reduce the GPU memory requirements which are passed through in `bert-sklearn`:

* **`gradient_accumulation_steps`** - this is the number of update steps to accumulate gradients before performing an update step with the optimizer. The default is 1. Setting it to a higher integer(i.e 2, 4, up to the **`train_batch_size`** ) will trade GPU memory for compute time. I use this a lot when I train BERT models on my laptop GPU.


* **`fp16`** - this is whether to use 16-bit float precision instead of the 32-bit. The default is set to `False`. To enable half precision, you must install [Nvidia apex](https://github.com/NVIDIA/apex). Then setting this option to `True` will cut the model memory load in half. I use this when I train on my laptop GPU as well.


Finally the two other system setups that will help reduce the memory requirement: 

* `multiple gpus` - for a single machine with multiple GPUs, following the huggingface port, the GPUs will be detected and will split the load onto the multiple cards. Effevtively this cuts the memory requirement in half.


* `distributed training` - the huggingface port allows you to train across distributed GPUs. The parameter,  **`local_rank`**, is exposed in `bert-sklearn`. But this option has not been tested yet.


In [1]:
import os
import math
import random
import csv
import sys

import numpy as np
import pandas as pd
from sklearn import metrics
from sklearn.metrics import f1_score
from sklearn.metrics import classification_report
import statistics as stats

from bert_sklearn import BertClassifier
from bert_sklearn import BertRegressor
from bert_sklearn import BertTokenClassifier
from bert_sklearn import load_model

def read_tsv(filename, quotechar=None):
    with open(filename, "r", encoding='utf-8') as f:
        return list(csv.reader(f, delimiter="\t", quotechar=quotechar))   

def flatten(l):
    return [item for sublist in l for item in sublist]

def read_CoNLL2003_format(filename, idx=3):
    """Read file in CoNLL-2003 shared task format"""
    
    # read file
    lines =  open(filename).read().strip()   
    
    # find sentence-like boundaries
    lines = lines.split("\n\n")  
    
     # split on newlines
    lines = [line.split("\n") for line in lines]
    
    # get tokens
    tokens = [[l.split()[0] for l in line] for line in lines]
    
    # get labels/tags
    labels = [[l.split()[idx] for l in line] for line in lines]
    
    #convert to df
    data= {'tokens': tokens, 'labels': labels}
    df=pd.DataFrame(data=data)
    
    return df


<a id='text_classification'></a>
# text classification 

For single text classification, we have the input data `X`, and target data `y` where:

* `X` is a list, pandas Series, or numpy array of text data.


* `y` is a list, pandas Series, or numpy array of text labels.

For this example, we will use the **`Stanford Sentiment Treebank (SST-2)`** data set from the [GLUE benchmarks](https://gluebenchmark.com/). The **`SST-2`** task consists of sentences drawn from movie reviews and annotated with a sentiment label. 

See [website](https://nlp.stanford.edu/sentiment/code.html) and [paper](https://nlp.stanford.edu/~socherr/EMNLP2013_RNTN.pdf) for more info.

The input features are short sentences and the labels are the standard sentiment polarity of:

*    0 for negative 


*    1 for positive.

## get data

First download the data using the GLUE downloder:

In [2]:
%%bash
python3 ./glue_examples/download_glue_data.py --data_dir ./glue_examples//glue_data --tasks SST 

Downloading and extracting SST...
	Completed!


In [3]:
"""
SST-2 train data size: 67349 
SST-2 dev data size: 872 
"""
DATADIR = './glue_examples/glue_data'

def get_sst_data(train_file=DATADIR + '/SST-2/train.tsv',
                 dev_file=DATADIR + '/SST-2/dev.tsv'):

    train = pd.read_csv(train_file, sep='\t', encoding='utf8', keep_default_na=False)
    train.columns=['text', 'label']
    print("SST-2 train data size: %d "%(len(train)))
    
    dev = pd.read_csv(dev_file, sep='\t', encoding='utf8', keep_default_na=False)
    dev.columns=['text', 'label']
    print("SST-2 dev data size: %d "%(len(dev)))
    label_list = np.unique(train['label'])

    return train, dev, label_list

train, dev, label_list = get_sst_data()
train.head()

SST-2 train data size: 67349 
SST-2 dev data size: 872 


Unnamed: 0,text,label
0,hide new secretions from the parental units,0
1,"contains no wit , only labored gags",0
2,that loves its characters and communicates som...,1
3,remains utterly satisfied to remain the same t...,0
4,on the worst revenge-of-the-nerds clichés the ...,0


## setup data

We will subsample the data for the demo. To see a finetune run on the full data  see [SST-2.ipynb](https://github.com/charles9n/bert-sklearn/blob/master/glue_examples/SST-2.ipynb)

In [4]:
# subsample data 
n = 1000
train = train.sample(n, random_state=42)

X_train = train['text']
y_train = train['label']

# use the dev set for testing
test = dev
X_test = test['text']
y_test = test['label']

## define model

We will set up a classifier with the defualt settings, but let's reduce **`max_sequence_length`** , and **`train_batch_size`**, so it can run on a smaller GPU. This config uses ~5Gb of GPU memory om my laptop 8GB GTX-1070:

In [5]:
model = BertClassifier(max_seq_length=64, train_batch_size=16)
model

Building sklearn text classifier...


BertClassifier(bert_model='bert-base-uncased', epochs=3, eval_batch_size=8,
        fp16=False, gradient_accumulation_steps=1, ignore_label=None,
        label_list=None, learning_rate=2e-05, local_rank=-1,
        logfile='bert_sklearn.log', loss_scale=0, max_seq_length=64,
        num_mlp_hiddens=500, num_mlp_layers=0, random_state=42,
        restore_file=None, train_batch_size=16, use_cuda=True,
        validation_fraction=0.1, warmup_proportion=0.1)

## finetune model

finetune = fit model on train data

The `model.fit()` routine with default parameters:

* Loads the pretrained BERT model. The firs time this runs will be slower as it downloads the bert_model, set in `model.bert_model`, from the internet. Subsequent calls will be faster as the model is saved in a file cache locally.


* Uses 10% of the data for validation, set in `model.validation_fraction`, and finetunes BERT on the remainder for 3 epochs, set in `model.epochs`.

In [6]:
%%time
model.fit(X_train, y_train)

100%|██████████| 231508/231508 [00:00<00:00, 946970.40B/s]


Loading bert-base-uncased model...


100%|██████████| 407873900/407873900 [00:42<00:00, 9656718.51B/s] 


Defaulting to linear classifier/regressor
train data size: 900, validation data size: 100


Training: 100%|██████████| 57/57 [00:16<00:00,  4.02it/s, loss=0.522]
                                                           

Epoch 1, Train loss: 0.5217, Val loss: 0.4849, Val accy: 79.00%


Training: 100%|██████████| 57/57 [00:15<00:00,  4.02it/s, loss=0.156]
                                                           

Epoch 2, Train loss: 0.1564, Val loss: 0.4529, Val accy: 82.00%


Training: 100%|██████████| 57/57 [00:16<00:00,  4.01it/s, loss=0.0379]
                                                           

Epoch 3, Train loss: 0.0379, Val loss: 0.5570, Val accy: 81.00%
CPU times: user 53.6 s, sys: 20.7 s, total: 1min 14s
Wall time: 1min 43s




BertClassifier(bert_model='bert-base-uncased', epochs=3, eval_batch_size=8,
        fp16=False, gradient_accumulation_steps=1, ignore_label=None,
        label_list=array([0, 1]), learning_rate=2e-05, local_rank=-1,
        logfile='bert_sklearn.log', loss_scale=0, max_seq_length=64,
        num_mlp_hiddens=500, num_mlp_layers=0, random_state=42,
        restore_file=None, train_batch_size=16, use_cuda=True,
        validation_fraction=0.1, warmup_proportion=0.1)

## score and make predictions on test data

In [7]:
# score model
accy = model.score(X_test, y_test)

# make class probability predicts
y_prob = model.predict_proba(X_test)
print("class prob estimates:\n", y_prob)

# make predictions
y_pred = model.predict(X_test)
print("Accuracy: %0.2f%%"%(metrics.accuracy_score(y_pred, y_test) * 100))

target_names = ['negative', 'positive']
print(classification_report(y_test, y_pred, target_names=target_names))

Predicting:   0%|          | 0/109 [00:00<?, ?it/s]       


Loss: 0.3805, Accuracy: 86.81%


Predicting:   0%|          | 0/109 [00:00<?, ?it/s]          

class prob estimates:
 [[0.00115015 0.99884987]
 [0.9683653  0.03163463]
 [0.00374866 0.9962514 ]
 ...
 [0.9052405  0.09475956]
 [0.19052215 0.8094778 ]
 [0.00482875 0.99517125]]


                                                             

Accuracy: 86.81%
              precision    recall  f1-score   support

    negative       0.88      0.85      0.86       428
    positive       0.86      0.89      0.87       444

   micro avg       0.87      0.87      0.87       872
   macro avg       0.87      0.87      0.87       872
weighted avg       0.87      0.87      0.87       872





## save/load model from disk

In [8]:
#save model to disk
savefile = '/data/test.bin'
model.save(savefile)

del model

# load model from disk
model = load_model(savefile)

# predict with new model
accy = model.score(X_test, y_test)

Loading model from /data/test.bin...
Loading bert-base-uncased model...
Defaulting to linear classifier/regressor


Testing:   0%|          | 0/109 [00:00<?, ?it/s]

Building sklearn text classifier...


                                                          


Loss: 0.3805, Accuracy: 86.81%




### random seed
The finetuned model weights will change depending on the random seed we seed the pytorch and numpy RNGs. The variance in test accuracy is higher when the training data is small. You can skip the next cell if you want. But if you want to check out the variability with a few random seeds it takes ~3min to run and uses 6.5GB on my laptop GPU. Note the random seed also affects the internal split in model.fit() for validation data as well.

In [9]:
%%time
scores = []; 
for seed in [4, 27, 33]:
    model.random_state = seed
    model.fit(X_train, y_train)
    scores.append(model.score(X_test, y_test))

Loading bert-base-uncased model...
Defaulting to linear classifier/regressor
train data size: 900, validation data size: 100


Training: 100%|██████████| 57/57 [00:15<00:00,  4.06it/s, loss=0.515]
                                                           

Epoch 1, Train loss: 0.5146, Val loss: 0.3517, Val accy: 87.00%


Training: 100%|██████████| 57/57 [00:16<00:00,  4.11it/s, loss=0.167]
                                                           

Epoch 2, Train loss: 0.1672, Val loss: 0.3790, Val accy: 88.00%


Training: 100%|██████████| 57/57 [00:16<00:00,  4.10it/s, loss=0.0316]
                                                           

Epoch 3, Train loss: 0.0316, Val loss: 0.5478, Val accy: 87.00%


                                                          


Loss: 0.3528, Accuracy: 88.53%
Loading bert-base-uncased model...
Defaulting to linear classifier/regressor
train data size: 900, validation data size: 100


Training: 100%|██████████| 57/57 [00:16<00:00,  4.05it/s, loss=0.554]
                                                           

Epoch 1, Train loss: 0.5535, Val loss: 0.4035, Val accy: 83.00%


Training: 100%|██████████| 57/57 [00:16<00:00,  4.02it/s, loss=0.234]
                                                           

Epoch 2, Train loss: 0.2345, Val loss: 0.3220, Val accy: 86.00%


Training: 100%|██████████| 57/57 [00:17<00:00,  3.78it/s, loss=0.122] 
                                                           

Epoch 3, Train loss: 0.1218, Val loss: 0.3950, Val accy: 84.00%


                                                          


Loss: 0.3871, Accuracy: 87.04%
Loading bert-base-uncased model...
Defaulting to linear classifier/regressor
train data size: 900, validation data size: 100


Training: 100%|██████████| 57/57 [00:17<00:00,  3.62it/s, loss=0.528]
                                                           

Epoch 1, Train loss: 0.5283, Val loss: 0.4882, Val accy: 81.00%


Training: 100%|██████████| 57/57 [00:17<00:00,  3.62it/s, loss=0.146]
                                                           

Epoch 2, Train loss: 0.1458, Val loss: 0.3105, Val accy: 85.00%


Training: 100%|██████████| 57/57 [00:19<00:00,  3.04it/s, loss=0.0409]
                                                           

Epoch 3, Train loss: 0.0409, Val loss: 0.3568, Val accy: 88.00%


                                                          


Loss: 0.4476, Accuracy: 87.73%
CPU times: user 2min 11s, sys: 1min 1s, total: 3min 12s
Wall time: 3min 15s




In [17]:
# lets add the accy from our earlier run as well that uses the default seed=42
scores = np.array(scores + [accy])
print(scores)
print("%0.2f%% (+/-%0.03f)"% (stats.mean(scores), stats.stdev(scores) * 2))

[88.53211009 87.0412844  87.7293578  86.81192661]
87.53% (+/-1.549)


<a id='text_pair_classification'></a>

# text pair classification

For text pair classification, we have input data `X`, and target data `y` where :

* `X` is a list, pandas dataframe, or numpy array of text pairs (`text_a`, `text_b`) .


* `y` is a list, pandas Series, or numpy array of text labels

For this example, we will use the **`Quora Question Pair(QQP)`** data set from the [GLUE benchmarks](https://gluebenchmark.com/). This data consists of sentence pairs from the Quora website labeled as duplicate or not. See [original release post](https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs) for more info.

The input features are pairs of questions (text_a,text_b) along with the labels :
*    0 if `text_a` and `text_b` are not duplicates

*    1 if `text_a` and `text_b` are duplicates


## get data

In [18]:
%%bash
python3 ./glue_examples/download_glue_data.py --data_dir ./glue_examples//glue_data --tasks QQP 

Downloading and extracting QQP...
	Completed!


In [19]:
"""
QQP train data size: 363849 
QQP dev data size: 40430 
"""

DATADIR = './glue_examples/glue_data'

def get_quora_df(filename):
    rows = read_tsv(filename)
    df=pd.DataFrame(rows[1:], columns=rows[0])
    df=df[['question1', 'question2', 'is_duplicate']]
    df = df[pd.notnull(df['is_duplicate'])]
    df.columns=['text_a', 'text_b', 'label']
    return df

def get_quora_data(train_file=DATADIR+'/QQP/train.tsv', 
                   dev_file=DATADIR+'/QQP/dev.tsv'):
    train = get_quora_df(train_file)
    print("QQP train data size: %d "%(len(train)))
    dev = get_quora_df(dev_file)
    print("QQP dev data size: %d "%(len(dev)))

    label_list = np.unique(train['label'].values)
    return train, dev, label_list

train, dev, label_list = get_quora_data()
train.head()

QQP train data size: 363849 
QQP dev data size: 40430 


Unnamed: 0,text_a,text_b,label
0,How is the life of a math student? Could you d...,Which level of prepration is enough for the ex...,0
1,How do I control my horny emotions?,How do you control your horniness?,1
2,What causes stool color to change to yellow?,What can cause stool to come out as little balls?,0
3,What can one do after MBBS?,What do i do after my MBBS ?,1
4,Where can I find a power outlet for my laptop ...,"Would a second airport in Sydney, Australia be...",0


## setup data

We will subsample the data for the demo. To see a finetune run on the full data see [QQP.ipynb](https://github.com/charles9n/bert-sklearn/blob/master/glue_examples/QQP.ipynb)

In [20]:
# subsample data 
n = 1000
train = train.sample(n, random_state=42)
dev = dev.sample(n, random_state=42)

X_train = train[['text_a', 'text_b']]
y_train = train['label']

# use the dev set for testing...
test = dev
X_test = test[['text_a', 'text_b']]
y_test = test['label']

## finetune

In [21]:
%%time
# define model
model = BertClassifier(max_seq_length=64, train_batch_size=16)

# fit model
model.fit(X_train, y_train)

# score model
model.score(X_test, y_test)

# make predictions
y_pred = model.predict(X_test)
print("Accuracy: %0.2f%%"%(metrics.accuracy_score(y_pred, y_test) * 100))

target_names = ['not duplicate', 'is duplicate']
print(classification_report(y_test, y_pred, target_names=target_names))

Building sklearn text classifier...
Loading bert-base-uncased model...
Defaulting to linear classifier/regressor
train data size: 900, validation data size: 100


Training: 100%|██████████| 57/57 [00:15<00:00,  4.14it/s, loss=0.649]
                                                           

Epoch 1, Train loss: 0.6490, Val loss: 0.5802, Val accy: 63.00%


Training: 100%|██████████| 57/57 [00:15<00:00,  4.13it/s, loss=0.463]
                                                           

Epoch 2, Train loss: 0.4628, Val loss: 0.5607, Val accy: 66.00%


Training: 100%|██████████| 57/57 [00:15<00:00,  4.06it/s, loss=0.277]
                                                           

Epoch 3, Train loss: 0.2770, Val loss: 0.7237, Val accy: 64.00%


Predicting:   0%|          | 0/125 [00:00<?, ?it/s]       


Loss: 0.5703, Accuracy: 73.60%


                                                             

Accuracy: 73.60%
               precision    recall  f1-score   support

not duplicate       0.83      0.72      0.77       617
 is duplicate       0.63      0.76      0.69       383

    micro avg       0.74      0.74      0.74      1000
    macro avg       0.73      0.74      0.73      1000
 weighted avg       0.75      0.74      0.74      1000

CPU times: user 43.6 s, sys: 20.3 s, total: 1min 3s
Wall time: 1min 5s




<a id='text_pair_regression'></a>

# text pair regression  

For text pair regression we have input data `X`, and target data `y` where :

* `X` is a list, pandas dataframe, or numpy array of text pairs (`text_a`, `text_b`) .


* `y` is a list, pandas Series, or numpy array of floats.


For this example, we will use the **`STS-B`** data set from [GLUE benchmarks](https://gluebenchmark.com/). The data consists of sentence pairs drawn from news headlines and image captions with annotated similarity scores ranging from 1 to 5.

See [website](http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark) and [paper](http://www.aclweb.org/anthology/S/S17/S17-2001.pdf) for more info.


### STS-B

In [17]:
%%bash
python3 ./glue_examples/download_glue_data.py --data_dir ./glue_examples//glue_data --tasks STS 

Downloading and extracting STS...
	Completed!


In [22]:
"""
STS-B train data size: 5749 
STS-B dev data size: 1500 
"""

DATADIR = './glue_examples/glue_data'

def get_sts_b_df(filename):
    rows = read_tsv(filename)
    df=pd.DataFrame(rows[1:], columns=rows[0])
    df=df[['sentence1', 'sentence2', 'score']]    
    df.columns=['text_a', 'text_b', 'label']
    df.label = pd.to_numeric(df.label)
    df = df[pd.notnull(df['label'])]                
    return df

def get_sts_b_data(train_file=DATADIR + '/STS-B/train.tsv',
                   dev_file=DATADIR + '/STS-B/dev.tsv'):
    train = get_sts_b_df(train_file)
    print("STS-B train data size: %d "%(len(train)))    
    dev   = get_sts_b_df(dev_file)
    print("STS-B dev data size: %d "%(len(dev)))  
    return train,dev

train, dev = get_sts_b_data()
train.head()

STS-B train data size: 5749 
STS-B dev data size: 1500 


Unnamed: 0,text_a,text_b,label
0,A plane is taking off.,An air plane is taking off.,5.0
1,A man is playing a large flute.,A man is playing a flute.,3.8
2,A man is spreading shreded cheese on a pizza.,A man is spreading shredded cheese on an uncoo...,3.8
3,Three men are playing chess.,Two men are playing chess.,2.6
4,A man is playing the cello.,A man seated is playing the cello.,4.25


## setup data

We will subsample the data for the demo. To see a finetune run on the full data see [STS-B.ipynb](https://github.com/charles9n/bert-sklearn/blob/master/glue_examples/STS-B.ipynb)

In [24]:
# subsample data
n = 1000
train = train.sample(n, random_state=42)
dev = dev.sample(n, random_state=42)

X_train = train[['text_a', 'text_b']]
y_train = train['label']

# use the dev set for testing...
test = dev
X_test = test[['text_a', 'text_b']]
y_test = test['label']

## finetune

* For regression, validation accuracy is reported as pearson correlation

In [25]:
%%time
from scipy.stats import pearsonr

# define model
model = BertRegressor()
model.max_seq_length = 64

# fit model
model.fit(X_train, y_train)

# score model
model.score(X_test, y_test)

# make predictions
y_pred = model.predict(X_test)
pearson_accy = pearsonr(y_pred, y_test)[0] * 100
print("Pearson : %0.2f"%(pearson_accy))

Building sklearn text regressor...
Loading bert-base-uncased model...
Defaulting to linear classifier/regressor
train data size: 900, validation data size: 100


Training: 100%|██████████| 29/29 [00:12<00:00,  2.84it/s, loss=3.16]
                                                           

Epoch 1, Train loss: 3.1627, Val loss: 1.0854, Val accy: 78.50%


Training: 100%|██████████| 29/29 [00:12<00:00,  2.81it/s, loss=0.715]
                                                           

Epoch 2, Train loss: 0.7148, Val loss: 0.7729, Val accy: 83.84%


Training: 100%|██████████| 29/29 [00:12<00:00,  2.80it/s, loss=0.333]
                                                           

Epoch 3, Train loss: 0.3329, Val loss: 0.6205, Val accy: 84.72%


Predicting:   0%|          | 0/125 [00:00<?, ?it/s]       


Loss: 0.5922, Accuracy: 86.04%


                                                             

Pearson : 86.04
CPU times: user 35.9 s, sys: 18.9 s, total: 54.8 s
Wall time: 56.2 s




<a id='ner_conll_eng'></a>


## CoNLL 2003 Named Entity Recognition (NER)

The  **`CoNLL 2003`** shared task consists of data from the Reuters 1996 news corpus with annotations for 4 types of `Named Entities` (persons, locations, organizations, and miscellaneous entities). The data is in a [IOB2](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)) format. Each token enitity has a `'B-'` or `'I-'` tag indicating if its the start of the entity or if the token is inside the annotation. 

* **`Person`**: `'B-PER'` and  `'I-PER'`


* **`Organization`**: `'B-ORG'` and `'I-ORG'`


* **`Location`**: `'B-LOC'`  and `'I-LOC'`


* **`Miscellaneous`**: `'B-MISC'` and `'I-MISC'`


* **`Other(non-named entity)`**: `'O'`

See [website](https://www.clips.uantwerpen.be/conll2003/ner/) and [paper](https://www.clips.uantwerpen.be/conll2003/pdf/14247tjo.pdf) for more info.

The data is already tokenized and tagged:

In [None]:
# tokens: EU     rejects  German  call  to  boycott  British  lamb . 
# tags  : B-ORG  O        B-MISC  O     O   O        B-MISC   O    O

So for the named entity recognition (NER) task the data consists of features:`X`and labels:`y`

* **`X`** :  a list of list of tokens 


* **`y`** :  a list of list of NER tags


### get data

In [26]:
%%bash
cd other_examples
DATADIR="ner_english"
if test ! -d "$DATADIR";then
    echo "Creating $DATADIR dir"
    mkdir "$DATADIR"
    cd "$DATADIR"
    wget https://raw.githubusercontent.com/mxhofer/Named-Entity-Recognition-BidirectionalLSTM-CNN-CoNLL/master/data/train.txt
    wget https://raw.githubusercontent.com/mxhofer/Named-Entity-Recognition-BidirectionalLSTM-CNN-CoNLL/master/data/test.txt
    wget https://raw.githubusercontent.com/mxhofer/Named-Entity-Recognition-BidirectionalLSTM-CNN-CoNLL/master/data/dev.txt
fi

In [27]:
"""
Train data: 14987 sentences, 204567 tokens
Dev data: 3466 sentences, 51578 tokens
Test data: 3684 sentences, 46666 tokens
"""
DATADIR = "./other_examples/ner_english/"

def get_conll2003_data(trainfile=DATADIR + "train.txt",
                       devfile=DATADIR + "dev.txt",
                       testfile=DATADIR + "test.txt"):

    train = read_CoNLL2003_format(trainfile)
    print("Train data: %d sentences, %d tokens"%(len(train), len(flatten(train.tokens))))
    dev = read_CoNLL2003_format(devfile)
    print("Dev data: %d sentences, %d tokens"%(len(dev), len(flatten(dev.tokens))))
    test = read_CoNLL2003_format(testfile)
    print("Test data: %d sentences, %d tokens"%(len(test), len(flatten(test.tokens))))
    
    return train, dev, test


train, dev, test = get_conll2003_data()
train.head()

Train data: 14987 sentences, 204567 tokens
Dev data: 3466 sentences, 51578 tokens
Test data: 3684 sentences, 46666 tokens


Unnamed: 0,tokens,labels
0,[-DOCSTART-],[O]
1,"[EU, rejects, German, call, to, boycott, Briti...","[B-ORG, O, B-MISC, O, O, O, B-MISC, O, O]"
2,"[Peter, Blackburn]","[B-PER, I-PER]"
3,"[BRUSSELS, 1996-08-22]","[B-LOC, O]"
4,"[The, European, Commission, said, on, Thursday...","[O, B-ORG, I-ORG, O, O, O, O, O, O, B-MISC, O,..."


## setup data

We will subsample the data for the demo. To see a finetune run on the full data see [ner_english.ipynb](https://github.com/charles9n/bert-sklearn/blob/master/other_examples/ner_english.ipynb)

In [28]:
X_train, y_train = train.tokens, train.labels
X_dev, y_dev = dev.tokens, dev.labels
X_test, y_test = test.tokens, test.labels

label_list = np.unique(flatten(y_train))
label_list = list(label_list)
print("\nNER tags:",label_list)

# take a subset of the data for demo
n = 1000
X_train, y_train = X_train[:n], y_train[:n]
X_test, y_test = X_test[:n], y_test[:n]


NER tags: ['B-LOC', 'B-MISC', 'B-ORG', 'B-PER', 'I-LOC', 'I-MISC', 'I-ORG', 'I-PER', 'O']


## finetune 


Let's define our model using the **`BertTokenClassifier`** class

* We will include an **`ignore_label`** option to exclude the `'O'`, non named entities label, to calculate  `f1`. The non named entities are a huge majority of the labels, and typically `f1` is reported with non named entities excluded.


* We will also use the cased model,`'bert-base-cased'`, as casing provides an important signal for NER. The first time you run this it will take a little longer to download the model into the cache.


* With the BertTokenClassifier we should also be mindful to set the **` max_seq_len`**  high enough to cover lengths of the token lists. See the extended demo for more detail.

This uses around 7GB on my laptop. If this gives you OOM, then set the  **` max_seq_len`** to a lower number, i.e 128 or 96.

In [29]:
%%time
# define model
model = BertTokenClassifier(bert_model='bert-base-cased',
                            epochs=3,
                            max_seq_length=173,
                            learning_rate=2e-5,
                            train_batch_size=16,
                            eval_batch_size=16,
                            gradient_accumulation_steps=2,
                            ignore_label=['O'])


print(model)

# fit model
model.fit(X_train, y_train)

# score model
f1_test = model.score(X_test, y_test)
print("Test f1: %0.02f"%(f1_test))

# make predictions
y_preds = model.predict(X_test)

# calculate the probability of each class
y_probs = model.predict_proba(X_test)

print(classification_report(flatten(y_test), flatten(y_preds)))

Building sklearn token classifier...
BertTokenClassifier(bert_model='bert-base-cased', epochs=3,
          eval_batch_size=16, fp16=False, gradient_accumulation_steps=2,
          ignore_label=['O'], label_list=None, learning_rate=2e-05,
          local_rank=-1, logfile='bert_sklearn.log', loss_scale=0,
          max_seq_length=173, num_mlp_hiddens=500, num_mlp_layers=0,
          random_state=42, restore_file=None, train_batch_size=16,
          use_cuda=True, validation_fraction=0.1, warmup_proportion=0.1)


100%|██████████| 213450/213450 [00:00<00:00, 860870.37B/s]


Loading bert-base-cased model...


100%|██████████| 404400730/404400730 [00:50<00:00, 8020332.83B/s] 


Defaulting to linear classifier/regressor
train data size: 900, validation data size: 100


Training: 100%|██████████| 113/113 [00:34<00:00,  3.72it/s, loss=0.0501]
                                                         

Epoch 1, Train loss: 0.0501, Val loss: 0.0102, Val accy: 95.16%, f1: 83.12


Training: 100%|██████████| 113/113 [00:35<00:00,  3.66it/s, loss=0.00848]
                                                         

Epoch 2, Train loss: 0.0085, Val loss: 0.0071, Val accy: 96.93%, f1: 90.42


Training: 100%|██████████| 113/113 [00:39<00:00,  3.33it/s, loss=0.00417]
                                                         

Epoch 3, Train loss: 0.0042, Val loss: 0.0055, Val accy: 97.30%, f1: 91.74


Predicting:   0%|          | 0/63 [00:00<?, ?it/s]         

Test f1: 87.65


                                                           

              precision    recall  f1-score   support

       B-LOC       0.84      0.86      0.85       418
      B-MISC       0.67      0.69      0.68       189
       B-ORG       0.84      0.82      0.83       489
       B-PER       0.99      0.92      0.95       655
       I-LOC       0.86      0.55      0.67        66
      I-MISC       0.59      0.72      0.65        82
       I-ORG       0.87      0.88      0.87       205
       I-PER       0.99      1.00      0.99       483
           O       0.99      0.99      0.99      8479

   micro avg       0.96      0.96      0.96     11066
   macro avg       0.85      0.83      0.83     11066
weighted avg       0.96      0.96      0.96     11066

CPU times: user 1min 56s, sys: 1min 7s, total: 3min 3s
Wall time: 3min 40s




### check results on test data

In [30]:
i = 152
tokens = X_test[i]
labels = y_test[i]
preds = y_preds[i]
data = {"token": tokens,"label": labels,"predict": preds}
df=pd.DataFrame(data=data)
print(df)

         token   label predict
0        Dutch  B-MISC  B-MISC
1      forward       O       O
2       Reggie   B-PER   B-PER
3      Blinker   I-PER   I-PER
4          had       O       O
5          his       O       O
6   indefinite       O       O
7   suspension       O       O
8       lifted       O       O
9           by       O       O
10        FIFA   B-ORG   B-ORG
11          on       O       O
12      Friday       O       O
13         and       O       O
14         was       O       O
15         set       O       O
16          to       O       O
17        make       O       O
18         his       O       O
19   Sheffield   B-ORG   B-LOC
20   Wednesday   I-ORG   I-ORG
21    comeback       O       O
22     against       O       O
23   Liverpool   B-ORG   B-ORG
24          on       O       O
25    Saturday       O       O
26           .       O       O


In [31]:
# pprint out probs for this observation
prob = y_probs[i]
tokens_prob = model.tokens_proba(tokens, prob)

         token  B-LOC  B-MISC  B-ORG  B-PER  I-LOC  I-MISC  I-ORG  I-PER    O
0        Dutch   0.11    0.81   0.06   0.01   0.00    0.00   0.00   0.00 0.00
1      forward   0.00    0.00   0.00   0.00   0.00    0.00   0.00   0.00 1.00
2       Reggie   0.00    0.00   0.00   1.00   0.00    0.00   0.00   0.00 0.00
3      Blinker   0.00    0.00   0.00   0.00   0.00    0.00   0.00   0.99 0.00
4          had   0.00    0.00   0.00   0.00   0.00    0.00   0.00   0.00 1.00
5          his   0.00    0.00   0.00   0.00   0.00    0.00   0.00   0.00 1.00
6   indefinite   0.00    0.00   0.00   0.00   0.00    0.00   0.00   0.00 1.00
7   suspension   0.00    0.00   0.00   0.00   0.00    0.00   0.00   0.00 1.00
8       lifted   0.00    0.00   0.00   0.00   0.00    0.00   0.00   0.00 1.00
9           by   0.00    0.00   0.00   0.00   0.00    0.00   0.00   0.00 1.00
10        FIFA   0.30    0.07   0.60   0.01   0.01    0.00   0.01   0.00 0.00
11          on   0.00    0.00   0.00   0.00   0.00    0.00   0.0

Finally, lets predict the tags and tag probabilities on some new text:

In [32]:
text = "Jefferson wants to go to France."       

tag_predicts  = model.tag_text(text)       
prob_predicts = model.tag_text_proba(text)    

Predicting:   0%|          | 0/1 [00:00<?, ?it/s]        

       token predicted tags
0  Jefferson          B-PER
1      wants              O
2         to              O
3         go              O
4         to              O
5     France          B-LOC
6          .              O


                                                         

       token  B-LOC  B-MISC  B-ORG  B-PER  I-LOC  I-MISC  I-ORG  I-PER    O
0  Jefferson   0.00    0.01   0.03   0.95   0.00    0.00   0.00   0.00 0.00
1      wants   0.00    0.00   0.00   0.00   0.00    0.00   0.00   0.00 1.00
2         to   0.00    0.00   0.00   0.00   0.00    0.00   0.00   0.00 1.00
3         go   0.00    0.00   0.00   0.00   0.00    0.00   0.00   0.00 1.00
4         to   0.00    0.00   0.00   0.00   0.00    0.00   0.00   0.00 1.00
5     France   0.99    0.00   0.00   0.00   0.00    0.00   0.00   0.00 0.00
6          .   0.00    0.00   0.00   0.00   0.00    0.00   0.00   0.00 1.00


