## <center> Named Entity Recognition using HuggingFace BERT Transformer </center> <br><sup><center> W266: Final Project Source Code </center></sup><br><sup><center> Pierce Coggins and Bhuvnesh Sharma </center></sup>
___

This project will employ the HuggingFace Transformers framework and SimpleTransformers as the basis for developing a Transformer-based model for NER. Most transformer models have only been tested against the primary CoNLL-2003 NER dataset and not other more complex NER datasets. In this project we look to test the BERT pretained base model against more recent NER benchmarks that better represent modern day NER tasks. Here is an overview of the datasets we will be testing against:


Corpus | Year | Text Source | # of Tags 
:-----:|:-----:|:-----:|:-----:
CoNLL03 | 2003 | Reuters News | 4
W-NUT | 2015 - 2018 | User-generated Text | 7
GENIA | 2004 | Biology & Clinical Text | 6



While BERT-like transformer models have performed exceedingly well across various NLP tasks, including NER, this performance was conducted only on the most cannonical NLP datasets. Within the field of NER, the BERT model performance was assessed using the CoNLL-2003 dataset, sourced from Reuters News articles from 1996-1997 with only 4 named entities tagged. As illustrated above, many of the publicly available NER corpora are generated from new sources; however, the field of NER has progressed beyond the foundation established by the CoNLL-2003 dataset. For example, The W-NUT corpus (2015-2018) focuses on testing the generalization of NER models against a more diverse text environment and the GENIA corpus is generated from domain specific biology and clinical text. While the team at Google only presented their results for the CoNLL-2003 dataset, the objectiv in this project is to assess how generalizable the BERT pretrained models are when fine-tuned on more complex NER datasets.



In [5]:
import pandas as pd 
from simpletransformers.ner import NERModel

### Base Bert model on CoNLL-2003 dataset

In [None]:
model = NERModel('bert', 'bert-base-cased', use_cuda = False)

In [None]:
model.train_model('data/train.txt')

**Evaluating Base Bert model against CoNLL-2003 train dataset**

In [6]:
result, model_outputs, predictions = model.eval_model('data/train.txt')

Converting to features started.


HBox(children=(IntProgress(value=0, max=14041), HTML(value='')))




HBox(children=(IntProgress(value=0, max=1756), HTML(value='')))


{'eval_loss': 0.022640006968443673, 'precision': 0.9710453480945285, 'recall': 0.9704668283756755, 'f1_score': 0.9707560020432487}


** Evaluating Base Bert model against CoNNL-2003 test dataset **

In [3]:
test_result, test_model_outputs, test_predictions = model.eval_model('data/test.txt')

Features loaded from cache at cache_dir/cached_dev_bert_128_9_3453


HBox(children=(IntProgress(value=0, max=432), HTML(value='')))


{'eval_loss': 0.10845712301886158, 'precision': 0.9011353711790393, 'recall': 0.9140680368532955, 'f1_score': 0.9075556337408742}


### Base Bert model on W-NUT Dataset 

Read in Train and test datasets, standardize column data types and remove NAN values

In [76]:
wnut_train = pd.read_csv('data/train.csv')
wnut_test = pd.read_csv('data/test.csv')
wnut_train = pd.DataFrame(wnut_train)
wnut_test = pd.DataFrame(wnut_test)

convert_dict = {'sentence_id': int, 
                'words': str,
                'labels': str
               } 

wnut_train_new = wnut_train.dropna() 
wnut_test_new = wnut_test.dropna()

wnut_train_new = wnut_train_new.astype(convert_dict) 
wnut_test_new = wnut_test_new.astype(convert_dict)

In [77]:
wnut_train_new.head

<bound method NDFrame.head of        sentence_id                   words      labels
0                0               @paulwalk           O
1                0                      It           O
2                0                      's           O
3                0                     the           O
4                0                    view           O
5                0                    from           O
6                0                   where           O
7                0                       I           O
8                0                      'm           O
9                0                  living           O
10               0                     for           O
11               0                     two           O
12               0                   weeks           O
13               0                       .           O
14               0                  Empire  B-location
15               0                   State  I-location
16               0                B

**Building and training Base BERT model on W-NUT dataset **

Note updated labels have been applied to match with custom W-NUT entity labels

In [78]:
WNUT_model = NERModel('bert', 'bert-base-cased', labels=["O", "B-location", "I-location", "B-person", "I-person", "B-corporation", "I-corporation", "B-group", "I-group", "B-creative-work", "I-creative-work", "B-product", "I-product"], use_cuda = False)

In [79]:
WNUT_model.train_model(wnut_train_new, output_dir = 'outputs_wnut', args={'overwrite_output_dir': True})

Features loaded from cache at cache_dir/cached_train_bert_128_13_2173


HBox(children=(IntProgress(value=0, description='Epoch', max=1, style=ProgressStyle(description_width='initial…

HBox(children=(IntProgress(value=0, description='Current iteration', max=272, style=ProgressStyle(description_…

Running loss: 0.010380

Training of bert model complete. Saved to outputs_wnut.


** Evaluation Base BERT model against W-NUT Training dataset **

In [None]:
WNUT_result, WNUT_model_outputs, WNUT_predictions = WNUT_model.eval_model(wnut_train_new)

**Evaluating Base Bert model against W-NUT Test dataset**

In [46]:
WNUT_result, WNUT_model_outputs, WNUT_predictions = WNUT_model.eval_model(wnut_test_new)

Features loaded from cache at cache_dir/cached_dev_bert_128_13_1287


HBox(children=(IntProgress(value=0, max=161), HTML(value='')))


{'eval_loss': 0.07599619912527363, 'precision': 0.0, 'recall': 0, 'f1_score': 0}


### Base Bert model against GENIA Medical NER Dataset

In [61]:
genia_train = pd.read_csv('data/genia_train.csv')
genia_test = pd.read_csv('data/genie_test.csv')
genia_train_new = pd.DataFrame(genia_train)
genia_test_new = pd.DataFrame(genia_test)

In [65]:
convert_dict = {'sentence_id': int, 
                'words': str,
                'labels': str
               } 

genia_train_new = genia_train_new.dropna() 
genia_test_new = genia_test_new.dropna()

genia_train_new = genia_train_new.astype(convert_dict) 
genia_test_new = genia_test_new.astype(convert_dict)

genia_test_new.head

<bound method NDFrame.head of        sentence_id           words       labels
0                0          Number            O
1                0              of            O
2                0  glucocorticoid    B-protein
3                0       receptors    I-protein
4                0              in            O
5                0     lymphocytes  B-cell_type
6                0             and            O
7                0           their            O
8                0     sensitivity            O
9                0              to            O
10               0         hormone            O
11               0          action            O
12               0               .            O
14               1             The            O
15               1           study            O
16               1    demonstrated            O
17               1               a            O
18               1       decreased            O
19               1           level            O
20        

In [54]:
genia_model = NERModel('bert', 'bert-base-cased', labels=["O", "B-DNA", "I-DNA", "B-protein", "I-protein", "B-cell_type", "I-cell_type", "B-cell_line", "I-cell_line", "B-RNA", "I-RNA"], use_cuda = False)

In [55]:
genia_model.train_model(genia_train, output_dir = 'outputs_genia', args={'overwrite_output_dir': True})

Converting to features started.


HBox(children=(IntProgress(value=0, max=1418), HTML(value='')))




HBox(children=(IntProgress(value=0, description='Epoch', max=1, style=ProgressStyle(description_width='initial…

HBox(children=(IntProgress(value=0, description='Current iteration', max=178, style=ProgressStyle(description_…

Running loss: 0.333312

Training of bert model complete. Saved to outputs_genia.


In [68]:
genia_result, genia_model_outputs, genia_predictions = genia_model.eval_model(genia_train_new)

Converting to features started.


HBox(children=(IntProgress(value=0, max=1418), HTML(value='')))




HBox(children=(IntProgress(value=0, max=178), HTML(value='')))


{'eval_loss': 0.19831275284876315, 'precision': 0.6495744680851064, 'recall': 0.7490186457311089, 'f1_score': 0.6957611668185961}


In [67]:
genia_result, genia_model_outputs, genia_predictions = genia_model.eval_model(genia_test_new)

Converting to features started.


HBox(children=(IntProgress(value=0, max=560), HTML(value='')))




HBox(children=(IntProgress(value=0, max=70), HTML(value='')))


{'eval_loss': 0.2804098552891186, 'precision': 0.4858689116055322, 'recall': 0.6417791898332009, 'f1_score': 0.5530458590006845}
