Resume NER Part 4: Working with Flair NLP

---

In this part we will use flair NLP to train a model on our data and evaluate the results. Please make sure you have set up your Google account and uploaded your files to Google drive. This Notebook should run on Google Colab.

Let's change the working directory to the Google drive where our training data is, and install flair nlp. 

In [0]:
import os
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)
os.chdir("/content/gdrive/My Drive/SAKI/NER/flair") 

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/gdrive


In [0]:
# download flair library
! pip install flair

Collecting flair
[?25l  Downloading https://files.pythonhosted.org/packages/4e/3a/2e777f65a71c1eaa259df44c44e39d7071ba8c7780a1564316a38bf86449/flair-0.4.2-py3-none-any.whl (136kB)
[K     |████████████████████████████████| 143kB 47.4MB/s 
Collecting sqlitedict>=1.6.0 (from flair)
  Downloading https://files.pythonhosted.org/packages/0f/1c/c757b93147a219cf1e25cef7e1ad9b595b7f802159493c45ce116521caff/sqlitedict-1.6.0.tar.gz
Collecting mpld3==0.3 (from flair)
[?25l  Downloading https://files.pythonhosted.org/packages/91/95/a52d3a83d0a29ba0d6898f6727e9858fe7a43f6c2ce81a5fe7e05f0f4912/mpld3-0.3.tar.gz (788kB)
[K     |████████████████████████████████| 798kB 47.4MB/s 
Collecting bpemb>=0.2.9 (from flair)
  Downloading https://files.pythonhosted.org/packages/bc/70/468a9652095b370f797ed37ff77e742b11565c6fd79eaeca5f2e50b164a7/bpemb-0.3.0-py3-none-any.whl
Collecting deprecated>=1.2.4 (from flair)
  Downloading https://files.pythonhosted.org/packages/9f/7a/003fa432f1e45625626549726c2fbb7a29baa7

In the next section, we will train a NER model with flair. This code is taken from the flair nlp tutorials section 7. "Training a model" 
https://github.com/zalandoresearch/flair/blob/master/resources/docs/TUTORIAL_7_TRAINING_A_MODEL.md



In [0]:
from typing import List
from flair.datasets import Corpus
from flair.data_fetcher import NLPTaskDataFetcher

# folder where training and test data are
data_folder = '/content/gdrive/My Drive/SAKI/NER/flair'

train_file = 'train_res_bilou.txt'
test_file = 'test_res_bilou.txt'

# relevant columns for the "gold standard" in the bilou tagged corpus
columns = {1: 'text', 3: 'ner'}

# 1.0 is full data; use a smaller number like 0.1 to test run the code
downsample = 1.0 

corpus: Corpus = NLPTaskDataFetcher.load_column_corpus(
    data_folder, columns, train_file=train_file, test_file=test_file, dev_file=None
).downsample(downsample)
print(corpus)

tag_dictionary = corpus.make_tag_dictionary(tag_type='ner')
print(tag_dictionary.idx2item)

2019-06-18 18:47:28,553 Reading data from /content/gdrive/My Drive/SAKI/NER/flair
2019-06-18 18:47:28,554 Train: /content/gdrive/My Drive/SAKI/NER/flair/train_res_bilou.txt
2019-06-18 18:47:28,560 Dev: None
2019-06-18 18:47:28,561 Test: /content/gdrive/My Drive/SAKI/NER/flair/test_res_bilou.txt


  train_file, column_format
  test_file, column_format


Corpus: 11569 train + 1285 dev + 3667 test sentences
[b'<unk>', b'O', b'B-Name', b'L-Name', b'B-Companies_worked_at', b'I-Companies_worked_at', b'L-Companies_worked_at', b'U-Companies_worked_at', b'B-College_Name', b'L-College_Name', b'I-College_Name', b'-', b'U-College_Name', b'I-Name', b'<START>', b'<STOP>']


In [0]:
# initialize embeddings
from flair.embeddings import TokenEmbeddings, CharacterEmbeddings, WordEmbeddings, StackedEmbeddings, FlairEmbeddings

embedding_types: List[TokenEmbeddings] = [
    WordEmbeddings('glove'),
    # CharacterEmbeddings(),
    FlairEmbeddings('news-forward'),
    FlairEmbeddings('news-backward'),
]

embeddings: StackedEmbeddings = StackedEmbeddings(embeddings=embedding_types)

from flair.models import SequenceTagger

tagger: SequenceTagger = SequenceTagger(
    hidden_size=256, embeddings=embeddings,
    tag_dictionary=tag_dictionary, tag_type='ner',
    use_crf=True
)

2019-06-18 18:47:35,337 https://s3.eu-central-1.amazonaws.com/alan-nlp/resources/embeddings/glove.gensim.vectors.npy not found in cache, downloading to /tmp/tmpbgnfqaii


100%|██████████| 160000128/160000128 [00:17<00:00, 9230471.05B/s]

2019-06-18 18:47:53,693 copying /tmp/tmpbgnfqaii to cache at /root/.flair/embeddings/glove.gensim.vectors.npy





2019-06-18 18:47:53,913 removing temp file /tmp/tmpbgnfqaii
2019-06-18 18:47:55,002 https://s3.eu-central-1.amazonaws.com/alan-nlp/resources/embeddings/glove.gensim not found in cache, downloading to /tmp/tmpk3x7hji0


100%|██████████| 21494764/21494764 [00:03<00:00, 5444640.47B/s]

2019-06-18 18:47:59,956 copying /tmp/tmpk3x7hji0 to cache at /root/.flair/embeddings/glove.gensim
2019-06-18 18:47:59,980 removing temp file /tmp/tmpk3x7hji0



  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


2019-06-18 18:48:02,668 https://s3.eu-central-1.amazonaws.com/alan-nlp/resources/embeddings-v0.4.1/big-news-forward--h2048-l1-d0.05-lr30-0.25-20/news-forward-0.4.1.pt not found in cache, downloading to /tmp/tmpcdwbdiu1


100%|██████████| 73034624/73034624 [00:09<00:00, 7887117.57B/s]

2019-06-18 18:48:12,949 copying /tmp/tmpcdwbdiu1 to cache at /root/.flair/embeddings/news-forward-0.4.1.pt
2019-06-18 18:48:13,019 removing temp file /tmp/tmpcdwbdiu1





2019-06-18 18:48:21,774 https://s3.eu-central-1.amazonaws.com/alan-nlp/resources/embeddings-v0.4.1/big-news-backward--h2048-l1-d0.05-lr30-0.25-20/news-backward-0.4.1.pt not found in cache, downloading to /tmp/tmp6d_mj2pt


100%|██████████| 73034575/73034575 [00:09<00:00, 7709786.95B/s]

2019-06-18 18:48:32,305 copying /tmp/tmp6d_mj2pt to cache at /root/.flair/embeddings/news-backward-0.4.1.pt
2019-06-18 18:48:32,401 removing temp file /tmp/tmp6d_mj2pt





In [5]:
from flair.trainers import ModelTrainer
from pathlib import Path

model_name = 'resources/taggers/flair-wordflairembeddings'
if downsample < 1.0:
  model_name += '-test'

use_checkpoints = downsample == 1.0
checkpoint_path = Path(model_name) / 'checkpoint.pt'
if use_checkpoints and checkpoint_path.exists():
    checkpoint = tagger.load_checkpoint(checkpoint_path)
    trainer: ModelTrainer = ModelTrainer.load_from_checkpoint(checkpoint, corpus)
else:
    trainer: ModelTrainer = ModelTrainer(tagger, corpus)

# start/continue training
trainer.train(
    model_name, learning_rate=0.1, mini_batch_size=32,
    #anneal_with_restarts=True,
    max_epochs=50,
    train_with_dev=False,  # notebook output shows a run with train_with_dev=True
    checkpoint=use_checkpoints
)

2019-06-18 18:48:34,038 loading file resources/taggers/flair-wordflairembeddings-nodev/checkpoint.pt
2019-06-18 18:48:39,209 ----------------------------------------------------------------------------------------------------
2019-06-18 18:48:39,218 Evaluation method: MICRO_F1_SCORE
2019-06-18 18:48:41,144 ----------------------------------------------------------------------------------------------------
2019-06-18 18:48:42,576 epoch 30 - iter 0/402 - loss 0.19545156
2019-06-18 18:49:10,527 epoch 30 - iter 40/402 - loss 0.19786211
2019-06-18 18:49:45,029 epoch 30 - iter 80/402 - loss 0.21883249
2019-06-18 18:50:12,880 epoch 30 - iter 120/402 - loss 0.20899170
2019-06-18 18:50:44,409 epoch 30 - iter 160/402 - loss 0.21028733
2019-06-18 18:51:13,363 epoch 30 - iter 200/402 - loss 0.21183648
2019-06-18 18:51:41,159 epoch 30 - iter 240/402 - loss 0.20423505
2019-06-18 18:52:11,540 epoch 30 - iter 280/402 - loss 0.20235777
2019-06-18 18:52:42,389 epoch 30 - iter 320/402 - loss 0.20012768
2

{'dev_loss_history': [],
 'dev_score_history': [],
 'test_score': 0.7326,
 'train_loss_history': [0.19494456864559828,
  0.1939115002009999,
  0.19734645368003134,
  0.18257623558762062,
  0.16943227295851826,
  0.18599699165171651,
  0.18081205983215304,
  0.17127067643908125,
  0.1746583947198308,
  0.15683405879718154,
  0.1502690320806717,
  0.13643562435102996,
  0.14545134799693948,
  0.14352850059964764,
  0.13466501158120028,
  0.13289340272249273,
  0.1292913343516452,
  0.12182469621523102,
  0.12411503006347377,
  0.12359519671667274,
  0.12156489881367158,
  0.12187670634605398,
  0.1163323722217656,
  0.12312254132888284,
  0.11414515104756426,
  0.11460021330942562,
  0.11136574598390665,
  0.11380362604363631,
  0.11198965602772153,
  0.10533681278353307,
  0.10926857564728058,
  0.10718384726130548,
  0.10442720870695897,
  0.09965691947614524,
  0.09491823016855847]}

In [0]:
# only works when training the model in one go,
# otherwise the loss.tsv file gets overwritten by flair
from flair.visual.training_curves import Plotter
plotter = Plotter()
plotter.plot_training_curves(model_name + '/loss.tsv')
plotter.plot_weights(model_name + '/weights.txt')