Resume NER Part 4: Working with Flair NLP

---

In this part we will use flair NLP to train a model on our data and evaluate the results. Please make sure you have set up your Google account and uploaded your files to Google drive. This Notebook should run on Google Colab.

Let's change the working directory to the Google drive where our training data is, and install flair nlp. 

In [1]:
from google.colab import drive
drive.mount('/content/drive/')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive/


In [2]:
with open("/content/drive/My Drive/SAKI_2019/dataset/converted_resumes.json","r") as f: 
    data = f.read()
print(len(data))


3202595


In [3]:
# download flair library #
! pip install flair

Collecting flair
[?25l  Downloading https://files.pythonhosted.org/packages/4e/3a/2e777f65a71c1eaa259df44c44e39d7071ba8c7780a1564316a38bf86449/flair-0.4.2-py3-none-any.whl (136kB)
[K     |████████████████████████████████| 143kB 3.0MB/s 
[?25hCollecting sqlitedict>=1.6.0 (from flair)
  Downloading https://files.pythonhosted.org/packages/0f/1c/c757b93147a219cf1e25cef7e1ad9b595b7f802159493c45ce116521caff/sqlitedict-1.6.0.tar.gz
Collecting bpemb>=0.2.9 (from flair)
  Downloading https://files.pythonhosted.org/packages/bc/70/468a9652095b370f797ed37ff77e742b11565c6fd79eaeca5f2e50b164a7/bpemb-0.3.0-py3-none-any.whl
Collecting mpld3==0.3 (from flair)
[?25l  Downloading https://files.pythonhosted.org/packages/91/95/a52d3a83d0a29ba0d6898f6727e9858fe7a43f6c2ce81a5fe7e05f0f4912/mpld3-0.3.tar.gz (788kB)
[K     |████████████████████████████████| 798kB 43.8MB/s 
Collecting regex (from flair)
[?25l  Downloading https://files.pythonhosted.org/packages/6f/4e/1b178c38c9a1a184288f72065a65ca01f3154df

In the next section, we will train a NER model with flair. This code is taken from the flair nlp tutorials section 7. "Training a model" 
https://github.com/zalandoresearch/flair/blob/master/resources/docs/TUTORIAL_7_TRAINING_A_MODEL.md



In [0]:
# imports 
from flair.data import Corpus
from flair.data_fetcher import NLPTaskDataFetcher, NLPTask
from typing import List

# columns of "gold standard" ner annotations and text
columns = {3: 'text', 1: 'ner'}

# folder where training and test data are
data_folder = '/content/drive/My Drive/SAKI_2019/dataset/flair'

# 2. what tag do we want to predict?
tag_type = 'ner'


In [9]:
downsample = 1.0 # 1.0 is full data, try a much smaller number like 0.01 to test run the code
# 1. get the corpus
corpus: Corpus = NLPTaskDataFetcher.load_column_corpus(data_folder, columns,
                                                              train_file='train_res_bilou.txt',
                                                              test_file='test_res_bilou.txt',
                                                              dev_file=None).downsample(downsample)
print(corpus)

# 3. make the tag dictionary from the corpus
tag_dictionary = corpus.make_tag_dictionary(tag_type=tag_type)
print(tag_dictionary.idx2item)


2019-06-16 17:36:36,010 Reading data from /content/drive/My Drive/SAKI_2019/dataset/flair
2019-06-16 17:36:36,012 Train: /content/drive/My Drive/SAKI_2019/dataset/flair/train_res_bilou.txt
2019-06-16 17:36:36,013 Dev: None
2019-06-16 17:36:36,014 Test: /content/drive/My Drive/SAKI_2019/dataset/flair/test_res_bilou.txt


  
  train_file, column_format
  test_file, column_format


Corpus: 287375 train + 31930 dev + 319305 test sentences
[b'<unk>', b'O', b'"I-Companies', b'-', b'"U-Companies', b'I-Degree', b'"L-Companies', b'B-Name', b'B-Degree', b'"B-Companies', b'L-Degree', b'U-Degree', b'L-Name', b'I-Name', b'ner', b'<START>', b'<STOP>']


In [10]:

# 4. initialize embeddings. Experiment with different embedding types to see what gets the best results
from flair.embeddings import TokenEmbeddings, WordEmbeddings, StackedEmbeddings,FlairEmbeddings
embedding_types: List[TokenEmbeddings] = [
    WordEmbeddings('glove'),
    # comment in this line to use character embeddings
    # CharacterEmbeddings(),

    # comment in these lines to use flair embeddings (needs a LONG time to train :-)
    FlairEmbeddings('news-forward'),
    FlairEmbeddings('news-backward'),
]

embeddings: StackedEmbeddings = StackedEmbeddings(embeddings=embedding_types)

# 5. initialize sequence tagger
from flair.models import SequenceTagger

tagger: SequenceTagger = SequenceTagger(hidden_size=256,
                                        embeddings=embeddings,
                                        tag_dictionary=tag_dictionary,
                                        tag_type=tag_type,
                                        use_crf=True)




2019-06-16 17:45:24,570 https://s3.eu-central-1.amazonaws.com/alan-nlp/resources/embeddings/glove.gensim.vectors.npy not found in cache, downloading to /tmp/tmpvzszo_wh


100%|██████████| 160000128/160000128 [00:08<00:00, 18430901.01B/s]

2019-06-16 17:45:33,802 copying /tmp/tmpvzszo_wh to cache at /root/.flair/embeddings/glove.gensim.vectors.npy





2019-06-16 17:45:34,024 removing temp file /tmp/tmpvzszo_wh
2019-06-16 17:45:34,518 https://s3.eu-central-1.amazonaws.com/alan-nlp/resources/embeddings/glove.gensim not found in cache, downloading to /tmp/tmp1ml3yifh


100%|██████████| 21494764/21494764 [00:01<00:00, 12528775.09B/s]

2019-06-16 17:45:36,756 copying /tmp/tmp1ml3yifh to cache at /root/.flair/embeddings/glove.gensim
2019-06-16 17:45:36,781 removing temp file /tmp/tmp1ml3yifh



  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


2019-06-16 17:45:38,021 https://s3.eu-central-1.amazonaws.com/alan-nlp/resources/embeddings-v0.4.1/big-news-forward--h2048-l1-d0.05-lr30-0.25-20/news-forward-0.4.1.pt not found in cache, downloading to /tmp/tmpn_21mirz


100%|██████████| 73034624/73034624 [00:04<00:00, 16201632.38B/s]

2019-06-16 17:45:43,073 copying /tmp/tmpn_21mirz to cache at /root/.flair/embeddings/news-forward-0.4.1.pt





2019-06-16 17:45:43,182 removing temp file /tmp/tmpn_21mirz
2019-06-16 17:45:51,525 https://s3.eu-central-1.amazonaws.com/alan-nlp/resources/embeddings-v0.4.1/big-news-backward--h2048-l1-d0.05-lr30-0.25-20/news-backward-0.4.1.pt not found in cache, downloading to /tmp/tmpnzszxmw6


100%|██████████| 73034575/73034575 [00:03<00:00, 18829618.17B/s]

2019-06-16 17:45:55,942 copying /tmp/tmpnzszxmw6 to cache at /root/.flair/embeddings/news-backward-0.4.1.pt





2019-06-16 17:45:56,059 removing temp file /tmp/tmpnzszxmw6


In [0]:
# 6. initialize trainer
from flair.trainers import ModelTrainer

trainer: ModelTrainer = ModelTrainer(tagger, corpus)

# 7. start training
trainer.train('resources/taggers/resume-ner',
              learning_rate=0.1,
              mini_batch_size=32,
              max_epochs=150)

# 8. plot training curves (optional)
#from flair.visual.training_curves import Plotter
#plotter = Plotter()
#plotter.plot_training_curves('resources/taggers/example-ner/loss.tsv')
#plotter.plot_weights('resources/taggers/example-ner/weights.txt')


2019-06-16 17:46:04,055 ----------------------------------------------------------------------------------------------------
2019-06-16 17:46:04,057 Evaluation method: MICRO_F1_SCORE
2019-06-16 17:46:04,403 ----------------------------------------------------------------------------------------------------
2019-06-16 17:46:04,878 epoch 1 - iter 0/8981 - loss 2.70170760
2019-06-16 17:47:05,017 epoch 1 - iter 898/8981 - loss 0.17520278
2019-06-16 17:48:04,936 epoch 1 - iter 1796/8981 - loss 0.14706109
2019-06-16 17:49:05,807 epoch 1 - iter 2694/8981 - loss 0.13494926
2019-06-16 17:50:05,864 epoch 1 - iter 3592/8981 - loss 0.12657222
2019-06-16 17:51:06,201 epoch 1 - iter 4490/8981 - loss 0.12074549
2019-06-16 17:52:07,710 epoch 1 - iter 5388/8981 - loss 0.11594817
2019-06-16 17:53:09,458 epoch 1 - iter 6286/8981 - loss 0.11331125
2019-06-16 17:54:10,606 epoch 1 - iter 7184/8981 - loss 0.11087043
2019-06-16 17:55:12,353 epoch 1 - iter 8082/8981 - loss 0.10855913
2019-06-16 17:56:11,579 ep