<a href="https://colab.research.google.com/github/jloutz/Resume_NER/blob/master/flair_nlp_colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Resume NER Part 4: Working with Flair NLP

---

In this part we will use flair NLP to train a model on our data and evaluate the results. Please make sure you have set up your Google account and uploaded your files to Google drive. This Notebook should run on Google Colab.

Let's change the working directory to the Google drive where our training data is, and install flair nlp. 

In [0]:
from google.colab import drive
drive.mount('/content/gdrive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/gdrive


In [0]:
import os
os.chdir("/content/gdrive/My Drive/flair_final") 
os.listdir('.')

['resources']

In [1]:
# download flair library #
! pip install flair

Collecting flair
[?25l  Downloading https://files.pythonhosted.org/packages/4e/3a/2e777f65a71c1eaa259df44c44e39d7071ba8c7780a1564316a38bf86449/flair-0.4.2-py3-none-any.whl (136kB)
[K     |████████████████████████████████| 143kB 2.8MB/s 
Collecting pytorch-pretrained-bert>=0.6.1 (from flair)
[?25l  Downloading https://files.pythonhosted.org/packages/d7/e0/c08d5553b89973d9a240605b9c12404bcf8227590de62bae27acbcfe076b/pytorch_pretrained_bert-0.6.2-py3-none-any.whl (123kB)
[K     |████████████████████████████████| 133kB 39.6MB/s 
[?25hCollecting regex (from flair)
[?25l  Downloading https://files.pythonhosted.org/packages/6f/4e/1b178c38c9a1a184288f72065a65ca01f3154df43c6ad898624149b8b4e0/regex-2019.06.08.tar.gz (651kB)
[K     |████████████████████████████████| 655kB 37.5MB/s 
[?25hCollecting bpemb>=0.2.9 (from flair)
  Downloading https://files.pythonhosted.org/packages/bc/70/468a9652095b370f797ed37ff77e742b11565c6fd79eaeca5f2e50b164a7/bpemb-0.3.0-py3-none-any.whl
Collecting mpld3==

In the next section, we will train a NER model with flair. This code is taken from the flair nlp tutorials section 7. "Training a model" 
https://github.com/zalandoresearch/flair/blob/master/resources/docs/TUTORIAL_7_TRAINING_A_MODEL.md



In [0]:
# imports 
from flair.datasets import Corpus
from flair.data_fetcher import NLPTaskDataFetcher

## make sure this describes your file structure
columns = {0: 'text', 2: 'ner'}

# folder where training and test data are
data_folder = '/content/gdrive/My Drive/flair_nan'

# 1.0 is full data, try a much smaller number like 0.1 to test run the code
downsample = 1

## your train file name
train_file = 'train_res_bilou_flair_nan.txt'

## your test file name
test_file = 'test_res_bilou_flair_nan.txt'
# 1. get the corpus
corpus: Corpus = NLPTaskDataFetcher.load_column_corpus(data_folder, columns,
                                                             train_file=train_file,
                                                             test_file=test_file,
                                                           dev_file=None).downsample(downsample)
print(corpus)

# 3. make the tag dictionary from the corpus
tag_dictionary = corpus.make_tag_dictionary(tag_type='ner')
print(tag_dictionary.idx2item)


2019-06-18 18:39:00,632 Reading data from /content/gdrive/My Drive/flair_nan
2019-06-18 18:39:00,633 Train: /content/gdrive/My Drive/flair_nan/train_res_bilou_flair_nan.txt
2019-06-18 18:39:00,634 Dev: None
2019-06-18 18:39:00,639 Test: /content/gdrive/My Drive/flair_nan/test_res_bilou_flair_nan.txt


  train_file, column_format
  test_file, column_format


Corpus: 7202 train + 800 dev + 2929 test sentences
[b'<unk>', b'O', b'B-Designation', b'L-Designation', b'-', b'I-Designation', b'B-Skills', b'L-Skills', b'I-Skills', b'U-Skills', b'U-Designation', b'B-Degree', b'I-Degree', b'L-Degree', b'U-Degree', b'<START>', b'<STOP>']


In [0]:

# 4. initialize embeddings. Experiment with different embedding types to see what gets the best results
from flair.embeddings import TokenEmbeddings, WordEmbeddings, StackedEmbeddings,FlairEmbeddings, CharacterEmbeddings
from typing import List

embedding_types: List[TokenEmbeddings] = [
    WordEmbeddings('glove'),
    # comment in this line to use character embeddings
    #CharacterEmbeddings(),

    # comment in these lines to use flair embeddings (needs a LONG time to train :-)
    FlairEmbeddings('news-forward'),
    FlairEmbeddings('news-backward'),
]

embeddings: StackedEmbeddings = StackedEmbeddings(embeddings=embedding_types)

# 5. initialize sequence tagger
from flair.models import SequenceTagger

tagger: SequenceTagger = SequenceTagger(hidden_size=256,
                                        embeddings=embeddings,
                                        tag_dictionary=tag_dictionary,
                                        tag_type='ner',
                                        use_crf=True)

2019-06-18 18:39:12,715 https://s3.eu-central-1.amazonaws.com/alan-nlp/resources/embeddings/glove.gensim.vectors.npy not found in cache, downloading to /tmp/tmp8wcvnwel


100%|██████████| 160000128/160000128 [00:18<00:00, 8743468.83B/s]

2019-06-18 18:39:32,039 copying /tmp/tmp8wcvnwel to cache at /root/.flair/embeddings/glove.gensim.vectors.npy





2019-06-18 18:39:32,779 removing temp file /tmp/tmp8wcvnwel
2019-06-18 18:39:33,907 https://s3.eu-central-1.amazonaws.com/alan-nlp/resources/embeddings/glove.gensim not found in cache, downloading to /tmp/tmpkp1z97z5


100%|██████████| 21494764/21494764 [00:03<00:00, 6983033.22B/s]

2019-06-18 18:39:38,012 copying /tmp/tmpkp1z97z5 to cache at /root/.flair/embeddings/glove.gensim
2019-06-18 18:39:38,036 removing temp file /tmp/tmpkp1z97z5



  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


In [0]:
# 6. initialize trainer
from flair.trainers import ModelTrainer

trainer: ModelTrainer = ModelTrainer(tagger, corpus)

## give your model a name and folder of your choice. Your model will be saved there for loading later 
## you can run this notebook many times with different embeddings/params and save the models with different names
model_name = 'resources/taggers/ner-word_final'

#resources/taggers/resume-ner-word_forward_backward_final == only wordembedding

# 7. start training - you can experiment with batch size if you get memory errors
# how many epochs does it take before the model stops showing improvement? Start with a big number like 150, and stop the code cell
# from running at any time - the framework will persist the best model even if you interrupt training. 
trainer.train(model_name,
              learning_rate=0.1,
              mini_batch_size=32,
              #anneal_with_restarts=True,
              max_epochs=150)




2019-06-18 18:40:05,667 ----------------------------------------------------------------------------------------------------
2019-06-18 18:40:05,673 Evaluation method: MICRO_F1_SCORE
2019-06-18 18:40:06,039 ----------------------------------------------------------------------------------------------------
2019-06-18 18:40:06,918 epoch 1 - iter 0/226 - loss 19.15983200
2019-06-18 18:40:13,361 epoch 1 - iter 22/226 - loss 11.42582655
2019-06-18 18:40:21,876 epoch 1 - iter 44/226 - loss 10.11105117
2019-06-18 18:40:32,002 epoch 1 - iter 66/226 - loss 9.00965321
2019-06-18 18:40:39,820 epoch 1 - iter 88/226 - loss 8.32511708
2019-06-18 18:40:46,567 epoch 1 - iter 110/226 - loss 8.18648899
2019-06-18 18:40:55,145 epoch 1 - iter 132/226 - loss 7.91619583
2019-06-18 18:41:03,798 epoch 1 - iter 154/226 - loss 7.66961915
2019-06-18 18:41:10,822 epoch 1 - iter 176/226 - loss 7.54908306
2019-06-18 18:41:20,375 epoch 1 - iter 198/226 - loss 7.27625955
2019-06-18 18:41:27,775 epoch 1 - iter 220/22

In [0]:
designation='I am Jannis Wolf, Student worker in Erlangen'
#designation='Senior Sales Manager in Atlanta'

degree='Bachelor in medical engineering at the university of Erlangen'
#degree='Bachelor in computer science at the university of chennai'

skill=' Good communication skills and Team Work'
from flair.data import Sentence
sentence: Sentence = Sentence(designation)

trainer.model.predict(sentence)

print("Analysing %s" % sentence)
print("\nThe following NER tags are found: \n")
print(sentence.to_tagged_string())

sentence: Sentence = Sentence(degree)

trainer.model.predict(sentence)

print("Analysing %s" % sentence)
print("\nThe following NER tags are found: \n")
print(sentence.to_tagged_string())

sentence: Sentence = Sentence(skill)

trainer.model.predict(sentence)

print("Analysing %s" % sentence)
print("\nThe following NER tags are found: \n")
print(sentence.to_tagged_string())

#tagger: SequenceTagger = SequenceTagger.load("ner")

#sentence: Sentence = Sentence("George Washington went to Washington .")
#tagger.predict(sentence)

#print("Analysing %s" % sentence)
#print("\nThe following NER tags are found: \n")
#print(sentence.to_tagged_string())

ModuleNotFoundError: ignored