Flair

In [1]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [2]:
import os

In [3]:
os.chdir( "/content/gdrive/MyDrive/flair" ) 

In [None]:
pip install flair

In [5]:
from flair.data import Corpus
from flair.datasets import WIKINER_ENGLISH

Next, we create __wikiner_corpus__, an instance of the class __Corpus__.<br>
Read [here](https://github.com/flairNLP/flair/blob/master/resources/docs/TUTORIAL_6_CORPUS.md) the documentation of __Corpus__.<br>
__Question 1__: explain, what the __WIKINER__ corpus is.<br>

The Wikiner dataset is a NER dataset automatically generated from Wikipedia. We have different entities like S-PER, E_PER, so we can directly see that like previously with our resumees that we not only have one entity for PERSON, but multiple. Also different types of Organisations. 

<br>

Then, we create __tag_dictionary__ which is an __BILUO__-__NER__-encoding.

In [6]:
# 1. get the corpus
wikiner_corpus: Corpus = WIKINER_ENGLISH().downsample(0.1)
print(wikiner_corpus)
# 3. make the tag dictionary from the corpus
tag_dictionary = Corpus.make_tag_dictionary( wikiner_corpus, tag_type='ner')
print(tag_dictionary.idx2item)

2021-05-22 10:59:11,386 https://raw.githubusercontent.com/dice-group/FOX/master/input/Wikiner/aij-wikiner-en-wp3.bz2 not found in cache, downloading to /tmp/tmpx7tfxsn0


100%|██████████| 6208404/6208404 [00:00<00:00, 13216516.29B/s]

2021-05-22 10:59:11,955 copying /tmp/tmpx7tfxsn0 to cache at /root/.flair/datasets/wikiner_english/aij-wikiner-en-wp3.bz2
2021-05-22 10:59:11,972 removing temp file /tmp/tmpx7tfxsn0





2021-05-22 10:59:20,590 Reading data from /root/.flair/datasets/wikiner_english
2021-05-22 10:59:20,594 Train: /root/.flair/datasets/wikiner_english/aij-wikiner-en-wp3.train
2021-05-22 10:59:20,597 Dev: None
2021-05-22 10:59:20,599 Test: None
Corpus: 11514 train + 1279 dev + 1422 test sentences
[b'<unk>', b'O', b'B-PER', b'E-PER', b'S-PER', b'S-ORG', b'B-MISC', b'I-MISC', b'E-MISC', b'S-MISC', b'B-LOC', b'E-LOC', b'S-LOC', b'B-ORG', b'I-ORG', b'E-ORG', b'I-LOC', b'I-PER', b'<START>', b'<STOP>']


In [7]:
print(wikiner_corpus.train[73])
print(wikiner_corpus.train[73].to_tagged_string())

Sentence: "In March 2006 , Russia agreed to erase $ 4.74 billion of Algeria 's Soviet-era debt during a visit by President Vladimir Putin to the country , the first by a Russian leader in half a century ."   [− Tokens: 38  − Token-Labels: "In <IN> March <NNP> 2006 <CD> , <,> Russia <NNP/S-LOC> agreed <VBD> to <TO> erase <VB> $ <$> 4.74 <CD> billion <CD> of <IN> Algeria <NNP/S-LOC> 's <POS> Soviet-era <JJ/S-MISC> debt <NN> during <IN> a <DT> visit <NN> by <IN> President <NNP> Vladimir <NNP/B-PER> Putin <NNP/E-PER> to <TO> the <DT> country <NN> , <,> the <DT> first <JJ> by <IN> a <DT> Russian <JJ/S-MISC> leader <NN> in <IN> half <PDT> a <DT> century <NN> . <.>"]
In <IN> March <NNP> 2006 <CD> , <,> Russia <NNP/S-LOC> agreed <VBD> to <TO> erase <VB> $ <$> 4.74 <CD> billion <CD> of <IN> Algeria <NNP/S-LOC> 's <POS> Soviet-era <JJ/S-MISC> debt <NN> during <IN> a <DT> visit <NN> by <IN> President <NNP> Vladimir <NNP/B-PER> Putin <NNP/E-PER> to <TO> the <DT> country <NN> , <,> the <DT> first <

Ok, above, we loaded a corpus, a collection of texts, and with this collection the annotation of these texts.<br>
Next, we load the data, we prepared using Spacy.

In [8]:
from flair.data_fetcher import NLPTaskDataFetcher

downsample = 1.0 # 1.0 is full data, try a much smaller number like 0.01 to test run the code
data_folder = os.getcwd()
columns = {0: 'text', 1: 'ner'}

# 1. get the corpus
corpus: Corpus = NLPTaskDataFetcher.load_column_corpus(data_folder, columns,
                                                             train_file='training_data.csv',
                                                             test_file='test_data.csv',
                                                           dev_file=None).downsample(downsample)
print(corpus)

# 3. make the tag dictionary from the corpus
tag_dictionary = corpus.make_tag_dictionary(tag_type='ner')
print(tag_dictionary.idx2item)


2021-05-22 10:59:33,427 Reading data from /content/gdrive/My Drive/flair
2021-05-22 10:59:33,429 Train: /content/gdrive/My Drive/flair/training_data.csv
2021-05-22 10:59:33,431 Dev: None
2021-05-22 10:59:33,433 Test: /content/gdrive/My Drive/flair/test_data.csv


  # This is added back by InteractiveShellApp.init_path()


Corpus: 6290 train + 699 dev + 3322 test sentences
[b'<unk>', b'O', b'B-Skills', b'I-Skills', b'L-Skills', b'U-Companies', b'B-Companies', b'I-Companies', b'L-Companies', b'B-Degree', b'I-Degree', b'L-Degree', b'U-Skills', b'U-Degree', b'<START>', b'<STOP>']


__Question 2__: what is the difference between __tag_dictionary__ created in the cell above, and __tag_dictionary__ created before that.<br>

In the second tag_dictionary, we have only our three selected entities, namely the:
- Skills
- Companies
- Degree

but slight various of them with a starting character of either U, B, I, L which comes from our BILUO Schemes. This is what we rather want.

<br>

Next, we take the first sentence from the test data, and annotate this sentence using __to_tagged_string__.

In [24]:
for i in range(1, 10):
  print(corpus.test[-i].to_tagged_string())

, <I-Skills> Catia <I-Skills> V6 <I-Skills> , <I-Skills> Ansys <L-Skills>
Sri Venkateswara College Of Engineering - Chennai , Tamil Nadu 2012 to 2016 SKILLS ANSYS <B-Skills> ( <I-Skills> Less <I-Skills> than <I-Skills> 1 <I-Skills> year <I-Skills> ) <I-Skills> , <I-Skills> CATIA <I-Skills> ( <I-Skills> Less <I-Skills> than <I-Skills> 1 <I-Skills> year <I-Skills> ) <I-Skills> , <I-Skills> CREO <I-Skills> ( <I-Skills> Less <I-Skills> than <I-Skills> 1 <I-Skills> year <I-Skills> ) <I-Skills> , <I-Skills> PARAMETRIC <I-Skills> ( <I-Skills> Less <I-Skills> than <I-Skills> 1 <I-Skills> year <I-Skills> ) <I-Skills> , <I-Skills> PYTHON <I-Skills> ( <I-Skills> Less <I-Skills> than <I-Skills> 1 <I-Skills> year <I-Skills> ) <I-Skills> , <I-Skills> Selenium <I-Skills> , <I-Skills> Selenium <I-Skills> Webdriver <I-Skills> , <I-Skills> Testing <I-Skills> , <I-Skills> Functional <I-Skills> Testing <I-Skills> , <I-Skills> Automation <I-Skills> Testing <I-Skills> , <I-Skills> Regression <I-Skills> Test

__Question 3__: Why is not every word annotated?<br>

Because now we are predicting only for our three labels Degree, Companies and Skills. Since not every word is one of these labels, not every word gets annotated.

<br>

How do you explain the difference to the result from __to_tagged_string__ applied to one sentence from the wiki ner corpus?

When we applied the to_tagged_string, the annotations we appended to the end of the string and then for every single word were the entity label was found, flair is showing which words for one entity have which BILUO letter, to see where the entity is starting and ending. In this way we can easily detect how many entities in a sentence were found. 