Flair

In [10]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [11]:
import os

In [12]:
os.chdir( "/content/gdrive/MyDrive/flair" ) 

In [13]:
pip install flair



In [14]:
from flair.data import Corpus
from flair.datasets import WIKINER_ENGLISH

Next, we create __wikiner_corpus__, an instance of the class __Corpus__.<br>
Read [here](https://github.com/flairNLP/flair/blob/master/resources/docs/TUTORIAL_6_CORPUS.md) the documentation of __Corpus__.<br>
__Question 1__: explain, what the __WIKINER__ corpus is.<br>
__Answer 1__: A corpus is a list of train, validation and testing sentences. __WIKINER__ corresponds to a NER dataset which was automatically generated from Wikipedia.<br>
Then, we create __tag_dictionary__ which is an __BILUO__-__NER__-encoding.

In [27]:
# 1. get the corpus
wikiner_corpus: Corpus = WIKINER_ENGLISH().downsample(0.1)
print(wikiner_corpus)
# 3. make the tag dictionary from the corpus
tag_dictionary = Corpus.make_tag_dictionary( wikiner_corpus, tag_type='ner')
print(tag_dictionary.idx2item)

2021-05-23 15:57:25,013 Reading data from /root/.flair/datasets/wikiner_english
2021-05-23 15:57:25,016 Train: /root/.flair/datasets/wikiner_english/aij-wikiner-en-wp3.train
2021-05-23 15:57:25,019 Dev: None
2021-05-23 15:57:25,021 Test: None
Corpus: 11514 train + 1279 dev + 1422 test sentences
[b'<unk>', b'O', b'B-MISC', b'E-MISC', b'S-PER', b'S-LOC', b'B-ORG', b'I-ORG', b'E-ORG', b'S-ORG', b'I-MISC', b'B-PER', b'I-PER', b'E-PER', b'S-MISC', b'B-LOC', b'E-LOC', b'I-LOC', b'<START>', b'<STOP>']


In [31]:
print(wikiner_corpus.train[73])
print(wikiner_corpus.train[73].to_tagged_string())

Sentence: "Scholars include the linguist and Boasian Edward Sapir ."   [− Tokens: 9  − Token-Labels: "Scholars <NNS> include <VBP> the <DT> linguist <NN> and <CC> Boasian <JJ/S-PER> Edward <NNP/B-PER> Sapir <NNP/E-PER> . <.>"]
Scholars <NNS> include <VBP> the <DT> linguist <NN> and <CC> Boasian <JJ/S-PER> Edward <NNP/B-PER> Sapir <NNP/E-PER> . <.>


Ok, above, we loaded a corpus, a collection of texts, and with this collection the annotation of these texts.<br>
Next, we load the data, we prepared using Spacy.

In [21]:
from flair.data_fetcher import NLPTaskDataFetcher

downsample = 0.1 # 1.0 is full data, try a much smaller number like 0.01 to test run the code
data_folder = os.getcwd()
columns = {0: 'text', 1: 'ner'}

# 1. get the corpus
corpus: Corpus = NLPTaskDataFetcher.load_column_corpus(data_folder, columns,
                                                             train_file='training_data.csv',
                                                             test_file='test_data.csv',
                                                           dev_file=None).downsample(downsample)
print(corpus)

# 3. make the tag dictionary from the corpus
tag_dictionary = corpus.make_tag_dictionary(tag_type='ner')
print(tag_dictionary.idx2item)


2021-05-23 15:41:30,800 Reading data from /content/gdrive/My Drive/flair
2021-05-23 15:41:30,804 Train: /content/gdrive/My Drive/flair/training_data.csv
2021-05-23 15:41:30,808 Dev: None
2021-05-23 15:41:30,810 Test: /content/gdrive/My Drive/flair/test_data.csv


  # This is added back by InteractiveShellApp.init_path()


Corpus: 798 train + 89 dev + 303 test sentences
[b'<unk>', b'O', b'I-Degree', b'B-Location', b'I-Location', b'L-Location', b'B-Skills', b'I-Skills', b'L-Skills', b'U-Location', b'U-Degree', b'U-Skills', b'B-Degree', b'L-Degree', b'<START>', b'<STOP>']


__Question 2__: what is the difference between __tag_dictionary__ created in the cell above, and __tag_dictionary__ created before that.<br>
__Answer 2__: The wiki dictionary has E and S suffixed tags which are semantically the same as L and U.<br>


Next, we take the first sentence from the test data, and annotate this sentence using __to_tagged_string__.

In [19]:
for sent in corpus.test:
  print(sent.to_tagged_string())
  break

Willing to relocate to : Bengaluru <U-Location> , Karnataka WORK EXPERIENCE Principal Engineer Technical Staff Company 1 - Bengaluru <U-Location> , Karnataka - September 2005 to Present Total Experience : 12 years 6 months .


__Question 3__: Why is not every word annotated?<br>
How do you explain the difference to the result from __to_tagged_string__ applied to one sentence from the wiki ner corpus?<br>
__Answer 3__: Not every word corresponds to a tag. The wiki ner corpus also assigned tags which are part of the POS-tagging.