<a href="https://colab.research.google.com/github/ApurbaPaul-NLP/FLAIR-MODELS/blob/main/Prog3_07_09_2022_Loading_Training_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install flair

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting flair
  Downloading flair-0.11.3-py3-none-any.whl (401 kB)
[K     |████████████████████████████████| 401 kB 9.2 MB/s 
Collecting janome
  Downloading Janome-0.4.2-py2.py3-none-any.whl (19.7 MB)
[K     |████████████████████████████████| 19.7 MB 336 kB/s 
[?25hCollecting bpemb>=0.3.2
  Downloading bpemb-0.3.3-py3-none-any.whl (19 kB)
Collecting sqlitedict>=1.6.0
  Downloading sqlitedict-2.0.0.tar.gz (46 kB)
[K     |████████████████████████████████| 46 kB 4.9 MB/s 
[?25hCollecting deprecated>=1.2.4
  Downloading Deprecated-1.2.13-py2.py3-none-any.whl (9.6 kB)
Collecting wikipedia-api
  Downloading Wikipedia-API-0.5.4.tar.gz (18 kB)
Collecting mpld3==0.3
  Downloading mpld3-0.3.tar.gz (788 kB)
[K     |████████████████████████████████| 788 kB 64.1 MB/s 
[?25hCollecting konoha<5.0.0,>=4.0.0
  Downloading konoha-4.6.5-py3-none-any.whl (20 kB)
Collecting huggingface-hub
  Downlo

**The Corpus Object**

The Corpus represents a dataset that you use to train a model. 

It consists of a list of train sentences, a list of dev sentences, and a list of test sentences, which correspond to the training, validation and testing split during model training.

The following example snippet instantiates the Universal Dependency Treebank for English as a corpus object:

In [2]:
import flair.datasets
corpus = flair.datasets.UD_ENGLISH()

2022-09-06 23:35:35,483 https://raw.githubusercontent.com/UniversalDependencies/UD_English-EWT/master/en_ewt-ud-dev.conllu not found in cache, downloading to /tmp/tmpxut8yj9t


1738438B [00:00, 64337284.44B/s]         

2022-09-06 23:35:35,558 copying /tmp/tmpxut8yj9t to cache at /root/.flair/datasets/ud_english/en_ewt-ud-dev.conllu
2022-09-06 23:35:35,563 removing temp file /tmp/tmpxut8yj9t





2022-09-06 23:35:36,318 https://raw.githubusercontent.com/UniversalDependencies/UD_English-EWT/master/en_ewt-ud-test.conllu not found in cache, downloading to /tmp/tmpqniotv4f


1738935B [00:00, 61981066.72B/s]         

2022-09-06 23:35:36,410 copying /tmp/tmpqniotv4f to cache at /root/.flair/datasets/ud_english/en_ewt-ud-test.conllu
2022-09-06 23:35:36,416 removing temp file /tmp/tmpqniotv4f





2022-09-06 23:35:37,857 https://raw.githubusercontent.com/UniversalDependencies/UD_English-EWT/master/en_ewt-ud-train.conllu not found in cache, downloading to /tmp/tmptpx0_el2


13686411B [00:00, 101458594.52B/s]

2022-09-06 23:35:38,039 copying /tmp/tmptpx0_el2 to cache at /root/.flair/datasets/ud_english/en_ewt-ud-train.conllu
2022-09-06 23:35:38,057 removing temp file /tmp/tmptpx0_el2
2022-09-06 23:35:38,061 Reading data from /root/.flair/datasets/ud_english
2022-09-06 23:35:38,063 Train: /root/.flair/datasets/ud_english/en_ewt-ud-train.conllu
2022-09-06 23:35:38,066 Dev: /root/.flair/datasets/ud_english/en_ewt-ud-dev.conllu
2022-09-06 23:35:38,069 Test: /root/.flair/datasets/ud_english/en_ewt-ud-test.conllu





The first time you call this snippet, it triggers a download of the Universal Dependency Treebank for English onto your hard drive. 

It then reads the train, test and dev splits into the Corpus which it returns. 

Check the length of the three splits to see how many Sentences are there:

In [3]:
# print the number of Sentences in the train split
print(len(corpus.train))

# print the number of Sentences in the test split
print(len(corpus.test))

# print the number of Sentences in the dev split
print(len(corpus.dev))

12543
2077
2001


You can also access the Sentence objects in each split directly. 

For instance, let us look at the first Sentence in the training split of the English UD:

In [4]:
# get the first Sentence in the training split
sentence = corpus.test[0]

# print with all annotations
print(sentence)

# print only with POS annotations (better readability)
print(sentence.to_tagged_string('pos'))

Sentence: "What if Google Morphed Into GoogleOS ?" → ["What"/what/PRON/WP/root/Int, "if"/if/SCONJ/IN/mark, "Google"/Google/PROPN/NNP/nsubj/Sing, "Morphed"/morph/VERB/VBD/advcl/Ind/Sing/3/Past/Fin, "Into"/into/ADP/IN/case, "GoogleOS"/GoogleOS/PROPN/NNP/obl/Sing, "?"/?/PUNCT/./punct]
Sentence: "What if Google Morphed Into GoogleOS ?" → ["What"/WP, "if"/IN, "Google"/NNP, "Morphed"/VBD, "Into"/IN, "GoogleOS"/NNP, "?"/.]


**Helper functions**

A Corpus contains a bunch of useful helper functions. 

For instance, you can downsample the data by calling downsample() and passing a ratio. 

So, if you normally get a corpus like this:

In [5]:
import flair.datasets
corpus = flair.datasets.UD_ENGLISH()
downsampled_corpus = flair.datasets.UD_ENGLISH().downsample(0.1) #you can downsample the corpus, simply like this.
#If you print both corpora, you see that the second one has been downsampled to 10% of the data.
print("--- 1 Original ---")
print(corpus)

print("--- 2 Downsampled ---")
print(downsampled_corpus)


2022-09-06 23:40:14,405 Reading data from /root/.flair/datasets/ud_english
2022-09-06 23:40:14,408 Train: /root/.flair/datasets/ud_english/en_ewt-ud-train.conllu
2022-09-06 23:40:14,410 Dev: /root/.flair/datasets/ud_english/en_ewt-ud-dev.conllu
2022-09-06 23:40:14,411 Test: /root/.flair/datasets/ud_english/en_ewt-ud-test.conllu
2022-09-06 23:40:36,687 Reading data from /root/.flair/datasets/ud_english
2022-09-06 23:40:36,688 Train: /root/.flair/datasets/ud_english/en_ewt-ud-train.conllu
2022-09-06 23:40:36,693 Dev: /root/.flair/datasets/ud_english/en_ewt-ud-dev.conllu
2022-09-06 23:40:36,698 Test: /root/.flair/datasets/ud_english/en_ewt-ud-test.conllu
--- 1 Original ---
Corpus: 12543 train + 2001 dev + 2077 test sentences
--- 2 Downsampled ---
Corpus: 1254 train + 200 dev + 208 test sentences


**Creating label dictionaries**

For many learning tasks you need to create a "dictionary" that contains all the labels you want to predict. 

You can generate this dictionary directly out of the Corpus by calling the method make_label_dictionary and passing the desired label_type.

For instance, the UD_ENGLISH corpus instantiated above has multiple layers of annotation like regular POS tags ('pos'), universal POS tags ('upos'), morphological tags ('tense', 'number'..) and so on. 

Create label dictionaries for universal POS tags by passing label_type='upos' like this:

In [6]:
# create label dictionary for a Universal Part-of-Speech tagging task
upos_dictionary = corpus.make_label_dictionary(label_type='upos')

# print dictionary
print(upos_dictionary)

2022-09-06 23:42:42,509 Computing label dictionary. Progress:


12543it [00:00, 19772.59it/s]

2022-09-06 23:42:43,196 Dictionary created for label 'upos' with 18 values: NOUN (seen 34761 times), PUNCT (seen 23620 times), VERB (seen 22946 times), PRON (seen 18589 times), ADP (seen 17730 times), DET (seen 16314 times), ADJ (seen 13167 times), AUX (seen 12440 times), PROPN (seen 12345 times), ADV (seen 9462 times), CCONJ (seen 6690 times), PART (seen 5745 times), SCONJ (seen 4554 times), NUM (seen 4119 times), X (seen 704 times), SYM (seen 698 times), INTJ (seen 694 times)
Dictionary with 18 tags: <unk>, NOUN, PUNCT, VERB, PRON, ADP, DET, ADJ, AUX, PROPN, ADV, CCONJ, PART, SCONJ, NUM, X, SYM, INTJ





In [7]:
tense_dictionary = corpus.make_label_dictionary(label_type='tense')

# print dictionary
print(tense_dictionary)

2022-09-06 23:44:02,263 Computing label dictionary. Progress:


12543it [00:00, 40272.98it/s]

2022-09-06 23:44:02,581 Dictionary created for label 'tense' with 3 values: Pres (seen 10870 times), Past (seen 9357 times)
Dictionary with 3 tags: <unk>, Pres, Past





**Dictionaries for other label types**

If you don't know the label types in a corpus, just call make_label_dictionary with any random label name (e.g. corpus.make_label_dictionary(label_type='abcd')). This will print out statistics on all label types in the corpus:

In [9]:
corpus.make_label_dictionary(label_type='all')  #here in label_type anything you can write to check all label_types. Obviously it will show an error msg

2022-09-06 23:47:13,794 Computing label dictionary. Progress:


12543it [00:00, 50241.67it/s]

2022-09-06 23:47:14,054 ERROR: You specified label_type='all' which is not in this dataset!
2022-09-06 23:47:14,056 ERROR: The corpus contains the following label types: 'lemma' (in 12543 sentences), 'upos' (in 12543 sentences), 'pos' (in 12543 sentences), 'dependency' (in 12543 sentences), 'number' (in 12037 sentences), 'verbform' (in 10123 sentences), 'prontype' (in 9750 sentences), 'person' (in 9387 sentences), 'mood' (in 8911 sentences), 'tense' (in 8747 sentences), 'degree' (in 7149 sentences), 'definite' (in 6851 sentences), 'case' (in 6492 sentences), 'gender' (in 2829 sentences), 'numtype' (in 2771 sentences), 'poss' (in 2537 sentences), 'voice' (in 1085 sentences), 'typo' (in 553 sentences), 'extpos' (in 185 sentences), 'abbr' (in 177 sentences), 'reflex' (in 134 sentences), 'style' (in 48 sentences), 'foreign' (in 5 sentences)





Exception: ignored

Let's create dictionaries for regular part of speech tags and a morphological number tagging task:

In [11]:
# create label dictionary for a regular POS tagging task
pos_dictionary = corpus.make_label_dictionary(label_type='pos')
print(pos_dictionary)
# create label dictionary for a morphological number tagging task
tense_dictionary = corpus.make_label_dictionary(label_type='number')
print(tense_dictionary)

2022-09-06 23:49:31,938 Computing label dictionary. Progress:


12543it [00:00, 19977.25it/s]

2022-09-06 23:49:32,575 Dictionary created for label 'pos' with 51 values: NN (seen 26920 times), IN (seen 20882 times), DT (seen 16845 times), NNP (seen 12401 times), PRP (seen 12220 times), JJ (seen 11587 times), RB (seen 10592 times), . (seen 10317 times), VB (seen 9487 times), NNS (seen 8450 times), , (seen 8062 times), CC (seen 6693 times), VBD (seen 5400 times), VBP (seen 5349 times), VBZ (seen 4569 times), CD (seen 4002 times), VBN (seen 3957 times), VBG (seen 3330 times), MD (seen 3292 times), TO (seen 3283 times)
Dictionary with 51 tags: <unk>, NN, IN, DT, NNP, PRP, JJ, RB, ., VB, NNS, ,, CC, VBD, VBP, VBZ, CD, VBN, VBG, MD, TO, PRP$, -RRB-, -LRB-, WDT, :, WRB, ``, '', WP, RP, UH, POS, HYPH, NNPS, JJR, JJS, EX, NFP, RBR, ADD, GW, $, PDT, SYM, LS, RBS, FW, AFX, WP$
2022-09-06 23:49:32,578 Computing label dictionary. Progress:



12543it [00:00, 29369.75it/s]

2022-09-06 23:49:33,015 Dictionary created for label 'number' with 3 values: Sing (seen 60903 times), Plur (seen 16416 times)
Dictionary with 3 tags: <unk>, Sing, Plur





**Dictionaries for other corpora types**

The method make_label_dictionary can be used for any corpus, including text classification corpora:

In [14]:
# create label dictionary for a text classification task
corpus = flair.datasets.TREC_6()
# corpus.make_label_dictionary(label_type='all') #The corpus contains the following label types: 'question_class' (in 4907 sentences). It will show an error msg
corpus.make_label_dictionary('question_class')

2022-09-06 23:52:44,860 Reading data from /root/.flair/datasets/trec_6
2022-09-06 23:52:44,861 Train: /root/.flair/datasets/trec_6/train.txt
2022-09-06 23:52:44,865 Dev: None
2022-09-06 23:52:44,866 Test: /root/.flair/datasets/trec_6/test.txt
2022-09-06 23:52:45,447 Initialized corpus /root/.flair/datasets/trec_6 (label type name is 'question_class')
2022-09-06 23:52:45,449 Computing label dictionary. Progress:


4907it [00:00, 51206.17it/s]

2022-09-06 23:52:45,559 Dictionary created for label 'question_class' with 7 values: ENTY (seen 1103 times), HUM (seen 1088 times), DESC (seen 1057 times), NUM (seen 817 times), LOC (seen 764 times), ABBR (seen 78 times)





<flair.data.Dictionary at 0x7f1fe8118810>

**The MultiCorpus Object**

If you want to train multiple tasks at once, you can use the MultiCorpus object. 

To initiate the MultiCorpus you first need to create any number of Corpus objects. 

Afterwards, you can pass a list of Corpus to the MultiCorpus object. 

For instance, the following snippet loads a combination corpus consisting of the English, German and Dutch Universal Dependency Treebanks.

The MultiCorpus inherits from Corpus, so you can use it like any other corpus to train your models.

In [15]:
english_corpus = flair.datasets.UD_ENGLISH()
german_corpus = flair.datasets.UD_GERMAN()
dutch_corpus = flair.datasets.UD_DUTCH()

# make a multi corpus consisting of three UDs
from flair.data import MultiCorpus
multi_corpus = MultiCorpus([english_corpus, german_corpus, dutch_corpus])

2022-09-06 23:54:43,993 Reading data from /root/.flair/datasets/ud_english
2022-09-06 23:54:43,996 Train: /root/.flair/datasets/ud_english/en_ewt-ud-train.conllu
2022-09-06 23:54:43,998 Dev: /root/.flair/datasets/ud_english/en_ewt-ud-dev.conllu
2022-09-06 23:54:44,001 Test: /root/.flair/datasets/ud_english/en_ewt-ud-test.conllu
2022-09-06 23:55:12,275 https://raw.githubusercontent.com/UniversalDependencies/UD_German-GSD/master/de_gsd-ud-dev.conllu not found in cache, downloading to /tmp/tmp3t7gzolw


902796B [00:00, 53792292.90B/s]          

2022-09-06 23:55:12,353 copying /tmp/tmp3t7gzolw to cache at /root/.flair/datasets/ud_german/de_gsd-ud-dev.conllu
2022-09-06 23:55:12,357 removing temp file /tmp/tmp3t7gzolw





2022-09-06 23:55:12,763 https://raw.githubusercontent.com/UniversalDependencies/UD_German-GSD/master/de_gsd-ud-test.conllu not found in cache, downloading to /tmp/tmpeeea0kze


1205132B [00:00, 68425904.19B/s]         

2022-09-06 23:55:12,824 copying /tmp/tmpeeea0kze to cache at /root/.flair/datasets/ud_german/de_gsd-ud-test.conllu
2022-09-06 23:55:12,827 removing temp file /tmp/tmpeeea0kze





2022-09-06 23:55:14,596 https://raw.githubusercontent.com/UniversalDependencies/UD_German-GSD/master/de_gsd-ud-train.conllu not found in cache, downloading to /tmp/tmpb0mwdw9d


19591438B [00:00, 103732657.16B/s]

2022-09-06 23:55:14,842 copying /tmp/tmpb0mwdw9d to cache at /root/.flair/datasets/ud_german/de_gsd-ud-train.conllu





2022-09-06 23:55:14,868 removing temp file /tmp/tmpb0mwdw9d
2022-09-06 23:55:14,874 Reading data from /root/.flair/datasets/ud_german
2022-09-06 23:55:14,875 Train: /root/.flair/datasets/ud_german/de_gsd-ud-train.conllu
2022-09-06 23:55:14,878 Dev: /root/.flair/datasets/ud_german/de_gsd-ud-dev.conllu
2022-09-06 23:55:14,881 Test: /root/.flair/datasets/ud_german/de_gsd-ud-test.conllu
2022-09-06 23:55:41,812 https://raw.githubusercontent.com/UniversalDependencies/UD_Dutch-Alpino/master/nl_alpino-ud-dev.conllu not found in cache, downloading to /tmp/tmpkb3e33fv


965554B [00:00, 62084392.46B/s]          

2022-09-06 23:55:41,872 copying /tmp/tmpkb3e33fv to cache at /root/.flair/datasets/ud_dutch/nl_alpino-ud-dev.conllu
2022-09-06 23:55:41,877 removing temp file /tmp/tmpkb3e33fv





2022-09-06 23:55:42,229 https://raw.githubusercontent.com/UniversalDependencies/UD_Dutch-Alpino/master/nl_alpino-ud-test.conllu not found in cache, downloading to /tmp/tmpm6se4lpu


924462B [00:00, 58229083.41B/s]          

2022-09-06 23:55:42,285 copying /tmp/tmpm6se4lpu to cache at /root/.flair/datasets/ud_dutch/nl_alpino-ud-test.conllu
2022-09-06 23:55:42,293 removing temp file /tmp/tmpm6se4lpu





2022-09-06 23:55:43,828 https://raw.githubusercontent.com/UniversalDependencies/UD_Dutch-Alpino/master/nl_alpino-ud-train.conllu not found in cache, downloading to /tmp/tmprnadjor1


14864664B [00:00, 109544685.68B/s]

2022-09-06 23:55:44,023 copying /tmp/tmprnadjor1 to cache at /root/.flair/datasets/ud_dutch/nl_alpino-ud-train.conllu
2022-09-06 23:55:44,044 removing temp file /tmp/tmprnadjor1
2022-09-06 23:55:44,048 Reading data from /root/.flair/datasets/ud_dutch
2022-09-06 23:55:44,050 Train: /root/.flair/datasets/ud_dutch/nl_alpino-ud-train.conllu
2022-09-06 23:55:44,053 Dev: /root/.flair/datasets/ud_dutch/nl_alpino-ud-dev.conllu
2022-09-06 23:55:44,056 Test: /root/.flair/datasets/ud_dutch/nl_alpino-ud-test.conllu





**Reading Your Own Sequence Labeling Dataset**

In cases you want to train over a sequence labeling dataset that is not in the above list, you can load them with the ColumnCorpus object. 

Most sequence labeling datasets in NLP use some sort of column format in which each line is a word and each column is one level of linguistic annotation. 

See for instance this sentence:



```
George N B-PER
Washington N I-PER
went V O
to P O
Washington N B-LOC

Sam N B-PER
Houston N I-PER
stayed V O
home N O
```





```
The first column is the word itself, the second coarse PoS tags, and the third BIO-annotated NER tags. Empty line separates sentences. 

To read such a dataset, define the column structure as a dictionary and instantiate a ColumnCorpus.
```





```
from flair.data import Corpus
from flair.datasets import ColumnCorpus

# define columns
columns = {0: 'text', 1: 'pos', 2: 'ner'}

# this is the folder in which train, test and dev files reside
data_folder = '/path/to/data/folder'

# init a corpus using column format, data folder and the names of the train, dev and test files
corpus: Corpus = ColumnCorpus(data_folder, columns,
                              train_file='train.txt',
                              test_file='test.txt',
                              dev_file='dev.txt')
```





```
This gives you a Corpus object that contains the train, dev and test splits, each has a list of Sentence. 

So, to check how many sentences there are in the training split, do
```





```
len(corpus.train)
```





```
You can also access a sentence and check out annotations. 

Lets assume that the training split is read from the example above, then executing these commands

print(corpus.train[0].to_tagged_string('ner'))
print(corpus.train[1].to_tagged_string('pos'))


will print the sentences with different layers of annotation:


George <B-PER> Washington <I-PER> went to Washington <B-LOC> .

Sam <N> Houston <N> stayed <V> home <N>
```

