In [1]:
!pip install flair



# Flair Experiments

## I. Ontonotes Named Entity Recognition (English)

### a. Data

The [Ontonotes corpus](https://catalog.ldc.upenn.edu/docs/LDC2013T19/OntoNotes-Release-5.0.pdf) is one of the best resources for different types of NLP and contains rich NER annotation. Get the corpus and split it into train, test and dev splits using the scripts provided by the [CoNLL-12 shared task](http://conll.cemantix.org/2012/data.html).

Place train, test and dev data in CoNLL-03 format in `resources/tasks/onto-ner/` as follows:

`resources/tasks/onto-ner/eng.testa`

`resources/tasks/onto-ner/eng.testb`

`resources/tasks/onto-ner/eng.train`





### b. Best Known Configuration

Once you have the data, reproduce our experiments exactly like for CoNLL-03, just with a different dataset and with FastText embeddings (they work better on this dataset).

You also need to provide a `column_format` for the `ColumnCorpus` object indicating which column in the training file is the 'ner' information. The full code then is as follows:

In [2]:
from flair.data import Corpus
from flair.datasets import ColumnCorpus
from flair.embeddings import TokenEmbeddings, WordEmbeddings
from flair.embeddings import StackedEmbeddings, FlairEmbeddings
from typing import List

In [0]:
# 1. get the corpus

corpus: Corpus = flair.datasets.ColumnCorpus('resources/tasks/onto-ner',
                                             column_format = {0: 'text', 1: 'pos', 2: 'upos', 3: 'ner'},
                                             tag_to_bioes = 'ner')

In [0]:
# 2. what tag do we want to predict?
tag_type = 'ner'

# 3. make the tag dictionary from the corpus
tag_dictionary = corpus.make_tag_dictionary(tag_type = tag_type)

In [0]:
# 4. initialize embeddings

embedding_types: List[TokenEmbeddings] = [
                  WordEmbeddings('crawl'),
                  FlairEmbeddings('news-forward'),
                  FlairEmbeddings('news-backward'),
]

embeddings: StackedEmbeddings = StackedEmbeddings(embeddings = embedding_types)

In [0]:
# 5. initialize sequence tagger

from flair.models import SequenceTagger

tagger: SequenceTagger = SequenceTagger(hidden_size = 256,
                                        embeddings = embeddings,
                                        tag_dictionary = tag_dictionary,
                                        tag_type = tag_type)

In [0]:
# 6. initialize trainer

from flair.trainers import ModelTrainer

trainer: ModelTrainer = ModelTrainer(tagger, corpus)

trainer.train('resources/taggers/example-ner',
              learning_rate = 0.1,
              train_with_dev = True,
              # it's a big dataset so maybe set embeddings_storage_mode to 'none'
              # (embeddings are not kept in memory
              embeddings_storage_mode = 'none')

## II. Penn Treebank Part-of-Speech Tagging (English)

### a. Data

Get the [Penn treebank](https://catalog.ldc.upenn.edu/ldc99t42) and follow the guidelines in  [Collins (2002)](http://www.cs.columbia.edu/~mcollins/papers/tagperc.pdf) to produce train, dev and test splits. Convert splits into CoNLL-U format and place train, test and dev data in `/path/to/penn/` as follows:

`/path/to/penn/test.conll`

`/path/to/penn/train.conll`

`/path/to/penn/valid.conll`

Then, run the experiments with extvec embeddings and contextual string embeddings. Also, select 'pos' as `tag_type`, so the algorithm knows that POS tags and not NER are to be predicted from this data.

### b. Best Known Configuration

In [0]:
from flair.data import Corpus
from flair.datasets import UniversalDependenciesCorpus
from flair.embeddings import TokenEmbeddings, WordEmbeddings
from flair.embeddings import StackedEmbeddings, FlairEmbeddings
from typing import List

In [0]:
# 1. get the corpus

corpus: Corpus = UniversalDependenciesCorpus(base_path = '/path/to/penn')

In [0]:
# 2. what tag do we want to predict?
tag_type = 'pos'

# 3. make the tag dictionary from the corpus
tag_dictionary = corpus.make_tag_dictionary(tag_type = tag_type)

In [0]:
# 4. initialize embeddings

embedding_types: List[TokenEmbeddings] = [
            WordEmbeddings('extvec'),
            FlairEmbeddings('news-forward'),
            FlairEmbeddings('news-backward'),
]

embeddings: StackedEmbeddings = StackedEmbeddings(embeddings = embedding_types)

In [0]:
# 5. initialize sequence tagger

from flair.models import SequenceTagger

tagger: SequenceTagger = SequenceTagger(hidden_size = 256,
                                        embeddings = embeddings,
                                        tag_dictionary = tag_dictionary,
                                        tag_type = tag_type)

In [0]:
# 6. initialize trainer

from flair.trainers import ModelTrainer

trainer: ModelTrainer = ModelTrainer(tagger, corpus)

trainer.train('resources/taggers/example-pos',
              train_with_dev = True,
              max_epochs = 150)

## III. CoNLL-2000 Noun Phrase Chunking (English)

### a. Data

Data is included in Flair and will get automatically downloaded when you run the script.

### b. Best Known Configuration

Run the code with extvec embeddings and our proposed contextual string embeddings. Use 'np' as `tag_type`, so the algorithm knows that chunking tags and not NER are to be predicted from this data.

In [0]:
from flair.data import Corpus
from flair.datasets import CONLL_2000
from flair.embeddings import TokenEmbeddings, WordEmbeddings
from flair.embeddings import StackedEmbeddings, FlairEmbeddings
from typing import List

In [5]:
# 1. get the corpus
corpus: Corpus = CONLL_2000()

2019-12-23 08:16:45,892 https://www.clips.uantwerpen.be/conll2000/chunking/train.txt.gz not found in cache, downloading to /tmp/tmp7suuhq6o


100%|██████████| 611540/611540 [00:00<00:00, 845891.13B/s]

2019-12-23 08:16:47,204 copying /tmp/tmp7suuhq6o to cache at /root/.flair/datasets/conll_2000/train.txt.gz
2019-12-23 08:16:47,206 removing temp file /tmp/tmp7suuhq6o





2019-12-23 08:16:47,796 https://www.clips.uantwerpen.be/conll2000/chunking/test.txt.gz not found in cache, downloading to /tmp/tmp04czw6ua


100%|██████████| 139551/139551 [00:00<00:00, 332076.10B/s]

2019-12-23 08:16:48,792 copying /tmp/tmp04czw6ua to cache at /root/.flair/datasets/conll_2000/test.txt.gz
2019-12-23 08:16:48,794 removing temp file /tmp/tmp04czw6ua
2019-12-23 08:16:48,816 Reading data from /root/.flair/datasets/conll_2000
2019-12-23 08:16:48,817 Train: /root/.flair/datasets/conll_2000/train.txt
2019-12-23 08:16:48,818 Dev: None
2019-12-23 08:16:48,818 Test: /root/.flair/datasets/conll_2000/test.txt





In [0]:
# 2. what tag do we want to predict?
tag_type = 'np'

# 3. make the tag dictionary from the corpus
tag_dictionary = corpus.make_tag_dictionary(tag_type = tag_type)

In [7]:
# 4. initialize embeddings
embedding_types: List[TokenEmbeddings] = [
              WordEmbeddings('extvec'),
              FlairEmbeddings('news-forward'),
              FlairEmbeddings('news-backward'),
]

embeddings: StackedEmbeddings = StackedEmbeddings(embeddings = embedding_types)

2019-12-23 08:18:44,450 https://s3.eu-central-1.amazonaws.com/alan-nlp/resources/embeddings/extvec.gensim.vectors.npy not found in cache, downloading to /tmp/tmp4v5goqyt


100%|██████████| 1771225328/1771225328 [01:49<00:00, 16135927.20B/s]

2019-12-23 08:20:34,904 copying /tmp/tmp4v5goqyt to cache at /root/.flair/embeddings/extvec.gensim.vectors.npy





2019-12-23 08:20:48,301 removing temp file /tmp/tmp4v5goqyt
2019-12-23 08:20:49,090 https://s3.eu-central-1.amazonaws.com/alan-nlp/resources/embeddings/extvec.gensim not found in cache, downloading to /tmp/tmp44butzb2


100%|██████████| 91665398/91665398 [00:06<00:00, 14581110.36B/s]

2019-12-23 08:20:56,076 copying /tmp/tmp44butzb2 to cache at /root/.flair/embeddings/extvec.gensim





2019-12-23 08:20:56,191 removing temp file /tmp/tmp44butzb2


  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


2019-12-23 08:21:03,620 https://s3.eu-central-1.amazonaws.com/alan-nlp/resources/embeddings-v0.4.1/big-news-forward--h2048-l1-d0.05-lr30-0.25-20/news-forward-0.4.1.pt not found in cache, downloading to /tmp/tmpsuglll0g


100%|██████████| 73034624/73034624 [00:05<00:00, 12805031.48B/s]

2019-12-23 08:21:10,006 copying /tmp/tmpsuglll0g to cache at /root/.flair/embeddings/news-forward-0.4.1.pt





2019-12-23 08:21:10,102 removing temp file /tmp/tmpsuglll0g
2019-12-23 08:21:22,190 https://s3.eu-central-1.amazonaws.com/alan-nlp/resources/embeddings-v0.4.1/big-news-backward--h2048-l1-d0.05-lr30-0.25-20/news-backward-0.4.1.pt not found in cache, downloading to /tmp/tmpqafqyoez


100%|██████████| 73034575/73034575 [00:05<00:00, 14136615.07B/s]

2019-12-23 08:21:28,048 copying /tmp/tmpqafqyoez to cache at /root/.flair/embeddings/news-backward-0.4.1.pt





2019-12-23 08:21:28,190 removing temp file /tmp/tmpqafqyoez


In [0]:
# 5. initialize sequence tagger

from flair.models import SequenceTagger

tagger: SequenceTagger = SequenceTagger(hidden_size = 256,
                                        embeddings = embeddings,
                                        tag_dictionary = tag_dictionary,
                                        tag_type = tag_type)

In [0]:
# 6. initialize trainer

from flair.trainers import ModelTrainer

trainer: ModelTrainer = ModelTrainer(tagger, corpus)

trainer.train('resources/taggers/example-chunk',
              train_with_dev = True,
              max_epochs = 150)

2019-12-23 08:23:18,621 ----------------------------------------------------------------------------------------------------
2019-12-23 08:23:18,624 Model: "SequenceTagger(
  (embeddings): StackedEmbeddings(
    (list_embedding_0): WordEmbeddings('extvec')
    (list_embedding_1): FlairEmbeddings(
      (lm): LanguageModel(
        (drop): Dropout(p=0.05, inplace=False)
        (encoder): Embedding(300, 100)
        (rnn): LSTM(100, 2048)
        (decoder): Linear(in_features=2048, out_features=300, bias=True)
      )
    )
    (list_embedding_2): FlairEmbeddings(
      (lm): LanguageModel(
        (drop): Dropout(p=0.05, inplace=False)
        (encoder): Embedding(300, 100)
        (rnn): LSTM(100, 2048)
        (decoder): Linear(in_features=2048, out_features=300, bias=True)
      )
    )
  )
  (word_dropout): WordDropout(p=0.05)
  (locked_dropout): LockedDropout(p=0.5)
  (embedding2nn): Linear(in_features=4396, out_features=4396, bias=True)
  (rnn): LSTM(4396, 256, batch_first=True, 

  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "
