# Flair: A Natural Language Processing Library

---

Flair is an NLP library whose framework builds on top of *PyTorch*. Several NLP tasks Flair can handle include *Name-Entity Recognition*, *Parts-of-Speech Tagging*, *Text Classification*, and *Custom Language Modeling*.

What makes Flair admirable is how it comprises itself from SOA word embeddings, allowing users to combine different embeddings to documents.

---

## contextual string embeddings for sequence labeling

---

Contextual String Embeddings leverage the internal states of a trained character language model to produce a novel type of word embedding. It uses certain internal principles of a trained character language model, such that words can have different meaning in different sentences.

The words are trained as characters in contenxtual string embeddings, and the embeddings are contextualized by their surrounding text. What this means is the same words can have different embeddings depending on the context.

Take, for instance, the word *key*. In some ways it is an object which unlocks, in others it is the fulcrum of rhetoric as in the *key* takeaway or the *key* point, and still, in others, it is the labeling of a value as in *key*-value pairs.

With contextual string embedding, each of these *keys* are given seperate context. Think of all the cases in which the same word in the English language is under different context domains and you'll see the boon of the tool.


---

## connecting to *Google Drive*, importing dataset

---




In [None]:
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials


# Authenticate and create the PyDrive client.
# This only needs to be done once per notebook.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

# Download a file based on its file ID.
# A file ID looks like: laggVyWshwcyP6kEI-y_W3P8D26sz
# file_id = '1fr4ff3mLKTY0WOvXI1x4Fj9Xu_hgxyQV' ### File ID ###
# file_id = '1GhyH4k9C4uPRnMAMKhJYOqa-V9Tqt4q8' ### File ID ###
# data = drive.CreateFile({'id': file_id})

In [None]:
import shutil

## transferring dataset into readable format

---



In [None]:
# download Flair 
# on top PyTorch
import torch
!pip install flair
import flair



In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Flair data types have two objects, namely *sentence* and *token* objects, which are cardinal appendeges of the library. Sentences are lists of tokens that hold textual sentences.

In [None]:
from flair.data import Corpus
from flair.datasets import ColumnCorpus
from flair.embeddings import TokenEmbeddings, WordEmbeddings, StackedEmbeddings, FlairEmbeddings
from flair.models import SequenceTagger
from flair.trainers import ModelTrainer
from typing import List
import gensim
import flair, torch
# flair.device = torch.device('gpu') 

import time
start_time = time.time()

import os

In [None]:
device = None
if torch.cuda.is_available():
    device = torch.device('cuda:0')
else:
    device = torch.device('cpu')

In [None]:
columns = {0: 'text', 1: 'pos', 2: 'ner'}
    
data_path = os.path.join(os.getcwd(),"drive/MyDrive/JKI_Arbeit/HortiSem")
# initializing the corpus
corpus: Corpus = ColumnCorpus(data_path, columns,
                              train_file = 'train.txt',
                              dev_file = 'dev.txt',
                              test_file = 'test.txt')

print(len(corpus.train))
print(corpus.train[58].to_tagged_string('ner'))


# tag to predict
tag_type = 'ner'

tag_dictionary = corpus.make_tag_dictionary(tag_type=tag_type)

# print(corpus)

# word_vectors = gensim.models.KeyedVectors.load_word2vec_format('german.model', binary=True)
# word_vectors.save('german.model.gensim')

# german_embedding = WordEmbeddings('german.model.gensim')

# init forward embedding for German
flair_embedding_forward = FlairEmbeddings('de-forward')
flair_embedding_backward = FlairEmbeddings('de-backward')

embedding_types: List[TokenEmbeddings] = [
    # german_embedding,
    flair_embedding_forward,
    flair_embedding_backward,
]

embeddings: StackedEmbeddings = StackedEmbeddings(embeddings=embedding_types)

tagger: SequenceTagger = SequenceTagger(hidden_size=256,
                                        embeddings=embeddings,
                                        tag_dictionary=tag_dictionary,
                                        tag_type=tag_type,
                                        use_crf=True)


trainer : ModelTrainer = ModelTrainer(tagger, corpus)

trainer.train('models',
              learning_rate=0.01,
              mini_batch_size=64,
              max_epochs=50,
              )


print(f"It took {time.time() - start_time}")


2021-09-20 08:17:29,364 Reading data from /content/drive/MyDrive/JKI_Arbeit/HortiSem
2021-09-20 08:17:29,366 Train: /content/drive/MyDrive/JKI_Arbeit/HortiSem/train.txt
2021-09-20 08:17:29,368 Dev: /content/drive/MyDrive/JKI_Arbeit/HortiSem/dev.txt
2021-09-20 08:17:29,370 Test: /content/drive/MyDrive/JKI_Arbeit/HortiSem/test.txt
1377
In Neupflanzungen ist dieser bereits bei 1 Tier auf 2 - 3 Haupttrieben erreicht
2021-09-20 08:17:33,582 https://flair.informatik.hu-berlin.de/resources/embeddings/flair/lm-mix-german-forward-v0.2rc.pt not found in cache, downloading to /tmp/tmp53c8ef__


100%|██████████| 72818995/72818995 [00:04<00:00, 14584259.99B/s]

2021-09-20 08:17:38,960 copying /tmp/tmp53c8ef__ to cache at /root/.flair/embeddings/lm-mix-german-forward-v0.2rc.pt
2021-09-20 08:17:39,044 removing temp file /tmp/tmp53c8ef__





2021-09-20 08:17:50,875 https://flair.informatik.hu-berlin.de/resources/embeddings/flair/lm-mix-german-backward-v0.2rc.pt not found in cache, downloading to /tmp/tmpyjie3t2v


100%|██████████| 72818995/72818995 [00:04<00:00, 16020632.82B/s]

2021-09-20 08:17:56,035 copying /tmp/tmpyjie3t2v to cache at /root/.flair/embeddings/lm-mix-german-backward-v0.2rc.pt





2021-09-20 08:17:56,134 removing temp file /tmp/tmpyjie3t2v
2021-09-20 08:17:57,527 ----------------------------------------------------------------------------------------------------
2021-09-20 08:17:57,529 Model: "SequenceTagger(
  (embeddings): StackedEmbeddings(
    (list_embedding_0): FlairEmbeddings(
      (lm): LanguageModel(
        (drop): Dropout(p=0.25, inplace=False)
        (encoder): Embedding(275, 100)
        (rnn): LSTM(100, 2048)
        (decoder): Linear(in_features=2048, out_features=275, bias=True)
      )
    )
    (list_embedding_1): FlairEmbeddings(
      (lm): LanguageModel(
        (drop): Dropout(p=0.25, inplace=False)
        (encoder): Embedding(275, 100)
        (rnn): LSTM(100, 2048)
        (decoder): Linear(in_features=2048, out_features=275, bias=True)
      )
    )
  )
  (word_dropout): WordDropout(p=0.05)
  (locked_dropout): LockedDropout(p=0.5)
  (embedding2nn): Linear(in_features=4096, out_features=4096, bias=True)
  (rnn): LSTM(4096, 256, batch_f

In [None]:
import shutil
shutil.move('/content/resources/taggers/ner_bb/', "/content/drive/MyDrive/model/")

'/content/drive/MyDrive/model/'

In [None]:
# initializing the corpus
corpus: Corpus = ColumnCorpus(data_path, columns,
                              train_file = 'train.txt',
                              dev_file = 'dev.txt',
                              test_file = 'test.txt')

print(len(corpus.train))

# tag to predict
tag_type = 'ner'

tag_dictionary = corpus.make_tag_dictionary(tag_type=tag_type)

# print(corpus)

# word_vectors = gensim.models.KeyedVectors.load_word2vec_format('german.model', binary=True)
# word_vectors.save('german.model.gensim')

# german_embedding = WordEmbeddings('german.model.gensim')

# init forward embedding for German
flair_embedding_forward = FlairEmbeddings('de-forward')
flair_embedding_backward = FlairEmbeddings('de-backward')

embedding_types: List[TokenEmbeddings] = [
    # german_embedding,
    flair_embedding_forward,
    flair_embedding_backward,
]

embeddings: StackedEmbeddings = StackedEmbeddings(embeddings=embedding_types)

tagger: SequenceTagger = SequenceTagger(hidden_size=256,
                                        embeddings=embeddings,
                                        tag_dictionary=tag_dictionary,
                                        tag_type=tag_type,
                                        use_crf=True)


trainer : ModelTrainer = ModelTrainer(tagger, corpus)

trainer.train('model',
              learning_rate=0.1,
              mini_batch_size=32,
              max_epochs=50,
              embeddings_storage_mode='gpu'
              )

2021-09-20 09:15:50,568 Reading data from /content/drive/MyDrive/JKI_Arbeit/HortiSem
2021-09-20 09:15:50,571 Train: /content/drive/MyDrive/JKI_Arbeit/HortiSem/train.txt
2021-09-20 09:15:50,573 Dev: /content/drive/MyDrive/JKI_Arbeit/HortiSem/dev.txt
2021-09-20 09:15:50,576 Test: /content/drive/MyDrive/JKI_Arbeit/HortiSem/test.txt
1377
2021-09-20 09:15:53,548 ----------------------------------------------------------------------------------------------------
2021-09-20 09:15:53,550 Model: "SequenceTagger(
  (embeddings): StackedEmbeddings(
    (list_embedding_0): FlairEmbeddings(
      (lm): LanguageModel(
        (drop): Dropout(p=0.25, inplace=False)
        (encoder): Embedding(275, 100)
        (rnn): LSTM(100, 2048)
        (decoder): Linear(in_features=2048, out_features=275, bias=True)
      )
    )
    (list_embedding_1): FlairEmbeddings(
      (lm): LanguageModel(
        (drop): Dropout(p=0.25, inplace=False)
        (encoder): Embedding(275, 100)
        (rnn): LSTM(100, 2048)

{'dev_loss_history': [tensor(0.4702, device='cuda:0'),
  tensor(0.3804, device='cuda:0'),
  tensor(0.3119, device='cuda:0'),
  tensor(0.2167, device='cuda:0'),
  tensor(0.2580, device='cuda:0'),
  tensor(0.1975, device='cuda:0'),
  tensor(0.1574, device='cuda:0'),
  tensor(0.1454, device='cuda:0'),
  tensor(0.1346, device='cuda:0'),
  tensor(0.1221, device='cuda:0'),
  tensor(0.1872, device='cuda:0'),
  tensor(0.1164, device='cuda:0'),
  tensor(0.1275, device='cuda:0'),
  tensor(0.1171, device='cuda:0'),
  tensor(0.1138, device='cuda:0'),
  tensor(0.0984, device='cuda:0'),
  tensor(0.0928, device='cuda:0'),
  tensor(0.1207, device='cuda:0'),
  tensor(0.0931, device='cuda:0'),
  tensor(0.1225, device='cuda:0'),
  tensor(0.0823, device='cuda:0'),
  tensor(0.0929, device='cuda:0'),
  tensor(0.0794, device='cuda:0'),
  tensor(0.0790, device='cuda:0'),
  tensor(0.0749, device='cuda:0'),
  tensor(0.0800, device='cuda:0'),
  tensor(0.0823, device='cuda:0'),
  tensor(0.0708, device='cuda:0'),


In [None]:
columns = {0: 'text', 1: 'pos', 2: 'ner'}
    
data_path = os.path.join(os.getcwd(),"drive/MyDrive/JKI_Arbeit/HortiSem")

# initializing the corpus
corpus: Corpus = ColumnCorpus(data_path, columns,
                              train_file = 'train.txt',
                              dev_file = 'dev.txt',
                              test_file = 'test.txt')


print(len(corpus.train))

# tag to predict
tag_type = 'ner'

tag_dictionary = corpus.make_tag_dictionary(tag_type=tag_type)

# print(corpus)

# word_vectors = gensim.models.KeyedVectors.load_word2vec_format('german.model', binary=True)
# word_vectors.save('german.model.gensim')

# german_embedding = WordEmbeddings(os.path.join(data_path,'german.model.gensim'))

# init forward embedding for German
flair_embedding_forward = FlairEmbeddings('de-forward')
flair_embedding_backward = FlairEmbeddings('de-backward')

embedding_types: List[TokenEmbeddings] = [
    # german_embedding,
    flair_embedding_forward,
    flair_embedding_backward,
]

embeddings: StackedEmbeddings = StackedEmbeddings(embeddings=embedding_types)

tagger: SequenceTagger = SequenceTagger(hidden_size=256,
                                        embeddings=embeddings,
                                        tag_dictionary=tag_dictionary,
                                        tag_type=tag_type,
                                        use_crf=True)


trainer : ModelTrainer = ModelTrainer(tagger, corpus)

trainer.train('/drive/MyDrive/jmodel',
              learning_rate=0.1,
              mini_batch_size=32,
              max_epochs=60
              )

2021-09-21 11:49:19,069 Reading data from /content/drive/MyDrive/JKI_Arbeit/HortiSem
2021-09-21 11:49:19,071 Train: /content/drive/MyDrive/JKI_Arbeit/HortiSem/train.txt
2021-09-21 11:49:19,073 Dev: /content/drive/MyDrive/JKI_Arbeit/HortiSem/dev.txt
2021-09-21 11:49:19,077 Test: /content/drive/MyDrive/JKI_Arbeit/HortiSem/test.txt
1377
2021-09-21 11:49:23,895 https://flair.informatik.hu-berlin.de/resources/embeddings/flair/lm-mix-german-forward-v0.2rc.pt not found in cache, downloading to /tmp/tmp7dekkb0w


100%|██████████| 72818995/72818995 [00:02<00:00, 29718110.89B/s]

2021-09-21 11:49:26,441 copying /tmp/tmp7dekkb0w to cache at /root/.flair/embeddings/lm-mix-german-forward-v0.2rc.pt





2021-09-21 11:49:26,558 removing temp file /tmp/tmp7dekkb0w
2021-09-21 11:49:38,108 https://flair.informatik.hu-berlin.de/resources/embeddings/flair/lm-mix-german-backward-v0.2rc.pt not found in cache, downloading to /tmp/tmp48mnvh7h


100%|██████████| 72818995/72818995 [00:02<00:00, 33149643.41B/s]

2021-09-21 11:49:40,403 copying /tmp/tmp48mnvh7h to cache at /root/.flair/embeddings/lm-mix-german-backward-v0.2rc.pt





2021-09-21 11:49:40,506 removing temp file /tmp/tmp48mnvh7h
2021-09-21 11:49:42,763 ----------------------------------------------------------------------------------------------------
2021-09-21 11:49:42,765 Model: "SequenceTagger(
  (embeddings): StackedEmbeddings(
    (list_embedding_0): FlairEmbeddings(
      (lm): LanguageModel(
        (drop): Dropout(p=0.25, inplace=False)
        (encoder): Embedding(275, 100)
        (rnn): LSTM(100, 2048)
        (decoder): Linear(in_features=2048, out_features=275, bias=True)
      )
    )
    (list_embedding_1): FlairEmbeddings(
      (lm): LanguageModel(
        (drop): Dropout(p=0.25, inplace=False)
        (encoder): Embedding(275, 100)
        (rnn): LSTM(100, 2048)
        (decoder): Linear(in_features=2048, out_features=275, bias=True)
      )
    )
  )
  (word_dropout): WordDropout(p=0.05)
  (locked_dropout): LockedDropout(p=0.5)
  (embedding2nn): Linear(in_features=4096, out_features=4096, bias=True)
  (rnn): LSTM(4096, 256, batch_f

{'dev_loss_history': [tensor(0.5038, device='cuda:0'),
  tensor(0.3965, device='cuda:0'),
  tensor(0.2915, device='cuda:0'),
  tensor(0.2233, device='cuda:0'),
  tensor(0.2321, device='cuda:0'),
  tensor(0.1796, device='cuda:0'),
  tensor(0.1731, device='cuda:0'),
  tensor(0.1471, device='cuda:0'),
  tensor(0.1370, device='cuda:0'),
  tensor(0.1441, device='cuda:0'),
  tensor(0.1218, device='cuda:0'),
  tensor(0.1449, device='cuda:0'),
  tensor(0.1044, device='cuda:0'),
  tensor(0.1057, device='cuda:0'),
  tensor(0.1014, device='cuda:0'),
  tensor(0.1103, device='cuda:0'),
  tensor(0.0978, device='cuda:0'),
  tensor(0.0921, device='cuda:0'),
  tensor(0.1045, device='cuda:0'),
  tensor(0.0946, device='cuda:0'),
  tensor(0.0813, device='cuda:0'),
  tensor(0.0792, device='cuda:0'),
  tensor(0.0878, device='cuda:0'),
  tensor(0.0872, device='cuda:0'),
  tensor(0.0750, device='cuda:0'),
  tensor(0.0735, device='cuda:0'),
  tensor(0.0742, device='cuda:0'),
  tensor(0.0698, device='cuda:0'),


In [None]:
shutil.move('/content/model/', "/content/drive/MyDrive/JKI_Arbeit/HortiSem/FlairModel")

'/content/drive/MyDrive/JIK_Arbeit/HortiSem/FlairModel'

## word embeddings with *Flair*

---

A few of the more popular word embeddings are written into the cell below. We will be using Stacked Embeddings to combine multiple embeddings to build a word representation model with great power and little complexity.