<a href="https://colab.research.google.com/github/ApurbaPaul-NLP/FLAIR-MODELS/blob/main/Prog2_06_09_2022_WordEmbeddings_FlairEmbeddings_StackedEmbeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install flair

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting flair
  Downloading flair-0.11.3-py3-none-any.whl (401 kB)
[K     |████████████████████████████████| 401 kB 30.8 MB/s 
[?25hCollecting sentencepiece==0.1.95
  Downloading sentencepiece-0.1.95-cp37-cp37m-manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 49.6 MB/s 
[?25hCollecting janome
  Downloading Janome-0.4.2-py2.py3-none-any.whl (19.7 MB)
[K     |████████████████████████████████| 19.7 MB 77.0 MB/s 
[?25hCollecting langdetect
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[K     |████████████████████████████████| 981 kB 65.7 MB/s 
[?25hCollecting conllu>=4.0
  Downloading conllu-4.5.2-py2.py3-none-any.whl (16 kB)
Collecting hyperopt>=0.2.7
  Downloading hyperopt-0.2.7-py2.py3-none-any.whl (1.6 MB)
[K     |████████████████████████████████| 1.6 MB 57.1 MB/s 
[?25hCollecting ftfy
  Downloading ftfy-6.1.1-py3-none-any.whl (53 kB)
[K   

# **Embeddings**

All word embedding classes inherit from the TokenEmbeddings class and implement the embed() method which you need to call to embed your text. 

This means that for most users of Flair, the complexity of different embeddings remains hidden behind this interface. 

Simply instantiate the embedding class you require and call embed() to embed your text. 

All embeddings produced with our methods are PyTorch vectors, so they can be immediately used for training and fine-tuning.

**Classic Word Embeddings**

Classic word embeddings are static and word-level, meaning that each distinct word gets exactly one pre-computed embedding. 

Most embeddings fall under this class, including the popular GloVe or Komninos embeddings.

Simply instantiate the WordEmbeddings class and pass a string identifier of the embedding you wish to load. 

So, if you want to use GloVe embeddings, pass the string 'glove' to the constructor:

In [2]:
from flair.embeddings import WordEmbeddings
from flair.data import Sentence

# init embedding
glove_embedding = WordEmbeddings('glove')

2022-09-06 17:02:36,406 https://flair.informatik.hu-berlin.de/resources/embeddings/token/glove.gensim.vectors.npy not found in cache, downloading to /tmp/tmp1o_z0abd


100%|██████████| 160000128/160000128 [00:13<00:00, 11795303.02B/s]

2022-09-06 17:02:50,668 copying /tmp/tmp1o_z0abd to cache at /root/.flair/embeddings/glove.gensim.vectors.npy





2022-09-06 17:02:50,904 removing temp file /tmp/tmp1o_z0abd
2022-09-06 17:02:51,589 https://flair.informatik.hu-berlin.de/resources/embeddings/token/glove.gensim not found in cache, downloading to /tmp/tmps15xmtmu


100%|██████████| 21494764/21494764 [00:03<00:00, 6955381.20B/s] 

2022-09-06 17:02:55,392 copying /tmp/tmps15xmtmu to cache at /root/.flair/embeddings/glove.gensim
2022-09-06 17:02:55,420 removing temp file /tmp/tmps15xmtmu





In [3]:
# create sentence.
sentence = Sentence('The grass is green .')

# embed a sentence using glove.
glove_embedding.embed(sentence)

# now check out the embedded tokens.
for token in sentence:
    print(token)
    print(token.embedding)

Token[0]: "The"
tensor([-0.0382, -0.2449,  0.7281, -0.3996,  0.0832,  0.0440, -0.3914,  0.3344,
        -0.5755,  0.0875,  0.2879, -0.0673,  0.3091, -0.2638, -0.1323, -0.2076,
         0.3340, -0.3385, -0.3174, -0.4834,  0.1464, -0.3730,  0.3458,  0.0520,
         0.4495, -0.4697,  0.0263, -0.5415, -0.1552, -0.1411, -0.0397,  0.2828,
         0.1439,  0.2346, -0.3102,  0.0862,  0.2040,  0.5262,  0.1716, -0.0824,
        -0.7179, -0.4153,  0.2033, -0.1276,  0.4137,  0.5519,  0.5791, -0.3348,
        -0.3656, -0.5486, -0.0629,  0.2658,  0.3020,  0.9977, -0.8048, -3.0243,
         0.0125, -0.3694,  2.2167,  0.7220, -0.2498,  0.9214,  0.0345,  0.4674,
         1.1079, -0.1936, -0.0746,  0.2335, -0.0521, -0.2204,  0.0572, -0.1581,
        -0.3080, -0.4162,  0.3797,  0.1501, -0.5321, -0.2055, -1.2526,  0.0716,
         0.7056,  0.4974, -0.4206,  0.2615, -1.5380, -0.3022, -0.0734, -0.2831,
         0.3710, -0.2522,  0.0162, -0.0171, -0.3898,  0.8742, -0.7257, -0.5106,
        -0.5203, -0.1459

**English-FastText Embedding**

In [4]:
eng_embedding = WordEmbeddings('en-crawl')
eng_embedding.embed(sentence)

# now check out the embedded tokens.
for token in sentence:
    print(token)
    print(token.embedding)

2022-09-06 17:11:31,118 https://flair.informatik.hu-berlin.de/resources/embeddings/token/en-fasttext-crawl-300d-1M.vectors.npy not found in cache, downloading to /tmp/tmpkmx4hysc


100%|██████████| 1200000128/1200000128 [01:28<00:00, 13504330.31B/s]

2022-09-06 17:13:00,654 copying /tmp/tmpkmx4hysc to cache at /root/.flair/embeddings/en-fasttext-crawl-300d-1M.vectors.npy





2022-09-06 17:13:04,932 removing temp file /tmp/tmpkmx4hysc
2022-09-06 17:13:06,078 https://flair.informatik.hu-berlin.de/resources/embeddings/token/en-fasttext-crawl-300d-1M not found in cache, downloading to /tmp/tmp1al25_f9


100%|██████████| 39323680/39323680 [00:04<00:00, 9352709.31B/s] 

2022-09-06 17:13:10,946 copying /tmp/tmp1al25_f9 to cache at /root/.flair/embeddings/en-fasttext-crawl-300d-1M





2022-09-06 17:13:10,992 removing temp file /tmp/tmp1al25_f9
Token[0]: "The"
tensor([ 3.4100e-02,  2.3550e-01, -6.3600e-02, -2.6600e-02,  3.9000e-02,
         1.8200e-02,  1.5850e-01, -3.9070e-01, -4.3700e-02, -4.8400e-02,
        -1.0740e-01,  8.3800e-02, -2.5350e-01, -3.0200e-02, -1.5200e-01,
        -2.3300e-02,  2.1290e-01, -1.2400e-02, -5.9100e-02,  4.3200e-02,
        -2.9000e-03, -6.3700e-02,  8.1700e-02, -5.1700e-02,  5.1900e-02,
         4.9900e-02, -1.5120e-01, -1.5300e-02, -5.8800e-02, -3.3890e-01,
         3.1600e-02,  2.5000e-03,  1.7000e-02,  2.0200e-01,  2.9000e-02,
        -2.1000e-03, -2.6000e-03,  5.3000e-02,  1.3900e-02,  1.2660e-01,
         5.7500e-02, -2.5300e-02, -7.8000e-02, -1.8300e-02, -1.4100e-01,
        -8.2000e-03,  4.2100e-02, -5.5000e-03, -1.9000e-03, -7.8200e-02,
         2.3600e-02,  3.4040e-01, -1.3570e-01, -9.4500e-02, -2.3200e-02,
         4.2600e-02,  5.9800e-02,  2.1380e-01,  1.0600e-02, -8.6500e-02,
         2.4990e-01,  2.7580e-01,  1.0400e-01,  

**Flair Embeddings**

Contextual string embeddings are powerful embeddings that capture latent syntactic-semantic information that goes beyond standard word embeddings. 

Key differences are: 

    (1) they are trained without any explicit notion of words and thus fundamentally model words as sequences of characters. And 
    (2) they are contextualized by their surrounding text, meaning that the same word will have different embeddings depending on its contextual use.

With Flair, you can use these embeddings simply by instantiating the appropriate embedding class, same as standard word embeddings:

In [5]:
from flair.embeddings import FlairEmbeddings

# init embedding
flair_embedding_forward = FlairEmbeddings('news-forward')

# create a sentence
sentence = Sentence('The grass is green .')

# embed words in sentence
flair_embedding_forward.embed(sentence)
# now check out the embedded tokens.
for token in sentence:
    print(token)
    print(token.embedding)

2022-09-06 17:17:18,448 https://flair.informatik.hu-berlin.de/resources/embeddings/flair/news-forward-0.4.1.pt not found in cache, downloading to /tmp/tmpl4v3ylan


100%|██████████| 73034624/73034624 [00:07<00:00, 10411923.71B/s]

2022-09-06 17:17:26,127 copying /tmp/tmpl4v3ylan to cache at /root/.flair/embeddings/news-forward-0.4.1.pt





2022-09-06 17:17:26,227 removing temp file /tmp/tmpl4v3ylan
Token[0]: "The"
tensor([-0.0021,  0.0005,  0.0469,  ..., -0.0004, -0.0393,  0.0106],
       device='cuda:0')
Token[1]: "grass"
tensor([-0.0006,  0.0047,  0.0248,  ..., -0.0004, -0.0236,  0.0117],
       device='cuda:0')
Token[2]: "is"
tensor([ 0.0011, -0.0032,  0.0156,  ..., -0.0061,  0.0112,  0.0100],
       device='cuda:0')
Token[3]: "green"
tensor([-0.0034,  0.0003,  0.0256,  ..., -0.0026, -0.0118,  0.0455],
       device='cuda:0')
Token[4]: "."
tensor([ 0.0008,  0.0002,  0.1262,  ..., -0.0002,  0.0039,  0.0058],
       device='cuda:0')


**Forward and Backeard Flair EMbeddings**

In [6]:
# init forward embedding for German
flair_embedding_forward = FlairEmbeddings('en-forward')
flair_embedding_backward = FlairEmbeddings('en-backward')

2022-09-06 17:18:47,792 https://flair.informatik.hu-berlin.de/resources/embeddings/flair/news-backward-0.4.1.pt not found in cache, downloading to /tmp/tmpv65eg9q5


100%|██████████| 73034575/73034575 [00:07<00:00, 10397501.73B/s]

2022-09-06 17:18:55,525 copying /tmp/tmpv65eg9q5 to cache at /root/.flair/embeddings/news-backward-0.4.1.pt





2022-09-06 17:18:55,658 removing temp file /tmp/tmpv65eg9q5


In [7]:
# embed words in sentence
flair_embedding_forward.embed(sentence)
# now check out the embedded tokens.
for token in sentence:
    print(token)
    print(token.embedding)

Token[0]: "The"
tensor([-0.0021,  0.0005,  0.0469,  ..., -0.0004, -0.0393,  0.0106],
       device='cuda:0')
Token[1]: "grass"
tensor([-0.0006,  0.0047,  0.0248,  ..., -0.0004, -0.0236,  0.0117],
       device='cuda:0')
Token[2]: "is"
tensor([ 0.0011, -0.0032,  0.0156,  ..., -0.0061,  0.0112,  0.0100],
       device='cuda:0')
Token[3]: "green"
tensor([-0.0034,  0.0003,  0.0256,  ..., -0.0026, -0.0118,  0.0455],
       device='cuda:0')
Token[4]: "."
tensor([ 0.0008,  0.0002,  0.1262,  ..., -0.0002,  0.0039,  0.0058],
       device='cuda:0')


In [8]:
flair_embedding_backward.embed(sentence)
# now check out the embedded tokens.
for token in sentence:
    print(token)
    print(token.embedding)

Token[0]: "The"
tensor([ 0.0085, -0.0139, -0.0008,  ..., -0.0004, -0.0393,  0.0106],
       device='cuda:0')
Token[1]: "grass"
tensor([ 0.0049, -0.0203,  0.0007,  ..., -0.0004, -0.0236,  0.0117],
       device='cuda:0')
Token[2]: "is"
tensor([ 0.0045,  0.0119, -0.0011,  ..., -0.0061,  0.0112,  0.0100],
       device='cuda:0')
Token[3]: "green"
tensor([-0.0012, -0.0028,  0.0070,  ..., -0.0026, -0.0118,  0.0455],
       device='cuda:0')
Token[4]: "."
tensor([-0.0008, -0.0064, -0.0006,  ..., -0.0002,  0.0039,  0.0058],
       device='cuda:0')


**Stacked Embeddings**

Stacked embeddings are one of the most important concepts of this library. You can use them to combine different embeddings together, for instance if you want to use both traditional embeddings together with contextual string embeddings. Stacked embeddings allow you to mix and match. 

We find that a combination of embeddings often gives best results.

All you need to do is use the StackedEmbeddings class and instantiate it by passing a list of embeddings that you wish to combine. 

For instance, lets combine classic GloVe embeddings with forward and backward Flair embeddings. 

This is a combination that we generally recommend to most users, especially for sequence labeling.

First, instantiate the two embeddings you wish to combine:

In [9]:
from flair.embeddings import WordEmbeddings, FlairEmbeddings
from flair.embeddings import StackedEmbeddings

# init standard GloVe embedding
glove_embedding = WordEmbeddings('glove')

# init Flair forward and backwards embeddings
flair_embedding_forward = FlairEmbeddings('news-forward')
flair_embedding_backward = FlairEmbeddings('news-backward')
# create a StackedEmbedding object that combines glove and forward/backward flair embeddings
stacked_embeddings = StackedEmbeddings([
                                        glove_embedding,
                                        flair_embedding_forward,
                                        flair_embedding_backward,
                                       ])

sentence = Sentence('The grass is green .')

# just embed a sentence using the StackedEmbedding as you would with any single embedding.
stacked_embeddings.embed(sentence)

# now check out the embedded tokens.
for token in sentence:
    print(token)
    print(token.embedding)

Token[0]: "The"
tensor([-0.0382, -0.2449,  0.7281,  ..., -0.0065, -0.0053,  0.0090],
       device='cuda:0')
Token[1]: "grass"
tensor([-0.8135,  0.9404, -0.2405,  ...,  0.0354, -0.0255, -0.0143],
       device='cuda:0')
Token[2]: "is"
tensor([-5.4264e-01,  4.1476e-01,  1.0322e+00,  ..., -5.3691e-04,
        -9.6750e-03, -2.7541e-02], device='cuda:0')
Token[3]: "green"
tensor([-0.6791,  0.3491, -0.2398,  ..., -0.0007, -0.1333,  0.0161],
       device='cuda:0')
Token[4]: "."
tensor([-0.3398,  0.2094,  0.4635,  ...,  0.0005, -0.0177,  0.0032],
       device='cuda:0')


Words are now embedded using a concatenation of three different embeddings.

This means that the resulting embedding vector is still a single PyTorch vector.