<a href="https://colab.research.google.com/github/LxYuan0420/nlp/blob/main/notebooks/flair/TUTORIAL_3_WORD_EMBEDDING.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
#!pip install flair

##### Tutorial 3: Word Embeddings
We provide a set of classes with which you can embed the words in sentences in various ways. This tutorial explains how that works. We assume that you're familiar with the base types of this library.

###### Embeddings
All word embedding classes inherit from the TokenEmbeddings class and implement the embed() method which you need to call to embed your text. This means that for most users of Flair, the complexity of different embeddings remains hidden behind this interface. Simply instantiate the embedding class you require and call embed() to embed your text. All embeddings produced with our methods are PyTorch vectors, so they can be immediately used for training and fine-tuning.

This tutorial introduces some common embeddings and shows you how to use them. For more details on these embeddings and an overview of all supported embeddings, check here.

###### Classic Word Embeddings

Classic word embeddings are static and word-level, meaning that each distinct word gets exactly one pre-computed embedding. Most embeddings fall under this class, including the popular GloVe or Komninos embeddings.

Simply instantiate the WordEmbeddings class and pass a string identifier of the embedding you wish to load. So, if you want to use GloVe embeddings, pass the string 'glove' to the constructor:

In [2]:
from flair.embeddings import WordEmbeddings
from flair.data import Sentence

# init embedding
glove_embedding = WordEmbeddings("glove")

2022-11-13 07:21:48,977 https://flair.informatik.hu-berlin.de/resources/embeddings/token/glove.gensim.vectors.npy not found in cache, downloading to /tmp/tmpy179fc8e


100%|██████████| 160000128/160000128 [00:10<00:00, 15844361.35B/s]

2022-11-13 07:21:59,521 copying /tmp/tmpy179fc8e to cache at /root/.flair/embeddings/glove.gensim.vectors.npy





2022-11-13 07:21:59,819 removing temp file /tmp/tmpy179fc8e
2022-11-13 07:22:00,257 https://flair.informatik.hu-berlin.de/resources/embeddings/token/glove.gensim not found in cache, downloading to /tmp/tmp2me7d8za


100%|██████████| 21494764/21494764 [00:01<00:00, 11359994.50B/s]

2022-11-13 07:22:02,549 copying /tmp/tmp2me7d8za to cache at /root/.flair/embeddings/glove.gensim
2022-11-13 07:22:02,591 removing temp file /tmp/tmp2me7d8za





Now, create an example sentence and call the embedding's embed() method. You can also pass a list of sentences to this method since some embedding types make use of batching to increase speed.

In [3]:
sentence = Sentence("The grass is green.")

glove_embedding.embed(sentence)

for token in sentence:
    print(token)
    print(token.embedding)
    print(token.embedding.size())

Token[0]: "The"
tensor([-0.0382, -0.2449,  0.7281, -0.3996,  0.0832,  0.0440, -0.3914,  0.3344,
        -0.5755,  0.0875,  0.2879, -0.0673,  0.3091, -0.2638, -0.1323, -0.2076,
         0.3340, -0.3385, -0.3174, -0.4834,  0.1464, -0.3730,  0.3458,  0.0520,
         0.4495, -0.4697,  0.0263, -0.5415, -0.1552, -0.1411, -0.0397,  0.2828,
         0.1439,  0.2346, -0.3102,  0.0862,  0.2040,  0.5262,  0.1716, -0.0824,
        -0.7179, -0.4153,  0.2033, -0.1276,  0.4137,  0.5519,  0.5791, -0.3348,
        -0.3656, -0.5486, -0.0629,  0.2658,  0.3020,  0.9977, -0.8048, -3.0243,
         0.0125, -0.3694,  2.2167,  0.7220, -0.2498,  0.9214,  0.0345,  0.4674,
         1.1079, -0.1936, -0.0746,  0.2335, -0.0521, -0.2204,  0.0572, -0.1581,
        -0.3080, -0.4162,  0.3797,  0.1501, -0.5321, -0.2055, -1.2526,  0.0716,
         0.7056,  0.4974, -0.4206,  0.2615, -1.5380, -0.3022, -0.0734, -0.2831,
         0.3710, -0.2522,  0.0162, -0.0171, -0.3898,  0.8742, -0.7257, -0.5106,
        -0.5203, -0.1459

This prints out the tokens and their embeddings. GloVe embeddings are PyTorch vectors of dimensionality 100.

You choose which pre-trained embeddings you load by passing the appropriate id string to the constructor of the WordEmbeddings class. Typically, you use the two-letter language code to init an embedding, so 'en' for English and 'de' for German and so on. By default, this will initialize FastText embeddings trained over Wikipedia. You can also always use FastText embeddings over Web crawls, by instantiating with '-crawl'. So 'de-crawl' to use embeddings trained over German web crawls:

In [None]:
german_embedding = WordEmbeddings('de-crawl')

Check out the full list of all word embeddings models here, along with more explanations on this class.

We generally recommend the FastText embeddings, or GloVe if you want a smaller model.

##### Flair Embeddings
----
Contextual string embeddings are powerful embeddings that capture latent syntactic-semantic information that goes beyond standard word embeddings. Key differences are: (1) they are trained without any explicit notion of words and thus fundamentally model words as sequences of characters. And (2) they are contextualized by their surrounding text, meaning that the same word will have different embeddings depending on its contextual use.

With Flair, you can use these embeddings simply by instantiating the appropriate embedding class, same as standard word embeddings:

In [None]:
from flair.embeddings import FlairEmbeddings

# init embedding
flair_embedding_forward = FlairEmbeddings('news-forward')

# create a sentence
sentence = Sentence('The grass is green .')

# embed words in sentence
flair_embedding_forward.embed(sentence)

In [None]:
# init forward embedding for German
flair_embedding_forward = FlairEmbeddings('de-forward')
flair_embedding_backward = FlairEmbeddings('de-backward')

You choose which embeddings you load by passing the appropriate string to the constructor of the FlairEmbeddings class. For all supported languages, there is a forward and a backward model. You can load a model for a language by using the two-letter language code followed by a hyphen and either forward or backward. So, if you want to load the forward and backward Flair models for German, do it like this:

Check out the full list of all pre-trained FlairEmbeddings models here, along with more information on standard usage.

##### Stacked Embeddings
-----
Stacked embeddings are one of the most important concepts of this library. You can use them to combine different embeddings together, for instance if you want to use both traditional embeddings together with contextual string embeddings. Stacked embeddings allow you to mix and match. We find that a combination of embeddings often gives best results.

All you need to do is use the StackedEmbeddings class and instantiate it by passing a list of embeddings that you wish to combine. For instance, lets combine classic GloVe embeddings with forward and backward Flair embeddings. This is a combination that we generally recommend to most users, especially for sequence labeling.

First, instantiate the two embeddings you wish to combine:

In [4]:
from flair.embeddings import WordEmbeddings, FlairEmbeddings

# init standard GloVe embedding
glove_embedding = WordEmbeddings("glove")

# init flair forward and backward embeddings
flair_forward_embedding = FlairEmbeddings("news-forward")
flair_backward_embedding = FlairEmbeddings("news-backward")

2022-11-13 07:25:48,056 https://flair.informatik.hu-berlin.de/resources/embeddings/flair/news-forward-0.4.1.pt not found in cache, downloading to /tmp/tmpxrtps9wo


100%|██████████| 73034624/73034624 [00:04<00:00, 16689320.06B/s]

2022-11-13 07:25:52,908 copying /tmp/tmpxrtps9wo to cache at /root/.flair/embeddings/news-forward-0.4.1.pt





2022-11-13 07:25:53,101 removing temp file /tmp/tmpxrtps9wo
2022-11-13 07:25:53,857 https://flair.informatik.hu-berlin.de/resources/embeddings/flair/news-backward-0.4.1.pt not found in cache, downloading to /tmp/tmpbn_rsn3r


100%|██████████| 73034575/73034575 [00:04<00:00, 17869053.97B/s]

2022-11-13 07:25:58,338 copying /tmp/tmpbn_rsn3r to cache at /root/.flair/embeddings/news-backward-0.4.1.pt





2022-11-13 07:25:58,481 removing temp file /tmp/tmpbn_rsn3r


Now instantiate the StackedEmbeddings class and pass it a list containing these two embeddings.

In [5]:
from flair.embeddings import StackedEmbeddings

# create a StackedEmbedding object that combines glove and forward/backward flair embeddings
stacked_embeddings = StackedEmbeddings([
    glove_embedding,
    flair_forward_embedding,
    flair_backward_embedding,
])

That's it! Now just use this embedding like all the other embeddings, i.e. call the embed() method over your sentences.

In [7]:
sentence = Sentence('The grass is green .')

# just embed a sentence using the StackedEmbedding as you would with any single embedding.
stacked_embeddings.embed(sentence)

# now check out the embedded tokens.
for token in sentence:
    print(token)
    print(token.embedding)
    print(token.embedding.size())

Token[0]: "The"
tensor([-0.0382, -0.2449,  0.7281,  ..., -0.0065, -0.0053,  0.0090])
torch.Size([4196])
Token[1]: "grass"
tensor([-0.8135,  0.9404, -0.2405,  ...,  0.0354, -0.0255, -0.0143])
torch.Size([4196])
Token[2]: "is"
tensor([-5.4264e-01,  4.1476e-01,  1.0322e+00,  ..., -5.3691e-04,
        -9.6750e-03, -2.7541e-02])
torch.Size([4196])
Token[3]: "green"
tensor([-0.6791,  0.3491, -0.2398,  ..., -0.0007, -0.1333,  0.0161])
torch.Size([4196])
Token[4]: "."
tensor([-0.3398,  0.2094,  0.4635,  ...,  0.0005, -0.0177,  0.0032])
torch.Size([4196])


Words are now embedded using a concatenation of three different embeddings. This means that the resulting embedding vector is still a single PyTorch vector.

Next
To get more details on these embeddings and a full overview of all word embeddings that we support, you can look into this tutorial. You can also skip details on word embeddings and go directly to document embeddings that let you embed entire text passages with one vector for tasks such as text classification. You can also go directly to the tutorial about loading your corpus, which is a pre-requirement for training your own models.