### How to use gensim word2vec embeddings loaded from file

This notebook presents how to load gensim embeddings from file and use them to train `Flair` model for sequence labelling task.

Examplary embeddings are located here:
http://dsmodels.nlp.ipipan.waw.pl/

They were trained in many different ways:
1. Using different corpus.
2. Based lemmas or forms.
3. With all part of speach or some of them.
4. With different vector size.
5. With CBOW or Skip-Gram neural net architecture.
6. WIth different nerual net training algorithms.

In [1]:
import os
os.chdir("..")

import pathlib

from flair.data import Sentence

from embeddings.defaults import RESULTS_PATH
from embeddings.embedding.static.embedding import StandardStaticWordEmbeddingPL
from embeddings.evaluator.sequence_labeling_evaluator import SequenceLabelingEvaluator
from embeddings.pipeline.flair_sequence_labeling import FlairSequenceLabelingPipeline

##### Get embeddings for words 

In [2]:
embedding_path = pathlib.Path("../wiki-lemmas-restricted-100-skipg-ns.txt.gz")

embeding = StandardStaticWordEmbeddingPL(str(embedding_path))

sentence = Sentence("Nas nie przekonają, że białe jest białe, a czarne jest czarne.")

embeding.embed([sentence])

[Sentence: "Nas nie przekonają , że białe jest białe , a czarne jest czarne ."   [− Tokens: 14]]

In [3]:
for token in sentence:
    print(token)
    print(token.embedding)

Token: 1 Nas
tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0.])
Token: 2 nie
tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0.])
Token: 3 przekonają
tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 

#### Training model for sequance labelling

Before we train a model we need to define its parameters:
- embedding_path - path to file that contains gensim embeddings
- dataset_name (that points to the dataset located on huggingface: https://huggingface.co/datasets/)
- input_column_name - specific for selected dataset
- target_column_name - specific for selected dataset
- root - root path of output path
- hidden_size - hidden_size of model that will be trained in sequencelabelling task, the model is defined here: https://github.com/flairNLP/flair/blob/579c7b70dfcc1d184bd47069f722d7a9ae8b78d7/flair/models/sequence_tagger_model.py#L26

Extra important parameters:
- task_model_kwargs - parameters of the model that will be trained (e.g. number of RNN layers, type of RNN layers, ...)
- task_train_kwargs - training parameters (all of them can be found here: https://github.com/flairNLP/flair/blob/579c7b70dfcc1d184bd47069f722d7a9ae8b78d7/flair/trainers/trainer.py#L73)

In [4]:
embedding_path = pathlib.Path("../wiki-lemmas-restricted-100-skipg-ns.txt.gz")

dataset_name= "clarin-pl/kpwr-ner"
input_column_name = "tokens"
target_column_name = "ner"
root = RESULTS_PATH.joinpath("pos_tagging")

output_path = pathlib.Path(root, embedding_path.stem, dataset_name)
output_path.mkdir(parents=True, exist_ok=True)

hidden_size = 64
task_train_kwargs = {
    "max_epochs": 3 # for testing purpose only
} 

In [5]:
pipeline = FlairSequenceLabelingPipeline(
    embedding_path,
    dataset_name,
    input_column_name,
    target_column_name,
    output_path,
    hidden_size,
    task_train_kwargs=task_train_kwargs,
)
result = pipeline.run()

Using custom data configuration default
Reusing dataset kpwrner (/Users/lukaszkoziol/.cache/huggingface/datasets/clarin-pl___kpwrner/default/0.0.0/001e3d471298007e8412e3a6ccc06bec000dec1bce0cf8e0ba7e5b7e105b1342)


  0%|          | 0/2 [00:00<?, ?it/s]

2022-01-21 09:24:09,412 - embeddings.transformation.flair_transformation.corpus_transformation - INFO - Info of ['train', 'test']:
{'builder_name': 'kpwrner',
 'citation': '',
 'config_name': 'default',
 'dataset_size': 13212646,
 'description': 'KPWR-NER tagging dataset.',
 'download_checksums': {'https://huggingface.co/datasets/clarin-pl/kpwr-ner/resolve/main/data/kpwr-ner-n82-test.iob': {'checksum': '7b86fd227605b7e5f807eedbcd87573271d8adb86cfddf56c763b1751e71a924',
                                                                                                                       'num_bytes': 2247780},
                        'https://huggingface.co/datasets/clarin-pl/kpwr-ner/resolve/main/data/kpwr-ner-n82-train-tune.iob': {'checksum': '7ab673f299b3a9e875c2c46ef1051807d98f923f0356d0be78556c832481efea',
                                                                                                                             'num_bytes': 6719818}},
 'download_size': 8967598,
 'f

2022-01-21 09:24:20,562 ----------------------------------------------------------------------------------------------------
2022-01-21 09:24:20,563 Model: "SequenceTagger(
  (embeddings): WordEmbeddingsPL(
    '../wiki-lemmas-restricted-100-skipg-ns.txt.gz'
    (embedding): Embedding(446609, 100)
  )
  (word_dropout): WordDropout(p=0.05)
  (locked_dropout): LockedDropout(p=0.5)
  (embedding2nn): Linear(in_features=100, out_features=100, bias=True)
  (rnn): LSTM(100, 64, batch_first=True, bidirectional=True)
  (linear): Linear(in_features=128, out_features=163, bias=True)
  (beta): 1.0
  (weights): None
  (weight_tensor) None
)"
2022-01-21 09:24:20,564 ----------------------------------------------------------------------------------------------------
2022-01-21 09:24:20,566 Corpus: "Corpus: 13959 train + 0 dev + 4323 test sentences"
2022-01-21 09:24:20,568 ----------------------------------------------------------------------------------------------------
2022-01-21 09:24:20,568 Param

2022-01-21 09:53:17,570 ----------------------------------------------------------------------------------------------------


  _warn_prf(average, modifier, msg_start, len(result))


In [6]:
result

{'seqeval__mode_None__scheme_None': {'nam_adj': {'precision': 0.0,
   'recall': 0.0,
   'f1': 0.0,
   'number': 52},
  'nam_adj_city': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'number': 42},
  'nam_adj_country': {'precision': 0.5,
   'recall': 0.006024096385542169,
   'f1': 0.011904761904761906,
   'number': 166},
  'nam_adj_person': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'number': 18},
  'nam_eve': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'number': 8},
  'nam_eve_human': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'number': 78},
  'nam_eve_human_cultural': {'precision': 0.0,
   'recall': 0.0,
   'f1': 0.0,
   'number': 22},
  'nam_eve_human_holiday': {'precision': 0.0,
   'recall': 0.0,
   'f1': 0.0,
   'number': 9},
  'nam_eve_human_sport': {'precision': 0.0,
   'recall': 0.0,
   'f1': 0.0,
   'number': 55},
  'nam_fac_bridge': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'number': 4},
  'nam_fac_goe': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'number': 64},
  'nam_