# Training a Gensim `Doc2Vec` model

This language embedding model will be used for vectorizing preprocessed input strings.

*Version: 2022-03-29*

---

**Imports**

In [2]:
# standard library
from shutil import copyfileobj
from tempfile import NamedTemporaryFile
from typing import Final, Tuple
from psutil import Process
from pathlib import Path

# third-party libraries
import cloudpickle
from tqdm.notebook import tqdm
from gensim.models import Doc2Vec

# LexNLP
from lexnlp.ml.gensim_utils import DummyGensimKeyedVectors, TrainingCallback

---

## Constants

In [3]:
# directory paths
PATH_PREPROCESSED: Final[Path] = Path('./preprocessed/')
PATH_OUTPUT: Final[Path] = Path('./output/')
PATH_OUTPUT.mkdir(exist_ok=True)

# number of processors to use for multiprocessing
MAX_WORKERS: Final[int] = (len(Process().cpu_affinity()) - 1) or 1

---

## Training

In [4]:
with NamedTemporaryFile(mode='wb') as corpus_file:

    text_files: Tuple[Path, ...] = tuple(PATH_PREPROCESSED.rglob('*.txt'))
    for path in tqdm(text_files, total=len(text_files)):
        with open(path, 'rb') as f:
            copyfileobj(f, corpus_file)

    # train the Doc2Vec model
    doc2vec_model: Doc2Vec = Doc2Vec(
        documents=None,
        corpus_file=corpus_file.name,
        vector_size=200,
        dm_mean=None,
        dm=1,
        dbow_words=0,
        dm_concat=0,
        dm_tag_count=1,
        dv=None,
        dv_mapfile=None,
        comment=None,
        trim_rule=None,
        window=10,
        epochs=40,
        shrink_windows=True,
        min_count=20,
        workers=MAX_WORKERS,
        callbacks=(TrainingCallback(),)
    )

  0%|          | 0/9 [00:00<?, ?it/s]

Started training...
Gensim version: 4.1.2, model.vector_size=200, model.window=10, model.min_count=20, model.dm=True
Started epoch 1 / 40
...[Epoch 1 | total_train_time: 9.846747789997607]
Started epoch 2 / 40
...[Epoch 2 | total_train_time: 22.107361511996714]
Started epoch 3 / 40
...[Epoch 3 | total_train_time: 35.72111703200062]
Started epoch 4 / 40
...[Epoch 4 | total_train_time: 48.96803596600512]
Started epoch 5 / 40
...[Epoch 5 | total_train_time: 62.0138194290048]
Started epoch 6 / 40
...[Epoch 6 | total_train_time: 75.1317966600036]
Started epoch 7 / 40
...[Epoch 7 | total_train_time: 87.57887790200766]
Started epoch 8 / 40
...[Epoch 8 | total_train_time: 100.27492325301137]
Started epoch 9 / 40
...[Epoch 9 | total_train_time: 112.83466162801051]
Started epoch 10 / 40
...[Epoch 10 | total_train_time: 126.18972277301509]
Started epoch 11 / 40
...[Epoch 11 | total_train_time: 138.1070833520207]
Started epoch 12 / 40
...[Epoch 12 | total_train_time: 149.82595640401996]
Started ep

Replace `KeyedVectors` with `DummyGensimKeyedVectors`; this radically reduces file size.

In [5]:
doc2vec_model.dv = DummyGensimKeyedVectors(doc2vec_model.dv.vector_size)

Create a filename:

In [6]:
filename_doc2vec_model: str = \
f'vectorsize{doc2vec_model.vector_size}_'\
f'window{doc2vec_model.window}_'\
f'dm{doc2vec_model.dm}_'\
f'mincount{doc2vec_model.min_count}_'\
f'epochs{doc2vec_model.epochs}'\
'.doc2vec'

Save the model using `cloudpickle` (instead of `Doc2Vec.save(...)`); this ensures `DummyGensimKeyedVectors` is also serialized.

In [7]:
with open(PATH_OUTPUT / filename_doc2vec_model, 'wb') as f:
    cloudpickle.dump(doc2vec_model, f)