# Training a Gensim `Doc2Vec` model

This language embedding model will be used for vectorizing preprocessed input strings.

*Version: 2022-04-19*

---

**Imports**

In [1]:
# standard library
import csv
from shutil import copyfileobj
from tempfile import NamedTemporaryFile
from typing import Final, Tuple
from psutil import Process
from pathlib import Path

# third-party libraries
import cloudpickle
from tqdm.notebook import tqdm
from gensim.models import Doc2Vec
from pandas import DataFrame, read_csv, concat

# LexNLP
from lexnlp.ml.gensim_utils import DummyGensimKeyedVectors, TrainingCallback

---

## Constants

In [2]:
# directory paths
PATH_PREPROCESSED: Final[Path] = Path('./preprocessed/')
PATH_OUTPUT: Final[Path] = Path('./output/')
PATH_OUTPUT.mkdir(exist_ok=True)

# number of processors to use for multiprocessing
MAX_WORKERS: Final[int] = (len(Process().cpu_affinity()) - 1) or 1

---

## Training

In [3]:
%%time

with NamedTemporaryFile(mode='w') as corpus_file:

    # convert `text` column of CSV files to a single, newline-separated text file
    csv_files: Tuple[Path, ...] = tuple(PATH_PREPROCESSED.rglob('*.csv'))
    dataframe: DataFrame = concat(map(read_csv, csv_files), ignore_index=True)
    for line in dataframe['text']:
        corpus_file.write(f'{line}\n')

    # train the Doc2Vec model
    doc2vec_model: Doc2Vec = Doc2Vec(
        documents=None,
        corpus_file=corpus_file.name,
        vector_size=200,
        dm_mean=None,
        dm=1,
        dbow_words=0,
        dm_concat=0,
        dm_tag_count=1,
        dv=None,
        dv_mapfile=None,
        comment=None,
        trim_rule=None,
        window=10,
        epochs=40,
        shrink_windows=True,
        min_count=20,
        workers=MAX_WORKERS,
        callbacks=(TrainingCallback(),)
    )

Started training...
Gensim version: 4.1.2, model.vector_size=200, model.window=10, model.min_count=20, model.dm=True
Started epoch 1 / 40
...[Epoch 1 | total_train_time: 8.010563021001872]
Started epoch 2 / 40
...[Epoch 2 | total_train_time: 18.961073338999995]
Started epoch 3 / 40
...[Epoch 3 | total_train_time: 29.97430093100411]
Started epoch 4 / 40
...[Epoch 4 | total_train_time: 40.92812875901291]
Started epoch 5 / 40
...[Epoch 5 | total_train_time: 52.11123081902042]
Started epoch 6 / 40
...[Epoch 6 | total_train_time: 63.10769526602235]
Started epoch 7 / 40
...[Epoch 7 | total_train_time: 73.96174055703159]
Started epoch 8 / 40
...[Epoch 8 | total_train_time: 84.61831618103315]
Started epoch 9 / 40
...[Epoch 9 | total_train_time: 95.4833044380357]
Started epoch 10 / 40
...[Epoch 10 | total_train_time: 106.30930993004586]
Started epoch 11 / 40
...[Epoch 11 | total_train_time: 117.0936355470476]
Started epoch 12 / 40
...[Epoch 12 | total_train_time: 128.38158319305512]
Started epo

Replace `KeyedVectors` with `DummyGensimKeyedVectors`; this radically reduces file size.

In [4]:
doc2vec_model.dv = DummyGensimKeyedVectors(doc2vec_model.dv.vector_size)

Create a filename:

In [5]:
filename_doc2vec_model: str = \
f'vectorsize{doc2vec_model.vector_size}_'\
f'window{doc2vec_model.window}_'\
f'dm{doc2vec_model.dm}_'\
f'mincount{doc2vec_model.min_count}_'\
f'epochs{doc2vec_model.epochs}'\
'.doc2vec'

Save the model using `cloudpickle` (instead of `Doc2Vec.save(...)`); this ensures `DummyGensimKeyedVectors` is also serialized.

In [6]:
%%time

with open(PATH_OUTPUT / filename_doc2vec_model, 'wb') as f:
    cloudpickle.dump(doc2vec_model, f)

CPU times: user 9.63 ms, sys: 11.9 ms, total: 21.6 ms
Wall time: 122 ms
