### Word2Vec Window Alignment
* As of March 14th, 2023, the Gensim Word2Vec implementation does not support left or right-aligned windows. This means that we can only use centered windows where `n` words are taken from both the left and right neighbors of the current word. The Word2Vec class in Gensim takes the window size `n` as a parameter called window. To compare the performance of left and right-aligned implementations, we made some modifications to the Gensim architecture. Install the modified version using `pip`:
   ```cmd
   pip install git+https://github.com/KarahanS/custom-gensim.git@window-alignment
   ```

In [6]:
import multiprocessing
import logging
from gensim.models import Word2Vec
import sys
from pathlib import Path
sys.path.append(str(Path.cwd().parent))

from utils.utils import LineSentences
from utils.utils import callback

### Gensim Word2Vec
_Documentation: https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec_
* `sentences` _(iterable of iterables, optional)_: The sentences iterable can be simply a list of lists of tokens, but for larger corpora, consider an iterable that streams the sentences directly from disk/network. See `BrownCorpus`, `Text8Corpus` or `LineSentence` in word2vec module for such examples. See also the tutorial on data streaming in Python. If you don’t supply sentences, the model is left uninitialized – use if you plan to initialize it in some other way.
* `vector_size` _(int, optional)_: Dimensionality of the word vectors.
* `window` _(int, optional)_: Maximum distance between the current and predicted word within a sentence.
* `min_count` _(int, optional)_: Ignores all words with total frequency lower than this.
* `workers` _(int, optional)_: Use these many worker threads to train the model (=faster training with multicore machines).
* `sg` _({0, 1}, optional)_: Training algorithm: 1 for skip-gram; otherwise CBOW.
* `hs` _({0, 1}, optional)_: If 1, hierarchical softmax will be used for model training. If 0, and negative is non-zero, negative sampling will be used.
* `negative` _(int, optional)_: If > 0, negative sampling will be used, the int for negative specifies how many “noise words” should be drawn (usually between 5-20). If set to 0, no negative sampling is used.

In short:

| SG | HS | Negative | Training Algorithm |
|----|----|----------|-------------------|
| 1  | 1  |          | Skip-Gram Hierarchical Softmax |
| 1  | 0  | $\neq$ 0 | Skip-Gram Negative Sampling |
| 1  | 0  | = 0 | No training |
| 0  | 1  |          | CBOW Hierarchical Softmax |
| 0  | 0  | $\neq$ 0 | CBOW Negative Sampling |
| 0 | 0  | = 0 | No training |


In [3]:
# documentation: https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec
# Assumption: Provided input is a txt file with one sentence per line.
INPUT = ["../corpus/turkish-texts-tokenized.txt", "../corpus/bounwebcorpus.txt"]
MIN_COUNT = 10  # ignore all words with total frequency lower than this
EMB = 300       # dimensionality of word vectors
WINDOW = 5      # maximum distance between the target and context word within a sentence
EPOCH = 10      # number of iterations (epochs) over the corpus
SG = 1          # training algorithm: 1 for skip-gram; otherwise CBOW
HS = 0          # if 1, hierarchical softmax will be used for model training. If set to 0, and negative is non-zero, negative sampling will be used. If both of them 0, no training algorithm will be used.
NEGATIVE = 5    # if > 0, negative sampling will be used, the int for negative specifies how many "noise words" should be drawn (usually between 5-20). If set to 0, no negative sampling is used.
OUTPUT = "word2vec_5epoch_leftaligned.model"
ALIGNMENT = -1  # -1 for left alignment, 1 for right alignment, 0 for centered alignment

# So, if both hs and negative are set to 0, it means that no training algorithm will be used to learn the word embeddings. In this case, you will have to provide pre-trained word embeddings for the model to use.

In [5]:
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
model = Word2Vec(sentences=LineSentences(INPUT), 
                vector_size=EMB, 
                window=WINDOW, 
                min_count=MIN_COUNT, 
                epochs = EPOCH, 
                sg = SG,
                hs = HS,
                negative = NEGATIVE,
                compute_loss=True,
                window_alignment=ALIGNMENT,
                workers=multiprocessing.cpu_count(), callbacks=[callback()])
model.wv.save_word2vec_format(OUTPUT, binary=True)

2023-03-14 19:07:14,355 : INFO : collecting all words and their counts
2023-03-14 19:07:14,357 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2023-03-14 19:07:14,358 : INFO : collected 1573 word types from a corpus of 2838 raw words and 9 sentences
2023-03-14 19:07:14,359 : INFO : Creating a fresh vocabulary
2023-03-14 19:07:14,360 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=10 retains 20 unique words (1.27% of original 1573, drops 1553)', 'datetime': '2023-03-14T19:07:14.360778', 'gensim': '4.3.1.dev0', 'python': '3.9.13 (tags/v3.9.13:6de2ca5, May 17 2022, 16:36:42) [MSC v.1929 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19045-SP0', 'event': 'prepare_vocab'}
2023-03-14 19:07:14,360 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=10 leaves 593 word corpus (20.89% of original 2838, drops 2245)', 'datetime': '2023-03-14T19:07:14.360778', 'gensim': '4.3.1.dev0', 'python': '3.9.13 (tags/v3.9.13:6de2ca5, May 17 2022, 16:36:42)

Loss after epoch 0: 879.2654418945312
Loss after epoch 1: 1132.3890991210938
Loss after epoch 2: 960.003662109375
Loss after epoch 3: 906.3017578125
Loss after epoch 4: 728.04736328125
Loss after epoch 5: 1011.609375
Loss after epoch 6: 795.62060546875
Loss after epoch 7: 799.21923828125
Loss after epoch 8: 860.857421875
Loss after epoch 9: 735.37060546875


In [None]:
from gensim.models import KeyedVectors
word_vectors = KeyedVectors.load_word2vec_format(OUTPUT, binary=True)

In [None]:
word_vectors.most_similar_cosmul(positive=['kadın', 'kral'], negative=['adam'])

In [None]:
# Create vocabulary file
vocab = list(word_vectors.index_to_key)

In [None]:
# Actually, vocabulary is already sorted according to the frequency of the words. But, you can sort it again to be sure.
word_counts = [word_vectors.get_vecattr(word, 'count') for word in vocab]  # get frequency of each word in corpus

# Sort the vocabulary by word counts in descending order
sorted_vocab = [word for _, word in sorted(zip(word_counts, vocab), reverse=True)]
print(sorted_vocab[:10])

In [None]:
# write vocab to corpus/vocab.txt
with open("../corpus/vocab.txt", "w", encoding="utf-8") as f:
    for word in sorted_vocab:
        f.write(word + "\n")