# Training customized word embeddings

Word embeddings became big around 2013 and are linked to [this paper](https://arxiv.org/abs/1301.3781) with the beautiful title 
*Efficient Estimation of Word Representations in Vector Space* by Tomas Mokolov et al. coming out of Google. This was the foundation of Word2Vec.

The idea behind it is easiest summarized by the following quote: 


> *You shall know a word by the company it keeps (Firth, J. R. 1957:11)*

![](https://ruder.io/content/images/size/w2000/2016/04/word_embeddings_colah.png)

Let me start with a fascinating example of word embeddings in practice. Below, you can see a figure from the paper: 
*Dynamic Word Embeddings for Evolving Semantic Discovery*. Here (in simple terms) the researchers estimated word vectors for from textual inputs in different time-frames. They picked out some terms and person that obviously changed *their company* over the years. Then they look at the relative position of these terms compared to terms that did not change much (anchors). If you are interested in this kind of research, check out [this blog](https://blog.acolyer.org/2018/02/22/dynamic-word-embeddings-for-evolving-semantic-discovery/) that describes the paper briefly or the [original paper](https://arxiv.org/abs/1703.00607).

![alt text](https://adriancolyer.files.wordpress.com/2018/02/evolving-word-embeddings-fig-1.jpeg)

Word embeddings allow us to create term representations that "learn" meaning from semantic and syntactic features. These models take a sequence of sentences as an input and scan for all individual terms that appear in the whole corpus and all their occurrences. Such contextual learning seems to be able to pick up non-trivial conceptual details and it is this class of models that today enable technologies such as chatbots, machine translation and much more.

The early word embedding models were Word2Vec and [GloVe](https://nlp.stanford.edu/projects/glove/).
In December 2017 Facebook presented [fastText](https://fasttext.cc/) (by the way - by 2017 Tomas Mikolov was working for Facebook and is one of the authors of the [paper](https://arxiv.org/abs/1607.04606) that introduces the research behind fastText). This model extends the idea of Word2Vec, enriching these vectors by information from sub-word elements. What does that mean? Words are not only defined by surrounding words but in addition also by the various syllables that make up the word. Why should that be a good idea? Well, now words such as *apple* and *apples* do not only get similar vectors due to them often sharing context but also because they are composed of the same sub-word elements. This comes in particularly handy when we are dealing with language that have a rich morphology such as Turkish or Russian.  This is also great when working with web-text, which is often messy and misspelt.

The current state-of-the-art transformer models go even further and implement context-specificity (a word may change meaning depending on the context in which it occurs)

Now the good news: You will find pre-trained vectors from all mentioned models online. They will do great in most cases. However, when working with specific tasks: Some obscure languages and/or specific technical jargon (specific scientific field or industry e.g. finance, insurance), it is nice to know how to train such word-vectors.


In this tutorial we will train the "classic" Word2Vec model, considering bi-grams. We will also look a bit into data-engineering issues in sequence-training. Finally, we will look at how we can use such models for text representation beyond individual words.

## Data

The data used here are 10k cooking related posts from Reddit. They come in JSON-lines format and can be either downloaded first or opened via requests.

## Plan of attack
In this tutorial we will not be using Spacy, as it is not fast enough for use in training of large language models.
The intent is to understand training from disk - where the file is not opened (with e.g. pandas) and an object in memory but streamed from disk.

In [1]:
# download data (optional when training from memory)
!wget https://raw.githubusercontent.com/aaubs/ds-master/main/data/reddit_r_cooking_sample.jsonl

--2022-11-01 15:04:31--  https://raw.githubusercontent.com/aaubs/ds-master/main/data/reddit_r_cooking_sample.jsonl
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2675456 (2.6M) [text/plain]
Saving to: ‘reddit_r_cooking_sample.jsonl’


2022-11-01 15:04:32 (32.9 MB/s) - ‘reddit_r_cooking_sample.jsonl’ saved [2675456/2675456]



In [2]:
# installs
!pip install --upgrade gensim

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting gensim
  Downloading gensim-4.2.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (24.1 MB)
[K     |████████████████████████████████| 24.1 MB 1.6 MB/s 
Installing collected packages: gensim
  Attempting uninstall: gensim
    Found existing installation: gensim 3.6.0
    Uninstalling gensim-3.6.0:
      Successfully uninstalled gensim-3.6.0
Successfully installed gensim-4.2.0


In [3]:
import pandas as pd
import numpy as np
import json

# we will use nltk for sentence tokenization
import nltk
from nltk.tokenize import sent_tokenize
nltk.download('punkt')

# we will be using gensim for training
import gensim
from gensim import utils
from gensim.models.word2vec import Word2Vec
from gensim.models.fasttext import FastText
from gensim.models.phrases import Phrases, ENGLISH_CONNECTOR_WORDS


# Logging settings
import logging

for handler in logging.root.handlers[:]:
   logging.root.removeHandler(handler)

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


## Simple In-memory training

To better understand the training itself we start with simple model training out of memory. All the data will be loaded with pandas.
Preprocessing results will also be stored in the dataframe. This is a viable approache up a certain data-size. When going beyond 5M texts (depending on the hardware) that's probably not a good idea..

In [4]:
# load data
data = pd.read_json('https://raw.githubusercontent.com/aaubs/ds-master/main/data/reddit_r_cooking_sample.jsonl', lines=True)

In [5]:
data.head()

Unnamed: 0,text,meta
0,Where do you get the mock duck? I've only rece...,"{'section': 'Cooking', 'utc': '1364690064'}"
1,Microwaves are terrible. Everyone in this sub ...,"{'section': 'Cooking', 'utc': '1368260826'}"
2,My Pro 500 is going on 18 years old. Thing is ...,"{'section': 'Cooking', 'utc': 1518485096}"
3,deglazing works ok. but not as well as on a st...,"{'section': 'Cooking', 'utc': '1413146528'}"
4,Does Google not exist in Germany? 7g dry is 1....,"{'section': 'Cooking', 'utc': 1522171636}"


Word2Vec uses sentences to train, not paragraphs. Therefore we will need to sentence-tokenize.

In [6]:
# NLTK tokenizer:
sent_tokenize('this is a sentence. also that one.')

['this is a sentence.', 'also that one.']

In [7]:
# Let's apply that to all texts
sentences = []
for i in data['text']:
  sentences.extend(sent_tokenize(i))

In [8]:
len(sentences)

29445

Gensim has efficient simple preprocessing as part of the utility functions. That works well for most latin-letter texts. Check out [Gensim docos](https://tedboy.github.io/nlps/generated/generated/gensim.utils.simple_preprocess.html) for more into.

In [9]:
# simple prepro (tokenization, lowercase, de-accent (otional))
sentences_prepro = [utils.simple_preprocess(line) for line in sentences]

We are not removing stopwords for Word2Vec, as the model actually cares about syntax. One thing that we can do is identifying n-grams (phrases).

In [10]:
# trainig a model to identify n-grams
phrase_model = Phrases(sentences_prepro, min_count=1, threshold=1, connector_words=ENGLISH_CONNECTOR_WORDS)

2022-11-01 15:04:50,384 : INFO : collecting all words and their counts
2022-11-01 15:04:50,387 : INFO : PROGRESS: at sentence #0, processed 0 words and 0 word types
2022-11-01 15:04:50,554 : INFO : PROGRESS: at sentence #10000, processed 123637 words and 74036 word types
2022-11-01 15:04:50,729 : INFO : PROGRESS: at sentence #20000, processed 245368 words and 129436 word types
2022-11-01 15:04:50,899 : INFO : collected 176735 token types (unigram + bigrams) from a corpus of 359868 words and 29445 sentences
2022-11-01 15:04:50,902 : INFO : merged Phrases<176735 vocab, min_count=1, threshold=1, max_vocab_size=40000000>
2022-11-01 15:04:50,907 : INFO : Phrases lifecycle event {'msg': 'built Phrases<176735 vocab, min_count=1, threshold=1, max_vocab_size=40000000> in 0.52s', 'datetime': '2022-11-01T15:04:50.907864', 'gensim': '4.2.0', 'python': '3.7.15 (default, Oct 12 2022, 19:14:55) \n[GCC 7.5.0]', 'platform': 'Linux-5.10.133+-x86_64-with-Ubuntu-18.04-bionic', 'event': 'created'}


In [11]:
# apply the model
sentences_phrased = [phrase_model[line] for line in sentences_prepro]

In [12]:
# quick check
sentences_phrased[:5]

[['where_do', 'you_get', 'the', 'mock', 'duck'],
 ['ve_only', 'recently', 'tried_it', 'in', 'restaurant', 'and', 'loved_it'],
 ['hoisin', 'we_use', 'for', 'sandwich', 'condiment', 'mixed_with_sriracha'],
 ['you_could', 'make_those', 'pancakes', 'with', 'another', 'faux', 'meat'],
 ['some_of_those',
  'grain',
  'sausages_are',
  'really_good',
  'and',
  'you_can',
  'slice_them']]

obviousely, some hyperparameter tuning is needed

In [13]:
# adjusting min_count and threshold (that's a value calculated within the model - read docus)
phrase_model = Phrases(sentences_prepro, min_count=25, threshold=20, connector_words=ENGLISH_CONNECTOR_WORDS)
sentences_phrased = [phrase_model[line] for line in sentences_prepro]
sentences_phrased[:5]

2022-11-01 15:04:51,611 : INFO : collecting all words and their counts
2022-11-01 15:04:51,616 : INFO : PROGRESS: at sentence #0, processed 0 words and 0 word types
2022-11-01 15:04:51,807 : INFO : PROGRESS: at sentence #10000, processed 123637 words and 74036 word types
2022-11-01 15:04:52,124 : INFO : PROGRESS: at sentence #20000, processed 245368 words and 129436 word types
2022-11-01 15:04:52,311 : INFO : collected 176735 token types (unigram + bigrams) from a corpus of 359868 words and 29445 sentences
2022-11-01 15:04:52,313 : INFO : merged Phrases<176735 vocab, min_count=25, threshold=20, max_vocab_size=40000000>
2022-11-01 15:04:52,322 : INFO : Phrases lifecycle event {'msg': 'built Phrases<176735 vocab, min_count=25, threshold=20, max_vocab_size=40000000> in 0.71s', 'datetime': '2022-11-01T15:04:52.322617', 'gensim': '4.2.0', 'python': '3.7.15 (default, Oct 12 2022, 19:14:55) \n[GCC 7.5.0]', 'platform': 'Linux-5.10.133+-x86_64-with-Ubuntu-18.04-bionic', 'event': 'created'}


[['where', 'do', 'you', 'get', 'the', 'mock', 'duck'],
 ['ve',
  'only',
  'recently',
  'tried',
  'it',
  'in',
  'restaurant',
  'and',
  'loved',
  'it'],
 ['hoisin',
  'we',
  'use',
  'for',
  'sandwich',
  'condiment',
  'mixed',
  'with',
  'sriracha'],
 ['you',
  'could',
  'make',
  'those',
  'pancakes',
  'with',
  'another',
  'faux',
  'meat'],
 ['some',
  'of',
  'those',
  'grain',
  'sausages',
  'are',
  'really',
  'good',
  'and',
  'you',
  'can',
  'slice',
  'them']]

In [14]:
# did we actually find anything?
for phrase, score in phrase_model.find_phrases(sentences_prepro).items():
    print(phrase, score)

as_well 27.471339649272448
ve_been 41.653628014475565
stainless_steel 339.0599520383693
your_own 20.223897445413495
more_than 20.47408324458768
stir_fry 192.49911686782454
salt_pepper 31.312819683243973
olive_oil 184.98806397708285
store_bought 37.18857840249137
sour_cream 170.60931322975085
ve_never 32.897408361970214
slow_cooker 332.58133391235117
mashed_potatoes 166.10432330827066
thank_you 20.47161354330867
tomato_sauce 20.939855748581923
they_re 27.604060913705585
ve_got 21.553749715437903
check_out 30.309552392385527
talking_about 51.24875724937863
cast_iron 782.157477411027
alton_brown 256.58036640165915
pulled_pork 112.0196486780152
http_www 292.3240096923725
com_recipes 21.29508394248534
better_than 33.40834415963816
don_know 24.0042240154292
sous_vide 1497.1500605082697
next_time 35.865252904469585
grocery_store 213.3450024142926
imgur_com 81.90843485169492
ground_beef 91.19453044375645
grew_up 45.984822202948834
make_sure 24.00867902694272
chicken_breasts 29.188962816157037


Once sentences are pre-processed (tokenized, list of lists) we can train the model.

In [15]:
model = gensim.models.Word2Vec(sentences=sentences_phrased, 
                               vector_size=300, 
                               window=5, 
                               min_count=5, 
                               workers=4, 
                               epochs=15)

2022-11-01 15:04:54,022 : INFO : collecting all words and their counts
2022-11-01 15:04:54,027 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2022-11-01 15:04:54,068 : INFO : PROGRESS: at sentence #10000, processed 121743 words, keeping 9863 word types
2022-11-01 15:04:54,105 : INFO : PROGRESS: at sentence #20000, processed 241677 words, keeping 13807 word types
2022-11-01 15:04:54,139 : INFO : collected 16762 word types from a corpus of 354479 raw words and 29445 sentences
2022-11-01 15:04:54,140 : INFO : Creating a fresh vocabulary
2022-11-01 15:04:54,179 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 4857 unique words (28.98% of original 16762, drops 11905)', 'datetime': '2022-11-01T15:04:54.179874', 'gensim': '4.2.0', 'python': '3.7.15 (default, Oct 12 2022, 19:14:55) \n[GCC 7.5.0]', 'platform': 'Linux-5.10.133+-x86_64-with-Ubuntu-18.04-bionic', 'event': 'prepare_vocab'}
2022-11-01 15:04:54,191 : INFO : Word2Vec lifecycle event 

In [16]:
# check most similar terms
model.wv.most_similar('dutch_oven')

[('electric', 0.8057990670204163),
 ('bare', 0.7724413275718689),
 ('pressure_cooker', 0.7668929696083069),
 ('smoker', 0.7512131929397583),
 ('ss', 0.743894636631012),
 ('cast_iron', 0.743208646774292),
 ('oven', 0.721584141254425),
 ('crock_pot', 0.7208809852600098),
 ('wok', 0.7184653282165527),
 ('skillet', 0.7170997858047485)]

In [17]:
# we can call the vector of each word
model.wv['kettle']

array([ 0.01459577,  0.00051552,  0.03378123, -0.06099399,  0.14045642,
       -0.06822933, -0.08402817,  0.23614497,  0.13813491, -0.12879542,
        0.01250404, -0.14240834,  0.01904952,  0.03136403, -0.03646796,
       -0.13897865,  0.05526898, -0.0383767 , -0.0014335 , -0.04038331,
        0.09598418, -0.01977006,  0.07523971,  0.11076099, -0.08894028,
       -0.03512162, -0.02173638,  0.05409386, -0.09885938, -0.01030666,
       -0.04874817,  0.18512665, -0.07877014,  0.12038369, -0.14203385,
        0.13327682,  0.06715305, -0.2822212 , -0.09083666, -0.06442374,
       -0.22567119, -0.00222421,  0.12802269, -0.12861009, -0.15939379,
        0.11796158,  0.10332466,  0.0298819 , -0.07896766,  0.26637504,
        0.06439476, -0.04309073, -0.14780426,  0.06083586, -0.0601622 ,
        0.07341482,  0.29285246, -0.01333882, -0.05669258, -0.04951216,
       -0.15498039,  0.10532729, -0.05211889,  0.13003305, -0.03847679,
        0.0511533 ,  0.00375731,  0.16538042, -0.12729827, -0.15

In [18]:
model.wv.vectors.shape

(4857, 300)

In [19]:
# from here you can ennter key-word dicts for mapping
model.wv.key_to_index

{'the': 0,
 'and': 1,
 'it': 2,
 'to': 3,
 'you': 4,
 'of': 5,
 'in': 6,
 'is': 7,
 'that': 8,
 'for': 9,
 'with': 10,
 'but': 11,
 'or': 12,
 'on': 13,
 'have': 14,
 'if': 15,
 'can': 16,
 'this': 17,
 'my': 18,
 'like': 19,
 'just': 20,
 'are': 21,
 'not': 22,
 'be': 23,
 'as': 24,
 'so': 25,
 'some': 26,
 'your': 27,
 'make': 28,
 'them': 29,
 'at': 30,
 'use': 31,
 'they': 32,
 'was': 33,
 'all': 34,
 'do': 35,
 'good': 36,
 'from': 37,
 'one': 38,
 'what': 39,
 'get': 40,
 'out': 41,
 'don': 42,
 'then': 43,
 'up': 44,
 'about': 45,
 'add': 46,
 'would': 47,
 'an': 48,
 'will': 49,
 'when': 50,
 'more': 51,
 'there': 52,
 'cook': 53,
 'cooking': 54,
 'me': 55,
 'also': 56,
 'really': 57,
 'chicken': 58,
 'sauce': 59,
 'time': 60,
 'how': 61,
 'food': 62,
 'recipe': 63,
 'meat': 64,
 'pan': 65,
 'water': 66,
 'into': 67,
 'want': 68,
 're': 69,
 'no': 70,
 'think': 71,
 'cheese': 72,
 'we': 73,
 'go': 74,
 'too': 75,
 'because': 76,
 'way': 77,
 'put': 78,
 'any': 79,
 'made': 80,


## Training Word2Vec from disk

Let's assume you want to train a word-embeddding model from disk. You downloaded all of Wikipedia or one of the large (multi GB datasets from Huggingface)

In [20]:
# open file (not read yet) from disk
texts_reddit = open('/content/reddit_r_cooking_sample.jsonl','r')

In [21]:
# read single line (this will iterate over the lines)
texts_reddit.readline()

'{"text":"Where do you get the mock duck? I\'ve only recently tried it in a restaurant and loved it. Hoisin we use for sandwich condiment mixed with sriracha. You could make those pancakes with another faux-meat. Some of those grain sausages are really good and you can slice them. The brand I buy sometimes is Field Roast. Also hoisin stir fried veggies with peanut is delicious.","meta":{"section":"Cooking","utc":"1364690064"}}\n'

In [22]:
# Decode JSON
json.loads(texts_reddit.readline())

{'text': 'Microwaves are terrible. Everyone in this sub should know that they have no good culinary purpose. The reason they leave unpopped kernels is because they heat unevenly. Inside the microwave oven there are hotspots and there are coldspots. The rotating dish things try to mitigate this but it\'s still a problem. The best way to do it is to not use the microwave. Simple as that. Learn to do it in a pan. It\'s almost as easy as boiling water. This question is a bit like "how do I blanch vegetables in the oven". Use the right tool for the job.',
 'meta': {'section': 'Cooking', 'utc': '1368260826'}}

We need to turn our comments into sentences (tokenize) and preprocess. No need to do on-the-fly preprocessing 15 times
For that we create a new file `sentences.txt`, we tokenize our texts and write all sentences as lines into the new file. Using 1-sentence-per-line in TXTs is a common approach.

In [23]:
# We need re-open to start from top
texts_reddit = open('/content/reddit_r_cooking_sample.jsonl','r')

In [24]:
# open file
with open('sentances.txt','w') as f:
  for line in texts_reddit: # iterate over the json-lines with comments (alternative to readline())
    line = json.loads(line) # decode json
    for sent in sent_tokenize(line['text']): # sent-tokenize
      f.write(sent) # write sents into the new file
      f.write('\n')
  f.close()

The next step is not easy but important and your first step to writing "real code".
We need to define something that allows us to retrieve our sentences from the stored file one by one (and start from the beginning after the last one).

A class with an `__iter__` function can help here. This becomes an iterator that yields them one by one. `yield` is different from `return`. The latter ends an execution and returns the "overall" result of a function. `yield` is called repeatedly.

In [25]:
path = "/content/sentances.txt"

In [26]:
class MyCorpus:
    """An iterator that yields sentences (lists of str)."""
    def __iter__(self):
        for line in open(path):
            # assume there's one document per line, tokens separated by whitespace
            yield utils.simple_preprocess(line)

Let's try out how that works

In [27]:
# instantiate a corpus object
sentences_disk = MyCorpus()

In [28]:
# define a generator (similar to list comprehension but on "stand-by")
test_gen = (a for a in sentences_disk)

In [29]:
# every time we call next, it runs one iteration
next(test_gen)

['where', 'do', 'you', 'get', 'the', 'mock', 'duck']

Let's train our Phrases model from the disk-corpus

In [30]:
sentences_disk = MyCorpus()

In [31]:
phrase_model = Phrases(sentences_disk, min_count=25, threshold=20, connector_words=ENGLISH_CONNECTOR_WORDS)

2022-11-01 15:05:12,251 : INFO : collecting all words and their counts
2022-11-01 15:05:12,269 : INFO : PROGRESS: at sentence #0, processed 0 words and 0 word types
2022-11-01 15:05:13,049 : INFO : PROGRESS: at sentence #10000, processed 123637 words and 74036 word types
2022-11-01 15:05:13,960 : INFO : PROGRESS: at sentence #20000, processed 245368 words and 129436 word types
2022-11-01 15:05:14,751 : INFO : collected 176735 token types (unigram + bigrams) from a corpus of 359868 words and 29445 sentences
2022-11-01 15:05:14,760 : INFO : merged Phrases<176735 vocab, min_count=25, threshold=20, max_vocab_size=40000000>
2022-11-01 15:05:14,767 : INFO : Phrases lifecycle event {'msg': 'built Phrases<176735 vocab, min_count=25, threshold=20, max_vocab_size=40000000> in 2.52s', 'datetime': '2022-11-01T15:05:14.767561', 'gensim': '4.2.0', 'python': '3.7.15 (default, Oct 12 2022, 19:14:55) \n[GCC 7.5.0]', 'platform': 'Linux-5.10.133+-x86_64-with-Ubuntu-18.04-bionic', 'event': 'created'}


In [32]:
for phrase, score in phrase_model.find_phrases(sentences_disk).items():
    print(phrase, score)

as_well 27.471339649272448
ve_been 41.653628014475565
stainless_steel 339.0599520383693
your_own 20.223897445413495
more_than 20.47408324458768
stir_fry 192.49911686782454
salt_pepper 31.312819683243973
olive_oil 184.98806397708285
store_bought 37.18857840249137
sour_cream 170.60931322975085
ve_never 32.897408361970214
slow_cooker 332.58133391235117
mashed_potatoes 166.10432330827066
thank_you 20.47161354330867
tomato_sauce 20.939855748581923
they_re 27.604060913705585
ve_got 21.553749715437903
check_out 30.309552392385527
talking_about 51.24875724937863
cast_iron 782.157477411027
alton_brown 256.58036640165915
pulled_pork 112.0196486780152
http_www 292.3240096923725
com_recipes 21.29508394248534
better_than 33.40834415963816
don_know 24.0042240154292
sous_vide 1497.1500605082697
next_time 35.865252904469585
grocery_store 213.3450024142926
imgur_com 81.90843485169492
ground_beef 91.19453044375645
grew_up 45.984822202948834
make_sure 24.00867902694272
chicken_breasts 29.188962816157037


🚀🚀🚀
**Efficiency** is key when working from disk.
Let's preprocess the inputs using simple-prepro and the phrases model.
Since we preprocess our sentences into lists we need to store them using json such that we can load them into python objects, not strings

In [33]:
sentences_disk = MyCorpus()

In [34]:
# open new file (txt file with json-input)
with open('sentances_phrases.txt','w') as f:
  for sent in sentences_disk: # iterate over the json-lines with comments (alternative to readline())
    f.write(json.dumps(phrase_model[sent])) # write sents into the new file
    f.write('\n')
  f.close()

In [35]:
path = '/content/sentances_phrases.txt'

In [36]:
class MyCorpus_processed:
    """An iterator that yields sentences (lists of str)."""
    def __iter__(self):
        for line in open(path):
            # assume there's one document per line, tokens separated by whitespace
            yield json.loads(line)

In [37]:
sentences_disk = MyCorpus_processed()

In [38]:
# or we just add it to the training
model = gensim.models.Word2Vec(sentences=sentences_disk, 
                               vector_size=300, 
                               window=5, 
                               min_count=5, 
                               workers=4, 
                               epochs=15)

2022-11-01 15:05:21,332 : INFO : collecting all words and their counts
2022-11-01 15:05:21,334 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2022-11-01 15:05:21,415 : INFO : PROGRESS: at sentence #10000, processed 121743 words, keeping 9863 word types
2022-11-01 15:05:21,494 : INFO : PROGRESS: at sentence #20000, processed 241677 words, keeping 13807 word types
2022-11-01 15:05:21,574 : INFO : collected 16762 word types from a corpus of 354479 raw words and 29445 sentences
2022-11-01 15:05:21,576 : INFO : Creating a fresh vocabulary
2022-11-01 15:05:21,608 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 4857 unique words (28.98% of original 16762, drops 11905)', 'datetime': '2022-11-01T15:05:21.608214', 'gensim': '4.2.0', 'python': '3.7.15 (default, Oct 12 2022, 19:14:55) \n[GCC 7.5.0]', 'platform': 'Linux-5.10.133+-x86_64-with-Ubuntu-18.04-bionic', 'event': 'prepare_vocab'}
2022-11-01 15:05:21,611 : INFO : Word2Vec lifecycle event 

In [39]:
model.wv.most_similar('coriander')

[('thyme', 0.9487970471382141),
 ('bay', 0.9379823803901672),
 ('basil', 0.9286872148513794),
 ('chilies', 0.9284491539001465),
 ('chives', 0.928163468837738),
 ('parsley', 0.9272999167442322),
 ('oregano', 0.9271392226219177),
 ('chile', 0.925579845905304),
 ('mint', 0.9250453114509583),
 ('carrot', 0.9242677688598633)]

### Bonus: Training FastText

training of FastText is syntax-wise the same.
There are a few other paras that you can tune

In [40]:
model_fasttext = FastText(sentences = sentences_disk, 
                          vector_size=300, 
                          window=8, 
                          min_count=5, 
                          workers=4, 
                          epochs=15)

2022-11-01 15:05:34,458 : INFO : collecting all words and their counts
2022-11-01 15:05:34,460 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2022-11-01 15:05:34,590 : INFO : PROGRESS: at sentence #10000, processed 121743 words, keeping 9863 word types
2022-11-01 15:05:34,665 : INFO : PROGRESS: at sentence #20000, processed 241677 words, keeping 13807 word types
2022-11-01 15:05:34,736 : INFO : collected 16762 word types from a corpus of 354479 raw words and 29445 sentences
2022-11-01 15:05:34,738 : INFO : Creating a fresh vocabulary
2022-11-01 15:05:34,773 : INFO : FastText lifecycle event {'msg': 'effective_min_count=5 retains 4857 unique words (28.98% of original 16762, drops 11905)', 'datetime': '2022-11-01T15:05:34.773171', 'gensim': '4.2.0', 'python': '3.7.15 (default, Oct 12 2022, 19:14:55) \n[GCC 7.5.0]', 'platform': 'Linux-5.10.133+-x86_64-with-Ubuntu-18.04-bionic', 'event': 'prepare_vocab'}
2022-11-01 15:05:34,775 : INFO : FastText lifecycle event 

In [41]:
model_fasttext.wv.most_similar('coriander')

[('chowder', 0.9274277091026306),
 ('chili_powder', 0.9191348552703857),
 ('black_pepper', 0.9164717197418213),
 ('powder', 0.9057203531265259),
 ('powders', 0.9046632647514343),
 ('cayenne', 0.902331531047821),
 ('garlic_powder', 0.9016087055206299),
 ('powdered', 0.9012802839279175),
 ('pepper', 0.8967088460922241),
 ('oregano', 0.8918057084083557)]

In [42]:
model.wv['powder']

array([-0.12078518,  0.597558  ,  0.00331327,  1.0297153 ,  0.09155773,
        0.29574957,  0.73296636,  0.4650821 ,  0.384209  ,  0.21762931,
       -0.6055789 , -0.051665  ,  0.39077055, -0.45504335, -0.20111151,
        0.39284787, -0.06931885,  0.18525817, -0.00431984,  0.03986278,
       -0.1095335 ,  0.6636997 , -0.09697071,  0.5727559 ,  0.2781841 ,
       -0.37117946, -0.09656594, -0.3709168 , -0.58877665, -0.32047302,
       -0.32964802, -0.43679228, -0.3207291 , -0.35386488, -0.03786171,
       -0.36373922,  0.02904632, -0.26232532,  0.41615665,  0.30471483,
        0.19186927,  0.68935853, -0.46712887, -0.08489431, -0.08985439,
       -0.06875525, -0.11272377,  0.43016848,  0.3556957 ,  0.15710631,
        0.08752429, -0.32841668, -0.78596765, -0.2321159 , -0.9267788 ,
        0.44065472,  0.7954728 , -0.68115926,  1.067942  , -0.09555547,
       -1.3683629 ,  0.54949826, -0.5056994 ,  0.32614797,  1.0164382 ,
       -0.62646544,  0.07067577,  0.28844243, -0.08612514, -0.44

## Visualizing Word-Vectors

now that we have our Word-vectors we should be able to reduce their dimensionality to explore visually

In [43]:
!pip install umap-learn -q

[K     |████████████████████████████████| 88 kB 3.2 MB/s 
[K     |████████████████████████████████| 1.1 MB 24.1 MB/s 
[?25h  Building wheel for umap-learn (setup.py) ... [?25l[?25hdone
  Building wheel for pynndescent (setup.py) ... [?25l[?25hdone


In [44]:
import random
import umap
import altair as alt

In [45]:
# picking 2000 random vectors from the W2V model
idx = random.sample(range(len(model.wv.vectors)), 2000)

In [46]:
# creating 2D reduction
umap_reducer = umap.UMAP(random_state=42, n_components=2)
embeddings = umap_reducer.fit_transform(model.wv.vectors[idx])

In [47]:
# df for plot
df_plot = pd.DataFrame(embeddings, columns=['x','y'])

In [48]:
# vector-labels
labels = [model.wv.index_to_key[ix] for ix in idx]

In [49]:
df_plot['labels'] = labels

In [50]:
# plot
alt.Chart(df_plot).mark_circle(size=60).encode(
    x='x',
    y='y',
    tooltip=['labels']
).properties(
    width=800,
    height=600
).interactive()

## Create sentence embeddings from our W2V model

The final aim is to use the custom W2V embeddings to vectorize sentences
We will look at average vectors and tfidf weighted avg. embeddings

In [51]:
test_sents = ['I love chicken super much with soy',
              'I enjoy asian food, especially chicken',
              'Give me cake', 'mexican food is amazing', 
              'I enjoy cuisine italian']

### Average W2V vectors

In [52]:
# tokenize
tokens = phrase_model[utils.simple_preprocess(test_sents[0])]

In [53]:
# filter out only those words that are part of the vocab
tokens = [t for t in tokens if t in model.wv.key_to_index.keys()]

In [54]:
# create average-vectors
avg_vec = np.average([model.wv[t] for t in tokens], axis=0)

let's package this process up into a vectorizer-function

In [55]:
def w2v_vectorize(text):
  tokens = phrase_model[utils.simple_preprocess(text)] # preprocess just as model inputs
  tokens = [t for t in tokens if t in model.wv.key_to_index.keys()] # filter only tokens that are in vocab
  return np.average([model.wv[t] for t in tokens], axis=0) # calculate avg vector

In [56]:
# it's a goof idea to stack them using numpy into a matrix
vecs = np.vstack([w2v_vectorize(s) for s in test_sents])

In [57]:
# quick explaininng of the vectors (not really part of the code)
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(vecs)

array([[0.99999994, 0.33225888, 0.19143194, 0.27539983, 0.33984554],
       [0.33225888, 0.9999998 , 0.1990554 , 0.64899683, 0.6214321 ],
       [0.19143194, 0.1990554 , 0.9999999 , 0.21739675, 0.28216696],
       [0.27539983, 0.64899683, 0.21739675, 0.9999996 , 0.70198625],
       [0.33984554, 0.6214321 , 0.28216696, 0.70198625, 1.0000002 ]],
      dtype=float32)

### TFIDF weighted W2V Embeddings

Very similar to avg-embeddings, however here we will use sklearn TfidfVectorizer (that one we already know) to weight our vecs
The approach is a bit "hacky" but efficient

In [58]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [59]:
# function that does absolutely nothing...
# cause we do prepro and tokenization in one using gensim, we will define it for prepro
def dummy_fun(doc):
    return doc

In [60]:
[phrase_model[utils.simple_preprocess(text)] for text in test_sents]

[['love', 'chicken', 'super', 'much', 'with', 'soy'],
 ['enjoy', 'asian', 'food', 'especially', 'chicken'],
 ['give', 'me', 'cake'],
 ['mexican', 'food', 'is', 'amazing'],
 ['enjoy', 'cuisine', 'italian']]

In [61]:
# we define a preprocessing function to pass into the TfidfVectorizer
def gensim_prepro(doc):
  return phrase_model[utils.simple_preprocess(doc)]

In [62]:
# we turn of any preprocessing and align vocabulary with the one
# used by our embeddings
# that will allow us to use TFIDF vectors to weight the embeddings

tfidf_new_text = TfidfVectorizer(
    vocabulary=model.wv.key_to_index.keys(), # here using the W2V vocab
    tokenizer=dummy_fun,
    preprocessor=gensim_prepro,
    token_pattern=None)  

In [63]:
# create TFIDF matrix (we could also just use that one for search)
new_tfidf = tfidf_new_text.fit_transform(test_sents)

In [64]:
new_tfidf

<5x4857 sparse matrix of type '<class 'numpy.float64'>'
	with 21 stored elements in Compressed Sparse Row format>

This here is a cool little trick: Since N-columns for the TFIDF is the same as n-rows for our word-embeddings we can simply take a dot-product here.
Another cool feature: this can be done sequentially for large datasets (when no space in ram)

In [65]:
# calculating TFIDF-weighted avg. embeddings
test_w2v_tfidf = new_tfidf @ model.wv.vectors

In [66]:
cosine_similarity(test_w2v_tfidf)

array([[1.        , 0.28100529, 0.20314774, 0.31770869, 0.34759816],
       [0.28100529, 1.        , 0.18598448, 0.62465747, 0.64506126],
       [0.20314774, 0.18598448, 1.        , 0.20873881, 0.24964549],
       [0.31770869, 0.62465747, 0.20873881, 1.        , 0.71836449],
       [0.34759816, 0.64506126, 0.24964549, 0.71836449, 1.        ]])

## Using these embeddings for semantic search
We can use such embeddings (and others) for semantic search (similarity maximization) and also downstream in unsuprvised/supervised tasks.

In [67]:
# create TFIDF matrix for all
tfidf_all = tfidf_new_text.fit_transform(data['text'])

In [68]:
# get vecs by dot-product
tfidf_w2v_all = tfidf_all @ model.wv.vectors

In [69]:
# make query and transform it into same vector-space

query = 'Steak egg'

tfidf_q = tfidf_new_text.transform([query]) 
tfidf_w2v_q = tfidf_q @ model.wv.vectors

In [70]:
# calculate cos-sim between the query and all vecs

distances = cosine_similarity(tfidf_w2v_q,tfidf_w2v_all)

In [71]:
# get corresponding texts
ids = np.flip(np.argsort(distances))[0]
ids

array([ 950, 5679, 3546, ..., 5576,  798, 6482])

In [72]:
# print
for ix in ids[:10]:
  print(data['text'].values[ix])

Malaysian-styled burger aka Ramly Burger. Basically beef patty wrapped in egg.
Teriyaki steak bowls!!
NJ porkroll egg and cheese on a Kaiser roll. Egg has to be runny.
Steak
A poached or sunny side up egg with a runny yolk is great on buttered toast. The yolk should not be watery, but slightly thickened, like cheese sauce. I also enjoy egg in soup. It cooks into threads. Stracciatella (sp) or chicken soup are good soups to add egg to.
Gravy over rice and peas is what I serve with chicken fried steak. Shit is delicious.
Oven baked lasagna is redundant.
Egg-in-window.
Frittatas Egg salad sandwiches Deviled eggs Crack 4 eggs in a water bottle (use a funnel),shake it up and freeze it for later Cobb salad Potato salad
German oven pancakes! 6 eggs!


### Serialization

Gensim models can be (ans should be) saved to disk after training.

In [73]:
phrase_model.save('bigram_model.m')

2022-11-01 15:08:09,719 : INFO : Phrases lifecycle event {'fname_or_handle': 'bigram_model.m', 'separately': 'None', 'sep_limit': 10485760, 'ignore': frozenset(), 'datetime': '2022-11-01T15:08:09.719783', 'gensim': '4.2.0', 'python': '3.7.15 (default, Oct 12 2022, 19:14:55) \n[GCC 7.5.0]', 'platform': 'Linux-5.10.133+-x86_64-with-Ubuntu-18.04-bionic', 'event': 'saving'}
2022-11-01 15:08:09,818 : INFO : saved bigram_model.m


In [74]:
model.save('w2v_food.m')

2022-11-01 15:08:09,834 : INFO : Word2Vec lifecycle event {'fname_or_handle': 'w2v_food.m', 'separately': 'None', 'sep_limit': 10485760, 'ignore': frozenset(), 'datetime': '2022-11-01T15:08:09.834835', 'gensim': '4.2.0', 'python': '3.7.15 (default, Oct 12 2022, 19:14:55) \n[GCC 7.5.0]', 'platform': 'Linux-5.10.133+-x86_64-with-Ubuntu-18.04-bionic', 'event': 'saving'}
2022-11-01 15:08:09,840 : INFO : not storing attribute cum_table
2022-11-01 15:08:09,874 : INFO : saved w2v_food.m


In [75]:
g = Word2Vec.load('/content/w2v_food.m')

2022-11-01 15:08:09,885 : INFO : loading Word2Vec object from /content/w2v_food.m
2022-11-01 15:08:09,907 : INFO : loading wv recursively from /content/w2v_food.m.wv.* with mmap=None
2022-11-01 15:08:09,914 : INFO : setting ignored attribute cum_table to None
2022-11-01 15:08:09,982 : INFO : Word2Vec lifecycle event {'fname': '/content/w2v_food.m', 'datetime': '2022-11-01T15:08:09.982258', 'gensim': '4.2.0', 'python': '3.7.15 (default, Oct 12 2022, 19:14:55) \n[GCC 7.5.0]', 'platform': 'Linux-5.10.133+-x86_64-with-Ubuntu-18.04-bionic', 'event': 'loaded'}


In [76]:
g.wv.most_similar('garlic')

[('cloves', 0.819964587688446),
 ('celery', 0.7911463379859924),
 ('minced', 0.7880942225456238),
 ('carrots', 0.7792809009552002),
 ('chopped', 0.7780672907829285),
 ('onions', 0.7706136703491211),
 ('ginger', 0.770179271697998),
 ('cilantro', 0.769708514213562),
 ('unpeeled', 0.7607511878013611),
 ('olive_oil', 0.750318706035614)]