### Imports and logging

In [1]:
# imports needed and set up logging
import gzip
import gensim 
import logging

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)


### Dataset 

Let unzip and read the dataset

In [2]:
data_file="reviews_data.txt.gz"

with gzip.open ('reviews_data.txt.gz', 'rb') as f:
    for i,line in enumerate (f):
        print(line)
        break


b"Oct 12 2009 \tNice trendy hotel location not too bad.\tI stayed in this hotel for one night. As this is a fairly new place some of the taxi drivers did not know where it was and/or did not want to drive there. Once I have eventually arrived at the hotel, I was very pleasantly surprised with the decor of the lobby/ground floor area. It was very stylish and modern. I found the reception's staff geeting me with 'Aloha' a bit out of place, but I guess they are briefed to say that to keep up the coroporate image.As I have a Starwood Preferred Guest member, I was given a small gift upon-check in. It was only a couple of fridge magnets in a gift box, but nevertheless a nice gesture.My room was nice and roomy, there are tea and coffee facilities in each room and you get two complimentary bottles of water plus some toiletries by 'bliss'.The location is not great. It is at the last metro stop and you then need to take a taxi, but if you are not planning on going to see the historic sites in Be

### Read files into a list

In [3]:

def read_input_file(input_file):
    """This method reads the input file which is in gzip format"""
    
    logging.info("reading file {0}...this may take a while".format(input_file))
    
    with gzip.open (input_file, 'rb') as f:
        for i, line in enumerate (f): 

            if (i%10000==0):
                logging.info ("read {0} reviews".format (i))
            # do some pre-processing and return a list of words for each review text
            yield gensim.utils.simple_preprocess (line)

# read the tokenized reviews into a list
# each review item becomes a serries of words
# so this becomes a list of lists
documents = list (read_input_file (data_file))
logging.info ("Done reading data file")    

2023-01-21 11:41:54,267 : INFO : reading file reviews_data.txt.gz...this may take a while
2023-01-21 11:41:54,272 : INFO : read 0 reviews
2023-01-21 11:41:58,767 : INFO : read 10000 reviews
2023-01-21 11:42:03,159 : INFO : read 20000 reviews
2023-01-21 11:42:08,565 : INFO : read 30000 reviews
2023-01-21 11:42:13,677 : INFO : read 40000 reviews
2023-01-21 11:42:19,695 : INFO : read 50000 reviews
2023-01-21 11:42:26,373 : INFO : read 60000 reviews
2023-01-21 11:42:31,423 : INFO : read 70000 reviews
2023-01-21 11:42:36,421 : INFO : read 80000 reviews
2023-01-21 11:42:40,678 : INFO : read 90000 reviews
2023-01-21 11:42:45,049 : INFO : read 100000 reviews
2023-01-21 11:42:50,067 : INFO : read 110000 reviews
2023-01-21 11:42:54,083 : INFO : read 120000 reviews
2023-01-21 11:42:58,916 : INFO : read 130000 reviews
2023-01-21 11:43:03,931 : INFO : read 140000 reviews
2023-01-21 11:43:08,420 : INFO : read 150000 reviews
2023-01-21 11:43:13,259 : INFO : read 160000 reviews
2023-01-21 11:43:17,697

## Training the Word2Vec model

## Understanding some of the parameters
To train the model earlier, we had to set some parameters. Now, let's try to understand what some of them mean. For reference, this is the command that we used to train the model.

```
model = gensim.models.Word2Vec (documents, vector_size=150, window=10, min_count=2, workers=10)
```

### `vector_size`
The size of the dense vector to represent each token or word. If you have very limited data, then size should be a much smaller value. If you have lots of data, its good to experiment with various sizes. A value of 100-150 has worked well for me. 

### `window`
The maximum distance between the target word and its neighboring word. If your neighbor's position is greater than the maximum window width to the left and the right, then, some neighbors are not considered as being related to the target word. In theory, a smaller window should give you terms that are more related. If you have lots of data, then the window size should not matter too much, as long as its a decent sized window. 

### `min_count`
Minimium frequency count of words. The model would ignore words that do not statisfy the `min_count`. Extremely infrequent words are usually unimportant, so its best to get rid of those. Unless your dataset is really tiny, this does not really affect the model.

### `workers`
How many threads to use behind the scenes?


In [5]:
model = gensim.models.Word2Vec (documents, vector_size=150, window=10, min_count=2, workers=10)
model.train(documents,total_examples=len(documents),epochs=10)

2023-01-21 11:50:49,321 : INFO : collecting all words and their counts
2023-01-21 11:50:49,323 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2023-01-21 11:50:50,160 : INFO : PROGRESS: at sentence #10000, processed 1655714 words, keeping 25777 word types
2023-01-21 11:50:51,080 : INFO : PROGRESS: at sentence #20000, processed 3317863 words, keeping 35016 word types
2023-01-21 11:50:53,258 : INFO : PROGRESS: at sentence #30000, processed 5264072 words, keeping 47518 word types
2023-01-21 11:50:55,858 : INFO : PROGRESS: at sentence #40000, processed 7081746 words, keeping 56675 word types
2023-01-21 11:50:58,189 : INFO : PROGRESS: at sentence #50000, processed 9089491 words, keeping 63744 word types
2023-01-21 11:50:59,905 : INFO : PROGRESS: at sentence #60000, processed 11013727 words, keeping 76787 word types
2023-01-21 11:51:01,561 : INFO : PROGRESS: at sentence #70000, processed 12637529 words, keeping 83200 word types
2023-01-21 11:51:02,775 : INFO : PROG

2023-01-21 11:51:57,797 : INFO : EPOCH 1 - PROGRESS: at 51.80% examples, 611605 words/s, in_qsize 19, out_qsize 0
2023-01-21 11:51:58,827 : INFO : EPOCH 1 - PROGRESS: at 53.76% examples, 611225 words/s, in_qsize 17, out_qsize 2
2023-01-21 11:51:59,836 : INFO : EPOCH 1 - PROGRESS: at 56.16% examples, 613074 words/s, in_qsize 19, out_qsize 0
2023-01-21 11:52:00,846 : INFO : EPOCH 1 - PROGRESS: at 58.34% examples, 613846 words/s, in_qsize 19, out_qsize 0
2023-01-21 11:52:01,873 : INFO : EPOCH 1 - PROGRESS: at 60.64% examples, 614945 words/s, in_qsize 19, out_qsize 0
2023-01-21 11:52:02,897 : INFO : EPOCH 1 - PROGRESS: at 62.48% examples, 611866 words/s, in_qsize 19, out_qsize 0
2023-01-21 11:52:03,970 : INFO : EPOCH 1 - PROGRESS: at 63.78% examples, 602399 words/s, in_qsize 15, out_qsize 4
2023-01-21 11:52:05,113 : INFO : EPOCH 1 - PROGRESS: at 64.44% examples, 586892 words/s, in_qsize 20, out_qsize 1
2023-01-21 11:52:06,117 : INFO : EPOCH 1 - PROGRESS: at 66.26% examples, 586222 words/s,

2023-01-21 11:53:00,330 : INFO : worker thread finished; awaiting finish of 6 more threads
2023-01-21 11:53:00,343 : INFO : worker thread finished; awaiting finish of 5 more threads
2023-01-21 11:53:00,345 : INFO : worker thread finished; awaiting finish of 4 more threads
2023-01-21 11:53:00,349 : INFO : worker thread finished; awaiting finish of 3 more threads
2023-01-21 11:53:00,355 : INFO : worker thread finished; awaiting finish of 2 more threads
2023-01-21 11:53:00,357 : INFO : worker thread finished; awaiting finish of 1 more threads
2023-01-21 11:53:00,367 : INFO : EPOCH 2 - PROGRESS: at 100.00% examples, 791524 words/s, in_qsize 0, out_qsize 1
2023-01-21 11:53:00,369 : INFO : worker thread finished; awaiting finish of 0 more threads
2023-01-21 11:53:00,371 : INFO : EPOCH - 2 : training on 41519359 raw words (30350373 effective words) took 38.3s, 791441 effective words/s
2023-01-21 11:53:01,388 : INFO : EPOCH 3 - PROGRESS: at 2.34% examples, 730838 words/s, in_qsize 18, out_qsiz

2023-01-21 11:53:57,659 : INFO : EPOCH 4 - PROGRESS: at 42.40% examples, 781854 words/s, in_qsize 18, out_qsize 1
2023-01-21 11:53:58,675 : INFO : EPOCH 4 - PROGRESS: at 45.31% examples, 781374 words/s, in_qsize 18, out_qsize 1
2023-01-21 11:53:59,679 : INFO : EPOCH 4 - PROGRESS: at 48.07% examples, 782851 words/s, in_qsize 20, out_qsize 0
2023-01-21 11:54:00,685 : INFO : EPOCH 4 - PROGRESS: at 50.79% examples, 781999 words/s, in_qsize 20, out_qsize 2
2023-01-21 11:54:01,694 : INFO : EPOCH 4 - PROGRESS: at 53.44% examples, 783434 words/s, in_qsize 18, out_qsize 1
2023-01-21 11:54:02,711 : INFO : EPOCH 4 - PROGRESS: at 56.21% examples, 782747 words/s, in_qsize 20, out_qsize 1
2023-01-21 11:54:03,712 : INFO : EPOCH 4 - PROGRESS: at 59.04% examples, 784449 words/s, in_qsize 19, out_qsize 0
2023-01-21 11:54:04,712 : INFO : EPOCH 4 - PROGRESS: at 61.54% examples, 782282 words/s, in_qsize 19, out_qsize 0
2023-01-21 11:54:05,713 : INFO : EPOCH 4 - PROGRESS: at 64.50% examples, 782748 words/s,

2023-01-21 11:54:59,425 : INFO : worker thread finished; awaiting finish of 6 more threads
2023-01-21 11:54:59,429 : INFO : worker thread finished; awaiting finish of 5 more threads
2023-01-21 11:54:59,436 : INFO : worker thread finished; awaiting finish of 4 more threads
2023-01-21 11:54:59,445 : INFO : EPOCH 5 - PROGRESS: at 99.93% examples, 750768 words/s, in_qsize 3, out_qsize 1
2023-01-21 11:54:59,449 : INFO : worker thread finished; awaiting finish of 3 more threads
2023-01-21 11:54:59,449 : INFO : worker thread finished; awaiting finish of 2 more threads
2023-01-21 11:54:59,451 : INFO : worker thread finished; awaiting finish of 1 more threads
2023-01-21 11:54:59,454 : INFO : worker thread finished; awaiting finish of 0 more threads
2023-01-21 11:54:59,455 : INFO : EPOCH - 5 : training on 41519359 raw words (30347747 effective words) took 40.4s, 751152 effective words/s
2023-01-21 11:54:59,456 : INFO : Word2Vec lifecycle event {'msg': 'training on 207596795 raw words (151741010 

2023-01-21 11:55:46,934 : INFO : EPOCH 2 - PROGRESS: at 16.20% examples, 764077 words/s, in_qsize 19, out_qsize 0
2023-01-21 11:55:47,934 : INFO : EPOCH 2 - PROGRESS: at 18.42% examples, 770191 words/s, in_qsize 19, out_qsize 0
2023-01-21 11:55:48,934 : INFO : EPOCH 2 - PROGRESS: at 20.42% examples, 771223 words/s, in_qsize 19, out_qsize 0
2023-01-21 11:55:49,940 : INFO : EPOCH 2 - PROGRESS: at 22.88% examples, 776496 words/s, in_qsize 19, out_qsize 0
2023-01-21 11:55:50,959 : INFO : EPOCH 2 - PROGRESS: at 25.07% examples, 777742 words/s, in_qsize 19, out_qsize 0
2023-01-21 11:55:51,967 : INFO : EPOCH 2 - PROGRESS: at 27.99% examples, 776939 words/s, in_qsize 19, out_qsize 0
2023-01-21 11:55:52,976 : INFO : EPOCH 2 - PROGRESS: at 30.73% examples, 776955 words/s, in_qsize 19, out_qsize 0
2023-01-21 11:55:53,983 : INFO : EPOCH 2 - PROGRESS: at 33.48% examples, 775347 words/s, in_qsize 19, out_qsize 0
2023-01-21 11:55:55,001 : INFO : EPOCH 2 - PROGRESS: at 36.17% examples, 775118 words/s,

2023-01-21 11:56:50,800 : INFO : EPOCH 3 - PROGRESS: at 83.15% examples, 788587 words/s, in_qsize 19, out_qsize 0
2023-01-21 11:56:51,810 : INFO : EPOCH 3 - PROGRESS: at 85.81% examples, 789672 words/s, in_qsize 20, out_qsize 1
2023-01-21 11:56:52,827 : INFO : EPOCH 3 - PROGRESS: at 88.69% examples, 789415 words/s, in_qsize 18, out_qsize 1
2023-01-21 11:56:53,838 : INFO : EPOCH 3 - PROGRESS: at 91.51% examples, 789852 words/s, in_qsize 20, out_qsize 0
2023-01-21 11:56:54,843 : INFO : EPOCH 3 - PROGRESS: at 94.28% examples, 790373 words/s, in_qsize 19, out_qsize 0
2023-01-21 11:56:55,853 : INFO : EPOCH 3 - PROGRESS: at 97.09% examples, 790987 words/s, in_qsize 18, out_qsize 1
2023-01-21 11:56:56,843 : INFO : worker thread finished; awaiting finish of 9 more threads
2023-01-21 11:56:56,846 : INFO : worker thread finished; awaiting finish of 8 more threads
2023-01-21 11:56:56,846 : INFO : worker thread finished; awaiting finish of 7 more threads
2023-01-21 11:56:56,850 : INFO : worker thr

2023-01-21 11:57:45,619 : INFO : EPOCH 5 - PROGRESS: at 26.01% examples, 797646 words/s, in_qsize 19, out_qsize 0
2023-01-21 11:57:46,628 : INFO : EPOCH 5 - PROGRESS: at 28.88% examples, 794904 words/s, in_qsize 19, out_qsize 0
2023-01-21 11:57:47,654 : INFO : EPOCH 5 - PROGRESS: at 31.64% examples, 790501 words/s, in_qsize 20, out_qsize 0
2023-01-21 11:57:48,678 : INFO : EPOCH 5 - PROGRESS: at 34.35% examples, 789438 words/s, in_qsize 19, out_qsize 0
2023-01-21 11:57:49,693 : INFO : EPOCH 5 - PROGRESS: at 37.01% examples, 786940 words/s, in_qsize 20, out_qsize 0
2023-01-21 11:57:50,702 : INFO : EPOCH 5 - PROGRESS: at 39.94% examples, 789164 words/s, in_qsize 18, out_qsize 1
2023-01-21 11:57:51,708 : INFO : EPOCH 5 - PROGRESS: at 43.04% examples, 792276 words/s, in_qsize 18, out_qsize 1
2023-01-21 11:57:52,715 : INFO : EPOCH 5 - PROGRESS: at 46.06% examples, 793316 words/s, in_qsize 19, out_qsize 0
2023-01-21 11:57:53,738 : INFO : EPOCH 5 - PROGRESS: at 49.08% examples, 796458 words/s,

2023-01-21 11:58:49,469 : INFO : EPOCH 6 - PROGRESS: at 96.01% examples, 803011 words/s, in_qsize 19, out_qsize 0
2023-01-21 11:58:50,475 : INFO : EPOCH 6 - PROGRESS: at 98.78% examples, 802805 words/s, in_qsize 19, out_qsize 0
2023-01-21 11:58:50,807 : INFO : worker thread finished; awaiting finish of 9 more threads
2023-01-21 11:58:50,819 : INFO : worker thread finished; awaiting finish of 8 more threads
2023-01-21 11:58:50,823 : INFO : worker thread finished; awaiting finish of 7 more threads
2023-01-21 11:58:50,826 : INFO : worker thread finished; awaiting finish of 6 more threads
2023-01-21 11:58:50,842 : INFO : worker thread finished; awaiting finish of 5 more threads
2023-01-21 11:58:50,847 : INFO : worker thread finished; awaiting finish of 4 more threads
2023-01-21 11:58:50,851 : INFO : worker thread finished; awaiting finish of 3 more threads
2023-01-21 11:58:50,860 : INFO : worker thread finished; awaiting finish of 2 more threads
2023-01-21 11:58:50,869 : INFO : worker thre

2023-01-21 11:59:44,668 : INFO : EPOCH 8 - PROGRESS: at 32.05% examples, 691684 words/s, in_qsize 19, out_qsize 0
2023-01-21 11:59:45,693 : INFO : EPOCH 8 - PROGRESS: at 34.33% examples, 688956 words/s, in_qsize 19, out_qsize 0
2023-01-21 11:59:46,696 : INFO : EPOCH 8 - PROGRESS: at 36.91% examples, 692031 words/s, in_qsize 20, out_qsize 0
2023-01-21 11:59:47,704 : INFO : EPOCH 8 - PROGRESS: at 39.26% examples, 689733 words/s, in_qsize 18, out_qsize 1
2023-01-21 11:59:48,709 : INFO : EPOCH 8 - PROGRESS: at 41.76% examples, 688943 words/s, in_qsize 20, out_qsize 1
2023-01-21 11:59:49,724 : INFO : EPOCH 8 - PROGRESS: at 44.35% examples, 689999 words/s, in_qsize 20, out_qsize 0
2023-01-21 11:59:50,730 : INFO : EPOCH 8 - PROGRESS: at 46.91% examples, 691860 words/s, in_qsize 19, out_qsize 0
2023-01-21 11:59:51,743 : INFO : EPOCH 8 - PROGRESS: at 49.70% examples, 696270 words/s, in_qsize 19, out_qsize 0
2023-01-21 11:59:52,752 : INFO : EPOCH 8 - PROGRESS: at 52.44% examples, 700563 words/s,

2023-01-21 12:00:48,277 : INFO : EPOCH 9 - PROGRESS: at 97.75% examples, 795070 words/s, in_qsize 16, out_qsize 3
2023-01-21 12:00:49,026 : INFO : worker thread finished; awaiting finish of 9 more threads
2023-01-21 12:00:49,037 : INFO : worker thread finished; awaiting finish of 8 more threads
2023-01-21 12:00:49,037 : INFO : worker thread finished; awaiting finish of 7 more threads
2023-01-21 12:00:49,061 : INFO : worker thread finished; awaiting finish of 6 more threads
2023-01-21 12:00:49,072 : INFO : worker thread finished; awaiting finish of 5 more threads
2023-01-21 12:00:49,074 : INFO : worker thread finished; awaiting finish of 4 more threads
2023-01-21 12:00:49,075 : INFO : worker thread finished; awaiting finish of 3 more threads
2023-01-21 12:00:49,090 : INFO : worker thread finished; awaiting finish of 2 more threads
2023-01-21 12:00:49,092 : INFO : worker thread finished; awaiting finish of 1 more threads
2023-01-21 12:00:49,093 : INFO : worker thread finished; awaiting f

(303490987, 415193590)

## Now, let's look at some output 
This first example shows a simple case of looking up words similar to the word `dirty`. All we need to do here is to call the `most_similar` function and provide the word `dirty` as the positive example. This returns the top 10 similar words. 

In [6]:

w1 = "dirty"
model.wv.most_similar (positive=w1)


[('filthy', 0.8609764575958252),
 ('stained', 0.7824174165725708),
 ('unclean', 0.7755457162857056),
 ('smelly', 0.7609288692474365),
 ('grubby', 0.7539528012275696),
 ('dusty', 0.7529140710830688),
 ('soiled', 0.7389745116233826),
 ('dingy', 0.7325567007064819),
 ('disgusting', 0.7289584875106812),
 ('gross', 0.7164105176925659)]

That looks pretty good, right? Let's look at a few more. Let's look at similarity for `polite`, `france` and `shocked`. 

In [7]:
# look up top 6 words similar to 'polite'
w1 = ["polite"]
model.wv.most_similar (positive=w1,topn=6)


[('courteous', 0.9211625456809998),
 ('friendly', 0.8233624696731567),
 ('cordial', 0.8014512062072754),
 ('professional', 0.7983894944190979),
 ('attentive', 0.7798492312431335),
 ('curteous', 0.7637271285057068)]

In [8]:
# look up top 6 words similar to 'france'
w1 = ["france"]
model.wv.most_similar (positive=w1,topn=6)


[('canada', 0.6712582111358643),
 ('germany', 0.6440682411193848),
 ('spain', 0.6323268413543701),
 ('barcelona', 0.6108295917510986),
 ('gaulle', 0.6051010489463806),
 ('england', 0.5976824164390564)]

In [9]:
# look up top 6 words similar to 'shocked'
w1 = ["shocked"]
model.wv.most_similar (positive=w1,topn=6)


[('horrified', 0.8302863240242004),
 ('amazed', 0.783433735370636),
 ('stunned', 0.7692832350730896),
 ('dismayed', 0.7649892568588257),
 ('appalled', 0.755466878414154),
 ('astonished', 0.7495095133781433)]

That's, nice. You can even specify several positive examples to get things that are related in the provided context and provide negative examples to say what should not be considered as related. In the example below we are asking for all items that *relate to bed* only:

In [10]:
# get everything related to stuff on the bed
w1 = ["bed",'sheet','pillow']
w2 = ['couch']
model.wv.most_similar (positive=w1,negative=w2,topn=10)


[('duvet', 0.721473217010498),
 ('blanket', 0.7098276615142822),
 ('quilt', 0.6970281600952148),
 ('mattress', 0.6804217100143433),
 ('matress', 0.678648054599762),
 ('sheets', 0.644561767578125),
 ('pillowcase', 0.6398433446884155),
 ('pillows', 0.6327794194221497),
 ('foam', 0.6251879930496216),
 ('comforter', 0.6060689687728882)]

### Similarity between two words in the vocabulary

You can even use the Word2Vec model to return the similarity between two words that are present in the vocabulary. 

In [11]:
# similarity between two different words
model.wv.similarity(w1="dirty",w2="smelly")

0.760929

In [12]:
# similarity between two identical words
model.wv.similarity(w1="dirty",w2="dirty")

1.0

In [13]:
# similarity between two unrelated words
model.wv.similarity(w1="dirty",w2="clean")

0.2639729

Under the hood, the above three snippets computes the cosine similarity between the two specified words using word vectors of each. From the scores, it makes sense that `dirty` is highly similar to `smelly` but `dirty` is dissimilar to `clean`. If you do a similarity between two identical words, the score will be 1.0 as the range of the cosine similarity score will always be between [0.0-1.0]. You can read more about cosine similarity scoring [here](https://en.wikipedia.org/wiki/Cosine_similarity).

### Find the odd one out
You can even use Word2Vec to find odd items given a list of items.

In [14]:
# Which one is the odd one out in this list?
model.wv.doesnt_match(["cat","dog","france"])

'france'

In [15]:
# Which one is the odd one out in this list?
model.wv.doesnt_match(["bed","pillow","duvet","shower"])


'shower'