# Word2vec Using Gensim
tutorial [word2vec.intro](http://localhost:8888/notebooks/Desktop/nlplab/word2vec.tutorial.ipynb)
## loading train data

gensim preprocesses the text by tokening and lowercasing

In [1]:
import gzip
import gensim 
import logging
with gzip.open ('reviews_data.txt.gz', 'rb') as f:
    for i,line in enumerate (f):
        print(line)
        break

b"Oct 12 2009 \tNice trendy hotel location not too bad.\tI stayed in this hotel for one night. As this is a fairly new place some of the taxi drivers did not know where it was and/or did not want to drive there. Once I have eventually arrived at the hotel, I was very pleasantly surprised with the decor of the lobby/ground floor area. It was very stylish and modern. I found the reception's staff geeting me with 'Aloha' a bit out of place, but I guess they are briefed to say that to keep up the coroporate image.As I have a Starwood Preferred Guest member, I was given a small gift upon-check in. It was only a couple of fridge magnets in a gift box, but nevertheless a nice gesture.My room was nice and roomy, there are tea and coffee facilities in each room and you get two complimentary bottles of water plus some toiletries by 'bliss'.The location is not great. It is at the last metro stop and you then need to take a taxi, but if you are not planning on going to see the historic sites in Be

In [1]:
import gzip
import gensim 
import logging

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

data_file='reviews_data.txt.gz'

def read_input(input_file):
    
    logging.info(f"reading file {input_file}...this may take a while")
    
    with gzip.open (input_file, 'rb') as f:
        for i, line in enumerate (f): 

            if (i%10000==0):
                logging.info (f"read {i} reviews")
            yield gensim.utils.simple_preprocess (line) # tokenization, lowercasing

documents = list (read_input (data_file))
logging.info ("Done reading data file")

2020-09-15 02:05:37,391 : INFO : reading file reviews_data.txt.gz...this may take a while
2020-09-15 02:05:37,398 : INFO : read 0 reviews
2020-09-15 02:05:43,904 : INFO : read 10000 reviews
2020-09-15 02:05:48,989 : INFO : read 20000 reviews
2020-09-15 02:05:52,930 : INFO : read 30000 reviews
2020-09-15 02:05:59,645 : INFO : read 40000 reviews
2020-09-15 02:06:06,970 : INFO : read 50000 reviews
2020-09-15 02:06:14,347 : INFO : read 60000 reviews
2020-09-15 02:06:20,367 : INFO : read 70000 reviews
2020-09-15 02:06:26,116 : INFO : read 80000 reviews
2020-09-15 02:06:31,846 : INFO : read 90000 reviews
2020-09-15 02:06:37,169 : INFO : read 100000 reviews
2020-09-15 02:06:42,567 : INFO : read 110000 reviews
2020-09-15 02:06:48,074 : INFO : read 120000 reviews
2020-09-15 02:06:54,012 : INFO : read 130000 reviews
2020-09-15 02:06:59,610 : INFO : read 140000 reviews
2020-09-15 02:07:01,599 : INFO : read 150000 reviews
2020-09-15 02:07:03,474 : INFO : read 160000 reviews
2020-09-15 02:07:06,059

## training models
|   |   |
|---|---|
|size|the size of the dense vector|
|window|the maximum distance of related neighboring words|
|min_count|Minimium frequency count of words|
|workers|threads behind the scenes|

In [2]:
model = gensim.models.Word2Vec (documents, size=150, window=10, min_count=2, workers=10)
model.train(documents,total_examples=len(documents),epochs=10)

2020-09-15 02:14:22,533 : INFO : collecting all words and their counts
2020-09-15 02:14:22,535 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2020-09-15 02:14:23,338 : INFO : PROGRESS: at sentence #10000, processed 1655714 words, keeping 25777 word types
2020-09-15 02:14:24,177 : INFO : PROGRESS: at sentence #20000, processed 3317863 words, keeping 35016 word types
2020-09-15 02:14:25,324 : INFO : PROGRESS: at sentence #30000, processed 5264072 words, keeping 47518 word types
2020-09-15 02:14:26,363 : INFO : PROGRESS: at sentence #40000, processed 7081746 words, keeping 56675 word types
2020-09-15 02:14:27,302 : INFO : PROGRESS: at sentence #50000, processed 9089491 words, keeping 63744 word types
2020-09-15 02:14:28,222 : INFO : PROGRESS: at sentence #60000, processed 11013726 words, keeping 76786 word types
2020-09-15 02:14:29,044 : INFO : PROGRESS: at sentence #70000, processed 12637528 words, keeping 83199 word types
2020-09-15 02:14:29,819 : INFO : PROG

2020-09-15 02:15:23,694 : INFO : EPOCH 1 - PROGRESS: at 85.27% examples, 696798 words/s, in_qsize 19, out_qsize 0
2020-09-15 02:15:24,708 : INFO : EPOCH 1 - PROGRESS: at 88.28% examples, 700350 words/s, in_qsize 20, out_qsize 0
2020-09-15 02:15:25,708 : INFO : EPOCH 1 - PROGRESS: at 91.06% examples, 702531 words/s, in_qsize 19, out_qsize 0
2020-09-15 02:15:26,710 : INFO : EPOCH 1 - PROGRESS: at 93.25% examples, 701362 words/s, in_qsize 19, out_qsize 0
2020-09-15 02:15:27,733 : INFO : EPOCH 1 - PROGRESS: at 95.56% examples, 699956 words/s, in_qsize 18, out_qsize 1
2020-09-15 02:15:28,743 : INFO : EPOCH 1 - PROGRESS: at 97.97% examples, 699883 words/s, in_qsize 18, out_qsize 1
2020-09-15 02:15:29,461 : INFO : worker thread finished; awaiting finish of 9 more threads
2020-09-15 02:15:29,461 : INFO : worker thread finished; awaiting finish of 8 more threads
2020-09-15 02:15:29,469 : INFO : worker thread finished; awaiting finish of 7 more threads
2020-09-15 02:15:29,471 : INFO : worker thr

2020-09-15 02:16:19,468 : INFO : EPOCH 3 - PROGRESS: at 14.98% examples, 699002 words/s, in_qsize 19, out_qsize 0
2020-09-15 02:16:20,491 : INFO : EPOCH 3 - PROGRESS: at 16.86% examples, 691480 words/s, in_qsize 20, out_qsize 2
2020-09-15 02:16:21,500 : INFO : EPOCH 3 - PROGRESS: at 18.58% examples, 684859 words/s, in_qsize 17, out_qsize 2
2020-09-15 02:16:22,516 : INFO : EPOCH 3 - PROGRESS: at 20.20% examples, 679469 words/s, in_qsize 18, out_qsize 1
2020-09-15 02:16:23,530 : INFO : EPOCH 3 - PROGRESS: at 22.37% examples, 681731 words/s, in_qsize 19, out_qsize 0
2020-09-15 02:16:24,546 : INFO : EPOCH 3 - PROGRESS: at 24.09% examples, 680759 words/s, in_qsize 20, out_qsize 1
2020-09-15 02:16:25,578 : INFO : EPOCH 3 - PROGRESS: at 26.40% examples, 678304 words/s, in_qsize 19, out_qsize 0
2020-09-15 02:16:26,579 : INFO : EPOCH 3 - PROGRESS: at 28.76% examples, 676212 words/s, in_qsize 17, out_qsize 2
2020-09-15 02:16:27,594 : INFO : EPOCH 3 - PROGRESS: at 31.23% examples, 675796 words/s,

2020-09-15 02:17:23,833 : INFO : EPOCH 4 - PROGRESS: at 61.65% examples, 696499 words/s, in_qsize 15, out_qsize 4
2020-09-15 02:17:24,854 : INFO : EPOCH 4 - PROGRESS: at 64.58% examples, 699140 words/s, in_qsize 19, out_qsize 0
2020-09-15 02:17:25,877 : INFO : EPOCH 4 - PROGRESS: at 66.76% examples, 697994 words/s, in_qsize 17, out_qsize 2
2020-09-15 02:17:26,878 : INFO : EPOCH 4 - PROGRESS: at 69.04% examples, 696798 words/s, in_qsize 19, out_qsize 0
2020-09-15 02:17:27,884 : INFO : EPOCH 4 - PROGRESS: at 71.61% examples, 699943 words/s, in_qsize 17, out_qsize 2
2020-09-15 02:17:28,887 : INFO : EPOCH 4 - PROGRESS: at 74.21% examples, 701172 words/s, in_qsize 20, out_qsize 2
2020-09-15 02:17:29,898 : INFO : EPOCH 4 - PROGRESS: at 76.75% examples, 704037 words/s, in_qsize 19, out_qsize 0
2020-09-15 02:17:30,906 : INFO : EPOCH 4 - PROGRESS: at 78.90% examples, 703292 words/s, in_qsize 19, out_qsize 0
2020-09-15 02:17:31,909 : INFO : EPOCH 4 - PROGRESS: at 81.31% examples, 704149 words/s,

2020-09-15 02:18:22,867 : INFO : worker thread finished; awaiting finish of 3 more threads
2020-09-15 02:18:22,872 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-09-15 02:18:22,876 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-09-15 02:18:22,879 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-09-15 02:18:22,880 : INFO : EPOCH - 5 : training on 41519358 raw words (30353382 effective words) took 43.1s, 704699 effective words/s
2020-09-15 02:18:22,880 : INFO : training on a 207596790 raw words (151752319 effective words) took 216.7s, 700302 effective words/s
2020-09-15 02:18:22,883 : INFO : training model with 10 workers on 70537 vocabulary and 150 features, using sg=0 hs=0 sample=0.001 negative=5 window=10
2020-09-15 02:18:23,898 : INFO : EPOCH 1 - PROGRESS: at 2.29% examples, 717705 words/s, in_qsize 19, out_qsize 0
2020-09-15 02:18:24,928 : INFO : EPOCH 1 - PROGRESS: at 4.83% examples, 730549 words/s, in_qsize 1

2020-09-15 02:19:21,431 : INFO : EPOCH 2 - PROGRESS: at 33.70% examples, 723766 words/s, in_qsize 19, out_qsize 0
2020-09-15 02:19:22,441 : INFO : EPOCH 2 - PROGRESS: at 36.22% examples, 723211 words/s, in_qsize 19, out_qsize 0
2020-09-15 02:19:23,461 : INFO : EPOCH 2 - PROGRESS: at 38.79% examples, 722624 words/s, in_qsize 19, out_qsize 0
2020-09-15 02:19:24,469 : INFO : EPOCH 2 - PROGRESS: at 41.38% examples, 721850 words/s, in_qsize 19, out_qsize 0
2020-09-15 02:19:25,489 : INFO : EPOCH 2 - PROGRESS: at 43.91% examples, 719905 words/s, in_qsize 20, out_qsize 3
2020-09-15 02:19:26,496 : INFO : EPOCH 2 - PROGRESS: at 46.56% examples, 720763 words/s, in_qsize 18, out_qsize 1
2020-09-15 02:19:27,524 : INFO : EPOCH 2 - PROGRESS: at 49.16% examples, 720775 words/s, in_qsize 15, out_qsize 4
2020-09-15 02:19:28,525 : INFO : EPOCH 2 - PROGRESS: at 51.81% examples, 723222 words/s, in_qsize 18, out_qsize 1
2020-09-15 02:19:29,541 : INFO : EPOCH 2 - PROGRESS: at 54.10% examples, 722359 words/s,

2020-09-15 02:20:25,553 : INFO : EPOCH 3 - PROGRESS: at 89.64% examples, 730150 words/s, in_qsize 20, out_qsize 0
2020-09-15 02:20:26,567 : INFO : EPOCH 3 - PROGRESS: at 91.80% examples, 727132 words/s, in_qsize 19, out_qsize 0
2020-09-15 02:20:27,567 : INFO : EPOCH 3 - PROGRESS: at 93.97% examples, 724907 words/s, in_qsize 19, out_qsize 0
2020-09-15 02:20:28,590 : INFO : EPOCH 3 - PROGRESS: at 96.15% examples, 722316 words/s, in_qsize 18, out_qsize 1
2020-09-15 02:20:29,591 : INFO : EPOCH 3 - PROGRESS: at 98.26% examples, 719752 words/s, in_qsize 17, out_qsize 2
2020-09-15 02:20:30,232 : INFO : worker thread finished; awaiting finish of 9 more threads
2020-09-15 02:20:30,257 : INFO : worker thread finished; awaiting finish of 8 more threads
2020-09-15 02:20:30,257 : INFO : worker thread finished; awaiting finish of 7 more threads
2020-09-15 02:20:30,265 : INFO : worker thread finished; awaiting finish of 6 more threads
2020-09-15 02:20:30,282 : INFO : worker thread finished; awaiting 

2020-09-15 02:21:21,736 : INFO : EPOCH 5 - PROGRESS: at 15.67% examples, 730446 words/s, in_qsize 19, out_qsize 0
2020-09-15 02:21:22,741 : INFO : EPOCH 5 - PROGRESS: at 17.52% examples, 725188 words/s, in_qsize 19, out_qsize 0
2020-09-15 02:21:23,760 : INFO : EPOCH 5 - PROGRESS: at 19.37% examples, 721307 words/s, in_qsize 19, out_qsize 0
2020-09-15 02:21:24,782 : INFO : EPOCH 5 - PROGRESS: at 21.74% examples, 726942 words/s, in_qsize 19, out_qsize 0
2020-09-15 02:21:25,809 : INFO : EPOCH 5 - PROGRESS: at 23.60% examples, 725078 words/s, in_qsize 19, out_qsize 0
2020-09-15 02:21:26,827 : INFO : EPOCH 5 - PROGRESS: at 25.57% examples, 717282 words/s, in_qsize 18, out_qsize 1
2020-09-15 02:21:27,847 : INFO : EPOCH 5 - PROGRESS: at 28.18% examples, 714852 words/s, in_qsize 18, out_qsize 1
2020-09-15 02:21:28,849 : INFO : EPOCH 5 - PROGRESS: at 30.59% examples, 712646 words/s, in_qsize 19, out_qsize 0
2020-09-15 02:21:29,864 : INFO : EPOCH 5 - PROGRESS: at 33.03% examples, 708992 words/s,

2020-09-15 02:22:25,770 : INFO : EPOCH 6 - PROGRESS: at 54.88% examples, 674120 words/s, in_qsize 19, out_qsize 0
2020-09-15 02:22:26,773 : INFO : EPOCH 6 - PROGRESS: at 57.24% examples, 673184 words/s, in_qsize 16, out_qsize 3
2020-09-15 02:22:27,784 : INFO : EPOCH 6 - PROGRESS: at 59.56% examples, 673130 words/s, in_qsize 17, out_qsize 2
2020-09-15 02:22:28,785 : INFO : EPOCH 6 - PROGRESS: at 62.08% examples, 675447 words/s, in_qsize 17, out_qsize 2
2020-09-15 02:22:29,793 : INFO : EPOCH 6 - PROGRESS: at 64.96% examples, 678312 words/s, in_qsize 18, out_qsize 1
2020-09-15 02:22:30,801 : INFO : EPOCH 6 - PROGRESS: at 67.59% examples, 682477 words/s, in_qsize 19, out_qsize 0
2020-09-15 02:22:31,814 : INFO : EPOCH 6 - PROGRESS: at 69.90% examples, 682442 words/s, in_qsize 19, out_qsize 0
2020-09-15 02:22:32,827 : INFO : EPOCH 6 - PROGRESS: at 72.39% examples, 684987 words/s, in_qsize 17, out_qsize 2
2020-09-15 02:22:33,842 : INFO : EPOCH 6 - PROGRESS: at 74.96% examples, 686127 words/s,

2020-09-15 02:23:28,431 : INFO : worker thread finished; awaiting finish of 8 more threads
2020-09-15 02:23:28,431 : INFO : worker thread finished; awaiting finish of 7 more threads
2020-09-15 02:23:28,440 : INFO : worker thread finished; awaiting finish of 6 more threads
2020-09-15 02:23:28,448 : INFO : worker thread finished; awaiting finish of 5 more threads
2020-09-15 02:23:28,452 : INFO : worker thread finished; awaiting finish of 4 more threads
2020-09-15 02:23:28,452 : INFO : worker thread finished; awaiting finish of 3 more threads
2020-09-15 02:23:28,453 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-09-15 02:23:28,463 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-09-15 02:23:28,467 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-09-15 02:23:28,472 : INFO : EPOCH - 7 : training on 41519358 raw words (30347590 effective words) took 44.1s, 687475 effective words/s
2020-09-15 02:23:29,509 : INFO : EPOCH 8 

2020-09-15 02:24:25,486 : INFO : EPOCH 9 - PROGRESS: at 19.68% examples, 659886 words/s, in_qsize 18, out_qsize 1
2020-09-15 02:24:26,503 : INFO : EPOCH 9 - PROGRESS: at 21.74% examples, 660996 words/s, in_qsize 19, out_qsize 0
2020-09-15 02:24:27,525 : INFO : EPOCH 9 - PROGRESS: at 23.37% examples, 658021 words/s, in_qsize 16, out_qsize 3
2020-09-15 02:24:28,545 : INFO : EPOCH 9 - PROGRESS: at 25.43% examples, 659853 words/s, in_qsize 19, out_qsize 0
2020-09-15 02:24:29,548 : INFO : EPOCH 9 - PROGRESS: at 27.84% examples, 657980 words/s, in_qsize 19, out_qsize 0
2020-09-15 02:24:30,572 : INFO : EPOCH 9 - PROGRESS: at 30.14% examples, 657406 words/s, in_qsize 19, out_qsize 0
2020-09-15 02:24:31,578 : INFO : EPOCH 9 - PROGRESS: at 32.56% examples, 656754 words/s, in_qsize 19, out_qsize 0
2020-09-15 02:24:32,593 : INFO : EPOCH 9 - PROGRESS: at 34.81% examples, 657623 words/s, in_qsize 19, out_qsize 0
2020-09-15 02:24:33,611 : INFO : EPOCH 9 - PROGRESS: at 36.99% examples, 654883 words/s,

2020-09-15 02:25:29,901 : INFO : EPOCH 10 - PROGRESS: at 59.35% examples, 646074 words/s, in_qsize 17, out_qsize 2
2020-09-15 02:25:30,898 : INFO : EPOCH 10 - PROGRESS: at 61.59% examples, 646253 words/s, in_qsize 18, out_qsize 1
2020-09-15 02:25:31,944 : INFO : EPOCH 10 - PROGRESS: at 64.08% examples, 646364 words/s, in_qsize 18, out_qsize 1
2020-09-15 02:25:32,969 : INFO : EPOCH 10 - PROGRESS: at 66.31% examples, 646442 words/s, in_qsize 19, out_qsize 0
2020-09-15 02:25:33,978 : INFO : EPOCH 10 - PROGRESS: at 68.44% examples, 645610 words/s, in_qsize 19, out_qsize 0
2020-09-15 02:25:34,983 : INFO : EPOCH 10 - PROGRESS: at 70.62% examples, 646723 words/s, in_qsize 20, out_qsize 0
2020-09-15 02:25:35,977 : INFO : EPOCH 10 - PROGRESS: at 72.87% examples, 646869 words/s, in_qsize 19, out_qsize 0
2020-09-15 02:25:36,995 : INFO : EPOCH 10 - PROGRESS: at 75.02% examples, 646271 words/s, in_qsize 20, out_qsize 1
2020-09-15 02:25:38,003 : INFO : EPOCH 10 - PROGRESS: at 77.07% examples, 646451

(303497832, 415193580)

## using models
word2vec models support the following methods:
- most_similar
- similarity
- doesnt_match

note that the similarity is the cosine similarity

In [17]:
w1 = 'happy'
model.wv.most_similar (positive=w1,topn=6)

[('pleased', 0.8064697980880737),
 ('satisfied', 0.7442551851272583),
 ('thrilled', 0.6671044230461121),
 ('delighted', 0.6617661714553833),
 ('impressed', 0.6447084546089172),
 ('disappointed', 0.5832072496414185)]

In [18]:
w1 = ["bed",'sheet','pillow']
w2 = ['couch']
model.wv.most_similar (positive=w1,negative=w2,topn=10)

[('duvet', 0.7176166772842407),
 ('mattress', 0.7037608623504639),
 ('blanket', 0.698083221912384),
 ('matress', 0.6931204199790955),
 ('quilt', 0.6749095916748047),
 ('pillowcase', 0.6509799957275391),
 ('foam', 0.6488490700721741),
 ('sheets', 0.6427589654922485),
 ('pillows', 0.6399072408676147),
 ('quilts', 0.5985084772109985)]

In [19]:
model.wv.similarity(w1="dirty",w2="smelly")

0.75950706

In [20]:
model.wv.doesnt_match(["cat","dog","france"])

  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)


'france'

## saving and loading models

In [None]:
model.save('review.model.bin')
model.wv.save_word2vec_format('review.model.txt', binary=False)

from gensim.models import KeyedVectors
model2 = KeyedVectors.load('review.model.bin')

timestamp sep 14, 2020