# Getting started with Word2vec in Gensim 

The idea behind word2vec is that we make this assumption that the meaning of the word depends on the words arround it 

The training algo in the `gensim package` were actually ported from the original <b>Word2vec</b> implementation by <b>Google</b> 

In [1]:
# imports needed and logging 
import gzip
import gensim
import logging 

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)


## Dataset 

we will be using the OpinRank dataset. it has the data about reviews of cars and hotels from users 

In [2]:
data_file = "reviews_data.txt.gz"

with gzip.open ('reviews_data.txt.gz', 'rb') as f:
    for i, line in enumerate(f):
        print(line)
        break

b"Oct 12 2009 \tNice trendy hotel location not too bad.\tI stayed in this hotel for one night. As this is a fairly new place some of the taxi drivers did not know where it was and/or did not want to drive there. Once I have eventually arrived at the hotel, I was very pleasantly surprised with the decor of the lobby/ground floor area. It was very stylish and modern. I found the reception's staff geeting me with 'Aloha' a bit out of place, but I guess they are briefed to say that to keep up the coroporate image.As I have a Starwood Preferred Guest member, I was given a small gift upon-check in. It was only a couple of fridge magnets in a gift box, but nevertheless a nice gesture.My room was nice and roomy, there are tea and coffee facilities in each room and you get two complimentary bottles of water plus some toiletries by 'bliss'.The location is not great. It is at the last metro stop and you then need to take a taxi, but if you are not planning on going to see the historic sites in Be

## Reading files into a list 

we can read the file into a list so that we can pass this into the word2vec model.
we would also be doing some minor preprocessing using the `gensim.utils.simple_preprocess(line)` which basically does is lowercasing, tokenization etc. 

In [3]:
def read_input(input_file):
    """This method reads the input file which is in gzip format"""
    
    logging.info("reading file {0}...this may take a while".format(input_file))
    
    with gzip.open (input_file, 'rb') as f:
        for i, line in enumerate (f): 

            if (i%10000==0):
                logging.info ("read {0} reviews".format (i))
            # do some pre-processing and return a list of words for each review text
            yield gensim.utils.simple_preprocess (line)

# read the tokenized reviews into a list
# each review item becomes a serries of words
# so this becomes a list of lists
documents = list (read_input (data_file))
logging.info ("Done reading data file")

2019-06-10 10:24:43,424 : INFO : reading file reviews_data.txt.gz...this may take a while
2019-06-10 10:24:43,426 : INFO : read 0 reviews
2019-06-10 10:24:45,654 : INFO : read 10000 reviews
2019-06-10 10:24:47,754 : INFO : read 20000 reviews
2019-06-10 10:24:50,231 : INFO : read 30000 reviews
2019-06-10 10:24:52,446 : INFO : read 40000 reviews
2019-06-10 10:24:54,875 : INFO : read 50000 reviews
2019-06-10 10:24:57,257 : INFO : read 60000 reviews
2019-06-10 10:24:59,249 : INFO : read 70000 reviews
2019-06-10 10:25:01,376 : INFO : read 80000 reviews
2019-06-10 10:25:03,307 : INFO : read 90000 reviews
2019-06-10 10:25:05,200 : INFO : read 100000 reviews
2019-06-10 10:25:07,308 : INFO : read 110000 reviews
2019-06-10 10:25:09,308 : INFO : read 120000 reviews
2019-06-10 10:25:11,236 : INFO : read 130000 reviews
2019-06-10 10:25:13,272 : INFO : read 140000 reviews
2019-06-10 10:25:15,150 : INFO : read 150000 reviews
2019-06-10 10:25:17,081 : INFO : read 160000 reviews
2019-06-10 10:25:19,481

## Training the Word2Vec model 

Training the model is fairly straightforward. You pass the reviews that we read in the previous step. So, we are essentially passing on a lost of list s where each list within the main list contains a set of tokens from a user review. Word2Vec uses all these tokens to internally create a vocabulary and by vocabulary we mean set of unique words.
Afer building vocabulary we just have to call in train function 

In [4]:
model = gensim.models.Word2Vec (documents, size=150, window=10, min_count=2, workers=10)
model.train(documents,total_examples=len(documents),epochs=10)

2019-06-10 10:25:42,975 : INFO : collecting all words and their counts
2019-06-10 10:25:42,975 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2019-06-10 10:25:43,291 : INFO : PROGRESS: at sentence #10000, processed 1655714 words, keeping 25777 word types
2019-06-10 10:25:43,602 : INFO : PROGRESS: at sentence #20000, processed 3317863 words, keeping 35016 word types
2019-06-10 10:25:43,972 : INFO : PROGRESS: at sentence #30000, processed 5264072 words, keeping 47518 word types
2019-06-10 10:25:44,322 : INFO : PROGRESS: at sentence #40000, processed 7081746 words, keeping 56675 word types
2019-06-10 10:25:44,706 : INFO : PROGRESS: at sentence #50000, processed 9089491 words, keeping 63744 word types
2019-06-10 10:25:45,074 : INFO : PROGRESS: at sentence #60000, processed 11013723 words, keeping 76781 word types
2019-06-10 10:25:45,386 : INFO : PROGRESS: at sentence #70000, processed 12637525 words, keeping 83194 word types
2019-06-10 10:25:45,675 : INFO : PROG

2019-06-10 10:26:21,862 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-06-10 10:26:21,863 : INFO : EPOCH - 1 : training on 41519355 raw words (30347619 effective words) took 28.7s, 1058170 effective words/s
2019-06-10 10:26:22,880 : INFO : EPOCH 2 - PROGRESS: at 2.93% examples, 908960 words/s, in_qsize 20, out_qsize 1
2019-06-10 10:26:23,886 : INFO : EPOCH 2 - PROGRESS: at 6.06% examples, 931025 words/s, in_qsize 20, out_qsize 1
2019-06-10 10:26:24,896 : INFO : EPOCH 2 - PROGRESS: at 9.05% examples, 936436 words/s, in_qsize 19, out_qsize 0
2019-06-10 10:26:25,900 : INFO : EPOCH 2 - PROGRESS: at 11.60% examples, 938646 words/s, in_qsize 18, out_qsize 1
2019-06-10 10:26:26,908 : INFO : EPOCH 2 - PROGRESS: at 14.29% examples, 940532 words/s, in_qsize 19, out_qsize 0
2019-06-10 10:26:27,912 : INFO : EPOCH 2 - PROGRESS: at 17.01% examples, 940803 words/s, in_qsize 19, out_qsize 0
2019-06-10 10:26:28,921 : INFO : EPOCH 2 - PROGRESS: at 19.50% examples, 940600 words/s

2019-06-10 10:27:24,518 : INFO : EPOCH 3 - PROGRESS: at 81.46% examples, 884232 words/s, in_qsize 20, out_qsize 0
2019-06-10 10:27:25,522 : INFO : EPOCH 3 - PROGRESS: at 84.45% examples, 884870 words/s, in_qsize 19, out_qsize 0
2019-06-10 10:27:26,536 : INFO : EPOCH 3 - PROGRESS: at 87.59% examples, 885225 words/s, in_qsize 19, out_qsize 0
2019-06-10 10:27:27,536 : INFO : EPOCH 3 - PROGRESS: at 90.95% examples, 887402 words/s, in_qsize 19, out_qsize 0
2019-06-10 10:27:28,542 : INFO : EPOCH 3 - PROGRESS: at 94.19% examples, 889026 words/s, in_qsize 19, out_qsize 0
2019-06-10 10:27:29,545 : INFO : EPOCH 3 - PROGRESS: at 97.44% examples, 890617 words/s, in_qsize 20, out_qsize 2
2019-06-10 10:27:30,260 : INFO : worker thread finished; awaiting finish of 9 more threads
2019-06-10 10:27:30,279 : INFO : worker thread finished; awaiting finish of 8 more threads
2019-06-10 10:27:30,288 : INFO : worker thread finished; awaiting finish of 7 more threads
2019-06-10 10:27:30,290 : INFO : worker thr

2019-06-10 10:28:19,915 : INFO : EPOCH 5 - PROGRESS: at 48.47% examples, 940782 words/s, in_qsize 20, out_qsize 0
2019-06-10 10:28:20,926 : INFO : EPOCH 5 - PROGRESS: at 51.86% examples, 942280 words/s, in_qsize 19, out_qsize 0
2019-06-10 10:28:21,933 : INFO : EPOCH 5 - PROGRESS: at 54.63% examples, 937329 words/s, in_qsize 19, out_qsize 0
2019-06-10 10:28:22,945 : INFO : EPOCH 5 - PROGRESS: at 57.79% examples, 934157 words/s, in_qsize 19, out_qsize 0
2019-06-10 10:28:23,946 : INFO : EPOCH 5 - PROGRESS: at 61.05% examples, 934476 words/s, in_qsize 18, out_qsize 1
2019-06-10 10:28:24,949 : INFO : EPOCH 5 - PROGRESS: at 64.51% examples, 934945 words/s, in_qsize 19, out_qsize 0
2019-06-10 10:28:25,953 : INFO : EPOCH 5 - PROGRESS: at 67.63% examples, 935251 words/s, in_qsize 18, out_qsize 1
2019-06-10 10:28:26,957 : INFO : EPOCH 5 - PROGRESS: at 70.72% examples, 935782 words/s, in_qsize 19, out_qsize 0
2019-06-10 10:28:27,959 : INFO : EPOCH 5 - PROGRESS: at 74.02% examples, 936058 words/s,

2019-06-10 10:29:11,822 : INFO : EPOCH 2 - PROGRESS: at 2.93% examples, 900593 words/s, in_qsize 19, out_qsize 0
2019-06-10 10:29:12,836 : INFO : EPOCH 2 - PROGRESS: at 6.03% examples, 919315 words/s, in_qsize 19, out_qsize 0
2019-06-10 10:29:13,847 : INFO : EPOCH 2 - PROGRESS: at 8.99% examples, 923470 words/s, in_qsize 19, out_qsize 0
2019-06-10 10:29:14,858 : INFO : EPOCH 2 - PROGRESS: at 11.56% examples, 927308 words/s, in_qsize 19, out_qsize 0
2019-06-10 10:29:15,870 : INFO : EPOCH 2 - PROGRESS: at 14.25% examples, 930915 words/s, in_qsize 20, out_qsize 1
2019-06-10 10:29:16,876 : INFO : EPOCH 2 - PROGRESS: at 16.98% examples, 933403 words/s, in_qsize 18, out_qsize 1
2019-06-10 10:29:17,880 : INFO : EPOCH 2 - PROGRESS: at 19.46% examples, 934915 words/s, in_qsize 19, out_qsize 0
2019-06-10 10:29:18,881 : INFO : EPOCH 2 - PROGRESS: at 22.24% examples, 936918 words/s, in_qsize 18, out_qsize 1
2019-06-10 10:29:19,894 : INFO : EPOCH 2 - PROGRESS: at 24.67% examples, 936223 words/s, in

2019-06-10 10:30:15,437 : INFO : EPOCH 3 - PROGRESS: at 92.31% examples, 870318 words/s, in_qsize 19, out_qsize 0
2019-06-10 10:30:16,443 : INFO : EPOCH 3 - PROGRESS: at 95.30% examples, 870675 words/s, in_qsize 19, out_qsize 0
2019-06-10 10:30:17,462 : INFO : EPOCH 3 - PROGRESS: at 98.37% examples, 871064 words/s, in_qsize 19, out_qsize 0
2019-06-10 10:30:17,919 : INFO : worker thread finished; awaiting finish of 9 more threads
2019-06-10 10:30:17,938 : INFO : worker thread finished; awaiting finish of 8 more threads
2019-06-10 10:30:17,939 : INFO : worker thread finished; awaiting finish of 7 more threads
2019-06-10 10:30:17,952 : INFO : worker thread finished; awaiting finish of 6 more threads
2019-06-10 10:30:17,954 : INFO : worker thread finished; awaiting finish of 5 more threads
2019-06-10 10:30:17,960 : INFO : worker thread finished; awaiting finish of 4 more threads
2019-06-10 10:30:17,971 : INFO : worker thread finished; awaiting finish of 3 more threads
2019-06-10 10:30:17,9

2019-06-10 10:31:10,978 : INFO : EPOCH 5 - PROGRESS: at 52.56% examples, 853742 words/s, in_qsize 19, out_qsize 0
2019-06-10 10:31:11,994 : INFO : EPOCH 5 - PROGRESS: at 55.84% examples, 858366 words/s, in_qsize 19, out_qsize 0
2019-06-10 10:31:13,001 : INFO : EPOCH 5 - PROGRESS: at 59.10% examples, 862217 words/s, in_qsize 17, out_qsize 2
2019-06-10 10:31:14,011 : INFO : EPOCH 5 - PROGRESS: at 62.39% examples, 866432 words/s, in_qsize 20, out_qsize 1
2019-06-10 10:31:15,011 : INFO : EPOCH 5 - PROGRESS: at 65.78% examples, 870028 words/s, in_qsize 20, out_qsize 0
2019-06-10 10:31:16,019 : INFO : EPOCH 5 - PROGRESS: at 69.02% examples, 873240 words/s, in_qsize 19, out_qsize 0
2019-06-10 10:31:17,038 : INFO : EPOCH 5 - PROGRESS: at 72.10% examples, 875989 words/s, in_qsize 17, out_qsize 2
2019-06-10 10:31:18,040 : INFO : EPOCH 5 - PROGRESS: at 75.42% examples, 879321 words/s, in_qsize 19, out_qsize 0
2019-06-10 10:31:19,058 : INFO : EPOCH 5 - PROGRESS: at 78.38% examples, 881357 words/s,

2019-06-10 10:32:07,017 : INFO : EPOCH 7 - PROGRESS: at 24.31% examples, 924121 words/s, in_qsize 19, out_qsize 0
2019-06-10 10:32:08,018 : INFO : EPOCH 7 - PROGRESS: at 27.68% examples, 925397 words/s, in_qsize 19, out_qsize 0
2019-06-10 10:32:09,031 : INFO : EPOCH 7 - PROGRESS: at 31.17% examples, 928134 words/s, in_qsize 19, out_qsize 0
2019-06-10 10:32:10,036 : INFO : EPOCH 7 - PROGRESS: at 34.51% examples, 930312 words/s, in_qsize 19, out_qsize 0
2019-06-10 10:32:11,053 : INFO : EPOCH 7 - PROGRESS: at 37.88% examples, 931733 words/s, in_qsize 18, out_qsize 1
2019-06-10 10:32:12,060 : INFO : EPOCH 7 - PROGRESS: at 41.33% examples, 932766 words/s, in_qsize 18, out_qsize 1
2019-06-10 10:32:13,066 : INFO : EPOCH 7 - PROGRESS: at 44.84% examples, 933959 words/s, in_qsize 17, out_qsize 2
2019-06-10 10:32:14,068 : INFO : EPOCH 7 - PROGRESS: at 48.15% examples, 935795 words/s, in_qsize 19, out_qsize 0
2019-06-10 10:32:15,082 : INFO : EPOCH 7 - PROGRESS: at 51.54% examples, 936648 words/s,

2019-06-10 10:33:02,391 : INFO : EPOCH - 8 : training on 41519355 raw words (30349022 effective words) took 32.1s, 945475 effective words/s
2019-06-10 10:33:03,403 : INFO : EPOCH 9 - PROGRESS: at 2.98% examples, 920054 words/s, in_qsize 19, out_qsize 0
2019-06-10 10:33:04,404 : INFO : EPOCH 9 - PROGRESS: at 6.08% examples, 935058 words/s, in_qsize 19, out_qsize 0
2019-06-10 10:33:05,414 : INFO : EPOCH 9 - PROGRESS: at 9.01% examples, 932041 words/s, in_qsize 19, out_qsize 0
2019-06-10 10:33:06,415 : INFO : EPOCH 9 - PROGRESS: at 11.56% examples, 934229 words/s, in_qsize 18, out_qsize 1
2019-06-10 10:33:07,418 : INFO : EPOCH 9 - PROGRESS: at 14.18% examples, 933870 words/s, in_qsize 18, out_qsize 1
2019-06-10 10:33:08,427 : INFO : EPOCH 9 - PROGRESS: at 16.93% examples, 935448 words/s, in_qsize 20, out_qsize 0
2019-06-10 10:33:09,427 : INFO : EPOCH 9 - PROGRESS: at 19.40% examples, 937183 words/s, in_qsize 19, out_qsize 0
2019-06-10 10:33:10,450 : INFO : EPOCH 9 - PROGRESS: at 22.16% ex

2019-06-10 10:34:06,841 : INFO : EPOCH 10 - PROGRESS: at 99.52% examples, 937693 words/s, in_qsize 19, out_qsize 0
2019-06-10 10:34:06,911 : INFO : worker thread finished; awaiting finish of 9 more threads
2019-06-10 10:34:06,921 : INFO : worker thread finished; awaiting finish of 8 more threads
2019-06-10 10:34:06,932 : INFO : worker thread finished; awaiting finish of 7 more threads
2019-06-10 10:34:06,944 : INFO : worker thread finished; awaiting finish of 6 more threads
2019-06-10 10:34:06,946 : INFO : worker thread finished; awaiting finish of 5 more threads
2019-06-10 10:34:06,955 : INFO : worker thread finished; awaiting finish of 4 more threads
2019-06-10 10:34:06,959 : INFO : worker thread finished; awaiting finish of 3 more threads
2019-06-10 10:34:06,961 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-06-10 10:34:06,965 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-06-10 10:34:06,976 : INFO : worker thread finished; awaiting 

(303495536, 415193550)

In [5]:
# seeing the most similar word to dirty 
w1 = "dirty"
model.wv.most_similar(positive=w1)

2019-06-10 10:38:07,122 : INFO : precomputing L2-norms of word weight vectors


[('filthy', 0.8673031330108643),
 ('stained', 0.7836593389511108),
 ('unclean', 0.7768709659576416),
 ('dusty', 0.7650618553161621),
 ('smelly', 0.7534812092781067),
 ('grubby', 0.7373911142349243),
 ('mouldy', 0.7345187067985535),
 ('gross', 0.7317819595336914),
 ('dingy', 0.7306115627288818),
 ('disgusting', 0.7217835187911987)]

 you can see how accurate are the results 

In [6]:
w1 = "polite"
model.wv.most_similar(positive=w1)

[('courteous', 0.9225897192955017),
 ('friendly', 0.8366171717643738),
 ('cordial', 0.8151143789291382),
 ('professional', 0.7802613973617554),
 ('attentive', 0.7662550210952759),
 ('curteous', 0.7622577548027039),
 ('curtious', 0.7530476450920105),
 ('freindly', 0.7513997554779053),
 ('gracious', 0.7293314337730408),
 ('personable', 0.7278640866279602)]

In [7]:
# look up top 6 words similar to 'shocked'
w1 = ["shocked"]
model.wv.most_similar(positive=w1, topn=6)

[('horrified', 0.8050983548164368),
 ('amazed', 0.804564893245697),
 ('astonished', 0.7981277704238892),
 ('dismayed', 0.7543410658836365),
 ('appalled', 0.7456772923469543),
 ('stunned', 0.742326021194458)]

You can even specify several positive examples to get things that are related in the provided context and provide negative examples to say waht should not be considered 

In [8]:
w1 = ["bed", "pillow", "sheet"]
w2 = ["couch"]
model.wv.most_similar(positive=w1, negative=w2, topn=10)

[('duvet', 0.7107797861099243),
 ('blanket', 0.7099823951721191),
 ('quilt', 0.7027401328086853),
 ('mattress', 0.6842578649520874),
 ('pillowcase', 0.6793761849403381),
 ('matress', 0.6710195541381836),
 ('pillows', 0.6448855996131897),
 ('sheets', 0.6411908268928528),
 ('pillowcases', 0.6406018137931824),
 ('foam', 0.6265773177146912)]

### Find out the similarity between two words 

In [9]:
w1 = "dirty"
w2= "smelly"
model.wv.similarity(w1,w2)

0.75348127

In [10]:
model.wv.similarity(w1, w2="dirty")

0.99999994

In [11]:
model.wv.similarity(w1, w2="clean")

0.2729992

### Find the odd one out 


In [12]:
#which is the odd one out  in the list?
l =["cat", "dog", "france"]
model.wv.doesnt_match(l)

'france'