# Getting started with Word2vec in Gensim 

The idea behind word2vec is that we make this assumption that the meaning of the word depends on the words arround it 

The training algo in the `gensim package` were actually ported from the original <b>Word2vec</b> implementation by <b>Google</b> 

In [2]:
# imports needed and logging 
import gzip
import gensim
import logging 

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)


## Dataset 

we will be using the OpinRank dataset. it has the data about reviews of cars and hotels from users 

In [4]:
data_file = "reviews_data.txt.gz"

with gzip.open ('reviews_data.txt.gz', 'rb') as f:
    for i, line in enumerate(f):
        print(line)
        break

b"Oct 12 2009 \tNice trendy hotel location not too bad.\tI stayed in this hotel for one night. As this is a fairly new place some of the taxi drivers did not know where it was and/or did not want to drive there. Once I have eventually arrived at the hotel, I was very pleasantly surprised with the decor of the lobby/ground floor area. It was very stylish and modern. I found the reception's staff geeting me with 'Aloha' a bit out of place, but I guess they are briefed to say that to keep up the coroporate image.As I have a Starwood Preferred Guest member, I was given a small gift upon-check in. It was only a couple of fridge magnets in a gift box, but nevertheless a nice gesture.My room was nice and roomy, there are tea and coffee facilities in each room and you get two complimentary bottles of water plus some toiletries by 'bliss'.The location is not great. It is at the last metro stop and you then need to take a taxi, but if you are not planning on going to see the historic sites in Be

## Reading files into a list 

we can read the file into a list so that we can pass this into the word2vec model.
we would also be doing some minor preprocessing using the `gensim.utils.simple_preprocess(line)` which basically does is lowercasing, tokenization etc. 

In [5]:
def read_input(input_file):
    logging.info("reading file {0}... this may take a while".format(input_file))
    
    with gzip.open(input_file, 'rb') as f:
        for i, line in enumerate(f):
            if (i%10000==0):
                logging.info("read {0} reviews".format(i))
                
                yield gensim.utils.simple_preprocess(line)

                
# read the tokenized reviews into a list 
# each review item becomes a series of words 
# so this becomes a list of lists 

documents = list(read_input(data_file))
logging.info("Done reading the data file")

2019-06-10 01:41:18,317 : INFO : reading file reviews_data.txt.gz... this may take a while
2019-06-10 01:41:18,321 : INFO : read 0 reviews
2019-06-10 01:41:18,432 : INFO : read 10000 reviews
2019-06-10 01:41:18,533 : INFO : read 20000 reviews
2019-06-10 01:41:19,123 : INFO : read 30000 reviews
2019-06-10 01:41:19,313 : INFO : read 40000 reviews
2019-06-10 01:41:19,432 : INFO : read 50000 reviews
2019-06-10 01:41:19,547 : INFO : read 60000 reviews
2019-06-10 01:41:19,650 : INFO : read 70000 reviews
2019-06-10 01:41:19,741 : INFO : read 80000 reviews
2019-06-10 01:41:19,836 : INFO : read 90000 reviews
2019-06-10 01:41:19,930 : INFO : read 100000 reviews
2019-06-10 01:41:20,040 : INFO : read 110000 reviews
2019-06-10 01:41:20,154 : INFO : read 120000 reviews
2019-06-10 01:41:20,268 : INFO : read 130000 reviews
2019-06-10 01:41:20,377 : INFO : read 140000 reviews
2019-06-10 01:41:20,474 : INFO : read 150000 reviews
2019-06-10 01:41:20,572 : INFO : read 160000 reviews
2019-06-10 01:41:20,66

## Training the Word2Vec model 

Training the model is fairly straightforward. You pass the reviews that we read in the previous step. So, we are essentially passing on a lost of list s where each list within the main list contains a set of tokens from a user review. Word2Vec uses all these tokens to internally create a vocabulary and by vocabulary we mean set of unique words.
Afer building vocabulary we just have to call in train function 

In [7]:
model = gensim.models.Word2Vec (documents, size=150, window=10, min_count=2, workers=10)
model.train(documents,total_examples=len(documents),epochs=10)

2019-06-10 02:12:30,556 : INFO : collecting all words and their counts
2019-06-10 02:12:30,556 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2019-06-10 02:12:30,558 : INFO : collected 1310 word types from a corpus of 5022 raw words and 26 sentences
2019-06-10 02:12:30,559 : INFO : Loading a fresh vocabulary
2019-06-10 02:12:30,561 : INFO : effective_min_count=2 retains 556 unique words (42% of original 1310, drops 754)
2019-06-10 02:12:30,563 : INFO : effective_min_count=2 leaves 4268 word corpus (84% of original 5022, drops 754)
2019-06-10 02:12:30,567 : INFO : deleting the raw counts dictionary of 1310 items
2019-06-10 02:12:30,568 : INFO : sample=0.001 downsamples 62 most-common words
2019-06-10 02:12:30,569 : INFO : downsampling leaves estimated 2791 word corpus (65.4% of prior 4268)
2019-06-10 02:12:30,571 : INFO : estimated required memory for 556 words and 150 dimensions: 945200 bytes
2019-06-10 02:12:30,572 : INFO : resetting layer weights
2019-06-1

2019-06-10 02:12:30,715 : INFO : worker thread finished; awaiting finish of 5 more threads
2019-06-10 02:12:30,717 : INFO : worker thread finished; awaiting finish of 4 more threads
2019-06-10 02:12:30,718 : INFO : worker thread finished; awaiting finish of 3 more threads
2019-06-10 02:12:30,720 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-06-10 02:12:30,722 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-06-10 02:12:30,726 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-06-10 02:12:30,727 : INFO : EPOCH - 2 : training on 5022 raw words (2772 effective words) took 0.0s, 162505 effective words/s
2019-06-10 02:12:30,732 : INFO : worker thread finished; awaiting finish of 9 more threads
2019-06-10 02:12:30,733 : INFO : worker thread finished; awaiting finish of 8 more threads
2019-06-10 02:12:30,734 : INFO : worker thread finished; awaiting finish of 7 more threads
2019-06-10 02:12:30,735 : INFO : worker thread fin

2019-06-10 02:12:30,881 : INFO : worker thread finished; awaiting finish of 6 more threads
2019-06-10 02:12:30,882 : INFO : worker thread finished; awaiting finish of 5 more threads
2019-06-10 02:12:30,883 : INFO : worker thread finished; awaiting finish of 4 more threads
2019-06-10 02:12:30,884 : INFO : worker thread finished; awaiting finish of 3 more threads
2019-06-10 02:12:30,885 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-06-10 02:12:30,885 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-06-10 02:12:30,887 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-06-10 02:12:30,888 : INFO : EPOCH - 10 : training on 5022 raw words (2771 effective words) took 0.0s, 267719 effective words/s
2019-06-10 02:12:30,889 : INFO : training on a 50220 raw words (27861 effective words) took 0.2s, 137648 effective words/s


(27861, 50220)