# Training Word2Vec Model

In [4]:
#!pip install gensim

In [5]:
#!pip install python-Levenshtein

### Reading and Exploring the Dataset
The dataset we are using here is a subset of Amazon reviews from the Cell Phones & Accessories category. The data is stored as a JSON file and can be read using pandas.

Link to the Dataset: http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Cell_Phones_and_Accessories_5.json.gz

In [33]:
#gitbash on the file run "gunzip file address"

In [6]:
import gensim
import pandas as pd### Reading and Exploring the Dataset
The dataset we are using here is a subset of Amazon reviews from the Cell Phones & Accessories category. The data is stored as a JSON file and can be read using pandas.

Link to the Dataset: http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Cell_Phones_and_Accessories_5.json.gz

In [9]:
df = pd.read_json("C:/Users/Owner/reviews_Cell_Phones_and_Accessories_5.json", lines= True)# lines parameter: one line one json object
df.head()reviews_Sports_and_Outdoors_5.json

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
0,A30TL5EWN6DFXT,120401325X,christina,"[0, 0]",They look good and stick good! I just don't li...,4,Looks Good,1400630400,"05 21, 2014"
1,ASY55RVNIL0UD,120401325X,emily l.,"[0, 0]",These stickers work like the review says they ...,5,Really great product.,1389657600,"01 14, 2014"
2,A2TMXE2AFO7ONB,120401325X,Erica,"[0, 0]",These are awesome and make my phone look so st...,5,LOVE LOVE LOVE,1403740800,"06 26, 2014"
3,AWJ0WZQYMYFQ4,120401325X,JM,"[4, 4]",Item arrived in great time and was in perfect ...,4,Cute!,1382313600,"10 21, 2013"
4,ATX7CZYFXI1KW,120401325X,patrice m rogoza,"[2, 3]","awesome! stays on, and looks great. can be use...",5,leopard home button sticker for iphone 4s,1359849600,"02 3, 2013"


In [10]:
df.shape # huge data, enough to train our model

(194439, 9)

In [13]:
df.reviewText[0] # we are interested in with this

"They look good and stick good! I just don't like the rounded shape because I was always bumping it and Siri kept popping up and it was irritating. I just won't buy a product like this again"

### Simple Preprocessing & Tokenization
The first thing to do for any data science task is to clean the data.
For NLP, we apply various processing like converting all the words to lower case, trimming spaces, removing punctuations. 
This is something we will do over here too.

Additionally, we can also remove stop words like 'and', 'or', 'is', 'the', 'a', 'an' and convert words to their root forms like 'running' to 'run'.

In [None]:
# gensim is a nlp library and has preprocessing function: 
                                            # not perfect, using some simple heuristic rules
                                            # tokenizing, convert to lower case, punct removed.
                                            # but very good in building Word2Vewc model, that's why Dhaval prefer it. 

In [16]:
gensim.utils.simple_preprocess("They look good and stick good! ")

['they', 'look', 'good', 'and', 'stick', 'good']

In [18]:
review_text = df.reviewText.apply(gensim.utils.simple_preprocess)
review_text # generates pandas series. 

0         [they, look, good, and, stick, good, just, don...
1         [these, stickers, work, like, the, review, say...
2         [these, are, awesome, and, make, my, phone, lo...
3         [item, arrived, in, great, time, and, was, in,...
4         [awesome, stays, on, and, looks, great, can, b...
                                ...                        
194434    [works, great, just, like, my, original, one, ...
194435    [great, product, great, packaging, high, quali...
194436    [this, is, great, cable, just, as, good, as, t...
194437    [really, like, it, becasue, it, works, well, w...
194438    [product, as, described, have, wasted, lot, of...
Name: reviewText, Length: 194439, dtype: object

### Training the Word2Vec Model

Train the model for reviews. Use a window of size 10 i.e. 10 words before the present word and 10 words ahead. A sentence with at least 2 words should only be considered, configure this using min_count parameter.

Workers define how many CPU threads to be used.

### initialize gensim model 

In [20]:
model = gensim.models.Word2Vec( # identify parameters
window=10, # means 10 words before target, 10 words after target
min_count =2, #if you have a sentence which has only one word, dont use taht sentence, at least 2 words need to present
 workers=4, #how many cpu threads to use to train model, if cpu has 4 cores, write 4.    
)

### building vocabulary

In [21]:
model.build_vocab(review_text, progress_per=1000) # progress_per: indicates how many words to process before showing/updating the progress
                # after how many words you wanna see. 

In [23]:
model.epochs #default 5, how many times to iterate through the entire dataset. 

5

### Actual training

In [24]:
model.corpus_count

194439

In [25]:
model.train(review_text, total_examples=model.corpus_count,
           epochs =model.epochs) 

(61503769, 83868975)

In [27]:
#trained model usually saved to be used later
model.save("C:/Users/Owner/Word2Vec_Model")

### Experiment model

### Finding Similar Words and Similarity between words
https://radimrehurek.com/gensim/models/word2vec.html

In [28]:
model.wv.most_similar("bad")  # Finding Similar Words and Similarity between words

[('terrible', 0.649594783782959),
 ('shabby', 0.6098870038986206),
 ('horrible', 0.5888733863830566),
 ('good', 0.5765827894210815),
 ('crappy', 0.5566805005073547),
 ('awful', 0.543475866317749),
 ('legit', 0.5284345149993896),
 ('disappointing', 0.5170247554779053),
 ('okay', 0.5054886341094971),
 ('weird', 0.5051954388618469)]

In [29]:
model.wv.similarity(w1="cheap", w2="inexpensive")

0.5427461

In [30]:
model.wv.similarity(w1="great", w2="good") #high similarity

0.7885885

In [32]:
model.wv.similarity(w1="great", w2="iphone") #less similarity

0.10154014

### Further Reading

You can read about gensim more at https://radimrehurek.com/gensim/models/word2vec.html

Explore other Datasets related to Amazon Reviews: http://jmcauley.ucsd.edu/data/amazon/

## Exercise

Train a word2vec model on the [Sports & Outdoors Reviews Dataset](http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Sports_and_Outdoors_5.json.gz)
Once you train a model on this, find the words most similar to 'awful' and find similarities between the following word tuples: ('good', 'great'), ('slow','steady')

Click here for [solution](https://github.com/codebasics/deep-learning-keras-tf-tutorial/blob/master/42_word2vec_gensim/42_word2vec_gensim_exercise_solution.ipynb).

In [34]:
df = pd.read_json("C:/Users/Owner/reviews_Sports_and_Outdoors_5.json", lines= True)# lines parameter: one line one json object
df.head()

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
0,AIXZKN4ACSKI,1881509818,David Briner,"[0, 0]",This came in on time and I am veru happy with ...,5,Woks very good,1390694400,"01 26, 2014"
1,A1L5P841VIO02V,1881509818,Jason A. Kramer,"[1, 1]",I had a factory Glock tool that I was using fo...,5,Works as well as the factory tool,1328140800,"02 2, 2012"
2,AB2W04NI4OEAD,1881509818,J. Fernald,"[2, 2]",If you don't have a 3/32 punch or would like t...,4,"It's a punch, that's all.",1330387200,"02 28, 2012"
3,A148SVSWKTJKU6,1881509818,"Jusitn A. Watts ""Maverick9614""","[0, 0]",This works no better than any 3/32 punch you w...,4,It's a punch with a Glock logo.,1328400000,"02 5, 2012"
4,AAAWJ6LW9WMOO,1881509818,Material Man,"[0, 0]",I purchased this thinking maybe I need a speci...,4,"Ok,tool does what a regular punch does.",1366675200,"04 23, 2013"


In [35]:
df.shape

(296337, 9)

In [37]:
df.reviewText[0]

'This came in on time and I am veru happy with it, I haved used it already and it makes taking out the pins in my glock 32 very easy'

In [39]:
df["review_text"] = df.reviewText.apply(gensim.utils.simple_preprocess)

In [40]:
df.head()

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime,review_text
0,AIXZKN4ACSKI,1881509818,David Briner,"[0, 0]",This came in on time and I am veru happy with ...,5,Woks very good,1390694400,"01 26, 2014","[this, came, in, on, time, and, am, veru, happ..."
1,A1L5P841VIO02V,1881509818,Jason A. Kramer,"[1, 1]",I had a factory Glock tool that I was using fo...,5,Works as well as the factory tool,1328140800,"02 2, 2012","[had, factory, glock, tool, that, was, using, ..."
2,AB2W04NI4OEAD,1881509818,J. Fernald,"[2, 2]",If you don't have a 3/32 punch or would like t...,4,"It's a punch, that's all.",1330387200,"02 28, 2012","[if, you, don, have, punch, or, would, like, t..."
3,A148SVSWKTJKU6,1881509818,"Jusitn A. Watts ""Maverick9614""","[0, 0]",This works no better than any 3/32 punch you w...,4,It's a punch with a Glock logo.,1328400000,"02 5, 2012","[this, works, no, better, than, any, punch, yo..."
4,AAAWJ6LW9WMOO,1881509818,Material Man,"[0, 0]",I purchased this thinking maybe I need a speci...,4,"Ok,tool does what a regular punch does.",1366675200,"04 23, 2013","[purchased, this, thinking, maybe, need, speci..."


In [41]:
#initialize model
model = gensim.models.Word2Vec(
    window=10,
    min_count=2,
    workers=4
)

In [42]:
#building vocabulary
model.build_vocab(df["review_text"], progress_per=1000)

In [43]:
#training the model
model.train(review_text, 
            total_examples=model.corpus_count,
           epochs = model.epochs)

(62255381, 83868975)

In [46]:
#experiment the model
model.wv.most_similar("awful")

[('horrible', 0.7680966854095459),
 ('terrible', 0.7518805265426636),
 ('crappy', 0.5919217467308044),
 ('atrocious', 0.5878327488899231),
 ('stupid', 0.564454972743988),
 ('okay', 0.5599114894866943),
 ('weird', 0.5443432331085205),
 ('amazing', 0.5352024435997009),
 ('ugly', 0.5350634455680847),
 ('alright', 0.5318759679794312)]

In [45]:
model.wv.similarity(w1="good", w2= "great")

0.77068

In [47]:
model.wv.similarity(w1="slow", w2="steady")

0.16485931