* This is a practice notebook to learn gensim Word2Vec NLP model training to find similar words in amazon product review

* The dataset is a subset of mobile phones and accessories review in amazon and is downloaded from http://jmcauley.ucsd.edu/data/amazon/
It is in json.gz format

* To unzip the data and save it to a json file run: 
***!gunzip reviews_Cell_Phones_and_Accessories_5.json.gz***

In [1]:
!ls

Amazon_mobile_review_embedding.ipynb
reviews_Cell_Phones_and_Accessories_5.json


In [3]:
!pip install gensim

Please see https://github.com/pypa/pip/issues/5599 for advice on fixing the underlying issue.
To avoid this problem you can invoke Python with '-m pip' instead of running pip directly.
Collecting gensim
  Downloading gensim-4.1.2-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (24.1 MB)
     |████████████████████████████████| 24.1 MB 3.9 kB/s            
[?25hCollecting smart-open>=1.8.1
  Downloading smart_open-5.2.1-py3-none-any.whl (58 kB)
     |████████████████████████████████| 58 kB 182 kB/s            
Installing collected packages: smart-open, gensim
Successfully installed gensim-4.1.2 smart-open-5.2.1


In [None]:
#!pip install python-Levenshtein

In [2]:
import gensim
import pandas as pd

In [3]:
df = pd.read_json('reviews_Cell_Phones_and_Accessories_5.json', lines=True)

In [4]:
df.head()

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
0,A30TL5EWN6DFXT,120401325X,christina,"[0, 0]",They look good and stick good! I just don't li...,4,Looks Good,1400630400,"05 21, 2014"
1,ASY55RVNIL0UD,120401325X,emily l.,"[0, 0]",These stickers work like the review says they ...,5,Really great product.,1389657600,"01 14, 2014"
2,A2TMXE2AFO7ONB,120401325X,Erica,"[0, 0]",These are awesome and make my phone look so st...,5,LOVE LOVE LOVE,1403740800,"06 26, 2014"
3,AWJ0WZQYMYFQ4,120401325X,JM,"[4, 4]",Item arrived in great time and was in perfect ...,4,Cute!,1382313600,"10 21, 2013"
4,ATX7CZYFXI1KW,120401325X,patrice m rogoza,"[2, 3]","awesome! stays on, and looks great. can be use...",5,leopard home button sticker for iphone 4s,1359849600,"02 3, 2013"


In [5]:
df.shape

(194439, 9)

#### Here we will only use the reviewText column for word embedding and model training

In [6]:
df.reviewText[0]

"They look good and stick good! I just don't like the rounded shape because I was always bumping it and Siri kept popping up and it was irritating. I just won't buy a product like this again"

In [7]:
gensim.utils.simple_preprocess(df.reviewText[0])

['they',
 'look',
 'good',
 'and',
 'stick',
 'good',
 'just',
 'don',
 'like',
 'the',
 'rounded',
 'shape',
 'because',
 'was',
 'always',
 'bumping',
 'it',
 'and',
 'siri',
 'kept',
 'popping',
 'up',
 'and',
 'it',
 'was',
 'irritating',
 'just',
 'won',
 'buy',
 'product',
 'like',
 'this',
 'again']

### Preprocessing all the documents in df.reviewText column using gensim.utils.simple_preprocess method

In [8]:
review_text = df.reviewText.apply(gensim.utils.simple_preprocess)

In [9]:
review_text

0         [they, look, good, and, stick, good, just, don...
1         [these, stickers, work, like, the, review, say...
2         [these, are, awesome, and, make, my, phone, lo...
3         [item, arrived, in, great, time, and, was, in,...
4         [awesome, stays, on, and, looks, great, can, b...
                                ...                        
194434    [works, great, just, like, my, original, one, ...
194435    [great, product, great, packaging, high, quali...
194436    [this, is, great, cable, just, as, good, as, t...
194437    [really, like, it, becasue, it, works, well, w...
194438    [product, as, described, have, wasted, lot, of...
Name: reviewText, Length: 194439, dtype: object

### Building Model

In [10]:
model = gensim.models.Word2Vec(window=10, min_count=2, workers=2)

### Building Vocabulary

In [11]:
model.build_vocab(review_text, progress_per=100)

### Training Model

In [12]:
model.epochs

5

In [13]:
model.corpus_count

194439

In [14]:
model.train(review_text, total_examples=model.corpus_count, epochs=model.epochs)

(61505050, 83868975)

### Finding Similar words and similarity index

In [18]:
model.wv.most_similar('bad')

[('terrible', 0.7007719874382019),
 ('shabby', 0.631822407245636),
 ('horrible', 0.5899611711502075),
 ('good', 0.5831224322319031),
 ('awful', 0.5656654238700867),
 ('okay', 0.532368004322052),
 ('sad', 0.5312588810920715),
 ('cheap', 0.5217819213867188),
 ('crappy', 0.5210874080657959),
 ('poor', 0.5069063305854797)]

In [19]:
model.wv.similarity(w1="good", w2="nice")

0.6966757

In [20]:
model.wv.similarity(w1="horrible", w2="phone")

-0.116661906

In [21]:
model.wv.similarity(w1="okay", w2="poor")

0.26974562

In [22]:
model.wv.similarity(w1="great", w2="excellent")

0.71155477

### So we get good similarity number for similar words