train word2vec model in python gensim library using amazon product reviews

In [1]:
!pip install gensim
!pip install python-Levenshtein


Collecting python-Levenshtein
  Downloading python-Levenshtein-0.12.2.tar.gz (50 kB)
[K     |████████████████████████████████| 50 kB 2.7 MB/s 
Building wheels for collected packages: python-Levenshtein
  Building wheel for python-Levenshtein (setup.py) ... [?25l[?25hdone
  Created wheel for python-Levenshtein: filename=python_Levenshtein-0.12.2-cp37-cp37m-linux_x86_64.whl size=149862 sha256=fe06cd566faa6647b1a4d25392ee0b5591a24538e4be3cdc0f536cb5a2308ef4
  Stored in directory: /root/.cache/pip/wheels/05/5f/ca/7c4367734892581bb5ff896f15027a932c551080b2abd3e00d
Successfully built python-Levenshtein
Installing collected packages: python-Levenshtein
Successfully installed python-Levenshtein-0.12.2


In [2]:
import gensim
import pandas as pd

Reading and Exploring the Dataset

The dataset we are using here is a subset of Amazon reviews from the Cell Phones & Accessories category. The data is stored as a JSON file and can be read using pandas.

Link to the Dataset: http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Cell_Phones_and_Accessories_5.json.gz

In [3]:
# downloading the data
!wget http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Cell_Phones_and_Accessories_5.json.gz
!gunzip reviews_Cell_Phones_and_Accessories_5.json.gz

--2022-01-20 18:06:35--  http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Cell_Phones_and_Accessories_5.json.gz
Resolving snap.stanford.edu (snap.stanford.edu)... 171.64.75.80
Connecting to snap.stanford.edu (snap.stanford.edu)|171.64.75.80|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 45409631 (43M) [application/x-gzip]
Saving to: ‘reviews_Cell_Phones_and_Accessories_5.json.gz’


2022-01-20 18:06:40 (9.35 MB/s) - ‘reviews_Cell_Phones_and_Accessories_5.json.gz’ saved [45409631/45409631]



In [4]:
df = pd.read_json("reviews_Cell_Phones_and_Accessories_5.json", lines= True)
df.head()

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
0,A30TL5EWN6DFXT,120401325X,christina,"[0, 0]",They look good and stick good! I just don't li...,4,Looks Good,1400630400,"05 21, 2014"
1,ASY55RVNIL0UD,120401325X,emily l.,"[0, 0]",These stickers work like the review says they ...,5,Really great product.,1389657600,"01 14, 2014"
2,A2TMXE2AFO7ONB,120401325X,Erica,"[0, 0]",These are awesome and make my phone look so st...,5,LOVE LOVE LOVE,1403740800,"06 26, 2014"
3,AWJ0WZQYMYFQ4,120401325X,JM,"[4, 4]",Item arrived in great time and was in perfect ...,4,Cute!,1382313600,"10 21, 2013"
4,ATX7CZYFXI1KW,120401325X,patrice m rogoza,"[2, 3]","awesome! stays on, and looks great. can be use...",5,leopard home button sticker for iphone 4s,1359849600,"02 3, 2013"


In [5]:
df.shape

(194439, 9)

In [6]:
# column of interest
df.reviewText[0]

"They look good and stick good! I just don't like the rounded shape because I was always bumping it and Siri kept popping up and it was irritating. I just won't buy a product like this again"

# Preprocessing

apply various processing like converting all the words to lower case, trimming spaces, removing punctuations. This is something we will do over here too.

Additionally, we can also remove stop words like 'and', 'or', 'is', 'the', 'a', 'an' and convert words to their root forms like 'running' to 'run'

In [7]:
gensim.utils.simple_preprocess("They look good and stick good! I just don't like the rounded shape because I was always bumping it and Siri kept popping up and it was irritating. I just won't buy a product like this again")

['they',
 'look',
 'good',
 'and',
 'stick',
 'good',
 'just',
 'don',
 'like',
 'the',
 'rounded',
 'shape',
 'because',
 'was',
 'always',
 'bumping',
 'it',
 'and',
 'siri',
 'kept',
 'popping',
 'up',
 'and',
 'it',
 'was',
 'irritating',
 'just',
 'won',
 'buy',
 'product',
 'like',
 'this',
 'again']

In [8]:
# use entire column
review_text = df.reviewText.apply(gensim.utils.simple_preprocess)
review_text

0         [they, look, good, and, stick, good, just, don...
1         [these, stickers, work, like, the, review, say...
2         [these, are, awesome, and, make, my, phone, lo...
3         [item, arrived, in, great, time, and, was, in,...
4         [awesome, stays, on, and, looks, great, can, b...
                                ...                        
194434    [works, great, just, like, my, original, one, ...
194435    [great, product, great, packaging, high, quali...
194436    [this, is, great, cable, just, as, good, as, t...
194437    [really, like, it, becasue, it, works, well, w...
194438    [product, as, described, have, wasted, lot, of...
Name: reviewText, Length: 194439, dtype: object

In [9]:
# initialzie gensim model
model = gensim.models.Word2Vec(
    window= 10,
    min_count= 2,#atleast 2 words in a sentence
    workers= 4, #threads of CPU to use
    
)

In [10]:
# build a vocab
model.build_vocab(review_text, progress_per= 1000)

In [11]:
# view model
model.epochs

5

In [12]:
model.corpus_count

194439

In [13]:
# actual training
model.train(review_text, total_examples= model.corpus_count, epochs= model.epochs)

(61503167, 83868975)

In [14]:
#  save model
model.save("./word2vec-amazon-cell-accessories-reviews-short.model")

# Finding Similar Words and Similarity between words

https://radimrehurek.com/gensim/models/word2vec.html

In [15]:
model.wv.most_similar("bad")

[('shabby', 0.6804478168487549),
 ('terrible', 0.6548770070075989),
 ('horrible', 0.5846167206764221),
 ('good', 0.5814749002456665),
 ('poor', 0.5312163233757019),
 ('keen', 0.519131064414978),
 ('awful', 0.5112630724906921),
 ('crappy', 0.5012902021408081),
 ('disappointing', 0.4987128973007202),
 ('okay', 0.49233919382095337)]

In [17]:
# similarity score
model.wv.similarity(w1= "cheap", w2= "inexpensive")

0.50886714

In [18]:
model.wv.similarity(w1= "great", w2= "good")

0.7785346

In [19]:
model.wv.similarity(w1= "great", w2= "product")

-0.051206265

In [20]:
model.wv.similarity(w1= "great", w2= "awesome")

0.72914594

In [21]:
model.wv.similarity(w1= "great", w2= "great")

1.0

In [27]:
model.wv.similarity(w1= "great", w2= "famous")

-0.036790576

Further Reading

You can read about gensim more at https://radimrehurek.com/gensim/models/word2vec.html

Explore other Datasets related to Amazon Reviews: http://jmcauley.ucsd.edu/data/amazon/

# Exercise


Train a word2vec model on the [Sports & Outdoors](http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Sports_and_Outdoors_5.json.gz) Reviews Dataset Once you train a model on this, find the words most similar to 'awful' and find similarities between the following word tuples: ('good', 'great'), ('slow','steady')

# [solution.](https://github.com/codebasics/deep-learning-keras-tf-tutorial/blob/master/42_word2vec_gensim/42_word2vec_gensim_exercise_solution.ipynb)