<div style="background-color:#8B0000; border-radius:5px; border:#ffffff solid; padding: 20px; font-size:120%; text-align:center">
    <h1 align="center" style="color:#ffffff;"><b>Word2vec Tutorial</b></h1>
</div>


Requirements Libraries

In [1]:
# !pip install gensim
# !pip install python-Levenshtein

<div style="background-color:#8B0000; border-radius:5px; border:#ffffff solid; padding: 20px; font-size:120%; text-align:center">
    <h1 align="center" style="color:#ffffff;"><b>Exploring the Dataset </b></h1>
</div>


The dataset we are using here is a subset of Amazon reviews from the Cell Phones & Accessories category. The data is stored as a JSON file and can be read using pandas.

Link to the Dataset: http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Cell_Phones_and_Accessories_5.json.gz

importing Libraries

In [2]:
import pandas as pd
import numpy as np
import gensim

In [4]:
#creating dataframe
df = pd.read_json('/kaggle/input/cell-phones-and-accessories-5/Cell_Phones_and_Accessories_5.json',lines=True)

In [5]:
df.head(1)

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
0,A30TL5EWN6DFXT,120401325X,christina,"[0, 0]",They look good and stick good! I just don't li...,4,Looks Good,1400630400,"05 21, 2014"


This is a tutorial.. so we are going to apply word2vec only on reviewText column

In [6]:
df.shape

(194439, 9)

In [7]:
df['reviewText'][0]

"They look good and stick good! I just don't like the rounded shape because I was always bumping it and Siri kept popping up and it was irritating. I just won't buy a product like this again"

<div style="background-color:#8B0000; border-radius:5px; border:#ffffff solid; padding: 20px; font-size:120%; text-align:center">
    <h1 align="center" style="color:#ffffff;"><b>Preprocessing on reviewText Column</b></h1>
</div>


The first thing to do for any data science task is to clean the data. For NLP, we apply various processing like converting all the words to lower case, trimming spaces, removing punctuations. This is something we will do over here too.

Additionally, we can also remove stop words like 'and', 'or', 'is', 'the', 'a', 'an' and convert words to their root forms like 'running' to 'run'.

In [8]:
rewiew_text = df['reviewText'].apply(gensim.utils.simple_preprocess)

In [9]:
rewiew_text

0         [they, look, good, and, stick, good, just, don...
1         [these, stickers, work, like, the, review, say...
2         [these, are, awesome, and, make, my, phone, lo...
3         [item, arrived, in, great, time, and, was, in,...
4         [awesome, stays, on, and, looks, great, can, b...
                                ...                        
194434    [works, great, just, like, my, original, one, ...
194435    [great, product, great, packaging, high, quali...
194436    [this, is, great, cable, just, as, good, as, t...
194437    [really, like, it, becasue, it, works, well, w...
194438    [product, as, described, have, wasted, lot, of...
Name: reviewText, Length: 194439, dtype: object

<div style="background-color:#8B0000; border-radius:5px; border:#ffffff solid; padding: 20px; font-size:120%; text-align:center">
    <h2 align="center" style="color:#ffffff;"><b>Initialize Gensim Model With Word2vec</b></h2>
</div>


Train the model for reviews. Use a window of size 10 i.e. 10 words before the present word and 10 words ahead. A sentence with at least 2 words should only be considered, configure this using min_count parameter.

Workers define how many CPU threads to be used.

In [10]:
model = gensim.models.Word2Vec(
    window=10, # this perameter means target  will  after 10 words and before 10 words 
    workers=4, # how many cpu threds you want to use to train this model
    min_count=2 # atlast 2 words present in sentence to train
)


<div style="background-color:#8B0000; border-radius:5px; border:#ffffff solid; padding: 20px; font-size:120%; text-align:center">
    <h1 align="center" style="color:#ffffff;"><b>Build Vocabulary
</b></h1>
</div>


To know build vocabulary through gensim libarary check this documentation
https://radimrehurek.com/gensim/models/word2vec.html

In [11]:
model.build_vocab(rewiew_text, progress_per=1000)

<div style="background-color:#8B0000; border-radius:5px; border:#ffffff solid; padding: 20px; font-size:120%; text-align:center">
    <h1 align="center" style="color:#ffffff;"><b>Train Model
</b></h1>
</div>


In [14]:
%%time
model.train(rewiew_text, total_examples=model.corpus_count,epochs=10)

CPU times: user 7min 19s, sys: 1.69 s, total: 7min 21s
Wall time: 1min 55s


(123013472, 167737950)

<div style="background-color:#8B0000; border-radius:5px; border:#ffffff solid; padding: 20px; font-size:120%; text-align:center">
    <h1 align="center" style="color:#ffffff;"><b>Save The Model</b></h1>
</div>


In [15]:
model.save('word2vec.model')

# Finding Similar Words and Similarity between words


In [16]:
model.wv.most_similar('nice')

[('cool', 0.7521339058876038),
 ('good', 0.7082912921905518),
 ('great', 0.6975357532501221),
 ('neat', 0.6902515292167664),
 ('lovely', 0.6657590270042419),
 ('attractive', 0.6540594100952148),
 ('classy', 0.6526955366134644),
 ('snazzy', 0.6308637857437134),
 ('fantastic', 0.6264909505844116),
 ('stylish', 0.6015927195549011)]

You this model now understand the language

In [20]:
model.wv.most_similar('laptop')

[('computer', 0.8820147514343262),
 ('pc', 0.826290488243103),
 ('netbook', 0.7586874961853027),
 ('desktop', 0.7159069776535034),
 ('notebook', 0.7099999785423279),
 ('tablet', 0.6924258470535278),
 ('imac', 0.6749047040939331),
 ('mac', 0.6696504950523376),
 ('thinkpad', 0.6475188732147217),
 ('roku', 0.6392170786857605)]

In [21]:
model.wv.similarity(w1="great", w2="good")

0.7878794

<div style="background-color:#8B0000; border-radius:5px; border:#ffffff solid; padding: 20px; font-size:120%; text-align:center">
    <h1 align="center" style="color:#ffffff;"><b>Further Reading</b></h1>
</div>


You can read about gensim more at https://radimrehurek.com/gensim/models/word2vec.html

Explore other Datasets related to Amazon Reviews: http://jmcauley.ucsd.edu/data/amazon/