## Gensim

In order to ease ourselves for creating the word embeddings we are going to use en external library: `gensim`.

Gensim is an **open-source** library for **unsupervised** topic modeling, document indexing, retrieval by similarity, and other natural language processing functionalities.

It's designed to **handle** large text collections using data streaming and incremental online algorithms, which differentiates it from most other machine learning software packages that target only in-memory processing.

After computing the word embeddings we can load them into a `nn.Embedding` layer in order to solve the NLP task in Pytorch.


For more information about gensim: https://radimrehurek.com/gensim/auto_examples/index.html

In [4]:
# Inorder to that notebook to word we need
!pip install gensim -q
!pip install python-Levenshtein -q

In [5]:
import gensim
import pandas as pd

from os import cpu_count

## Loading the Dataset

In [6]:
# We have to unzip the dataset: 'reviews_Cell_Phones_and_Accessories_5.json.gz'
import gzip
from pathlib import Path
import shutil

# Setting the path of the zip file
zip_path = Path("/content/reviews_Cell_Phones_and_Accessories_5.json.gz")
dest_path = Path("/content/reviews_Cell_Phones_and_Accessories_5.json")

if not dest_path.is_file():
    with gzip.open(zip_path, "rb") as zip_ref:
        print(f"[INFO] Unzipping dataset `{zip_path}` to `{dest_path}`...")
        with open(dest_path, "wb") as un_zip_ref:
            shutil.copyfileobj(zip_ref, un_zip_ref)

    print(f"[INFO] Dataset succesfully downloaded to `{dest_path}`...")
else:
    print(f"[INFO] Dataset `{dest_path}` alerady exists...")

[INFO] Unzipping dataset `/content/reviews_Cell_Phones_and_Accessories_5.json.gz` to `/content/reviews_Cell_Phones_and_Accessories_5.json`...
[INFO] Dataset succesfully downloaded to `/content/reviews_Cell_Phones_and_Accessories_5.json`...


In [7]:
# Discovering the Dataset
df = pd.read_json(dest_path, lines=True)

df.head(3)

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
0,A30TL5EWN6DFXT,120401325X,christina,"[0, 0]",They look good and stick good! I just don't li...,4,Looks Good,1400630400,"05 21, 2014"
1,ASY55RVNIL0UD,120401325X,emily l.,"[0, 0]",These stickers work like the review says they ...,5,Really great product.,1389657600,"01 14, 2014"
2,A2TMXE2AFO7ONB,120401325X,Erica,"[0, 0]",These are awesome and make my phone look so st...,5,LOVE LOVE LOVE,1403740800,"06 26, 2014"


## Problem: We are going to create a Word2Vec embedding only using the `reviews` filed of the dataset.

In [8]:
# The shape of the dataframe
print(df.shape)

(194439, 9)


In [9]:
# The field we are interested in is the following:
print(df.reviewText[0])
print(df.reviewText[9])

They look good and stick good! I just don't like the rounded shape because I was always bumping it and Siri kept popping up and it was irritating. I just won't buy a product like this again
This is a fantastic case. Very stylish and protects my phone. Easy access to all buttons and features, without any loss of phone reception. But most importantly, it double power, just as promised. Great buy


### Preprocessing

The first thing when working with huge loads of text data is to prepocessing them. The steps we follow are the following:
1. Removing stop words and punctiations.
2. Converting the corpus into lower case.
3. Tokenize the corpus.

-

Those processing can be done using the function: `utils.simple_preprocess()`.

In [10]:
# Let's see an example:
processed_seq = gensim.utils.simple_preprocess("They look good and stick good! I just don't like the rounded shape because I was always bumping it and Siri kept popping up and it was irritating. I just won't buy a product like this again.")

print(processed_seq)

['they', 'look', 'good', 'and', 'stick', 'good', 'just', 'don', 'like', 'the', 'rounded', 'shape', 'because', 'was', 'always', 'bumping', 'it', 'and', 'siri', 'kept', 'popping', 'up', 'and', 'it', 'was', 'irritating', 'just', 'won', 'buy', 'product', 'like', 'this', 'again']


### Applying the preprocess function to the DataFrame

In [11]:
review_text = df.reviewText.apply(gensim.utils.simple_preprocess)

print(review_text)

0         [they, look, good, and, stick, good, just, don...
1         [these, stickers, work, like, the, review, say...
2         [these, are, awesome, and, make, my, phone, lo...
3         [item, arrived, in, great, time, and, was, in,...
4         [awesome, stays, on, and, looks, great, can, b...
                                ...                        
194434    [works, great, just, like, my, original, one, ...
194435    [great, product, great, packaging, high, quali...
194436    [this, is, great, cable, just, as, good, as, t...
194437    [really, like, it, becasue, it, works, well, w...
194438    [product, as, described, have, wasted, lot, of...
Name: reviewText, Length: 194439, dtype: object


### Creating a Word2Vec model from `gensim`

For more informations: https://radimrehurek.com/gensim/models/word2vec.html

In [12]:
model = gensim.models.Word2Vec(
    window=10,           # The window size.
    min_count=2,         # If we have a sentence with 2 words we don't use that sentence.
    workers=cpu_count(), # The number of cores to train the model.
)

### Creating the vocabulary of the model

In [13]:
model.build_vocab(review_text, progress_per=1000)

In [14]:
# The total examples of the model:
print(model.corpus_count)

194439


In [15]:
# The epochs of the model:
print(model.epochs)

5


### Training the model

In [16]:
model.train(review_text, total_examples=model.corpus_count, epochs=model.epochs)

(61507235, 83868975)

### Saving the model

In [17]:
model.save("/content/word2vec_amazon_cell_accessories_reviews.model")

### Evaluating the model

In [18]:
# Getting the similar words of `bad` (wv stands for `word-vector`)
model.wv.most_similar("bad") 

[('terrible', 0.696999728679657),
 ('shabby', 0.6316559314727783),
 ('horrible', 0.6010767817497253),
 ('good', 0.5787262320518494),
 ('awful', 0.5774445533752441),
 ('okay', 0.5509426593780518),
 ('ok', 0.5356559753417969),
 ('poor', 0.5151407718658447),
 ('sad', 0.5067817568778992),
 ('disappointing', 0.5061890482902527)]

In [19]:
# We can see the cosine similarity of two words
model.wv.similarity(w1="good", w2="great")

0.78583175

In [20]:
model.wv.similarity(w1="good", w2="product")

-0.028856898

## How does the model can tell how two words are similar?

The answer is through a math notion, `Cosine Similarity`

Cosine Similarity is calculated using the formula:
* $similarity = \cfrac{emb1 \cdot emb2}{|emb1||emb2|}$

Where as `emb1` we denote the embedding vector of the first word and as `emb2` the embedding vector of the second word.

Drived by this notion we can define the distance of two words as:
* $distance = 1 - similarity$

In [26]:
import numpy as np

emb1 = model.wv["good"]
emb2 = model.wv["great"]

In [27]:
similarity = (emb1 @ emb2) / (np.linalg.norm(emb1) * np.linalg.norm(emb2))

similarity

0.78583175

This results is the same by using the method `wv.similarity()`.