<a href="https://colab.research.google.com/github/GuCuChiara/Word2Vec-Model/blob/main/word2vec_gensim.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Word2Vec Model

En este cuaderno de Jupyter utilizamos la biblioteca Gensim para experimentar con word2vec. 

**Este cuaderno está enfocado en la intuición de los conceptos y no en los detalles de implementación.**

**Word2Vec** is a widely used algorithm based on neural networks, commonly referred to as “deep learning” (though word2vec itself is rather shallow). 

Using large amounts of unannotated plain text, word2vec learns relationships between words automatically. 

The output are vectors, one vector per word, with remarkable linear relationships that allow us to do things like:

* vec(“king”) - vec(“man”) + vec(“woman”) =~ vec(“queen”)

* vec(“Montreal Canadiens”) – vec(“Montreal”) + vec(“Toronto”) =~ vec(“Toronto Maple Leafs”).

**Word2vec is very useful in automatic text tagging, recommender systems and machine translation.**

In [None]:
from google.colab import drive
drive.mount('/content/gdrive/',force_remount=True)

Mounted at /content/gdrive/


# Instalación y cargar el modelo

In [None]:
!pip install gensim
!pip install python-Levenshtein

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting python-Levenshtein
  Downloading python_Levenshtein-0.20.8-py3-none-any.whl (9.4 kB)
Collecting Levenshtein==0.20.8
  Downloading Levenshtein-0.20.8-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (174 kB)
[K     |████████████████████████████████| 174 kB 36.6 MB/s 
[?25hCollecting rapidfuzz<3.0.0,>=2.3.0
  Downloading rapidfuzz-2.13.7-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.2 MB)
[K     |████████████████████████████████| 2.2 MB 70.6 MB/s 
[?25hInstalling collected packages: rapidfuzz, Levenshtein, python-Levenshtein
Successfully installed Levenshtein-0.20.8 python-Levenshtein-0.20.8 rapidfuzz-2.13.7


#Importamos las librerías necesarias:

In [None]:
import sys
import os
import gensim
import pandas as pd

# Reading and Exploring the Dataset
The dataset we are using here is a subset of Amazon reviews from the Cell Phones & Accessories category. 

The data is stored as a JSON file and can be read using pandas.

In [None]:
df = pd.read_json("/content/gdrive/MyDrive/Colab Notebooks/NLP/word2vec_gensim/Cell_Phones_and_Accessories_5.json", lines=True)
df

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
0,A30TL5EWN6DFXT,120401325X,christina,"[0, 0]",They look good and stick good! I just don't li...,4,Looks Good,1400630400,"05 21, 2014"
1,ASY55RVNIL0UD,120401325X,emily l.,"[0, 0]",These stickers work like the review says they ...,5,Really great product.,1389657600,"01 14, 2014"
2,A2TMXE2AFO7ONB,120401325X,Erica,"[0, 0]",These are awesome and make my phone look so st...,5,LOVE LOVE LOVE,1403740800,"06 26, 2014"
3,AWJ0WZQYMYFQ4,120401325X,JM,"[4, 4]",Item arrived in great time and was in perfect ...,4,Cute!,1382313600,"10 21, 2013"
4,ATX7CZYFXI1KW,120401325X,patrice m rogoza,"[2, 3]","awesome! stays on, and looks great. can be use...",5,leopard home button sticker for iphone 4s,1359849600,"02 3, 2013"
...,...,...,...,...,...,...,...,...,...
194434,A1YMNTFLNDYQ1F,B00LORXVUE,eyeused2loveher,"[0, 0]",Works great just like my original one. I reall...,5,This works just perfect!,1405900800,"07 21, 2014"
194435,A15TX8B2L8B20S,B00LORXVUE,Jon Davidson,"[0, 0]",Great product. Great packaging. High quality a...,5,Great replacement cable. Apple certified,1405900800,"07 21, 2014"
194436,A3JI7QRZO1QG8X,B00LORXVUE,Joyce M. Davidson,"[0, 0]","This is a great cable, just as good as the mor...",5,Real quality,1405900800,"07 21, 2014"
194437,A1NHB2VC68YQNM,B00LORXVUE,Nurse Farrugia,"[0, 0]",I really like it becasue it works well with my...,5,I really like it becasue it works well with my...,1405814400,"07 20, 2014"


In [None]:
df.shape

(194439, 9)

# Simple Preprocessing & Tokenization
* The first thing to do for any data science task is to clean the data. 

* For NLP, we apply various processing like converting all the words to lower case, trimming spaces, removing punctuations. 

* This is something we will do over here too.

* Additionally, we can also remove stop words like 'and', 'or', 'is', 'the', 'a', 'an' and convert words to their root forms like 'running' to 'run'.

In [None]:
review_text = df.reviewText.apply(gensim.utils.simple_preprocess)

In [None]:
review_text

0         [they, look, good, and, stick, good, just, don...
1         [these, stickers, work, like, the, review, say...
2         [these, are, awesome, and, make, my, phone, lo...
3         [item, arrived, in, great, time, and, was, in,...
4         [awesome, stays, on, and, looks, great, can, b...
                                ...                        
194434    [works, great, just, like, my, original, one, ...
194435    [great, product, great, packaging, high, quali...
194436    [this, is, great, cable, just, as, good, as, t...
194437    [really, like, it, becasue, it, works, well, w...
194438    [product, as, described, have, wasted, lot, of...
Name: reviewText, Length: 194439, dtype: object

In [None]:
review_text.loc[0]

['they',
 'look',
 'good',
 'and',
 'stick',
 'good',
 'just',
 'don',
 'like',
 'the',
 'rounded',
 'shape',
 'because',
 'was',
 'always',
 'bumping',
 'it',
 'and',
 'siri',
 'kept',
 'popping',
 'up',
 'and',
 'it',
 'was',
 'irritating',
 'just',
 'won',
 'buy',
 'product',
 'like',
 'this',
 'again']

In [None]:
df.reviewText.loc[0]

"They look good and stick good! I just don't like the rounded shape because I was always bumping it and Siri kept popping up and it was irritating. I just won't buy a product like this again"

# Training the Word2Vec Model
* Train the model for reviews. 

* Use a window of size 10 i.e. 10 words before the present word and 10 words ahead. 

* A sentence with at least 2 words should only be considered, configure this using min_count parameter.

* Workers define how many CPU threads to be used.

# Initialize the model

In [None]:
model = gensim.models.Word2Vec(
    window=10,
    min_count=2,
    workers=4,
)

# Build Vocabulary

In [None]:
model.build_vocab(review_text, progress_per=1000)

# Default model epochs:

In [None]:
model.epochs

5

# Train the Word2Vec Model

In [None]:
model.train(review_text, total_examples=model.corpus_count, epochs=model.epochs)

(61506018, 83868975)

# Save the Model
Save the model so that it can be reused in other applications

In [None]:
model.save("/content/gdrive/MyDrive/Colab Notebooks/NLP/word2vec_gensim/word2vec-amazon-cell-accessories-reviews-short.model")

# Finding Similar Words and Similarity between words
https://radimrehurek.com/gensim/models/word2vec.html

###Now we will see how to find the words most similar to the specified set of words.

In [None]:
model.wv.most_similar("bad")

[('shabby', 0.6660996079444885),
 ('terrible', 0.6644800901412964),
 ('horrible', 0.5914134979248047),
 ('good', 0.5757290720939636),
 ('crappy', 0.5301653742790222),
 ('disappointing', 0.524520993232727),
 ('crummy', 0.516400158405304),
 ('poor', 0.5126922130584717),
 ('okay', 0.5112333297729492),
 ('cheap', 0.5081875920295715)]

In [None]:
model.wv.similarity(w1="cheap", w2="inexpensive")

0.53316593

In [None]:
model.wv.similarity(w1="great", w2="good")

0.78210694



---



# Further Reading
You can read about gensim more at https://radimrehurek.com/gensim/models/word2vec.html

Explore other Datasets related to Amazon Reviews: http://jmcauley.ucsd.edu/data/amazon/