- [Word2Vec Explained](https://israelg99.github.io/2017-03-23-Word2Vec-Explained/)
- [Word2vec WikiPidea](https://en.wikipedia.org/wiki/Word2vec#Dimensionality)

- [What is Word2Vec? A Simple Explanation ](https://www.youtube.com/watch?v=hQwFeIupNP0)
- [Word2Vec - Skipgram and CBOW YOU TUBE](https://www.youtube.com/watch?v=UqRCEmrv1gQ)

- Each word Basically represent as a vector of 32 or more dimension instead of single number
- Here the semantic information and relation between two words also preserve
- [Gensim word2vec](https://radimrehurek.com/gensim/models/word2vec.html)

In [1]:
from gensim.models import word2vec , keyedvectors
import pandas as pd
import nltk
import numpy as np


### Reading and Exploring the Dataset
The dataset we are using here is a subset of Amazon reviews from the Cell Phones & Accessories category. The data is stored as a JSON file and can be read using pandas.

Link to the Dataset: http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Cell_Phones_and_Accessories_5.json.gz

In [2]:
import gensim

In [3]:
import nltk
from gensim.models import Word2Vec
from nltk.corpus import stopwords

In [4]:

df = pd.read_json("reviews_cell_phones_and_accessories_5.json", lines=True)
df


Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
0,A30TL5EWN6DFXT,120401325X,christina,"[0, 0]",They look good and stick good! I just don't li...,4,Looks Good,1400630400,"05 21, 2014"
1,ASY55RVNIL0UD,120401325X,emily l.,"[0, 0]",These stickers work like the review says they ...,5,Really great product.,1389657600,"01 14, 2014"
2,A2TMXE2AFO7ONB,120401325X,Erica,"[0, 0]",These are awesome and make my phone look so st...,5,LOVE LOVE LOVE,1403740800,"06 26, 2014"
3,AWJ0WZQYMYFQ4,120401325X,JM,"[4, 4]",Item arrived in great time and was in perfect ...,4,Cute!,1382313600,"10 21, 2013"
4,ATX7CZYFXI1KW,120401325X,patrice m rogoza,"[2, 3]","awesome! stays on, and looks great. can be use...",5,leopard home button sticker for iphone 4s,1359849600,"02 3, 2013"
...,...,...,...,...,...,...,...,...,...
194434,A1YMNTFLNDYQ1F,B00LORXVUE,eyeused2loveher,"[0, 0]",Works great just like my original one. I reall...,5,This works just perfect!,1405900800,"07 21, 2014"
194435,A15TX8B2L8B20S,B00LORXVUE,Jon Davidson,"[0, 0]",Great product. Great packaging. High quality a...,5,Great replacement cable. Apple certified,1405900800,"07 21, 2014"
194436,A3JI7QRZO1QG8X,B00LORXVUE,Joyce M. Davidson,"[0, 0]","This is a great cable, just as good as the mor...",5,Real quality,1405900800,"07 21, 2014"
194437,A1NHB2VC68YQNM,B00LORXVUE,Nurse Farrugia,"[0, 0]",I really like it becasue it works well with my...,5,I really like it becasue it works well with my...,1405814400,"07 20, 2014"


In [5]:
df.reviewText[0]

"They look good and stick good! I just don't like the rounded shape because I was always bumping it and Siri kept popping up and it was irritating. I just won't buy a product like this again"


### Simple Preprocessing & Tokenization
The first thing to do for any data science task is to clean the data. For NLP, we apply various processing like converting all the words to lower case, trimming spaces, removing punctuations. This is something we will do over here too.

In [6]:
review_text = df.reviewText.apply(gensim.utils.simple_preprocess)
review_text

0         [they, look, good, and, stick, good, just, don...
1         [these, stickers, work, like, the, review, say...
2         [these, are, awesome, and, make, my, phone, lo...
3         [item, arrived, in, great, time, and, was, in,...
4         [awesome, stays, on, and, looks, great, can, b...
                                ...                        
194434    [works, great, just, like, my, original, one, ...
194435    [great, product, great, packaging, high, quali...
194436    [this, is, great, cable, just, as, good, as, t...
194437    [really, like, it, becasue, it, works, well, w...
194438    [product, as, described, have, wasted, lot, of...
Name: reviewText, Length: 194439, dtype: object

In [7]:
review_text[0]

['they',
 'look',
 'good',
 'and',
 'stick',
 'good',
 'just',
 'don',
 'like',
 'the',
 'rounded',
 'shape',
 'because',
 'was',
 'always',
 'bumping',
 'it',
 'and',
 'siri',
 'kept',
 'popping',
 'up',
 'and',
 'it',
 'was',
 'irritating',
 'just',
 'won',
 'buy',
 'product',
 'like',
 'this',
 'again']

### Training the Word2Vec Model
Train the model for reviews. Use a window of size 10 i.e. 10 words before the present word and 10 words ahead. A sentence with at least 2 words should only be considered, configure this using min_count parameter.

Workers define how many CPU threads to be used.

#### Initialize the model

In [8]:

model = gensim.models.Word2Vec(
    window=10,
    min_count=2,
    workers=4,
)

# sg: (default 0 or CBOW) The training algorithm, either CBOW (0) or skip gram (1).

#### Build Vocabulary

In [9]:

model.build_vocab(review_text, progress_per=1000)

In [10]:
model.epochs

5

In [11]:
model.corpus_count

194439

In [12]:
df.shape

(194439, 9)

In [13]:
model.train(review_text, total_examples=model.corpus_count, epochs=model.epochs)

(61508184, 83868975)

In [14]:
model.save("./word2vec-amazon-cell-accessories-reviews-short.model")

In [15]:
model.wv.most_similar("man")

[('woman', 0.7144899368286133),
 ('girl', 0.6376656889915466),
 ('guy', 0.6234941482543945),
 ('women', 0.6160207986831665),
 ('student', 0.5694995522499084),
 ('young', 0.5678049325942993),
 ('men', 0.567389190196991),
 ('boy', 0.5562182664871216),
 ('lbs', 0.5527745485305786),
 ('toy', 0.5512212514877319)]

In [16]:
model.wv.similarity(w1="cheap", w2="inexpensive")

0.51627254

In [17]:
model.wv.similarity(w1="cheap", w2="cheap")

1.0

In [18]:
model.wv.similarity(w1="great", w2="good")

0.779374

### Another Example

In [5]:
import nltk

from gensim.models import Word2Vec
from nltk.corpus import stopwords

import re

In [6]:
paragraph = """I have three visions for India. In 3000 years of our history, people from all over 
               the world have come and invaded us, captured our lands, conquered our minds. 
               From Alexander onwards, the Greeks, the Turks, the Moguls, the Portuguese, the British,
               the French, the Dutch, all of them came and looted us, took over what was ours. 
               Yet we have not done this to any other nation. We have not conquered anyone. 
               We have not grabbed their land, their culture, 
               their history and tried to enforce our way of life on them. 
               Why? Because we respect the freedom of others.That is why my 
               first vision is that of freedom. I believe that India got its first vision of 
               this in 1857, when we started the War of Independence. It is this freedom that
               we must protect and nurture and build on. If we are not free, no one will respect us.
               My second vision for India’s development. For fifty years we have been a developing nation.
               It is time we see ourselves as a developed nation. We are among the top 5 nations of the world
               in terms of GDP. We have a 10 percent growth rate in most areas. Our poverty levels are falling.
               Our achievements are being globally recognised today. Yet we lack the self-confidence to
               see ourselves as a developed nation, self-reliant and self-assured. Isn’t this incorrect?
               I have a third vision. India must stand up to the world. Because I believe that unless India 
               stands up to the world, no one will respect us. Only strength respects strength. We must be 
               strong not only as a military power but also as an economic power. Both must go hand-in-hand. 
               My good fortune was to have worked with three great minds. Dr. Vikram Sarabhai of the Dept. of 
               space, Professor Satish Dhawan, who succeeded him and Dr. Brahm Prakash, father of nuclear material.
               I was lucky to have worked with all three of them closely and consider this the great opportunity of my life. 
               I see four milestones in my career"""

In [7]:
processed_paragraph = paragraph.lower()
processed_paragraph = re.sub('[^a-zA-Z]', ' ', processed_paragraph)
processed_paragraph = re.sub(r'\s+', ' ', processed_paragraph)

# Preparing the dataset
all_sentences = nltk.sent_tokenize(processed_paragraph)

all_words = [nltk.word_tokenize(sent) for sent in all_sentences]

In [8]:
all_words

[['i',
  'have',
  'three',
  'visions',
  'for',
  'india',
  'in',
  'years',
  'of',
  'our',
  'history',
  'people',
  'from',
  'all',
  'over',
  'the',
  'world',
  'have',
  'come',
  'and',
  'invaded',
  'us',
  'captured',
  'our',
  'lands',
  'conquered',
  'our',
  'minds',
  'from',
  'alexander',
  'onwards',
  'the',
  'greeks',
  'the',
  'turks',
  'the',
  'moguls',
  'the',
  'portuguese',
  'the',
  'british',
  'the',
  'french',
  'the',
  'dutch',
  'all',
  'of',
  'them',
  'came',
  'and',
  'looted',
  'us',
  'took',
  'over',
  'what',
  'was',
  'ours',
  'yet',
  'we',
  'have',
  'not',
  'done',
  'this',
  'to',
  'any',
  'other',
  'nation',
  'we',
  'have',
  'not',
  'conquered',
  'anyone',
  'we',
  'have',
  'not',
  'grabbed',
  'their',
  'land',
  'their',
  'culture',
  'their',
  'history',
  'and',
  'tried',
  'to',
  'enforce',
  'our',
  'way',
  'of',
  'life',
  'on',
  'them',
  'why',
  'because',
  'we',
  'respect',
  'the',
 

In [9]:
# Removing Stop Words
from nltk.corpus import stopwords
for i in range(len(all_words)):
    all_words[i] = [word for word in all_words[i] if word not in stopwords.words('english')]

In [10]:
all_words

[['three',
  'visions',
  'india',
  'years',
  'history',
  'people',
  'world',
  'come',
  'invaded',
  'us',
  'captured',
  'lands',
  'conquered',
  'minds',
  'alexander',
  'onwards',
  'greeks',
  'turks',
  'moguls',
  'portuguese',
  'british',
  'french',
  'dutch',
  'came',
  'looted',
  'us',
  'took',
  'yet',
  'done',
  'nation',
  'conquered',
  'anyone',
  'grabbed',
  'land',
  'culture',
  'history',
  'tried',
  'enforce',
  'way',
  'life',
  'respect',
  'freedom',
  'others',
  'first',
  'vision',
  'freedom',
  'believe',
  'india',
  'got',
  'first',
  'vision',
  'started',
  'war',
  'independence',
  'freedom',
  'must',
  'protect',
  'nurture',
  'build',
  'free',
  'one',
  'respect',
  'us',
  'second',
  'vision',
  'india',
  'development',
  'fifty',
  'years',
  'developing',
  'nation',
  'time',
  'see',
  'developed',
  'nation',
  'among',
  'top',
  'nations',
  'world',
  'terms',
  'gdp',
  'percent',
  'growth',
  'rate',
  'areas',
  '

### Creating Word2Vec Model

In [11]:
model = Word2Vec(all_words, min_count=2)

In [14]:
# Finding Word Vectors
vector = model.wv['see']

In [15]:
vector

array([-9.58062708e-03,  8.96992814e-03,  4.17919783e-03,  9.26031265e-03,
        6.64846459e-03,  2.91303615e-03,  9.83883068e-03, -4.40194784e-03,
       -6.83826813e-03,  4.20083292e-03,  3.73480050e-03, -5.68703748e-03,
        9.72128659e-03, -3.56320175e-03,  9.55601316e-03,  8.33498780e-04,
       -6.30994933e-03, -1.98007398e-03, -7.39609497e-03, -3.03064659e-03,
        1.04605174e-03,  9.49276332e-03,  9.37271584e-03, -6.62563136e-03,
        3.45783960e-03,  2.27852142e-03, -2.50939489e-03, -9.22383461e-03,
        1.03023998e-03, -8.15419760e-03,  6.33127009e-03, -5.81249176e-03,
        5.53191779e-03,  9.82042495e-03, -1.82000702e-04,  4.54787537e-03,
       -1.80061534e-03,  7.36754015e-03,  3.93243926e-03, -9.01106931e-03,
       -2.36555771e-03,  3.61610227e-03, -1.03208884e-04, -1.19464030e-03,
       -1.04304112e-03, -1.67205918e-03,  5.97835984e-04,  4.15182533e-03,
       -4.24516387e-03, -3.82727943e-03, -3.96006581e-05,  2.60016386e-04,
       -1.65811914e-04, -

In [16]:

# Most similar words
similar = model.wv.most_similar("see")
similar

[('world', 0.2029840648174286),
 ('believe', 0.09815169870853424),
 ('us', 0.07644196599721909),
 ('vision', 0.06272962689399719),
 ('freedom', 0.04679276421666145),
 ('years', 0.039732933044433594),
 ('first', 0.038035809993743896),
 ('nation', 0.03507174924015999),
 ('india', 0.032317694276571274),
 ('respect', 0.0273970328271389)]

### Another Example
- [Most Useful Article](https://stackabuse.com/implementing-word2vec-with-gensim-library-in-python/)

In [17]:
import bs4 as bs
import urllib.request
import re
import nltk

scrapped_data = urllib.request.urlopen('https://en.wikipedia.org/wiki/Artificial_intelligence')
article = scrapped_data .read()

parsed_article = bs.BeautifulSoup(article,'lxml')

paragraphs = parsed_article.find_all('p')

article_text = ""

for p in paragraphs:
    article_text += p.text

- urllib library : We first download the Wikipedia article using the urlopen method
- BeautifulSoup : We then read the article content and parse it using an object.
- Wikipedia stores the text content of the article inside p tags. We use the find_all function of the BeautifulSoup object to fetch all the contents from the paragraph tags of the article.



### Preprocessing

In [19]:
# Cleaing the text
processed_article = article_text.lower()
processed_article = re.sub('[^a-zA-Z]', ' ', processed_article )
processed_article = re.sub(r'\s+', ' ', processed_article)

# Preparing the dataset
all_sentences = nltk.sent_tokenize(processed_article)

all_words = [nltk.word_tokenize(sent) for sent in all_sentences]

# Removing Stop Words
from nltk.corpus import stopwords
for i in range(len(all_words)):
    all_words[i] = [word for word in all_words[i] if word not in stopwords.words('english')]
    
    
    

In [20]:
from gensim.models import Word2Vec

word2vec = Word2Vec(all_words, min_count=2)

In [22]:
all_words

[['artificial',
  'intelligence',
  'ai',
  'intelligence',
  'demonstrated',
  'machines',
  'opposed',
  'natural',
  'intelligence',
  'displayed',
  'humans',
  'animals',
  'leading',
  'ai',
  'textbooks',
  'define',
  'field',
  'study',
  'intelligent',
  'agents',
  'system',
  'perceives',
  'environment',
  'takes',
  'actions',
  'maximize',
  'chance',
  'achieving',
  'goals',
  'popular',
  'accounts',
  'use',
  'term',
  'artificial',
  'intelligence',
  'describe',
  'machines',
  'mimic',
  'cognitive',
  'functions',
  'humans',
  'associate',
  'human',
  'mind',
  'learning',
  'problem',
  'solving',
  'however',
  'definition',
  'rejected',
  'major',
  'ai',
  'researchers',
  'b',
  'c',
  'ai',
  'applications',
  'include',
  'advanced',
  'web',
  'search',
  'engines',
  'e',
  'google',
  'recommendation',
  'systems',
  'used',
  'youtube',
  'amazon',
  'netflix',
  'understanding',
  'human',
  'speech',
  'siri',
  'alexa',
  'self',
  'driving',
  

In [24]:
sim_words = word2vec.wv.most_similar('intelligence')
sim_words

[('winter', 0.3759722411632538),
 ('ai', 0.3491349518299103),
 ('actually', 0.33578574657440186),
 ('programs', 0.3161846399307251),
 ('field', 0.3098253905773163),
 ('systems', 0.30423590540885925),
 ('data', 0.3011187016963959),
 ('often', 0.2987501621246338),
 ('machine', 0.2976626753807068),
 ('experience', 0.28699710965156555)]

In [27]:
sim_words = word2vec.wv.most_similar('machine')
sim_words

[('whether', 0.43519946932792664),
 ('fully', 0.4198515713214874),
 ('however', 0.41358375549316406),
 ('humans', 0.3742591142654419),
 ('c', 0.36555925011634827),
 ('software', 0.35991308093070984),
 ('drones', 0.35991284251213074),
 ('research', 0.3560183644294739),
 ('find', 0.34959396719932556),
 ('known', 0.34892621636390686)]

### Google Pre_Trained Model

In [2]:
import gensim
import gensim.downloader as api
from gensim.models import KeyedVectors


In [3]:
filename = 'GoogleNews-vectors-negative300.bin'
model = KeyedVectors.load_word2vec_format(filename, binary=True)

In [None]:
model.most_similar("Man")

### Now You can play :)  