# Word2Vec Analysis

BOW, TFIDF model **Problems**<br/>
**Semantic information of words is not stored.** BOW model store each word as 1 (if word present) or 0 (if word absent). TF-IDF does a little better job as it gives more importance to uncommon words. So although TF-IDF is an improvement over BOW model, but here semantic information of word is not stored. In BOW or TF-IDF, we care about words individually (where a word appears or not), we completely ignore whether a word appears together with another word (or if 1 word appears what's the probability that another word will also appear). To eradicate this issue, we can use **Word2Vec model.**

In **Word2Vec model**, we do not represent words as single numbers (like we do in case of BOW or TF-IDF). In **Word2Vec model** we represent words as vectors. This helps to maintain the relationship between words. <br/>
For example- word 'king' has a vector value of (2,6). 'Queen' vector value is (2.2, 6.3). Word 'life' has a vector value of (8,3). If we plot them we will see king and queen are more related to each other. So they will appear together. Infact if we use a big corpus we can do all sort of maths on it. A research conducted on google showed when in **Word2Vec model** `king-man+woman=queen` was derived. And that says how much semantic information is stored in the model.

Building **Word2Vec model**<br/>
* Scrape through huge dataset (like whole wikipedia corpus- all articles in wikipedia)
* Then create a matric of unique words in dataset which will ocontain occurance relation between the words. <br/>
for example- if we have 3 sentences<br/>
1.'it is going to rain today'<br/>
2.'today I am not going outside'<br/>
3.'I am going to watch the season premier'<br/>
word `going` appears in 3 different sentences. Word `going` appears with `to` in 2 different sentences. Word `going` apears with `today` in 2 diffent sentences. This way the matrix preserves the relationship between the words. appears
* Now we have split the matrix into 2 matrices. One matrix will be a transposed version of the other. Now each matrix will have `n` number of dimensions. And each word will have a specific value for each dimension. This creates a word vector. If we have a word `going` and there are 300 dimensions. Then `going` will have value for each dimensions and this creates word vectors.

In [6]:
# Word2Vec model visualization

# Install gensim - pip install gensim
import nltk
import urllib
import bs4 as bs
import re
from gensim.models import Word2Vec # this is required to build word2vec

# Gettings the data source
source = urllib.request.urlopen('https://en.wikipedia.org/wiki/Global_warming').read()
source[:400]

b'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>Global warming - Wikipedia</title>\n<script>document.documentElement.className = document.documentElement.className.replace( /(^|\\s)client-nojs(\\s|$)/, "$1client-js$2" );</script>\n<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName'

In [5]:
# Parsing the data/ creating BeautifulSoup object
soup = bs.BeautifulSoup(source,'lxml')

# Fetching the data
text = ""
for paragraph in soup.find_all('p'): # getting text that is written in HTML 'p' tag
    text += paragraph.text
    
text[:400]

"\n\n\nGlobal warming is a long-term rise in the average temperature of the Earth's climate system, an aspect of climate change shown by temperature measurements and by multiple effects of the warming.[2][3] The term commonly refers to the mainly human-caused observed warming since pre-industrial times and its projected continuation,[4] though there were also much earlier periods of global warming.[5]"

In [7]:
# Preprocessing the data
text = re.sub(r'\[[0-9]*\]',' ',text) # remove references
text = re.sub(r'\s+',' ',text) # remove extra spaces
text = text.lower()
text = re.sub(r'\W',' ',text) # remove non-word
text = re.sub(r'\d',' ',text) # remove digits
text = re.sub(r'\s+',' ',text) # remove extra spaces
text[:400]

' global warming is a long term rise in the average temperature of the earth s climate system an aspect of climate change shown by temperature measurements and by multiple effects of the warming the term commonly refers to the mainly human caused observed warming since pre industrial times and its projected continuation though there were also much earlier periods of global warming in the modern con'

In text summarizer, we used `clean_text` and `text`, we do not need that here.

In [15]:
# Preparing the dataset
sentences = nltk.sent_tokenize(text) 
#print(sentences[:5]) # will return the entire paragraph as we removed all punctuations
sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
#print(sentences[:2]) # will return list of all words in entire sentence i.e. the entire article here

In [19]:
# Training the Word2Vec model
model = Word2Vec(sentences, min_count=1)

words = model.wv.vocab
type(words)
print(list(words.items())[:5])

[('global', <gensim.models.keyedvectors.Vocab object at 0x11a61c6d8>), ('warming', <gensim.models.keyedvectors.Vocab object at 0x11a61ce10>), ('is', <gensim.models.keyedvectors.Vocab object at 0x11a61cb00>), ('a', <gensim.models.keyedvectors.Vocab object at 0x11a61ce80>), ('long', <gensim.models.keyedvectors.Vocab object at 0x11a61c278>)]


`min_count=1` - we wil ignore all words that have a total frequency lower than mean_count(i.e. specified 1 here). But that is not possible, all words that are being considered has appeared atleast 1 time. So that means, we are considering all words here. But if we mention 5 as `mean_count`, that means it will ignore all words whose total frequency is below 5.

In [23]:
# Vector representation of a particular word: finding Word Vectors
# To find vector for the word 'global'
vector = model.wv['global']
vector.shape 

(100,)

Size of the vector is 100, i.e. 100 dimensions. We have values for word `global` for 100 dimensions

In [22]:
# Most similar words
similar = model.wv.most_similar('global')
similar

  if np.issubdtype(vec.dtype, np.int):


[('and', 0.9810046553611755),
 ('the', 0.9802556037902832),
 ('to', 0.9784972667694092),
 ('of', 0.9782954454421997),
 ('is', 0.976692795753479),
 ('climate', 0.9750581979751587),
 ('a', 0.9737613797187805),
 ('that', 0.9728396534919739),
 ('warming', 0.9725962281227112),
 ('in', 0.9714340567588806)]

We see a lot of impurities here. So we can just use some more pre-procesing shown in previous notebooks to take care of that. So we will update our code next. But so far, we still see meaningful similar words to global with respect to this article like `climate` and `warming`.

In [38]:
import nltk
import urllib
import bs4 as bs
import re
from gensim.models import Word2Vec
from nltk.corpus import stopwords
nltk.download('stopwords')

# Gettings the data source
source = urllib.request.urlopen('https://en.wikipedia.org/wiki/Global_warming').read()
# Parsing the data/ creating BeautifulSoup object
soup = bs.BeautifulSoup(source,'lxml')
# Fetching the data
text = ""
for paragraph in soup.find_all('p'):
    text += paragraph.text

# Preprocessing the data
text = re.sub(r'\[[0-9]*\]',' ',text)
text = re.sub(r'\s+',' ',text)
text = text.lower()
text = re.sub(r'[@#,\$%&\*\(\)\<\>\?\'\":;"\[\]-]', ' ', text)
text = re.sub(r'\d',' ',text)
text = re.sub(r'\s+',' ',text)
text[:450]

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/sayantan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


' global warming is a long term rise in the average temperature of the earth s climate system an aspect of climate change shown by temperature measurements and by multiple effects of the warming. the term commonly refers to the mainly human caused observed warming since pre industrial times and its projected continuation though there were also much earlier periods of global warming. in the modern context the terms are commonly used interchangeably'

In [40]:
# Preparing the dataset
sentences = nltk.sent_tokenize(text)
print(sentences[:3]) # Now we can see sentences rather than entire article

sentences = [nltk.word_tokenize(sentence) for sentence in sentences]

for i in range(len(sentences)):
    sentences[i] = [word for word in sentences[i] if word not in stopwords.words('english')]
    
# Training the Word2Vec model
model = Word2Vec(sentences, min_count=1)
words = model.wv.vocab

# Finding Word Vectors
vector = model.wv['global']
print(vector.shape)

# Most similar words
similar = model.wv.most_similar('global')
similar

[' global warming is a long term rise in the average temperature of the earth s climate system an aspect of climate change shown by temperature measurements and by multiple effects of the warming.', 'the term commonly refers to the mainly human caused observed warming since pre industrial times and its projected continuation though there were also much earlier periods of global warming.', 'in the modern context the terms are commonly used interchangeably but global warming more specifically relates to worldwide surface temperature increases while climate change is any regional or global statistically identifiable persistent change in the state of climate which lasts for decades or longer including warming or cooling.']
(100,)


  if np.issubdtype(vec.dtype, np.int):


[('cover', 0.368155300617218),
 ('increased', 0.34049203991889954),
 ('cycle', 0.33821025490760803),
 ('temperature', 0.32580241560935974),
 ('warming', 0.318177729845047),
 ('gases', 0.3173825144767761),
 ('burning', 0.31577667593955994),
 ('.', 0.312172532081604),
 ('dubbed', 0.3120083212852478),
 ('called', 0.3098583221435547)]

Size of the vector is 100 again, i.e. 100 dimensions. We have values for word global for 100 dimensions. Based on that we see more efficient similar words, as we have removed stopwords (allowing the model to be more efficient)

In [None]:
# Install gensim - pip install gensim
from gensim.models import KeyedVectors

filename = 'GoogleNews-vectors-negative300.bin'

model = KeyedVectors.load_word2vec_format(filename, binary=True) # binary file

model.wv.most_similar('king')

model.wv.most_similar(positive=['king','woman'], negative= ['man'])

  


In [None]:
ggggggfgg