### Word2Vec - Easily Explained

=> The main drawback of BOW(Bag of Words) and TF-IDF is that the semantic information(ie meaning of the word is not considered) and the order in which the words occur in a sentence is not preserved.

=> This might lead to overfitting.

Thus, Word2Vec is used to convert the words to 32 or more dimensional vectors by considering the semantic as well as the order in which the words appear

For example, In a sentence that has the words "man" and "woman"
Representing in a 2D space, the word "man" can have a vector - (8, 6)
Representing in a 2D space, the word "woman" can have a vector - (8.2, 6.4)
since the word "woman" has the sub-word "man", the two vectors can be interrelated and the difference is minimal. For the word "hello", it might be (10, 14)

STEPS TO CREATE WORD2VEC

* Tokenize the sentences
* Create histograms
* Identify the most frequent words
* Build a matrix with all unique words

In [1]:
!pip install gensim #topic modelling, similarity search



In [2]:
import nltk
nltk.download('punkt')

#implementing word2vec vusing gensim module
from gensim.models import Word2Vec
from nltk.corpus import stopwords


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [6]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [30]:
paragraph = """
              When I was 17, I read a quote that went something like: “If you live each day as if it was your last,
              someday you’ll most certainly be right.” It made an impression on me, and since then, for the past 33 years,
              I have looked in the mirror every morning and asked myself: “If today were the last day of my life,
              would I want to do what I am about to do today?” And whenever the answer has been “No” for too many days in a row,
               I know I need to change something.
               No one wants to die. Even people who want to go to heaven don’t want to
               die to get there.
               And yet death is the destination we all share.
               No one has ever escaped it. And that is as it should be, because
               Death is very likely the single best invention of Life. It is Life’s change agent.
               It clears out the old to make way for the new. Right now the new is you, but
               someday not too long from now,
               you will gradually become the old and be cleared away. Sorry to be so dramatic,
               but it is quite true.
              Your time is limited, so don’t waste it living someone else’s life.
              Don’t be trapped by dogma — which is living with the results of other people’s thinking.
              Don’t let the noise of others’ opinions drown out your own inner voice.
              And most important, have the courage to follow your heart and intuition.
              They somehow already know what you truly want to become. Everything else is secondary.
              When I was young, there was an amazing publication called The Whole Earth Catalog,
               which was one of the bibles of my generation.
               It was created by a fellow named Stewart Brand not far from here in Menlo Park, and
               he brought it to life with his poetic touch. This was in the late 1960s, before
                personal computers and desktop publishing, so it was all made with typewriters,
                scissors and Polaroid cameras.
                It was sort of like Google in paperback form, 35 years before Google came along:
                It was idealistic, and overflowing with neat tools and great notions.
              """

In [32]:
import re
# text = re.sub(r'[0-9]*', ' ', paragraph)  #remove digit
text = re.sub(r'\s+', ' ', paragraph) #remove special characters
text = text.lower()    #convert text to lower case
text = re.sub(r'\d', ' ', text)
text = re.sub(r'\s+', ' ', text)

In [33]:
text

' when i was , i read a quote that went something like: “if you live each day as if it was your last, someday you’ll most certainly be right.” it made an impression on me, and since then, for the past years, i have looked in the mirror every morning and asked myself: “if today were the last day of my life, would i want to do what i am about to do today?” and whenever the answer has been “no” for too many days in a row, i know i need to change something. no one wants to die. even people who want to go to heaven don’t want to die to get there. and yet death is the destination we all share. no one has ever escaped it. and that is as it should be, because death is very likely the single best invention of life. it is life’s change agent. it clears out the old to make way for the new. right now the new is you, but someday not too long from now, you will gradually become the old and be cleared away. sorry to be so dramatic, but it is quite true. your time is limited, so don’t waste it living 

In [34]:
sentences = nltk.sent_tokenize(text)

sentences = [nltk.word_tokenize(sentence) for sentence in sentences]

for i in range(len(sentences)):
    sentences[i] = [word for word in sentences[i] if word not in stopwords.words('english')]

In [35]:
sentences

[[',',
  'read',
  'quote',
  'went',
  'something',
  'like',
  ':',
  '“',
  'live',
  'day',
  'last',
  ',',
  'someday',
  '’',
  'certainly',
  'right.',
  '”',
  'made',
  'impression',
  ',',
  'since',
  ',',
  'past',
  'years',
  ',',
  'looked',
  'mirror',
  'every',
  'morning',
  'asked',
  ':',
  '“',
  'today',
  'last',
  'day',
  'life',
  ',',
  'would',
  'want',
  'today',
  '?',
  '”',
  'whenever',
  'answer',
  '“',
  '”',
  'many',
  'days',
  'row',
  ',',
  'know',
  'need',
  'change',
  'something',
  '.'],
 ['one', 'wants', 'die', '.'],
 ['even', 'people', 'want', 'go', 'heaven', '’', 'want', 'die', 'get', '.'],
 ['yet', 'death', 'destination', 'share', '.'],
 ['one', 'ever', 'escaped', '.'],
 [',', 'death', 'likely', 'single', 'best', 'invention', 'life', '.'],
 ['life', '’', 'change', 'agent', '.'],
 ['clears', 'old', 'make', 'way', 'new', '.'],
 ['right',
  'new',
  ',',
  'someday',
  'long',
  ',',
  'gradually',
  'become',
  'old',
  'cleared',
  '

In [41]:
# Training the Word2Vec model
model = Word2Vec(sentences, min_count=1) #skip word if less than count 1

In [45]:
# sentences

In [44]:
model.wv.similar_by_word('would')

[('everything', 0.277738481760025),
 ('”', 0.19782152771949768),
 ('looked', 0.19341064989566803),
 ('old', 0.19333408772945404),
 ('ever', 0.17128680646419525),
 ('right.', 0.1430920511484146),
 ('mirror', 0.13816764950752258),
 ('.', 0.13567233085632324),
 ('every', 0.13381516933441162),
 ('menlo', 0.13212113082408905)]

In [46]:
model.wv.words_closer_than('would', 'want')

  model.wv.words_closer_than('would', 'want')


[',',
 '.',
 '’',
 'life',
 '“',
 '”',
 'one',
 'know',
 'made',
 'change',
 'today',
 'years',
 'die',
 'else',
 'living',
 'death',
 'old',
 'people',
 'new',
 'someday',
 'become',
 'last',
 'day',
 'something',
 'google',
 'like',
 'destination',
 'heaven',
 'get',
 'gradually',
 'yet',
 'right',
 'share',
 'ever',
 'make',
 'likely',
 'single',
 'best',
 'invention',
 'agent',
 'clears',
 'go',
 'escaped',
 'days',
 'even',
 'mirror',
 'read',
 'quote',
 'went',
 'live',
 'certainly',
 'right.',
 'impression',
 'since',
 'past',
 'looked',
 'every',
 'morning',
 '?',
 'whenever',
 'answer',
 'many',
 'away',
 'row',
 'need',
 'cleared',
 'notions',
 'sorry',
 'far',
 'touch',
 'park',
 'menlo',
 'brand',
 'earth',
 'stewart',
 'named',
 'fellow',
 'created',
 'generation',
 'bibles',
 'computers',
 'desktop',
 'publishing',
 'typewriters',
 'scissors',
 'polaroid',
 'cameras',
 'sort',
 'paperback',
 'form',
 'came',
 'along',
 'idealistic',
 'neat',
 'tools',
 'catalog',
 'whole'