## First of all, let's try to understand what is Word2Vec and why do we use it ? 🤔🤔 

So, in both Bag of Words and TF-IDF approaches(check out my tutorial in this repo in case you missed it out), the semantic information isn't stored i.e. these techniques do not give any relationship between different words present in the corpus(text data). Also, TF-IDF model gives importance to uncommon words. One more drawback of these similar kinds of approaches is that these models are prone to overfit. So, in order to solve these kind of problems, we use Word2Vec model.   

In this specific model(Word2Vec), each word is basically represented as a vector of 32 or more dimensions instead of a single like '1's or '0's as represented in Bag of Words and TF-IDF. Here, the semantic information and relation between different words is also preserved. So, this is why it is being used more extensively as compared to Bag of Words & TF-IDF. Now, let's dig into the application part, shall we..? 🤓 

In [2]:
# Import the essential libraries
import nltk
from gensim.models import Word2Vec
from nltk.corpus import stopwords
import re

In [3]:
# Corpus
paragraph = """ Madrid have been ruthless since the resumption, winning all eight games and conceding 
                just two goals over that stretch.But debate rages over the merits of this Madrid team,
                who arguably lack the panache of previous Los Blancos sides but have shown themselves to be a 
                winning machine, unlike some of their more celebrated forebears. "It doesn't bother me or 
                surprise me, it's always the same debate," Zidane told a news conference on Sunday."We prove in 
                every game and every training session that we are good. We have to show it. Everyone will give 
                their opinion on what they think of Real Madrid because it is the most important club in history 
                and this will never change."Zidane, once the unpredictable flair player in Madrid's midfield, 
                has become the straight man of the media room, a serious figure and a coach who has brought the 
                best out of many of his players.Rather than consider the prospect of another championship, with 
                Villarreal on Thursday and Leganes on Sunday the other remaining games for Madrid, Zidane's focus
                does not deviate from the immediate task at hand."La Liga and the Champions League are the goal 
                and what we fight for, but it is useless if we look beyond tomorrow's game," he said. "This is the 
                last week and there are three games. It is the most difficult, but the most important. All the 
                teams have things to play for and we want to put all our energy into tomorrow's game."Zidane 
                clearly finds the suggestion his team have had the better of VAR decisions over recent weeks to 
                be thoroughly tedious, amid claims Barcelona have had a relatively raw deal."Everyone can give 
                their opinion, I don't mess with the opinions of others," he said, when asked about the VAR talk.
                "What we are doing is giving everything on the pitch and putting in a great effort every day.
                That is what I'm interested in." """

In [5]:
# Pre-processing the data
text = re.sub(r'\[[0-9]*\]',' ',paragraph)
text = re.sub(r'\s+',' ',text)
text = text.lower()
text = re.sub(r'\d',' ',text)
text = re.sub(r'\s+',' ',text)

In [9]:
# Preparing the dataset
sentences = nltk.sent_tokenize(text)

sentences = [nltk.word_tokenize(sentence) for sentence in sentences]

for i in range(len(sentences)):
    sentences[i] = [word for word in sentences[i] if word not in stopwords.words('english')]

In [11]:
# Training the Word2Vec model
model = Word2Vec(sentences, min_count = 1)

In [15]:
words = model.wv.vocab
words

{'madrid': <gensim.models.keyedvectors.Vocab at 0x7f9a204b8dc0>,
 'ruthless': <gensim.models.keyedvectors.Vocab at 0x7f9a204b8eb0>,
 'since': <gensim.models.keyedvectors.Vocab at 0x7f9a204b80a0>,
 'resumption': <gensim.models.keyedvectors.Vocab at 0x7f9a202de400>,
 ',': <gensim.models.keyedvectors.Vocab at 0x7f9a202de340>,
 'winning': <gensim.models.keyedvectors.Vocab at 0x7f9a202de2e0>,
 'eight': <gensim.models.keyedvectors.Vocab at 0x7f9a202de1f0>,
 'games': <gensim.models.keyedvectors.Vocab at 0x7f9a202de100>,
 'conceding': <gensim.models.keyedvectors.Vocab at 0x7f9a202de1c0>,
 'two': <gensim.models.keyedvectors.Vocab at 0x7f9a202de130>,
 'goals': <gensim.models.keyedvectors.Vocab at 0x7f9a202de040>,
 'stretch.but': <gensim.models.keyedvectors.Vocab at 0x7f9a202de220>,
 'debate': <gensim.models.keyedvectors.Vocab at 0x7f9a202de490>,
 'rages': <gensim.models.keyedvectors.Vocab at 0x7f9a204b13d0>,
 'merits': <gensim.models.keyedvectors.Vocab at 0x7f9a204b11f0>,
 'team': <gensim.models

In [25]:
# Finding the vectors and it's shape for any word present in our text corpus
vectors = model.wv['zidane']
print(vectors)
print('\n Shape:',vectors.shape)

[ 1.1201098e-03 -3.7991805e-03  1.0027046e-03 -1.7689388e-04
 -3.2557251e-03  2.7372090e-03  3.8887151e-03 -3.0097296e-03
  1.3236266e-04  2.5157051e-03  3.8946446e-03 -3.1804652e-03
 -4.9434984e-03  2.1939385e-03 -1.3319448e-03 -2.9347437e-03
  2.9161039e-05 -2.5931676e-04 -3.6120864e-03  1.2433172e-03
 -3.7393270e-03 -7.0426444e-04  3.6719195e-03  2.3540715e-04
 -3.5613176e-04  4.4214949e-03  2.4982304e-03 -4.1793282e-03
  4.9951570e-03 -3.3476399e-03  4.8449202e-03  1.4846472e-03
 -3.5903414e-03 -3.8403061e-03 -1.1529298e-04 -3.2067921e-03
 -4.9406663e-03 -2.8246532e-03 -4.0935874e-03  3.3758450e-03
 -4.6605202e-03  4.4073304e-04  2.9095442e-03  3.9463080e-03
 -4.3793595e-03  1.9753310e-04 -4.3580853e-03 -4.7668824e-03
 -1.4448996e-03 -4.5487345e-03 -3.7319696e-04 -1.8238208e-03
 -3.0922391e-03  1.9204508e-03  3.7903769e-03  4.8648543e-03
  3.9430580e-04  3.4690206e-03 -1.8686753e-04  2.4903398e-03
  2.9838332e-03  3.9719781e-03 -4.4967751e-03 -4.1384646e-03
  3.5127637e-03  3.16333

In [28]:
# Most similar words
similar = model.wv.most_similar('zidane')
similar

[("'m", 0.24366071820259094),
 ('always', 0.2218143343925476),
 ('championship', 0.2087278664112091),
 ('merits', 0.18880674242973328),
 ('goal', 0.14808715879917145),
 ('better', 0.1468127965927124),
 ('training', 0.14571413397789001),
 ('everything', 0.14134114980697632),
 ('relatively', 0.13222773373126984),
 ('hand', 0.12835456430912018)]

### So, this was Word2Vec model. Now, it's your turn to try this out by yourself. Till then, PEACE...✌️