# <b> NLP: Intro to Word2Vec (Word to Vector Conversion)
> Description:

* Word2Vec is a popular natural language processing (NLP) technique that is used to represent words in a high-dimensional space such that similar words are closer together in the space. Word2Vec is based on the idea that the meaning of a word can be inferred from the context in which it appears, and it uses a neural network to learn vector representations for each word in a text corpus.

* The main difference between Word2Vec and traditional bag-of-words and TF-IDF approaches is that Word2Vec is a more sophisticated way of representing words that captures the semantic relationships between them. In the bag-of-words model, words are represented as individual tokens with no relation to each other. TF-IDF, on the other hand, takes into account the frequency of words in a document and the inverse frequency of those words across the entire corpus.

* Word2Vec, by contrast, takes into account the context in which a word appears and learns a high-dimensional vector representation for each word based on the relationships between the words in the corpus. This allows Word2Vec to capture more nuanced relationships between words, such as synonyms and antonyms, and to perform better on tasks such as language modeling, semantic similarity, and text classification.

In [1]:
# importing Libraries :
import nltk
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

from gensim.models import Word2Vec

In [2]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

In [4]:
# We will be working on the same Paragraph we were using till now:

paragraph  = '''The Cosmic Microwave Background (CMB) is a form of electromagnetic radiation that pervades the entire universe. It is thought to be the afterglow of the Big Bang, the event that marks the beginning of the universe as we know it.

The CMB was first discovered in 1964 by two radio astronomers, Arno Penzias and Robert Wilson, who were working at Bell Labs in New Jersey. They were using a large horn-shaped antenna to study radio waves emitted by the Milky Way, but they kept detecting a mysterious signal that seemed to be coming from all directions in the sky. After ruling out a number of possible explanations, they realized that they had stumbled upon the CMB.

The CMB is incredibly faint, with a temperature of just 2.7 Kelvin (-270.45 degrees Celsius). However, it is remarkably uniform across the entire sky, with temperature variations of just a few parts in 100,000. These tiny fluctuations are thought to be the result of slight density variations in the early universe, which were stretched out by cosmic expansion to form the large-scale structures we see today, such as galaxies and clusters of galaxies.

Studying the CMB has been crucial to our understanding of the universe and its evolution. It has provided strong evidence for the Big Bang theory, as well as for the existence of dark matter and dark energy. It has also allowed astronomers to measure the age, size, and composition of the universe with unprecedented accuracy.

In recent years, the study of the CMB has entered a new era, with a number of high-precision experiments, such as the Planck satellite and the Atacama Cosmology Telescope, providing even more detailed maps of the CMB and shedding light on some of the universe's deepest mysteries.
'''
paragraph

"The Cosmic Microwave Background (CMB) is a form of electromagnetic radiation that pervades the entire universe. It is thought to be the afterglow of the Big Bang, the event that marks the beginning of the universe as we know it.\n\nThe CMB was first discovered in 1964 by two radio astronomers, Arno Penzias and Robert Wilson, who were working at Bell Labs in New Jersey. They were using a large horn-shaped antenna to study radio waves emitted by the Milky Way, but they kept detecting a mysterious signal that seemed to be coming from all directions in the sky. After ruling out a number of possible explanations, they realized that they had stumbled upon the CMB.\n\nThe CMB is incredibly faint, with a temperature of just 2.7 Kelvin (-270.45 degrees Celsius). However, it is remarkably uniform across the entire sky, with temperature variations of just a few parts in 100,000. These tiny fluctuations are thought to be the result of slight density variations in the early universe, which were st

In [16]:
## CLEANING THE PARAGRAPH

plain_text = re.sub(r'\[[0-9]*\]', ' ', paragraph) 
plain_text = re.sub(r'\s+',' ', plain_text)
plain_text = re.sub(r'\d',' ', plain_text).lower()
plain_text = re.sub(r'\s+', ' ', plain_text)

plain_text = plain_text.replace('(','').replace(')','')

plain_text


"the cosmic microwave background cmb is a form of electromagnetic radiation that pervades the entire universe. it is thought to be the afterglow of the big bang, the event that marks the beginning of the universe as we know it. the cmb was first discovered in by two radio astronomers, arno penzias and robert wilson, who were working at bell labs in new jersey. they were using a large horn-shaped antenna to study radio waves emitted by the milky way, but they kept detecting a mysterious signal that seemed to be coming from all directions in the sky. after ruling out a number of possible explanations, they realized that they had stumbled upon the cmb. the cmb is incredibly faint, with a temperature of just . kelvin - . degrees celsius. however, it is remarkably uniform across the entire sky, with temperature variations of just a few parts in , . these tiny fluctuations are thought to be the result of slight density variations in the early universe, which were stretched out by cosmic expa

In [17]:
# Preparing the dataset : 

# First we will perform sentence tokenization:
sentences = nltk.sent_tokenize(plain_text)
# sentences 

# Now, we will tokenize every word in every sentence :
sentences2word = [nltk.word_tokenize(sentence) for sentence in sentences]
# sentences2word

# Now we will use Stopword library to eliminate the un-necessary words :

for i in range(len(sentences2word)):
  sentences2word[i] = [word for word in sentences2word[i] if word not in stopwords.words('english')] 


sentences2word

[['cosmic',
  'microwave',
  'background',
  'cmb',
  'form',
  'electromagnetic',
  'radiation',
  'pervades',
  'entire',
  'universe',
  '.'],
 ['thought',
  'afterglow',
  'big',
  'bang',
  ',',
  'event',
  'marks',
  'beginning',
  'universe',
  'know',
  '.'],
 ['cmb',
  'first',
  'discovered',
  'two',
  'radio',
  'astronomers',
  ',',
  'arno',
  'penzias',
  'robert',
  'wilson',
  ',',
  'working',
  'bell',
  'labs',
  'new',
  'jersey',
  '.'],
 ['using',
  'large',
  'horn-shaped',
  'antenna',
  'study',
  'radio',
  'waves',
  'emitted',
  'milky',
  'way',
  ',',
  'kept',
  'detecting',
  'mysterious',
  'signal',
  'seemed',
  'coming',
  'directions',
  'sky',
  '.'],
 ['ruling',
  'number',
  'possible',
  'explanations',
  ',',
  'realized',
  'stumbled',
  'upon',
  'cmb',
  '.'],
 ['cmb', 'incredibly', 'faint', ',', 'temperature', '.'],
 ['kelvin', '-', '.'],
 ['degrees', 'celsius', '.'],
 ['however',
  ',',
  'remarkably',
  'uniform',
  'across',
  'entire'

## Training the Word2Vec model:

In [21]:
 model = Word2Vec(sentences2word, min_count = 1) 



In [25]:
# Getting the words found by the Word2Vec model:

words = model.wv.vocab
words.keys()  # These are the words found by the model.

dict_keys(['cosmic', 'microwave', 'background', 'cmb', 'form', 'electromagnetic', 'radiation', 'pervades', 'entire', 'universe', '.', 'thought', 'afterglow', 'big', 'bang', ',', 'event', 'marks', 'beginning', 'know', 'first', 'discovered', 'two', 'radio', 'astronomers', 'arno', 'penzias', 'robert', 'wilson', 'working', 'bell', 'labs', 'new', 'jersey', 'using', 'large', 'horn-shaped', 'antenna', 'study', 'waves', 'emitted', 'milky', 'way', 'kept', 'detecting', 'mysterious', 'signal', 'seemed', 'coming', 'directions', 'sky', 'ruling', 'number', 'possible', 'explanations', 'realized', 'stumbled', 'upon', 'incredibly', 'faint', 'temperature', 'kelvin', '-', 'degrees', 'celsius', 'however', 'remarkably', 'uniform', 'across', 'variations', 'parts', 'tiny', 'fluctuations', 'result', 'slight', 'density', 'early', 'stretched', 'expansion', 'large-scale', 'structures', 'see', 'today', 'galaxies', 'clusters', 'studying', 'crucial', 'understanding', 'evolution', 'provided', 'strong', 'evidence', '

In [26]:
# If we want to see the vector for particualr word :

vector = model.wv['cosmic']
vector   # This is the position of word 'cosmic' in the n-dimension space. n = length of array we got.

array([-3.98301892e-03, -2.70775706e-03, -3.80598451e-03,  4.71882340e-05,
       -3.22667765e-03,  4.08722553e-03,  2.28740904e-03,  2.05308339e-03,
        3.49071901e-03, -1.83201884e-03, -2.32052902e-04, -2.49124737e-03,
        4.20600205e-04,  2.66864290e-03, -8.52546305e-04,  1.39543740e-03,
       -1.83397683e-03, -1.13969676e-04,  1.54595205e-03, -3.82068893e-03,
        2.41259951e-03, -4.15557157e-03,  1.38507306e-03, -3.91696952e-03,
       -4.84730443e-03,  1.48971693e-03,  3.61641729e-03,  1.53856853e-03,
       -6.72453956e-04,  2.20203144e-03, -1.73191517e-03,  3.13505181e-03,
        1.65485602e-03, -1.23908417e-03, -3.92137282e-03, -3.71263525e-03,
       -1.90628401e-03, -3.74478218e-03,  2.94097350e-04,  1.13700912e-03,
        2.57054786e-03, -1.06859964e-03,  2.26941571e-04,  2.40902256e-04,
        2.47177272e-03, -1.72763446e-03,  1.13886863e-03,  1.02468452e-03,
        3.10960575e-03,  4.64215642e-03,  3.60821933e-03,  1.89407682e-03,
        3.84033960e-03,  

In [28]:
# if we want to find out the most similar words in the visinity of one particular word :

model.wv.most_similar('milky')  # The numerical data with the words signifiese the distance between 'milky' and the other word.

[('event', 0.2361035943031311),
 ('know', 0.20579682290554047),
 ('wilson', 0.19070184230804443),
 ('detecting', 0.19009076058864594),
 ('ruling', 0.18866844475269318),
 ('stretched', 0.18445973098278046),
 ('crucial', 0.18304233253002167),
 ('galaxies', 0.17966637015342712),
 ('understanding', 0.17718477547168732),
 ('structures', 0.16532006859779358)]