# <b> NLP: Intro to Word_Embedding Techniques using Embedding Layer in Keras
> Description:

* Word embedding is a technique used in natural language processing (NLP) to represent words in a high-dimensional space. It involves mapping each word to a vector of real numbers, such that the vectors capture the meaning and context of the words.

* Word embedding works by analyzing large collections of text data and learning the relationships between words. This is done through a process called training, where an algorithm iteratively adjusts the vector representations of words in a way that maximizes their predictive power for a given task.

* The resulting word embeddings can be used in a wide range of NLP applications, including text classification, language translation, and sentiment analysis. They allow machine learning models to better understand the meaning and context of words, which can lead to more accurate and effective analysis of text data.

In [None]:
# importing libraries 
import numpy as np
import nltk
import re
from tensorflow.keras.preprocessing.text import one_hot


# nltk.download('punkt')
# nltk.download('stopwords')
# nltk.download('wordnet')
# nltk.download('omw-1.4')

In [None]:
paragraph =  '''The Cosmic Microwave Background (CMB) is a form of electromagnetic radiation that pervades the entire universe. It is thought to be the afterglow of the Big Bang, the event that marks the beginning of the universe as we know it.

The CMB was first discovered in 1964 by two radio astronomers, Arno Penzias and Robert Wilson, who were working at Bell Labs in New Jersey. They were using a large horn-shaped antenna to study radio waves emitted by the Milky Way, but they kept detecting a mysterious signal that seemed to be coming from all directions in the sky. After ruling out a number of possible explanations, they realized that they had stumbled upon the CMB.

The CMB is incredibly faint, with a temperature of just 2.7 Kelvin (-270.45 degrees Celsius). However, it is remarkably uniform across the entire sky, with temperature variations of just a few parts in 100,000. These tiny fluctuations are thought to be the result of slight density variations in the early universe, which were stretched out by cosmic expansion to form the large-scale structures we see today, such as galaxies and clusters of galaxies.

Studying the CMB has been crucial to our understanding of the universe and its evolution. It has provided strong evidence for the Big Bang theory, as well as for the existence of dark matter and dark energy. It has also allowed astronomers to measure the age, size, and composition of the universe with unprecedented accuracy.

In recent years, the study of the CMB has entered a new era, with a number of high-precision experiments, such as the Planck satellite and the Atacama Cosmology Telescope, providing even more detailed maps of the CMB and shedding light on some of the universe's deepest mysteries.
'''
paragraph

"The Cosmic Microwave Background (CMB) is a form of electromagnetic radiation that pervades the entire universe. It is thought to be the afterglow of the Big Bang, the event that marks the beginning of the universe as we know it.\n\nThe CMB was first discovered in 1964 by two radio astronomers, Arno Penzias and Robert Wilson, who were working at Bell Labs in New Jersey. They were using a large horn-shaped antenna to study radio waves emitted by the Milky Way, but they kept detecting a mysterious signal that seemed to be coming from all directions in the sky. After ruling out a number of possible explanations, they realized that they had stumbled upon the CMB.\n\nThe CMB is incredibly faint, with a temperature of just 2.7 Kelvin (-270.45 degrees Celsius). However, it is remarkably uniform across the entire sky, with temperature variations of just a few parts in 100,000. These tiny fluctuations are thought to be the result of slight density variations in the early universe, which were st

In [None]:
## CLEANING THE PARAGRAPH

plain_text = re.sub(r'\[[0-9]*\]', ' ', paragraph) 
plain_text = re.sub(r'\s+',' ', plain_text)
plain_text = re.sub(r'\d',' ', plain_text).lower()
plain_text = re.sub(r'\s+', ' ', plain_text)

plain_text = plain_text.replace('(','').replace(')','').replace('-','')

plain_text

"the cosmic microwave background cmb is a form of electromagnetic radiation that pervades the entire universe. it is thought to be the afterglow of the big bang, the event that marks the beginning of the universe as we know it. the cmb was first discovered in by two radio astronomers, arno penzias and robert wilson, who were working at bell labs in new jersey. they were using a large hornshaped antenna to study radio waves emitted by the milky way, but they kept detecting a mysterious signal that seemed to be coming from all directions in the sky. after ruling out a number of possible explanations, they realized that they had stumbled upon the cmb. the cmb is incredibly faint, with a temperature of just . kelvin  . degrees celsius. however, it is remarkably uniform across the entire sky, with temperature variations of just a few parts in , . these tiny fluctuations are thought to be the result of slight density variations in the early universe, which were stretched out by cosmic expans

In [None]:
# First we will perform sentence tokenization:
sentences = nltk.sent_tokenize(plain_text)
sentences

['the cosmic microwave background cmb is a form of electromagnetic radiation that pervades the entire universe.',
 'it is thought to be the afterglow of the big bang, the event that marks the beginning of the universe as we know it.',
 'the cmb was first discovered in by two radio astronomers, arno penzias and robert wilson, who were working at bell labs in new jersey.',
 'they were using a large hornshaped antenna to study radio waves emitted by the milky way, but they kept detecting a mysterious signal that seemed to be coming from all directions in the sky.',
 'after ruling out a number of possible explanations, they realized that they had stumbled upon the cmb.',
 'the cmb is incredibly faint, with a temperature of just .',
 'kelvin  .',
 'degrees celsius.',
 'however, it is remarkably uniform across the entire sky, with temperature variations of just a few parts in , .',
 'these tiny fluctuations are thought to be the result of slight density variations in the early universe, whic

In [None]:
# initializing the Vocabulary Size / Dictionary Size :
voc_size  = 10000

# One_hot_representation

In [None]:
# One_hot_representation
one_hot_rep = [one_hot(words, voc_size) for words in sentences]
print(one_hot_rep)

[[5688, 5917, 6831, 1452, 7079, 2986, 9412, 1738, 3518, 6414, 7685, 4293, 9645, 5688, 4783, 3085], [5406, 2986, 7735, 6702, 5331, 5688, 8088, 3518, 5688, 4112, 301, 5688, 7269, 4293, 9253, 5688, 1151, 3518, 5688, 3085, 9844, 5383, 6341, 5406], [5688, 7079, 3437, 5200, 3322, 5582, 4608, 3399, 9307, 6281, 2556, 3817, 6633, 577, 6012, 5650, 5128, 8766, 9771, 1767, 2954, 5582, 3517, 4603], [1836, 5128, 5094, 9412, 1290, 5393, 1855, 6702, 3615, 9307, 9194, 7063, 4608, 5688, 50, 1547, 9528, 1836, 988, 7535, 9412, 743, 6602, 4293, 9419, 6702, 5331, 9392, 8812, 6216, 7874, 5582, 5688, 6937], [4382, 9759, 6792, 9412, 9544, 3518, 2964, 6585, 1836, 8930, 4293, 1836, 8283, 976, 3746, 5688, 7079], [5688, 7079, 2986, 4918, 8663, 5173, 9412, 3899, 3518, 2138], [7153], [9037, 1475], [2791, 5406, 2986, 9678, 6740, 4976, 5688, 4783, 6937, 5173, 3899, 2263, 3518, 2138, 9412, 8578, 7646, 5582], [3426, 1328, 4102, 2501, 7735, 6702, 5331, 5688, 7793, 3518, 1635, 9064, 2263, 5582, 5688, 1231, 3085, 1132, 512

# Word Embedding Representation (Embedding Matrix):

In [None]:
from tensorflow.keras.layers import Embedding
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential


In [None]:
# Converting all the sentence to max sentence length and performing embedding :
max_length =max([len(x) for x in (one_hot_rep)])
embedded_text  = pad_sequences(one_hot_rep, padding = 'post', maxlen = max_length)
print (embedded_text[:5])

[[5688 5917 6831 1452 7079 2986 9412 1738 3518 6414 7685 4293 9645 5688
  4783 3085    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0]
 [5406 2986 7735 6702 5331 5688 8088 3518 5688 4112  301 5688 7269 4293
  9253 5688 1151 3518 5688 3085 9844 5383 6341 5406    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0]
 [5688 7079 3437 5200 3322 5582 4608 3399 9307 6281 2556 3817 6633  577
  6012 5650 5128 8766 9771 1767 2954 5582 3517 4603    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0]
 [1836 5128 5094 9412 1290 5393 1855 6702 3615 9307 9194 7063 4608 5688
    50 1547 9528 1836  988 7535 9412  743 6602 4293 9419 6702 5331 9392
  8812 6216 7874 5582 5688 6937    0    0    0    0    0    0    0    0
     0    0    0    0    0]
 [4382 9759 6792 9412 95

In [None]:
# initializing our Sequential model with 15 dimensions/features :
dim = 15

model = Sequential(name= 'Embedding_Model')
model.add(Embedding(voc_size, dim, input_length = max_length)) # Adding our embedding layer
model.compile('adam', 'mse')
model.summary()

Model: "Embedding_Model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_2 (Embedding)     (None, 47, 15)            150000    
                                                                 
Total params: 150,000
Trainable params: 150,000
Non-trainable params: 0
_________________________________________________________________


In [None]:
# To look how our embedded text has been processed :
(model.predict(embedded_text[1]))   # These are the prediction for just 1st sentence from 'embedded text matrix'



array([[ 1.85492523e-02, -3.85978706e-02, -1.19758137e-02,
         2.69536041e-02,  1.81919336e-03,  8.90851021e-04,
        -4.19114940e-02, -4.37314436e-03, -4.87363711e-02,
         4.70311679e-02, -4.22126651e-02, -3.30915675e-02,
        -4.86287251e-02,  3.76777537e-02, -4.48207967e-02],
       [-2.04117429e-02,  3.89749743e-02,  4.73494567e-02,
         4.99094389e-02,  4.17676233e-02, -2.32914686e-02,
         2.05688812e-02,  7.57981464e-03, -9.25841182e-03,
        -3.22735198e-02,  4.86930124e-02,  2.15953924e-02,
         1.26311220e-02,  3.65111120e-02, -1.75625570e-02],
       [-4.31043990e-02,  8.80445167e-03,  1.14710107e-02,
         2.36505158e-02, -3.60427052e-03, -1.09831095e-02,
         2.30864435e-03,  2.92596556e-02, -4.78147641e-02,
        -1.45435221e-02, -3.91037352e-02,  2.63318904e-02,
        -1.92705281e-02,  3.16648148e-02, -3.22998688e-03],
       [-2.68684030e-02,  4.13108133e-02, -1.52742974e-02,
        -4.23475392e-02, -2.65886318e-02, -2.01598760