<a href="https://colab.research.google.com/github/GuptAmit725/NLP/blob/main/Embedding_intutition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h2><I>Importing all the necessary libraries

In [12]:
import nltk
import numpy as np
import re
import os
import string
import tensorflow as tf
from keras.layers import Input, Dense, Embedding
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [42]:
text = """
          In human society, family (from Latin: familia) is a 
          group of people related either by consanguinity 
          (by recognized birth) or affinity (by marriage or other relationship).
          The purpose of families is to maintain the well-being of its members 
          and of society. Ideally, families would offer predictability, structure, 
          and safety as members mature and participate in the community.
          [1] In most societies, it is within families that children acquire 
          socialization for life outside the family, and acts as the primary 
          source of attachment, nurturing, and socialization for humans.
          [2][3] Additionally, as the basic unit for meeting the basic needs 
          of its members, it provides a sense of boundaries for performing tasks 
          in a safe environment, ideally builds a person into a functional adult, 
          transmits culture, and ensures continuity of humankind with precedents of knowledge.
      """

<h2><I>Defining the text preprocessing block

In [43]:
def number_to_text(text):
  text = text.split()

  for i,word in enumerate(text):
    if word.isdigit():
      text[i] = inf.number_to_words(word)
  return " ".join(text)

def remove_punc(text):
  translator = str.maketrans("","", string.punctuation)
  return text.translate(translator)

def remove_stopwords(text):
  eng_stopwords = stopwords.words('english')
  word_tokens = word_tokenize(text)
  text = [i for i in word_tokens if i not in eng_stopwords]

  return " ".join(text)

def preprocessing(text):
  text = text.lower() #changing the whole text into lower case.
  text = re.sub(r'\d+','', text) #removing numbers fromthe text.
  text = number_to_text(text) #converting every number into text.
  text = remove_punc(text) # removing all the punctuation in the text.
  text = remove_stopwords(text) # removing all the words like the,and,this which are not that important in this context.

  return text

text = preprocessing(text)
text

'human society family latin familia group people related either consanguinity recognized birth affinity marriage relationship purpose families maintain wellbeing members society ideally families would offer predictability structure safety members mature participate community societies within families children acquire socialization life outside family acts primary source attachment nurturing socialization humans additionally basic unit meeting basic needs members provides sense boundaries performing tasks safe environment ideally builds person functional adult transmits culture ensures continuity humankind precedents knowledge'

In [44]:
len(text)

632

In [48]:
X = '<start> ' + text
Y = text + ' <end>'

X,Y

('<start> human society family latin familia group people related either consanguinity recognized birth affinity marriage relationship purpose families maintain wellbeing members society ideally families would offer predictability structure safety members mature participate community societies within families children acquire socialization life outside family acts primary source attachment nurturing socialization humans additionally basic unit meeting basic needs members provides sense boundaries performing tasks safe environment ideally builds person functional adult transmits culture ensures continuity humankind precedents knowledge',
 'human society family latin familia group people related either consanguinity recognized birth affinity marriage relationship purpose families maintain wellbeing members society ideally families would offer predictability structure safety members mature participate community societies within families children acquire socialization life outside family a

In [50]:
len(X.split(' ')),len(Y.split(' '))

(75, 75)

In [None]:
unique_words = list(set((X+Y).split(' '))) # getting all the unique words in the corpus.
unique_words

In [None]:
word_indexing = {words:i for i, words in enumerate(unique_words)} #indexing the unique words with numbers.
word_indexing

In [None]:
#Doing one hot encoding 
def one_hot_encoding(X,Y, word_indexing):

  target = [0] * len(Y.split(' '))
  ohe_matrix = np.zeros((len(X.split(' ')),len(unique_words)))

  for i,word in enumerate(X.split(' ')):
    ohe_matrix[i][word_indexing[word]] = 1
  for i,word in enumerate(Y.split(' ')):
    target[i] = word_indexing[word]

  return np.asarray(ohe_matrix), np.asarray(target)

x,y = one_hot_encoding(X,Y,word_indexing)
x,y

In [None]:
x.shape, y.shape

In [57]:
y = tf.keras.utils.to_categorical(y)

In [None]:
y.shape

This is the dummy model, it's purpose is to showcase how the embedding works and the working of the algorithms like Word2Vec. The same logic goes with BERT models not exactly but can be related to.

In [None]:
embedding_size = 10

input = Input(shape=(x.shape[1],))
o = Dense(x.shape[0])(input)
o = Dense(embedding_size)(o)
o = Dense(y.shape[1], activation='softmax')(o)

model = tf.keras.models.Model(inputs = input, outputs = o)

model.summary()

In [87]:
model.compile(optimizer='adam', loss = 'categorical_crossentropy', metrics=['accuracy'])

In [None]:
model.fit(x=x , y = y , batch_size = 1 , epochs=10)

In [None]:
len(model.get_weights()) # so we have 3 layers in the model, for each layer there ares et of weights.

In [None]:
model.get_weights()[0]

In [None]:
#getting the weights of the layer before final output layer.
for layer in model.get_weights():
  print(layer.shape)

In [None]:
#Taking the output of the weights in the layer before last layer. Why?
# Because I have defined this layer embedding layer where I can have flexibility to 
#change the dimension of embedding matrix.
word_to_vec = model.get_weights()[2]
word_to_vec.shape

In [None]:
len(X.split(' ')) 
# you can see the total words in my corpus
#now in the next step going to get us the vector for each 
#word which is trained on a neural network like Word2Vec.

In [103]:
#We have got the weights for each word from model.get_weights() 
#Now storing the weights as vectors in a dictionary like globe vectors.
#word_to_vec = {word:word_to_vec[i] for i, word in enumerate(X.split(' '))}
print('The shape of the embedding matrix: ',model.get_weights()[2].shape)
word_to_vec['adult']

The shape of the embedding matrix:  (75, 10)


array([ 2.1157727e-01,  9.1998696e-02,  4.2216954e-01, -3.8947314e-02,
        3.0391186e-01,  3.4180036e-01,  7.0480369e-02,  3.4955460e-01,
        3.6477763e-04, -3.9406410e-01], dtype=float32)

The size of the vectors are determined by passing the embedding_size as parameters in the above defined model.