<a href="https://colab.research.google.com/github/Priyanshu-Naik/Gen_AI/blob/main/GlovWo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Glove Word Embedding in NLP


1. Importing Libraries

We will be importing necessary libraries to handle text processing and numerical operations.

Tokenizer and pad_sequences from tensorflow.keras.preprocessing.text help us tokenize the text and manage sequences of tokens, respectively.

numpy is used for handling numerical operations, especially creating and manipulating arrays like the embedding matrix.





In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np

2. Creating Vocabulary

We will be defining a list of words (texts) that we want to use for building a vocabulary. These words represent our small sample text corpus that the tokenizer will later process.






In [None]:
texts = ['text', 'the', 'leader', 'prime', 'natural', 'language']

3. Initializing and Fitting the Tokenizer

We will be initializing the Tokenizer object and fitting it on the texts corpus to create a dictionary of words and their corresponding integer indices. The tokenizer will break the words into unique tokens and assign each token an integer ID.

The fit_on_texts function processes the provided corpus and generates the word-to-index mapping.

tokenizer.word_index gives the dictionary that maps each word to its corresponding index.

In [None]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(texts)

print("Number of unique words in dictionary = ", len(tokenizer.word_index))
print("Dictionary is = " , tokenizer.word_index)

Number of unique words in dictionary =  6
Dictionary is =  {'text': 1, 'the': 2, 'leader': 3, 'prime': 4, 'natural': 5, 'language': 6}


4. Defining a Function to Create Embedding Matrix

We will be defining the function embedding_for_vocab that loads pre-trained GloVe word vectors and creates an embedding matrix for the vocabulary.

filepath: Path to the GloVe file.

word_index: The dictionary created by the tokenizer, mapping words to indices.

embedding_dim: The dimensionality of the word vectors (e.g., 50-dimensional vectors).

Inside the function:

We initialize a matrix of zeros with shape (vocab_size, embedding_dim) where vocab_size is the number of words plus one (to account for the padding token).

We read the GloVe file line by line and match the word to the tokenizer's word index, copying the corresponding word vector to the embedding matrix.






In [None]:
def embedding_for_vocab(filepath, word_index, embedding_dim):
    vocab_size = len(word_index) + 1
    embedding_matrix = np.zeros((vocab_size, embedding_dim))
    with open(filepath, encoding='utf-8') as f:
        for line in f:
            word, *vector = line.split()
            if word in word_index:
                idx = word_index[word]
                embedding_matrix[idx] = np.array(vector, dtype=np.float32)[:embedding_dim]
    return embedding_matrix

5. Downloading GloVe File

We will be downloading the GloVe dataset from Stanford's NLP repository. This dataset contains pre-trained word embeddings, and we will be specifically using the 50-dimensional embeddings (glove.6B.50d.txt).

!wget is used to download the file.

!unzip is used to extract the zipped GloVe file.





In [None]:
!wget https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
!unzip -q glove.6B.zip

--2025-12-25 19:38:05--  https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


2025-12-25 19:40:44 (5.17 MB/s) - ‘glove.6B.zip’ saved [862182613/862182613]



6. Loading GloVe Embeddings and Creating a Matrix

We will be specifying the embedding dimension (50 in this case, matching the GloVe file) and providing the path to the GloVe file. We then call the previously defined function embedding_for_vocab to load the GloVe embeddings and generate the embedding matrix for our vocabulary.






In [None]:
from IPython.terminal.embed import embed
embedding_dim = 50
embedding_matrix = embedding_for_vocab('glove.6B.50d.txt', tokenizer.word_index, embedding_dim)

7. Accessing Embedding Vector for a Word

We will be accessing the embedding vector for a specific word in the tokenizer’s index. In this case, we're accessing the vector for the word with index 1, which corresponds to the word "text" in the vocabulary.






In [None]:
first_word_index = 1
print("Dense vector for word with index 1=> ", embedding_matrix[first_word_index])

Dense vector for word with index 1=>  [ 0.32615     0.36686    -0.0074905  -0.37553     0.66715002  0.21646
 -0.19801    -1.10010004 -0.42221001  0.10574    -0.31292     0.50953001
  0.55774999  0.12019     0.31441    -0.25042999 -1.06369996 -1.32130003
  0.87797999 -0.24627     0.27379    -0.51091999  0.49324     0.52243
  1.16359997 -0.75322998 -0.48052999 -0.11259    -0.54595    -0.83920997
  2.98250008 -1.19159997 -0.51958001 -0.39365    -0.1419     -0.026977
  0.66295999  0.16574    -1.1681      0.14443     1.63049996 -0.17216
 -0.17436001 -0.01049    -0.17794     0.93076003  1.0381      0.94265997
 -0.14805    -0.61109   ]
