<a href="https://colab.research.google.com/github/Swap1984/GloVe-Embeddings-NLP-Vectorization-Pipeline/blob/main/GloVe_Embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Glove Embedding method**

**What Are GloVe Embeddings?**

GloVe (Global Vectors for Word Representation) is a **popular pre-trained word embedding technique developed by Stanford.** It is based on the idea that the meaning of a word is captured by the co-occurrence probability of that word with other words in a large corpus. **GloVe embeddings are trained on a global word co-occurrence matrix, which gives them the ability to capture both the local context and global semantics.**

**Advantages of GloVe Embeddings:**

Efficient Pre-trained Embeddings:
You can load pre-trained embeddings to save time, rather than training from scratch.

Captures Word Meaning Well:
GloVe embeddings capture semantic relationships, e.g., the vector for "king" – "man" + "woman" is close to "queen."

Available in Multiple Sizes:
Pre-trained models are available with different dimensions (e.g., 50D, 100D, 300D), allowing you to choose based on your requirements.

Handles Context:
While not as powerful as contextual embeddings (like BERT), GloVe captures more global context than simple word embeddings like Word2Vec.

**Disadvantages of GloVe Embeddings:**

Static Embeddings:
GloVe assigns the same vector to a word, irrespective of the context. For example, the word "bank" will have the same vector whether referring to a financial institution or a riverbank.

Pre-trained on Specific Corpora:
 The pre-trained GloVe embeddings are trained on certain corpora (e.g., Wikipedia, Common Crawl), which may not always match your dataset's vocabulary.

Cannot Handle Out-of-Vocabulary (OOV) Words:
Words that were not present in the training data are not included in the pre-trained GloVe embeddings, making them harder to handle for domain-specific text.

**Applications of GloVe Embeddings:**

Text Classification:
Word embeddings from GloVe can be used as input features for text classifiers.

Semantic Search:
Use embeddings to compute similarity between queries and documents.

Named Entity Recognition (NER):
 Word embeddings help in recognizing entities by capturing semantic relationships between words.

Sentiment Analysis:
 Pre-trained embeddings improve the performance of sentiment classifiers by capturing meaning.




# **GloVe Embeddings Pretrained model**

In [63]:
# initialising the libraries
import numpy as np
import re
import nltk
from nltk.tokenize import word_tokenize  # Ensure you have NLTK installed
import string
nltk.download('punkt')  # Download the punkt tokenizer
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [33]:
# using a paragraph as input . data =""   ""
data = """Yes, life is full, there is life even underground,” he began again. “You wouldn’t believe, Alexey, how I want to live now, what a thirst for existence and consciousness has sprung up in me within these peeling walls… And what is suffering? I am not afraid of it, even if it were beyond reckoning. I am not afraid of it now. I was afraid of it before… And I seem to have such strength in me now, that I think I could stand anything, any suffering, only to be able to say and to repeat to myself every moment, ‘I exist.’ In thousands of agonies — I exist. I’m tormented on the rack — but I exist! Though I sit alone on a pillar — I exist! I see the sun, and if I don’t see the sun, I know it’s there. And there’s a whole life in that, in knowing that the sun is there."""

In [3]:
# Download the GloVe 6B vectors (around 812Mb)
!wget http://nlp.stanford.edu/data/glove.6B.zip


--2025-11-20 08:08:07--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2025-11-20 08:08:07--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2025-11-20 08:08:08--  https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’



In [4]:
# Unzip the file
!unzip glove.6B.zip

Archive:  glove.6B.zip
  End-of-central-directory signature not found.  Either this file is not
  a zipfile, or it constitutes one disk of a multi-part archive.  In the
  latter case the central directory and zipfile comment will be found on
  the last disk(s) of this archive.
unzip:  cannot find zipfile directory in one of glove.6B.zip or
        glove.6B.zip.zip, and cannot find glove.6B.zip.ZIP, period.


In [18]:
# Define the file path for GloVe pre-trained embeddings
glove_file = r'/content/glove.6B.200d.txt'  # Example for 200-dimensional embeddings


In [64]:
# Load GloVe embeddings
def load_glove_embeddings(glove_file_path):
    embeddings_index = {}
    with open(glove_file_path, 'r', encoding='utf-8') as f:
        for line in f:
            if line.strip() == "":
                continue
            values = line.split()
            word = values[0]
            try:
                coefs = np.asarray(values[1:], dtype='float32')
                embeddings_index[word] = coefs
            except ValueError:
                continue
    return embeddings_index




In [65]:
# Function to convert tokens to embeddings
def text_to_embeddings(tokens, glove_embeddings):
    embeddings = []
    for word in tokens:
        if word in glove_embeddings:
            embeddings.append(glove_embeddings[word])
        else:
            # Append a zero vector of the appropriate dimension if the word is not found
            embeddings.append(np.zeros((200,)))  # Adjust if using different dimensions
    return np.array(embeddings,dtype='float32')

In [66]:
# Example usage
glove_file_path = '/content/glove.6B.200d.txt'  # Replace with the path to your GloVe file
glove_embeddings = load_glove_embeddings(glove_file_path)

In [67]:
# Tokenize the input data
tokens = word_tokenize(data.lower())  # Tokenize and convert to lowercase
print(f"Tokens: {tokens}")

Tokens: ['yes', ',', 'life', 'is', 'full', ',', 'there', 'is', 'life', 'even', 'underground', ',', '”', 'he', 'began', 'again', '.', '“', 'you', 'wouldn', '’', 't', 'believe', ',', 'alexey', ',', 'how', 'i', 'want', 'to', 'live', 'now', ',', 'what', 'a', 'thirst', 'for', 'existence', 'and', 'consciousness', 'has', 'sprung', 'up', 'in', 'me', 'within', 'these', 'peeling', 'walls…', 'and', 'what', 'is', 'suffering', '?', 'i', 'am', 'not', 'afraid', 'of', 'it', ',', 'even', 'if', 'it', 'were', 'beyond', 'reckoning', '.', 'i', 'am', 'not', 'afraid', 'of', 'it', 'now', '.', 'i', 'was', 'afraid', 'of', 'it', 'before…', 'and', 'i', 'seem', 'to', 'have', 'such', 'strength', 'in', 'me', 'now', ',', 'that', 'i', 'think', 'i', 'could', 'stand', 'anything', ',', 'any', 'suffering', ',', 'only', 'to', 'be', 'able', 'to', 'say', 'and', 'to', 'repeat', 'to', 'myself', 'every', 'moment', ',', '‘', 'i', 'exist.', '’', 'in', 'thousands', 'of', 'agonies', '—', 'i', 'exist', '.', 'i', '’', 'm', 'tormented

In [68]:
# Convert tokens to embeddings
embeddings_array = text_to_embeddings(tokens, glove_embeddings)

In [69]:
# Check the shape of the embeddings
print(f"Shape of the embeddings array: {embeddings_array.shape}")

Shape of the embeddings array: (193, 200)


In [70]:
# Printing the first 5 embeddings for the first 5 tokens of the data
print(embeddings_array[:5])


[[ 2.2042e-01  2.4924e-01  1.7393e-01 -1.6275e-02 -3.0916e-01  2.6610e-01
  -7.4482e-01  8.3853e-02  3.2016e-01  6.5026e-01 -3.0558e-01  1.6842e-02
   4.8508e-02 -2.9494e-01 -7.8175e-01  4.1424e-01 -2.1490e-01  5.3132e-01
   4.8179e-01  1.5397e-01  3.0971e-01  1.4896e+00 -8.2240e-02  3.2086e-01
   2.8866e-02 -2.9613e-01 -1.3558e-01 -7.4538e-02  2.3826e-01 -5.3784e-01
  -4.1572e-01 -1.3477e-01  1.5444e-01 -1.2929e-01 -3.4317e-01 -1.5537e-01
   3.3685e-02 -3.5089e-01 -2.2403e-01  4.9218e-01 -1.4750e-01  3.4514e-03
  -2.4207e-01 -1.0924e-01  5.7027e-03  1.2135e-01  6.0403e-01 -1.7868e-01
  -3.9604e-01  5.6209e-02  5.1824e-01 -3.9824e-01  1.3078e-01  1.3691e-01
  -1.5311e-01 -4.9687e-02  4.9536e-02 -3.3107e-02 -1.2196e-02 -3.2973e-01
   2.5021e-01 -3.2097e-01  1.4448e-01  5.5550e-02  6.1817e-02 -7.5844e-02
  -1.9434e-01  6.2882e-01  8.3724e-01  1.1561e-01  1.9471e-01  3.9750e-03
   1.7576e-02 -3.6561e-01 -4.1066e-01 -4.7927e-01 -5.2158e-01  2.1736e-01
  -5.6578e-01 -1.5558e-01  3.5471e-01 

In [71]:
# Function to retrieve the embedding
def get_embedding(token, glove_model, embedding_dim=300):
    return glove_model.get(token, np.zeros(embedding_dim))


In [72]:
# Retrieve embeddings for the tokens
embeddings_array = [get_embedding(token, glove_embeddings) for token in tokens]

In [73]:
# Check the output
for token, embedding in zip(tokens, embeddings_array):
    print(f"Token: '{token}', Embedding: {embedding[:5]}")  # Print first 5 dimensions of the embedding

Token: 'yes', Embedding: [ 0.22042   0.24924   0.17393  -0.016275 -0.30916 ]
Token: ',', Embedding: [ 0.17651    0.29208   -0.0020768 -0.37523    0.0049139]
Token: 'life', Embedding: [ 0.34098   0.41888  -0.31878   0.031399  0.047223]
Token: 'is', Embedding: [ 0.32928   0.25526   0.26753  -0.084809  0.29764 ]
Token: 'full', Embedding: [ 0.29667 -0.19041 -0.16734 -0.23067  0.03546]
Token: ',', Embedding: [ 0.17651    0.29208   -0.0020768 -0.37523    0.0049139]
Token: 'there', Embedding: [ 0.66193   0.16192  -0.090129 -0.59287   0.15391 ]
Token: 'is', Embedding: [ 0.32928   0.25526   0.26753  -0.084809  0.29764 ]
Token: 'life', Embedding: [ 0.34098   0.41888  -0.31878   0.031399  0.047223]
Token: 'even', Embedding: [ 0.44802   0.16025  -0.23372  -0.054205 -0.067149]
Token: 'underground', Embedding: [-0.20773 -0.26716 -0.64454 -0.14187  0.3224 ]
Token: ',', Embedding: [ 0.17651    0.29208   -0.0020768 -0.37523    0.0049139]
Token: '”', Embedding: [ 0.10706   0.25534   0.036386 -0.01086  -

**We see that the nltk considers the punctuation marks as tokens
and Glove only has vectors corrosponding to words and not the pinctuation marks, and eventough we have asked to return a 0 vector for the words not in the glove embeddings we are getting a specific vector value**

We need to remove the punctuations first to get proper tokens

In [74]:
# Cleaning the data
cleaned_data = []
for sentence in data:
    tokens = word_tokenize(sentence)
    tokens = [w for w in tokens if w not in string.punctuation]
    cleaned_data.append(" ".join(tokens))

In [75]:
cleaned_data = [sentence.lower() for sentence in cleaned_data]


In [76]:
# Passing cleaned data to tokeniser
# Tokenize the input data
tokens = word_tokenize(cleaned_data)
print(f"Tokens: {tokens}")

TypeError: expected string or bytes-like object, got 'list'

we're getting an issue because word_tokenize() expects a single string, but we are passing a list of sentences (cleaned_data).

In [77]:
# To get single list of all tokens
all_tokens = []

for sentence in cleaned_data:
    all_tokens.extend(word_tokenize(sentence))

print(all_tokens)


['y', 'e', 's', 'l', 'i', 'f', 'e', 'i', 's', 'f', 'u', 'l', 'l', 't', 'h', 'e', 'r', 'e', 'i', 's', 'l', 'i', 'f', 'e', 'e', 'v', 'e', 'n', 'u', 'n', 'd', 'e', 'r', 'g', 'r', 'o', 'u', 'n', 'd', '”', 'h', 'e', 'b', 'e', 'g', 'a', 'n', 'a', 'g', 'a', 'i', 'n', '“', 'y', 'o', 'u', 'w', 'o', 'u', 'l', 'd', 'n', '’', 't', 'b', 'e', 'l', 'i', 'e', 'v', 'e', 'a', 'l', 'e', 'x', 'e', 'y', 'h', 'o', 'w', 'i', 'w', 'a', 'n', 't', 't', 'o', 'l', 'i', 'v', 'e', 'n', 'o', 'w', 'w', 'h', 'a', 't', 'a', 't', 'h', 'i', 'r', 's', 't', 'f', 'o', 'r', 'e', 'x', 'i', 's', 't', 'e', 'n', 'c', 'e', 'a', 'n', 'd', 'c', 'o', 'n', 's', 'c', 'i', 'o', 'u', 's', 'n', 'e', 's', 's', 'h', 'a', 's', 's', 'p', 'r', 'u', 'n', 'g', 'u', 'p', 'i', 'n', 'm', 'e', 'w', 'i', 't', 'h', 'i', 'n', 't', 'h', 'e', 's', 'e', 'p', 'e', 'e', 'l', 'i', 'n', 'g', 'w', 'a', 'l', 'l', 's', '…', 'a', 'n', 'd', 'w', 'h', 'a', 't', 'i', 's', 's', 'u', 'f', 'f', 'e', 'r', 'i', 'n', 'g', 'i', 'a', 'm', 'n', 'o', 't', 'a', 'f', 'r', 'a',

A simple method to get the tokens from list: cleaned_data is to use Keras tokeniser.

In [78]:
from tensorflow.keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer()
tokenizer.fit_on_texts(data)   # data = list of sentences
word_index = tokenizer.word_index

In [79]:
# Printing the tokens with their index
for word, index in word_index.items():
    print(word, "→", index)


e → 1
i → 2
n → 3
t → 4
a → 5
s → 6
o → 7
h → 8
r → 9
l → 10
f → 11
u → 12
d → 13
w → 14
g → 15
y → 16
m → 17
b → 18
c → 19
v → 20
’ → 21
x → 22
p → 23
k → 24
— → 25
… → 26
” → 27
“ → 28
‘ → 29


A simple and safe practice to print the tokens

In [80]:
for i, sent in enumerate(cleaned_data):
    print("Sentence:", sent)
    print("Tokens:", word_tokenize(sent))
    print()


Sentence: y
Tokens: ['y']

Sentence: e
Tokens: ['e']

Sentence: s
Tokens: ['s']

Sentence: 
Tokens: []

Sentence: 
Tokens: []

Sentence: l
Tokens: ['l']

Sentence: i
Tokens: ['i']

Sentence: f
Tokens: ['f']

Sentence: e
Tokens: ['e']

Sentence: 
Tokens: []

Sentence: i
Tokens: ['i']

Sentence: s
Tokens: ['s']

Sentence: 
Tokens: []

Sentence: f
Tokens: ['f']

Sentence: u
Tokens: ['u']

Sentence: l
Tokens: ['l']

Sentence: l
Tokens: ['l']

Sentence: 
Tokens: []

Sentence: 
Tokens: []

Sentence: t
Tokens: ['t']

Sentence: h
Tokens: ['h']

Sentence: e
Tokens: ['e']

Sentence: r
Tokens: ['r']

Sentence: e
Tokens: ['e']

Sentence: 
Tokens: []

Sentence: i
Tokens: ['i']

Sentence: s
Tokens: ['s']

Sentence: 
Tokens: []

Sentence: l
Tokens: ['l']

Sentence: i
Tokens: ['i']

Sentence: f
Tokens: ['f']

Sentence: e
Tokens: ['e']

Sentence: 
Tokens: []

Sentence: e
Tokens: ['e']

Sentence: v
Tokens: ['v']

Sentence: e
Tokens: ['e']

Sentence: n
Tokens: ['n']

Sentence: 
Tokens: []

Sentence: u
To

In [81]:
# Mapping words to GloVe embeddings
def get_embedding_matrix(word_index, glove_embeddings, embedding_dim):
    vocab_size = len(word_index) + 1
    embedding_matrix = np.zeros((vocab_size, embedding_dim))

    for word, i in word_index.items():
        embedding_vector = glove_embeddings.get(word)
        if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector
    return embedding_matrix

In [82]:
embedding_dim = 200
embedding_matrix = get_embedding_matrix(word_index, glove_embeddings, embedding_dim)


In [83]:
# Printing the first 5 embeddings for the first 5 tokens of the data
print(embeddings_array[:5])


[array([ 0.22042  ,  0.24924  ,  0.17393  , -0.016275 , -0.30916  ,
        0.2661   , -0.74482  ,  0.083853 ,  0.32016  ,  0.65026  ,
       -0.30558  ,  0.016842 ,  0.048508 , -0.29494  , -0.78175  ,
        0.41424  , -0.2149   ,  0.53132  ,  0.48179  ,  0.15397  ,
        0.30971  ,  1.4896   , -0.08224  ,  0.32086  ,  0.028866 ,
       -0.29613  , -0.13558  , -0.074538 ,  0.23826  , -0.53784  ,
       -0.41572  , -0.13477  ,  0.15444  , -0.12929  , -0.34317  ,
       -0.15537  ,  0.033685 , -0.35089  , -0.22403  ,  0.49218  ,
       -0.1475   ,  0.0034514, -0.24207  , -0.10924  ,  0.0057027,
        0.12135  ,  0.60403  , -0.17868  , -0.39604  ,  0.056209 ,
        0.51824  , -0.39824  ,  0.13078  ,  0.13691  , -0.15311  ,
       -0.049687 ,  0.049536 , -0.033107 , -0.012196 , -0.32973  ,
        0.25021  , -0.32097  ,  0.14448  ,  0.05555  ,  0.061817 ,
       -0.075844 , -0.19434  ,  0.62882  ,  0.83724  ,  0.11561  ,
        0.19471  ,  0.003975 ,  0.017576 , -0.36561  , -0.410

Thus we have sucessfully generated Glove ebeddings for out input data