<a href="https://colab.research.google.com/github/Swap1984/swapnil/blob/main/Assignment_GloVe.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Glove Embedding method**

**What Are GloVe Embeddings?**

GloVe (Global Vectors for Word Representation) is a **popular pre-trained word embedding technique developed by Stanford.** It is based on the idea that the meaning of a word is captured by the co-occurrence probability of that word with other words in a large corpus. **GloVe embeddings are trained on a global word co-occurrence matrix, which gives them the ability to capture both the local context and global semantics.**

**Advantages of GloVe Embeddings:**

Efficient Pre-trained Embeddings:
You can load pre-trained embeddings to save time, rather than training from scratch.

Captures Word Meaning Well:
GloVe embeddings capture semantic relationships, e.g., the vector for "king" – "man" + "woman" is close to "queen."

Available in Multiple Sizes:
Pre-trained models are available with different dimensions (e.g., 50D, 100D, 300D), allowing you to choose based on your requirements.

Handles Context:
While not as powerful as contextual embeddings (like BERT), GloVe captures more global context than simple word embeddings like Word2Vec.

**Disadvantages of GloVe Embeddings:**

Static Embeddings:
GloVe assigns the same vector to a word, irrespective of the context. For example, the word "bank" will have the same vector whether referring to a financial institution or a riverbank.

Pre-trained on Specific Corpora:
 The pre-trained GloVe embeddings are trained on certain corpora (e.g., Wikipedia, Common Crawl), which may not always match your dataset's vocabulary.

Cannot Handle Out-of-Vocabulary (OOV) Words:
Words that were not present in the training data are not included in the pre-trained GloVe embeddings, making them harder to handle for domain-specific text.

**Applications of GloVe Embeddings:**

Text Classification:
Word embeddings from GloVe can be used as input features for text classifiers.

Semantic Search:
Use embeddings to compute similarity between queries and documents.

Named Entity Recognition (NER):
 Word embeddings help in recognizing entities by capturing semantic relationships between words.

Sentiment Analysis:
 Pre-trained embeddings improve the performance of sentiment classifiers by capturing meaning.




# **GloVe Embeddings Pretrained model**

In [2]:
# initialising the libraries
import numpy as np
import re
from nltk.tokenize import word_tokenize  # Ensure you have NLTK installed
import nltk
nltk.download('punkt')  # Download the punkt tokenizer


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [3]:
# using a paragraph as input . data =""   ""
data = """Yes, life is full, there is life even underground,” he began again. “You wouldn’t believe, Alexey, how I want to live now, what a thirst for existence and consciousness has sprung up in me within these peeling walls… And what is suffering? I am not afraid of it, even if it were beyond reckoning. I am not afraid of it now. I was afraid of it before… And I seem to have such strength in me now, that I think I could stand anything, any suffering, only to be able to say and to repeat to myself every moment, ‘I exist.’ In thousands of agonies — I exist. I’m tormented on the rack — but I exist! Though I sit alone on a pillar — I exist! I see the sun, and if I don’t see the sun, I know it’s there. And there’s a whole life in that, in knowing that the sun is there."""

In [19]:
# Download the GloVe 6B vectors (around 812Mb)
!wget http://nlp.stanford.edu/data/glove.6B.zip


--2024-10-01 17:13:00--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2024-10-01 17:13:00--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2024-10-01 17:13:01--  https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip.1’


2

In [21]:
# Unzip the file
!unzip glove.6B.zip.1

Archive:  glove.6B.zip.1
  inflating: glove.6B.50d.txt        
  inflating: glove.6B.100d.txt       
  inflating: glove.6B.200d.txt       
  inflating: glove.6B.300d.txt       


In [22]:
# Define the file path for GloVe pre-trained embeddings
glove_file = r'/content/glove.6B.200d.txt'  # Example for 200-dimensional embeddings


In [23]:
# Load GloVe embeddings
def load_glove_embeddings(glove_file_path):
    embeddings_index = {}
    with open(glove_file_path, 'r', encoding='utf-8') as f:
        for line in f:
            if line.strip() == "":
                continue
            values = line.split()
            word = values[0]
            try:
                coefs = np.asarray(values[1:], dtype='float32')
                embeddings_index[word] = coefs
            except ValueError:
                continue
    return embeddings_index




In [24]:
# Function to convert tokens to embeddings
def text_to_embeddings(tokens, glove_embeddings):
    embeddings = []
    for word in tokens:
        if word in glove_embeddings:
            embeddings.append(glove_embeddings[word])
        else:
            # Append a zero vector of the appropriate dimension if the word is not found
            embeddings.append(np.zeros((200,)))  # Adjust if using different dimensions
    return np.array(embeddings,dtype='float32')

In [25]:
# Example usage
glove_file_path = '/content/glove.6B.200d.txt'  # Replace with the path to your GloVe file
glove_embeddings = load_glove_embeddings(glove_file_path)

In [26]:
# Tokenize the input data
tokens = word_tokenize(data.lower())  # Tokenize and convert to lowercase
print(f"Tokens: {tokens}")

Tokens: ['yes', ',', 'life', 'is', 'full', ',', 'there', 'is', 'life', 'even', 'underground', ',', '”', 'he', 'began', 'again', '.', '“', 'you', 'wouldn', '’', 't', 'believe', ',', 'alexey', ',', 'how', 'i', 'want', 'to', 'live', 'now', ',', 'what', 'a', 'thirst', 'for', 'existence', 'and', 'consciousness', 'has', 'sprung', 'up', 'in', 'me', 'within', 'these', 'peeling', 'walls…', 'and', 'what', 'is', 'suffering', '?', 'i', 'am', 'not', 'afraid', 'of', 'it', ',', 'even', 'if', 'it', 'were', 'beyond', 'reckoning', '.', 'i', 'am', 'not', 'afraid', 'of', 'it', 'now', '.', 'i', 'was', 'afraid', 'of', 'it', 'before…', 'and', 'i', 'seem', 'to', 'have', 'such', 'strength', 'in', 'me', 'now', ',', 'that', 'i', 'think', 'i', 'could', 'stand', 'anything', ',', 'any', 'suffering', ',', 'only', 'to', 'be', 'able', 'to', 'say', 'and', 'to', 'repeat', 'to', 'myself', 'every', 'moment', ',', '‘', 'i', 'exist.', '’', 'in', 'thousands', 'of', 'agonies', '—', 'i', 'exist', '.', 'i', '’', 'm', 'tormented

In [27]:
# 3. Create the Embedding Matrix
def get_embedding_matrix(word_index, glove_embeddings, embedding_dim):
    vocab_size = len(word_index) + 1  # +1 for padding or unknowns
    embedding_matrix = np.zeros((vocab_size, embedding_dim))

    for word, i in word_index.items():
        embedding_vector = glove_embeddings.get(word)
        if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector
    return embedding_matrix


In [28]:
# Convert tokens to embeddings
embeddings_array = text_to_embeddings(tokens, glove_embeddings)

In [29]:
# Check the shape of the embeddings
print(f"Shape of the embeddings array: {embeddings_array.shape}")

Shape of the embeddings array: (193, 200)


In [30]:
# Printing the first 5 embeddings for the first 5 tokens of the data
print(embeddings_array[:5])


[[ 2.2042e-01  2.4924e-01  1.7393e-01 -1.6275e-02 -3.0916e-01  2.6610e-01
  -7.4482e-01  8.3853e-02  3.2016e-01  6.5026e-01 -3.0558e-01  1.6842e-02
   4.8508e-02 -2.9494e-01 -7.8175e-01  4.1424e-01 -2.1490e-01  5.3132e-01
   4.8179e-01  1.5397e-01  3.0971e-01  1.4896e+00 -8.2240e-02  3.2086e-01
   2.8866e-02 -2.9613e-01 -1.3558e-01 -7.4538e-02  2.3826e-01 -5.3784e-01
  -4.1572e-01 -1.3477e-01  1.5444e-01 -1.2929e-01 -3.4317e-01 -1.5537e-01
   3.3685e-02 -3.5089e-01 -2.2403e-01  4.9218e-01 -1.4750e-01  3.4514e-03
  -2.4207e-01 -1.0924e-01  5.7027e-03  1.2135e-01  6.0403e-01 -1.7868e-01
  -3.9604e-01  5.6209e-02  5.1824e-01 -3.9824e-01  1.3078e-01  1.3691e-01
  -1.5311e-01 -4.9687e-02  4.9536e-02 -3.3107e-02 -1.2196e-02 -3.2973e-01
   2.5021e-01 -3.2097e-01  1.4448e-01  5.5550e-02  6.1817e-02 -7.5844e-02
  -1.9434e-01  6.2882e-01  8.3724e-01  1.1561e-01  1.9471e-01  3.9750e-03
   1.7576e-02 -3.6561e-01 -4.1066e-01 -4.7927e-01 -5.2158e-01  2.1736e-01
  -5.6578e-01 -1.5558e-01  3.5471e-01 

In [31]:
# Function to retrieve the embedding
def get_embedding(token, glove_model, embedding_dim=300):
    return glove_model.get(token, np.zeros(embedding_dim))


In [32]:
# Retrieve embeddings for the tokens
embeddings_array = [get_embedding(token, glove_embeddings) for token in tokens]

In [33]:
# Check the output
for token, embedding in zip(tokens, embeddings_array):
    print(f"Token: '{token}', Embedding: {embedding[:5]}")  # Print first 5 dimensions of the embedding

Token: 'yes', Embedding: [ 0.22042   0.24924   0.17393  -0.016275 -0.30916 ]
Token: ',', Embedding: [ 0.17651    0.29208   -0.0020768 -0.37523    0.0049139]
Token: 'life', Embedding: [ 0.34098   0.41888  -0.31878   0.031399  0.047223]
Token: 'is', Embedding: [ 0.32928   0.25526   0.26753  -0.084809  0.29764 ]
Token: 'full', Embedding: [ 0.29667 -0.19041 -0.16734 -0.23067  0.03546]
Token: ',', Embedding: [ 0.17651    0.29208   -0.0020768 -0.37523    0.0049139]
Token: 'there', Embedding: [ 0.66193   0.16192  -0.090129 -0.59287   0.15391 ]
Token: 'is', Embedding: [ 0.32928   0.25526   0.26753  -0.084809  0.29764 ]
Token: 'life', Embedding: [ 0.34098   0.41888  -0.31878   0.031399  0.047223]
Token: 'even', Embedding: [ 0.44802   0.16025  -0.23372  -0.054205 -0.067149]
Token: 'underground', Embedding: [-0.20773 -0.26716 -0.64454 -0.14187  0.3224 ]
Token: ',', Embedding: [ 0.17651    0.29208   -0.0020768 -0.37523    0.0049139]
Token: '”', Embedding: [ 0.10706   0.25534   0.036386 -0.01086  -

**We see that the nltk considers the punctuation marks as tokens and Glove only has vectors corrosponding to words and not the pinctuation marks, and eventough we have asked to return a 0 vector for the words not in the glove embeddings we are getting a specific vector value**

#Code with word index from tokens

In [39]:
# Create a word index (mapping words to unique indices)
word_index = {word: i for i, word in enumerate(set(tokens), start=0)}
print(f"Word Index: {word_index}")  # Debugging: See word index



Word Index: {'.': 0, 'rack': 1, 'pillar': 2, 'sun': 3, 't': 4, 'believe': 5, 'thousands': 6, 'full': 7, 'life': 8, 'such': 9, 'even': 10, 'knowing': 11, 'live': 12, 'suffering': 13, 'yes': 14, 'sprung': 15, 'consciousness': 16, 'anything': 17, 'you': 18, 'for': 19, '‘': 20, 'began': 21, 'reckoning': 22, 'i': 23, 'be': 24, 'stand': 25, 'alexey': 26, 'me': 27, 'on': 28, 'has': 29, 'up': 30, 'agonies': 31, 'sit': 32, 'though': 33, '?': 34, 'again': 35, 'peeling': 36, 'not': 37, 'was': 38, 'there': 39, 'know': 40, 'he': 41, ',': 42, 'exist': 43, '“': 44, 'seem': 45, 'these': 46, 'but': 47, 'see': 48, 'if': 49, 'were': 50, 'repeat': 51, 'wouldn': 52, 'any': 53, 'beyond': 54, 'every': 55, 'able': 56, 'want': 57, 'a': 58, 'of': 59, 'existence': 60, 'within': 61, '”': 62, 'don': 63, '’': 64, 'have': 65, 'strength': 66, 's': 67, '—': 68, 'myself': 69, 'that': 70, 'think': 71, 'in': 72, 'how': 73, 'it': 74, 'walls…': 75, 'exist.': 76, 'underground': 77, 'thirst': 78, 'am': 79, 'now': 80, 'afraid

In [35]:
# Create the embedding matrix using the word index
embedding_dim = 200  # Set according to your GloVe vectors
embedding_matrix = get_embedding_matrix(word_index, glove_embeddings, embedding_dim)

In [43]:
# Map words to GloVe embeddings
def get_embedding_matrix(word_index, glove_embeddings, embedding_dim):
    vocab_size = len(word_index) + 1  # +1 for padding or unknowns
    embedding_matrix = np.zeros((vocab_size, embedding_dim))

    for word, i in word_index.items():
        embedding_vector = glove_embeddings.get(word)  # Lookup the embedding
        if embedding_vector is not None:
            print(f"Embedding found for word '{word}'")  # Debugging: Word found in GloVe
            embedding_matrix[i] = embedding_vector
        else:
            print(f"No embedding for word '{word}', assigning zero vector")  # Debugging: Word not found
    return embedding_matrix


In [45]:
# Load the GloVe embeddings
glove_file_path = '/content/glove.6B.100d.txt'  # Update with your file path
glove_embeddings = load_glove_embeddings(glove_file_path)


In [48]:
# Create the embedding matrix
embedding_dim = 100  # Set according to your GloVe vectors
embedding_matrix = get_embedding_matrix(word_index, glove_embeddings, embedding_dim)

Embedding found for word '.'
Embedding found for word 'rack'
Embedding found for word 'pillar'
Embedding found for word 'sun'
Embedding found for word 't'
Embedding found for word 'believe'
Embedding found for word 'thousands'
Embedding found for word 'full'
Embedding found for word 'life'
Embedding found for word 'such'
Embedding found for word 'even'
Embedding found for word 'knowing'
Embedding found for word 'live'
Embedding found for word 'suffering'
Embedding found for word 'yes'
Embedding found for word 'sprung'
Embedding found for word 'consciousness'
Embedding found for word 'anything'
Embedding found for word 'you'
Embedding found for word 'for'
Embedding found for word '‘'
Embedding found for word 'began'
Embedding found for word 'reckoning'
Embedding found for word 'i'
Embedding found for word 'be'
Embedding found for word 'stand'
Embedding found for word 'alexey'
Embedding found for word 'me'
Embedding found for word 'on'
Embedding found for word 'has'
Embedding found for w

In [49]:
# Check the embedding for a punctuation mark (comma)
comma_embedding = embedding_matrix[word_index[',']]
print(f"Embedding for ',': {comma_embedding}")  # Should return a zero vector if not found

Embedding for ',': [-0.10767     0.11053     0.59811997 -0.54360998  0.67395997  0.10663
  0.038867    0.35481     0.06351    -0.094189    0.15786    -0.81664997
  0.14172     0.21939     0.58504999 -0.52157998  0.22782999 -0.16642
 -0.68228     0.35870001  0.42568001  0.19021     0.91962999  0.57555002
  0.46184999  0.42363    -0.095399   -0.42749    -0.16566999 -0.056842
 -0.29595     0.26036999 -0.26605999 -0.070404   -0.27662     0.15820999
  0.69825     0.43081     0.27952    -0.45436999 -0.33801001 -0.58183998
  0.22363999 -0.57779998 -0.26862001 -0.20424999  0.56393999 -0.58524001
 -0.14365    -0.64218003  0.0054697  -0.35247999  0.16162001  1.1796
 -0.47674    -2.75530005 -0.1321     -0.047729    1.06550002  1.10339999
 -0.2208      0.18669     0.13177     0.15117     0.71310002 -0.35214999
  0.91347998  0.61782998  0.70991999  0.23954999 -0.14571001 -0.37858999
 -0.045959   -0.47367999  0.2385      0.20536    -0.18996     0.32506999
 -1.11119998 -0.36341     0.98679    -0.0847

We see that we are still geting a nonzero vector for ','

In [50]:
#checking the glove embediings whether it has the embeddings for ','
print(list(glove_embeddings.keys())[:20])  # Print the first 20 keys in GloVe

['the', ',', '.', 'of', 'to', 'and', 'in', 'a', '"', "'s", 'for', '-', 'that', 'on', 'is', 'was', 'said', 'with', 'he', 'as']


**Inference**

So its confirmed that the glove embedding vocabulary that we have used here has the embeddings of the punctuation marks as well as words.

# Forcing the glove emedding vocabulary to return 0 vectors for the punctuation marks

In [51]:
# Force punctuation (like commas) to have zero vectors
punctuation = [',', '.', '!', '?', ':', ';']  # Add any other punctuation if needed

for punc in punctuation:
    if punc in word_index:
        embedding_matrix[word_index[punc]] = np.zeros(embedding_dim)

In [52]:
comma_embedding = embedding_matrix[word_index[',']]
print(f"Embedding for ',': {comma_embedding}")  # Should return a zero vector

Embedding for ',': [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0.]
