EXPLANING MODEL BUILDING WITH EMPHASIS ON LEARNING EMBEDDING AND NOT USING THE PRETRAINED MODEL FOR EMBEDDING

In [1]:
import tensorflow as tf  # Import TensorFlow library
from tensorflow import keras  # Import Keras module from TensorFlow
from tensorflow.keras import layers  # Import layers module from Keras

# --- Data Preparation (Dummy Data for Demonstration) --- # Define example parameters for the data
max_tokens = 10000  # Example vocabulary size (number of unique words/tokens)
input_length = 50   # Example sequence length (maximum length of input sequences)
batch_size = 32     # Batch size for training

# Create dummy training dataset using tf.data.Dataset
int_train_ds = tf.data.Dataset.from_tensor_slices(
    (tf.random.uniform(shape=(1000, input_length), minval=0, maxval=max_tokens, dtype=tf.int64),  # Input sequences (random integers)
     tf.random.uniform(shape=(1000,), minval=0, maxval=2, dtype=tf.int64))  # Target labels (random 0 or 1)
).batch(batch_size)  # Batch the dataset into batches of size batch_size

# Create dummy validation dataset using tf.data.Dataset
int_val_ds = tf.data.Dataset.from_tensor_slices(
    (tf.random.uniform(shape=(200, input_length), minval=0, maxval=max_tokens, dtype=tf.int64),  # Input sequences (random integers)
     tf.random.uniform(shape=(200,), minval=0, maxval=2, dtype=tf.int64))  # Target labels (random 0 or 1)
).batch(batch_size)  # Batch the dataset into batches of size batch_size

# --- Model Definition ---
# Define the input layer, taking integer sequences as input
inputs = keras.Input(shape=(None,), dtype="int64")  # shape=(None,) allows variable sequence lengths

# Embedding layer: Converts integer sequences to dense vectors (embeddings)
embedded = layers.Embedding(input_dim=max_tokens, output_dim=256)(inputs)  # 256-dimensional embeddings

# Bidirectional LSTM layer: Processes the embedded sequences in both forward and backward directions
x = layers.Bidirectional(layers.LSTM(32))(embedded)  # 32 units in each LSTM layer

# Dropout layer: Applies dropout regularization to prevent overfitting
x = layers.Dropout(0.5)(x)  # Dropout rate of 0.5 (50%)

# Output layer: Produces a probability for binary classification (0 or 1)
outputs = layers.Dense(1, activation="sigmoid")(x)  # Sigmoid activation for binary output

# Create the Keras model by specifying the input and output layers
model = keras.Model(inputs, outputs)

# --- Model Compilation ---

# Compile the model by specifying the optimizer, loss function, and metrics
model.compile(optimizer="rmsprop",  # RMSprop optimizer
              loss="binary_crossentropy",  # Binary cross-entropy loss for binary classification
              metrics=["accuracy"])  # Track accuracy during training

# --- Model Summary ---

# Print a summary of the model architecture
model.summary()

# --- Callbacks ---

# Define callbacks to be used during training
callbacks = [
    keras.callbacks.ModelCheckpoint("embeddings_bidir_gru.keras",  # File path to save the model
                                    save_best_only=True)  # Save only the best model (based on validation loss)
]

# --- Model Training ---

# Train the model using the training data and validate on the validation data
model.fit(int_train_ds,  # Training dataset
          validation_data=int_val_ds,  # Validation dataset
          epochs=10,  # Number of epochs to train for
          callbacks=callbacks)  # Apply the defined callbacks

# --- Load Best Model ---

# Load the best saved model from the specified file path
model = keras.models.load_model("embeddings_bidir_gru.keras")

# --- Model Evaluation ---

# Evaluate the model on the validation dataset and print the results
loss, accuracy = model.evaluate(int_val_ds)
print(f"Validation loss: {loss}, Validation accuracy: {accuracy}")

Epoch 1/10
[1m32/32[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 152ms/step - accuracy: 0.4853 - loss: 0.6930 - val_accuracy: 0.4800 - val_loss: 0.6967
Epoch 2/10
[1m32/32[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 202ms/step - accuracy: 0.5542 - loss: 0.6816 - val_accuracy: 0.4800 - val_loss: 0.7014
Epoch 3/10
[1m32/32[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 163ms/step - accuracy: 0.6541 - loss: 0.6535 - val_accuracy: 0.5550 - val_loss: 0.7005
Epoch 4/10
[1m32/32[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 188ms/step - accuracy: 0.8959 - loss: 0.5060 - val_accuracy: 0.4750 - val_loss: 1.0188
Epoch 5/10
[1m32/32[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 227ms/step - accuracy: 0.9808 - loss: 0.0763 - val_accuracy: 0.5150 - val_loss: 1.5288
Epoch 6/10
[1m32/32[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 193ms/step - accuracy: 1.0000 - loss: 0.0100 - val_accuracy: 0.5000 - val_loss: 1.8010
Epoch 7/10
[1m32/32[0m 

In [2]:
"""inputs = keras.Input(shape=(None,), dtype="int64")
embedded = layers.Embedding(input_dim=max_tokens, output_dim=256)(inputs)
x = layers.Bidirectional(layers.LSTM(32))(embedded)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)
model.compile(optimizer="rmsprop",
 loss="binary_crossentropy",
 metrics=["accuracy"])
model.summary()
callbacks = [
 keras.callbacks.ModelCheckpoint("embeddings_bidir_gru.keras",
 save_best_only=True)
]
model.fit(int_train_ds, validation_data=int_val_ds, epochs=10,
 callbacks=callbacks)
model = keras.models.load_model("embeddings_bidir_gru.keras")"""

'inputs = keras.Input(shape=(None,), dtype="int64")\nembedded = layers.Embedding(input_dim=max_tokens, output_dim=256)(inputs)\nx = layers.Bidirectional(layers.LSTM(32))(embedded)\nx = layers.Dropout(0.5)(x)\noutputs = layers.Dense(1, activation="sigmoid")(x)\nmodel = keras.Model(inputs, outputs)\nmodel.compile(optimizer="rmsprop",\n loss="binary_crossentropy",\n metrics=["accuracy"])\nmodel.summary()\ncallbacks = [\n keras.callbacks.ModelCheckpoint("embeddings_bidir_gru.keras",\n save_best_only=True)\n]\nmodel.fit(int_train_ds, validation_data=int_val_ds, epochs=10,\n callbacks=callbacks)\nmodel = keras.models.load_model("embeddings_bidir_gru.keras")'

HERE, EXPLAININ THE PRETRAINED EMBEDDING MODELS, WORD2VEC AND GLOVE

In [None]:
import gensim.downloader as api  # Import the Gensim downloader for pre-trained models
from sklearn.manifold import TSNE  # Import t-SNE for dimensionality reduction (visualization)
import matplotlib.pyplot as plt  # Import Matplotlib for plotting

# --- 1. Load Pre-trained Word2Vec Model ---

# Load the pre-trained Word2Vec model trained on Google News dataset (300-dimensional vectors)
# 'word2vec-google-news-300' is the model identifier in Gensim's downloader
wv = api.load('word2vec-google-news-300')  # wv stands for word vectors

[=-------------------------------------------------] 3.3% 54.2/1662.8MB downloaded

In [None]:
# --- 2. Example 1: Semantic Similarity ---

# Calculate and print the cosine similarity between word vectors
# Cosine similarity measures how similar two vectors are (between -1 and 1)
print(wv.similarity('king', 'queen'))  # Similarity between 'king' and 'queen' (should be high)
print(wv.similarity('man', 'woman'))  # Similarity between 'man' and 'woman' (should be high)
print(wv.similarity('king', 'man'))  # Similarity between 'king' and 'man' (should be high)
print(wv.similarity('king', 'car'))  # Similarity between 'king' and 'car' (should be low)

In [None]:
# --- 3. Example 2: Analogy ---

# Find words most similar to 'king' + 'woman' - 'man' (should be close to 'queen')
# This demonstrates the ability of Word2Vec to solve analogy problems
result = wv.most_similar(positive=['king', 'woman'], negative=['man'])
print(result)

In [None]:
# --- 3. Example 2: Analogy ---

# Find words most similar to 'king' + 'woman' - 'man' (should be close to 'queen')
# This demonstrates the ability of Word2Vec to solve analogy problems

result = wv.most_similar(positive=['king', 'girl', 'young'], negative=['man', 'adult'])
print(result)

In [None]:

# --- 4. Example 3: Getting the Vector for a Word ---

# Retrieve the vector representation of the word 'king'
vector_king = wv['king']
print(vector_king)  # Print the vector (300 numbers)

In [None]:
# --- 5. Example 4: Checking if a Word Exists in the Vocabulary ---

# Check if the word 'cat' exists in the Word2Vec vocabulary
if 'cat' in wv:
    print("Cat is in the vocabulary")
else:
    print("Cat is not in the vocabulary")

In [None]:
# --- 6. Example 5: Visualizing Embeddings (using t-SNE) ---

# List of words to visualize
words = ['king', 'queen', 'man', 'woman', 'prince', 'princess', 'cat', 'dog', 'library', 'table', 'throne', 'chair']

# Get the vector representations for the words (only if they are in the vocabulary)
embeddings = [wv[word] for word in words if word in wv]

# Convert the list of embeddings to a 2D NumPy array
import numpy as np  # Import NumPy for array manipulation
embeddings = np.array(embeddings) # Convert the list of embeddings to a NumPy array

# Reduce the dimensionality of the vectors to 2D for visualization using t-SNE
# t-SNE (t-Distributed Stochastic Neighbor Embedding) is a technique for visualizing high-dimensional data
# Set perplexity to a value less than the number of samples (8 in this case)
tsne = TSNE(n_components=2, perplexity=5, random_state=42)  # Reduce to 2 dimensions, perplexity=5
embeddings_2d = tsne.fit_transform(embeddings)

# Create a scatter plot of the 2D embeddings
plt.figure(figsize=(8, 6))  # Set the figure size
for i, word in enumerate(words):
    if word in wv:  # Only plot if the word is in the vocabulary
        plt.scatter(embeddings_2d[i, 0], embeddings_2d[i, 1])  # Plot the point
        plt.annotate(word, (embeddings_2d[i, 0], embeddings_2d[i, 1]))  # Add the word label
plt.title('Word2Vec Embeddings Visualization (t-SNE)')  # Set the plot title
plt.show()  # Display the plot

In [None]:
# --- 6. Example 5: Visualizing Embeddings (using t-SNE) ---

# List of words to visualize
words = ['king', 'queen', 'prince', 'princess', 'man', 'woman',
         'cat', 'dog', 'kitten', 'puppy',
         'book', 'library', 'page', 'author',
         'chair', 'table', 'furniture', 'sofa']

# Get the vector representations for the words (only if they are in the vocabulary)
embeddings = [wv[word] for word in words if word in wv]

# Convert the list of embeddings to a 2D NumPy array
import numpy as np  # Import NumPy for array manipulation
embeddings = np.array(embeddings) # Convert the list of embeddings to a NumPy array

tsne = TSNE(n_components=2, perplexity=5, random_state=42)  # Reduce to 2 dimensions, perplexity=5
embeddings_2d = tsne.fit_transform(embeddings)

# Create a scatter plot of the 2D embeddings
plt.figure(figsize=(8, 6))  # Set the figure size
for i, word in enumerate(words):
    if word in wv:  # Only plot if the word is in the vocabulary
        plt.scatter(embeddings_2d[i, 0], embeddings_2d[i, 1])  # Plot the point
        plt.annotate(word, (embeddings_2d[i, 0], embeddings_2d[i, 1]))  # Add the word label
plt.title('Word2Vec Embeddings Visualization (t-SNE)')  # Set the plot title
plt.show()  # Display the plot

NOW LETS USE GLOVE TECHNIQUE

In [None]:
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip glove.6B.zip

In [None]:
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

# ... (Your load_glove_vectors, sentences_train, sentences_val, text_vectorization, embedding_dim, etc. code) ...

# 1. Prepare your training and validation sentences
sentences_train = ["This is a sample sentence.", "Another sentence with more words.", "Train data example one", "Train data example two"]
sentences_val = ["Validation sentence one.", "Validation sentence two."]
all_sentences = sentences_train + sentences_val  # Combine for consistent vocabulary

# 2. Create a TextVectorization layer with a larger vocabulary size
max_tokens = 10000  # Increase the maximum size of the vocabulary to match the training data
text_vectorization = layers.TextVectorization(
    max_tokens=max_tokens,
    output_mode="int",
    output_sequence_length=10  # Adjust according to your data
)

# 3. Adapt the TextVectorization layer to all sentences
text_vectorization.adapt(all_sentences)

# 4. Load GloVe vectors
path_to_glove_file = "glove.6B.100d.txt"
embeddings_index = {}
with open(path_to_glove_file) as f:
    for line in f:
        word, coefs = line.split(maxsplit=1)
        coefs = np.fromstring(coefs, "f", sep=" ")
        embeddings_index[word] = coefs
print(f"Found {len(embeddings_index)} word vectors.")

# 5. Create the embedding matrix using the consistent vocabulary
embedding_dim = 100
vocabulary = text_vectorization.get_vocabulary()
word_index = dict(zip(vocabulary, range(len(vocabulary))))
embedding_matrix = np.zeros((max_tokens, embedding_dim))
for word, i in word_index.items():
    if i < max_tokens:
        embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

# 6. Create the Embedding layer
embedding_layer = layers.Embedding(
    max_tokens,
    embedding_dim,
    embeddings_initializer=keras.initializers.Constant(embedding_matrix),
    trainable=False,
    mask_zero=True,
)
# 7. Vectorize your training and validation data
int_train_ds = tf.data.Dataset.from_tensor_slices(
    (text_vectorization(sentences_train), labels_train)  # Use text_vectorization to transform text to indices
).batch(2)

int_val_ds = tf.data.Dataset.from_tensor_slices(
    (text_vectorization(sentences_val), labels_val)  # Use text_vectorization to transform text to indices
).batch(2)

# 8. Build, compile, and train your model (rest of the code remains the same)
inputs = keras.Input(shape=(None,), dtype="int64")
embedded = embedding_layer(inputs)
x = layers.Bidirectional(layers.LSTM(32))(embedded)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)
model.compile(optimizer="rmsprop",
              loss="binary_crossentropy",
              metrics=["accuracy"])
model.summary()
callbacks = [
    keras.callbacks.ModelCheckpoint("glove_embeddings_sequence_model.keras",
                                    save_best_only=True)
]
model.fit(int_train_ds, validation_data=int_val_ds, epochs=10,
          callbacks=callbacks)
model = keras.models.load_model("glove_embeddings_sequence_model.keras")
# Now 'int_train_ds' and 'int_val_ds' have indices matching the embedding layer

# --- Modified Functions using GloVe Vocabulary ---

def get_glove_vector(word, embeddings_index):
    """Retrieves the GloVe vector for a word."""
    return embeddings_index.get(word)

def glove_similarity(word1, word2, embeddings_index):
    """Calculates cosine similarity between two words."""
    vec1 = get_glove_vector(word1, embeddings_index)
    vec2 = get_glove_vector(word2, embeddings_index)
    if vec1 is not None and vec2 is not None:
        return cosine_similarity([vec1], [vec2])[0, 0]
    else:
        return None

def find_most_similar(word, embeddings_index, top_n=5):
    """Finds the top N most similar words to a given word."""
    word_vector = get_glove_vector(word, embeddings_index)
    if word_vector is None:
        return None

    similarities = []
    for vocab_word, vector in embeddings_index.items():
        if vocab_word != word:
            similarity = cosine_similarity([word_vector], [vector])[0, 0]
            similarities.append((vocab_word, similarity))

    similarities.sort(key=lambda x: x[1], reverse=True)
    return similarities[:top_n]

def check_word_vector(word, embeddings_index):
    """Checks if a word has a GloVe vector."""
    if get_glove_vector(word, embeddings_index) is not None:
        print(f"'{word}' has a GloVe vector.")
    else:
        print(f"'{word}' does not have a GloVe vector.")

def get_word_vector(word, embeddings_index):
    """Retrieves the GloVe vector for a word."""
    vector = get_glove_vector(word, embeddings_index)
    if vector is not None:
        print(f"Vector for '{word}': {vector}")
    else:
        print(f"'{word}' has no vector.")

def visualize_embeddings(words, embeddings_index):
    """Visualizes GloVe embeddings using t-SNE."""
    embeddings = [get_glove_vector(word, embeddings_index) for word in words if get_glove_vector(word, embeddings_index) is not None]
    words_filtered = [word for word in words if get_glove_vector(word, embeddings_index) is not None]

    if not embeddings:
        print("No embeddings to visualize.")
        return

    tsne = TSNE(n_components=2, random_state=42)
    embeddings_2d = tsne.fit_transform(embeddings)

    plt.figure(figsize=(8, 6))
    for i, word in enumerate(words_filtered):
        plt.scatter(embeddings_2d[i, 0], embeddings_2d[i, 1])
        plt.annotate(word, (embeddings_2d[i, 0], embeddings_2d[i, 1]))
    plt.title('GloVe Embeddings Visualization (t-SNE)')
    plt.show()


In [None]:
# --- Usage Examples ---

# Examples of Semantic Similarity
print(f"Similarity ('king', 'queen'): {glove_similarity('king', 'queen', embeddings_index)}")
print(f"Similarity ('man', 'woman'): {glove_similarity('man', 'woman', embeddings_index)}")
print(f"Similarity ('king', 'man'): {glove_similarity('king', 'man', embeddings_index)}")
print(f"Similarity ('king', 'car'): {glove_similarity('king', 'car', embeddings_index)}")
print(f"Similarity ('cat', 'dog'): {glove_similarity('cat', 'dog', embeddings_index)}")


In [None]:
# Examples of Finding Similar Words (Continued)
print(f"Most similar to 'cat': {find_most_similar('cat', embeddings_index)}")
print(f"Most similar to 'book': {find_most_similar('book', embeddings_index)}")

In [None]:
# Examples of Checking Word Vector Existence
check_word_vector("king", embeddings_index)
check_word_vector("randomword", embeddings_index)

In [None]:
# Examples of Getting the Vector for a Word
get_word_vector("queen", embeddings_index)
get_word_vector("anotherword", embeddings_index)

In [None]:
# Example of Visualization
#words_to_visualize = ['king', 'queen', 'man', 'woman', 'cat', 'dog', 'book', 'library']
#visualize_embeddings(words_to_visualize, embeddings_index)

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

def visualize_embeddings(words, embeddings_index):
    """Visualizes GloVe embeddings using t-SNE."""
    embeddings = [get_glove_vector(word, embeddings_index) for word in words if get_glove_vector(word, embeddings_index) is not None]
    words_filtered = [word for word in words if get_glove_vector(word, embeddings_index) is not None]

    if not embeddings:
        print("No embeddings to visualize.")
        return

    # Convert the list of embeddings to a NumPy array
    embeddings = np.array(embeddings)

    # Lower the perplexity to be significantly less than the number of samples
    tsne = TSNE(n_components=2, perplexity=3, random_state=42)  # Reduced perplexity to 3
    embeddings_2d = tsne.fit_transform(embeddings)

    plt.figure(figsize=(8, 6))
    for i, word in enumerate(words_filtered):
        plt.scatter(embeddings_2d[i, 0], embeddings_2d[i, 1])
        plt.annotate(word, (embeddings_2d[i, 0], embeddings_2d[i, 1]))
    plt.title('GloVe Embeddings Visualization (t-SNE)')
    plt.show()

words_to_visualize = ['king', 'queen', 'man', 'woman', 'cat', 'dog', 'book', 'library']
visualize_embeddings(words_to_visualize, embeddings_index)

In [None]:
words_to_visualize = ['king', 'queen', 'prince', 'princess', 'man', 'woman',
         'cat', 'dog', 'kitten', 'puppy',
         'book', 'library', 'page', 'author',
         'chair', 'table', 'furniture', 'sofa']

visualize_embeddings(words_to_visualize, embeddings_index)


--- Use Cases for Word2Vec and GloVe ---

1. Semantic Similarity and Relatedness:
- Finding synonyms and related words.
- Measuring the semantic distance between words or documents.
- Example: Building a search engine that understands the meaning of queries.

2. Analogy Tasks:
- Solving analogy problems like "king - man + woman = queen".
- Example: Building a question-answering system that can reason about relationships between words.

3. Feature Engineering for NLP Models:
- Using pre-trained embeddings as input features for deep learning models.
- Example: Improving the performance of sentiment analysis, text classification, or machine translation models.

4. Information Retrieval:
- Finding documents that are semantically similar to a query.
- Example: Building a document retrieval system that understands the meaning of documents.

5. Word Sense Disambiguation:
- Identifying the correct meaning of a word in a given context.
- Example: Building a system that can understand the different meanings of polysemous words.

6. Recommendation Systems:
- Recommending items based on their semantic similarity.
- Example: Building a movie recommendation system that suggests movies similar to those a user has watched.

7. Text Summarization:
- Identifying the most important sentences or phrases in a document.
- Example: Building a system that can generate concise summaries of long documents.

8. Machine Translation:
- Representing words in different languages in a shared embedding space.
- Example: Building a system that can translate text from one language to another while preserving meaning.

Key Differences (Word2Vec vs. GloVe):

Word2Vec:
- Predicts context words given a target word (or vice versa).
- Captures local context information.
- Better for capturing semantic relationships between words that appear in similar contexts.

GloVe:
- Leverages global word co-occurrence statistics.
- Captures global relationships between words.
- Better for capturing overall word similarity and relatedness.