1.The Embedding layer performs the task of mapping discrete word indices (integers) into dense vector representations (embeddings).

input_dim (Vocabulary Size):

The input to the embedding layer consists of word indices from the corpus. The size of the vocabulary (vocab_size) represents how many unique words are in the entire dataset (including padding).
This parameter defines how many different words the model can handle.
output_dim (Embedding Dimensions):

The output_dim is the size of the dense vector that represents each word.
In this example, embedding_size = 10, meaning each word will be represented as a 10-dimensional vector in continuous space.
These dense vectors are learned during training and aim to capture semantic relationships between words (i.e., words with similar meanings will have similar embeddings).
input_length (Context Length):

The input_length represents the number of context words fed into the model at each step.
In the CBOW model, the input is a set of 2 * window_size context words around a target word. So, if the window_size = 2, then input_length = 4. This means the model will take 4 context words to predict the target word.

Why is the Embedding Layer needed?

Convert Words into Vectors: Words in a corpus are typically represented as strings or categorical variables, but neural networks cannot work with such raw categorical data. Instead, they require numeric input. The embedding layer transforms words into continuous, dense vectors, making them suitable for processing by the neural network.

Semantic Relationships: Through training, the embedding layer learns the relationships between words. For instance, the words "dog" and "cat" may end up with similar embeddings because they are likely to appear in similar contexts in the text. This is the core idea of word embeddings — capturing semantic meaning in a lower-dimensional space.

Parameter Efficiency: Unlike one-hot encoding, which creates very sparse, high-dimensional vectors (each word is represented by a vector of size equal to the vocabulary size, with only one non-zero element), the embedding layer provides a more efficient representation by using dense, smaller vectors. This reduces both the memory requirements and the computational complexity of the model.



2.The Lambda layer is used to perform a custom operation — in this case, calculating the mean of the embeddings of all the context words.

Why is this important in CBOW?

In the Continuous Bag of Words (CBOW) model, we do not use individual context word embeddings directly. Instead, we average the embeddings of all context words to get a single vector that represents the entire context.

tf.reduce_mean(x, axis=1): This operation averages the embeddings of all context words across the sequence. The resulting output is a single vector representing the context. This vector will then be used to predict the target word.

For example, if the context words have embeddings [0.3, 0.5, 0.7] for "cat", "sat", and "on", the Lambda layer averages them to produce a single vector. This aggregation helps the model generalize the context and not be too sensitive to any individual word.



3.The Dense layer at the end of the model produces the final output, which is a probability distribution over the entire vocabulary.

Softmax Activation:

The softmax activation function transforms the output into a probability distribution, where each value corresponds to the probability of a word being the target, given the context.
The output layer has as many units as the number of words in the vocabulary (vocab_size), and each unit represents a word in the vocabulary.

Why is this necessary?

After averaging the context word embeddings, the model needs to predict the target word. The Dense layer outputs a probability for each word in the vocabulary, and the word with the highest probability is selected as the predicted target word.

Categorical Cross-Entropy Loss:
During training, the categorical cross-entropy loss is used to minimize the difference between the predicted probabilities and the actual target word (which is represented as a one-hot vector).

In [1]:
# Import necessary libraries
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
import keras
from sklearn.decomposition import PCA
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, Lambda
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.preprocessing.text import Tokenizer

# Define the corpus (collection of sentences)
corpus = ["I love sunflowers",
                            "Sunflowers fill my heart with joy",
                            "I love to look into the garden and see the flowers",
                            "Flowers especially sunflowers are the most beautiful"]

# Step 1: Tokenize the corpus and convert text to sequences of integers
# Tokenizer: assigns a unique integer to each word in the corpus
tokenizer = Tokenizer()
tokenizer.fit_on_texts(corpus)  # Builds the vocabulary and word-index mapping
sequences = tokenizer.texts_to_sequences(corpus)  # Converts each sentence to a sequence of word indices

print("After converting the corpus into a sequence of integers:")
print(sequences)

# Vocabulary size (including a padding index of 0)
vocab_size = len(tokenizer.word_index) + 1  # +1 for padding or unseen words
embedding_size = 10  # The dimensionality of word embeddings
window_size = 1     # Context window size (number of words to consider on both sides of the target word)

# Step 2: Generate context-target pairs for the CBOW model
# For each word in a sentence, use its neighboring words (window_size) as the context
contexts = []  # Stores context words
targets = []   # Stores target words

for sequence in sequences:
    for i in range(window_size, len(sequence) - window_size):
        # Context words: words in the window (excluding the center word)
        context = sequence[i - window_size:i] + sequence[i + 1:i + window_size + 1]
        target = sequence[i]  # Center word (target)

        # Append context and target to the lists
        contexts.append(context)
        targets.append(target)

        # Debug output for context and target pairs
        print(f"Context: {context}, Target: {target}")

# Step 3: Convert context and target pairs to numpy arrays
X = np.array(contexts)  # Context words (input)
y = to_categorical(targets, num_classes=vocab_size)  # Convert targets to one-hot encoded vectors

# Step 4: Define the Continuous Bag of Words (CBOW) model
model = Sequential()

# Embedding layer: maps input word indices to dense word vectors (embeddings)
model.add(Embedding(input_dim=vocab_size,  # Size of the vocabulary
                    output_dim=embedding_size,  # Embedding dimensions
                    input_length=2 * window_size))  # Length of input (number of context words), the input is 2 * window_size context words around a target word. So, if the window_size = 2, then input_length = 4.

# Lambda layer: averages the embeddings of all context words
model.add(Lambda(lambda x: tf.reduce_mean(x, axis=1)))

# Dense layer: outputs a probability distribution over the vocabulary
model.add(Dense(units=vocab_size, activation='softmax'))

# Step 5: Compile the model
# Loss: categorical cross-entropy (as it's a multi-class classification problem)
# Optimizer: Adam (adaptive learning rate optimization)
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Step 6: Train the CBOW model
print("\nTraining the CBOW model...")
model.fit(X, y, epochs=100, verbose=0)  # Train for 100 epochs in silent mode
print("Training complete!")

# Step 7: Extract the learned word embeddings
# Access the Embedding layer and get its weights (word embeddings)
# Extract the embeddings from the trained Embedding layer
embedding_layer = model.layers[0]
embeddings = embedding_layer.get_weights()[0]  # Shape: (vocab_size, embedding_size)

# Print vocabulary size and shape of the embedding matrix
print(f"\nNumber of words in the vocabulary (including padding): {len(embeddings)}")
print(f"Shape of embeddings: {embeddings.shape}")

# Print the embedding vectors for all words in the vocabulary
print("\nWord Embeddings:")
for word, idx in tokenizer.word_index.items():  # Iterate over word-index pairs
    print(f"Word: '{word}' -> Embedding: {embeddings[idx]}")

# Print the embedding for the padding index (index 0), if applicable
if 0 in range(len(embeddings)):
    print(f"\nEmbedding for padding (index 0): {embeddings[0]}")

# Perform PCA to reduce the dimensionality of the embeddings for visualization
pca = PCA(n_components=2)
reduced_embeddings = pca.fit_transform(embeddings)

# Visualize the embeddings
plt.figure(figsize=(5, 5))
for word, idx in tokenizer.word_index.items():
    x, y = reduced_embeddings[idx]
    plt.scatter(x, y)
    plt.annotate(word, xy=(x, y), fontsize=10)
plt.title("Word Embeddings Visualized")
plt.show()


ImportError: DLL load failed while importing _c_internal_utils: The specified module could not be found.