"CBOW Implementation for Sentence Embedding Using Neural Networks"

This notebook presents a Python implementation of the Continuous Bag of Words (CBOW) model for sentence embedding using neural networks. CBOW is a popular word embedding technique that learns distributed representations of words in a sentence by predicting the target word based on its context. The implementation showcases the application of neural networks to efficiently capture semantic relationships within sentences, providing a foundation for understanding and utilizing CBOW in natural language processing tasks.

In [10]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [20]:
sentence = "The quick brown fox jumps over the lazy dog"

In [21]:
# Tokenize the text data
tokenizer = Tokenizer()
tokenizer.fit_on_texts([sentence])

# Create the vocabulary
vocab_size = len(tokenizer.word_index) + 1

# Convert the text data to sequences of word indices
sequences = tokenizer.texts_to_sequences([sentence])[0]

# Define the context and target words
context_size = 2
word_pairs = []
for i in range(context_size, len(sequences) - context_size):
    context_words = sequences[i - context_size:i] + sequences[i + 1:i + context_size + 1]
    target_word = sequences[i]
    word_pairs.append((context_words, target_word))
    
# Split the data into training and validation sets
split = int(0.8 * len(word_pairs))
train_data = word_pairs[:split]
val_data = word_pairs[split:]

In [22]:
# Define the model
model = tf.keras.models.Sequential([
    tf.keras.layers.Embedding(input_dim=vocab_size, output_dim=100, input_length=context_size*2),
    tf.keras.layers.GlobalAveragePooling1D(),
    tf.keras.layers.Dense(units=vocab_size, activation='softmax')
])

# Compile the model
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [23]:
# Train the model
X_train = np.array([context_words for context_words, _ in train_data])
y_train = np.array([target_word for _, target_word in train_data])
X_val = np.array([context_words for context_words, _ in val_data])
y_val = np.array([target_word for _, target_word in val_data])

history = model.fit(X_train, y_train, epochs=100, validation_data=(X_val, y_val),verbose = 0)

In [26]:
# Test the model
word_embeddings = model.get_weights()[0]
word_index = tokenizer.word_index
reverse_word_index = {v: k for k, v in word_index.items()}

def get_word_vector(word):
    word_index = tokenizer.word_index[word]
    return word_embeddings[word_index]

test_word = "quick"
test_word_vector = get_word_vector(test_word)
most_similar_words = []
for i in range(1, vocab_size):
    if i != word_index[test_word]:
        word = reverse_word_index[i]
        word_vector = get_word_vector(word)
        similarity = np.dot(test_word_vector, word_vector) / (np.linalg.norm(test_word_vector) * np.linalg.norm(word_vector))
        most_similar_words.append((word, similarity))
most_similar_words = sorted(most_similar_words, key=lambda x: x[1], reverse=True)
print("Most similar words to '{}' are:".format(test_word))
for word, similarity in most_similar_words[:5]:
    print("{} ({:.2f})".format(word, similarity))

Most similar words to 'quick' are:
jumps (0.48)
brown (0.16)
the (0.15)
over (0.15)
fox (0.12)
