# Tutorial: Word Embeddings using CBOW
In this tutorial, we'll learn how to create word embeddings using the Continuous Bag of Words (CBOW) model. Word embeddings are vector representations of words that capture their meanings, contexts, and relationships. CBOW is a simple and effective neural network model that predicts a target word from its surrounding context words.

## Step 1: Importing Required Libraries


In [17]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, Lambda
from tensorflow.keras.preprocessing.text import Tokenizer
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

## Step 2: Defining the Corpus
Let's define a simple corpus of sentences to train our CBOW model. This corpus will consist of three sentences.

In [18]:
corpus = ['The cat sat on the mat',
          'The dog ran in the park',
          'The bird sang in the tree']

## Step 3: Tokenizing the Corpus
We convert the corpus into a sequence of integers using Keras' `Tokenizer`. This step is necessary to transform the text data into numerical data that can be used by the neural network.

In [None]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(corpus)
sequences = tokenizer.texts_to_sequences(corpus)
print('After converting our words in the corpus into vector of integers:')
print(sequences)

## Step 4: Defining Model Parameters
Next, we define some parameters for our CBOW model:
- `vocab_size`: The size of the vocabulary (total number of unique words).
- `embedding_size`: The size of the word embedding vectors.
- `window_size`: The number of context words to consider on either side of the target word.

In [20]:
vocab_size = len(tokenizer.word_index) + 1
embedding_size = 10
window_size = 2

## Step 5: Generating Context-Target Pairs
We generate context-target pairs for training the CBOW model. The context consists of words surrounding a target word within a defined window size.

In [21]:
contexts = []
targets = []
for sequence in sequences:
    for i in range(window_size, len(sequence) - window_size):
        context = sequence[i - window_size:i] + sequence[i + 1:i + window_size + 1]
        target = sequence[i]
        contexts.append(context)
        targets.append(target)
X = np.array(contexts)
y = tf.keras.utils.to_categorical(targets, num_classes=vocab_size)

## Step 6: Building the CBOW Model
We use Keras' Sequential API to build the CBOW model. The model consists of an `Embedding` layer to learn word embeddings, a `Lambda` layer to average the embeddings of context words, and a `Dense` layer with a softmax activation function to predict the target word.

In [None]:
model = Sequential()
model.add(Embedding(input_dim=vocab_size, output_dim=embedding_size, input_length=2*window_size))
model.add(Lambda(lambda x: tf.reduce_mean(x, axis=1)))
model.add(Dense(units=vocab_size, activation='softmax'))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Train the model
model.fit(X, y, epochs=100, verbose=1)
model.save_weights('cbow_model.weights.h5')

## Step 7: Loading Pre-trained Weights and Extracting Word Embeddings
We load the pre-trained weights and extract the word embeddings from the model.

In [23]:
model.load_weights('cbow_model.weights.h5')
embeddings = model.get_weights()[0]

## Step 8: Visualizing Word Embeddings
To visualize the word embeddings, we use Principal Component Analysis (PCA) to reduce their dimensionality to 2D. This allows us to plot the embeddings on a 2D plane.

In [None]:
pca = PCA(n_components=2)
reduced_embeddings = pca.fit_transform(embeddings)
plt.figure(figsize=(5, 5))
for i, word in enumerate(tokenizer.word_index.keys()):
    x, y = reduced_embeddings[i]
    plt.scatter(x, y)
    plt.annotate(word, xy=(x, y), xytext=(5, 2), textcoords='offset points', ha='right', va='bottom')
plt.show()

[Source](https://www.geeksforgeeks.org/continuous-bag-of-words-cbow-in-nlp/) for this tutorial.