# Introduction

CBOW (Continuous Bag of Words) is a neural network architecture used to learn word embeddings by predicting a target word based on its surrounding context words. Given a fixed-size window of words before and after a missing or central word, the CBOW model computes a combined (typically averaged) representation of these context words and uses it to predict the most probable target word that fits in the center. It treats the context as a “bag of words,” meaning it ignores the order of the surrounding words and focuses only on their presence. CBOW is efficient, especially for frequent words, and helps capture semantic relationships by learning vector representations where similar words have similar meanings in the vector space.

# 📚 Table of Contents

**1. Text with Blanks**  
Define the paragraph containing missing words (represented as `____`) that we want the model to fill.

**2. Original Clean Text**  
Provide the full, original version of the text to train the CBOW model before attempting to fill in the blanks.

**3. Preprocessing and Vocabulary Construction**  
Convert the text into lowercase, remove punctuation, tokenize it into words, and build the vocabulary with word-to-index mappings.

**4. CBOW (Continuous Bag of Words) Training Data Creation**  
Generate pairs of context words and their corresponding target word based on a fixed context window size.

**5. Utility Functions**  
Define helper functions such as one-hot encoding, softmax, cosine similarity, etc., which are used throughout the model.

**6. Weight Initialization**  
Randomly initialize two weight matrices (W1 and W2) that will be learned during model training.

**7. CBOW Training Loop**  
Implement the training process using forward pass, loss calculation (cross-entropy), and backpropagation for weight updates.

**8. Word Embedding Extraction**  
Retrieve the final learned word vector (embedding) for any word in the vocabulary after training.

**9. Missing Word Prediction Function**  
Define a function that takes context words and predicts the most probable missing word using the trained model.

**10. Context Extraction for Blanks**  
For each blank (`____`), extract the surrounding words within a defined window to serve as context for prediction.

**11. Model Training**  
Train the CBOW model on the clean original text using a defined number of epochs and learning rate.

**12. Blank Filling Procedure**  
Iterate through the text with blanks, use the model to predict missing words, and replace each `____` with the predicted word.

**13. Final Reconstructed Text**  
Rebuild the final paragraph by joining all words (including predicted ones) into a readable and completed sentence.


### Import Libraries

In [1]:
import numpy as np
import re
from collections import Counter
from numpy.linalg import norm
import pandas as pd

### 1. Text with Blanks

In [2]:
text_with_blanks = """In a ____ where knowledge holds immense ____,
one must strive to seek the truth. You shall know the truth,
and it is through this ____ that you shall find freedom.
The path to ____ begins with awareness,
and awareness begins with a deep understanding of what is real."""

### 2. Original Clean Text

In [3]:
original_text = """In a world where knowledge holds immense power,
one must strive to seek the truth. You shall know the truth,
and it is through this truth that you shall find freedom.
The path to liberation begins with awareness,
and awareness begins with a deep understanding of what is real."""

### 3. Preprocessing and Vocabulary Construction

In [4]:
words = re.findall(r'\w+', original_text.lower())
print(words)

['in', 'a', 'world', 'where', 'knowledge', 'holds', 'immense', 'power', 'one', 'must', 'strive', 'to', 'seek', 'the', 'truth', 'you', 'shall', 'know', 'the', 'truth', 'and', 'it', 'is', 'through', 'this', 'truth', 'that', 'you', 'shall', 'find', 'freedom', 'the', 'path', 'to', 'liberation', 'begins', 'with', 'awareness', 'and', 'awareness', 'begins', 'with', 'a', 'deep', 'understanding', 'of', 'what', 'is', 'real']


In [5]:
vocab = list(set(words))
print(vocab)

['seek', 'path', 'strive', 'a', 'world', 'and', 'in', 'must', 'find', 'awareness', 'what', 'understanding', 'where', 'freedom', 'immense', 'truth', 'power', 'you', 'liberation', 'with', 'real', 'one', 'is', 'knowledge', 'shall', 'the', 'this', 'know', 'begins', 'deep', 'it', 'through', 'holds', 'to', 'that', 'of']


In [6]:
word2idx = {w: i for i, w in enumerate(vocab)}
print(word2idx)

{'seek': 0, 'path': 1, 'strive': 2, 'a': 3, 'world': 4, 'and': 5, 'in': 6, 'must': 7, 'find': 8, 'awareness': 9, 'what': 10, 'understanding': 11, 'where': 12, 'freedom': 13, 'immense': 14, 'truth': 15, 'power': 16, 'you': 17, 'liberation': 18, 'with': 19, 'real': 20, 'one': 21, 'is': 22, 'knowledge': 23, 'shall': 24, 'the': 25, 'this': 26, 'know': 27, 'begins': 28, 'deep': 29, 'it': 30, 'through': 31, 'holds': 32, 'to': 33, 'that': 34, 'of': 35}


In [7]:
idx2word = {i: w for w, i in word2idx.items()}
print(idx2word)

{0: 'seek', 1: 'path', 2: 'strive', 3: 'a', 4: 'world', 5: 'and', 6: 'in', 7: 'must', 8: 'find', 9: 'awareness', 10: 'what', 11: 'understanding', 12: 'where', 13: 'freedom', 14: 'immense', 15: 'truth', 16: 'power', 17: 'you', 18: 'liberation', 19: 'with', 20: 'real', 21: 'one', 22: 'is', 23: 'knowledge', 24: 'shall', 25: 'the', 26: 'this', 27: 'know', 28: 'begins', 29: 'deep', 30: 'it', 31: 'through', 32: 'holds', 33: 'to', 34: 'that', 35: 'of'}


In [8]:
vocab_size = len(vocab)
vocab_size

36

### 4. CBOW Training Data Creation

In [9]:
window_size = 2
training_data = []

for i in range(window_size, len(words) - window_size): 
    context = (
        [words[i - j] for j in range(window_size, 0, -1)] +
        [words[i + j] for j in range(1, window_size + 1)]
    )
    target = words[i] 
    training_data.append([context, target])

In [10]:
training_data

[[['in', 'a', 'where', 'knowledge'], 'world'],
 [['a', 'world', 'knowledge', 'holds'], 'where'],
 [['world', 'where', 'holds', 'immense'], 'knowledge'],
 [['where', 'knowledge', 'immense', 'power'], 'holds'],
 [['knowledge', 'holds', 'power', 'one'], 'immense'],
 [['holds', 'immense', 'one', 'must'], 'power'],
 [['immense', 'power', 'must', 'strive'], 'one'],
 [['power', 'one', 'strive', 'to'], 'must'],
 [['one', 'must', 'to', 'seek'], 'strive'],
 [['must', 'strive', 'seek', 'the'], 'to'],
 [['strive', 'to', 'the', 'truth'], 'seek'],
 [['to', 'seek', 'truth', 'you'], 'the'],
 [['seek', 'the', 'you', 'shall'], 'truth'],
 [['the', 'truth', 'shall', 'know'], 'you'],
 [['truth', 'you', 'know', 'the'], 'shall'],
 [['you', 'shall', 'the', 'truth'], 'know'],
 [['shall', 'know', 'truth', 'and'], 'the'],
 [['know', 'the', 'and', 'it'], 'truth'],
 [['the', 'truth', 'it', 'is'], 'and'],
 [['truth', 'and', 'is', 'through'], 'it'],
 [['and', 'it', 'through', 'this'], 'is'],
 [['it', 'is', 'this', '

### 5. Utility Functions

In [11]:
def word_to_onehot(word): 
    vec = np.zeros(vocab_size) 
    vec[word2idx[word]] = 1 
    return vec

In [12]:
word_to_onehot("and")

array([0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0.])

### Softmax

![description](softmax.png)

In [13]:
def softmax(x): 
    e_x = np.exp(x) 
    return e_x / np.sum(e_x)

### cosine_similarity

![description](cs.png)

In [14]:
def cosine_similarity(v1, v2): 
    return np.dot(v1, v2) / (norm(v1) * norm(v2))

### 6. Weight Initialization

In [15]:
embedding_size = 10
W1 = np.random.randn(vocab_size, embedding_size)
W2 = np.random.randn(embedding_size, vocab_size)

In [16]:
for w in W1:
    print(w)
    print()

[ 0.33778226 -1.20460252 -0.5313687  -0.36628509 -0.68015166 -0.06997364
  1.34723843 -0.09789417  0.68917642  1.7539125 ]

[-0.83231963 -0.41646749  1.15329715  0.50040134 -1.10089516  0.31658547
 -0.69460133  1.71040067  0.15381943 -0.17266875]

[-0.61882282 -0.39091241  0.19502879 -0.38320889  1.97655631 -0.68166635
  0.64021659 -0.54026835  0.27952969 -1.00849724]

[-0.08207348 -1.70132408  0.23804193 -0.33589806 -0.55364757 -0.37070888
  0.0181981   1.64853284  1.02826813  0.21967641]

[ 1.2828925   0.36965927  0.48294673  0.15121616  0.66058636  0.70987494
 -1.14913248  1.03277768 -1.16062833  1.60662147]

[-2.28836707 -0.14432359  1.26624947  1.56844574  0.33793155 -0.45084976
  0.0097666  -0.16272751  0.78512274  1.49104151]

[ 0.56378539  0.75050187  1.06056114  0.86032958 -0.45497603  0.43333374
 -0.68998786  0.72670877 -0.86775053  1.76098891]

[-0.48173927 -0.1882952  -0.264587    0.13236801  2.29796084  0.30730652
 -1.76528193 -0.52643456  1.48445105  1.78128927]

[-0.0066

In [17]:
for w in W2:
    print(w)
    print()

[ 0.67860269 -1.71731239 -0.95920496  0.65372007 -2.27636258 -0.75345074
  1.01327858 -0.42046474  1.55656135  1.67101768 -0.26108949 -0.27586962
 -0.6635004  -0.81682508  0.66086686  0.18423991 -0.21168071 -0.36131349
  0.1381996   0.09283225 -1.80581129  0.04715965  0.02367211 -0.58177224
 -0.44752913 -0.16753873 -1.52476586  1.68174267 -0.92328954  1.84862946
  1.1286734  -0.66261733 -0.61345421  0.21081274 -0.20792785  1.21997304]

[ 1.11952494  0.05677692  1.11166742 -0.52759849  0.54124325 -1.32689728
  0.41932367  0.63977648  1.30053631 -0.24608426  2.29484367 -1.30022039
 -0.47764942 -2.31774093 -0.01753492 -0.33328329  0.89359865  0.00496097
  0.81624922  2.44405771  0.42178988 -0.81443968  0.5069484   0.13551767
  1.0832627  -1.52108374  2.03424915 -1.02623829  0.58468899 -0.64494097
  0.91619116  2.84815956 -0.20814604  1.19380529  1.94938628  0.23557442]

[ 1.04094784 -0.60265714  1.03585258 -0.82540771  0.51885367  0.75682983
  0.23507265  0.35138716  1.9827737  -1.8321390

### 7. CBOW Training Loop

#### Gradeint Descent

![description](gd.png)

In [18]:
def train_cbow(epochs=1000, learning_rate=0.01):
    global W1, W2
    for epoch in range(epochs):
        total_loss = 0
        for context_words, target_word in training_data:
            x = np.sum([word_to_onehot(w) for w in context_words], axis=0) / len(context_words)
            y_true = word_to_onehot(target_word)

            # forward
            h = np.dot(W1.T, x)
            u = np.dot(W2.T, h)
            y_pred = softmax(u)

            # loss & backprop
            loss = -np.log(y_pred[np.argmax(y_true)] + 1e-7)
            error = y_pred - y_true
            dW2 = np.outer(h, error)
            dW1 = np.outer(x, np.dot(W2, error))

            # update
            W1 -= learning_rate * dW1
            W2 -= learning_rate * dW2

            total_loss += loss

        if epoch % 100 == 0:
            print(f"Epoch {epoch}, Loss: {total_loss:.4f}")

### 8. Word Embedding Extraction

In [19]:
def get_embedding(word): 
    return W1[word2idx[word]]

get_embedding("truth")

array([-1.16281397,  1.37917331,  0.10931854,  0.42393616,  0.32976903,
        0.19071628,  0.0541985 ,  1.01008177,  1.09628051,  0.42954136])

### 9. Missing Word Prediction Function

In [20]:
def predict_cbow(context_words): 
    valid_context = [w for w in context_words if w in word2idx] 
    if not valid_context: 
        return "____" 
    x = np.sum([word_to_onehot(w) for w in valid_context], axis=0) / len(valid_context) 
    h = np.dot(W1.T, x) 
    u = np.dot(W2.T, h) 
    y_pred = softmax(u) 
    predicted_idx = np.argmax(y_pred) 
    return idx2word[predicted_idx]

### 10. Context Extraction for Blanks

In [21]:
def get_context(words, index, window=2): 
    context = [] 
    for i in range(index - window, index + window + 1): 
        if i != index and 0 <= i < len(words) and words[i] != "____":
            context.append(words[i]) 
    return context

### 11. Model Training

In [22]:
train_cbow(epochs=3001, learning_rate=0.01)

Epoch 0, Loss: 199.3310
Epoch 100, Loss: 69.9717
Epoch 200, Loss: 30.1245
Epoch 300, Loss: 15.4481
Epoch 400, Loss: 9.0437
Epoch 500, Loss: 5.9789
Epoch 600, Loss: 4.3182
Epoch 700, Loss: 3.3132
Epoch 800, Loss: 2.6536
Epoch 900, Loss: 2.1938
Epoch 1000, Loss: 1.8581
Epoch 1100, Loss: 1.6040
Epoch 1200, Loss: 1.4059
Epoch 1300, Loss: 1.2478
Epoch 1400, Loss: 1.1191
Epoch 1500, Loss: 1.0126
Epoch 1600, Loss: 0.9231
Epoch 1700, Loss: 0.8470
Epoch 1800, Loss: 0.7816
Epoch 1900, Loss: 0.7249
Epoch 2000, Loss: 0.6752
Epoch 2100, Loss: 0.6315
Epoch 2200, Loss: 0.5926
Epoch 2300, Loss: 0.5580
Epoch 2400, Loss: 0.5269
Epoch 2500, Loss: 0.4988
Epoch 2600, Loss: 0.4734
Epoch 2700, Loss: 0.4502
Epoch 2800, Loss: 0.4291
Epoch 2900, Loss: 0.4097
Epoch 3000, Loss: 0.3919


### 12. Blank Filling Procedure

In [23]:
words_with_blanks = re.findall(r'\w+|____', text_with_blanks.lower())
for i, word in enumerate(words_with_blanks): 
    if word == "____": 
        context = get_context(words_with_blanks, i, window=2) 
        prediction = predict_cbow(context) 
        words_with_blanks[i] = prediction

### 13. Final Reconstructed Text

In [24]:
completed_text = ' '.join(words_with_blanks)
print("Original text:\n", original_text)
print()
print("Text with blanks:\n", text_with_blanks)
print()
print("Reconstructed Text:\n", completed_text)

Original text:
 In a world where knowledge holds immense power,
one must strive to seek the truth. You shall know the truth,
and it is through this truth that you shall find freedom.
The path to liberation begins with awareness,
and awareness begins with a deep understanding of what is real.

Text with blanks:
 In a ____ where knowledge holds immense ____,
one must strive to seek the truth. You shall know the truth,
and it is through this ____ that you shall find freedom.
The path to ____ begins with awareness,
and awareness begins with a deep understanding of what is real.

Reconstructed Text:
 in a world where knowledge holds immense power one must strive to seek the truth you shall know the truth and it is through this truth that you shall find freedom the path to liberation begins with awareness and awareness begins with a deep understanding of what is real
