<a href="https://colab.research.google.com/github/Reema-h/T5_project_w7/blob/main/Task1_CBOW.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Continuous Bag of Words (CBOW) Model

In this exam, we will create a CBOW model using a sample Arabic traffic corpus. The corpus consists of sentences describing various traffic scenarios. The goal of the CBOW model is to predict a target word based on its surrounding context words.

# Importing Required Libraries

In this step, we import the necessary libraries to build and train the Continuous Bag of Words (CBOW) model.

- **TensorFlow and Keras**: Used to build the neural network model, including the layers like `Embedding`, `Dense`, and `Lambda`.
- **Tokenizer**: A utility from Keras for tokenizing and processing text data.
- **NumPy**: Used for handling numerical operations, particularly for processing arrays and data manipulation.

These libraries will provide the essential tools for text preprocessing and model development in the upcoming steps.

Add more if needed!


In [22]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, Lambda

# Preparing the Corpus

In this step, we initialize the corpus that will be used for training the Continuous Bag of Words (CBOW) model. The corpus consists of Arabic sentences, each of which describes different traffic scenarios.

- **Corpus**: A collection of traffic-related sentences in Arabic.

This step sets up the text data that we will use in the upcoming stages of tokenization and model training.

In [23]:
corpus = [
    "الطريق مزدحم اليوم بسبب الحادث المروري الذي حدث صباحاً ويؤدي إلى تأخير كبير في الوصول",
    "كان الطريق مزدحما للغاية والسيارات متوقفة تقريباً نتيجة الازدحام الشديد والحركة بطيئة جداً ولا تتحسن",
    "أنا أحب الذهاب إلى السوق في الصباح الباكر لتفادي الازدحام وشراء الخضروات الطازجة دون الانتظار في الطريق",
    "السيارات بطيئة بسبب الازدحام المروري في الشارع الرئيسي والتأخير في حركة المرور خلال ساعات الذروة",
    "هناك ازدحام مروري في الشارع بسبب أعمال البناء والحفريات التي تعطل حركة السيارات وتتسبب في تأخير كبير",
    "ازدحام السيارات يزداد في المساء عندما يبدأ الجميع بالعودة إلى منازلهم من العمل وتتوقف حركة المرور بالكامل",
    "الطريق السريع يشهد ازدحاما مستمرا خلال فترة الظهيرة بسبب الشاحنات الكبيرة التي تبطئ حركة السير",
    "الحافلات والسيارات عالقة في الازدحام المروري في المنطقة التجارية مما يؤدي إلى تأخير وصول الناس إلى وجهاتهم",
    "حركة المرور مزدحمة اليوم بسبب الفعاليات التي تقام في وسط المدينة مما يزيد من صعوبة الوصول إلى هناك",
    "كان من الصعب جدا القيادة على الطريق الرئيسي اليوم بسبب الازدحام الخانق الذي استمر طوال اليوم",
    "الطريق إلى المطار مزدحم اليوم بسبب الحوادث المتكررة والتأخيرات الكبيرة في حركة المرور على الطريق السريع",
    "الشارع مزدحم بالسيارات والحافلات الكبيرة مما يجعل التنقل بطيئًا جدًا ويزيد من وقت الوصول إلى العمل",
    "ازدحام السيارات في المدينة أصبح مشكلة كبيرة خاصة خلال ساعات الذروة حيث يصعب التحرك بسرعة",
    "تفاقم الازدحام في الطرق الجانبية بسبب إغلاق الطريق الرئيسي المؤدي إلى وسط المدينة لصيانة الجسر",
    "ازدحام مروري خانق يواجه السكان يوميًا خلال تنقلهم من وإلى العمل على الطرق السريعة المؤدية إلى المدينة",
    "التأخيرات المرورية اليوم ناجمة عن سوء الأحوال الجوية والضباب الذي يعيق الرؤية ويبطئ حركة السيارات",
    "حوادث السير المتكررة على الطريق الزراعي تؤدي إلى ازدحام مروري شديد وتأخير كبير في وصول السيارات",
    "كانت حركة السير اليوم غير منتظمة بسبب تنظيم حدث رياضي كبير أدى إلى إغلاق بعض الشوارع الرئيسية",
    "الأعمال الإنشائية في الشارع الرئيسي تسببت في اختناق مروري كامل وتباطؤ في حركة السيارات خلال النهار",
    "تراكم السيارات عند تقاطع الطرق الرئيسية أدى إلى ازدحام شديد وزيادة كبيرة في مدة الانتظار للوصول إلى الجهة المطلوبة"
]

# Defining Vocabulary and Model Parameters

In this step, we define key parameters that will be used to configure the CBOW model.

- **Vocabulary size**: We calculate the size of the vocabulary based on the number of unique words in the corpus. The `vocab_size` represents the total number of unique tokens (words) in the dataset plus one for padding.
  
- **Embedding size**: The `embedding_size` defines the dimensionality of the word embeddings. In this case, we set the embedding size to 10, meaning each word will be represented as a 10-dimensional vector in the embedding layer.

- **Window size**: The `window_size` defines how many words to the left and right of the target word are considered as context. Here, a window size of 2 means that two words before and two words after the target word will be used as context.

These parameters will play an essential role in shaping the CBOW model architecture.


In [40]:
vocab_size = len(set(" ".join(corpus).split())) + 1 #unique tokens
embedding_size = 10 # vector
window_size = 2 #context window size


# Preparing Context-Target Pairs for CBOW

In this step, we generate the context-target pairs from the tokenized sequences to train the CBOW model.

- **Context words**: For each word in a sequence, the surrounding words (within the window size) are considered as context. The context consists of the words immediately before and after the target word.
  
- **Target word**: The word in the middle of the context window is treated as the target word that the model will learn to predict.

We iterate through each sequence, collecting the context words and corresponding target words:
- For each word in a sequence, we gather the surrounding words based on the defined window size.
- The middle word is the target, and the surrounding words form the context.

Finally:
- **`X`**: An array of context words.
- **`y`**: The target words are one-hot encoded, which means they are converted into a categorical format where each word is represented as a vector of length equal to the vocabulary size.

These context-target pairs will be used to train the CBOW model to predict a target word based on its context.


In [41]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(corpus)

# convert text to sequences
sequences = tokenizer.texts_to_sequences(corpus)


tokenized_sequences = [sequence for sequence in sequences if len(sequence) > 0]  # filter empty sequences


In [42]:
def generate_context_target_pairs(sequences, window_size):
    X, y = [], []
    for sequence in sequences:
        for i, word in enumerate(sequence):

            start = max(0, i - window_size)
            end = min(len(sequence), i + window_size + 1)

            context = [sequence[j] for j in range(start, end) if j != i]
            target = word

            X.append(context)
            y.append(target)

    return X, y

In [43]:
X, y = generate_context_target_pairs(tokenized_sequences, window_size)

In [44]:
tokenized_sequences

[[3, 19, 7, 4, 47, 20, 21, 30, 48, 49, 2, 22, 12, 1, 23],
 [31, 3, 50, 51, 32, 52, 53, 54, 8, 55, 56, 33, 57, 58, 59],
 [60, 61, 62, 2, 63, 1, 64, 65, 66, 8, 67, 68, 69, 70, 34, 1, 3],
 [6, 33, 4, 8, 20, 1, 13, 14, 71, 1, 5, 15, 10, 35, 36],
 [37, 9, 16, 1, 13, 4, 72, 73, 74, 24, 75, 5, 6, 76, 1, 22, 12],
 [9, 6, 77, 1, 78, 79, 80, 81, 82, 2, 83, 11, 25, 84, 5, 15, 85],
 [3, 38, 86, 87, 88, 10, 89, 90, 4, 91, 26, 24, 92, 5, 27],
 [93, 32, 94, 1, 8, 20, 1, 95, 96, 28, 97, 2, 22, 39, 98, 2, 99],
 [5, 15, 100, 7, 4, 101, 24, 102, 1, 40, 17, 28, 103, 11, 104, 23, 2, 37],
 [31, 11, 105, 106, 107, 18, 3, 14, 7, 4, 8, 108, 21, 109, 110, 7],
 [3, 2, 111, 19, 7, 4, 112, 41, 113, 26, 1, 5, 15, 18, 3, 38],
 [13, 19, 114, 115, 26, 28, 116, 117, 118, 119, 120, 11, 121, 23, 2, 25],
 [9, 6, 1, 17, 122, 123, 42, 124, 10, 35, 36, 125, 126, 127, 128],
 [129, 8, 1, 29, 130, 4, 43, 3, 14, 131, 2, 40, 17, 132, 133],
 [9, 16, 134, 135, 136, 137, 10, 138, 11, 139, 25, 18, 29, 140, 141, 2, 17],
 [142, 143, 7,

In [45]:
X

[[19, 7],
 [3, 7, 4],
 [3, 19, 4, 47],
 [19, 7, 47, 20],
 [7, 4, 20, 21],
 [4, 47, 21, 30],
 [47, 20, 30, 48],
 [20, 21, 48, 49],
 [21, 30, 49, 2],
 [30, 48, 2, 22],
 [48, 49, 22, 12],
 [49, 2, 12, 1],
 [2, 22, 1, 23],
 [22, 12, 23],
 [12, 1],
 [3, 50],
 [31, 50, 51],
 [31, 3, 51, 32],
 [3, 50, 32, 52],
 [50, 51, 52, 53],
 [51, 32, 53, 54],
 [32, 52, 54, 8],
 [52, 53, 8, 55],
 [53, 54, 55, 56],
 [54, 8, 56, 33],
 [8, 55, 33, 57],
 [55, 56, 57, 58],
 [56, 33, 58, 59],
 [33, 57, 59],
 [57, 58],
 [61, 62],
 [60, 62, 2],
 [60, 61, 2, 63],
 [61, 62, 63, 1],
 [62, 2, 1, 64],
 [2, 63, 64, 65],
 [63, 1, 65, 66],
 [1, 64, 66, 8],
 [64, 65, 8, 67],
 [65, 66, 67, 68],
 [66, 8, 68, 69],
 [8, 67, 69, 70],
 [67, 68, 70, 34],
 [68, 69, 34, 1],
 [69, 70, 1, 3],
 [70, 34, 3],
 [34, 1],
 [33, 4],
 [6, 4, 8],
 [6, 33, 8, 20],
 [33, 4, 20, 1],
 [4, 8, 1, 13],
 [8, 20, 13, 14],
 [20, 1, 14, 71],
 [1, 13, 71, 1],
 [13, 14, 1, 5],
 [14, 71, 5, 15],
 [71, 1, 15, 10],
 [1, 5, 10, 35],
 [5, 15, 35, 36],
 [15, 1

In [46]:
y

[3,
 19,
 7,
 4,
 47,
 20,
 21,
 30,
 48,
 49,
 2,
 22,
 12,
 1,
 23,
 31,
 3,
 50,
 51,
 32,
 52,
 53,
 54,
 8,
 55,
 56,
 33,
 57,
 58,
 59,
 60,
 61,
 62,
 2,
 63,
 1,
 64,
 65,
 66,
 8,
 67,
 68,
 69,
 70,
 34,
 1,
 3,
 6,
 33,
 4,
 8,
 20,
 1,
 13,
 14,
 71,
 1,
 5,
 15,
 10,
 35,
 36,
 37,
 9,
 16,
 1,
 13,
 4,
 72,
 73,
 74,
 24,
 75,
 5,
 6,
 76,
 1,
 22,
 12,
 9,
 6,
 77,
 1,
 78,
 79,
 80,
 81,
 82,
 2,
 83,
 11,
 25,
 84,
 5,
 15,
 85,
 3,
 38,
 86,
 87,
 88,
 10,
 89,
 90,
 4,
 91,
 26,
 24,
 92,
 5,
 27,
 93,
 32,
 94,
 1,
 8,
 20,
 1,
 95,
 96,
 28,
 97,
 2,
 22,
 39,
 98,
 2,
 99,
 5,
 15,
 100,
 7,
 4,
 101,
 24,
 102,
 1,
 40,
 17,
 28,
 103,
 11,
 104,
 23,
 2,
 37,
 31,
 11,
 105,
 106,
 107,
 18,
 3,
 14,
 7,
 4,
 8,
 108,
 21,
 109,
 110,
 7,
 3,
 2,
 111,
 19,
 7,
 4,
 112,
 41,
 113,
 26,
 1,
 5,
 15,
 18,
 3,
 38,
 13,
 19,
 114,
 115,
 26,
 28,
 116,
 117,
 118,
 119,
 120,
 11,
 121,
 23,
 2,
 25,
 9,
 6,
 1,
 17,
 122,
 123,
 42,
 124,
 10,
 35,
 36,
 125,
 1

In [47]:
# One-hot encode
"""tokenizer = Tokenizer()
tokenizer.fit_on_texts(y)
y_encoded = tokenizer.texts_to_matrix(y, mode='binary')"""
#ERROR --> AttributeError: 'int' object has no attribute 'lower'

# solve error
tokenizer = Tokenizer()
y = [str(i) for i in y] # int to string
tokenizer.fit_on_texts(y)
y_encoded = tokenizer.texts_to_matrix(y, mode='binary')

# Building and Training the CBOW Model

In this step, we build and train the Continuous Bag of Words (CBOW) model using the context-target pairs created earlier.

1. **Model architecture**:
   - **Embedding layer**: This layer transforms the input context words into dense vector representations (embeddings) of size defined by `embedding_size`. The `input_dim` is set to the vocabulary size, and the `input_length` is twice the window size (since context consists of words from both sides of the target).
   
   - **Lambda layer**: This layer computes the mean of the context word embeddings. It averages the embeddings of the context words to generate a single representation that will be used to predict the target word.
   
   - **Dense layer**: This fully connected layer outputs a probability distribution over the vocabulary, using the softmax activation function. It predicts the most likely target word based on the context word embeddings.

2. **Compilation**:
   The model is compiled using the Adam optimizer and categorical cross-entropy as the loss function, which is suitable for multi-class classification tasks. Accuracy is used as a metric to evaluate the model's performance during training.

3. **Training the model**:
   The model is trained on the context-target pairs for 500 epochs. During each epoch, the model learns to predict the target word based on the context, refining its weights to improve accuracy.

4. **Saving the model weights**:
   After training, the model weights are saved to a file (`cbow_model.weights.h5`) for future use. This allows us to load the trained model later without retraining.

By the end of this step, the CBOW model will have learned to predict target words based on their surrounding context from the given corpus.

In [48]:
model = Sequential([
    Dense(embedding_size, input_shape=(2 * window_size,)),
    Embedding(input_dim= vocab_size, output_dim= embedding_size, input_length=2 * window_size),
    Lambda(lambda x: tf.reduce_mean(x, axis=1)), # calculate the avr
    Dense(vocab_size, activation='softmax')
])

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


In [49]:
model.summary()

In [50]:
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

In [51]:
model.fit(X, y_encoded, epochs=5, verbose=1)

ValueError: Unrecognized data type: x=[[19, 7], [3, 7, 4], [3, 19, 4, 47], [19, 7, 47, 20], [7, 4, 20, 21], [4, 47, 21, 30], [47, 20, 30, 48], [20, 21, 48, 49], [21, 30, 49, 2], [30, 48, 2, 22], [48, 49, 22, 12], [49, 2, 12, 1], [2, 22, 1, 23], [22, 12, 23], [12, 1], [3, 50], [31, 50, 51], [31, 3, 51, 32], [3, 50, 32, 52], [50, 51, 52, 53], [51, 32, 53, 54], [32, 52, 54, 8], [52, 53, 8, 55], [53, 54, 55, 56], [54, 8, 56, 33], [8, 55, 33, 57], [55, 56, 57, 58], [56, 33, 58, 59], [33, 57, 59], [57, 58], [61, 62], [60, 62, 2], [60, 61, 2, 63], [61, 62, 63, 1], [62, 2, 1, 64], [2, 63, 64, 65], [63, 1, 65, 66], [1, 64, 66, 8], [64, 65, 8, 67], [65, 66, 67, 68], [66, 8, 68, 69], [8, 67, 69, 70], [67, 68, 70, 34], [68, 69, 34, 1], [69, 70, 1, 3], [70, 34, 3], [34, 1], [33, 4], [6, 4, 8], [6, 33, 8, 20], [33, 4, 20, 1], [4, 8, 1, 13], [8, 20, 13, 14], [20, 1, 14, 71], [1, 13, 71, 1], [13, 14, 1, 5], [14, 71, 5, 15], [71, 1, 15, 10], [1, 5, 10, 35], [5, 15, 35, 36], [15, 10, 36], [10, 35], [9, 16], [37, 16, 1], [37, 9, 1, 13], [9, 16, 13, 4], [16, 1, 4, 72], [1, 13, 72, 73], [13, 4, 73, 74], [4, 72, 74, 24], [72, 73, 24, 75], [73, 74, 75, 5], [74, 24, 5, 6], [24, 75, 6, 76], [75, 5, 76, 1], [5, 6, 1, 22], [6, 76, 22, 12], [76, 1, 12], [1, 22], [6, 77], [9, 77, 1], [9, 6, 1, 78], [6, 77, 78, 79], [77, 1, 79, 80], [1, 78, 80, 81], [78, 79, 81, 82], [79, 80, 82, 2], [80, 81, 2, 83], [81, 82, 83, 11], [82, 2, 11, 25], [2, 83, 25, 84], [83, 11, 84, 5], [11, 25, 5, 15], [25, 84, 15, 85], [84, 5, 85], [5, 15], [38, 86], [3, 86, 87], [3, 38, 87, 88], [38, 86, 88, 10], [86, 87, 10, 89], [87, 88, 89, 90], [88, 10, 90, 4], [10, 89, 4, 91], [89, 90, 91, 26], [90, 4, 26, 24], [4, 91, 24, 92], [91, 26, 92, 5], [26, 24, 5, 27], [24, 92, 27], [92, 5], [32, 94], [93, 94, 1], [93, 32, 1, 8], [32, 94, 8, 20], [94, 1, 20, 1], [1, 8, 1, 95], [8, 20, 95, 96], [20, 1, 96, 28], [1, 95, 28, 97], [95, 96, 97, 2], [96, 28, 2, 22], [28, 97, 22, 39], [97, 2, 39, 98], [2, 22, 98, 2], [22, 39, 2, 99], [39, 98, 99], [98, 2], [15, 100], [5, 100, 7], [5, 15, 7, 4], [15, 100, 4, 101], [100, 7, 101, 24], [7, 4, 24, 102], [4, 101, 102, 1], [101, 24, 1, 40], [24, 102, 40, 17], [102, 1, 17, 28], [1, 40, 28, 103], [40, 17, 103, 11], [17, 28, 11, 104], [28, 103, 104, 23], [103, 11, 23, 2], [11, 104, 2, 37], [104, 23, 37], [23, 2], [11, 105], [31, 105, 106], [31, 11, 106, 107], [11, 105, 107, 18], [105, 106, 18, 3], [106, 107, 3, 14], [107, 18, 14, 7], [18, 3, 7, 4], [3, 14, 4, 8], [14, 7, 8, 108], [7, 4, 108, 21], [4, 8, 21, 109], [8, 108, 109, 110], [108, 21, 110, 7], [21, 109, 7], [109, 110], [2, 111], [3, 111, 19], [3, 2, 19, 7], [2, 111, 7, 4], [111, 19, 4, 112], [19, 7, 112, 41], [7, 4, 41, 113], [4, 112, 113, 26], [112, 41, 26, 1], [41, 113, 1, 5], [113, 26, 5, 15], [26, 1, 15, 18], [1, 5, 18, 3], [5, 15, 3, 38], [15, 18, 38], [18, 3], [19, 114], [13, 114, 115], [13, 19, 115, 26], [19, 114, 26, 28], [114, 115, 28, 116], [115, 26, 116, 117], [26, 28, 117, 118], [28, 116, 118, 119], [116, 117, 119, 120], [117, 118, 120, 11], [118, 119, 11, 121], [119, 120, 121, 23], [120, 11, 23, 2], [11, 121, 2, 25], [121, 23, 25], [23, 2], [6, 1], [9, 1, 17], [9, 6, 17, 122], [6, 1, 122, 123], [1, 17, 123, 42], [17, 122, 42, 124], [122, 123, 124, 10], [123, 42, 10, 35], [42, 124, 35, 36], [124, 10, 36, 125], [10, 35, 125, 126], [35, 36, 126, 127], [36, 125, 127, 128], [125, 126, 128], [126, 127], [8, 1], [129, 1, 29], [129, 8, 29, 130], [8, 1, 130, 4], [1, 29, 4, 43], [29, 130, 43, 3], [130, 4, 3, 14], [4, 43, 14, 131], [43, 3, 131, 2], [3, 14, 2, 40], [14, 131, 40, 17], [131, 2, 17, 132], [2, 40, 132, 133], [40, 17, 133], [17, 132], [16, 134], [9, 134, 135], [9, 16, 135, 136], [16, 134, 136, 137], [134, 135, 137, 10], [135, 136, 10, 138], [136, 137, 138, 11], [137, 10, 11, 139], [10, 138, 139, 25], [138, 11, 25, 18], [11, 139, 18, 29], [139, 25, 29, 140], [25, 18, 140, 141], [18, 29, 141, 2], [29, 140, 2, 17], [140, 141, 17], [141, 2], [143, 7], [142, 7, 144], [142, 143, 144, 145], [143, 7, 145, 146], [7, 144, 146, 147], [144, 145, 147, 148], [145, 146, 148, 149], [146, 147, 149, 21], [147, 148, 21, 150], [148, 149, 150, 151], [149, 21, 151, 152], [21, 150, 152, 5], [150, 151, 5, 6], [151, 152, 6], [152, 5], [27, 41], [153, 41, 18], [153, 27, 18, 3], [27, 41, 3, 154], [41, 18, 154, 155], [18, 3, 155, 2], [3, 154, 2, 9], [154, 155, 9, 16], [155, 2, 16, 44], [2, 9, 44, 156], [9, 16, 156, 12], [16, 44, 12, 1], [44, 156, 1, 39], [156, 12, 39, 6], [12, 1, 6], [1, 39], [5, 27], [157, 27, 7], [157, 5, 7, 158], [5, 27, 158, 159], [27, 7, 159, 4], [7, 158, 4, 160], [158, 159, 160, 30], [159, 4, 30, 161], [4, 160, 161, 12], [160, 30, 12, 45], [30, 161, 45, 2], [161, 12, 2, 43], [12, 45, 43, 162], [45, 2, 162, 163], [2, 43, 163, 46], [43, 162, 46], [162, 163], [165, 1], [164, 1, 13], [164, 165, 13, 14], [165, 1, 14, 166], [1, 13, 166, 1], [13, 14, 1, 167], [14, 166, 167, 16], [166, 1, 16, 168], [1, 167, 168, 169], [167, 16, 169, 1], [16, 168, 1, 5], [168, 169, 5, 6], [169, 1, 6, 10], [1, 5, 10, 170], [5, 6, 170], [6, 10], [6, 172], [171, 172, 173], [171, 6, 173, 29], [6, 172, 29, 46], [172, 173, 46, 45], [173, 29, 45, 2], [29, 46, 2, 9], [46, 45, 9, 44], [45, 2, 44, 174], [2, 9, 174, 42], [9, 44, 42, 1], [44, 174, 1, 175], [174, 42, 175, 34], [42, 1, 34, 176], [1, 175, 176, 2], [175, 34, 2, 177], [34, 176, 177, 178], [176, 2, 178], [2, 177]] (of type <class 'list'>)

In [None]:
model.fit(np.array(X), np.array(y_encoded), epochs=5, verbose=1)

In [52]:
X_padded = tf.keras.preprocessing.sequence.pad_sequences(X, padding="post")
y_padded = tf.keras.preprocessing.sequence.pad_sequences(y_encoded, padding="post")


model.fit(np.array(X_padded), np.array(y_padded), epochs=5, verbose=1)

Epoch 1/5




InvalidArgumentError: Graph execution error:

Detected at node sequential_2_1/embedding_2_1/GatherV2 defined at (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main

  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code

  File "/usr/local/lib/python3.10/dist-packages/colab_kernel_launcher.py", line 37, in <module>

  File "/usr/local/lib/python3.10/dist-packages/traitlets/config/application.py", line 992, in launch_instance

  File "/usr/local/lib/python3.10/dist-packages/ipykernel/kernelapp.py", line 619, in start

  File "/usr/local/lib/python3.10/dist-packages/tornado/platform/asyncio.py", line 195, in start

  File "/usr/lib/python3.10/asyncio/base_events.py", line 603, in run_forever

  File "/usr/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once

  File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run

  File "/usr/local/lib/python3.10/dist-packages/tornado/ioloop.py", line 685, in <lambda>

  File "/usr/local/lib/python3.10/dist-packages/tornado/ioloop.py", line 738, in _run_callback

  File "/usr/local/lib/python3.10/dist-packages/tornado/gen.py", line 825, in inner

  File "/usr/local/lib/python3.10/dist-packages/tornado/gen.py", line 786, in run

  File "/usr/local/lib/python3.10/dist-packages/ipykernel/kernelbase.py", line 361, in process_one

  File "/usr/local/lib/python3.10/dist-packages/tornado/gen.py", line 234, in wrapper

  File "/usr/local/lib/python3.10/dist-packages/ipykernel/kernelbase.py", line 261, in dispatch_shell

  File "/usr/local/lib/python3.10/dist-packages/tornado/gen.py", line 234, in wrapper

  File "/usr/local/lib/python3.10/dist-packages/ipykernel/kernelbase.py", line 539, in execute_request

  File "/usr/local/lib/python3.10/dist-packages/tornado/gen.py", line 234, in wrapper

  File "/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py", line 302, in do_execute

  File "/usr/local/lib/python3.10/dist-packages/ipykernel/zmqshell.py", line 539, in run_cell

  File "/usr/local/lib/python3.10/dist-packages/IPython/core/interactiveshell.py", line 2975, in run_cell

  File "/usr/local/lib/python3.10/dist-packages/IPython/core/interactiveshell.py", line 3030, in _run_cell

  File "/usr/local/lib/python3.10/dist-packages/IPython/core/async_helpers.py", line 78, in _pseudo_sync_runner

  File "/usr/local/lib/python3.10/dist-packages/IPython/core/interactiveshell.py", line 3257, in run_cell_async

  File "/usr/local/lib/python3.10/dist-packages/IPython/core/interactiveshell.py", line 3473, in run_ast_nodes

  File "/usr/local/lib/python3.10/dist-packages/IPython/core/interactiveshell.py", line 3553, in run_code

  File "<ipython-input-52-9b4d7aa8ff7f>", line 5, in <cell line: 5>

  File "/usr/local/lib/python3.10/dist-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler

  File "/usr/local/lib/python3.10/dist-packages/keras/src/backend/tensorflow/trainer.py", line 318, in fit

  File "/usr/local/lib/python3.10/dist-packages/keras/src/backend/tensorflow/trainer.py", line 121, in one_step_on_iterator

  File "/usr/local/lib/python3.10/dist-packages/keras/src/backend/tensorflow/trainer.py", line 108, in one_step_on_data

  File "/usr/local/lib/python3.10/dist-packages/keras/src/backend/tensorflow/trainer.py", line 51, in train_step

  File "/usr/local/lib/python3.10/dist-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler

  File "/usr/local/lib/python3.10/dist-packages/keras/src/layers/layer.py", line 882, in __call__

  File "/usr/local/lib/python3.10/dist-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler

  File "/usr/local/lib/python3.10/dist-packages/keras/src/ops/operation.py", line 46, in __call__

  File "/usr/local/lib/python3.10/dist-packages/keras/src/utils/traceback_utils.py", line 156, in error_handler

  File "/usr/local/lib/python3.10/dist-packages/keras/src/models/sequential.py", line 209, in call

  File "/usr/local/lib/python3.10/dist-packages/keras/src/models/functional.py", line 175, in call

  File "/usr/local/lib/python3.10/dist-packages/keras/src/ops/function.py", line 171, in _run_through_graph

  File "/usr/local/lib/python3.10/dist-packages/keras/src/models/functional.py", line 556, in call

  File "/usr/local/lib/python3.10/dist-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler

  File "/usr/local/lib/python3.10/dist-packages/keras/src/layers/layer.py", line 882, in __call__

  File "/usr/local/lib/python3.10/dist-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler

  File "/usr/local/lib/python3.10/dist-packages/keras/src/ops/operation.py", line 46, in __call__

  File "/usr/local/lib/python3.10/dist-packages/keras/src/utils/traceback_utils.py", line 156, in error_handler

  File "/usr/local/lib/python3.10/dist-packages/keras/src/layers/core/embedding.py", line 140, in call

  File "/usr/local/lib/python3.10/dist-packages/keras/src/ops/numpy.py", line 4875, in take

  File "/usr/local/lib/python3.10/dist-packages/keras/src/backend/tensorflow/numpy.py", line 1951, in take

indices[3,1] = -16 is not in [0, 179)
	 [[{{node sequential_2_1/embedding_2_1/GatherV2}}]] [Op:__inference_one_step_on_iterator_4090]

In [38]:
model.save_weights('cbow_model.weights.h5')

# Predicting a Word Using the Trained CBOW Model

In this step, we define a function to predict a word based on a given context using the trained CBOW model.

1. **Function: `predict_word`**:
   - **Input**: The function takes a list of context words as input. The number of context words should match the expected size (2 times the window size).
   - **Context sequence conversion**: The input context words are tokenized into a sequence of integers using the same tokenizer that was used during training.
   - **Input validation**: The function checks whether the length of the context sequence matches the expected size (2 times the window size). If not, it prints an error message.
   - **Prediction**: The tokenized context is fed into the trained CBOW model, which predicts the probability distribution over the vocabulary.
   - **Retrieve predicted word**: The predicted word is the one with the highest probability. The function retrieves the word corresponding to the predicted index from the tokenizer's word index.

2. **Example**:
   - We provide a sample context: `['الحادث', 'بسبب', 'مزدحم', 'الطريق']`.
   - The function predicts the word that fits best in this context, based on the model's learned weights.
   - The predicted word is printed along with the input context.

This function allows us to test the CBOW model by predicting words based on their surrounding context from the corpus.