5.Cbow stages
obj-to learn and understand cbow
CBOW, which stands for "Continuous Bag of Words," is a type of word embedding model used in natural language processing and deep learning. It is designed to learn distributed representations (word vectors) of words in a large text corpus. CBOW is one of two popular architectures for word embeddings, with the other being Skip-gram.

The primary goal of CBOW is to predict a target word based on the context words that surround it. It operates as follows:

1. **Data Preparation:** To train a CBOW model, a large text corpus is required. The corpus is tokenized into sentences, and each sentence is further divided into words or subword units, such as tokens or subword embeddings.

2. **Context Window:** CBOW defines a context window around each target word. The context window size specifies how many words before and after the target word are considered as context words. For example, if the context window size is 2, and the target word is "cat" in the sentence "The quick brown cat jumps," the context words would be "quick," "brown," "jumps."

3. **Word Embeddings:** Each word in the vocabulary is associated with a unique word embedding vector. These word embeddings are learned during the training process, and their dimensions are typically set as hyperparameters.

4. **Model Architecture:** The CBOW model architecture consists of an input layer, a hidden layer (word embeddings), and an output layer. The input layer encodes the context words, and the output layer aims to predict the target word. The hidden layer is the sum of the word embeddings for the context words.

5. **Training Objective:** The CBOW model is trained to minimize the difference between its predicted output and the actual target word. This is typically done using a softmax activation function and a categorical cross-entropy loss function. The goal is to make the model's predictions for the target word as accurate as possible given the context words.

6. **Word Vector Learning:** During training, the word embeddings are updated using backpropagation and stochastic gradient descent (SGD) to minimize the loss function. As training progresses, the word embeddings become more representative of the words' meanings based on their co-occurrence patterns in the corpus.

The learned word embeddings capture the semantic relationships between words. Words with similar meanings or similar usage tend to have similar word vectors, making them useful for various natural language processing tasks, such as sentiment analysis, text classification, machine translation, and more. These word embeddings are pre-trained on large corpora and can be used as features in downstream NLP models. Popular pre-trained word embeddings, such as Word2Vec, GloVe, and FastText, are often based on CBOW or Skip-gram models.

In [1]:
import numpy as np
import keras.backend as K
from keras.models import Sequential
from keras.layers import Dense, Embedding, Lambda
from keras import utils
from keras.preprocessing import sequence
from keras.preprocessing.text import Tokenizer
import gensim

In [2]:
#pip install keras.utils

In [3]:
data=open('corona.txt','r')

In [4]:
corona_data = [text for text in data if text.count(' ') >= 2]

In [5]:
vectorize = Tokenizer()
vectorize.fit_on_texts(corona_data)

In [6]:
corona_data = vectorize.texts_to_sequences(corona_data)

In [7]:
total_vocab = sum(len(s) for s in corona_data)
word_count = len(vectorize.word_index) + 1
window_size = 2

In [8]:
def cbow_model(data, window_size, total_vocab):
    total_length = window_size*2
    for text in data:
        text_len = len(text)
        for idx, word in enumerate(text):
            context_word = []
            target   = []
            begin = idx - window_size
            end = idx + window_size + 1
            context_word.append([text[i] for i in range(begin, end) if 0 <= i < text_len and i != idx])
            target.append(word)
            contextual = sequence.pad_sequences(context_word, total_length=total_length)
            final_target = utils.to_categorical(target, total_vocab)
            yield(contextual, final_target)

In [9]:
model = Sequential()
model.add(Embedding(input_dim=total_vocab, output_dim=100, input_length=window_size*2))
model.add(Lambda(lambda x: K.mean(x, axis=1), output_shape=(100,)))
model.add(Dense(total_vocab, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')
for i in range(10):
    cost = 0
    for x, y in cbow_model(data, window_size, total_vocab):
        cost += model.train_on_batch(contextual, final_target)
    print(i, cost)


0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0


In [10]:
dimensions=100
vect_file = open('vectors.txt','w')
vect_file.write('{} {}\n'.format(101,dimensions))


8

In [11]:
weights = model.get_weights()[0]
for text, i in vectorize.word_index.items():
    final_vec = ' '.join(map(str, list(weights[i, :])))
    vect_file.write('{} {}\n'.format(text, final_vec))
vect_file.close()

In [12]:
cbow_output = gensim.models.KeyedVectors.load_word2vec_format('vectors.txt', binary=False)
cbow_output.most_similar(positive=['virus'])

[('are', 0.21292082965373993),
 ('both', 0.15492649376392365),
 ('shorter', 0.1516490876674652),
 ('to', 0.14281314611434937),
 ('context', 0.13967569172382355),
 ('understood', 0.13876615464687347),
 ('pre', 0.13234691321849823),
 ('does', 0.1282172054052353),
 ('estimated', 0.11985184252262115),
 ('there', 0.11273766309022903)]