To work with Word2Vec, especially comparing CBOW (Continuous Bag of Words) and Skip-Gram architectures, we typically train a model using the Gensim library. However, the model you're referring to — 'glove-wiki-gigaword-100' — is pre-trained using the GloVe algorithm, not Word2Vec. GloVe is another word embedding algorithm developed by Stanford, and it doesn't distinguish between CBOW and Skip-Gram.

| Feature                 | CBOW (Continuous Bag of Words)                         | Skip-Gram                                            |
| ----------------------- | ------------------------------------------------------ | ---------------------------------------------------- |
| **Goal**                | Predict the target word from surrounding context words | Predict surrounding context words from a target word |
| **Training speed**      | Faster (more efficient on large data)                  | Slower (more computations)                           |
| **Performs better for** | Frequent words                                         | Rare words                                           |
| **Use case**            | General NLP tasks with sufficient data                 | Capturing representations of infrequent words better |


In [1]:
#!pip install gensim


In [1]:
import gensim.downloader as api
import numpy as np


In [2]:
# List available models
print("Available models:")
print(api.info()['models'].keys())
model = api.load("glove-wiki-gigaword-100")

Available models:
dict_keys(['fasttext-wiki-news-subwords-300', 'conceptnet-numberbatch-17-06-300', 'word2vec-ruscorpora-300', 'word2vec-google-news-300', 'glove-wiki-gigaword-50', 'glove-wiki-gigaword-100', 'glove-wiki-gigaword-200', 'glove-wiki-gigaword-300', 'glove-twitter-25', 'glove-twitter-50', 'glove-twitter-100', 'glove-twitter-200', '__testing_word2vec-matrix-synopsis'])


Cbow VS SkipGram

In [3]:
def cbow_predict(context_words, model, topn=5):
    """
        Get vector for each context word and average them
    """
    context_vectors = []
    for word in context_words:
        if word in model:
            context_vectors.append(model[word])
        else:
            print(f"'{word}' not in vocabulary.")

    if not context_vectors:
        return []

    # Average the vectors to get the context representation
    avg_vector = np.mean(context_vectors, axis=0)
    # Use most_similar to find top-n closest words to the context vector
    similar_words = model.similar_by_vector(avg_vector, topn=topn)

    return similar_words


In [4]:
context = ["king", "man"]
predicted_words = cbow_predict(context, model)
print("Predicted words:", predicted_words)


Predicted words: [('king', 0.8817193508148193), ('man', 0.8566084504127502), ('father', 0.8132981061935425), ('brother', 0.8037790656089783), ('son', 0.7959659695625305)]


In [5]:
def skipgram_predict(target_word, model, topn=5):
    """
    Predict context words given a target word using the Skip-gram approach.
    """
    if target_word not in model:
        print(f"'{target_word}' not in vocabulary.")
        return []

    # Get most similar words to the target word
    similar_words = model.most_similar(target_word, topn=topn)

    return similar_words


In [6]:
target = "king"
predicted_context = skipgram_predict(target, model)
print("Predicted context words:", predicted_context)


Predicted context words: [('prince', 0.7682328820228577), ('queen', 0.7507690787315369), ('son', 0.7020888328552246), ('brother', 0.6985775232315063), ('monarch', 0.6977890729904175)]
