<a href="https://colab.research.google.com/github/NollyKeyz/NLP/blob/main/week7_analogies.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Load pre-trained GloVe model

The GloVe model is a method to convert words into numbers. It looks at how often words appear together in text and assigns each word a unique number so that similar words have similar numbers. This helps computers understand the meaning of words and find relationships between them, like "king" being similar to "queen."

Gensim is an open-source Python library designed for topic modeling, document indexing, and similarity retrieval with large corpora. It is widely used for natural language processing (NLP) and is particularly known for its implementations of word embedding models like Word2Vec, Doc2Vec, and FastText. Gensim provides simple and efficient tools for vector space modeling and allows users to easily train and utilize word embeddings for various NLP tasks.

In [1]:
import gensim.downloader as api

# Load pre-trained GloVe model
model = api.load("glove-wiki-gigaword-100")





# Find analogy using word embeddings

In [2]:
# Function to find analogy using word embeddings
def find_analogy(word1, word2, word3, topn=3):
    try:
        # Perform vector arithmetic to find the analogy
        analogy = model.most_similar(positive=[word1, word2], negative=[word3], topn=topn)
        return analogy
    except KeyError as e:
        print(f"Word not found in vocabulary: {e}")
        return None

# Example analogy

In [3]:
word1 = "king"
word2 = "man"
word3 = "woman"
# Find the analogy
analogy = find_analogy(word1, word2, word3)
analogy

[('prince', 0.6826401948928833),
 ('brother', 0.6500723958015442),
 ('ii', 0.6345422267913818)]


# Better Model
The word2vec-google-news-300 is a pre-trained word embedding model trained on a large corpus of Google News articles. It was trained using the word2vec algorithm and consists of word vectors of dimensionality 300. This model captures semantic relationships between words based on the context in which they appear in the training data. These word embeddings are useful for various natural language processing tasks such as word similarity, analogy detection, and text classification.

In [4]:
import gensim.downloader

# Download and load pre-trained word embeddings model
model = gensim.downloader.load('word2vec-google-news-300')



# Define analogy relationship

In [10]:
analogy = (input('enter the origin word: '), input('enter the word relating to the origin: '), input('enter the word relating to the expected: '))

# Calculate analogy
try:
    result = model.most_similar(positive=[analogy[0], analogy[2]], negative=[analogy[1]], topn=1)
    print(f"Analogous word to '{analogy[0]}' 'in relation to ' '{analogy[1]}' 'and ' '{analogy[2]}' is: '{result[0][0]}'")
except KeyError:
    print("One or more words not found in vocabulary.")

enter the origin word: tokyo
enter the word relating to the origin: japan
enter the word relating to the expected: germany
Analogous word to 'tokyo' 'in relation to ' 'japan' 'and ' 'germany' is: 'sweden'


Please try the below:

"Paris" - "France" + "Italy" ≈ ""

"Tokyo" - "Japan" + "Germany" ≈ ""

"Apple" - "Fruit" + "Electronics" ≈ ""

"Facebook" - "Social media" + "Search" ≈ ""

"Cat" - "Pet" + "Bird" ≈ ""