<a href="https://colab.research.google.com/github/Randoot/NLP-2/blob/main/Word_Embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Introduction to Word Embeddings

Word embeddings are a type of word representation that allows words to be represented as continuous vectors in a high-dimensional space.

It aims to map words to vectors of real numbers in such a way that words with similar meanings have similar vector representations.
Unlike traditional representations like Bag of Words (BoW), word embeddings capture semantic meanings and relationships between words by placing similar words closer together in the vector space.

### Key Concepts

1. **Word Embedding**: A dense vector representation of a word where each dimension captures some aspect of its meaning.
2. **Pre-trained Embeddings**: Embeddings learned from large corpora, such as Word2Vec, GloVe, and FastText.
3. **Semantic Similarity**: Words with similar meanings will have similar embeddings, making it easier to perform tasks like word similarity and analogy.

In [None]:
from gensim.models import KeyedVectors


 Load pre-trained Word2Vec model (Google News vectors)
Note: This model is quite large. For demonstration, use a smaller or different model as needed.
 model = KeyedVectors.load_word2vec_format('path/to/GoogleNews-vectors-negative300.bin', binary=True)

In [1]:
# For demonstration, we'll use a smaller pre-trained model available in gensim
from gensim.downloader import load
model = load('glove-wiki-gigaword-50')
# GloVe (Global Vectors for Word Representation) model trained on the Wikipedia and Gigaword corpus with 50-dimensional vectors.
# So each word is represented by a 50-dimensional vector.



In [2]:
# Example words
words = ['king', 'queen', 'man', 'woman']

In [3]:
# Get embeddings
embeddings = {word: model[word] for word in words}

Each value is a float representing the position of the word in that dimension of the embedding space.

The values are typically in a real number range, which can be positive, negative, or zero. The exact range depends on the embedding model and its initialization.

In [4]:
# Display embeddings
for word, vector in embeddings.items():
    print(f"Word: {word}\nEmbedding: {vector}\n")

Word: king
Embedding: [ 0.50451   0.68607  -0.59517  -0.022801  0.60046  -0.13498  -0.08813
  0.47377  -0.61798  -0.31012  -0.076666  1.493    -0.034189 -0.98173
  0.68229   0.81722  -0.51874  -0.31503  -0.55809   0.66421   0.1961
 -0.13495  -0.11476  -0.30344   0.41177  -2.223    -1.0756   -1.0783
 -0.34354   0.33505   1.9927   -0.04234  -0.64319   0.71125   0.49159
  0.16754   0.34344  -0.25663  -0.8523    0.1661    0.40102   1.1685
 -1.0137   -0.21585  -0.15155   0.78321  -0.91241  -1.6106   -0.64426
 -0.51042 ]

Word: queen
Embedding: [ 0.37854    1.8233    -1.2648    -0.1043     0.35829    0.60029
 -0.17538    0.83767   -0.056798  -0.75795    0.22681    0.98587
  0.60587   -0.31419    0.28877    0.56013   -0.77456    0.071421
 -0.5741     0.21342    0.57674    0.3868    -0.12574    0.28012
  0.28135   -1.8053    -1.0421    -0.19255   -0.55375   -0.054526
  1.5574     0.39296   -0.2475     0.34251    0.45365    0.16237
  0.52464   -0.070272  -0.83744   -1.0326     0.45946    0.2530

In [None]:
# Find similar words
similar_words = model.most_similar('computer', topn=5)

# TODO:: Display similar words

In [None]:
# Solve analogy
analogy_result = model.most_similar(positive=['queen', 'man'], negative=['king'], topn=1)

# TODO:: Display result