
# Comparing Word Embedding Approaches: From One-Hot Encoding to OpenAI Embeddings

This notebook demonstrates and compares different word embedding approaches, ranging from simple one-hot encoding to advanced embeddings like OpenAI's.

## Objectives
1. Understand the evolution of word embeddings.
2. Implement and compare:
   - One-Hot Encoding
   - Word2Vec
   - GloVe
   - BERT (contextual embeddings)
   - OpenAI Embeddings
3. Evaluate the embeddings on a semantic similarity task.
    

In [None]:
%pip install gensim
%pip install transformers
%pip install matplotlib
%pip install scikit-learn
%pip install pytorch-lightning

In [None]:

# Import required libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from gensim.models import Word2Vec
from transformers import pipeline

# Example sentences
sentences = [
    "I love natural language processing",
    "Deep learning is a key technology for AI",
    "Word embeddings capture semantic meaning",
    "OpenAI embeddings are state-of-the-art",
    "Machine learning is evolving rapidly"
]

# Tokenize sentences into words for embedding methods
tokenized_sentences = [sentence.lower().split() for sentence in sentences]
    


## 1. One-Hot Encoding

One-hot encoding represents each word as a unique binary vector. However, it does not capture any semantic relationships between words.
    

In [None]:

# One-Hot Encoding
vocabulary = sorted(set(word for sentence in tokenized_sentences for word in sentence))
one_hot_vectors = {word: np.eye(len(vocabulary))[i] for i, word in enumerate(vocabulary)}

# Display one-hot encoding for a few words
print("Vocabulary:", vocabulary)
print("One-Hot Encoding Example:", one_hot_vectors['learning'])
    


## 2. Word2Vec

Word2Vec learns embeddings by predicting the context of a word within a sliding window. It captures semantic relationships like synonyms.
    

In [None]:

# Train Word2Vec model
word2vec_model = Word2Vec(sentences=tokenized_sentences, vector_size=10, window=2, min_count=1, workers=1, sg=1)
word2vec_vectors = {word: word2vec_model.wv[word] for word in vocabulary}

# Display Word2Vec embeddings for a few words
print("Word2Vec Embedding Example (learning):", word2vec_vectors['learning'])
    


## 3. GloVe

GloVe uses matrix factorization to learn embeddings that capture global co-occurrence statistics of words in a corpus.
    

In [None]:

# Simulating GloVe by loading pre-trained vectors (example placeholder)
# In practice, you can download GloVe vectors and load them here
glove_vectors = {word: np.random.rand(10) for word in vocabulary}
print("GloVe Embedding Example (learning):", glove_vectors['learning'])
    


## 4. BERT (Contextual Embeddings)

BERT generates contextual embeddings, meaning the same word can have different embeddings depending on its context.
    

In [None]:

# Use BERT for embeddings
bert_pipeline = pipeline('feature-extraction', model='bert-base-uncased', tokenizer='bert-base-uncased')

# Generate embeddings for a sentence
bert_embedding = bert_pipeline("Deep learning is a key technology for AI")[0]
print("BERT Embedding Shape (first word):", np.array(bert_embedding[1]).shape)
    


## 5. OpenAI Embeddings

OpenAI embeddings provide state-of-the-art representations using their advanced models.
    

In [None]:
%load_ext dotenv
%dotenv

In [None]:

# Placeholder for OpenAI embeddings (replace with actual API usage)
import os
from openai import AzureOpenAI
client = AzureOpenAI(
  api_key = os.getenv("AZURE_OPENAI_API_KEY"),  
  api_version = "2024-02-01",
  azure_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT")
)
response = client.embeddings.create(input="Deep learning is a key technology for AI", model="demo-cosmos-rag-emb")
openai_embedding = response.data[0].embedding
print("OpenAI Embedding Example (simulated):", openai_embedding)
    


## 7. Conclusion

This notebook demonstrates the progression of word embedding techniques, highlighting their strengths and limitations. Advanced embeddings like OpenAI's provide state-of-the-art representations, but simpler methods like Word2Vec are still useful for certain tasks.
    