
# Comparing Word Embedding Approaches: From One-Hot Encoding to OpenAI Embeddings

This notebook demonstrates and compares different word embedding approaches, ranging from simple one-hot encoding to advanced embeddings like OpenAI's.

## Objectives
1. Understand the evolution of word embeddings.
2. Implement and compare:
   - One-Hot Encoding
   - Word2Vec
   - GloVe
   - BERT (contextual embeddings)
   - OpenAI Embeddings
3. Evaluate the embeddings on a semantic similarity task.
    

In [1]:
%pip install gensim
%pip install transformers
%pip install matplotlib
%pip install scikit-learn
%pip install pytorch-lightning

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.
Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.
Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.
Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.
Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [3]:

# Import required libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from gensim.models import Word2Vec
from transformers import pipeline

# Example sentences
sentences = [
    "I love natural language processing",
    "Deep learning is a key technology for AI",
    "Word embeddings capture semantic meaning",
    "OpenAI embeddings are state-of-the-art",
    "Machine learning is evolving rapidly"
]

# Tokenize sentences into words for embedding methods
tokenized_sentences = [sentence.lower().split() for sentence in sentences]
    


## 1. One-Hot Encoding

One-hot encoding represents each word as a unique binary vector. However, it does not capture any semantic relationships between words.
    

In [4]:

# One-Hot Encoding
vocabulary = sorted(set(word for sentence in tokenized_sentences for word in sentence))
one_hot_vectors = {word: np.eye(len(vocabulary))[i] for i, word in enumerate(vocabulary)}

# Display one-hot encoding for a few words
print("Vocabulary:", vocabulary)
print("One-Hot Encoding Example:", one_hot_vectors['learning'])
    

Vocabulary: ['a', 'ai', 'are', 'capture', 'deep', 'embeddings', 'evolving', 'for', 'i', 'is', 'key', 'language', 'learning', 'love', 'machine', 'meaning', 'natural', 'openai', 'processing', 'rapidly', 'semantic', 'state-of-the-art', 'technology', 'word']
One-Hot Encoding Example: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]



## 2. Word2Vec

Word2Vec learns embeddings by predicting the context of a word within a sliding window. It captures semantic relationships like synonyms.
    

In [5]:

# Train Word2Vec model
word2vec_model = Word2Vec(sentences=tokenized_sentences, vector_size=10, window=2, min_count=1, workers=1, sg=1)
word2vec_vectors = {word: word2vec_model.wv[word] for word in vocabulary}

# Display Word2Vec embeddings for a few words
print("Word2Vec Embedding Example (learning):", word2vec_vectors['learning'])
    

Word2Vec Embedding Example (learning): [-0.00536227  0.00236431  0.0510335   0.09009273 -0.0930295  -0.07116809
  0.06458873  0.08972988 -0.05015428 -0.03763372]



## 3. GloVe

GloVe uses matrix factorization to learn embeddings that capture global co-occurrence statistics of words in a corpus.
    

In [6]:

# Simulating GloVe by loading pre-trained vectors (example placeholder)
# In practice, you can download GloVe vectors and load them here
glove_vectors = {word: np.random.rand(10) for word in vocabulary}
print("GloVe Embedding Example (learning):", glove_vectors['learning'])
    

GloVe Embedding Example (learning): [0.24667367 0.77424196 0.23182241 0.59957729 0.4058091  0.41627447
 0.28488116 0.64496648 0.83806805 0.6268004 ]



## 4. BERT (Contextual Embeddings)

BERT generates contextual embeddings, meaning the same word can have different embeddings depending on its context.
    

In [7]:

# Use BERT for embeddings
bert_pipeline = pipeline('feature-extraction', model='bert-base-uncased', tokenizer='bert-base-uncased')

# Generate embeddings for a sentence
bert_embedding = bert_pipeline("Deep learning is a key technology for AI")[0]
print("BERT Embedding Shape (first word):", np.array(bert_embedding[1]).shape)
    

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


BERT Embedding Shape (first word): (768,)



## 5. OpenAI Embeddings

OpenAI embeddings provide state-of-the-art representations using their advanced models.
    

In [8]:
%load_ext dotenv
%dotenv

cannot find .env file


In [18]:

# Placeholder for OpenAI embeddings (replace with actual API usage)
import os
from openai import AzureOpenAI
client = AzureOpenAI(
  api_key = os.getenv("AZURE_OPENAI_API_KEY"),  
  api_version = "2024-02-01",
  azure_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT")
)
response = client.embeddings.create(input="Deep learning is a key technology for AI", model="demo-cosmos-rag-emb")
openai_embedding = response.data[0].embedding
print("OpenAI Embedding Example (simulated):", openai_embedding)
    

OpenAI Embedding Example (simulated): [-0.008931445889174938, 0.003937659319490194, 0.01644168235361576, -0.01835835725069046, -0.004455943591892719, 0.015033513307571411, -0.019440561532974243, 0.016141794621944427, -0.0014611388323828578, -0.021930936723947525, 0.020926963537931442, 0.04360109940171242, 0.010541713796555996, -0.002995619783177972, -0.01961006410419941, 0.006877864710986614, 0.026272792369127274, 0.0052121831104159355, 0.009857186116278172, -0.012732199393212795, -0.03223143517971039, 0.012275847606360912, 0.019975144416093826, -0.02745930477976799, -0.010945910587906837, 0.004902516026049852, 0.01730223000049591, -0.03955913335084915, -0.015581134706735611, -0.01788896881043911, 0.03363960608839989, -0.008325150236487389, -0.004273403435945511, -0.007621065713465214, -0.0003041662566829473, -0.013331974856555462, 0.0041984315030276775, -0.007405928336083889, 0.018019353970885277, 0.015463787131011486, 0.023534685373306274, 0.01863216795027256, -0.0011001324746757746,


## 7. Conclusion

This notebook demonstrates the progression of word embedding techniques, highlighting their strengths and limitations. Advanced embeddings like OpenAI's provide state-of-the-art representations, but simpler methods like Word2Vec are still useful for certain tasks.
    