#  **3. Text Embeddings**

## 📌 Why Do We Need Text Embeddings?

Text embeddings are essential for **converting human language into a numerical format** that AI models can understand and process efficiently. They enable **semantic search, retrieval-augmented generation (RAG), recommendation systems, and more**.

---

## **🔹 What Are Text Embeddings?**
✅ **Text embeddings are vector representations of words, sentences, or documents.**  
✅ They capture the **meaning, relationships, and context** of text in a high-dimensional space.  
✅ **Similar texts have similar vector representations**, making it possible to compare and retrieve them efficiently.

🔹 **Example:**  
- The words **"dog"** and **"puppy"** will have embeddings that are **closer** together than **"dog"** and **"car"** because they are more semantically related.

---

## **🔹 Why Are Text Embeddings Needed?**

### **1️⃣ Enables Semantic Search**
✅ Instead of matching **exact keywords**, embeddings **find relevant content based on meaning**.  
✅ **Improves search accuracy** for AI-powered search engines.

🔹 **Example Use Case:**  
- A user searches for **"healthy meal ideas"**, and the system retrieves results about **"nutritious recipes"**, even if the exact words don’t match.

---

### **2️⃣ Powers Retrieval-Augmented Generation (RAG)**
✅ **LLMs (like GPT-4) have token limits**, so embeddings help retrieve **only the most relevant information**.  
✅ Used in **vector databases (e.g., ChromaDB, FAISS, Pinecone)** to fetch data before generating responses.

🔹 **Example Use Case:**  
- A chatbot answering medical questions retrieves **relevant medical research papers** using embeddings.

---

## **🚀 Why Use Text Embeddings?**
| **Problem** | **Solution with Text Embeddings** |
|------------|----------------------------------|
| Keyword search is too basic | **Embeddings enable semantic search** |
| LLMs forget context | **Embeddings help retrieve relevant data (RAG)** |
| Recommendations are not personalized | **Embeddings find similar items efficiently** |
| AI chatbots struggle with memory | **Embeddings store & retrieve past interactions** |
| Searching in multiple languages is difficult | **Embeddings support multi-language retrieval** |

📌 **Text embeddings are the backbone of AI search, chatbots, recommendations, and intelligent retrieval!** 🚀😊


- https://python.langchain.com/docs/how_to/embed_text/
- https://platform.openai.com/docs/guides/embeddings/embedding-models

In [None]:
import os


# DONT FORGET TO SETUP YOUR API KEY!
import os
os.environ["OPENAI_API_KEY"] = "your key here"
api_key = os.getenv("OPENAI_API_KEY")
api_key

'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'

### **Run the code below every time you need `tiktoken`**

In [51]:
import os
import hashlib


blobpath = "https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken"
cache_key = hashlib.sha1(blobpath.encode()).hexdigest()


# path to the directory you created to paste the file
tiktoken_cache_dir = r"C:\Users\Seyed Barabadi\Downloads\Gen AI\tiktoken" 
os.environ["TIKTOKEN_CACHE_DIR"] = tiktoken_cache_dir

# validate
assert os.path.exists(os.path.join(tiktoken_cache_dir, cache_key))


In [52]:
from langchain_openai import OpenAIEmbeddings

embeddings_model = OpenAIEmbeddings(model="text-embedding-3-small")

## **1. Embed Query**
- What kind of ebedding opanai is using? word2vec glove or ...

In [20]:
sent_1 = "Vegan food is not only plant-based but also packed with nutrients, offering delicious alternatives to traditional meals while being environmentally friendly."

In [21]:
embedded_sent_1 = embeddings_model.embed_query(sent_1)

type(embedded_sent_1), len(embedded_sent_1)

(list, 1536)

In [22]:
embedded_sent_1[:10]

[-0.01908184496125671,
 0.00998435307674029,
 -0.01609702889641673,
 0.027654847661957836,
 0.0259383401458056,
 0.007447736445026071,
 0.0009804358234051609,
 0.0006871991460122172,
 0.0612602539681854,
 0.011214517448575165]

In [40]:
sent_2 = "From hearty lentil stews to creamy cashew-based desserts, vegan cuisine proves that ethical eating can be both flavorful and satisfying."
sent_3 = "Meat-based dishes, from smoky barbecued ribs to tender steak, offer rich flavors and high protein content, making them a staple in many cuisines worldwide."

# sent_2 = "He swung the bat and hit a home run."
# sent_3 = "A bat flew out of the cave at dusk."

In [41]:
embedded_sent_2 = embeddings_model.embed_query(sent_2)
embedded_sent_3 = embeddings_model.embed_query(sent_3)

In [42]:
import numpy as np

def cosine_similarity(vec1, vec2):
    vec1 = np.array(vec1)
    vec2 = np.array(vec2)
    return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

In [44]:
similarity = cosine_similarity(embedded_sent_1, embedded_sent_2)
print(f'Cosine Similarity: {similarity:.4f}')



similarity = cosine_similarity(embedded_sent_1, embedded_sent_3)
print(f'Cosine Similarity: {similarity:.4f}')

Cosine Similarity: 0.5976
Cosine Similarity: 0.3639


## **2. Embed Documents**

## 📌 How Does Document Embedding Work?

**Document embeddings** convert entire documents into **numerical vector representations** so that AI models can understand and compare them. These embeddings help in **semantic search, AI-powered retrieval, summarization, and chatbot memory.**

---

## **🔹 What Are Document Embeddings?**
✅ **A document embedding is a high-dimensional vector** that represents the meaning of an entire document.  
✅ Similar documents have **closer embeddings**, even if they use different words.  
✅ Used in **retrieval-augmented generation (RAG), search engines, document clustering, and AI assistants.**  

🔹 **Example:**  
- A **legal contract** and an **NDA agreement** will have similar embeddings because they share related legal concepts.

---

## **🔹 How Do Document Embeddings Work?**
1. **Tokenization** → The document is split into sentences or words.  
2. **Sentence Embedding** → Each sentence is converted into a vector.  
3. **Aggregation** → Sentences are combined into a single document vector (mean pooling, attention mechanisms, etc.).  
4. **Normalization** → The final vector is adjusted for better comparisons.  

🔹 **Popular Models for Document Embeddings:**  
- **OpenAI Embeddings (text-embedding-ada-002)** → Optimized for search and retrieval.  
- **Sentence-BERT (SBERT)** → Creates efficient, meaningful document embeddings.  
- **Universal Sentence Encoder (USE)** → Fast document embeddings for NLP tasks.

---


In [45]:
from langchain_openai import OpenAIEmbeddings

model = OpenAIEmbeddings()

embedded_docs = model.embed_documents([
    "Hi there!",
    "Oh, hello!",
    "What's your name?",
    "My friends call me World",
    "Hello World!"
])

embedded_docs

[[-0.02032531998333742,
  -0.007096723237236881,
  -0.02283900619589423,
  -0.026279457096853462,
  -0.03752757262548842,
  0.02163294531599625,
  -0.006144568832628118,
  -0.008975640933264193,
  0.008524954378488735,
  -0.016618264955018055,
  0.026838055067780066,
  -0.007356978538561556,
  -0.01354598077153464,
  -0.024133935739127307,
  0.006512735084442646,
  -0.02019836581437023,
  0.0242608899080945,
  -0.014739347724652033,
  0.016427835564212432,
  -0.016478616859270278,
  -0.007204633722065444,
  -0.008080615718426151,
  0.00469412052554249,
  -0.002066174769291015,
  -0.014802824809135627,
  -0.005989050068775569,
  -0.0020868047286159252,
  -0.023016741659919262,
  0.019855590675745913,
  -0.031535350006340286,
  0.01286043049428601,
  0.01162262920950107,
  -0.00851860648377586,
  -0.009477108783097498,
  -0.0018138538706961446,
  -0.02742204275491301,
  -0.00826469907716406,
  0.0020788700930554757,
  0.02400698343280528,
  -0.008734428384755565,
  0.02349916675693652,
 

##  📌 OpenAI Embedding Models Comparison

| **Model**                  | **~ Pages per Dollar** | **Performance on MTEB Eval** | **Max Input Tokens** |
|----------------------------|-----------------------|-----------------------------|----------------------|
| **text-embedding-3-small** | 62,500               | 62.3%                       | 8,191               |
| **text-embedding-3-large** | 9,615                | 64.6%                       | 8,191               |
| **text-embedding-ada-002** | 12,500               | 61.0%                       | 8,191               |

🔹 **Pages per Dollar** → Higher means more cost-efficient embeddings.  
🔹 **MTEB Eval (%)** → Higher means better embedding quality for semantic tasks.  
🔹 **Max Input Tokens** → Maximum number of tokens the model can process in one request.




## **Combining Document Loaders, Splitters and Embeddings**

In [None]:
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter


loader = PyPDFLoader('some_data/marvel_superheroes.pdf')


# Use a text splitter to break large text into smaller chunks
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(encoding_name="cl100k_base", 
                                                                     chunk_size=200,
                                                                     chunk_overlap=50)


pages = loader.load_and_split(text_splitter)
pages

[Document(metadata={'source': 'some_data/marvel_superheroes.pdf', 'page': 0}, page_content='Spider-Man\nSpider-Man, also known as Peter Parker, is one of Marvel\'s most iconic superheroes. Created by\nwriter Stan Lee and artist Steve Ditko, he first appeared in Amazing Fantasy #15 in 1962. As a\nteenager, Peter Parker was bitten by a radioactive spider, which granted him extraordinary powers,\nincluding superhuman strength, agility, the ability to cling to walls, and a "spider-sense" that warns\nhim of impending danger.\nDespite his incredible abilities, Peter\'s life is fraught with hardship. After the tragic murder of his\nUncle Ben, Peter learns a valuable lesson: "With great power comes great responsibility." This\nphilosophy shapes his journey as he fights crime and protects the citizens of New York City from\nnotorious villains like the Green Goblin, Doctor Octopus, and Venom. \nAs Spider-Man, Peter has been a key member of teams like the Avengers and the Fantastic Four.'),
 Docu

In [66]:
embeddings_model = OpenAIEmbeddings(model="text-embedding-3-small")

# input is a list of documnets
embedded_docs = embeddings_model.embed_documents([page.page_content for page in pages])
print(len(embedded_docs))

4


In [61]:
len(embedded_docs[0])

1536