# 向量存儲支持的檢索器

## 概覽
本教程提供了使用 LangChain 建立和優化**向量存儲支持的檢索器**的全面指南。它涵蓋了使用 FAISS（Facebook AI 相似性搜索）創建向量存儲的基礎步驟，並探索了提高搜索準確性和效率的高級檢索策略。

**向量存儲支持的檢索器**是一個文檔檢索系統，它利用向量存儲根據文檔的向量表示來搜索文檔。這種方法能夠高效地進行基於相似性的搜索，用於處理非結構化數據。

### RAG（檢索增強生成）工作流程
![rag-flow.png](./assets/01-vectorstore-retriever-rag-flow.png)

上圖說明了 RAG 系統中的**文檔搜索和回應生成**工作流程。

步驟包括：

1. 文檔載入：導入原始文檔。
2. 文本分塊：將文本分割成可管理的塊。
3. 向量嵌入：使用嵌入模型將文本轉換為數值向量。
4. 存儲在向量資料庫：將生成的嵌入存儲在向量資料庫中以實現高效檢索。

在查詢階段：
- 步驟：用戶查詢 → 嵌入 → 在向量存儲中搜索 → 檢索相關塊 → LLM 生成回應
- 用戶查詢使用嵌入模型轉換為嵌入向量。
- 這個查詢嵌入與向量資料庫中存儲的文檔向量進行比較，以**檢索最相關的結果**。
- 檢索到的塊被傳遞給大型語言模型（LLM），該模型基於檢索到的信息生成最終回應。

本教程旨在探索和優化「向量存儲 → 檢索相關塊 → LLM 生成回應」階段。它將涵蓋高級檢索技術以提高回應的準確性和相關性。

## 目錄

- [概覽](#概覽)
- [環境設定](#環境設定)
- [初始化和使用 VectorStoreRetriever](#初始化和使用-vectorstoreretriever)
- [動態配置（使用 ConfigurableField）](#動態配置使用-configurablefield)
- [使用分離的查詢和段落嵌入模型](#使用分離的查詢和段落嵌入模型)

## 參考資料

- [如何使用向量存儲作為檢索器](https://python.langchain.com/docs/how_to/vectorstore_retriever/)
- [最大邊際相關性（MMR）](https://community.fullstackretrieval.com/retrieval-methods/maximum-marginal-relevance)
- [Upstage-Embeddings](https://console.upstage.ai/docs/capabilities/embeddings)

---

## Environment Setup

Set up the environment. You may refer to [Environment Setup](https://wikidocs.net/257836) for more details.

**[Note]**
- ```langchain-opentutorial``` is a package that provides a set of easy-to-use environment setup, useful functions, and utilities for tutorials. 
- You can checkout out the [```langchain-opentutorial```](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details.

In [1]:
%%capture --no-stderr
%pip install langchain-opentutorial

In [2]:
# Install required packages
from langchain_opentutorial import package

package.install(
    [
        "langchain_opentutorial",
        "langchain_openai",
        "langchain_community",
        "langchain_text_splitters",
        "langchain_core",
        "langchain_upstage",
        "faiss-cpu"
    ],
    verbose=False,
    upgrade=False,
)

In [3]:
# Set environment variables
from langchain_opentutorial import set_env

set_env(
    {
        # "OPENAI_API_KEY": "",
        # "LANGCHAIN_API_KEY": "",
        # "UPSTAGE_API_KEY": "",
        "LANGCHAIN_TRACING_V2": "true",
        "LANGCHAIN_ENDPOINT": "https://api.smith.langchain.com",
        "LANGCHAIN_PROJECT": "VectorStore Retriever"
    }
)

Environment variables have been set successfully.


You can alternatively set API keys such as ```OPENAI_API_KEY``` in a ```.env``` file and load them.

[Note] This is not necessary if you've already set the required API keys in previous steps.

In [4]:
# Configuration file to manage the API KEY as an environment variable
from dotenv import load_dotenv

# Load API KEY information
load_dotenv(override=True)

True

## Initializing and Using VectorStoreRetriever

This section demonstrates how to load documents using OpenAI embeddings and create a vector database using FAISS.

- The example below showcases how to use OpenAI embeddings for document loading and FAISS for vector database creation.
- Once the vector database is created, it can be loaded and queried using retrieval methods such as **Similarity Search** and **Maximal Marginal Relevance (MMR)** to search for relevant text within the vector store.

📌 **Creating a Vector Store (Using FAISS)**

In [5]:
from langchain_community.vectorstores import FAISS
from langchain_openai.embeddings import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter
from langchain_community.document_loaders import TextLoader

# Load the file using TextLoader
loader = TextLoader("./data/01-vectorstore-retriever-appendix-keywords.txt", encoding="utf-8")
documents = loader.load()

# split the text into chunks
text_splitter = CharacterTextSplitter(chunk_size=300, chunk_overlap=0)
split_docs = text_splitter.split_documents(documents) # Split into smaller chunks

# Initialize the OpenAI embedding model
embeddings = OpenAIEmbeddings()

# Create a FAISS vector database
db = FAISS.from_documents(split_docs, embeddings)

Created a chunk of size 351, which is longer than the specified 300
Created a chunk of size 343, which is longer than the specified 300
Created a chunk of size 307, which is longer than the specified 300
Created a chunk of size 316, which is longer than the specified 300
Created a chunk of size 341, which is longer than the specified 300
Created a chunk of size 321, which is longer than the specified 300
Created a chunk of size 303, which is longer than the specified 300
Created a chunk of size 325, which is longer than the specified 300
Created a chunk of size 315, which is longer than the specified 300
Created a chunk of size 304, which is longer than the specified 300
Created a chunk of size 385, which is longer than the specified 300
Created a chunk of size 349, which is longer than the specified 300
Created a chunk of size 376, which is longer than the specified 300


📌 **1. Initializing and Using VectorStoreRetriever (```as_retriever``` )**

The ```as_retriever``` method allows you to convert a vector database into a retriever, enabling efficient document search and retrieval from the vector store.

**How It Works**:
* The ```as_retriever()``` method transforms a vector store (like FAISS) into a retriever object, making it compatible with LangChain's retrieval workflows.
* This retriever can then be directly used with RAG pipelines or combined with Large Language Models (LLMs) for building intelligent search systems.

In [6]:
# Basic Retriever Creation (Similarity Search)
retriever = db.as_retriever()

**高級檢索器配置**

```as_retriever``` 方法允許您配置高級檢索策略，例如**相似性搜索**、**MMR（最大邊際相關性）**和**基於相似度分數閾值的過濾**。

**參數：**

- ```**kwargs```：傳遞給檢索函數的關鍵字參數：
   - ```search_type```：指定搜索方法。
     - ```"similarity"```：基於餘弦相似度返回最相關的文檔。
     - ```"mmr"```：使用最大邊際相關性算法，平衡**相關性**和**多樣性**。
     - ```"similarity_score_threshold"```：返回相似度分數高於指定閾值的文檔。
   - ```search_kwargs```：用於微調結果的額外搜索選項：
     - ```k```：要返回的文檔數量（預設：```4```）。
     - ```score_threshold```：```"similarity_score_threshold"``` 搜索類型的最低相似度分數（例如：```0.8```）。
     - ```fetch_k```：MMR 搜索期間初始檢索的文檔數量（預設：```20```）。
     - ```lambda_mult```：控制 MMR 結果中的多樣性（```0``` = 最大多樣性，```1``` = 最大相關性，預設：```0.5```）。
     - ```filter```：用於選擇性文檔檢索的元數據過濾。

**返回值：**

- ```VectorStoreRetriever```：可直接用於文檔搜索任務查詢的初始化檢索器對象。

**注意事項：**
- 支援多種搜索策略（```similarity```、```MMR```、```similarity_score_threshold```）。
- MMR 通過減少結果中的冗餘來改善結果多樣性，同時保持相關性。
- 元數據過濾能夠基於文檔屬性進行選擇性文檔檢索。
- ```tags``` 參數可用於標記檢索器，以便更好地組織和更容易識別。

**注意事項：**
- MMR 的多樣性控制：
  - 仔細調整 ```fetch_k```（初始檢索的文檔數量）和 ```lambda_mult```（多樣性控制因子）以達到最佳平衡。
  - ```lambda_mult```
    - 較低值（< 0.5）→ 優先考慮多樣性。
    - 較高值（> 0.5）→ 優先考慮相關性。
  - 將 ```fetch_k``` 設定得比 ```k``` 高，以實現有效的多樣性控制。
- 閾值設定：
  - 使用過高的 ```score_threshold```（例如 0.95）可能導致零結果。
- 元數據過濾：
  - 在應用過濾器之前確保元數據結構定義良好。
- 平衡配置：
  - 在 ```search_type``` 和 ```search_kwargs``` 設定之間保持適當平衡，以獲得最佳檢索性能。

In [7]:
retriever = db.as_retriever(
    search_type="similarity_score_threshold", 
    search_kwargs={
        "k": 5,  # Return the top 5 most relevant documents
        "score_threshold": 0.7  # Only return documents with a similarity score of 0.7 or higher
    }
)
# Perform the search
query = "Explain the concept of vector search."
results = retriever.invoke(query)

# Display search results
for doc in results:
    print(doc.page_content)

Semantic Search
VectorStore

Definition: A vector store is a system for storing data in vector format, often used for search, classification, and data analysis tasks.
Example: Storing word embeddings in a database for fast retrieval of similar words.
Related Keywords: Embedding, Database, Vectorization
Definition: Semantic search is a method of retrieving results based on the meaning of the user's query, going beyond simple keyword matching.
Example: If a user searches for "solar system planets," the search returns information about related planets like Jupiter and Mars.
Related Keywords: Natural Language Processing, Search Algorithms, Data Mining
Definition: Keyword search is the process of finding information based on specific keywords entered by the user. It is commonly used in search engines and database systems as a fundamental search method.
Example: If a user searches for "coffee shop in Seoul," the search engine returns a list of related coffee shops.
Related Keywords: Search E

### 檢索器的 ```invoke()``` 方法

```invoke()``` 方法是與檢索器互動的主要入口點。它用於基於給定查詢搜索和檢索相關文檔。

**運作方式**：
1. 查詢提交：用戶查詢作為輸入提供。
2. 嵌入生成：查詢被轉換為向量表示（如有必要）。
3. 搜索過程：檢索器使用指定的搜索策略（相似性、MMR 等）搜索向量資料庫。
4. 結果返回：該方法返回相關文檔塊的列表。

**參數：**
- ```input```（必需）：
   - 用戶提供的查詢字串。
   - 查詢被轉換為向量，並與存儲的文檔向量進行比較，以進行基於相似性的檢索。

- ```config```（可選）：
   - 允許對檢索過程進行細粒度控制。
   - 可用於指定**標籤、元數據插入和搜索策略**。

- ```**kwargs```（可選）：
   - 能夠直接傳遞 ```search_kwargs``` 進行高級配置。
   - 範例選項包括：
     - ```k```：要返回的文檔數量。
     - ```score_threshold```：文檔被包含的最低相似度分數。
     - ```fetch_k```：MMR 搜索中初始檢索的文檔數量。

**返回值：**
- ```List[Document]```：
   - 返回包含檢索文本和元數據的文檔對象列表。
   - 每個文檔對象包括：
     - ```page_content```：文檔的主要內容。
     - ```metadata```：與文檔相關的元數據（例如：來源、標籤）。

**使用說明：**

### invoke() 方法的核心功能

**1. 簡潔的檢索接口**
```python
# 基本使用
results = retriever.invoke("查詢文本")

# 帶配置的使用
results = retriever.invoke(
    "查詢文本",
    config={"tags": ["search"]},
    k=5,
    score_threshold=0.7
)
```

**2. 靈活的參數控制**
- **即時調整**：可在不重新初始化檢索器的情況下調整搜索參數
- **動態配置**：根據查詢類型動態調整檢索策略
- **元數據利用**：充分利用文檔元數據進行精確檢索

**3. 統一的回傳格式**
- 所有檢索器都返回相同的 `Document` 格式
- 便於後續處理和鏈式操作
- 保持與 LangChain 生態系統的一致性

**Usage Example 1: Basic Usage (Synchronous Search)**

In [8]:
docs = retriever.invoke("What is an embedding?")

for doc in docs:
    print(doc.page_content)
    print("=========================================================")

Embedding
Definition: Embedding is the process of converting text data, such as words or sentences, into continuous low-dimensional vectors. This allows computers to understand and process text.
Example: The word "apple" can be represented as a vector like [0.65, -0.23, 0.17].
Related Keywords: Natural Language Processing, Vectorization, Deep Learning
Semantic Search
Deep Learning


**Usage Example 2: Search with Options** ( ```search_kwargs``` )

In [9]:
# search options: top 5 results with a similarity score ≥ 0.7
docs = retriever.invoke(
    "What is a vector database?",
    search_kwargs={"k": 5, "score_threshold": 0.7}
)
for doc in docs:
    print(doc.page_content)
    print("=========================================================")

VectorStore

Definition: A vector store is a system for storing data in vector format, often used for search, classification, and data analysis tasks.
Example: Storing word embeddings in a database for fast retrieval of similar words.
Related Keywords: Embedding, Database, Vectorization


**Usage Example 3: Using** ```config``` **and** ```**kwargs``` **(Advanced Configuration)**

In [10]:
from langchain_core.runnables.config import RunnableConfig

# Create a RunnableConfig with tags and metadata
config = RunnableConfig(
    tags=["retrieval", "faq"],  ## Adding tags for query categorization
    metadata={"project": "vectorstore-tutorial"}  # Project-specific metadata for traceability
)
# Perform a query using advanced configuration settings
docs = retriever.invoke(
    input="What is a DataFrame?", 
    config=config,  # Applying the config with tags and metadata
    search_kwargs={
        "k": 3,                   
        "score_threshold": 0.8   
    }
)
#  Display the search results
for idx, doc in enumerate(docs):
    print(f"\n🔍 [Search Result {idx + 1}]")
    print("📄 Document Content:", doc.page_content)
    print("🗂️ Metadata:", doc.metadata)
    print("=" * 60)


🔍 [Search Result 1]
📄 Document Content: Definition: A DataFrame is a tabular data structure with rows and columns, commonly used for data analysis and manipulation.
Example: Pandas DataFrame can store data like an Excel sheet and perform operations like filtering and grouping.
Related Keywords: Data Analysis, Pandas, Data Manipulation
🗂️ Metadata: {'source': './data/01-vectorstore-retriever-appendix-keywords.txt'}

🔍 [Search Result 2]
📄 Document Content: Schema

Definition: A schema defines the structure of a database or file, describing how data is stored and organized.
Example: A database schema can specify table columns, data types, and constraints.
Related Keywords: Database, Data Modeling, Data Management

DataFrame
🗂️ Metadata: {'source': './data/01-vectorstore-retriever-appendix-keywords.txt'}

🔍 [Search Result 3]
📄 Document Content: Pandas

Definition: Pandas is a Python library for data analysis and manipulation, offering tools for working with structured data.
Example: Panda

## 最大邊際相關性 (MMR)

**最大邊際相關性 (MMR)** 搜尋方法是一種文件檢索演算法，透過平衡相關性和多樣性來減少冗餘，提供更好的搜尋結果。

**MMR 運作原理：**
與僅基於相似度分數返回最相關文件的基本相似性搜尋不同，MMR 考慮兩個關鍵因素：
1. 相關性：衡量文件與使用者查詢的匹配程度。
2. 多樣性：確保檢索到的文件彼此不同，避免重複性結果。

**關鍵參數：**
- ```search_type="mmr"```：啟用 MMR 檢索策略。
- ```k```：應用多樣性過濾後返回的文件數量（預設：```4```）。
- ```fetch_k```：應用多樣性過濾前初始檢索的文件數量（預設：```20```）。
- ```lambda_mult```：多樣性控制因子（```0 = 最大多樣性```，```1 = 最大相關性```，預設：```0.5```）。

---

## 我的見解

MMR 是資訊檢索中非常實用的演算法，特別適用於需要避免結果重複的場景。它解決了傳統相似性搜尋的一個重要問題：當使用者搜尋某個主題時，往往會得到許多內容相似的文件，這降低了搜尋的實用性。

MMR 的核心價值在於其平衡機制 - 透過 lambda_mult 參數，使用者可以根據需求調整相關性與多樣性的權重。

## 學習補充重點

**實際應用場景：**
- 推薦系統：避免推薦相似商品
- 搜尋引擎：提供多元化的搜尋結果
- 文件摘要：選擇代表性段落

**參數調整策略：**
- 探索性搜尋時：降低 lambda_mult（增加多樣性）
- 精確查找時：提高 lambda_mult（增加相關性）
- fetch_k 通常設為 k 的 3-5 倍效果較佳

**注意事項：**
- MMR 計算複雜度較高，可能影響回應速度
- 需要適當的向量化模型支援才能有效運作

In [11]:
# MMR Retriever Configuration (Balancing Relevance and Diversity)
retriever = db.as_retriever(
    search_type="mmr", 
    search_kwargs={
        "k": 3,                
        "fetch_k": 10,           
        "lambda_mult": 0.6  # Balancing Similarity and Diversity (0.6: Slight Emphasis on Diversity)
    }
)

query = "What is an embedding?"
docs = retriever.invoke(query)

#  Display the search results
print(f"\n🔎 [Query]: {query}\n")
for idx, doc in enumerate(docs):
    print(f"📄 [Document {idx + 1}]")
    print("📖 Document Content:", doc.page_content)
    print("🗂️ Metadata:", doc.metadata)
    print("=" * 60)


🔎 [Query]: What is an embedding?

📄 [Document 1]
📖 Document Content: Embedding
🗂️ Metadata: {'source': './data/01-vectorstore-retriever-appendix-keywords.txt'}
📄 [Document 2]
📖 Document Content: Definition: Embedding is the process of converting text data, such as words or sentences, into continuous low-dimensional vectors. This allows computers to understand and process text.
Example: The word "apple" can be represented as a vector like [0.65, -0.23, 0.17].
Related Keywords: Natural Language Processing, Vectorization, Deep Learning
🗂️ Metadata: {'source': './data/01-vectorstore-retriever-appendix-keywords.txt'}
📄 [Document 3]
📖 Document Content: TF-IDF (Term Frequency-Inverse Document Frequency)
🗂️ Metadata: {'source': './data/01-vectorstore-retriever-appendix-keywords.txt'}


## 相似度分數閾值搜尋

**相似度分數閾值搜尋** 是一種檢索方法，只返回超過預定義相似度分數的文件。這種方法有助於過濾低相關性結果，確保返回的文件與查詢高度相關。

**主要特點：**
- 相關性過濾：只返回相似度分數高於指定閾值的文件。
- 可配置精確度：可使用 ```score_threshold``` 參數調整閾值。
- 搜尋類型啟用：透過設定 ```search_type="similarity_score_threshold"``` 來啟用。

這種搜尋方法非常適合需要**高度精確**結果的任務，例如事實查核或回答技術性問題。

---

## 我的見解

相較於 MMR 著重平衡性，閾值搜尋更注重「質量控制」。它採用「寧缺勿濫」的策略，確保每個返回的結果都達到最低品質標準。這在需要高準確性的應用場景中特別有價值。

這種方法的優勢在於可預測性 - 使用者可以明確知道所有結果都符合設定的相關性標準，避免了傳統 top-k 搜尋可能返回低品質結果的問題。

## 學習補充重點

**適用場景：**
- 醫療診斷輔助：需要高度準確的資訊
- 法律文件檢索：精確性至關重要
- 技術文檔查詢：避免誤導性資訊

**閾值設定策略：**
- 高閾值（0.8-0.9）：追求極高精確度，可能犧牲召回率
- 中等閾值（0.6-0.8）：平衡精確度與召回率
- 低閾值（0.4-0.6）：確保基本相關性，提高召回率

**與其他方法比較：**
- vs Top-k：可能返回不同數量的結果
- vs MMR：不考慮多樣性，純粹基於相關性
- 可與 MMR 結合：先用閾值過濾，再用 MMR 增加多樣性

**實作考量：**
- 需要根據向量模型特性調整閾值
- 建議先進行小規模測試確定最佳閾值

In [12]:
# Retriever Configuration (Similarity Score Threshold Search)
retriever = db.as_retriever(
    search_type="similarity_score_threshold",  
    search_kwargs={
        "score_threshold": 0.6,  
        "k": 5                
    }
)
# Execute the query
query = "What is Word2Vec?"
docs = retriever.invoke(query)

# Display the search results 
print(f"\n🔎 [Query]: {query}\n")
if docs:
    for idx, doc in enumerate(docs):
        print(f"📄 [Document {idx + 1}]")
        print("📖 Document Content:", doc.page_content)
        print("🗂️ Metadata:", doc.metadata)
        print("=" * 60)
else:
    print("⚠️ No relevant documents found. Try lowering the similarity score threshold.")


🔎 [Query]: What is Word2Vec?

📄 [Document 1]
📖 Document Content: Word2Vec

Definition: Word2Vec is a technique in NLP that maps words into a vector space, representing their semantic relationships based on context.
Example: In Word2Vec, "king" and "queen" would be represented by vectors close to each other.
Related Keywords: NLP, Embeddings, Semantic Similarity
🗂️ Metadata: {'source': './data/01-vectorstore-retriever-appendix-keywords.txt'}
📄 [Document 2]
📖 Document Content: Definition: Embedding is the process of converting text data, such as words or sentences, into continuous low-dimensional vectors. This allows computers to understand and process text.
Example: The word "apple" can be represented as a vector like [0.65, -0.23, 0.17].
Related Keywords: Natural Language Processing, Vectorization, Deep Learning
🗂️ Metadata: {'source': './data/01-vectorstore-retriever-appendix-keywords.txt'}
📄 [Document 3]
📖 Document Content: TF-IDF (Term Frequency-Inverse Document Frequency)
🗂️ Metad

### Configuring ```top_k``` (Adjusting the Number of Returned Documents)

- The parameter ```k``` specifies the number of documents returned during a vector search. It determines how many of the **top-ranked** documents (based on similarity score) will be retrieved from the vector database.

- The number of documents retrieved can be adjusted by setting the ```k``` value within the ```search_kwargs```.  
- For example, setting ```k=1``` will return only the **top 1 most relevant document** based on similarity.

In [13]:
# Retriever Configuration (Return Only the Top 1 Document)
retriever = db.as_retriever(
    search_kwargs={
        "k": 1  # Return only the top 1 most relevant document
    }
)

query = "What is an embedding?"
docs = retriever.invoke(query)

#  Display the search results 
print(f"\n🔎 [Query]: {query}\n")
if docs:
    for idx, doc in enumerate(docs):
        print(f"📄 [Document {idx + 1}]")
        print("📖 Document Content:", doc.page_content)
        print("🗂️ Metadata:", doc.metadata)
        print("=" * 60)
else:
    print("⚠️ No relevant documents found. Try increasing the `k` value.")


🔎 [Query]: What is an embedding?

📄 [Document 1]
📖 Document Content: Embedding
🗂️ Metadata: {'source': './data/01-vectorstore-retriever-appendix-keywords.txt'}


## Dynamic Configuration (Using ```ConfigurableField``` )

The ```ConfigurableField``` feature in LangChain allows for **dynamic adjustment** of search configurations, providing flexibility during query execution.

**Key Features:**
- Runtime Search Configuration: Adjust search settings without modifying the core retriever setup.
- Enhanced Traceability: Assign unique identifiers, names, and descriptions to each parameter for improved readability and debugging.
- Flexible Control with ```config```: Search configurations can be passed dynamically using the ```config``` parameter as a dictionary.


**Use Cases:**
- Switching Search Strategies: Dynamically adjust the search type (e.g., ```"similarity"```, ```"mmr"``` ).
- Real-Time Parameter Adjustments: Modify search parameters like ```k``` , ```score_threshold``` , and ```fetch_k``` during query execution.
- Experimentation: Easily test different search strategies and parameter combinations without rewriting code.

In [14]:
from langchain_core.runnables import ConfigurableField 

# Retriever Configuration Using ConfigurableField
retriever = db.as_retriever(search_kwargs={"k": 1}).configurable_fields(
    search_type=ConfigurableField(
        id="search_type", 
        name="Search Type",  # Name for the search strategy
        description="The search type to use",  # Description of the search strategy
    ),
    search_kwargs=ConfigurableField(
        id="search_kwargs",  
        name="Search Kwargs",  # Name for the search parameters
        description="The search kwargs to use",  # Description of the search parameters
    ),
)

The following examples demonstrate how to apply dynamic search settings using ```ConfigurableField``` in LangChain.


In [15]:
# ✅ Search Configuration 1: Basic Search (Top 3 Documents)

config_1 = {"configurable": {"search_kwargs": {"k": 3}}}

# Execute the query
docs = retriever.invoke("What is an embedding?", config=config_1)

# Display the search results
print("\n🔎 [Search Results - Basic Configuration (Top 3 Documents)]")
for idx, doc in enumerate(docs):
    print(f"📄 [Document {idx + 1}]")
    print(doc.page_content)
    print("=" * 60)


🔎 [Search Results - Basic Configuration (Top 3 Documents)]
📄 [Document 1]
Embedding
📄 [Document 2]
Definition: Embedding is the process of converting text data, such as words or sentences, into continuous low-dimensional vectors. This allows computers to understand and process text.
Example: The word "apple" can be represented as a vector like [0.65, -0.23, 0.17].
Related Keywords: Natural Language Processing, Vectorization, Deep Learning
📄 [Document 3]
Semantic Search


In [16]:
# ✅ Search Configuration 2: Similarity Score Threshold (≥ 0.8)

config_2 = {
    "configurable": {
        "search_type": "similarity_score_threshold",
        "search_kwargs": {
            "score_threshold": 0.8,  # Only return documents with a similarity score of 0.8 or higher
        },
    }
}

# Execute the query
docs = retriever.invoke("What is Word2Vec?", config=config_2)

# Display the search results
print("\n🔎 [Search Results - Similarity Score Threshold ≥ 0.8]")
for idx, doc in enumerate(docs):
    print(f"📄 [Document {idx + 1}]")
    print(doc.page_content)
    print("=" * 60)


🔎 [Search Results - Similarity Score Threshold ≥ 0.8]
📄 [Document 1]
Word2Vec

Definition: Word2Vec is a technique in NLP that maps words into a vector space, representing their semantic relationships based on context.
Example: In Word2Vec, "king" and "queen" would be represented by vectors close to each other.
Related Keywords: NLP, Embeddings, Semantic Similarity


In [17]:
# ✅ Search Configuration 3: MMR Search (Diversity and Relevance Balanced)

config_3 = {
    "configurable": {
        "search_type": "mmr",
        "search_kwargs": {
            "k": 2,            # Return the top 2 most diverse and relevant documents
            "fetch_k": 10,     # Initially fetch the top 10 documents before filtering for diversity
            "lambda_mult": 0.6 # Balance factor: 0.6 (0 = maximum diversity, 1 = maximum relevance)
        },
    }
}
# Execute the query using MMR search
docs = retriever.invoke("What is Word2Vec?", config=config_3)

#  Display the search results
print("\n🔎 [Search Results - MMR (Diversity and Relevance Balanced)]")
for idx, doc in enumerate(docs):
    print(f"📄 [Document {idx + 1}]")
    print(doc.page_content)
    print("=" * 60)


🔎 [Search Results - MMR (Diversity and Relevance Balanced)]
📄 [Document 1]
Word2Vec

Definition: Word2Vec is a technique in NLP that maps words into a vector space, representing their semantic relationships based on context.
Example: In Word2Vec, "king" and "queen" would be represented by vectors close to each other.
Related Keywords: NLP, Embeddings, Semantic Similarity
📄 [Document 2]
Tokenizer


## Using Separate Query & Passage Embedding Models

By default, a retriever uses the **same embedding model** for both queries and documents. However, certain scenarios can benefit from using different models tailored to the specific needs of queries and documents.

### Why Use Separate Embedding Models?
Using different models for queries and documents can improve retrieval accuracy and search relevance by optimizing each model for its intended purpose:
- Query Embedding Model: Fine-tuned for understanding short and concise search queries.
- Document (Passage) Embedding Model: Optimized for longer text spans with richer context.
  
For instance, **Upstage Embeddings** provides the capability to use distinct models for:  
- Query Embeddings (```solar-embedding-1-large-query```)  
- Document (Passage) Embeddings (```solar-embedding-1-large-passage```)  

In such cases, the query is embedded using the query embedding model, while the documents are embedded using the document embedding model. 

✅ **How to Issue an Upstage API Key**  
- Sign Up & Log In: 
   - Visit [Upstage](https://upstage.ai/) and log in (sign up if you don't have an account).  

- Open API Key Page:
   - Go to the menu bar, select "Dashboards", then navigate to "API Keys".

- Generate API Key:  
   - Click **"Create new key"** → Enter name your key (e.g., ```LangChain-Tutorial```) 

- Copy & Store Safely:  
   - Copy the generated key and keep it secure.  

<img src="./assets/01-vectorstore-retriever-get-upstage-api-key.png" alt="Description" width="1000">


In [18]:
from langchain_community.vectorstores import FAISS
from langchain_text_splitters import CharacterTextSplitter
from langchain_community.document_loaders import TextLoader
from langchain_upstage import UpstageEmbeddings

# ✅ 1. Data Loading and Document Splitting
loader = TextLoader("./data/01-vectorstore-retriever-appendix-keywords.txt", encoding="utf-8")
documents = loader.load()

# Split the loaded documents into text chunks 
text_splitter = CharacterTextSplitter(chunk_size=300, chunk_overlap=0)
split_docs = text_splitter.split_documents(documents)

# ✅ 2. Document Embedding
doc_embedder = UpstageEmbeddings(model="solar-embedding-1-large-passage")

# ✅ 3. Create a Vector Database
db = FAISS.from_documents(split_docs, doc_embedder)

Created a chunk of size 351, which is longer than the specified 300
Created a chunk of size 343, which is longer than the specified 300
Created a chunk of size 307, which is longer than the specified 300
Created a chunk of size 316, which is longer than the specified 300
Created a chunk of size 341, which is longer than the specified 300
Created a chunk of size 321, which is longer than the specified 300
Created a chunk of size 303, which is longer than the specified 300
Created a chunk of size 325, which is longer than the specified 300
Created a chunk of size 315, which is longer than the specified 300
Created a chunk of size 304, which is longer than the specified 300
Created a chunk of size 385, which is longer than the specified 300
Created a chunk of size 349, which is longer than the specified 300
Created a chunk of size 376, which is longer than the specified 300


The following example demonstrates the process of generating an Upstage embedding for a query, converting the query sentence into a vector, and conducting a vector similarity search.

In [19]:
# ✅ 3. Query Embedding and Vector Search
query_embedder = UpstageEmbeddings(model="solar-embedding-1-large-query")

# Convert the query into a vector using the query embedding model
query_vector = query_embedder.embed_query("What is an embedding?")

# ✅ 4. Vector Similarity Search (Return Top 2 Documents)
results = db.similarity_search_by_vector(query_vector, k=2)

# ✅ 5. Display the Search Results
print(f"\n🔎 [Query]: What is an embedding?\n")
for idx, doc in enumerate(results):
    print(f"📄 [Document {idx + 1}]")
    print("📖 Document Content:", doc.page_content)
    print("🗂️ Metadata:", doc.metadata)
    print("=" * 60)


🔎 [Query]: What is an embedding?

📄 [Document 1]
📖 Document Content: Embedding
🗂️ Metadata: {'source': './data/01-vectorstore-retriever-appendix-keywords.txt'}
📄 [Document 2]
📖 Document Content: Definition: Embedding is the process of converting text data, such as words or sentences, into continuous low-dimensional vectors. This allows computers to understand and process text.
Example: The word "apple" can be represented as a vector like [0.65, -0.23, 0.17].
Related Keywords: Natural Language Processing, Vectorization, Deep Learning
🗂️ Metadata: {'source': './data/01-vectorstore-retriever-appendix-keywords.txt'}
