[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/simple_csv_rag.ipynb)

# 用于CSV文件的简单RAG（检索增强生成）系统

## 概述

此代码实现了一个基本的检索增强生成（RAG）系统，用于处理和查询CSV文档。该系统将文档内容编码到向量存储中，然后可以查询该存储以检索相关信息。

# CSV文件结构和用例
CSV文件包含虚拟客户数据，包括名字、姓氏、公司等各种属性。该数据集将用于RAG用例，以方便创建客户信息问答系统。

## 核心组件

1. 加载和拆分csv文件。
2. 使用[FAISS](https://engineering.fb.com/2017/03/29/data-infrastructure/faiss-a-library-for-efficient-similarity-search/)和OpenAI嵌入创建向量存储
3. 设置检索器以查询已处理的文档
4. 基于csv数据创建问答功能。

## 方法详情

### 文档预处理

1. 使用langchain Csvloader加载csv文件
2. 将数据拆分为块。


### 向量存储创建

1. 使用OpenAI嵌入创建文本块的向量表示。
2. 从这些嵌入创建FAISS向量存储，以实现高效的相似性搜索。

### 检索器设置

1. 配置检索器以获取给定查询最相关的块。

## 此方法的优势

1. 可扩展性：通过分块处理可以处理大型文档。
2. 灵活性：易于调整块大小和检索结果数量等参数。
3. 效率：利用FAISS在高维空间中进行快速相似性搜索。
4. 与先进NLP集成：使用OpenAI嵌入进行最先进的文本表示。

## 结论

这个简单的RAG系统为构建更复杂的信息检索和问答系统提供了坚实的基础。通过将文档内容编码到可搜索的向量存储中，它能够高效地检索相关信息以响应查询。这种方法对于需要快速访问csv文件中特定信息的应用程序特别有用。

导入库

# 软件包安装和导入

下面的单元格安装了运行此笔记本所需的所有必要软件包。


In [85]:
# Install required packages
%pip install faiss-cpu langchain langchain-community langchain-openai pandas python-dotenv

Looking in indexes: https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
Note: you may need to restart the kernel to use updated packages.


In [86]:
from langchain_community.document_loaders.csv_loader import CSVLoader
from pathlib import Path
from langchain_openai import ChatOpenAI,OpenAIEmbeddings
import os
from dotenv import load_dotenv

# Load environment variables from a .env file
load_dotenv()

# Set the OpenAI API key environment variable
# os.environ["OPENAI_API_KEY"] = os.getenv('OPENAI_API_KEY')

llm = ChatOpenAI(model="deepseek-chat", base_url=os.environ["DEEPSEEK_BASE_URL"],api_key=os.environ["DEEPSEEK_API_KEY"])

# CSV文件结构和用例
CSV文件包含虚拟客户数据，包括名字、姓氏、公司等各种属性。该数据集将用于RAG用例，以方便创建客户信息问答系统。

In [87]:
# # Download required data files
# import os
# os.makedirs('data', exist_ok=True)

# # Download the PDF document used in this notebook
# !wget -O data/Understanding_Climate_Change.pdf https://raw.githubusercontent.com/NirDiamant/RAG_TECHNIQUES/main/data/Understanding_Climate_Change.pdf
# !wget -O data/customers-100.csv https://raw.githubusercontent.com/NirDiamant/RAG_TECHNIQUES/main/data/customers-100.csv


In [88]:
import pandas as pd

file_path = ('../data/customers-100.csv') # insert the path of the csv file
data = pd.read_csv(file_path)

#preview the csv file
data.head()

Unnamed: 0,Index,Customer Id,First Name,Last Name,Company,City,Country,Phone 1,Phone 2,Email,Subscription Date,Website
0,1,DD37Cf93aecA6Dc,Sheryl,Baxter,Rasmussen Group,East Leonard,Chile,229.077.5154,397.884.0519x718,zunigavanessa@smith.info,2020-08-24,http://www.stephenson.com/
1,2,1Ef7b82A4CAAD10,Preston,Lozano,Vega-Gentry,East Jimmychester,Djibouti,5153435776,686-620-1820x944,vmata@colon.com,2021-04-23,http://www.hobbs.com/
2,3,6F94879bDAfE5a6,Roy,Berry,Murillo-Perry,Isabelborough,Antigua and Barbuda,+1-539-402-0259,(496)978-3969x58947,beckycarr@hogan.com,2020-03-25,http://www.lawrence.com/
3,4,5Cef8BFA16c5e3c,Linda,Olsen,"Dominguez, Mcmillan and Donovan",Bensonview,Dominican Republic,001-808-617-6467x12895,+1-813-324-8756,stanleyblackwell@benson.org,2020-06-02,http://www.good-lyons.com/
4,5,053d585Ab6b3159,Joanna,Bender,"Martin, Lang and Andrade",West Priscilla,Slovakia (Slovak Republic),001-234-203-0635x76146,001-199-446-3860x3486,colinalvarado@miles.net,2021-04-17,https://goodwin-ingram.com/


load and process csv data

In [89]:
loader = CSVLoader(file_path=file_path)
docs = loader.load_and_split()

In [90]:
print(docs[0].page_content)

Index: 1
Customer Id: DD37Cf93aecA6Dc
First Name: Sheryl
Last Name: Baxter
Company: Rasmussen Group
City: East Leonard
Country: Chile
Phone 1: 229.077.5154
Phone 2: 397.884.0519x718
Email: zunigavanessa@smith.info
Subscription Date: 2020-08-24
Website: http://www.stephenson.com/


Initiate faiss vector store and openai embedding

In [91]:
import faiss
import os
import sys
from langchain_community.docstore.in_memory import InMemoryDocstore
from langchain_community.vectorstores import FAISS

sys.path.append(os.path.dirname(os.path.abspath('.')))
from helper_functions import get_langchain_embedding_provider, EmbeddingProvider

embeddings = get_langchain_embedding_provider(
    EmbeddingProvider.OPENAI,
    model_id="Qwen/Qwen3-Embedding-4B",
    base_url=os.environ["SF_BASE_URL"],
    api_key=os.environ["SF_API_KEY"]
)
# 1. 创建指定维度的 FAISS 索引
index = faiss.IndexFlatL2(len(embeddings.embed_query(" ")))
print(len(embeddings.embed_query(" ")))

# 2. 创建 FAISS 向量存储
vector_store = FAISS(
    embedding_function=embeddings,  # 嵌入函数
    index=index,                    # 向量索引
    docstore=InMemoryDocstore(),    # 文档存储
    index_to_docstore_id={}         # 索引到文档的映射
)

2560


Add the splitted csv data to the vector store

In [92]:
vector_store.add_documents(documents=docs)

['d82768a6-3a9c-4121-81ea-01f9d1732d2b',
 'b80ad63b-522d-43dd-9cb7-c100088837cd',
 '4abf4092-a45a-46e6-acd4-a1f36c0311e4',
 'a76f9e4b-66d6-4ac1-95f4-ad9ea8cab6f4',
 '2c998edb-e74f-4a9b-9e81-f874ea872afc',
 'e153b8ed-6bd8-41a7-8091-be4dd2ed4253',
 '4b076ae4-b3d5-41c8-882f-71821a05e781',
 '70b7f839-1abf-4685-8d34-baaea3faf4d6',
 'a6dca6d1-79c4-4f5e-b9b9-2a269c063cc8',
 'ab40ef88-6cae-4f30-88a8-8152cd0bce67',
 'fe665ba9-38cc-4b9a-a18c-4e7918e038a9',
 '9699470f-035b-443c-b94f-75a981555918',
 '04cf9fa2-e450-4856-b521-9ea0b13df381',
 '4433df5b-d0a7-4b0e-b121-fd49bd8f1907',
 '0c109b39-02d7-483f-9113-822743ecb04e',
 'b3c02556-451c-4159-aebd-d71540fb4777',
 '6ffd8d4f-e1ce-43bc-87bd-351d1c9e3022',
 '9021cfa2-24c7-4b1d-85d6-459280515115',
 '2a2d0cff-7d2a-4c1a-a261-d08e8030d440',
 '1564d55c-92a2-48b8-8aa0-cca88cfd7000',
 '610f5419-e032-44ea-a9eb-3480e89963ea',
 'a9df4d66-610b-4379-9024-a3c3c9c785ce',
 '2449f431-b864-400e-b231-dc3ab11c4e52',
 'abeff939-b3d9-4cd4-a6aa-fc0882d8a80b',
 '4c689eb7-7373-

Create the retrieval chain  创建检索链

In [93]:
# 导入 LangChain 核心组件
from langchain_core.prompts import ChatPromptTemplate      # 对话提示模板
from langchain.chains import create_retrieval_chain        # 创建检索链
from langchain.chains.combine_documents import create_stuff_documents_chain  # 创建文档组合链

# 步骤1: 创建检索器 - 将向量存储转换为检索器对象
# 检索器的作用：根据用户查询，从向量存储中找到最相关的文档片段
retriever = vector_store.as_retriever()

# 步骤2: 设置系统提示词 - 定义AI助手的行为规范
system_prompt = (
    "You are an assistant for question-answering tasks. "     # 定义AI助手角色
    "Use the following pieces of retrieved context to answer " # 指示使用检索到的上下文
    "the question. If you don't know the answer, say that you " # 不知道答案时的处理方式
    "don't know. Use three sentences maximum and keep the "     # 限制回答长度
    "answer concise."                                          # 要求简洁回答
    "\n\n"
    "{context}"                                               # 占位符：检索到的上下文将插入这里
)

# 步骤3: 创建对话提示模板 - 定义与AI的对话格式
prompt = ChatPromptTemplate.from_messages([
    ("system", system_prompt),    # 系统消息：定义AI行为和上下文
    ("human", "{input}"),         # 人类消息：用户的问题（占位符）
    
])

# 步骤4: 创建问答链 - 将检索到的文档与问题结合，生成答案
# 'stuff'策略：将所有检索到的文档内容塞入一个提示中发送给LLM
question_answer_chain = create_stuff_documents_chain(llm, prompt)

# 步骤5: 创建完整的RAG检索链 - 组合检索器和问答链
# 工作流程：用户问题 → 检索器 → 相关文档 → 问答链 → 最终答案
rag_chain = create_retrieval_chain(retriever, question_answer_chain)

Query the rag bot with a question based on the CSV data

In [100]:
answer= rag_chain.invoke({"input": "which company does sheryl Baxter work for?"})

answer['answer']

'Sheryl Baxter works for Rasmussen Group.'

## 🔍 RAG调试和监控功能

为了解决RAG系统中各个环节耦合的问题，我们添加了详细的调试功能，让你能够查看每个环节的真实交互数据。

In [95]:
# ===== 调试和监控功能 =====
# 为了查看每个环节的真实交互数据，我们创建一个调试版本的RAG系统

def debug_rag_pipeline(question: str, retriever, llm, prompt, k=4):
    """
    调试版本的RAG流水线，显示每个环节的详细信息
    
    Args:
        question: 用户问题
        retriever: 检索器
        llm: 语言模型
        prompt: 提示模板
        k: 检索的文档数量
    
    Returns:
        包含所有中间步骤信息的字典
    """
    print("=" * 80)
    print("🔍 RAG 流水线调试信息")
    print("=" * 80)
    
    # 步骤1: 显示原始问题
    print(f"\n📝 步骤1 - 用户问题:")
    print(f"问题: {question}")
    print(f"问题长度: {len(question)} 字符")
    
    # 步骤2: 问题向量化（如果需要的话）
    print(f"\n🔢 步骤2 - 问题向量化:")
    try:
        question_embedding = embeddings.embed_query(question)
        print(f"向量维度: {len(question_embedding)}")
        print(f"向量前5个值: {question_embedding[:5]}")
    except Exception as e:
        print(f"向量化失败: {e}")
    
    # 步骤3: 文档检索
    print(f"\n🔍 步骤3 - 文档检索:")
    try:
        # 使用检索器获取相关文档
        retrieved_docs = retriever.invoke(question)
        print(f"检索到的文档数量: {len(retrieved_docs)}")
        
        # 显示每个检索到的文档
        for i, doc in enumerate(retrieved_docs):
            print(f"\n  📄 文档 {i+1}:")
            print(f"    内容: {doc.page_content[:200]}...")
            print(f"    内容长度: {len(doc.page_content)} 字符")
            if hasattr(doc, 'metadata') and doc.metadata:
                print(f"    元数据: {doc.metadata}")
        
        # 如果支持相似度分数，也显示出来
        try:
            docs_with_scores = vector_store.similarity_search_with_score(question, k=k)
            print(f"\n  📊 相似度分数:")
            for i, (doc, score) in enumerate(docs_with_scores):
                print(f"    文档 {i+1} 分数: {score:.4f}")
        except:
            print("    (相似度分数不可用)")
            
    except Exception as e:
        print(f"文档检索失败: {e}")
        retrieved_docs = []
    
    # 步骤4: 构建上下文
    print(f"\n📋 步骤4 - 构建上下文:")
    try:
        # 将检索到的文档合并为上下文
        context = "\n\n".join([doc.page_content for doc in retrieved_docs])
        print(f"上下文总长度: {len(context)} 字符")
        print(f"上下文预览: {context[:300]}...")
    except Exception as e:
        print(f"构建上下文失败: {e}")
        context = ""
    
    # 步骤5: 构建最终提示
    print(f"\n💬 步骤5 - 构建最终提示:")
    try:
        # 使用提示模板构建最终提示
        final_prompt = prompt.format_messages(
            context=context,
            input=question
        )
        print(f"提示消息数量: {len(final_prompt)}")
        for i, msg in enumerate(final_prompt):
            print(f"  消息 {i+1} ({msg.type}): {str(msg.content)[:200]}...")
    except Exception as e:
        print(f"构建提示失败: {e}")
        final_prompt = []
    
    # 步骤6: LLM生成答案
    print(f"\n🤖 步骤6 - LLM生成答案:")
    try:
        # 调用LLM生成答案
        response = llm.invoke(final_prompt)
        print(f"LLM响应类型: {type(response)}")
        if hasattr(response, 'content'):
            answer = response.content
            print(f"答案长度: {len(answer)} 字符")
            print(f"答案: {answer}")
        else:
            answer = str(response)
            print(f"答案: {answer}")
    except Exception as e:
        print(f"LLM生成失败: {e}")
        answer = "生成答案时出错"
    
    # 步骤7: 返回完整结果
    print(f"\n✅ 步骤7 - 完成")
    print("=" * 80)
    
    return {
        'question': question,
        'retrieved_docs': retrieved_docs,
        'context': context,
        'final_prompt': final_prompt,
        'answer': answer,
        'num_docs_retrieved': len(retrieved_docs)
    }

# 创建一个简化的调试函数，专门用于检索测试
def debug_retrieval_only(question: str, retriever, k=4):
    """
    仅测试检索环节的函数
    """
    print(f"\n🔍 检索测试 - 问题: {question}")
    print("-" * 50)
    
    # 测试检索
    docs = retriever.invoke(question)
    print(f"检索到 {len(docs)} 个文档:")
    
    for i, doc in enumerate(docs):
        print(f"\n文档 {i+1}:")
        print(f"内容: {doc.page_content}")
        if hasattr(doc, 'metadata'):
            print(f"元数据: {doc.metadata}")
    
    return docs

# 创建一个函数来测试不同的检索参数
def test_retrieval_parameters(question: str, vector_store, k_values=[1, 3, 5]):
    """
    测试不同的检索参数
    """
    print(f"\n🧪 检索参数测试 - 问题: {question}")
    print("=" * 60)
    
    for k in k_values:
        print(f"\n📊 k={k} 的检索结果:")
        print("-" * 30)
        
        try:
            docs = vector_store.similarity_search(question, k=k)
            print(f"检索到 {len(docs)} 个文档")
            
            for i, doc in enumerate(docs):
                print(f"  {i+1}. {doc.page_content[:100]}...")
                
        except Exception as e:
            print(f"检索失败: {e}")

### 使用调试功能测试RAG流水线

In [96]:
# 测试1: 完整的RAG流水线调试
question = "which company does Ralph Haas work for?  what is her email?"
debug_result = debug_rag_pipeline(question, retriever, llm, prompt)

# 你可以访问调试结果的各个部分
print(f"\n📊 调试结果摘要:")
print(f"检索到的文档数量: {debug_result['num_docs_retrieved']}")
print(f"最终答案: {debug_result['answer']}")

🔍 RAG 流水线调试信息

📝 步骤1 - 用户问题:
问题: which company does Ralph Haas work for?  what is her email?
问题长度: 59 字符

🔢 步骤2 - 问题向量化:
向量维度: 2560
向量前5个值: [-0.0005412720493040979, -0.0016420504543930292, 0.05914836749434471, -0.017320705577731133, -0.0027831364423036575]

🔍 步骤3 - 文档检索:
检索到的文档数量: 4

  📄 文档 1:
    内容: Index: 56
Customer Id: C6F2Fc6a7948a4e
First Name: Ralph
Last Name: Haas
Company: Montes PLC
City: Lake Ellenchester
Country: Palestinian Territory
Phone 1: 2239271999
Phone 2: 001-962-434-0867x649
Em...
    内容长度: 285 字符
    元数据: {'source': '../data/customers-100.csv', 'row': 55}

  📄 文档 2:
    内容: Index: 7
Customer Id: EA4d384DfDbBf77
First Name: Darren
Last Name: Peck
Company: Lester, Woodard and Mitchell
City: Lake Ana
Country: Pitcairn Islands
Phone 1: (496)452-6181x3291
Phone 2: +1-247-266-...
    内容长度: 295 字符
    元数据: {'source': '../data/customers-100.csv', 'row': 6}

  📄 文档 3:
    内容: Index: 42
Customer Id: 6e5ad5a5e2bB5Ca
First Name: Bryan
Last Name: Dunn
Company: Kaufman and Sons

In [97]:
# 测试2: 仅测试检索环节
debug_retrieval_only("sheryl baxter", retriever)


🔍 检索测试 - 问题: sheryl baxter
--------------------------------------------------
检索到 4 个文档:

文档 1:
内容: Index: 42
Customer Id: 6e5ad5a5e2bB5Ca
First Name: Bryan
Last Name: Dunn
Company: Kaufman and Sons
City: North Jimstad
Country: Burkina Faso
Phone 1: 001-710-802-5565
Phone 2: 078.699.8982x13881
Email: woodwardandres@phelps.com
Subscription Date: 2021-09-08
Website: http://www.butler.com/
元数据: {'source': '../data/customers-100.csv', 'row': 41}

文档 2:
内容: Index: 76
Customer Id: d40e89dCade7b2F
First Name: Debra
Last Name: Riddle
Company: Chang, Aguirre and Leblanc
City: Colinhaven
Country: United States Virgin Islands
Phone 1: +1-768-182-6014x14336
Phone 2: (303)961-4491
Email: shieldskerry@robles.com
Subscription Date: 2020-07-11
Website: http://kaiser.info/
元数据: {'source': '../data/customers-100.csv', 'row': 75}

文档 3:
内容: Index: 7
Customer Id: EA4d384DfDbBf77
First Name: Darren
Last Name: Peck
Company: Lester, Woodard and Mitchell
City: Lake Ana
Country: Pitcairn Islands
Phone 1: (496

[Document(id='c2c6ecbe-0b17-4f20-a250-34ad98807df0', metadata={'source': '../data/customers-100.csv', 'row': 41}, page_content='Index: 42\nCustomer Id: 6e5ad5a5e2bB5Ca\nFirst Name: Bryan\nLast Name: Dunn\nCompany: Kaufman and Sons\nCity: North Jimstad\nCountry: Burkina Faso\nPhone 1: 001-710-802-5565\nPhone 2: 078.699.8982x13881\nEmail: woodwardandres@phelps.com\nSubscription Date: 2021-09-08\nWebsite: http://www.butler.com/'),
 Document(id='1ba43e10-fbc7-4668-a8b2-9280d2a1bb2d', metadata={'source': '../data/customers-100.csv', 'row': 75}, page_content='Index: 76\nCustomer Id: d40e89dCade7b2F\nFirst Name: Debra\nLast Name: Riddle\nCompany: Chang, Aguirre and Leblanc\nCity: Colinhaven\nCountry: United States Virgin Islands\nPhone 1: +1-768-182-6014x14336\nPhone 2: (303)961-4491\nEmail: shieldskerry@robles.com\nSubscription Date: 2020-07-11\nWebsite: http://kaiser.info/'),
 Document(id='4b076ae4-b3d5-41c8-882f-71821a05e781', metadata={'source': '../data/customers-100.csv', 'row': 6}, pag

In [98]:
# 测试3: 测试不同的检索参数
test_retrieval_parameters("sheryl baxter", vector_store, k_values=[1, 2, 4, 6])


🧪 检索参数测试 - 问题: sheryl baxter

📊 k=1 的检索结果:
------------------------------
检索到 1 个文档
  1. Index: 42
Customer Id: 6e5ad5a5e2bB5Ca
First Name: Bryan
Last Name: Dunn
Company: Kaufman and Sons
C...

📊 k=2 的检索结果:
------------------------------
检索到 2 个文档
  1. Index: 42
Customer Id: 6e5ad5a5e2bB5Ca
First Name: Bryan
Last Name: Dunn
Company: Kaufman and Sons
C...
  2. Index: 76
Customer Id: d40e89dCade7b2F
First Name: Debra
Last Name: Riddle
Company: Chang, Aguirre a...

📊 k=4 的检索结果:
------------------------------
检索到 4 个文档
  1. Index: 42
Customer Id: 6e5ad5a5e2bB5Ca
First Name: Bryan
Last Name: Dunn
Company: Kaufman and Sons
C...
  2. Index: 76
Customer Id: d40e89dCade7b2F
First Name: Debra
Last Name: Riddle
Company: Chang, Aguirre a...
  3. Index: 7
Customer Id: EA4d384DfDbBf77
First Name: Darren
Last Name: Peck
Company: Lester, Woodard an...
  4. Index: 90
Customer Id: 37Ec4B395641c1E
First Name: Lori
Last Name: Flowers
Company: Decker-Mcknight
...

📊 k=6 的检索结果:
----------------------------

### 比较原始RAG链和调试版本的结果

In [99]:
# 比较原始RAG链和调试版本
question = "which company does sheryl Baxter work for?"

print("🔄 原始RAG链结果:")
original_result = rag_chain.invoke({"input": question})
print(f"答案: {original_result['answer']}")
print(f"上下文文档数量: {len(original_result.get('context', []))}")

print("\n" + "="*60)

print("🔍 调试版本结果:")
debug_result = debug_rag_pipeline(question, retriever, llm, prompt)

🔄 原始RAG链结果:
答案: Sheryl Baxter works for Rasmussen Group.
上下文文档数量: 4

🔍 调试版本结果:
🔍 RAG 流水线调试信息

📝 步骤1 - 用户问题:
问题: which company does sheryl Baxter work for?
问题长度: 42 字符

🔢 步骤2 - 问题向量化:
向量维度: 2560
向量前5个值: [-0.00054713967256248, 0.022464854642748833, 0.02155950851738453, -0.02430623583495617, -0.002863731700927019]

🔍 步骤3 - 文档检索:
检索到的文档数量: 4

  📄 文档 1:
    内容: Index: 34
Customer Id: A09AEc6E3bF70eE
First Name: Kaitlyn
Last Name: Santana
Company: Herrera Group
City: New Kaitlyn
Country: United States of America
Phone 1: 6303643286
Phone 2: 447-710-6202x07313...
    内容长度: 284 字符
    元数据: {'source': '../data/customers-100.csv', 'row': 33}

  📄 文档 2:
    内容: Index: 1
Customer Id: DD37Cf93aecA6Dc
First Name: Sheryl
Last Name: Baxter
Company: Rasmussen Group
City: East Leonard
Country: Chile
Phone 1: 229.077.5154
Phone 2: 397.884.0519x718
Email: zunigavanes...
    内容长度: 279 字符
    元数据: {'source': '../data/customers-100.csv', 'row': 0}

  📄 文档 3:
    内容: Index: 90
Customer Id: 37Ec4B395641c1E
Fir