## 阶段一
	•	设置好环境与依赖（LangChain + Torch）
	•	规范路径与目录结构
	•	引入通用工具函数

In [22]:
# 1. 导入必要的库
import os
import torch
import re 
from langchain.document_loaders import DirectoryLoader, TextLoader, PyPDFLoader, UnstructuredHTMLLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
from langchain.chains import RetrievalQA # RAG 问答链

# 确保sys.path包含项目根目录
import sys
sys.path.append('..') 

# 导入通用工具函数 (set_seed，从train_eval_utils.py)(filter_documents, 从mism_doc_loader.py)
from utils.model_training.train_eval_utils import set_seed 
from utils.text_processing.mism_doc_loader import filter_documents


# 2. 设备设置
if torch.backends.mps.is_available():
    device = torch.device("mps")
    print("Using MPS device for embedding model and LLM.")
else:
    device = torch.device("cpu")
    print("MPS device not available, using CPU.")

# 3. 设置随机种子
SEED = 66
set_seed(SEED)

# 4. 定义路径
KNOWLEDGE_BASE_DIR = '../data/cmu_mism_docs' # 知识库文档存放的根目录
CHROMA_PERSIST_DIR = '../vectorstore/cmu_mism_chroma' # ChromaDB 数据存放目录

# 确保 ChromaDB 目录存在
os.makedirs(CHROMA_PERSIST_DIR, exist_ok=True)


Using MPS device for embedding model and LLM.
Random seed set to 66


## 阶段二
    •	路径设定：定义并检查各类文档目录（HTML、PDF、TXT）
    •	文档加载：使用 LangChain 提供的 DirectoryLoader + 相应 loader_cls
    •	类型统计：记录各类文件加载数目
    •	内容预览：打印每类前 3 个文档的元信息与文本预览
    •	清洗检查：基于正则表达式进行页面质量检测（如残留HTML标签、解析失败等）

In [23]:
from utils.text_processing.category_map import category_map
from utils.text_processing.mism_doc_loader import filter_documents
from langchain.docstore.document import Document
import os

# --- 知识库构建流程 ---

print(f"\n--- Starting document loading from {KNOWLEDGE_BASE_DIR} ---")

documents = []
loaded_counts_by_type = {'html': 0, 'pdf': 0, 'txt_forums': 0, 'txt_info': 0}
all_loaded_docs_by_type = {'html': [], 'pdf': [], 'txt_forums': [], 'txt_info': []}


def load_and_tag_documents(path, glob_pattern, loader_cls, category_key):
    doc_list = []
    if os.path.exists(path):
        print(f"Attempting to load {category_key.upper()} from: {path}")
        loader = DirectoryLoader(path, glob=glob_pattern, loader_cls=loader_cls)
        try:
            raw_docs = loader.load()
            for doc in raw_docs:
                rel_path = os.path.relpath(doc.metadata.get('source', ''), KNOWLEDGE_BASE_DIR).replace("\\", "/")
                doc.metadata["source"] = rel_path
                doc.metadata["category"] = category_map.get(rel_path, category_map.get(category_key, "unknown"))
            print(f"  Successfully loaded {len(raw_docs)} {category_key.upper()} documents.")
            return raw_docs
        except Exception as e:
            print(f"  ERROR: Failed to load {category_key.upper()} documents from {path}. Error: {e}")
            return []
    else:
        print(f"  WARNING: {category_key.upper()} documents directory not found: {path}.")
        return []


# --- 加载 HTML 文件 ---
html_docs_path = os.path.join(KNOWLEDGE_BASE_DIR, 'websites_pages')
html_docs = load_and_tag_documents(html_docs_path, "**/*.html", UnstructuredHTMLLoader, 'websites_pages')
documents.extend(html_docs)
all_loaded_docs_by_type['html'] = html_docs
loaded_counts_by_type['html'] = len(html_docs)

# --- 加载 PDF 文件 ---
pdf_docs_path = os.path.join(KNOWLEDGE_BASE_DIR, 'handbooks')
pdf_docs = load_and_tag_documents(pdf_docs_path, "**/*.pdf", PyPDFLoader, 'handbooks')
documents.extend(pdf_docs)
all_loaded_docs_by_type['pdf'] = pdf_docs
loaded_counts_by_type['pdf'] = len(pdf_docs)

# --- 加载 Forums TXT 文件 ---
forums_txt_path = os.path.join(KNOWLEDGE_BASE_DIR, 'forums')
forums_txt_docs = load_and_tag_documents(forums_txt_path, "**/*.txt", TextLoader, 'forums')
documents.extend(forums_txt_docs)
all_loaded_docs_by_type['txt_forums'] = forums_txt_docs
loaded_counts_by_type['txt_forums'] = len(forums_txt_docs)

# --- 加载 Text_info TXT 文件 ---
text_info_path = os.path.join(KNOWLEDGE_BASE_DIR, 'text_info')
text_info_docs = load_and_tag_documents(text_info_path, "**/*.txt", TextLoader, 'text_info')
documents.extend(text_info_docs)
all_loaded_docs_by_type['txt_info'] = text_info_docs
loaded_counts_by_type['txt_info'] = len(text_info_docs)

# --- 清洗文档 ---
documents, filtered_documents = filter_documents(documents)



--- Starting document loading from ../data/cmu_mism_docs ---
Attempting to load WEBSITES_PAGES from: ../data/cmu_mism_docs/websites_pages
  Successfully loaded 8 WEBSITES_PAGES documents.
Attempting to load HANDBOOKS from: ../data/cmu_mism_docs/handbooks
  Successfully loaded 44 HANDBOOKS documents.
Attempting to load FORUMS from: ../data/cmu_mism_docs/forums
  Successfully loaded 1 FORUMS documents.
Attempting to load TEXT_INFO from: ../data/cmu_mism_docs/text_info
  Successfully loaded 3 TEXT_INFO documents.

📊 文档清洗完成：
  - 原始文档数量：56
  - 保留有效文档：52
  - 剔除无效/乱码文档：4


检查文档加载情况（数量统计和内容质量检查）

In [24]:

# 文档数量统计
print(f"\nTotal documents loaded: {len(documents)}")
for doc_type, count in loaded_counts_by_type.items():
    print(f"  {doc_type.replace('_', ' ').title()} documents loaded: {count}")



# --- 详细内容验证 ---
print("\n--- Detailed Content Verification (Preview of first 3 docs of each type) ---")

for doc_type, docs_list in all_loaded_docs_by_type.items():
    if not docs_list:
        print(f"No {doc_type.replace('_', ' ').title()} documents to verify.")
        continue

    print(f"\nVerifying {doc_type.replace('_', ' ').title()} documents (first {min(3, len(docs_list))} samples):")
    for i, doc in enumerate(docs_list[:3]): # 只检查前3个样本，避免输出过长
        source = doc.metadata.get('source', 'N/A')
        # 对于PDF，metadata可能包含'page'
        page_info = f" Page {doc.metadata.get('page')}" if 'page' in doc.metadata else ""
        content_preview = doc.page_content.strip()[:500] # 获取前500字符进行预览
        
        print(f"\n  {doc_type.replace('_', ' ').title()} Document {i+1} from {source}{page_info}:")
        print(f"    Content length: {len(doc.page_content)} characters")
        print(f"    Content preview:\n\"\"\"\n{content_preview}\n\"\"\"") # 使用三引号打印，保留格式

        # --- 质量检查逻辑 ---
        # 检查内容是否为空或过短，这通常表示解析失败
        if len(doc.page_content) < 100: 
            print(f"    WARNING: Content length {len(doc.page_content)} seems too short. Possible parsing issue for {source}{page_info}. Consider manually converting to TXT/MD.")
        
        # 检查是否包含HTML标签 remnants (可能UnstructuredHTMLLoader未完全清除)
        if '<' in doc.page_content[:500] and '>' in doc.page_content[:500] and 'html' in doc.page_content[:500].lower():
             print(f"    WARNING: HTML tags or remnants detected in content. This might affect embedding quality.")
        
        # 检查是否包含大量连续的非字母数字字符（如 JS 代码或复杂 CSS）
        if re.search(r'[^a-zA-Z0-9\s]{20,}', doc.page_content[:500]): # 连续20个非字母数字空格
            print(f"    WARNING: Suspicious non-alphanumeric character sequences detected. Content might contain code or parsing artifacts.")


Total documents loaded: 52
  Html documents loaded: 8
  Pdf documents loaded: 44
  Txt Forums documents loaded: 1
  Txt Info documents loaded: 3

--- Detailed Content Verification (Preview of first 3 docs of each type) ---

Verifying Html documents (first 3 samples):

  Html Document 1 from websites_pages/career_outcome.html:
    Content length: 2341 characters
    Content preview:
"""
Careers in Information Systems Management

The World's leading companies recruit Heinz College Information Systems Management graduates. Data shown reflects the last three years.

information systems job titles and salary reports

Job titles and salary information for the Information Systems Management program.

Information Systems Career Outcomes

Dedicated Career SUPPORT

Heinz College treats career development as seriously as it treats academics. The Information Systems Management program has
"""

  Html Document 2 from websites_pages/curriculum.html:
    Content length: 8506 characters
    Content p

## 阶段三
    •	文本分块
    •	文本嵌入
    •	向量存储
    •   创建检索器
    •   测试知识库

In [25]:

# 6. 文本分块
print("\n--- Splitting documents into chunks ---")

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    length_function=len
)

chunks = text_splitter.split_documents(documents)
print(f"Split into {len(chunks)} chunks.")

# ✅ 检查第一个 chunk 的 metadata 是否保留了 category
if chunks:
    meta = chunks[0].metadata
    print("\n--- Example Chunk ---")
    print(f"Source: {meta.get('source')}")
    print(f"Category: {meta.get('category')}")  
    print(f"Page content (first 200 chars):\n{chunks[0].page_content[:200]}...")


--- Splitting documents into chunks ---
Split into 515 chunks.

--- Example Chunk ---
Source: websites_pages/career_outcome.html
Category: career
Page content (first 200 chars):
Careers in Information Systems Management

The World's leading companies recruit Heinz College Information Systems Management graduates. Data shown reflects the last three years.

information systems ...


手动删除旧的向量库文件夹
适用于：
	•	修改了文本清洗逻辑；
	•	或修改了分块策略；
	•	或使用了不同的 Embedding 模型；
	•	或希望完全重建索引（干净状态）。

In [26]:
# import shutil
# import os

# CHROMA_PERSIST_DIR = "../vectorstore/cmu_mism_chroma"

# if os.path.exists(CHROMA_PERSIST_DIR):
#     print("Deleting existing vectorstore...")
#     shutil.rmtree(CHROMA_PERSIST_DIR)

In [27]:
# 7. 文本嵌入
print("\n--- Creating embeddings ---")
# 使用 Sentence-Transformers 模型进行嵌入
# "all-MiniLM-L6-v2" 是一个轻量级但高性能的模型
# force_download=True 确保在Docker或某些环境中模型能正确下载

embeddings_model_name = "all-MiniLM-L6-v2"
embeddings = HuggingFaceEmbeddings(
    model_name=embeddings_model_name,
    model_kwargs={'device': str(device)}
)
print(f"Using embedding model: {embeddings_model_name} on device: {device}")


--- Creating embeddings ---
Using embedding model: all-MiniLM-L6-v2 on device: mps


In [28]:

# 8. 向量存储
print(f"\n--- Storing embeddings in ChromaDB at {CHROMA_PERSIST_DIR} ---")
# 检查向量数据库是否已存在并加载，否则从头创建
if os.path.exists(CHROMA_PERSIST_DIR) and len(os.listdir(CHROMA_PERSIST_DIR)) > 0:
    print("ChromaDB already exists. Loading existing vectorstore.")
    vectordb = Chroma(persist_directory=CHROMA_PERSIST_DIR, embedding_function=embeddings)
else:
    print("ChromaDB not found or empty. Creating new vectorstore.")

    # ✅ 重要：确认 chunks 中每个 Document 含有 metadata（包括 category）
    sample_meta = chunks[0].metadata if chunks else {}
    print(f"Sample document metadata: {sample_meta}")

    vectordb = Chroma.from_documents(
        documents=chunks,  # chunks 应来自前面处理好的带 metadata 的文档片段
        embedding=embeddings,
        persist_directory=CHROMA_PERSIST_DIR
    )
    vectordb.persist()
    print("Vectorstore created and persisted.")

print("\n✅ Knowledge base construction complete.")



--- Storing embeddings in ChromaDB at ../vectorstore/cmu_mism_chroma ---
ChromaDB already exists. Loading existing vectorstore.

✅ Knowledge base construction complete.


In [29]:
# 9. 创建检索器
# 这是用于从向量数据库中检索相关文档块的工具
retriever = vectordb.as_retriever(search_kwargs={"k": 10}) # k=10 表示检索最相关的10个块
print(f"Retriever created, set to retrieve top {retriever.search_kwargs['k']} chunks.")

Retriever created, set to retrieve top 10 chunks.


测试知识库搭建成果

In [30]:
# 示例查询1：测试通用信息检索
query1 = "What are the admission requirements for the MISM program?"
print(f"\n--- Testing Retriever with Query: '{query1}' ---")
docs_with_scores1 = vectordb.similarity_search_with_score(query1, k=5)
print(f"Retrieved {len(docs_with_scores1)} documents.")
for i, (doc, score) in enumerate(docs_with_scores1):
    print(f"\n--- Document {i+1} ---")
    print(f"Score: {score:.4f}")  # 显示相似度得分（通常越小越相关）
    print(f"Source: {doc.metadata.get('source', 'N/A')}")
    print(f"Content (first 50 chars):\n{doc.page_content[:50]}...")


--- Testing Retriever with Query: 'What are the admission requirements for the MISM program?' ---
Retrieved 5 documents.

--- Document 1 ---
Score: 0.5168
Source: handbooks/mism-student-handbook.pdf
Content (first 50 chars):
the
MISM
program
in
three
semesters.
Note,
the
MIS...

--- Document 2 ---
Score: 0.6145
Source: handbooks/mism-student-handbook.pdf
Content (first 50 chars):
the
MISM
degree
requirements.
If
a
course
is
cross...

--- Document 3 ---
Score: 0.6224
Source: websites_pages/apply.html
Content (first 50 chars):
MISM-BIDA requires that you have a background in t...

--- Document 4 ---
Score: 0.6224
Source: text_info/admission.txt
Content (first 50 chars):
MISM-BIDA requires that you have a background in t...

--- Document 5 ---
Score: 0.6988
Source: text_info/admission.txt
Content (first 50 chars):
Home    Admissions   Information Systems Managemen...


In [31]:
# 示例查询2：测试更具体的信息检索
query2 = "What kind of careers do MISM graduates pursue?"
print(f"\n--- Testing Retriever with Query: '{query2}' ---")
docs_with_scores2 = vectordb.similarity_search_with_score(query2, k=10)
print(f"Retrieved {len(docs_with_scores2)} documents.")
for i, (doc, score) in enumerate(docs_with_scores2):
    print(f"\n--- Document {i+1} ---")
    print(f"Score: {score:.4f}")  # 越小越相关
    print(f"Source: {doc.metadata.get('source', 'N/A')}")
    print(f"Content (first 50 chars):\n{doc.page_content[:50]}...")



--- Testing Retriever with Query: 'What kind of careers do MISM graduates pursue?' ---
Retrieved 10 documents.

--- Document 1 ---
Score: 0.6626
Source: websites_pages/EL.html
Content (first 50 chars):
Apply your newly earned knowledge from class to im...

--- Document 2 ---
Score: 0.7286
Source: handbooks/mism-student-handbook.pdf
Content (first 50 chars):
Advisor,
Career
Advisor
for
MISM
16
and
MISM
BIDA
...

--- Document 3 ---
Score: 0.7683
Source: handbooks/mism-student-handbook.pdf
Content (first 50 chars):
the
MISM
degree
requirements.
If
a
course
is
cross...

--- Document 4 ---
Score: 0.7871
Source: websites_pages/businessAndTechnology.html
Content (first 50 chars):
Master's Degree in Information Systems Management
...

--- Document 5 ---
Score: 0.8396
Source: text_info/mism-16.txt
Content (first 50 chars):
Home    Programs    Master of Information Systems ...

--- Document 6 ---
Score: 0.8514
Source: websites_pages/16m_pathway.html
Content (first 50 chars):
Students in the MIS

In [32]:
# 示例查询3：测试不存在的信息（预期返回不相关或空白结果）
query3 = "What is the capital of France?"
print(f"\n--- Testing Retriever with Query: '{query3}' ---")
docs_with_scores3 = vectordb.similarity_search_with_score(query3, k=10)
print(f"Retrieved {len(docs_with_scores3)} documents.")
if not docs_with_scores3:
    print("No relevant documents found for this query (expected).")
else:
    for i, (doc, score) in enumerate(docs_with_scores3):
        print(f"\n--- Document {i+1} ---")
        print(f"Score: {score:.4f}")
        print(f"Source: {doc.metadata.get('source', 'N/A')}")
        print(f"Content (first 50 chars):\n{doc.page_content[:50]}...")


--- Testing Retriever with Query: 'What is the capital of France?' ---
Retrieved 10 documents.

--- Document 1 ---
Score: 1.7469
Source: text_info/admission.txt
Content (first 50 chars):
students who have worked or studied for five or mo...

--- Document 2 ---
Score: 1.7699
Source: websites_pages/curriculum.html
Content (first 50 chars):
(95-723)Accounting and Finance Analytics (95-719)D...

--- Document 3 ---
Score: 1.7769
Source: handbooks/heinz-college-student-handbook.pdf
Content (first 50 chars):
their
degree
completion
or
for
a
grade.
1 7
M a s ...

--- Document 4 ---
Score: 1.7931
Source: websites_pages/apply.html
Content (first 50 chars):
Required Essay...

--- Document 5 ---
Score: 1.7938
Source: websites_pages/apply.html
Content (first 50 chars):
If your native language (mother tongue) is not Eng...

--- Document 6 ---
Score: 1.7982
Source: websites_pages/16m_pathway.html
Content (first 50 chars):
Systems Capstone ProjectPLUSRequired Summer Intern...

--- Document 7 ---
Scor

## 阶段四
    •	获取 API 密钥  
    •	实例化语言模型
    •	创建RAG问答链
     •	创建问题分类器


In [33]:
# 导入 LLM 模块
from langchain_openai import ChatOpenAI # 如果使用OpenAI
# from langchain_google_genai import ChatGoogleGenerativeAI # 如果使用Google Gemini
# from langchain_community.llms import Ollama # 如果使用Ollama

# 导入问答链
from langchain.chains import RetrievalQA # 导入RetrievalQA

# 导入PromptTemplate (用于自定义Prompt)
from langchain_core.prompts import ChatPromptTemplate

# --- 导入：load_key 函数 ---
import sys
if '..' not in sys.path:
    sys.path.append('..')
from utils.load_key import load_key 
# -----------------------------


# --- 问答系统核心逻辑 ---

# 1. 配置和实例化 LLM
print("\n--- Configuring and Instantiating LLM ---")

# 使用 load_key 函数获取 API Key
# 优先从环境变量中获取（如果用户手动设置了），否则从 keys.json 读取或提示用户输入
openai_api_key_val = os.environ.get("OPENAI_API_KEY")
if not openai_api_key_val:
    print("OPENAI_API_KEY not found in environment variables.")
    openai_api_key_val = load_key("OPENAI_API_KEY") # 调用 load_key 函数获取密钥

# 实例化 OpenAI LLM
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0, api_key=openai_api_key_val) # temperature=0 更倾向于确定性回答

# --- 如果你选择使用 Google Gemini API，请使用以下代码块并注释掉上面的 OpenAI 部分 ---
# google_api_key_val = os.environ.get("GOOGLE_API_KEY")
# if not google_api_key_val:
#     print("GOOGLE_API_KEY not found in environment variables.")
#     google_api_key_val = load_key("GOOGLE_API_KEY")
# from langchain_google_genai import ChatGoogleGenerativeAI # 在这里导入以避免不必要的import
# llm = ChatGoogleGenerativeAI(model="gemini-pro", temperature=0, google_api_key=google_api_key_val)
# --------------------------------------------------------------------------------------

# --- 如果你选择使用 Ollama，请使用以下代码块并注释掉上面的 API 部分 ---
# from langchain_community.llms import Ollama # 在这里导入以避免不必要的import
# try:
#     # 检查Ollama服务是否运行
#     ollama_llm_test = Ollama(model="llama2")
#     ollama_llm_test.invoke("Hi") # 尝试一个简单的调用
#     llm = ollama_llm_test
#     print("Ollama (llama2) LLM connected.")
# except Exception as e:
#     print(f"Error connecting to Ollama: {e}. Please ensure Ollama service is running and model 'llama2' is downloaded.")
#     print("Falling back to a different LLM if available or re-prompting for API key.")
#     # 如果Ollama连接失败，可以考虑在这里fallback到API LLM或者退出
#     raise # 或者处理为fallback逻辑
# --------------------------------------------------------------------------------------


print(f"LLM instantiated: {llm.__class__.__name__} with model: {llm.model_name if hasattr(llm, 'model_name') else llm.model}")


--- Configuring and Instantiating LLM ---
OPENAI_API_KEY not found in environment variables.
Successfully loaded OPENAI_API_KEY from keys.json.
LLM instantiated: ChatOpenAI with model: gpt-3.5-turbo


创建问答链

In [34]:

# 2. 构建 RAG 问答链
print("\n--- Building Retrieval-Augmented Generation (RAG) Chain ---")
# 导入必要的LangChain组件
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

# 自定义 Prompt Template 
template = """You are an assistant that answers questions only based on provided context about the CMU MISM program.
When the question is about application requirements, look specifically for information in admissions pages or files and **provide a comprehensive and detailed list of all relevant requirements**.
When the question is about career paths, refer to student handbooks or career services sections and **describe them thoroughly**.
Do not infer from general knowledge if the answer is not in the documents.

{context}

Question: {question}
Helpful Answer:"""

QA_CHAIN_PROMPT = ChatPromptTemplate.from_template(template)



print("\n--- 构建灵活的问答链 (LLM推理部分) ---")
# 这是一个更灵活的链，它接收'context'和'question'作为输入
# 'context'将由我们在检索步骤动态生成和传入
flexible_qa_chain = (
    QA_CHAIN_PROMPT
    | llm
    | StrOutputParser() # 使用StrOutputParser来确保输出是字符串
)
print("灵活问答链构建成功。")


--- Building Retrieval-Augmented Generation (RAG) Chain ---

--- 构建灵活的问答链 (LLM推理部分) ---
灵活问答链构建成功。


创建用户查询分类器

In [None]:
# 导入查询分类器
from utils.text_processing.query_classifier import classify_query, CATEGORIES

## 阶段五
    调试问答系统

In [None]:

# --- 测试问答系统 ---

print("\n=== 🧪 Testing RAG Assistant ===")

# 定义通用输出函数，避免代码重复
# 注意：flexible_qa_chain 作为参数传入，表示它是预先创建好的
def run_and_print_qa(question: str, vectordb, llm, flexible_qa_chain):
    print(f"\n📌 Question: {question}")

    # --- 查询分类步骤 (保持不变) ---
    print(f"🔍 正在分类查询：'{question}'")
    predicted_category = classify_query(question)
    print(f"✅ 查询被分类为: '{predicted_category}'")

    # --- 根据分类结果构建过滤条件 (保持不变) ---
    if predicted_category != "unknown":
        filter_condition = {"category": predicted_category}
        print(f"✨ 将检索范围限制在类别: '{predicted_category}'")
    else:
        filter_condition = {}
        print("❗ 查询类别为 'unknown'，将从所有文档中检索。")

    # --- 优化后的检索步骤：直接从vectordb检索文档 ---
    # 使用 vectordb.similarity_search 并传入 filter_condition 和更新后的 k 值
    retrieved_docs = vectordb.similarity_search(
        query=question,
        k=10,  # 确保这里使用您更新后的k值，例如10
        filter=filter_condition
    )
    print(f"✅ 检索到 {len(retrieved_docs)} 份文档。")

    # --- 准备上下文给LLM ---
    # 从检索到的文档中提取 page_content 并拼接成一个大字符串作为上下文
    context_strings = []
    source_documents_for_display = [] # 用于打印源文档信息
    for doc in retrieved_docs:
        context_strings.append(doc.page_content)
        source_documents_for_display.append(doc)

    context_str = "\n\n".join(context_strings)

    if not context_str.strip():
        print("警告: 未能从检索到的文档中提取有效上下文。")
        # 返回一个预设的“抱歉”消息
        response_result = "抱歉，根据我的知识库，未能找到相关信息来回答您的问题。"
    else:
        # --- 调用灵活的LLM链进行问答 ---
        # 将准备好的上下文和问题传入预先构建的 flexible_qa_chain
        response_result = flexible_qa_chain.invoke({
            "context": context_str,
            "question": question
        })

    # --- 输出答案 (保持不变) ---
    print(f"\n💬 Answer:\n{response_result}")

    # --- 输出来源文档 (保持不变) ---
    print("\n📚 Source Documents:")
    if not source_documents_for_display:
        print("  无来源文档。")
    for i, doc in enumerate(source_documents_for_display):
        source = doc.metadata.get('source', 'N/A')
        page = doc.metadata.get('page', 'N/A')
        category = doc.metadata.get('category', 'unknown')

        print(f"\n--- Document {i+1} ---")
        print(f"📄 Source: {source}")
        print(f"📂 Category: {category}")
        print(f"📎 Page: {page}")
        print(f"📑 Content Preview:\n{doc.page_content[:100]}...")



=== 🧪 Testing RAG Assistant ===


In [36]:

# --- 运行测试 (传递 flexible_qa_chain) ---
# 问题 1：关于 MISM 项目的申请要求
run_and_print_qa("What are the admission requirements for the CMU MISM program?", vectordb, llm, flexible_qa_chain)

# 问题 2：关于 MISM 毕业生的职业路径
run_and_print_qa("What kind of careers do MISM graduates pursue?", vectordb, llm, flexible_qa_chain)

# 问题 3：知识库外的问题（预期返回不相关或空白结果）
run_and_print_qa("What is the capital of France?", vectordb, llm, flexible_qa_chain)




📌 Question: What are the admission requirements for the CMU MISM program?
🔍 正在分类查询：'What are the admission requirements for the CMU MISM program?'
✅ 查询被分类为: 'admission'
✨ 将检索范围限制在类别: 'admission'
✅ 检索到 10 份文档。

💬 Answer:
The admission requirements for the CMU MISM program are as follows:

1. Students can apply online for free in the fall of their junior year for admission at the start of their senior year.
2. Fast Track applicants are required to submit the application materials and meet the admission criteria required of all Information Systems Management and Business Intelligence and Data Analytics applicants.
3. The Admissions Committee takes a holistic approach to reviewing applications.
4. Detailed information on the application checklist can be found in the "Application Checklist" section.
5. Students must have a background in programming, database, and statistics for the MISM-BIDA program.
6. Exchange students have specific application requirements outlined on the Exchange Progr