## 阶段一
	•	设置好环境与依赖（LangChain + Torch）
	•	规范路径与目录结构
	•	引入通用工具函数

In [1]:
# 1. 导入必要的库
import os
import torch
import re 
from langchain.document_loaders import DirectoryLoader, TextLoader, PyPDFLoader, UnstructuredHTMLLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
from langchain.chains import RetrievalQA # RAG 问答链

# --- 加载重排序模型 ---
from sentence_transformers import CrossEncoder

print("\n--- Loading CrossEncoder Re-ranker Model ---")
# 'cross-encoder/ms-marco-MiniLM-L-6-v2' 是一个常用的、性能良好的交叉编码器模型
# 它可以评估 (query, document) 对的相关性
reranker_model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
print("CrossEncoder Re-ranker Model loaded successfully.")

# 确保sys.path包含项目根目录
import sys
sys.path.append('..') 

# 导入通用工具函数 (set_seed，从train_eval_utils.py)(filter_documents, 从mism_doc_loader.py)
from utils.model_training.train_eval_utils import set_seed 
from utils.text_processing.mism_doc_loader import filter_documents


# 2. 设备设置
if torch.backends.mps.is_available():
    device = torch.device("mps")
    print("Using MPS device for embedding model and LLM.")
else:
    device = torch.device("cpu")
    print("MPS device not available, using CPU.")

# 3. 设置随机种子
SEED = 66
set_seed(SEED)

# 4. 定义路径
KNOWLEDGE_BASE_DIR = '../data/cmu_mism_docs' # 知识库文档存放的根目录
CHROMA_PERSIST_DIR = '../vectorstore/cmu_mism_chroma' # ChromaDB 数据存放目录

# 确保 ChromaDB 目录存在
os.makedirs(CHROMA_PERSIST_DIR, exist_ok=True)


  from .autonotebook import tqdm as notebook_tqdm
W0628 17:28:04.440000 32398 site-packages/torch/distributed/elastic/multiprocessing/redirects.py:29] NOTE: Redirects are currently not supported in Windows or MacOs.



--- Loading CrossEncoder Re-ranker Model ---
CrossEncoder Re-ranker Model loaded successfully.
Using MPS device for embedding model and LLM.
Random seed set to 66


## 阶段二
    •	路径设定：定义并检查各类文档目录（HTML、PDF、TXT）
    •	文档加载：使用 LangChain 提供的 DirectoryLoader + 相应 loader_cls
    •	类型统计：记录各类文件加载数目
    •	内容预览：打印每类前 3 个文档的元信息与文本预览
    •	清洗检查：基于正则表达式进行页面质量检测（如残留HTML标签、解析失败等）

In [2]:
from utils.text_processing.category_map import category_map
from utils.text_processing.mism_doc_loader import filter_documents
from langchain.docstore.document import Document
import os

# --- 知识库构建流程 ---

print(f"\n--- Starting document loading from {KNOWLEDGE_BASE_DIR} ---")

documents = []
loaded_counts_by_type = {'html': 0, 'pdf': 0, 'txt_forums': 0, 'txt_info': 0}
all_loaded_docs_by_type = {'html': [], 'pdf': [], 'txt_forums': [], 'txt_info': []}


def load_and_tag_documents(path, glob_pattern, loader_cls, category_key):
    doc_list = []
    if os.path.exists(path):
        print(f"Attempting to load {category_key.upper()} from: {path}")
        loader = DirectoryLoader(path, glob=glob_pattern, loader_cls=loader_cls)
        try:
            raw_docs = loader.load()
            for doc in raw_docs:
                rel_path = os.path.relpath(doc.metadata.get('source', ''), KNOWLEDGE_BASE_DIR).replace("\\", "/")
                doc.metadata["source"] = rel_path
                doc.metadata["category"] = category_map.get(rel_path, category_map.get(category_key, "unknown"))
            print(f"  Successfully loaded {len(raw_docs)} {category_key.upper()} documents.")
            return raw_docs
        except Exception as e:
            print(f"  ERROR: Failed to load {category_key.upper()} documents from {path}. Error: {e}")
            return []
    else:
        print(f"  WARNING: {category_key.upper()} documents directory not found: {path}.")
        return []


# --- 加载 HTML 文件 ---
html_docs_path = os.path.join(KNOWLEDGE_BASE_DIR, 'websites_pages')
html_docs = load_and_tag_documents(html_docs_path, "**/*.html", UnstructuredHTMLLoader, 'websites_pages')
documents.extend(html_docs)
all_loaded_docs_by_type['html'] = html_docs
loaded_counts_by_type['html'] = len(html_docs)

# --- 加载 PDF 文件 ---
pdf_docs_path = os.path.join(KNOWLEDGE_BASE_DIR, 'handbooks')
pdf_docs = load_and_tag_documents(pdf_docs_path, "**/*.pdf", PyPDFLoader, 'handbooks')
documents.extend(pdf_docs)
all_loaded_docs_by_type['pdf'] = pdf_docs
loaded_counts_by_type['pdf'] = len(pdf_docs)

# --- 加载 Forums TXT 文件 ---
forums_txt_path = os.path.join(KNOWLEDGE_BASE_DIR, 'forums')
forums_txt_docs = load_and_tag_documents(forums_txt_path, "**/*.txt", TextLoader, 'forums')
documents.extend(forums_txt_docs)
all_loaded_docs_by_type['txt_forums'] = forums_txt_docs
loaded_counts_by_type['txt_forums'] = len(forums_txt_docs)

# --- 加载 Text_info TXT 文件 ---
text_info_path = os.path.join(KNOWLEDGE_BASE_DIR, 'text_info')
text_info_docs = load_and_tag_documents(text_info_path, "**/*.txt", TextLoader, 'text_info')
documents.extend(text_info_docs)
all_loaded_docs_by_type['txt_info'] = text_info_docs
loaded_counts_by_type['txt_info'] = len(text_info_docs)

# --- 清洗文档 ---
documents, filtered_documents = filter_documents(documents)



--- Starting document loading from ../data/cmu_mism_docs ---
Attempting to load WEBSITES_PAGES from: ../data/cmu_mism_docs/websites_pages
  Successfully loaded 8 WEBSITES_PAGES documents.
Attempting to load HANDBOOKS from: ../data/cmu_mism_docs/handbooks


  from cryptography.hazmat.primitives.ciphers.algorithms import AES, ARC4


  Successfully loaded 44 HANDBOOKS documents.
Attempting to load FORUMS from: ../data/cmu_mism_docs/forums
  Successfully loaded 1 FORUMS documents.
Attempting to load TEXT_INFO from: ../data/cmu_mism_docs/text_info
  Successfully loaded 3 TEXT_INFO documents.

📊 文档清洗完成：
  - 原始文档数量：56
  - 保留有效文档：52
  - 剔除无效/乱码文档：4


检查文档加载情况（数量统计和内容质量检查）

In [3]:

# 文档数量统计
print(f"\nTotal documents loaded: {len(documents)}")
for doc_type, count in loaded_counts_by_type.items():
    print(f"  {doc_type.replace('_', ' ').title()} documents loaded: {count}")



# --- 详细内容验证 ---
print("\n--- Detailed Content Verification (Preview of first 3 docs of each type) ---")

for doc_type, docs_list in all_loaded_docs_by_type.items():
    if not docs_list:
        print(f"No {doc_type.replace('_', ' ').title()} documents to verify.")
        continue

    print(f"\nVerifying {doc_type.replace('_', ' ').title()} documents (first {min(3, len(docs_list))} samples):")
    for i, doc in enumerate(docs_list[:3]): # 只检查前3个样本，避免输出过长
        source = doc.metadata.get('source', 'N/A')
        # 对于PDF，metadata可能包含'page'
        page_info = f" Page {doc.metadata.get('page')}" if 'page' in doc.metadata else ""
        content_preview = doc.page_content.strip()[:500] # 获取前500字符进行预览
        
        print(f"\n  {doc_type.replace('_', ' ').title()} Document {i+1} from {source}{page_info}:")
        print(f"    Content length: {len(doc.page_content)} characters")
        print(f"    Content preview:\n\"\"\"\n{content_preview}\n\"\"\"") # 使用三引号打印，保留格式

        # --- 质量检查逻辑 ---
        # 检查内容是否为空或过短，这通常表示解析失败
        if len(doc.page_content) < 100: 
            print(f"    WARNING: Content length {len(doc.page_content)} seems too short. Possible parsing issue for {source}{page_info}. Consider manually converting to TXT/MD.")
        
        # 检查是否包含HTML标签 remnants (可能UnstructuredHTMLLoader未完全清除)
        if '<' in doc.page_content[:500] and '>' in doc.page_content[:500] and 'html' in doc.page_content[:500].lower():
             print(f"    WARNING: HTML tags or remnants detected in content. This might affect embedding quality.")
        
        # 检查是否包含大量连续的非字母数字字符（如 JS 代码或复杂 CSS）
        if re.search(r'[^a-zA-Z0-9\s]{20,}', doc.page_content[:500]): # 连续20个非字母数字空格
            print(f"    WARNING: Suspicious non-alphanumeric character sequences detected. Content might contain code or parsing artifacts.")


Total documents loaded: 52
  Html documents loaded: 8
  Pdf documents loaded: 44
  Txt Forums documents loaded: 1
  Txt Info documents loaded: 3

--- Detailed Content Verification (Preview of first 3 docs of each type) ---

Verifying Html documents (first 3 samples):

  Html Document 1 from websites_pages/career_outcome.html:
    Content length: 2341 characters
    Content preview:
"""
Careers in Information Systems Management

The World's leading companies recruit Heinz College Information Systems Management graduates. Data shown reflects the last three years.

information systems job titles and salary reports

Job titles and salary information for the Information Systems Management program.

Information Systems Career Outcomes

Dedicated Career SUPPORT

Heinz College treats career development as seriously as it treats academics. The Information Systems Management program has
"""

  Html Document 2 from websites_pages/curriculum.html:
    Content length: 8506 characters
    Content p

## 阶段三
    •	文本分块
    •	文本嵌入
    •	向量存储
    •   创建检索器
    •   测试知识库(optional)

In [4]:

# 6. 文本分块
print("\n--- Splitting documents into chunks ---")

# 确保已经安装了必要的库，例如 pip install langchain-experimental sentence-transformers
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings # 或者可以使用当前的 HuggingFaceEmbeddings

# 定义一个简单的分词器函数，用于 BM25 索引和查询
def simple_tokenizer(text):
    """一个简单的分词器，用于BM25索引和查询"""
    return re.findall(r'\w+', text.lower())

# 初始化一个用于语义分块的嵌入模型
# 使用您项目中已有的HuggingFaceEmbeddings实例
# 假设 embeddings 变量在之前已经初始化 (在阶段三的7. 文本嵌入部分)
embeddings_model_name = "all-MiniLM-L6-v2"
embeddings = HuggingFaceEmbeddings(
    model_name=embeddings_model_name,
    model_kwargs={'device': str(device)}
)

# 实例化 SemanticChunker
# 您可以尝试不同的 threshold_percentile 值来调整块的大小和数量
# 默认的 threshold_percentile 可能是 95 或其他值，可以实验
text_splitter = SemanticChunker(embeddings, breakpoint_threshold_type="percentile") # 默认使用 percentile 策略

chunks = text_splitter.split_documents(documents)
print(f"Split into {len(chunks)} chunks.")

# 同样，在测试时检查第一个 chunk 的 metadata 是否保留了 category
if chunks:
    meta = chunks[0].metadata
    print("\n--- Example Chunk ---")
    print(f"Source: {meta.get('source')}")
    print(f"Category: {meta.get('category')}")
    print(f"Page content (first 200 chars):\\n{chunks[0].page_content[:200]}...")


--- Splitting documents into chunks ---


  embeddings = HuggingFaceEmbeddings(


Split into 135 chunks.

--- Example Chunk ---
Source: websites_pages/career_outcome.html
Category: career
Page content (first 200 chars):\nCareers in Information Systems Management

The World's leading companies recruit Heinz College Information Systems Management graduates. Data shown reflects the last three years. information systems j...


In [5]:
# 使用 rank_bm25 库进行稀疏检索
# --- 实例化 rank_bm25 索引 ---
from rank_bm25 import BM25Okapi
import re # 用于简单的分词

print("\n--- Building BM25 Index (using rank_bm25) ---")

# BM25 需要分词后的文档列表
# 这里我们使用一个简单的正则表达分词器，您可以根据需要替换为更复杂的，例如NLTK或spaCy的分词器
def simple_tokenizer(text):
    return re.findall(r'\w+', text.lower())

# 对每个文本块进行分词
tokenized_corpus = [simple_tokenizer(doc.page_content) for doc in chunks]

# 构建 BM25 索引
bm25_index = BM25Okapi(tokenized_corpus)

# 我们还需要一个列表来存储原始文档对象，以便根据BM25分数取回它们
bm25_doc_list = chunks # 这里的 chunks 就是原始的文档对象

print("BM25 Index built successfully.")


--- Building BM25 Index (using rank_bm25) ---
BM25 Index built successfully.


手动删除旧的向量库文件夹
适用于：
	•	修改了文本清洗逻辑；
	•	或修改了分块策略；
	•	或使用了不同的 Embedding 模型；
	•	或希望完全重建索引（干净状态）。

In [6]:
import shutil
import os

CHROMA_PERSIST_DIR = "../vectorstore/cmu_mism_chroma"

if os.path.exists(CHROMA_PERSIST_DIR):
    print("Deleting existing vectorstore...")
    shutil.rmtree(CHROMA_PERSIST_DIR)

Deleting existing vectorstore...


In [7]:
# 7. 文本嵌入
print("\n--- Creating embeddings ---")
# 使用 Sentence-Transformers 模型进行嵌入
# "all-MiniLM-L6-v2" 是一个轻量级但高性能的模型
# force_download=True 确保在Docker或某些环境中模型能正确下载

print(f"Using embedding model: {embeddings_model_name} on device: {device}")


--- Creating embeddings ---
Using embedding model: all-MiniLM-L6-v2 on device: mps


In [8]:

# 8. 向量存储
print(f"\n--- Storing embeddings in ChromaDB at {CHROMA_PERSIST_DIR} ---")
# 检查向量数据库是否已存在并加载，否则从头创建
if os.path.exists(CHROMA_PERSIST_DIR) and len(os.listdir(CHROMA_PERSIST_DIR)) > 0:
    print("ChromaDB already exists. Loading existing vectorstore.")
    vectordb = Chroma(persist_directory=CHROMA_PERSIST_DIR, embedding_function=embeddings)
else:
    print("ChromaDB not found or empty. Creating new vectorstore.")

    # ✅ 重要：确认 chunks 中每个 Document 含有 metadata（包括 category）
    sample_meta = chunks[0].metadata if chunks else {}
    print(f"Sample document metadata: {sample_meta}")

    vectordb = Chroma.from_documents(
        documents=chunks,  # chunks 应来自前面处理好的带 metadata 的文档片段
        embedding=embeddings,
        persist_directory=CHROMA_PERSIST_DIR
    )
    vectordb.persist()
    print("Vectorstore created and persisted.")

print("\n✅ Knowledge base construction complete.")



--- Storing embeddings in ChromaDB at ../vectorstore/cmu_mism_chroma ---
ChromaDB not found or empty. Creating new vectorstore.
Sample document metadata: {'source': 'websites_pages/career_outcome.html', 'category': 'career'}
Vectorstore created and persisted.

✅ Knowledge base construction complete.


  vectordb.persist()


In [9]:
# 9. 创建检索器
# 这是用于从向量数据库中检索相关文档块的工具
retriever = vectordb.as_retriever(search_kwargs={"k": 10}) # k=10 表示检索最相关的10个块
print(f"Retriever created, set to retrieve top {retriever.search_kwargs['k']} chunks.")

Retriever created, set to retrieve top 10 chunks.


测试知识库搭建成果

In [10]:
# 示例查询1：测试通用信息检索
query1 = "What are the admission requirements for the MISM program?"
print(f"\n--- Testing Retriever with Query: '{query1}' ---")
docs_with_scores1 = vectordb.similarity_search_with_score(query1, k=5)
print(f"Retrieved {len(docs_with_scores1)} documents.")
for i, (doc, score) in enumerate(docs_with_scores1):
    print(f"\n--- Document {i+1} ---")
    print(f"Score: {score:.4f}")  # 显示相似度得分（通常越小越相关）
    print(f"Source: {doc.metadata.get('source', 'N/A')}")
    print(f"Content (first 50 chars):\n{doc.page_content[:50]}...")


--- Testing Retriever with Query: 'What are the admission requirements for the MISM program?' ---
Retrieved 5 documents.

--- Document 1 ---
Score: 0.5880
Source: handbooks/mism-student-handbook.pdf
Content (first 50 chars):
3 . Master
of
Information
Systems
Management
(MISM...

--- Document 2 ---
Score: 0.6349
Source: text_info/admission.txt
Content (first 50 chars):
Apply now to the MISM program
Review the informati...

--- Document 3 ---
Score: 0.6968
Source: websites_pages/apply.html
Content (first 50 chars):
Upon admission, you will receive more details abou...

--- Document 4 ---
Score: 0.7246
Source: handbooks/mism-student-handbook.pdf
Content (first 50 chars):
Required
Courses
The
following
table
lists
all
the...

--- Document 5 ---
Score: 0.7730
Source: handbooks/mism-student-handbook.pdf
Content (first 50 chars):
1 . Introduction
This
handbook
provides
speciﬁc
in...


In [11]:
# 示例查询2：测试更具体的信息检索
query2 = "What kind of careers do MISM graduates pursue?"
print(f"\n--- Testing Retriever with Query: '{query2}' ---")
docs_with_scores2 = vectordb.similarity_search_with_score(query2, k=10)
print(f"Retrieved {len(docs_with_scores2)} documents.")
for i, (doc, score) in enumerate(docs_with_scores2):
    print(f"\n--- Document {i+1} ---")
    print(f"Score: {score:.4f}")  # 越小越相关
    print(f"Source: {doc.metadata.get('source', 'N/A')}")
    print(f"Content (first 50 chars):\n{doc.page_content[:50]}...")



--- Testing Retriever with Query: 'What kind of careers do MISM graduates pursue?' ---
Retrieved 10 documents.

--- Document 1 ---
Score: 0.8730
Source: websites_pages/businessAndTechnology.html
Content (first 50 chars):
Master's Degree in Information Systems Management
...

--- Document 2 ---
Score: 0.8997
Source: handbooks/mism-student-handbook.pdf
Content (first 50 chars):
Specializations
MISM
students
who
wish
to
complete...

--- Document 3 ---
Score: 0.8998
Source: text_info/mism-16.txt
Content (first 50 chars):
For Future Technology Leaders
Our World-Renowned I...

--- Document 4 ---
Score: 0.9480
Source: handbooks/mism-student-handbook.pdf
Content (first 50 chars):
c m u . e d u
Sean
Beggs
Director
of
the
MISM
Prog...

--- Document 5 ---
Score: 0.9774
Source: handbooks/heinz-college-student-handbook.pdf
Content (first 50 chars):
2024-25
Masterʼs
Student
Handbook
Degree
Programs
...

--- Document 6 ---
Score: 0.9907
Source: text_info/admission.txt
Content (first 50 chars):
Apply

In [12]:
# 示例查询3：测试不存在的信息（预期返回不相关或空白结果）
query3 = "What is the capital of France?"
print(f"\n--- Testing Retriever with Query: '{query3}' ---")
docs_with_scores3 = vectordb.similarity_search_with_score(query3, k=10)
print(f"Retrieved {len(docs_with_scores3)} documents.")
if not docs_with_scores3:
    print("No relevant documents found for this query (expected).")
else:
    for i, (doc, score) in enumerate(docs_with_scores3):
        print(f"\n--- Document {i+1} ---")
        print(f"Score: {score:.4f}")
        print(f"Source: {doc.metadata.get('source', 'N/A')}")
        print(f"Content (first 50 chars):\n{doc.page_content[:50]}...")


--- Testing Retriever with Query: 'What is the capital of France?' ---
Retrieved 10 documents.

--- Document 1 ---
Score: 1.5666
Source: handbooks/heinz-college-student-handbook.pdf
Content (first 50 chars):
2. Degrees
Oﬀered
2 ....

--- Document 2 ---
Score: 1.7842
Source: websites_pages/16m_pathway.html
Content (first 50 chars):
Compare...

--- Document 3 ---
Score: 1.7852
Source: handbooks/heinz-college-student-handbook.pdf
Content (first 50 chars):
8 ....

--- Document 4 ---
Score: 1.7901
Source: handbooks/heinz-college-student-handbook.pdf
Content (first 50 chars):
1 1 ....

--- Document 5 ---
Score: 1.7981
Source: handbooks/heinz-college-student-handbook.pdf
Content (first 50 chars):
e d u / s t u d e n t - a ﬀ a i r s / t h e w o r ...

--- Document 6 ---
Score: 1.8459
Source: handbooks/heinz-college-student-handbook.pdf
Content (first 50 chars):
9 ....

--- Document 7 ---
Score: 1.8502
Source: handbooks/heinz-college-student-handbook.pdf
Content (first 50 chars):
These
should


## 阶段四
    •	获取 API 密钥  
    •	实例化语言模型
    •	创建RAG问答链
     •	创建问题分类器


In [13]:
# 导入 LLM 模块
from langchain_openai import ChatOpenAI # 如果使用OpenAI
# from langchain_google_genai import ChatGoogleGenerativeAI # 如果使用Google Gemini
# from langchain_community.llms import Ollama # 如果使用Ollama

# 导入问答链
from langchain.chains import RetrievalQA # 导入RetrievalQA

# 导入PromptTemplate (用于自定义Prompt)
from langchain_core.prompts import ChatPromptTemplate

# --- 导入：load_key 函数 ---
import sys
if '..' not in sys.path:
    sys.path.append('..')
from utils.load_key import load_key 
# -----------------------------


# --- 问答系统核心逻辑 ---

# 1. 配置和实例化 LLM
print("\n--- Configuring and Instantiating LLM ---")

# 使用 load_key 函数获取 API Key
# 优先从环境变量中获取（如果用户手动设置了），否则从 keys.json 读取或提示用户输入
openai_api_key_val = os.environ.get("OPENAI_API_KEY")
if not openai_api_key_val:
    print("OPENAI_API_KEY not found in environment variables.")
    openai_api_key_val = load_key("OPENAI_API_KEY") # 调用 load_key 函数获取密钥

# 实例化 OpenAI LLM
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0, api_key=openai_api_key_val) # temperature=0 更倾向于确定性回答

# --- 如果你选择使用 Google Gemini API，请使用以下代码块并注释掉上面的 OpenAI 部分 ---
# google_api_key_val = os.environ.get("GOOGLE_API_KEY")
# if not google_api_key_val:
#     print("GOOGLE_API_KEY not found in environment variables.")
#     google_api_key_val = load_key("GOOGLE_API_KEY")
# from langchain_google_genai import ChatGoogleGenerativeAI # 在这里导入以避免不必要的import
# llm = ChatGoogleGenerativeAI(model="gemini-pro", temperature=0, google_api_key=google_api_key_val)
# --------------------------------------------------------------------------------------

# --- 如果你选择使用 Ollama，请使用以下代码块并注释掉上面的 API 部分 ---
# from langchain_community.llms import Ollama # 在这里导入以避免不必要的import
# try:
#     # 检查Ollama服务是否运行
#     ollama_llm_test = Ollama(model="llama2")
#     ollama_llm_test.invoke("Hi") # 尝试一个简单的调用
#     llm = ollama_llm_test
#     print("Ollama (llama2) LLM connected.")
# except Exception as e:
#     print(f"Error connecting to Ollama: {e}. Please ensure Ollama service is running and model 'llama2' is downloaded.")
#     print("Falling back to a different LLM if available or re-prompting for API key.")
#     # 如果Ollama连接失败，可以考虑在这里fallback到API LLM或者退出
#     raise # 或者处理为fallback逻辑
# --------------------------------------------------------------------------------------


print(f"LLM instantiated: {llm.__class__.__name__} with model: {llm.model_name if hasattr(llm, 'model_name') else llm.model}")


--- Configuring and Instantiating LLM ---
OPENAI_API_KEY not found in environment variables.
Successfully loaded OPENAI_API_KEY from keys.json.
LLM instantiated: ChatOpenAI with model: gpt-3.5-turbo


创建问答链

In [14]:

# 2. 构建 RAG 问答链
print("\n--- Building Retrieval-Augmented Generation (RAG) Chain ---")
# 导入必要的LangChain组件
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

# 自定义 Prompt Template 
template = """You are an assistant that answers questions only based on provided context about the CMU MISM program.
When the question is about application requirements, look specifically for information in admissions pages or files and **provide a comprehensive and detailed list of all relevant requirements**.
When the question is about career paths, refer to student handbooks or career services sections and **describe them thoroughly**.
Do not infer from general knowledge if the answer is not in the documents.

{context}

Question: {question}
Helpful Answer:"""

QA_CHAIN_PROMPT = ChatPromptTemplate.from_template(template)



print("\n--- 构建灵活的问答链 (LLM推理部分) ---")
# 这是一个更灵活的链，它接收'context'和'question'作为输入
# 'context'将由我们在检索步骤动态生成和传入
flexible_qa_chain = (
    QA_CHAIN_PROMPT
    | llm
    | StrOutputParser() # 使用StrOutputParser来确保输出是字符串
)
print("灵活问答链构建成功。")


--- Building Retrieval-Augmented Generation (RAG) Chain ---

--- 构建灵活的问答链 (LLM推理部分) ---
灵活问答链构建成功。


创建用户查询分类器

In [15]:
# 导入查询分类器
from utils.text_processing.query_classifier import classify_query, CATEGORIES

Successfully loaded OPENAI_API_KEY from keys.json.


## 阶段五
    调试问答系统

In [18]:
# --- 测试问答系统 ---

# 定义通用输出函数，避免代码重复
# 注意：flexible_qa_chain, bm25_index, bm25_doc_list 作为参数传入
def run_and_print_qa(question: str, vectordb, llm, flexible_qa_chain, bm25_index, bm25_doc_list, reranker_model):
    print(f"\n📌 Question: {question}")

    # --- 查询分类步骤  ---
    print(f"🔍 正在分类查询：'{question}'")
    # 确保 classify_query 函数已被导入，例如 from utils.text_processing.query_classifier import classify_query
    predicted_category = classify_query(question)
    print(f"✅ 查询被分类为: '{predicted_category}'")

    # --- 根据分类结果构建过滤条件  ---
    if predicted_category != "unknown":
        filter_condition = {"category": predicted_category}
        print(f"✨ 将检索范围限制在类别: '{predicted_category}'")
    else:
        filter_condition = {}
        print("❗ 查询类别为 'unknown'，将从所有文档中检索（无类别过滤）。")

    # --- 优化后的混合检索步骤 ---
    all_retrieved_docs = [] # 用于存储混合检索的结果

    # 1. 执行向量（密集）检索，并应用类别过滤
    print("⏳ 正在执行向量（密集）检索...")
    # 向量检索的 k 值可以根据需要调整，例如保持为10
    vector_results = vectordb.similarity_search(
        query=question,
        k=10,
        filter=filter_condition # 应用类别过滤
    )
    print(f"✅ 向量检索完成，获取到 {len(vector_results)} 份文档。")

    # 2. 执行关键词（稀疏）检索 (使用 rank_bm25)
    print("⏳ 正在执行关键词（稀疏）检索 (BM25)...")
    # 对查询进行分词
    tokenized_query = simple_tokenizer(question) # 确保 simple_tokenizer 已经定义并可用

    bm25_results = [] # 初始化结果列表

    # 只有当分词后的查询不为空时才执行BM25检索
    if tokenized_query:
        doc_scores = bm25_index.get_scores(tokenized_query)

        # 获取得分最高的 k 个文档
        top_n_bm25 = 10 # 稀疏检索希望返回的文档数量，可以根据需要调整

        # 创建 (原始索引, 分数) 对并按分数降序排序
        # enumerate(doc_scores) 生成 (索引, 值) 对，这里 x[0] 是原始索引，x[1] 是分数
        sorted_doc_indices_and_scores = sorted(enumerate(doc_scores), key=lambda x: x[1], reverse=True)

        # 获取 top_n_bm25 的文档对象
        # 修正了变量解包的顺序：
        for i, (doc_original_index, score_value) in enumerate(sorted_doc_indices_and_scores):
            if i >= top_n_bm25:
                break
            # doc_original_index 现在是正确的整数索引
            bm25_results.append(bm25_doc_list[doc_original_index])
    else:
        print("警告：分词后的查询为空，跳过BM25检索。")

    print(f"✅ BM25检索完成，获取到 {len(bm25_results)} 份文档。")

    # 3. 合并和去重
    # 使用字典来去重，以确保每个文档块的page_content只出现一次
    unique_docs_map = {}
    for doc in vector_results:
        unique_docs_map[doc.page_content] = doc
    for doc in bm25_results:
        # 如果BM25结果中的文档不在向量结果中，则添加
        if doc.page_content not in unique_docs_map:
            unique_docs_map[doc.page_content] = doc

    retrieved_docs = list(unique_docs_map.values())
    print(f"✅ 混合检索（并去重）后，共检索到 {len(retrieved_docs)} 份文档。")
    
    # --- 上下文重排序 ---
    if retrieved_docs:
        print("⏳ 正在对检索到的文档进行重排序...")
        # 准备输入给 CrossEncoder：(query, document_content) 对
        pairs = [(question, doc.page_content) for doc in retrieved_docs]
        
        # 获取重排序分数
        # reranker_model.predict 返回一个 NumPy 数组，包含每个 pair 的分数
        rerank_scores = reranker_model.predict(pairs)

        # 将文档与它们的重排序分数绑定，并按分数降序排序
        # zip 将 (doc, score) 配对，然后 sorted 根据 score (x[1]) 排序
        ranked_docs_with_scores = sorted(zip(retrieved_docs, rerank_scores), key=lambda x: x[1], reverse=True)
        
        # 提取排序后的文档，丢弃分数
        retrieved_docs = [doc for doc, score in ranked_docs_with_scores]
        
        print(f"✅ 文档重排序完成。排序后文档数量: {len(retrieved_docs)}")
    else:
        print("❗ 未检索到文档，跳过重排序。")



    # --- 准备上下文给LLM  ---
    context_strings = []
    source_documents_for_display = [] # 用于打印源文档信息
    for doc in retrieved_docs:
        context_strings.append(doc.page_content)
        source_documents_for_display.append(doc)

    context_str = "\n\n".join(context_strings)

    if not context_str.strip():
        print("警告: 未能从检索到的文档中提取有效上下文。")
        response_result = "抱歉，根据我的知识库，未能找到相关信息来回答您的问题。"
    else:
        # --- 调用灵活的LLM链进行问答 ---
        response_result = flexible_qa_chain.invoke({
            "context": context_str,
            "question": question
        })

    # --- 输出答案 ---
    print(f"\n💬 Answer:\n{response_result}")

    # --- 输出来源文档 ---
    print("\n📚 Source Documents:")
    if not source_documents_for_display:
        print("  无来源文档。")
    for i, doc in enumerate(source_documents_for_display):
        source = doc.metadata.get('source', 'N/A')
        page = doc.metadata.get('page', 'N/A')
        category = doc.metadata.get('category', 'unknown')

        print(f"\n--- Document {i+1} ---")
        print(f"📄 Source: {source}")
        print(f"📂 Category: {category}")
        print(f"📎 Page: {page}")
        print(f"📑 Content Preview:\\n{doc.page_content[:100]}...")


In [19]:

# --- 运行测试 (传递 flexible_qa_chain) ---
print("\n=== 🧪 Testing RAG Assistant ===")

# 问题 1：关于 MISM 项目的申请要求
run_and_print_qa("What are the admission requirements for the CMU MISM program?", vectordb, llm, flexible_qa_chain, bm25_index, bm25_doc_list, reranker_model)
# 问题 2：关于 MISM 毕业生的职业路径
run_and_print_qa("What kind of careers do MISM graduates pursue?", vectordb, llm, flexible_qa_chain, bm25_index, bm25_doc_list, reranker_model)
# 问题 3：知识库外的问题（预期返回不相关或空白结果）
run_and_print_qa("What is the capital of France?", vectordb, llm, flexible_qa_chain, bm25_index, bm25_doc_list, reranker_model)



=== 🧪 Testing RAG Assistant ===

📌 Question: What are the admission requirements for the CMU MISM program?
🔍 正在分类查询：'What are the admission requirements for the CMU MISM program?'
✅ 查询被分类为: 'admission'
✨ 将检索范围限制在类别: 'admission'
⏳ 正在执行向量（密集）检索...
✅ 向量检索完成，获取到 10 份文档。
⏳ 正在执行关键词（稀疏）检索 (BM25)...
✅ BM25检索完成，获取到 10 份文档。
✅ 混合检索（并去重）后，共检索到 13 份文档。
⏳ 正在对检索到的文档进行重排序...
✅ 文档重排序完成。排序后文档数量: 13

💬 Answer:
The admission requirements for the CMU MISM program are as follows:

1. Online Application Form: Submit a completed application form via the online application system.

2. Transcripts: Official transcripts of undergraduate studies must be submitted directly to Carnegie Mellon University’s Heinz College. Course-by-course transcript evaluation reports are required for applicants who completed their Bachelor’s degree outside the United States.

3. GRE/GMAT Scores: Applicants are required to submit GRE/GMAT scores. Waivers are available for applicants who demonstrate evidence of prior quantitative cou