# 基于RAG技术的知识问答

@Author tang rui

数据来源：arXiv

任务：

1. 输入用户问题
2. 搜索向量存储库找出匹配的论文abstract

大纲：

1. 初始化
2. 向量数据库搜索封装

## 初始化

1. llm:Qwen2.5-14B模型

2. embedding
>`sentence-transformers/all-MiniLM-L12-v2` 是 `sentence-transformers` 项目下的一个很受欢迎的预训练模型。它擅长将句子级别的文本转换为高质量的向量表示，能够很好地捕捉句子的语义信息，并且生成的向量维度相对合理（通常在计算和存储成本上有较好的平衡）。
>
>这个模型在诸多文本相似性任务中表现出色，例如在文本聚类、语义搜索、问答匹配等场景下，可以基于它生成的向量快速准确地找到语义相近的文本内容

3. 数据arXiv下载（如果搞自己的本地离线向量数据库）

3. 向量数据库Milvus
>Milvus 是一款开源的、专门用于处理海量向量数据的数据库管理系统。它旨在高效地存储、索引以及快速检索高维向量数据，能够很好地支持各种基于向量相似度的应用，比如在自然语言处理领域的语义搜索、图像识别领域的图像特征匹配等场景中发挥重要作用。与传统的关系型数据库不同，它聚焦于向量这种数据形式，针对向量的特点优化了存储和查询的性能

In [94]:
# 初始化llm
from langchain.chat_models import ChatOpenAI
import os
llm_model = "Qwen2.5-14B"
os.environ["OPENAI_API_KEY"] = "None"
os.environ["OPENAI_API_BASE"] = "http://10.58.0.2:8000/v1"
# llm_completion = OpenAI(model_name="Qwen2.5-14B") 
# llm_chat = OpenAIChat(model_name="Qwen2.5-14B")
llm = ChatOpenAI(temperature=0, model=llm_model)


In [95]:
# 初始化embedding方式
# !pip install sentence_transformers
# 每次都得翻墙很麻烦，直接离线到本地：
from langchain.embeddings import HuggingFaceEmbeddings
# embedding = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L12-v2")
embedding = HuggingFaceEmbeddings(model_name="./all-MiniLM-L12-v2")

In [96]:
# 初始化数据集下载 若需要搭建自己的向量数据库
# import kagglehub
# # Download latest version
# path = kagglehub.dataset_download("Cornell-University/arxiv")
# print("Path to dataset files:", path)

In [97]:
# 初始化向量数据库
# 不使用本地内存向量数据库，而是提供的在线向量数据库
# !pip install protobuf
# !pip install pymilvus
# 检索时，输入的数据也会先通过这个嵌入函数转换为向量，进而在数据库中查找相似向量

# from langchain.vectorstores import Milvus
# db = Milvus(embedding_function=embedding,collection_name='arXiv',connection_args={"host": "10.58.0.2", "port": "19530"})

from pymilvus import connections
connections.connect(
  host='10.58.0.2',
  port='19530'
)

from pymilvus import utility
print(utility.has_collection("arXiv"))
from langchain.vectorstores import Milvus
db = Milvus(embedding_function=embedding,collection_name='arXiv',connection_args={"host": "10.58.0.2", "port": "19530"})


True


## 向量数据库搜索封装

查询向量数据库相关文档

In [98]:
# 手动向量化问题
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("./all-MiniLM-L12-v2")
sentences = "什么是大语言模型？"
encoded = model.encode(sentences)
len(encoded)
# similarities = model.similarity(embeddings, embeddings)
# print(similarities.shape)
# [4, 4]

384

In [99]:
from pymilvus import MilvusClient

client = MilvusClient(
    uri="http://10.58.0.2:19530"
)

res = client.search(
    collection_name="arXiv",
    anns_field="vector",
    data=[encoded],
    limit=3,
    search_params={"title": "text"}
)

for hits in res:
    for hit in hits:
        print(hit)
res = client.get(
    collection_name="arXiv",
    ids=[630919, 1319478, 1944931],
    output_fields=["id", "text"]
)
print(res)

{'id': 630919, 'distance': 1.40438711643219, 'entity': {}}
{'id': 1319478, 'distance': 1.4082231521606445, 'entity': {}}
{'id': 1944931, 'distance': 1.4109312295913696, 'entity': {}}
data: ["{'id': 630919, 'text': '  We study an equation $Qu=g$, where $Q$ is a continuous quadratic operator\\nacting from one normed space to another normed space. Obviously, if $u$ is a\\nsolution of such equation then $-u$ is also a solution. We find conditions\\nimplying that there are no other solutions and apply them to the study of the\\nDirichlet boundary value problem for the partial differential equation $u\\\\Delta\\nu =g$.\\n'}", '{\'id\': 1319478, \'text\': "  An important theme is how to maximize the cooperation of employees when\\ndealing with crisis measures taken by the company. Therefore, to find out what\\nkind of employees have cooperated with the company\'s measures in the current\\ncorona (COVID-19) crisis, and what effect the cooperation has had to these\\nemployees/companies to get h

In [100]:
client.list_collections()

['hello_milvus', 'arxiv_abstracts', 'arXiv', 'LangChainCollection', 'arxiv']

In [101]:
from pymilvus import Collection
collection = Collection("arXiv")      # Get an existing collection.
collection.load()
search_params = {
    "metric_type": "L2", 
    "offset": 5, 
    "ignore_growing": False, 
    "params": {"nprobe": 10}
}
results = collection.search(
    data=[encoded],                 # 查询的输入向量
    anns_field="vector",            # 查询的字段，和论文的abstract的向量化表示对比
    # the sum of `offset` in `param` and `limit` 
    # should be less than 16384.
    param=search_params,            
    limit=10,                       # 返回结果数量限制
    expr=None,                      # 字段过滤
    # set the names of the fields you want to 
    # retrieve from the search result.
    output_fields=['title','authors','text','categories'],        # 返回字段
    consistency_level="Strong"      # 搜索一致性
)

# get the IDs of all returned hits
results[0].ids

# get the distances to the query vector from all returned hits
results[0].distances

# get the value of an output field specified in the search request.
hit = results[0][0]
hit.entity.get('title')

'Radial biharmonic $k-$Hessian equations: The critical dimension'

In [102]:
import json
def db_search(question,limits=3):
    return db.similarity_search(question,k=limits)
def display(results,indent=4):
    dict_result = [doc.to_dict() if hasattr(doc, 'to_dict') else vars(doc) for doc in results]
    print(json.dumps(dict_result, indent=indent))
display(db_search("什么是大语言模型？",10))


[
    {
        "id": null,
        "metadata": {
            "id": 630919,
            "access_id": "1506.02474",
            "authors": "Victor Alexandrov",
            "title": "On the number of solutions of a quadratic equation in a normed space",
            "comments": "6 pages",
            "journal_ref": "Journal of Natural Science of Heilongjiang University, 33, no. 1\n  (2016), 1-5. [ISSN: 1001-7011]",
            "doi": "10.13482/j.issn1001-7011.2015.12.299",
            "categories": "math.FA math.AP"
        },
        "page_content": "  We study an equation $Qu=g$, where $Q$ is a continuous quadratic operator\nacting from one normed space to another normed space. Obviously, if $u$ is a\nsolution of such equation then $-u$ is also a solution. We find conditions\nimplying that there are no other solutions and apply them to the study of the\nDirichlet boundary value problem for the partial differential equation $u\\Delta\nu =g$.\n",
        "type": "Document"
    },
    {
  

## QA应用

思路

1. 构建预处理chain：两个chain，第一个chain翻译问题，第二个chain拓展优化问题，并说明问题所属领域
2. 根据优化后的问题，查找向量数据库，找出最合适的abstract
3. 构建一个routerchain将用户问题分类（6个领域）,选择最好的那个chain结合材料，回答问题
4. 将问题和回答让llm整理一遍，通过输出解析器得到格式化的问题和答案，将问题和答案输入llm返回一个true/false，确认是否偏题
5. 如果偏题就重新从1开始再来一遍

In [103]:
# 1. 预处理chain
from langchain.prompts import ChatPromptTemplate
from langchain.chains import LLMChain
from langchain.chains import SimpleSequentialChain

# prompt template 1
user_question_prompt = ChatPromptTemplate.from_template(
    "This is the user's Chinese question. Please translate it into an English question.:"
    "\n\n{Question}"
)
chain_translate = LLMChain(llm=llm, prompt=user_question_prompt)


# prompt template 2
optimize_prompt = ChatPromptTemplate.from_template(
    "Optimize and expand the user's questions, and specify which field the questions belong to. \
    Question:{company_name}"
)
# chain 2
chain_optimize = LLMChain(llm=llm, prompt=optimize_prompt)
preproceed = SimpleSequentialChain(chains=[chain_translate, chain_optimize],verbose=True)
optimized_question = preproceed.run("什么是大模型？")


[32;1m[1;3m[chain/start][0m [1m[chain:SimpleSequentialChain] Entering Chain run with input:
[0m{
  "input": "什么是大模型？"
}
[32;1m[1;3m[chain/start][0m [1m[chain:SimpleSequentialChain > chain:LLMChain] Entering Chain run with input:
[0m{
  "Question": "什么是大模型？"
}
[32;1m[1;3m[llm/start][0m [1m[chain:SimpleSequentialChain > chain:LLMChain > llm:ChatOpenAI] Entering LLM run with input:
[0m{
  "prompts": [
    "Human: This is the user's Chinese question. Please translate it into an English question.:\n\n什么是大模型？"
  ]
}
[36;1m[1;3m[llm/end][0m [1m[chain:SimpleSequentialChain > chain:LLMChain > llm:ChatOpenAI] [320ms] Exiting LLM run with output:
[0m{
  "generations": [
    [
      {
        "text": "What is a large model?",
        "generation_info": {
          "finish_reason": "stop",
          "logprobs": null
        },
        "type": "ChatGeneration",
        "message": {
          "lc": 1,
          "type": "constructor",
          "id": [
            "langchain",
    

In [104]:
# 2. 查询向量数据库，封装结果
def search_for_doc(question):
    doc = db_search(optimized_question,1)
    if doc==None or len(doc)<=0:
        return ""
    access_id = doc[0].metadata['access_id']
    authors = doc[0].metadata['authors']
    title = doc[0].metadata['title']
    source = f'authors:{authors}\ntitle:{title}\nwebsite:https://arxiv.org/abs/{access_id}\n'
    return source,doc[0].page_content
source,abstract_doc=search_for_doc(optimized_question)
print(source)

authors:Conor Houghton, Nina Kazanina, Priyanka Sukumaran
title:Beyond the limitations of any imaginable mechanism: large language
  models and psycholinguistics
website:https://arxiv.org/abs/2303.00077



In [None]:
from langchain.chains.router import MultiPromptChain
from langchain.chains.router.llm_router import LLMRouterChain,RouterOutputParser
from langchain.prompts import PromptTemplate
# 领域分类，不同领域不同模版，并结合领域
def choose_field_answer_based_doc(abstract,question):
    baseTemplate = """And must answer the questions based on a paper abstract provided to you.
    Questions:{input}
    Paper Abstract:
    """+abstract

    NLP_template = """
    Please provide a detailed and comprehensive explanation of [question topic, e.g., large language models, their scaling laws, or instruction tuning]. Include its fundamental concepts, working mechanisms, typical applications, and any recent advancements or challenges in the field of natural language processing. If relevant, compare it with other similar techniques and discuss its future prospects.
    """+baseTemplate
    SE_template = """
    Describe [question topic, like formal software engineering, code review goals, or software engineering adaptation across fields] in-depth. Explain the key principles, methodologies, and best practices. Elaborate on how it fits into the overall software development life cycle, its importance for ensuring software quality and maintainability, and any industry standards or trends related to it.
    """+baseTemplate
    DL_template = """
    Analyze the impact of [question topic, such as duplicate data on In-content Learning] from multiple perspectives. Discuss the technical implications on learning algorithms, data processing pipelines, and model performance. Mention any strategies to mitigate negative effects and leverage positive aspects, if applicable, along with relevant case studies or research findings in the realm of data and machine learning
    """+baseTemplate
    BC_template= """
    Give a thorough account of how [question topic, e.g., blockchain ensures security] by detailing the underlying cryptographic techniques, consensus mechanisms, and network architectures involved. Explain the security threats it aims to counter and how it compares to traditional security models. Provide real-world examples of blockchain implementations with enhanced security features
    """+baseTemplate
    QC_template = """
    Elucidate the principle of [question topic, like ion-trap computers] in the context of quantum computing. Describe the quantum mechanical phenomena utilized, the hardware components and their functions, and the computational advantages it offers over classical computing. Discuss the current state-of-the-art research and potential future breakthroughs in this area.
    """+baseTemplate
    MSP_template = """
    Explain what [question topic, i.e., artificial atoms] are, covering their physical properties, synthetic methods, and potential applications. Compare them with natural atoms in terms of structure, behavior, and functionality. Cite relevant scientific literature or experimental results to support your explanation and discuss any emerging research directions
    """+baseTemplate
    prompt_infos = [
        {
            "name": "NLP", 
            "description": "Good for answering questions about Natural Language Processing and machine learning.", 
            "prompt_template": NLP_template
        },
        {
            "name": "SE", 
            "description": "Good for answering Software Engineering Field questions", 
            "prompt_template": SE_template
        },
        {
            "name": "DL", 
            "description": "Good for answering Data and Learning Field questions", 
            "prompt_template": DL_template
        },
        {
            "name": "BC", 
            "description": "Good for answering Blockchain Field questions", 
            "prompt_template": BC_template
        },
        {
            "name": "QC", 
            "description": "Good for answering Quantum Computing Field questions", 
            "prompt_template": QC_template
        },
        {
            "name": "MSP", 
            "description": "Good for answering Materials Science or Physics Field questions", 
            "prompt_template": MSP_template
        }
    ]

    # 准备所有子链给路由链决定
    destination_chains = {}
    for p_info in prompt_infos:
        # 读取出对应的prompt模板
        name = p_info["name"]
        prompt_template = p_info["prompt_template"]
        prompt = ChatPromptTemplate.from_template(template=prompt_template)
        chain = LLMChain(llm=llm, prompt=prompt)
        destination_chains[name] = chain  
        
    # 给定名字和子链的描述，让路由链决定哪个name更合适回答当前的问题
    destinations = [f"{p['name']}: {p['description']}" for p in prompt_infos]
    destinations_str = "\n".join(destinations)
    print(destinations_str)
    # 路由链实在找不到可用的子链，就用默认链
    default_prompt = ChatPromptTemplate.from_template("{input}")
    default_chain = LLMChain(llm=llm, prompt=default_prompt)


    # 路由链需要两个输入，所有子链和对应描述，当前任务

    MULTI_PROMPT_ROUTER_TEMPLATE = """Given a raw text input and relating document to a \
    language model select the model prompt best suited for the input. \
    You will be given the names of the available prompts and a \
    description of what the prompt is best suited for. \
    You may also revise the original input if you think that revising\
    it will ultimately lead to a better response from the language model.

    << FORMATTING >>
    Return a markdown code snippet with a JSON object formatted to look like:
    ```json
    {{{{
        "destination": string \ name of the prompt to use or "DEFAULT"
        "next_inputs": string \ a potentially modified version of the original input
    }}}}
    ```

    REMEMBER: "destination" MUST be one of the candidate prompt \
    names specified below OR it can be "DEFAULT" if the input is not\
    well suited for any of the candidate prompts.
    REMEMBER: "next_inputs" can just be the original input \
    if you don't think any modifications are needed.

    << CANDIDATE PROMPTS >>
    {destinations}

    << INPUT >>
    {{input}}

    << OUTPUT (remember to include the ```json)>>"""

    router_template = MULTI_PROMPT_ROUTER_TEMPLATE.format(
        destinations=destinations_str,
    )
    router_prompt = PromptTemplate(
        template=router_template,
        input_variables=["input"],
        output_parser=RouterOutputParser(),
    )

    router_chain = LLMRouterChain.from_llm(llm, router_prompt)

    chain = MultiPromptChain(router_chain=router_chain, 
                            destination_chains=destination_chains, 
                            default_chain=default_chain,
                            verbose=True
                            )

    answer = chain.run(question)
    return answer

final_answer = choose_field_answer_based_doc(abstract_doc,optimized_question)

NLP: Good for answering questions about Natural Language Processing and machine learning.
SE: Good for answering Software Engineering Field questions
DL: Good for answering Data and Learning Field questions
BC: Good for answering Blockchain Field questions
QC: Good for answering Quantum Computing Field questions
MSP: Good for answering Materials Science or Physics Field questions
[32;1m[1;3m[chain/start][0m [1m[chain:MultiPromptChain] Entering Chain run with input:
[0m{
  "input": "**Field:** Artificial Intelligence / Machine Learning\n\n**Optimized and Expanded Question:** What is a large language model, and how does it differ from smaller models in terms of its architecture, training process, and performance capabilities? Additionally, could you provide some examples of large models and their applications in various industries?"
}
[32;1m[1;3m[chain/start][0m [1m[chain:MultiPromptChain > chain:LLMRouterChain] Entering Chain run with input:
[0m{
  "input": "**Field:** Artific

[36;1m[1;3m[llm/end][0m [1m[chain:MultiPromptChain > chain:LLMRouterChain > chain:LLMChain > llm:ChatOpenAI] [1.81s] Exiting LLM run with output:
[0m{
  "generations": [
    [
      {
        "text": "```json\n{\n    \"destination\": \"NLP\",\n    \"next_inputs\": \"What is a large language model, and how does it differ from smaller models in terms of its architecture, training process, and performance capabilities? Additionally, could you provide some examples of large models and their applications in various industries?\"\n}\n```",
        "generation_info": {
          "finish_reason": "stop",
          "logprobs": null
        },
        "type": "ChatGeneration",
        "message": {
          "lc": 1,
          "type": "constructor",
          "id": [
            "langchain",
            "schema",
            "messages",
            "AIMessage"
          ],
          "kwargs": {
            "content": "```json\n{\n    \"destination\": \"NLP\",\n    \"next_inputs\": \"What is 

In [107]:
# 使用输出解析器将llm的输出进行格式化，返回dict格式
from langchain.output_parsers import ResponseSchema
from langchain.output_parsers import StructuredOutputParser

def format_result(final_answer):
       # 定义回复的所有属性约束，就是对每个字段给定说明，和取值
       field_schema = ResponseSchema(name="question field",
                                   description="What field is this about? \
                                          choose from the following fields:\
                                          Natural Language Processing Field\
                                          or Software Engineering Field\
                                          or Data and Learning Field\
                                          or Blockchain Field\
                                          or Quantum Computing Field\
                                          or Materials Science or Physics Field")
       answer_schema = ResponseSchema(name="final answer",
                                          description="the final answer of the question.")
       question_schema = ResponseSchema(name="question",
                                          description="what is the quesiton of the text.")
       response_schemas = [question_schema,field_schema, answer_schema]

       # 定义输出解析器
       output_parser = StructuredOutputParser.from_response_schemas(response_schemas)
       # 获得格式化输出
       format_instructions = output_parser.get_format_instructions()
       print(format_instructions)
       # 使用输出解析器的提示词
       review_template_2 = """\
       For the following text, extract the following information:

       question field: What field is this about? \
              choose from the following fields:\
              Natural Language Processing Field\
              or Software Engineering Field\
              or Data and Learning Field\
              or Blockchain Field\
              or Quantum Computing Field\
              or Materials Science or Physics Field

       question: what is the quesiton of the text

       final answer: Extract infomation of the final answer to the question. and give Simplify the answer of the text as the final answer.

       text: {final_answer}

       {format_instructions}
       """

       prompt = ChatPromptTemplate.from_template(template=review_template_2)
       messages = prompt.format_messages(final_answer=final_answer,format_instructions=format_instructions)
       # 得到了使用输出解析器的prompt，调用模型
       response = llm(messages)
       return output_parser.parse(response.content) 



result_dict = format_result(final_answer)

The output should be a markdown code snippet formatted in the following schema, including the leading and trailing "```json" and "```":

```json
{
	"question": string  // what is the quesiton of the text.
	"question field": string  // What field is this about?                                           choose from the following fields:                                          Natural Language Processing Field                                          or Software Engineering Field                                          or Data and Learning Field                                          or Blockchain Field                                          or Quantum Computing Field                                          or Materials Science or Physics Field
	"final answer": string  // the final answer of the question.
}
```
[32;1m[1;3m[llm/start][0m [1m[llm:ChatOpenAI] Entering LLM run with input:
[0m{
  "prompts": [
    "Human:        For the following text, extract the following informat

In [111]:
print(result_dict)

{'question': 'What is an overview of large language models (LLMs)?', 'question field': 'Natural Language Processing Field', 'final answer': 'Large language models (LLMs) are advanced AI systems that understand and generate human-like text. They are characterized by their large size, extensive training on diverse datasets, and ability to perform various NLP tasks with high accuracy. LLMs use transformer architecture and undergo pre-training and fine-tuning. They excel in contextual understanding, generative capabilities, and transfer learning. Examples include GPT-3, BERT, and T5. Applications span healthcare, finance, education, and customer service.'}


In [112]:
optimized_question

'**Field:** Artificial Intelligence / Machine Learning\n\n**Optimized and Expanded Question:** What is a large language model, and how does it differ from smaller models in terms of its architecture, training process, and performance capabilities? Additionally, could you provide some examples of large models and their applications in various industries?'

In [113]:
def check_is_bad_answer(answer,question):

    answer_schema = ResponseSchema(name="answer",description="\
    The original question is as follows: xx \
    The original answer is as follows: xx \
    The streamlined answer is as follows: xx \
    The optimized question is as follows: xx \
    Whether the optimized question and answer deviate too much from the original question. Answer True if yes,\
                             False if not or unknown.")
   
    response_schemas = [answer_schema]

    # 定义输出解析器
    output_parser = StructuredOutputParser.from_response_schemas(response_schemas)
    # 获得格式化输出
    format_instructions = output_parser.get_format_instructions()
    print(format_instructions)
    # 使用输出解析器的提示词
    review_template_2 = """\
    For the following text, extract the following information:

    answer: Whether the optimized question and answer deviate too much from the original question. Answer True if yes,\
                             False if not or unknown.


    original question: {question}

    answer:{answer}
    
    {format_instructions}
    """

    prompt = ChatPromptTemplate.from_template(template=review_template_2)
    messages = prompt.format_messages(answer=answer,question=question,format_instructions=format_instructions)
    # 得到了使用输出解析器的prompt，调用模型
    response = llm(messages)
    return output_parser.parse(response.content) 

is_bad = check_is_bad_answer(final_answer,optimized_question)

The output should be a markdown code snippet formatted in the following schema, including the leading and trailing "```json" and "```":

```json
{
	"answer": string  //     The original question is as follows: xx     The original answer is as follows: xx     The streamlined answer is as follows: xx     The optimized question is as follows: xx     Whether the optimized question and answer deviate too much from the original question. Answer True if yes,                             False if not or unknown.
}
```
[32;1m[1;3m[llm/start][0m [1m[llm:ChatOpenAI] Entering LLM run with input:
[0m{
  "prompts": [
    "Human:     For the following text, extract the following information:\n\n    answer: Whether the optimized question and answer deviate too much from the original question. Answer True if yes,                             False if not or unknown.\n\n\n    original question: **Field:** Artificial Intelligence / Machine Learning\n\n**Optimized and Expanded Question:** What is a lar

In [115]:
print(is_bad['answer']=='False') # 说明回答没什么问题

True


## 封装类

封装所有方法成一个类

In [160]:
from langchain.chat_models import ChatOpenAI
import os
from langchain.embeddings import HuggingFaceEmbeddings
import json
from langchain.vectorstores import Milvus
from langchain.prompts import ChatPromptTemplate
from langchain.chains import LLMChain
from langchain.chains import SimpleSequentialChain
from langchain.chains.router import MultiPromptChain
from langchain.chains.router.llm_router import LLMRouterChain,RouterOutputParser
from langchain.prompts import PromptTemplate
from langchain.output_parsers import ResponseSchema
from langchain.output_parsers import StructuredOutputParser

class ArXivQA:
    def __init__(self,language='Chinese',embedding=None,llm_model = "Qwen2.5-14B",key="None",api_base="http://10.58.0.2:8000/v1",collection_name='arXiv'
                 ,milvus_host="10.58.0.2",port="19530"):
        self.language = language
        os.environ["OPENAI_API_KEY"] = key
        os.environ["OPENAI_API_BASE"] = api_base
        self.answer = None
        self.optimized_question = None
        self.original_question=None
        self.llm = ChatOpenAI(temperature=0, model=llm_model)
        if embedding == None:
            embedding = HuggingFaceEmbeddings(model_name="./all-MiniLM-L12-v2")
        self.db = Milvus(embedding_function=embedding,collection_name=collection_name,connection_args={"host": milvus_host, "port": port})
        
    def db_search(self,question,limits=3):
        return self.db.similarity_search(question,k=limits)
    
    def display(self,results,indent=4):
        dict_result = [doc.to_dict() if hasattr(doc, 'to_dict') else vars(doc) for doc in results]
        print(json.dumps(dict_result, indent=indent))

    def search_for_doc(self):
        doc = self.db_search(self.optimized_question,1)
        if doc==None or len(doc)<=0:
            return ""
        access_id = doc[0].metadata['access_id']
        authors = doc[0].metadata['authors']
        title = doc[0].metadata['title']
        source = f'authors:{authors}\ntitle:{title}\nwebsite:https://arxiv.org/abs/{access_id}\n'
        self.source=source
        self.abstract_doc = doc[0].page_content

    def choose_field_answer_based_doc(self):
        baseTemplate = """And must answer the questions based on a paper abstract provided to you.
        Questions:{input}
        Paper Abstract:
        """+self.abstract_doc

        NLP_template = """
        Please provide a detailed and comprehensive explanation of [question topic, e.g., large language models, their scaling laws, or instruction tuning]. Include its fundamental concepts, working mechanisms, typical applications, and any recent advancements or challenges in the field of natural language processing. If relevant, compare it with other similar techniques and discuss its future prospects.
        """+baseTemplate
        SE_template = """
        Describe [question topic, like formal software engineering, code review goals, or software engineering adaptation across fields] in-depth. Explain the key principles, methodologies, and best practices. Elaborate on how it fits into the overall software development life cycle, its importance for ensuring software quality and maintainability, and any industry standards or trends related to it.
        """+baseTemplate
        DL_template = """
        Analyze the impact of [question topic, such as duplicate data on In-content Learning] from multiple perspectives. Discuss the technical implications on learning algorithms, data processing pipelines, and model performance. Mention any strategies to mitigate negative effects and leverage positive aspects, if applicable, along with relevant case studies or research findings in the realm of data and machine learning
        """+baseTemplate
        BC_template= """
        Give a thorough account of how [question topic, e.g., blockchain ensures security] by detailing the underlying cryptographic techniques, consensus mechanisms, and network architectures involved. Explain the security threats it aims to counter and how it compares to traditional security models. Provide real-world examples of blockchain implementations with enhanced security features
        """+baseTemplate
        QC_template = """
        Elucidate the principle of [question topic, like ion-trap computers] in the context of quantum computing. Describe the quantum mechanical phenomena utilized, the hardware components and their functions, and the computational advantages it offers over classical computing. Discuss the current state-of-the-art research and potential future breakthroughs in this area.
        """+baseTemplate
        MSP_template = """
        Explain what [question topic, i.e., artificial atoms] are, covering their physical properties, synthetic methods, and potential applications. Compare them with natural atoms in terms of structure, behavior, and functionality. Cite relevant scientific literature or experimental results to support your explanation and discuss any emerging research directions
        """+baseTemplate
        prompt_infos = [
            {
                "name": "NLP", 
                "description": "Good for answering questions about Natural Language Processing and machine learning.", 
                "prompt_template": NLP_template
            },
            {
                "name": "SE", 
                "description": "Good for answering Software Engineering Field questions", 
                "prompt_template": SE_template
            },
            {
                "name": "DL", 
                "description": "Good for answering Data and Learning Field questions", 
                "prompt_template": DL_template
            },
            {
                "name": "BC", 
                "description": "Good for answering Blockchain Field questions", 
                "prompt_template": BC_template
            },
            {
                "name": "QC", 
                "description": "Good for answering Quantum Computing Field questions", 
                "prompt_template": QC_template
            },
            {
                "name": "MSP", 
                "description": "Good for answering Materials Science or Physics Field questions", 
                "prompt_template": MSP_template
            }
        ]

        # 准备所有子链给路由链决定
        destination_chains = {}
        for p_info in prompt_infos:
            # 读取出对应的prompt模板
            name = p_info["name"]
            prompt_template = p_info["prompt_template"]
            prompt = ChatPromptTemplate.from_template(template=prompt_template)
            chain = LLMChain(llm=llm, prompt=prompt)
            destination_chains[name] = chain  
            
        # 给定名字和子链的描述，让路由链决定哪个name更合适回答当前的问题
        destinations = [f"{p['name']}: {p['description']}" for p in prompt_infos]
        destinations_str = "\n".join(destinations)
        print(destinations_str)
        # 路由链实在找不到可用的子链，就用默认链
        default_prompt = ChatPromptTemplate.from_template("{input}")
        default_chain = LLMChain(llm=llm, prompt=default_prompt)


        # 路由链需要两个输入，所有子链和对应描述，当前任务

        MULTI_PROMPT_ROUTER_TEMPLATE = """Given a raw text input and relating document to a \
        language model select the model prompt best suited for the input. \
        You will be given the names of the available prompts and a \
        description of what the prompt is best suited for. \
        You may also revise the original input if you think that revising\
        it will ultimately lead to a better response from the language model.

        << FORMATTING >>
        Return a markdown code snippet with a JSON object formatted to look like:
        ```json
        {{{{
            "destination": string \ name of the prompt to use or "DEFAULT"
            "next_inputs": string \ a potentially modified version of the original input
        }}}}
        ```

        REMEMBER: "destination" MUST be one of the candidate prompt \
        names specified below OR it can be "DEFAULT" if the input is not\
        well suited for any of the candidate prompts.
        REMEMBER: "next_inputs" can just be the original input \
        if you don't think any modifications are needed.

        << CANDIDATE PROMPTS >>
        {destinations}

        << INPUT >>
        {{input}}

        << OUTPUT (remember to include the ```json)>>"""

        router_template = MULTI_PROMPT_ROUTER_TEMPLATE.format(
            destinations=destinations_str,
        )
        router_prompt = PromptTemplate(
            template=router_template,
            input_variables=["input"],
            output_parser=RouterOutputParser(),
        )

        router_chain = LLMRouterChain.from_llm(llm, router_prompt)

        chain = MultiPromptChain(router_chain=router_chain, 
                                destination_chains=destination_chains, 
                                default_chain=default_chain,
                                verbose=True
                                )

        self.answer = chain.run(self.optimized_question)

    def format_result(self):
        # 定义回复的所有属性约束，就是对每个字段给定说明，和取值
        field_schema = ResponseSchema(name="question field",
                                    description="What field is this about? \
                                            choose from the following fields:\
                                            Natural Language Processing Field\
                                            or Software Engineering Field\
                                            or Data and Learning Field\
                                            or Blockchain Field\
                                            or Quantum Computing Field\
                                            or Materials Science or Physics Field")
        answer_schema = ResponseSchema(name="final answer",
                                            description="the final answer of the question.")
        question_schema = ResponseSchema(name="question",
                                            description="what is the quesiton of the text.")
        response_schemas = [question_schema,field_schema, answer_schema]

        # 定义输出解析器
        output_parser = StructuredOutputParser.from_response_schemas(response_schemas)
        # 获得格式化输出
        format_instructions = output_parser.get_format_instructions()
        print(format_instructions)
        # 使用输出解析器的提示词
        review_template_2 = """\
        For the following text, extract the following information:

        question field: What field is this about? \
                choose from the following fields:\
                Natural Language Processing Field\
                or Software Engineering Field\
                or Data and Learning Field\
                or Blockchain Field\
                or Quantum Computing Field\
                or Materials Science or Physics Field

        question: what is the quesiton of the text

        final answer: Extract infomation of the final answer to the question. and give Simplify the answer of the text as the final answer.

        text: {final_answer}

        {format_instructions}
        """

        prompt = ChatPromptTemplate.from_template(template=review_template_2)
        messages = prompt.format_messages(final_answer=self.answer,format_instructions=format_instructions)
        # 得到了使用输出解析器的prompt，调用模型
        response = llm(messages)
        parse = output_parser.parse(response.content) 
        self.result_dict=format_result(parse)

    def check_is_bad_answer(self):

        answer_schema = ResponseSchema(name="answer",description="\
        The original question is as follows: xx \
        The original answer is as follows: xx \
        The streamlined answer is as follows: xx \
        The optimized question is as follows: xx \
        Whether the optimized question and answer deviate too much from the original question. Answer True if yes,\
                                False if not or unknown.")
    
        response_schemas = [answer_schema]

        # 定义输出解析器
        output_parser = StructuredOutputParser.from_response_schemas(response_schemas)
        # 获得格式化输出
        format_instructions = output_parser.get_format_instructions()
        print(format_instructions)
        # 使用输出解析器的提示词
        review_template_2 = """\
        For the following text, extract the following information:

        answer: Whether the optimized question and answer deviate too much from the original question. Answer True if yes,\
                                False if not or unknown.


        original question: {question}

        answer:{answer}
        
        {format_instructions}
        """

        prompt = ChatPromptTemplate.from_template(template=review_template_2)
        messages = prompt.format_messages(answer=self.answer,question=self.optimized_question,format_instructions=format_instructions)
        # 得到了使用输出解析器的prompt，调用模型
        response = llm(messages)
        self.is_bad = output_parser.parse(response.content) 

    # 预处理问题，翻译并且优化问题
    def preproceed_chain(self):
        user_question_prompt = ChatPromptTemplate.from_template(
            "This is the user's Chinese question. Please translate it into an English question.:"
            "\n\n{Question}"
        )
        chain_translate = LLMChain(llm=llm, prompt=user_question_prompt)

        optimize_prompt = ChatPromptTemplate.from_template(
            "Optimize and expand the user's questions, and specify which field the questions belong to. \
            Question:{company_name}"
        )
        chain_optimize = LLMChain(llm=llm, prompt=optimize_prompt)
        
        self._preproceed_chain = SimpleSequentialChain(chains=[chain_translate, chain_optimize],verbose=True)
        self.optimized_question = self._preproceed_chain.run(self.original_question)

    def do_answer_question(self,question):
        self.original_question=question
        # 预处理
        self.preproceed_chain()
        print(f"optimized-question:{self.optimized_question}\n")
        # 搜索文档
        self.search_for_doc()
        print(f'abstract:{self.abstract_doc}\n source:{self.source}\n')
        # 路由chain选择领域模版，使用问题相关专业模版回答问题
        self.choose_field_answer_based_doc()
        print(f'row answer:{self.answer}\n')
        # 格式化回答，抽取问题，相关领域，简化回答
        self.format_result()
        print(f'formate answer and extract information to json:\n {self.result_dict}')
        # 检查回答是否满意
        self.check_is_bad_answer()
        print(f'is bad answer:{self.is_bad}')
    
    def translateToChinese(self):
        prompt = ChatPromptTemplate.from_template(
            """Translate the english text to chinese , the output should only about the chinese text:
            <<< TEXT >>>
                {answer}
            <<< TEXT END>>>
            """
        )
        chain = LLMChain(llm=llm, prompt=prompt)
        self.answer = chain.run(self.answer)
        self.optimized_question = chain.run(self.optimized_question)
        self.result_dict['final answer']=chain.run(self.result_dict['final answer'])
    
    def encapsule(self):
        res = {}
        res['original question']=self.original_question
        res['optimized question']=self.optimized_question
        res['question field']=self.result_dict['question field']
        res['answer']=self.answer
        res['simplified answer']=self.result_dict['final answer']
        res['source'] = self.source
        return res

    def answer_question(self,question):
        # 执行回答，如果回答不满意或者偏离问题比较大，执行最多5次重回答
        self.do_answer_question(question)
        count = 0
        while count < 5:
            if self.is_bad==None or self.is_bad['answer']=='True':
                print(f"这是第{count + 1}次重新回答")
                count += 1
                self.do_answer_question(question)
            else:
                print("回答未偏离问题")
                break
        # # 最终的信息判断是否要转回中文
        # if self.language == 'Chinese':
        #     self.translateToChinese()


## 批量回答Json格式问题



In [161]:
import langchain
langchain.debug = False

qa = ArXivQA()
qa.answer_question(question="什么是大模型")




[1m> Entering new SimpleSequentialChain chain...[0m
[36;1m[1;3mWhat is a large model?[0m
[33;1m[1;3m**Field:** Artificial Intelligence / Machine Learning

**Optimized and Expanded Question:** What is a large language model, and how does it differ from smaller models in terms of capabilities, training requirements, and performance in natural language processing tasks? Additionally, could you provide some examples of large models and their applications in real-world scenarios?[0m

[1m> Finished chain.[0m
optimized-question:**Field:** Artificial Intelligence / Machine Learning

**Optimized and Expanded Question:** What is a large language model, and how does it differ from smaller models in terms of capabilities, training requirements, and performance in natural language processing tasks? Additionally, could you provide some examples of large models and their applications in real-world scenarios?

abstract:  Large language models are not detailed models of human linguistic pro

In [162]:
print(qa.encapsule())
qa.translateToChinese()
qa.optimized_question

{'original question': '什么是大模型', 'optimized question': '**Field:** Artificial Intelligence / Machine Learning\n\n**Optimized and Expanded Question:** What is a large language model, and how does it differ from smaller models in terms of capabilities, training requirements, and performance in natural language processing tasks? Additionally, could you provide some examples of large models and their applications in real-world scenarios?', 'question field': 'Natural Language Processing Field', 'answer': 'Large language models (LLMs) are sophisticated artificial intelligence systems designed to understand and generate human language. These models are characterized by their vast size, often containing billions of parameters, which allows them to capture complex patterns and nuances in language data. In contrast to smaller models, LLMs offer a broader and deeper understanding of language, enabling them to perform a wide range of natural language processing (NLP) tasks with high accuracy and ef

'领域：人工智能/机器学习\n\n优化和扩展的问题：什么是大型语言模型，它在能力、训练需求和自然语言处理任务性能方面与较小的模型有何不同？此外，能否提供一些大型模型及其在实际场景中的应用示例？'

In [164]:
import json

# 打开JSON文件
with open('questions.jsonl', 'r') as file:  
    data = json.load(file)

for item in data:
    qa.answer_question(question=item['question'])
    qa.translateToChinese()
    item['answer'] = qa.encapsule()

with open('answer.json', 'w') as new_file:
    json.dump(data, new_file, indent=4)


print(data)



[1m> Entering new SimpleSequentialChain chain...[0m
[36;1m[1;3mWhat is a large language model?[0m
[33;1m[1;3m**Field:** Artificial Intelligence / Machine Learning

**Optimized and Expanded Question:** What is a large language model, and how does it differ from smaller language models? Could you also explain the key technologies and techniques that enable large language models to perform natural language processing tasks effectively, and discuss some of the challenges and limitations associated with these models?[0m

[1m> Finished chain.[0m
optimized-question:**Field:** Artificial Intelligence / Machine Learning

**Optimized and Expanded Question:** What is a large language model, and how does it differ from smaller language models? Could you also explain the key technologies and techniques that enable large language models to perform natural language processing tasks effectively, and discuss some of the challenges and limitations associated with these models?

abstract:  La

In [167]:
# 写入的时候编码有点问题
# 打开JSON文件
with open('answer.json', 'r') as file:  
    data = json.load(file)

with open('answer.json', 'w',encoding='utf-8') as new_file:
    json.dump(data, new_file,ensure_ascii=False, indent=4)

