# 将Confluence里的需求文档作为外挂知识库，进行RAG

### 加载文档
单独建了一个Wiki空间，用ConfluenceLoader读这个空间的所有文档，没读图片


In [1]:
# Load docs
from langchain.document_loaders import WebBaseLoader, ConfluenceLoader,FigmaFileLoader
Confluence_Username = "bo.peng3@renren-inc.com"
Confluence_API = "ATATT3xFfGF0HW8cTnaXPwz-MxgCK0WiPR25Hk5TWfQF5-1I3FAUYjT6gTxVjcH9U-alJqvLVTcv66UorvNVjhN7BSv40joP7UIQweWA2fWm_14EvYASBKfvF4d23qkPa44ETg40EV1rdoeiT6skv-athCjCbXibwoR5LKfHFiid4CE-NsR_RKQ=C7E97795"
loader = ConfluenceLoader(
    url="https://truckerpath.atlassian.net/wiki", username=Confluence_Username, api_key=Confluence_API
)
documents = loader.load(space_key="AndrewLang", include_attachments=False, limit=2)
# WebContent = WebBaseLoader("https://lilianweng.github.io/posts/2023-06-23-agent/").load()
# FigmaCcontent = FigmaFileLoader(
#     access_token='figd_Sd6fUkyGJrwuCxfjQ4BBsxQuMSf_N2npOEn1WdaY',
#     ids='7034-12303',
#     key='XgmWvO4rVBaCIrFqYftPjw',
# ).load()



### 文档切分
Recursive_Character_Split会按700个字符为一个chunk切分，相邻Chunk前后50个字符互相重叠，以保证不因为切割损失语义<br>
理论上使用Markdown_Character_Split效果会更好，因为基于Markdown标题做拆分，这样能保证每个chunk的内容与主题强相关。但Langchain的Confluence Loader不支持导出原始Markdown文件，此计划搁浅


In [2]:
# Split docs
from langchain.text_splitter import RecursiveCharacterTextSplitter,MarkdownHeaderTextSplitter
Recursive_Character_Split = RecursiveCharacterTextSplitter(chunk_size = 700, chunk_overlap = 50)
all_splits = Recursive_Character_Split.split_documents(documents)
# Markdown_Character_Split = MarkdownHeaderTextSplitter()
# all_spilts = Markdown_Character_Split.split_text()

### 文本向量化及向量存储
这里用的临时存储，每次运行都Embedding一遍。如果文档多，想长期运行，建议用向量数据库

In [3]:
# Store splits
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
vectorstore = Chroma.from_documents(documents=all_splits, embedding=OpenAIEmbeddings())

### 定义RAG提示词
提示词是从Langchain hub里取的。具体文本在下面列出来了，可以修改三引号里面的内容

In [4]:
# RAG prompt
from langchain import hub
prompt = hub.pull("rlm/rag-prompt")
prompt.messages[0].prompt.template = '''You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.
Question: {question} 
Context: {context} 
Answer:
'''

### 定义LLM

In [5]:
# LLM
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

In [6]:
# RetrievalQA
qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=vectorstore.as_retriever(),
    chain_type_kwargs={"prompt": prompt},
    return_source_documents = True
)

### 定义聊天框和返回信息

In [7]:
#测试性能
question = "What does Verified POI mean? Also what is the difference bewteen custom POI and Verified POI? "
result = qa_chain({"query": question})
print('The answer is:\n' + result['result'])

# 告知用户LLM的回答来源于哪个知识库文档
sources = []
for doc in result['source_documents']:
    sources.append(doc.metadata['source'])
# print('The source is:\n' + set(sources))
print(f'The source is:\n {set(sources)}')

The answer is:
A Verified POI refers to a point of interest or facility that is already included in the database. It has a detail page and a geofence, and any changes made to the geofence will be visible to drivers. On the other hand, a Custom POI is a user-added point of interest that is not included in the database and does not have a geofence or a detail page.
The source is:
 {'https://truckerpath.atlassian.net/wiki/spaces/AndrewLang/pages/2424340922/Customize+POI+and+last+mile'}


In [8]:
# Gradio前端
import gradio as gr
def generate(message,history=''):
    result = qa_chain({"query": message})
    answer = result['result']
    source = []
    for doc in result['source_documents']:
        source.append(doc.metadata['source'])
    source_info = "\n".join(set(source))
    full_answer = answer + "\nSources:\n" + source_info
    return full_answer
# demo = gr.Interface(
#     fn=generate,
#     inputs = gr.Textbox(lines=2, placeholder = "Ask here"),
#     outputs = "text",
# )
# 加短期记忆

demo = gr.ChatInterface(
    fn = generate,
    examples = ["Tell me the background of scan document project","What's the difference between COMMAND routing and Truckerpath routing","How to create a custom poi?"]
)

demo.launch(share=True)


Running on local URL:  http://127.0.0.1:7860
Running on public URL: https://dcfa9eda4b8135759f.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)




本项目后续TODO：
1. 导出Teams聊天记录作为数据库一部分
2. ~做个前端chatbox页面~
3. Evaluation
4. 融入Agent