# 使用LangChain和LlamaIndex从零构建PDF聊天机器人
具体步骤：
- 加载文档（PDF、HTML、文本、数据库等）
- 将数据分割成块，并对这些块建立embedding索引，这样方便使用向量检索工具进行语义搜索
- 对于每个问题，通过搜索索引和embedding数据来获取与问题相关的信息
- 将问题和相关数据输入到LLM模型中。在这个系列中使用OpenAI的LLM；

## 1、准备工作

由于使用的是ChatGPT，需要获得OpenAI的API密钥。请访问[OpenAI](https://beta.openai.com/signup/)注册并获取API密钥。

设置配置

In [11]:

import logging
import sys
import os
from dotenv import load_dotenv, find_dotenv
load_dotenv(find_dotenv())

api_key = os.getenv("OPENAI_API_KEY")
api_base = os.getenv("OPENAI_API_BASE")

In [2]:
## load the PDF using pypdf
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# load the data
loader = PyPDFLoader('/Users/huangxinzhe/论文阅读/RAG/2202.01110.pdf')

# the 10k financial report are huge, we will need to split the doc into multiple chunk.
# This text splitter is the recommended one for generic text. It is parameterized by a list of characters. 
# It tries to split on them in order until the chunks are small enough.
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
data = loader.load()
texts = text_splitter.split_documents(data)

# view the first chunk
texts[0]

Document(page_content='A Survey on Retrieval-Augmented Text Generation\nHuayang Li~;\x03Yixuan Su\x7f;\x03Deng Cai};\x03Yan Wang|;\x03Lemao Liu|;\x03\n~Nara Institute of Science and Technology\x7fUniversity of Cambridge\n}The Chinese University of Hong Kong|Tencent AI Lab\nli.huayang.lh6@is.naist.jp, ys484@cam.ac.uk\nthisisjcykcd@gmail.com, brandenwang@tencent.com\nlemaoliu@gmail.com\nAbstract\nRecently, retrieval-augmented text generation\nattracted increasing attention of the compu-\ntational linguistics community. Compared\nwith conventional generation models, retrieval-\naugmented text generation has remarkable ad-\nvantages and particularly has achieved state-of-\nthe-art performance in many NLP tasks. This\npaper aims to conduct a survey about retrieval-\naugmented text generation. It ﬁrstly highlights\nthe generic paradigm of retrieval-augmented\ngeneration, and then it reviews notable ap-\nproaches according to different tasks including\ndialogue response generation, machine tr

In [3]:
# import Chroma and OpenAIEmbeddings
from langchain.vectorstores import Chroma
# from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.embeddings.huggingface import HuggingFaceEmbeddings

# initialize OpenAIEmbedding
# embeddings = OpenAIEmbeddings(model='text-embedding-ada-002')
embeddings = HuggingFaceEmbeddings()

# use Chroma to create in-memory embedding database from the doc
docsearch = Chroma.from_documents(texts, embeddings,  metadatas=[{"source": str(i)} for i in range(len(texts))])

## perform search based on the question
query = "What is the operating income?"
docs = docsearch.similarity_search(query)

  from .autonotebook import tqdm as notebook_tqdm
Using embedded DuckDB without persistence: data will be transient


In [4]:
docs

[Document(page_content='decoding is that it does not require to change the\ntranslation model (including phrase table and pa-\nrameters) and can be applied in a plug-and-play\nway. This approach is successful when xis highly\nsimilar to xr; otherwise its performance is de-\ngraded largely, because it explicitly isolates TMmatching and SMT decoding and reuses the results\ninxror not in a deterministic way.\nPhrase Table Aggregation with TM There are\nalso notable efforts to augment the phrase table\nfor SMT by extracting translation rules from the\nretrieved bilingual sentences fhxr;yrig. Then\nthey re-tune the parameters for the SMT model\nwhich makes use of translation knowledge from\nfhxr;yrigin a implicit way when translating x.\nFor example, Biçici and Dymetman (2008); Simard\nand Isabelle (2009) directly combine the extracted\ntranslation rules into the phrase table in a shallow\ncombination way. They introduce an additional fea-\nture to indicate that whether translation rule is 

Langchain提供了四种预先构建的问答Chain，具体如下：

- 问答：load_qa_chain

- 有来源问答：load_qa_with_sources_chain

- 检索问题答案：RetrievalQA

- 资源检索问答：RetrievalQAWithSourcesChain

In [5]:
## importing necessary framework
from langchain.chains.question_answering import load_qa_chain
from langchain.chains.qa_with_sources import load_qa_with_sources_chain
from langchain.chains import RetrievalQA
from langchain.chains import RetrievalQAWithSourcesChain

from langchain.chat_models import ChatOpenAI


### 问答：load_qa_chain

In [None]:
## use LLM to get answering
chain = load_qa_chain(ChatOpenAI(temperature=0.2,model_name='gpt-3.5-turbo'), 
                      chain_type="stuff")
query = "What is the operating income?"
chain.run(input_documents=docs, question=query)

### 有来源问答：load_qa_with_sources_chain

In [None]:
chain = load_qa_with_sources_chain(ChatOpenAI(temperature=0.2,model_name='gpt-3.5-turbo'), 
                                   chain_type="stuff")
query = "What is the operating income?"
chain({"input_documents": docs, "question": query}, return_only_outputs=True)

### 检索问题答案：RetrievalQA

In [None]:
qa=RetrievalQA.from_chain_type(llm=ChatOpenAI(temperature=0.2,model_name='gpt-3.5-turbo'), 
                               chain_type="stuff", 
                               retriever=docsearch.as_retriever())
query = "What is the operating income?"
qa.run(query)


### 资源检索问答：RetrievalQAWithSourcesChain

In [None]:
chain=RetrievalQAWithSourcesChain.from_chain_type(ChatOpenAI(temperature=0.2,model_name='gpt-3.5-turbo'), 
                                                  chain_type="stuff", 
                                                  retriever=docsearch.as_retriever())
chain({"question": "What is the operating income?"}, return_only_outputs=True)

LlamaIndex是一个用于构建和搜索向量索引的库。它可以用于构建和搜索向量索引，以便在大型数据集中进行快速的相似性搜索。

In [34]:
import logging
import sys

## setup your OpenAI Key
import os
# os.environ["OPENAI_API_KEY"] = "<YOUR_OPENAI_API_KEY>"
from dotenv import load_dotenv, find_dotenv
load_dotenv(find_dotenv())

# enable logs to see what happen underneath
logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

### 加载文档

In [38]:
from llama_index import GPTVectorStoreIndex
from llama_index import download_loader

# we will use this UnstructuredReader to read PDF file
UnstructuredReader = download_loader('UnstructuredReader', refresh_cache=True)
loader = UnstructuredReader()
# load the data
data = loader.load_data(f'/Users/huangxinzhe/论文阅读/RAG/2202.01110.pdf', split_documents=False)

JSONDecodeError: Extra data: line 1 column 4 (char 3)