<a href="https://colab.research.google.com/github/sugarforever/wtf-langchain/blob/main/10_Example/10_Example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 一个完整的例子

这是该 `LangChain` 极简入门系列的最后一讲。我们将利用过去9讲学习的知识，来完成一个具备完整功能集的LLM应用。该应用基于 `LangChain` 框架，以某 `PDF` 文件的内容为知识库，提供给用户基于该文件内容的问答能力。

我们利用 `LangChain` 的QA chain，结合 `Chroma` 来实现PDF文档的语义化搜索。示例代码所引用的是[AWS Serverless
Developer Guide](https://docs.aws.amazon.com/pdfs/serverless/latest/devguide/serverless-core.pdf)，该PDF文档共84页。

1. 安装必要的 `Python` 包

In [1]:
!pip install -q langchain==0.1.0  openai chromadb pymupdf tiktoken


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


2. 设置OpenAI环境

In [2]:
import os
# os.environ['OPENAI_API_KEY'] = ''

3. 下载PDF文件AWS Serverless Developer Guide

In [3]:
!wget https://docs.aws.amazon.com/pdfs/serverless/latest/devguide/serverless-core.pdf

PDF_NAME = 'serverless-core.pdf'

--2024-03-30 11:25:33--  https://docs.aws.amazon.com/pdfs/serverless/latest/devguide/serverless-core.pdf
正在解析主机 docs.aws.amazon.com (docs.aws.amazon.com)... 

18.154.132.72, 18.154.132.52, 18.154.132.103, ...
正在连接 docs.aws.amazon.com (docs.aws.amazon.com)|18.154.132.72|:443... 已连接。
已发出 HTTP 请求，正在等待回应... 200 OK
长度：4829940 (4.6M) [application/pdf]
正在保存至: “serverless-core.pdf.2”


2024-03-30 11:25:37 (3.42 MB/s) - 已保存 “serverless-core.pdf.2” [4829940/4829940])



4. 加载PDF文件

In [4]:
from langchain.document_loaders import PyMuPDFLoader
docs = PyMuPDFLoader(PDF_NAME).load()

print (f'There are {len(docs)} document(s) in {PDF_NAME}.')
print (f'There are {len(docs[0].page_content)} characters in the first page of your document.')

There are 113 document(s) in serverless-core.pdf.
There are 112 characters in the first page of your document.


5. 拆分文档并存储文本嵌入的向量数据

In [5]:
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
split_docs = text_splitter.split_documents(docs)

embeddings = OpenAIEmbeddings()

vectorstore = Chroma.from_documents(split_docs, embeddings, collection_name="serverless_guide")

  warn_deprecated(


6. 基于OpenAI创建QA链

In [7]:
from langchain.llms import OpenAI
from langchain.chains.question_answering import load_qa_chain

llm = OpenAI(temperature=0)
chain = load_qa_chain(llm, chain_type="stuff")

  warn_deprecated(


7. 基于提问，进行相似性查询

In [9]:
query = "What is the use case of AWS Serverless?"
similar_docs = vectorstore.similarity_search(query, 3, )

In [10]:
similar_docs

[Document(page_content='Serverless\nDeveloper Guide\n• Mobile applications – Suppose you have a custom mobile application that produces events. \nYou can create a Lambda function to process events published by your custom application. For \nexample, you can conﬁgure a Lambda function to process the clicks within your custom mobile \napplication.\nServices you’ll likely use:\n• AWS Lambda for compute processing tasks\n• Amazon API Gateway for connecting and scaling inbound requests\n• AWS Step Functions for managing and orchestrating microservice workﬂows\n• Amazon DynamoDB & S3 for storing and retrieving data and ﬁles\n• Amazon Cognito for authentication and authorization of users\nStreaming\nStreaming data allows you to gather analytical insights and act upon them, but also presents a \nunique set of design and architectural challenges.\nLambda and Amazon Kinesis can process real-time streaming data for application activity tracking,', metadata={'author': 'AWS', 'creationDate': 'D:202

8. 基于相关文档，利用QA链完成回答

In [11]:
chain.run(input_documents=similar_docs, question=query)

  warn_deprecated(


' The use case of AWS Serverless is to provide a platform for developers to build and deploy applications without having to manage servers or infrastructure. This allows for more efficient and cost-effective development and scaling of applications.'