<a href="https://colab.research.google.com/github/sugarforever/wtf-langchain/blob/main/03_Data_Connections/03_Data_Connections.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
pip install langchain==0.3.7
pip install langchain-chroma==0.1.4
pip install langchain-community==0.3.5
pip install langchain-core==0.3.18
pip install langchain-huggingface==0.1.2
pip install langchain-ollama==0.2.0
pip install langchain-openai==0.2.8
pip install langchain-text-splitters==0.3.2

In [1]:
!wget -P ./data https://raw.githubusercontent.com/WTFAcademy/WTF-Langchain/main/01_Hello_Langchain/README.md

--2024-11-08 11:40:34--  https://raw.githubusercontent.com/WTFAcademy/WTF-Langchain/main/01_Hello_Langchain/README.md
正在解析主机 raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.108.133, ...
正在连接 raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... 已连接。
已发出 HTTP 请求，正在等待回应... ^C


## 加载文档

In [1]:
from langchain_community.document_loaders.text import TextLoader

loader = TextLoader("./README.md")
docs = loader.load()

In [2]:
docs

[Document(metadata={'source': './data/README.md'}, page_content='---\ntitle: 01. Hello Langchain\ntags:\n  - zhipuai\n  - llm\n  - langchain\n---\n\n# Langchain极简入门: 01. Hello Langchain\n\n最近在学习Langchain框架，顺手写一个“WTF Langchain极简入门”，供小白们使用（编程大佬可以另找教程）。本教程默认以下前提：\n- 使用Python版本的[Langchain](https://github.com/hwchase17/langchain)\n- LLM使用ChatZhipuAI的模型\n- Langchain目前还处于快速发展阶段，版本迭代频繁，为避免示例代码失效，本教程统一使用版本 **0.3.7**\n\n根据Langchain的[代码约定](https://github.com/hwchase17/langchain/blob/v0.0.235/pyproject.toml#L14C1-L14C24)，Python版本 ">=3.8.1,<4.0"。\n\n推特：[@verysmallwoods](https://twitter.com/verysmallwoods)\n\n所有代码和教程开源在github: [github.com/sugarforever/wtf-langchain](https://github.com/sugarforever/wtf-langchain)\n\n-----\n\n## Langchain 简介\n\n大型语言模型（LLM）正在成为一种具有变革性的技术，使开发人员能够构建以前无法实现的应用程序。然而，仅仅依靠LLM还不足以创建一个真正强大的应用程序。它还需要其他计算资源或知识来源。\n\n`Langchain` 旨在帮助开发这些类型应用程序，比如：\n- 基于文档数据的问答\n- 聊天机器人\n- 代理\n\n## ZhipuAI 简介\n\n`ZhipuAI` 是LLM生态的模型层最大的玩家之一。大家目前熟知的 *GLM-3*，*GLM-4* 等模型都是ZhipuAI的产品。它的API允许开发人员通过简单的API

## 拆分文档

### 按字符拆分

In [4]:
from langchain_text_splitters.character import CharacterTextSplitter
text_splitter = CharacterTextSplitter(
    separator = "\n\n",
    chunk_size = 1000,
    chunk_overlap  = 200,
    length_function = len,
)

split_docs = text_splitter.split_documents(docs)
print(len(docs[0].page_content))
for split_doc in split_docs:
  print(len(split_doc.page_content))

3738
999
977
963
999
396


### 拆分代码

In [5]:
from langchain_text_splitters.character import RecursiveCharacterTextSplitter, Language

PYTHON_CODE = """
def hello_langchain():
    print("Hello, Langchain!")

# Call the function
hello_langchain()
"""
python_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON, chunk_size=50, chunk_overlap=0
)
python_docs = python_splitter.create_documents([PYTHON_CODE])
python_docs

[Document(metadata={}, page_content='def hello_langchain():'),
 Document(metadata={}, page_content='print("Hello, Langchain!")'),
 Document(metadata={}, page_content='# Call the function\nhello_langchain()')]

### Markdown文档拆分

In [7]:
from langchain_text_splitters.markdown import MarkdownHeaderTextSplitter

markdown_document = "# Chapter 1\n\n    ## Section 1\n\nHi this is the 1st section\n\nWelcome\n\n ### Module 1 \n\n Hi this is the first module \n\n ## Section 2\n\n Hi this is the 2nd section"

headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
splits = splitter.split_text(markdown_document)

splits

[Document(metadata={'Header 1': 'Chapter 1', 'Header 2': 'Section 1'}, page_content='Hi this is the 1st section  \nWelcome'),
 Document(metadata={'Header 1': 'Chapter 1', 'Header 2': 'Section 1', 'Header 3': 'Module 1'}, page_content='Hi this is the first module'),
 Document(metadata={'Header 1': 'Chapter 1', 'Header 2': 'Section 2'}, page_content='Hi this is the 2nd section')]

### 按字符递归拆分

In [8]:
from langchain_text_splitters.character import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 100,
    chunk_overlap  = 20,
    length_function = len,
)
texts = text_splitter.split_documents(docs)
print(len(docs[0].page_content))
for split_doc in texts:
  print(len(split_doc.page_content))

3738
74
36
71
86
59
99
30
15
56
17
86
22
91
70
97
97
64
65
77
57
75
98
19
35
98
56
8
79
28
3
56
65
74
63
48
81
14
8
94
97
89
81
3
58
8
79
28
3
97
87
83
80
76
87
31
66
37
91
21
95
94
13
90


### 按token拆分

In [None]:
!pip install -q tiktoken

In [9]:
from langchain_text_splitters.character import CharacterTextSplitter
text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=100, chunk_overlap=0
)
split_docs = text_splitter.split_documents(docs)

split_docs



[Document(metadata={'source': './data/README.md'}, page_content='---\ntitle: 01. Hello Langchain\ntags:\n  - zhipuai\n  - llm\n  - langchain\n---\n\n# Langchain极简入门: 01. Hello Langchain'),
 Document(metadata={'source': './data/README.md'}, page_content='最近在学习Langchain框架，顺手写一个“WTF Langchain极简入门”，供小白们使用（编程大佬可以另找教程）。本教程默认以下前提：\n- 使用Python版本的[Langchain](https://github.com/hwchase17/langchain)\n- LLM使用ChatZhipuAI的模型\n- Langchain目前还处于快速发展阶段，版本迭代频繁，为避免示例代码失效，本教程统一使用版本 **0.3.7**'),
 Document(metadata={'source': './data/README.md'}, page_content='根据Langchain的[代码约定](https://github.com/hwchase17/langchain/blob/v0.0.235/pyproject.toml#L14C1-L14C24)，Python版本 ">=3.8.1,<4.0"。'),
 Document(metadata={'source': './data/README.md'}, page_content='推特：[@verysmallwoods](https://twitter.com/verysmallwoods)\n\n所有代码和教程开源在github: [github.com/sugarforever/wtf-langchain](https://github.com/sugarforever/wtf-langchain)\n\n-----\n\n## Langchain 简介'),
 Document(metadata={'source': './data/README.md'}, page_content='大

## 向量化文档分块

In [12]:
from langchain_huggingface.embeddings import HuggingFaceEmbeddings
model_name = "shibing624/text2vec-base-chinese"
model_kwargs = {'device': 'cpu'}
encode_kwargs = {'normalize_embeddings': False}
embeddings_model = HuggingFaceEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs
)

embeddings = embeddings_model.embed_documents(
    [
        "你好!",
        "Langchain!",
        "你真棒！"
    ]
)
len(embeddings[0])

768

## 向量数据存储

### 存储

In [None]:
!pip install -q chromadb

In [13]:
from langchain_community.document_loaders.text import TextLoader
from langchain_huggingface.embeddings import HuggingFaceEmbeddings
from langchain_text_splitters.character import CharacterTextSplitter
from langchain_community.vectorstores import Chroma

# 加载文件
loader = TextLoader("./data/README.md")
docs = loader.load()

# 文件分割
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
documents = text_splitter.split_documents(docs)

# 分割文档向量化
model_name = "/Users/wangwenbin/Development/Learning/LLM/wtf-langchain/embedding_model/shibing624/text2vec-base-chinese"
model_kwargs = {'device': 'cpu'}
encode_kwargs = {'normalize_embeddings': False}
embeddings_model = HuggingFaceEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs
)

# 加入向量库
db = Chroma.from_documents(documents, embeddings_model)

### 检索

In [14]:
query = "什么是WTF Langchain？"
docs = db.similarity_search(query)
docs

[Document(metadata={'source': './data/README.md'}, page_content='import os\nos.environ[\'ZHIPUAI_API_KEY\'] = \'您的有效ZhipuAI API Key\'\n\nchat = ChatZhipuAI(temperature=0, model="glm-4")\nresponse = chat([ HumanMessage(content="Hello Langchain!") ])\nprint(response)\n```\n\n你应该能看到类似这样的输出：\n\n```shell\ncontent=\'Hello! How can I assist you today? If you have any questions or need information on a topic, feel free to ask.\' additional_kwargs={} response_metadata={\'token_usage\': {\'completion_tokens\': 28, \'prompt_tokens\': 9, \'total_tokens\': 37}, \'model_name\': \'glm-4\', \'finish_reason\': \'stop\'} id=\'run-e0f8ccbf-9518-4ac4-ba44-946ca817fc14-0\'\n```\n\n我们拆解程序，学习该代码的结构：\n\n1. 以下系统命令安装必要的Python包，langchain和ZhipuAI。\n\n  ```shell\n  pip install langchain==0.3.7 zhipuai==2.1.5.20230904 langchain-community==0.3.5 langchain-core==0.3.15 -q -U\n  ```\n\n2. 以下代码将ZhipuAI的API Key设置在环境变量中。默认情况下，Langchain会从环境变量 `ZhipuAI_API_KEY` 中读取API Key。注意，在代码中直接嵌入API Key明文并不安全，切勿将API Key直接提交到代码仓库。我们建议利用

In [15]:
docs = db.similarity_search_with_score(query)
docs

[(Document(metadata={'source': './data/README.md'}, page_content='import os\nos.environ[\'ZHIPUAI_API_KEY\'] = \'您的有效ZhipuAI API Key\'\n\nchat = ChatZhipuAI(temperature=0, model="glm-4")\nresponse = chat([ HumanMessage(content="Hello Langchain!") ])\nprint(response)\n```\n\n你应该能看到类似这样的输出：\n\n```shell\ncontent=\'Hello! How can I assist you today? If you have any questions or need information on a topic, feel free to ask.\' additional_kwargs={} response_metadata={\'token_usage\': {\'completion_tokens\': 28, \'prompt_tokens\': 9, \'total_tokens\': 37}, \'model_name\': \'glm-4\', \'finish_reason\': \'stop\'} id=\'run-e0f8ccbf-9518-4ac4-ba44-946ca817fc14-0\'\n```\n\n我们拆解程序，学习该代码的结构：\n\n1. 以下系统命令安装必要的Python包，langchain和ZhipuAI。\n\n  ```shell\n  pip install langchain==0.3.7 zhipuai==2.1.5.20230904 langchain-community==0.3.5 langchain-core==0.3.15 -q -U\n  ```\n\n2. 以下代码将ZhipuAI的API Key设置在环境变量中。默认情况下，Langchain会从环境变量 `ZhipuAI_API_KEY` 中读取API Key。注意，在代码中直接嵌入API Key明文并不安全，切勿将API Key直接提交到代码仓库。我们建议利