<a href="https://colab.research.google.com/github/sugarforever/wtf-langchain/blob/main/03_Data_Connections/03_Data_Connections.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install -q langchain==0.1.0  openai


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [4]:
!wget https://raw.githubusercontent.com/WTFAcademy/WTF-Langchain/main/01_Hello_Langchain/README.md

zsh:1: command not found: wget


## 加载文档

In [5]:
from langchain.document_loaders import TextLoader

loader = TextLoader("./README.md")
docs = loader.load()

In [6]:
docs

[Document(page_content='---\ntitle: 03. 数据连接\ntags:\n  - openai\n  - llm\n  - langchain\n---\n\n# WTF Langchain极简入门: 03. 数据连接\n\n最近在学习Langchain框架，顺手写一个“WTF Langchain极简入门”，供小白们使用（编程大佬可以另找教程）。本教程默认以下前提：\n- 使用Python版本的[Langchain](https://github.com/hwchase17/langchain)\n- LLM使用OpenAI的模型\n- Langchain目前还处于快速发展阶段，版本迭代频繁，为避免示例代码失效，本教程统一使用版本 **0.1.0 **\n\n根据Langchain的[代码约定](https://github.com/hwchase17/langchain/blob/v0.1.0 /pyproject.toml#L14C1-L14C24)，Python版本 ">=3.8.1,<4.0"。\n\n推特：[@verysmallwoods](https://twitter.com/verysmallwoods)\n\n所有代码和教程开源在github: [github.com/sugarforever/wtf-langchain](https://github.com/sugarforever/wtf-langchain)\n\n-----\n\n## 什么是数据连接？\n\nLLM应用往往需要用户特定的数据，而这些数据并不属于模型的训练集。`LangChain` 的数据连接概念，通过提供以下组件，实现用户数据的加载、转换、存储和查询：\n\n- 文档加载器：从不同的数据源加载文档\n- 文档转换器：拆分文档，将文档转换为问答格式，去除冗余文档，等等\n- 文本嵌入模型：将非结构化文本转换为浮点数数组表现形式，也称为向量\n- 向量存储：存储和搜索嵌入数据（向量）\n- 检索器：提供数据查询的通用接口\n\n我们通过下一段落的实践，来介绍这些组件的使用。\n## 数据连接实践\n\n在LLM应用连接用户数据时，通常我们会以如下步骤完成：\n1. 加载文档\n2. 拆分文档\n3. 向量化文档分块\n4. 向量数据存储\n\n

## 拆分文档

### 按字符拆分

In [7]:
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(
    separator = "\n\n",
    chunk_size = 1000,
    chunk_overlap  = 200,
    length_function = len,
)

split_docs = text_splitter.split_documents(docs)
print(len(docs[0].page_content))
for split_doc in split_docs:
  print(len(split_doc.page_content))

6200
934
999
969
729
958
986
908
467


### 拆分代码

In [8]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, Language

PYTHON_CODE = """
def hello_langchain():
    print("Hello, Langchain!")

# Call the function
hello_langchain()
"""
python_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON, chunk_size=50, chunk_overlap=0
)
python_docs = python_splitter.create_documents([PYTHON_CODE])
python_docs

[Document(page_content='def hello_langchain():', metadata={}),
 Document(page_content='print("Hello, Langchain!")', metadata={}),
 Document(page_content='# Call the function\nhello_langchain()', metadata={})]

### Markdown文档拆分

In [10]:
from langchain.text_splitter import MarkdownHeaderTextSplitter

markdown_document = "# Chapter 1\n\n    ## Section 1\n\nHi this is the 1st section\n\nWelcome\n\n ### Module 1 \n\n Hi this is the first module \n\n ## Section 2\n\n Hi this is the 2nd section"
print(markdown_document)
headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
splits = splitter.split_text(markdown_document)

splits

# Chapter 1

    ## Section 1

Hi this is the 1st section

Welcome

 ### Module 1 

 Hi this is the first module 

 ## Section 2

 Hi this is the 2nd section


[Document(page_content='Hi this is the 1st section  \nWelcome', metadata={'Header 1': 'Chapter 1', 'Header 2': 'Section 1'}),
 Document(page_content='Hi this is the first module', metadata={'Header 1': 'Chapter 1', 'Header 2': 'Section 1', 'Header 3': 'Module 1'}),
 Document(page_content='Hi this is the 2nd section', metadata={'Header 1': 'Chapter 1', 'Header 2': 'Section 2'})]

### 按字符递归拆分

In [16]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 100,
    chunk_overlap  = 20,
    length_function = len,
)
texts = text_splitter.split_documents(docs)
print(len(docs[0].page_content))
print(texts[0].page_content)
print(texts[1].page_content)
for split_doc in texts:
  
  print(len(split_doc.page_content))

6200
---
title: 03. 数据连接
tags:
  - openai
  - llm
  - langchain
---

# WTF Langchain极简入门: 03. 数据连接
93
71
81
78
99
30
15
56
17
86
18
94
88
38
33
64
82
90
36
75
49
59
59
93
88
43
58
99
45
67
68
67
93
50
63
95
80
41
67
58
79
92
72
97
95
97
78
51
12
84
47
76
98
71
63
99
37
60
67
98
66
93
65
82
67
39
59
72
89
29
11
81
86
59
98
38
52
98
97
36
59
56
99
71
47
61
93
72
31
79
50
13
90
43
最近在学习Langchain框架，顺手写一个“WTF Langchain极简入门”，供小白们使用（编程大佬可以另找教程）。本教程默认以下前提：


### 按token拆分

In [17]:
!pip install -q tiktoken


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [18]:
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=100, chunk_overlap=0
)
split_docs = text_splitter.split_documents(docs)

split_docs

Created a chunk of size 258, which is longer than the specified 100
Created a chunk of size 158, which is longer than the specified 100
Created a chunk of size 290, which is longer than the specified 100
Created a chunk of size 119, which is longer than the specified 100
Created a chunk of size 154, which is longer than the specified 100
Created a chunk of size 223, which is longer than the specified 100
Created a chunk of size 209, which is longer than the specified 100
Created a chunk of size 259, which is longer than the specified 100
Created a chunk of size 216, which is longer than the specified 100
Created a chunk of size 112, which is longer than the specified 100
Created a chunk of size 192, which is longer than the specified 100
Created a chunk of size 112, which is longer than the specified 100
Created a chunk of size 140, which is longer than the specified 100
Created a chunk of size 174, which is longer than the specified 100
Created a chunk of size 133, which is longer tha

[Document(page_content='---\ntitle: 03. 数据连接\ntags:\n  - openai\n  - llm\n  - langchain\n---\n\n# WTF Langchain极简入门: 03. 数据连接', metadata={'source': './README.md'}),
 Document(page_content='最近在学习Langchain框架，顺手写一个“WTF Langchain极简入门”，供小白们使用（编程大佬可以另找教程）。本教程默认以下前提：\n- 使用Python版本的[Langchain](https://github.com/hwchase17/langchain)\n- LLM使用OpenAI的模型\n- Langchain目前还处于快速发展阶段，版本迭代频繁，为避免示例代码失效，本教程统一使用版本 **0.1.0 **', metadata={'source': './README.md'}),
 Document(page_content='根据Langchain的[代码约定](https://github.com/hwchase17/langchain/blob/v0.1.0 /pyproject.toml#L14C1-L14C24)，Python版本 ">=3.8.1,<4.0"。', metadata={'source': './README.md'}),
 Document(page_content='推特：[@verysmallwoods](https://twitter.com/verysmallwoods)\n\n所有代码和教程开源在github: [github.com/sugarforever/wtf-langchain](https://github.com/sugarforever/wtf-langchain)\n\n-----', metadata={'source': './README.md'}),
 Document(page_content='## 什么是数据连接？', metadata={'source': './README.md'}),
 Document(page_content='LLM应用往往需要用户特定的数据，而这些数据并不属于模型的训

## 向量化文档分块

In [None]:
from langchain.embeddings import OpenAIEmbeddings
embeddings_model = OpenAIEmbeddings(
    # openai_api_key=""
    )
embeddings = embeddings_model.embed_documents(
    [
        "你好!",
        "Langchain!",
        "你真棒！"
    ]
)
embeddings

In [34]:
!echo OPENAI_API_KEY

OPENAI_API_KEY


## 向量数据存储

### 存储

In [43]:
!pip install -q chromadb


  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mPreparing metadata [0m[1;32m([0m[32mpyproject.toml[0m[1;32m)[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m [31m[6 lines of output][0m
  [31m   [0m Checking for Rust toolchain....
  [31m   [0m 
  [31m   [0m Cargo, the Rust package manager, is not installed or is not on PATH.
  [31m   [0m This package requires Rust and Cargo to compile extensions. Install it through
  [31m   [0m the system's package manager or via https://rustup.rs/
  [31m   [0m 
  [31m   [0m [31m[end of output][0m
  
  [1;35mnote[0m: This error originates from a subprocess, and is likely not a problem with pip.
[1;31merror[0m: [1mmetadata-generation-failed[0m

[31m×[0m Encountered error while generating package metadata.
[31m╰─>[0m See above for output.

[1;35mnote[0m: This is an issue with the package mentioned above, not pip.
[1;36mhint[0m: See above for details.

[1m

In [None]:
from langchain.document_loaders import TextLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma

text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
documents = text_splitter.split_documents(docs)
db = Chroma.from_documents(documents, OpenAIEmbeddings(openai_api_key=""))

### 检索

In [None]:
query = "什么是WTF Langchain？"
docs = db.similarity_search(query)
docs

In [None]:
docs = db.similarity_search_with_score(query)
docs