# [Lang Chain](https://python.langchain.com/en/latest/use_cases/code/code-analysis-deeplake.html)

用 `Lang chain` / `Deep Lake` / `GPT` 来分析 `LangChain` 库的源码。

**注：** [Deep Lake](https://www.deeplake.ai/) 是个 向量数据库 (`Vector Database`)，类似于 [PineCone](https://www.pinecone.io/)

流程如下：

+ 准备数据
    - 用 `TextLoader` 上传 Github LangChain库 所有的源码文件，我们把这些文件叫 `document`
    - 用 `CharacterTextSplitter` 分割 `document` 成 块(`chunk`)
    - 用 `OpenAIEmbedding` 将 `chunk` 转换成 嵌入向量(`Embedding Vector`)，并将向量保存到 `Deep Lake`
+ QA 设计
    - 用 `ChatOpenAI` 和 `ConversationalRetrievalChain` 构建 一个 `Chain` 
    - 问 问题
    - 得到 答案

## 1. 安装

+ 请安装 `Python`；并在 `VsCode` 上 安装 `Python`插件，`Jupyter` 插件
+ 请准备好 HTTP-代理，并保证 VSCode 能访问该代理；
+ 请准备好 `OpenAI` 的 `API Key`，将其配到你个人电脑的环境变量 `OPENAI_API_KEY` 上；

安装 三个 python 库： `langchain`, `deeplake`, `openai`

``` bash
pip3 install --upgrade langchain

pip3 install --upgrade openai

pip3 install --upgrade deeplake
```

In [5]:
#!pip3 install --upgrade langchain deeplake openai

### 1.1. 申请 Deep Lake

Deep Lake 是个 向量数据库

+ 到 [这里](https://app.activeloop.ai/) 注册 账号，将你的账号名修改下面的 `DEEPLAKE_ACCOUNT_NAME`
+ 申请 api-key，并 将 api-key 填到 系统环境变量 `ACTIVELOOP_TOKEN`
+ 运行下面的命令行：

``` bash
activeloop login -t 你申请到的ACTIVELOOP_TOKEN
```

In [1]:
# 注：这里要用你的 申请账号
DEEPLAKE_ACCOUNT_NAME = "XXX"

这里要确保`LangChain`的版本是 0.0.188 或以上；

In [6]:
import langchain

# 保证langchain 用的 是 最新的 0.0.188 版本；

langchain.__version__

'0.0.188'

## 1. 加载文件

加载文件：

实现先将 langchain github 库 git clone 到 本地：

``` bash
git clone https://github.com/hwchase17/langchain.git
```

没有安装 git 的童鞋，可以 点击 [这里](https://codeload.github.com/hwchase17/langchain/zip/refs/heads/master) 下载 zip 并解压

用 Lang Chain 的 工具类 `TextLoader` 将文件 变 document

In [15]:
import os
from langchain.document_loaders import TextLoader

# LangChain库的本地 clone 目录
root_dir = './langchain/'

docs = []
for dirpath, dirnames, filenames in os.walk(root_dir):
    for file in filenames:
        # 仅处理 Python 文件
        if file.endswith('.py') and '/.venv/' not in dirpath:
            try: 
                loader = TextLoader(os.path.join(dirpath, file), encoding='utf-8')
                
                document_list = loader.load_and_split()

                # 将 document_list flat 到 docs 去；
                docs.extend(document_list)
            except Exception as e: 
                pass

print(f'已加载 Python 文件 {len(docs)} 个')

已加载 Python 文件 1565 个


Document(page_content='"""Configuration file for the Sphinx documentation builder."""\n# Configuration file for the Sphinx documentation builder.\n#\n# This file only contains a selection of the most common options. For a full\n# list see the documentation:\n# https://www.sphinx-doc.org/en/master/usage/configuration.html\n\n# -- Path setup --------------------------------------------------------------\n\n# If extensions (or modules to document with autodoc) are in another directory,\n# add these directories to sys.path here. If the directory is relative to the\n# documentation root, use os.path.abspath to make it absolute, like shown here.\n#\n# import os\n# import sys\n# sys.path.insert(0, os.path.abspath(\'.\'))\n\nimport toml\n\nwith open("../pyproject.toml") as f:\n    data = toml.load(f)\n\n# -- Project information -----------------------------------------------------\n\nproject = "🦜🔗 LangChain"\ncopyright = "2023, Harrison Chase"\nauthor = "Harrison Chase"\n\nversion = data["tool

## 2. 分割文件

分割文件：这个Demo用到文本的方式分割

In [8]:
from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)

texts = text_splitter.split_documents(docs)

Created a chunk of size 1006, which is longer than the specified 1000
Created a chunk of size 1213, which is longer than the specified 1000
Created a chunk of size 1782, which is longer than the specified 1000
Created a chunk of size 1620, which is longer than the specified 1000
Created a chunk of size 1485, which is longer than the specified 1000
Created a chunk of size 1919, which is longer than the specified 1000
Created a chunk of size 3515, which is longer than the specified 1000
Created a chunk of size 1852, which is longer than the specified 1000
Created a chunk of size 1533, which is longer than the specified 1000
Created a chunk of size 1331, which is longer than the specified 1000
Created a chunk of size 2549, which is longer than the specified 1000
Created a chunk of size 1696, which is longer than the specified 1000
Created a chunk of size 1086, which is longer than the specified 1000
Created a chunk of size 1647, which is longer than the specified 1000
Created a chunk of s

In [9]:
print(f"分割成 {len(texts)} 个 chunk")

分割成 4765 个 chunk


### 2.1. 进阶：以代码方式分割

但是 可以使用 [代码方式分割](https://python.langchain.com/en/latest/modules/indexes/text_splitters/examples/code_splitter.html)，等后面一步可以继续实验；

+ 注：Lang Chain 0.0.188 以上才有 Language 类
+ 注：代码方式目前不支持 C#，不知道为什么。

In [10]:
# 注：langchain 0.0.188 以上才有 Language 类
from langchain.text_splitter import (
    RecursiveCharacterTextSplitter,
    Language,
)

In [11]:
# 目前支持的 语言
[e.value for e in Language]

['cpp',
 'go',
 'java',
 'js',
 'php',
 'proto',
 'python',
 'rst',
 'ruby',
 'rust',
 'scala',
 'swift',
 'markdown',
 'latex',
 'html']

In [2]:
# 可以看到 python 语言 的 分割 符号

RecursiveCharacterTextSplitter.get_separators_for_language(Language.PYTHON)

NameError: name 'RecursiveCharacterTextSplitter' is not defined

In [17]:
PYTHON_CODE = """
def hello_world():
    print("Hello, World!")

# Call the function
hello_world()
"""

python_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON, chunk_size=50, chunk_overlap=0
)

python_docs = python_splitter.create_documents([PYTHON_CODE])

python_docs

[Document(page_content='def hello_world():\n    print("Hello, World!")', metadata={}),
 Document(page_content='# Call the function\nhello_world()', metadata={})]

## 3. 嵌入向量 并上传到 Deep Leak

+ 运行的话，需要等几分钟
+ 请在OpenAI的API项里面绑定好充值的信用卡

In [18]:
from langchain.embeddings.openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(client="davinci")

embeddings

OpenAIEmbeddings(client=<class 'openai.api_resources.embedding.Embedding'>, model='text-embedding-ada-002', deployment='text-embedding-ada-002', openai_api_version=None, openai_api_base=None, openai_api_type=None, openai_proxy=None, embedding_ctx_length=8191, openai_api_key=None, openai_organization=None, allowed_special=set(), disallowed_special='all', chunk_size=1000, max_retries=6, request_timeout=None, headers=None)

In [21]:
from langchain.vectorstores import DeepLake

db = DeepLake.from_documents(texts, embeddings, dataset_path=f"hub://{DEEPLAKE_ACCOUNT_NAME}/langchain-code")

db

Your Deep Lake dataset has been successfully created!
The dataset is private so make sure you are logged in!


/

This dataset can be visualized in Jupyter Notebook by ds.visualize() or at https://app.activeloop.ai/myy412001799/langchain-code


 

hub://myy412001799/langchain-code loaded successfully.


Evaluating ingest: 100%|██████████| 5/5 [03:14<00:00
-

Dataset(path='hub://myy412001799/langchain-code', tensors=['embedding', 'ids', 'metadata', 'text'])

  tensor     htype      shape       dtype  compression
  -------   -------    -------     -------  ------- 
 embedding  generic  (4765, 1536)  float32   None   
    ids      text     (4765, 1)      str     None   
 metadata    json     (4765, 1)      str     None   
   text      text     (4765, 1)      str     None   


 

<langchain.vectorstores.deeplake.DeepLake at 0x1dc168e9390>

# 4. QA 回答问题

+ 加载数据集
+ 构造检索 retriever
+ 构造 检索链 Conversational Chain

加载 DeepLeak 数据库：

In [22]:
# 加载数据库

db = DeepLake(
    dataset_path=f"hub://{DEEPLAKE_ACCOUNT_NAME}/langchain-code", 
    read_only=True, 
    embedding_function=embeddings
)

\

This dataset can be visualized in Jupyter Notebook by ds.visualize() or at https://app.activeloop.ai/myy412001799/langchain-code



|

hub://myy412001799/langchain-code loaded successfully.



 

Deep Lake Dataset in hub://myy412001799/langchain-code already exists, loading from the storage
Dataset(path='hub://myy412001799/langchain-code', read_only=True, tensors=['embedding', 'ids', 'metadata', 'text'])

  tensor     htype      shape       dtype  compression
  -------   -------    -------     -------  ------- 
 embedding  generic  (4765, 1536)  float32   None   
    ids      text     (4765, 1)      str     None   
 metadata    json     (4765, 1)      str     None   
   text      text     (4765, 1)      str     None   


 

构造 检索：

In [23]:
retriever = db.as_retriever()

retriever.search_kwargs['distance_metric'] = 'cos'
retriever.search_kwargs['fetch_k'] = 20
retriever.search_kwargs['maximal_marginal_relevance'] = True
retriever.search_kwargs['k'] = 20

还可以利用过滤器检索：

In [24]:
def filter(x):
    # filter based on source code
    if 'something' in x['text'].data()['value']:
        return False
    
    # filter based on path e.g. extension
    metadata =  x['metadata'].data()['value']
    return 'only_this' in metadata['source'] or 'also_that' in metadata['source']

### turn on below for custom filtering
# retriever.search_kwargs['filter'] = filter

构造检索链：

In [25]:
from langchain.chat_models import ChatOpenAI
from langchain.chains import ConversationalRetrievalChain

model = ChatOpenAI(client="chatgpt", model='gpt-3.5-turbo') # 'ada' 'gpt-3.5-turbo' 'gpt-4',
qa = ConversationalRetrievalChain.from_llm(model,retriever=retriever)

问答开始：

In [30]:
questions = [
    "哪些类从类Chain中派生?",
    # "What classes are derived from the Chain class?",
    # "What classes and functions in the ./langchain/utilities/ forlder are not covered by unit tests?",
    # "What one improvement do you propose in code in relation to the class herarchy for the Chain class?",
] 
chat_history = []

for question in questions:  
    result = qa({"question": question, "chat_history": chat_history})
    chat_history.append((question, result['answer']))
    print(f"-> **Question**: {question} \n")
    print(f"**Answer**: {result['answer']} \n")

-> **Question**: 哪些类从类Chain中派生? 

**Answer**: 从类Chain中派生的类有：
- APIChain
- AnalyzeDocumentChain
- BaseCombineDocumentsChain
- BaseQAWithSourcesChain
- ChatVectorDBChain
- ConstitutionalChain
- ConversationalRetrievalChain
- ConversationChain
- FlareChain
- GraphCypherQAChain
- GraphQAChain
- HypotheticalDocumentEmbedder
- LLMChain
- LLMBashChain
- LLMCheckerChain
- LLMMathChain
- LLMRequestsChain
- LLMSummarizationCheckerChain
- MapReduceChain
- OpenAIModerationChain
- PALChain
- QAGenerationChain
- QAWithSourcesChain
- RetrievalQA
- RetrievalQAWithSourcesChain
- RouterChain
- SequentialChain
- SimpleSequentialChain
- SQLDatabaseChain
- SQLDatabaseSequentialChain
- TransformChain
- VectorDBQA
- VectorDBQAWithSourcesChain 

