## 1. LangChain 中的文档查询

在完成文档拆分加载后，为了能够根据问题查询相关的文档片段，我们需要把前面拆分的文档片段，分别使用embedding(嵌入模型)计算文本特征向量，然后存储到向量数据库中。

在Langchain 中的通过矢量嵌入类（embeddings）和向量数据库类（vectorstores）来实现文档的查询，本文将详细介绍如何通过embeddings和vectorstores实现txt、markdown格式文档的查询。



## 2. 案例体验
🔹 本案例需使用 P100 及以上规格运行，请确保运行规格一致，可按照下图切换规格。

![](https://modelarts-labs-bj4-v2.obs.cn-north-4.myhuaweicloud.com/case_zoo/chatglm3/image/1.png)

🔹 点击Run in ModelArts，将会进入到ModelArts CodeLab中，这时需要你登录华为云账号，如果没有账号，则需要注册一个，且要进行实名认证，参考[《ModelArts准备工作_简易版》](https://developer.huaweicloud.com/develop/aigallery/article/detail?id=4ce709d6-eb25-4fa4-b214-e2e5d6b7919c) 即可完成账号注册和实名认证。 登录之后，等待片刻，即可进入到CodeLab的运行环境

🔹 出现 Out Of Memory ，请检查是否为您的参数配置过高导致，修改参数配置，重启kernel或更换更高规格资源进行规避❗❗❗

### 2.1 下载模型和数据

下载nltk_data数据

In [1]:
import os
import moxing as mox

work_dir = '/home/ma-user/work'
obs_path = 'obs://dtse-models/tar-models/nltk_data.tar'
ma_path = os.path.join(work_dir, 'nltk_data.tar')
mox.file.copy(obs_path, ma_path)

mox.file.copy_parallel('obs://modelarts-labs-bj4-v2/case_zoo/langchain-ChatGLM/file/docs','/home/ma-user/work/docs')

INFO:root:Using MoXing-v2.1.0.5d9c87c8-5d9c87c8

INFO:root:Using OBS-Python-SDK-3.20.9.1


进入nltk_data目录，解压数据压缩包

In [2]:
os.chdir(work_dir)
!pwd
!tar -xvf nltk_data.tar

/home/ma-user/work

nltk_data/

nltk_data/misc/

nltk_data/misc/mwa_ppdb.zip

nltk_data/misc/perluniprops.xml

nltk_data/misc/mwa_ppdb.xml

nltk_data/misc/perluniprops.zip

nltk_data/tokenizers/

nltk_data/tokenizers/punkt/

nltk_data/tokenizers/punkt/french.pickle

nltk_data/tokenizers/punkt/polish.pickle

nltk_data/tokenizers/punkt/.DS_Store

nltk_data/tokenizers/punkt/portuguese.pickle

nltk_data/tokenizers/punkt/german.pickle

nltk_data/tokenizers/punkt/swedish.pickle

nltk_data/tokenizers/punkt/malayalam.pickle

nltk_data/tokenizers/punkt/estonian.pickle

nltk_data/tokenizers/punkt/finnish.pickle

nltk_data/tokenizers/punkt/spanish.pickle

nltk_data/tokenizers/punkt/PY3/

nltk_data/tokenizers/punkt/PY3/french.pickle

nltk_data/tokenizers/punkt/PY3/polish.pickle

nltk_data/tokenizers/punkt/PY3/portuguese.pickle

nltk_data/tokenizers/punkt/PY3/german.pickle

nltk_data/tokenizers/punkt/PY3/swedish.pickle

nltk_data/tokenizers/punkt/PY3/malayalam.pickle

nltk_data/tokenizers/punkt/PY3

下载text2vec-large-chinese模型，用于中文通用语义匹配

In [3]:
import os
import moxing as mox

obs_path = 'obs://dtse-models/tar-models/text2vec-large-chinese.tar'
ma_path = os.path.join(work_dir, 'text2vec-large-chinese.tar')
mox.file.copy(obs_path, ma_path)

进入text2vec-large-chinese目录，解压模型压缩包

In [4]:
os.chdir(work_dir)
!pwd
!tar -xvf text2vec-large-chinese.tar

/home/ma-user/work

text2vec-large-chinese/

text2vec-large-chinese/.gitattributes

text2vec-large-chinese/README.md

text2vec-large-chinese/config.json

text2vec-large-chinese/eval_results.txt

text2vec-large-chinese/models--GanymedeNil--text2vec-large-chinese/

text2vec-large-chinese/models--GanymedeNil--text2vec-large-chinese/blobs/

text2vec-large-chinese/models--GanymedeNil--text2vec-large-chinese/blobs/eaf5cb71c0eeab7db3c5171da504e5867b3f67a78e07bdba9b52d334ae35adb3.lock

text2vec-large-chinese/models--GanymedeNil--text2vec-large-chinese/refs/

text2vec-large-chinese/models--GanymedeNil--text2vec-large-chinese/refs/main

text2vec-large-chinese/models--GanymedeNil--text2vec-large-chinese/snapshots/

text2vec-large-chinese/models--GanymedeNil--text2vec-large-chinese/snapshots/064717f2acfd7253bea91079d59b82e50b58c886/

text2vec-large-chinese/pytorch_model.bin

text2vec-large-chinese/special_tokens_map.json

text2vec-large-chinese/tmpqlu9nxcm

text2vec-large-chinese/tokenizer.json

t

### 2.2 配置环境

本案例依赖Python3.10.10及以上环境，因此我们首先创建虚拟环境：

In [5]:
!/home/ma-user/anaconda3/bin/conda create -n python-3.10.10 python=3.10.10 -y --override-channels --channel https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
!/home/ma-user/anaconda3/envs/python-3.10.10/bin/pip install ipykernel



Collecting package metadata (current_repodata.json): done

Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source.

Collecting package metadata (repodata.json): done

Solving environment: done



## Package Plan ##



  environment location: /home/ma-user/anaconda3/envs/python-3.10.10



  added / updated specs:

    - python=3.10.10





The following NEW packages will be INSTALLED:



  _libgcc_mutex      anaconda/pkgs/main/linux-64::_libgcc_mutex-0.1-main

  _openmp_mutex      anaconda/pkgs/main/linux-64::_openmp_mutex-5.1-1_gnu

  bzip2              anaconda/pkgs/main/linux-64::bzip2-1.0.8-h7b6447c_0

  ca-certificates    anaconda/pkgs/main/linux-64::ca-certificates-2023.08.22-h06a4308_0

  ld_impl_linux-64   anaconda/pkgs/main/linux-64::ld_impl_linux-64-2.38-h1181459_1

  libffi             anaconda/pkgs/main/linux-64::libffi-3.4.4-h6a678d5_0

  libgcc-ng          anaconda/pkgs/main/linux-64::libgcc-ng-11.2.0-h1234567_1

  libg

In [6]:
import json
import os

data = {
   "display_name": "python-3.10.10",
   "env": {
      "PATH": "/home/ma-user/anaconda3/envs/python-3.10.10/bin:/home/ma-user/anaconda3/envs/python-3.7.10/bin:/modelarts/authoring/notebook-conda/bin:/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/home/ma-user/modelarts/ma-cli/bin:/home/ma-user/modelarts/ma-cli/bin:/home/ma-user/anaconda3/envs/PyTorch-1.8/bin"
   },
   "language": "python",
   "argv": [
      "/home/ma-user/anaconda3/envs/python-3.10.10/bin/python",
      "-m",
      "ipykernel",
      "-f",
      "{connection_file}"
   ]
}

if not os.path.exists("/home/ma-user/anaconda3/share/jupyter/kernels/python-3.10.10/"):
    os.mkdir("/home/ma-user/anaconda3/share/jupyter/kernels/python-3.10.10/")

with open('/home/ma-user/anaconda3/share/jupyter/kernels/python-3.10.10/kernel.json', 'w') as f:
    json.dump(data, f, indent=4)

创建完成后，稍等片刻，或刷新页面，点击右上角kernel选择python-3.10.10

![](https://modelarts-labs-bj4-v2.obs.cn-north-4.myhuaweicloud.com/case_zoo/chatglm3/image/2.png)

### 2.3 安装依赖库

In [1]:
!pip install transformers==4.30.2
!pip install sentencepiece==0.1.99
!pip install torch==2.0.1
!pip install markdown==3.4.3
!pip install faiss-gpu==1.7.2
!pip install langchain==0.0.329 -i https://pypi.tuna.tsinghua.edu.cn/simple
!pip install nltk==3.8.1 -i https://pypi.tuna.tsinghua.edu.cn/simple
!pip install unstructured==0.10.24 -i https://pypi.tuna.tsinghua.edu.cn/simple
!pip install sentence-transformers==2.2.2

Looking in indexes: http://repo.myhuaweicloud.com/repository/pypi/simple

Collecting transformers==4.30.2

  Using cached http://repo.myhuaweicloud.com/repository/pypi/packages/5b/0b/e45d26ccd28568013523e04f325432ea88a442b4e3020b757cf4361f0120/transformers-4.30.2-py3-none-any.whl (7.2 MB)

Collecting filelock (from transformers==4.30.2)

  Using cached http://repo.myhuaweicloud.com/repository/pypi/packages/00/45/ec3407adf6f6b5bf867a4462b2b0af27597a26bd3cd6e2534cb6ab029938/filelock-3.12.2-py3-none-any.whl (10 kB)

Collecting huggingface-hub<1.0,>=0.14.1 (from transformers==4.30.2)

  Using cached http://repo.myhuaweicloud.com/repository/pypi/packages/7f/c4/adcbe9a696c135578cabcbdd7331332daad4d49b7c43688bc2d36b3a47d2/huggingface_hub-0.16.4-py3-none-any.whl (268 kB)

Collecting numpy>=1.17 (from transformers==4.30.2)

  Using cached http://repo.myhuaweicloud.com/repository/pypi/packages/71/3c/3b1981c6a1986adc9ee7db760c0c34ea5b14ac3da9ecfcf1ea2a4ec6c398/numpy-1.25.2-cp310-cp310-manylinux_2

## 3. 中文文字匹配

langchain支持很多中embedding模型，例如[text2vec-large-chinese](https://github.com/shibing624/text2vec)、[m3e-large](https://github.com/wangyingdong/m3e-base)、[bge-large-zh](https://github.com/jsonzhuwei/bge-large-zh)等，本文中使用text2vec-large-chinese模型实现。

我们将待匹配的文字通过text2vec-large-chinese模型，转成嵌入向量，然后计算两个向量直接的相似性来进行匹配。

In [2]:
import os
import nltk
work_dir = '/home/ma-user/work'
docs_path = os.path.join(work_dir, 'docs')
nltk.data.path.append(os.path.join(work_dir, 'nltk_data'))

import numpy as np
from nltk import data
from langchain.vectorstores import FAISS
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from langchain.text_splitter import CharacterTextSplitter,MarkdownTextSplitter
from langchain.document_loaders import UnstructuredFileLoader,UnstructuredMarkdownLoader

In [3]:
embedding_model = 'text2vec-large-chinese'

#基于余弦相似性公式计算两个向量之间的相似度
def get_cos_similar(v1: list, v2: list):
    num = float(np.dot(v1, v2))  # 向量点乘
    denom = np.linalg.norm(v1) * np.linalg.norm(v2)  # 求模长的乘积
    return 0.5 + 0.5 * (num / denom) if denom != 0 else 0

#加载text2vec-large-chinese模型
def load_embeddings():
    embedding_model_path = os.path.join(work_dir, embedding_model)
    embeddings = HuggingFaceEmbeddings(model_name=embedding_model_path)
    return embeddings

#计算两段文字的相似度
def get_embedding_sim(s1, s2, embeddings):
    embedding1 = embeddings.embed_query(s1)#将文本转为向量
    print('embedding1: ', len(embedding1))
    embedding2 = embeddings.embed_query(s2)
    sim = get_cos_similar(embedding1, embedding2)
    print('sim of \'{0}\' and \'{1}\' is : {2}'.format(s1, s2, sim))
    return sim

In [4]:
sentence1 = "我今天心情很差"
sentence2 = "我今天很不开心"
sentence3 = "what are you弄啥嘞"
embeddings = load_embeddings()
get_embedding_sim(sentence1, sentence2, embeddings)
get_embedding_sim(sentence1, sentence3, embeddings)


  from .autonotebook import tqdm as notebook_tqdm

No sentence-transformers model found with name /home/ma-user/work/text2vec-large-chinese. Creating a new one with MEAN pooling.


embedding1:  1024

sim of '我今天心情很差' and '我今天很不开心' is : 0.8813112714604454

embedding1:  1024

sim of '我今天心情很差' and 'what are you弄啥嘞' is : 0.5937208396102376


0.5937208396102376

## 4. 文档查询
对于文档查询，我们首先也是将分割后的文档转成嵌入向量，然后存储到向量数据库，再根据查询条件，从向量数据库进行搜索。

langchain支持的向量数据库有很多种，例如：[FAISS](https://github.com/facebookresearch/faiss)、[Milvus](https://github.com/milvus-io/milvus)、[PGVector](https://github.com/pgvector/pgvector)等，本文使用的是FAISS。

In [9]:
#加载txt文件
def load_txt_file(txt_file):    
    loader = UnstructuredFileLoader(os.path.join(work_dir, txt_file))
    docs = loader.load()
    return docs

#加载md文件
def load_md_file(md_file):    
    loader = UnstructuredMarkdownLoader(os.path.join(work_dir, md_file))
    docs = loader.load()
    return docs

#分割txt文件
def load_txt_splitter(txt_file, chunk_size=100, chunk_overlap=20):
    docs = load_txt_file(txt_file)
    text_splitter = CharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
    split_docs = text_splitter.split_documents(docs)
    return split_docs

#分割md文件
def load_md_splitter(md_file, chunk_size=100, chunk_overlap=20):
    docs = load_md_file(md_file)
    text_splitter = MarkdownTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
    split_docs = text_splitter.split_documents(docs)
    return split_docs

#分割docs_path目录下的文件，并将其转为向量，放到FAISS向量数据库中
def load_vector_store(docs_path):
    split_docs = []
    for doc in os.listdir(docs_path):
        doc_path = f'{docs_path}/{doc}'
        if doc_path.endswith('.txt'):
            docs = load_txt_splitter(doc_path)
            split_docs.extend(docs)
        elif doc_path.endswith('.md'):
            docs = load_md_splitter(doc_path)
            split_docs.extend(docs)
        else:
            print('不支持的文件类型:', doc_path)
            continue
    embeddings = load_embeddings()
    vector_store = FAISS.from_documents(split_docs, embeddings)
    return vector_store

#从向量数据集进行内容查询
def sim_search(query, vector_store):
    #similarity_search_with_score返回相似的文档内容和查询与文档的距离分数
    #返回的距离分数是L2距离。因此，得分越低越好。
    re = vector_store.similarity_search_with_score(query)
    print('query result: ', re)

In [10]:
query = "ModelBox支持哪两种方式"
vector_store = load_vector_store(docs_path)
sim_search(query, vector_store)

Created a chunk of size 146, which is longer than the specified 100

No sentence-transformers model found with name /home/ma-user/work/text2vec-large-chinese. Creating a new one with MEAN pooling.


不支持的文件类型: /home/ma-user/work/docs/.ipynb_checkpoints

query result:  [(Document(page_content='ModelBox支持两种方式运行，一种是服务化，一种是SDK，开发者可以按照下表选择相关的开发模式。', metadata={'source': '/home/ma-user/work/docs/modelbox.txt'}), 420.72437), (Document(page_content='2. SDK：ModelBox提供了ModelBox开发库，使用于扩展现有应用支持高性能AI推理，专注AI推理业务，支持c++，Python集成', metadata={'source': '/home/ma-user/work/docs/modelbox.txt'}), 587.23193), (Document(page_content='如果是第一次创建工程，在ModelBox', metadata={'source': '/home/ma-user/work/docs/第一个应用.md'}), 598.4208), (Document(page_content='也就是说，ModelBox的Pipeline模式，首先需要将应用的流程图构建出来，再分别实现图中的每个模块（ModelBox中称为“功能单元”），对于上面的视频应用：读取摄像头并输出原始画面，对应的', metadata={'source': '/home/ma-user/work/docs/第一个应用.md'}), 646.28265)]
