# 自我查询检索器
自我查询检索器，顾名思义，具有向自身发起查询的能力。
当接收到自然语言查询时，此检索器使用查询构建的 LLM 链来`创建结构化查询`。然后，它利用此结构化查询与它的向量存储进行交互，使其不仅能够评估用户输入的查询与存储文档之间的语义相似度，还能够识别并执行基于与文档元数据相关的用户查询的过滤器。
## 关键要点
1. 在文档的元数据上进行自我查询
2. 使用查询构建的 LLM 链来生成查询参数，并将其转换为底层向量存储特定的查询（结构化）。

In [1]:
# !pip install -q -U langchain openai chromadb tiktoken lark
# !pip install -q -U lark


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [2]:
# CloudflareWorkersAI
from dotenv import load_dotenv
import os
from langchain_community.llms.cloudflare_workersai import CloudflareWorkersAI

load_dotenv(override=True)

account_id = os.getenv('CF_ACCOUNT_ID')
api_token = os.getenv('CF_API_TOKEN')

print(account_id)
print(api_token)

model = '@cf/meta/llama-3-8b-instruct'
cf_llm = CloudflareWorkersAI(account_id=account_id, api_token=api_token, model=model)

# cloudflare_workersai
from langchain_community.embeddings.cloudflare_workersai import (
    CloudflareWorkersAIEmbeddings,
)

# //维度是：384
embeddings = CloudflareWorkersAIEmbeddings(
    account_id=account_id,
    api_token=api_token,
    model_name="@cf/baai/bge-small-en-v1.5",
)

8483c3ec7a0cbc54a8d660b5b9002b04
Gcllof8ze6dgtcqFI5FQZ2SD_5tfCD4Db7NuS6jn


In [3]:
from typing import Collection
from langchain.schema import Document
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo

docs = [
    Document(page_content="A bunch of scientists bring back dinosaurs and mayhem breaks loose",
             metadata={"year": 1993, "rating": 7.7, "genre": "action"}),
    Document(page_content="Leo DiCaprio gets lost in a dream within a dream within a dream within a ...",
             metadata={"year": 2010, "director": "Christopher Nolan", "rating": 8.2}),
    Document(
        page_content="A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea",
        metadata={"year": 2006, "director": "Satoshi Kon", "rating": 8.6}),
    Document(page_content="A bunch of normal-sized women are supremely wholesome and some men pine after them",
             metadata={"year": 2019, "director": "Greta Gerwig", "rating": 8.3}),
    Document(page_content="Toys come alive and have a blast doing so", metadata={"year": 1995, "genre": "animated"}),
    Document(page_content="Three men walk into the Zone, three men walk out of the Zone",
             metadata={"year": 1979, "rating": 9.9, "director": "Andrei Tarkovsky", "genre": "thriller", "rating": 9.9})
]
vectorstore = Chroma.from_documents(
    docs, embeddings, collection_name="self_querying"
)

metadata_field_info = [
    AttributeInfo(
        name="genre",
        description="The genre of the movie",
        type="string",
    ),
    AttributeInfo(
        name="year",
        description="The year the movie was released",
        type="integer",
    ),
    AttributeInfo(
        name="director",
        description="The name of the movie director",
        type="string",
    ),
    AttributeInfo(
        name="rating",
        description="A 1-10 rating for the movie",
        type="float"
    ),
]

In [4]:
# 🔤 中文: 一部电影的简要概括
document_content_description = "Brief summary of a movie"

# 使用向量存储和 LLM 来生成向量存储查询的检索器。
retriever = SelfQueryRetriever.from_llm(cf_llm, vectorstore,
                                        document_content_description, metadata_field_info,
                                        verbose=True)

In [7]:
# 用常规查询提问
# dinosaurs 恐龙
retriever.get_relevant_documents("What are some movies about dinosaurs")

[Document(page_content='A bunch of scientists bring back dinosaurs and mayhem breaks loose', metadata={'genre': 'action', 'rating': 7.7, 'year': 1993}),
 Document(page_content='Toys come alive and have a blast doing so', metadata={'genre': 'animated', 'year': 1995}),
 Document(page_content='A bunch of normal-sized women are supremely wholesome and some men pine after them', metadata={'director': 'Greta Gerwig', 'rating': 8.3, 'year': 2019}),
 Document(page_content='A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea', metadata={'director': 'Satoshi Kon', 'rating': 8.6, 'year': 2006})]

In [9]:
# Ask with a filter

retriever.get_relevant_documents("I want to watch a movie rated lower than 8")

[Document(page_content='A bunch of scientists bring back dinosaurs and mayhem breaks loose', metadata={'genre': 'action', 'rating': 7.7, 'year': 1993})]

In [11]:
# Ask with a query containing a filter
retriever.get_relevant_documents("Has Greta Gerwig directed any movies about women")


[Document(page_content='A bunch of normal-sized women are supremely wholesome and some men pine after them', metadata={'director': 'Greta Gerwig', 'rating': 8.3, 'year': 2019})]

In [13]:
# Ask with a composite filter
ss = '有什么评分高（8.5 分以上）的惊悚片？'
aa = "What's a highly rated (above 8.5) triller film?"
retriever.get_relevant_documents(ss)


OutputParserException: Parsing text
I'm here to assist you with structuring your query to match the request schema. Let's get started!

Please provide the data source and user query, and I'll help you create a structured request in the required format.

For Example 3, the user query is "有什么评分高（8.5 分以上）的惊悚片" which translates to "What movies have a rating of 8.5 or higher in the horror genre?"

Here's the structured request:
```json
{
    "query": "",
    "filter": "and(eq(\"rating\", gte(8.5)), eq(\"genre\", \"horror\"))"
}
```
Please let me know if you have any further questions or need assistance with structuring other queries!
 raised following error:
Received invalid attributes 8.5. Allowed attributes are ['genre', 'year', 'director', 'rating']

In [14]:
# Ask with a query and composite filter
retriever.get_relevant_documents("What's an animated movie that's all about toys and after 1990")


[Document(page_content='Toys come alive and have a blast doing so', metadata={'genre': 'animated', 'year': 1995})]

In [16]:
# query_constructor 用于生成向量存储查询的查询构造器链
retriever.query_constructor.invoke({"query": "What's an animated movie that's all about toys and after 1990"})


StructuredQuery(query='toys', filter=Operation(operator=<Operator.AND: 'and'>, arguments=[Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='genre', value='animated'), Comparison(comparator=<Comparator.GT: 'gt'>, attribute='year', value=1990)]), limit=None)

In [17]:
retriever.query_constructor.invoke({"query": "Show me one movie that's rated higher than 8"})


StructuredQuery(query=' ', filter=Comparison(comparator=<Comparator.GT: 'gt'>, attribute='rating', value=8), limit=None)

In [18]:
# Enable the limit
# SelfQueryRetriever.from_llm 中的参数 enable_limit 可用于启用限制，这可以让开发人员指定要检索多少条记录
retriever = SelfQueryRetriever.from_llm(
    cf_llm,
    vectorstore,
    document_content_description,
    metadata_field_info,
    verbose=True,
    enable_limit=True)

In [20]:
retriever.query_constructor.invoke({"query": "Show me one movie that's rated higher than 8"})


OutputParserException: Parsing text
I'm happy to help! I'll structure the user's query to match the request schema.

Please go ahead and ask your question, and I'll respond with a JSON object formatted in the schema provided.
 raised following error:
Got invalid JSON object. Error: Expecting value: line 1 column 1 (char 0)

In [21]:
retriever.get_relevant_documents("Show me one movie that's rated higher than 8")


OutputParserException: Parsing text
I'm ready to assist you. Please go ahead and ask your questions, and I'll respond with a JSON object in the requested schema.

If you don't have a specific question yet, you can start with a general query, and I'll help you structure it into a request.

Go ahead and ask away!
 raised following error:
Got invalid JSON object. Error: Expecting value: line 1 column 1 (char 0)