<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/docs/examples/vector_stores/WeaviateIndex_metadata_filter.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Weaviate Vector Store Metadata Filter

If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.

In [1]:
%pip install llama-index-vector-stores-weaviate

Collecting llama-index-vector-stores-weaviate
  Downloading llama_index_vector_stores_weaviate-1.1.3-py3-none-any.whl.metadata (717 bytes)
Collecting llama-index-core<0.12.0,>=0.11.0 (from llama-index-vector-stores-weaviate)
  Downloading llama_index_core-0.11.23-py3-none-any.whl.metadata (2.5 kB)
Collecting weaviate-client<5.0.0,>=4.5.7 (from llama-index-vector-stores-weaviate)
  Downloading weaviate_client-4.9.3-py3-none-any.whl.metadata (3.6 kB)
Collecting dataclasses-json (from llama-index-core<0.12.0,>=0.11.0->llama-index-vector-stores-weaviate)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting dirtyjson<2.0.0,>=1.0.8 (from llama-index-core<0.12.0,>=0.11.0->llama-index-vector-stores-weaviate)
  Downloading dirtyjson-1.0.8-py3-none-any.whl.metadata (11 kB)
Collecting filetype<2.0.0,>=1.2.0 (from llama-index-core<0.12.0,>=0.11.0->llama-index-vector-stores-weaviate)
  Downloading filetype-1.2.0-py2.py3-none-any.whl.metadata (6.5 kB)
Collecting tenacity

In [2]:
!pip install llama-index weaviate-client

Collecting llama-index
  Downloading llama_index-0.11.23-py3-none-any.whl.metadata (11 kB)
Collecting llama-index-agent-openai<0.4.0,>=0.3.4 (from llama-index)
  Downloading llama_index_agent_openai-0.3.4-py3-none-any.whl.metadata (728 bytes)
Collecting llama-index-cli<0.4.0,>=0.3.1 (from llama-index)
  Downloading llama_index_cli-0.3.1-py3-none-any.whl.metadata (1.5 kB)
Collecting llama-index-embeddings-openai<0.3.0,>=0.2.4 (from llama-index)
  Downloading llama_index_embeddings_openai-0.2.5-py3-none-any.whl.metadata (686 bytes)
Collecting llama-index-indices-managed-llama-cloud>=0.3.0 (from llama-index)
  Downloading llama_index_indices_managed_llama_cloud-0.4.2-py3-none-any.whl.metadata (3.8 kB)
Collecting llama-index-legacy<0.10.0,>=0.9.48 (from llama-index)
  Downloading llama_index_legacy-0.9.48.post4-py3-none-any.whl.metadata (8.5 kB)
Collecting llama-index-llms-openai<0.3.0,>=0.2.10 (from llama-index)
  Downloading llama_index_llms_openai-0.2.16-py3-none-any.whl.metadata (3.3 k

#### Creating a Weaviate Client

In [3]:
import os
import openai

os.environ["OPENAI_API_KEY"] = ""
openai.api_key = os.environ["OPENAI_API_KEY"]

In [4]:
import logging
import sys

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

In [5]:
import weaviate

# cloud
cluster_url = "https://u8s7zofurnxvyrvtb26tw.c0.asia-southeast1.gcp.weaviate.cloud"
api_key = "7mwNDmgaubLjxMKgKNlUjOKu4udi0SAI34Q1"

client = weaviate.connect_to_wcs(
    cluster_url=cluster_url,
    auth_credentials=weaviate.auth.AuthApiKey(api_key),
)

# local
# client = weaviate.connect_to_local()

#### Load documents, build the VectorStoreIndex

In [6]:
from llama_index.core import VectorStoreIndex
from llama_index.vector_stores.weaviate import WeaviateVectorStore
from IPython.display import Markdown, display

## Metadata Filtering

Let's insert a dummy document, and try to filter so that only that document is returned.

In [8]:
from llama_index.core.schema import TextNode

nodes = [
    TextNode(
        text="The Shawshank Redemption",
        metadata={
            "author": "Stephen King",
            "theme": "Friendship",
            "year": 1994,
        },
    ),
    TextNode(
        text="The Godfather",
        metadata={
            "director": "Francis Ford Coppola",
            "theme": "Mafia",
            "year": 1972,
        },
    ),
    TextNode(
        text="Inception",
        metadata={
            "director": "Christopher Nolan",
            "theme": "Fiction",
            "year": 2010,
        },
    ),
    TextNode(
        text="To Kill a Mockingbird",
        metadata={
            "author": "Harper Lee",
            "theme": "Mafia",
            "year": 1960,
        },
    ),
    TextNode(
        text="1984",
        metadata={
            "author": "George Orwell",
            "theme": "Totalitarianism",
            "year": 1949,
        },
    ),
    TextNode(
        text="The Great Gatsby",
        metadata={
            "author": "F. Scott Fitzgerald",
            "theme": "The American DreamChina",
            "year": 1925,
        },
    ),
    TextNode(
        text="Harry Potter and the Sorcerer's Stone",
        metadata={
            "author": "J.K. Rowling",
            "theme": "Fiction",
            "year": 1997,
        },
    ),
]

In [9]:
from llama_index.core import StorageContext

vector_store = WeaviateVectorStore(
    weaviate_client=client, index_name="LlamaIndex_filter"
)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex(nodes, storage_context=storage_context)

In [None]:
retriever = index.as_retriever()
retriever.retrieve("What is inception?")

In [11]:
from weaviate.classes.query import Filter

LlamaIndex_filter = client.collections.get("LlamaIndex_filter")
response = LlamaIndex_filter.query.fetch_objects(
    filters=Filter.by_property("theme").like("*China*"),
    limit=1
)

for o in response.objects:
    print(o.properties)

{'text': 'The Great Gatsby', 'year': 1925.0, '_node_type': 'TextNode', 'document_id': 'None', 'director': None, 'ref_doc_id': 'None', 'relationships': None, 'theme': 'The American DreamChina', 'author': 'F. Scott Fitzgerald', '_node_content': '{"id_": "9db013d9-2d1f-4223-ba5c-ba6338cc670b", "embedding": null, "metadata": {"author": "F. Scott Fitzgerald", "theme": "The American DreamChina", "year": 1925}, "excluded_embed_metadata_keys": [], "excluded_llm_metadata_keys": [], "relationships": {}, "text": "", "mimetype": "text/plain", "start_char_idx": null, "end_char_idx": null, "text_template": "{metadata_str}\\n\\n{content}", "metadata_template": "{key}: {value}", "metadata_seperator": "\\n", "class_name": "TextNode"}', 'doc_id': 'None', 'node_info': None}


In [16]:
import logging
# 设置日志配置
logging.basicConfig(level=logging.INFO, format="%(levelname)s:%(name)s:%(message)s")
logger = logging.getLogger(__name__)

In [22]:
java_directory = 'javacode'
summary_directory = 'summaries'

In [19]:
!pip install javalang

Collecting javalang
  Downloading javalang-0.13.0-py3-none-any.whl.metadata (805 bytes)
Downloading javalang-0.13.0-py3-none-any.whl (22 kB)
Installing collected packages: javalang
Successfully installed javalang-0.13.0


In [20]:
import javalang
# 提取Java代码中的类、方法和变量信息
def extract_java_features(java_code):
    features = {
        "classes": [],
        "methods": [],
        "variables": []
    }
    try:
        # 解析Java代码
        tree = javalang.parse.parse(java_code)

        # 遍历代码中的类和方法声明
        for path, node in tree:
            if isinstance(node, javalang.tree.ClassDeclaration):
                features["classes"].append(node.name)
            elif isinstance(node, javalang.tree.MethodDeclaration):
                features["methods"].append(node.name)
            elif isinstance(node, javalang.tree.VariableDeclarator):
                features["variables"].append(node.name)
    except Exception as e:
        logger.error(f"Error parsing Java code: {e}")

    return features

In [23]:
nodes = []
logger.info("Starting to process summary files.")
for filename in os.listdir(summary_directory):
    if filename.endswith("_summary.txt"):
        logger.info(f"Processing file: {filename}")
        # 构建摘要文件路径
        summary_file_path = os.path.join(summary_directory, filename)

        # 构建对应的 Java 文件路径
        java_file_path = os.path.join(java_directory, filename.replace(".java_summary.txt", ".java"))

        # 读取摘要文件内容
        with open(summary_file_path, 'r', encoding='utf-8') as summary_file:
            summary_content = summary_file.read()
        # print(summary_content)
        # print(summary_content)
        # 读取原始Java文件内容
        with open(java_file_path, 'r', encoding='utf-8') as java_file:
            original_code = java_file.read()
        # print(original_code)



        # 使用 javalang 提取代码特征
        java_features = extract_java_features(original_code)

        # 将类、方法和变量列表转换为字符串
        classes_str = ",".join(java_features["classes"])
        methods_str = ",".join(java_features["methods"])
        variables_str = ",".join(java_features["variables"])


        # 创建 TextNode，其中 metadata 包含原始代码、代码特征和其他元数据
        nodes.append(
            TextNode(
                text=summary_content,  # 摘要作为 TextNode 的主内容
                metadata={
                    "category": "Code Summary",
                    "filename": filename.replace("_summary.txt", ""),  # Java 文件名
                    "original_code": original_code,  # 将原始 Java 代码存入 metadata
                    "classes": classes_str,  # 将类名列表转换为字符串存储
                    "methods": methods_str,  # 将方法名列表转换为字符串存储
                    "variables": variables_str  # 将变量名列表转换为字符串存储
                }
            )
        )
        # 将内容写入到txt文件
        output_file_path = os.path.join(summary_directory, f"{filename.replace('_summary.txt', '')}_features.txt")
        with open(output_file_path, 'w', encoding='utf-8') as output_file:
            output_file.write(f"Summary:\n{summary_content}\n\n")
            output_file.write(f"Original Code:\n{original_code}\n\n")
            output_file.write(f"Classes:\n{classes_str}\n\n")
            output_file.write(f"Methods:\n{methods_str}\n\n")
            output_file.write(f"Variables:\n{variables_str}\n")


In [26]:
from llama_index.core import StorageContext

vector_store = WeaviateVectorStore(
    weaviate_client=client, index_name="Java_Vec_DB"
)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
logger.info("Creating vector index and storing on disk.")
index = VectorStoreIndex(nodes, storage_context=storage_context)
logger.info("Vector index created and stored on disk.")

In [28]:
from weaviate.classes.query import Filter

LlamaIndex_filter = client.collections.get("Java_Vec_DB")
response = LlamaIndex_filter.query.fetch_objects(
    filters=Filter.by_property("classes").like("*OpenHelper*"),
    limit=1
)

for o in response.objects:
    print(o.properties)

{'_node_type': 'TextNode', 'document_id': 'None', 'relationships': None, 'classes': 'DataStorage,OpenHelper', 'ref_doc_id': 'None', 'filename': 'DataStorage.java', 'doc_id': 'None', 'text': " This Java code defines a `DataStorage` class that is responsible for managing data storage in the context of an Android application, specifically using SQLite database. The primary functionality of this class is to store, retrieve, and manage data related to delays (represented as a number and value). Here's a summary of the main methods and behaviors:\n\n  1. `DataStorage(Context context2)`: Constructor that initializes an instance of DataStorage with the provided Android Context object.\n\n  2. `SendSavedMessages()`: Sends saved messages to a remote server by querying all records from the `delay_data` table, sending each message using the `WebManager.MakeHttpRequest` function, and deleting the sent message after it has been processed. The method returns the number of successfully sent messages.\

In [41]:
from llama_index.core.retrievers import VectorIndexAutoRetriever
from llama_index.core.vector_stores.types import MetadataInfo, VectorStoreInfo


vector_store_info = VectorStoreInfo(
    content_info="Summarized Java code snippets with detailed feature information stored in metadata.",
    metadata_info=[
        MetadataInfo(
            name="category",
            type="str",
            description="Indicates the type of document, specifically as a 'Code Summary'."
        ),
        MetadataInfo(
            name="filename",
            type="str",
            description="Stores the name of the original Java file associated with the summary."
        ),
        MetadataInfo(
            name="original_code",
            type="str",
            description="The full, original Java code as stored in the metadata field."
        ),
        MetadataInfo(
            name="classes",
            type="str",
            description="siring Lists class names found within the Java file, separated by commas."
        ),
        MetadataInfo(
            name="methods",
            type="str",
            description="siring Lists method names found within the Java file, separated by commas."
        ),
        MetadataInfo(
            name="variables",
            type="str",
            description="siring Lists variable names found within the Java file, separated by commas."
        ),
    ],
)
retriever = VectorIndexAutoRetriever(
    index, vector_store_info=vector_store_info
)

In [43]:
query = "Please provide the content that relevent to http requests"

response = retriever.retrieve(query)
logger.info(f"Executing query: {query}")
print(response[0])

ValueError: Filter operator contains not supported