# OceanBase

> [OceanBase Database](https://github.com/oceanbase/oceanbase) is a distributed relational database. It is developed entirely by Ant Group. The OceanBase Database is built on a common server cluster. Based on the Paxos protocol and its distributed structure, the OceanBase Database provides high availability and linear scalability. The OceanBase Database is not dependent on specific hardware architectures.

This notebook describes in detail how to use the OceanBase vector store functionality.

## Setup

First donwload the partner package:

In [None]:
%pip install --upgrade --quiet pyobvector

Then you can deploy a standalone OceanBase server with `docker`:

In [None]:
%docker run --name=ob433 -e MODE=slim -p 2881:2881 -d oceanbase/oceanbase-ce:4.3.3.0-100000132024100711

Check the connection to OceanBase and set the memory usage ratio for vector data:

In [None]:
from pyobvector import ObVecClient

tmp_client = ObVecClient()
tmp_client.perform_raw_text_sql(
    "ALTER SYSTEM ob_vector_memory_limit_percentage = 30"
)

## Instantiation

Configure the API key of the embedded model. Here we use `DashScopeEmbeddings` as an example. When deploying `Oceanbase` with a Docker image as described above, simply follow the script below to set the `host`, `port`, `user`, `password`, and `database name`. For other deployment methods, set these parameters according to the actual situation.

In [1]:
import os

DASHSCOPE_API = os.environ.get("DASHSCOPE_API_KEY", "")
connection_args = {
    "host": "127.0.0.1",
    "port": "2881",
    "user": "root@test",
    "password": "",
    "db_name": "test",
}

Prepare the following data

In [2]:
from langchain_community.document_loaders import TextLoader
from langchain_community.embeddings import DashScopeEmbeddings
from langchain_community.vectorstores import OceanBase
from langchain_text_splitters import CharacterTextSplitter

embeddings = DashScopeEmbeddings(
    model="text-embedding-v1", dashscope_api_key=DASHSCOPE_API
)
loader = TextLoader("../../how_to/state_of_the_union.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

Establish a connection with `OceanBase` server, set the memory ratio that vector index can occupy.

In [4]:
DEMO_TABLE_NAME = "demo_ann"
ob = OceanBase(
    embedding_function=embeddings,
    table_name=DEMO_TABLE_NAME,
    connection_args=connection_args,
    drop_old=True,
    normalize=True,
)
ob.obvector.perform_raw_text_sql("ALTER SYSTEM ob_vector_memory_limit_percentage = 30")

<sqlalchemy.engine.cursor.CursorResult at 0x1145ea2e0>

## Manage vector store

### Add items to OceanBase

In [5]:
res = ob.add_documents(documents=docs)
id_for_deletes = res[:10]

### Delete items from OceanBase

In [6]:
print(id_for_deletes)
res = ob.obvector.perform_raw_text_sql(f"SELECT COUNT(*) from {DEMO_TABLE_NAME}")
print(f"Before delete: {[r for r in res][0]}")
ob.delete(ids=id_for_deletes)
res = ob.obvector.perform_raw_text_sql(f"SELECT COUNT(*) from {DEMO_TABLE_NAME}")
print(f"After delete: {[r for r in res][0]}")

['06ea5b80-af4d-42eb-aef2-9b25c50cc264', 'ea589a9b-953d-4361-bc53-cd931f46f861', '3d2fe189-8d8e-4b27-9e0b-ff6e64368eba', 'caee93e6-95f3-480d-bdf6-7952fc2ce629', '5f8cd9d4-2bff-45b8-8af3-a92c512ede44', '88ddcd62-d740-4272-a911-9dd799bdd21b', '7b9eafff-2ba5-4aa5-b864-74c78c47c33c', '5e5b5d1b-562d-4dda-b121-a4780b312ce0', '2b967c6f-a3d9-47b3-beea-9a369f1cce02', '062566b4-9302-4993-b083-2c4ae420af09']
Before delete: (42,)
After delete: (32,)


### Query from OceanBase

Note that `OceanBase` currently only supports two vector distance functions: Euclidean distance (`l2`) and inner product distance (`ip`), and uses Euclidean distance by default.

In [7]:
query = "What did the president say about Ketanji Brown Jackson"
docs_with_score = ob.similarity_search_with_score(query, k=3)

for doc, score in docs_with_score:
    print("-" * 80)
    print("Score: ", score)
    print(doc.page_content)
    print("-" * 80)

--------------------------------------------------------------------------------
Score:  1.204783671324283
Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. 

Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. 

One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. 

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.
--------------------------------------------------------------------------------
------------------------

### Filter with metadata

> It should be noted that `OceanBase` currently only supports post-filtering (i.e., filtering based on metadata after performing an approximate nearest neighbor search).

When using `OceanBase` as a vector storage database, you can directly write a SQL-compatible boolean expression as a filter.

In [8]:
ob.add_texts(
    texts=[
        "OceanBase Database is a native, enterprise-level distributed database developed independently by the OceanBase team.",
        "OceanBase Database is highly compatible with most general features of Oracle and MySQL, and supports advanced features such as procedural language and triggers.",
    ],
    metadatas=[
        {"id": 111},
        {"id": 222},
    ],
)

['2799546c-28fc-4924-a909-91af22c726bd',
 '43099ff6-f0b1-47ab-ab1a-44b2fdb368fc']

In [9]:
docs_with_score = ob.similarity_search_with_score(
    "What is OceanBase", fltr="metadata->'$.id' = 111"
)
for doc, score in docs_with_score:
    print("-" * 80)
    print("Score: ", score)
    print(doc.page_content)
    print("-" * 80)

--------------------------------------------------------------------------------
Score:  0.5886369054575459
OceanBase Database is a native, enterprise-level distributed database developed independently by the OceanBase team.
--------------------------------------------------------------------------------


### Using as a Retriever

You can transform `OceanBase` vector store into a retriever for broader functionality in LangChain.

In [13]:
retriever = ob.as_retriever(
    search_type="similarity_score_threshold",
    search_kwargs={"k": 2, "score_threshold": 0.4},
)
output = retriever.invoke("What is OceanBase")
for r in output:
    print("-" * 80)
    print(r.page_content)
    # print(r.metadata)
    print("-" * 80)

--------------------------------------------------------------------------------
OceanBase Database is a native, enterprise-level distributed database developed independently by the OceanBase team.
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
OceanBase Database is highly compatible with most general features of Oracle and MySQL, and supports advanced features such as procedural language and triggers.
--------------------------------------------------------------------------------


## Advanced Usage

You can use more functions supported by the `pyobvector` SDK with `ob.obvector`. For details, please refer to [pyobvector](https://github.com/oceanbase/pyobvector)