# Example of Embedding

It is an embedding example that uses `tidb_vector_python` as its library.

## Install Dependencies

In [None]:
%%capture
%pip install openai peewee pymysql tidb_vector

## Preapre the environment

> **Note:**
>
> - You can get the `OPENAI_API_KEY` from [OpenAI](https://platform.openai.com/docs/quickstart).
> - You can get the `TIDB_HOST`, `TIDB_USERNAME`, and `TIDB_PASSWORD` from the TiDB Cloud console, as described in the [Prerequisites](../README.md#prerequisites) section.

Set the embedding model as `text-embedding-3-small`, and
the amount of embedding dimensions is `1536`.

In [None]:
import getpass

OPENAI_API_KEY = getpass.getpass("Enter your OpenAI API key: ")
TIDB_HOST = input("Enter your TiDB host: ")
TIDB_USERNAME = input("Enter your TiDB username: ")
TIDB_PASSWORD = getpass.getpass("Enter your TiDB password: ")

embedding_model = "text-embedding-3-small"
embedding_dimensions = 1536

## Initial the Clients of OpenAI and Database

In [None]:
from openai import OpenAI
from peewee import Model, MySQLDatabase, TextField, SQL
from tidb_vector.peewee import VectorField

client = OpenAI(api_key=OPENAI_API_KEY)
db = MySQLDatabase(
   'test',
    user=TIDB_USERNAME,
    password=TIDB_PASSWORD,
    host=TIDB_HOST,
    port=4000,
    ssl_verify_cert=True,
    ssl_verify_identity=True
)
db.connect()

## Prepare the Context

In this case, contexts are the documents, use the openai embeddings model to get the embeddings of the documents, and store them in the TiDB.

In [None]:
documents = [
   "TiDB is an open-source distributed SQL database that supports Hybrid Transactional and Analytical Processing (HTAP) workloads.",
   "TiFlash is the key component that makes TiDB essentially an Hybrid Transactional/Analytical Processing (HTAP) database. As a columnar storage extension of TiKV, TiFlash provides both good isolation level and strong consistency guarantee.",
   "TiKV is a distributed and transactional key-value database, which provides transactional APIs with ACID compliance. With the implementation of the Raft consensus algorithm and consensus state stored in RocksDB, TiKV guarantees data consistency between multiple replicas and high availability. ",
]

class DocModel(Model):
    text = TextField()
    embedding = VectorField(dimensions=embedding_dimensions)

    class Meta:
        database = db
        table_name = "openai_embedding_test"

    def __str__(self):
        return self.text

db.drop_tables([DocModel])
db.create_tables([DocModel])

embeddings = [
    r.embedding
    for r in client.embeddings.create(
      input=documents, model=embedding_model
    ).data
]
data_source = [
    {"text": doc, "embedding": emb}
    for doc, emb in zip(documents, embeddings)
]
DocModel.insert_many(data_source).execute()

## Initial the Vector of Question

Ask a question, use the openai embeddings model to get the embeddings of the question

In [None]:
question = "what is TiKV?"
question_embedding = client.embeddings.create(input=question, model=embedding_model).data[0].embedding

## Retrieve by Cosine Distance of Vectors
Get the relevant documents from the TiDB by comparing the embeddings of the question and the documents

In [None]:
related_docs = DocModel.select(
    DocModel.text, DocModel.embedding.cosine_distance(question_embedding).alias("distance")
).order_by(SQL("distance")).limit(3)

print("Question:", question)
print("Related documents:")
for doc in related_docs:
    print(doc.distance, doc.text)

## Cleanup

In [None]:
db.close()