## Vector Database setup

Remove old Weaviate DB files

In [None]:
!rm -rf ~/.local/share/weaviate


### Step 1 - Download sample data

In [None]:
import requests
import json

# Download the data
resp = requests.get('https://raw.githubusercontent.com/weaviate-tutorials/quickstart/main/data/jeopardy_tiny.json')
data = json.loads(resp.text)  # Load data

# Parse the JSON and preview it
print(type(data), len(data))

def json_print(data):
    print(json.dumps(data, indent=4))

json_print(data[0])

### Step 2 - Create an embedded instance of Weaviate vector database

In [None]:
import weaviate, os
from weaviate import EmbeddedOptions
import openai

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file
openai.api_key = os.environ['OPENAI_API_KEY']

client = weaviate.Client(
    embedded_options=EmbeddedOptions(),
    additional_headers={
        "X-OpenAI-BaseURL": os.environ['OPENAI_API_BASE'],
        "X-OpenAI-Api-Key": openai.api_key  # Replace this with your actual key
    }
)
print(f"Client created? {client.is_ready()}")

Prints all available vectorizers.

In [None]:
json_print(client.get_meta())

## Step 3 - Create Question collection

In [None]:
# resetting the schema. CAUTION: This will delete your collection 
if client.schema.exists("Question"):
    client.schema.delete_class("Question")
class_obj = {
    "class": "Question",
    "vectorizer": "text2vec-openai",  # Use OpenAI as the vectorizer
    "moduleConfig": { # few configurations for the OpenAI model
        "text2vec-openai": {
            "model": "ada",
            "modelVersion": "002",
            "type": "text",
            "baseURL": os.environ["OPENAI_API_BASE"]
        }
    }
}

client.schema.create_class(class_obj)

## Step 4 - Load sample data and generate vector embeddings

In [None]:
# reminder for the data structure
json_print(data[0])

Adding all the data to the DB.

**NB** The vectors aren't added as part of the `properties` of the data object as they will be generated automatically. **Weaviate** generates vector embeddings at the object level, which means **all the properties of STR type**.<br>
Following strategy is used by default:<br>
* Only vectorize properties that use the text data type.
* Sort properties in alphabetical (a-z) order before concatenating values.
* If `vectorizePropertyName` is true (false by default), prepend the property name to each property value.
* Join the (prepended) property values with spaces.
* Prepend the class name (unless `vectorizeClassName` is false).
* Convert the produced string to lowercase.

So in this case, as **all the properties are strings**:
1. Sort them alphabetically
2. Concatenate the values with spaces
3. Lowercase the resulting string and return it for vectorisation

In [None]:
with client.batch.configure(batch_size=5) as batch:
    for i, d in enumerate(data):  # Batch import data
        
        print(f"importing question: {i+1}")

        # configuring the data object
        # notice that the vectors aren't added here, as the vectorizer
        # will add it on the fly based on the "Question"
        properties = {
            "answer": d["Answer"],
            "question": d["Question"],
            "category": d["Category"],
        }
        # adding it to the "Question" schema
        batch.add_data_object(
            data_object=properties,
            class_name="Question"
        )

Checking total number of record in DB

In [None]:
count = client.query.aggregate("Question").with_meta_count().do()
json_print(count)

## Let's Extract the vector that represents each question!

In [None]:
# write a query to extract the vector for a question
result = (client.query
          .get("Question", ["category", "question", "answer"])
          .with_additional("vector")
          .with_limit(1)
          .do())

json_print(result)

## Query time
What is the distance between the `query`: `biology` and the returned objects?

**NB** Since the distance is `cosine distance`, **LOWER IS BETTER**

In [None]:
response = (
    client.query
    .get("Question",["question","answer","category"])
    .with_near_text({"concepts": "biology"}) # textual search for the category biology
    # this method always requiresa dict with "concepts" as key and values as a
    # str or list(str)
    .with_additional('distance')
    .with_limit(2)
    .do()
)

json_print(response)

Checcking all the responses in the DB

In [None]:
response = (
    client.query
    .get("Question", ["question", "answer"])
    .with_near_text({"concepts": ["animals"]})
    .with_limit(10)
    .with_additional(["distance"])
    .do()
)

json_print(response)

## We can let the vector database know to remove results after a threshold distance!

Considering only results < 0.24 distance

In [None]:
response = (
    client.query
    .get("Question", ["question", "answer"])
    .with_near_text({"concepts": ["animals"], "distance": 0.24})
    .with_limit(10)
    .with_additional(["distance"])
    .do()
)

json_print(response)

## Vector Databases support for CRUD operations

### `C`reate a new record

In [None]:
#Create an object
object_uuid = client.data_object.create(
    data_object={
        'question':"Leonardo da Vinci was born in this country.",
        'answer': "Italy",
        'category': "Culture"
    },
    class_name="Question"
 )

In [None]:
print(object_uuid)

### `R`ead the new record

In [None]:
data_object = client.data_object.get_by_id(object_uuid, class_name="Question")
json_print(data_object)

In [None]:
data_object = client.data_object.get_by_id(
    object_uuid,
    class_name='Question',
    with_vector=True
)

json_print(data_object)

### `U`pdate the new record

In [None]:
# updating the answer of the new record
client.data_object.update(
    uuid=object_uuid,
    class_name="Question",
    data_object={
        'answer':"Florence, Italy"
    })

Printing the newly updated record

In [None]:
data_object = client.data_object.get_by_id(
    object_uuid,
    class_name='Question',
)

json_print(data_object)

### `D`eleting the new record

In [None]:
json_print(client.query.aggregate("Question").with_meta_count().do())

In [None]:
client.data_object.delete(uuid=object_uuid, class_name="Question")

In [None]:
json_print(client.query.aggregate("Question").with_meta_count().do())