<a href="https://colab.research.google.com/github/Nov05/Google-Colaboratory/blob/master/L4_Objects_Vectors.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

* 2023-12-05 changed by nov05  
* go to [the lesson](https://learn.deeplearning.ai/vector-databases-embeddings-applications/lesson/5/vector-databases)  
* watch [the video](https://dft3h5i221ap1.cloudfront.net/Weaviate/C1/video/Weaviate_L4.mp4)  

https://medium.com/@maxwell.langford/its-2023-stop-leaking-secrets-in-google-colab-part-2-8215d47a76f2   

In [None]:
!pip install python-dotenv
!pip install --pre -U "weaviate-client==4.*"
!pip install openai
## Successfully installed h11-0.14.0 httpcore-1.0.2 httpx-0.25.2 openai-1.3.7

In [35]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [36]:
from dotenv import load_dotenv
path = '/content/drive/MyDrive/config/20230507_chatgpt.env'
load_dotenv(path)

True

In [None]:
import os
os.getenv('OPENAI_API_KEY')

## **Vector Database setup**  

Remove old Weaviate DB files

In [38]:
!rm -rf ~/.local/share/weaviate


### **Step 1 - Download sample data**  

In [39]:
import requests
import json

# Download the data
resp = requests.get('https://raw.githubusercontent.com/weaviate-tutorials/quickstart/main/data/jeopardy_tiny.json')
data = json.loads(resp.text)  # Load data

# Parse the JSON and preview it
print(type(data), len(data))

def json_print(data):
    print(json.dumps(data, indent=2))

json_print(data[0])

<class 'list'> 10
{
  "Category": "SCIENCE",
  "Question": "This organ removes excess glucose from the blood & stores it as glycogen",
  "Answer": "Liver"
}


### Step 2 - Create an embedded instance of Weaviate vector database

In [40]:
os.environ['OPENAI_API_KEY'] = 'eyJhbGciOiJIUzI1NiJ9.eyJpc3MiOiJhcHAiLCJzdWIiOiIyNjgzNiIsImF1ZCI6IldFQiIsImlhdCI6MTcwMTY4MDQzMCwiZXhwIjoxNzAyMjg1MjMwfQ.TMHaYugUHh29zu4R1RrxMau_68qP8uNc7sRjIkiEsW4'
os.environ['OPENAI_API_BASE'] = 'http://jupyter-api-proxy.internal.dlai/rev-proxy/for_weaviate'

In [41]:
import weaviate, os
from weaviate import EmbeddedOptions
import openai

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file
openai.api_key = os.environ['OPENAI_API_KEY']

client = weaviate.Client(
    embedded_options=EmbeddedOptions(),
    additional_headers={
        "X-OpenAI-BaseURL": os.environ['OPENAI_API_BASE'], ## 'http://jupyter-api-proxy.internal.dlai/rev-proxy/for_weaviate'
        "X-OpenAI-Api-Key": openai.api_key  # Replace this with your actual key
    }
)
print(f"Client created? {client.is_ready()}")
## embedded weaviate is already listening on port 8079

embedded weaviate is already listening on port 8079
Client created? True


In [42]:
json_print(client.get_meta())

{
  "hostname": "http://127.0.0.1:8079",
  "modules": {
    "generative-openai": {
      "documentationHref": "https://platform.openai.com/docs/api-reference/completions",
      "name": "Generative Search - OpenAI"
    },
    "qna-openai": {
      "documentationHref": "https://platform.openai.com/docs/api-reference/completions",
      "name": "OpenAI Question & Answering Module"
    },
    "ref2vec-centroid": {},
    "reranker-cohere": {
      "documentationHref": "https://txt.cohere.com/rerank/",
      "name": "Reranker - Cohere"
    },
    "text2vec-cohere": {
      "documentationHref": "https://docs.cohere.ai/embedding-wiki/",
      "name": "Cohere Module"
    },
    "text2vec-huggingface": {
      "documentationHref": "https://huggingface.co/docs/api-inference/detailed_parameters#feature-extraction-task",
      "name": "Hugging Face Module"
    },
    "text2vec-openai": {
      "documentationHref": "https://platform.openai.com/docs/guides/embeddings/what-are-embeddings",
      "nam

## Step 3 - Create Question collection

In [43]:
# resetting the schema. CAUTION: This will delete your collection
if client.schema.exists("Question"):
    client.schema.delete_class("Question")
class_obj = {
    "class": "Question",
    "vectorizer": "text2vec-openai",  # Use OpenAI as the vectorizer
    "moduleConfig": {
        "text2vec-openai": {
            "model": "ada",
            "modelVersion": "002",
            "type": "text",
            "baseURL": os.environ["OPENAI_API_BASE"]
        }
    }
}
client.schema.create_class(class_obj)

## Step 4 - Load sample data and generate vector embeddings

In [44]:
# reminder for the data structure
json_print(data[0])

{
  "Category": "SCIENCE",
  "Question": "This organ removes excess glucose from the blood & stores it as glycogen",
  "Answer": "Liver"
}


In [45]:
with client.batch.configure(batch_size=5) as batch:
    for i,d in enumerate(data):  # Batch import data
        print(f"importing question: {i+1}")
        properties = {
            "answer": d["Answer"],
            "question": d["Question"],
            "category": d["Category"],
        }
        batch.add_data_object(
            data_object=properties,
            class_name="Question"
        )
## {'error': [{'message': 'update vector: send POST request:
## Post "http://jupyter-api-proxy.internal.dlai/rev-proxy/for_weaviate/v1/embeddings":
## dial tcp: lookup jupyter-api-proxy.internal.dlai on 127.0.0.11:53: no such host'}]}

importing question: 1
importing question: 2
importing question: 3
importing question: 4
importing question: 5
{'error': [{'message': 'update vector: send POST request: Post "http://jupyter-api-proxy.internal.dlai/rev-proxy/for_weaviate/v1/embeddings": dial tcp: lookup jupyter-api-proxy.internal.dlai on 127.0.0.11:53: no such host'}]}
{'error': [{'message': 'update vector: send POST request: Post "http://jupyter-api-proxy.internal.dlai/rev-proxy/for_weaviate/v1/embeddings": dial tcp: lookup jupyter-api-proxy.internal.dlai on 127.0.0.11:53: no such host'}]}
{'error': [{'message': 'update vector: send POST request: Post "http://jupyter-api-proxy.internal.dlai/rev-proxy/for_weaviate/v1/embeddings": dial tcp: lookup jupyter-api-proxy.internal.dlai on 127.0.0.11:53: no such host'}]}
{'error': [{'message': 'update vector: send POST request: Post "http://jupyter-api-proxy.internal.dlai/rev-proxy/for_weaviate/v1/embeddings": dial tcp: lookup jupyter-api-proxy.internal.dlai on 127.0.0.11:53: no 

In [46]:
count = client.query.aggregate("Question").with_meta_count().do()
json_print(count)

{
  "data": {
    "Aggregate": {
      "Question": [
        {
          "meta": {
            "count": 0
          }
        }
      ]
    }
  }
}


## Let's Extract the vector that represents each question!

In [47]:
# write a query to extract the vector for a question
result = (client.query
          .get("Question", ["category", "question", "answer"])
          .with_additional("vector")
          .with_limit(1)
          .do())

json_print(result)

{
  "data": {
    "Get": {
      "Question": []
    }
  }
}


## Query time
What is the distance between the `query`: `biology` and the returned objects?

In [48]:
response = (
    client.query
    .get("Question",["question","answer","category"])
    .with_near_text({"concepts": "biology"})
    .with_additional('distance')
    .with_limit(2)
    .do()
)
json_print(response)

{
  "data": {
    "Get": {
      "Question": null
    }
  },
  "errors": [
    {
      "locations": [
        {
          "column": 6,
          "line": 1
        }
      ],
      "message": "explorer: get class: vectorize params: vectorize params: vectorize params: vectorize keywords: remote client vectorize: send POST request: Post \"http://jupyter-api-proxy.internal.dlai/rev-proxy/for_weaviate/v1/embeddings\": dial tcp: lookup jupyter-api-proxy.internal.dlai on 127.0.0.11:53: no such host",
      "path": [
        "Get",
        "Question"
      ]
    }
  ]
}


In [49]:
response = (
    client.query
    .get("Question", ["question", "answer"])
    .with_near_text({"concepts": ["animals"]})
    .with_limit(10)
    .with_additional(["distance"])
    .do()
)
json_print(response)

{
  "data": {
    "Get": {
      "Question": null
    }
  },
  "errors": [
    {
      "locations": [
        {
          "column": 6,
          "line": 1
        }
      ],
      "message": "explorer: get class: vectorize params: vectorize params: vectorize params: vectorize keywords: remote client vectorize: send POST request: Post \"http://jupyter-api-proxy.internal.dlai/rev-proxy/for_weaviate/v1/embeddings\": dial tcp: lookup jupyter-api-proxy.internal.dlai on 127.0.0.11:53: no such host",
      "path": [
        "Get",
        "Question"
      ]
    }
  ]
}


## We can let the vector database know to remove results after a threshold distance!

In [50]:
response = (
    client.query
    .get("Question", ["question", "answer"])
    .with_near_text({"concepts": ["animals"], "distance": 0.24})
    .with_limit(10)
    .with_additional(["distance"])
    .do()
)
json_print(response)

{
  "data": {
    "Get": {
      "Question": null
    }
  },
  "errors": [
    {
      "locations": [
        {
          "column": 6,
          "line": 1
        }
      ],
      "message": "explorer: get class: vectorize params: vectorize params: vectorize params: vectorize keywords: remote client vectorize: send POST request: Post \"http://jupyter-api-proxy.internal.dlai/rev-proxy/for_weaviate/v1/embeddings\": dial tcp: lookup jupyter-api-proxy.internal.dlai on 127.0.0.11:53: no such host",
      "path": [
        "Get",
        "Question"
      ]
    }
  ]
}


## Vector Databases support for CRUD operations

### Create

`UnexpectedStatusCodeException: Creating object! Unexpected status code: 500, with response body: {'error': [{'message': 'update vector: send POST request: Post "http://jupyter-api-proxy.internal.dlai/rev-proxy/for_weaviate/v1/embeddings": dial tcp: lookup jupyter-api-proxy.internal.dlai on 127.0.0.11:53: no such host'}]}.`

In [51]:
#Create an object
object_uuid = client.data_object.create(
    data_object={
        'question':"Leonardo da Vinci was born in this country.",
        'answer': "Italy",
        'category': "Culture"
    },
    class_name="Question"
 )

UnexpectedStatusCodeException: ignored

In [None]:
print(object_uuid)

### Read

In [None]:
data_object = client.data_object.get_by_id(object_uuid, class_name="Question")
json_print(data_object)

In [None]:
data_object = client.data_object.get_by_id(
    object_uuid,
    class_name='Question',
    with_vector=True
)

json_print(data_object)

### Update

In [None]:
client.data_object.update(
    uuid=object_uuid,
    class_name="Question",
    data_object={
        'answer':"Florence, Italy"
    })

In [None]:
data_object = client.data_object.get_by_id(
    object_uuid,
    class_name='Question',
)

json_print(data_object)

### Delete

In [None]:
json_print(client.query.aggregate("Question").with_meta_count().do())

In [None]:
client.data_object.delete(uuid=object_uuid, class_name="Question")

In [None]:
json_print(client.query.aggregate("Question").with_meta_count().do())