<a href="https://colab.research.google.com/github/Decoding-Data-Science/airesidency/blob/main/weavite_Vector_embedding23_jan.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Weaviate quickstart guide (as a notebook!)

This notebook will guide you through the basics of Weaviate. You can find the full documentation [on our site here](https://weaviate.io/developers/weaviate/quickstart).

<a target="_blank" href="https://colab.research.google.com/github/weaviate-tutorials/quickstart/blob/main/quickstart_end_to_end.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

You will need the Weaviate Python client. If you don't yet have it installed - you can do so with:

In [1]:
pip install "weaviate-client>=3.26.7,<4.0.0"

Collecting weaviate-client<4.0.0,>=3.26.7
  Downloading weaviate_client-3.26.7-py3-none-any.whl.metadata (3.4 kB)
Downloading weaviate_client-3.26.7-py3-none-any.whl (120 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m120.1/120.1 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: weaviate-client
  Attempting uninstall: weaviate-client
    Found existing installation: weaviate-client 4.10.4
    Uninstalling weaviate-client-4.10.4:
      Successfully uninstalled weaviate-client-4.10.4
Successfully installed weaviate-client-3.26.7


### Weaviate instance

For this, you will need a working instance of Weaviate somewhere. We recommend either:
- Creating a free sandbox instance on Weaviate Cloud Services (https://console.weaviate.cloud/), or
- Using [Embedded Weaviate](https://weaviate.io/developers/weaviate/installation/embedded).

Instantiate the client using **one** of the following code examples:

#### For using WCS

NOTE: Before you do this, you need to create the instance in WCS and get the credentials. Please refer to the [WCS Quickstart guide](https://weaviate.io/developers/wcs/quickstart).

#### For using Embedded Weaviate

This will spin up a Weaviate instance in the background.

In [2]:
pip install openai==0.28



In [3]:
import os
import openai

openai.api_key = "sk-proj-" #put your own openaikey

openai.ChatCompletion.create(
  model="gpt-3.5-turbo",
  messages=[
    {"role": "user", "content": "Hello ChatGPT, does this work?"}
  ]
  )

<OpenAIObject chat.completion id=chatcmpl-AstTAsJQaIjLjqXJEuOlcK7XSnn2b at 0x7c7282f2e3f0> JSON: {
  "id": "chatcmpl-AstTAsJQaIjLjqXJEuOlcK7XSnn2b",
  "object": "chat.completion",
  "created": 1737646300,
  "model": "gpt-3.5-turbo-0125",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Hello! Yes, it works. How can I assist you today?",
        "refusal": null
      },
      "logprobs": null,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 16,
    "completion_tokens": 15,
    "total_tokens": 31,
    "prompt_tokens_details": {
      "cached_tokens": 0,
      "audio_tokens": 0
    },
    "completion_tokens_details": {
      "reasoning_tokens": 0,
      "audio_tokens": 0,
      "accepted_prediction_tokens": 0,
      "rejected_prediction_tokens": 0
    }
  },
  "service_tier": "default",
  "system_fingerprint": null
}

In [4]:
OPENAI_APIKEY = openai.api_key

In [5]:
# For using embedded
import weaviate
from weaviate.embedded import EmbeddedOptions
import json
import os

client = weaviate.Client(
    embedded_options=EmbeddedOptions(),
    additional_headers = {
        "X-OpenAI-Api-Key": openai.api_key # Replace with your inference API key
    }
)

Binary /root/.cache/weaviate-embedded did not exist. Downloading binary from https://github.com/weaviate/weaviate/releases/download/v1.23.0/weaviate-v1.23.0-Linux-amd64.tar.gz
Started /root/.cache/weaviate-embedded: process ID 8903


### Create a class

In [6]:
if client.schema.exists("Question"):
    client.schema.delete_class("Question")

In [7]:
class_obj = {
    "class": "Question",
    "vectorizer": "text2vec-openai",
    "moduleConfig": {
        "text2vec-openai": {},
        "generative-openai": {}
    }
}

client.schema.create_class(class_obj)

In [8]:
class_obj

{'class': 'Question',
 'vectorizer': 'text2vec-openai',
 'moduleConfig': {'text2vec-openai': {}, 'generative-openai': {}}}

### Add objects

We'll add objects to our Weaviate instance using a batch import process.

We shows you two options, where you can either:
- Have Weaviate create vectors, or
- Specify custom vectors.

#### Have Weaviate create vectors (with `text2vec-openai`)

In [9]:
# Load data
import requests
url = 'https://raw.githubusercontent.com/weaviate-tutorials/quickstart/main/data/jeopardy_tiny.json'
resp = requests.get(url)
data = json.loads(resp.text)

# Configure a batch process
with client.batch(
    batch_size=100
) as batch:
    # Batch import all Questions
    for i, d in enumerate(data):
        print(f"importing question: {i+1}")

        properties = {
            "answer": d["Answer"],
            "question": d["Question"],
            "category": d["Category"],
        }

        client.batch.add_data_object(
            properties,
            "Question",
        )

importing question: 1
importing question: 2
importing question: 3
importing question: 4
importing question: 5
importing question: 6
importing question: 7
importing question: 8
importing question: 9
importing question: 10


### **Extract one questions** and get the lenght of dimensions

In [10]:
result = client.query.get("Question", ['category', 'question', 'answer']).with_additional('vector').with_limit(1).do()



In [11]:
result

{'data': {'Get': {'Question': [{'_additional': {'vector': [-0.0037278521,
       0.014763771,
       -0.0030701933,
       -0.009871594,
       0.0026373463,
       0.017703103,
       -0.010213845,
       -0.018884204,
       -0.02151484,
       -0.025487637,
       -0.020038463,
       0.012656578,
       -0.011106382,
       -0.00867036,
       0.015971716,
       0.034144577,
       0.029366482,
       0.008771022,
       0.027138496,
       -0.0036204793,
       -0.017984957,
       0.029366482,
       0.0014218518,
       -0.017555466,
       0.004586835,
       0.02351466,
       0.008039544,
       -0.012012341,
       -0.021045085,
       0.027205603,
       -0.008287844,
       0.026534522,
       -0.008851551,
       -0.020548485,
       -0.01650858,
       -0.018293655,
       0.0042546503,
       -0.03127235,
       0.022185922,
       -0.025246048,
       0.018991578,
       0.011992209,
       -0.002504808,
       -0.0055599017,
       -0.0138511015,
       -0.00783822,


In [12]:
result['data']['Get']['Question'][0]['question']

"2000 news: the Gunnison sage grouse isn't just another northern sage grouse, but a new one of this classification"

In [13]:

result['data']['Get']['Question'][0]['answer']

'species'

In [14]:
len(result['data']['Get']['Question'][0]['_additional']['vector'])

1536

In [15]:
result['data']['Get']['Question'][0]['_additional']['vector']

[-0.0037278521,
 0.014763771,
 -0.0030701933,
 -0.009871594,
 0.0026373463,
 0.017703103,
 -0.010213845,
 -0.018884204,
 -0.02151484,
 -0.025487637,
 -0.020038463,
 0.012656578,
 -0.011106382,
 -0.00867036,
 0.015971716,
 0.034144577,
 0.029366482,
 0.008771022,
 0.027138496,
 -0.0036204793,
 -0.017984957,
 0.029366482,
 0.0014218518,
 -0.017555466,
 0.004586835,
 0.02351466,
 0.008039544,
 -0.012012341,
 -0.021045085,
 0.027205603,
 -0.008287844,
 0.026534522,
 -0.008851551,
 -0.020548485,
 -0.01650858,
 -0.018293655,
 0.0042546503,
 -0.03127235,
 0.022185922,
 -0.025246048,
 0.018991578,
 0.011992209,
 -0.002504808,
 -0.0055599017,
 -0.0138511015,
 -0.00783822,
 -0.006090055,
 -0.005986038,
 -0.008046255,
 0.012629734,
 0.005519637,
 0.004083525,
 -0.028883304,
 -0.014307436,
 -0.015434851,
 -0.00799928,
 0.010905058,
 -0.013891366,
 -0.003435932,
 -0.0012347881,
 -0.014119534,
 -0.0045633474,
 -0.027594829,
 0.02909805,
 -0.009958834,
 -0.008153628,
 -0.01676359,
 -0.008039544,
 0.0

In [16]:
import json

In [17]:
def json_print(json_data):
    import json  # Import the 'json' module
    print(json.dumps(json_data, indent=2))  # Pretty-print the JSON


In [18]:
json_print(data)

[
  {
    "Category": "SCIENCE",
    "Question": "This organ removes excess glucose from the blood & stores it as glycogen",
    "Answer": "Liver"
  },
  {
    "Category": "ANIMALS",
    "Question": "It's the only living mammal in the order Proboseidea",
    "Answer": "Elephant"
  },
  {
    "Category": "ANIMALS",
    "Question": "The gavial looks very much like a crocodile except for this bodily feature",
    "Answer": "the nose or snout"
  },
  {
    "Category": "ANIMALS",
    "Question": "Weighing around a ton, the eland is the largest species of this animal in Africa",
    "Answer": "Antelope"
  },
  {
    "Category": "ANIMALS",
    "Question": "Heaviest of all poisonous snakes is this North American rattlesnake",
    "Answer": "the diamondback rattler"
  },
  {
    "Category": "SCIENCE",
    "Question": "2000 news: the Gunnison sage grouse isn't just another northern sage grouse, but a new one of this classification",
    "Answer": "species"
  },
  {
    "Category": "SCIENCE",
   

In [19]:
json_print(client.query.aggregate("Question").with_meta_count().do())

{
  "data": {
    "Aggregate": {
      "Question": [
        {
          "meta": {
            "count": 10
          }
        }
      ]
    }
  }
}


In [21]:
json_print(client.query.get("Question",["question","answer"]).with_limit(1).do())

{
  "data": {
    "Get": {
      "Question": [
        {
          "answer": "species",
          "question": "2000 news: the Gunnison sage grouse isn't just another northern sage grouse, but a new one of this classification"
        }
      ]
    }
  }
}


#### Specify "custom" vectors (i.e. generated outside of Weaviate)

In [None]:
# # Load data
# import requests
# fname = "jeopardy_tiny_with_vectors_all-OpenAI-ada-002.json"  # This file includes pre-generated vectors
# url = f'https://raw.githubusercontent.com/weaviate-tutorials/quickstart/main/data/{fname}'
# resp = requests.get(url)
# data = json.loads(resp.text)

# # Configure a batch process
# with client.batch(
#     batch_size=100
# ) as batch:
#     # Batch import all Questions
#     for i, d in enumerate(data):
#         print(f"importing question: {i+1}")

#         properties = {
#             "answer": d["Answer"],
#             "question": d["Question"],
#             "category": d["Category"],
#         }

#         custom_vector = d["vector"]
#         client.batch.add_data_object(
#             properties,
#             "Question",
#             vector=custom_vector  # Add custom vector
#         )

### Queries

#### Semantic search

Let's try a similarity search. We'll use nearText search to look for quiz objects most similar to biology.

In [22]:
nearText = {"concepts": ["biology"]}

response = (
    client.query
    .get("Question", ["question", "answer", "category"])
    .with_near_text(nearText)
    .with_limit(2)
    .do()
)

print(json.dumps(response, indent=4))

{
    "data": {
        "Get": {
            "Question": [
                {
                    "answer": "DNA",
                    "category": "SCIENCE",
                    "question": "In 1953 Watson & Crick built a model of the molecular structure of this, the gene-carrying substance"
                },
                {
                    "answer": "species",
                    "category": "SCIENCE",
                    "question": "2000 news: the Gunnison sage grouse isn't just another northern sage grouse, but a new one of this classification"
                }
            ]
        }
    }
}


The response includes a list of top 2 (due to the limit set) objects whose vectors are most similar to the word biology.

Notice that even though the word biology does not appear anywhere, Weaviate returns biology-related entries.

This example shows why vector searches are powerful. Vectorized data objects allow for searches based on degrees of similarity, as shown here.

#### Semantic search with a filter
You can add a Boolean filter to your example. For example, let's run the same search, but only look in objects that have a "category" value of "ANIMALS".

In [28]:
nearText = {"concepts": ["biology"]}

response = (
    client.query
    .get("Question", ["question", "answer", "category",   "_additional { distance }" ])
    .with_near_text(nearText)
    .with_where({
        "path": ["category"],
        "operator": "Equal",
        "valueText": "ANIMALS"
    })
    .with_limit(1)
    .do()
)

print(json.dumps(response, indent=4))

{
    "data": {
        "Get": {
            "Question": [
                {
                    "_additional": {
                        "distance": 0.22143602
                    },
                    "answer": "the nose or snout",
                    "category": "ANIMALS",
                    "question": "The gavial looks very much like a crocodile except for this bodily feature"
                }
            ]
        }
    }
}


The response includes a list of top 2 (due to the limit set) objects whose vectors are most similar to the word biology - but only from the "ANIMALS" category.

Using a Boolean filter allows you to combine the flexibility of vector search with the precision of where filters.

#### Generative search (single prompt)

Next, let's try a generative search, where search results are processed with a large language model (LLM).

Here, we use a `single prompt` query, and the model to explain each answer in plain terms.

In [26]:
nearText = {"concepts": ["biology"]}

response = (
    client.query
    .get("Question", ["question", "answer", "category"])
    .with_near_text(nearText)
    .with_generate(single_prompt="Explain {answer} as you might to a five-year-old.")
    .with_limit(2)
    .do()
)

print(json.dumps(response, indent=4))

{
    "data": {
        "Get": {
            "Question": [
                {
                    "_additional": {
                        "generate": {
                            "error": null,
                            "singleResult": "DNA is like a recipe book that tells our bodies how to grow and work. It is made up of tiny instructions that are passed down from our parents to us. Just like how a recipe tells you how to make a cake, DNA tells our bodies how to make us!"
                        }
                    },
                    "answer": "DNA",
                    "category": "SCIENCE",
                    "question": "In 1953 Watson & Crick built a model of the molecular structure of this, the gene-carrying substance"
                },
                {
                    "_additional": {
                        "generate": {
                            "error": null,
                            "singleResult": "A species is a group of animals or plants that are simi

We see that Weaviate has retrieved the same results as before. But now it includes an additional, generated text with a plain-language explanation of each answer.

#### Generative search (grouped task)

In the next example, we will use a grouped task prompt instead to combine all search results and send them to the LLM with a prompt. We ask the LLM to write a tweet about all of these search results.

In [27]:
response = (
    client.query
    .get("Question", ["question", "answer", "category"])
    .with_near_text({"concepts": ["biology"]})
    .with_generate(grouped_task="Write a tweet with emojis about these facts.")
    .with_limit(2)
    .do()
)

print(response["data"]["Get"]["Question"][0]["_additional"]["generate"]["groupedResult"])

🧬 In 1953 Watson & Crick built a model of the molecular structure of DNA, the gene-carrying substance! 🧬

🦉 2000 news: the Gunnison sage grouse isn't just another northern sage grouse, but a new species of this classification! 🦉 #ScienceFacts #DNA #SpeciesClassification


Generative search sends retrieved data from Weaviate to a large language model, or LLM. This allows you to go beyond simple data retrieval, but transform the data into a more useful form, without ever leaving Weaviate.

Well done! In just a few short minutes, you have:

- Created your own cloud-based vector database with Weaviate,
- Populated it with data objects,
    - Using an inference API, or
    - Using custom vectors,
- Performed searches, including:
    - Semantic search,
    - Sementic search with a filter and
    - Generative search.

## Next

You can do much more with Weaviate. We suggest trying:

- Examples from our [search how-to](https://weaviate.io/developers/weaviate/search) guides for [keyword](https://weaviate.io/developers/weaviate/search/bm25), [similarity](https://weaviate.io/developers/weaviate/search/similarity), [hybrid](https://weaviate.io/developers/weaviate/search/hybrid), [generative](https://weaviate.io/developers/weaviate/search/generative) searches and [filters](https://weaviate.io/developers/weaviate/search/filters) or
- Learning [how to manage data](https://weaviate.io/developers/weaviate/manage-data), like [reading](https://weaviate.io/developers/weaviate/manage-data/read), [batch importing](https://weaviate.io/developers/weaviate/manage-data/import), [updating](https://weaviate.io/developers/weaviate/manage-data/update), [deleting](https://weaviate.io/developers/weaviate/manage-data/delete) objects or [bulk exporting](https://weaviate.io/developers/weaviate/manage-data/read-all-objects) data.

For more holistic learning, try <i class="fa-solid fa-graduation-cap"></i> [Weaviate Academy](https://weaviate.io/developers/academy). We have built free courses for you to learn about Weaviate and the world of vector search.

You can also try a larger, [1,000 row](https://raw.githubusercontent.com/databyjp/wv_demo_uploader/main/weaviate_datasets/data/jeopardy_1k.json) version of the Jeopardy! dataset, or [this tiny set of 50 wine reviews](https://raw.githubusercontent.com/databyjp/wv_demo_uploader/main/weaviate_datasets/data/winemag_tiny.csv).