<a href="https://colab.research.google.com/github/Decoding-Data-Science/airesidency/blob/main/weavite_Vector_embedding29mar.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Weaviate quickstart guide (as a notebook!)

This notebook will guide you through the basics of Weaviate. You can find the full documentation [on our site here](https://weaviate.io/developers/weaviate/quickstart).

<a target="_blank" href="https://colab.research.google.com/github/weaviate-tutorials/quickstart/blob/main/quickstart_end_to_end.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

You will need the Weaviate Python client. If you don't yet have it installed - you can do so with:

In [28]:
pip install "weaviate-client>=3.26.7,<4.0.0"



### Weaviate instance

For this, you will need a working instance of Weaviate somewhere. We recommend either:
- Creating a free sandbox instance on Weaviate Cloud Services (https://console.weaviate.cloud/), or
- Using [Embedded Weaviate](https://weaviate.io/developers/weaviate/installation/embedded).

Instantiate the client using **one** of the following code examples:

#### For using WCS

NOTE: Before you do this, you need to create the instance in WCS and get the credentials. Please refer to the [WCS Quickstart guide](https://weaviate.io/developers/wcs/quickstart).

#### For using Embedded Weaviate

This will spin up a Weaviate instance in the background.

In [29]:
pip install openai==0.28



In [30]:
import openai
from google.colab import userdata  # Secure way to handle secrets

# Fetch the secret key from Colab's secure storage
openai.api_key = userdata.get("openai")

# Now make the API call
response = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "user", "content": "Hello ChatGPT, does this work?"}
    ]
)

print(response.choices[0].message['content'])


Hello! Yes, that works. How can I assist you today?


In [31]:
OPENAI_APIKEY = openai.api_key

In [32]:
# For using embedded
import weaviate
from weaviate.embedded import EmbeddedOptions
import json
import os

client = weaviate.Client(
    embedded_options=EmbeddedOptions(),
    additional_headers = {
        "X-OpenAI-Api-Key": openai.api_key # Replace with your inference API key
    }
)

embedded weaviate is already listening on port 8079


### Create a class

In [33]:
if client.schema.exists("Question"):
    client.schema.delete_class("Question")

In [34]:
class_obj = {
    "class": "Question",
    "vectorizer": "text2vec-openai",
    "moduleConfig": {
        "text2vec-openai": {},
        "generative-openai": {}
    }
}

client.schema.create_class(class_obj)

In [None]:
class_obj

{'class': 'Question',
 'vectorizer': 'text2vec-openai',
 'moduleConfig': {'text2vec-openai': {}, 'generative-openai': {}}}

### Add objects

We'll add objects to our Weaviate instance using a batch import process.

We shows you two options, where you can either:
- Have Weaviate create vectors, or
- Specify custom vectors.

#### Have Weaviate create vectors (with `text2vec-openai`)

In [35]:
# Load data
import requests
url = 'https://raw.githubusercontent.com/weaviate-tutorials/quickstart/main/data/jeopardy_tiny.json'
resp = requests.get(url)
data = json.loads(resp.text)

# Configure a batch process
with client.batch(
    batch_size=100
) as batch:
    # Batch import all Questions
    for i, d in enumerate(data):
        print(f"importing question: {i+1}")

        properties = {
            "answer": d["Answer"],
            "question": d["Question"],
            "category": d["Category"],
        }

        client.batch.add_data_object(
            properties,
            "Question",
        )

importing question: 1
importing question: 2
importing question: 3
importing question: 4
importing question: 5
importing question: 6
importing question: 7
importing question: 8
importing question: 9
importing question: 10


### **Extract one questions** and get the lenght of dimensions

In [38]:
result = client.query.get("Question", ['category', 'question', 'answer']).with_additional('vector').with_limit(1).do()



In [39]:
result

{'data': {'Get': {'Question': [{'_additional': {'vector': [-0.010655958,
       -0.010828686,
       -0.013964355,
       -0.028486753,
       0.00031369142,
       0.017671352,
       -0.019239187,
       -0.0055837487,
       -0.006483925,
       -0.005866092,
       0.013253515,
       0.010370294,
       0.018627997,
       -0.01715317,
       -0.01764478,
       0.012044422,
       0.018030094,
       0.017405618,
       0.017432192,
       0.016183238,
       0.011041274,
       0.004746685,
       0.0042484323,
       -0.0102972165,
       0.035103545,
       -0.010622742,
       0.01150631,
       -0.0061882953,
       -0.026002133,
       0.00096411846,
       0.0058361967,
       -0.01296785,
       -0.011579387,
       -0.024460873,
       -0.0070419675,
       -0.0104898745,
       0.0070419675,
       -0.014230089,
       0.03528956,
       -0.0070286808,
       0.009380433,
       0.003876404,
       -0.0032403017,
       0.004208572,
       -0.0233315,
       0.002203936

In [40]:
result['data']['Get']['Question'][0]['question']

"It's the only living mammal in the order Proboseidea"

In [41]:

result['data']['Get']['Question'][0]['answer']

'Elephant'

In [42]:
len(result['data']['Get']['Question'][0]['_additional']['vector'])

1536

In [43]:
result['data']['Get']['Question'][0]['_additional']['vector']

[-0.010655958,
 -0.010828686,
 -0.013964355,
 -0.028486753,
 0.00031369142,
 0.017671352,
 -0.019239187,
 -0.0055837487,
 -0.006483925,
 -0.005866092,
 0.013253515,
 0.010370294,
 0.018627997,
 -0.01715317,
 -0.01764478,
 0.012044422,
 0.018030094,
 0.017405618,
 0.017432192,
 0.016183238,
 0.011041274,
 0.004746685,
 0.0042484323,
 -0.0102972165,
 0.035103545,
 -0.010622742,
 0.01150631,
 -0.0061882953,
 -0.026002133,
 0.00096411846,
 0.0058361967,
 -0.01296785,
 -0.011579387,
 -0.024460873,
 -0.0070419675,
 -0.0104898745,
 0.0070419675,
 -0.014230089,
 0.03528956,
 -0.0070286808,
 0.009380433,
 0.003876404,
 -0.0032403017,
 0.004208572,
 -0.0233315,
 0.0022039367,
 -0.013406312,
 -0.01979723,
 -0.015997225,
 0.0017986912,
 0.019119607,
 0.003623956,
 -0.009167844,
 0.0060720365,
 -0.0052648676,
 -0.011267148,
 -0.026201434,
 -0.0046769297,
 0.010981483,
 -0.004999133,
 0.007048611,
 0.007905605,
 -0.03016088,
 -0.005048958,
 -0.006022211,
 -0.0006385935,
 -0.0048463354,
 -0.003391438

In [44]:
import json

In [45]:
def json_print(json_data):
    import json  # Import the 'json' module
    print(json.dumps(json_data, indent=2))  # Pretty-print the JSON


In [46]:
json_print(data)

[
  {
    "Category": "SCIENCE",
    "Question": "This organ removes excess glucose from the blood & stores it as glycogen",
    "Answer": "Liver"
  },
  {
    "Category": "ANIMALS",
    "Question": "It's the only living mammal in the order Proboseidea",
    "Answer": "Elephant"
  },
  {
    "Category": "ANIMALS",
    "Question": "The gavial looks very much like a crocodile except for this bodily feature",
    "Answer": "the nose or snout"
  },
  {
    "Category": "ANIMALS",
    "Question": "Weighing around a ton, the eland is the largest species of this animal in Africa",
    "Answer": "Antelope"
  },
  {
    "Category": "ANIMALS",
    "Question": "Heaviest of all poisonous snakes is this North American rattlesnake",
    "Answer": "the diamondback rattler"
  },
  {
    "Category": "SCIENCE",
    "Question": "2000 news: the Gunnison sage grouse isn't just another northern sage grouse, but a new one of this classification",
    "Answer": "species"
  },
  {
    "Category": "SCIENCE",
   

In [47]:
json_print(client.query.aggregate("Question").with_meta_count().do())

{
  "data": {
    "Aggregate": {
      "Question": [
        {
          "meta": {
            "count": 10
          }
        }
      ]
    }
  }
}


In [48]:
json_print(client.query.get("Question",["question","answer"]).with_limit(1).do())

{
  "data": {
    "Get": {
      "Question": [
        {
          "answer": "Elephant",
          "question": "It's the only living mammal in the order Proboseidea"
        }
      ]
    }
  }
}


#### Specify "custom" vectors (i.e. generated outside of Weaviate)

### Queries

#### Semantic search

Let's try a similarity search. We'll use nearText search to look for quiz objects most similar to biology.

In [49]:
nearText = {"concepts": ["biology"]}

response = (
    client.query
    .get("Question", ["question", "answer", "category"])
    .with_near_text(nearText)
    .with_limit(2)
    .do()
)

print(json.dumps(response, indent=4))

{
    "data": {
        "Get": {
            "Question": [
                {
                    "answer": "DNA",
                    "category": "SCIENCE",
                    "question": "In 1953 Watson & Crick built a model of the molecular structure of this, the gene-carrying substance"
                },
                {
                    "answer": "species",
                    "category": "SCIENCE",
                    "question": "2000 news: the Gunnison sage grouse isn't just another northern sage grouse, but a new one of this classification"
                },
                {
                    "answer": "the nose or snout",
                    "category": "ANIMALS",
                    "question": "The gavial looks very much like a crocodile except for this bodily feature"
                }
            ]
        }
    }
}


The response includes a list of top 2 (due to the limit set) objects whose vectors are most similar to the word biology.

Notice that even though the word biology does not appear anywhere, Weaviate returns biology-related entries.

This example shows why vector searches are powerful. Vectorized data objects allow for searches based on degrees of similarity, as shown here.

#### Semantic search with a filter
You can add a Boolean filter to your example. For example, let's run the same search, but only look in objects that have a "category" value of "ANIMALS".

In [51]:
nearText = {"concepts": ["biology"]}

response = (
    client.query
    .get("Question", ["question", "answer", "category",   "_additional { distance }" ])
    .with_near_text(nearText)
    .with_where({
        "path": ["category"],
        "operator": "Equal",
        "valueText": "ANIMALS"
    })
    .with_limit(2)
    .do()
)

print(json.dumps(response, indent=4))

{
    "data": {
        "Get": {
            "Question": [
                {
                    "_additional": {
                        "distance": 0.22143602
                    },
                    "answer": "the nose or snout",
                    "category": "ANIMALS",
                    "question": "The gavial looks very much like a crocodile except for this bodily feature"
                }
            ]
        }
    }
}


The response includes a list of top 2 (due to the limit set) objects whose vectors are most similar to the word biology - but only from the "ANIMALS" category.

Using a Boolean filter allows you to combine the flexibility of vector search with the precision of where filters.

#### Generative search (single prompt)

Next, let's try a generative search, where search results are processed with a large language model (LLM).

Here, we use a `single prompt` query, and the model to explain each answer in plain terms.

In [52]:
nearText = {"concepts": ["biology"]}

response = (
    client.query
    .get("Question", ["question", "answer", "category"])
    .with_near_text(nearText)
    .with_generate(single_prompt="Explain {answer} as you might to a five-year-old.")
    .with_limit(1)
    .do()
)

print(json.dumps(response, indent=4))

{
    "data": {
        "Get": {
            "Question": [
                {
                    "_additional": {
                        "generate": {
                            "error": null,
                            "singleResult": "DNA is like a set of instructions that tells our bodies how to grow and work. It's like a recipe book that tells our cells what to do and how to make us who we are. Just like how a recipe tells you how to make a cake, DNA tells our bodies how to make us unique and special."
                        }
                    },
                    "answer": "DNA",
                    "category": "SCIENCE",
                    "question": "In 1953 Watson & Crick built a model of the molecular structure of this, the gene-carrying substance"
                }
            ]
        }
    }
}


We see that Weaviate has retrieved the same results as before. But now it includes an additional, generated text with a plain-language explanation of each answer.

#### Generative search (grouped task)

In the next example, we will use a grouped task prompt instead to combine all search results and send them to the LLM with a prompt. We ask the LLM to write a tweet about all of these search results.

In [54]:
response = (
    client.query
    .get("Question", ["question", "answer", "category"])
    .with_near_text({"concepts": ["biology"]})
    .with_generate(grouped_task="Write a tweet with emojis about these facts.")
    .with_limit(1)
    .do()
)

print(response["data"]["Get"]["Question"][0]["_additional"]["generate"]["groupedResult"])

Did you know that in 1953 Watson & Crick discovered the molecular structure of DNA? 🔬🧬 #Science #DNA #WatsonAndCrick


Generative search sends retrieved data from Weaviate to a large language model, or LLM. This allows you to go beyond simple data retrieval, but transform the data into a more useful form, without ever leaving Weaviate.

Well done! In just a few short minutes, you have:

- Created your own cloud-based vector database with Weaviate,
- Populated it with data objects,
    - Using an inference API, or
    - Using custom vectors,
- Performed searches, including:
    - Semantic search,
    - Sementic search with a filter and
    - Generative search.

## Next

You can do much more with Weaviate. We suggest trying:

- Examples from our [search how-to](https://weaviate.io/developers/weaviate/search) guides for [keyword](https://weaviate.io/developers/weaviate/search/bm25), [similarity](https://weaviate.io/developers/weaviate/search/similarity), [hybrid](https://weaviate.io/developers/weaviate/search/hybrid), [generative](https://weaviate.io/developers/weaviate/search/generative) searches and [filters](https://weaviate.io/developers/weaviate/search/filters) or
- Learning [how to manage data](https://weaviate.io/developers/weaviate/manage-data), like [reading](https://weaviate.io/developers/weaviate/manage-data/read), [batch importing](https://weaviate.io/developers/weaviate/manage-data/import), [updating](https://weaviate.io/developers/weaviate/manage-data/update), [deleting](https://weaviate.io/developers/weaviate/manage-data/delete) objects or [bulk exporting](https://weaviate.io/developers/weaviate/manage-data/read-all-objects) data.

For more holistic learning, try <i class="fa-solid fa-graduation-cap"></i> [Weaviate Academy](https://weaviate.io/developers/academy). We have built free courses for you to learn about Weaviate and the world of vector search.

You can also try a larger, [1,000 row](https://raw.githubusercontent.com/databyjp/wv_demo_uploader/main/weaviate_datasets/data/jeopardy_1k.json) version of the Jeopardy! dataset, or [this tiny set of 50 wine reviews](https://raw.githubusercontent.com/databyjp/wv_demo_uploader/main/weaviate_datasets/data/winemag_tiny.csv).