## Weaviate quickstart guide (as a notebook!)

This notebook will guide you through the basics of Weaviate. You can find the full documentation [on our site here](https://weaviate.io/developers/weaviate/quickstart).

<a target="_blank" href="https://colab.research.google.com/github/weaviate-tutorials/quickstart/blob/main/quickstart_end_to_end.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

You will need the Weaviate Python client. If you don't yet have it installed - you can do so with:

In [1]:
# !pip install -U weaviate-client

### Weaviate instance

For this, you will need a working instance of Weaviate somewhere. We recommend either:
- Creating a free sandbox instance on Weaviate Cloud Services (https://console.weaviate.cloud/), or
- Using [Embedded Weaviate](https://weaviate.io/developers/weaviate/installation/embedded).

Instantiate the client using **one** of the following code examples:

#### For using WCS

NOTE: Before you do this, you need to create the instance in WCS and get the credentials. Please refer to the [WCS Quickstart guide](https://weaviate.io/developers/wcs/quickstart).

In [1]:
# # For using WCS
import json
import os
import weaviate
from weaviate import Client
from dotenv import load_dotenv, find_dotenv
load_dotenv(find_dotenv(), override=True)

auth_config = weaviate.AuthApiKey(api_key=os.environ['WEAVIATE_API_KEY'])

client = weaviate.Client(
  url="https://test-cluster-nov8-jvhnpt7n.weaviate.network",
  auth_client_secret=auth_config
)

In [2]:
client.is_live(), client.is_ready()

(True, True)

#### For using Embedded Weaviate

This will spin up a Weaviate instance in the background.

In [7]:
# For using embedded
# import weaviate
# from weaviate.embedded import EmbeddedOptions
# import json
# import os

# client = weaviate.Client(
#     embedded_options=EmbeddedOptions(),
#     additional_headers = {
#         "X-OpenAI-Api-Key": os.environ["OPENAI_APIKEY"]  # Replace with your inference API key
#     }
# )

### Import Data

In [3]:
import pandas as pd
import json
from sentence_transformers import SentenceTransformer
from tqdm.notebook import tqdm
from preprocessing import FileIO
from typing import Union, List
# data = pd.read_csv('jeopardy_questions.csv', nrows=50000)
# with open('vector_search_applications/data/impact_theory_updated_Nov1.json') as f:
#     data = json.load(f)

In [5]:
data = FileIO().load_parquet('./practice_data/impact_theory_minilm_196.parquet')

Shape of data: (37007, 17)
Memory Usage: 4.55+ MB


In [6]:
# questions = data['question'].values.tolist()
data[0].keys()

dict_keys(['author', 'title', 'video_id', 'playlist_id', 'channel_id', 'description', 'keywords', 'length', 'publish_date', 'thumbnail_url', 'views', 'age_restricted', 'episode_num', 'content', 'content_embedding', 'doc_id', 'episode_url'])

In [7]:
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

In [66]:
# vectors = model.encode(questions, show_progress_bar=True, device='cuda:0', batch_size=64)

In [67]:
# data['vectors'] = pd.Series(vectors.tolist())

In [32]:
data_dict = data.to_dict()

In [46]:
data_dict['question'][0]
data_dict['vectors'][0]

[0.014312311075627804,
 -0.023840470239520073,
 0.03354779630899429,
 -0.014589335769414902,
 0.06778962910175323,
 -0.04618564620614052,
 -0.029401959851384163,
 0.008369481191039085,
 -0.09274967759847641,
 -0.021655932068824768,
 0.04514612630009651,
 0.07262410968542099,
 0.010997667908668518,
 0.03870929032564163,
 -0.012866034172475338,
 -0.03947143629193306,
 -0.09030212461948395,
 0.0016469808761030436,
 -0.04250699281692505,
 0.04687390848994255,
 0.009863625280559063,
 0.027880573645234108,
 -0.0014345893869176507,
 0.01776115968823433,
 0.0377359576523304,
 0.02495153434574604,
 0.03829507157206535,
 -0.06030397117137909,
 0.01684751734137535,
 0.05068695545196533,
 0.04673828184604645,
 -0.030108792707324028,
 0.004148408304899931,
 -0.0534181147813797,
 -0.0599793903529644,
 0.03245760127902031,
 0.028064200654625893,
 -0.02351801097393036,
 0.026382220908999443,
 -0.027201006188988686,
 0.0028127955738455057,
 -0.08360270410776138,
 0.061894841492176056,
 0.04155516996979

In [33]:
data_dict = {k:v for k,v in data_dict.items() if k in ['question', 'answer', 'vectors', 'category']}

In [34]:
data_dict.keys()

dict_keys(['category', 'question', 'answer', 'vectors'])

### Create a class

In [13]:
if client.schema.exists("ImpactTheory"):
    client.schema.delete_class("ImpactTheory")

In [14]:
class_obj = {
    "class": "ImpactTheory",
    "vectorizer": "none",  # If set to "none" you must always provide vectors yourself. Could be any other "text2vec-*" also. # Ensure the `generative-openai` module is used for generative queries
}

client.schema.create_class(class_obj)

In [None]:
def __init__(self,
                 hosts: Union[str, List[dict]]=[{"host": "localhost", "port": 9200}],
                 http_auth: Tuple[str, str]=('admin', 'admin'),
                 model_name_or_path: str='sentence-transformers/all-MiniLM-L6-v2',
                 use_ssl: bool = True,
                 verify_certs = False,
                 ssl_assert_hostname = False,
                 ssl_show_warn = False,
                 timeout: int=30):
        super().__init__(hosts=hosts,
                         http_auth=http_auth,
                         use_ssl=use_ssl,
                         verify_certs = verify_certs,
                         ssl_assert_hostname = ssl_assert_hostname,
                         ssl_show_warn = ssl_show_warn,
                         timeout=timeout)

Client(
  url="https://test-cluster-nov8-jvhnpt7n.weaviate.network",
  auth_client_secret=auth_config
)

In [138]:
class Weaviate(Client):
    
    def __init__(self, 
                 api_key: str=os.environ['WEAVIATE_API_KEY'],
                 endpoint: str=os.environ['WEAVIATE_ENDPOINT']
                ):
        auth_config = weaviate.AuthApiKey(api_key=api_key)
        super().__init__(auth_client_secret=auth_config,
                         url=endpoint)    
        self.fields = ["title", "content", "docid"]
        
    def show_classes(self):
        return [d['class'] for d in self.cluster.get_nodes_status()[0]['shards']]

    def show_class_info(self):
        return [d for d in self.cluster.get_nodes_status()[0]['shards']]

    def _format_response(self, 
                         response: dict,
                         class_: str
                         ) -> List[dict]:
        results = []
        hits = response['data']['Get'][class_]
        for d in hits:
            temp = {k:v for k,v in d.items() if k != '_additional'}
            if d.get('_additional'):
                for key in d['_additional']:
                    temp[key] = d['_additional'][key]
            results.append(temp)
        return results
        
    def keyword_search(self,
                       query: str,
                       class_: str,
                       properties: List[str]=['content'],
                       limit: int=10,
                       return_raw: bool=False) -> Union[dict, List[dict]]:
        '''
        Executes Keyword (BM25) search. 
        '''
        response = (self.query
                    .get(class_,self.fields)
                    .with_bm25(query=query, properties=properties)
                    .with_additional(['score', "id"])
                    .with_limit(limit)
                    .do()
                    )
        if return_raw:
            return response
        else: 
            return self._format_response(response, class_)

    def hybrid_search(self,
                      query: str,
                      class_: str,
                      properties: List[str]=['content'],
                      limit: int=10,
                      return_raw: bool=False
                     ) -> Union[dict, List[dict]]:
        pass

### Add objects

We'll add objects to our Weaviate instance using a batch import process.

We shows you two options, where you can either:
- Have Weaviate create vectors, or
- Specify custom vectors.

#### Have Weaviate create vectors (with `text2vec-openai`)

In [6]:
# # Load data
# import requests
# url = 'https://raw.githubusercontent.com/weaviate-tutorials/quickstart/main/data/jeopardy_tiny.json'
# resp = requests.get(url)
# data = json.loads(resp.text)

# # Configure a batch process
# with client.batch(
#     batch_size=100
# ) as batch:
#     # Batch import all Questions
#     for i, d in enumerate(data):
#         if i % 2500:
#             print(f"importing question: {i+1}")

#         properties = {
#             "answer": d["Answer"],
#             "question": d["Question"],
#             "category": d["Category"],
#         }

#         client.batch.add_data_object(
#             properties,
#             "Question",
#         )

importing question: 1
importing question: 2
importing question: 3
importing question: 4
importing question: 5
importing question: 6
importing question: 7
importing question: 8
importing question: 9
importing question: 10


#### Specify "custom" vectors (i.e. generated outside of Weaviate)

In [35]:
%%time
# # Load data
# import requests
# fname = "jeopardy_tiny_with_vectors_all-OpenAI-ada-002.json"  # This file includes pre-generated vectors
# url = f'https://raw.githubusercontent.com/weaviate-tutorials/quickstart/main/data/{fname}'
# resp = requests.get(url)
# data = json.loads(resp.text)

# Configure a batch process
with client.batch(batch_size=250) as batch:
    # Batch import all Questions
    for i in tqdm(range(len(data_dict['question']))):

        properties = {
            "answer": data_dict["answer"][i],
            "question": data_dict["question"][i],
            "category": data_dict["category"][i],
        }

        custom_vector = data_dict["vectors"][i]
        batch.add_data_object(
            properties,
            "Question",
            vector=custom_vector  # Add custom vector
        )

            Please instead use the `client.batch.configure()` method to configure your batch and `client.batch` to enter the context manager.
            See https://weaviate.io/developers/weaviate/client-libraries/python for details.


  0%|          | 0/50000 [00:00<?, ?it/s]

CPU times: user 19.1 s, sys: 1.07 s, total: 20.2 s
Wall time: 2min 42s


In [16]:
%%time
# # Load data
# import requests
# fname = "jeopardy_tiny_with_vectors_all-OpenAI-ada-002.json"  # This file includes pre-generated vectors
# url = f'https://raw.githubusercontent.com/weaviate-tutorials/quickstart/main/data/{fname}'
# resp = requests.get(url)
# data = json.loads(resp.text)

# Configure a batch process
with client.batch(batch_size=250) as batch:
    # Batch import all Questions
    for i, d in enumerate(tqdm(data)):

        properties = {
            "title": d['title'],
            "content": d['content'],
            "docid": d['doc_id'],
        }

        custom_vector = d['content_embedding']
        batch.add_data_object(
            properties,
            "ImpactTheory",
            vector=custom_vector  # Add custom vector
        )

  0%|          | 0/37007 [00:00<?, ?it/s]

CPU times: user 14.3 s, sys: 870 ms, total: 15.2 s
Wall time: 2min 31s


### Queries

#### Basic Query


In [84]:
response = client.query.get(
    "Question",
    ["question", "answer", 'category']
).with_limit(2).do()

# print(json.dumps(response['data']['Get']['Question']))


[{'answer': 'Fanny Hill',
  'category': 'IMPRISONED AUTHORS',
  'question': 'To escape debtor\'s prison, John Cleland wrote this bawdy book about "A Woman of Pleasure" in 1748'},
 {'answer': 'Key West',
  'category': 'FLORIDA',
  'question': 'Harry Truman chose this Florida island locale for his "Little White House" in 1946'}]

#### BM25 Search

Let's try a similarity search. We'll use nearText search to look for quiz objects most similar to biology.

In [10]:
query = "how do i quit social media"
client.cluster.get_nodes_status()
index = 'Impact_theory_minilm_256'

In [15]:
properties = ['content']
response = (
    client.query
    .get(index,["title", "content", "doc_id", 'thumbnail_url'])
    .with_bm25(query=query, properties=properties)
    .with_additional(['score', "id"])
    .with_limit(3)
    .do()
)

In [16]:
response

{'data': {'Get': {'Impact_theory_minilm_256': [{'_additional': {'id': '92b37c1a-6018-4f9e-b1c9-b6aa420a9eba',
      'score': '6.62025'},
     'content': "in electrical engineering and computer science from MIT. He's the author of six books. His work has been published in more than 20 languages. And he's written 60 peer-reviewed papers that have been cited more than 3,500 times. He's a provost, distinguished professor of computer science at Georgetown University. But, ironically, he's most famous for his views on digital minimalism. His TED Talk on why you should quit social media has been viewed over 5 million times. And his work on this topic and related topics such as doing deep work and avoiding a life of distraction have been read by millions of people around the world. His insights and expertise have made him one of the most sought-after minds on the subject. And he's been featured in most major publications, including The New York Times, The Wall Street Journal, The New Yorker, T

In [95]:
res  = client.keyword_search(query, index)
res

### Vector Search

In [45]:
nearVector = {"vector": model.encode(query)}
properties = ['content']
response = (
    client.query
    .get("ImpactTheory",["title", "content", "docid"])
    .with_near_vector(nearVector)
    .with_additional('score')
    .with_limit(3)
    .do()
)

In [46]:
print(json.dumps(response, indent=4))

{
    "data": {
        "Get": {
            "ImpactTheory": [
                {
                    "_additional": {
                        "score": "0"
                    },
                    "content": "Like if it's not working, if I don't like the way this is working, I should be able to just send everyone an email saying, sorry, you know, you guys broke this place. I don't like the conversation. This is over, right? That's actually something I told Jack Dorsey when he was still running Twitter, that he should just delete it and he'd win the Nobel Peace Prize and he would deserve it. So I think you should be able to, you should be free to delete your social media account if you in fact own it. And you should be free to decide, okay, these are the standards of conduct in this space. Like it's, you know, this is true if you open a restaurant or if you open a movie theater, if you open any public space, it doesn't change if it's merely digital. You should be able to set the terms 

The response includes a list of top 2 (due to the limit set) objects whose vectors are most similar to the word biology.

Notice that even though the word biology does not appear anywhere, Weaviate returns biology-related entries.

This example shows why vector searches are powerful. Vectorized data objects allow for searches based on degrees of similarity, as shown here.

#### Semantic search with a filter
You can add a Boolean filter to your example. For example, let's run the same search, but only look in objects that have a "category" value of "ANIMALS".

In [18]:
nearText = {"concepts": ["biology"]}
ask = {
  "question": "What is an erythrocite",
  "properties": ["question", "answer"]
}
response = (
    client.query
    .get("Question", ["question", "answer", "category", "_additional {answer {hasAnswer property result} }"])
    .with_ask(ask)
    # .with_where({
    #     "path": ["category"],
    #     "operator": "Equal",
    #     "valueText": "ANIMALS"
    # })
    .with_limit(2)
    .do()
)

print(json.dumps(response, indent=4))

{
    "data": {
        "Get": {
            "Question": null
        }
    },
    "errors": [
        {
            "locations": [
                {
                    "column": 6,
                    "line": 1
                }
            ],
            "message": "VectorFromParams was called without any known params present",
            "path": [
                "Get",
                "Question"
            ]
        }
    ]
}


The response includes a list of top 2 (due to the limit set) objects whose vectors are most similar to the word biology - but only from the "ANIMALS" category.

Using a Boolean filter allows you to combine the flexibility of vector search with the precision of where filters.

#### Generative search (single prompt)

Next, let's try a generative search, where search results are processed with a large language model (LLM).

Here, we use a `single prompt` query, and the model to explain each answer in plain terms.

In [10]:
nearText = {"concepts": ["biology"]}

response = (
    client.query
    .get("Question", ["question", "answer", "category"])
    .with_near_text(nearText)
    .with_generate(single_prompt="Explain {answer} as you might to a five-year-old.")
    .with_limit(2)
    .do()
)

print(json.dumps(response, indent=4))

{
    "data": {
        "Get": {
            "Question": [
                {
                    "_additional": {
                        "generate": {
                            "error": null,
                            "singleResult": "DNA is like a special code that tells our bodies how to grow and work. It's like a recipe book that has all the instructions for making you who you are. Just like a recipe book has different recipes for different foods, DNA has different instructions for making different parts of your body, like your eyes, hair, and even your personality! It's really amazing because it's what makes you unique and special."
                        }
                    },
                    "answer": "DNA",
                    "category": "SCIENCE",
                    "question": "In 1953 Watson & Crick built a model of the molecular structure of this, the gene-carrying substance"
                },
                {
                    "_additional": {
            

We see that Weaviate has retrieved the same results as before. But now it includes an additional, generated text with a plain-language explanation of each answer.

#### Generative search (grouped task)

In the next example, we will use a grouped task prompt instead to combine all search results and send them to the LLM with a prompt. We ask the LLM to write a tweet about all of these search results.

In [12]:
response = (
    client.query
    .get("Question", ["question", "answer", "category"])
    .with_near_text({"concepts": ["biology"]})
    .with_generate(grouped_task="Write a tweet with emojis about these facts.")
    .with_limit(2)
    .do()
)

print(response["data"]["Get"]["Question"][0]["_additional"]["generate"]["groupedResult"])

🧬 Did you know? In 1953, Watson & Crick 🧪 built a model of the molecular structure of DNA, the gene-carrying substance! 🧬 #ScienceFacts

🐦🌿 Exciting news! In 2000, a new species 🆕 of sage grouse, the Gunnison sage grouse, was discovered. It's not just another northern sage grouse, but a unique classification! 🦆🌿 #ScienceDiscoveries


Generative search sends retrieved data from Weaviate to a large language model, or LLM. This allows you to go beyond simple data retrieval, but transform the data into a more useful form, without ever leaving Weaviate.

Well done! In just a few short minutes, you have:

- Created your own cloud-based vector database with Weaviate,
- Populated it with data objects,
    - Using an inference API, or
    - Using custom vectors,
- Performed searches, including:
    - Semantic search,
    - Sementic search with a filter and
    - Generative search.