## Preparation

In [2]:
!pip install -U weaviate-client

Defaulting to user installation because normal site-packages is not writeable
Collecting weaviate-client
  Downloading weaviate_client-4.7.1-py3-none-any.whl.metadata (3.3 kB)
Collecting validators==0.33.0 (from weaviate-client)
  Downloading validators-0.33.0-py3-none-any.whl.metadata (3.8 kB)
Collecting authlib<2.0.0,>=1.2.1 (from weaviate-client)
  Downloading Authlib-1.3.2-py2.py3-none-any.whl.metadata (3.9 kB)
Collecting pydantic<3.0.0,>=2.5.0 (from weaviate-client)
  Downloading pydantic-2.8.2-py3-none-any.whl.metadata (125 kB)
Collecting grpcio<2.0.0,>=1.57.0 (from weaviate-client)
  Downloading grpcio-1.66.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.9 kB)
Collecting grpcio-tools<2.0.0,>=1.57.0 (from weaviate-client)
  Downloading grpcio_tools-1.66.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.3 kB)
Collecting grpcio-health-checking<2.0.0,>=1.57.0 (from weaviate-client)
  Downloading grpcio_health_checking-1.66.0-py3-none-any.w

### Get the data

We'll use a subset of the Jeopardy! quiz dataset:
https://raw.githubusercontent.com/databyjp/wv_demo_uploader/main/weaviate_datasets/data/jeopardy_1k.json


In [3]:
import requests
import json

def load_data():
    with open("jeopardy_1k.json", "r") as f:
        raw_data = f.read()
    return raw_data

def download_data():
    response = requests.get('https://raw.githubusercontent.com/databyjp/wv_demo_uploader/main/weaviate_datasets/data/jeopardy_1k.json')
    raw_data = response.text
    return raw_data

# Parse the JSON and preview it
json_data = load_data()
data = json.loads(json_data)
print(type(data), len(data))
print(json.dumps(data[0], indent=2))

<class 'list'> 1000
{
  "Air Date": "2006-11-08",
  "Round": "Double Jeopardy!",
  "Value": 800,
  "Category": "AMERICAN HISTORY",
  "Question": "Abraham Lincoln died across the street from this theatre on April 15, 1865",
  "Answer": "Ford's Theatre (the Ford Theatre accepted)"
}


## Step 1: Create a Weaviate instance (database)

We'll use Embedded Weaviate - this is a quick way to create a Weaviate database. 

> You can also use:
> - A free sandbox with Weaviate Cloud Services
> - Open-source Weaviate directly, available cross-platform with Docker
> - Or use Kubernetes in production :) 

In [4]:

openai_key = "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"  

In [5]:
import weaviate

client = weaviate.Client(
    embedded_options=weaviate.EmbeddedOptions(),
    additional_headers={
        "X-OpenAI-Api-Key": openai_key
    }
)

ImportError: cannot import name 'builder' from 'google.protobuf.internal' (/usr/local/lib/python3.8/dist-packages/google/protobuf/internal/__init__.py)

In [6]:
def jprint(data_in):
    print(json.dumps(data_in, indent=2))

Retrieve Weaviate instance information to check our configuration.

In [9]:
jprint(client.get_meta())

{
  "hostname": "http://127.0.0.1:6666",
  "modules": {
    "generative-openai": {
      "documentationHref": "https://beta.openai.com/docs/api-reference/completions",
      "name": "Generative Search - OpenAI"
    },
    "qna-openai": {
      "documentationHref": "https://beta.openai.com/docs/api-reference/completions",
      "name": "OpenAI Question & Answering Module"
    },
    "ref2vec-centroid": {},
    "text2vec-cohere": {
      "documentationHref": "https://docs.cohere.ai/embedding-wiki/",
      "name": "Cohere Module"
    },
    "text2vec-huggingface": {
      "documentationHref": "https://huggingface.co/docs/api-inference/detailed_parameters#feature-extraction-task",
      "name": "Hugging Face Module"
    },
    "text2vec-openai": {
      "documentationHref": "https://beta.openai.com/docs/guides/embeddings/what-are-embeddings",
      "name": "OpenAI Module"
    }
  },
  "version": "1.19.12"
}


## Step 2: Add data to Weaviate

### Add class definition

The equivalent of a SQL "table", or noSQL "collection" is called a "class" in Weaviate.

#### Delete existing data

In case I created a demo class - let's delete it.

In [10]:
if client.schema.exists("Question"):
    client.schema.delete_class("Question")

And create a new class definition here.
We'll set up a class called "Question" with:
- A "vectorizer" -> which will convert data to vectors, which represent meaning,
- A "generative" module -> which will allow us to use LLMs with our data, and
- Properties to save our quiz data (which are like SQL columns).
    - Just the question and answer for now

#### Tip

> You can get example class definitions in our documentation:
> - https://weaviate.io/developers/weaviate/manage-data/classes#example-class-configurations

In [11]:
class_definition = {
    "class": "Question",
    "vectorizer": "text2vec-openai",
    "vectorIndexConfig": {
        "distance": "cosine",
    },
    "moduleConfig": {
        "generative-cohere": {}
    },
    "properties": [
        {
            "name": "question",
            "dataType": ["text"]
        },
        {
            "name": "answer",
            "dataType": ["text"]
        },
    ],
}

client.schema.create_class(class_definition)

{"action":"hnsw_vector_cache_prefill","count":1000,"index_id":"question_vNYIGrAhRBdl","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2023-08-15T15:16:49Z","took":33092}


Was our class created successfully? Let's take a look

In [13]:
jprint(client.schema.get("Question"))

{
  "class": "Question",
  "invertedIndexConfig": {
    "bm25": {
      "b": 0.75,
      "k1": 1.2
    },
    "cleanupIntervalSeconds": 60,
    "stopwords": {
      "additions": null,
      "preset": "en",
      "removals": null
    }
  },
  "moduleConfig": {
    "generative-openai": {},
    "text2vec-openai": {
      "model": "ada",
      "modelVersion": "002",
      "type": "text",
      "vectorizeClassName": true
    }
  },
  "properties": [
    {
      "dataType": [
        "text"
      ],
      "indexFilterable": true,
      "indexSearchable": true,
      "moduleConfig": {
        "text2vec-openai": {
          "skip": false,
          "vectorizePropertyName": false
        }
      },
      "name": "question",
      "tokenization": "word"
    },
    {
      "dataType": [
        "text"
      ],
      "indexFilterable": true,
      "indexSearchable": true,
      "moduleConfig": {
        "text2vec-openai": {
          "skip": false,
          "vectorizePropertyName": false
        

### Add data

We'll add actual objects (SQL rows) to our data. 

First, let's build objects to add - and take a look at a couple.

In [14]:
for row in data[:2]:
    data_obj = {
        "question": row["Question"],
        "answer": row["Answer"]
    }
    print(data_obj)

{'question': 'Abraham Lincoln died across the street from this theatre on April 15, 1865', 'answer': "Ford's Theatre (the Ford Theatre accepted)"}
{'question': 'Any pigment on the wall so faded you can barely see it', 'answer': 'faint paint'}


> If it all looks fine - let's add objects:
> - https://weaviate.io/developers/weaviate/manage-data/import

In [15]:
with client.batch() as batch:
    for row in data:
        data_obj = {
            "question": row["Question"],
            "answer": row["Answer"]
        }
        batch.add_data_object(
            data_object=data_obj,
            class_name="Question"
        )  

In [16]:
len(data)

1000

#### Confirm data load

Do we have data? 

Let's get an object count

In [17]:
client.query.aggregate("Question").with_meta_count().do()

{'data': {'Aggregate': {'Question': [{'meta': {'count': 1000}}]}}}

Does the data look right?

Let's grab a few objects from Weaviate!

In [18]:
response = (
    client.query
    .get("Question", ["question", "answer"])
    .with_limit(3)
    .do()
)

jprint(response)

{
  "data": {
    "Get": {
      "Question": [
        {
          "answer": "Adam",
          "question": "Created from the earth, this man's name in Hebrew means \"red earth\""
        },
        {
          "answer": "buffalo",
          "question": "Animal that was the main staple of the Plains Indians economy"
        },
        {
          "answer": "Rock Band",
          "question": "You can form your own rock band & tour the world with this 2007 video game from MTV games"
        }
      ]
    }
  }
}


Let's pause for a second - because we've done a lot!

#### What did we just do?

Here is a conceptual diagram

![img](https://github.com/weaviate-tutorials/intro-workshop/blob/main/images/object_import_process_full.png?raw=1)

## Step 3: Work with the data

Let's try a few more involved queries

### Filtering (similar to WHERE filter in SQL)

Let's find objects that meet a particular condition.

In [19]:
where_filter = {
    "path": ["answer"],
    "operator": "Like",
    "valueText": "*history*"
}

response = (
    client.query
    .get("Question", ["question", "answer"])
    .with_limit(3)
    .with_where(where_filter)
    .do()
)

jprint(response)

{
  "data": {
    "Get": {
      "Question": [
        {
          "answer": "\"A Brief History Of Time In A Bottle\"",
          "question": "Stephen Hawking's 1988 bio of the universe that was a No. 1 hit for Jim Croce"
        }
      ]
    }
  }
}


We can also use multiple filters

In [20]:
where_filter = {
    "operator": "Or",
    "operands": [
        {
            "path": ["answer"],
            "operator": "Like",
            "valueText": "*history*"
        },
        {
            "path": ["question"],
            "operator": "Like",
            "valueText": "*history*"
        },        
    ]
}

response = (
    client.query
    .get("Question", ["question", "answer"])
    .with_limit(3)
    .with_where(where_filter)
    .do()
)

jprint(response)

{
  "data": {
    "Get": {
      "Question": [
        {
          "answer": "the Field Museum",
          "question": "What was once the Chicago Natural History Museum is now called this, after its founder"
        },
        {
          "answer": "Greyhound",
          "question": "A Hibbing, Minn. museum traces the history of this bus company founded there in 1914 using Hupmobiles"
        },
        {
          "answer": "the draft",
          "question": "You're in the Army now--in 1940 FDR instituted the first peacetime one of these in U.S. history"
        }
      ]
    }
  }
}


But this does not rank the result in any meaningful way. 

For that, we need a keyword search (as opposed to a keyword *filter*).

### Keyword search

Unlike a keyword filter, a keyword search will search for, and rank results based on the frequency of the keyword.

In [21]:
response = (
    client.query
    .get("Question", ["question", "answer"])
    .with_limit(3)
    .with_bm25(query="history")
    .do()
)

jprint(response)

{
  "data": {
    "Get": {
      "Question": [
        {
          "answer": "\"A Brief History Of Time In A Bottle\"",
          "question": "Stephen Hawking's 1988 bio of the universe that was a No. 1 hit for Jim Croce"
        },
        {
          "answer": "Oil",
          "question": "The Drake Well Museum in Titusville, Penn. is dedicated to the history of this industry"
        },
        {
          "answer": "the Field Museum",
          "question": "What was once the Chicago Natural History Museum is now called this, after its founder"
        }
      ]
    }
  }
}


### Semantic search

A semantic search, on the other hand, searches objects based on similarity

In [22]:
response = (
    client.query
    .get("Question", ["question", "answer"])
    .with_limit(3)
    .with_near_text({"concepts": ["history"]})
    .do()
)

jprint(response)

{
  "data": {
    "Get": {
      "Question": [
        {
          "answer": "Greyhound",
          "question": "A Hibbing, Minn. museum traces the history of this bus company founded there in 1914 using Hupmobiles"
        },
        {
          "answer": "The Rijksmuseum",
          "question": "This Dutch national art museum had its origins in one founded by Louis Bonaparte in 1808"
        },
        {
          "answer": "Shinto",
          "question": "Compiled in 712, the Kojiki, \"Records of Ancient Matters\", is one of this religion's oldest texts"
        }
      ]
    }
  }
}


#### How does this work?

- Under the hood, this uses a vector search. It looks for objects which are the most similar to a text input.
- We can inspect the similarity along with the results.

In [23]:
response = (
    client.query
    .get("Question", ["question", "answer"])
    .with_limit(3)
    .with_near_text({"concepts": ["history"]})
    .with_additional("distance")
    .do()
)

jprint(response)

{
  "data": {
    "Get": {
      "Question": [
        {
          "_additional": {
            "distance": 0.19912618
          },
          "answer": "Greyhound",
          "question": "A Hibbing, Minn. museum traces the history of this bus company founded there in 1914 using Hupmobiles"
        },
        {
          "_additional": {
            "distance": 0.20578152
          },
          "answer": "The Rijksmuseum",
          "question": "This Dutch national art museum had its origins in one founded by Louis Bonaparte in 1808"
        },
        {
          "_additional": {
            "distance": 0.20843762
          },
          "answer": "Shinto",
          "question": "Compiled in 712, the Kojiki, \"Records of Ancient Matters\", is one of this religion's oldest texts"
        }
      ]
    }
  }
}


This is where "vectors" come in. 

Each object in Weaviate includes a vector - like so:

In [24]:
response = (
    client.query
    .get("Question", ["question", "answer"])
    .with_limit(3)
    .with_near_text({"concepts": ["history"]})
    .with_additional("vector")
    .do()
)

jprint(response)

{
  "data": {
    "Get": {
      "Question": [
        {
          "_additional": {
            "vector": [
              -0.018947069,
              -0.011887303,
              -0.0006647947,
              -0.034424767,
              -0.022093708,
              0.011335968,
              0.0020069908,
              -0.01394472,
              -0.009177697,
              -0.031036079,
              -0.028185278,
              0.016956888,
              -0.0030558705,
              -0.023599792,
              -0.030820925,
              0.005439382,
              0.013171508,
              0.01306393,
              0.013010141,
              0.010609821,
              -0.008397761,
              0.0015993734,
              0.010771187,
              -0.008424655,
              0.0010581246,
              0.0012102458,
              0.026571618,
              -0.008888583,
              -0.0010161021,
              0.01598869,
              -0.004740129,
              -0.0072278567,
     

These vector representations come from deep learning models that are similar to those that power LLMs. They capture meaning, and are called vector "embeddings".

### Generative search

A generative search transforms your data at retrieval time. 

In [25]:
response = (
    client.query
    .get("Question", ["question", "answer"])
    .with_limit(3)
    .with_near_text({"concepts": ["history"]})
    .with_generate(
        single_prompt="write a tweet with emojis explaining the {question} and {answer}"
    )
    .do()
)

jprint(response)

{
  "data": {
    "Get": {
      "Question": [
        {
          "_additional": {
            "generate": {
              "error": null,
              "singleResult": "\ud83d\ude8c\ud83c\udfde\ufe0f Discover the fascinating story of a bus company that started in Hibbing, Minn. in 1914! \ud83d\ude8d\u2728 The museum takes you on a journey through time, showcasing Hupmobiles and Greyhound buses \ud83d\ude90\ud83d\udc15. Don't miss this unique opportunity to explore the rich history of transportation! \ud83c\udf1f #HibbingMuseum #BusCompanyHistory #TransportationJourney"
            }
          },
          "answer": "Greyhound",
          "question": "A Hibbing, Minn. museum traces the history of this bus company founded there in 1914 using Hupmobiles"
        },
        {
          "_additional": {
            "generate": {
              "error": null,
              "singleResult": "\ud83c\udfa8 Did you know? The Rijksmuseum, \ud83c\uddf3\ud83c\uddf1 Netherlands' national art museum, 

You can see here ⬆️ that each object has been transformed by the LLM based on our prompt.

#### LLMs are linguistically very flexible

In [26]:
response = (
    client.query
    .get("Question", ["question", "answer"])
    .with_limit(3)
    .with_near_text({"concepts": ["history"]})
    .with_generate(
        single_prompt="translate the {question} into French"
    )
    .do()
)

jprint(response)

{
  "data": {
    "Get": {
      "Question": [
        {
          "_additional": {
            "generate": {
              "error": null,
              "singleResult": "Un mus\u00e9e \u00e0 Hibbing, dans le Minnesota, retrace l'histoire de cette compagnie de bus fond\u00e9e en 1914 en utilisant des Hupmobiles."
            }
          },
          "answer": "Greyhound",
          "question": "A Hibbing, Minn. museum traces the history of this bus company founded there in 1914 using Hupmobiles"
        },
        {
          "_additional": {
            "generate": {
              "error": null,
              "singleResult": "Ce mus\u00e9e national d'art n\u00e9erlandais trouve ses origines dans celui fond\u00e9 par Louis Bonaparte en 1808."
            }
          },
          "answer": "The Rijksmuseum",
          "question": "This Dutch national art museum had its origins in one founded by Louis Bonaparte in 1808"
        },
        {
          "_additional": {
            "generate

In fact, this LLM is multi-lingual!

#### Generative search - Grouped task

You can also send groups of results to the LLM with Weaviate.

In [27]:
response = (
    client.query
    .get("Question", ["question", "answer"])
    .with_limit(3)
    .with_near_text({"concepts": ["history"]})
    .with_generate(
        grouped_task="write a short poem about these facts"
    )
    .do()
)

jprint(response)

{
  "data": {
    "Get": {
      "Question": [
        {
          "_additional": {
            "generate": {
              "error": null,
              "groupedResult": "In Hibbing, a town so grand,\nA bus company took its stand.\nGreyhound, born in 1914,\nUsing Hupmobiles, its history gleamed.\n\nIn the land of tulips and canals,\nThe Rijksmuseum proudly stands tall.\nOriginating from Bonaparte's decree,\nA Dutch national art museum, for all to see.\n\nCenturies ago, in ancient Japan,\nShinto's roots began to span.\nThe Kojiki, a sacred script,\nRecords of Ancient Matters, tightly gripped.\n\nThrough time and space, these facts unfold,\nStories of history, precious and bold.\nFrom Greyhound's humble start,\nTo Rijksmuseum's artistic heart.\n\nAnd Shinto's ancient wisdom, so divine,\nThese facts intertwine, like a sacred shrine.\nLet us cherish the past, embrace its lore,\nFor knowledge and beauty, forevermore."
            }
          },
          "answer": "Greyhound",
          "qu

The output for a grouped task is contained in the first response object. 

So let's take a closer look at that one :) 

In [28]:
print(response["data"]["Get"]["Question"][0]["_additional"]["generate"]["groupedResult"])

In Hibbing, a town so grand,
A bus company took its stand.
Greyhound, born in 1914,
Using Hupmobiles, its history gleamed.

In the land of tulips and canals,
The Rijksmuseum proudly stands tall.
Originating from Bonaparte's decree,
A Dutch national art museum, for all to see.

Centuries ago, in ancient Japan,
Shinto's roots began to span.
The Kojiki, a sacred script,
Records of Ancient Matters, tightly gripped.

Through time and space, these facts unfold,
Stories of history, precious and bold.
From Greyhound's humble start,
To Rijksmuseum's artistic heart.

And Shinto's ancient wisdom, so divine,
These facts intertwine, like a sacred shrine.
Let us cherish the past, embrace its lore,
For knowledge and beauty, forevermore.


Look how far we've got in a short time - we can do much more than that! 

Here's something I prepared earlier.

## What more can we do with Weaviate?

Here is a demo instance that you can connect to and try out. 

Like many of our production clusters, we have a read-only API key set up that you can use.

In [30]:
import os

api_headers = {
    "X-OpenAI-Api-Key": os.environ["OPENAI_APIKEY"],
}

# Instantiate the client with the auth config
client = weaviate.Client(
    url="https://edu-demo.weaviate.network",
    auth_client_secret=weaviate.AuthApiKey(
        api_key="learn-weaviate"
    ),
    additional_headers=api_headers
)

This instance is populated with the first two chapters of the "Pro Git" book.

In [None]:
response = (
    client.query
    .get("GitBookChunk", ["chunk"])
    .with_limit(3)
    .with_near_text({"concepts": ["history of git"]})
    .with_generate(
        grouped_task="write a cute short story for children about this topic, use emojis for cuteness"
    )
    .do()
)

KeyError: 'Question'

Using Weaviate, we can talk to this book!

Let's see what the book says about ways of undoing commits.

In [32]:
print(response["data"]["Get"]["GitBookChunk"][0]["_additional"]["generate"]["groupedResult"])

📚 Once upon a time, in a land called Linux 🐧, there was a little kernel named Git. It had a big dream of making things easier for everyone. But Git's journey wasn't always smooth sailing. It started with a bit of creative destruction and fiery controversy. 🔥

In the early years, changes to the software were passed around as patches and archived files. But then, in 2002, a new tool called BitKeeper came into the picture. It was like a shiny treasure for the Linux kernel project. 🌟

However, as time went on, the relationship between the Linux community and BitKeeper broke down. The tool's free-of-charge status was taken away, leaving the Linux developers in a tough spot. 😔

But fear not, for a hero named Linus Torvalds stepped forward. He was the creator of Linux and had a brilliant idea. Linus and the Linux development community decided to create their own tool based on the lessons they learned from BitKeeper. 🌈

They wanted this new tool, Git, to be fast ⚡, simple 🌟, and able to handle

Take a look at the results as we've done before

And the information that this is based on:

You can do strange and wonderful things - like this:

And a lot more. 

Weaviate makes it easy for you to work with your data and these AI models, at scale. As a vector database, we deal with data stores with 10s or 100s of M objects!