<a href="https://colab.research.google.com/github/LorenzoBellomo/InformationRetrieval/blob/main/notebooks/4_SearchEngineWeaviate.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Weaviate as a Search Engine

In [2]:
!pip install -U weaviate-client
import weaviate
import weaviate.classes.config as wc

Collecting weaviate-client
  Downloading weaviate_client-4.10.4-py3-none-any.whl.metadata (3.6 kB)
Collecting validators==0.34.0 (from weaviate-client)
  Downloading validators-0.34.0-py3-none-any.whl.metadata (3.8 kB)
Collecting authlib<1.3.2,>=1.2.1 (from weaviate-client)
  Downloading Authlib-1.3.1-py2.py3-none-any.whl.metadata (3.8 kB)
Collecting grpcio-tools<2.0.0,>=1.66.2 (from weaviate-client)
  Downloading grpcio_tools-1.70.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.3 kB)
Collecting grpcio-health-checking<2.0.0,>=1.66.2 (from weaviate-client)
  Downloading grpcio_health_checking-1.70.0-py3-none-any.whl.metadata (1.1 kB)
Collecting protobuf<6.0dev,>=5.26.1 (from grpcio-health-checking<2.0.0,>=1.66.2->weaviate-client)
  Downloading protobuf-5.29.3-cp38-abi3-manylinux2014_x86_64.whl.metadata (592 bytes)
Downloading weaviate_client-4.10.4-py3-none-any.whl (330 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m330.4/330.4 kB[0m [31m4.6

In [3]:
import weaviate
from weaviate.classes.query import MetadataQuery
from weaviate.classes.config import Configure, Property, DataType, Tokenization
from weaviate.classes.query import Filter

client = weaviate.connect_to_embedded()

INFO:weaviate-client:Binary /root/.cache/weaviate-embedded did not exist. Downloading binary from https://github.com/weaviate/weaviate/releases/download/v1.26.6/weaviate-v1.26.6-Linux-amd64.tar.gz
INFO:weaviate-client:Started /root/.cache/weaviate-embedded: process ID 908


In [4]:
client.collections.delete_all()
client.collections.create(
    name="TestCollection",
    properties=[
        wc.Property(name="text", data_type=wc.DataType.TEXT),
    ]
)

<weaviate.collections.collection.sync.Collection at 0x7e50fc33ff50>

In [5]:
sample_docs = [
    {"text": "Trump u.s.a. NATO"},
    {"text": "trump usa N.A.T.O."},
    {"text": "the cat sleeps"}
]

In [6]:
documents = client.collections.get("TestCollection")
for doc in sample_docs:
    documents.data.insert(doc)

In [7]:
# retrieve the elements
for doc in documents.iterator():
  print(doc.uuid, " - ", doc.properties)

2a29f72e-2a9e-4e53-b5ac-b6c29fc1ebb9  -  {'text': 'trump usa N.A.T.O.'}
b06b09ec-ff3e-4b00-a146-9db7bfafa56b  -  {'text': 'the cat sleeps'}
e87af5a5-48ef-47a3-846f-e6d1508ad252  -  {'text': 'Trump u.s.a. NATO'}


In [8]:
query = "sleep"
response = documents.query.bm25(query=query, return_metadata=MetadataQuery(score=True))
for o in response.objects:
    print("{} - {}".format(round(o.metadata.score*100)/100, o.properties["text"]))

Words are not stemmed. This is on the roadmap of features that Weaviate plans to support in the future.

In [9]:
def print_query_results(query, prop_name, collection):
  response = collection.query.bm25(query=query, return_metadata=MetadataQuery(score=True))
  for o in response.objects:
    print("{} - {}".format(round(o.metadata.score*100)/100, o.properties[prop_name]))

In [10]:
print_query_results("TRUMP", "text", documents)

0.21 - Trump u.s.a. NATO
0.19 - trump usa N.A.T.O.


In [11]:
print_query_results("Trump", "text", documents)

0.21 - Trump u.s.a. NATO
0.19 - trump usa N.A.T.O.


In [28]:
print_query_results("the", "text", documents)

0.45 - the cat sleeps


In [20]:
def example_queries(prop_name, collection):
    queries = ["She is sleeping", "I am sleeping", "the usa", "I live in the u.s.a.", "TRUMP"]
    for query in queries:
      print("QUERY: ", query)
      print_query_results(query, prop_name, collection)
      print("===============================================================")
      print()

In [21]:
example_queries("text", documents)

QUERY:  She is sleeping

QUERY:  I am sleeping

QUERY:  the usa
0.4 - trump usa N.A.T.O.

QUERY:  I live in the u.s.a.
0.87 - Trump u.s.a. NATO

QUERY:  TRUMP
0.21 - Trump u.s.a. NATO
0.19 - trump usa N.A.T.O.



But how is the input really treated? How is it tokenized?

In [22]:
client.collections.create(
    name="TestWhitespace",
    properties=[
        wc.Property(name="text", data_type=wc.DataType.TEXT, tokenization=Tokenization.WHITESPACE),
    ],
)

<weaviate.collections.collection.sync.Collection at 0x7e511b015dd0>

In [23]:
documents = client.collections.get("TestWhitespace")
for doc in sample_docs:
    documents.data.insert(doc)

In [26]:
print_query_results("the", "text", documents)

0.45 - the cat sleeps


In [27]:
print_query_results("Trump", "text", documents)

0.45 - Trump u.s.a. NATO


In [29]:
print_query_results("TRUMP", "text", documents)

In [30]:
example_queries("text", documents)

QUERY:  She is sleeping

QUERY:  I am sleeping

QUERY:  the usa
0.45 - trump usa N.A.T.O.
0.45 - the cat sleeps

QUERY:  I live in the u.s.a.
0.45 - the cat sleeps
0.45 - Trump u.s.a. NATO

QUERY:  TRUMP



**TOKENIZATION OPTIONS**
* word: alphanumeric, lowercased tokens
* lowercase: lowercased tokens
* whitespace: whitespace-separated, case-sensitive tokens
* the entire value of the property is treated as a single token

## Properties
Let's now add some simple properties to our index. As of now we only handled the "text" property, containing some simple textual snippets.

In [8]:
!wget https://raw.githubusercontent.com/LorenzoBellomo/InformationRetrieval/refs/heads/main/data/5articles.json
import json

with open("5articles.json", 'r') as f:
  articles = json.load(f)

--2025-02-14 16:06:53--  https://raw.githubusercontent.com/LorenzoBellomo/InformationRetrieval/refs/heads/main/data/5articles.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 12566 (12K) [text/plain]
Saving to: ‘5articles.json’


2025-02-14 16:06:54 (27.0 MB/s) - ‘5articles.json’ saved [12566/12566]



In [33]:
articles[0]

{'title': 'American Airlines orders 60 Overture supersonic jets',
 'maintext': "The revival of supersonic passenger travel, thought to be long dead with the demise of Concorde nearly two decades ago, could be about to take wing as American Airlines has put in an order for 60 aircraft capable of flying at 1.7 times the speed of sound. \nBoom is a start-up based in Denver, Colorado, whose development of Overture, an ultra-fast successor to Concorde that seats 65 to 88 passengers, is so advanced that it showed off designs at last month's Farnborough air show.",
 'date': '2022-08-18',
 'source': 'The New York Times'}

In [38]:
client.collections.create(
    name="TestProperties",
    properties=[
        wc.Property(name="maintext", data_type=wc.DataType.TEXT, tokenization=Tokenization.WORD),
        wc.Property(name="title", data_type=wc.DataType.TEXT, tokenization=Tokenization.LOWERCASE),
    ],
)

<weaviate.collections.collection.sync.Collection at 0x7e50f3e30810>

In [39]:
documents = client.collections.get("TestProperties")
for doc in articles:
    documents.data.insert({"maintext": doc["maintext"], "title": doc["title"]})

In [40]:
for doc in documents.iterator():
  print(doc.uuid, " - ", doc.properties)

386adf7a-a5e1-4364-b965-8f01f96426fc  -  {'maintext': 'Charles Leclerc\nCharles Leclerc registered the maiden win of his Formula One career after romping to victory at the Belgian Grand Prix.\nLess than 24 hours after Leclerc\'s French motor racing contemporary, Anthoine Hubert, was killed at the Spa-Francorchamps venue, the young Monegasque driver delivered a dominant display to take the chequered flag in his friend\'s honour.\nLewis Hamilton finished second after fighting his way past Sebastian Vettel with 12 laps remaining.\nHamilton\'s Mercedes team-mate Valtteri Bottas also managed to see off Vettel after the Ferrari driver was forced to make an additional stop for tyres.\nHamilton extended his lead over Bottas in the championship to 65 points.\n"This one is for Anthoine," said an emotional Leclerc on the radio.\n"It feels good but it is difficult to enjoy a weekend like this.\n"On one hand I have realised a dream, but on the other hand it has been a difficult weekend.\n"I have lo

In [47]:
print_query_results("mother", "title", documents)

0.52 - 'One-punch killer's sentence will make others think twice'
0.3 - Leclerc dedicates win to Hubert


In [48]:
print_query_results("car", "title", documents)

1.87 - Gunman opens fire on car just metres from scene of Hamid Sanambar murder


In [49]:
print_query_results("cars", "title", documents)

0.48 - Leclerc dedicates win to Hubert


In [61]:
print_query_results("victory", "title", documents)

Say that you now want to consider some words as "stopwords", that the system does not consider as such by default

In [59]:
documents.config.update(inverted_index_config=wc.Reconfigure.inverted_index(stopwords_additions=["victory"]))

In [65]:
print_query_results("victory", "title", documents)

1.27 - Conte: 'Chelsea are not in the race to sign Sanchez'
0.54 - Leclerc dedicates win to Hubert


But fields in the query are not all "born equal". Some are more important than others (e.g., title)

In [67]:
response = documents.query.bm25(
    query="race",
    return_metadata=MetadataQuery(score=True)
)
print("BEFORE FIELD BOOSTING: (query = race)")
for o in response.objects:
    print("{} - {}".format(round(o.metadata.score*100)/100, o.properties["title"]))

BEFORE FIELD BOOSTING: (query = race)
1.27 - Conte: 'Chelsea are not in the race to sign Sanchez'
0.54 - Leclerc dedicates win to Hubert


In [66]:
response = documents.query.bm25(
    query="race",
    query_properties=["title^2", "maintext"],
    return_metadata=MetadataQuery(score=True)
)
print("AFTER FIELD BOOSTING: (query = race)")
for o in response.objects:
    print("{} - {}".format(round(o.metadata.score*100)/100, o.properties["title"]))

AFTER FIELD BOOSTING: (query = race)
1.43 - Conte: 'Chelsea are not in the race to sign Sanchez'
0.54 - Leclerc dedicates win to Hubert


Add some basic filtering

In [75]:
response = documents.query.bm25(
    query="mother",
    return_metadata=MetadataQuery(score=True)
)
print("BEFORE FILTERING: (query = mother)")
for o in response.objects:
    print("{} - {}".format(round(o.metadata.score*100)/100, o.properties["title"]))

BEFORE FILTERING: (query = mother)
0.52 - 'One-punch killer's sentence will make others think twice'
0.3 - Leclerc dedicates win to Hubert


In [77]:
response = documents.query.bm25(
    query="mother",
    filters=Filter.by_property("title").contains_any(["Leclerc", "formula"]),
    return_metadata=MetadataQuery(score=True)
)
print("AFTER FILTERING: (query = mother)")
for o in response.objects:
    print("{} - {}".format(round(o.metadata.score*100)/100, o.properties["title"]))

AFTER FILTERING: (query = mother)
0.3 - Leclerc dedicates win to Hubert


Let's see what happens when we also add dates as properties

In [78]:
client.collections.create(
    name="TestDate",
    properties=[
        wc.Property(name="maintext", data_type=wc.DataType.TEXT, tokenization=Tokenization.WORD),
        wc.Property(name="title", data_type=wc.DataType.TEXT, tokenization=Tokenization.LOWERCASE),
        wc.Property(name="date", data_type=wc.DataType.DATE)
    ]
)

<weaviate.collections.collection.sync.Collection at 0x7e50e72df090>

[All property types](https://weaviate.io/developers/weaviate/config-refs/datatypes)

In [81]:
from datetime import timezone, datetime
documents = client.collections.get("TestDate")
for doc in articles:
    documents.data.insert({"maintext": doc["maintext"], "title": doc["title"], "date": datetime.strptime(doc["date"], "%Y-%m-%d").replace(tzinfo=timezone.utc)})

In [82]:
for doc in documents.iterator():
  print(doc.uuid, " - ", doc.properties)

23c9584b-847a-42bb-b501-9568f0d102fc  -  {'title': 'American Airlines orders 60 Overture supersonic jets', 'date': datetime.datetime(2022, 8, 18, 0, 0, tzinfo=datetime.timezone.utc), 'maintext': "The revival of supersonic passenger travel, thought to be long dead with the demise of Concorde nearly two decades ago, could be about to take wing as American Airlines has put in an order for 60 aircraft capable of flying at 1.7 times the speed of sound. \nBoom is a start-up based in Denver, Colorado, whose development of Overture, an ultra-fast successor to Concorde that seats 65 to 88 passengers, is so advanced that it showed off designs at last month's Farnborough air show."}
390f41e7-c0f2-41da-9994-15e87945138f  -  {'maintext': 'Antonio Conte. Pic: PA\nHead coach Antonio Conte does not think Chelsea are in the race to sign Arsenal forward Alexis Sanchez.\nSanchez is out of contract this summer and seemed certain to join Manchester City this month.\nBut the Premier League leaders on Monday

In [84]:
response = documents.query.bm25(
    query="mother",
    return_metadata=MetadataQuery(score=True)
)
print("BEFORE FILTERING: (query = mother)")
for o in response.objects:
    print("{} - {}".format(round(o.metadata.score*100)/100, o.properties["title"]))

BEFORE FILTERING: (query = mother)
0.52 - 'One-punch killer's sentence will make others think twice'
0.3 - Leclerc dedicates win to Hubert


In [85]:
response = documents.query.bm25(
    query="mother",
    filters=Filter.by_property("date").greater_or_equal(datetime.strptime("2019-08-15", "%Y-%m-%d").replace(tzinfo=timezone.utc)),
    return_metadata=MetadataQuery(score=True)
)
print("AFTER FILTERING: (query = mother)")
for o in response.objects:
    print("{} - {}".format(round(o.metadata.score*100)/100, o.properties["title"]))

AFTER FILTERING: (query = mother)
0.3 - Leclerc dedicates win to Hubert


Some advanced features

In [86]:
response = documents.query.bm25(query="sport", return_metadata=MetadataQuery(score=True))
for o in response.objects:
    print("{} - {}".format(round(o.metadata.score*100)/100, o.properties["title"]))

[https://aistudio.google.com/app/apikey](https://aistudio.google.com/app/apikey)

In [10]:
# Unfortunately, we cannot use all the vectorizer modules that are present in Weaviate. Here is a list of the ones that are available
client.get_meta()

{'hostname': 'http://127.0.0.1:8079',
 'modules': {'generative-openai': {'documentationHref': 'https://platform.openai.com/docs/api-reference/completions',
   'name': 'Generative Search - OpenAI'},
  'qna-openai': {'documentationHref': 'https://platform.openai.com/docs/api-reference/completions',
   'name': 'OpenAI Question & Answering Module'},
  'ref2vec-centroid': {},
  'reranker-cohere': {'documentationHref': 'https://txt.cohere.com/rerank/',
   'name': 'Reranker - Cohere'},
  'text2vec-cohere': {'documentationHref': 'https://docs.cohere.ai/embedding-wiki/',
   'name': 'Cohere Module'},
  'text2vec-huggingface': {'documentationHref': 'https://huggingface.co/docs/api-inference/detailed_parameters#feature-extraction-task',
   'name': 'Hugging Face Module'},
  'text2vec-openai': {'documentationHref': 'https://platform.openai.com/docs/guides/embeddings/what-are-embeddings',
   'name': 'OpenAI Module'}},
 'version': '1.26.6'}

Let's use COHERE [https://dashboard.cohere.com/api-keys](https://dashboard.cohere.com/api-keys)

In [35]:
client.close()
cohere_key = "E7TICEFGJEFVWtc40FRafeQ4z8ZlcDEDbjHWedSX"
studio_key = "AIzaSyB3CLDPRhYlfWbIXfQMXp9UM4lDzsWmpE8"
headers = {
    "X-Cohere-Api-Key": cohere_key,
    "X-Goog-Studio-Api-Key": studio_key,
}
client = weaviate.connect_to_embedded(headers=headers)

INFO:weaviate-client:Started /root/.cache/weaviate-embedded: process ID 9647


In [41]:
client.collections.delete_all()
client.collections.create(
    name="TestVectorizer",
    properties=[
        wc.Property(name="maintext", data_type=wc.DataType.TEXT, tokenization=Tokenization.WORD),
        wc.Property(name="title", data_type=wc.DataType.TEXT, tokenization=Tokenization.LOWERCASE),
    ],
    vectorizer_config=[
        Configure.NamedVectors.text2vec_cohere(
            name="maintext_vector",
            source_properties=["maintext"],
            #model="embed-multilingual-light-v3.0"
        )
    ],
    #generative_config=Configure.Generative.google(project_id = "827288372753")
)

<weaviate.collections.collection.sync.Collection at 0x7d198ab64f50>

In [42]:
documents = client.collections.get("TestVectorizer")
for doc in articles:
    documents.data.insert({"maintext": doc["maintext"], "title": doc["title"]})

In [43]:
print("pure syntactical search")
response = documents.query.bm25(query="sport", return_metadata=MetadataQuery(score=True))
for o in response.objects:
    print("{} - {}".format(round(o.metadata.score*100)/100, o.properties["title"]))

pure syntactical search


In [44]:
print("pure vector search")
# NOTE THAT WE ALSO NEED THE PARAMETER DISTANCE
response = documents.query.near_text(query="sport", return_metadata=MetadataQuery(score=True, distance=True), limit=3)
for o in response.objects:
  print("{} - {}".format(round(o.metadata.distance*100)/100, o.properties["title"]))

pure vector search
0.61 - Leclerc dedicates win to Hubert
0.61 - Gunman opens fire on car just metres from scene of Hamid Sanambar murder
0.65 - Conte: 'Chelsea are not in the race to sign Sanchez'


In [45]:
print("pure vector search")
response = documents.query.near_text(query="car", return_metadata=MetadataQuery(score=True, distance=True), limit=3)
for o in response.objects:
    print("{} - {}".format(round(o.metadata.score*100)/100, o.properties["title"]))

pure vector search
0.0 - Leclerc dedicates win to Hubert
0.0 - Gunman opens fire on car just metres from scene of Hamid Sanambar murder
0.0 - American Airlines orders 60 Overture supersonic jets


In [46]:
print("hybrid search")
response = documents.query.hybrid(query="car", alpha=0.5, return_metadata=MetadataQuery(score=True, explain_score=True), limit=3)
for o in response.objects:
  print("{} - {} [{}]".format(round(o.metadata.score*100)/100, o.properties["title"],  o.metadata.explain_score.strip().replace("\n", '')))

hybrid search
0.95 - Gunman opens fire on car just metres from scene of Hamid Sanambar murder [Hybrid (Result Set keyword,bm25) Document 8e9c89cf-5db3-4e39-b7b7-676c95fd40b5: original score 1.8679439, normalized score: 0.5 - Hybrid (Result Set vector,hybridVector) Document 8e9c89cf-5db3-4e39-b7b7-676c95fd40b5: original score 0.42209542, normalized score: 0.44946226]
0.5 - Leclerc dedicates win to Hubert [Hybrid (Result Set vector,hybridVector) Document acd84eaf-e48a-4882-a938-6a8f669a9913: original score 0.43642688, normalized score: 0.5]
0.22 - American Airlines orders 60 Overture supersonic jets [Hybrid (Result Set vector,hybridVector) Document 178d588f-6f26-4333-a57a-cf878f30e955: original score 0.35618782, normalized score: 0.21704914]


Now let's try to include some generative AI prompts to this query (let's add context to the entities in the news, or let's translate them in Italian)