<a href="https://colab.research.google.com/github/LorenzoBellomo/InformationRetrieval/blob/main/notebooks/4_SearchEngineWeaviate.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Weaviate as a Search Engine

In [1]:
!pip install -U weaviate-client
import weaviate
import weaviate.classes.config as wc

Collecting weaviate-client
  Downloading weaviate_client-4.11.0-py3-none-any.whl.metadata (3.6 kB)
Collecting validators==0.34.0 (from weaviate-client)
  Downloading validators-0.34.0-py3-none-any.whl.metadata (3.8 kB)
Collecting authlib<1.3.2,>=1.2.1 (from weaviate-client)
  Downloading Authlib-1.3.1-py2.py3-none-any.whl.metadata (3.8 kB)
Collecting grpcio-tools<2.0.0,>=1.66.2 (from weaviate-client)
  Downloading grpcio_tools-1.70.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.3 kB)
Collecting grpcio-health-checking<2.0.0,>=1.66.2 (from weaviate-client)
  Downloading grpcio_health_checking-1.70.0-py3-none-any.whl.metadata (1.1 kB)
Collecting protobuf<6.0dev,>=5.26.1 (from grpcio-health-checking<2.0.0,>=1.66.2->weaviate-client)
  Downloading protobuf-5.29.3-cp38-abi3-manylinux2014_x86_64.whl.metadata (592 bytes)
Downloading weaviate_client-4.11.0-py3-none-any.whl (350 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m350.1/350.1 kB[0m [31m10.

In [2]:
import weaviate
from weaviate.classes.query import MetadataQuery
from weaviate.classes.config import Configure, Property, DataType, Tokenization
from weaviate.classes.query import Filter

client = weaviate.connect_to_embedded()

INFO:weaviate-client:Binary /root/.cache/weaviate-embedded did not exist. Downloading binary from https://github.com/weaviate/weaviate/releases/download/v1.26.6/weaviate-v1.26.6-Linux-amd64.tar.gz
INFO:weaviate-client:Started /root/.cache/weaviate-embedded: process ID 893


Let's create a simple collection that has just one field of texts.  

In [3]:
client.collections.delete_all()
client.collections.create(
    name="TestCollection",
    properties=[
        wc.Property(name="text", data_type=wc.DataType.TEXT),
    ]
)

<weaviate.collections.collection.sync.Collection at 0x79ec38b66110>

Here is a list of simple documents that are useful to test some simple queries

In [4]:
sample_docs = [
    {"text": "Trump u.s.a. NATO"},
    {"text": "trump usa N.A.T.O."},
    {"text": "trump u s a NATO"},
    {"text": "the cat sleeps"},
    {"text": "u are a star"}
]

Now we create the collection and we insert the samples

In [5]:
documents = client.collections.get("TestCollection")
for doc in sample_docs:
    documents.data.insert(doc)

Here is how to iterate over all documents in the collection

In [6]:
# retrieve the elements
for doc in documents.iterator():
  print(doc.uuid, " - ", doc.properties)

2c9eb6ac-a3dd-429a-9a9e-d7b2091033b8  -  {'text': 'the cat sleeps'}
7c1422fe-bb47-44c5-b940-d6f775b004b9  -  {'text': 'trump usa N.A.T.O.'}
a17fe3ca-4d5b-4425-9a54-3f6569e39987  -  {'text': 'trump u s a NATO'}
f2976550-d198-4f56-b77c-4da6631912c3  -  {'text': 'Trump u.s.a. NATO'}
f3e29f0f-d3b2-45bb-8799-1457f4dbda85  -  {'text': 'u are a star'}


Let's try some simple queries, bm25 is the vectorization textual technique that we saw in lecture 2 (better than TFIDF). This means that the following query is processed textually.

In [7]:
query = "sleep"
response = documents.query.bm25(query=query, return_metadata=MetadataQuery(score=True))
for o in response.objects:
    print("{} - {}".format(round(o.metadata.score*100)/100, o.properties["text"]))

Unfortunately, words are not stemmed, but are lowercased. This is on the roadmap of features that Weaviate plans to support in the future.

Let's also define a function that properly prints the results of a query

In [8]:
def print_query_results(query, prop_name, collection):
  print("QUERY:: {}\n".format(query))
  response = collection.query.bm25(query=query, return_metadata=MetadataQuery(score=True))
  for o in response.objects:
    print("{} - {}".format(round(o.metadata.score*100)/100, o.properties[prop_name]))

In [9]:
print_query_results("TRUMP", "text", documents) #the words are lowercased

QUERY:: TRUMP

0.24 - Trump u.s.a. NATO
0.24 - trump u s a NATO
0.22 - trump usa N.A.T.O.


In [10]:
print_query_results("Trump", "text", documents) #the words are lowercased

QUERY:: Trump

0.24 - Trump u.s.a. NATO
0.24 - trump u s a NATO
0.22 - trump usa N.A.T.O.


In [11]:
print_query_results("the", "text", documents) # the stopwords are not present by assuming English

QUERY:: the



Now we define a function that shows some very basic queries, but that are able

In [12]:
def example_queries(prop_name, collection):
    queries = ["She is sleeping", "I sleep", "the usa", "I live in the u.s.a.", "TRUMP"]
    for query in queries:
      print_query_results(query, prop_name, collection)
      print("===============================================================")
      print()

In [13]:
print(sample_docs)
print("\n")
example_queries("text", documents)

[{'text': 'Trump u.s.a. NATO'}, {'text': 'trump usa N.A.T.O.'}, {'text': 'trump u s a NATO'}, {'text': 'the cat sleeps'}, {'text': 'u are a star'}]


QUERY:: She is sleeping


QUERY:: I sleep


QUERY:: the usa

0.56 - trump usa N.A.T.O.

QUERY:: I live in the u.s.a.

0.62 - Trump u.s.a. NATO
0.62 - trump u s a NATO
0.26 - u are a star

QUERY:: TRUMP

0.24 - Trump u.s.a. NATO
0.24 - trump u s a NATO
0.22 - trump usa N.A.T.O.



But how is the input really treated? How is it tokenized?

**TOKENIZATION OPTIONS**
* word: alphanumeric, lowercased tokens (default tokenizer for Weaviate)
* lowercase: lowercased tokens
* whitespace: whitespace-separated, case-sensitive tokens
* the entire value of the property is treated as a single token

In [14]:
client.collections.create(
    name="TestWhitespace",
    properties=[
        wc.Property(name="text", data_type=wc.DataType.TEXT, tokenization=Tokenization.WHITESPACE),
    ],
)

<weaviate.collections.collection.sync.Collection at 0x79ec38ca9b90>

In [15]:
documents = client.collections.get("TestWhitespace")
for doc in sample_docs:
    documents.data.insert(doc)

In [16]:
print_query_results("the", "text", documents) # stopword is found

QUERY:: the

0.68 - the cat sleeps


In [17]:
print_query_results("Trump", "text", documents) # no lowercasing, thus not find "trump"

QUERY:: Trump

0.68 - Trump u.s.a. NATO


In [18]:
print_query_results("TRUMP", "text", documents) # no lowercasing, thus not find "trump" and "Trump"

QUERY:: TRUMP



In [19]:
print_query_results("u", "text", documents) # whitespace does not split "u.s.a." which is one token

QUERY:: u

0.38 - u are a star
0.34 - trump u s a NATO


In [20]:
print_query_results("u.s.a.", "text", documents)

QUERY:: u.s.a.

0.68 - Trump u.s.a. NATO


In [21]:
example_queries("text", documents)

QUERY:: She is sleeping


QUERY:: I sleep


QUERY:: the usa

0.68 - trump usa N.A.T.O.
0.68 - the cat sleeps

QUERY:: I live in the u.s.a.

0.68 - Trump u.s.a. NATO
0.68 - the cat sleeps

QUERY:: TRUMP




## Properties
Let's now add some simple properties to our index. As of now we only handled the "text" property, containing some simple textual snippets.

In [22]:
!wget https://raw.githubusercontent.com/LorenzoBellomo/InformationRetrieval/refs/heads/main/data/5articles.json
import json

with open("5articles.json", 'r') as f:
  articles = json.load(f)

--2025-02-27 15:39:48--  https://raw.githubusercontent.com/LorenzoBellomo/InformationRetrieval/refs/heads/main/data/5articles.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 12566 (12K) [text/plain]
Saving to: ‘5articles.json’


2025-02-27 15:39:48 (101 MB/s) - ‘5articles.json’ saved [12566/12566]



In [23]:
articles[0]

{'title': 'American Airlines orders 60 Overture supersonic jets',
 'maintext': "The revival of supersonic passenger travel, thought to be long dead with the demise of Concorde nearly two decades ago, could be about to take wing as American Airlines has put in an order for 60 aircraft capable of flying at 1.7 times the speed of sound. \nBoom is a start-up based in Denver, Colorado, whose development of Overture, an ultra-fast successor to Concorde that seats 65 to 88 passengers, is so advanced that it showed off designs at last month's Farnborough air show.",
 'date': '2022-08-18',
 'source': 'The New York Times'}

In [24]:
client.collections.create(
    name="TestProperties",
    properties=[
        wc.Property(name="maintext", data_type=wc.DataType.TEXT, tokenization=Tokenization.WORD),
        wc.Property(name="title", data_type=wc.DataType.TEXT, tokenization=Tokenization.LOWERCASE),
    ],
)

<weaviate.collections.collection.sync.Collection at 0x79ec3844c510>

In [25]:
documents = client.collections.get("TestProperties")
for doc in articles:
    documents.data.insert({"maintext": doc["maintext"], "title": doc["title"]})

In [26]:
for doc in documents.iterator():
  print(doc.uuid, " - ", doc.properties)

4cfa5cc5-6fa2-43b4-abcf-b4ecf98b7955  -  {'maintext': 'Hamid Sanambar\nGardai are hunting for a gunman who opened fire on a car in north Dublin - just metres from where Hamid Sanambar was gunned down last week.\nEmergency services were alerted to reports of gunfire in Kilmore Road in the Artane area of the capital shortly before 9pm on Wednesday.\nGardai believe a number of rounds were fired at the car before the gunman and the vehicle fled the scene.\nFled\nDetectives investigating the shooting are probing if the gunman interacted with the car driver before he opened fire.\nIt is understood the gunman fled the area on foot.\nThe incident happened just a few hundred metres from Kilbarron Avenue where Sanambar (41) was shot dead on Wednesday of last week.\nGardai said investigations into that shooting are still ongoing.\n"Gardai are investigating reports of an alleged shooting incident on the Kilmore Road, Artane, Dublin 5," a spokeswoman said.\n"The incident occurred on June 5, 2019, a

In [27]:
print_query_results("mother", "title", documents) # prints the score and the title of the retrieved article

QUERY:: mother

0.52 - 'One-punch killer's sentence will make others think twice'
0.3 - Leclerc dedicates win to Hubert


In [28]:
print_query_results("cars", "title", documents) # There is no stemming, indeed, thus the next article is not returned

QUERY:: cars

0.48 - Leclerc dedicates win to Hubert


In [29]:
print_query_results("car", "title", documents) # The score can be larger than 1

QUERY:: car

1.87 - Gunman opens fire on car just metres from scene of Hamid Sanambar murder


Say that you now want to consider some words as "stopwords", that the system does not consider as such by default

In [30]:
print_query_results("victory", "title", documents) #As above, but below we classify it as a stopword

documents.config.update(inverted_index_config=wc.Reconfigure.inverted_index(stopwords_additions=["victory"]))

print("\n")
print_query_results("victory", "title", documents)

QUERY:: victory

0.71 - Leclerc dedicates win to Hubert


QUERY:: victory



But fields in the query are not all "born equal". Some are more important than others (e.g., title). Let's boost the importance of the "title" field (by scaling its score count by two)

In [31]:
response = documents.query.bm25(
    query="race",
    return_metadata=MetadataQuery(score=True)
)
print("BEFORE FIELD BOOSTING: (query = race)\n")
for o in response.objects:
    print("{} - {}".format(round(o.metadata.score*100)/100, o.properties["title"]))

BEFORE FIELD BOOSTING: (query = race)

1.27 - Conte: 'Chelsea are not in the race to sign Sanchez'
0.54 - Leclerc dedicates win to Hubert


In [32]:
response = documents.query.bm25(
    query="race",
    query_properties=["title^2", "maintext"],
    return_metadata=MetadataQuery(score=True)
)
print("AFTER FIELD BOOSTING: (query = race)\n")
for o in response.objects:
    print("{} - {}".format(round(o.metadata.score*100)/100, o.properties["title"]))

AFTER FIELD BOOSTING: (query = race)

1.43 - Conte: 'Chelsea are not in the race to sign Sanchez'
0.54 - Leclerc dedicates win to Hubert


The score is not double the score, because:
- it is not using TF-IDF, but BM25 which scales slightly different
- "race" is also present inside the maintext of the article

In [41]:
response.objects[0].properties["maintext"].count("race") # indeed it appears once

1

Add some basic filtering

In [44]:
response = documents.query.bm25(
    query="race",
    return_metadata=MetadataQuery(score=True)
)
print("BEFORE FILTERING: (query = race)\n")
for o in response.objects:
    print("{} - {}".format(round(o.metadata.score*100)/100, o.properties["title"]))

BEFORE FILTERING: (query = mother)

1.27 - Conte: 'Chelsea are not in the race to sign Sanchez'
0.54 - Leclerc dedicates win to Hubert


In [45]:
response = documents.query.bm25(
    query="race",
    filters=Filter.by_property("title").contains_any(["Leclerc", "formula"]),
    return_metadata=MetadataQuery(score=True)
)
print("AFTER FILTERING: (query = race)\n")
for o in response.objects:
    print("{} - {}".format(round(o.metadata.score*100)/100, o.properties["title"]))

AFTER FILTERING: (query = race)

0.54 - Leclerc dedicates win to Hubert


Let's see what happens when we also add dates as properties

In [46]:
client.collections.create(
    name="TestDate",
    properties=[
        wc.Property(name="maintext", data_type=wc.DataType.TEXT, tokenization=Tokenization.WORD),
        wc.Property(name="title", data_type=wc.DataType.TEXT, tokenization=Tokenization.LOWERCASE),
        wc.Property(name="date", data_type=wc.DataType.DATE)
    ]
)

<weaviate.collections.collection.sync.Collection at 0x79ec38b9ba50>

[All property types](https://weaviate.io/developers/weaviate/config-refs/datatypes)

In [47]:
from datetime import timezone, datetime
documents = client.collections.get("TestDate")
for doc in articles:
    documents.data.insert({"maintext": doc["maintext"], "title": doc["title"], "date": datetime.strptime(doc["date"], "%Y-%m-%d").replace(tzinfo=timezone.utc)})

In [48]:
for doc in documents.iterator():
  print(doc.uuid, " - ", doc.properties['date'], '  ', doc.properties['title'])
  # print(doc.uuid, " - ", doc.properties)

650aa442-33a9-4511-a5a9-fa9bf411ef8c  -  2018-01-23 00:00:00+00:00    Conte: 'Chelsea are not in the race to sign Sanchez'
8410ab1a-5780-43ff-a8e3-246730dfe17d  -  2019-06-07 00:00:00+00:00    Gunman opens fire on car just metres from scene of Hamid Sanambar murder
a35dc66b-bdd9-4997-aa5f-92a2132ba922  -  2019-06-29 00:00:00+00:00    'One-punch killer's sentence will make others think twice'
e182cea0-1285-4277-a6e3-669b0e47dc6c  -  2019-09-01 00:00:00+00:00    Leclerc dedicates win to Hubert
e8114e5c-6778-494d-96f2-067dbcc9d468  -  2022-08-18 00:00:00+00:00    American Airlines orders 60 Overture supersonic jets


In [52]:
response = documents.query.bm25(
    query="race",
    return_metadata=MetadataQuery(score=True)
)
print("BEFORE FILTERING: (query = race)\n")
for o in response.objects:
    print("{} - {} ({})".format(round(o.metadata.score*100)/100, o.properties["title"], o.properties["date"]))

BEFORE FILTERING: (query = race)

1.27 - Conte: 'Chelsea are not in the race to sign Sanchez' (2018-01-23 00:00:00+00:00)
0.54 - Leclerc dedicates win to Hubert (2019-09-01 00:00:00+00:00)


In [53]:
response = documents.query.bm25(
    query="race",
    filters=Filter.by_property("date").greater_or_equal(datetime.strptime("2019-08-15", "%Y-%m-%d").replace(tzinfo=timezone.utc)),
    return_metadata=MetadataQuery(score=True)
)
print("AFTER FILTERING: (query = race)\n")
for o in response.objects:
    print("{} - {} ({})".format(round(o.metadata.score*100)/100, o.properties["title"], o.properties["date"]))

AFTER FILTERING: (query = race)

0.54 - Leclerc dedicates win to Hubert (2019-09-01 00:00:00+00:00)


Some advanced features, let's try some vectorized queries. Let's assume we want to find all articles that are "related to sport". In this current collection, "sport" is not present as a word in any title or maintext.

In [54]:
response = documents.query.bm25(query="sport", return_metadata=MetadataQuery(score=True))
for o in response.objects:
    print("{} - {}".format(round(o.metadata.score*100)/100, o.properties["title"]))

Let's install some textual vectorizer to run some semantic search queries.

In [55]:
# Unfortunately, we cannot use all the vectorizer modules that are present in Weaviate. Here is a list of the ones that are available
client.get_meta()

{'hostname': 'http://127.0.0.1:8079',
 'modules': {'generative-openai': {'documentationHref': 'https://platform.openai.com/docs/api-reference/completions',
   'name': 'Generative Search - OpenAI'},
  'qna-openai': {'documentationHref': 'https://platform.openai.com/docs/api-reference/completions',
   'name': 'OpenAI Question & Answering Module'},
  'ref2vec-centroid': {},
  'reranker-cohere': {'documentationHref': 'https://txt.cohere.com/rerank/',
   'name': 'Reranker - Cohere'},
  'text2vec-cohere': {'documentationHref': 'https://docs.cohere.ai/embedding-wiki/',
   'name': 'Cohere Module'},
  'text2vec-huggingface': {'documentationHref': 'https://huggingface.co/docs/api-inference/detailed_parameters#feature-extraction-task',
   'name': 'Hugging Face Module'},
  'text2vec-openai': {'documentationHref': 'https://platform.openai.com/docs/guides/embeddings/what-are-embeddings',
   'name': 'OpenAI Module'}},
 'version': '1.26.6'}

Let's use COHERE as a textual vectorizer [https://dashboard.cohere.com/api-keys](https://dashboard.cohere.com/api-keys). As we can see, using colab we have only a few options for vectorization (openai, cohere, huggingface). Additionally, only one generation model is available (openai).
Cohere provides free sample apis. OpenAI does not.

In [56]:
## You need first to create a KEY !!!!
from google.colab import userdata

client.close()
cohere_key = userdata.get('COHERE_KEY') # MAKE SURE YOU CREATED A KEY
headers = {
    "X-Cohere-Api-Key": cohere_key,
}
client = weaviate.connect_to_embedded(headers=headers)

INFO:weaviate-client:Started /root/.cache/weaviate-embedded: process ID 4184


HERE TO CHECK HOW TO INTEGRATE MODELS [https://weaviate.io/developers/weaviate/model-providers](https://weaviate.io/developers/weaviate/model-providers)

Now we create the example collection. Please note that we set here the vectorizer (cohere) and the generation module for an experiment that we will do later (openai, only availabe on the paid model).

In [57]:
client.collections.delete_all()
client.collections.create(
    name="TestVectorizer",
    properties=[
        wc.Property(name="maintext", data_type=wc.DataType.TEXT, tokenization=Tokenization.WORD),
        wc.Property(name="title", data_type=wc.DataType.TEXT, tokenization=Tokenization.LOWERCASE),
    ],
    vectorizer_config=[
        Configure.NamedVectors.text2vec_cohere(
            name="maintext_vector",
            source_properties=["maintext"],
            #model="embed-multilingual-light-v3.0"
        )
    ]
)

<weaviate.collections.collection.sync.Collection at 0x79ec702a0210>

In [58]:
documents = client.collections.get("TestVectorizer")
for doc in articles:
    documents.data.insert({"maintext": doc["maintext"], "title": doc["title"]}) # here weaviate performs the vectorization

In [63]:
print("pure syntactical search (ordered by decreasing similarity score): 'sport'\n")
response = documents.query.bm25(query="sport", return_metadata=MetadataQuery(score=True))
for o in response.objects:
    print("{} - {}".format(round(o.metadata.score*100)/100, o.properties["title"]))

pure syntactical search (ordered by decreasing similarity score): 'sport'



In [64]:
print("pure vector search (ordered by increasing distance): 'sport'\n")
# NOTE THAT WE ALSO NEED THE PARAMETER DISTANCE
response = documents.query.near_text(query="sport", return_metadata=MetadataQuery(score=True, distance=True), limit=3)
for o in response.objects:
  print("{} - {} (score is {})".format(round(o.metadata.distance*100)/100, o.properties["title"], round(o.metadata.score*100)/100))

pure vector search (ordered by increasing distance): 'sport'

0.61 - Leclerc dedicates win to Hubert (score is 0.0)
0.61 - Gunman opens fire on car just metres from scene of Hamid Sanambar murder (score is 0.0)
0.65 - Conte: 'Chelsea are not in the race to sign Sanchez' (score is 0.0)


In [65]:
print("pure syntactical search (ordered by decreasing similarity score): 'race'\n")
response = documents.query.bm25(query="race", return_metadata=MetadataQuery(score=True))
for o in response.objects:
    print("{} - {}".format(round(o.metadata.score*100)/100, o.properties["title"]))

pure syntactical search (ordered by decreasing similarity score): 'race'

1.27 - Conte: 'Chelsea are not in the race to sign Sanchez'
0.54 - Leclerc dedicates win to Hubert


In [66]:
print("pure vector search (ordered by increasing distance): 'race'\n")
# NOTE THAT WE ALSO NEED THE PARAMETER DISTANCE
response = documents.query.near_text(query="race", return_metadata=MetadataQuery(score=True, distance=True), limit=3)
for o in response.objects:
  print("{} - {} (score is {})".format(round(o.metadata.distance*100)/100, o.properties["title"], round(o.metadata.score*100)/100))

pure vector search (ordered by increasing distance): 'race'

0.61 - Leclerc dedicates win to Hubert (score is 0.0)
0.61 - Gunman opens fire on car just metres from scene of Hamid Sanambar murder (score is 0.0)
0.69 - Conte: 'Chelsea are not in the race to sign Sanchez' (score is 0.0)


In [67]:
print("hybrid search (ordered by decreasing score): 'race'")
response = documents.query.hybrid(query="race", alpha=0.5, return_metadata=MetadataQuery(score=True, explain_score=True), limit=3)
for o in response.objects:
  print("{} - {} [{}]".format(round(o.metadata.score*100)/100, o.properties["title"],  o.metadata.explain_score.strip().replace("\n", '')))

hybrid search (ordered by decreasing score): 'race'
0.59 - Conte: 'Chelsea are not in the race to sign Sanchez' [Hybrid (Result Set keyword,bm25) Document 983e85a7-245d-4a1f-9abd-eb2d6bd78a5f: original score 1.2714014, normalized score: 0.5 - Hybrid (Result Set vector,hybridVector) Document 983e85a7-245d-4a1f-9abd-eb2d6bd78a5f: original score 0.31188643, normalized score: 0.09419514]
0.5 - Leclerc dedicates win to Hubert [Hybrid (Result Set keyword,bm25) Document 0229274f-29f6-4575-bc1d-7412ca1b98a9: original score 0.5364737, normalized score: 0 - Hybrid (Result Set vector,hybridVector) Document 0229274f-29f6-4575-bc1d-7412ca1b98a9: original score 0.3915854, normalized score: 0.5]
0.48 - Gunman opens fire on car just metres from scene of Hamid Sanambar murder [Hybrid (Result Set vector,hybridVector) Document b1bd8dae-f31d-430e-84ea-65a5eb114e53: original score 0.3867706, normalized score: 0.47548437]


[Description of how scoring works](https://weaviate.io/developers/weaviate/concepts/search/hybrid-search)

## A new method, RAG
RAG stands for Retrieval Augmented Generation. This is a recent trend in Information Retrieval that aims at reducing the problem of "hallucinations" for Large Language Model generation, and returns better answers based on local document archives.
- Traditional queries go as follows: the user makes a query to a search engine; the search engine returns, in some predefined format, the answer to that query.
- LLM queries: the user makes a query to a Large Language Model (LLM); the LLM creates an answer based on the (often unspecified) training data that was originally used to train it. The LLM often hallucinates, returing wrong answers.
- RAG: the user makes a query to a search engine; the search engine runs the query and gets its results. Before returning the results to the user, they are sent to a LLM to "process" and generate a textual response that is more convenient to read for the user, but (ideally) does not contain hallucinated information because they use precomputed (retrieved) results.

Now let's try to include some generative AI prompts to this query (let's add context to the entities in the news, or let's translate them in Italian).
Note that this query will only work for those who have an openai paid module.

In [68]:
client.close()
cohere_key = userdata.get('COHERE_KEY')
openai_key = userdata.get("OPENAI_KEY2")
headers = {
    "X-Cohere-Api-Key": cohere_key,
    "X-OpenAI-Api-Key": openai_key
}
client = weaviate.connect_to_embedded(headers=headers)

INFO:weaviate-client:Started /root/.cache/weaviate-embedded: process ID 11228


In [69]:
client.collections.delete_all()
client.collections.create(
    name="TestVectorizer",
    properties=[
        wc.Property(name="maintext", data_type=wc.DataType.TEXT, tokenization=Tokenization.WORD),
        wc.Property(name="title", data_type=wc.DataType.TEXT, tokenization=Tokenization.LOWERCASE),
    ],
    vectorizer_config=[
        Configure.NamedVectors.text2vec_cohere(
            name="maintext_vector",
            source_properties=["maintext"],
            #model="embed-multilingual-light-v3.0"
        )
    ],
    generative_config=Configure.Generative.openai(model="gpt-4") # added generation module
)

<weaviate.collections.collection.sync.Collection at 0x79ec39b5ae90>

In [70]:
documents = client.collections.get("TestVectorizer")
for doc in articles:
    documents.data.insert({"maintext": doc["maintext"], "title": doc["title"]}) # here weaviate performs the vectorization

In [71]:
response = documents.generate.near_text(
    query="sport",  # The model provider integration will automatically vectorize the query
    single_prompt="Write a short summary of maximum 100 characters in Italian of {maintext}",
    limit=2 # apply LLM to the top 2 results
)

In [72]:
for obj in response.objects:
    print(obj.properties["title"])
    print(f"Generated output: {obj.generated}")  # Note that the generated output is per object
    print("====================================================")
    print()

Leclerc dedicates win to Hubert
Generated output: Charles Leclerc ha ottenuto la sua prima vittoria in Formula Uno al Gran Premio del Belgio, dedicandola all'amico Anthoine Hubert, morto in un incidente.

Gunman opens fire on car just metres from scene of Hamid Sanambar murder
Generated output: La polizia cerca un uomo armato che ha sparato su un'auto a Dublino, vicino al luogo dove Hamid Sanambar è stato ucciso.



The code above implements RAG using an external LLM module (OpenAI), invoked via the internal Weaviate module. We can also implement a RAG by calling the LLM directly, by using Cohere to implement the vectorization (inside Weaviate) and the generation (direcly with an API call). This way we do not need to pay for an OpenAI API key.

In [73]:
!pip install cohere

Collecting cohere
  Downloading cohere-5.13.12-py3-none-any.whl.metadata (3.4 kB)
Collecting fastavro<2.0.0,>=1.9.4 (from cohere)
  Downloading fastavro-1.10.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.5 kB)
Collecting httpx-sse==0.4.0 (from cohere)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting types-requests<3.0.0,>=2.0.0 (from cohere)
  Downloading types_requests-2.32.0.20241016-py3-none-any.whl.metadata (1.9 kB)
Downloading cohere-5.13.12-py3-none-any.whl (252 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m252.9/252.9 kB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading httpx_sse-0.4.0-py3-none-any.whl (7.8 kB)
Downloading fastavro-1.10.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.3/3.3 MB[0m [31m51.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading types_requests-2.32.0.20241016-py3-none-any.whl (15 kB)
In

In [74]:
print("hybrid search: 'race'")
response = documents.query.hybrid(query="race", alpha=0.5, return_metadata=MetadataQuery(score=True, explain_score=True), limit=3)
for o in response.objects:
  print("{} - {} [{}]".format(round(o.metadata.score*100)/100, o.properties["title"],  o.metadata.explain_score.strip().replace("\n", '')))

hybrid search: 'race'
0.59 - Conte: 'Chelsea are not in the race to sign Sanchez' [Hybrid (Result Set keyword,bm25) Document ad04ee5d-b50d-45a2-a0f5-002111e0d0c5: original score 1.2714014, normalized score: 0.5 - Hybrid (Result Set vector,hybridVector) Document ad04ee5d-b50d-45a2-a0f5-002111e0d0c5: original score 0.31191516, normalized score: 0.09438221]
0.5 - Leclerc dedicates win to Hubert [Hybrid (Result Set keyword,bm25) Document 25062050-154b-49fd-aa58-cd219f6e6249: original score 0.5364737, normalized score: 0 - Hybrid (Result Set vector,hybridVector) Document 25062050-154b-49fd-aa58-cd219f6e6249: original score 0.39161777, normalized score: 0.5]
0.48 - Gunman opens fire on car just metres from scene of Hamid Sanambar murder [Hybrid (Result Set vector,hybridVector) Document b43b5d85-8a63-431f-bb1b-8db57f93686c: original score 0.38678777, normalized score: 0.47541943]


In [76]:
import cohere

co = cohere.ClientV2(api_key=cohere_key)
res = co.chat(
    model="command-r-plus-08-2024", # this is a cohere model
    messages=[
        {
            "role": "user",
            "content": "Write a short summary (100 characters at max), in Italian of the textual article \
            provided below: \n\n {}".format(response.objects[0].properties["maintext"]),
        } # response includes all the results returned by the previous hybrid query
    ],
)

print(res.message.content[0].text)

Conte nega l'interesse del Chelsea per Sanchez, mentre il Norwich si prepara per la sfida.
