<a href="https://colab.research.google.com/github/EMbeDS-education/ComputingDataAnalysisModeling20242025/blob/main/ISE/SearchEngineWeaviate.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Weaviate as a Search Engine

##################### TO START

Please download your COHERE key to vectorize texts from

https://dashboard.cohere.com/api-keys.

Then load your key in the "secrets" of Colab, here to the left, by clicking on the key icon, and storing as "COHERE_APIKEY".

You find all the documentation of Weaviate in:

https://weaviate.io/developers/weaviate

###########################################################

In [2]:
!pip install -U weaviate-client
import weaviate
import weaviate.classes.config as wc



In [3]:
# Create the Weaviate client

import weaviate
from weaviate.classes.query import MetadataQuery
from weaviate.classes.config import Configure, Property, DataType, Tokenization
from weaviate.classes.query import Filter

client = weaviate.connect_to_embedded()

INFO:weaviate-client:Binary /root/.cache/weaviate-embedded did not exist. Downloading binary from https://github.com/weaviate/weaviate/releases/download/v1.26.6/weaviate-v1.26.6-Linux-amd64.tar.gz
INFO:weaviate-client:Started /root/.cache/weaviate-embedded: process ID 2302


In [4]:
!wget https://raw.githubusercontent.com/EMbeDS-education/ComputingDataAnalysisModeling20242025/refs/heads/main/ISE/data/5articles.json
import json

with open("5articles.json", 'r') as f:
  articles = json.load(f)

articles[0]

--2025-03-30 12:58:26--  https://raw.githubusercontent.com/EMbeDS-education/ComputingDataAnalysisModeling20242025/refs/heads/main/ISE/data/5articles.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 12566 (12K) [text/plain]
Saving to: ‘5articles.json’


2025-03-30 12:58:27 (31.5 MB/s) - ‘5articles.json’ saved [12566/12566]



{'title': 'American Airlines orders 60 Overture supersonic jets',
 'maintext': "The revival of supersonic passenger travel, thought to be long dead with the demise of Concorde nearly two decades ago, could be about to take wing as American Airlines has put in an order for 60 aircraft capable of flying at 1.7 times the speed of sound. \nBoom is a start-up based in Denver, Colorado, whose development of Overture, an ultra-fast successor to Concorde that seats 65 to 88 passengers, is so advanced that it showed off designs at last month's Farnborough air show.",
 'date': '2022-08-18',
 'source': 'The New York Times'}

Let's create a simple collection that has two fields of type TEXT and call it "TestCollection".

**TOKENIZATION OPTIONS**
* word: alphanumeric, lowercased tokens, with stopwords filtering (default tokenizer for Weaviate)
* lowercase: lowercased tokens
* whitespace: whitespace-separated, case-sensitive tokens
* field: the entire value of the property is treated as a single token

[All property types](https://weaviate.io/developers/weaviate/config-refs/datatypes)

In [5]:
client.collections.delete_all() # To remove anything created before, if any

client.collections.create(
    name="TestCollection",
    properties=[
        wc.Property(name="maintext", data_type=wc.DataType.TEXT, tokenization=Tokenization.WORD),
        wc.Property(name="title", data_type=wc.DataType.TEXT, tokenization=Tokenization.LOWERCASE),
        wc.Property(name="source", data_type=wc.DataType.TEXT, tokenization=Tokenization.FIELD),
        wc.Property(name="date", data_type=wc.DataType.DATE)
    ],
)

# Just typing "wc.DataType." you can explore other data types...

<weaviate.collections.collection.sync.Collection at 0x7cff09a76b50>

**CREATE A TEST COLLECTION**

Now we create the collection of name "TestCollection" and we insert the "articles" in it

In [6]:
from datetime import timezone, datetime

documents = client.collections.get("TestCollection")

for doc in articles:
    documents.data.insert({
        "maintext": doc["maintext"],
        "title": doc["title"],
        "source": doc["source"],
        "date": datetime.strptime(doc["date"], "%Y-%m-%d").replace(tzinfo=timezone.utc)
    })


**ITERATE over all documents in the collection**

Notice that it is a "map" so that the listing is not necessarily in the order in which they have been inserted

In [7]:
# retrieve the elements
for i, doc in enumerate(documents.iterator()):
  print(doc.uuid, " - ", doc.properties["title"], " - ", doc.properties["source"], " - ", doc.properties["date"])

234c4d4c-f7c7-4c06-a3bf-b89fa6152c1b  -  American Airlines orders 60 Overture supersonic jets  -  The New York Times  -  2022-08-18 00:00:00+00:00
3663b81b-886e-45af-a4d3-10bfee980dc8  -  'One-punch killer's sentence will make others think twice'  -  The Herald-ir  -  2019-06-29 00:00:00+00:00
9c6570f1-036b-4407-9284-ded9c38269e4  -  Leclerc dedicates win to Hubert  -  The Herald-ir  -  2019-09-01 00:00:00+00:00
da23c776-c05b-424d-83f5-b8afe65ba18c  -  Conte: 'Chelsea are not in the race to sign Sanchez'  -  The Herald-ir  -  2018-01-23 00:00:00+00:00
f71210d6-ff98-468b-8153-33b35747efad  -  Gunman opens fire on car just metres from scene of Hamid Sanambar murder  -  The Herald-ir  -  2019-06-07 00:00:00+00:00


**QUERYING THE COLLECTION**

Let's try some simple queries, **bm25** is the sparse vectorization of texts (better than TFIDF) that Weaviate adopts.

Notice that it lowercases the parsed tokens, but it does not stem them. This is on the roadmap of features that Weaviate plans to support in the future.

In [10]:
query = "race"
response = documents.query.bm25(query=query, return_metadata=MetadataQuery(score=True))
for o in response.objects:
    print("{} - {}".format(round(o.metadata.score*100)/100, o.properties["title"]))
    print("query term occurs: ", o.properties["maintext"].count("race"), " in maintext and ", o.properties["title"].count("race"), " in title\n")

1.18 - Conte: 'Chelsea are not in the race to sign Sanchez'
query term occurs:  1  in maintext and  1  in title

0.46 - Leclerc dedicates win to Hubert
query term occurs:  3  in maintext and  0  in title



In [13]:
# parsing is induced also by -

query = "punch"
response = documents.query.bm25(query=query, return_metadata=MetadataQuery(score=True))
for o in response.objects:
    print("{} - {}".format(round(o.metadata.score*100)/100, o.properties["title"]))

0.7 - 'One-punch killer's sentence will make others think twice'


In [12]:
# parsing is lowercase

query = "sanchez"
response = documents.query.bm25(query=query, return_metadata=MetadataQuery(score=True))
for o in response.objects:
    print("{} - {}".format(round(o.metadata.score*100)/100, o.properties["title"]))

0.77 - Conte: 'Chelsea are not in the race to sign Sanchez'


In [14]:
# It is a FIELD parsing, so it does not find "The New York Times" in article on "American Airlines"

query = "The"
response = documents.query.bm25(query=query, return_metadata=MetadataQuery(score=True))
for o in response.objects:
    print("{} - {}".format(round(o.metadata.score*100)/100, o.properties["title"]))

0.97 - Conte: 'Chelsea are not in the race to sign Sanchez'


In [15]:
# the stopwords are not present by assuming English in field "maintext" but they are include in the field "title"

query = "the"
response = documents.query.bm25(query=query, return_metadata=MetadataQuery(score=True))
for o in response.objects:
    print("{} - {}".format(round(o.metadata.score*100)/100, o.properties["title"]))

0.97 - Conte: 'Chelsea are not in the race to sign Sanchez'


**CHANGING STOPWORDS**

Say that you now want to consider some words as "stopwords", that the system does not consider as such by default

documents.config.update(inverted_index_config=wc.Reconfigure.inverted_index(stopwords_additions=["victory"]))



**SEARCH ON A SPECIFIC PROPERTY**

In [23]:
# maintext == is tokenized as WORD and thus stopwords are removed (hence, "will" is removed)

response = documents.query.bm25(
    query="will",
    query_properties=["maintext"], # this is the line to add
    return_metadata=MetadataQuery(score=True)
)
for o in response.objects:
    print("{} - {}".format(round(o.metadata.score*100)/100, o.properties["title"]))


In [24]:
# title == is tokenized as LOWERCASE and thus stopwords are NOT removed


response = documents.query.bm25(
    query="will",
    query_properties=["title"], # this is the line to add
    return_metadata=MetadataQuery(score=True)
)
for o in response.objects:
    print("{} - {}".format(round(o.metadata.score*100)/100, o.properties["title"]))
    print("query term occurs: ", o.properties["title"].count("will"), "\n")

0.65 - 'One-punch killer's sentence will make others think twice'
query term occurs:  1 



**AFTER FIELD BOOSTING**

The score is not twice. It applies some normalization and other formulas.

In [26]:
print("BEFORE BOOSTING:\n")
response = documents.query.bm25(
    query="race",
    query_properties=["title", "maintext"],
    return_metadata=MetadataQuery(score=True)
)
for o in response.objects:
    print("{} - {}".format(round(o.metadata.score*100)/100, o.properties["title"]))

print("\n\nAFTER BOOSTING:\n")
response = documents.query.bm25(
    query="race",
    query_properties=["title^2", "maintext"],
    return_metadata=MetadataQuery(score=True)
)
for o in response.objects:
    print("{} - {}".format(round(o.metadata.score*100)/100, o.properties["title"]))

BEFORE BOOSTING:

1.27 - Conte: 'Chelsea are not in the race to sign Sanchez'
0.54 - Leclerc dedicates win to Hubert


AFTER BOOSTING:

1.43 - Conte: 'Chelsea are not in the race to sign Sanchez'
0.54 - Leclerc dedicates win to Hubert


**APPLY FIELD FILTERING**

In [27]:
response = documents.query.bm25(
    query="race",
    query_properties=["title^2", "maintext"],
    filters=Filter.by_property("title").contains_any(["Leclerc", "formula"]),
    return_metadata=MetadataQuery(score=True)
)
for o in response.objects:
    print("{} - {}".format(round(o.metadata.score*100)/100, o.properties["title"]))

0.54 - Leclerc dedicates win to Hubert


**FILTERING BY DATE or MANY PROPERTIES**

In [29]:
print("BEFORE FILTERING BY DATE:\n")
response = documents.query.bm25(
    query="race",
    return_metadata=MetadataQuery(score=True)
)
for o in response.objects:
    print("{} - {} ({})".format(round(o.metadata.score*100)/100, o.properties["title"], o.properties["date"]))

print("\n\nAFTER FILTERING BY DATE:\n")
reference_date = datetime.strptime("2018-08-15", "%Y-%m-%d").replace(tzinfo=timezone.utc)
response = documents.query.bm25(
    query="race",
    filters=Filter.by_property("date").greater_or_equal(reference_date),
    return_metadata=MetadataQuery(score=True)
)
for o in response.objects:
    print("{} - {} ({})".format(round(o.metadata.score*100)/100, o.properties["title"], o.properties["date"]))


print("\n\nAPPLY MANY FILTERS OVER MANY PROPERTIES:\n")
reference_date = datetime.strptime("2018-08-15", "%Y-%m-%d").replace(tzinfo=timezone.utc)
response = documents.query.bm25(
    query="race",
    filters=( ## use of &, but also |
        Filter.by_property("date").greater_or_equal(reference_date) &
        Filter.by_property("title").contains_any(["won", "formula"])
        ),
    return_metadata=MetadataQuery(score=True)
)
for o in response.objects:
    print("{} - {} ({})".format(round(o.metadata.score*100)/100, o.properties["title"], o.properties["date"]))


BEFORE FILTERING BY DATE:

1.18 - Conte: 'Chelsea are not in the race to sign Sanchez' (2018-01-23 00:00:00+00:00)
0.46 - Leclerc dedicates win to Hubert (2019-09-01 00:00:00+00:00)


AFTER FILTERING BY DATE:

0.46 - Leclerc dedicates win to Hubert (2019-09-01 00:00:00+00:00)


APPLY MANY FILTERS OVER MANY PROPERTIES:



**VECTORIZATION:: DENSE EMBEDDINGS**

Some advanced features, let's try some vectorized queries. Let's assume we want to find all articles that are "related to sport". In this current collection, "sport" is not present as a word in any title or maintext.

In [30]:
# The term "sport" does not occur and so BM25 does not return any result

response = documents.query.bm25(query="sport", return_metadata=MetadataQuery(score=True))
for o in response.objects:
    print("{} - {}".format(round(o.metadata.score*100)/100, o.properties["title"]))

In [31]:
# Unfortunately, we cannot use all the vectorizer modules that are present in Weaviate.
# Here is a list of the ones that are available
client.get_meta()

{'hostname': 'http://127.0.0.1:8079',
 'modules': {'generative-openai': {'documentationHref': 'https://platform.openai.com/docs/api-reference/completions',
   'name': 'Generative Search - OpenAI'},
  'qna-openai': {'documentationHref': 'https://platform.openai.com/docs/api-reference/completions',
   'name': 'OpenAI Question & Answering Module'},
  'ref2vec-centroid': {},
  'reranker-cohere': {'documentationHref': 'https://txt.cohere.com/rerank/',
   'name': 'Reranker - Cohere'},
  'text2vec-cohere': {'documentationHref': 'https://docs.cohere.ai/embedding-wiki/',
   'name': 'Cohere Module'},
  'text2vec-huggingface': {'documentationHref': 'https://huggingface.co/docs/api-inference/detailed_parameters#feature-extraction-task',
   'name': 'Hugging Face Module'},
  'text2vec-openai': {'documentationHref': 'https://platform.openai.com/docs/guides/embeddings/what-are-embeddings',
   'name': 'OpenAI Module'}},
 'version': '1.26.6'}

Let's use COHERE as a textual vectorizer [https://dashboard.cohere.com/api-keys](https://dashboard.cohere.com/api-keys). As we can see, using colab we have only a few options for vectorization (openai, cohere, huggingface). Additionally, only one generation model is available (openai).
Cohere provides free sample apis. OpenAI does not.

At this link you find how to integrate MODELS [https://weaviate.io/developers/weaviate/model-providers](https://weaviate.io/developers/weaviate/model-providers)

In [32]:
## You need first to create a KEY !!!!
from google.colab import userdata

client.close()
cohere_key = userdata.get('COHERE_APIKEY') # MAKE SURE YOU CREATED A KEY
headers = {
    "X-Cohere-Api-Key": cohere_key,
}
client = weaviate.connect_to_embedded(headers=headers)

INFO:weaviate-client:Started /root/.cache/weaviate-embedded: process ID 5532


**CREATE A VECTOR DB**


In [33]:
client.collections.delete_all()
client.collections.create(
    name="TestVectorizer",
    properties=[
        wc.Property(name="maintext", data_type=wc.DataType.TEXT, tokenization=Tokenization.WORD),
        wc.Property(name="title", data_type=wc.DataType.TEXT, tokenization=Tokenization.LOWERCASE),
        wc.Property(name="source", data_type=wc.DataType.TEXT, tokenization=Tokenization.FIELD),
        wc.Property(name="date", data_type=wc.DataType.DATE)
    ],
    vectorizer_config=[
        Configure.NamedVectors.text2vec_cohere(
            name="maintext_vector",
            source_properties=["maintext"],
            #model="embed-multilingual-light-v3.0"
        )
    ]
)

<weaviate.collections.collection.sync.Collection at 0x7cfefc95a410>

In [34]:
from datetime import timezone, datetime

documents = client.collections.get("TestVectorizer")

for doc in articles:
    documents.data.insert({
        "maintext": doc["maintext"],
        "title": doc["title"],
        "source": doc["source"],
        "date": datetime.strptime(doc["date"], "%Y-%m-%d").replace(tzinfo=timezone.utc)
    })


In [35]:
print("pure syntactical search (ordered by decreasing similarity score): 'sport'\n")
response = documents.query.bm25(query="sport", return_metadata=MetadataQuery(score=True))
for o in response.objects:
    print("{} - {}".format(round(o.metadata.score*100)/100, o.properties["title"]))

pure syntactical search (ordered by decreasing similarity score): 'sport'



In [36]:
print("pure vector search (ordered by increasing distance): 'sport'\n")
# NOTE THAT WE ALSO NEED THE PARAMETER DISTANCE
response = documents.query.near_text(query="sport", return_metadata=MetadataQuery(score=True, distance=True), limit=3)
for o in response.objects:
  print("{} - {} (score is {})".format(round(o.metadata.distance*100)/100, o.properties["title"], round(o.metadata.score*100)/100))

pure vector search (ordered by increasing distance): 'sport'

0.6 - Leclerc dedicates win to Hubert (score is 0.0)
0.61 - Gunman opens fire on car just metres from scene of Hamid Sanambar murder (score is 0.0)
0.65 - Conte: 'Chelsea are not in the race to sign Sanchez' (score is 0.0)


**HYBRID SEARCH**

Let us consider a query that admits results for both syntactic and vectorized search

In [37]:
print("pure syntactical search (ordered by decreasing similarity score): 'race'\n")
response = documents.query.bm25(query="race", return_metadata=MetadataQuery(score=True))
for o in response.objects:
    print("{} - {}".format(round(o.metadata.score*100)/100, o.properties["title"]))

pure syntactical search (ordered by decreasing similarity score): 'race'

1.18 - Conte: 'Chelsea are not in the race to sign Sanchez'
0.46 - Leclerc dedicates win to Hubert


In [None]:
print("pure vector search (ordered by increasing distance): 'race'\n")
# NOTE THAT WE ALSO NEED THE PARAMETER DISTANCE
response = documents.query.near_text(query="race", return_metadata=MetadataQuery(score=True, distance=True), limit=3)
for o in response.objects:
  print("{} - {} (score is {})".format(round(o.metadata.distance*100)/100, o.properties["title"], round(o.metadata.score*100)/100))

pure vector search (ordered by increasing distance): 'race'

0.6 - Leclerc dedicates win to Hubert (score is 0.0)
0.61 - Gunman opens fire on car just metres from scene of Hamid Sanambar murder (score is 0.0)
0.69 - Conte: 'Chelsea are not in the race to sign Sanchez' (score is 0.0)


In [39]:
# 0 = syntactic, 1 = vectorized
print("hybrid search (ordered by decreasing score): 'race'")
response = documents.query.hybrid(query="race", alpha=0.5, return_metadata=MetadataQuery(score=True, explain_score=True), limit=3)
for o in response.objects:
  print("{} - {} [{}]".format(round(o.metadata.score*100)/100, o.properties["title"],  o.metadata.explain_score.strip().replace("\n", '')))

hybrid search (ordered by decreasing score): 'race'
0.6 - Conte: 'Chelsea are not in the race to sign Sanchez' [Hybrid (Result Set keyword,bm25) Document 8a72ce3c-68e2-4fa3-b99b-359a551dd75d: original score 1.1829562, normalized score: 0.5 - Hybrid (Result Set vector,hybridVector) Document 8a72ce3c-68e2-4fa3-b99b-359a551dd75d: original score 0.31254542, normalized score: 0.100039184]
0.5 - Leclerc dedicates win to Hubert [Hybrid (Result Set keyword,bm25) Document 1badb2ce-59fe-4920-9c1f-d9df5242069f: original score 0.46418247, normalized score: 0 - Hybrid (Result Set vector,hybridVector) Document 1badb2ce-59fe-4920-9c1f-d9df5242069f: original score 0.39784843, normalized score: 0.5]
0.46 - Gunman opens fire on car just metres from scene of Hamid Sanambar murder [Hybrid (Result Set vector,hybridVector) Document 3ec1bb3f-81e4-4429-a556-2e854e7fd3f0: original score 0.38973743, normalized score: 0.46196988]


[Description of how scoring works](https://weaviate.io/developers/weaviate/concepts/search/hybrid-search)

## A new method, RAG

RAG stands for Retrieval Augmented Generation. This is a recent trend in Information Retrieval that aims at reducing the problem of "hallucinations" for Large Language Model generation, and returns better answers based on local document archives.
- Traditional queries go as follows: the user makes a query to a search engine; the search engine returns, in some predefined format, the answer to that query.
- LLM queries: the user makes a query to a Large Language Model (LLM); the LLM creates an answer based on the (often unspecified) training data that was originally used to train it. The LLM often hallucinates, returing wrong answers.
- RAG: the user makes a query to a search engine; the search engine runs the query and gets its results. Before returning the results to the user, they are sent to a LLM to "process" and generate a textual response that is more convenient to read for the user, but (ideally) does not contain hallucinated information because they use precomputed (retrieved) results.

https://weaviate.io/developers/weaviate/model-providers

**GENERATIVE AI with OPENAI GPT-4**

Now let's try to include some generative AI prompts to this query (let's add context to the entities in the news, or let's translate them in Italian).
Note that this query will only work for those who have an openai paid module.

In [50]:
client.close()

cohere_key = userdata.get('COHERE_APIKEY')
openai_key = userdata.get("OPEN_APIKEY2")
headers = {
    "X-Cohere-Api-Key": cohere_key,
    "X-OpenAI-Api-Key": openai_key
}

client = weaviate.connect_to_embedded(headers=headers)

INFO:weaviate-client:Started /root/.cache/weaviate-embedded: process ID 9157


In [51]:
client.collections.delete_all()

client.collections.create(
    name="TestVectorizer",
    properties=[
        wc.Property(name="maintext", data_type=wc.DataType.TEXT, tokenization=Tokenization.WORD),
        wc.Property(name="title", data_type=wc.DataType.TEXT, tokenization=Tokenization.LOWERCASE),
        wc.Property(name="source", data_type=wc.DataType.TEXT, tokenization=Tokenization.FIELD),
        wc.Property(name="date", data_type=wc.DataType.DATE)
    ],
    vectorizer_config=[
        Configure.NamedVectors.text2vec_cohere(
            name="maintext_vector",
            source_properties=["maintext"],
            #model="embed-multilingual-light-v3.0"
        )
    ],
    generative_config=Configure.Generative.openai(model="gpt-4") # added generation module
)

<weaviate.collections.collection.sync.Collection at 0x7cff09b19410>

In [52]:
documents = client.collections.get("TestVectorizer")
for doc in articles:
    documents.data.insert({
        "maintext": doc["maintext"],
        "title": doc["title"]
        }) # here weaviate performs the vectorization

In [53]:
response = documents.generate.near_text(
    query="sport",  # The model provider integration will automatically vectorize the query
    single_prompt="Write a short summary of maximum 100 characters in Italian of {maintext}",
    limit=2 # apply LLM to the top 2 results
)

In [54]:
for obj in response.objects:
    print(obj.properties["title"])
    print(f"Generated output: {obj.generated}")  # Note that the generated output is per object
    print("====================================================")
    print()

Leclerc dedicates win to Hubert
Generated output: Charles Leclerc ha ottenuto la sua prima vittoria in Formula Uno al Gran Premio del Belgio, dedicandola ad Anthoine Hubert.

Gunman opens fire on car just metres from scene of Hamid Sanambar murder
Generated output: La polizia cerca un uomo armato che ha sparato su un'auto a Dublino, vicino al luogo dove Hamid Sanambar è stato ucciso.



**GENERATIVE AI with an external COHERE**

The code above implements RAG using an external LLM module (OpenAI), invoked via the internal Weaviate module. We can also implement a RAG by calling the LLM directly, by using Cohere to implement the vectorization (inside Weaviate) and the generation (direcly with an API call). This way we do not need to pay for an OpenAI API key.

In [45]:
!pip install cohere

Collecting cohere
  Downloading cohere-5.14.0-py3-none-any.whl.metadata (3.4 kB)
Collecting fastavro<2.0.0,>=1.9.4 (from cohere)
  Downloading fastavro-1.10.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.5 kB)
Collecting httpx-sse==0.4.0 (from cohere)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting types-requests<3.0.0,>=2.0.0 (from cohere)
  Downloading types_requests-2.32.0.20250328-py3-none-any.whl.metadata (2.3 kB)
Downloading cohere-5.14.0-py3-none-any.whl (253 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m253.9/253.9 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading httpx_sse-0.4.0-py3-none-any.whl (7.8 kB)
Downloading fastavro-1.10.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.3/3.3 MB[0m [31m44.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading types_requests-2.32.0.20250328-py3-none-any.whl (20 kB)
Inst

In [46]:
print("hybrid search: 'race'")
response = documents.query.hybrid(query="race", alpha=0.5, return_metadata=MetadataQuery(score=True, explain_score=True), limit=3)
for o in response.objects:
  print("{} - {} [{}]".format(round(o.metadata.score*100)/100, o.properties["title"],  o.metadata.explain_score.strip().replace("\n", '')))

hybrid search: 'race'
0.6 - Conte: 'Chelsea are not in the race to sign Sanchez' [Hybrid (Result Set keyword,bm25) Document 1fbd403d-8c83-4033-b991-9f302296a2f9: original score 1.1795324, normalized score: 0.5 - Hybrid (Result Set vector,hybridVector) Document 1fbd403d-8c83-4033-b991-9f302296a2f9: original score 0.31253493, normalized score: 0.10006716]
0.5 - Leclerc dedicates win to Hubert [Hybrid (Result Set keyword,bm25) Document 4ba90a36-2af6-4c47-b84e-565e5d9b6183: original score 0.46129686, normalized score: 0 - Hybrid (Result Set vector,hybridVector) Document 4ba90a36-2af6-4c47-b84e-565e5d9b6183: original score 0.3978603, normalized score: 0.5]
0.46 - Gunman opens fire on car just metres from scene of Hamid Sanambar murder [Hybrid (Result Set vector,hybridVector) Document 85a95143-ed92-4458-ba96-1c614389ae63: original score 0.389758, normalized score: 0.46202332]


In [47]:
import cohere

co = cohere.ClientV2(api_key=cohere_key)
res = co.chat(
    model="command-r-plus-08-2024", # this is a cohere model
    messages=[
        {
            "role": "user",
            "content": "Write a short summary (100 characters at max), in Italian of the textual article \
            provided below: \n\n {}".format(response.objects[1].properties["maintext"]),
        } # response includes all the results returned by the previous hybrid query
    ],
)

print(res.message.content[0].text)

Charles Leclerc vince il Gran Premio del Belgio, dedicando la vittoria all'amico Anthoine Hubert, tragicamente scomparso.


In [48]:
res = co.chat(
    model="command-r-plus-08-2024", # this is a cohere model
    messages=[
        {
            "role": "user",
            "content": "Write a one sentence summary in French of the textual article \
            provided below: \n\n {}".format(response.objects[0].properties["maintext"]),
        }
    ],
)

print(res.message.content[0].text)

Antonio Conte, entraîneur de Chelsea, a déclaré qu'il ne pensait pas que le club était en lice pour signer Alexis Sanchez, l'attaquant d'Arsenal, et a évité de discuter du marché des transferts, affirmant qu'il préférait laisser ces questions au club.
