## **Weaviate**

[Official_Tutorials](https://academy.weaviate.io/courses/wa101t-py)

`Weaviate` is an open-source vector search engine that allows you to store, index, and search through high-dimensional vector embeddings. It is designed to handle large-scale datasets and provides efficient similarity search capabilities.

### **Key Features of Weaviate:**

- **Vector Search:** Weaviate allows you to perform similarity searches based on vector embeddings, making it suitable for applications like recommendation systems, semantic search, and more.

- **Schema Flexibility:** You can define custom schemas for your data, allowing you to structure and organize your information as needed.

- **GraphQL API:** Weaviate provides a GraphQL API for querying and managing your data, making it easy to integrate with various applications.

- **Scalability:** Weaviate is designed to scale horizontally, allowing you to handle large datasets and high query loads.

<hr>
<hr>
<hr>

## **Setting up Weaviate with Docker Compose**

To set up Weaviate using Docker Compose, you can create a `docker-compose.yml` file with the following content:

```yaml

services:
  weaviate:
    command:
      - --host
      - 0.0.0.0
      - --port
      - '8080'
      - --scheme
      - http
    # Replace `1.33.1` with your desired Weaviate version
    image: cr.weaviate.io/semitechnologies/weaviate:1.33.1
    ports:
      - 8080:8080
      - 50051:50051
    restart: on-failure:0
    volumes:
      - weaviate_data:/var/lib/weaviate
    environment:
      QUERY_DEFAULTS_LIMIT: 25
      AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'true'
      PERSISTENCE_DATA_PATH: '/var/lib/weaviate'
      ENABLE_API_BASED_MODULES: 'true'
      BACKUP_FILESYSTEM_PATH: '/var/lib/weaviate/backups'
      CLUSTER_HOSTNAME: 'node1'
volumes:
  weaviate_data:

```

### **Creating Client to Interact with Weaviate**

You can use the `weaviate-client` library to interact with your Weaviate instance. Here's an example of how to create a client:

```python

import weaviate
import os

headers = {
    "X-Cohere-Api-Key": "sXzSSVZ4S5Y9L2ykxlLcsKlfd9i0eNS7hnU9JAeG"
}  # Replace with your Cohere API key

client = weaviate.connect_to_local(headers=headers)
print(client.is_ready())
``` 

<hr>
<hr>
<hr>


## **Data Types in Weaviate**

[Documentation](https://academy.weaviate.io/courses/wa050-py)

### **Objects**

An `Object` is an entity in a vector database like a `row` in a relational database. Each object has the following components:

- `ID`: A unique identifier for the object.

- `Properties`: Key-value pairs that store additional information about the object.

- `Vector`: A high-dimensional vector representation of the object, typically generated using embedding models.

<img src="../Notes_Images/Object.png" alt="Weaviate Object" width="800"/>

<hr>

### **Collections**

A `Collection` organizes objects into distinct sets of information. 

Each collection includes metadata(configurations), indexes, and an object store.

**Metadata/configurations**: Define the structure and behavior of the collection, including schema definitions, indexing strategies, and other settings.

**Indexes**: Facilitate efficient searching and retrieval of objects

**Object Store**: The actual storage location for the objects within the collection.

<img src="../Notes_Images/Collection.png" alt="Weaviate Collection" width="800"/>

<hr>

### **Vectors**

A vector embedding captures the meaning of an object as a series of numbers. This representation enables similarity-based search.

An "embedding model" transforms your input into vectors. For example, the text `"You're a wizard, Harry!"` might become the vector `[0.0134, 0.8723, -0.4532, â€¦, 0.5842]`. This is also called putting an object into a `"high-dimensional vector space",` because each vector has many numbers (dimensions), and thus a vector can be thought of as a point in this space.

Similar concepts produce similar vectors. Vectors for `"You're a wizard, Harry!"` and `"You can do magic, Henry!"` would be mathematically close to each other. This enables Weaviate to understand they're related even though the exact words differ.

<img src="../Notes_Images/Vectors.png" alt="Weaviate Vector Space" width="800"/>

<hr>
<hr>
<hr>



## **Indexes**

`Indexes` in Weaviate are data structures that optimize the retrieval of objects based on their vector representations. They enable efficient similarity searches by organizing and structuring the vector data.

In `Vector Database`, each collection has its own `indexes` that facilitate fast and accurate searches.

Each index may speed up specific operations, such as vector search, keyword search, or filtering based on object properties.

<img src="../Notes_Images/Indexes.png" alt="Weaviate Indexes" width="800"/>


<hr>
<hr>
<hr>



## **Weaviate API**

`Weaviate API` allows you to interact with the Weaviate vector database programmatically. It provides endpoints for managing collections, objects, and performing searches.

It provides:

**RESTful API**: A RESTful interface for performing CRUD operations on collections and objects.

**gRPC API**: A gRPC interface for high-performance communication with Weaviate.

**GraphQL API**: A GraphQL interface for querying and managing data in Weaviate.

<hr>
<hr>
<hr>


## **Searching in Weaviate**

### **Keyword Search**

Keyword search uses a traditional text matching algorithm `(BM25)` for exact term matching.

<img src="../Notes_Images/Keyword.png" alt="Weaviate Keyword Search" width="800"/>

`BM25` (Best Matching 25) is a ranking function used by search engines to estimate the relevance of documents to a given search query. It is based on the probabilistic information retrieval model and is widely used in information retrieval systems.

### **Vector Search**

Vector search finds objects based on their vector similarity to a given query vector. It uses distance metrics like cosine similarity or Euclidean distance to measure similarity.

<img src="../Notes_Images/Vector_S.png" alt="Weaviate Vector Search" width="800"/>

### **Hybrid Search**

Hybrid search combines both keyword and vector search to provide more comprehensive results. It allows you to find objects that match both textual and semantic criteria.

<img src="../Notes_Images/Hybrid_S.png" alt="Weaviate Hybrid Search" width="800"/>

<hr>
<hr>
<hr>



## **Overview of Weaviate Modules**

<img src="../Notes_Images/Overview.png" alt="Weaviate Modules" width="800"/>

<hr>
<hr>
<hr>


## **Creating Collection**

```python

import weaviate
from weaviate.classes.config import Configure, Property, DataType
import os


# Instantiate your client (not shown). e.g.:
# client = weaviate.connect_to_weaviate_cloud(...) or
# client = weaviate.connect_to_local(...)

client.collections.create(
    name="Movies",
    properties=[
        Property(name="title", data_type=DataType.TEXT),
        Property(name="overview", data_type=DataType.TEXT),
        Property(name="vote_average", data_type=DataType.NUMBER),
        Property(name="genre_ids", data_type=DataType.INT_ARRAY),
        Property(name="release_date", data_type=DataType.DATE),
        Property(name="tmdb_id", data_type=DataType.INT),
    ],
    # Define the vectorizer module
    vector_config=Configure.Vectors.text2vec_cohere(model="embed-v4.0"),
    # Define the generative module
    generative_config=Configure.Generative.cohere(model="command-a-03-2025")
)

client.close()

```

In the above code, we create a collection named `Movies` with various properties such as `title`, `overview`, `vote_average`, etc. 

We've defined the schema `Explicitly` by specifying the properties and their data types. But, Weaviate also supports `Implicit` schema creation, where the schema is automatically created based on the data you insert.

**Vector Configuration:** We specify the vectorization module to use `Cohere's embed-v4.0` model for generating text embeddings.

**Generative Configuration:** We also specify the generative module to use `Cohere's command-a-03-2025` model for generating text based on prompts.

This way we can create a collection in Weaviate with a defined schema, vectorization, and generative capabilities.

<hr>
<hr>
<hr>


## **Using the Collection to Load Data**

```python

import weaviate
import pandas as pd
import requests
from datetime import datetime, timezone
import json
from weaviate.util import generate_uuid5
from tqdm import tqdm
import os

# Instantiate your client (not shown). e.g.:
# client = weaviate.connect_to_weaviate_cloud(...) or
# client = weaviate.connect_to_local(...)

data_url = "https://raw.githubusercontent.com/weaviate-tutorials/edu-datasets/main/movies_data_1990_2024.json"
resp = requests.get(data_url)
df = pd.DataFrame(resp.json())

# Configure collection object
movies = client.collections.use("Movies")

# Enter context manager
with movies.batch.fixed_size(batch_size=200) as batch:
    # Loop through the data
    for i, movie in tqdm(df.iterrows()):
        # Convert data types
        # Convert a JSON date to `datetime` and add time zone information
        release_date = datetime.fromisoformat(movie["release_date"]).replace(tzinfo=timezone.utc)
        # Convert a JSON array to a list of integers
        genre_ids = json.loads(movie["genre_ids"])

        # Build the object payload
        movie_obj = {
            "title": movie["title"],
            "overview": movie["overview"],
            "vote_average": movie["vote_average"],
            "genre_ids": genre_ids,
            "release_date": release_date,
            "tmdb_id": movie["id"],
        }

        # Add object to batch queue
        batch.add_object(
            properties=movie_obj,
            uuid=generate_uuid5(movie["id"])
        )
        # Batcher automatically sends batches

# Check for failed objects
if len(movies.batch.failed_objects) > 0:
    print(f"Failed to import {len(movies.batch.failed_objects)} objects")

client.close()
```

This way we can load data into the `Movies` collection in Weaviate using a batch process for efficiency.

Here, we've used `Static Batching`, where we define a fixed batch size of `200` objects. The batcher automatically sends the batches when the specified size is reached.

We can also use `Dynamic Batching`, where the batcher automatically adjusts the batch size based on the system's performance and resource availability.

<hr>
<hr>
<hr>


## **Semantic Search with Weaviate**

Now that we've stored the data in `Weaviate`, we can perform semantic searches using vector embeddings.

```python 
import weaviate
from weaviate.classes.query import Filter, MetadataQuery
import os


# Instantiate your client (not shown). e.g.:
# client = weaviate.connect_to_weaviate_cloud(...) or
# client = weaviate.connect_to_local(...)

# Configure collection object
movies = client.collections.use("Movies")

# Perform query
response = movies.query.near_text(
    query="dystopian future",
    limit=5,
    return_metadata=MetadataQuery(distance=True)
)

# Inspect the response
for o in response.objects:
    print(o.properties["title"], o.properties["release_date"].year)  # Print the title and release year (note the release date is a datetime object)
    print(f"Distance to query: {o.metadata.distance:.3f}\n")  # Print the distance of the object from the query

client.close()
```

Here, the `collection` is configured with one vector per object. It is also possible to attach multiple vectors to a single object, allowing for choices in how to search the object.


When using multiple vectors, you must specify which vector to use for the search by setting the `target_vector` parameter in the query.

<hr>
<hr>



## **Key Word Search**

We can also perform traditional keyword searches in Weaviate using filters.

```python
import weaviate
from weaviate.classes.query import Filter, MetadataQuery
import os


# Instantiate your client (not shown). e.g.:
# client = weaviate.connect_to_weaviate_cloud(...) or
# client = weaviate.connect_to_local(...)

# Configure collection object
movies = client.collections.use("Movies")

# Perform query
response = movies.query.bm25(
    query="history", limit=5, return_metadata=MetadataQuery(score=True)
)

# Inspect the response
for o in response.objects:
    print(o.properties["title"], o.properties["release_date"].year)  # Print the title and release year (note the release date is a datetime object)
    print(f"BM25 score: {o.metadata.score:.3f}\n")  # Print the BM25 score of the object from the query

client.close()
```

<hr>
<hr>
<hr>


## **Hybrid Search**

We can combine both semantic search and keyword search in Weaviate using hybrid queries.

```python
import weaviate
from weaviate.classes.query import Filter, MetadataQuery
import os


# Instantiate your client (not shown). e.g.:
# client = weaviate.connect_to_weaviate_cloud(...) or
# client = weaviate.connect_to_local(...)

# Configure collection object
movies = client.collections.use("Movies")

# Perform query
response = movies.query.hybrid(
    query="history", limit=5, return_metadata=MetadataQuery(score=True)
)

# Inspect the response
for o in response.objects:
    print(o.properties["title"], o.properties["release_date"].year)  # Print the title and release year (note the release date is a datetime object)
    print(f"Hybrid score: {o.metadata.score:.3f}\n")  # Print the hybrid search score of the object from the query

client.close()
```

<hr>
<hr>
<hr>


## **Filtering While Searching**

We can apply filters to our searches in Weaviate to narrow down the results based on specific criteria.

```python
import weaviate
from weaviate.classes.query import Filter, MetadataQuery
import os

from datetime import datetime


# Instantiate your client (not shown). e.g.:
# client = weaviate.connect_to_weaviate_cloud(...) or
# client = weaviate.connect_to_local(...)

# Configure collection object
movies = client.collections.use("Movies")

# Perform query
response = movies.query.near_text(
    query="dystopian future",
    limit=5,
    return_metadata=MetadataQuery(distance=True),
    filters=Filter.by_property("release_date").greater_than(datetime(2020, 1, 1))
)

# Inspect the response
for o in response.objects:
    print(o.properties["title"], o.properties["release_date"].year)  # Print the title and release year (note the release date is a datetime object)
    print(f"Distance to query: {o.metadata.distance:.3f}\n")  # Print the distance of the object from the query

client.close()
```

<hr>
<hr>
<hr>


## **Single Prompt Generations**

Until npw, we've focused on searching and retrieving data from Weaviate. However, Weaviate also supports generative capabilities, allowing us to generate text based on prompts using integrated language models.

```python
import os
import weaviate

# Instantiate your client (not shown). e.g.:
# client = weaviate.connect_to_weaviate_cloud(...) or
# client = weaviate.connect_to_local(...)

# Configure collection object
movies = client.collections.use("Movies")

# Perform query
response = movies.generate.near_text(
    query="dystopian future",
    limit=5,
    single_prompt="Translate this into French: {title}"
)

# Inspect the response
for o in response.objects:
    print(o.properties["title"])  # Print the title
    print(o.generated)  # Print the generated text (the title, in French)

client.close()
```

<hr>
<hr>
<hr>


## **Group Task Generations**

We can also perform group task generations in Weaviate, where we generate text based on multiple prompts for each retrieved object.

```python

import os
import weaviate

# Instantiate your client (not shown). e.g.:
# client = weaviate.connect_to_weaviate_cloud(...) or
# client = weaviate.connect_to_local(...)

# Configure collection object
movies = client.collections.use("Movies")

# Perform query
response = movies.generate.near_text(
    query="dystopian future",
    limit=5,
    grouped_task="What do these movies have in common?",
    # grouped_properties=["title", "overview"]  # Optional parameter; for reducing prompt length
)

# Inspect the response
for o in response.objects:
    print(o.properties["title"])  # Print the title
print(response.generated)  # Print the generated text (the commonalities between them)

client.close()
```