## **Weaviate**

[Official_Tutorials](https://academy.weaviate.io/courses/wa101t-py)

`Weaviate` is an open-source vector search engine that allows you to store, index, and search through high-dimensional vector embeddings. It is designed to handle large-scale datasets and provides efficient similarity search capabilities.

### **Key Features of Weaviate:**

- **Vector Search:** Weaviate allows you to perform similarity searches based on vector embeddings, making it suitable for applications like recommendation systems, semantic search, and more.

- **Schema Flexibility:** You can define custom schemas for your data, allowing you to structure and organize your information as needed.

- **GraphQL API:** Weaviate provides a GraphQL API for querying and managing your data, making it easy to integrate with various applications.

- **Scalability:** Weaviate is designed to scale horizontally, allowing you to handle large datasets and high query loads.

<hr>
<hr>
<hr>

## **Setting up Weaviate with Docker Compose**

To set up Weaviate using Docker Compose, you can create a `docker-compose.yml` file with the following content:

```yaml

services:
  weaviate:
    command:
      - --host
      - 0.0.0.0
      - --port
      - '8080'
      - --scheme
      - http
    # Replace `1.33.1` with your desired Weaviate version
    image: cr.weaviate.io/semitechnologies/weaviate:1.33.1
    ports:
      - 8080:8080
      - 50051:50051
    restart: on-failure:0
    volumes:
      - weaviate_data:/var/lib/weaviate
    environment:
      QUERY_DEFAULTS_LIMIT: 25
      AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'true'
      PERSISTENCE_DATA_PATH: '/var/lib/weaviate'
      ENABLE_API_BASED_MODULES: 'true'
      BACKUP_FILESYSTEM_PATH: '/var/lib/weaviate/backups'
      CLUSTER_HOSTNAME: 'node1'
volumes:
  weaviate_data:

```

### **Creating Client to Interact with Weaviate**

You can use the `weaviate-client` library to interact with your Weaviate instance. Here's an example of how to create a client:

```python

import weaviate
import os

headers = {
    "X-Cohere-Api-Key": "sXzSSVZ4S5Y9L2ykxlLcsKlfd9i0eNS7hnU9JAeG"
}  # Replace with your Cohere API key

client = weaviate.connect_to_local(headers=headers)
print(client.is_ready())
``` 

<hr>
<hr>
<hr>


## **Data Types in Weaviate**

### **Collection**

`Collection` is a group of related data objects in Weaviate. It serves as a container for organizing and managing similar types of data.

### **Data Object**

A `Data Object` is an individual item or record within a collection. Each data object contains specific attributes and values that define its properties.

### **Vector**

A `Vector` is a numerical representation of data in a multi-dimensional space. In Weaviate, vectors are used to represent the semantic meaning of data objects, enabling similarity searches and comparisons.

### **Schema**

A `Schema` defines the structure and organization of data within a Weaviate instance. It specifies the collections, data objects, and their attributes, as well as the relationships between them.

<hr>
<hr>
<hr>



## **Creating Collection**

```python

import weaviate
from weaviate.classes.config import Configure, Property, DataType
import os


# Instantiate your client (not shown). e.g.:
# client = weaviate.connect_to_weaviate_cloud(...) or
# client = weaviate.connect_to_local(...)

client.collections.create(
    name="Movies",
    properties=[
        Property(name="title", data_type=DataType.TEXT),
        Property(name="overview", data_type=DataType.TEXT),
        Property(name="vote_average", data_type=DataType.NUMBER),
        Property(name="genre_ids", data_type=DataType.INT_ARRAY),
        Property(name="release_date", data_type=DataType.DATE),
        Property(name="tmdb_id", data_type=DataType.INT),
    ],
    # Define the vectorizer module
    vector_config=Configure.Vectors.text2vec_cohere(model="embed-v4.0"),
    # Define the generative module
    generative_config=Configure.Generative.cohere(model="command-a-03-2025")
)

client.close()

```

In the above code, we create a collection named `Movies` with various properties such as `title`, `overview`, `vote_average`, etc. 

We've defined the schema `Explicitly` by specifying the properties and their data types. But, Weaviate also supports `Implicit` schema creation, where the schema is automatically created based on the data you insert.

**Vector Configuration:** We specify the vectorization module to use `Cohere's embed-v4.0` model for generating text embeddings.

**Generative Configuration:** We also specify the generative module to use `Cohere's command-a-03-2025` model for generating text based on prompts.

This way we can create a collection in Weaviate with a defined schema, vectorization, and generative capabilities.

<hr>
<hr>
<hr>


## **Using the Collection to Load Data**

```python

import weaviate
import pandas as pd
import requests
from datetime import datetime, timezone
import json
from weaviate.util import generate_uuid5
from tqdm import tqdm
import os

# Instantiate your client (not shown). e.g.:
# client = weaviate.connect_to_weaviate_cloud(...) or
# client = weaviate.connect_to_local(...)

data_url = "https://raw.githubusercontent.com/weaviate-tutorials/edu-datasets/main/movies_data_1990_2024.json"
resp = requests.get(data_url)
df = pd.DataFrame(resp.json())

# Configure collection object
movies = client.collections.use("Movies")

# Enter context manager
with movies.batch.fixed_size(batch_size=200) as batch:
    # Loop through the data
    for i, movie in tqdm(df.iterrows()):
        # Convert data types
        # Convert a JSON date to `datetime` and add time zone information
        release_date = datetime.fromisoformat(movie["release_date"]).replace(tzinfo=timezone.utc)
        # Convert a JSON array to a list of integers
        genre_ids = json.loads(movie["genre_ids"])

        # Build the object payload
        movie_obj = {
            "title": movie["title"],
            "overview": movie["overview"],
            "vote_average": movie["vote_average"],
            "genre_ids": genre_ids,
            "release_date": release_date,
            "tmdb_id": movie["id"],
        }

        # Add object to batch queue
        batch.add_object(
            properties=movie_obj,
            uuid=generate_uuid5(movie["id"])
        )
        # Batcher automatically sends batches

# Check for failed objects
if len(movies.batch.failed_objects) > 0:
    print(f"Failed to import {len(movies.batch.failed_objects)} objects")

client.close()
```

This way we can load data into the `Movies` collection in Weaviate using a batch process for efficiency.

Here, we've used `Static Batching`, where we define a fixed batch size of `200` objects. The batcher automatically sends the batches when the specified size is reached.

We can also use `Dynamic Batching`, where the batcher automatically adjusts the batch size based on the system's performance and resource availability.

<hr>
<hr>
<hr>


## **Semantic Search with Weaviate**

Now that we've stored the data in `Weaviate`, we can perform semantic searches using vector embeddings.

```python 
import weaviate
from weaviate.classes.query import Filter, MetadataQuery
import os


# Instantiate your client (not shown). e.g.:
# client = weaviate.connect_to_weaviate_cloud(...) or
# client = weaviate.connect_to_local(...)

# Configure collection object
movies = client.collections.use("Movies")

# Perform query
response = movies.query.near_text(
    query="dystopian future",
    limit=5,
    return_metadata=MetadataQuery(distance=True)
)

# Inspect the response
for o in response.objects:
    print(o.properties["title"], o.properties["release_date"].year)  # Print the title and release year (note the release date is a datetime object)
    print(f"Distance to query: {o.metadata.distance:.3f}\n")  # Print the distance of the object from the query

client.close()
```

Here, the `collection` is configured with one vector per object. It is also possible to attach multiple vectors to a single object, allowing for choices in how to search the object.


When using multiple vectors, you must specify which vector to use for the search by setting the `target_vector` parameter in the query.

<hr>
<hr>



## **Key Word Search**

We can also perform traditional keyword searches in Weaviate using filters.

```python
import weaviate
from weaviate.classes.query import Filter, MetadataQuery
import os


# Instantiate your client (not shown). e.g.:
# client = weaviate.connect_to_weaviate_cloud(...) or
# client = weaviate.connect_to_local(...)

# Configure collection object
movies = client.collections.use("Movies")

# Perform query
response = movies.query.bm25(
    query="history", limit=5, return_metadata=MetadataQuery(score=True)
)

# Inspect the response
for o in response.objects:
    print(o.properties["title"], o.properties["release_date"].year)  # Print the title and release year (note the release date is a datetime object)
    print(f"BM25 score: {o.metadata.score:.3f}\n")  # Print the BM25 score of the object from the query

client.close()
```

<hr>
<hr>
<hr>


## **Hybrid Search**

We can combine both semantic search and keyword search in Weaviate using hybrid queries.

```python
import weaviate
from weaviate.classes.query import Filter, MetadataQuery
import os


# Instantiate your client (not shown). e.g.:
# client = weaviate.connect_to_weaviate_cloud(...) or
# client = weaviate.connect_to_local(...)

# Configure collection object
movies = client.collections.use("Movies")

# Perform query
response = movies.query.hybrid(
    query="history", limit=5, return_metadata=MetadataQuery(score=True)
)

# Inspect the response
for o in response.objects:
    print(o.properties["title"], o.properties["release_date"].year)  # Print the title and release year (note the release date is a datetime object)
    print(f"Hybrid score: {o.metadata.score:.3f}\n")  # Print the hybrid search score of the object from the query

client.close()
```