# Ingestion API Usage

This notebook demonstrates how to interact with the ingestion APIs to upload and index documents for retrieval-augmented generation (RAG) applications. It showcases the different APIs needed to create a collection, upload documents to the created collection using Milvus Vector DB. It also showcases different APIs to manage uploaded documents and existing collections effectively.



- Ensure the ingestor-server container is running before executing the notebook by [following steps in the readme](../docs/quickstart.md#start-the-containers-for-ingestion-microservices).
- Replace `BASE_URL` with the actual server URL if the API is hosted on another system.
- You can customize the directory path (`../data/multimodal`) with the correct location of your dataset.


#### 1. Install Dependencies and import required modules

In [None]:
!pip install aiohttp
import aiohttp
import os
import json

#### 2. Setup Base Configuration

In [None]:
IPADDRESS = "localhost" #Replace this with the correct IP address
INGESTOR_SERVER_PORT = "8082"
BASE_URL = f"http://{IPADDRESS}:{INGESTOR_SERVER_PORT}"  # Replace with your server URL

async def print_response(response):
    """Helper to print API response."""
    try:
        response_json = await response.json()
        print(json.dumps(response_json, indent=2))
    except aiohttp.ClientResponseError:
        print(await response.text())

#### 3. Health Check Endpoint

**Purpose:**
This endpoint performs a health check on the server. It returns a 200 status code if the server is operational.

In [None]:
async def fetch_health_status():
    """Fetch health status asynchronously."""
    url = f"{BASE_URL}/v1/health"
    async with aiohttp.ClientSession() as session:
        async with session.get(url) as response:
            await print_response(response)

# Run the async function
await fetch_health_status()

#### 4. Create collection Endpoint

**Purpose:**
This endpoint is used to create a collection in the vector store. 

In [None]:
async def create_collections(
    collection_names: list = None,
    collection_type: str = "text",
    embedding_dimension: int = 2048
):

    params = {
        "collection_type": collection_type,
        "embedding_dimension": embedding_dimension
    }

    HEADERS = {"Content-Type": "application/json"}

    async with aiohttp.ClientSession() as session:
        try:
            async with session.post(f"{BASE_URL}/v1/collections", params=params, json=collection_names, headers=HEADERS) as response:
                await print_response(response)
        except aiohttp.ClientError as e:
            return 500, {"error": str(e)}

await create_collections(collection_names=["multimodal_data"])

#### 4. Upload Document Endpoint

**Purpose:**
This endpoint uploads new documents to the vector store. 
1. You can specify the collection name where the documents should be stored. 
2. The collection to which the documents are being uploaded must exist in the vector database.
3. The documents which are uploaded must not exist in the collection. If the documents already exists, to reingest existing files in the provided collection, replace `session.post(...)` with `session.patch(...)`
4. To speed up the ingestion process, the multiple files can be passed in a single request as showcased below.

In [None]:
DATA_DIR = "../data/multimodal"
async def upload_documents(collection_name: str = ""):
    files = [os.path.join(DATA_DIR, f) for f in os.listdir(DATA_DIR) if os.path.isfile(os.path.join(DATA_DIR, f))]

    data = {
        "collection_name": collection_name,
        "extraction_options": {
            "extract_text": True,
            "extract_tables": True,
            "extract_charts": True,
            "extract_images": False, # Set to True if you want to extract images, ensure the VLM model is deployed
            "extract_method": "pdfium",
            "text_depth": "page",
        },
        "split_options": {
            "chunk_size": 1024,
            "chunk_overlap": 150
        }
    }

    form_data = aiohttp.FormData()
    for file_path in files:
        form_data.add_field("documents", open(file_path, "rb"), filename=os.path.basename(file_path), content_type="application/pdf")

    form_data.add_field("data", json.dumps(data), content_type="application/json")

    async with aiohttp.ClientSession() as session:
        try:
            async with session.post(f"{BASE_URL}/v1/documents", data=form_data) as response: # Replace with session.patch for reingesting
                await print_response(response)
        except aiohttp.ClientError as e:
            print(f"Error: {e}")

await upload_documents(collection_name="multimodal_data")


#### 5. Get Documents Endpoint

**Purpose:**
This endpoint retrieves a list of documents ingested into the vector store for a specified collection.

In [None]:
async def fetch_documents(collection_name: str = ""):
    url = f"{BASE_URL}/v1/documents"
    params = {"collection_name": collection_name}
    async with aiohttp.ClientSession() as session:
        try:
            async with session.get(url, params=params) as response:
                await print_response(response)
        except aiohttp.ClientError as e:
            print(f"Error: {e}")

await fetch_documents(collection_name="multimodal_data")

#### 6. Delete Documents Endpoint

**Purpose:**
This endpoint deletes specified documents from the vector store. The documents are identified by its filename.

In [None]:
from typing import List

async def delete_documents(collection_name: str = "", file_names: List[str] = []):
    url = f"{BASE_URL}/v1/documents"
    params = {"collection_name": collection_name}
    async with aiohttp.ClientSession() as session:
        try:
            async with session.delete(url, params=params, json=file_names) as response:
                await print_response(response)
        except aiohttp.ClientError as e:
            print(f"Error: {e}")

await delete_documents(collection_name="multimodal_data", file_names=["embedded_table.pdf", "table_test.pdf"])

#### 7. Get Collections Endpoint

**Purpose:**
This endpoint retrieves a list of all collection names available on the server. Collections are used to organize documents in the vector store.

In [None]:
async def fetch_collections():
    url = f"{BASE_URL}/v1/collections"
    async with aiohttp.ClientSession() as session:
        try:
            async with session.get(url) as response:
                await print_response(response)
        except aiohttp.ClientError as e:
            print(f"Error: {e}")

await fetch_collections()

#### 7. Delete Collections Endpoint

**Purpose:**
This endpoint deletes list of provided collection names available on the specified vector database server.

In [None]:
from typing import List

async def delete_collections(collection_names: List[str] = ""):
    url = f"{BASE_URL}/v1/collections"
    async with aiohttp.ClientSession() as session:
        try:
            async with session.delete(url, json=collection_names) as response:
                await print_response(response)
        except aiohttp.ClientError as e:
            print(f"Error: {e}")

await delete_collections(collection_names=["multimodal_data"])