# Ingestion API Usage

- Replace `BASE_URL` with the actual server URL where the API is hosted.
- Replace the directory path (`../data/dataset`) with the correct location of your dataset.
- Ensure to follow the steps in the README under [Data Ingestion](../docs/quickstart.md#data-ingestion) to retrieve the dataset, including installing and pulling using Git LFS.
- Replace file paths (`example_document.pdf`) with appropriate files for the Upload and Delete endpoints.
- Modify the `collection_name` accordingly for the Upload and Delete
- Ensure the server is running before executing the notebook

#### 1. Install Dependencies

In [None]:
!pip install requests

#### 2. Setup Base Configuration

In [None]:
import requests
import json
from typing import Dict, Any

IPADDRESS = "localhost" #Replace this with the correct IP address
RAG_PORT = "8081"
BASE_URL = f"http://{IPADDRESS}:{RAG_PORT}"  # Replace with your server URL

def print_response(response: requests.Response):
    """Helper to print API response."""
    print(f"Status Code: {response.status_code}")
    try:
        print(json.dumps(response.json(), indent=2))
    except json.JSONDecodeError:
        print(response.text)

#### 3. Health Check Endpoint

**Purpose:**
This endpoint performs a health check on the server. It returns a 200 status code if the server is operational.

In [None]:
# GET /health
url = f"{BASE_URL}/health"
response = requests.get(url)
print_response(response)

#### 4. Upload Document Endpoint

**Purpose:**
This endpoint uploads a document to the vector store. You can specify the collection name where the document should be stored. To speed up the ingestion process, the code is parallelized using `concurrent.futures`.

In [None]:
# Extract the Dataset
!unzip ../data/dataset.zip -d ../data

##### Upload multiple files with `concurrent.futures`

In [None]:
import os
from concurrent.futures import ThreadPoolExecutor

def upload_file(file_path: str, collection_name: str):
    url = f"{BASE_URL}/documents"
    files = {"file": open(file_path, "rb")}
    params = {"collection_name": collection_name}
    response = requests.post(url, files=files, params=params)
    print(f"Uploading {os.path.basename(file_path)}...")
    print_response(response)

directory_path = "../data/dataset"  # Replace with your directory path
collection_name = "nvidia_blogs"

file_paths = [os.path.join(directory_path, f) for f in os.listdir(directory_path) if os.path.isfile(os.path.join(directory_path, f))]

with ThreadPoolExecutor() as executor:
    executor.map(lambda file: upload_file(file, collection_name), file_paths)

#### 5. Get Documents Endpoint

**Purpose:**
This endpoint retrieves a list of documents ingested into the vector store for a specified collection.

In [None]:
# GET /documents
url = f"{BASE_URL}/documents"
params = {"collection_name": "nvidia_blogs"}

response = requests.get(url, params=params)
print_response(response)

#### 6. Delete Document Endpoint

**Purpose:**
This endpoint deletes a specified document from the vector store. The document is identified by its filename.

In [None]:
# DELETE /documents
url = f"{BASE_URL}/documents"
params = {
    "filename": "example_document.pdf",  # Replace with the file you want to delete
    "collection_name": "nvidia_blogs"
}

response = requests.delete(url, params=params)
print_response(response)

#### 7. Get Collections Endpoint

**Purpose:**
This endpoint retrieves a list of all collection names available on the server. Collections are used to organize documents in the vector store.

In [None]:
# GET /collections
url = f"{BASE_URL}/collections"
response = requests.get(url)
print_response(response)