# Document Collections

The IONOS AI Model Hub API allows you to access vector databases to persist your document collections and find semantically
similar documents. 

The vector database is used to persist documents in document collections. Each document is any form of pure text. In the 
document collection not only the input text is persisted, but also a transformation of the input text into an embedding. 
Each embedding is a vector of numbers. Input texts which are semantically similar have similar embeddings. A similarity
search on a document collection finds the most similar embeddings for a given input text. These embeddings and the 
corresponding input text are returned to the user.

## Overview

This tutorial is intended for developers. It assumes you have basic knowledge of:

* REST APIs and how to call them
* A programming language to handle REST API endpoints (for illustration purposes, the tutorials uses Python and Bash scripting)

By the end of this tutorial, you'll be able to:

* Create, delete and query a document collection in the IONOS vector database
* Save, delete and modify documents in the document collection and
* Answer customer queries using the document collection.

## Background

* The IONOS AI Model Hub API offers a vector database that you can use to persist text in document collections
  without having to manage corresponding hardware yourself.
* Our AI Model Hub API provides all required functionality without your data being transfered out of Germany.


# Prerequisites

We strongly suggest that you save your IONOS API token as environment variable in your operating system. You can then access it using the following lines of code:


In [1]:
from dotenv import load_dotenv
import os

load_dotenv()
IONOS_API_TOKEN = os.getenv('IONOS_API_TOKEN')

header = {
    "Authorization": f"Bearer {IONOS_API_TOKEN}", 
    "Content-Type": "application/json"
}

## Manage document collections

In this section you learn how to create a document collection. We will use this document collection to fill it with the data from your
knowledge base in the next step. 

To track, if something went wrong this section also shows how to:
* List existing document collections
* Remove document collections
* Get meta data of a document collection

### Create document collections

To create a document collection, you have to specify the name of the collection and a description and invoke the 
endpoint to generate document collections:

In [2]:
import requests

COLLECTION_NAME = "my collection"
COLLECTION_DESCRIPTION = "test collection to check the functionality of IONOS collections"
CHUNK_OVERLAP = 50
CHUNK_SIZE = 128
EMBEDDING_MODEL = "BAAI/bge-m3"
DATA_BACKEND = "pgvector"

endpoint = "https://inference.de-txl.ionos.com/collections"
body = {
    "properties": {
        "name": COLLECTION_NAME,
        "description": COLLECTION_DESCRIPTION,
        "chunking": {
            "enabled": True,
            "strategy": {
                "config": {
                    "chunk_overlap": CHUNK_OVERLAP,
                    "chunk_size": CHUNK_SIZE
                }
            }
        },
        "embedding": {
            "model": EMBEDDING_MODEL
        },
        "engine": {
            "db_type": DATA_BACKEND
        }
    }
}
response = requests.post(endpoint, json=body, headers=header)
response.status_code

201

In [3]:
collection_meta_data = response.json()
collection_meta_data

{'href': 'https://inference.de-txl.ionos.com/collections',
 'id': '6ed916fa-8dea-4c78-9596-c8a27216b258',
 'metadata': {'createdDate': '2025-02-03T10:44:16Z',
  'lastModifiedDate': '2025-02-03T10:44:16Z'},
 'properties': {'chunking': {'enabled': True,
   'strategy': {'config': {'chunk_overlap': 50, 'chunk_size': 128},
    'name': 'fixed_size'}},
  'description': 'test collection to check the functionality of IONOS collections',
  'documentsCount': 0,
  'embedding': {'model': 'BAAI/bge-m3'},
  'engine': {'db_type': 'pgvector'},
  'labels': {},
  'name': 'my collection',
  'totalTokens': 0},
 'type': 'collection'}

You can specify the following parameters when you create your document collection:

**Chunking**

The AI Model Hub supports fixed-length chunking. If you apply chunking, long documents are
split before being uploaded into the document collection. Is is beneficial in the following
cases:

* If your document exceeds the length of the text, your embedding model can cope with.
* If your document spans over different semantic topics. 

You can control chunking using:

* **CHUNK_OVERLAP**: The number of overlaping tokens in two subsequent chunks.
* **CHUNK_SIZE**: The maximum number of tokens per chunk.

**Embedding model**

The AI Model Hub supports different embedding models. You can use any of them when saving
your documents to the document collection by setting the parameter **EMBEDDING_MODEL**.

**Database engine**

The AI Model Hub supports different databases in the backend to persist your data. Upon 
creation of your collection, you can choose which of them to use by setting **DATA_BACKEND** to:

* **pgvector**: PGVector uses a PostgreSQL database as the backend to persist the document collection. We offer the corresponding PostgreSQL as a Database as a Service (DBaaS) offering.
This allows you to scale as your demands grow by switching from the managed PostgreSQL to your 
own DBaaS instance.
* **chromadb**: ChromaDB is a state-of-the-art database optimized for persisting document collections. It is strongly optimized but does not support relational database features like
PostgreSQL.

If you remove the **chunking**, **embedding**, and **engine** sections from the body of your 
request, we will create a document collection with default parameters. Our approach does not apply chunking; sentence transformers will be used as the embedding models, and the data will be stored in our managed ChromaDB.

If the creation of the document collection was successful, the status code of the request is 201 and it returns a JSON document with all relevant information concerning the document collection.

To modify the document collection you need its identifier. You can extract it from the returned JSON document in the variable **id**.

In [4]:
COLLECTION_ID = collection_meta_data['id']
COLLECTION_ID

'6ed916fa-8dea-4c78-9596-c8a27216b258'

### List existing document collections

To ensure that the previous step went as expected, you can list the existing document collections.

To retrieve a list of all document collections saved by you:

In [5]:
import requests

endpoint = "https://inference.de-txl.ionos.com/collections"
response = requests.get(endpoint, headers=header)
response.json()

{'href': 'https://inference.de-txl.ionos.com/collections',
 'id': '153c8eca-bdc9-549b-bbd9-b3fe40ddcba7',
 'items': [{'href': 'https://inference.de-txl.ionos.com/collections',
   'id': '6ed916fa-8dea-4c78-9596-c8a27216b258',
   'metadata': {'createdDate': '2025-02-03T10:44:16Z',
    'lastModifiedDate': '2025-02-03T10:44:16Z'},
   'properties': {'chunking': {'enabled': True,
     'strategy': {'config': {'chunk_overlap': 50, 'chunk_size': 128},
      'name': 'fixed_size'}},
    'description': 'test collection to check the functionality of IONOS collections',
    'documentsCount': 0,
    'embedding': {'model': 'BAAI/bge-m3'},
    'engine': {'db_type': 'pgvector'},
    'labels': {},
    'name': 'my collection',
    'totalTokens': 0},
   'type': 'collection'},
  {'href': 'https://inference.de-txl.ionos.com/collections',
   'id': '24b23c66-e616-437e-a1cb-f4867fe4c915',
   'metadata': {'createdDate': '2025-02-03T10:11:49Z',
    'lastModifiedDate': '2025-02-03T10:11:49Z'},
   'properties': {'c

This query returns a JSON document consisting of your document collections and corresponding meta information

The result consists of 8 attributes per collection of which 3 are relevant for you:
* **id**: The identifier of the document collection
* **properties.description**: The textual description of the document collection
* **properties.documentsCount**: The number of documents persisted in the document collection

If you have not created a collection yet, the field **items** is an empty list.

### Get meta data for a document collection

If you are interested in the meta data of a collection, you can extract it by invoking: 

In [6]:
import requests

endpoint = f"https://inference.de-txl.ionos.com/collections/{COLLECTION_ID}"
result = requests.get(endpoint, headers=header)
result.status_code

200

This query returns a status code which indicates whether the collection exists:
* 200: Status code if the collection exists
* 404: Status code given the collection does not exist

In [7]:
result.json()

{'href': 'https://inference.de-txl.ionos.com/collections/6ed916fa-8dea-4c78-9596-c8a27216b258',
 'id': '6ed916fa-8dea-4c78-9596-c8a27216b258',
 'metadata': {'createdDate': '2025-02-03T10:44:16Z',
  'lastModifiedDate': '2025-02-03T10:44:16Z'},
 'properties': {'chunking': {'enabled': True,
   'strategy': {'config': {'chunk_overlap': 50, 'chunk_size': 128},
    'name': 'fixed_size'}},
  'description': 'test collection to check the functionality of IONOS collections',
  'documentsCount': 0,
  'embedding': {'model': 'BAAI/bge-m3'},
  'engine': {'db_type': 'pgvector'},
  'labels': {},
  'name': 'my collection',
  'totalTokens': 0},
 'type': 'collection'}

The body of the request consists of all meta data of the document collection.

## Manage documents in document collection

In this section, you learn how to add documents to the newly created document collection. To validate your insertion, this section
also shows how to

* List the documents in the document collection,
* Get meta data for a document,
* Update an existing document and
* Prune a document collection.

### Add documents to document collection

To add an entry to the document collection, you need to at least specify the **content**, the **name** of the content and the **contentType**:

In [8]:
import requests
import base64

CONTENT = 'IONOS grows cows!'
NAME = 'IONOS know how'

content_base64 = base64.b64encode(CONTENT.encode('utf-8')).decode("utf-8")
body = { 
    "items": [{ 
        "properties": { 
            "name": NAME, 
            "contentType": "text/plain", 
            "content": content_base64
        }
    }]
}
endpoint = f"https://inference.de-txl.ionos.com/collections/{COLLECTION_ID}/documents"
response = requests.put(endpoint, json=body, headers=header)
response.status_code

200

**Note:** You need to encode your content using base64 prior to adding it to the document collection. This is done here in line 7 of the source code. We imply a document limit of 65535 characters for each document you upload. Please ensure that your documents do not exceed this limit.

This request returns a status code 200 if adding the document to the document collection was successful.

In [9]:
response.json()

{'href': 'https://inference.de-txl.ionos.com/collections/6ed916fa-8dea-4c78-9596-c8a27216b258/documents',
 'id': 'f6f26e8b-afeb-5a90-a17f-2cedcc18ee75',
 'items': [{'href': 'https://inference.de-txl.ionos.com/collections/6ed916fa-8dea-4c78-9596-c8a27216b258/documents',
   'id': '6305e201-f94b-48aa-875a-ff5167062c3a',
   'metadata': {'createdDate': '2025-02-03T10:44:30Z',
    'lastModifiedDate': '2025-02-03T10:44:30Z'},
   'properties': {'content': 'SU9OT1MgZ3Jvd3MgY293cyE=',
    'contentType': 'text/plain',
    'description': '',
    'labels': {'number_of_tokens': '5'},
    'name': 'IONOS know how'},
   'type': 'document'}],
 'type': 'collection'}

The body of the reponse consists of the uploaded document and some meta information. This meta information includes the **ID** of the document which you can use to manipulate the document:

In [10]:
DOCUMENT_ID = response.json()['items'][0]['id']

### List existing documents in document collection

To ensure that the previous step went as expected, you can list the existing documents of your document collection.

To retrieve a list of all documents in the document collection saved by you:

In [11]:
import requests

endpoint_coll = f"https://inference.de-txl.ionos.com/collections/{COLLECTION_ID}"
endpoint = f"{endpoint_coll}/documents"

response = requests.get(endpoint, headers=header)
response.status_code

200

If you have not created the collection yet, the request will return a status code 404. It will return a JSON document with the 
field **items** set to an empty list if no documents were added yet.

In [12]:
response.json()

{'href': 'https://inference.de-txl.ionos.com/collections/6ed916fa-8dea-4c78-9596-c8a27216b258/documents',
 'id': 'f6f26e8b-afeb-5a90-a17f-2cedcc18ee75',
 'items': [{'href': 'https://inference.de-txl.ionos.com/collections/6ed916fa-8dea-4c78-9596-c8a27216b258/documents',
   'id': '6305e201-f94b-48aa-875a-ff5167062c3a',
   'metadata': {'createdDate': '2025-02-03T10:44:30Z',
    'lastModifiedDate': '2025-02-03T10:44:30Z'},
   'properties': {'content': 'SU9OT1MgZ3Jvd3MgY293cyE=',
    'contentType': 'text/plain',
    'description': '',
    'labels': {'number_of_tokens': '5'},
    'name': 'IONOS know how'},
   'type': 'document'}],
 'type': 'collection'}

This query returns a JSON document consisting of your documents in the document collection and corresponding meta information

The result has a field **items** with all documents in the collection. This field consists of 10 attributes per entry of which 5 are relevant for you:
* **id**: The identifier of the document 
* **properties.content**: The base64 encoded content of the document
* **properties.name**: The name of the document
* **properties.description**: The description of the document
* **properties.labels.number_of_tokens**: The number of tokens in the document 

### Get meta data for a document 

If you are interested in the metadata of a document, you can extract it by invoking: 

In [13]:
import requests

endpoint_coll = f"https://inference.de-txl.ionos.com/collections/{COLLECTION_ID}"
endpoint = f"{endpoint_coll}/documents/{DOCUMENT_ID}"

response = requests.get(endpoint, headers=header)
response.status_code

200

This query returns a status code which indicates whether the document exists:
* 200: Status code if the document exists
* 404: Status code given the document does not exist

In [14]:
response.json()

{'href': 'https://inference.de-txl.ionos.com/collections/6ed916fa-8dea-4c78-9596-c8a27216b258/documents/6305e201-f94b-48aa-875a-ff5167062c3a',
 'id': '6305e201-f94b-48aa-875a-ff5167062c3a',
 'metadata': {'createdDate': '2025-02-03T10:44:30Z',
  'lastModifiedDate': '2025-02-03T10:44:30Z'},
 'properties': {'content': 'SU9OT1MgZ3Jvd3MgY293cyE=',
  'contentType': 'text/plain',
  'description': '',
  'labels': {'number_of_tokens': '5'},
  'name': 'IONOS know how'},
 'type': 'document'}

The body of the request consists of all meta data of the document.

### Update a document

If you want to update a document, invoke: 

In [15]:
CONTENT = 'IONOS hosts your AI workloads'
NAME = 'True IONOS know how'

endpoint_coll = f"https://inference.de-txl.ionos.com/collections/{COLLECTION_ID}"
endpoint = f"{endpoint_coll}/documents/{DOCUMENT_ID}"
content_base64 = base64.b64encode(CONTENT.encode('utf-8')).decode("utf-8")
body = { 
    "properties": { 
        "id": DOCUMENT_ID, 
        "name": NAME, 
        "contentType": 
        "text/plain", 
        "content": content_base64
    }
}

response = requests.put(endpoint, json=body, headers=header)
response.status_code

200

This will replace the existing entry in the document collection with the given **id** by the payload of this request. It returns the status code 200 on success.

In [16]:
response.json()

{'href': 'https://inference.de-txl.ionos.com/collections/6ed916fa-8dea-4c78-9596-c8a27216b258/documents/6305e201-f94b-48aa-875a-ff5167062c3a',
 'id': '6305e201-f94b-48aa-875a-ff5167062c3a',
 'metadata': {'createdDate': '2025-02-03T10:44:30Z',
  'lastModifiedDate': '2025-02-03T10:44:40Z'},
 'properties': {'content': 'SU9OT1MgaG9zdHMgeW91ciBBSSB3b3JrbG9hZHM=',
  'contentType': 'text/plain',
  'description': '',
  'labels': {'number_of_tokens': '7'},
  'name': 'True IONOS know how'},
 'type': 'document'}

The body of the reponse consists of the updated document.

## Query documents in the document collection

Finally, this section shows how to use the document collection and the contained documents to answer a user query.

To retrieve the documents relevant for answering the user query, invoke the **query** endpoint as follows:

In [17]:
import requests
import base64

USER_QUERY = "What does IONOS do?"
NUM_OF_DOCUMENTS = 1

endpoint = f"https://inference.de-txl.ionos.com/collections/{COLLECTION_ID}/query"
body = {"query": USER_QUERY, "limit": NUM_OF_DOCUMENTS }

relevant_documents = requests.post(endpoint, json=body, headers=header)

In [18]:
[
    base64.b64decode(entry['document']['properties']['content']).decode()
    for entry in relevant_documents.json()['properties']['matches']
]

['IONOS hosts your AI workloads']

This will return a list of the **NUM_OF_DOCUMENTS** most relevant documents in your document collection for answering the user query. 

## Prune a document collection

If you want to remove all documents from a document collection invoke: 

In [19]:
import requests

endpoint_coll = f"https://inference.de-txl.ionos.com/collections/{COLLECTION_ID}"
endpoint = f"{endpoint_coll}/documents"
response = requests.delete(endpoint, headers=header)
response.status_code

204

This query returns the status code 204 if pruning the document collection was successful.

## Remove a document collection

If the list of document collections consists of document collections you do not need anymore, you can
remove a document collection by invoking: 

In [20]:
import requests

endpoint = f"https://inference.de-txl.ionos.com/collections/{COLLECTION_ID}"
response = requests.delete(endpoint, headers=header)
response.status_code

204

This query returns a status code which indicates whether the deletion was successful:
 * 204: Status code for successfull deletion
 * 404: Status code given the collection did not exist

## Summary

In this tutorial you learned how to use the IONOS AI Model Hub API to conduct semantic similarity searches using
our vector database.

Namely, you learned how to:

* Create a necessary document collection in the vector database and modify it
* Insert your documents into the document collection and modify the documents
* Conduct semantic similarity searches using your document collection.