# 3. Data setup - Setting up the document ingestion pipeline
<a id="data-setup"></a>

This section describes how to establish a complete document ingestion pipeline in PhariaAI. The ingestion pipeline is a crucial foundation for RAG applications, as it transforms source documents into searchable, AI-ready processed documents.

## Pipeline components

The pipeline consists of several interconnected components:

- **Stage**: Provides a entry point storage for source documents
- **Trigger**: Defines the processing workflow when documents are uploaded.
- **Transformation**: Converts raw files into structured, searchable content
- **Repository**: Stores the source documents and processed documents
- **Collection**: Groups processed documents in a searchable container with unified access patterns and shared indexes
- **Index**: Enables efficient semantic search across your documents

The document ingestion workflow we will build transforms source documents into searchable processed documents in the following steps:
1. **Load**: uploading to the stage and triggering the transformation
2. **Transform**: applying transformations and storing in the repository
3. **Search**: storing in the collections and indexing for search


<img src="../Visualizations/E2E-Tutorial-data-pipeline.png" alt="Ingestion workflow" style="width:85%"/>


## What you will learn

1. How to configure your environment and connection parameters
2. How to create an ingestion pipeline with the PhariaData API
3. How to upload documents and monitor their processing
4. How to interact with your processed content through search and retrieval


## Prerequisites

Before starting, ensure you have the following:

- **API token**: A valid Aleph Alpha API token with appropriate permissions
- **API URLs**: Access to running instances of `pharia-data-api` and `document-index-api`
- **Permissions**: The *StudioUser* permission, as described in [User Setup](1.%20Introduction%20-%20Getting%20Started.ipynb#user-setup)


<br>
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
<br>

## Procedure

This section guides you through setting up your document ingestion pipeline. You will first import necessary libraries, configure your environment, and then build the essential components for document processing. The workflow follows a systematic approach of creating a repository, setting up a document staging area, configuring an index, and establishing triggers for automated document transformation.

Below, you can see all concepts involved in the creation of the pipeline and their relationships.

<img src="../Visualizations/E2E-Tutorial-data-pipeline-relationships.png" alt="Resources relationships" style="width:70%;"/>

### 1. Import dependencies and configure the environment

We begin by importing necessary dependencies and setting up the environment. We use standard Python libraries such as `requests` for API communication, `pandas` for data handling, as well as specialised libraries such as `tenacity` for robust error handling with retry mechanisms.

The environment configuration establishes connections to two key PhariaAI services:
- The PhariaData API for managing document transformations and storage
- The PhariaDocument Index API for creating searchable indexes

We use several key libraries for our document processing workflow. The code below imports all of these libraries and disables warnings to keep our notebook output clean:

Create a `.env` using the following command and add your PhariaAI Token

```bash
cp .env.sample .env
```

In [1]:
import json
import requests
import os
import pandas as pd
import warnings
import concurrent.futures
from dotenv import load_dotenv
from tenacity import (
    retry,
    stop_after_attempt,
    wait_exponential,
    retry_if_exception_type,
)
warnings.filterwarnings("ignore")

Next, we configure the essential parameters that provide authentication and identify your workspace:

- **TOKEN**: Your Aleph Alpha API authentication token, loaded from your environment file
- **NAMESPACE**: The organisational namespace where your collections are stored ("Studio")
- **COLLECTION**: The name of the document collection for this tutorial ("pharia-tutorial-rag")

**Note:** The namespace identifier depends on your specific PhariaAI setup and permission level. The collection name can be freely chosen to help you organise and separate different RAG projects. Using descriptive collection names (like "legal-contracts" or "product-documentation") can help you manage multiple document sets within the same namespace.

In [2]:
## Setups
load_dotenv(".env", override=True)

TOKEN = os.getenv("PHARIA_AI_TOKEN") #<your-token>
NAMESPACE = os.getenv("PHARIA_DATA_NAMESPACE")
COLLECTION = os.getenv("PHARIA_DATA_COLLECTION")

Finally, we define the API endpoints that connect to PhariaAI's core document services:

- **DATA_PLATFORM_URL**: The endpoint for the PhariaData API service that manages document storage and transformations
- **DOCUMENT_INDEX_API_URL**: The endpoint for the PhariaDocument Index service that enables vector search capabilities

These endpoints are stored as environment variables, making them accessible to all the helper functions we create throughout this notebook.

In [3]:
PHARIA_API_BASE_URL = os.getenv("PHARIA_API_BASE_URL")

DATA_PLATFORM_URL = f"{PHARIA_API_BASE_URL}/studio/data"
DOCUMENT_INDEX_API_URL = f"{PHARIA_API_BASE_URL}/studio/search"

#### Different Deployment Environments

Aleph Alpha operates multiple deployment environments for different teams and use cases. It's crucial that you use the correct environment URLs to avoid access issues during this tutorial.

**Product Team Environment (Default for this tutorial)**
If you're part of the Product team or working on product-related tasks, use these URLs:

```python
Base URL Pattern: https://pharia-{service}.stage.product.pharia.com/
```

**Customer Team Environment**
If you're part of the Customer team or working on customer-related tasks, use these URLs:
```python
Base URL Pattern: https://pharia-{service}.customer.pharia.com/
```

### 2. Create a document repository and collection

A repository in PhariaData is a storage container that organises processed documents. In this tutorial, we create a repository named "RAG_Tutorial_Repository".

The `get_or_create_repository` function checks if a repository with the specified name already exists and creates one if it does not. The function returns the repository ID, which is referenced in later steps when configuring the ingestion pipeline.

Similarly, we create a collection, to act as the search container for our files.

In [4]:
REPOSITORY_NAME = os.getenv("REPOSITORY_NAME")

In [5]:
## Helper function 

def get_or_create_repository(repository: dict) -> str:
    """Get or create a repository in the Data Platform."""
    dataplatform_base_url = DATA_PLATFORM_URL
    name = repository["name"]
    url = f"{dataplatform_base_url}/repositories?name={name}"

    token = TOKEN
    response = requests.get(
        url=url, headers={"Authorization": f"Bearer {token}"}, verify=False
    )
    response.raise_for_status()
    page = response.json()

    if page["total"] > 0:
        return page["repositories"][0]["repositoryId"]
    else:
        url = f"{dataplatform_base_url}/repositories"
        response = requests.post(
            url=url,
            json=repository,
            headers={"Authorization": f"Bearer {token}"},
            verify=False,
        )
        response.raise_for_status()
        repo_created = response.json()
        return repo_created["repositoryId"]
    
def get_or_create_collection(namespace: str, collection: str) -> str:
    """Get or create a collection in the Document Index."""
    try:
        di_base_url = DOCUMENT_INDEX_API_URL
        url = f"{di_base_url}/collections/{namespace}"
        token = TOKEN
        response = requests.get(
            url=url, headers={"Authorization": f"Bearer {token}"}, verify=False
        )
        response.raise_for_status()
        collections_list = response.json()
        
        if len(collections_list) == 0 or collection not in collections_list:
            url = f"{di_base_url}/collections/{namespace}/{collection}"
            response = requests.put(
                url=url, headers={"Authorization": f"Bearer {token}"}, verify=False
            )
            response.raise_for_status()
            return f"{collection} created"
        else:
            return f"{collection} exists"
    except Exception as e:
        return f"{e}, Response: {response}"

In [None]:
## Create the repository

repository_payload = {
    "name": REPOSITORY_NAME,
    "mediaType": "jsonlines",
    "modality": "text",
    "schema": None,
}

repository_id = get_or_create_repository(repository_payload)
print(f"Repository ID: {repository_id}")

collection_id = get_or_create_collection(NAMESPACE, COLLECTION)
print(f"Collection: {collection_id}")


### 3. Configure a document upload stage

A stage provides temporary storage for source documents before they are processed. In this step, we create a stage named "DocumentStorageTutorialTest" that uses the "DocumentToMarkdown" transformation to convert source documents.

The stage configuration includes a trigger that defines what happens when source documents are uploaded. This trigger specifies the transformation to apply and where to store the results.

The `get_or_create_stage` function returns a stage ID that is used when uploading documents in later steps.

In [7]:
## Environment Variables
STAGE_NAME = os.getenv("STAGE_NAME")
TRANSFORMATION_NAME = os.getenv("TRANSFORMATION_NAME")
TRIGGER_NAME = os.getenv("TRIGGER_NAME")

In [8]:
## Helper function 

def get_or_create_stage(stage: dict) -> str:
    """Get or create a stage in the Data Platform."""
    dataplatform_base_url = DATA_PLATFORM_URL
    name = stage["name"]
    url = f"{dataplatform_base_url}/stages?name={name}"

    token = TOKEN
    response = requests.get(
        url=url, headers={"Authorization": f"Bearer {token}"}, verify=False
    )
    response.raise_for_status()
    page = response.json()

    if page["total"] > 0:
        return page["stages"][0]["stageId"]
    else:
        url = f"{dataplatform_base_url}/stages"
        response = requests.post(
            url=url,
            json=stage,
            headers={"Authorization": f"Bearer {token}"},
            verify=False,
        )
        response.raise_for_status()
        stage_created = response.json()
        return stage_created["stageId"]

In [None]:
## Setup stage

stage_payload = {
    "name": STAGE_NAME,
    "triggers": [
        {
            "transformationName": TRANSFORMATION_NAME,
            "destinationType": "DataPlatform:Repository",
            "connectorType": "DocumentIndex:Collection",
            "name": TRIGGER_NAME,
        }
    ],
}

stage_id = get_or_create_stage(stage_payload)
print(f"Stage ID: {stage_id}")

### 4. Create and assign a searchable index for documents

An index enables efficient searching of your document content. The `create_index_and_assign_to_collection` function creates an index with specified parameters and assigns it to your collection.

The key parameters include:
- `chunk_size`: Controls how documents are divided into searchable segments (256 tokens)
- `chunk_overlap`: Defines the overlap between chunks to maintain context (10 tokens)
- `embedding_type`: Specifies the vector embedding approach ("asymmetric")

Once the index is assigned to your collection, any ingested documents are automatically processed according to these settings.

In [10]:
INDEX = os.getenv("INDEX")

In [11]:
## Helper function
def create_index_and_assign_to_collection(index_name: str, collection_name: str, namespace: str, chunk_size: int = 256, chunk_overlap: int = 10, embedding_type: str = "asymmetric") -> str:
    """Create an index in the Document Index."""
    token = TOKEN
    document_index_base_url = DOCUMENT_INDEX_API_URL
    url = f"{document_index_base_url}/indexes/{namespace}/{index_name}"
    payload = {
        "chunk_size": chunk_size,
        "chunk_overlap": chunk_overlap,
        "embedding_type": embedding_type
    }
    response = requests.put(url, json=payload, headers={"Authorization": f"Bearer {token}"})
    response.raise_for_status()
    print(f"Index created: {index_name}")

    # Assign the index to the collection
    url = f"{document_index_base_url}/collections/{namespace}/{collection_name}/indexes/{index_name}"
    response = requests.put(url, headers={"Authorization": f"Bearer {token}"})
    response.raise_for_status()
    print(f"Index '{index_name}' assigned to collection '{collection_name}' ")

In [None]:
create_index_and_assign_to_collection(index_name=INDEX, collection_name=COLLECTION, namespace=NAMESPACE)

### 5. Set up automated document processing

The trigger configuration defines what happens when source documents are uploaded to the stage. The `ingestion_context` object combines three key elements:

1. The trigger name that identifies which trigger to activate
2. The destination repository where processed documents are stored
3. The collection and namespace where processed documents are indexed

This context is included with source document uploads to instruct the system on how to process each document. When a source document is uploaded, the specified trigger automatically applies the transformation and indexes the processed document.

In [None]:
ingestion_context = {
    "triggerName": TRIGGER_NAME,
    "destinationContext": {"repositoryId": repository_id},
    "connectorContext": {
        "collection": COLLECTION,
        "namespace": NAMESPACE,
    },
}
print(f"Ingestion context: {ingestion_context}")

### 6. Upload and process documents

With our infrastructure set-up complete (repository, stage, index, and trigger), we can now upload source documents to the PhariaAI platform. This section demonstrates how to upload source documents and initiate the document ingestion process.

The document ingestion workflow transforms source documents into searchable processed documents through several steps: uploading to the stage, applying transformations, storing in the repository, and indexing for search.

The `ingest_all_documents` helper function returns a DataFrame with details on each upload attempt, making it easy to track successes and failures.

In [14]:
## Helper functions

@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=2, max=10),
    retry=retry_if_exception_type(requests.RequestException),
)
def ingest_document(
    document_path: str, ingestion_context: dict, name: str, stage_id: str
) -> dict:
    """Attempts to ingest a document and returns the ingestion result."""
    with open(document_path, mode="rb") as file_reader:
        dataplatform_base_url = DATA_PLATFORM_URL
        url = f"{dataplatform_base_url}/stages/{stage_id}/files"
        token = TOKEN
        response = requests.post(
            url=url,
            headers={"Authorization": f"Bearer {token}"},
            verify=False,
            files={
                "name": name,
                "sourceData": file_reader,
                "ingestionContext": json.dumps(ingestion_context),
            },
        )
        response.raise_for_status()

        file_uploaded = response.json()
        return {
            "file_id": file_uploaded["fileId"],
            "status": "Success",
            "error_type": None,
            "error_message": None,
        }
    


def ingest_all_documents(
    directory_path: str, ingestion_context: dict, stage_id: str, max_workers: int = 3
):
    """Ingest all files in a directory concurrently and store results in a DataFrame."""

    results = []

    with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
        future_to_file = {
            executor.submit(
                ingest_document,
                os.path.join(directory_path, file),
                ingestion_context,
                file,
                stage_id,
            ): file
            for file in os.listdir(directory_path)
        }

        for future in concurrent.futures.as_completed(future_to_file):
            file_name = future_to_file[future]
            file_path = os.path.join(directory_path, file_name)
            try:
                result = future.result()
                results.append(
                    {
                        "file_path": file_path,
                        "file_id": result["file_id"],
                        "status": result["status"],
                        "error_type": result["error_type"],
                        "error_message": result["error_message"],
                    }
                )
            except Exception as e:
                print(f"An error occurred while ingesting {file_path}: {e}")
                results.append(
                    {
                        "file_path": file_path,
                        "file_id": None,
                        "status": "Ingestion Failed",
                        "error": str(e),
                    }
                )

    df_results = pd.DataFrame(results)
    return df_results

In [None]:
# Ingesting the files
directory_path = "files_to_upload"
df_results = ingest_all_documents(directory_path, ingestion_context, stage_id)
df_results

### 7. Monitor the source document processing status

After uploading source documents, you need to verify their processing status. The code in this section does the following:

1. Extracts IDs of successfully uploaded source documents
2. Retrieves the transformation ID
3. Checks the status of each source document's transformation
4. Extracts dataset IDs from completed transformations

The `check_files_status` function combines all this information into a comprehensive report that shows which files completed processing and which encountered errors. The dataset IDs are particularly important as they are used to access your processed documents in subsequent operations.


In [18]:
def get_successful_document_ids(df: pd.DataFrame) -> list:
    """Retrieve a list of successful file_ids from the DataFrame."""
    return df[df["status"] == "Success"]["file_id"].tolist()

@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=2, max=10),
    retry=retry_if_exception_type(requests.RequestException),
)
def check_status_of_ingestion(transformation_id: str, file_id: str) -> dict:
    """Query the status of the ingestion for a given transformation and file_id."""
    dataplatform_base_url = DATA_PLATFORM_URL
    url = f"{dataplatform_base_url}/transformations/{transformation_id}/runs?file_id={file_id}"

    token = TOKEN
    response = requests.get(
        url=url, headers={"Authorization": f"Bearer {token}"}, verify=False
    )
    response.raise_for_status()
    page = response.json()

    assert page["total"] > 0
    return page["runs"][0]

def get_transformation_id(name: str) -> str:
    """Get the transformation ID from the Data Platform."""
    dataplatform_base_url = DATA_PLATFORM_URL
    url = f"{dataplatform_base_url}/transformations?name={name}"

    token = TOKEN
    response = requests.get(
        url=url, headers={"Authorization": f"Bearer {token}"}, verify=False
    )
    response.raise_for_status()
    page = response.json()

    assert page["total"] > 0
    return page["transformations"][0]["transformationId"]

def check_files_status(transformation_id: str, df: pd.DataFrame, max_workers: int = 3):
    """Check the status of ingested files and store the results in a DataFrame."""

    successful_file_ids = get_successful_document_ids(df)
    status_results = []

    with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
        future_to_file = {
            executor.submit(
                check_status_of_ingestion, transformation_id, file_id
            ): file_id
            for file_id in successful_file_ids
        }

        for future in concurrent.futures.as_completed(future_to_file):
            file_id = future_to_file[future]
            try:
                run = future.result()
                output = json.dumps(run.get("output", {}), indent=4)
                status_results.append(
                    {
                        "file_id": file_id,
                        "run_id": run["runId"],
                        "status": run["status"],
                        "output": output,
                        "error": run["errors"],
                    }
                )
            except Exception as e:
                status_results.append(
                    {
                        "file_id": file_id,
                        "status": run["status"],
                        "output": None,
                        "error": str(e),
                    }
                )

    return df.merge(
        pd.DataFrame(status_results),
        on="file_id",
        how="left",
        suffixes=("_ingestion", ""),
    )

def get_successful_dataset_ids(df: pd.DataFrame) -> list:
    """Retrieve a list of successful dataset_ids from the DataFrame."""
    dataset_ids_list = []
    for i in range(len(df)):
        dataset_ids_list.append(json.loads(df["output"][i]).get("datasetId"))
    return dataset_ids_list

In [None]:
transformation_id = get_transformation_id(TRANSFORMATION_NAME)
status_df = check_files_status(transformation_id, df_results)
status_df.to_csv("ingestion_status.csv", index=False)
successful_dataset_ids = get_successful_dataset_ids(status_df[status_df["status"] == "completed"])
status_df

### 8. Interact with processed documents

With source documents ingested and processed, you can now interact with your data in various ways:

1. **Search operation**: The `search_text` function demonstrates semantic search against your indexed processed documents, finding content based on meaning rather than exact keyword matches

2. **Document and metadata retrieval**: The `get_document_from_document_index` function retrieves a complete processed document and its metadata using the dataset ID

3. **Text display**: The `display_processed_document_text` function shows how to access the actual content extracted from your source documents, helping you verify the quality of text extraction

These operations showcase the fundamental ways to interact with your processed documents in PhariaAI.

#### 8.1. Searching document content

After successfully ingesting documents, one of the most valuable operations is searching through your content. This section demonstrates how to perform semantic searches against your indexed documents.

The `search_text` function sends a query to the PhariaDocument Index API, which uses vector embeddings to find semantically relevant content. Unlike traditional keyword search, this approach can identify conceptually related information even when exact terms do not match.

In this example, we search for content related to "what is attention?" and retrieve matches ranked by relevance. The results include document chunks that semantically align with the query, along with confidence scores indicating match quality.


In [25]:
# Helper functions

def search_text(namespace: str, collection: str, text: str, index: str) -> dict:
    di_base_url = DOCUMENT_INDEX_API_URL
    url = f"{di_base_url}/collections/{namespace}/{collection}/indexes/{index}/search"

    token = TOKEN
    payload = {"query": [{"modality": "text", "text": text}]}
    response = requests.post(
        url=url,
        json=payload,
        headers={"Authorization": f"Bearer {token}"},
        verify=False,
    )
    response.raise_for_status()
    return response.json()

In [None]:
text_to_search = "what is attention?"
search_result = search_text(
    NAMESPACE, COLLECTION, text_to_search, index=INDEX
)
print(json.dumps(search_result, indent=4))

#### 8.2. Retrieving complete documents and metadata

While searching helps find specific information, sometimes you need to retrieve a complete document along with its metadata. This operation is useful when you want to examine a document's full context or access its associated properties.

The `get_document_from_document_index` function retrieves a document using its dataset ID (obtained during the ingestion process). The response includes both the document content and additional metadata such as creation time, source information, and any custom properties attached during processing.

This example retrieves the fourth document from our previously ingested set, demonstrating how to access specific documents directly when you know their IDs.

In [27]:
# Helper functions

def get_document_from_document_index(namespace, collection, dataset_id) -> dict:
    di_base_url = DOCUMENT_INDEX_API_URL
    url = f"{di_base_url}/collections/{namespace}/{collection}/docs/{dataset_id}"

    token = TOKEN
    response = requests.get(
        url=url, headers={"Authorization": f"Bearer {token}"}, verify=False
    )
    response.raise_for_status()
    return response.json()

In [None]:
all_documents = []

for id in successful_dataset_ids:
    document_from_di = get_document_from_document_index(
        NAMESPACE, COLLECTION, id
    )
    all_documents.append(document_from_di)

print(json.dumps(all_documents[0], indent=4))

#### 8.3. Viewing extracted document text

To inspect the actual content extracted from your documents, you can retrieve and display the text chunks stored in the repository. This is useful for verifying extraction quality and understanding how your documents were segmented.

The `display_text_extracted` function connects to the PhariaData repository and retrieves text chunks from a specific document. It displays each chunk sequentially, showing how the document was divided during processing.

This operation helps you validate that your documents were properly processed and that the extracted text accurately represents the original content. It can be particularly valuable when troubleshooting search issues or refining your ingestion parameters.

In [None]:
# Helper function
def display_processed_document_text(repository_id: str, dataset_id: str) -> None:
    dataplatform_base_url = DATA_PLATFORM_URL
    url = f"{dataplatform_base_url}/repositories/{repository_id}/datasets/{dataset_id}/datapoints"

    token = TOKEN
    response = requests.get(
        url=url, headers={"Authorization": f"Bearer {token}"}, verify=False, stream=True
    )
    response.raise_for_status()
    for line in response.iter_lines():
        datapoint = json.loads(line.decode())
        print(datapoint)


display_processed_document_text(repository_id, successful_dataset_ids[0])

## Summary

In this section, you successfully set up the complete document ingestion pipeline:

✅ **Configured the environment** with connections to both the PhariaData and PhariaDocument Index APIs

✅ **Built the foundation infrastructure**:
   - Created a repository for storing processed documents
   - Set up a stage for temporary source document storage
   - Configured an index for enabling semantic search
   - Established triggers for automating document processing

✅ **Implemented document operations** with:
   - Concurrent source document uploads with error handling
   - Status monitoring for transformation processes
   - Multiple ways to interact with processed documents

Your source document collection is now properly ingested, processed, and ready for semantic search operations. This data foundation will serve as the basis for retrieval-augmented generation in the subsequent sections of this tutorial.