# 3. Data Setup - Setting Up the Document Ingestion Pipeline

This section guides you through establishing a complete document ingestion pipeline in Pharia. The ingestion pipeline is a crucial foundation for RAG applications, as it transforms source documents into searchable, AI-ready processed documents.

## Pipeline Components

The pipeline consists of several interconnected components:

- **Repository**: Stores the source documents & processed documents
- **Collection**: Groups processed documents in a searchable container with unified access patterns and shared indexes
- **Stage**: Provides temporary storage for source documents
- **Transformation**: Converts raw files into structured, searchable content
- **Index**: Enables efficient semantic search across your documents
- **Trigger**: Automates the processing workflow when documents are uploaded

The document ingestion workflow we will be building transforms source documents into searchable processed documents through several steps: uploading to the stage, applying transformations, storing in the repository, and indexing for search.


## What You'll Learn

In this section, you'll learn how to:

1. Configure your environment and connection parameters
2. Create an ingestion pipeline in the Data Platform
3. Upload documents and monitor their processing
4. Interact with your processed content through search and retrieval


## Prerequisites

Before starting, ensure you have the following:

- **API Token**: A valid Aleph Alpha API token with appropriate permissions
- **API URL**: Access to running instances of pharia-data-api and document-index-api
- **Permissions**: StudioUser permission as described in the User Setup section

<br>
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
<br>

## Installation Options

This tutorial uses various Python packages for API interaction and data processing. You have two options for setting up your environment:

### Using Poetry (Recommended)

If you're working with the complete project repository that includes the `pyproject.toml` file:

1. Install [poetry](https://python-poetry.org/docs/#installing-with-pipx) using `pipx` following the official instructions
2. Run `poetry install` in the project directory to automatically set up all dependencies

### Custom Installation with Poetry

If you prefer using your own virtual environment manager instead of poetry's default:

```
      poetry config virtualenvs.prefer-active-python true
```

You can append `--local` or `--global` to this command to apply the setting locally or globally.

#### Manual Installation

If you're just using this notebook without the full project structure, you can install the required packages manually:

- python = "~=3.11"
- requests = "^2.32.3"
- aiohttp = "^3.10.5"
- urllib3 = "^2.2.2"
- pandas = "2.1.4"
- tenacity = "^8.2.2"
- python-dotenv = "^1.0.0"
- ipykernel = "^6.29.5"

Will be needed later, so lets already install it:
- pharia-skill = "^0.14.0"


<br>
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
<br>

## Getting Started

This section guides you through setting up your document ingestion pipeline. You'll first import necessary libraries, configure your environment, and then build the essential components for document processing. The workflow follows a systematic approach of creating a repository, setting up a document staging area, configuring an index, and establishing triggers for automated document transformation.

### 1. Import dependencies & configure the Environment

Let's begin by importing the necessary dependencies and setting up our environment. We'll use standard Python libraries like `requests` for API communication, `pandas` for data handling, and specialized libraries like `tenacity` for robust error handling with retry mechanisms.

The environment configuration establishes connections to Pharia's two key services:
- The Data Platform API for managing document transformations and storage
- The Document Index API for creating searchable indexes

We'll use several key libraries for our document processing workflow, The code below imports all of these libraries and disables warnings to keep our notebook output clean:

In [1]:
import json
import requests
import os
import pandas as pd
import warnings
import concurrent.futures
from dotenv import load_dotenv
from tenacity import (
    retry,
    stop_after_attempt,
    wait_exponential,
    retry_if_exception_type,
)
warnings.filterwarnings("ignore")

Next, we'll configure the essential parameters that provide authentication and identify your workspace:

- **TOKEN**: Your Pharia API authentication token, loaded from your environment file
- **NAMESPACE**: The organizational namespace where your collections are stored ("Studio")
- **COLLECTION**: The name of the document collection for this tutorial ("pharia-tutorial-rag")

> **Note:** The namespace identifier depends on your specific Pharia setup and permission level. The collection name can be freely chosen to help you organize and separate different RAG projects. Using descriptive collection names (like "legal-contracts" or "product-documentation") helps you manage multiple document sets within the same namespace.


In [2]:
## Setups
load_dotenv("rag_tutorial/.env", override=True)

TOKEN = os.getenv("PHARIA_AI_TOKEN") #<your-token>
NAMESPACE = "Studio"
COLLECTION = "pharia-tutorial-rag"

os.environ["NAMESPACE"] = NAMESPACE
os.environ["COLLECTION"] = COLLECTION
os.environ["TOKEN"] = TOKEN

Finally, we define the API endpoints that connect to Pharia's core document services:

- **DATA_PLATFORM_URL**: The endpoint for the Data Platform service that manages document storage and transformations
- **DOCUMENT_INDEX_API_URL**: The endpoint for the Document Index service that enables vector search capabilities

These endpoints are stored as environment variables, making them accessible to all the helper functions we'll create throughout this notebook.

In [3]:
## URLS
DATA_PLATFORM_URL = "https://pharia-data-api.product.pharia.com"
DOCUMENT_INDEX_API_URL = "https://document-index.product.pharia.com"

os.environ["DATA_PLATFORM_URL"] = DATA_PLATFORM_URL
os.environ["DOCUMENT_INDEX_API_URL"] = DOCUMENT_INDEX_API_URL

### 2. Creating a Document Repository

A repository in Pharia's Data Platform is a storage container that organizes processed documents. In this tutorial, we create a repository named "DocumentSearch"

The `get_or_create_repository` function checks if a repository with the specified name already exists and creates one if needed. The function returns the repository ID, which will be referenced in later steps when configuring the ingestion pipeline.

In [4]:
## Environment Variables

REPOSITORY_NAME = "RAG_Tutorial_Repository"
os.environ["REPOSITORY_NAME"] = REPOSITORY_NAME

In [5]:
## Helper fucntion 

def get_or_create_repository(repository: dict) -> str:
    """Get or create a repository in the Data Platform."""
    dataplatform_base_url = os.getenv("DATA_PLATFORM_URL")
    name = repository["name"]
    url = f"{dataplatform_base_url}/api/v1/repositories?name={name}"

    token = os.getenv("TOKEN")
    response = requests.get(
        url=url, headers={"Authorization": f"Bearer {token}"}, verify=False
    )
    response.raise_for_status()
    page = response.json()

    if page["total"] > 0:
        return page["repositories"][0]["repositoryId"]
    else:
        url = f"{dataplatform_base_url}/api/v1/repositories"
        response = requests.post(
            url=url,
            json=repository,
            headers={"Authorization": f"Bearer {token}"},
            verify=False,
        )
        response.raise_for_status()
        repo_created = response.json()
        return repo_created["repositoryId"]
    
def get_or_create_collection(namespace: str, collection: str) -> str:
    """Get or create a collection in the Document Index."""
    try:
        di_base_url = os.getenv("DOCUMENT_INDEX_API_URL")
        url = f"{di_base_url}/collections/{namespace}"
        token = os.getenv("TOKEN")
        response = requests.get(
            url=url, headers={"Authorization": f"Bearer {token}"}, verify=False
        )
        response.raise_for_status()
        collections_list = response.json()
        
        if len(collections_list) == 0 or collection not in collections_list:
            url = f"{di_base_url}/collections/{namespace}/{collection}"
            response = requests.put(
                url=url, headers={"Authorization": f"Bearer {token}"}, verify=False
            )
            response.raise_for_status()
            return f"{collection} created"
        else:
            return f"{collection} exists"
    except Exception as e:
        return f"{e}, Response: {response}"

In [6]:
## Create the Repository

repository_payload = {
    "name": os.getenv("REPOSITORY_NAME"),
    "mediaType": "jsonlines",
    "modality": "text",
    "schema": None,
}

repository_id = get_or_create_repository(repository_payload)
print(f"Repository ID: {repository_id}")

collection_id = get_or_create_collection(os.getenv("NAMESPACE"), os.getenv("COLLECTION"))
print(f"Collection: {collection_id}")


Repository ID: 0196b9dc-0d15-4a9f-8280-442bf681785f
Collection: pharia-tutorial-rag exists


### 3. Configuring a Document Upload Stage

A stage provides temporary storage for source documents before they're processed. In this step, we create a stage named "DocumentStorageTutotialTest" that will use the "DocumentToMarkdown" transformation to convert source documents.

The stage configuration includes a trigger that defines what happens when source documents are uploaded. This trigger specifies the transformation to apply and where to store the results.

The `get_or_create_stage` function returns a stage ID that will be used when uploading documents in later steps.

In [7]:
## Environment Variables
STAGE_NAME = "DocumentStorageTutotialTest"
TRANSFORMATION_NAME = "DocumentToMarkdown" # You can check the other transformations in the documentation https://pharia-data-api.product.pharia.com/api/v1/transformations
TRIGGER_NAME = "testTrigger - DocumentStorageTutorial"

os.environ["STAGE_NAME"] = STAGE_NAME
os.environ["TRANSFORMATION_NAME"] = TRANSFORMATION_NAME
os.environ["TRIGGER_NAME"] = TRIGGER_NAME

In [8]:
## Helper fucntion 

def get_or_create_stage(stage: dict) -> str:
    """Get or create a stage in the Data Platform."""
    dataplatform_base_url = os.getenv("DATA_PLATFORM_URL")
    name = stage["name"]
    url = f"{dataplatform_base_url}/api/v1/stages?name={name}"

    token = os.getenv("TOKEN")
    response = requests.get(
        url=url, headers={"Authorization": f"Bearer {token}"}, verify=False
    )
    response.raise_for_status()
    page = response.json()

    if page["total"] > 0:
        return page["stages"][0]["stageId"]
    else:
        url = f"{dataplatform_base_url}/api/v1/stages"
        response = requests.post(
            url=url,
            json=stage,
            headers={"Authorization": f"Bearer {token}"},
            verify=False,
        )
        response.raise_for_status()
        stage_created = response.json()
        return stage_created["stageId"]

In [9]:
## Setup Stage

stage_payload = {
    "name": os.getenv("STAGE_NAME"),
    "triggers": [
        {
            "transformationName": os.getenv("TRANSFORMATION_NAME"),
            "destinationType": "DataPlatform:Repository",
            "connectorType": "DocumentIndex:Collection",
            "name": TRIGGER_NAME,
        }
    ],
}

stage_id = get_or_create_stage(stage_payload)
print(f"Stage ID: {stage_id}")

Stage ID: 072470c8-8d4d-44fd-9fab-83b19ec3b96c


### 4. Create & Assign a Searchable Index for Documents

An index enables efficient searching of your document content. The `create_index_and_assign_to_collection` function creates an index with specified parameters and assigns it to your collection.

The key parameters include:
- `chunk_size`: Controls how documents are divided into searchable segments (256 tokens)
- `chunk_overlap`: Defines overlap between chunks to maintain context (10 tokens)
- `embedding_type`: Specifies the vector embedding approach ("asymmetric")

Once the index is assigned to your collection, any ingested documents will be automatically processed according to these settings.

In [10]:
## Helper function

def create_index_and_assign_to_collection(index_name: str, collection_name: str, namespace: str = "Studio", chunk_size: int = 256, chunk_overlap: int = 10, embedding_type: str = "asymmetric") -> str:
    """Create an index in the Document Index."""
    token = os.getenv("TOKEN")
    document_index_base_url = os.getenv("DOCUMENT_INDEX_API_URL")
    url = f"{document_index_base_url}/indexes/{namespace}/{index_name}"
    payload = {
        "chunk_size": chunk_size,
        "chunk_overlap": chunk_overlap,
        "embedding_type": embedding_type
    }
    response = requests.put(url, json=payload, headers={"Authorization": f"Bearer {token}"})
    response.raise_for_status()
    print(f"Index created: {index_name}")

    # Assign the index to the collection
    url = f"{document_index_base_url}/collections/{namespace}/{collection_name}/indexes/{index_name}"
    response = requests.put(url, headers={"Authorization": f"Bearer {token}"})
    response.raise_for_status()
    print(f"Index '{index_name}' assigned to collection '{collection_name}' ")

In [11]:
create_index_and_assign_to_collection(index_name="studio-tutorial-index", collection_name=os.getenv("COLLECTION"))

Index created: studio-tutorial-index
Index 'studio-tutorial-index' assigned to collection 'pharia-tutorial-rag' 


### 5. Setting Up Automated Document Processing

The trigger configuration defines what happens when source documents are uploaded to the stage. The `ingestion_context` object combines three key elements:

1. The trigger name that identifies which trigger to activate
2. The destination repository where processed documents will be stored
3. The collection and namespace where processed documents will be indexed

This context will be included with source document uploads to instruct the system on how to process each document. When a source document is uploaded, the specified trigger automatically applies the transformation and indexes the processed document.

In [12]:
## Environment Varaibles

TEST_TRIGGER = os.environ["TRIGGER_NAME"]
os.environ["TEST_TRIGGER"] = TEST_TRIGGER

In [13]:
ingestion_context = {
    "triggerName": os.getenv("TEST_TRIGGER"),
    "destinationContext": {"repositoryId": repository_id},
    "connectorContext": {
        "collection": os.getenv("COLLECTION"),
        "namespace": os.getenv("NAMESPACE"),
    },
}
print(f"Ingestion context: {ingestion_context}")

Ingestion context: {'triggerName': 'testTrigger - DocumentStorageTutorial', 'destinationContext': {'repositoryId': '0196b9dc-0d15-4a9f-8280-442bf681785f'}, 'connectorContext': {'collection': 'pharia-tutorial-rag', 'namespace': 'Studio'}}


### 6. Uploading and Processing Documents

With our infrastructure set up (repository, stage, index, and trigger), we can now upload source documents to the Pharia platform. This section demonstrates how to upload source documents and initiate the document ingestion process.

The document ingestion workflow transforms source documents into searchable processed documents through several steps: uploading to the stage, applying transformations, storing in the repository, and indexing for search.

The `ingest_all_documents` helper function returns a DataFrame with details on each upload attempt, making it easy to track successes and failures.

In [14]:
## Helper fucntions

@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=2, max=10),
    retry=retry_if_exception_type(requests.RequestException),
)
def ingest_document(
    document_path: str, ingestion_context: dict, name: str, stage_id: str
) -> dict:
    """Attempts to ingest a document and returns the ingestion result."""
    with open(document_path, mode="rb") as file_reader:
        dataplatform_base_url = os.getenv("DATA_PLATFORM_URL")
        url = f"{dataplatform_base_url}/api/v1/stages/{stage_id}/files"
        token = os.getenv("TOKEN")
        response = requests.post(
            url=url,
            headers={"Authorization": f"Bearer {token}"},
            verify=False,
            files={
                "name": name,
                "sourceData": file_reader,
                "ingestionContext": json.dumps(ingestion_context),
            },
        )
        response.raise_for_status()

        file_uploaded = response.json()
        return {
            "file_id": file_uploaded["fileId"],
            "status": "Success",
            "error_type": None,
            "error_message": None,
        }
    


def ingest_all_documents(
    directory_path: str, ingestion_context: dict, stage_id: str, max_workers: int = 3
):
    """Ingest all files in a directory concurrently and store results in a DataFrame."""

    results = []

    with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
        future_to_file = {
            executor.submit(
                ingest_document,
                os.path.join(directory_path, file),
                ingestion_context,
                file,
                stage_id,
            ): file
            for file in os.listdir(directory_path)
        }

        for future in concurrent.futures.as_completed(future_to_file):
            file_name = future_to_file[future]
            file_path = os.path.join(directory_path, file_name)
            try:
                result = future.result()
                results.append(
                    {
                        "file_path": file_path,
                        "file_id": result["file_id"],
                        "status": result["status"],
                        "error_type": result["error_type"],
                        "error_message": result["error_message"],
                    }
                )
            except Exception as e:
                print(f"An error occurred while ingesting {file_path}: {e}")
                results.append(
                    {
                        "file_path": file_path,
                        "file_id": None,
                        "status": "Ingestion Failed",
                        "error": str(e),
                    }
                )

    df_results = pd.DataFrame(results)
    return df_results

In [15]:
# Ingesting the files
directory_path = "files_to_upload"
df_results = ingest_all_documents(directory_path, ingestion_context, stage_id)
df_results

Unnamed: 0,file_path,file_id,status,error_type,error_message
0,files_to_upload/RAG.pdf,6a9f5c37-3268-423b-8202-18a30446ca1f,Success,,
1,files_to_upload/What is RAG_ - Retrieval-Augme...,488c59c8-26e9-4c3e-8557-bff5bff8daf0,Success,,
2,files_to_upload/Azure Cognitive Search_ Outper...,9e4d6f4c-ca8d-4a09-a718-f546f8969544,Success,,
3,files_to_upload/paper.pdf,14fc1d5c-8b47-4358-9b23-d63bedfc5fdc,Success,,


### 7. Monitoring Source Document Processing Status

After uploading source documents, you need to verify their processing status. The code in this section:

1. Extracts IDs of successfully uploaded source documents
2. Retrieves the transformation ID
3. Checks the status of each source document's transformation
4. Extracts dataset IDs from completed transformations

The `check_files_status` function combines all this information into a comprehensive report that shows which files completed processing and which encountered errors. The dataset IDs are particularly important as they're used to access your processed documents in subsequent operations.


In [16]:
def get_successful_document_ids(df: pd.DataFrame) -> list:
    """Retrieve a list of successful file_ids from the DataFrame."""
    return df[df["status"] == "Success"]["file_id"].tolist()

@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=2, max=10),
    retry=retry_if_exception_type(requests.RequestException),
)
def check_status_of_ingestion(transformation_id: str, file_id: str) -> dict:
    """Query the status of the ingestion for a given transformation and file_id."""
    dataplatform_base_url = os.getenv("DATA_PLATFORM_URL")
    url = f"{dataplatform_base_url}/api/v1/transformations/{transformation_id}/runs?file_id={file_id}"

    token = os.getenv("TOKEN")
    response = requests.get(
        url=url, headers={"Authorization": f"Bearer {token}"}, verify=False
    )
    response.raise_for_status()
    page = response.json()

    assert page["total"] > 0
    return page["runs"][0]

def get_transformation_id(name: str) -> str:
    """Get the transformation ID from the Data Platform."""
    dataplatform_base_url = os.getenv("DATA_PLATFORM_URL")
    url = f"{dataplatform_base_url}/api/v1/transformations?name={name}"

    token = os.getenv("TOKEN")
    response = requests.get(
        url=url, headers={"Authorization": f"Bearer {token}"}, verify=False
    )
    response.raise_for_status()
    page = response.json()

    assert page["total"] > 0
    return page["transformations"][0]["transformationId"]

def check_files_status(transformation_id: str, df: pd.DataFrame, max_workers: int = 3):
    """Check the status of ingested files and store the results in a DataFrame."""

    successful_file_ids = get_successful_document_ids(df)
    status_results = []

    with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
        future_to_file = {
            executor.submit(
                check_status_of_ingestion, transformation_id, file_id
            ): file_id
            for file_id in successful_file_ids
        }

        for future in concurrent.futures.as_completed(future_to_file):
            file_id = future_to_file[future]
            try:
                run = future.result()
                output = json.dumps(run.get("output", {}), indent=4)
                status_results.append(
                    {
                        "file_id": file_id,
                        "run_id": run["runId"],
                        "status": run["status"],
                        "output": output,
                        "error": run["errors"],
                    }
                )
            except Exception as e:
                status_results.append(
                    {
                        "file_id": file_id,
                        "status": run["status"],
                        "output": None,
                        "error": str(e),
                    }
                )

    return df.merge(
        pd.DataFrame(status_results),
        on="file_id",
        how="left",
        suffixes=("_ingestion", ""),
    )

def get_successful_dataset_ids(df: pd.DataFrame) -> list:
    """Retrieve a list of successful dataset_ids from the DataFrame."""
    dataset_ids_list = []
    for i in range(len(df)):
        dataset_ids_list.append(json.loads(df["output"][i]).get("datasetId"))
    return dataset_ids_list




In [20]:
transformation_id = get_transformation_id(os.getenv("TRANSFORMATION_NAME"))
status_df = check_files_status(transformation_id, df_results)
status_df.to_csv("ingestion_status.csv", index=False)
successful_dataset_ids = get_successful_dataset_ids( status_df[status_df["status"] == "completed"])
status_df

Unnamed: 0,file_path,file_id,status_ingestion,error_type,error_message,run_id,status,output,error
0,files_to_upload/RAG.pdf,6a9f5c37-3268-423b-8202-18a30446ca1f,Success,,,d62c29e1-88f4-4c9f-9524-163ec765d51a,completed,"{\n ""type"": ""DataPlatform:Repository:Datase...",
1,files_to_upload/What is RAG_ - Retrieval-Augme...,488c59c8-26e9-4c3e-8557-bff5bff8daf0,Success,,,4600f1bc-7b43-4635-a5d6-aa63e5fec545,completed,"{\n ""type"": ""DataPlatform:Repository:Datase...",
2,files_to_upload/Azure Cognitive Search_ Outper...,9e4d6f4c-ca8d-4a09-a718-f546f8969544,Success,,,8bf74798-8205-44e9-b666-c8b6c38e9478,completed,"{\n ""type"": ""DataPlatform:Repository:Datase...",
3,files_to_upload/paper.pdf,14fc1d5c-8b47-4358-9b23-d63bedfc5fdc,Success,,,e89da729-437a-456c-b1be-a2c650451312,completed,"{\n ""type"": ""DataPlatform:Repository:Datase...",


### 8. Working with Processed Documents

With source documents ingested and processed, you can now interact with your data in various ways:

1. **Search Operation**: The `search_text` function demonstrates semantic search against your indexed processed documents, finding content based on meaning rather than exact keyword matches.

2. **Document & Metadata Retrieval**: The `get_document_from_document_index` function retrieves a complete processed document and its metadata using the dataset ID.

3. **Text Display**: The `display_processed_document_text` function shows how to access the actual content extracted from your source documents, helping you verify the quality of text extraction.

These operations showcase the fundamental ways to interact with your processed documents in the Pharia platform

#### 8.1. Searching Document Content

After successfully ingesting documents, one of the most valuable operations is searching through your content. This section demonstrates how to perform semantic searches against your indexed documents.

The `search_text` function sends a query to the Document Index API, which uses vector embeddings to find semantically relevant content. Unlike traditional keyword search, this approach can identify conceptually related information even when exact terms don't match.

In this example, we search for content related to "what is attention?" and retrieve matches ranked by relevance. The results include document chunks that semantically align with the query, along with confidence scores indicating match quality.


In [21]:
# Helper Functions

def search_text(namespace: str, collection: str, text: str, index: str) -> dict:
    di_base_url = os.getenv("DOCUMENT_INDEX_API_URL")
    url = f"{di_base_url}/collections/{namespace}/{collection}/indexes/{index}/search"

    token = os.getenv("TOKEN")
    payload = {"query": [{"modality": "text", "text": text}]}
    response = requests.post(
        url=url,
        json=payload,
        headers={"Authorization": f"Bearer {token}"},
        verify=False,
    )
    response.raise_for_status()
    return response.json()

In [22]:
text_to_search = "what is attention?"
search_result = search_text(
    os.getenv("NAMESPACE"), os.getenv("COLLECTION"), text_to_search, index="studio-tutorial-index"
)
print(json.dumps(search_result, indent=4))

[
    {
        "document_path": {
            "namespace": "Studio",
            "collection": "pharia-tutorial-rag",
            "name": "7b12d07f-2b2e-402f-8cb4-de0424eea255"
        },
        "section": [
            {
                "modality": "text",
                "text": "Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks, allowing modeling of dependencies without regard to their distance in the input or output sequences [2 , 19]. In all but a few cases [27], however, such attention mechanisms are used in conjunction with a recurrent network.\n\nIn this work we propose the Transformer, a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output. The Transformer allows for significantly more parallelization and can reach a new state of the art in translation quality after being trained for as little as twelve hours 

#### 8.2. Retrieving Complete Documents and Metadata

While searching helps find specific information, sometimes you need to retrieve a complete document along with its metadata. This operation is useful when you want to examine a document's full context or access its associated properties.

The `get_document_from_document_index` function retrieves a document using its dataset ID (obtained during the ingestion process). The response includes both the document content and additional metadata such as creation time, source information, and any custom properties attached during processing.

This example retrieves the fourth document from our previously ingested set, demonstrating how to access specific documents directly when you know their IDs.

In [23]:
# Helper Functions

def get_document_from_document_index(namespace, collection, dataset_id) -> dict:
    di_base_url = os.getenv("DOCUMENT_INDEX_API_URL")
    url = f"{di_base_url}/collections/{namespace}/{collection}/docs/{dataset_id}"

    token = os.getenv("TOKEN")
    response = requests.get(
        url=url, headers={"Authorization": f"Bearer {token}"}, verify=False
    )
    response.raise_for_status()
    return response.json()

In [24]:
all_documents = []

for id in successful_dataset_ids:
    document_from_di = get_document_from_document_index(
        os.getenv("NAMESPACE"), os.getenv("COLLECTION"), id
    )
    all_documents.append(document_from_di)

print(json.dumps(all_documents[0], indent=4))

{
    "schema_version": "V1",
    "contents": [
        {
            "modality": "text",
            "text": "## Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks\n\nPatrick Lewis\u2020\u2021, Ethan Perez ? ,\n\nAleksandra Piktus\u2020, Fabio Petroni\u2020, Vladimir Karpukhin\u2020, Naman Goyal\u2020, Heinrich K\u00fcttler\u2020 ,\n\nMike Lewis\u2020, Wen-tau Yih\u2020, Tim Rockt\u00e4schel\u2020\u2021, Sebastian Riedel\u2020\u2021, Douwe Kiela\u2020\n\n\u2020 Facebook AI Research; \u2021University College London; ? New York University; plewis@fb.com\n\n## Abstract\n\nLarge pre-trained language models have been shown to store factual knowledge in their parameters, and achieve state-of-the-art results when fine-tuned on downstream NLP tasks. However, their ability to access and precisely manipulate knowledge is still limited, and hence on knowledge-intensive tasks, their performance lags behind task-specific architectures. Additionally, providing provenance for their dec

#### 8.3. Viewing Extracted Document Text

To inspect the actual content extracted from your documents, you can retrieve and display the text chunks stored in the repository. This is useful for verifying extraction quality and understanding how your documents were segmented.

The `display_text_extracted` function connects to the Data Platform repository and retrieves text chunks from a specific document. It displays each chunk sequentially, showing how the document was divided during processing.

This operation helps you validate that your documents were properly processed and that the extracted text accurately represents the original content. It can be particularly valuable when troubleshooting search issues or refining your ingestion parameters.

In [28]:
# Helper Function
def display_processed_document_text(repository_id: str, dataset_id: str) -> None:
    number_of_pages = 10
    dataplatform_base_url = os.getenv("DATA_PLATFORM_URL")
    url = f"{dataplatform_base_url}/api/v1/repositories/{repository_id}/datasets/{dataset_id}/datapoints?size={number_of_pages}"

    token = os.getenv("TOKEN")
    response = requests.get(
        url=url, headers={"Authorization": f"Bearer {token}"}, verify=False, stream=True
    )
    response.raise_for_status()
    for line in response.iter_lines():
        print("# starting new page ...")
        datapoint = json.loads(line.decode())
        # print(f"{datapoint['text'][:100]}...")
        print(datapoint)


display_processed_document_text(repository_id, successful_dataset_ids[0])

# starting new page ...
{'text': '## Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks\n\nPatrick Lewis†‡, Ethan Perez ? ,\n\nAleksandra Piktus†, Fabio Petroni†, Vladimir Karpukhin†, Naman Goyal†, Heinrich Küttler† ,\n\nMike Lewis†, Wen-tau Yih†, Tim Rocktäschel†‡, Sebastian Riedel†‡, Douwe Kiela†\n\n† Facebook AI Research; ‡University College London; ? New York University; plewis@fb.com\n\n## Abstract\n\nLarge pre-trained language models have been shown to store factual knowledge in their parameters, and achieve state-of-the-art results when fine-tuned on downstream NLP tasks. However, their ability to access and precisely manipulate knowledge is still limited, and hence on knowledge-intensive tasks, their performance lags behind task-specific architectures. Additionally, providing provenance for their decisions and updating their world knowledge remain open research problems. Pretrained models with a differentiable access mechanism to explicit non-parametric memory hav

## Summary

In this section, you've successfully set up the complete document ingestion pipeline:

✅ **Configured the environment** with connections to both the Data Platform and Document Index APIs

✅ **Built the foundation infrastructure**:
   - Created a repository for storing processed documents
   - Set up a stage for temporary source document storage
   - Configured an index for enabling semantic search
   - Established triggers for automating document processing

✅ **Implemented document operations** with:
   - Concurrent source document uploads with error handling
   - Status monitoring for transformation processes
   - Multiple ways to interact with processed documents

Your source document collection is now properly ingested, processed, and ready for semantic search operations. This data foundation will serve as the basis for retrieval-augmented generation in the subsequent sections of this tutorial.