# Automating Trustworthy Document Processing with Cleanlab and Unstructured

## Introduction

In today's AI landscape, ensuring the reliability and accuracy of AI-generated responses is essential. This notebook demonstrates how to build trustworthy document processing systems by combining Cleanlab's Trustworthy Language Model (TLM) and Unstructured's document parsing capabilities.

Retrieval-Augmented Generation (RAG) enhances large language model performance by grounding responses in retrieved information. However, RAG effectiveness depends heavily on the quality and accessibility of underlying data. Many valuable sources exist in complex formats like PDFs, spreadsheets, and images that require sophisticated preprocessing for RAG pipelines.

This notebook addresses three key challenges:

1. **Document Preprocessing**: Converting diverse formats into structured text while preserving semantic relationships
2. **Hallucination Mitigation**: Reducing false information generation by grounding LLM responses in verified data
3. **Trustworthiness Assessment**: Quantifying response reliability for appropriate human oversight

In [None]:
# Install required packages
!pip install -q cleanlab-tlm llama-index llama-index-embeddings-huggingface unstructured_client

### API Key Configuration

To use this notebook, you'll need to obtain API keys for both Cleanlab and Unstructured:
- Cleanlab TLM API key: Available at [Cleanlab Studio](https://cleanlab.ai/)
- Unstructured API key: Available at [Unstructured.io](https://unstructured.io/)

Enter your API keys in the cell below:


In [None]:
import os

os.environ["CLEANLAB_API_KEY"] = ""  # Replace with your API key
os.environ["UNSTRUCTURED_API_KEY"] = ""  # Replace with your API key
os.environ["UNSTRUCTURED_API_URL"] = "https://api.unstructured.io/general/v0/general"

## TLM Integration with LlamaIndex

Cleanlab's Trustworthy Language Model (TLM) provides reliable responses with a trustworthiness score indicating confidence. This feature is valuable in RAG systems where response accuracy is critical.

In this section, we'll integrate TLM with LlamaIndex by:

1. Initializing a connection to Cleanlab's TLM
2. Creating a custom wrapper for use with LlamaIndex
3. Setting up an embedding model for semantic search
4. Defining a function to create query engines

The trustworthiness scores help identify potential hallucinations or low-confidence responses for more reliable applications.

In [None]:
from cleanlab_studio import Studio

studio = Studio(os.environ["CLEANLAB_API_KEY"])
tlm = studio.TLM()

import os
from typing import Any
import json
from llama_index.core.base.llms.types import (
    CompletionResponse,
    CompletionResponseGen,
    LLMMetadata,
)
from llama_index.core.llms.callbacks import llm_completion_callback
from llama_index.core.llms.custom import CustomLLM
from llama_index.core import Settings
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import VectorStoreIndex, Document


studio = Studio(os.environ["CLEANLAB_API_KEY"])
tlm = studio.TLM()

Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")


class TLMWrapper(CustomLLM):
    context_window: int = 16000
    num_output: int = 256
    model_name: str = "TLM"

    @property
    def metadata(self) -> LLMMetadata:
        """Get LLM metadata."""
        return LLMMetadata(
            context_window=self.context_window,
            num_output=self.num_output,
            model_name=self.model_name,
        )

    @llm_completion_callback()
    def complete(self, prompt: str, **kwargs: Any) -> CompletionResponse:
        # Prompt tlm for a response and trustworthiness score
        response = tlm.prompt(prompt)
        output = json.dumps(response)
        return CompletionResponse(text=output)

    @llm_completion_callback()
    def stream_complete(self, prompt: str, **kwargs: Any) -> CompletionResponseGen:
        # Prompt tlm for a response and trustworthiness score
        response = tlm.prompt(prompt)
        output = json.dumps(response)

        # Stream the output
        output_str = ""
        for token in output:
            output_str += token
            yield CompletionResponse(text=output_str, delta=token)


def tlm_query_engine(documents: list[str]):
    return VectorStoreIndex(
        [Document(text=document, metadata={"source": "tlm"}) for document in documents]
    ).as_query_engine(llm=TLMWrapper())

## Unstructured API Setup

Unstructured is a document processing platform that extracts structured information from various formats including PDFs, Word documents, and HTML. It excels at handling complex layouts, tables, and structured elements challenging for traditional parsers.

We'll set up the Unstructured API client to access these capabilities:

1. **Document Partitioning**: Breaking documents into meaningful chunks
2. **Table Extraction**: Accurately identifying and extracting tabular data
3. **Layout Analysis**: Understanding document structure and element relationships
4. **Multi-format Support**: Processing various document formats consistently

These capabilities are essential for RAG systems leveraging diverse document sources.

In [None]:
import os
import unstructured_client
from unstructured_client.models import operations, shared
import requests


client = unstructured_client.UnstructuredClient(
    api_key_auth=os.getenv("UNSTRUCTURED_API_KEY")
)

## Document Processing and Table Extraction

A key challenge in RAG systems is extracting structured information from complex formats like PDFs. Here, we'll use Unstructured's API to process a PDF document and extract tables, which are particularly difficult to handle with traditional methods.

We'll work with data from the NFL Record & Fact Book to demonstrate how Unstructured preserves table structure for downstream RAG applications.

The process involves:
1. Downloading the PDF document
2. Sending it to Unstructured's API for partitioning
3. Extracting tables and their associated titles
4. Formatting the extracted data for our RAG pipeline

This approach extends to various document types and structures, making it versatile for RAG document processing.

In [None]:
file_url = "https://storage.googleapis.com/nfl-2024-record/nfl.pdf"


response = requests.get(file_url)
response.raise_for_status()


file_content = response.content

req = operations.PartitionRequest(
    partition_parameters=shared.PartitionParameters(
        files=shared.Files(
            content=file_content,
            file_name="nfl.pdf",
        ),
        strategy=shared.Strategy.VLM,
        vlm_model=shared.PartitionParametersStrategy.GPT_4O,
        vlm_model_provider=shared.PartitionParametersSchemasStrategy.OPENAI,
        languages=["eng"],
        split_pdf_page=True,
        split_pdf_allow_failed=True,
        split_pdf_concurrency_level=15,
    ),
)


result = client.general.partition(request=req)


if result.elements is None:
    raise Exception("No elements found in the response")



In [None]:
rawTables = [item for item in result.elements if item["type"] == "Table"] # type: ignore
titles = [
    item
    for item in result.elements # type: ignore
    if item["text"].startswith("TOP") or "COACHES" in item["text"]
]


tables = [
    {"title": title["text"], "table": table["metadata"]["text_as_html"]}
    for title, table in zip(titles, rawTables)
]

## Query Engine Setup

Now that we've extracted tables, we need to make this information searchable and retrievable. The query engine will:

1. Index the document content (our extracted tables)
2. Retrieve relevant information based on user queries
3. Provide context to the language model for accurate responses

We'll create a query engine using our extracted tables, formatting each with its title for better context. This ensures that when users ask questions about specific statistics, the system retrieves the most relevant tables and provides accurate answers.

Our query engine uses the embedding model we set up earlier for semantic search, finding relevant content based on meaning rather than just keyword matching.

In [None]:
query_engine = tlm_query_engine(
    [f"<h1>{table['title']}</h1>\n{table['table']}" for table in tables]
)

query = input('Enter your query:')
print(query_engine.query(query))

## Alternate TLM Method

If we prefer using a different LLM while still leveraging Cleanlab's trustworthiness assessment, we can use the TLM's `get_trustworthiness_score` method. This approach allows us to evaluate responses from any LLM without being tied to a specific implementation. Let's set this up using OpenAI:

In [None]:
from openai import OpenAI

client = OpenAI()

client.api_key = "<OPENAI API KEY>"

class LLMWrapper(CustomLLM):
    context_window: int = 128_000
    num_output: int = 256
    model_name: str = "gpt-4o-mini"

    @property
    def metadata(self) -> LLMMetadata:
        """Get LLM metadata."""
        return LLMMetadata(
            context_window=self.context_window,
            num_output=self.num_output,
            model_name=self.model_name,
        )

    @llm_completion_callback()
    def complete(self, prompt: str, **kwargs: Any) -> CompletionResponse:
        response = client.chat.completions.create(
            model=self.model_name,
            messages=[{"role": "user", "content": prompt}],
            max_tokens=self.num_output,
        )
        output = response.choices[0].message.content

        trustworthiness = tlm.get_trustworthiness_score(prompt, output or "")

        return CompletionResponse(
            text=json.dumps(
                {
                    "response": output,
                    "trustworthiness_score": trustworthiness["trustworthiness_score"],  # type: ignore
                }
            )
        )

    @llm_completion_callback()
    def stream_complete(self, prompt: str, **kwargs: Any) -> CompletionResponseGen:
        response = client.chat.completions.create(
            model=self.model_name,
            messages=[{"role": "user", "content": prompt}],
            max_tokens=self.num_output,
            stream=True,
        )

        output = ""

        for token in response:
            output += token.choices[0].delta.content or ""

            yield CompletionResponse(
                text=json.dumps(
                    {
                        "response": output,
                    }
                ),
                delta=token.choices[0].delta.content or "",
            )

        trustworthiness = tlm.get_trustworthiness_score(prompt, output)

        yield CompletionResponse(
            text=json.dumps(
                {
                    "response": output,
                    "trustworthiness_score": trustworthiness["trustworthiness_score"],  # type: ignore
                }
            )
        )


def llm_query_engine(documents: list[str]):
    return VectorStoreIndex(
        [Document(text=document, metadata={"source": "tlm"}) for document in documents]
    ).as_query_engine(llm=LLMWrapper())

We can easily run this query engine the same way as the previous one.

In [None]:
llm_qe = llm_query_engine(
    [f"<h1>{table['title']}</h1>\n{table['table']}" for table in tables]
)

query = input('Enter your query:')
print(llm_qe.query(query))

## Conclusion

This notebook has demonstrated an integrated approach to building trustworthy document processing systems combining Cleanlab's TLM with Unstructured's document parsing.

1. **Enhanced Document Processing**: We've shown how Unstructured transforms complex documents into structured, machine-readable content while preserving semantic relationships and tables. This preprocessing maintains data integrity in RAG systems.

2. **Trustworthiness Assessment**: By integrating Cleanlab's TLM, we've implemented a mechanism to quantify response reliability. These scores provide valuable metrics for assessing when to trust model outputs and when human intervention might be necessary.

3. **Practical Implementation**: The notebook provides a complete framework that practitioners can adapt to specific domains, from document ingestion to query processing and response evaluation.

Integrating trustworthiness metrics into RAG systems significantly advances AI reliability. By providing quantitative confidence measures, these systems enable more informed decision-making about AI-generated content, particularly valuable in high-stakes domains.

- [Cleanlab Studio](https://cleanlab.ai/) - Platform for building trustworthy AI systems
- [Unstructured Platform](https://unstructured.io/) - Tools for extracting structured information from documents
- [Hugging Face Hub](https://huggingface.co/) - Repository of pre-trained models, including embeddings used in this notebook