# Automating Trustworthy Document Processing with Cleanlab and Unstructured

## Introduction

In the rapidly evolving world of artificial intelligence, ensuring the reliability and accuracy of AI-generated responses has become paramount. This notebook demonstrates an approach to building trustworthy document processing systems by combining two powerful technologies: Cleanlab's Trustworthy Language Model (TLM) and Unstructured's document parsing capabilities.

Retrieval-Augmented Generation (RAG) has emerged as a critical methodology for enhancing large language model (LLM) performance by grounding responses in factual, retrieved information. However, the effectiveness of RAG systems is heavily dependent on the quality and accessibility of the underlying data. Many valuable information sources exist in complex formats like PDFs, spreadsheets, and images—formats that require sophisticated preprocessing before they can be utilized in RAG pipelines.

This notebook addresses three key challenges in building reliable AI systems:

1. **Document Preprocessing**: Converting diverse document formats into structured, machine-readable text while preserving semantic relationships and tabular information.

2. **Hallucination Mitigation**: Implementing mechanisms to reduce the generation of false or misleading information by grounding LLM responses in verified data.

3. **Trustworthiness Assessment**: Quantifying the reliability of AI-generated responses to enable appropriate human oversight and intervention.


## Setup and Installation

First, let's install the required packages and set up our environment. We'll need several libraries:

- `cleanlab-tlm`: For accessing Cleanlab's Trustworthy Language Model
- `llama-index`: For building the RAG pipeline
- `unstructured_client`: For document processing and extraction
- Additional utilities for data handling and visualization


In [30]:
# Install required packages
!pip install -q cleanlab-tlm llama-index llama-index-embeddings-huggingface unstructured_client

### API Key Configuration

To use this notebook, you'll need to obtain API keys for both Cleanlab and Unstructured:
- Cleanlab TLM API key: Available at [Cleanlab Studio](https://cleanlab.ai/)
- Unstructured API key: Available at [Unstructured.io](https://unstructured.io/)

Enter your API keys in the cell below:


In [None]:
import os

os.environ["CLEANLAB_API_KEY"] = ""  # Replace with your API key
os.environ["UNSTRUCTURED_API_KEY"] = ""  # Replace with your API key
os.environ["UNSTRUCTURED_API_URL"] = "https://api.unstructured.io/general/v0/general"

## TLM Integration with LlamaIndex

Cleanlab's Trustworthy Language Model (TLM) is designed to provide reliable responses with a trustworthiness score that indicates the model's confidence in its output. This feature is particularly valuable in RAG systems, where ensuring the accuracy of AI-generated responses is critical.

In this section, we'll integrate TLM with LlamaIndex, a framework for building RAG applications. This integration involves:

1. Initializing a connection to Cleanlab's TLM
2. Creating a custom wrapper to use TLM with LlamaIndex
3. Setting up an embedding model for semantic search
4. Defining a function to create query engines from document collections

The trustworthiness scores provided by TLM help identify potential hallucinations or low-confidence responses, enabling more reliable AI applications.


In [None]:
from cleanlab_studio import Studio

studio = Studio(os.environ["CLEANLAB_API_KEY"])
tlm = studio.TLM()

import os
from typing import Any
import json
from llama_index.core.base.llms.types import (
    CompletionResponse,
    CompletionResponseGen,
    LLMMetadata,
)
from llama_index.core.llms.callbacks import llm_completion_callback
from llama_index.core.llms.custom import CustomLLM
from llama_index.core import Settings
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import VectorStoreIndex, Document


studio = Studio(os.environ["CLEANLAB_API_KEY"])
tlm = studio.TLM()

Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")


class TLMWrapper(CustomLLM):
    context_window: int = 16000
    num_output: int = 256
    model_name: str = "TLM"

    @property
    def metadata(self) -> LLMMetadata:
        """Get LLM metadata."""
        return LLMMetadata(
            context_window=self.context_window,
            num_output=self.num_output,
            model_name=self.model_name,
        )

    @llm_completion_callback()
    def complete(self, prompt: str, **kwargs: Any) -> CompletionResponse:
        # Prompt tlm for a response and trustworthiness score
        response = tlm.prompt(prompt)
        output = json.dumps(response)
        return CompletionResponse(text=output)

    @llm_completion_callback()
    def stream_complete(self, prompt: str, **kwargs: Any) -> CompletionResponseGen:
        # Prompt tlm for a response and trustworthiness score
        response = tlm.prompt(prompt)
        output = json.dumps(response)

        # Stream the output
        output_str = ""
        for token in output:
            output_str += token
            yield CompletionResponse(text=output_str, delta=token)


def tlm_query_engine(documents: list[str]):
    return VectorStoreIndex(
        [Document(text=document, metadata={"source": "tlm"}) for document in documents]
    ).as_query_engine(llm=TLMWrapper())

INFO: Load pretrained SentenceTransformer: BAAI/bge-small-en-v1.5
INFO: 2 prompts are loaded, with the keys: ['query', 'text']


## Unstructured API Setup

Unstructured is a powerful document processing platform that can extract structured information from various document formats, including PDFs, Word documents, HTML, and more. It's particularly effective at handling complex layouts, tables, and other structured elements that are challenging to parse with traditional methods.

In this section, we'll set up the Unstructured API client to process documents. The Unstructured Platform API provides several key capabilities:

1. **Document Partitioning**: Breaking documents into meaningful chunks (paragraphs, tables, lists, etc.)
2. **Table Extraction**: Accurately identifying and extracting tabular data
3. **Layout Analysis**: Understanding the document's structure and relationships between elements
4. **Multi-format Support**: Processing various document formats with a consistent API

These capabilities are essential for building effective RAG systems that can leverage information from diverse document sources.


In [33]:
import os
import unstructured_client
from unstructured_client.models import operations, shared
import requests


client = unstructured_client.UnstructuredClient(
    api_key_auth=os.getenv("UNSTRUCTURED_API_KEY")
)

## Document Processing and Table Extraction

One of the key challenges in building effective RAG systems is extracting structured information from complex document formats like PDFs. In this section, we'll demonstrate how to use Unstructured's API to process a PDF document and extract tables, which are particularly challenging to handle with traditional text extraction methods.

We'll work with data from the NFL Record & Fact Book, which contains various tables with statistics and records. This example illustrates how Unstructured can accurately extract tabular data while preserving its structure, making it suitable for downstream RAG applications.

The process involves:
1. Downloading the PDF document
2. Sending it to Unstructured's API for partitioning
3. Extracting tables and their associated titles
4. Formatting the extracted data for use in our RAG pipeline

This approach can be extended to handle various document types and structures, making it a versatile solution for document processing in RAG systems.

Note: Unstructured sometimes fails to partition the PDF. If it does, just rerun the below cell.


In [37]:
file_url = "https://storage.googleapis.com/richard-xiong-366.appspot.com/nfl.pdf"


response = requests.get(file_url)
response.raise_for_status()


file_content = response.content

req = operations.PartitionRequest(
    partition_parameters=shared.PartitionParameters(
        files=shared.Files(
            content=file_content,
            file_name="nfl.pdf",
        ),
        strategy=shared.Strategy.VLM,
        vlm_model=shared.PartitionParametersStrategy.GPT_4O,
        vlm_model_provider=shared.PartitionParametersSchemasStrategy.OPENAI,
        languages=["eng"],
        split_pdf_page=True,
        split_pdf_allow_failed=True,
        split_pdf_concurrency_level=15,
    ),
)


result = client.general.partition(request=req)


if result.elements is None:
    raise Exception("No elements found in the response")



INFO: HTTP Request: GET https://api.unstructuredapp.io/general/docs "HTTP/1.1 200 OK"
INFO: HTTP Request: POST https://api.unstructuredapp.io/general/v0/general "HTTP/1.1 200 OK"
INFO: HTTP Request: POST https://api.unstructuredapp.io/general/v0/general "HTTP/1.1 200 OK"
INFO: HTTP Request: POST https://api.unstructuredapp.io/general/v0/general "HTTP/1.1 200 OK"


In [None]:
rawTables = [item for item in result.elements if item["type"] == "Table"] # type: ignore
titles = [
    item
    for item in result.elements # type: ignore
    if item["text"].startswith("TOP") or "COACHES" in item["text"]
]


tables = [
    {"title": title["text"], "table": table["metadata"]["text_as_html"]}
    for title, table in zip(titles, rawTables)
]

## Query Engine Setup

Now that we have extracted tables from our document, we need to make this information searchable and retrievable. This is where the query engine comes into play. The query engine is responsible for:

1. Indexing the document content (in this case, our extracted tables)
2. Retrieving relevant information based on user queries
3. Providing context to the language model for generating accurate responses

In this section, we'll create a query engine using the tables we extracted in the previous step. We'll format each table with its title to provide better context for the retrieval system. This approach ensures that when users ask questions about specific records or statistics, the system can retrieve the most relevant tables and provide accurate answers.

The query engine leverages the embedding model we set up earlier to perform semantic search, finding the most relevant content based on the meaning of the query rather than just keyword matching.


In [None]:
query_engine = tlm_query_engine(
    [f"<h1>{table['title']}</h1>\n{table['table']}" for table in tables]
)

query = input('Enter your query:')
print(query_engine.query(query))

Batches: 100%|██████████| 1/1 [00:00<00:00,  2.00it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 14.08it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 37.03it/s]


{"response": "Based on the provided statistics, Derrick Henry stands out as the best rusher to bet on. He has the highest total yards (9,502) and touchdowns (90) among active rushers, with 8 years of experience. His consistent performance and ability to score make him a strong candidate for betting.", "trustworthiness_score": 0.9493786239709094}


## Alternate TLM Method

If we want to use a different LLM and simply use Cleanlab's TLM for trustworthiness scoring, we can use the TLM's `get_trustworthiness_score` method to evaluate the trustworthiness of the responses generated by the LLM. This approach allows us to leverage the trustworthiness assessment capabilities of Cleanlab's TLM without being tied to a specific language model implementation. Let's set this up using OpenAI:

In [None]:
from openai import OpenAI

client = OpenAI()

client.api_key = "<OPENAI API KEY>"

class LLMWrapper(CustomLLM):
    context_window: int = 128_000
    num_output: int = 256
    model_name: str = "gpt-4o-mini"

    @property
    def metadata(self) -> LLMMetadata:
        """Get LLM metadata."""
        return LLMMetadata(
            context_window=self.context_window,
            num_output=self.num_output,
            model_name=self.model_name,
        )

    @llm_completion_callback()
    def complete(self, prompt: str, **kwargs: Any) -> CompletionResponse:
        response = client.chat.completions.create(
            model=self.model_name,
            messages=[{"role": "user", "content": prompt}],
            max_tokens=self.num_output,
        )
        output = response.choices[0].message.content

        trustworthiness = tlm.get_trustworthiness_score(prompt, output or "")

        return CompletionResponse(
            text=json.dumps(
                {
                    "response": output,
                    "trustworthiness_score": trustworthiness["trustworthiness_score"],  # type: ignore
                }
            )
        )

    @llm_completion_callback()
    def stream_complete(self, prompt: str, **kwargs: Any) -> CompletionResponseGen:
        response = client.chat.completions.create(
            model=self.model_name,
            messages=[{"role": "user", "content": prompt}],
            max_tokens=self.num_output,
            stream=True,
        )

        output = ""

        for token in response:
            output += token.choices[0].delta.content or ""

            yield CompletionResponse(
                text=json.dumps(
                    {
                        "response": output,
                    }
                ),
                delta=token.choices[0].delta.content or "",
            )

        trustworthiness = tlm.get_trustworthiness_score(prompt, output)

        yield CompletionResponse(
            text=json.dumps(
                {
                    "response": output,
                    "trustworthiness_score": trustworthiness["trustworthiness_score"],  # type: ignore
                }
            )
        )


def llm_query_engine(documents: list[str]):
    return VectorStoreIndex(
        [Document(text=document, metadata={"source": "tlm"}) for document in documents]
    ).as_query_engine(llm=LLMWrapper())

We can easily run this query engine the same way as the previous one.

In [None]:
llm_qe = llm_query_engine(
    [f"<h1>{table['title']}</h1>\n{table['table']}" for table in tables]
)

query = input('Enter your query:')
print(llm_qe.query(query))

## Conclusion

This notebook has demonstrated an integrated approach to building trustworthy document processing systems by combining Cleanlab's TLM with Unstructured's document parsing capabilities.

1. **Enhanced Document Processing Pipeline**: We've shown how Unstructured's API can transform complex document formats like PDFs into structured, machine-readable content while preserving semantic relationships and tabular information. This preprocessing step is crucial for maintaining data integrity in RAG systems.

2. **Trustworthiness Assessment Framework**: By integrating Cleanlab's TLM, we've implemented a mechanism to quantify the reliability of AI-generated responses. The trustworthiness scores provide a valuable metric for assessing when to trust model outputs and when human intervention might be necessary.

3. **Practical Implementation Strategy**: The notebook provides a complete implementation framework that practitioners can adapt to their specific domains, from document ingestion to query processing and response evaluation.

### Implications for AI Reliability

The integration of trustworthiness metrics into RAG systems represents a significant advancement in AI reliability. By providing quantitative measures of confidence, these systems enable more informed decision-making about when to trust AI-generated content. This approach is particularly valuable in high-stakes domains where accuracy is critical.

### Additional Resources

- [Cleanlab Studio](https://cleanlab.ai/) - Platform for building and deploying trustworthy AI systems
- [Unstructured Platform](https://unstructured.io/) - Tools for extracting structured information from unstructured documents
- [Hugging Face Hub](https://huggingface.co/) - Repository of pre-trained models, including embedding models used in this notebook
