# PDF processing with Unstructured and querying with HuggingChat

This sample notebook sends a PDF file to [Unstructured API services](https://docs.unstructured.io/api-reference/api-services/overview) for processing. Unstructured processes the PDF and extracts the PDF's content. The notebook then sends some of the content to [HuggingChat](https://huggingface.co/chat/), Hugging Face's open-source AI chatbot, along with some queries about this content.

## Step 1: Install the Unstructured and HuggingChat libraries

---



In [None]:
%pip install -q unstructured-ingest
%pip install "unstructured-ingest[remote]"
%pip install -q hugchat

Collecting unstructured-client>=0.23.0 (from unstructured-ingest[remote])
  Downloading unstructured_client-0.25.5-py3-none-any.whl.metadata (13 kB)
Collecting deepdiff>=6.0 (from unstructured-client>=0.23.0->unstructured-ingest[remote])
  Downloading deepdiff-7.0.1-py3-none-any.whl.metadata (6.8 kB)
Collecting httpx>=0.27.0 (from unstructured-client>=0.23.0->unstructured-ingest[remote])
  Downloading httpx-0.27.0-py3-none-any.whl.metadata (7.2 kB)
Collecting jsonpath-python>=1.0.6 (from unstructured-client>=0.23.0->unstructured-ingest[remote])
  Downloading jsonpath_python-1.0.6-py3-none-any.whl.metadata (12 kB)
Collecting pypdf>=4.0 (from unstructured-client>=0.23.0->unstructured-ingest[remote])
  Downloading pypdf-4.3.1-py3-none-any.whl.metadata (7.4 kB)
Collecting ordered-set<4.2.0,>=4.1.0 (from deepdiff>=6.0->unstructured-client>=0.23.0->unstructured-ingest[remote])
  Downloading ordered_set-4.1.0-py3-none-any.whl.metadata (5.3 kB)
Collecting httpcore==1.* (from httpx>=0.27.0->uns

## Step 2: Set imports

---

In [None]:
from unstructured_ingest.v2.pipeline.pipeline import Pipeline
from unstructured_ingest.v2.interfaces import ProcessorConfig
from unstructured_ingest.v2.processes.connectors.local import (
    LocalIndexerConfig,
    LocalDownloaderConfig,
    LocalConnectionConfig,
    LocalUploaderConfig
)
from unstructured_ingest.v2.processes.partitioner import PartitionerConfig

import os, json
from hugchat import hugchat
from hugchat.login import Login
from google.colab import userdata

## Step 3: Set your Unstructured API key and API URL

---

Get a key and URL:

- Pay-as-you-go unlimited version: https://docs.unstructured.io/api-reference/api-services/saas-api-development-guide#get-started
- Limited free version: https://docs.unstructured.io/api-reference/api-services/free-api#get-an-api-key

Set the following secrets:

- `UNSTRUCTURED_API_KEY` to your Unstructured API key.
- `UNSTRUCTURED_API_URL` to your Unstructured API URL.

To set these:

1. On the left sidebar, click the **Secrets** icon.
2. Enter each name/value pair above.
3. Switch on the **Notebook access** toggle for each name/value pair.

## Step 4: Set your Hugging Face account's email address and account password

---

Get a Hugging Face account: https://huggingface.co/join

Set the following secrets:

- `HUGGING_FACE_EMAIL` to your Hugging Face account's email address.
- `HUGGING_FACE_PASSWORD` to your Hugging Face account's password.

To set these:

1. On the left sidebar, click the **Secrets** icon.
2. Enter each name/value pair above.
3. Switch on the **Notebook access** toggle for each name/value pair.

## Step 5: Upload a PDF file for Unstructured to process

---

Upload a PDF file before continuing.

For example, you can run the following cell to upload a sample PDF file containing the text of the United States Constitution, from https://constitutioncenter.org/media/files/constitution.pdf, into Google Collab session storage.

Or, you can upload a different file into Google Collab session storage:

1. On the left sidebar, click the **Files** icon.
2. Click the **Upload to session storage** icon.

Then, provide the filename of the PDF file that was uploaded.

In [None]:
!wget https://constitutioncenter.org/media/files/constitution.pdf

--2024-08-23 17:11:41--  https://constitutioncenter.org/media/files/constitution.pdf
Resolving constitutioncenter.org (constitutioncenter.org)... 172.67.42.106, 104.22.22.181, 104.22.23.181, ...
Connecting to constitutioncenter.org (constitutioncenter.org)|172.67.42.106|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 413949 (404K) [application/pdf]
Saving to: ‘constitution.pdf’


2024-08-23 17:11:41 (80.4 MB/s) - ‘constitution.pdf’ saved [413949/413949]



## Step 6: Provide a function to process the document

---



In [None]:
def generate_json_from_local(
        input_path: str,
        output_dir: str,
        parition_by_api: bool = False,
        api_key: str = None,
        partition_endpoint: str = None,
        split_pdf_page: bool = True,
        split_pdf_allow_failed: bool = True,
        split_pdf_concurrency_level: int = 15
    ):
    Pipeline.from_configs(
        context=ProcessorConfig(),
        indexer_config=LocalIndexerConfig(input_path=input_path),
        downloader_config=LocalDownloaderConfig(),
        source_connection_config=LocalConnectionConfig(),
        partitioner_config=PartitionerConfig(
            partition_by_api=parition_by_api,
            api_key=api_key,
            partition_endpoint=partition_endpoint,
            additional_partition_args={
                "split_pdf_page": split_pdf_page,
                "split_pdf_allow_failed": split_pdf_allow_failed,
                "split_pdf_concurrency_level": split_pdf_concurrency_level
            }
        ),
        uploader_config=LocalUploaderConfig(output_dir=output_dir)
    ).run()

## Step 7: Provide a function to extract matching texts from the processed data

---

In [None]:
def extract_matching_texts_from_local(input_json_file_path: str, text_to_match: str) -> str:
    voting_texts = ""

    with open(input_json_file_path, 'r') as file:
        file_elements = json.load(file)

    for element in file_elements:
        if text_to_match in element["text"]:
            voting_texts += " " + element["text"]

    return voting_texts

## Step 8: Provide a function to log in to your Hugging Face account

---

In [None]:
def log_in_to_hugging_face(email: str, passwd: str, cookie_dir_path: str) -> hugchat.ChatBot:
    sign = Login(
        email=email,
        passwd=passwd
    )

    cookies = sign.login(cookie_dir_path=cookie_dir_path)

    return hugchat.ChatBot(cookies=cookies.get_dict())

## Step 9: Process the PDF and chat about it with HuggingChat

---

This code:

1. Sends the PDF to Unstructured for processing. Unstructured then sends the processed data back.
2. Gathers all texts from the processed data that cover voting, such as texts that contain the strings "vote", "voted", and "voting".
3. Logs in to your Hugging Face account.
4. Sends the matching texts to HuggingChat along with some queries about the text.

In [None]:
generate_json_from_local(
    input_path="constitution.pdf",
    output_dir=".",
    parition_by_api=True,
    api_key=userdata.get("UNSTRUCTURED_API_KEY"),
    partition_endpoint=userdata.get("UNSTRUCTURED_API_URL")
)

chatbot = log_in_to_hugging_face(
    email=userdata.get("HUGGING_FACE_EMAIL"),
    passwd=userdata.get("HUGGING_FACE_PASSWORD"),
    cookie_dir_path="./cookies/"
)

voting_texts = extract_matching_texts_from_local(
    input_json_file_path="constitution.pdf.json",
    text_to_match="vot"
)

print("\n-----\n")
print("Querying HuggingChat...")
print("\n-----\n")

req = f"Given the following information, what is the minimum voting age in the United States? {voting_texts}"
print(req)
print("\n-----\n")
print(chatbot.chat(text=req))

print("\n-----\n")
print("Querying HuggingChat again...")
print("\n-----\n")

follow_up = "And when were women given the right to vote in the United States?"
print(follow_up)
print("\n-----\n")

print(chatbot.chat(text=follow_up))

2024-08-23 17:22:03,114 MainProcess INFO     Created index with configs: {"input_path": "constitution.pdf", "recursive": false}, connection configs: {"access_config": "**********"}
2024-08-23 17:22:03,117 MainProcess INFO     Created download with configs: {"download_dir": null}, connection configs: {"access_config": "**********"}
2024-08-23 17:22:03,120 MainProcess INFO     Created partition with configs: {"strategy": "auto", "ocr_languages": null, "encoding": null, "additional_partition_args": {"split_pdf_page": true, "split_pdf_allow_failed": true, "split_pdf_concurrency_level": 15}, "skip_infer_table_types": null, "fields_include": ["element_id", "text", "type", "metadata", "embeddings"], "flatten_metadata": false, "metadata_exclude": [], "metadata_include": [], "partition_endpoint": "https://api.unstructured.io/general/v0/general", "partition_by_api": true, "api_key": "*******", "hi_res_model_name": null}
2024-08-23 17:22:03,123 MainProcess INFO     Created upload with configs: {"


-----

Querying HuggingChat...

-----

Given the following information, what is the minimum voting age in the United States?  Every Bill which shall have passed the House of Represen- tatives and the Senate, shall, before it become a Law, be presented to the President of the United States; If he ap- prove he shall sign it, but if not he shall return it, with his Objections to that House in which it shall have originated, who shall enter the Objections at large on their Journal, and proceed to reconsider it. If after such Reconsideration two thirds of that House shall agree to pass the Bill, it shall be sent, together with the Objections, to the other House, by which it shall likewise be reconsidered, and if approved by two thirds of that House, it shall become a Law. But in all such Cases the Votes of both Houses shall be determined by Yeas and Nays, and the Names of the Persons voting for and against the Bill shall be entered on the Journal of each House respectively, If any Bill sha