# Use `Uniflow` to Extract PDF and Ingest into OpenSearch

### Before running the code

You will need to create a `uniflow` conda environment to run this notebook. You can set up the environment following the instruction: https://github.com/CambioML/uniflow/tree/main#installation.

Next, you will need a valid [AWS CLI profile](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html) to run the code. You can set up the profile by running `aws configure --profile <profile_name>` in your terminal. You will need to provide your AWS Access Key ID and AWS Secret Access Key. You can find your AWS Access Key ID and AWS Secret Access Key in the [Security Credentials](https://console.aws.amazon.com/iam/home?region=us-east-1#/security_credentials) section of the AWS console.

```bash
$ aws configure --profile <profile_name>
$ AWS Access Key ID [None]: <your_access_key_id>
$ AWS Secret Access Key [None]: <your_secret_access_key>
$ Default region name [None]: us-west-2
$ Default output format [None]: .json
```

Make sure to set `Default output format` to `.json`.

> Note: If you don't have AWS CLI installed, you will get a `command not found: aws` error. You can follow the instructions [here](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html).

Before you run the code, you will need to set up the resources and environment variables in setup_resources.ipynb

You can check the .env file to see if the environment variables are set up correctly.

### Import dependency
First, we set system paths and import libraries.

In [1]:
%reload_ext autoreload
%autoreload 2

import sys

sys.path.append(".")
sys.path.append("..")
sys.path.append("../..")

In [2]:
import pprint

from dotenv import load_dotenv
import json
import os
import tempfile
import warnings
from typing import Any, Dict, List, Optional

from pathlib import Path

from uniflow.flow.client import ExtractClient
from uniflow.flow.config import ExtractImageConfig
from uniflow.op.model.model_config import LayoutModelConfig

from utils.aws_session import AWSSession
from utils.bedrock_client import BedrockEmbeddingClient
from utils.es_client import ElasticSearchClient
from utils.s3_client import S3Client

load_dotenv()

  from .autonotebook import tqdm as notebook_tqdm


True

### Install Extra Libraries

In [3]:
!{sys.executable} -m pip install -q boto3
!{sys.executable} -m pip install -q easyocr
!{sys.executable} -m pip install -q pdf2image
!{sys.executable} -m pip install -q onnxruntime
!{sys.executable} -m pip install -q pip install opensearch-py
!{sys.executable} -m pip install -q onnxruntime-gpu
!{sys.executable} -m pip install -q requests-aws4auth

In [4]:
from pdf2image import convert_from_path

### Useful Functions
In the following, we define some useful functions to help us convert PDF to images, and further extract text from images.

The `convert_pdf_to_pngs` function converts a PDF file into a series of PNG images, with one image per page of the PDF. 

Here's a breakdown of the function parameters:

- `pdf_file`: This is the path to the PDF file that you want to convert.
- `dest_dir`: This is the directory where the PNG images will be saved.
- `dpi`: This is the resolution in dots per inch. The default value is 200.
- `fmt`: This is the format of the output images. The default value is "png".

In [5]:
def convert_pdf_to_pngs(pdf_file, dest_dir, dpi=200, fmt="png"):
    """
    Convert a PDF file to a directory of PNG images.
    """
    os.makedirs(dest_dir, exist_ok=True)
    images = convert_from_path(
        pdf_file,
        dpi=dpi,
        output_folder=dest_dir,
        fmt=fmt,
        output_file=f"{Path(pdf_file).stem}-",
    )

    return dest_dir

The `extract_image_pipeline` function is designed to extract text from an image. 

In [6]:
def extract_image_pipeline(local_file_path):
    """
    Extracts text from an image.

    Args:
        local_file_path (str): The path to the local image file.

    Returns:
        dict: The extracted text from the image.
    """
    data = [
        {"filename": local_file_path},
    ]

    config = ExtractImageConfig(model_config=LayoutModelConfig())
    layout_client = ExtractClient(config)
    output = layout_client.run(data)

    return output

This code block is initializing several clients to interact with various AWS services.

1. **AWSSession**: The `AWSSession` object is initialized with a dictionary containing the AWS profile name. The `session` attribute of this object is then accessed to create a session that can be used to authenticate with AWS services.

2. **S3Client**: The `S3Client` object is initialized with the AWS session and a dictionary containing the AWS region. This client can be used to interact with the S3 service in the specified region.

3. **ElasticSearchClient (OpenSearch Client)**: The `ElasticSearchClient` object is initialized with the AWS session and a dictionary containing the OpenSearch URL, AWS region, and OpenSearch username and password. These are fetched from environment variables. This client can be used to interact with the OpenSearch service.

4. **BedrockEmbeddingClient**: The `BedrockEmbeddingClient` object is initialized with the AWS session and a dictionary containing the model ID and an empty dictionary for model kwargs. This client can be used to interact with the Bedrock service for embedding tasks.

Each of these clients is used to interact with a different AWS service, and they are all authenticated using the same AWS session.

In [7]:
aws_session = AWSSession(
    {
        "aws_profile": "default",
    }
).session
custom_s3_client = S3Client(
    aws_session,
    {
        "aws_region": "us-west-2",
    },
)
opensearch_client = ElasticSearchClient(
    aws_session,
    {
        "opensearch_url": os.getenv("OPENSEARCH_URL"),
        "aws_region": "us-west-2",
        "es_username": os.getenv("ES_USERNAME"),
        "es_password": os.getenv("ES_PASSWORD"),
    },
)
bedrock_client = BedrockEmbeddingClient(
    aws_session,
    {
        "model_id": "amazon.titan-embed-text-v1",
        "model_kwargs": {},
    },
)

### Demo 1: Ingest One file into OpenSearch

As the first step, we will download the PDF file from the S3 bucket and convert it into a series of PNG images. 

In [8]:
s3_bucket = os.getenv("S3_BUCKET")
s3_prefix = os.getenv("S3_PREFIX")

sample_files_dir = "es_sample_files"
pdf_dir = os.path.join(sample_files_dir, "pdf")
local_file_path = custom_s3_client.download_file_from_s3(s3_bucket, s3_prefix, pdf_dir)

Downloading file from S3 to es_sample_files/pdf/nike-paper.pdf


Then, we convert the file to a series of PNG images and upload them to the S3 bucket. 

In [9]:
png_dir = os.path.join(sample_files_dir, "png", Path(local_file_path).stem)
convert_pdf_to_pngs(local_file_path, png_dir)
s3_png_prefix = os.path.join("uniflow-es-sample", "png", Path(local_file_path).stem)
s3_png_paths = custom_s3_client.upload_files_to_s3(s3_bucket, f"{s3_png_prefix}/", png_dir)

Uploading file to S3 from es_sample_files/png/nike-paper/nike-paper-0001-14.png
Uploading file to S3 from es_sample_files/png/nike-paper/nike-paper-0001-06.png
Uploading file to S3 from es_sample_files/png/nike-paper/nike-paper-0001-09.png
Uploading file to S3 from es_sample_files/png/nike-paper/nike-paper-0001-13.png
Uploading file to S3 from es_sample_files/png/nike-paper/nike-paper-0001-12.png
Uploading file to S3 from es_sample_files/png/nike-paper/nike-paper-0001-05.png
Uploading file to S3 from es_sample_files/png/nike-paper/nike-paper-0001-01.png
Uploading file to S3 from es_sample_files/png/nike-paper/nike-paper-0001-10.png
Uploading file to S3 from es_sample_files/png/nike-paper/nike-paper-0001-04.png
Uploading file to S3 from es_sample_files/png/nike-paper/nike-paper-0001-03.png
Uploading file to S3 from es_sample_files/png/nike-paper/nike-paper-0001-07.png
Uploading file to S3 from es_sample_files/png/nike-paper/nike-paper-0001-08.png
Uploading file to S3 from es_sample_file

Finally, we ingest the file into OpenSearch.

In [None]:
index_name = "uniflow-es-sample-index"

for file_id, file in enumerate(os.listdir(png_dir)):
    data = []
    local_png_path = os.path.join(png_dir, file)
    extract_result = extract_image_pipeline(local_png_path)
    texts = extract_result[0]["output"][0]["text"]
    for text in texts:
        print(text, local_png_path)
        embedding = bedrock_client.get_text_embedding(text)
        metadata = {
            "s3_path": f"s3://{s3_bucket}/{s3_png_prefix}/{file}",
            "filename": file,
        }
        one_page_data = {
            "id": file_id,
            "metadata": metadata,
            "text": text,
            "vector_field": embedding,
        }
        data.append(one_page_data)

    # Step 4: Write the result to Amazon OpenSearch
    opensearch_client.bulk_ingest_elasticsearch(index_name, data)

### Test the Ingestion
After we finish the ingestion, we can test the ingestion by querying the OpenSearch index.

In [11]:
embedding = bedrock_client.get_text_embedding(
    "How much a runner’s marathon time can be expected to improve after switching to Vaporfly shoes?"
)
search_result = opensearch_client.knn_search(index_name, embedding)
for hit in search_result:
    print(hit["_index"])
    print(hit["_id"])
    print(hit["_score"])
    print(hit["_source"]["text"])
    print(hit["_source"]["metadata"]["filename"])

uniflow-es-sample-index
klhAMo0B0g7dU03EYF6r
0.8391172
By collecting data on marathon times and identifying shoes WOrn in a systematic sample of elite and sub-elite marathon runners , we studied how much a runner s marathon time can be lexpected to improve after switching to Vaporfly shoes: For men the improvement is most likely somewhere between 2.0 and 3.9 minutes, o1' between 1.4% and 2.8%. For women it is likely between 0.8 and 3.5 minutes; or between 0.6% and 2.2%. To these numbers into perspective,elite marathon runners cover more than half a mile in 3 minutes. put
nike-paper-0001-08.png
uniflow-es-sample-index
l1hAMo0B0g7dU03EYF6r
0.8216595
It is possible that athletes are more likely to switch to Vaporfly shoes when they know are ready to turn in marathon_performance Inversely some athletes might not be they good
nike-paper-0001-08.png


## End of the notebook

Check more Uniflow use cases in the [example folder](https://github.com/CambioML/uniflow/tree/main/example/model#examples)!

<a href="https://www.cambioml.com/" title="Title">
    <img src="../image/cambioml_logo_large.png" style="height: 100px; display: block; margin-left: auto; margin-right: auto;"/>
</a>
