# Data Prep Kit - Hello World (Ray)

This notebook guides you through running your first data preparation transformation using the data-prep-kit. In this example, we will demonstrate a transformation that takes PDF files as input and extracts their content.

This notebook uses Ray framework, for pure python version see this notebook: [Run_your_first_transform_python.ipynb](Run_your_first_transform_python.ipynb)

[Ray](https://docs.ray.io/en/latest/index.html) is a powerful framework that  enables parallelization while still allowing you to run it efficiently on a local machine, such as your laptop. 

## Step-1: Setting up Python Dev Environment

Please follow instructions from [Getting started section](../../README.md#gettingstarted) to setup your python development environment

## Step-2: Get Data

For this example, we will show PDF processing capabilities of DPK.  And we will download and use this PDF documents

- [IBM Granite model](https://arxiv.org/abs/2405.04324)
- [Attention is all you need](https://arxiv.org/pdf/1706.03762) - seminal paper on transformer/attention architecture

The code below will download the PDF.  Feel free to try your own PDFs to test it out

In [1]:
INPUT_DIR = 'Input-Test-Data'
OUTPUT_DIR = 'Output-Test-Data'

In [2]:
## This cell will download the input files

import os
import shutil
import requests
from humanfriendly import format_size

def download_file(url, local_file, chunk_size=1024*1024):
    # Check if the local file already exists
    if os.path.exists(local_file):
        file_size = format_size(os.path.getsize(local_file))
        print(f"Local file '{local_file}' ({file_size}) already exists. Skipping download.")
        return

    # Create the directory if it doesn't exist
    os.makedirs(os.path.dirname(local_file), exist_ok=True)

    # Stream the file download
    with requests.get(url, stream=True) as r:
        r.raise_for_status()
        with open(local_file, 'wb') as f:
            for chunk in r.iter_content(chunk_size=chunk_size):
                if chunk: # filter out keep-alive new chunks
                    f.write(chunk)
        print()
        file_size = format_size(os.path.getsize(local_file))
        print(f"{local_file} ({file_size}) downloaded successfully.")
## --- end: download_file ------

## setup input/output directories
shutil.os.makedirs(INPUT_DIR, exist_ok=True)
shutil.rmtree(OUTPUT_DIR, ignore_errors=True)
shutil.os.makedirs(OUTPUT_DIR, exist_ok=True)

## Download PDF files
download_file (url = 'https://arxiv.org/pdf/2405.04324', local_file = os.path.join(INPUT_DIR, 'Granite_code_models.pdf' ))
download_file (url = 'https://arxiv.org/pdf/1706.03762', local_file = os.path.join(INPUT_DIR, 'attention_is_all_you_need.pdf' ))

Local file 'Input-Test-Data/Granite_code_models.pdf' (1.27 MB) already exists. Skipping download.
Local file 'Input-Test-Data/attention_is_all_you_need.pdf' (2.22 MB) already exists. Skipping download.


## Step-3: Extract Text from PDF 

This code is designed to set up a data transformation process that extracts text from PDF.  We will save the output as parquet format.

In [3]:
import sys
import ast

# Utilities from the data-prep-kit's data-processing-lib library provide functions and classes for parameter management, 
from pdf2parquet_transform import (pdf2parquet_contents_type_cli_param, pdf2parquet_contents_types,)
from data_processing.utils import GB, ParamsUtils


ingest_config = {
    pdf2parquet_contents_type_cli_param: pdf2parquet_contents_types.JSON,
}

# num_cpus_available: Determines the number of CPUs to use for parallel processing. It's set to one-fourth of the total available CPUs on the machine.
# worker_options: Specifies the resources each worker will use, including the number of CPUs (num_cpus) and the memory (memory), set to 2 gigabytes (using a utility function GB).

num_cpus_available =  os.cpu_count()/4
worker_options = {"num_cpus" : num_cpus_available, "memory": 2 * GB}

#local_conf: A dictionary specifying the local input and output folders where the PDF files will be read from and the transformed data will be saved.
local_conf = {
    "input_folder": INPUT_DIR,
    "output_folder": OUTPUT_DIR,
}

#params: A dictionary containing various runtime parameters for the transformation.
#run_locally: A flag indicating that the transformation should run locally.
#data_local_config: Configuration for local data access, such as input and output folders, converted into a format compatible with the transformation using ParamsUtils.convert_to_ast.
#data_files_to_use: Specifies that only PDF files (['.pdf']) will be used as input data.
#runtime_worker_options: Specifies worker configuration options for Ray, with one gigabyte of memory per worker.
#runtime_num_workers: Number of workers to be used for the transformation.
#runtime_pipeline_id and runtime_job_id: Identifiers for the pipeline and job, respectively.
#runtime_code_location: Provides metadata about the code location, such as its repository and commit details, using the ParamsUtils.convert_to_ast function to format it correctly.

params = {
    "run_locally": True,
    "data_local_config": ParamsUtils.convert_to_ast(local_conf),
    "data_files_to_use": ast.literal_eval("['.pdf']"),
    "runtime_worker_options": ParamsUtils.convert_to_ast(worker_options),
    "runtime_num_workers": 2,
    "runtime_pipeline_id": "pipeline_id",
    "runtime_job_id": "job_id",
}
sys.argv = ParamsUtils.dict_to_req(d=(params | ingest_config))

### 3 - Execute

Now its time to run the transformation.

This will launch a *local RAY cluster* to execute our code in parallel (using multiple workers (=2)).  You can view the RAY dashboard in the URL printed below.

E.g. http://127.0.0.1:8265 

You will notice, that the code will download models to execute the transformation.  These models will be used to process PDFs.

In [4]:
%%time 

from data_processing_ray.runtime.ray import RayTransformLauncher
from pdf2parquet_transform_ray import Pdf2ParquetRayTransformConfiguration

launcher = RayTransformLauncher(Pdf2ParquetRayTransformConfiguration())
return_code = launcher.launch()

if return_code == 0:
    print (f"✅ Tranformation run completed successfully")
else:
    raise Exception ("❌ Transformation run failed")


00:34:57 INFO - Running locally
00:34:57 INFO - pdf2parquet parameters are : {'artifacts_path': None, 'contents_type': <pdf2parquet_contents_types.JSON: 'application/json'>, 'do_table_structure': True, 'do_ocr': False}
00:34:57 INFO - data factory data_ is using local data access: input_folder - Input-Test-Data output_folder - Output-Test-Data
00:34:57 INFO - data factory data_ max_files -1, n_sample -1
00:34:57 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.pdf'], files to checkpoint ['.parquet']
00:34:57 INFO - pipeline id pipeline_id
00:34:57 INFO - code location None
00:34:57 INFO - number of workers 2 worker options {'num_cpus': 4.0, 'memory': 2147483648, 'max_restarts': -1}
00:34:57 INFO - actor creation delay 0
00:34:57 INFO - job details {'job category': 'preprocessing', 'job name': 'pdf2parquet', 'job type': 'ray', 'job id': 'job_id'}
2024-09-04 00:34:59,763	INFO worker.py:1744 -- Started a local Ray instanc

✅ Tranformation run completed successfully
CPU times: user 511 ms, sys: 264 ms, total: 775 ms
Wall time: 1min 30s


## Step-4: Inspect the generated output

We will use pandas to read parquet files and display.

You should see one-entry per PDF input file

In [5]:
import glob
import pandas as pd

## Reads parquet files in a folder into a pandas dataframe 
def read_parquet_files_as_df (parquet_dir):
    parquet_files = glob.glob(f'{parquet_dir}/*.parquet')

    # read each parquet file into a DataFrame and store in a list
    dfs = [pd.read_parquet (f) for f in parquet_files]

    # Concatenate all DataFrames into a single DataFrame
    data_df = pd.concat(dfs, ignore_index=True)
    return data_df

In [6]:
output_df = read_parquet_files_as_df(OUTPUT_DIR)
output_df

Unnamed: 0,filename,contents,num_pages,num_tables,num_doc_elements,document_id,ext,hash,size,date_acquired,pdf_convert_time,source_filename
0,attention_is_all_you_need.pdf,"{""_name"":"""",""type"":""pdf-document"",""description...",15,4,193,39684ed7-3033-47fb-ae8f-26f470932cd3,pdf,fee309974aabb59c48dbfaeb011ee8a2c78f2e492747b9...,131167,2024-09-04T00:35:41.547110,20.117517,attention_is_all_you_need.pdf
1,Granite_code_models.pdf,"{""_name"":"""",""type"":""pdf-document"",""description...",28,17,320,204e53cb-c39f-4ea6-81c4-177adda365b3,pdf,254309b69f2010ecff8af0907d3ae643daba6e8dfa1250...,584822,2024-09-04T00:36:16.451014,55.192019,Granite_code_models.pdf


In [7]:
# Inspect contents

import json
import pprint

column_list = output_df['contents'].tolist()
column_json = json.dumps(column_list, indent=4)
pprint.pprint(column_json[:2000]) # display first few lines

('[\n'
 '    '
 '"{\\"_name\\":\\"\\",\\"type\\":\\"pdf-document\\",\\"description\\":{\\"logs\\":[]},\\"file-info\\":{\\"filename\\":\\"attention_is_all_you_need.pdf\\",\\"document-hash\\":\\"bdfaa68d8984f0dc02beaca527b76f207d99b666d31d1da728ee0728182df697\\",\\"#-pages\\":15,\\"page-hashes\\":[{\\"hash\\":\\"8834a09ad99e9297886c9f8ad786c2784b7dc66dc6e6adfeff6bf2c1f07926d6\\",\\"model\\":\\"default\\",\\"page\\":1},{\\"hash\\":\\"72ded7022ad3cbfa9b5c4377a9c9b44511251f9489973956c23d2f3321e6307e\\",\\"model\\":\\"default\\",\\"page\\":2},{\\"hash\\":\\"38733274891513257d051950018621d95f73d05d5c70bfd7331def2f1194973d\\",\\"model\\":\\"default\\",\\"page\\":3},{\\"hash\\":\\"699ed16bf81021d0f86374d05c7b4b2b1049e63a28d2951ec1fb930747d755b9\\",\\"model\\":\\"default\\",\\"page\\":4},{\\"hash\\":\\"a17e6b313bdd51eff07a824253eff394d78ae1d6ebc985de3580bdfece38d2e1\\",\\"model\\":\\"default\\",\\"page\\":5},{\\"hash\\":\\"b3e9b63f2e8728fa83a5b7d911df2827585cf6040d2a4734cb3b44be264da6b6\\",\\"mo