# Data Prep Kit - Hello World (Ray)

This notebook guides you through running your first data preparation transformation using the data-prep-kit. In this example, we will demonstrate a transformation that takes PDF files as input and extracts their content.

[Ray](https://docs.ray.io/en/latest/index.html) is a powerful framework that  enables parallelization while still allowing you to run it efficiently on a local machine, such as your laptop. 


**Notebook versions:**

- Pure python (run locally): [Run_your_first_transform_python.ipynb](Run_your_first_transform_python.ipynb)
- Ray version (run locally): this notebook
- Google Colab friendly notebook: [Run_your_first_transform_colab.ipynb](Run_your_first_transform_colab.ipynb)  |  [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/IBM/data-prep-kit/blob/dev/examples/notebooks/Run_your_first_transform_colab.ipynb)

## Step-1: Setting up Python Dev Environment

Please follow instructions from [Getting started section](../../README.md#gettingstarted) to setup your python development environment

## Step-2: Inspect Data

For this example, we will show PDF processing capabilities of DPK.  The PDF documents are in [Input-Test-Data](Input-Test-Data) folder.

- [IBM Granite model paper](Input-Test-Data/GraniteCodePaper.pdf) . [Arxiv link](https://arxiv.org/abs/2405.04324)
- [Attention is all you need paper](Input-Test-Data/attention_is_all_you_need.pdf)  . [Arxiv link](https://arxiv.org/pdf/1706.03762) - seminal paper on transformer/attention architecture


**Try your own PDFs** : You can drop any PDF documents into [Input-Test-Data](Input-Test-Data) folder and run this code.

In [1]:
INPUT_DIR = 'Input-Test-Data'
OUTPUT_DIR = 'Output-Test-Data'

## Step-3: Extract Text from PDF 

This code is designed to set up a data transformation process that extracts text from PDF.  We will save the output as parquet format.

In [2]:
import os, sys
import ast

# Utilities from the data-prep-kit's data-processing-lib library provide functions and classes for parameter management, 
from pdf2parquet_transform import (pdf2parquet_contents_type_cli_param, pdf2parquet_contents_types,)
from data_processing.utils import GB, ParamsUtils


ingest_config = {
    pdf2parquet_contents_type_cli_param: pdf2parquet_contents_types.JSON,
}

# num_cpus_available: Determines the number of CPUs to use for parallel processing. It's set to one-fourth of the total available CPUs on the machine.
# worker_options: Specifies the resources each worker will use, including the number of CPUs (num_cpus) and the memory (memory), set to 2 gigabytes (using a utility function GB).

num_cpus_available =  os.cpu_count()/4
worker_options = {"num_cpus" : num_cpus_available, "memory": 2 * GB}

#local_conf: A dictionary specifying the local input and output folders where the PDF files will be read from and the transformed data will be saved.
local_conf = {
    "input_folder": INPUT_DIR,
    "output_folder": OUTPUT_DIR,
}

#params: A dictionary containing various runtime parameters for the transformation.
#run_locally: A flag indicating that the transformation should run locally.
#data_local_config: Configuration for local data access, such as input and output folders, converted into a format compatible with the transformation using ParamsUtils.convert_to_ast.
#data_files_to_use: Specifies that only PDF files (['.pdf']) will be used as input data.
#runtime_worker_options: Specifies worker configuration options for Ray, with one gigabyte of memory per worker.
#runtime_num_workers: Number of workers to be used for the transformation.
#runtime_pipeline_id and runtime_job_id: Identifiers for the pipeline and job, respectively.
#runtime_code_location: Provides metadata about the code location, such as its repository and commit details, using the ParamsUtils.convert_to_ast function to format it correctly.

params = {
    "run_locally": True,
    "data_local_config": ParamsUtils.convert_to_ast(local_conf),
    "data_files_to_use": ast.literal_eval("['.pdf']"),
    "runtime_worker_options": ParamsUtils.convert_to_ast(worker_options),
    "runtime_num_workers": 2,
    "runtime_pipeline_id": "pipeline_id",
    "runtime_job_id": "job_id",
}
sys.argv = ParamsUtils.dict_to_req(d=(params | ingest_config))

### 3 - Execute

Now its time to run the transformation.

This will launch a *local RAY cluster* to execute our code in parallel (using multiple workers (=2)).  You can view the RAY dashboard in the URL printed below.

E.g. http://127.0.0.1:8265 

You will notice, that the code will download models to execute the transformation.  These models will be used to process PDFs.

In [3]:
%%time 

from data_processing_ray.runtime.ray import RayTransformLauncher
from pdf2parquet_transform_ray import Pdf2ParquetRayTransformConfiguration

launcher = RayTransformLauncher(Pdf2ParquetRayTransformConfiguration())
return_code = launcher.launch()

if return_code == 0:
    print (f"✅ Tranformation run completed successfully")
else:
    raise Exception ("❌ Transformation run failed")


13:28:13 INFO - Running locally
13:28:13 INFO - pdf2parquet parameters are : {'artifacts_path': None, 'contents_type': <pdf2parquet_contents_types.JSON: 'application/json'>, 'do_table_structure': True, 'do_ocr': False}
13:28:13 INFO - data factory data_ is using local data access: input_folder - Input-Test-Data output_folder - Output-Test-Data
13:28:13 INFO - data factory data_ max_files -1, n_sample -1
13:28:13 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.pdf'], files to checkpoint ['.parquet']
13:28:13 INFO - pipeline id pipeline_id
13:28:13 INFO - code location None
13:28:13 INFO - number of workers 2 worker options {'num_cpus': 4.0, 'memory': 2147483648, 'max_restarts': -1}
13:28:13 INFO - actor creation delay 0
13:28:13 INFO - job details {'job category': 'preprocessing', 'job name': 'pdf2parquet', 'job type': 'ray', 'job id': 'job_id'}
2024-09-04 13:28:16,713	INFO worker.py:1744 -- Started a local Ray instanc

✅ Tranformation run completed successfully
CPU times: user 485 ms, sys: 281 ms, total: 766 ms
Wall time: 1min 30s


## Step-4: Inspect the generated output

We will use pandas to read parquet files and display.

You should see one-entry per PDF input file

In [4]:
import glob
import pandas as pd

## Reads parquet files in a folder into a pandas dataframe 
def read_parquet_files_as_df (parquet_dir):
    parquet_files = glob.glob(f'{parquet_dir}/*.parquet')

    # read each parquet file into a DataFrame and store in a list
    dfs = [pd.read_parquet (f) for f in parquet_files]

    # Concatenate all DataFrames into a single DataFrame
    data_df = pd.concat(dfs, ignore_index=True)
    return data_df

In [5]:
output_df = read_parquet_files_as_df(OUTPUT_DIR)
output_df

Unnamed: 0,filename,contents,num_pages,num_tables,num_doc_elements,document_id,ext,hash,size,date_acquired,pdf_convert_time,source_filename
0,GraniteCodePaper.pdf,"{""_name"":"""",""type"":""pdf-document"",""description...",28,17,320,6769e793-6e20-4dfc-8fe6-12ecc1944aac,pdf,d1f30543fce4e18d223cf4ba7728733e169201b4df5d5d...,584819,2024-09-04T13:29:32.380832,53.872901,GraniteCodePaper.pdf
1,attention_is_all_you_need.pdf,"{""_name"":"""",""type"":""pdf-document"",""description...",15,4,193,20925bb2-4e14-4e58-8ae1-0ce971dfc2bf,pdf,fee309974aabb59c48dbfaeb011ee8a2c78f2e492747b9...,131167,2024-09-04T13:28:57.169495,19.068189,attention_is_all_you_need.pdf


In [6]:
# Inspect contents

import json
import pprint

column_list = output_df['contents'].tolist()
column_json = json.dumps(column_list, indent=4)
pprint.pprint(column_json[:2000]) # display first few lines

('[\n'
 '    '
 '"{\\"_name\\":\\"\\",\\"type\\":\\"pdf-document\\",\\"description\\":{\\"logs\\":[]},\\"file-info\\":{\\"filename\\":\\"GraniteCodePaper.pdf\\",\\"document-hash\\":\\"cd6f480d8c03baf8fc4fc5a874dfa726764028fd6fb032532b6807522da6df48\\",\\"#-pages\\":28,\\"page-hashes\\":[{\\"hash\\":\\"2683abedc4e9b0a7ecffbba4e696db4cdeeaaa770de105f14b46cf7d0fef935a\\",\\"model\\":\\"default\\",\\"page\\":1},{\\"hash\\":\\"6b67e3f7d141634b469b81edae327289f954805104902cb51eced075585ecdf5\\",\\"model\\":\\"default\\",\\"page\\":2},{\\"hash\\":\\"d6c88ae80a41c48e99bacf83121051d01010c0edd5489fbc1c34b1f11f7d45c8\\",\\"model\\":\\"default\\",\\"page\\":3},{\\"hash\\":\\"a342fd3b69659664f36d83fba38be16f7a78bd8fa31cc3216c884a856c42c842\\",\\"model\\":\\"default\\",\\"page\\":4},{\\"hash\\":\\"bd8ab5e38ea21b04545dd2c9a73cd82d17e6e732f497125aca4e9d8468891e7d\\",\\"model\\":\\"default\\",\\"page\\":5},{\\"hash\\":\\"b24778e5dd56e510671b10b05e099b3ce72178a6a49cc6241dcf52ac1aaea27a\\",\\"model\\":\\