This notebook guides you through running your first data preparation transformation using the data-prep-kit. In this example, we will demonstrate a transformation that takes PDF files as input and extracts their content. We will show how to execute this process using Ray to enable parallelization while still allowing you to run it efficiently on a local machine, such as your laptop. 





In [21]:
#The code is designed to set up a data transformation process that converts PDF files to a different format (like Parquet) 
# using a parallelized approach. The transformation can run both locally and in a distributed environment using Ray, a framework for 
# parallel and distributed computing.

#Import necessary libraries
import os, sys
import shutil
import ast

# Utilities from the data-prep-kit's data-processing-lib library provide functions and classes for parameter management, 
# memory calculations, and launchers for running transformations either locally or using Ray.
from pdf2parquet_transform_python import Pdf2ParquetPythonTransformConfiguration
from pdf2parquet_transform_ray import Pdf2ParquetRayTransformConfiguration
from pdf2parquet_transform import (pdf2parquet_contents_type_cli_param, pdf2parquet_contents_types,)
from data_processing.utils import GB, ParamsUtils
from data_processing_ray.runtime.ray import RayTransformLauncher
from data_processing.runtime.pure_python import PythonTransformLauncher


#Set configuration parameters to run. These will be used to parallelize your code. 

#num_cpus_available: Determines the number of CPUs to use for parallel processing. It's set to one-fourth of the total available CPUs on the machine.
#worker_options: Specifies the resources each worker will use, including the number of CPUs (num_cpus) and the memory (memory), set to 2 gigabytes (using a utility function GB).
#code_location: Contains metadata about the code location, such as the GitHub repository, commit hash, and file path.
#ingest_config: Defines the input data type for the transformation, set to JSON using parameters from the pdf2parquet_transform module.
    
num_cpus_available =  os.cpu_count()/4
worker_options = {"num_cpus" : num_cpus_available, "memory": 2 * GB}
code_location = {"github": "NA", "commit_hash": "0", "path": "path"}
ingest_config = {
    pdf2parquet_contents_type_cli_param: pdf2parquet_contents_types.JSON,
}

#local_conf: A dictionary specifying the local input and output folders where the PDF files will be read from and the transformed data will be saved.
local_conf = {
    "input_folder": "Input-Test-Data",
    "output_folder": "Output-Test-Data",
}

#params: A dictionary containing various runtime parameters for the transformation.
#run_locally: A flag indicating that the transformation should run locally.
#data_local_config: Configuration for local data access, such as input and output folders, converted into a format compatible with the transformation using ParamsUtils.convert_to_ast.
#data_files_to_use: Specifies that only PDF files (['.pdf']) will be used as input data.
#runtime_worker_options: Specifies worker configuration options for Ray, with half a gigabyte of memory per worker.
#runtime_num_workers: Number of workers to be used for the transformation.
#runtime_pipeline_id and runtime_job_id: Identifiers for the pipeline and job, respectively.
#runtime_code_location: Provides metadata about the code location, such as its repository and commit details, using the ParamsUtils.convert_to_ast function to format it correctly.

params = {
    "run_locally": True,
    "data_local_config": ParamsUtils.convert_to_ast(local_conf),
    "data_files_to_use": ast.literal_eval("['.pdf']"),
    "runtime_worker_options": ParamsUtils.convert_to_ast({"num_cpus" : num_cpus_available, "memory": .5 * GB}),
    "runtime_num_workers": 2,
    "runtime_pipeline_id": "pipeline_id",
    "runtime_job_id": "job_id",
    "runtime_code_location": ParamsUtils.convert_to_ast(code_location),
}
sys.argv = ParamsUtils.dict_to_req(d=(params | ingest_config))

The next code cell is the execution step where the transformation process is actually launched using Ray, a parallel computing framework. RayTransformLauncher is a class for running data transformation jobs using Ray, and Pdf2ParquetRayTransformConfiguration defines the necessary settings for converting PDFs to Parquet format. The launcher instance combines these to execute the transformation with the specified configuration.

In [22]:
launcher = RayTransformLauncher(Pdf2ParquetRayTransformConfiguration())
return_code = launcher.launch()

if return_code != 0:
    print (f"Computation failed")



16:35:19 INFO - Running locally
16:35:19 INFO - pdf2parquet parameters are : {'artifacts_path': None, 'contents_type': <pdf2parquet_contents_types.JSON: 'application/json'>, 'do_table_structure': True, 'do_ocr': False}
16:35:19 INFO - data factory data_ is using local data access: input_folder - Input-Test-Data output_folder - Output-Test-Data
16:35:19 INFO - data factory data_ max_files -1, n_sample -1
16:35:19 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.pdf'], files to checkpoint ['.parquet']
16:35:19 INFO - pipeline id pipeline_id
16:35:19 INFO - code location {'github': 'github', 'commit_hash': '12345', 'path': 'path'}
16:35:19 INFO - number of workers 2 worker options {'num_cpus': 4.0, 'memory': 2147483648, 'max_restarts': -1}
16:35:19 INFO - actor creation delay 0
16:35:19 INFO - job details {'job category': 'preprocessing', 'job name': 'pdf2parquet', 'job type': 'ray', 'job id': 'job_id'}
2024-09-02 16:35:2

✅ Stage:2 completed successfully
