# Data Prep Kit - Hello World (Pure Python)

This notebook guides you through running your first data preparation transformation using the data-prep-kit. In this example, we will demonstrate a transformation that takes PDF files as input and extracts their content.

**Notebook versions:**

- Pure python (run locally): this notebook 
- Ray version (run locally): [Run_your_first_transform_ray.ipynb](Run_your_first_transform_ray.ipynb)
- Google Colab friendly notebook: [Run_your_first_transform_colab.ipynb](Run_your_first_transform_colab.ipynb)  |  [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/IBM/data-prep-kit/blob/dev/examples/notebooks/Run_your_first_transform_colab.ipynb)

## Step-1: Setting up Python Dev Environment

Please follow instructions from [Getting started section](../../README.md#gettingstarted) to setup your python development environment

## Step-2: Inspect Data

For this example, we will show PDF processing capabilities of DPK.  The PDF documents are in [Input-Test-Data](Input-Test-Data) folder.

- [IBM Granite model paper](Input-Test-Data/GraniteCodePaper.pdf) . [Arxiv link](https://arxiv.org/abs/2405.04324)
- [Attention is all you need paper](Input-Test-Data/attention_is_all_you_need.pdf)  . [Arxiv link](https://arxiv.org/pdf/1706.03762) - seminal paper on transformer/attention architecture


**Try your own PDFs** : You can drop any PDF documents into [Input-Test-Data](Input-Test-Data) folder and run this code.

In [1]:
INPUT_DIR = 'Input-Test-Data'
OUTPUT_DIR = 'Output-Test-Data'

## Step-3: Extract Text from PDF 

This code is designed to set up a data transformation process that extracts text from PDF.  We will save the output as parquet format.

In [2]:
import sys
import ast

# Utilities from the data-prep-kit's data-processing-lib library provide functions and classes for parameter management, 
from pdf2parquet_transform import (pdf2parquet_contents_type_cli_param, pdf2parquet_contents_types,)
from data_processing.utils import GB, ParamsUtils


ingest_config = {
    pdf2parquet_contents_type_cli_param: pdf2parquet_contents_types.JSON,
}

#local_conf: A dictionary specifying the local input and output folders where the PDF files will be read from and the transformed data will be saved.
local_conf = {
    "input_folder": INPUT_DIR,
    "output_folder": OUTPUT_DIR,
}

#params: A dictionary containing various runtime parameters for the transformation.
#data_local_config: Configuration for local data access, such as input and output folders, converted into a format compatible with the transformation using ParamsUtils.convert_to_ast.
#data_files_to_use: Specifies that only PDF files (['.pdf']) will be used as input data.

params = {
    "data_local_config": ParamsUtils.convert_to_ast(local_conf),
    "data_files_to_use": ast.literal_eval("['.pdf']"),
}
sys.argv = ParamsUtils.dict_to_req(d=(params | ingest_config))

### 3 - Execute

Now its time to run the transformation.

You will notice, that the code will download models to execute the transformation.  These models will be used to process PDFs.

In [3]:
%%time 

from data_processing.runtime.pure_python import PythonTransformLauncher
from pdf2parquet_transform_python import Pdf2ParquetPythonTransformConfiguration


launcher = PythonTransformLauncher(Pdf2ParquetPythonTransformConfiguration())
return_code = launcher.launch()

if return_code == 0:
    print (f"✅ Tranformation run completed successfully")
else:
    raise Exception ("❌ Transformation run failed")


13:23:41 INFO - pdf2parquet parameters are : {'artifacts_path': None, 'contents_type': <pdf2parquet_contents_types.JSON: 'application/json'>, 'do_table_structure': True, 'do_ocr': False}
13:23:41 INFO - pipeline id pipeline_id
13:23:41 INFO - job details {'job category': 'preprocessing', 'job name': 'pdf2parquet', 'job type': 'pure python', 'job id': 'job_id'}
13:23:41 INFO - code location None
13:23:41 INFO - data factory data_ is using local data access: input_folder - Input-Test-Data output_folder - Output-Test-Data
13:23:41 INFO - data factory data_ max_files -1, n_sample -1
13:23:41 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.pdf'], files to checkpoint ['.parquet']
13:23:41 INFO - orchestrator pdf2parquet started at 2024-09-04 13:23:41
13:23:41 INFO - Number of files is 2, source profile {'max_file_size': 2.112621307373047, 'min_file_size': 1.2146415710449219, 'total_file_size': 3.3272628784179688}
13:23:41 I

Fetching 7 files:   0%|          | 0/7 [00:00<?, ?it/s]

13:24:18 INFO - Completed 1 files (50.0%) in 0.4262009263038635 min
13:24:29 INFO - Completed 2 files (100.0%) in 0.5942506710688273 min
13:24:29 INFO - done flushing in 6.4373016357421875e-06 sec
13:24:29 INFO - Completed execution in 0.7952145417531331 min, execution result 0


✅ Tranformation run completed successfully
CPU times: user 1min 56s, sys: 1.5 s, total: 1min 58s
Wall time: 47.8 s


## Step-4: Inspect the generated output

We will use pandas to read parquet files and display.

You should see one-entry per PDF input file

In [4]:
import glob
import pandas as pd

## Reads parquet files in a folder into a pandas dataframe 
def read_parquet_files_as_df (parquet_dir):
    parquet_files = glob.glob(f'{parquet_dir}/*.parquet')

    # read each parquet file into a DataFrame and store in a list
    dfs = [pd.read_parquet (f) for f in parquet_files]

    # Concatenate all DataFrames into a single DataFrame
    data_df = pd.concat(dfs, ignore_index=True)
    return data_df

In [5]:
output_df = read_parquet_files_as_df(OUTPUT_DIR)
output_df

Unnamed: 0,filename,contents,num_pages,num_tables,num_doc_elements,document_id,ext,hash,size,date_acquired,pdf_convert_time,source_filename
0,GraniteCodePaper.pdf,"{""_name"":"""",""type"":""pdf-document"",""description...",28,17,320,7852a79f-18ee-4675-95d0-3c7293e41496,pdf,36fa2268f52ed97f426768769749c4df37fef90bc9b2f8...,584823,2024-09-04T13:24:18.920445,25.537883,GraniteCodePaper.pdf
1,attention_is_all_you_need.pdf,"{""_name"":"""",""type"":""pdf-document"",""description...",15,4,193,8b9e4b7b-8909-4c30-90db-bce60749c553,pdf,e8417f232bdadc1760dd998dd64ee650f6140493f1685e...,131173,2024-09-04T13:24:29.020833,10.077004,attention_is_all_you_need.pdf


In [6]:
# Inspect contents

import json
import pprint

column_list = output_df['contents'].tolist()
column_json = json.dumps(column_list, indent=4)
pprint.pprint(column_json[:2000]) # display first few lines

('[\n'
 '    '
 '"{\\"_name\\":\\"\\",\\"type\\":\\"pdf-document\\",\\"description\\":{\\"logs\\":[]},\\"file-info\\":{\\"filename\\":\\"GraniteCodePaper.pdf\\",\\"document-hash\\":\\"cd6f480d8c03baf8fc4fc5a874dfa726764028fd6fb032532b6807522da6df48\\",\\"#-pages\\":28,\\"page-hashes\\":[{\\"hash\\":\\"2683abedc4e9b0a7ecffbba4e696db4cdeeaaa770de105f14b46cf7d0fef935a\\",\\"model\\":\\"default\\",\\"page\\":1},{\\"hash\\":\\"6b67e3f7d141634b469b81edae327289f954805104902cb51eced075585ecdf5\\",\\"model\\":\\"default\\",\\"page\\":2},{\\"hash\\":\\"d6c88ae80a41c48e99bacf83121051d01010c0edd5489fbc1c34b1f11f7d45c8\\",\\"model\\":\\"default\\",\\"page\\":3},{\\"hash\\":\\"a342fd3b69659664f36d83fba38be16f7a78bd8fa31cc3216c884a856c42c842\\",\\"model\\":\\"default\\",\\"page\\":4},{\\"hash\\":\\"bd8ab5e38ea21b04545dd2c9a73cd82d17e6e732f497125aca4e9d8468891e7d\\",\\"model\\":\\"default\\",\\"page\\":5},{\\"hash\\":\\"b24778e5dd56e510671b10b05e099b3ce72178a6a49cc6241dcf52ac1aaea27a\\",\\"model\\":\\