# Data Prep Kit - Hello World (Pure Python)

This notebook guides you through running your first data preparation transformation using the data-prep-kit. In this example, we will demonstrate a transformation that takes PDF files as input and extracts their content.

This notebook is a pure python version, for ray version see this notebook: [Run_your_first_transform_ray.ipynb](Run_your_first_transform_ray.ipynb)

## Step-1: Setting up Python Dev Environment

Please follow instructions from [Getting started section](../../README.md#gettingstarted) to setup your python development environment

## Step-2: Get Data

For this example, we will show PDF processing capabilities of DPK.  And we will download and use this PDF documents

- [IBM Granite model](https://arxiv.org/abs/2405.04324)
- [Attention is all you need](https://arxiv.org/pdf/1706.03762) - seminal paper on transformer/attention architecture

The code below will download the PDF.  Feel free to try your own PDFs to test it out

In [1]:
INPUT_DIR = 'Input-Test-Data'
OUTPUT_DIR = 'Output-Test-Data'

In [2]:
## This cell will download the input files

import os
import shutil
import requests
from humanfriendly import format_size

def download_file(url, local_file, chunk_size=1024*1024):
    # Check if the local file already exists
    if os.path.exists(local_file):
        file_size = format_size(os.path.getsize(local_file))
        print(f"Local file '{local_file}' ({file_size}) already exists. Skipping download.")
        return

    # Create the directory if it doesn't exist
    os.makedirs(os.path.dirname(local_file), exist_ok=True)

    # Stream the file download
    with requests.get(url, stream=True) as r:
        r.raise_for_status()
        with open(local_file, 'wb') as f:
            for chunk in r.iter_content(chunk_size=chunk_size):
                if chunk: # filter out keep-alive new chunks
                    f.write(chunk)
        print()
        file_size = format_size(os.path.getsize(local_file))
        print(f"{local_file} ({file_size}) downloaded successfully.")
## --- end: download_file ------

## setup input/output directories
shutil.os.makedirs(INPUT_DIR, exist_ok=True)
shutil.rmtree(OUTPUT_DIR, ignore_errors=True)
shutil.os.makedirs(OUTPUT_DIR, exist_ok=True)

## Download PDF files
download_file (url = 'https://arxiv.org/pdf/2405.04324', local_file = os.path.join(INPUT_DIR, 'Granite_code_models.pdf' ))
download_file (url = 'https://arxiv.org/pdf/1706.03762', local_file = os.path.join(INPUT_DIR, 'attention_is_all_you_need.pdf' ))


Input-Test-Data/Granite_code_models.pdf (1.27 MB) downloaded successfully.

Input-Test-Data/attention_is_all_you_need.pdf (2.22 MB) downloaded successfully.


## Step-3: Extract Text from PDF 

This code is designed to set up a data transformation process that extracts text from PDF.  We will save the output as parquet format.

In [3]:
import sys
import ast

# Utilities from the data-prep-kit's data-processing-lib library provide functions and classes for parameter management, 
from pdf2parquet_transform import (pdf2parquet_contents_type_cli_param, pdf2parquet_contents_types,)
from data_processing.utils import GB, ParamsUtils


ingest_config = {
    pdf2parquet_contents_type_cli_param: pdf2parquet_contents_types.JSON,
}

#local_conf: A dictionary specifying the local input and output folders where the PDF files will be read from and the transformed data will be saved.
local_conf = {
    "input_folder": INPUT_DIR,
    "output_folder": OUTPUT_DIR,
}

#params: A dictionary containing various runtime parameters for the transformation.
#data_local_config: Configuration for local data access, such as input and output folders, converted into a format compatible with the transformation using ParamsUtils.convert_to_ast.
#data_files_to_use: Specifies that only PDF files (['.pdf']) will be used as input data.

params = {
    "data_local_config": ParamsUtils.convert_to_ast(local_conf),
    "data_files_to_use": ast.literal_eval("['.pdf']"),
}
sys.argv = ParamsUtils.dict_to_req(d=(params | ingest_config))

### 3 - Execute

Now its time to run the transformation.

You will notice, that the code will download models to execute the transformation.  These models will be used to process PDFs.

In [4]:
%%time 

from data_processing.runtime.pure_python import PythonTransformLauncher
from pdf2parquet_transform_python import Pdf2ParquetPythonTransformConfiguration


launcher = PythonTransformLauncher(Pdf2ParquetPythonTransformConfiguration())
return_code = launcher.launch()

if return_code == 0:
    print (f"✅ Tranformation run completed successfully")
else:
    raise Exception ("❌ Transformation run failed")


00:21:36 INFO - pdf2parquet parameters are : {'artifacts_path': None, 'contents_type': <pdf2parquet_contents_types.JSON: 'application/json'>, 'do_table_structure': True, 'do_ocr': False}
00:21:36 INFO - pipeline id pipeline_id
00:21:36 INFO - job details {'job category': 'preprocessing', 'job name': 'pdf2parquet', 'job type': 'pure python', 'job id': 'job_id'}
00:21:36 INFO - code location None
00:21:36 INFO - data factory data_ is using local data access: input_folder - Input-Test-Data output_folder - Output-Test-Data
00:21:36 INFO - data factory data_ max_files -1, n_sample -1
00:21:36 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.pdf'], files to checkpoint ['.parquet']
00:21:36 INFO - orchestrator pdf2parquet started at 2024-09-04 00:21:36
00:21:36 INFO - Number of files is 2, source profile {'max_file_size': 2.112621307373047, 'min_file_size': 1.2146415710449219, 'total_file_size': 3.3272628784179688}
00:21:36 I

Fetching 7 files:   0%|          | 0/7 [00:00<?, ?it/s]

00:22:11 INFO - Completed 1 files (50.0%) in 0.4056213855743408 min
00:22:21 INFO - Completed 2 files (100.0%) in 0.5788233876228333 min
00:22:21 INFO - done flushing in 5.9604644775390625e-06 sec
00:22:22 INFO - Completed execution in 0.7664824485778808 min, execution result 0


✅ Tranformation run completed successfully
CPU times: user 1min 56s, sys: 1.1 s, total: 1min 57s
Wall time: 46 s


## Step-4: Inspect the generated output

We will use pandas to read parquet files and display.

You should see one-entry per PDF input file

In [5]:
import glob
import pandas as pd

## Reads parquet files in a folder into a pandas dataframe 
def read_parquet_files_as_df (parquet_dir):
    parquet_files = glob.glob(f'{parquet_dir}/*.parquet')

    # read each parquet file into a DataFrame and store in a list
    dfs = [pd.read_parquet (f) for f in parquet_files]

    # Concatenate all DataFrames into a single DataFrame
    data_df = pd.concat(dfs, ignore_index=True)
    return data_df

In [6]:
output_df = read_parquet_files_as_df(OUTPUT_DIR)
output_df

Unnamed: 0,filename,contents,num_pages,num_tables,num_doc_elements,document_id,ext,hash,size,date_acquired,pdf_convert_time,source_filename
0,attention_is_all_you_need.pdf,"{""_name"":"""",""type"":""pdf-document"",""description...",15,4,193,1a75f9df-e478-43b6-bab6-075e6e4cc52c,pdf,e8417f232bdadc1760dd998dd64ee650f6140493f1685e...,131173,2024-09-04T00:22:21.740447,10.386293,attention_is_all_you_need.pdf
1,Granite_code_models.pdf,"{""_name"":"""",""type"":""pdf-document"",""description...",28,17,320,aaf7c1f6-592b-4b5e-94d3-d79d887e6be2,pdf,153d0ed14d3c71894252d0e8584479ec71d793a8d9d7ea...,584826,2024-09-04T00:22:11.343103,24.315634,Granite_code_models.pdf


In [7]:
# Inspect contents

import json
import pprint

column_list = output_df['contents'].tolist()
column_json = json.dumps(column_list, indent=4)
pprint.pprint(column_json[:2000]) # display first few lines

('[\n'
 '    '
 '"{\\"_name\\":\\"\\",\\"type\\":\\"pdf-document\\",\\"description\\":{\\"logs\\":[]},\\"file-info\\":{\\"filename\\":\\"attention_is_all_you_need.pdf\\",\\"document-hash\\":\\"bdfaa68d8984f0dc02beaca527b76f207d99b666d31d1da728ee0728182df697\\",\\"#-pages\\":15,\\"page-hashes\\":[{\\"hash\\":\\"8834a09ad99e9297886c9f8ad786c2784b7dc66dc6e6adfeff6bf2c1f07926d6\\",\\"model\\":\\"default\\",\\"page\\":1},{\\"hash\\":\\"72ded7022ad3cbfa9b5c4377a9c9b44511251f9489973956c23d2f3321e6307e\\",\\"model\\":\\"default\\",\\"page\\":2},{\\"hash\\":\\"38733274891513257d051950018621d95f73d05d5c70bfd7331def2f1194973d\\",\\"model\\":\\"default\\",\\"page\\":3},{\\"hash\\":\\"699ed16bf81021d0f86374d05c7b4b2b1049e63a28d2951ec1fb930747d755b9\\",\\"model\\":\\"default\\",\\"page\\":4},{\\"hash\\":\\"a17e6b313bdd51eff07a824253eff394d78ae1d6ebc985de3580bdfece38d2e1\\",\\"model\\":\\"default\\",\\"page\\":5},{\\"hash\\":\\"b3e9b63f2e8728fa83a5b7d911df2827585cf6040d2a4734cb3b44be264da6b6\\",\\"mo