# Data Prep Kit - Hello World (Colab friendly)

This notebook guides you through running your first data preparation transformation using the data-prep-kit. In this example, we will demonstrate a transformation that takes PDF files as input and extracts their content.

**Notebook versions:**

- Pure python (run locally): [Run_your_first_transform_python.ipynb](https://github.com/IBM/data-prep-kit/blob/dev/examples/notebooks/Run_your_first_transform_python.ipynb)
- Ray version (run locally): [Run_your_first_transform_ray.ipynb](https://github.com/IBM/data-prep-kit/blob/dev/examples/notebooks/Run_your_first_transform_ray.ipynb)
- Google Colab friendly notebook: this notebook  |  [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/IBM/data-prep-kit/blob/dev/examples/notebooks/Run_your_first_transform_colab.ipynb)

## Step-1: Install data-prep-kit dependencies



In [None]:
! pip install  --default-timeout=100  data-prep-toolkit-transforms[pdf2parquet]==0.2.2.dev1


## Restart Runtime

After installing dependencies, be sure <font color="red">restart runtime</font>, so libraries will be loaded

You do this by going to **`Runtime --> Restart Session`**

Then continue from **Step-2** below

## Step-2: Get Data

For this example, we will show PDF processing capabilities of DPK.  And we will download and use this PDF documents

- [IBM Granite model](https://arxiv.org/abs/2405.04324)
- [Attention is all you need](https://arxiv.org/pdf/1706.03762) - seminal paper on transformer/attention architecture

The code below will download the PDF.  Feel free to try your own PDFs to test it out

In [1]:
INPUT_DIR = 'Input-Test-Data'
OUTPUT_DIR = 'Output-Test-Data'

In [2]:
## This cell will download the input files

import os
import shutil
import requests
from humanfriendly import format_size

def download_file(url, local_file, chunk_size=1024*1024):
    # Check if the local file already exists
    if os.path.exists(local_file):
        file_size = format_size(os.path.getsize(local_file))
        print(f"Local file '{local_file}' ({file_size}) already exists. Skipping download.")
        return

    # Create the directory if it doesn't exist
    os.makedirs(os.path.dirname(local_file), exist_ok=True)

    # Stream the file download
    with requests.get(url, stream=True) as r:
        r.raise_for_status()
        with open(local_file, 'wb') as f:
            for chunk in r.iter_content(chunk_size=chunk_size):
                if chunk: # filter out keep-alive new chunks
                    f.write(chunk)
        print()
        file_size = format_size(os.path.getsize(local_file))
        print(f"{local_file} ({file_size}) downloaded successfully.")
## --- end: download_file ------

## setup input/output directories
shutil.os.makedirs(INPUT_DIR, exist_ok=True)
shutil.rmtree(OUTPUT_DIR, ignore_errors=True)
shutil.os.makedirs(OUTPUT_DIR, exist_ok=True)

## Download PDF files
download_file (url = 'https://arxiv.org/pdf/2405.04324', local_file = os.path.join(INPUT_DIR, 'Granite_code_models.pdf' ))
download_file (url = 'https://arxiv.org/pdf/1706.03762', local_file = os.path.join(INPUT_DIR, 'attention_is_all_you_need.pdf' ))


Input-Test-Data/Granite_code_models.pdf (1.27 MB) downloaded successfully.

Input-Test-Data/attention_is_all_you_need.pdf (2.22 MB) downloaded successfully.


## Step-3: Extract Text from PDF

This code is designed to set up a data transformation process that extracts text from PDF.  We will save the output as parquet format.

In [3]:
import sys
import ast

# Utilities from the data-prep-kit's data-processing-lib library provide functions and classes for parameter management,
from pdf2parquet_transform import (pdf2parquet_contents_type_cli_param, pdf2parquet_contents_types,)
from data_processing.utils import GB, ParamsUtils


ingest_config = {
    pdf2parquet_contents_type_cli_param: pdf2parquet_contents_types.JSON,
}

#local_conf: A dictionary specifying the local input and output folders where the PDF files will be read from and the transformed data will be saved.
local_conf = {
    "input_folder": INPUT_DIR,
    "output_folder": OUTPUT_DIR,
}

#params: A dictionary containing various runtime parameters for the transformation.
#data_local_config: Configuration for local data access, such as input and output folders, converted into a format compatible with the transformation using ParamsUtils.convert_to_ast.
#data_files_to_use: Specifies that only PDF files (['.pdf']) will be used as input data.

params = {
    "data_local_config": ParamsUtils.convert_to_ast(local_conf),
    "data_files_to_use": ast.literal_eval("['.pdf']"),
}
sys.argv = ParamsUtils.dict_to_req(d=(params | ingest_config))

### 3 - Execute

Now its time to run the transformation.

You will notice, that the code will download models to execute the transformation.  These models will be used to process PDFs.

In [4]:
%%time

from data_processing.runtime.pure_python import PythonTransformLauncher
from pdf2parquet_transform_python import Pdf2ParquetPythonTransformConfiguration


launcher = PythonTransformLauncher(Pdf2ParquetPythonTransformConfiguration())
return_code = launcher.launch()

if return_code == 0:
    print (f"✅ Tranformation run completed successfully")
else:
    raise Exception ("❌ Transformation run failed")


20:39:40 INFO - pdf2parquet parameters are : {'artifacts_path': None, 'contents_type': <pdf2parquet_contents_types.JSON: 'application/json'>, 'do_table_structure': True, 'do_ocr': False}
INFO:pdf2parquet_transform:pdf2parquet parameters are : {'artifacts_path': None, 'contents_type': <pdf2parquet_contents_types.JSON: 'application/json'>, 'do_table_structure': True, 'do_ocr': False}
20:39:40 INFO - pipeline id pipeline_id
INFO:data_processing.runtime.execution_configuration:pipeline id pipeline_id
20:39:40 INFO - code location None
INFO:data_processing.runtime.execution_configuration:code location None
20:39:40 INFO - data factory data_ is using local data access: input_folder - Input-Test-Data output_folder - Output-Test-Data
INFO:data_processing.data_access.data_access_factory_basee9e219d6-eeb0-4e90-a0e6-12ea6b0bc0b7:data factory data_ is using local data access: input_folder - Input-Test-Data output_folder - Output-Test-Data
20:39:40 INFO - data factory data_ max_files -1, n_sample -

Fetching 7 files:   0%|          | 0/7 [00:00<?, ?it/s]

.gitignore:   0%|          | 0.00/5.18k [00:00<?, ?B/s]

(…)del_artifacts/tableformer/tm_config.json:   0%|          | 0.00/7.09k [00:00<?, ?B/s]

README.md:   0%|          | 0.00/40.0 [00:00<?, ?B/s]

otslp_all_fast.check:   0%|          | 0.00/146M [00:00<?, ?B/s]

model.pt:   0%|          | 0.00/169M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/41.0 [00:00<?, ?B/s]

.gitattributes:   0%|          | 0.00/1.60k [00:00<?, ?B/s]

20:46:16 INFO - Completed 1 files (50.0%) in 5.423 min
INFO:data_processing.runtime.pure_python.transform_orchestrator:Completed 1 files (50.0%) in 5.423 min
20:47:47 INFO - Completed 2 files (100.0%) in 6.94 min
INFO:data_processing.runtime.pure_python.transform_orchestrator:Completed 2 files (100.0%) in 6.94 min
20:47:47 INFO - Done processing 2 files, waiting for flush() completion.
INFO:data_processing.runtime.pure_python.transform_orchestrator:Done processing 2 files, waiting for flush() completion.
20:47:47 INFO - done flushing in 0.0 sec
INFO:data_processing.runtime.pure_python.transform_orchestrator:done flushing in 0.0 sec
20:47:48 INFO - Completed execution in 8.119 min, execution result 0
INFO:data_processing.runtime.pure_python.transform_launcher:Completed execution in 8.119 min, execution result 0


✅ Tranformation run completed successfully
CPU times: user 10min 47s, sys: 10 s, total: 10min 57s
Wall time: 8min 7s


## Step-4: Inspect the generated output

We will use pandas to read parquet files and display.

You should see one-entry per PDF input file

In [5]:
import glob
import pandas as pd

## Reads parquet files in a folder into a pandas dataframe
def read_parquet_files_as_df (parquet_dir):
    parquet_files = glob.glob(f'{parquet_dir}/*.parquet')

    # read each parquet file into a DataFrame and store in a list
    dfs = [pd.read_parquet (f) for f in parquet_files]

    # Concatenate all DataFrames into a single DataFrame
    data_df = pd.concat(dfs, ignore_index=True)
    return data_df

In [6]:
output_df = read_parquet_files_as_df(OUTPUT_DIR)
output_df

Unnamed: 0,filename,contents,num_pages,num_tables,num_doc_elements,document_id,ext,hash,size,date_acquired,pdf_convert_time,source_filename
0,attention_is_all_you_need.pdf,"{""_name"":"""",""type"":""pdf-document"",""description...",15,4,193,1371115a-5421-4289-bbae-43882017b531,pdf,69c4e62b4bffc298b2231e31e392cb51a514bcb03507b8...,108568,2024-09-04T20:47:47.345013,91.005103,attention_is_all_you_need.pdf
1,Granite_code_models.pdf,"{""_name"":"""",""type"":""pdf-document"",""description...",28,17,320,5ec63530-a806-4779-a4ca-24d92e5ee47a,pdf,66a506489ca7abed61a40dd7e3fc73302e33f2169fab73...,310628,2024-09-04T20:46:16.208916,325.244253,Granite_code_models.pdf


In [7]:
# Inspect contents

import json
import pprint

column_list = output_df['contents'].tolist()
column_json = json.dumps(column_list, indent=4)
pprint.pprint(column_json[:2000]) # display first few lines

('[\n'
 '    '
 '"{\\"_name\\":\\"\\",\\"type\\":\\"pdf-document\\",\\"description\\":{\\"logs\\":[]},\\"file-info\\":{\\"filename\\":\\"attention_is_all_you_need.pdf\\",\\"document-hash\\":\\"bdfaa68d8984f0dc02beaca527b76f207d99b666d31d1da728ee0728182df697\\",\\"#-pages\\":15,\\"page-hashes\\":[{\\"hash\\":\\"8834a09ad99e9297886c9f8ad786c2784b7dc66dc6e6adfeff6bf2c1f07926d6\\",\\"model\\":\\"default\\",\\"page\\":1},{\\"hash\\":\\"72ded7022ad3cbfa9b5c4377a9c9b44511251f9489973956c23d2f3321e6307e\\",\\"model\\":\\"default\\",\\"page\\":2},{\\"hash\\":\\"38733274891513257d051950018621d95f73d05d5c70bfd7331def2f1194973d\\",\\"model\\":\\"default\\",\\"page\\":3},{\\"hash\\":\\"699ed16bf81021d0f86374d05c7b4b2b1049e63a28d2951ec1fb930747d755b9\\",\\"model\\":\\"default\\",\\"page\\":4},{\\"hash\\":\\"a17e6b313bdd51eff07a824253eff394d78ae1d6ebc985de3580bdfece38d2e1\\",\\"model\\":\\"default\\",\\"page\\":5},{\\"hash\\":\\"b3e9b63f2e8728fa83a5b7d911df2827585cf6040d2a4734cb3b44be264da6b6\\",\\"mo