### Open this notebook in Google Colab

Click link to open notebook in google colab:  [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/IBM/data-prep-kit/blob/dev/examples/notebooks/pdf2parquet_colab-version.ipynb)

### Create necessary directories for transform on Google Colab

In [None]:
!mkdir -p 'input/solar-system'
!wget -O 'input/solar-system/earth.pdf'  'https://raw.githubusercontent.com/IBM/data-prep-kit/dev/examples/notebooks/intro/input/solar-system/earth.pdf'
!wget -O 'input/solar-system/mars.pdf'  'https://raw.githubusercontent.com/IBM/data-prep-kit/dev/examples/notebooks/intro/input/solar-system/mars.pdf'
!wget -O 'my_utils.py'  'https://raw.githubusercontent.com/IBM/data-prep-kit/dev/examples/notebooks/intro/my_utils.py'

### Install dependencies for Google Colab environment

In [None]:
! pip install data-prep-connector
! pip install data-prep-toolkit>=0.2.2.dev2
! pip install 'data-prep-toolkit-transforms[pdf2parquet]>=0.2.2.dev3'

### Restart runtime

After installing dependencies, be sure to <font color="red">restart runtime</font>, so libraries will be loaded.

You do this by going to **`Runtime --> Restart Session`**

Then you can continue to the next step (no need to re-run the notebook).

### Import required classes and modules

In [None]:
import ast
import os
import sys

from data_processing.runtime.pure_python import PythonTransformLauncher
from data_processing.utils import ParamsUtils
from pdf2parquet_transform_python import Pdf2ParquetPythonTransformConfiguration

### Configure class for managing input and output directories

In [None]:
import os

## Configuration
class MyConfig:
    pass

MY_CONFIG = MyConfig ()

MY_CONFIG.INPUT_DATA_DIR = 'input/solar-system'

MY_CONFIG.OUTPUT_FOLDER = "output"

### Add parent directory to path

In [None]:
import os,sys

this_dir = os.path.abspath('')
parent_dir = os.path.dirname(this_dir)
sys.path.append (os.path.abspath (parent_dir))

### Configure transform parameters

In [None]:
# create parameters
input_folder = MY_CONFIG.INPUT_DATA_DIR
output_folder = MY_CONFIG.OUTPUT_FOLDER
local_conf = {
    "input_folder": input_folder,
    "output_folder": output_folder,
}
params = {
    # Data access. Only required parameters are specified
    "data_local_config": ParamsUtils.convert_to_ast(local_conf),
    "data_files_to_use": ast.literal_eval("['.pdf','.docx','.pptx','.zip']"),
    # execution info
    "runtime_pipeline_id": "pipeline_id",
    "runtime_job_id": "job_id",
    # pdf2parquet params
    "pdf2parquet_double_precision": 0,
}

### Invoke transform with python runtime

In [4]:
%%capture
sys.argv = ParamsUtils.dict_to_req(d=params)
launcher = PythonTransformLauncher(runtime_config=Pdf2ParquetPythonTransformConfiguration())
launcher.launch()


15:13:18 INFO - pdf2parquet parameters are : {'batch_size': -1, 'artifacts_path': None, 'contents_type': <pdf2parquet_contents_types.MARKDOWN: 'text/markdown'>, 'do_table_structure': True, 'do_ocr': True, 'ocr_engine': <pdf2parquet_ocr_engine.EASYOCR: 'easyocr'>, 'bitmap_area_threshold': 0.05, 'pdf_backend': <pdf2parquet_pdf_backend.DLPARSE_V2: 'dlparse_v2'>, 'double_precision': 0}
15:13:18 INFO - pipeline id pipeline_id
15:13:18 INFO - code location None
15:13:18 INFO - data factory data_ is using local data access: input_folder - python/test-data/input output_folder - python/output
15:13:18 INFO - data factory data_ max_files -1, n_sample -1
15:13:18 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.pdf', '.docx', '.pptx', '.zip'], files to checkpoint ['.parquet']
15:13:18 INFO - orchestrator pdf2parquet started at 2024-11-20 15:13:18
15:13:18 INFO - Number of files is 2, source profile {'max_file_size': 0.30131721496

### Print files in output folder

In [None]:
import glob
output_files = glob.glob('output/*')
print(output_files)