<div style="background-color: #04D7FD; padding: 20px; text-align: left;">
    <h1 style="color: #000000; font-size: 36px; margin: 0;">Data Processing for RAG with Data Prep Kit (Python)</h1>
    
</div>


## Before Running the notebook

Please complete [setting up python dev environment](./setup-python-dev-env.md)

## Overview

This notebook will process PDF documents as part of RAG pipeline

![](media/rag-overview-2.png)

This notebook will perform steps 1, 2 and 3 in RAG pipeline.

Here are the processing steps:

- **pdf2parquet** : Extract text from PDF and convert them into parquet files
- **Chunk documents**: Split the PDFs into 'meaningful sections' (paragraphs, sentences ..etc)
- **Doc_ID generation**: Each chunk is assigned a uniq id, based on content and hash
- **Exact Dedup**: Chunks with exact same content are filtered out
- **Text encoder**: Convert chunks into vectors using embedding models

## Step-1: Configuration

In [1]:
from my_config import MY_CONFIG

## Step-2:  Data

We will use white papers  about LLMs.  

- [Granite Code Models](https://arxiv.org/abs/2405.04324)
- [Attention is all you need](https://arxiv.org/abs/1706.03762)

You can of course substite your own data below

### 2.1 - Download data

In [2]:
import os, sys
import shutil
from utils import download_file

## Download the data files
shutil.os.makedirs(MY_CONFIG.INPUT_DATA_DIR, exist_ok=True)

download_file (url = 'https://arxiv.org/pdf/1706.03762', local_file = os.path.join(MY_CONFIG.INPUT_DATA_DIR, 'attension.pdf' ))

download_file (url = 'https://arxiv.org/pdf/2405.04324', local_file = os.path.join(MY_CONFIG.INPUT_DATA_DIR, 'granite.pdf' ))


Local file 'input/attension.pdf' (2.22 MB) already exists. Skipping download.
Local file 'input/granite.pdf' (1.27 MB) already exists. Skipping download.


### 2.2 - Set input/output path variables for the pipeline

In [3]:
import os, sys
import shutil

if not os.path.exists(MY_CONFIG.INPUT_DATA_DIR ):
    raise Exception (f"❌ Input folder MY_CONFIG.INPUT_DATA_DIR = '{MY_CONFIG.INPUT_DATA_DIR}' not found")

output_parquet_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '01_parquet_out')
output_chunk_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '02_chunk_out')
output_docid_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '03_docid_out')
output_exact_dedupe_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '04_exact_dedupe_out')
output_embeddings_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '05_embeddings_out')

## clear output folder
shutil.rmtree(MY_CONFIG.OUTPUT_FOLDER, ignore_errors=True)
shutil.os.makedirs(MY_CONFIG.OUTPUT_FOLDER, exist_ok=True)

print ("✅ Cleared output directory")

✅ Cleared output directory


## Step-3: pdf2parquet -  Convert data from PDF to Parquet

This step is reading the input folder containing all PDF files and ingest them in a parquet table using the [Docling package](https://github.com/DS4SD/docling).
The documents are converted into a JSON format which allows to easily chunk it in the later steps.



### 3.1 -  Set Input/output Folder

In [4]:
STAGE = 1 

input_folder = MY_CONFIG.INPUT_DATA_DIR
output_folder =  output_parquet_dir

print (f"🏃🏼 STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'")

🏃🏼 STAGE-1: Processing input='input' --> output='output/01_parquet_out'


### 3.2 - Execute 

In [5]:
%%time 

import ast
import os
import sys

from data_processing.runtime.pure_python import PythonTransformLauncher
from data_processing.utils import ParamsUtils

from pdf2parquet_transform import (
    pdf2parquet_contents_type_cli_param,
    pdf2parquet_contents_types,
)
from pdf2parquet_transform_python import Pdf2ParquetPythonTransformConfiguration


# create parameters
local_conf = {
    "input_folder": input_folder,
    "output_folder": output_folder,
}
ingest_config = {
    pdf2parquet_contents_type_cli_param: pdf2parquet_contents_types.JSON,
}

params = {
    # Data access. Only required parameters are specified
    "data_local_config": ParamsUtils.convert_to_ast(local_conf),
    "data_files_to_use": ast.literal_eval("['.pdf']"),
}


sys.argv = ParamsUtils.dict_to_req(d=(params | ingest_config))
# create launcher
launcher = PythonTransformLauncher(Pdf2ParquetPythonTransformConfiguration())
# launch
return_code = launcher.launch()

if return_code == 0:
    print (f"✅ Stage:{STAGE} completed successfully")
else:
    raise Exception ("❌ Job failed")


00:23:40 INFO - pdf2parquet parameters are : {'artifacts_path': None, 'contents_type': <pdf2parquet_contents_types.JSON: 'application/json'>, 'do_table_structure': True, 'do_ocr': True, 'double_precision': 8}
00:23:40 INFO - pipeline id pipeline_id
00:23:40 INFO - code location None
00:23:40 INFO - data factory data_ is using local data access: input_folder - input output_folder - output/01_parquet_out
00:23:40 INFO - data factory data_ max_files -1, n_sample -1
00:23:40 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.pdf'], files to checkpoint ['.parquet']
00:23:40 INFO - orchestrator pdf2parquet started at 2024-10-02 00:23:40
00:23:40 INFO - Number of files is 2, source profile {'max_file_size': 2.112621307373047, 'min_file_size': 1.2146415710449219, 'total_file_size': 3.3272628784179688}
00:23:40 INFO - Initializing models


Fetching 10 files:   0%|          | 0/10 [00:00<?, ?it/s]

model.pt:   0%|          | 0.00/201M [00:00<?, ?B/s]

00:24:14 INFO - Completed 1 files (50.0%) in 0.3 min
00:24:48 INFO - Completed 2 files (100.0%) in 0.871 min
00:24:48 INFO - Done processing 2 files, waiting for flush() completion.
00:24:48 INFO - done flushing in 0.0 sec
00:24:48 INFO - Completed execution in 1.137 min, execution result 0


✅ Stage:1 completed successfully
CPU times: user 2min 9s, sys: 3.15 s, total: 2min 12s
Wall time: 1min 11s


### 3.3 -  Inspect Generated output

Here we should see one entry per input file processed

In [6]:
from utils import read_parquet_files_as_df

output_df = read_parquet_files_as_df(output_folder)

print ("Output dimensions (rows x columns)= ", output_df.shape)

output_df.head(5)

## To display certain columns
#parquet_df[['column1', 'column2', 'column3']].head(5)

Output dimensions (rows x columns)=  (2, 12)


Unnamed: 0,filename,contents,num_pages,num_tables,num_doc_elements,document_id,ext,hash,size,date_acquired,pdf_convert_time,source_filename
0,granite.pdf,"{""_name"":"""",""type"":""pdf-document"",""description...",28,17,348,4a32ba4c-8fdb-4eeb-a06b-d28493efe8e3,pdf,0650e590f33356ab8581c7eb0c23f1b928f0cfe1659587...,654989,2024-10-02T00:24:48.959612,34.22392,granite.pdf
1,attension.pdf,"{""_name"":"""",""type"":""pdf-document"",""description...",15,4,193,f275d75a-a072-4836-8a55-6a65f0d34577,pdf,6fe23d4f932c725077dfc8334f3f4da4e3aaf908d2aa23...,135814,2024-10-02T00:24:14.713654,18.004455,attension.pdf


##  Step-4: Doc chunks

Split the documents in chunks, according to their layout segmentation.

### 4.1 - Set Input/output Folder

In [7]:
STAGE = 2

input_folder = output_parquet_dir # previous output folder is the input folder for the current stage
output_folder =  output_chunk_dir

input_df = read_parquet_files_as_df(input_folder)  ## for debug purposes

print (f"🏃🏼 STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'")

🏃🏼 STAGE-2: Processing input='output/01_parquet_out' --> output='output/02_chunk_out'


### 4.2 - Execute 

In [8]:
%%time 

from data_processing.runtime.pure_python import PythonTransformLauncher
from doc_chunk_transform_python import DocChunkPythonTransformConfiguration

# Prepare the commandline params
local_conf = {
    "input_folder": input_folder,
    "output_folder": output_folder,
}
worker_options = {"num_cpus" : MY_CONFIG.RAY_NUM_CPUS}
params = {
    # Data access. Only required parameters are specified
    "data_local_config": ParamsUtils.convert_to_ast(local_conf),
    # doc_chunk arguments
    # ...
}

# Pass the commandline params
sys.argv = ParamsUtils.dict_to_req(d=params)

# create launcher
launcher = PythonTransformLauncher(DocChunkPythonTransformConfiguration())
# launch
return_code = launcher.launch()

if return_code == 0:
    print (f"✅ Stage:{STAGE} completed successfully")
else:
    raise Exception ("❌ Job failed")

00:24:50 INFO - doc_chunk parameters are : {'chunking_type': <chunking_types.DL_JSON: 'dl_json'>, 'content_column_name': 'contents', 'doc_id_column_name': 'document_id', 'dl_min_chunk_len': None, 'output_chunk_column_name': 'contents', 'output_source_doc_id_column_name': 'source_document_id', 'output_jsonpath_column_name': 'doc_jsonpath', 'output_pageno_column_name': 'page_number', 'output_bbox_column_name': 'bbox'}
00:24:50 INFO - pipeline id pipeline_id
00:24:50 INFO - code location None
00:24:50 INFO - data factory data_ is using local data access: input_folder - output/01_parquet_out output_folder - output/02_chunk_out
00:24:50 INFO - data factory data_ max_files -1, n_sample -1
00:24:50 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
00:24:50 INFO - orchestrator doc_chunk started at 2024-10-02 00:24:50
00:24:50 INFO - Number of files is 2, source profile {'max_file_size

✅ Stage:2 completed successfully
CPU times: user 1.07 s, sys: 95.1 ms, total: 1.16 s
Wall time: 1.19 s


### 4.3 - Inspect Generated output

We would see documents are split into many chunks

In [9]:
from utils import read_parquet_files_as_df

output_df = read_parquet_files_as_df(output_folder)

print (f"Files processed : {input_df.shape[0]:,}")
print (f"Chunks created : {output_df.shape[0]:,}")

print ("Input data dimensions (rows x columns)= ", input_df.shape)
print ("Output data dimensions (rows x columns)= ", output_df.shape)

output_df.sample(min(3, output_df.shape[0]))

Files processed : 2
Chunks created : 211
Input data dimensions (rows x columns)=  (2, 12)
Output data dimensions (rows x columns)=  (211, 16)


Unnamed: 0,filename,num_pages,num_tables,num_doc_elements,ext,hash,size,date_acquired,pdf_convert_time,source_filename,source_document_id,contents,doc_jsonpath,page_number,bbox,document_id
87,granite.pdf,28,17,348,pdf,0650e590f33356ab8581c7eb0c23f1b928f0cfe1659587...,654989,2024-10-02T00:24:48.959612,34.22392,granite.pdf,4a32ba4c-8fdb-4eeb-a06b-d28493efe8e3,6.3 Code Editing and Translation\nTable 12: Pa...,$.main-text[189],16,"[106.69820404, 190.24554443, 504.00320435, 211...",f28d8c9a4fe81f0baf801daf9a95ddaf152a4ac5e8b8ac...
154,attension.pdf,15,4,193,pdf,6fe23d4f932c725077dfc8334f3f4da4e3aaf908d2aa23...,135814,2024-10-02T00:24:14.713654,18.004455,attension.pdf,f275d75a-a072-4836-8a55-6a65f0d34577,3.2.2 Multi-Head Attention\nMulti-head attenti...,$.main-text[55],5,"[107.46644592, 669.41210938, 503.99703979, 690...",da79f02a5f19c2f07de7a6f1da9df8db00f01a477582ac...
67,granite.pdf,28,17,348,pdf,0650e590f33356ab8581c7eb0c23f1b928f0cfe1659587...,654989,2024-10-02T00:24:48.959612,34.22392,granite.pdf,4a32ba4c-8fdb-4eeb-a06b-d28493efe8e3,"6.1.5 RepoBench, CrossCodeEval: Repository-Lev...",$.main-text[153],12,"[106.97065735, 224.31654358, 505.74191284, 290...",cd5bd4537bde007298a91de7fa2fb4b56516d2f1d31262...


## Step-5:  DOC ID generation

This transform annotates documents with document "ids". It supports the following transformations of the original data:

 - Adding document hash: this enables the addition of a document hash-based id to the data. The hash is calculated with `hashlib.sha256(doc.encode("utf-8")).hexdigest()`. To enable this annotation, set hash_column to the name of the column, where you want to store it.
 - Adding integer document id: this allows the addition of an integer document id to the data that is unique across all rows in all tables provided to the transform() method. To enable this annotation, set int_id_column to the name of the column, where you want to store it. **This is a pre-requisite for fuzzy dedup** in the pipeline.

### 5.1 - Set Input/output Folder

In [10]:

STAGE  = 3

input_folder = output_chunk_dir # previous output folder is the input folder for the current stage
output_folder =  output_docid_dir

input_df = read_parquet_files_as_df(input_folder)  ## for debug purposes

print (f"🏃🏼 STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'")

🏃🏼 STAGE-3: Processing input='output/02_chunk_out' --> output='output/03_docid_out'


### 5.2 - Execute 

In [11]:
%%time 

from data_processing.runtime.pure_python import PythonTransformLauncher
from doc_id_transform_python import DocIDPythonTransformRuntimeConfiguration
local_conf = {
    "input_folder": input_folder,
    "output_folder": output_folder,
}
params = {
    # Data access. Only required parameters are specified
    "data_local_config": ParamsUtils.convert_to_ast(local_conf),
    # doc id configuration
    "doc_id_doc_column": "contents",
    "doc_id_hash_column": "chunk_hash",
    "doc_id_int_column": "chunk_id",
}
sys.argv = ParamsUtils.dict_to_req(d=params)

# launch

launcher = PythonTransformLauncher(DocIDPythonTransformRuntimeConfiguration())

return_code = launcher.launch()

if return_code == 0:
    print (f"✅ Stage:{STAGE} completed successfully")
else:
    raise Exception ("❌ Ray job failed")

00:24:50 INFO - Doc id parameters are : {'doc_column': 'contents', 'hash_column': 'chunk_hash', 'int_column': 'chunk_id', 'start_id': 0}
00:24:50 INFO - pipeline id pipeline_id
00:24:50 INFO - code location None
00:24:50 INFO - data factory data_ is using local data access: input_folder - output/02_chunk_out output_folder - output/03_docid_out
00:24:50 INFO - data factory data_ max_files -1, n_sample -1
00:24:50 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
00:24:50 INFO - orchestrator doc_id started at 2024-10-02 00:24:50
00:24:50 INFO - Number of files is 2, source profile {'max_file_size': 0.06398963928222656, 'min_file_size': 0.028062820434570312, 'total_file_size': 0.09205245971679688}
00:24:50 INFO - Completed 1 files (50.0%) in 0.0 min
00:24:50 INFO - Completed 2 files (100.0%) in 0.0 min
00:24:50 INFO - Done processing 2 files, waiting for flush() completion.
00:24

✅ Stage:3 completed successfully
CPU times: user 13.4 ms, sys: 4.83 ms, total: 18.3 ms
Wall time: 14.7 ms


### 5.3 - Inspect Generated output

In [12]:
from utils import read_parquet_files_as_df

output_df = read_parquet_files_as_df(output_folder)

print ("Input data dimensions (rows x columns)= ", input_df.shape)
print ("Output data dimensions (rows x columns)= ", output_df.shape)

output_df.sample(min(3, output_df.shape[0]))

Input data dimensions (rows x columns)=  (211, 16)
Output data dimensions (rows x columns)=  (211, 18)


Unnamed: 0,filename,num_pages,num_tables,num_doc_elements,ext,hash,size,date_acquired,pdf_convert_time,source_filename,source_document_id,contents,doc_jsonpath,page_number,bbox,document_id,chunk_hash,chunk_id
192,attension.pdf,15,4,193,pdf,6fe23d4f932c725077dfc8334f3f4da4e3aaf908d2aa23...,135814,2024-10-02T00:24:14.713654,18.004455,attension.pdf,f275d75a-a072-4836-8a55-6a65f0d34577,"6.2 Model Variations\nIn Table 3 rows (A), we ...",$.main-text[118],9,"[107.27760315, 318.93438721, 505.24127197, 350...",70948f748c6f275b39c70652e29d60dfd53c545e0d6d92...,70948f748c6f275b39c70652e29d60dfd53c545e0d6d92...,69
71,granite.pdf,28,17,348,pdf,0650e590f33356ab8581c7eb0c23f1b928f0cfe1659587...,654989,2024-10-02T00:24:48.959612,34.22392,granite.pdf,4a32ba4c-8fdb-4eeb-a06b-d28493efe8e3,"6.1.5 RepoBench, CrossCodeEval: Repository-Lev...",$.tables[7],13,"[109.39778137, 486.89639282, 502.1010437, 679....",b7497dcda69d88caa6b7c3a462edb925ffa97ce5e42c52...,b7497dcda69d88caa6b7c3a462edb925ffa97ce5e42c52...,159
196,attension.pdf,15,4,193,pdf,6fe23d4f932c725077dfc8334f3f4da4e3aaf908d2aa23...,135814,2024-10-02T00:24:14.713654,18.004455,attension.pdf,f275d75a-a072-4836-8a55-6a65f0d34577,6.3 English Constituency Parsing\nWe performed...,$.main-text[123],9,"[106.96768951, 69.592453, 504.24859619, 101.62...",93e01b0e6bafcfe5fcd113d1a3dfedad27d12f81038ff5...,93e01b0e6bafcfe5fcd113d1a3dfedad27d12f81038ff5...,73


## Step-6: Exact Dedup

Remove documents having identical code to remove bias in the training data. On the content of each document, a SHA256 hash is computed,
followed by de-duplication of record having identical hashes.

### 6.1 - Set Input/output Folder

In [13]:
STAGE  = 4

input_folder = output_docid_dir # previous output folder is the input folder for the current stage
output_folder =  output_exact_dedupe_dir

input_df = read_parquet_files_as_df(input_folder)  ## for debug purposes

print (f"🏃🏼 STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'")

🏃🏼 STAGE-4: Processing input='output/03_docid_out' --> output='output/04_exact_dedupe_out'


### 6.2 - Execute 

In [14]:
%%time

from data_processing.runtime.pure_python import PythonTransformLauncher

# Import ededup transform configuration
from ededup_transform_python import EdedupPythonTransformRuntimeConfiguration


# Prepare the commandline params
local_conf = {
    "input_folder": input_folder,
    "output_folder": output_folder,
}
params = {
    # Data access. Only required parameters are specified
    "data_local_config": ParamsUtils.convert_to_ast(local_conf),
    # ededup parameters
    "ededup_doc_column": "contents",
    "ededup_doc_id_column": "chunk_hash",
    
}

# Pass the commandline params
sys.argv = ParamsUtils.dict_to_req(d=params)

# create launcher
launcher = PythonTransformLauncher(EdedupPythonTransformRuntimeConfiguration())
# launch
return_code = launcher.launch()

if return_code == 0:
    print (f"✅ Stage:{STAGE} completed successfully")
else:
    raise Exception ("❌ Ray job failed")

00:24:50 INFO - exact dedup params are {'doc_column': 'contents', 'doc_id_column': 'chunk_hash', 'use_snapshot': False, 'snapshot_directory': None}
00:24:50 INFO - pipeline id pipeline_id
00:24:50 INFO - code location None
00:24:50 INFO - data factory data_ is using local data access: input_folder - output/03_docid_out output_folder - output/04_exact_dedupe_out
00:24:50 INFO - data factory data_ max_files -1, n_sample -1
00:24:50 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
00:24:50 INFO - orchestrator ededup started at 2024-10-02 00:24:50
00:24:50 INFO - Number of files is 2, source profile {'max_file_size': 0.06945991516113281, 'min_file_size': 0.03227043151855469, 'total_file_size': 0.1017303466796875}
00:24:50 INFO - Starting from the beginning
00:24:50 INFO - Completed 1 files (50.0%) in 0.0 min
00:24:50 INFO - Completed 2 files (100.0%) in 0.0 min
00:24:50 INFO - Do

✅ Stage:4 completed successfully
CPU times: user 22.1 ms, sys: 5.79 ms, total: 27.9 ms
Wall time: 23.5 ms


### 6.3 - Inspect Generated output

In [15]:
from utils import read_parquet_files_as_df

output_df = read_parquet_files_as_df(output_folder)

print ("Input data dimensions (rows x columns)= ", input_df.shape)
print ("Output data dimensions (rows x columns)= ", output_df.shape)
print (f"Input chunks before exact dedupe : {input_df.shape[0]:,}")
print (f"Output chunks after exact dedupe : {output_df.shape[0]:,}")
print ("Duplicate chunks removed :  ", (input_df.shape[0] - output_df.shape[0]))

output_df.sample(min(3, output_df.shape[0]))

Input data dimensions (rows x columns)=  (211, 18)
Output data dimensions (rows x columns)=  (211, 19)
Input chunks before exact dedupe : 211
Output chunks after exact dedupe : 211
Duplicate chunks removed :   0


Unnamed: 0,filename,num_pages,num_tables,num_doc_elements,ext,hash,size,date_acquired,pdf_convert_time,source_filename,source_document_id,contents,doc_jsonpath,page_number,bbox,document_id,chunk_hash,chunk_id,removed
194,attension.pdf,15,4,193,pdf,6fe23d4f932c725077dfc8334f3f4da4e3aaf908d2aa23...,135814,2024-10-02T00:24:14.713654,18.004455,attension.pdf,f275d75a-a072-4836-8a55-6a65f0d34577,6.3 English Constituency Parsing\nTo evaluate ...,$.main-text[121],9,"[107.15766144, 167.93530273, 504.10968018, 210...",10c85ade191100c9586ffb4e5ded4944bc4fd865d0919f...,10c85ade191100c9586ffb4e5ded4944bc4fd865d0919f...,71,[]
101,granite.pdf,28,17,348,pdf,0650e590f33356ab8581c7eb0c23f1b928f0cfe1659587...,654989,2024-10-02T00:24:48.959612,34.22392,granite.pdf,4a32ba4c-8fdb-4eeb-a06b-d28493efe8e3,6.5 Math Reasoning\nTable 15: Performance on 4...,$.main-text[219],19,"[118.49487305, 699.65753174, 492.17700195, 710...",c39e0817c8d1edf1d322cef0535b5a63b80d2b2b4d1852...,c39e0817c8d1edf1d322cef0535b5a63b80d2b2b4d1852...,189,[]
206,attension.pdf,15,4,193,pdf,6fe23d4f932c725077dfc8334f3f4da4e3aaf908d2aa23...,135814,2024-10-02T00:24:14.713654,18.004455,attension.pdf,f275d75a-a072-4836-8a55-6a65f0d34577,7 Conclusion\nAcknowledgements We are grateful...,$.main-text[135],10,"[107.4437561, 212.26509094, 504.00241089, 232....",855fdc0d15cb042a43d799b9a38d4339ae1e25b2df99c4...,855fdc0d15cb042a43d799b9a38d4339ae1e25b2df99c4...,83,[]


## Fuzzy Dedup

**Fuzzy dedupe is currently available in RAY version only**

So we will skip this here

## Step-7:   Text encoding

Encode text for the vector storage.

### 7.1 - Set Input/output Folder

In [16]:
STAGE  = 5

input_folder = output_exact_dedupe_dir
output_folder =  output_embeddings_dir

input_df = read_parquet_files_as_df(input_folder)  ## for debug purposes

print (f"🏃🏼 STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'")

🏃🏼 STAGE-5: Processing input='output/04_exact_dedupe_out' --> output='output/05_embeddings_out'


### 7.2 - Execute

In [17]:
%%time 

from data_processing.runtime.pure_python import PythonTransformLauncher
from text_encoder_transform_python import TextEncoderPythonTransformConfiguration

local_conf = {
    "input_folder": input_folder,
    "output_folder": output_folder,
}
params = {
    # Data access. Only required parameters are specified
    "data_local_config": ParamsUtils.convert_to_ast(local_conf),
    # text_encoder
    "text_encoder_model_name": MY_CONFIG.EMBEDDING_MODEL,
}

sys.argv = ParamsUtils.dict_to_req(d=params)
# create launcher
launcher = PythonTransformLauncher(TextEncoderPythonTransformConfiguration())
# Launch the ray actor(s) to process the input

return_code = launcher.launch()

if return_code == 0:
    print (f"✅ Stage:{STAGE} completed successfully")
else:
    raise Exception ("❌ Job failed")

00:24:50 INFO - text_encoder parameters are : {'content_column_name': 'contents', 'output_embeddings_column_name': 'embeddings', 'model_name': 'sentence-transformers/all-MiniLM-L6-v2'}
00:24:50 INFO - pipeline id pipeline_id
00:24:50 INFO - code location None
00:24:50 INFO - data factory data_ is using local data access: input_folder - output/04_exact_dedupe_out output_folder - output/05_embeddings_out
00:24:50 INFO - data factory data_ max_files -1, n_sample -1
00:24:50 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
00:24:50 INFO - orchestrator text_encoder started at 2024-10-02 00:24:50
00:24:50 INFO - Number of files is 2, source profile {'max_file_size': 0.06981945037841797, 'min_file_size': 0.032629966735839844, 'total_file_size': 0.10244941711425781}
00:24:52 INFO - Completed 1 files (50.0%) in 0.008 min
00:24:53 INFO - Completed 2 files (100.0%) in 0.02 min
00:24:53 

✅ Stage:5 completed successfully
CPU times: user 1.78 s, sys: 103 ms, total: 1.88 s
Wall time: 3.09 s


### 7.3 - Inspect Generated output

In [18]:
from utils import read_parquet_files_as_df

output_df = read_parquet_files_as_df(output_folder)

print ("Input data dimensions (rows x columns)= ", input_df.shape)
print ("Output data dimensions (rows x columns)= ", output_df.shape)

output_df.sample(min(3, output_df.shape[0]))

Input data dimensions (rows x columns)=  (211, 19)
Output data dimensions (rows x columns)=  (211, 20)


Unnamed: 0,filename,num_pages,num_tables,num_doc_elements,ext,hash,size,date_acquired,pdf_convert_time,source_filename,source_document_id,contents,doc_jsonpath,page_number,bbox,document_id,chunk_hash,chunk_id,removed,embeddings
193,attension.pdf,15,4,193,pdf,6fe23d4f932c725077dfc8334f3f4da4e3aaf908d2aa23...,135814,2024-10-02T00:24:14.713654,18.004455,attension.pdf,f275d75a-a072-4836-8a55-6a65f0d34577,"6.2 Model Variations\nIn Table 3 rows (B), we ...",$.main-text[119],9,"[107.44257355, 248.49208069, 505.24127197, 312...",6b79d74f59d1218fa3cdff6d13b504c8bf80558f3e2522...,6b79d74f59d1218fa3cdff6d13b504c8bf80558f3e2522...,70,[],"[-0.0049973284, -0.10789071, 0.02143236, -0.02..."
210,attension.pdf,15,4,193,pdf,6fe23d4f932c725077dfc8334f3f4da4e3aaf908d2aa23...,135814,2024-10-02T00:24:14.713654,18.004455,attension.pdf,f275d75a-a072-4836-8a55-6a65f0d34577,Attention Visualizations Input-Input Layer5\nF...,$.main-text[190],15,"[107.43354034, 157.36341858, 504.06988525, 189...",67626adb815bf2b27871df24d538ddc10ae68a3fbbd238...,67626adb815bf2b27871df24d538ddc10ae68a3fbbd238...,87,[],"[0.01508544, -0.015680796, 0.039181348, 0.0084..."
46,granite.pdf,28,17,348,pdf,0650e590f33356ab8581c7eb0c23f1b928f0cfe1659587...,654989,2024-10-02T00:24:48.959612,34.22392,granite.pdf,4a32ba4c-8fdb-4eeb-a06b-d28493efe8e3,6.1.1 HumanEvalSynthesize: Multilingual Code G...,$.main-text[117],9,"[107.46860504, 613.84277344, 456.97003174, 624...",3d5d963f59d4ecb05d1ec2d014747459e01cabe2944bba...,3d5d963f59d4ecb05d1ec2d014747459e01cabe2944bba...,134,[],"[-0.029933447, 0.031515192, -0.04598905, -0.01..."


## Step-8: Copy output to final output dir

In [19]:
import shutil

shutil.rmtree(MY_CONFIG.OUTPUT_FOLDER_FINAL, ignore_errors=True)
shutil.copytree(src=output_folder, dst=MY_CONFIG.OUTPUT_FOLDER_FINAL)

print (f"✅ Copied output from '{output_folder}' --> '{MY_CONFIG.OUTPUT_FOLDER_FINAL}'")

✅ Copied output from 'output/05_embeddings_out' --> 'output/output_final'
