# Data Prep Kit Demo 1 - Ray Version

This notebook will introduce DPK and showcase some of it's capabilities.

Here is the workflow

![](https://raw.githubusercontent.com/sujee/data-prep-kit/intro-example1/examples/notebooks/intro/images/data-prep-kit-3-workflow.png)



## How to run this notebook

Two options:

- **Option 1 - Google Colab:** easiest option.  no setup required.  Click this link to open this on google colab.  [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sujee/data-prep-kit/blob/intro-example1/examples/notebooks/intro/dpk_intro_1_ray.ipynb)
- **Option 2 - Local python dev environment:**  Setup using this [guide](../../../README.md#-getting-started)

The notebook will work as in both environments

## Step-1: Inspect the Data

We will use simple PDFs about Solar system.  The files are [here](https://github.com/sujee/data-prep-kit/tree/intro-example1/examples/notebooks/intro/input/solar-system)

- [earth.pdf](https://github.com/sujee/data-prep-kit/blob/intro-example1/examples/notebooks/intro/input/solar-system/earth.pdf)
- [mars.pdf](https://github.com/sujee/data-prep-kit/blob/intro-example1/examples/notebooks/intro/input/solar-system/mars.pdf)


## Step-2: Figure out Runtime Environment

### 2.1 - Determine runtime

Determine if we are running on Google colab or local python environment

In [1]:
import os

if os.getenv("COLAB_RELEASE_TAG"):
   print("Running in Colab")
   RUNNING_IN_COLAB = True
else:
   print("NOT in Colab")
   RUNNING_IN_COLAB = False

NOT in Colab


### 2.2 -Download Data if running on Google Colab

In [2]:
if RUNNING_IN_COLAB:
    !mkdir -p 'input/solar-system'
    !wget -O 'input/solar-system/earth.pdf'  'https://raw.githubusercontent.com/sujee/data-prep-kit/intro-example1/examples/notebooks/intro/input/solar-system/earth.pdf'
    !wget -O 'input/solar-system/mars.pdf'  'https://raw.githubusercontent.com/sujee/data-prep-kit/intro-example1/examples/notebooks/intro/input/solar-system/mars.pdf'
    !wget -O 'my_utils.py'  'https://raw.githubusercontent.com/sujee/data-prep-kit/intro-example1/examples/notebooks/intro/my_utils.py'

### 2.3 - Install dependencies if running on Google Colab

In [3]:
if RUNNING_IN_COLAB:
    ! pip install  --default-timeout=100  \
        data-prep-toolkit==0.2.1 \
        data-prep-toolkit-transforms==0.2.1 \
        data-prep-toolkit-transforms-ray==0.2.1 \
        deepsearch-toolkit

### 2.4 - Restart Runtime

After installing dependencies, be sure <font color="red">restart runtime</font>, so libraries will be loaded

You do this by going to **`Runtime --> Restart Session`**

Then you can continue to the next step (no need to re-run the notebook)

## Step-2: Configuration

### 2.1 - Basic Config

In [4]:
import os

if os.getenv("COLAB_RELEASE_TAG"):
   print("Running in Colab")
   RUNNING_IN_COLAB = True
else:
   print("NOT in Colab")
   RUNNING_IN_COLAB = False

NOT in Colab


In [5]:
import os

## Configuration
class MyConfig:
    pass

MY_CONFIG = MyConfig ()

MY_CONFIG.INPUT_DATA_DIR = 'input/solar-system'

MY_CONFIG.OUTPUT_FOLDER = "output"
MY_CONFIG.OUTPUT_FOLDER_FINAL = os.path.join(MY_CONFIG.OUTPUT_FOLDER , "output_final")

## Embedding model
MY_CONFIG.EMBEDDING_MODEL = 'sentence-transformers/all-MiniLM-L6-v2'

## RAY CONFIGURATION
### For local runs, we can use more parallelism
### For google colab, be conservative

if RUNNING_IN_COLAB:
  MY_CONFIG.RAY_RUNTIME_WORKERS = 2
  MY_CONFIG.RAY_NUM_CPUS =  0.3
  MY_CONFIG.RAY_MEMORY_GB = 2  # GB
else:  # local run
  num_cpus_available =  os.cpu_count()
  # print (num_cpus_available)

  MY_CONFIG.RAY_RUNTIME_WORKERS = 2
  MY_CONFIG.RAY_NUM_CPUS =  0.8
  MY_CONFIG.RAY_MEMORY_GB = 2  # GB
  # MY_CONFIG.RAY_RUNTIME_WORKERS = num_cpus_available // 3

print ('MY_CONFIG.RAY_RUNTIME_WORKERS:', MY_CONFIG.RAY_RUNTIME_WORKERS)
print ('MY_CONFIG.RAY_NUM_CPUS:', MY_CONFIG.RAY_NUM_CPUS)
print ('MY_CONFIG.RAY_MEMORY_GB:', MY_CONFIG.RAY_MEMORY_GB)


MY_CONFIG.RAY_RUNTIME_WORKERS: 2
MY_CONFIG.RAY_NUM_CPUS: 0.8
MY_CONFIG.RAY_MEMORY_GB: 2


In [6]:
## Add parent dir to path
import os,sys

this_dir = os.path.abspath('')
parent_dir = os.path.dirname(this_dir)
sys.path.append (os.path.abspath (parent_dir))

### 2.2 - Setup input/outpur directories

In [7]:
import os, sys
import shutil

if not os.path.exists(MY_CONFIG.INPUT_DATA_DIR ):
    raise Exception (f"❌ Input folder MY_CONFIG.INPUT_DATA_DIR = '{MY_CONFIG.INPUT_DATA_DIR}' not found")

output_parquet_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '01_parquet_out')
output_chunk_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '02_chunk_out')
output_docid_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '03_docid_out')
output_exact_dedupe_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '04_exact_dedupe_out')
output_fuzzy_dedupe_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '05_fuzzy_dedupe_out')
output_embeddings_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '06_embeddings_out')

## clear output folder
shutil.rmtree(MY_CONFIG.OUTPUT_FOLDER, ignore_errors=True)
shutil.os.makedirs(MY_CONFIG.OUTPUT_FOLDER, exist_ok=True)

print ("✅ Cleared output directory")

✅ Cleared output directory


## Step-3: pdf2parquet -  Convert data from PDF to Parquet

This step is reading the input folder containing all PDF files and ingest them in a parquet table using the [Docling package](https://github.com/DS4SD/docling).
The documents are converted into a JSON format which allows to easily chunk it in the later steps.



### 3.1 - Set Input/output Folder

In [8]:
STAGE = 1

input_folder = MY_CONFIG.INPUT_DATA_DIR
output_folder =  output_parquet_dir

print (f"🏃🏼 STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'")

🏃🏼 STAGE-1: Processing input='input/solar-system' --> output='output/01_parquet_out'


### 3.2 - Execute

In [9]:
%%time

import ast
import os
import sys

from pdf2parquet_transform import (
    pdf2parquet_contents_type_cli_param,
    pdf2parquet_contents_types,
)
from data_processing_ray.runtime.ray import RayTransformLauncher
from pdf2parquet_transform_python import Pdf2ParquetPythonTransformConfiguration
from pdf2parquet_transform_ray import Pdf2ParquetRayTransformConfiguration

from data_processing.utils import GB, ParamsUtils


# create parameters
local_conf = {
    "input_folder": input_folder,
    "output_folder": output_folder,
}
worker_options = {"num_cpus" : MY_CONFIG.RAY_NUM_CPUS, "memory": MY_CONFIG.RAY_MEMORY_GB * GB}
ingest_config = {
    pdf2parquet_contents_type_cli_param: pdf2parquet_contents_types.JSON,
}

params = {
    # where to run
    "run_locally": True,
    # Data access. Only required parameters are specified
    "data_local_config": ParamsUtils.convert_to_ast(local_conf),
    "data_files_to_use": ast.literal_eval("['.pdf']"),
    # orchestrator
    "runtime_worker_options": ParamsUtils.convert_to_ast(worker_options),
    "runtime_num_workers": MY_CONFIG.RAY_RUNTIME_WORKERS,
}


sys.argv = ParamsUtils.dict_to_req(d=(params | ingest_config))
# create launcher
launcher = RayTransformLauncher(Pdf2ParquetRayTransformConfiguration())
# launcher = PythonTransformLauncher(Pdf2ParquetPythonTransformConfiguration())
# launch
return_code = launcher.launch()

if return_code == 0:
    print (f"✅ Stage:{STAGE} completed successfully")
else:
    raise Exception ("❌ Ray job failed")


13:30:44 INFO - pdf2parquet parameters are : {'artifacts_path': None, 'contents_type': <pdf2parquet_contents_types.JSON: 'application/json'>, 'do_table_structure': True, 'do_ocr': True, 'double_precision': 8}
13:30:44 INFO - pipeline id pipeline_id
13:30:44 INFO - code location None
13:30:44 INFO - number of workers 2 worker options {'num_cpus': 0.8, 'memory': 2147483648, 'max_restarts': -1}
13:30:44 INFO - actor creation delay 0
13:30:44 INFO - job details {'job category': 'preprocessing', 'job name': 'pdf2parquet', 'job type': 'ray', 'job id': 'job_id'}
13:30:44 INFO - data factory data_ is using local data access: input_folder - input/solar-system output_folder - output/01_parquet_out
13:30:44 INFO - data factory data_ max_files -1, n_sample -1
13:30:44 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.pdf'], files to checkpoint ['.parquet']
13:30:44 INFO - Running locally
2024-10-18 13:30:47,436	INFO worker.py:1744 

✅ Stage:1 completed successfully
CPU times: user 4.41 s, sys: 1.39 s, total: 5.8 s
Wall time: 31.1 s


### 3.3 - Inspect Generated output

Here we should see one entry per input file processed.

In [10]:
from my_utils import read_parquet_files_as_df

output_df = read_parquet_files_as_df(output_folder)

print ("Output dimensions (rows x columns)= ", output_df.shape)

output_df.head(5)

## To display certain columns
#parquet_df[['column1', 'column2', 'column3']].head(5)

Output dimensions (rows x columns)=  (2, 12)


Unnamed: 0,filename,contents,num_pages,num_tables,num_doc_elements,document_id,ext,hash,size,date_acquired,pdf_convert_time,source_filename
0,mars.pdf,"{""_name"":"""",""type"":""pdf-document"",""description...",1,0,11,62e5639f-f922-4ccc-a041-3cb02f1cfd83,pdf,8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...,2800,2024-10-18T13:30:59.490007,2.011138,mars.pdf
1,earth.pdf,"{""_name"":"""",""type"":""pdf-document"",""description...",1,0,11,f3c0ac2e-1de2-472b-8216-2043f3b3e9d1,pdf,18713f970989055625bef22209b6f4b6830b9ca22046bf...,2686,2024-10-18T13:30:59.494027,2.015123,earth.pdf



### 3.4 - Understand the output

Here are some interesting attributes to note:

- **filename** : original filename
- **contents** : text
- **document_id**: unique id (UUID) assignd to this document
- **hash** : hash of document
- **pdf_convert_time** : time to convert this pdf in seconds

Let's inspect the **contents** column.  See how the text is being divided up!

In [11]:
import pprint
import json

pprint.pprint (json.loads(output_df.iloc[0, ]['contents']))
# json.loads(output_df.iloc[0, ]['contents'])

{'_name': '',
 'description': {'logs': []},
 'equations': [],
 'figures': [],
 'file-info': {'#-pages': 1,
               'document-hash': '1a83f43f3a202e3f203c1263e36961ecc45d401aad488f638fc5559a584333b2',
               'filename': 'mars.pdf',
               'page-hashes': [{'hash': '551fe7a9bde2a9302f150c0a79a13fcc0868fcf73ac6afb80be645c1174734a0',
                                'model': 'default',
                                'page': 1}]},
 'footnotes': [],
 'main-text': [{'name': 'Section-header',
                'prov': [{'bbox': [133.35137939,
                                   654.45184326,
                                   169.88169861,
                                   667.98492432],
                          'page': 1,
                          'span': [0, 4]}],
                'text': 'Mars',
                'type': 'subtitle-level-1'},
               {'name': 'Section-header',
                'prov': [{'bbox': [133.09541321,
                                   630.681

In [12]:
pprint.pprint (json.loads(output_df.iloc[1, ]['contents']))

{'_name': '',
 'description': {'logs': []},
 'equations': [],
 'figures': [],
 'file-info': {'#-pages': 1,
               'document-hash': '7401ae81637dbb89e7040dcd5945bbfb75ff8648bb761c69f8a1595e86538748',
               'filename': 'earth.pdf',
               'page-hashes': [{'hash': 'ca802e4bd5a3301792808caea2a47db51f0520888875b77fc230c99ee851c19b',
                                'model': 'default',
                                'page': 1}]},
 'footnotes': [],
 'main-text': [{'name': 'Section-header',
                'prov': [{'bbox': [133.30961609,
                                   654.45184326,
                                   174.04208374,
                                   667.93347168],
                          'page': 1,
                          'span': [0, 5]}],
                'text': 'Earth',
                'type': 'subtitle-level-1'},
               {'name': 'Section-header',
                'prov': [{'bbox': [133.12528992,
                                   630.6

##  Step-4: Doc chunks

In the previous step, we have extracted text from oru PDFs.  But we have the content of entire file as 'one row' in our parquet output.

In this step, we are going to split the documents in chunks, according to their layout segmentation.

This transform uses [Quackling](https://github.com/DS4SD/quackling) `HierarchicalChunker`
to chunk according to the document layout segmentation, i.e. respecting the original document components as paragraphs, tables, enumerations, etc.
It relies on documents converted with the Docling library in the [pdf2parquet transform](https://github.com/IBM/data-prep-kit/blob/dev/transforms/language/pdf2parquet/python/README.md) using the option `contents_type: "application/json"`,
which provides the required JSON structure.

### 4.1 - Set Input/output Folder

In [13]:
STAGE = 2

input_folder = output_parquet_dir # previous output folder is the input folder for the current stage
output_folder =  output_chunk_dir

input_df = read_parquet_files_as_df(input_folder)  ## for debug purposes

print (f"🏃🏼 STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'")

🏃🏼 STAGE-2: Processing input='output/01_parquet_out' --> output='output/02_chunk_out'


### 4.2 - Execute

In [14]:
%%time

from data_processing_ray.runtime.ray import RayTransformLauncher
from doc_chunk_transform_ray import DocChunkRayTransformConfiguration


# Prepare the commandline params
local_conf = {
    "input_folder": input_folder,
    "output_folder": output_folder,
}
worker_options = {"num_cpus" : MY_CONFIG.RAY_NUM_CPUS}
params = {
    # where to run
    "run_locally": True,
    # Data access. Only required parameters are specified
    "data_local_config": ParamsUtils.convert_to_ast(local_conf),
    # orchestrator
    "runtime_worker_options": ParamsUtils.convert_to_ast(worker_options),
    "runtime_num_workers": MY_CONFIG.RAY_RUNTIME_WORKERS,
    # doc_chunk arguments
    # ...
}

# Pass the commandline params
sys.argv = ParamsUtils.dict_to_req(d=params)

# create launcher
launcher = RayTransformLauncher(DocChunkRayTransformConfiguration())
# launch
return_code = launcher.launch()

if return_code == 0:
    print (f"✅ Stage:{STAGE} completed successfully")
else:
    raise Exception ("❌ Ray job failed")

13:31:12 INFO - doc_chunk parameters are : {'chunking_type': <chunking_types.DL_JSON: 'dl_json'>, 'content_column_name': 'contents', 'doc_id_column_name': 'document_id', 'dl_min_chunk_len': None, 'output_chunk_column_name': 'contents', 'output_source_doc_id_column_name': 'source_document_id', 'output_jsonpath_column_name': 'doc_jsonpath', 'output_pageno_column_name': 'page_number', 'output_bbox_column_name': 'bbox'}
13:31:12 INFO - pipeline id pipeline_id
13:31:12 INFO - code location None
13:31:12 INFO - number of workers 2 worker options {'num_cpus': 0.8, 'max_restarts': -1}
13:31:12 INFO - actor creation delay 0
13:31:12 INFO - job details {'job category': 'preprocessing', 'job name': 'doc_chunk', 'job type': 'ray', 'job id': 'job_id'}
13:31:12 INFO - data factory data_ is using local data access: input_folder - output/01_parquet_out output_folder - output/02_chunk_out
13:31:12 INFO - data factory data_ max_files -1, n_sample -1
13:31:12 INFO - data factory data_ Not using data sets

✅ Stage:2 completed successfully
CPU times: user 982 ms, sys: 291 ms, total: 1.27 s
Wall time: 18.9 s


### 4.3 - Inspect Generated output

We would see documents are split into many chunks

In [15]:
from my_utils import read_parquet_files_as_df

output_df = read_parquet_files_as_df(output_folder)

print (f"Files processed : {input_df.shape[0]:,}")
print (f"Chunks created : {output_df.shape[0]:,}")

print ("Input data dimensions (rows x columns)= ", input_df.shape)
print ("Output data dimensions (rows x columns)= ", output_df.shape)

output_df.head(10)

Files processed : 2
Chunks created : 8
Input data dimensions (rows x columns)=  (2, 12)
Output data dimensions (rows x columns)=  (8, 16)


Unnamed: 0,filename,num_pages,num_tables,num_doc_elements,ext,hash,size,date_acquired,pdf_convert_time,source_filename,source_document_id,contents,doc_jsonpath,page_number,bbox,document_id
0,mars.pdf,1,0,11,pdf,8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...,2800,2024-10-18T13:30:59.490007,2.011138,mars.pdf,62e5639f-f922-4ccc-a041-3cb02f1cfd83,Solar System\nOur solar system is a vast and f...,$.main-text[2],1,"[132.84518433, 588.96014404, 479.40917969, 623...",44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...
1,mars.pdf,1,0,11,pdf,8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...,2800,2024-10-18T13:30:59.490007,2.011138,mars.pdf,62e5639f-f922-4ccc-a041-3cb02f1cfd83,Solar System\nFor more details about the Solar...,$.main-text[3],1,"[133.18510437, 570.83258057, 374.99838257, 581...",dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...
2,mars.pdf,1,0,11,pdf,8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...,2800,2024-10-18T13:30:59.490007,2.011138,mars.pdf,62e5639f-f922-4ccc-a041-3cb02f1cfd83,"Mars\nMars, the fourth planet from the Sun, is...",$.main-text[5],1,"[132.87440491, 500.84011841, 477.48345947, 534...",a31663e06fac41470ecc459f5a58658a3f9997d7801053...
3,mars.pdf,1,0,11,pdf,8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...,2800,2024-10-18T13:30:59.490007,2.011138,mars.pdf,62e5639f-f922-4ccc-a041-3cb02f1cfd83,Basic facts about Mars:\n· Distance from the S...,$.main-text[6],1,"[133.2026062, 482.90710449, 237.04431152, 493....",7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...
4,earth.pdf,1,0,11,pdf,18713f970989055625bef22209b6f4b6830b9ca22046bf...,2686,2024-10-18T13:30:59.494027,2.015123,earth.pdf,f3c0ac2e-1de2-472b-8216-2043f3b3e9d1,Solar System\nOur solar system is a vast and f...,$.main-text[2],1,"[132.87112427, 588.96014404, 479.40917969, 623...",44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...
5,earth.pdf,1,0,11,pdf,18713f970989055625bef22209b6f4b6830b9ca22046bf...,2686,2024-10-18T13:30:59.494027,2.015123,earth.pdf,f3c0ac2e-1de2-472b-8216-2043f3b3e9d1,Solar System\nFor more details about our Solar...,$.main-text[3],1,"[133.20942688, 570.81555176, 375.57919312, 581...",d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...
6,earth.pdf,1,0,11,pdf,18713f970989055625bef22209b6f4b6830b9ca22046bf...,2686,2024-10-18T13:30:59.494027,2.015123,earth.pdf,f3c0ac2e-1de2-472b-8216-2043f3b3e9d1,Earth\nEarth is the third planet from the Sun....,$.main-text[5],1,"[132.91053772, 512.46295166, 477.84887695, 534...",7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...
7,earth.pdf,1,0,11,pdf,18713f970989055625bef22209b6f4b6830b9ca22046bf...,2686,2024-10-18T13:30:59.494027,2.015123,earth.pdf,f3c0ac2e-1de2-472b-8216-2043f3b3e9d1,Earth\nBasic facts about Earth:\n· Distance fr...,$.main-text[6],1,"[133.30151367, 494.86206055, 240.17156982, 505...",189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...


### 4.4 - Understanding the Output

Here we see 2 PDF files are split into 6 chunks.  Basically we see the documents are being split along 'natural boundaris' - paragraphs and bullet points

See how **document_id** is carried throughout.  This helps us identify original documents.

Also note **contents** is now plain text (not JSON as before)

In [16]:
output_df[['filename', 'contents']]

Unnamed: 0,filename,contents
0,mars.pdf,Solar System\nOur solar system is a vast and f...
1,mars.pdf,Solar System\nFor more details about the Solar...
2,mars.pdf,"Mars\nMars, the fourth planet from the Sun, is..."
3,mars.pdf,Basic facts about Mars:\n· Distance from the S...
4,earth.pdf,Solar System\nOur solar system is a vast and f...
5,earth.pdf,Solar System\nFor more details about our Solar...
6,earth.pdf,Earth\nEarth is the third planet from the Sun....
7,earth.pdf,Earth\nBasic facts about Earth:\n· Distance fr...


In [17]:
for f in output_df['filename'].unique():
    print ('==========' , f, '===========')
    chunks = output_df[output_df['filename'] == f]['contents']
    for idx , chunk in enumerate(chunks):
        print (f'-------Chunk {idx}------\n{chunk}\n-------')

-------Chunk 0------
Solar System
Our solar system is a vast and fascinating expanse, comprising eight planets, five dwarf planets, numerous moons, asteroids, comets, and other celestial bodies. At its center lies the star we call the Sun.
-------
-------Chunk 1------
Solar System
For more details about the Solar system see Chapter 1.
-------
-------Chunk 2------
Mars
Mars, the fourth planet from the Sun, is a cold, desert world with a thin atmosphere composed primarily of carbon dioxide. Its reddish hue comes from iron oxide, or rust, prevalent on its surface.
-------
-------Chunk 3------
Basic facts about Mars:
· Distance from the Sun: Average of 228 million kilometers (142 million miles)
· Rotation Period: 24.6 hours (one Martian day - called a "sol")
· Moons: Two small moons, Phobos and Deimos.
-------
-------Chunk 0------
Solar System
Our solar system is a vast and fascinating expanse, comprising eight planets, five dwarf planets, numerous moons, asteroids, comets, and other celes

## Step-5:  DOC ID generation

This transform annotates documents with document "ids". It supports the following transformations of the original data:

 - Adding document hash: this enables the addition of a document hash-based id to the data. The hash is calculated with `hashlib.sha256(doc.encode("utf-8")).hexdigest()`. To enable this annotation, set **hash_column** to the name of the column, where you want to store it.
 - Adding integer document id: this allows the addition of an integer document id to the data that is unique across all rows in all tables provided to the transform() method. To enable this annotation, set **int_id_column** to the name of the column, where you want to store it.

**This is a pre-requisite for fuzzy dedup** in the pipeline.

### 5.1 - Set Input/output Folder

In [18]:

# Input for this stage is the output of exact dedeup component
# output of this component makes it possible for fdedup component to run on data.

STAGE  = 3

input_folder = output_chunk_dir
output_folder =  output_docid_dir

input_df = read_parquet_files_as_df(input_folder)  ## for debug purposes

print (f"🏃🏼 STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'")

🏃🏼 STAGE-3: Processing input='output/02_chunk_out' --> output='output/03_docid_out'


### 5.2 - Execute

In [19]:
%%time

from data_processing_ray.runtime.ray import RayTransformLauncher
from doc_id_transform_ray import DocIDRayTransformRuntimeConfiguration

local_conf = {
    "input_folder": input_folder,
    "output_folder": output_folder,
}
worker_options = {"num_cpus" : MY_CONFIG.RAY_NUM_CPUS}
params = {
    # where to run
    "run_locally": True,
    # Data access. Only required parameters are specified
    "data_local_config": ParamsUtils.convert_to_ast(local_conf),
    # orchestrator
    "runtime_worker_options": ParamsUtils.convert_to_ast(worker_options),
    "runtime_num_workers": MY_CONFIG.RAY_RUNTIME_WORKERS,
    # doc id configuration
    "doc_id_doc_column": "contents",
    "doc_id_hash_column": "chunk_hash",
    "doc_id_int_column": "chunk_id",
}
sys.argv = ParamsUtils.dict_to_req(d=params)

# launch

launcher = RayTransformLauncher(DocIDRayTransformRuntimeConfiguration())

return_code = launcher.launch()

if return_code == 0:
    print (f"✅ Stage:{STAGE} completed successfully")
else:
    raise Exception ("❌ Ray job failed")

13:31:29 INFO - Doc id parameters are : {'doc_column': 'contents', 'hash_column': 'chunk_hash', 'int_column': 'chunk_id', 'start_id': 0}
13:31:29 INFO - pipeline id pipeline_id
13:31:29 INFO - code location None
13:31:29 INFO - number of workers 2 worker options {'num_cpus': 0.8, 'max_restarts': -1}
13:31:29 INFO - actor creation delay 0
13:31:29 INFO - job details {'job category': 'preprocessing', 'job name': 'doc_id', 'job type': 'ray', 'job id': 'job_id'}
13:31:29 INFO - data factory data_ is using local data access: input_folder - output/02_chunk_out output_folder - output/03_docid_out
13:31:29 INFO - data factory data_ max_files -1, n_sample -1
13:31:29 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
13:31:29 INFO - Running locally
2024-10-18 13:31:31,792	INFO worker.py:1744 -- Started a local Ray instance. View the dashboard at [1m[32mhttp://127.0.0.1:8265 [39m[22m

✅ Stage:3 completed successfully
CPU times: user 123 ms, sys: 145 ms, total: 267 ms
Wall time: 15.2 s


### 5.3 - Inspect Generated output

You will notice we have two extra columns

- **hash_column**
- **int_id_column**

But still the same number or rows as before

In [20]:
from my_utils import read_parquet_files_as_df

output_df = read_parquet_files_as_df(output_folder)

print ("Input data dimensions (rows x columns)= ", input_df.shape)
print ("Output data dimensions (rows x columns)= ", output_df.shape)

output_df.head(10)

Input data dimensions (rows x columns)=  (8, 16)
Output data dimensions (rows x columns)=  (8, 18)


Unnamed: 0,filename,num_pages,num_tables,num_doc_elements,ext,hash,size,date_acquired,pdf_convert_time,source_filename,source_document_id,contents,doc_jsonpath,page_number,bbox,document_id,chunk_hash,chunk_id
0,mars.pdf,1,0,11,pdf,8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...,2800,2024-10-18T13:30:59.490007,2.011138,mars.pdf,62e5639f-f922-4ccc-a041-3cb02f1cfd83,Solar System\nOur solar system is a vast and f...,$.main-text[2],1,"[132.84518433, 588.96014404, 479.40917969, 623...",44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...,44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...,4
1,mars.pdf,1,0,11,pdf,8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...,2800,2024-10-18T13:30:59.490007,2.011138,mars.pdf,62e5639f-f922-4ccc-a041-3cb02f1cfd83,Solar System\nFor more details about the Solar...,$.main-text[3],1,"[133.18510437, 570.83258057, 374.99838257, 581...",dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...,dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...,5
2,mars.pdf,1,0,11,pdf,8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...,2800,2024-10-18T13:30:59.490007,2.011138,mars.pdf,62e5639f-f922-4ccc-a041-3cb02f1cfd83,"Mars\nMars, the fourth planet from the Sun, is...",$.main-text[5],1,"[132.87440491, 500.84011841, 477.48345947, 534...",a31663e06fac41470ecc459f5a58658a3f9997d7801053...,a31663e06fac41470ecc459f5a58658a3f9997d7801053...,6
3,mars.pdf,1,0,11,pdf,8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...,2800,2024-10-18T13:30:59.490007,2.011138,mars.pdf,62e5639f-f922-4ccc-a041-3cb02f1cfd83,Basic facts about Mars:\n· Distance from the S...,$.main-text[6],1,"[133.2026062, 482.90710449, 237.04431152, 493....",7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...,7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...,7
4,earth.pdf,1,0,11,pdf,18713f970989055625bef22209b6f4b6830b9ca22046bf...,2686,2024-10-18T13:30:59.494027,2.015123,earth.pdf,f3c0ac2e-1de2-472b-8216-2043f3b3e9d1,Solar System\nOur solar system is a vast and f...,$.main-text[2],1,"[132.87112427, 588.96014404, 479.40917969, 623...",44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...,44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...,0
5,earth.pdf,1,0,11,pdf,18713f970989055625bef22209b6f4b6830b9ca22046bf...,2686,2024-10-18T13:30:59.494027,2.015123,earth.pdf,f3c0ac2e-1de2-472b-8216-2043f3b3e9d1,Solar System\nFor more details about our Solar...,$.main-text[3],1,"[133.20942688, 570.81555176, 375.57919312, 581...",d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...,d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...,1
6,earth.pdf,1,0,11,pdf,18713f970989055625bef22209b6f4b6830b9ca22046bf...,2686,2024-10-18T13:30:59.494027,2.015123,earth.pdf,f3c0ac2e-1de2-472b-8216-2043f3b3e9d1,Earth\nEarth is the third planet from the Sun....,$.main-text[5],1,"[132.91053772, 512.46295166, 477.84887695, 534...",7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...,7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...,2
7,earth.pdf,1,0,11,pdf,18713f970989055625bef22209b6f4b6830b9ca22046bf...,2686,2024-10-18T13:30:59.494027,2.015123,earth.pdf,f3c0ac2e-1de2-472b-8216-2043f3b3e9d1,Earth\nBasic facts about Earth:\n· Distance fr...,$.main-text[6],1,"[133.30151367, 494.86206055, 240.17156982, 505...",189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...,189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...,3


## Step-6: Exact Dedup



### 6.1 - Set Input/output Folder

In [21]:
STAGE  = 4

input_folder = output_docid_dir # previous output folder is the input folder for the current stage
output_folder =  output_exact_dedupe_dir

input_df = read_parquet_files_as_df(input_folder)  ## for debug purposes

print (f"🏃🏼 STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'")

🏃🏼 STAGE-4: Processing input='output/03_docid_out' --> output='output/04_exact_dedupe_out'


### 6.2 - Execute

In [22]:
%%time

from data_processing_ray.runtime.ray import RayTransformLauncher
from ededup_transform_ray import EdedupRayTransformRuntimeConfiguration


# Prepare the commandline params
local_conf = {
    "input_folder": input_folder,
    "output_folder": output_folder,
}
worker_options = {"num_cpus" : MY_CONFIG.RAY_NUM_CPUS}
params = {
    # where to run
    "run_locally": True,
    # Data access. Only required parameters are specified
    "data_local_config": ParamsUtils.convert_to_ast(local_conf),
    # orchestrator
    "runtime_worker_options": ParamsUtils.convert_to_ast(worker_options),
    "runtime_num_workers": MY_CONFIG.RAY_RUNTIME_WORKERS,
    # ededup parameters
    "ededup_hash_cpu": 0.5,
    "ededup_num_hashes": 2,
    "ededup_doc_column": "contents",
    "ededup_doc_id_column": "chunk_hash",
}

# Pass the commandline params
sys.argv = ParamsUtils.dict_to_req(d=params)

# create launcher
launcher = RayTransformLauncher(EdedupRayTransformRuntimeConfiguration())
# launch
return_code = launcher.launch()

if return_code == 0:
    print (f"✅ Stage:{STAGE} completed successfully")
else:
    raise Exception ("❌ Ray job failed")

13:31:45 INFO - exact dedup params are {'doc_column': 'contents', 'doc_id_column': 'chunk_hash', 'use_snapshot': False, 'snapshot_directory': None, 'hash_cpu': 0.5, 'num_hashes': 2}
13:31:45 INFO - pipeline id pipeline_id
13:31:45 INFO - code location None
13:31:45 INFO - number of workers 2 worker options {'num_cpus': 0.8, 'max_restarts': -1}
13:31:45 INFO - actor creation delay 0
13:31:45 INFO - job details {'job category': 'preprocessing', 'job name': 'ededup', 'job type': 'ray', 'job id': 'job_id'}
13:31:45 INFO - data factory data_ is using local data access: input_folder - output/03_docid_out output_folder - output/04_exact_dedupe_out
13:31:45 INFO - data factory data_ max_files -1, n_sample -1
13:31:45 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
13:31:45 INFO - Running locally
2024-10-18 13:31:47,001	INFO worker.py:1744 -- Started a local Ray instance. View the da

✅ Stage:4 completed successfully
CPU times: user 136 ms, sys: 154 ms, total: 289 ms
Wall time: 15.2 s


### 6.3 - Inspect Generated output

In [23]:
from my_utils import read_parquet_files_as_df

output_df = read_parquet_files_as_df(output_folder)

print ("Input data dimensions (rows x columns)= ", input_df.shape)
print ("Output data dimensions (rows x columns)= ", output_df.shape)
print (f"Input chunks before exact dedupe : {input_df.shape[0]:,}")
print (f"Output chunks after exact dedupe : {output_df.shape[0]:,}")
print ("Duplicate chunks removed :  ", (input_df.shape[0] - output_df.shape[0]))

output_df.head(10)

Input data dimensions (rows x columns)=  (8, 18)
Output data dimensions (rows x columns)=  (7, 19)
Input chunks before exact dedupe : 8
Output chunks after exact dedupe : 7
Duplicate chunks removed :   1


Unnamed: 0,filename,num_pages,num_tables,num_doc_elements,ext,hash,size,date_acquired,pdf_convert_time,source_filename,source_document_id,contents,doc_jsonpath,page_number,bbox,document_id,chunk_hash,chunk_id,removed
0,mars.pdf,1,0,11,pdf,8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...,2800,2024-10-18T13:30:59.490007,2.011138,mars.pdf,62e5639f-f922-4ccc-a041-3cb02f1cfd83,Solar System\nFor more details about the Solar...,$.main-text[3],1,"[133.18510437, 570.83258057, 374.99838257, 581...",dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...,dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...,5,[44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf567...
1,mars.pdf,1,0,11,pdf,8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...,2800,2024-10-18T13:30:59.490007,2.011138,mars.pdf,62e5639f-f922-4ccc-a041-3cb02f1cfd83,"Mars\nMars, the fourth planet from the Sun, is...",$.main-text[5],1,"[132.87440491, 500.84011841, 477.48345947, 534...",a31663e06fac41470ecc459f5a58658a3f9997d7801053...,a31663e06fac41470ecc459f5a58658a3f9997d7801053...,6,[]
2,mars.pdf,1,0,11,pdf,8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...,2800,2024-10-18T13:30:59.490007,2.011138,mars.pdf,62e5639f-f922-4ccc-a041-3cb02f1cfd83,Basic facts about Mars:\n· Distance from the S...,$.main-text[6],1,"[133.2026062, 482.90710449, 237.04431152, 493....",7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...,7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...,7,[]
3,earth.pdf,1,0,11,pdf,18713f970989055625bef22209b6f4b6830b9ca22046bf...,2686,2024-10-18T13:30:59.494027,2.015123,earth.pdf,f3c0ac2e-1de2-472b-8216-2043f3b3e9d1,Solar System\nOur solar system is a vast and f...,$.main-text[2],1,"[132.87112427, 588.96014404, 479.40917969, 623...",44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...,44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...,0,[]
4,earth.pdf,1,0,11,pdf,18713f970989055625bef22209b6f4b6830b9ca22046bf...,2686,2024-10-18T13:30:59.494027,2.015123,earth.pdf,f3c0ac2e-1de2-472b-8216-2043f3b3e9d1,Solar System\nFor more details about our Solar...,$.main-text[3],1,"[133.20942688, 570.81555176, 375.57919312, 581...",d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...,d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...,1,[]
5,earth.pdf,1,0,11,pdf,18713f970989055625bef22209b6f4b6830b9ca22046bf...,2686,2024-10-18T13:30:59.494027,2.015123,earth.pdf,f3c0ac2e-1de2-472b-8216-2043f3b3e9d1,Earth\nEarth is the third planet from the Sun....,$.main-text[5],1,"[132.91053772, 512.46295166, 477.84887695, 534...",7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...,7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...,2,[]
6,earth.pdf,1,0,11,pdf,18713f970989055625bef22209b6f4b6830b9ca22046bf...,2686,2024-10-18T13:30:59.494027,2.015123,earth.pdf,f3c0ac2e-1de2-472b-8216-2043f3b3e9d1,Earth\nBasic facts about Earth:\n· Distance fr...,$.main-text[6],1,"[133.30151367, 494.86206055, 240.17156982, 505...",189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...,189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...,3,[]


In [24]:
output_df[['filename', 'contents']]

Unnamed: 0,filename,contents
0,mars.pdf,Solar System\nFor more details about the Solar...
1,mars.pdf,"Mars\nMars, the fourth planet from the Sun, is..."
2,mars.pdf,Basic facts about Mars:\n· Distance from the S...
3,earth.pdf,Solar System\nOur solar system is a vast and f...
4,earth.pdf,Solar System\nFor more details about our Solar...
5,earth.pdf,Earth\nEarth is the third planet from the Sun....
6,earth.pdf,Earth\nBasic facts about Earth:\n· Distance fr...


In [25]:
for f in output_df['filename'].unique():
    print ('==========' , f, '===========')
    chunks = output_df[output_df['filename'] == f]['contents']
    for idx , chunk in enumerate(chunks):
        print (f'-------Chunk {idx}------\n{chunk}\n-------')

-------Chunk 0------
Solar System
For more details about the Solar system see Chapter 1.
-------
-------Chunk 1------
Mars
Mars, the fourth planet from the Sun, is a cold, desert world with a thin atmosphere composed primarily of carbon dioxide. Its reddish hue comes from iron oxide, or rust, prevalent on its surface.
-------
-------Chunk 2------
Basic facts about Mars:
· Distance from the Sun: Average of 228 million kilometers (142 million miles)
· Rotation Period: 24.6 hours (one Martian day - called a "sol")
· Moons: Two small moons, Phobos and Deimos.
-------
-------Chunk 0------
Solar System
Our solar system is a vast and fascinating expanse, comprising eight planets, five dwarf planets, numerous moons, asteroids, comets, and other celestial bodies. At its center lies the star we call the Sun.
-------
-------Chunk 1------
Solar System
For more details about our Solar system see Chapter 1.
-------
-------Chunk 2------
Earth
Earth is the third planet from the Sun. It's our home plan

### 6.4 - Understanding the output

Remember we had 8 chunks initially.  Now we have 7!  One duplicate chunk is removed.

If you look at the PDF, the following common paragraph in `earth.pdf` and `mars.pdf`  is removed from one of the documents!  Pretty neat, eh!

```text
## Solar System

Our solar system is a vast and fascinating expanse, comprising eight planets, five dwarf planets, numerous moons, asteroids, comets, and other celestial bodies. At its center lies the star we call the Sun.
```

## Step-7: Fuzzy Dedup

Post exact deduplication, fuzzy deduplication is applied with the goal of removing code files that may have **slight variations** and thereby unbiasing
the data further.

Small variations are quite commonly seen in code data in the form of variations in the values of variables, addittion of logging statements etc.

### 7.1 - Set Input/output Folder

In [26]:
## Input to this component is the output of doc_id generator component.

STAGE  = 5

input_folder = output_docid_dir # previous output folder is the input folder for the current stage
output_folder =  output_fuzzy_dedupe_dir

input_df = read_parquet_files_as_df(input_folder)  ## for debug purposes

print (f"🏃🏼 STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'")

🏃🏼 STAGE-5: Processing input='output/03_docid_out' --> output='output/05_fuzzy_dedupe_out'


### 7.2 - Execute

In [27]:
%%time

import os
import sys

from data_processing.utils import ParamsUtils
from fdedup_transform_ray import FdedupRayTransformConfiguration
from data_processing_ray.runtime.ray import RayTransformLauncher

# create parameters

local_conf = {
    "input_folder": input_folder,
    "output_folder": output_folder,
}
worker_options = {"num_cpus" : MY_CONFIG.RAY_NUM_CPUS}
code_location = {"github": "github", "commit_hash": "12345", "path": "path"}
params = {
    # where to run
    "run_locally": True,
    # Data access. Only required parameters are specified
    "data_local_config": ParamsUtils.convert_to_ast(local_conf),
    # Orchestration parameters
    "runtime_worker_options": ParamsUtils.convert_to_ast(worker_options),
    "runtime_num_workers": MY_CONFIG.RAY_RUNTIME_WORKERS,
    # columns used
    "fdedup_doc_column": "contents",
    "fdedup_id_column": "chunk_id",
    "fdedup_cluster_column": "chunk_hash",
    # infrastructure
    "fdedup_bucket_cpu": 0.3,
    "fdedup_doc_cpu": 0.3,
    "fdedup_mhash_cpu": 0.3,
    "fdedup_num_doc_actors": 1,
    "fdedup_num_bucket_actors": 1,
    "fdedup_num_minhash_actors": 1,
    "fdedup_num_preprocessors": 1,
    # fuzzy parameters
    "fdedup_num_permutations": 64,
    "fdedup_threshold": 0.7, # (default 0.8)
    "fdedup_shingles_size": 5,
    "fdedup_delimiters": " "
}

# Pass commandline params
sys.argv = ParamsUtils.dict_to_req(d=params)

# launch

launcher = RayTransformLauncher(FdedupRayTransformConfiguration())

return_code = launcher.launch()

if return_code == 0:
    print (f"✅ Stage:{STAGE} completed successfully")
else:
    raise Exception ("❌ Ray job failed")

13:32:00 INFO - fuzzy dedup params are {'doc_column': 'contents', 'id_column': 'chunk_id', 'cluster_column': 'chunk_hash', 'bucket_cpu': 0.3, 'mhash_cpu': 0.3, 'doc_cpu': 0.3, 'num_doc_actors': 1, 'num_minhash_actors': 1, 'num_bucket_actors': 1, 'num_preprocessors': 1, 'num_permutations': 64, 'threshold': 0.7, 'shingles_size': 5, 'delimiters': ' ', 'snapshot_delay': 1, 'use_bucket_snapshot': False, 'use_doc_snapshot': False, 'random_delay_limit': 10, 'worker_options': {'num_cpus': 0.8}}
13:32:00 INFO - pipeline id pipeline_id
13:32:00 INFO - code location None
13:32:00 INFO - number of workers 2 worker options {'num_cpus': 0.8, 'max_restarts': -1}
13:32:00 INFO - actor creation delay 0
13:32:00 INFO - job details {'job category': 'preprocessing', 'job name': 'fdedup', 'job type': 'ray', 'job id': 'job_id'}
13:32:00 INFO - data factory data_ is using local data access: input_folder - output/03_docid_out output_folder - output/05_fuzzy_dedupe_out
13:32:00 INFO - data factory data_ max_fi

✅ Stage:5 completed successfully
CPU times: user 270 ms, sys: 200 ms, total: 470 ms
Wall time: 36.6 s


### 7.3 - Inspect Generated output

In [28]:
from my_utils import read_parquet_files_as_df

output_df = read_parquet_files_as_df(output_folder)

print ("Input data dimensions (rows x columns)= ", input_df.shape)
print ("Output data dimensions (rows x columns)= ", output_df.shape)
print ("Duplicate chunks removed  by fuzzy-dedupe:  ", (input_df.shape[0] - output_df.shape[0]))

output_df.head(10)

Input data dimensions (rows x columns)=  (8, 18)
Output data dimensions (rows x columns)=  (6, 18)
Duplicate chunks removed  by fuzzy-dedupe:   2


Unnamed: 0,filename,num_pages,num_tables,num_doc_elements,ext,hash,size,date_acquired,pdf_convert_time,source_filename,source_document_id,contents,doc_jsonpath,page_number,bbox,document_id,chunk_id,chunk_hash
0,mars.pdf,1,0,11,pdf,8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...,2800,2024-10-18T13:30:59.490007,2.011138,mars.pdf,62e5639f-f922-4ccc-a041-3cb02f1cfd83,Solar System\nOur solar system is a vast and f...,$.main-text[2],1,"[132.84518433, 588.96014404, 479.40917969, 623...",44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...,4,-1
1,mars.pdf,1,0,11,pdf,8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...,2800,2024-10-18T13:30:59.490007,2.011138,mars.pdf,62e5639f-f922-4ccc-a041-3cb02f1cfd83,"Mars\nMars, the fourth planet from the Sun, is...",$.main-text[5],1,"[132.87440491, 500.84011841, 477.48345947, 534...",a31663e06fac41470ecc459f5a58658a3f9997d7801053...,6,-1
2,mars.pdf,1,0,11,pdf,8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...,2800,2024-10-18T13:30:59.490007,2.011138,mars.pdf,62e5639f-f922-4ccc-a041-3cb02f1cfd83,Basic facts about Mars:\n· Distance from the S...,$.main-text[6],1,"[133.2026062, 482.90710449, 237.04431152, 493....",7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...,7,-1
3,earth.pdf,1,0,11,pdf,18713f970989055625bef22209b6f4b6830b9ca22046bf...,2686,2024-10-18T13:30:59.494027,2.015123,earth.pdf,f3c0ac2e-1de2-472b-8216-2043f3b3e9d1,Solar System\nFor more details about our Solar...,$.main-text[3],1,"[133.20942688, 570.81555176, 375.57919312, 581...",d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...,1,5
4,earth.pdf,1,0,11,pdf,18713f970989055625bef22209b6f4b6830b9ca22046bf...,2686,2024-10-18T13:30:59.494027,2.015123,earth.pdf,f3c0ac2e-1de2-472b-8216-2043f3b3e9d1,Earth\nEarth is the third planet from the Sun....,$.main-text[5],1,"[132.91053772, 512.46295166, 477.84887695, 534...",7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...,2,-1
5,earth.pdf,1,0,11,pdf,18713f970989055625bef22209b6f4b6830b9ca22046bf...,2686,2024-10-18T13:30:59.494027,2.015123,earth.pdf,f3c0ac2e-1de2-472b-8216-2043f3b3e9d1,Earth\nBasic facts about Earth:\n· Distance fr...,$.main-text[6],1,"[133.30151367, 494.86206055, 240.17156982, 505...",189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...,3,-1


In [29]:
output_df[['filename', 'contents']]

Unnamed: 0,filename,contents
0,mars.pdf,Solar System\nOur solar system is a vast and f...
1,mars.pdf,"Mars\nMars, the fourth planet from the Sun, is..."
2,mars.pdf,Basic facts about Mars:\n· Distance from the S...
3,earth.pdf,Solar System\nFor more details about our Solar...
4,earth.pdf,Earth\nEarth is the third planet from the Sun....
5,earth.pdf,Earth\nBasic facts about Earth:\n· Distance fr...


In [30]:
for f in output_df['filename'].unique():
    print ('==========' , f, '===========')
    chunks = output_df[output_df['filename'] == f]['contents']
    for idx , chunk in enumerate(chunks):
        print (f'-------Chunk {idx}------\n{chunk}\n-------')

-------Chunk 0------
Solar System
Our solar system is a vast and fascinating expanse, comprising eight planets, five dwarf planets, numerous moons, asteroids, comets, and other celestial bodies. At its center lies the star we call the Sun.
-------
-------Chunk 1------
Mars
Mars, the fourth planet from the Sun, is a cold, desert world with a thin atmosphere composed primarily of carbon dioxide. Its reddish hue comes from iron oxide, or rust, prevalent on its surface.
-------
-------Chunk 2------
Basic facts about Mars:
· Distance from the Sun: Average of 228 million kilometers (142 million miles)
· Rotation Period: 24.6 hours (one Martian day - called a "sol")
· Moons: Two small moons, Phobos and Deimos.
-------
-------Chunk 0------
Solar System
For more details about our Solar system see Chapter 1.
-------
-------Chunk 1------
Earth
Earth is the third planet from the Sun. It's our home planet. Earth is the only place we know of with life.
-------
-------Chunk 2------
Earth
Basic facts 

### 7.4- Understanding the output

So we started with 7 rows and ended up with 6.  Fuzzy dedupe removed the following **very similar** chunk.

These are pretty similar chunks except for the words 'the' and 'our'

**earth.pdf**

`For more details about *our* Solar system see Chapter 1.`

**mars.pdf**

`For more details about *the* Solar system see Chapter 1.`

Pretty neat, eh? 👏

### Configuring Fuzzy de-dupe

You can tweak fuzzy dedupe by tweaking the following parameters

```python
# fuzzy parameters
    "fdedup_num_permutations": 64,
    "fdedup_threshold": 0.7, #  (default 0.8)
    "fdedup_shingles_size": 5,
    "fdedup_delimiters": " "
```

In our case, we set `fdedup_threshold` parameter to 0.7.  


## Step-8:   Text encoding

Encode text for the vector storage.

### 8.1 - Set Input/output Folder

In [31]:
STAGE  = 6

input_folder = output_fuzzy_dedupe_dir # previous output folder is the input folder for the current stage
output_folder =  output_embeddings_dir

input_df = read_parquet_files_as_df(input_folder)  ## for debug purposes

print (f"🏃🏼 STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'")

🏃🏼 STAGE-6: Processing input='output/05_fuzzy_dedupe_out' --> output='output/06_embeddings_out'


### 8.2 - Execute

In [32]:
%%time

from text_encoder_transform_ray import TextEncoderRayTransformConfiguration

local_conf = {
    "input_folder": input_folder,
    "output_folder": output_folder,
}
worker_options = {"num_cpus" : MY_CONFIG.RAY_NUM_CPUS}
params = {
    # where to run
    "run_locally": True,
    # Data access. Only required parameters are specified
    "data_local_config": ParamsUtils.convert_to_ast(local_conf),
    # orchestrator
    "runtime_worker_options": ParamsUtils.convert_to_ast(worker_options),
    "runtime_num_workers": MY_CONFIG.RAY_RUNTIME_WORKERS,
    # text_encoder
    "text_encoder_model_name": MY_CONFIG.EMBEDDING_MODEL,
}

sys.argv = ParamsUtils.dict_to_req(d=params)
# create launcher
launcher = RayTransformLauncher(TextEncoderRayTransformConfiguration())
# Launch the ray actor(s) to process the input

return_code = launcher.launch()

if return_code == 0:
    print (f"✅ Stage:{STAGE} completed successfully")
else:
    raise Exception ("❌ Ray job failed")

13:32:37 INFO - text_encoder parameters are : {'content_column_name': 'contents', 'output_embeddings_column_name': 'embeddings', 'model_name': 'sentence-transformers/all-MiniLM-L6-v2'}
13:32:37 INFO - pipeline id pipeline_id
13:32:37 INFO - code location None
13:32:37 INFO - number of workers 2 worker options {'num_cpus': 0.8, 'max_restarts': -1}
13:32:37 INFO - actor creation delay 0
13:32:37 INFO - job details {'job category': 'preprocessing', 'job name': 'text_encoder', 'job type': 'ray', 'job id': 'job_id'}
13:32:37 INFO - data factory data_ is using local data access: input_folder - output/05_fuzzy_dedupe_out output_folder - output/06_embeddings_out
13:32:37 INFO - data factory data_ max_files -1, n_sample -1
13:32:37 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
13:32:37 INFO - Running locally
2024-10-18 13:32:39,609	INFO worker.py:1744 -- Started a local Ray instanc

✅ Stage:6 completed successfully
CPU times: user 607 ms, sys: 226 ms, total: 833 ms
Wall time: 22.1 s


### 8.3 - Inspect Generated output

You will see a column called `embeddings` added at the end.  This the text content converted into vectors or embeddings.  We used the model `sentence-transformers/all-MiniLM-L6-v2`

In [33]:
from my_utils import read_parquet_files_as_df

output_df = read_parquet_files_as_df(output_folder)

print ("Input data dimensions (rows x columns)= ", input_df.shape)
print ("Output data dimensions (rows x columns)= ", output_df.shape)

output_df.head(10)

Input data dimensions (rows x columns)=  (6, 18)
Output data dimensions (rows x columns)=  (6, 19)


Unnamed: 0,filename,num_pages,num_tables,num_doc_elements,ext,hash,size,date_acquired,pdf_convert_time,source_filename,source_document_id,contents,doc_jsonpath,page_number,bbox,document_id,chunk_id,chunk_hash,embeddings
0,mars.pdf,1,0,11,pdf,8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...,2800,2024-10-18T13:30:59.490007,2.011138,mars.pdf,62e5639f-f922-4ccc-a041-3cb02f1cfd83,Solar System\nOur solar system is a vast and f...,$.main-text[2],1,"[132.84518433, 588.96014404, 479.40917969, 623...",44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...,4,-1,"[0.0077404897, -0.020559434, 0.026426662, 0.01..."
1,mars.pdf,1,0,11,pdf,8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...,2800,2024-10-18T13:30:59.490007,2.011138,mars.pdf,62e5639f-f922-4ccc-a041-3cb02f1cfd83,"Mars\nMars, the fourth planet from the Sun, is...",$.main-text[5],1,"[132.87440491, 500.84011841, 477.48345947, 534...",a31663e06fac41470ecc459f5a58658a3f9997d7801053...,6,-1,"[0.07728298, 0.024971062, -0.04318075, 0.05809..."
2,mars.pdf,1,0,11,pdf,8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...,2800,2024-10-18T13:30:59.490007,2.011138,mars.pdf,62e5639f-f922-4ccc-a041-3cb02f1cfd83,Basic facts about Mars:\n· Distance from the S...,$.main-text[6],1,"[133.2026062, 482.90710449, 237.04431152, 493....",7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...,7,-1,"[0.1059802, 0.025460616, 0.02362733, 0.0390564..."
3,earth.pdf,1,0,11,pdf,18713f970989055625bef22209b6f4b6830b9ca22046bf...,2686,2024-10-18T13:30:59.494027,2.015123,earth.pdf,f3c0ac2e-1de2-472b-8216-2043f3b3e9d1,Solar System\nFor more details about our Solar...,$.main-text[3],1,"[133.20942688, 570.81555176, 375.57919312, 581...",d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...,1,5,"[-0.062105577, -0.0053322953, 0.03127779, 0.04..."
4,earth.pdf,1,0,11,pdf,18713f970989055625bef22209b6f4b6830b9ca22046bf...,2686,2024-10-18T13:30:59.494027,2.015123,earth.pdf,f3c0ac2e-1de2-472b-8216-2043f3b3e9d1,Earth\nEarth is the third planet from the Sun....,$.main-text[5],1,"[132.91053772, 512.46295166, 477.84887695, 534...",7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...,2,-1,"[0.0724358, -0.058001805, -0.01977186, -0.0243..."
5,earth.pdf,1,0,11,pdf,18713f970989055625bef22209b6f4b6830b9ca22046bf...,2686,2024-10-18T13:30:59.494027,2.015123,earth.pdf,f3c0ac2e-1de2-472b-8216-2043f3b3e9d1,Earth\nBasic facts about Earth:\n· Distance fr...,$.main-text[6],1,"[133.30151367, 494.86206055, 240.17156982, 505...",189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...,3,-1,"[0.091821924, 0.015197907, 0.07716932, 0.01711..."


## Step-9: Copy output to final output dir

In [34]:
import shutil

shutil.rmtree(MY_CONFIG.OUTPUT_FOLDER_FINAL, ignore_errors=True)
shutil.copytree(src=output_folder, dst=MY_CONFIG.OUTPUT_FOLDER_FINAL)

print (f"✅ Copied output from '{output_folder}' --> '{MY_CONFIG.OUTPUT_FOLDER_FINAL}'")

✅ Copied output from 'output/06_embeddings_out' --> 'output/output_final'
