<div style="background-color: #04D7FD; padding: 20px; text-align: left;">
    <h1 style="color: #000000; font-size: 36px; margin: 0;">Demo: Data Prep Kit</h1>
    
</div>


## Overview
Welcome to the demo notebook! Inside, you will find an end-to-end sample data pipeline designed for processing code datasets, beginning with GitHub repositories (.zip files) and culminating in processed data. This notebook provides the following transforms for processing the data. 

- [Ingest2parquet](#item1)
- [Exact Dedup](#item2)
- [Doc_ID generation](#item3)
- [Fuzzy Dedup](#item4)
- [Programming Language Select](#item5)
- [Code quality](#item6)
- [Filtering](#item7)
- [Tokenization](#item8)

### Getting started

If you want to try this pipeline on your data, you need to download your github repositories, as .zip files. Please refer to steps below for the same. One can also try it on sample data by downloading a few repos of interest.

Here's how to download a GitHub repository in ZIP format:

1. Go to the desired repository on GitHub.
2. Click the "Code" button near the top right corner of the repository.
3. Click the "Download ZIP" button.

This will download a ZIP archive of the entire repository to your computer.

Follow these steps and download some repositories from github into a folder. Now your data is ready.

The folder containing this data would serve as the input to the pipeline. Assign the path of this data folder to the variable `zip_input_folder` in the below cell. 


### Import Common python modules

In [1]:

import os
import sys

from data_processing_ray.runtime.ray import RayTransformLauncher
from data_processing.utils import ParamsUtils

### Set input/output path variables for the pipeline

In [2]:
# Example
# We can set input paths here
zip_input_folder = "input_data"

if not os.path.exists(zip_input_folder):
    print ("NO INPUT DATA")
    print ("Please set `zip_input_folder` variable to path containing data")

# make sure the paths are correct
data_base_path = "test-data"

parquet_data_output = os.path.join(data_base_path, "parquet_input")

ededup_out =  os.path.join(data_base_path, "ededup_out")

doc_id_out =  os.path.join(data_base_path, "doc_id_out")
fdedup_out = os.path.join(data_base_path, "fdedup_out")

lang_out =  os.path.join(data_base_path,"lang_out")
cq_out = os.path.join(data_base_path,"cq_out")

filter_out = os.path.join(data_base_path ,"filter_out")
tokensization_out = os.path.join(data_base_path ,"tokenization_out")



NO INPUT DATA
Please set `zip_input_folder` variable to path containing data


## <span style="color: green"> 1. Convert data to parquet using ingest2parquet [<-](#top)<a class="anchor" id="item1"></a>
_zip_ to _parquet_ </span>

Raw code data files which are in zip format are converted to parquet files, where each row of the parquet file corresponds to a separate code file. Apart from the contents of the code file, every row also contains a unique document id, file URL, name of the repository, source of the data, date of acquisition and license of the repository. For every code file, a language field is also added, which is detected using the filename
extensions.




### Set Input/output Folder

In [4]:
# For this stage input folder contains the zip files, each zip file contains a github repo.

input_folder = zip_input_folder
output_folder =  parquet_data_output

### Execute 

In [None]:
import ast
import os
import sys

from code2parquet_transform import (
    detect_programming_lang_cli_key,
    supported_langs_file_cli_key,
)
from code2parquet_transform_ray import CodeToParquetRayConfiguration
from data_processing.utils import GB, ParamsUtils
from data_processing_ray.runtime.ray import RayTransformLauncher


# create parameters
supported_languages_file = os.path.abspath(
    "../../../transforms/code/code2parquet/python/test-data/languages/lang_extensions.json"
)

local_conf = {
    "input_folder": input_folder,
    "output_folder": output_folder,
}
worker_options = {"num_cpus": 0.8, "memory": 2 * GB}
code_location = {"github": "github", "commit_hash": "12345", "path": "path"}
ingest_config = {
    supported_langs_file_cli_key: supported_languages_file,
    detect_programming_lang_cli_key: True,
}

params = {
    # where to run
    "run_locally": True,
    # Data access. Only required parameters are specified
    "data_local_config": ParamsUtils.convert_to_ast(local_conf),
    "data_files_to_use": ast.literal_eval("['.zip']"),
    # orchestrator
    "runtime_worker_options": ParamsUtils.convert_to_ast(worker_options),
    "runtime_num_workers": 3,
    "runtime_pipeline_id": "pipeline_id",
    "runtime_job_id": "job_id",
    "runtime_creation_delay": 0,
    "runtime_code_location": ParamsUtils.convert_to_ast(code_location),
}


sys.argv = ParamsUtils.dict_to_req(d=(params | ingest_config))
# create launcher
launcher = RayTransformLauncher(CodeToParquetRayConfiguration())
# launch
launcher.launch()

##  <span style="color: green">   2. Exact Dedup [<-](#top)<a class="anchor" id="item2"></a> </span>

Remove documents having identical code to remove bias in the training data. On the content of each document, a SHA256 hash is computed,
followed by de-duplication of record having identical hashes.

### Set Input/output Folder

In [5]:
## For this stage the input is the folder containing parquet data which is output from the ingest2parquet tool

input_folder = parquet_data_output
output_folder = ededup_out

print(input_folder)
print(output_folder)

test-data/parquet_input
test-data/ededup_out


### Execute 

In [6]:
# Import ededup transform configuration
from ededup_transform_ray import EdedupRayTransformConfiguration


# Prepare the commandline params
local_conf = {
    "input_folder": input_folder,
    "output_folder": output_folder,
}
worker_options = {"num_cpus": 0.8}
params = {
    # where to run
    "run_locally": True,
    # Data access. Only required parameters are specified
    "data_local_config": ParamsUtils.convert_to_ast(local_conf),
    # orchestrator
    "runtime_worker_options": ParamsUtils.convert_to_ast(worker_options),
    "runtime_num_workers": 3,
    # ededup parameters
    "ededup_hash_cpu": 0.5,
    "ededup_num_hashes": 2,
    "ededup_doc_column": "contents",
}

# Pass the commandline params
sys.argv = ParamsUtils.dict_to_req(d=params)

# create launcher
ededup_launcher = RayTransformLauncher(EdedupRayTransformConfiguration())
# launch
ededup_launcher.launch()

23:31:45 INFO - Running locally
23:31:45 INFO - exact dedup params are {'hash_cpu': 0.5, 'num_hashes': 2, 'doc_column': 'contents'}
23:31:45 INFO - data factory data_ is using local data access: input_folder - test-data/parquet_input output_folder - test-data/ededup_out
23:31:45 INFO - data factory data_ max_files -1, n_sample -1
23:31:45 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
23:31:45 INFO - pipeline id pipeline_id
23:31:45 INFO - code location None
23:31:45 INFO - number of workers 3 worker options {'num_cpus': 0.8}
23:31:45 INFO - actor creation delay 0
23:31:45 INFO - job details {'job category': 'preprocessing', 'job name': 'ededup', 'job type': 'ray', 'job id': 'job_id'}
2024-06-19 23:31:47,958	INFO worker.py:1715 -- Started a local Ray instance. View the dashboard at [1m[32m127.0.0.1:8265 [39m[22m
[36m(orchestrate pid=52070)[0m 23:31:48 INFO - orchestra

0

## <span style="color: green">  3. DOC ID generation [<-](#top)<a class="anchor" id="item3"></a> </span>

This transform annotates documents with document "ids". It supports the following transformations of the original data:

 - Adding document hash: this enables the addition of a document hash-based id to the data. The hash is calculated with `hashlib.sha256(doc.encode("utf-8")).hexdigest()`. To enable this annotation, set hash_column to the name of the column, where you want to store it.
 - Adding integer document id: this allows the addition of an integer document id to the data that is unique across all rows in all tables provided to the transform() method. To enable this annotation, set int_id_column to the name of the column, where you want to store it. **This is a pre-requisite for fuzzy dedup** in the pipeline.

In [7]:
# Input for this stage is the output of exact dedeup component
# output of this component makes it possible for fdedup component to run on data.

input_folder = ededup_out
output_folder = doc_id_out

print(input_folder)
print(output_folder)


test-data/ededup_out
test-data/doc_id_out


In [8]:
from doc_id_transform_ray import DocIDRayTransformConfiguration
local_conf = {
    "input_folder": input_folder,
    "output_folder": output_folder,
}
worker_options = {"num_cpus": 0.8}
params = {
    # where to run
    "run_locally": True,
    # Data access. Only required parameters are specified
    "data_local_config": ParamsUtils.convert_to_ast(local_conf),
    # orchestrator
    "runtime_worker_options": ParamsUtils.convert_to_ast(worker_options),
    "runtime_num_workers": 3,
    # doc id configuration
    "doc_id_doc_column": "contents",
    "doc_id_hash_column": "hash_column",
    "doc_id_int_column": "int_id_column",
}
sys.argv = ParamsUtils.dict_to_req(d=params)

# launch

launcher = RayTransformLauncher(DocIDRayTransformConfiguration())
launcher.launch()

23:32:01 INFO - Running locally
23:32:01 INFO - Doc id parameters are : {'doc_column': 'contents', 'hash_column': 'hash_column', 'int_column': 'int_id_column'}
23:32:01 INFO - data factory data_ is using local data access: input_folder - test-data/ededup_out output_folder - test-data/doc_id_out
23:32:01 INFO - data factory data_ max_files -1, n_sample -1
23:32:01 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
23:32:01 INFO - pipeline id pipeline_id
23:32:01 INFO - code location None
23:32:01 INFO - number of workers 3 worker options {'num_cpus': 0.8}
23:32:01 INFO - actor creation delay 0
23:32:01 INFO - job details {'job category': 'preprocessing', 'job name': 'doc_id', 'job type': 'ray', 'job id': 'job_id'}
2024-06-19 23:32:03,187	INFO worker.py:1715 -- Started a local Ray instance. View the dashboard at [1m[32m127.0.0.1:8265 [39m[22m
[36m(orchestrate pid=52123)[0m 

0

## 4. <span style="color: green">  Fuzzy Dedup [<-](#top)<a class="anchor" id="item4"></a> </span>

Post exact deduplication, fuzzy deduplication is applied with
the goal of removing code files that may have slight variations and thereby unbiasing
the data further. Small variations are quite commonly seen in code data in the form
of variations in the values of variables, addittion of logging statements etc. Find near-
duplicate.

### Set Input/output Folder

In [9]:
## Input to this component is the output of doc_id generator component. 

input_folder = doc_id_out
output_folder = fdedup_out

print(input_folder)
print(output_folder)

test-data/doc_id_out
test-data/fdedup_out


### Execute 

In [10]:
import os
import sys

from data_processing.utils import ParamsUtils
from fdedup_transform_ray import FdedupRayTransformConfiguration

# create parameters

local_conf = {
    "input_folder": input_folder,
    "output_folder": output_folder,
}
worker_options = {"num_cpus": 0.8}
code_location = {"github": "github", "commit_hash": "12345", "path": "path"}
params = {
    # where to run
    "run_locally": True,
    # Data access. Only required parameters are specified
    "data_local_config": ParamsUtils.convert_to_ast(local_conf),
    # Orchestration parameters
    "runtime_worker_options": ParamsUtils.convert_to_ast(worker_options),
    "runtime_num_workers": 3,
    # columns used
    "fdedup_doc_column": "contents",
    "fdedup_id_column": "int_id_column",
    "fdedup_cluster_column": "hash_column",
    # infrastructure
    "fdedup_bucket_cpu": 0.5,
    "fdedup_doc_cpu": 0.5,
    "fdedup_mhash_cpu": 0.5,
    "fdedup_num_doc_actors": 2,
    "fdedup_num_bucket_actors": 1,
    "fdedup_num_minhash_actors": 1,
    "fdedup_num_preprocessors": 2,
    # fuzzy parameters
    "fdedup_num_permutations": 64,
    "fdedup_threshold": 0.8,
    "fdedup_shingles_size": 5,
    "fdedup_delimiters": " "
}

# Pass commandline params
sys.argv = ParamsUtils.dict_to_req(d=params)

# launch

fdedup_launcher = RayTransformLauncher(FdedupRayTransformConfiguration())
fdedup_launcher.launch()

23:32:16 INFO - Running locally
23:32:16 INFO - fuzzy dedup params are {'doc_column': 'contents', 'id_column': 'int_id_column', 'cluster_column': 'hash_column', 'bucket_cpu': 0.5, 'mhash_cpu': 0.5, 'doc_cpu': 0.5, 'num_doc_actors': 2, 'num_minhash_actors': 1, 'num_bucket_actors': 1, 'num_preprocessors': 2, 'num_permutations': 64, 'threshold': 0.8, 'shingles_size': 5, 'delimiters': ' ', 'snapshot_delay': 1, 'use_bucket_snapshot': False, 'use_doc_snapshot': False, 'random_delay_limit': 10, 'worker_options': {'num_cpus': 0.8}}
23:32:16 INFO - data factory data_ is using local data access: input_folder - test-data/doc_id_out output_folder - test-data/fdedup_out
23:32:16 INFO - data factory data_ max_files -1, n_sample -1
23:32:16 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
23:32:16 INFO - pipeline id pipeline_id
23:32:16 INFO - code location None
23:32:16 INFO - number of wo

0

## <span style="color: green">  5. Programming language annotation [<-](#top)<a class="anchor" id="item5"></a> </span>

The raw data may contains many programming languages. Of this, we would wish to retain a prioritised list of selected programming languages. This component takes a file which has new line separated names of languages we need to select. It annotates the data a new column with boolean values. This column can be used by filter component to select the required languages.

### Set Input/output Folder

In [11]:

input_folder = fdedup_out
output_folder = lang_out 
selected_languages_file = "./test-data/allowed-code-languages.txt"


### Execute 

In [12]:
import os
import sys

from data_processing.utils import ParamsUtils
from proglang_select_transform_ray import ProgLangSelectRayConfiguration
from proglang_select_transform import (
    lang_allowed_langs_file_key,
    lang_lang_column_key,
    lang_output_column_key,
)

# create parameters
language_column_name = "programming_language"
annotated_column_name = "lang_selected"

local_conf = {
    "input_folder": input_folder,
    "output_folder": output_folder,
}
worker_options = {"num_cpus": 0.8}
langselect_config = {
    lang_allowed_langs_file_key: selected_languages_file,
    lang_lang_column_key: language_column_name,
    lang_output_column_key: annotated_column_name,
}
params = {
    # where to run
    "run_locally": True,
    # Data access. Only required parameters are specified
    "data_local_config": ParamsUtils.convert_to_ast(local_conf),
    # orchestrator
    "runtime_worker_options": ParamsUtils.convert_to_ast(worker_options),
    "runtime_num_workers": 1,
    # language selection specific parameters
    **langselect_config,
}

sys.argv = ParamsUtils.dict_to_req(d=params)

# create launcher
launcher = RayTransformLauncher(ProgLangSelectRayConfiguration())
launcher.launch()


23:33:05 INFO - Running locally
23:33:05 INFO - data factory proglang_select_ is using local configuration without input/output path
23:33:05 INFO - data factory proglang_select_ max_files -1, n_sample -1
23:33:05 INFO - data factory proglang_select_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
23:33:05 INFO - data factory data_ is using local data access: input_folder - test-data/fdedup_out output_folder - test-data/lang_out
23:33:05 INFO - data factory data_ max_files -1, n_sample -1
23:33:05 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
23:33:05 INFO - pipeline id pipeline_id
23:33:05 INFO - code location None
23:33:05 INFO - number of workers 1 worker options {'num_cpus': 0.8}
23:33:05 INFO - actor creation delay 0
23:33:05 INFO - job details {'job category': 'preprocessing', 'job

0

## <span style="color: green">  6. Code Quality [<-](#top)<a class="anchor" id="item6"></a> </span>

We experiment with various code quality metrics but finally retain
the four code quality metrics used by (Li et al., 2023) to balance the tradeoff between
code quality versus data volume. 


### Set Input/output Folder

In [13]:
input_folder = lang_out
output_folder = cq_out

print(input_folder)
print(output_folder)

test-data/lang_out
test-data/cq_out


### Execute 

In [14]:
import os
import sys
from pathlib import Path

from code_quality_transform_ray import CodeQualityRayTransformConfiguration
from data_processing.utils import ParamsUtils

local_conf = {
    "input_folder": input_folder,
    "output_folder": output_folder,
}

language_column_name = "programming_language"

worker_options = {"num_cpus": 0.8}
params = {
    # where to run
    "run_locally": True,
    # Data access. Only required parameters are specified
    "data_local_config": ParamsUtils.convert_to_ast(local_conf),
    # orchestrator
    "runtime_worker_options": ParamsUtils.convert_to_ast(worker_options),
    "runtime_num_workers": 3,
    "runtime_pipeline_id": "pipeline_id",
    "runtime_job_id": "job_id",
    "runtime_creation_delay": 0,
    # code quality configuration
    "cq_contents_column_name": "contents",
    "cq_language_column_name": language_column_name,
}


Path(output_folder).mkdir(parents=True, exist_ok=True)

sys.argv = ParamsUtils.dict_to_req(d=params)

# create launcher
launcher = RayTransformLauncher(CodeQualityRayTransformConfiguration())
# launch
launcher.launch()

None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
23:33:20 INFO - Running locally
23:33:20 INFO - data factory data_ is using local data access: input_folder - test-data/lang_out output_folder - test-data/cq_out
23:33:20 INFO - data factory data_ max_files -1, n_sample -1
23:33:20 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
23:33:20 INFO - pipeline id pipeline_id
23:33:20 INFO - code location None
23:33:20 INFO - number of workers 3 worker options {'num_cpus': 0.8}
23:33:20 INFO - actor creation delay 0
23:33:20 INFO - job details {'job category': 'preprocessing', 'job name': 'code_quality', 'job type': 'ray', 'job id': 'job_id'}
2024-06-19 23:33:22,278	INFO worker.py:1715 -- Started a local Ray instance. View the dashboard at [1m[32m127.0.0.1:8265 [39m[22m
[

0

## 7. <span style="color: green">   Filtering [<-](#top)<a class="anchor" id="item7"></a> </span>

Filter out documents that do not meet the quality threshold for each annotation. The thresholds are computed based on a distributional
analysis as well as manual inspection of samples maintaining the balance between data quality and data volume

### Set Input/output Folder

In [15]:
input_folder = cq_out
output_folder = filter_out

### Execute 

In [16]:
import os

from data_processing.data_access import DataAccessLocal
from filter_transform import (
    filter_columns_to_drop_cli_param,
    filter_criteria_cli_param,
    filter_logical_operator_cli_param,
)
from filter_transform_ray import FilterRayTransformConfiguration

local_conf = {
    "input_folder": input_folder,
    "output_folder": output_folder,
}

# This is just an example criteria to filter
filter_criteria = [
    "total_num_lines > 10 AND total_num_lines < 90",
    "lang_selected = 1",
]
filter_logical_operator = "AND"
filter_columns_to_drop = ["lang_selected", "hash_column"]

filter_params = {
    filter_criteria_cli_param: filter_criteria,
    filter_columns_to_drop_cli_param: filter_columns_to_drop,
    filter_logical_operator_cli_param: filter_logical_operator,
}

worker_options = {"num_cpus": 0.8}
launcher_params = {
    # where to run
    "run_locally": True,
    # Data access. Only required parameters are specified
    "data_local_config": ParamsUtils.convert_to_ast(local_conf),
    # orchestrator
    "runtime_worker_options": ParamsUtils.convert_to_ast(worker_options),
    "runtime_num_workers": 5,
}


sys.argv = ParamsUtils.dict_to_req(launcher_params | filter_params)
# Create the longer to launch with the blocklist transform.
launcher = RayTransformLauncher(FilterRayTransformConfiguration())
# Launch the ray actor(s) to process the input
launcher.launch()

23:33:47 INFO - Running locally
23:33:47 INFO - data factory data_ is using local data access: input_folder - test-data/cq_out output_folder - test-data/filter_out
23:33:47 INFO - data factory data_ max_files -1, n_sample -1
23:33:47 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
23:33:47 INFO - pipeline id pipeline_id
23:33:47 INFO - code location None
23:33:47 INFO - number of workers 5 worker options {'num_cpus': 0.8}
23:33:47 INFO - actor creation delay 0
23:33:47 INFO - job details {'job category': 'preprocessing', 'job name': 'filter', 'job type': 'ray', 'job id': 'job_id'}
2024-06-19 23:33:49,730	INFO worker.py:1715 -- Started a local Ray instance. View the dashboard at [1m[32m127.0.0.1:8265 [39m[22m
[36m(orchestrate pid=52332)[0m 23:33:50 INFO - orchestrator started at 2024-06-19 23:33:50
[36m(orchestrate pid=52332)[0m 23:33:50 INFO - Number of files is 5, s

0

## 8. <span style="color: green">  Tokenization [<-](#top)<a class="anchor" id="item8"></a> </span>

The data tokenization transform maps a (non-empty) input table to an output table using a pre-trained tokenizer. The input table must contain at least two columns, by default named document_id and contents. The tokenization transform utilizes the pre-trained tokenizer to tokenize each row (assuming a document) in the input table to each row in the output folder.

A pre-trained tokenizer must be specified through the --tkn_tokenizer parameter, which can be the name of a ready-for-download tokenizer from HuggingFace such as hf-internal-testing/llama-tokenizer, bigcode/starcoder or any others that can loaded by the Huggingface AutoTokenizer library. 


In [17]:
input_folder = filter_out
output_folder = tokensization_out

In [18]:
from tokenization_transform_ray import TokenizationRayConfiguration

local_conf = {
    "input_folder": input_folder,
    "output_folder": output_folder,
}
worker_options = {"num_cpus": 0.8}
params = {
    # where to run
    "run_locally": True,
    # Data access. Only required parameters are specified
    "data_local_config": ParamsUtils.convert_to_ast(local_conf),
    # orchestrator
    "runtime_worker_options": ParamsUtils.convert_to_ast(worker_options),
    "runtime_num_workers": 5,
}

sys.argv = ParamsUtils.dict_to_req(d=params)
# create launcher
launcher = RayTransformLauncher(TokenizationRayConfiguration())
# Launch the ray actor(s) to process the input
launcher.launch()


23:34:02 INFO - Running locally
23:34:02 INFO - data factory data_ is using local data access: input_folder - test-data/filter_out output_folder - test-data/tokenization_out
23:34:02 INFO - data factory data_ max_files -1, n_sample -1
23:34:02 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
23:34:02 INFO - pipeline id pipeline_id
23:34:02 INFO - code location None
23:34:02 INFO - number of workers 5 worker options {'num_cpus': 0.8}
23:34:02 INFO - actor creation delay 0
23:34:02 INFO - job details {'job category': 'preprocessing', 'job name': 'Tokenization', 'job type': 'ray', 'job id': 'job_id'}
2024-06-19 23:34:04,667	INFO worker.py:1715 -- Started a local Ray instance. View the dashboard at [1m[32m127.0.0.1:8265 [39m[22m
[36m(orchestrate pid=52380)[0m None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configur

0

## Repo Level Ordering Transform

In [None]:
from repo_level_order_transform import RepoLevelOrderRayTransformConfiguration

input_folder = "../../../transforms/code/repo_level_ordering/ray/test-data/input"
output_folder = "./output"

import tempfile

with tempfile.TemporaryDirectory() as tmpdirname:

    # create parameters
    local_conf = {
        "input_folder": input_folder,
        "output_folder": output_folder,
     }

    worker_options = {"num_cpus": 0.8}
    code_location = {"github": "github", "commit_hash": "12345", "path": "path"}
    params = {
        # where to run
        "run_locally": True,
        # Data access. Only required parameters are specified
        "data_local_config": ParamsUtils.convert_to_ast(local_conf),
        # orchestrator
        "runtime_worker_options": ParamsUtils.convert_to_ast(worker_options),
        "runtime_num_workers": 2,
        "runtime_pipeline_id": "pipeline_id",
        "runtime_job_id": "job_id",
        "runtime_creation_delay": 0,
        "runtime_code_location": ParamsUtils.convert_to_ast(code_location),
    }


    repo_level_params = {
        "repo_lvl_sorting_algo": "SORT_SEMANTIC_NORMALISED",
        "repo_lvl_store_type": "local",
        "repo_lvl_store_backend_dir": tmpdirname,
        "repo_lvl_output_by_langs": True,
        "repo_lvl_combine_rows": True,
        "repo_lvl_sorting_enabled": True,
        "data_local_config": ParamsUtils.convert_to_ast(local_conf)
    }

    sys.argv= ParamsUtils.dict_to_req(d=params | repo_level_params)
    launcher = RayTransformLauncher(RepoLevelOrderRayTransformConfiguration())
    # Launch the ray actor(s) to process the input
    launcher.launch()

12:08:28 INFO - Running locally
12:08:28 INFO - data factory data_ is using local data access: input_folder - ../../../transforms/code/repo_level_ordering/ray/test-data/input output_folder - ./output
12:08:28 INFO - data factory data_ max_files -1, n_sample -1
12:08:28 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
12:08:28 INFO - pipeline id pipeline_id
12:08:28 INFO - code location {'github': 'github', 'commit_hash': '12345', 'path': 'path'}
12:08:28 INFO - number of workers 2 worker options {'num_cpus': 0.8, 'max_restarts': -1}
12:08:28 INFO - actor creation delay 0
12:08:28 INFO - job details {'job category': 'preprocessing', 'job name': 'repo_lvl', 'job type': 'ray', 'job id': 'job_id'}


Creating Store Params


2024-09-02 12:08:29,866	INFO worker.py:1744 -- Started a local Ray instance. View the dashboard at [1m[32m127.0.0.1:8265 [39m[22m


[36m(orchestrate pid=8886)[0m Init Store params


[36m(orchestrate pid=8886)[0m 12:08:30 INFO - orchestrator started at 2024-09-02 12:08:30
[36m(orchestrate pid=8886)[0m 12:08:30 INFO - Number of files is 2, source profile {'max_file_size': 0.043808937072753906, 'min_file_size': 0.04120159149169922, 'total_file_size': 0.08501052856445312}
[36m(orchestrate pid=8886)[0m 12:08:30 INFO - Cluster resources: {'cpus': 16, 'gpus': 0, 'memory': 28.01360473688692, 'object_store': 2.0}
[36m(orchestrate pid=8886)[0m 12:08:30 INFO - Number of workers - 2 with {'num_cpus': 0.8, 'max_restarts': -1} each
[36m(orchestrate pid=8886)[0m 12:08:30 INFO - => get_transform_config started
[36m(orchestrate pid=8886)[0m 12:08:30 INFO - dict_keys(['store_backend_dir', 'store_type', 's3_creds'])
[36m(orchestrate pid=8886)[0m 12:08:30 INFO - <= get_transform_config
[36m(orchestrate pid=8886)[0m 12:08:30 INFO - Completed 0 files (0.0%)  in 0.0 min. Waiting for completion


[36m(RayTransformFileProcessor pid=8895)[0m Creating local store.


[36m(orchestrate pid=8886)[0m 12:08:31 INFO - Completed processing 2 files in 0.01 min
[36m(orchestrate pid=8886)[0m 12:08:31 INFO - done flushing in 0.001 sec
[36m(orchestrate pid=8886)[0m 12:08:31 INFO - Store Backend is None
[36m(orchestrate pid=8886)[0m 12:08:31 INFO - Stage 1 Finished in 0:00:00.604581.
[36m(orchestrate pid=8886)[0m 12:08:31 I - Repo level sorting is enabled. Algo: SORT_SEMANTIC_NORMALISED
[36m(orchestrate pid=8886)[0m 12:08:31 I - normalised semantic sort enabled
[36m(orchestrate pid=8886)[0m 12:08:31 I - Output by language enabled.
[36m(orchestrate pid=8886)[0m 12:08:31 I - Combine rows enabled.
[36m(orchestrate pid=8886)[0m 12:08:31 I - Processing 2 repos with 2 workers
[36m(orchestrate pid=8886)[0m 12:08:33 I - Finished the transform in 0:00:02.707978 
[36m(GroupByRepoActor pid=8898)[0m 12:08:33 I - Write unknown/repo2, tables: 1


[36m(GroupByRepoActor pid=8898)[0m Most promiment languages:  [unknown ,Tex]
[36m(GroupByRepoActor pid=8898)[0m returning from the end of function. chosen language: unknown
[36m(GroupByRepoActor pid=8899)[0m Most promiment languages:  [unknown ,Markdown]


12:08:43 INFO - Completed execution in 0.253 min, execution result 0
[36m(GroupByRepoActor pid=8899)[0m 12:08:33 I - Write unknown/SchapplM%2Frobotics-paper_ark2022_3T1R, tables: 1


[36m(orchestrate pid=8886)[0m Creating local store.[32m [repeated 2x across cluster][0m
[36m(GroupByRepoActor pid=8899)[0m returning from the end of function. chosen language: unknown
