# **Demo on building data prep pipeline for model fine tuning** 

<a href="https://colab.research.google.com/github/IBM/data-prep-kit/blob/dev/examples/notebooks/code/sample-notebook.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

This demo notebook shows how to use [data-prep-kit](https://github.com/IBM/data-prep-kit) to build a data preparation pipeline that can be used for fine tuning or extended pre-training. We will discuss the various data preparation steps to process raw data (code repositories), tokenise it that can then be fine tuned using any popular code models. We will also discuss a novel recipe for semantic ordering of files in a repository which has shown to enhance model training. Please see our [paper](https://arxiv.org/abs/2407.13739) here for more details. For this demo, we will use the [codeparrot/github-code](https://huggingface.co/datasets/codeparrot/github-code) dataset hosted on Hugging Face datasets. 



## Setup

Install data-prep-toolkit and datasets library. This notebook requires atleast 8 cpus. 
To run on google colab, it is recommended to change the runtime to TPUs to get the required number of cpus.


In [1]:
%%capture logpip --no-stderr
!pip install data-prep-toolkit-transforms-ray==0.2.1.dev1
!pip install datasets

We use parallel processing capability using Ray, so that beyond the demo, a user can also use this for actual production runs on larger datasets, with minor code changes. Please read [here](https://github.com/IBM/data-prep-kit?tab=readme-ov-file#-about-) on various features of data-prep-kit that includes flexibility of compute to run from laptop to cluster.  There are three parameters, that the user can change, as per usecase:

`runtime_num_worker`: number of parallel workers to be used

`num_cpus`: number of cpus to be used per worker

`run_locally: True` start a ray cluster for parallel computation


In [17]:
from data_processing_ray.runtime.ray import RayTransformLauncher
from data_processing.utils import ParamsUtils
import sys

#Default parameters for computation
worker_options = {"num_cpus": 0.8}
common_config_params = {
        "run_locally": True,
        "runtime_worker_options": ParamsUtils.convert_to_ast(worker_options),
        "runtime_num_workers": 2,
    }




We will do all the processing in `sample_data` folder. This concludes our setup section. 

In [18]:
!mkdir -p sample_data
!mkdir -p sample_data/hf_2_parquet

## Data Preparation Steps

We now discuss the various data preparation steps to transform the raw data to a tokenised format post cleaning and transforming the data. We use the [parquet data format](https://parquet.apache.org/) for all our operations. This helps to efficiently scale the data for actual production runs, beyond the demo. 

1. HuggingFace2Parquet: Read the dataset from HF and convert into parquet format. 
2. Exact Deduplication: Remove exact duplicates. 
3. Fuzzy Deduplication: Remove near duplicates. 
4. Programming Lang Selection: Select the programming languages to be used for the analysis.
5. Code Quality Annotations: Annotate whether a given code file is of high quality or not using various rules.
6. Filtering: Filter dataset to retain only programming language of interest. 
7. Semantic Ordering: Organise code files by their semantic dependencies.  
8. Tokenization: Tokenise the data for model fine tuning.

The data processing pipeline is organised such that the output of the previous transform is used as input to the next one. Refer to the papers [here](https://arxiv.org/pdf/2405.04324) and [here](https://arxiv.org/abs/2407.13739) for complete details for each of the above steps. 

## 1. Huggingface datasets to Parquet

This is the first component of this pipeline. It ingests a dataset `codeparrot/github-code` from huggingface and converts it into
parquet files for consumption by the next steps in this data processing pipeline.

For this demo we are trying to process a few records. The following fields can be updated in case you want to use more data.
_total_files_ = 10 <br/>
_rows_per_file_ = 10

The output of this stage of the pipeline would be written to `sample_data/hf_2_parquet`.

In [19]:
import os
import pyarrow as pa
import pyarrow.parquet as pq

from datasets import load_dataset

import uuid
from data_processing.utils import TransformUtils
from collections import defaultdict

DATASET_NAME='codeparrot/github-code'

ds = load_dataset(DATASET_NAME, 
                  streaming=True, 
                  split="train",
                  trust_remote_code=True)

def row_mapper(row):
    return {
            'ext': TransformUtils.get_file_extension(row['path'])[1],
            'document_id': str(uuid.uuid4())
            }

parquet_data_output = "sample_data/hf_2_parquet"

def hf_dataset_to_parquet(ds, skip, nrows, file_name, mapper=None, renamed_columns=[]):
    dst_ = ds.skip(skip).take(nrows)
    data_dict = defaultdict(list)

    dst = dst_.map(mapper)

    for data in dst:
        for k, v in data.items():
            data_dict[k].append(v)

    for old, new in renamed_columns:
        data_dict[new] = data_dict[old]
        del data_dict[old]

    table = pa.Table.from_pydict(data_dict)
    pq.write_table(table, file_name)


## Create parquet files 

total_files = 10
rows_per_file = 10
for num in range(total_files):
    file_name = os.path.join(
        f"{parquet_data_output}",
        f"data_{num}.parquet"
    )
    print (f"Writing {file_name}")
    hf_dataset_to_parquet(ds, 
                          1 * rows_per_file,
                          rows_per_file,
                          file_name=file_name,
                          mapper=row_mapper,
                          renamed_columns=[("code", "contents"),
                                           ("path", "title")])

Writing sample_data/hf_2_parquet/data_0.parquet
Writing sample_data/hf_2_parquet/data_1.parquet
Writing sample_data/hf_2_parquet/data_2.parquet
Writing sample_data/hf_2_parquet/data_3.parquet
Writing sample_data/hf_2_parquet/data_4.parquet
Writing sample_data/hf_2_parquet/data_5.parquet
Writing sample_data/hf_2_parquet/data_6.parquet
Writing sample_data/hf_2_parquet/data_7.parquet
Writing sample_data/hf_2_parquet/data_8.parquet
Writing sample_data/hf_2_parquet/data_9.parquet


## 2. Exact deduplication

This step will find exact duplicates in the 'content' column and remove them. This is done by computing SHA256 hash on the code files and remove records having identical hashes.

The transform specific params for exact deduplication are: <br/>
 _ededup_hash_cpu_ -  Number of cpus per worker <br/>
 _ededup_num_hashes_ - Number of workers used to store hashes <br/>
 _ededup_doc_column_ - Name of column which has to be checked for deduplication <br/>


In [20]:
import os
import sys
from ededup_transform_ray import EdedupRayTransformConfiguration

input_folder = parquet_data_output # Output of previous stage is used as input.
output_folder = "sample_data/ededup_out"

local_conf = {
    "input_folder": input_folder,
    "output_folder": output_folder,
}

ededup_params = {
    # ededup parameters
    "ededup_hash_cpu": 0.5,
    "ededup_num_hashes": 2,
    "ededup_doc_column": "contents",
    "data_local_config": ParamsUtils.convert_to_ast(local_conf)
}

params = common_config_params | ededup_params
sys.argv = ParamsUtils.dict_to_req(d=params)
ededup_launcher = RayTransformLauncher(EdedupRayTransformConfiguration())
ededup_launcher.launch()

18:41:16 INFO - Running locally
18:41:16 INFO - exact dedup params are {'doc_column': 'contents', 'hash_cpu': 0.5, 'num_hashes': 2}
18:41:16 INFO - exact dedup params are {'doc_column': 'contents', 'hash_cpu': 0.5, 'num_hashes': 2}
18:41:16 INFO - data factory data_ is using local data access: input_folder - sample_data/hf_2_parquet output_folder - sample_data/ededup_out
18:41:16 INFO - data factory data_ max_files -1, n_sample -1
18:41:16 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
18:41:16 INFO - pipeline id pipeline_id
18:41:16 INFO - code location None
18:41:16 INFO - number of workers 2 worker options {'num_cpus': 0.8, 'max_restarts': -1}
18:41:16 INFO - actor creation delay 0
18:41:16 INFO - job details {'job category': 'preprocessing', 'job name': 'ededup', 'job type': 'ray', 'job id': 'job_id'}
2024-08-21 18:41:18,293	INFO worker.py:1744 -- Started a local Ray in

0

## 3. Fuzzy Deduplication

This step will find near duplicates and remove them. The code is broken into two code cells, one for adding document ids to the parquet file and then running fuzzy dedup. Document id addition is a prerequisite for fuzzy dedup. 

We first add the document ids as an additional column to the parquet files. <br/>
_doc_column_ - specifies name of the column containing the document (required for ID generation) <br/>
_hash_column_ - specifies name of the column created to hold the string document id, if None, id is not generated <br/>
_int_id_column_ - specifies name of the column created to hold the integer document id, if None, id is not generated <br/>
At least one of hash_column or int_id_column must be specified.



In [21]:
input_folder = "sample_data/ededup_out"
output_folder = "sample_data/docid_out"


from doc_id_transform_ray import DocIDRayTransformConfiguration
local_conf = {
    "input_folder": input_folder,
    "output_folder": output_folder,
}

doc_id_params = {
    # doc id configuration
    "doc_id_doc_column": "contents",
    "doc_id_hash_column": "hash_column",
    "doc_id_int_column": "int_id_column",
    "data_local_config": ParamsUtils.convert_to_ast(local_conf)
}

params = doc_id_params | common_config_params
sys.argv = ParamsUtils.dict_to_req(d=params)
launcher = RayTransformLauncher(DocIDRayTransformConfiguration())
launcher.launch()

Post adding the document ids, the next step is to run fuzzy deduplication. We apply a two-step method for this: (1) compute MinHashes of all the documents and then utilize Locally Sensitive Hashing (LSH) to group documents based on their MinHash fingerprints, (2) measure Jaccard similarity between each pair of documents
in the same bucket and annotate documents except one as duplicates based on a similarity
threshold.  

Some important transform specific params are: <br/>
_fdedup_doc_column_ - Column to be used for deduplication <br/>
_fdedup_threshold_ - specifies the Jaccard similarity threshold (default is 0.7)

In [22]:
input_folder = "sample_data/docid_out"
output_folder = "sample_data/fdedup_out"

import os
import sys

from data_processing.utils import ParamsUtils
from fdedup_transform_ray import FdedupRayTransformConfiguration

local_conf = {
    "input_folder": input_folder,
    "output_folder": output_folder,
}
worker_options = {"num_cpus": 0.8}
code_location = {"github": "github", "commit_hash": "12345", "path": "path"}
fdedup_params = {
    # columns used
    "fdedup_doc_column": "contents",
    "fdedup_id_column": "int_id_column",
    "fdedup_cluster_column": "hash_column",
    "data_local_config": ParamsUtils.convert_to_ast(local_conf)
}

params = common_config_params| fdedup_params

# Pass commandline params
sys.argv = ParamsUtils.dict_to_req(d=params)

# launch
fdedup_launcher = RayTransformLauncher(FdedupRayTransformConfiguration())
fdedup_launcher.launch()

## 4. Programming Language Selection

This module helps retain the code files for language of interest which can be specified using selected_languages_file. Post this step, a new column is added, that contains the programming language name. One can use the code in the Filtering step to do analytics on how many files are found for which languages and thereby selectively filter. 

The important parameters used by this transform are: <br/>
_lang_allowed_langs_file_key_ - A file with a list of allowed languages. <br/>
_lang_lang_column_key_ - The name of column which has programming language. <br/>
_lang_output_column_key_ - The name of annotation column. <br/>

For this demo, we will use this [file](https://github.com/IBM/data-prep-kit/blob/dev/transforms/code/proglang_select/python/test-data/languages/allowed-code-languages.txt) to specify languages of interest and the module will add a new column called "language_of_interest" which can have two values 0/1. 1 is added for all rows that have code files belonging to programming language specified in the list.

In [23]:
input_folder = "sample_data/fdedup_out"
output_folder = "sample_data/ps_out"

# download allowed-code-languages.txt
!wget https://raw.githubusercontent.com/IBM/data-prep-kit/dev/transforms/code/proglang_select/python/test-data/languages/allowed-code-languages.txt
selected_languages_file = "./allowed-code-languages.txt"

from proglang_select_transform_ray import ProgLangSelectRayConfiguration
from proglang_select_transform import (
    lang_allowed_langs_file_key,
    lang_lang_column_key,
    lang_output_column_key,
)

# create parameters
language_column_name = "language"
annotated_column_name = "language_of_interest"

local_conf = {
    "input_folder": input_folder,
    "output_folder": output_folder,
}

langselect_config = {
    lang_allowed_langs_file_key: selected_languages_file,
    lang_lang_column_key: language_column_name,
    lang_output_column_key: annotated_column_name,
    "data_local_config": ParamsUtils.convert_to_ast(local_conf)
}

params = common_config_params| langselect_config

sys.argv = ParamsUtils.dict_to_req(d=params)

# create launcher
launcher = RayTransformLauncher(ProgLangSelectRayConfiguration())
launcher.launch()

## 5. Code Quality

We experiment with various code quality metrics but finally retain the four code quality metrics used by (Li et al., 2023) to balance the tradeoff between code quality versus data volume.

In [24]:
input_folder = "sample_data/ps_out"
output_folder = "sample_data/cq_out"

from code_quality_transform_ray import CodeQualityRayTransformConfiguration

local_conf = {
    "input_folder": input_folder,
    "output_folder": output_folder,
}
language_column_name = "language"
params = {
    "cq_contents_column_name": "contents",
    "cq_language_column_name": language_column_name,
    "data_local_config": ParamsUtils.convert_to_ast(local_conf)
}

params = common_config_params| params
sys.argv = ParamsUtils.dict_to_req(d=params)

# create launcher
launcher = RayTransformLauncher(CodeQualityRayTransformConfiguration())
# launch
launcher.launch()

18:42:36 INFO - Running locally
18:42:36 INFO - data factory data_ is using local data access: input_folder - sample_data/ps_out output_folder - sample_data/cq_out
18:42:36 INFO - data factory data_ max_files -1, n_sample -1
18:42:36 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
18:42:36 INFO - pipeline id pipeline_id
18:42:36 INFO - code location None
18:42:36 INFO - number of workers 2 worker options {'num_cpus': 0.8, 'max_restarts': -1}
18:42:36 INFO - actor creation delay 0
18:42:36 INFO - job details {'job category': 'preprocessing', 'job name': 'code_quality', 'job type': 'ray', 'job id': 'job_id'}
2024-08-21 18:42:38,257	INFO worker.py:1744 -- Started a local Ray instance. View the dashboard at [1m[32m127.0.0.1:8265 [39m[22m
[36m(orchestrate pid=50820)[0m 18:42:39 INFO - orchestrator started at 2024-08-21 18:42:39
[36m(orchestrate pid=50820)[0m 18:42:39 INFO

0

## 6. Filtering

This step can be used to filter the code files based on our chosen conditions. In this demo example, we have only used one annotation of adding programming language names for each code file. To demonstrate the utility, we will use this module to retain only code files of interest.

In [25]:
input_folder = "sample_data/cq_out"
output_folder = "sample_data/filter_out"


from filter_transform import (
    filter_columns_to_drop_cli_param,
    filter_criteria_cli_param,
    filter_logical_operator_cli_param,
)
from filter_transform_ray import FilterRayTransformConfiguration

local_conf = {
    "input_folder": input_folder,
    "output_folder": output_folder,
}

# This is just an example criteria to filter
filter_criteria = [
    "language_of_interest = 1",
    "total_num_lines > 10 AND total_num_lines < 90"
]
filter_logical_operator = "AND"
filter_columns_to_drop = ["language_of_interest", "hash_column"]

filter_params = {
    filter_criteria_cli_param: filter_criteria,
    filter_columns_to_drop_cli_param: filter_columns_to_drop,
    filter_logical_operator_cli_param: filter_logical_operator,
    "data_local_config": ParamsUtils.convert_to_ast(local_conf)
}


sys.argv = ParamsUtils.dict_to_req(common_config_params| filter_params)
launcher = RayTransformLauncher(FilterRayTransformConfiguration())
launcher.launch()


18:42:52 INFO - Running locally
18:42:52 INFO - data factory data_ is using local data access: input_folder - sample_data/cq_out output_folder - sample_data/filter_out
18:42:52 INFO - data factory data_ max_files -1, n_sample -1
18:42:52 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
18:42:52 INFO - pipeline id pipeline_id
18:42:52 INFO - code location None
18:42:52 INFO - number of workers 2 worker options {'num_cpus': 0.8, 'max_restarts': -1}
18:42:52 INFO - actor creation delay 0
18:42:52 INFO - job details {'job category': 'preprocessing', 'job name': 'filter', 'job type': 'ray', 'job id': 'job_id'}
2024-08-21 18:42:54,490	INFO worker.py:1744 -- Started a local Ray instance. View the dashboard at [1m[32m127.0.0.1:8265 [39m[22m
[36m(orchestrate pid=50927)[0m 18:42:55 INFO - orchestrator started at 2024-08-21 18:42:55
[36m(orchestrate pid=50927)[0m 18:42:55 INFO -

0

## 7. Semantic Ordering of Code Files

In this step, we order the code files such that we pack files from the same repository together, arranging them to prioritize semantic dependencies. We identify these dependencies by analyzing file imports and create a directed acyclic graph, where each file is a node and edges represent API imports between files. After breaking any cycles in the graph, we perform a topological sort to establish an ordering of files based on their semantic dependencies. We then organize the files in a repository by placing documentation and build files first, followed by the ordered set of files with semantic dependencies, and finally the remaining non-connected files. These non-connected files are arranged according to their folder structure, using a depth-first search to traverse the repository. Finally, we determine the dominant programming language of a repository based on file extensions and presence of build files, to organise repo-ordered files by programming languages.


This transform has following parameters:  <br/>
 _repo_lvl_sorting_enabled_ - If True, the repo level output is sorted using _repo_lvl_sorting_algo_ <br/>
 _repo_lvl_sorting_algo_ - Select the sorting algorithm to be used for repo level sorting. Use SORT_SEMANTIC_NORMALISED to organise by semantic dependencies or SORT_BY_PATH to arrange files based on folder structure in a repository.  <br/>
 _repo_lvl_store_backend_dir_ -  Directory to use for local store. Needed only when repo_lvl_store_type=local <br/>
 _repo_lvl_output_by_langs_ - If True, it organises output into folders of programming language. <br/>
 _repo_lvl_combine_rows_ - If True, it combines the contents of repo into a single row. <br/>



In [26]:
input_folder = "sample_data/filter_out"
output_folder = "sample_data/rlo_out"

import tempfile
from repo_level_order_transform import RepoLevelOrderRayTransformConfiguration
with tempfile.TemporaryDirectory() as tmpdirname:

    # create parameters
    local_conf = {
        "input_folder": input_folder,
        "output_folder": output_folder,
     }

    worker_options = {"num_cpus": 0.8}
    code_location = {"github": "github", "commit_hash": "12345", "path": "path"}

    repo_level_params = {
        "repo_lvl_sorting_algo": "SORT_SEMANTIC_NORMALISED",
        "repo_lvl_store_type": "local",
        "repo_lvl_store_backend_dir": tmpdirname,
        "repo_lvl_output_by_langs": True,
        "repo_lvl_combine_rows": True,
        "repo_lvl_sorting_enabled": True,
        "data_local_config": ParamsUtils.convert_to_ast(local_conf)
    }

    
    sys.argv = ParamsUtils.dict_to_req(d= common_config_params| repo_level_params)
    launcher = RayTransformLauncher(RepoLevelOrderRayTransformConfiguration())
    launcher.launch()

18:43:07 INFO - Running locally
18:43:07 INFO - data factory data_ is using local data access: input_folder - sample_data/filter_out output_folder - sample_data/rlo_out
18:43:07 INFO - data factory data_ max_files -1, n_sample -1
18:43:07 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
18:43:07 INFO - pipeline id pipeline_id
18:43:07 INFO - code location None
18:43:07 INFO - number of workers 2 worker options {'num_cpus': 0.8, 'max_restarts': -1}
18:43:07 INFO - actor creation delay 0
18:43:07 INFO - job details {'job category': 'preprocessing', 'job name': 'repo_lvl', 'job type': 'ray', 'job id': 'job_id'}


Creating Store Params


2024-08-21 18:43:08,706	INFO worker.py:1744 -- Started a local Ray instance. View the dashboard at [1m[32m127.0.0.1:8265 [39m[22m
[36m(orchestrate pid=51009)[0m 18:43:09 INFO - orchestrator started at 2024-08-21 18:43:09
[36m(orchestrate pid=51009)[0m 18:43:09 INFO - Number of files is 2, source profile {'max_file_size': 0.010923385620117188, 'min_file_size': 0.004130363464355469, 'total_file_size': 0.015053749084472656}
[36m(orchestrate pid=51009)[0m 18:43:09 INFO - Cluster resources: {'cpus': 16, 'gpus': 0, 'memory': 27.229208374395967, 'object_store': 2.0}
[36m(orchestrate pid=51009)[0m 18:43:09 INFO - Number of workers - 2 with {'num_cpus': 0.8, 'max_restarts': -1} each
[36m(orchestrate pid=51009)[0m 18:43:09 INFO - => get_transform_config started
[36m(orchestrate pid=51009)[0m 18:43:09 INFO - dict_keys(['store_backend_dir', 'store_type', 's3_creds'])
[36m(orchestrate pid=51009)[0m 18:43:09 INFO - <= get_transform_config
[36m(orchestrate pid=51009)[0m 18:43:09 I

[36m(orchestrate pid=51009)[0m Init Store params
[36m(RayTransformFileProcessor pid=51025)[0m Creating local store.


[36m(orchestrate pid=51009)[0m 18:43:09 INFO - Completed processing 2 files in 0.00863101085027059 min
[36m(orchestrate pid=51009)[0m 18:43:09 INFO - done flushing in 0.0005919933319091797 sec
[36m(orchestrate pid=51009)[0m 18:43:09 INFO - Store Backend is None
[36m(orchestrate pid=51009)[0m 18:43:09 INFO - Stage 1 Finished in 0:00:00.524131.
[36m(orchestrate pid=51009)[0m 18:43:10 I - Repo level sorting is enabled. Algo: SORT_SEMANTIC_NORMALISED
[36m(orchestrate pid=51009)[0m 18:43:10 I - normalised semantic sort enabled
[36m(orchestrate pid=51009)[0m 18:43:10 I - Output by language enabled.
[36m(orchestrate pid=51009)[0m 18:43:10 I - Combine rows enabled.
[36m(orchestrate pid=51009)[0m 18:43:10 I - Processing 2 repos with 2 workers
[36m(orchestrate pid=51009)[0m 18:43:11 I - Finished the transform in 0:00:02.268307 
[36m(GroupByRepoActor pid=51030)[0m 18:43:11 I - Write C/wvuRc2%2Frc2client, tables: 1
18:43:21 INFO - Completed execution in 0.2429994503657023 min

[36m(orchestrate pid=51009)[0m Creating local store.[32m [repeated 2x across cluster][0m


## 8. Tokenization

Next, we tokenize the data to be used for fine tuning. 



In [27]:
input_folder = "sample_data/rlo_out"
output_folder = "sample_data/tokenize_out"

from tokenization_transform_ray import TokenizationRayConfiguration

local_conf = {
    "input_folder": input_folder,
    "output_folder": output_folder,
}

tf_params= {
    "data_local_config": ParamsUtils.convert_to_ast(local_conf)
}
sys.argv = ParamsUtils.dict_to_req(d=common_config_params| tf_params)
# create launcher
launcher = RayTransformLauncher(TokenizationRayConfiguration())
# Launch the ray actor(s) to process the input
launcher.launch()

18:43:23 INFO - Running locally
18:43:23 INFO - data factory data_ is using local data access: input_folder - sample_data/rlo_out output_folder - sample_data/tokenize_out
18:43:23 INFO - data factory data_ max_files -1, n_sample -1
18:43:23 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
18:43:23 INFO - pipeline id pipeline_id
18:43:23 INFO - code location None
18:43:23 INFO - number of workers 2 worker options {'num_cpus': 0.8, 'max_restarts': -1}
18:43:23 INFO - actor creation delay 0
18:43:23 INFO - job details {'job category': 'preprocessing', 'job name': 'Tokenization', 'job type': 'ray', 'job id': 'job_id'}
2024-08-21 18:43:24,730	INFO worker.py:1744 -- Started a local Ray instance. View the dashboard at [1m[32m127.0.0.1:8265 [39m[22m
[36m(orchestrate pid=51142)[0m 18:43:25 INFO - orchestrator started at 2024-08-21 18:43:25
[36m(orchestrate pid=51142)[0m 18:43:

0

**The data is now ready for extended pretraining or fine tuning using any open source code models.**