<div style="background-color: #04D7FD; padding: 20px; text-align: left;">
    <h1 style="color: #000000; font-size: 36px; margin: 0;">Demo: Data Prep Kit</h1>
    
</div>


## Overview
Welcome to the demo notebook! Inside, you will find an end-to-end sample data pipeline designed for processing code datasets, beginning with GitHub repositories (.zip files) and culminating in processed data. This notebook provides the following transforms for processing the data. 

- [Ingest2parquet](#item1)
- [Exact Dedup](#item2)
- [Doc_ID generation](#item3)
- [Fuzzy Dedup](#item4)
- [Programming Language Select](#item5)
- [Code quality](#item6)
- [Filtering](#item7)
- [Tokenization](#item8)

### Getting started

If you want to try this pipeline on your data, you need to download your github repositories, as .zip files. Please refer to steps below for the same. One can also try it on sample data by downloading a few repos of interest.

Here's how to download a GitHub repository in ZIP format:

1. Go to the desired repository on GitHub.
2. Click the "Code" button near the top right corner of the repository.
3. Click the "Download ZIP" button.

This will download a ZIP archive of the entire repository to your computer.

Follow these steps and download some repositories from github into a folder. Now your data is ready.

The folder containing this data would serve as the input to the pipeline. Assign the path of this data folder to the variable `zip_input_folder` in the below cell. 


### Import Common python modules

In [1]:
import os
import ast

from data_processing_ray.runtime.ray import execute_ray_transform
from data_processing.runtime.pure_python import execute_python_transform
from data_processing.utils import TransformsConfiguration, ParamsUtils, GB

### To make a code smaller we will be using the same Ray environments for all transforms

In [2]:
worker_options = {"num_cpus": 0.8, "memory": 2 * GB}
code_location = {"github": "github", "commit_hash": "12345", "path": "path"}

runtime_ray_params = {
    "runtime_worker_options": ParamsUtils.convert_to_ast(worker_options),
    "runtime_num_workers": 3,
    "runtime_pipeline_id": "pipeline_id",
    "runtime_job_id": "job_id",
    "runtime_creation_delay": 0,
    "runtime_code_location": ParamsUtils.convert_to_ast(code_location),
}
runtime_python_params = {
    "runtime_pipeline_id": "pipeline_id",
    "runtime_job_id": "job_id",
    "runtime_code_location": ParamsUtils.convert_to_ast(code_location),
}

### Set input/output path variables for the pipeline

In [3]:
# Example
# We can set input paths here
zip_input_folder = "input_data"

if not os.path.exists(zip_input_folder):
    print ("NO INPUT DATA")
    print ("Please set `zip_input_folder` variable to path containing data")

# make sure the paths are correct
data_base_path = "test-data"

parquet_data_output = os.path.join(data_base_path, "parquet_input")

ededup_out =  os.path.join(data_base_path, "ededup_out")

doc_id_out =  os.path.join(data_base_path, "doc_id_out")
fdedup_out = os.path.join(data_base_path, "fdedup_out")

lang_out =  os.path.join(data_base_path,"lang_out")
cq_out = os.path.join(data_base_path,"cq_out")

filter_out = os.path.join(data_base_path ,"filter_out")
tokensization_out = os.path.join(data_base_path ,"tokenization_out")



### Finally lets print the list of available transforms

In [4]:
t_configuration = TransformsConfiguration()
transforms = t_configuration.get_available_transforms()
print(transforms)

08:50:34 INFO - loading from transforms configuration from /Users/borisl/Projects/data-prep-kit/data-processing-lib/python/src/data_processing/utils/transform_configuration.json


['code2parquet', 'code_quality', 'malware', 'proglang_select', 'lang_id', 'doc_id', 'ededup', 'fdedup', 'filter', 'noop', 'profiler', 'resize', 'tokenization']


## <span style="color: green"> 1. Convert data to parquet using ingest2parquet [<-](#top)<a class="anchor" id="item1"></a>_zip_ to _parquet_ Python transformer </span>

Raw code data files which are in zip format are converted to parquet files, where each row of the parquet file corresponds to a separate code file. Apart from the contents of the code file, every row also contains a unique document id, file URL, name of the repository, source of the data, date of acquisition and license of the repository. For every code file, a language field is also added, which is detected using the filename
extensions.




### Set Input/output Folder

In [5]:
# For this stage input folder contains the zip files, each zip file contains a github repo.

input_folder = os.path.abspath(zip_input_folder)
output_folder =  os.path.abspath(parquet_data_output)
supported_languages_file = os.path.abspath("../../../transforms/code/code2parquet/python/test-data/languages/lang_extensions.json")
print(input_folder)
print(output_folder)
print(supported_languages_file)

/Users/borisl/Projects/data-prep-kit/examples/notebooks/code/input_data
/Users/borisl/Projects/data-prep-kit/examples/notebooks/code/test-data/parquet_input
/Users/borisl/Projects/data-prep-kit/transforms/code/code2parquet/python/test-data/languages/lang_extensions.json


### Execute 

In [6]:
# create parameters
ingest_config = {
    "data_files_to_use": ast.literal_eval("['.zip']"),    
    "code2parquet_supported_langs_file": supported_languages_file,
    "code2parquet_detect_programming_lang": True,
}

execute_python_transform(
    configuration = t_configuration,
    name="code2parquet",
    input_folder=input_folder,
    output_folder=output_folder,
    params=runtime_python_params | ingest_config
)    

  Running command git clone --filter=blob:none --quiet https://github.com/IBM/data-prep-kit.git /private/var/folders/7l/54q_29q57dv5vwgqw0h3btlm0000gn/T/pip-install-zsc1hsbw/dpk-code2parquet-transform-python_b4ece9cd00d5447dae0162c1b5265d6b


Looking in indexes: https://pypi.org/simple, https://blublinsky%40ibm.com:****@na.artifactory.swg-devops.com/artifactory/api/pypi/res-data-engineering-team-pypi-local/simple
Collecting dpk_code2parquet_transform_python
  Cloning https://github.com/IBM/data-prep-kit.git to /private/var/folders/7l/54q_29q57dv5vwgqw0h3btlm0000gn/T/pip-install-zsc1hsbw/dpk-code2parquet-transform-python_b4ece9cd00d5447dae0162c1b5265d6b
  Resolved https://github.com/IBM/data-prep-kit.git to commit 29e83ed88c942317bdf17f4934d2847b6fc8d1fc
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Collecting argparse (from data-prep-toolkit==0.2.1.dev0->dpk_code2parquet_transform_python)
  Using cached argparse-1.4.0-py2.py3-none-any.whl.metada

08:50:43 INFO - Using local data
08:50:43 INFO - data factory code2parquet_ is using local configuration without input/output path
08:50:43 INFO - data factory code2parquet_ max_files -1, n_sample -1
08:50:43 INFO - data factory code2parquet_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
08:50:43 INFO - pipeline id pipeline_id
08:50:43 INFO - job details {'job category': 'preprocessing', 'job name': 'code2parquet', 'job type': 'pure python', 'job id': 'job_id'}
08:50:43 INFO - code location {'github': 'github', 'commit_hash': '12345', 'path': 'path'}
08:50:43 INFO - data factory data_ is using local data access: input_folder - /Users/borisl/Projects/data-prep-kit/examples/notebooks/code/input_data output_folder - /Users/borisl/Projects/data-prep-kit/examples/notebooks/code/test-data/parquet_input
08:50:43 INFO - data factory data_ max_files -1, n_sample -1
08:50:43 INFO - data factory data_ Not usi

Found existing installation: dpk_code2parquet_transform_python 0.2.1.dev0
Uninstalling dpk_code2parquet_transform_python-0.2.1.dev0:
  Successfully uninstalled dpk_code2parquet_transform_python-0.2.1.dev0
Found existing installation: parameterized 0.9.0
Uninstalling parameterized-0.9.0:
  Successfully uninstalled parameterized-0.9.0
Found existing installation: pandas 2.2.2
Uninstalling pandas-2.2.2:
  Successfully uninstalled pandas-2.2.2


True

##  <span style="color: green">   2. Exact Dedup [<-](#top)<a class="anchor" id="item2"></a> Using Ray transform</span>

Remove documents having identical code to remove bias in the training data. On the content of each document, a SHA256 hash is computed,
followed by de-duplication of record having identical hashes.

### Set Input/output Folder

In [7]:
## For this stage the input is the folder containing parquet data which is output from the ingest2parquet tool

input_folder = output_folder
output_folder = os.path.abspath(ededup_out)

print(input_folder)
print(output_folder)

/Users/borisl/Projects/data-prep-kit/examples/notebooks/code/test-data/parquet_input
/Users/borisl/Projects/data-prep-kit/examples/notebooks/code/test-data/ededup_out


### Execute 

In [8]:
# Prepare the commandline params
ededup_config = {
    "ededup_hash_cpu": 0.5,
    "ededup_num_hashes": 2,
    "ededup_doc_column": "contents",
}

execute_ray_transform(
        configuration = t_configuration,
        name="ededup",
        input_folder=input_folder,
        output_folder=output_folder,
        params=runtime_ray_params | ededup_config
)    



Looking in indexes: https://pypi.org/simple, https://blublinsky%40ibm.com:****@na.artifactory.swg-devops.com/artifactory/api/pypi/res-data-engineering-team-pypi-local/simple
Collecting dpk_ededup_transform_ray
  Cloning https://github.com/IBM/data-prep-kit.git to /private/var/folders/7l/54q_29q57dv5vwgqw0h3btlm0000gn/T/pip-install-tl1vq06t/dpk-ededup-transform-ray_5347ed0a09a24a2cbf307f1026352492


  Running command git clone --filter=blob:none --quiet https://github.com/IBM/data-prep-kit.git /private/var/folders/7l/54q_29q57dv5vwgqw0h3btlm0000gn/T/pip-install-tl1vq06t/dpk-ededup-transform-ray_5347ed0a09a24a2cbf307f1026352492


  Resolved https://github.com/IBM/data-prep-kit.git to commit 29e83ed88c942317bdf17f4934d2847b6fc8d1fc
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Collecting xxhash==3.4.1 (from dpk_ededup_transform_ray)
  Using cached xxhash-3.4.1-cp310-cp310-macosx_11_0_arm64.whl.metadata (12 kB)
Collecting tqdm==4.66.3 (from dpk_ededup_transform_ray)
  Using cached tqdm-4.66.3-py3-none-any.whl.metadata (57 kB)
Collecting argparse (from data-prep-toolkit==0.2.1.dev0->data-prep-toolkit-ray==0.2.1.dev0->dpk_ededup_transform_ray)
  Using cached argparse-1.4.0-py2.py3-none-any.whl.metadata (2.8 kB)
Using cached tqdm-4.66.3-py3-none-any.whl (78 kB)
Using cached xxhash-3.4.1-cp310-cp310-macosx_11_0_arm64.whl (30 kB)
Using cac

08:50:59 INFO - Using local data
08:50:59 INFO - Running locally
08:50:59 INFO - exact dedup params are {'hash_cpu': 0.5, 'num_hashes': 2, 'doc_column': 'contents'}
08:50:59 INFO - data factory data_ is using local data access: input_folder - /Users/borisl/Projects/data-prep-kit/examples/notebooks/code/test-data/parquet_input output_folder - /Users/borisl/Projects/data-prep-kit/examples/notebooks/code/test-data/ededup_out
08:50:59 INFO - data factory data_ max_files -1, n_sample -1
08:50:59 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
08:50:59 INFO - pipeline id pipeline_id
08:50:59 INFO - code location {'github': 'github', 'commit_hash': '12345', 'path': 'path'}
08:50:59 INFO - number of workers 3 worker options {'num_cpus': 0.8, 'memory': 2147483648, 'max_restarts': -1}
08:50:59 INFO - actor creation delay 0
08:50:59 INFO - job details {'job category': 'preprocessing', 

Found existing installation: dpk_ededup_transform_ray 0.2.1.dev0
Uninstalling dpk_ededup_transform_ray-0.2.1.dev0:
  Successfully uninstalled dpk_ededup_transform_ray-0.2.1.dev0


True

## <span style="color: green">  3. DOC ID generation [<-](#top)<a class="anchor" id="item3"></a> Ray Transform</span>

This transform annotates documents with document "ids". It supports the following transformations of the original data:

 - Adding document hash: this enables the addition of a document hash-based id to the data. The hash is calculated with `hashlib.sha256(doc.encode("utf-8")).hexdigest()`. To enable this annotation, set hash_column to the name of the column, where you want to store it.
 - Adding integer document id: this allows the addition of an integer document id to the data that is unique across all rows in all tables provided to the transform() method. To enable this annotation, set int_id_column to the name of the column, where you want to store it. **This is a pre-requisite for fuzzy dedup** in the pipeline.

In [9]:
# Input for this stage is the output of exact dedeup component
# output of this component makes it possible for fdedup component to run on data.

input_folder = output_folder
output_folder = os.path.abspath(doc_id_out)

print(input_folder)
print(output_folder)


/Users/borisl/Projects/data-prep-kit/examples/notebooks/code/test-data/ededup_out
/Users/borisl/Projects/data-prep-kit/examples/notebooks/code/test-data/doc_id_out


In [10]:
docid_config = {
    "doc_id_doc_column": "contents",
    "doc_id_hash_column": "hash_column",
    "doc_id_int_column": "int_id_column",
}
execute_ray_transform(
        configuration = t_configuration,
        name="doc_id",
        input_folder=input_folder,
        output_folder=output_folder,
        params=runtime_ray_params | docid_config
)



Looking in indexes: https://pypi.org/simple, https://blublinsky%40ibm.com:****@na.artifactory.swg-devops.com/artifactory/api/pypi/res-data-engineering-team-pypi-local/simple
Collecting dpk_docid_transform_ray
  Cloning https://github.com/IBM/data-prep-kit.git to /private/var/folders/7l/54q_29q57dv5vwgqw0h3btlm0000gn/T/pip-install-8i2caoqz/dpk-docid-transform-ray_895e34ab2a6d46ceb5980d4b48fa8525


  Running command git clone --filter=blob:none --quiet https://github.com/IBM/data-prep-kit.git /private/var/folders/7l/54q_29q57dv5vwgqw0h3btlm0000gn/T/pip-install-8i2caoqz/dpk-docid-transform-ray_895e34ab2a6d46ceb5980d4b48fa8525


  Resolved https://github.com/IBM/data-prep-kit.git to commit 29e83ed88c942317bdf17f4934d2847b6fc8d1fc
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Collecting argparse (from data-prep-toolkit==0.2.1.dev0->data-prep-toolkit-ray==0.2.1.dev0->dpk_docid_transform_ray)
  Using cached argparse-1.4.0-py2.py3-none-any.whl.metadata (2.8 kB)
Using cached argparse-1.4.0-py2.py3-none-any.whl (23 kB)
Building wheels for collected packages: dpk_docid_transform_ray
  Building wheel for dpk_docid_transform_ray (pyproject.toml): started
  Building wheel for dpk_docid_transform_ray (pyproject.toml): finished with status 'done'
  Created wheel for dpk_docid_transform_ray: filename=dpk_docid_transform_ray-0.2.1.dev0-py3-none-

08:51:27 INFO - Using local data
08:51:27 INFO - Running locally
08:51:27 INFO - Doc id parameters are : {'doc_column': 'contents', 'hash_column': 'hash_column', 'int_column': 'int_id_column'}
08:51:27 INFO - data factory data_ is using local data access: input_folder - /Users/borisl/Projects/data-prep-kit/examples/notebooks/code/test-data/ededup_out output_folder - /Users/borisl/Projects/data-prep-kit/examples/notebooks/code/test-data/doc_id_out
08:51:27 INFO - data factory data_ max_files -1, n_sample -1
08:51:27 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
08:51:27 INFO - pipeline id pipeline_id
08:51:27 INFO - code location {'github': 'github', 'commit_hash': '12345', 'path': 'path'}
08:51:27 INFO - number of workers 3 worker options {'num_cpus': 0.8, 'memory': 2147483648, 'max_restarts': -1}
08:51:27 INFO - actor creation delay 0
08:51:27 INFO - job details {'job cat

Found existing installation: dpk_docid_transform_ray 0.2.1.dev0
Uninstalling dpk_docid_transform_ray-0.2.1.dev0:
  Successfully uninstalled dpk_docid_transform_ray-0.2.1.dev0


True

## 4. <span style="color: green">  Fuzzy Dedup [<-](#top)<a class="anchor" id="item4">Ray transform</a> </span>

Post exact deduplication, fuzzy deduplication is applied with
the goal of removing code files that may have slight variations and thereby unbiasing
the data further. Small variations are quite commonly seen in code data in the form
of variations in the values of variables, addittion of logging statements etc. Find near-
duplicate.

### Set Input/output Folder

In [11]:
## Input to this component is the output of doc_id generator component. 

input_folder = output_folder
output_folder = os.path.abspath(fdedup_out)

print(input_folder)
print(output_folder)

/Users/borisl/Projects/data-prep-kit/examples/notebooks/code/test-data/doc_id_out
/Users/borisl/Projects/data-prep-kit/examples/notebooks/code/test-data/fdedup_out


### Execute 

In [12]:
# create parameters
fuzzy_config = {
    # columns used
    "fdedup_doc_column": "contents",
    "fdedup_id_column": "int_id_column",
    "fdedup_cluster_column": "hash_column",
    # infrastructure
    "fdedup_bucket_cpu": 0.5,
    "fdedup_doc_cpu": 0.5,
    "fdedup_mhash_cpu": 0.5,
    "fdedup_num_doc_actors": 2,
    "fdedup_num_bucket_actors": 1,
    "fdedup_num_minhash_actors": 1,
    "fdedup_num_preprocessors": 2,
    # fuzzy parameters
    "fdedup_num_permutations": 64,
    "fdedup_threshold": 0.8,
    "fdedup_shingles_size": 5,
    "fdedup_delimiters": " "
}

execute_ray_transform(
        configuration = t_configuration,
        name="fdedup",
        input_folder=input_folder,
        output_folder=output_folder,
        params=runtime_ray_params | fuzzy_config
)



Looking in indexes: https://pypi.org/simple, https://blublinsky%40ibm.com:****@na.artifactory.swg-devops.com/artifactory/api/pypi/res-data-engineering-team-pypi-local/simple
Collecting dpk_fdedup_transform_ray
  Cloning https://github.com/IBM/data-prep-kit.git to /private/var/folders/7l/54q_29q57dv5vwgqw0h3btlm0000gn/T/pip-install-4sze1ap_/dpk-fdedup-transform-ray_64bce67154264ce894ca948e5265255d


  Running command git clone --filter=blob:none --quiet https://github.com/IBM/data-prep-kit.git /private/var/folders/7l/54q_29q57dv5vwgqw0h3btlm0000gn/T/pip-install-4sze1ap_/dpk-fdedup-transform-ray_64bce67154264ce894ca948e5265255d


  Resolved https://github.com/IBM/data-prep-kit.git to commit 29e83ed88c942317bdf17f4934d2847b6fc8d1fc
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Collecting scipy==1.12.0 (from dpk_fdedup_transform_ray)
  Using cached scipy-1.12.0-cp310-cp310-macosx_12_0_arm64.whl.metadata (112 kB)
Collecting argparse (from data-prep-toolkit==0.2.1.dev0->data-prep-toolkit-ray==0.2.1.dev0->dpk_fdedup_transform_ray)
  Using cached argparse-1.4.0-py2.py3-none-any.whl.metadata (2.8 kB)
Using cached scipy-1.12.0-cp310-cp310-macosx_12_0_arm64.whl (31.4 MB)
Using cached argparse-1.4.0-py2.py3-none-any.whl (23 kB)
Building wheels for collected packages: dpk_fdedup_transform_ray
  Building wheel for dpk_fdedup_transform_ray (pypr

08:51:52 INFO - Using local data
08:52:00 INFO - Running locally
08:52:00 INFO - fuzzy dedup params are {'doc_column': 'contents', 'id_column': 'int_id_column', 'cluster_column': 'hash_column', 'bucket_cpu': 0.5, 'mhash_cpu': 0.5, 'doc_cpu': 0.5, 'num_doc_actors': 2, 'num_minhash_actors': 1, 'num_bucket_actors': 1, 'num_preprocessors': 2, 'num_permutations': 64, 'threshold': 0.8, 'shingles_size': 5, 'delimiters': ' ', 'snapshot_delay': 1, 'use_bucket_snapshot': False, 'use_doc_snapshot': False, 'random_delay_limit': 10, 'worker_options': {'num_cpus': 0.8, 'memory': 2147483648}}
08:52:00 INFO - data factory data_ is using local data access: input_folder - /Users/borisl/Projects/data-prep-kit/examples/notebooks/code/test-data/doc_id_out output_folder - /Users/borisl/Projects/data-prep-kit/examples/notebooks/code/test-data/fdedup_out
08:52:00 INFO - data factory data_ max_files -1, n_sample -1
08:52:00 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, rando

Found existing installation: dpk_fdedup_transform_ray 0.2.1.dev0
Uninstalling dpk_fdedup_transform_ray-0.2.1.dev0:
  Successfully uninstalled dpk_fdedup_transform_ray-0.2.1.dev0
Found existing installation: xxhash 3.4.1
Uninstalling xxhash-3.4.1:
  Successfully uninstalled xxhash-3.4.1
Found existing installation: tqdm 4.66.3
Uninstalling tqdm-4.66.3:
  Successfully uninstalled tqdm-4.66.3
Found existing installation: scipy 1.12.0
Uninstalling scipy-1.12.0:
  Successfully uninstalled scipy-1.12.0


True

## <span style="color: green">  5. Programming language annotation [<-](#top)<a class="anchor" id="item5"></a> Python transform </span>

The raw data may contains many programming languages. Of this, we would wish to retain a prioritised list of selected programming languages. This component takes a file which has new line separated names of languages we need to select. It annotates the data a new column with boolean values. This column can be used by filter component to select the required languages.

### Set Input/output Folder

In [13]:

input_folder = output_folder
output_folder = os.path.abspath(lang_out) 
selected_languages_file = os.path.abspath("../../../transforms/code/proglang_select/python/test-data/languages/allowed-code-languages.txt")
print(input_folder)
print(output_folder)
print(selected_languages_file)

/Users/borisl/Projects/data-prep-kit/examples/notebooks/code/test-data/fdedup_out
/Users/borisl/Projects/data-prep-kit/examples/notebooks/code/test-data/lang_out
/Users/borisl/Projects/data-prep-kit/transforms/code/proglang_select/python/test-data/languages/allowed-code-languages.txt


### Execute 

In [14]:
# create parameters
langselect_config = {
    "proglang_select_allowed_langs_file": selected_languages_file,
    "proglang_select_language_column": "programming_language",
    "proglang_select_output_column": "lang_selected",
}

execute_python_transform(
    configuration = t_configuration,
    name="proglang_select",
    input_folder=input_folder,
    output_folder=output_folder,
    params=runtime_python_params | langselect_config
) 

Looking in indexes: https://pypi.org/simple, https://blublinsky%40ibm.com:****@na.artifactory.swg-devops.com/artifactory/api/pypi/res-data-engineering-team-pypi-local/simple
Collecting dpk_proglang_select_transform_python
  Cloning https://github.com/IBM/data-prep-kit.git to /private/var/folders/7l/54q_29q57dv5vwgqw0h3btlm0000gn/T/pip-install-bbswfpzr/dpk-proglang-select-transform-python_7bd5e7f6636e4b44b04628ca9e316f03


  Running command git clone --filter=blob:none --quiet https://github.com/IBM/data-prep-kit.git /private/var/folders/7l/54q_29q57dv5vwgqw0h3btlm0000gn/T/pip-install-bbswfpzr/dpk-proglang-select-transform-python_7bd5e7f6636e4b44b04628ca9e316f03


  Resolved https://github.com/IBM/data-prep-kit.git to commit 29e83ed88c942317bdf17f4934d2847b6fc8d1fc
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Collecting argparse (from data-prep-toolkit==0.2.1.dev0->dpk_proglang_select_transform_python)
  Using cached argparse-1.4.0-py2.py3-none-any.whl.metadata (2.8 kB)
Using cached argparse-1.4.0-py2.py3-none-any.whl (23 kB)
Building wheels for collected packages: dpk_proglang_select_transform_python
  Building wheel for dpk_proglang_select_transform_python (pyproject.toml): started
  Building wheel for dpk_proglang_select_transform_python (pyproject.toml): finished with status 'done'
  Created wheel for dpk_proglang_select_transform_python: filename=dpk_proglang_s

08:52:44 INFO - Using local data
08:52:44 INFO - data factory proglang_select_ is using local configuration without input/output path
08:52:44 INFO - data factory proglang_select_ max_files -1, n_sample -1
08:52:44 INFO - data factory proglang_select_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
08:52:44 INFO - pipeline id pipeline_id
08:52:44 INFO - job details {'job category': 'preprocessing', 'job name': 'proglang_select', 'job type': 'pure python', 'job id': 'job_id'}
08:52:44 INFO - code location {'github': 'github', 'commit_hash': '12345', 'path': 'path'}
08:52:44 INFO - data factory data_ is using local data access: input_folder - /Users/borisl/Projects/data-prep-kit/examples/notebooks/code/test-data/fdedup_out output_folder - /Users/borisl/Projects/data-prep-kit/examples/notebooks/code/test-data/lang_out
08:52:44 INFO - data factory data_ max_files -1, n_sample -1
08:52:44 INFO - data fact

Found existing installation: dpk_proglang_select_transform_python 0.2.1.dev0
Uninstalling dpk_proglang_select_transform_python-0.2.1.dev0:
  Successfully uninstalled dpk_proglang_select_transform_python-0.2.1.dev0


True

## <span style="color: green">  6. Code Quality [<-](#top)<a class="anchor" id="item6"></a> Python Transform</span>

We experiment with various code quality metrics but finally retain
the four code quality metrics used by (Li et al., 2023) to balance the tradeoff between
code quality versus data volume. 


### Set Input/output Folder

In [15]:
input_folder = output_folder
output_folder = os.path.abspath(cq_out)

print(input_folder)
print(output_folder)

/Users/borisl/Projects/data-prep-kit/examples/notebooks/code/test-data/lang_out
/Users/borisl/Projects/data-prep-kit/examples/notebooks/code/test-data/cq_out


### Execute 

In [16]:
cq_config = {
    "cq_contents_column_name": "contents",
    "cq_language_column_name": "programming_language",
}

execute_python_transform(
    configuration = t_configuration,
    name="code_quality",
    input_folder=input_folder,
    output_folder=output_folder,
    params=runtime_python_params | cq_config
) 

Looking in indexes: https://pypi.org/simple, https://blublinsky%40ibm.com:****@na.artifactory.swg-devops.com/artifactory/api/pypi/res-data-engineering-team-pypi-local/simple
Collecting dpk_code_quality_transform_python
  Cloning https://github.com/IBM/data-prep-kit.git to /private/var/folders/7l/54q_29q57dv5vwgqw0h3btlm0000gn/T/pip-install-_jfhfvkg/dpk-code-quality-transform-python_134fff8b71ce48bab5517892c8cfbdde


  Running command git clone --filter=blob:none --quiet https://github.com/IBM/data-prep-kit.git /private/var/folders/7l/54q_29q57dv5vwgqw0h3btlm0000gn/T/pip-install-_jfhfvkg/dpk-code-quality-transform-python_134fff8b71ce48bab5517892c8cfbdde


  Resolved https://github.com/IBM/data-prep-kit.git to commit 29e83ed88c942317bdf17f4934d2847b6fc8d1fc
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Collecting bs4==0.0.2 (from dpk_code_quality_transform_python)
  Using cached bs4-0.0.2-py2.py3-none-any.whl.metadata (411 bytes)
Collecting transformers==4.38.2 (from dpk_code_quality_transform_python)
  Using cached transformers-4.38.2-py3-none-any.whl.metadata (130 kB)
Collecting argparse (from data-prep-toolkit==0.2.1.dev0->dpk_code_quality_transform_python)
  Using cached argparse-1.4.0-py2.py3-none-any.whl.metadata (2.8 kB)
Collecting tokenizers<0.19,>=0.14 (from transformers==4.38.2->dpk_code_quality_transform_python)
  Using cached tokenizers-0.15.2-cp3

08:52:57 INFO - Using local data
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
08:52:57 INFO - pipeline id pipeline_id
08:52:57 INFO - job details {'job category': 'preprocessing', 'job name': 'code_quality', 'job type': 'pure python', 'job id': 'job_id'}
08:52:57 INFO - code location {'github': 'github', 'commit_hash': '12345', 'path': 'path'}
08:52:57 INFO - data factory data_ is using local data access: input_folder - /Users/borisl/Projects/data-prep-kit/examples/notebooks/code/test-data/lang_out output_folder - /Users/borisl/Projects/data-prep-kit/examples/notebooks/code/test-data/cq_out
08:52:57 INFO - data factory data_ max_files -1, n_sample -1
08:52:57 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
08:52:57 INFO - orchestrator code_quality started at 2024

Found existing installation: dpk_code_quality_transform_python 0.2.1.dev0
Uninstalling dpk_code_quality_transform_python-0.2.1.dev0:
  Successfully uninstalled dpk_code_quality_transform_python-0.2.1.dev0
Found existing installation: bs4 0.0.2
Uninstalling bs4-0.0.2:
  Successfully uninstalled bs4-0.0.2
Found existing installation: transformers 4.38.2
Uninstalling transformers-4.38.2:
  Successfully uninstalled transformers-4.38.2


True

## 7. <span style="color: green">   Filtering [<-](#top)<a class="anchor" id="item7"></a> Python Transform</span>

Filter out documents that do not meet the quality threshold for each annotation. The thresholds are computed based on a distributional
analysis as well as manual inspection of samples maintaining the balance between data quality and data volume

### Set Input/output Folder

In [17]:
input_folder = output_folder
output_folder = os.path.abspath(filter_out)
print(input_folder)
print(output_folder)

/Users/borisl/Projects/data-prep-kit/examples/notebooks/code/test-data/cq_out
/Users/borisl/Projects/data-prep-kit/examples/notebooks/code/test-data/filter_out


### Execute 

In [18]:
# This is just an example criteria to filter
filter_criteria = [
    "total_num_lines > 10 AND total_num_lines < 90",
    "lang_selected = 1",
]
filter_logical_operator = "AND"
filter_columns_to_drop = ["lang_selected", "hash_column"]

filter_config = {
    "filter_criteria_list": filter_criteria,
    "filter_columns_to_drop": filter_columns_to_drop,
    "filter_logical_operator": filter_logical_operator,
}

execute_python_transform(
    configuration = t_configuration,
    name="filter",
    input_folder=input_folder,
    output_folder=output_folder,
    params=runtime_python_params | filter_config
) 

Looking in indexes: https://pypi.org/simple, https://blublinsky%40ibm.com:****@na.artifactory.swg-devops.com/artifactory/api/pypi/res-data-engineering-team-pypi-local/simple
Collecting dpk_filter_transform_python
  Cloning https://github.com/IBM/data-prep-kit.git to /private/var/folders/7l/54q_29q57dv5vwgqw0h3btlm0000gn/T/pip-install-jtqmffzs/dpk-filter-transform-python_c0b9e4d76e594c6aa2ef1b109d701350


  Running command git clone --filter=blob:none --quiet https://github.com/IBM/data-prep-kit.git /private/var/folders/7l/54q_29q57dv5vwgqw0h3btlm0000gn/T/pip-install-jtqmffzs/dpk-filter-transform-python_c0b9e4d76e594c6aa2ef1b109d701350


  Resolved https://github.com/IBM/data-prep-kit.git to commit 29e83ed88c942317bdf17f4934d2847b6fc8d1fc
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Collecting duckdb==0.10.1 (from dpk_filter_transform_python)
  Using cached duckdb-0.10.1-cp310-cp310-macosx_11_0_arm64.whl.metadata (763 bytes)
Collecting argparse (from data-prep-toolkit==0.2.1.dev0->dpk_filter_transform_python)
  Using cached argparse-1.4.0-py2.py3-none-any.whl.metadata (2.8 kB)
Using cached duckdb-0.10.1-cp310-cp310-macosx_11_0_arm64.whl (14.3 MB)
Using cached argparse-1.4.0-py2.py3-none-any.whl (23 kB)
Building wheels for collected packages: dpk_filter_transform_python
  Building wheel for dpk_filter_transform_python (pyproject.toml): star

08:53:08 INFO - Using local data
08:53:09 INFO - pipeline id pipeline_id
08:53:09 INFO - job details {'job category': 'preprocessing', 'job name': 'filter', 'job type': 'pure python', 'job id': 'job_id'}
08:53:09 INFO - code location {'github': 'github', 'commit_hash': '12345', 'path': 'path'}
08:53:09 INFO - data factory data_ is using local data access: input_folder - /Users/borisl/Projects/data-prep-kit/examples/notebooks/code/test-data/cq_out output_folder - /Users/borisl/Projects/data-prep-kit/examples/notebooks/code/test-data/filter_out
08:53:09 INFO - data factory data_ max_files -1, n_sample -1
08:53:09 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
08:53:09 INFO - orchestrator filter started at 2024-07-22 08:53:09
08:53:09 INFO - Number of files is 1, source profile {'max_file_size': 0.6735477447509766, 'min_file_size': 0.6735477447509766, 'total_file_size': 0.6735

Found existing installation: dpk_filter_transform_python 0.2.1.dev0
Uninstalling dpk_filter_transform_python-0.2.1.dev0:
  Successfully uninstalled dpk_filter_transform_python-0.2.1.dev0
Found existing installation: duckdb 0.10.1
Uninstalling duckdb-0.10.1:
  Successfully uninstalled duckdb-0.10.1


True

## 8. <span style="color: green">  Tokenization [<-](#top)<a class="anchor" id="item8"></a> Python transform</span>

The data tokenization transform maps a (non-empty) input table to an output table using a pre-trained tokenizer. The input table must contain at least two columns, by default named document_id and contents. The tokenization transform utilizes the pre-trained tokenizer to tokenize each row (assuming a document) in the input table to each row in the output folder.

A pre-trained tokenizer must be specified through the --tkn_tokenizer parameter, which can be the name of a ready-for-download tokenizer from HuggingFace such as hf-internal-testing/llama-tokenizer, bigcode/starcoder or any others that can loaded by the Huggingface AutoTokenizer library. 


In [19]:
input_folder = output_folder
output_folder = os.path.abspath(tokensization_out)
print(input_folder)
print(output_folder)

/Users/borisl/Projects/data-prep-kit/examples/notebooks/code/test-data/filter_out
/Users/borisl/Projects/data-prep-kit/examples/notebooks/code/test-data/tokenization_out


In [20]:
execute_python_transform(
    configuration = t_configuration,
    name="tokenization",
    input_folder=input_folder,
    output_folder=output_folder,
    params=runtime_python_params
) 

Looking in indexes: https://pypi.org/simple, https://blublinsky%40ibm.com:****@na.artifactory.swg-devops.com/artifactory/api/pypi/res-data-engineering-team-pypi-local/simple
Collecting dpk_tokenization_transform_python
  Cloning https://github.com/IBM/data-prep-kit.git to /private/var/folders/7l/54q_29q57dv5vwgqw0h3btlm0000gn/T/pip-install-ph8qsu3y/dpk-tokenization-transform-python_46ce6bf92cfe45a3b64ac11a29af82db


  Running command git clone --filter=blob:none --quiet https://github.com/IBM/data-prep-kit.git /private/var/folders/7l/54q_29q57dv5vwgqw0h3btlm0000gn/T/pip-install-ph8qsu3y/dpk-tokenization-transform-python_46ce6bf92cfe45a3b64ac11a29af82db


  Resolved https://github.com/IBM/data-prep-kit.git to commit 29e83ed88c942317bdf17f4934d2847b6fc8d1fc
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Collecting transformers==4.38.0 (from dpk_tokenization_transform_python)
  Using cached transformers-4.38.0-py3-none-any.whl.metadata (131 kB)
Collecting argparse (from data-prep-toolkit==0.2.1.dev0->dpk_tokenization_transform_python)
  Using cached argparse-1.4.0-py2.py3-none-any.whl.metadata (2.8 kB)
Using cached transformers-4.38.0-py3-none-any.whl (8.5 MB)
Using cached argparse-1.4.0-py2.py3-none-any.whl (23 kB)
Building wheels for collected packages: dpk_tokenization_transform_python
  Building wheel for dpk_tokenization_transform_python (pyproject.toml): 

08:53:20 INFO - Using local data
08:53:20 INFO - pipeline id pipeline_id
08:53:20 INFO - job details {'job category': 'preprocessing', 'job name': 'Tokenization', 'job type': 'pure python', 'job id': 'job_id'}
08:53:20 INFO - code location {'github': 'github', 'commit_hash': '12345', 'path': 'path'}
08:53:20 INFO - data factory data_ is using local data access: input_folder - /Users/borisl/Projects/data-prep-kit/examples/notebooks/code/test-data/filter_out output_folder - /Users/borisl/Projects/data-prep-kit/examples/notebooks/code/test-data/tokenization_out
08:53:20 INFO - data factory data_ max_files -1, n_sample -1
08:53:20 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
08:53:20 INFO - orchestrator Tokenization started at 2024-07-22 08:53:20
08:53:20 INFO - Number of files is 1, source profile {'max_file_size': 0.004254341125488281, 'min_file_size': 0.004254341125488281,

Found existing installation: dpk_tokenization_transform_python 0.2.1.dev0
Uninstalling dpk_tokenization_transform_python-0.2.1.dev0:
  Successfully uninstalled dpk_tokenization_transform_python-0.2.1.dev0
Found existing installation: transformers 4.38.0
Uninstalling transformers-4.38.0:
  Successfully uninstalled transformers-4.38.0


True