-
Notifications
You must be signed in to change notification settings - Fork 124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add doc_quality transform #282
Conversation
d24dedb
to
10a21b7
Compare
39ed015
to
8473caf
Compare
@daw3rd I'm having a trouble in building image. It seems it requires large disk space than I expect.
I have tried two solutions, but both were not successful.
Do you come up any workaround for this? |
@dtsuzuku-ibm I know it is the middle of the night in Japan now! But, tomorrow, could you please look at David's comment? We are anxious to merge this module. Thanks. |
I think the disk space problem is because the model is being copied into the image. When I saw this image earlier it was 4gb. This really should be changed to have the model loaded from a mounted (local) file system or s3. This will a) keep the image size down and b) allow us to have 1 image for many languages (I hope), instead of 1 image for each language. |
I agree that model file size is too large (I also saw that it grew to around 4gb)
I agree that this is a feasible strategy we can take. Let me try that way. |
5c70d6e
to
a9b7807
Compare
Signed-off-by: Daiki Tsuzuku <dtsuzuku@jp.ibm.com>
a9b7807
to
956c426
Compare
@dtsuzuku-ibm From your point of view, is this finished? |
@shahrokhDaijavad Yes. @daw3rd Could you review again? |
.make.defaults
Outdated
@@ -476,7 +476,7 @@ endif | |||
pip install $$extra_url -r requirements.txt; \ | |||
elif [ -e pyproject.toml ]; then \ | |||
echo Installing from pyproject.toml; \ | |||
pip install $$extra_url -e .; \ | |||
pip install $$extra_url -e .[dev]; \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this necessary and what are the implications for the publishing a transform python wheel to pypi.org?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to use moto
in testing.
It seems that this is required for pip to install project.optional-dependencies
defined in pyproject.toml.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In other word, data-prep-kit does not install project.optional-dependencies
now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what are the implications for the publishing a transform python wheel to pypi.org
I suppose it does not affect package, since it is still recognized as extra dependencies
output of make build
% make -C data-processing-lib build
/Applications/Xcode.app/Contents/Developer/usr/bin/make RULE=build .recurse
SUB_MAKE_DIRS=doc/ python/ ray/ spark/
No Makefile found in doc/. Skipping.
Using recursive build rule in python/
Checks passed
/Applications/Xcode.app/Contents/Developer/usr/bin/make TOML_VERSION=0.2.1.dev0 .defaults.update-toml
if [ -e pyproject.toml ]; then \
/Applications/Xcode.app/Contents/Developer/usr/bin/make TOML_VERSION=0.2.1.dev0 .defaults.__set-toml-version; \
/Applications/Xcode.app/Contents/Developer/usr/bin/make .defaults.__update-toml-lib-dep-versions; \
fi
if [ -e pyproject.toml ]; then \
cat pyproject.toml | sed -e \
's/^version[ ]*=.*/version = "'0.2.1.dev0'"/' \
> tt.toml; \
mv tt.toml pyproject.toml; \
fi
rm -rf dist || true
rm -rf src/*egg-info || true
python -m pip install --upgrade build
Collecting build
Using cached build-1.2.1-py3-none-any.whl.metadata (4.3 kB)
Requirement already satisfied: packaging>=19.1 in /Users/dtsuzuku/.pyenv/versions/instruct-lab/lib/python3.11/site-packages (from build) (23.2)
Collecting pyproject_hooks (from build)
Using cached pyproject_hooks-1.1.0-py3-none-any.whl.metadata (1.3 kB)
Using cached build-1.2.1-py3-none-any.whl (21 kB)
Using cached pyproject_hooks-1.1.0-py3-none-any.whl (9.2 kB)
Installing collected packages: pyproject_hooks, build
Successfully installed build-1.2.1 pyproject_hooks-1.1.0
[notice] A new release of pip is available: 24.0 -> 24.1.2
[notice] To update, run: pip install --upgrade pip
python -m build
* Creating isolated environment: venv+pip...
* Installing packages in isolated environment:
- setuptools>=68.0.0
- setuptools_scm[toml]>=7.1.0
- wheel
* Getting build dependencies for sdist...
running egg_info
creating src/data_prep_toolkit.egg-info
writing src/data_prep_toolkit.egg-info/PKG-INFO
writing dependency_links to src/data_prep_toolkit.egg-info/dependency_links.txt
writing requirements to src/data_prep_toolkit.egg-info/requires.txt
writing top-level names to src/data_prep_toolkit.egg-info/top_level.txt
writing manifest file 'src/data_prep_toolkit.egg-info/SOURCES.txt'
writing manifest file 'src/data_prep_toolkit.egg-info/SOURCES.txt'
* Building sdist...
running sdist
running egg_info
writing src/data_prep_toolkit.egg-info/PKG-INFO
writing dependency_links to src/data_prep_toolkit.egg-info/dependency_links.txt
writing requirements to src/data_prep_toolkit.egg-info/requires.txt
writing top-level names to src/data_prep_toolkit.egg-info/top_level.txt
writing manifest file 'src/data_prep_toolkit.egg-info/SOURCES.txt'
running check
creating data_prep_toolkit-0.2.1.dev0
creating data_prep_toolkit-0.2.1.dev0/src
creating data_prep_toolkit-0.2.1.dev0/src/data_prep_toolkit.egg-info
creating data_prep_toolkit-0.2.1.dev0/src/data_processing
creating data_prep_toolkit-0.2.1.dev0/src/data_processing/data_access
creating data_prep_toolkit-0.2.1.dev0/src/data_processing/runtime
creating data_prep_toolkit-0.2.1.dev0/src/data_processing/runtime/pure_python
creating data_prep_toolkit-0.2.1.dev0/src/data_processing/test_support
creating data_prep_toolkit-0.2.1.dev0/src/data_processing/test_support/data_access
creating data_prep_toolkit-0.2.1.dev0/src/data_processing/test_support/launch
creating data_prep_toolkit-0.2.1.dev0/src/data_processing/test_support/transform
creating data_prep_toolkit-0.2.1.dev0/src/data_processing/transform
creating data_prep_toolkit-0.2.1.dev0/src/data_processing/utils
creating data_prep_toolkit-0.2.1.dev0/test-data
creating data_prep_toolkit-0.2.1.dev0/test-data/data_processing
creating data_prep_toolkit-0.2.1.dev0/test-data/data_processing/daf
creating data_prep_toolkit-0.2.1.dev0/test-data/data_processing/daf/input
creating data_prep_toolkit-0.2.1.dev0/test-data/data_processing/daf/input/ds1
creating data_prep_toolkit-0.2.1.dev0/test-data/data_processing/daf/input/ds2
creating data_prep_toolkit-0.2.1.dev0/test-data/data_processing/daf/output
creating data_prep_toolkit-0.2.1.dev0/test-data/data_processing/daf/output/ds1
creating data_prep_toolkit-0.2.1.dev0/test-data/data_processing/input
creating data_prep_toolkit-0.2.1.dev0/test-data/data_processing/input_multiple
creating data_prep_toolkit-0.2.1.dev0/test-data/data_processing/python
creating data_prep_toolkit-0.2.1.dev0/test-data/data_processing/python/noop
creating data_prep_toolkit-0.2.1.dev0/test-data/data_processing/python/noop/expected
creating data_prep_toolkit-0.2.1.dev0/test-data/data_processing/python/noop/expected/subdir
creating data_prep_toolkit-0.2.1.dev0/test-data/data_processing/python/noop/input
creating data_prep_toolkit-0.2.1.dev0/test-data/data_processing/python/noop/input/subdir
creating data_prep_toolkit-0.2.1.dev0/test
creating data_prep_toolkit-0.2.1.dev0/test/data_processing_tests
creating data_prep_toolkit-0.2.1.dev0/test/data_processing_tests/data_access
creating data_prep_toolkit-0.2.1.dev0/test/data_processing_tests/launch
creating data_prep_toolkit-0.2.1.dev0/test/data_processing_tests/launch/pure_python
creating data_prep_toolkit-0.2.1.dev0/test/data_processing_tests/transform
creating data_prep_toolkit-0.2.1.dev0/test/data_processing_tests/util
copying files to data_prep_toolkit-0.2.1.dev0...
copying Makefile -> data_prep_toolkit-0.2.1.dev0
copying README.md -> data_prep_toolkit-0.2.1.dev0
copying pyproject.toml -> data_prep_toolkit-0.2.1.dev0
copying src/data_prep_toolkit.egg-info/PKG-INFO -> data_prep_toolkit-0.2.1.dev0/src/data_prep_toolkit.egg-info
copying src/data_prep_toolkit.egg-info/SOURCES.txt -> data_prep_toolkit-0.2.1.dev0/src/data_prep_toolkit.egg-info
copying src/data_prep_toolkit.egg-info/dependency_links.txt -> data_prep_toolkit-0.2.1.dev0/src/data_prep_toolkit.egg-info
copying src/data_prep_toolkit.egg-info/requires.txt -> data_prep_toolkit-0.2.1.dev0/src/data_prep_toolkit.egg-info
copying src/data_prep_toolkit.egg-info/top_level.txt -> data_prep_toolkit-0.2.1.dev0/src/data_prep_toolkit.egg-info
copying src/data_processing/__init__.py -> data_prep_toolkit-0.2.1.dev0/src/data_processing
copying src/data_processing/data_access/__init__.py -> data_prep_toolkit-0.2.1.dev0/src/data_processing/data_access
copying src/data_processing/data_access/arrow_s3.py -> data_prep_toolkit-0.2.1.dev0/src/data_processing/data_access
copying src/data_processing/data_access/data_access.py -> data_prep_toolkit-0.2.1.dev0/src/data_processing/data_access
copying src/data_processing/data_access/data_access_factory.py -> data_prep_toolkit-0.2.1.dev0/src/data_processing/data_access
copying src/data_processing/data_access/data_access_factory_base.py -> data_prep_toolkit-0.2.1.dev0/src/data_processing/data_access
copying src/data_processing/data_access/data_access_local.py -> data_prep_toolkit-0.2.1.dev0/src/data_processing/data_access
copying src/data_processing/data_access/data_access_s3.py -> data_prep_toolkit-0.2.1.dev0/src/data_processing/data_access
copying src/data_processing/runtime/__init__.py -> data_prep_toolkit-0.2.1.dev0/src/data_processing/runtime
copying src/data_processing/runtime/execution_configuration.py -> data_prep_toolkit-0.2.1.dev0/src/data_processing/runtime
copying src/data_processing/runtime/runtime_configuration.py -> data_prep_toolkit-0.2.1.dev0/src/data_processing/runtime
copying src/data_processing/runtime/transform_file_processor.py -> data_prep_toolkit-0.2.1.dev0/src/data_processing/runtime
copying src/data_processing/runtime/transform_launcher.py -> data_prep_toolkit-0.2.1.dev0/src/data_processing/runtime
copying src/data_processing/runtime/pure_python/__init__.py -> data_prep_toolkit-0.2.1.dev0/src/data_processing/runtime/pure_python
copying src/data_processing/runtime/pure_python/runtime_configuration.py -> data_prep_toolkit-0.2.1.dev0/src/data_processing/runtime/pure_python
copying src/data_processing/runtime/pure_python/transform_file_processor.py -> data_prep_toolkit-0.2.1.dev0/src/data_processing/runtime/pure_python
copying src/data_processing/runtime/pure_python/transform_launcher.py -> data_prep_toolkit-0.2.1.dev0/src/data_processing/runtime/pure_python
copying src/data_processing/runtime/pure_python/transform_orchestrator.py -> data_prep_toolkit-0.2.1.dev0/src/data_processing/runtime/pure_python
copying src/data_processing/test_support/__init__.py -> data_prep_toolkit-0.2.1.dev0/src/data_processing/test_support
copying src/data_processing/test_support/abstract_test.py -> data_prep_toolkit-0.2.1.dev0/src/data_processing/test_support
copying src/data_processing/test_support/data_access/__init__.py -> data_prep_toolkit-0.2.1.dev0/src/data_processing/test_support/data_access
copying src/data_processing/test_support/data_access/data_access_factory_test.py -> data_prep_toolkit-0.2.1.dev0/src/data_processing/test_support/data_access
copying src/data_processing/test_support/launch/__init__.py -> data_prep_toolkit-0.2.1.dev0/src/data_processing/test_support/launch
copying src/data_processing/test_support/launch/transform_test.py -> data_prep_toolkit-0.2.1.dev0/src/data_processing/test_support/launch
copying src/data_processing/test_support/transform/__init__.py -> data_prep_toolkit-0.2.1.dev0/src/data_processing/test_support/transform
copying src/data_processing/test_support/transform/binary_transform_test.py -> data_prep_toolkit-0.2.1.dev0/src/data_processing/test_support/transform
copying src/data_processing/test_support/transform/noop_transform.py -> data_prep_toolkit-0.2.1.dev0/src/data_processing/test_support/transform
copying src/data_processing/test_support/transform/table_transform_test.py -> data_prep_toolkit-0.2.1.dev0/src/data_processing/test_support/transform
copying src/data_processing/transform/__init__.py -> data_prep_toolkit-0.2.1.dev0/src/data_processing/transform
copying src/data_processing/transform/abstract_transform.py -> data_prep_toolkit-0.2.1.dev0/src/data_processing/transform
copying src/data_processing/transform/binary_transform.py -> data_prep_toolkit-0.2.1.dev0/src/data_processing/transform
copying src/data_processing/transform/table_transform.py -> data_prep_toolkit-0.2.1.dev0/src/data_processing/transform
copying src/data_processing/transform/transform_configuration.py -> data_prep_toolkit-0.2.1.dev0/src/data_processing/transform
copying src/data_processing/transform/transform_statistics.py -> data_prep_toolkit-0.2.1.dev0/src/data_processing/transform
copying src/data_processing/utils/__init__.py -> data_prep_toolkit-0.2.1.dev0/src/data_processing/utils
copying src/data_processing/utils/cli_utils.py -> data_prep_toolkit-0.2.1.dev0/src/data_processing/utils
copying src/data_processing/utils/config.py -> data_prep_toolkit-0.2.1.dev0/src/data_processing/utils
copying src/data_processing/utils/log.py -> data_prep_toolkit-0.2.1.dev0/src/data_processing/utils
copying src/data_processing/utils/params_utils.py -> data_prep_toolkit-0.2.1.dev0/src/data_processing/utils
copying src/data_processing/utils/transform_utils.py -> data_prep_toolkit-0.2.1.dev0/src/data_processing/utils
copying test-data/data_processing/daf/input/ds1/sample1.parquet -> data_prep_toolkit-0.2.1.dev0/test-data/data_processing/daf/input/ds1
copying test-data/data_processing/daf/input/ds1/sample2.parquet -> data_prep_toolkit-0.2.1.dev0/test-data/data_processing/daf/input/ds1
copying test-data/data_processing/daf/input/ds2/sample3.parquet -> data_prep_toolkit-0.2.1.dev0/test-data/data_processing/daf/input/ds2
copying test-data/data_processing/daf/output/ds1/sample1.parquet -> data_prep_toolkit-0.2.1.dev0/test-data/data_processing/daf/output/ds1
copying test-data/data_processing/input/sample1.parquet -> data_prep_toolkit-0.2.1.dev0/test-data/data_processing/input
copying test-data/data_processing/input_multiple/sample1.parquet -> data_prep_toolkit-0.2.1.dev0/test-data/data_processing/input_multiple
copying test-data/data_processing/input_multiple/sample2.parquet -> data_prep_toolkit-0.2.1.dev0/test-data/data_processing/input_multiple
copying test-data/data_processing/input_multiple/sample3.parquet -> data_prep_toolkit-0.2.1.dev0/test-data/data_processing/input_multiple
copying test-data/data_processing/python/noop/expected/metadata.json -> data_prep_toolkit-0.2.1.dev0/test-data/data_processing/python/noop/expected
copying test-data/data_processing/python/noop/expected/sample1.parquet -> data_prep_toolkit-0.2.1.dev0/test-data/data_processing/python/noop/expected
copying test-data/data_processing/python/noop/expected/subdir/test1.parquet -> data_prep_toolkit-0.2.1.dev0/test-data/data_processing/python/noop/expected/subdir
copying test-data/data_processing/python/noop/input/sample1.parquet -> data_prep_toolkit-0.2.1.dev0/test-data/data_processing/python/noop/input
copying test-data/data_processing/python/noop/input/subdir/test1.parquet -> data_prep_toolkit-0.2.1.dev0/test-data/data_processing/python/noop/input/subdir
copying test/data_processing_tests/data_access/daf_local_test.py -> data_prep_toolkit-0.2.1.dev0/test/data_processing_tests/data_access
copying test/data_processing_tests/data_access/data_access_local_test.py -> data_prep_toolkit-0.2.1.dev0/test/data_processing_tests/data_access
copying test/data_processing_tests/data_access/data_access_s3_test.py -> data_prep_toolkit-0.2.1.dev0/test/data_processing_tests/data_access
copying test/data_processing_tests/data_access/sample_input_data_test.py -> data_prep_toolkit-0.2.1.dev0/test/data_processing_tests/data_access
copying test/data_processing_tests/launch/pure_python/launcher_test.py -> data_prep_toolkit-0.2.1.dev0/test/data_processing_tests/launch/pure_python
copying test/data_processing_tests/launch/pure_python/multi_launcher_test.py -> data_prep_toolkit-0.2.1.dev0/test/data_processing_tests/launch/pure_python
copying test/data_processing_tests/launch/pure_python/test_noop_launch.py -> data_prep_toolkit-0.2.1.dev0/test/data_processing_tests/launch/pure_python
copying test/data_processing_tests/transform/test_noop.py -> data_prep_toolkit-0.2.1.dev0/test/data_processing_tests/transform
copying test/data_processing_tests/util/transform_utils_test.py -> data_prep_toolkit-0.2.1.dev0/test/data_processing_tests/util
copying src/data_prep_toolkit.egg-info/SOURCES.txt -> data_prep_toolkit-0.2.1.dev0/src/data_prep_toolkit.egg-info
Writing data_prep_toolkit-0.2.1.dev0/setup.cfg
Creating tar archive
removing 'data_prep_toolkit-0.2.1.dev0' (and everything under it)
* Building wheel from sdist
* Creating isolated environment: venv+pip...
* Installing packages in isolated environment:
- setuptools>=68.0.0
- setuptools_scm[toml]>=7.1.0
- wheel
* Getting build dependencies for wheel...
running egg_info
writing src/data_prep_toolkit.egg-info/PKG-INFO
writing dependency_links to src/data_prep_toolkit.egg-info/dependency_links.txt
writing requirements to src/data_prep_toolkit.egg-info/requires.txt
writing top-level names to src/data_prep_toolkit.egg-info/top_level.txt
ERROR setuptools_scm._file_finders.git listing git files failed - pretending there aren't any
reading manifest file 'src/data_prep_toolkit.egg-info/SOURCES.txt'
writing manifest file 'src/data_prep_toolkit.egg-info/SOURCES.txt'
* Building wheel...
running bdist_wheel
running build
running build_py
creating build
creating build/lib
creating build/lib/data_processing
copying src/data_processing/__init__.py -> build/lib/data_processing
creating build/lib/data_processing/data_access
copying src/data_processing/data_access/data_access_factory_base.py -> build/lib/data_processing/data_access
copying src/data_processing/data_access/arrow_s3.py -> build/lib/data_processing/data_access
copying src/data_processing/data_access/data_access_s3.py -> build/lib/data_processing/data_access
copying src/data_processing/data_access/data_access_factory.py -> build/lib/data_processing/data_access
copying src/data_processing/data_access/__init__.py -> build/lib/data_processing/data_access
copying src/data_processing/data_access/data_access.py -> build/lib/data_processing/data_access
copying src/data_processing/data_access/data_access_local.py -> build/lib/data_processing/data_access
creating build/lib/data_processing/runtime
copying src/data_processing/runtime/__init__.py -> build/lib/data_processing/runtime
copying src/data_processing/runtime/execution_configuration.py -> build/lib/data_processing/runtime
copying src/data_processing/runtime/transform_launcher.py -> build/lib/data_processing/runtime
copying src/data_processing/runtime/runtime_configuration.py -> build/lib/data_processing/runtime
copying src/data_processing/runtime/transform_file_processor.py -> build/lib/data_processing/runtime
creating build/lib/data_processing/utils
copying src/data_processing/utils/params_utils.py -> build/lib/data_processing/utils
copying src/data_processing/utils/cli_utils.py -> build/lib/data_processing/utils
copying src/data_processing/utils/config.py -> build/lib/data_processing/utils
copying src/data_processing/utils/log.py -> build/lib/data_processing/utils
copying src/data_processing/utils/transform_utils.py -> build/lib/data_processing/utils
copying src/data_processing/utils/__init__.py -> build/lib/data_processing/utils
creating build/lib/data_processing/test_support
copying src/data_processing/test_support/abstract_test.py -> build/lib/data_processing/test_support
copying src/data_processing/test_support/__init__.py -> build/lib/data_processing/test_support
creating build/lib/data_processing/transform
copying src/data_processing/transform/binary_transform.py -> build/lib/data_processing/transform
copying src/data_processing/transform/table_transform.py -> build/lib/data_processing/transform
copying src/data_processing/transform/__init__.py -> build/lib/data_processing/transform
copying src/data_processing/transform/transform_configuration.py -> build/lib/data_processing/transform
copying src/data_processing/transform/abstract_transform.py -> build/lib/data_processing/transform
copying src/data_processing/transform/transform_statistics.py -> build/lib/data_processing/transform
creating build/lib/data_processing/runtime/pure_python
copying src/data_processing/runtime/pure_python/transform_orchestrator.py -> build/lib/data_processing/runtime/pure_python
copying src/data_processing/runtime/pure_python/__init__.py -> build/lib/data_processing/runtime/pure_python
copying src/data_processing/runtime/pure_python/transform_launcher.py -> build/lib/data_processing/runtime/pure_python
copying src/data_processing/runtime/pure_python/runtime_configuration.py -> build/lib/data_processing/runtime/pure_python
copying src/data_processing/runtime/pure_python/transform_file_processor.py -> build/lib/data_processing/runtime/pure_python
creating build/lib/data_processing/test_support/data_access
copying src/data_processing/test_support/data_access/__init__.py -> build/lib/data_processing/test_support/data_access
copying src/data_processing/test_support/data_access/data_access_factory_test.py -> build/lib/data_processing/test_support/data_access
creating build/lib/data_processing/test_support/launch
copying src/data_processing/test_support/launch/transform_test.py -> build/lib/data_processing/test_support/launch
copying src/data_processing/test_support/launch/__init__.py -> build/lib/data_processing/test_support/launch
creating build/lib/data_processing/test_support/transform
copying src/data_processing/test_support/transform/__init__.py -> build/lib/data_processing/test_support/transform
copying src/data_processing/test_support/transform/binary_transform_test.py -> build/lib/data_processing/test_support/transform
copying src/data_processing/test_support/transform/noop_transform.py -> build/lib/data_processing/test_support/transform
copying src/data_processing/test_support/transform/table_transform_test.py -> build/lib/data_processing/test_support/transform
running egg_info
writing src/data_prep_toolkit.egg-info/PKG-INFO
writing dependency_links to src/data_prep_toolkit.egg-info/dependency_links.txt
writing requirements to src/data_prep_toolkit.egg-info/requires.txt
writing top-level names to src/data_prep_toolkit.egg-info/top_level.txt
ERROR setuptools_scm._file_finders.git listing git files failed - pretending there aren't any
reading manifest file 'src/data_prep_toolkit.egg-info/SOURCES.txt'
writing manifest file 'src/data_prep_toolkit.egg-info/SOURCES.txt'
installing to build/bdist.macosx-14.0-arm64/wheel
running install
running install_lib
creating build/bdist.macosx-14.0-arm64
creating build/bdist.macosx-14.0-arm64/wheel
creating build/bdist.macosx-14.0-arm64/wheel/data_processing
creating build/bdist.macosx-14.0-arm64/wheel/data_processing/data_access
copying build/lib/data_processing/data_access/data_access_factory_base.py -> build/bdist.macosx-14.0-arm64/wheel/data_processing/data_access
copying build/lib/data_processing/data_access/arrow_s3.py -> build/bdist.macosx-14.0-arm64/wheel/data_processing/data_access
copying build/lib/data_processing/data_access/data_access_s3.py -> build/bdist.macosx-14.0-arm64/wheel/data_processing/data_access
copying build/lib/data_processing/data_access/data_access_factory.py -> build/bdist.macosx-14.0-arm64/wheel/data_processing/data_access
copying build/lib/data_processing/data_access/__init__.py -> build/bdist.macosx-14.0-arm64/wheel/data_processing/data_access
copying build/lib/data_processing/data_access/data_access.py -> build/bdist.macosx-14.0-arm64/wheel/data_processing/data_access
copying build/lib/data_processing/data_access/data_access_local.py -> build/bdist.macosx-14.0-arm64/wheel/data_processing/data_access
creating build/bdist.macosx-14.0-arm64/wheel/data_processing/runtime
copying build/lib/data_processing/runtime/__init__.py -> build/bdist.macosx-14.0-arm64/wheel/data_processing/runtime
copying build/lib/data_processing/runtime/execution_configuration.py -> build/bdist.macosx-14.0-arm64/wheel/data_processing/runtime
copying build/lib/data_processing/runtime/transform_launcher.py -> build/bdist.macosx-14.0-arm64/wheel/data_processing/runtime
copying build/lib/data_processing/runtime/runtime_configuration.py -> build/bdist.macosx-14.0-arm64/wheel/data_processing/runtime
copying build/lib/data_processing/runtime/transform_file_processor.py -> build/bdist.macosx-14.0-arm64/wheel/data_processing/runtime
creating build/bdist.macosx-14.0-arm64/wheel/data_processing/runtime/pure_python
copying build/lib/data_processing/runtime/pure_python/transform_orchestrator.py -> build/bdist.macosx-14.0-arm64/wheel/data_processing/runtime/pure_python
copying build/lib/data_processing/runtime/pure_python/__init__.py -> build/bdist.macosx-14.0-arm64/wheel/data_processing/runtime/pure_python
copying build/lib/data_processing/runtime/pure_python/transform_launcher.py -> build/bdist.macosx-14.0-arm64/wheel/data_processing/runtime/pure_python
copying build/lib/data_processing/runtime/pure_python/runtime_configuration.py -> build/bdist.macosx-14.0-arm64/wheel/data_processing/runtime/pure_python
copying build/lib/data_processing/runtime/pure_python/transform_file_processor.py -> build/bdist.macosx-14.0-arm64/wheel/data_processing/runtime/pure_python
copying build/lib/data_processing/__init__.py -> build/bdist.macosx-14.0-arm64/wheel/data_processing
creating build/bdist.macosx-14.0-arm64/wheel/data_processing/utils
copying build/lib/data_processing/utils/params_utils.py -> build/bdist.macosx-14.0-arm64/wheel/data_processing/utils
copying build/lib/data_processing/utils/cli_utils.py -> build/bdist.macosx-14.0-arm64/wheel/data_processing/utils
copying build/lib/data_processing/utils/config.py -> build/bdist.macosx-14.0-arm64/wheel/data_processing/utils
copying build/lib/data_processing/utils/log.py -> build/bdist.macosx-14.0-arm64/wheel/data_processing/utils
copying build/lib/data_processing/utils/transform_utils.py -> build/bdist.macosx-14.0-arm64/wheel/data_processing/utils
copying build/lib/data_processing/utils/__init__.py -> build/bdist.macosx-14.0-arm64/wheel/data_processing/utils
creating build/bdist.macosx-14.0-arm64/wheel/data_processing/test_support
creating build/bdist.macosx-14.0-arm64/wheel/data_processing/test_support/data_access
copying build/lib/data_processing/test_support/data_access/__init__.py -> build/bdist.macosx-14.0-arm64/wheel/data_processing/test_support/data_access
copying build/lib/data_processing/test_support/data_access/data_access_factory_test.py -> build/bdist.macosx-14.0-arm64/wheel/data_processing/test_support/data_access
copying build/lib/data_processing/test_support/abstract_test.py -> build/bdist.macosx-14.0-arm64/wheel/data_processing/test_support
creating build/bdist.macosx-14.0-arm64/wheel/data_processing/test_support/launch
copying build/lib/data_processing/test_support/launch/transform_test.py -> build/bdist.macosx-14.0-arm64/wheel/data_processing/test_support/launch
copying build/lib/data_processing/test_support/launch/__init__.py -> build/bdist.macosx-14.0-arm64/wheel/data_processing/test_support/launch
copying build/lib/data_processing/test_support/__init__.py -> build/bdist.macosx-14.0-arm64/wheel/data_processing/test_support
creating build/bdist.macosx-14.0-arm64/wheel/data_processing/test_support/transform
copying build/lib/data_processing/test_support/transform/__init__.py -> build/bdist.macosx-14.0-arm64/wheel/data_processing/test_support/transform
copying build/lib/data_processing/test_support/transform/binary_transform_test.py -> build/bdist.macosx-14.0-arm64/wheel/data_processing/test_support/transform
copying build/lib/data_processing/test_support/transform/noop_transform.py -> build/bdist.macosx-14.0-arm64/wheel/data_processing/test_support/transform
copying build/lib/data_processing/test_support/transform/table_transform_test.py -> build/bdist.macosx-14.0-arm64/wheel/data_processing/test_support/transform
creating build/bdist.macosx-14.0-arm64/wheel/data_processing/transform
copying build/lib/data_processing/transform/binary_transform.py -> build/bdist.macosx-14.0-arm64/wheel/data_processing/transform
copying build/lib/data_processing/transform/table_transform.py -> build/bdist.macosx-14.0-arm64/wheel/data_processing/transform
copying build/lib/data_processing/transform/__init__.py -> build/bdist.macosx-14.0-arm64/wheel/data_processing/transform
copying build/lib/data_processing/transform/transform_configuration.py -> build/bdist.macosx-14.0-arm64/wheel/data_processing/transform
copying build/lib/data_processing/transform/abstract_transform.py -> build/bdist.macosx-14.0-arm64/wheel/data_processing/transform
copying build/lib/data_processing/transform/transform_statistics.py -> build/bdist.macosx-14.0-arm64/wheel/data_processing/transform
running install_egg_info
Copying src/data_prep_toolkit.egg-info to build/bdist.macosx-14.0-arm64/wheel/data_prep_toolkit-0.2.1.dev0-py3.11.egg-info
running install_scripts
creating build/bdist.macosx-14.0-arm64/wheel/data_prep_toolkit-0.2.1.dev0.dist-info/WHEEL
creating '/Users/dtsuzuku/granite/data-prep-kit/data-processing-lib/python/dist/.tmp-3_4fi23h/data_prep_toolkit-0.2.1.dev0-py3-none-any.whl' and adding 'build/bdist.macosx-14.0-arm64/wheel' to it
adding 'data_processing/__init__.py'
adding 'data_processing/data_access/__init__.py'
adding 'data_processing/data_access/arrow_s3.py'
adding 'data_processing/data_access/data_access.py'
adding 'data_processing/data_access/data_access_factory.py'
adding 'data_processing/data_access/data_access_factory_base.py'
adding 'data_processing/data_access/data_access_local.py'
adding 'data_processing/data_access/data_access_s3.py'
adding 'data_processing/runtime/__init__.py'
adding 'data_processing/runtime/execution_configuration.py'
adding 'data_processing/runtime/runtime_configuration.py'
adding 'data_processing/runtime/transform_file_processor.py'
adding 'data_processing/runtime/transform_launcher.py'
adding 'data_processing/runtime/pure_python/__init__.py'
adding 'data_processing/runtime/pure_python/runtime_configuration.py'
adding 'data_processing/runtime/pure_python/transform_file_processor.py'
adding 'data_processing/runtime/pure_python/transform_launcher.py'
adding 'data_processing/runtime/pure_python/transform_orchestrator.py'
adding 'data_processing/test_support/__init__.py'
adding 'data_processing/test_support/abstract_test.py'
adding 'data_processing/test_support/data_access/__init__.py'
adding 'data_processing/test_support/data_access/data_access_factory_test.py'
adding 'data_processing/test_support/launch/__init__.py'
adding 'data_processing/test_support/launch/transform_test.py'
adding 'data_processing/test_support/transform/__init__.py'
adding 'data_processing/test_support/transform/binary_transform_test.py'
adding 'data_processing/test_support/transform/noop_transform.py'
adding 'data_processing/test_support/transform/table_transform_test.py'
adding 'data_processing/transform/__init__.py'
adding 'data_processing/transform/abstract_transform.py'
adding 'data_processing/transform/binary_transform.py'
adding 'data_processing/transform/table_transform.py'
adding 'data_processing/transform/transform_configuration.py'
adding 'data_processing/transform/transform_statistics.py'
adding 'data_processing/utils/__init__.py'
adding 'data_processing/utils/cli_utils.py'
adding 'data_processing/utils/config.py'
adding 'data_processing/utils/log.py'
adding 'data_processing/utils/params_utils.py'
adding 'data_processing/utils/transform_utils.py'
adding 'data_prep_toolkit-0.2.1.dev0.dist-info/METADATA'
adding 'data_prep_toolkit-0.2.1.dev0.dist-info/WHEEL'
adding 'data_prep_toolkit-0.2.1.dev0.dist-info/top_level.txt'
adding 'data_prep_toolkit-0.2.1.dev0.dist-info/RECORD'
removing build/bdist.macosx-14.0-arm64/wheel
Successfully built data_prep_toolkit-0.2.1.dev0.tar.gz and data_prep_toolkit-0.2.1.dev0-py3-none-any.whl
PKG-INFO
% cat data_prep_toolkit-0.2.1.dev0/PKG-INFO
Metadata-Version: 2.1
Name: data_prep_toolkit
Version: 0.2.1.dev0
Summary: Data Preparation Toolkit Library
Author-email: David Wood <dawood@us.ibm.com>, Boris Lublinsky <blublinsky@ibm.com>
License: Apache-2.0
Keywords: data,data preprocessing,data preparation,llm,generative,ai,fine-tuning,llmapps
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: pyarrow==16.1.0
Requires-Dist: boto3==1.34.69
Requires-Dist: argparse
Requires-Dist: mmh3
Provides-Extra: dev
Requires-Dist: twine; extra == "dev"
Requires-Dist: pytest>=7.3.2; extra == "dev"
Requires-Dist: pytest-dotenv>=0.5.2; extra == "dev"
Requires-Dist: pytest-env>=1.0.0; extra == "dev"
Requires-Dist: pre-commit>=3.3.2; extra == "dev"
Requires-Dist: pytest-cov>=4.1.0; extra == "dev"
Requires-Dist: pytest-mock>=3.10.0; extra == "dev"
Requires-Dist: moto==5.0.5; extra == "dev"
Requires-Dist: markupsafe==2.0.1; extra == "dev"
# Data Processing Library
This provides a python framework for developing _transforms_
on data stored in files - currently parquet files are supported -
and running them in a ÄrayÅ(https://www.ray.io/) cluster.
Data files may be stored in the local file system or COS/S3.
For more details see the ÄdocumentationÅ(../doc/overview.md).
### Virtual Environment
The project uses épyproject.tomlé and a Makefile for operations.
To do development you should establish the virtual environment
éééshell
make venv
ééé
and then either activate
éééshell
source venv/bin/activate
ééé
or set up your IDE to use the venv directory when developing in this project
## Library Artifact Build and Publish
To test, build and publish the library
éééshell
make test build publish
ééé
To up the version number, edit the Makefile to change VERSION and rerun
the above. This will require committing both the éMakefileé and the
autotmatically updated épyproject.tomlé file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a global change that will effect all transforms which makes me uncomfortable. And we don't install like this in the images so there would be a difference between the local venv and what gets installed in the image. This latter point makes this seem like a bad idea. What problem are you trying to solve with this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to test if we can load bad words from s3. To do that, I want to use the mock of s3, which is provided by moto
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And we don't install like this in the images so there would be a difference between the local venv and what gets installed in the image
I suppose this is checked in data-prep-kit's build system since it executes pytest with the image it built.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea to add the tests for s3, but the only place s3 is actually tested now is in the library tests.
so, lets back out this change and if you want, submit an issue to enable s3 testing with moto in the transforms
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. reverted.
transforms/language/doc_quality/python/src/doc_quality_transform.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add this new transform to the table in the top README.md
…oc_quality Signed-off-by: Daiki Tsuzuku <dtsuzuku@jp.ibm.com>
Signed-off-by: Daiki Tsuzuku <dtsuzuku@jp.ibm.com>
Signed-off-by: Daiki Tsuzuku <dtsuzuku@jp.ibm.com>
Signed-off-by: Daiki Tsuzuku <dtsuzuku@jp.ibm.com>
Signed-off-by: Daiki Tsuzuku <dtsuzuku@jp.ibm.com>
Signed-off-by: Daiki Tsuzuku <dtsuzuku@jp.ibm.com>
Signed-off-by: Daiki Tsuzuku <dtsuzuku@jp.ibm.com>
Signed-off-by: Daiki Tsuzuku <dtsuzuku@jp.ibm.com>
Signed-off-by: Daiki Tsuzuku <dtsuzuku@jp.ibm.com>
0000d4a
to
05ad588
Compare
transforms/language/doc_quality/python/src/doc_quality_transform.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Daiki Tsuzuku <dtsuzuku@jp.ibm.com>
Signed-off-by: Daiki Tsuzuku <dtsuzuku@jp.ibm.com>
Signed-off-by: Daiki Tsuzuku <dtsuzuku@jp.ibm.com>
Signed-off-by: Daiki Tsuzuku <dtsuzuku@jp.ibm.com>
Signed-off-by: Daiki Tsuzuku <dtsuzuku@jp.ibm.com>
c264a38
to
bdc9a10
Compare
Signed-off-by: Daiki Tsuzuku <dtsuzuku@jp.ibm.com>
Signed-off-by: Daiki Tsuzuku <dtsuzuku@jp.ibm.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for all the changes.
Why are these changes needed?
Related issue number (if any).
#200