Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add doc_quality transform #282

Merged
merged 78 commits into from
Jul 25, 2024
Merged

add doc_quality transform #282

merged 78 commits into from
Jul 25, 2024

Conversation

dtsuzuku-ibm
Copy link
Collaborator

@dtsuzuku-ibm dtsuzuku-ibm commented Jun 17, 2024

Why are these changes needed?

Related issue number (if any).

#200

@dtsuzuku-ibm dtsuzuku-ibm force-pushed the add-doc_quality branch 2 times, most recently from d24dedb to 10a21b7 Compare June 17, 2024 12:51
@dtsuzuku-ibm dtsuzuku-ibm changed the title add doc_quality transform [WIP] add doc_quality transform Jun 17, 2024
@dtsuzuku-ibm dtsuzuku-ibm force-pushed the add-doc_quality branch 11 times, most recently from 39ed015 to 8473caf Compare June 19, 2024 12:34
@dtsuzuku-ibm
Copy link
Collaborator Author

@daw3rd I'm having a trouble in building image. It seems it requires large disk space than I expect.


build-language
System.IO.IOException: No space left on device : '/home/runner/runners/2.317.0/_diag/Worker_20240619-123448-utc.log'
   at System.IO.RandomAccess.WriteAtOffset(SafeFileHandle handle, ReadOnlySpan`1 buffer, Int64 fileOffset)
   at System.IO.Strategies.BufferedFileStreamStrategy.FlushWrite()
   at System.IO.StreamWriter.Flush(Boolean flushStream, Boolean flushEncoder)
   at System.Diagnostics.TextWriterTraceListener.Flush()
   at GitHub.Runner.Common.HostTraceListener.WriteHeader(String source, TraceEventType eventType, Int32 id)
   at GitHub.Runner.Common.HostTraceListener.TraceEvent(TraceEventCache eventCache, String source, TraceEventType eventType, Int32 id, String message)
   at System.Diagnostics.TraceSource.TraceEvent(TraceEventType eventType, Int32 id, String message)
   at GitHub.Runner.Worker.Worker.RunAsync(String pipeIn, String pipeOut)
   at GitHub.Runner.Worker.Program.MainAsync(IHostContext context, String[] args)
System.IO.IOException: No space left on device : '/home/runner/runners/2.317.0/_diag/Worker_20240619-123448-utc.log'
   at System.IO.RandomAccess.WriteAtOffset(SafeFileHandle handle, ReadOnlySpan`1 buffer, Int64 fileOffset)
   at System.IO.Strategies.BufferedFileStreamStrategy.FlushWrite()
   at System.IO.StreamWriter.Flush(Boolean flushStream, Boolean flushEncoder)
   at System.Diagnostics.TextWriterTraceListener.Flush()
   at GitHub.Runner.Common.HostTraceListener.WriteHeader(String source, TraceEventType eventType, Int32 id)
   at GitHub.Runner.Common.HostTraceListener.TraceEvent(TraceEventCache eventCache, String source, TraceEventType eventType, Int32 id, String message)
   at System.Diagnostics.TraceSource.TraceEvent(TraceEventType eventType, Int32 id, String message)
   at GitHub.Runner.Common.Tracing.Error(Exception exception)
   at GitHub.Runner.Worker.Program.MainAsync(IHostContext context, String[] args)
Unhandled exception. System.IO.IOException: No space left on device : '/home/runner/runners/2.317.0/_diag/Worker_20240619-123448-utc.log'
   at System.IO.RandomAccess.WriteAtOffset(SafeFileHandle handle, ReadOnlySpan`1 buffer, Int64 fileOffset)
   at System.IO.Strategies.BufferedFileStreamStrategy.FlushWrite()
   at System.IO.StreamWriter.Flush(Boolean flushStream, Boolean flushEncoder)
   at System.Diagnostics.TextWriterTraceListener.Flush()
   at System.Diagnostics.TraceSource.Flush()
   at GitHub.Runner.Common.TraceManager.Dispose(Boolean disposing)
   at GitHub.Runner.Common.TraceManager.Dispose()
   at GitHub.Runner.Common.HostContext.Dispose(Boolean disposing)
   at GitHub.Runner.Common.HostContext.Dispose()
   at GitHub.Runner.Worker.Program.Main(String[] args)

I have tried two solutions, but both were not successful.

  1. Put kenLM model on doc_quality/ directory and make a symlink
    -> Failed due to "too many symlinks" error
  2. Use knLM model placed in doc_quality/ directory
    -> Failed since it is not visibile in the context of each python/ray directory during docker build.

Do you come up any workaround for this?

@shahrokhDaijavad
Copy link
Member

@dtsuzuku-ibm I know it is the middle of the night in Japan now! But, tomorrow, could you please look at David's comment? We are anxious to merge this module. Thanks.

@daw3rd
Copy link
Member

daw3rd commented Jun 19, 2024

I think the disk space problem is because the model is being copied into the image. When I saw this image earlier it was 4gb. This really should be changed to have the model loaded from a mounted (local) file system or s3. This will a) keep the image size down and b) allow us to have 1 image for many languages (I hope), instead of 1 image for each language.

@dtsuzuku-ibm
Copy link
Collaborator Author

I agree that model file size is too large (I also saw that it grew to around 4gb)

the model loaded from a mounted (local) file system or s3

I agree that this is a feasible strategy we can take. Let me try that way.

@dtsuzuku-ibm dtsuzuku-ibm force-pushed the add-doc_quality branch 8 times, most recently from 5c70d6e to a9b7807 Compare June 21, 2024 05:59
Signed-off-by: Daiki Tsuzuku <dtsuzuku@jp.ibm.com>
@dtsuzuku-ibm dtsuzuku-ibm marked this pull request as ready for review June 21, 2024 06:41
@dtsuzuku-ibm dtsuzuku-ibm requested a review from daw3rd July 22, 2024 03:56
@shahrokhDaijavad
Copy link
Member

@dtsuzuku-ibm From your point of view, is this finished?

@dtsuzuku-ibm
Copy link
Collaborator Author

@dtsuzuku-ibm From your point of view, is this finished?

@shahrokhDaijavad Yes.

@daw3rd Could you review again?

.make.defaults Outdated
@@ -476,7 +476,7 @@ endif
pip install $$extra_url -r requirements.txt; \
elif [ -e pyproject.toml ]; then \
echo Installing from pyproject.toml; \
pip install $$extra_url -e .; \
pip install $$extra_url -e .[dev]; \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this necessary and what are the implications for the publishing a transform python wheel to pypi.org?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to use moto in testing.
It seems that this is required for pip to install project.optional-dependencies defined in pyproject.toml.

Copy link
Collaborator Author

@dtsuzuku-ibm dtsuzuku-ibm Jul 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other word, data-prep-kit does not install project.optional-dependencies now.

Copy link
Collaborator Author

@dtsuzuku-ibm dtsuzuku-ibm Jul 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are the implications for the publishing a transform python wheel to pypi.org

I suppose it does not affect package, since it is still recognized as extra dependencies

output of make build
% make -C data-processing-lib build
/Applications/Xcode.app/Contents/Developer/usr/bin/make RULE=build .recurse
SUB_MAKE_DIRS=doc/ python/ ray/ spark/
No Makefile found in doc/. Skipping.
Using recursive build rule in python/
Checks passed
/Applications/Xcode.app/Contents/Developer/usr/bin/make TOML_VERSION=0.2.1.dev0 .defaults.update-toml
if [ -e pyproject.toml ]; then                                          \
            /Applications/Xcode.app/Contents/Developer/usr/bin/make TOML_VERSION=0.2.1.dev0 .defaults.__set-toml-version;    \
            /Applications/Xcode.app/Contents/Developer/usr/bin/make .defaults.__update-toml-lib-dep-versions;       \
        fi
if [ -e pyproject.toml ]; then                                  \
            cat pyproject.toml | sed -e                                 \
                's/^version[ ]*=.*/version = "'0.2.1.dev0'"/'   \
                > tt.toml;                                              \
            mv tt.toml pyproject.toml;                                  \
        fi
rm -rf dist || true
rm -rf src/*egg-info || true
python -m pip install --upgrade build
Collecting build
  Using cached build-1.2.1-py3-none-any.whl.metadata (4.3 kB)
Requirement already satisfied: packaging>=19.1 in /Users/dtsuzuku/.pyenv/versions/instruct-lab/lib/python3.11/site-packages (from build) (23.2)
Collecting pyproject_hooks (from build)
  Using cached pyproject_hooks-1.1.0-py3-none-any.whl.metadata (1.3 kB)
Using cached build-1.2.1-py3-none-any.whl (21 kB)
Using cached pyproject_hooks-1.1.0-py3-none-any.whl (9.2 kB)
Installing collected packages: pyproject_hooks, build
Successfully installed build-1.2.1 pyproject_hooks-1.1.0

[notice] A new release of pip is available: 24.0 -> 24.1.2
[notice] To update, run: pip install --upgrade pip
python -m build
* Creating isolated environment: venv+pip...
* Installing packages in isolated environment:
  - setuptools>=68.0.0
  - setuptools_scm[toml]>=7.1.0
  - wheel
* Getting build dependencies for sdist...
running egg_info
creating src/data_prep_toolkit.egg-info
writing src/data_prep_toolkit.egg-info/PKG-INFO
writing dependency_links to src/data_prep_toolkit.egg-info/dependency_links.txt
writing requirements to src/data_prep_toolkit.egg-info/requires.txt
writing top-level names to src/data_prep_toolkit.egg-info/top_level.txt
writing manifest file 'src/data_prep_toolkit.egg-info/SOURCES.txt'
writing manifest file 'src/data_prep_toolkit.egg-info/SOURCES.txt'
* Building sdist...
running sdist
running egg_info
writing src/data_prep_toolkit.egg-info/PKG-INFO
writing dependency_links to src/data_prep_toolkit.egg-info/dependency_links.txt
writing requirements to src/data_prep_toolkit.egg-info/requires.txt
writing top-level names to src/data_prep_toolkit.egg-info/top_level.txt
writing manifest file 'src/data_prep_toolkit.egg-info/SOURCES.txt'
running check
creating data_prep_toolkit-0.2.1.dev0
creating data_prep_toolkit-0.2.1.dev0/src
creating data_prep_toolkit-0.2.1.dev0/src/data_prep_toolkit.egg-info
creating data_prep_toolkit-0.2.1.dev0/src/data_processing
creating data_prep_toolkit-0.2.1.dev0/src/data_processing/data_access
creating data_prep_toolkit-0.2.1.dev0/src/data_processing/runtime
creating data_prep_toolkit-0.2.1.dev0/src/data_processing/runtime/pure_python
creating data_prep_toolkit-0.2.1.dev0/src/data_processing/test_support
creating data_prep_toolkit-0.2.1.dev0/src/data_processing/test_support/data_access
creating data_prep_toolkit-0.2.1.dev0/src/data_processing/test_support/launch
creating data_prep_toolkit-0.2.1.dev0/src/data_processing/test_support/transform
creating data_prep_toolkit-0.2.1.dev0/src/data_processing/transform
creating data_prep_toolkit-0.2.1.dev0/src/data_processing/utils
creating data_prep_toolkit-0.2.1.dev0/test-data
creating data_prep_toolkit-0.2.1.dev0/test-data/data_processing
creating data_prep_toolkit-0.2.1.dev0/test-data/data_processing/daf
creating data_prep_toolkit-0.2.1.dev0/test-data/data_processing/daf/input
creating data_prep_toolkit-0.2.1.dev0/test-data/data_processing/daf/input/ds1
creating data_prep_toolkit-0.2.1.dev0/test-data/data_processing/daf/input/ds2
creating data_prep_toolkit-0.2.1.dev0/test-data/data_processing/daf/output
creating data_prep_toolkit-0.2.1.dev0/test-data/data_processing/daf/output/ds1
creating data_prep_toolkit-0.2.1.dev0/test-data/data_processing/input
creating data_prep_toolkit-0.2.1.dev0/test-data/data_processing/input_multiple
creating data_prep_toolkit-0.2.1.dev0/test-data/data_processing/python
creating data_prep_toolkit-0.2.1.dev0/test-data/data_processing/python/noop
creating data_prep_toolkit-0.2.1.dev0/test-data/data_processing/python/noop/expected
creating data_prep_toolkit-0.2.1.dev0/test-data/data_processing/python/noop/expected/subdir
creating data_prep_toolkit-0.2.1.dev0/test-data/data_processing/python/noop/input
creating data_prep_toolkit-0.2.1.dev0/test-data/data_processing/python/noop/input/subdir
creating data_prep_toolkit-0.2.1.dev0/test
creating data_prep_toolkit-0.2.1.dev0/test/data_processing_tests
creating data_prep_toolkit-0.2.1.dev0/test/data_processing_tests/data_access
creating data_prep_toolkit-0.2.1.dev0/test/data_processing_tests/launch
creating data_prep_toolkit-0.2.1.dev0/test/data_processing_tests/launch/pure_python
creating data_prep_toolkit-0.2.1.dev0/test/data_processing_tests/transform
creating data_prep_toolkit-0.2.1.dev0/test/data_processing_tests/util
copying files to data_prep_toolkit-0.2.1.dev0...
copying Makefile -> data_prep_toolkit-0.2.1.dev0
copying README.md -> data_prep_toolkit-0.2.1.dev0
copying pyproject.toml -> data_prep_toolkit-0.2.1.dev0
copying src/data_prep_toolkit.egg-info/PKG-INFO -> data_prep_toolkit-0.2.1.dev0/src/data_prep_toolkit.egg-info
copying src/data_prep_toolkit.egg-info/SOURCES.txt -> data_prep_toolkit-0.2.1.dev0/src/data_prep_toolkit.egg-info
copying src/data_prep_toolkit.egg-info/dependency_links.txt -> data_prep_toolkit-0.2.1.dev0/src/data_prep_toolkit.egg-info
copying src/data_prep_toolkit.egg-info/requires.txt -> data_prep_toolkit-0.2.1.dev0/src/data_prep_toolkit.egg-info
copying src/data_prep_toolkit.egg-info/top_level.txt -> data_prep_toolkit-0.2.1.dev0/src/data_prep_toolkit.egg-info
copying src/data_processing/__init__.py -> data_prep_toolkit-0.2.1.dev0/src/data_processing
copying src/data_processing/data_access/__init__.py -> data_prep_toolkit-0.2.1.dev0/src/data_processing/data_access
copying src/data_processing/data_access/arrow_s3.py -> data_prep_toolkit-0.2.1.dev0/src/data_processing/data_access
copying src/data_processing/data_access/data_access.py -> data_prep_toolkit-0.2.1.dev0/src/data_processing/data_access
copying src/data_processing/data_access/data_access_factory.py -> data_prep_toolkit-0.2.1.dev0/src/data_processing/data_access
copying src/data_processing/data_access/data_access_factory_base.py -> data_prep_toolkit-0.2.1.dev0/src/data_processing/data_access
copying src/data_processing/data_access/data_access_local.py -> data_prep_toolkit-0.2.1.dev0/src/data_processing/data_access
copying src/data_processing/data_access/data_access_s3.py -> data_prep_toolkit-0.2.1.dev0/src/data_processing/data_access
copying src/data_processing/runtime/__init__.py -> data_prep_toolkit-0.2.1.dev0/src/data_processing/runtime
copying src/data_processing/runtime/execution_configuration.py -> data_prep_toolkit-0.2.1.dev0/src/data_processing/runtime
copying src/data_processing/runtime/runtime_configuration.py -> data_prep_toolkit-0.2.1.dev0/src/data_processing/runtime
copying src/data_processing/runtime/transform_file_processor.py -> data_prep_toolkit-0.2.1.dev0/src/data_processing/runtime
copying src/data_processing/runtime/transform_launcher.py -> data_prep_toolkit-0.2.1.dev0/src/data_processing/runtime
copying src/data_processing/runtime/pure_python/__init__.py -> data_prep_toolkit-0.2.1.dev0/src/data_processing/runtime/pure_python
copying src/data_processing/runtime/pure_python/runtime_configuration.py -> data_prep_toolkit-0.2.1.dev0/src/data_processing/runtime/pure_python
copying src/data_processing/runtime/pure_python/transform_file_processor.py -> data_prep_toolkit-0.2.1.dev0/src/data_processing/runtime/pure_python
copying src/data_processing/runtime/pure_python/transform_launcher.py -> data_prep_toolkit-0.2.1.dev0/src/data_processing/runtime/pure_python
copying src/data_processing/runtime/pure_python/transform_orchestrator.py -> data_prep_toolkit-0.2.1.dev0/src/data_processing/runtime/pure_python
copying src/data_processing/test_support/__init__.py -> data_prep_toolkit-0.2.1.dev0/src/data_processing/test_support
copying src/data_processing/test_support/abstract_test.py -> data_prep_toolkit-0.2.1.dev0/src/data_processing/test_support
copying src/data_processing/test_support/data_access/__init__.py -> data_prep_toolkit-0.2.1.dev0/src/data_processing/test_support/data_access
copying src/data_processing/test_support/data_access/data_access_factory_test.py -> data_prep_toolkit-0.2.1.dev0/src/data_processing/test_support/data_access
copying src/data_processing/test_support/launch/__init__.py -> data_prep_toolkit-0.2.1.dev0/src/data_processing/test_support/launch
copying src/data_processing/test_support/launch/transform_test.py -> data_prep_toolkit-0.2.1.dev0/src/data_processing/test_support/launch
copying src/data_processing/test_support/transform/__init__.py -> data_prep_toolkit-0.2.1.dev0/src/data_processing/test_support/transform
copying src/data_processing/test_support/transform/binary_transform_test.py -> data_prep_toolkit-0.2.1.dev0/src/data_processing/test_support/transform
copying src/data_processing/test_support/transform/noop_transform.py -> data_prep_toolkit-0.2.1.dev0/src/data_processing/test_support/transform
copying src/data_processing/test_support/transform/table_transform_test.py -> data_prep_toolkit-0.2.1.dev0/src/data_processing/test_support/transform
copying src/data_processing/transform/__init__.py -> data_prep_toolkit-0.2.1.dev0/src/data_processing/transform
copying src/data_processing/transform/abstract_transform.py -> data_prep_toolkit-0.2.1.dev0/src/data_processing/transform
copying src/data_processing/transform/binary_transform.py -> data_prep_toolkit-0.2.1.dev0/src/data_processing/transform
copying src/data_processing/transform/table_transform.py -> data_prep_toolkit-0.2.1.dev0/src/data_processing/transform
copying src/data_processing/transform/transform_configuration.py -> data_prep_toolkit-0.2.1.dev0/src/data_processing/transform
copying src/data_processing/transform/transform_statistics.py -> data_prep_toolkit-0.2.1.dev0/src/data_processing/transform
copying src/data_processing/utils/__init__.py -> data_prep_toolkit-0.2.1.dev0/src/data_processing/utils
copying src/data_processing/utils/cli_utils.py -> data_prep_toolkit-0.2.1.dev0/src/data_processing/utils
copying src/data_processing/utils/config.py -> data_prep_toolkit-0.2.1.dev0/src/data_processing/utils
copying src/data_processing/utils/log.py -> data_prep_toolkit-0.2.1.dev0/src/data_processing/utils
copying src/data_processing/utils/params_utils.py -> data_prep_toolkit-0.2.1.dev0/src/data_processing/utils
copying src/data_processing/utils/transform_utils.py -> data_prep_toolkit-0.2.1.dev0/src/data_processing/utils
copying test-data/data_processing/daf/input/ds1/sample1.parquet -> data_prep_toolkit-0.2.1.dev0/test-data/data_processing/daf/input/ds1
copying test-data/data_processing/daf/input/ds1/sample2.parquet -> data_prep_toolkit-0.2.1.dev0/test-data/data_processing/daf/input/ds1
copying test-data/data_processing/daf/input/ds2/sample3.parquet -> data_prep_toolkit-0.2.1.dev0/test-data/data_processing/daf/input/ds2
copying test-data/data_processing/daf/output/ds1/sample1.parquet -> data_prep_toolkit-0.2.1.dev0/test-data/data_processing/daf/output/ds1
copying test-data/data_processing/input/sample1.parquet -> data_prep_toolkit-0.2.1.dev0/test-data/data_processing/input
copying test-data/data_processing/input_multiple/sample1.parquet -> data_prep_toolkit-0.2.1.dev0/test-data/data_processing/input_multiple
copying test-data/data_processing/input_multiple/sample2.parquet -> data_prep_toolkit-0.2.1.dev0/test-data/data_processing/input_multiple
copying test-data/data_processing/input_multiple/sample3.parquet -> data_prep_toolkit-0.2.1.dev0/test-data/data_processing/input_multiple
copying test-data/data_processing/python/noop/expected/metadata.json -> data_prep_toolkit-0.2.1.dev0/test-data/data_processing/python/noop/expected
copying test-data/data_processing/python/noop/expected/sample1.parquet -> data_prep_toolkit-0.2.1.dev0/test-data/data_processing/python/noop/expected
copying test-data/data_processing/python/noop/expected/subdir/test1.parquet -> data_prep_toolkit-0.2.1.dev0/test-data/data_processing/python/noop/expected/subdir
copying test-data/data_processing/python/noop/input/sample1.parquet -> data_prep_toolkit-0.2.1.dev0/test-data/data_processing/python/noop/input
copying test-data/data_processing/python/noop/input/subdir/test1.parquet -> data_prep_toolkit-0.2.1.dev0/test-data/data_processing/python/noop/input/subdir
copying test/data_processing_tests/data_access/daf_local_test.py -> data_prep_toolkit-0.2.1.dev0/test/data_processing_tests/data_access
copying test/data_processing_tests/data_access/data_access_local_test.py -> data_prep_toolkit-0.2.1.dev0/test/data_processing_tests/data_access
copying test/data_processing_tests/data_access/data_access_s3_test.py -> data_prep_toolkit-0.2.1.dev0/test/data_processing_tests/data_access
copying test/data_processing_tests/data_access/sample_input_data_test.py -> data_prep_toolkit-0.2.1.dev0/test/data_processing_tests/data_access
copying test/data_processing_tests/launch/pure_python/launcher_test.py -> data_prep_toolkit-0.2.1.dev0/test/data_processing_tests/launch/pure_python
copying test/data_processing_tests/launch/pure_python/multi_launcher_test.py -> data_prep_toolkit-0.2.1.dev0/test/data_processing_tests/launch/pure_python
copying test/data_processing_tests/launch/pure_python/test_noop_launch.py -> data_prep_toolkit-0.2.1.dev0/test/data_processing_tests/launch/pure_python
copying test/data_processing_tests/transform/test_noop.py -> data_prep_toolkit-0.2.1.dev0/test/data_processing_tests/transform
copying test/data_processing_tests/util/transform_utils_test.py -> data_prep_toolkit-0.2.1.dev0/test/data_processing_tests/util
copying src/data_prep_toolkit.egg-info/SOURCES.txt -> data_prep_toolkit-0.2.1.dev0/src/data_prep_toolkit.egg-info
Writing data_prep_toolkit-0.2.1.dev0/setup.cfg
Creating tar archive
removing 'data_prep_toolkit-0.2.1.dev0' (and everything under it)
* Building wheel from sdist
* Creating isolated environment: venv+pip...
* Installing packages in isolated environment:
  - setuptools>=68.0.0
  - setuptools_scm[toml]>=7.1.0
  - wheel
* Getting build dependencies for wheel...
running egg_info
writing src/data_prep_toolkit.egg-info/PKG-INFO
writing dependency_links to src/data_prep_toolkit.egg-info/dependency_links.txt
writing requirements to src/data_prep_toolkit.egg-info/requires.txt
writing top-level names to src/data_prep_toolkit.egg-info/top_level.txt
ERROR setuptools_scm._file_finders.git listing git files failed - pretending there aren't any
reading manifest file 'src/data_prep_toolkit.egg-info/SOURCES.txt'
writing manifest file 'src/data_prep_toolkit.egg-info/SOURCES.txt'
* Building wheel...
running bdist_wheel
running build
running build_py
creating build
creating build/lib
creating build/lib/data_processing
copying src/data_processing/__init__.py -> build/lib/data_processing
creating build/lib/data_processing/data_access
copying src/data_processing/data_access/data_access_factory_base.py -> build/lib/data_processing/data_access
copying src/data_processing/data_access/arrow_s3.py -> build/lib/data_processing/data_access
copying src/data_processing/data_access/data_access_s3.py -> build/lib/data_processing/data_access
copying src/data_processing/data_access/data_access_factory.py -> build/lib/data_processing/data_access
copying src/data_processing/data_access/__init__.py -> build/lib/data_processing/data_access
copying src/data_processing/data_access/data_access.py -> build/lib/data_processing/data_access
copying src/data_processing/data_access/data_access_local.py -> build/lib/data_processing/data_access
creating build/lib/data_processing/runtime
copying src/data_processing/runtime/__init__.py -> build/lib/data_processing/runtime
copying src/data_processing/runtime/execution_configuration.py -> build/lib/data_processing/runtime
copying src/data_processing/runtime/transform_launcher.py -> build/lib/data_processing/runtime
copying src/data_processing/runtime/runtime_configuration.py -> build/lib/data_processing/runtime
copying src/data_processing/runtime/transform_file_processor.py -> build/lib/data_processing/runtime
creating build/lib/data_processing/utils
copying src/data_processing/utils/params_utils.py -> build/lib/data_processing/utils
copying src/data_processing/utils/cli_utils.py -> build/lib/data_processing/utils
copying src/data_processing/utils/config.py -> build/lib/data_processing/utils
copying src/data_processing/utils/log.py -> build/lib/data_processing/utils
copying src/data_processing/utils/transform_utils.py -> build/lib/data_processing/utils
copying src/data_processing/utils/__init__.py -> build/lib/data_processing/utils
creating build/lib/data_processing/test_support
copying src/data_processing/test_support/abstract_test.py -> build/lib/data_processing/test_support
copying src/data_processing/test_support/__init__.py -> build/lib/data_processing/test_support
creating build/lib/data_processing/transform
copying src/data_processing/transform/binary_transform.py -> build/lib/data_processing/transform
copying src/data_processing/transform/table_transform.py -> build/lib/data_processing/transform
copying src/data_processing/transform/__init__.py -> build/lib/data_processing/transform
copying src/data_processing/transform/transform_configuration.py -> build/lib/data_processing/transform
copying src/data_processing/transform/abstract_transform.py -> build/lib/data_processing/transform
copying src/data_processing/transform/transform_statistics.py -> build/lib/data_processing/transform
creating build/lib/data_processing/runtime/pure_python
copying src/data_processing/runtime/pure_python/transform_orchestrator.py -> build/lib/data_processing/runtime/pure_python
copying src/data_processing/runtime/pure_python/__init__.py -> build/lib/data_processing/runtime/pure_python
copying src/data_processing/runtime/pure_python/transform_launcher.py -> build/lib/data_processing/runtime/pure_python
copying src/data_processing/runtime/pure_python/runtime_configuration.py -> build/lib/data_processing/runtime/pure_python
copying src/data_processing/runtime/pure_python/transform_file_processor.py -> build/lib/data_processing/runtime/pure_python
creating build/lib/data_processing/test_support/data_access
copying src/data_processing/test_support/data_access/__init__.py -> build/lib/data_processing/test_support/data_access
copying src/data_processing/test_support/data_access/data_access_factory_test.py -> build/lib/data_processing/test_support/data_access
creating build/lib/data_processing/test_support/launch
copying src/data_processing/test_support/launch/transform_test.py -> build/lib/data_processing/test_support/launch
copying src/data_processing/test_support/launch/__init__.py -> build/lib/data_processing/test_support/launch
creating build/lib/data_processing/test_support/transform
copying src/data_processing/test_support/transform/__init__.py -> build/lib/data_processing/test_support/transform
copying src/data_processing/test_support/transform/binary_transform_test.py -> build/lib/data_processing/test_support/transform
copying src/data_processing/test_support/transform/noop_transform.py -> build/lib/data_processing/test_support/transform
copying src/data_processing/test_support/transform/table_transform_test.py -> build/lib/data_processing/test_support/transform
running egg_info
writing src/data_prep_toolkit.egg-info/PKG-INFO
writing dependency_links to src/data_prep_toolkit.egg-info/dependency_links.txt
writing requirements to src/data_prep_toolkit.egg-info/requires.txt
writing top-level names to src/data_prep_toolkit.egg-info/top_level.txt
ERROR setuptools_scm._file_finders.git listing git files failed - pretending there aren't any
reading manifest file 'src/data_prep_toolkit.egg-info/SOURCES.txt'
writing manifest file 'src/data_prep_toolkit.egg-info/SOURCES.txt'
installing to build/bdist.macosx-14.0-arm64/wheel
running install
running install_lib
creating build/bdist.macosx-14.0-arm64
creating build/bdist.macosx-14.0-arm64/wheel
creating build/bdist.macosx-14.0-arm64/wheel/data_processing
creating build/bdist.macosx-14.0-arm64/wheel/data_processing/data_access
copying build/lib/data_processing/data_access/data_access_factory_base.py -> build/bdist.macosx-14.0-arm64/wheel/data_processing/data_access
copying build/lib/data_processing/data_access/arrow_s3.py -> build/bdist.macosx-14.0-arm64/wheel/data_processing/data_access
copying build/lib/data_processing/data_access/data_access_s3.py -> build/bdist.macosx-14.0-arm64/wheel/data_processing/data_access
copying build/lib/data_processing/data_access/data_access_factory.py -> build/bdist.macosx-14.0-arm64/wheel/data_processing/data_access
copying build/lib/data_processing/data_access/__init__.py -> build/bdist.macosx-14.0-arm64/wheel/data_processing/data_access
copying build/lib/data_processing/data_access/data_access.py -> build/bdist.macosx-14.0-arm64/wheel/data_processing/data_access
copying build/lib/data_processing/data_access/data_access_local.py -> build/bdist.macosx-14.0-arm64/wheel/data_processing/data_access
creating build/bdist.macosx-14.0-arm64/wheel/data_processing/runtime
copying build/lib/data_processing/runtime/__init__.py -> build/bdist.macosx-14.0-arm64/wheel/data_processing/runtime
copying build/lib/data_processing/runtime/execution_configuration.py -> build/bdist.macosx-14.0-arm64/wheel/data_processing/runtime
copying build/lib/data_processing/runtime/transform_launcher.py -> build/bdist.macosx-14.0-arm64/wheel/data_processing/runtime
copying build/lib/data_processing/runtime/runtime_configuration.py -> build/bdist.macosx-14.0-arm64/wheel/data_processing/runtime
copying build/lib/data_processing/runtime/transform_file_processor.py -> build/bdist.macosx-14.0-arm64/wheel/data_processing/runtime
creating build/bdist.macosx-14.0-arm64/wheel/data_processing/runtime/pure_python
copying build/lib/data_processing/runtime/pure_python/transform_orchestrator.py -> build/bdist.macosx-14.0-arm64/wheel/data_processing/runtime/pure_python
copying build/lib/data_processing/runtime/pure_python/__init__.py -> build/bdist.macosx-14.0-arm64/wheel/data_processing/runtime/pure_python
copying build/lib/data_processing/runtime/pure_python/transform_launcher.py -> build/bdist.macosx-14.0-arm64/wheel/data_processing/runtime/pure_python
copying build/lib/data_processing/runtime/pure_python/runtime_configuration.py -> build/bdist.macosx-14.0-arm64/wheel/data_processing/runtime/pure_python
copying build/lib/data_processing/runtime/pure_python/transform_file_processor.py -> build/bdist.macosx-14.0-arm64/wheel/data_processing/runtime/pure_python
copying build/lib/data_processing/__init__.py -> build/bdist.macosx-14.0-arm64/wheel/data_processing
creating build/bdist.macosx-14.0-arm64/wheel/data_processing/utils
copying build/lib/data_processing/utils/params_utils.py -> build/bdist.macosx-14.0-arm64/wheel/data_processing/utils
copying build/lib/data_processing/utils/cli_utils.py -> build/bdist.macosx-14.0-arm64/wheel/data_processing/utils
copying build/lib/data_processing/utils/config.py -> build/bdist.macosx-14.0-arm64/wheel/data_processing/utils
copying build/lib/data_processing/utils/log.py -> build/bdist.macosx-14.0-arm64/wheel/data_processing/utils
copying build/lib/data_processing/utils/transform_utils.py -> build/bdist.macosx-14.0-arm64/wheel/data_processing/utils
copying build/lib/data_processing/utils/__init__.py -> build/bdist.macosx-14.0-arm64/wheel/data_processing/utils
creating build/bdist.macosx-14.0-arm64/wheel/data_processing/test_support
creating build/bdist.macosx-14.0-arm64/wheel/data_processing/test_support/data_access
copying build/lib/data_processing/test_support/data_access/__init__.py -> build/bdist.macosx-14.0-arm64/wheel/data_processing/test_support/data_access
copying build/lib/data_processing/test_support/data_access/data_access_factory_test.py -> build/bdist.macosx-14.0-arm64/wheel/data_processing/test_support/data_access
copying build/lib/data_processing/test_support/abstract_test.py -> build/bdist.macosx-14.0-arm64/wheel/data_processing/test_support
creating build/bdist.macosx-14.0-arm64/wheel/data_processing/test_support/launch
copying build/lib/data_processing/test_support/launch/transform_test.py -> build/bdist.macosx-14.0-arm64/wheel/data_processing/test_support/launch
copying build/lib/data_processing/test_support/launch/__init__.py -> build/bdist.macosx-14.0-arm64/wheel/data_processing/test_support/launch
copying build/lib/data_processing/test_support/__init__.py -> build/bdist.macosx-14.0-arm64/wheel/data_processing/test_support
creating build/bdist.macosx-14.0-arm64/wheel/data_processing/test_support/transform
copying build/lib/data_processing/test_support/transform/__init__.py -> build/bdist.macosx-14.0-arm64/wheel/data_processing/test_support/transform
copying build/lib/data_processing/test_support/transform/binary_transform_test.py -> build/bdist.macosx-14.0-arm64/wheel/data_processing/test_support/transform
copying build/lib/data_processing/test_support/transform/noop_transform.py -> build/bdist.macosx-14.0-arm64/wheel/data_processing/test_support/transform
copying build/lib/data_processing/test_support/transform/table_transform_test.py -> build/bdist.macosx-14.0-arm64/wheel/data_processing/test_support/transform
creating build/bdist.macosx-14.0-arm64/wheel/data_processing/transform
copying build/lib/data_processing/transform/binary_transform.py -> build/bdist.macosx-14.0-arm64/wheel/data_processing/transform
copying build/lib/data_processing/transform/table_transform.py -> build/bdist.macosx-14.0-arm64/wheel/data_processing/transform
copying build/lib/data_processing/transform/__init__.py -> build/bdist.macosx-14.0-arm64/wheel/data_processing/transform
copying build/lib/data_processing/transform/transform_configuration.py -> build/bdist.macosx-14.0-arm64/wheel/data_processing/transform
copying build/lib/data_processing/transform/abstract_transform.py -> build/bdist.macosx-14.0-arm64/wheel/data_processing/transform
copying build/lib/data_processing/transform/transform_statistics.py -> build/bdist.macosx-14.0-arm64/wheel/data_processing/transform
running install_egg_info
Copying src/data_prep_toolkit.egg-info to build/bdist.macosx-14.0-arm64/wheel/data_prep_toolkit-0.2.1.dev0-py3.11.egg-info
running install_scripts
creating build/bdist.macosx-14.0-arm64/wheel/data_prep_toolkit-0.2.1.dev0.dist-info/WHEEL
creating '/Users/dtsuzuku/granite/data-prep-kit/data-processing-lib/python/dist/.tmp-3_4fi23h/data_prep_toolkit-0.2.1.dev0-py3-none-any.whl' and adding 'build/bdist.macosx-14.0-arm64/wheel' to it
adding 'data_processing/__init__.py'
adding 'data_processing/data_access/__init__.py'
adding 'data_processing/data_access/arrow_s3.py'
adding 'data_processing/data_access/data_access.py'
adding 'data_processing/data_access/data_access_factory.py'
adding 'data_processing/data_access/data_access_factory_base.py'
adding 'data_processing/data_access/data_access_local.py'
adding 'data_processing/data_access/data_access_s3.py'
adding 'data_processing/runtime/__init__.py'
adding 'data_processing/runtime/execution_configuration.py'
adding 'data_processing/runtime/runtime_configuration.py'
adding 'data_processing/runtime/transform_file_processor.py'
adding 'data_processing/runtime/transform_launcher.py'
adding 'data_processing/runtime/pure_python/__init__.py'
adding 'data_processing/runtime/pure_python/runtime_configuration.py'
adding 'data_processing/runtime/pure_python/transform_file_processor.py'
adding 'data_processing/runtime/pure_python/transform_launcher.py'
adding 'data_processing/runtime/pure_python/transform_orchestrator.py'
adding 'data_processing/test_support/__init__.py'
adding 'data_processing/test_support/abstract_test.py'
adding 'data_processing/test_support/data_access/__init__.py'
adding 'data_processing/test_support/data_access/data_access_factory_test.py'
adding 'data_processing/test_support/launch/__init__.py'
adding 'data_processing/test_support/launch/transform_test.py'
adding 'data_processing/test_support/transform/__init__.py'
adding 'data_processing/test_support/transform/binary_transform_test.py'
adding 'data_processing/test_support/transform/noop_transform.py'
adding 'data_processing/test_support/transform/table_transform_test.py'
adding 'data_processing/transform/__init__.py'
adding 'data_processing/transform/abstract_transform.py'
adding 'data_processing/transform/binary_transform.py'
adding 'data_processing/transform/table_transform.py'
adding 'data_processing/transform/transform_configuration.py'
adding 'data_processing/transform/transform_statistics.py'
adding 'data_processing/utils/__init__.py'
adding 'data_processing/utils/cli_utils.py'
adding 'data_processing/utils/config.py'
adding 'data_processing/utils/log.py'
adding 'data_processing/utils/params_utils.py'
adding 'data_processing/utils/transform_utils.py'
adding 'data_prep_toolkit-0.2.1.dev0.dist-info/METADATA'
adding 'data_prep_toolkit-0.2.1.dev0.dist-info/WHEEL'
adding 'data_prep_toolkit-0.2.1.dev0.dist-info/top_level.txt'
adding 'data_prep_toolkit-0.2.1.dev0.dist-info/RECORD'
removing build/bdist.macosx-14.0-arm64/wheel
Successfully built data_prep_toolkit-0.2.1.dev0.tar.gz and data_prep_toolkit-0.2.1.dev0-py3-none-any.whl
PKG-INFO
% cat data_prep_toolkit-0.2.1.dev0/PKG-INFO
Metadata-Version: 2.1
Name: data_prep_toolkit
Version: 0.2.1.dev0
Summary: Data Preparation Toolkit Library
Author-email: David Wood <dawood@us.ibm.com>, Boris Lublinsky <blublinsky@ibm.com>
License: Apache-2.0
Keywords: data,data preprocessing,data preparation,llm,generative,ai,fine-tuning,llmapps
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: pyarrow==16.1.0
Requires-Dist: boto3==1.34.69
Requires-Dist: argparse
Requires-Dist: mmh3
Provides-Extra: dev
Requires-Dist: twine; extra == "dev"
Requires-Dist: pytest>=7.3.2; extra == "dev"
Requires-Dist: pytest-dotenv>=0.5.2; extra == "dev"
Requires-Dist: pytest-env>=1.0.0; extra == "dev"
Requires-Dist: pre-commit>=3.3.2; extra == "dev"
Requires-Dist: pytest-cov>=4.1.0; extra == "dev"
Requires-Dist: pytest-mock>=3.10.0; extra == "dev"
Requires-Dist: moto==5.0.5; extra == "dev"
Requires-Dist: markupsafe==2.0.1; extra == "dev"

# Data Processing Library
This provides a python framework for developing _transforms_
on data stored in files - currently parquet files are supported -
and running them in a ÄrayÅ(https://www.ray.io/) cluster.
Data files may be stored in the local file system or  COS/S3.
For more details see the ÄdocumentationÅ(../doc/overview.md).

### Virtual Environment
The project uses épyproject.tomlé and a Makefile for operations.
To do development you should establish the virtual environment
éééshell
make venv
ééé
and then either activate
éééshell
source venv/bin/activate
ééé
or set up your IDE to use the venv directory when developing in this project

## Library Artifact Build and Publish
To test, build and publish the library 
éééshell
make test build publish
ééé
To up the version number, edit the Makefile to change VERSION and rerun
the above.  This will require committing both the éMakefileé and the
autotmatically updated épyproject.tomlé file.

Copy link
Member

@daw3rd daw3rd Jul 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a global change that will effect all transforms which makes me uncomfortable. And we don't install like this in the images so there would be a difference between the local venv and what gets installed in the image. This latter point makes this seem like a bad idea. What problem are you trying to solve with this change?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to test if we can load bad words from s3. To do that, I want to use the mock of s3, which is provided by moto.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And we don't install like this in the images so there would be a difference between the local venv and what gets installed in the image

I suppose this is checked in data-prep-kit's build system since it executes pytest with the image it built.

Copy link
Member

@daw3rd daw3rd Jul 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea to add the tests for s3, but the only place s3 is actually tested now is in the library tests.
so, lets back out this change and if you want, submit an issue to enable s3 testing with moto in the transforms

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. reverted.

transforms/language/doc_quality/python/Dockerfile Outdated Show resolved Hide resolved
transforms/language/doc_quality/python/Makefile Outdated Show resolved Hide resolved
transforms/language/doc_quality/python/README.md Outdated Show resolved Hide resolved
transforms/language/doc_quality/python/README.md Outdated Show resolved Hide resolved
transforms/language/doc_quality/ray/Dockerfile Outdated Show resolved Hide resolved
transforms/language/doc_quality/ray/Makefile Outdated Show resolved Hide resolved
Copy link
Member

@daw3rd daw3rd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add this new transform to the table in the top README.md

…oc_quality

Signed-off-by: Daiki Tsuzuku <dtsuzuku@jp.ibm.com>
Signed-off-by: Daiki Tsuzuku <dtsuzuku@jp.ibm.com>
Signed-off-by: Daiki Tsuzuku <dtsuzuku@jp.ibm.com>
Signed-off-by: Daiki Tsuzuku <dtsuzuku@jp.ibm.com>
Signed-off-by: Daiki Tsuzuku <dtsuzuku@jp.ibm.com>
Signed-off-by: Daiki Tsuzuku <dtsuzuku@jp.ibm.com>
Signed-off-by: Daiki Tsuzuku <dtsuzuku@jp.ibm.com>
Signed-off-by: Daiki Tsuzuku <dtsuzuku@jp.ibm.com>
@dtsuzuku-ibm dtsuzuku-ibm requested a review from daw3rd July 24, 2024 22:34
Signed-off-by: Daiki Tsuzuku <dtsuzuku@jp.ibm.com>
Signed-off-by: Daiki Tsuzuku <dtsuzuku@jp.ibm.com>
Signed-off-by: Daiki Tsuzuku <dtsuzuku@jp.ibm.com>
Signed-off-by: Daiki Tsuzuku <dtsuzuku@jp.ibm.com>
@dtsuzuku-ibm dtsuzuku-ibm requested a review from daw3rd July 25, 2024 00:12
Signed-off-by: Daiki Tsuzuku <dtsuzuku@jp.ibm.com>
Signed-off-by: Daiki Tsuzuku <dtsuzuku@jp.ibm.com>
Signed-off-by: Daiki Tsuzuku <dtsuzuku@jp.ibm.com>
Signed-off-by: Daiki Tsuzuku <dtsuzuku@jp.ibm.com>
Copy link
Member

@daw3rd daw3rd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for all the changes.

@daw3rd daw3rd merged commit 0b93935 into IBM:dev Jul 25, 2024
21 checks passed
@dtsuzuku-ibm dtsuzuku-ibm deleted the add-doc_quality branch July 25, 2024 01:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants