This repo contains the Alluxio Python API to interact with Alluxio servers, bridging the gap between computation frameworks and undelying storage systems. This module provides a convenient interface for performing file system operations such as reading, writing, and listing files in an Alluxio cluster.
- Directory listing and file status fetching
- Put data to Alluxio system cache and read from Alluxio system cache (include range read)
- Alluxio system Load operations with progress tracking
- Support dynamic Alluxio worker membership services (ETCD periodically refreshing and manually specified worker hosts)
Alluxio Python library supports reading from Alluxio cached data. The data needs to either
- Loaded into Alluxio servers via
load
operations - Put into Alluxio servers via
write_page
operation.
If you need to read from storage systems directly with Alluxio on demand caching capabilities, please use alluxiofs instead.
Install from source
cd alluxio-python-library
python setup.py sdist bdist_wheel
pip install dist/alluxio_python_library-0.1-py3-none-any.whl
Import and initialize the AlluxioFileSystem
class:
# Minimum setup for Alluxio with ETCD membership service
alluxio = AlluxioFileSystem(etcd_hosts="localhost")
# Minimum setup for Alluxio with user-defined worker list
alluxio = AlluxioFileSystem(worker_hosts="worker_host1,worker_host2")
# Minimum setup for Alluxio with self-defined page size
alluxio = AlluxioFileSystem(
etcd_hosts="localhost",
options={"alluxio.worker.page.store.page.size": "20MB"}
)
# Minimum setup for Alluxio with ETCD membership service with username/password
options = {
"alluxio.etcd.username": "my_user",
"alluxio.etcd.password": "my_password",
"alluxio.worker.page.store.page.size": "20MB" # Any other options should be included here
}
alluxio = AlluxioFileSystem(
etcd_hosts="localhost",
options=options
)
Dataset metadata and data in the Alluxio under storage need to be loaded into Alluxio system cache to read by end-users. Run the load operations before executing the read commands.
# Start a load operation
load_success = alluxio_fs.load('s3://mybucket/mypath/file')
print('Load successful:', load_success)
# Check load progress
progress = alluxio_fs.load_progress('s3://mybucket/mypath/file')
print('Load progress:', progress)
# Stop a load operation
stop_success = alluxio_fs.stop_load('s3://mybucket/mypath/file')
print('Stop successful:', stop_success)
Alluxio system cache can be used as a key value cache system.
Data can be written to Alluxio system cache via write_page
command
after which the data can be read from Alluxio system cache (Alternative to load operations).
success = alluxio_fs.write_page('s3://mybucket/mypath/file', page_index, page_bytes)
print('Write successful:', success)
List the contents of a directory:
"""
contents = alluxio_fs.listdir('s3://mybucket/mypath/dir')
print(contents)
Retrieve the status of a file or directory:
status = alluxio_fs.get_file_status('s3://mybucket/mypath/file')
print(status)
Read the entire content of a file:
"""
Reads a file.
Args:
file_path (str): The full ufs file path to read data from
Returns:
file content (str): The full file content
"""
content = alluxio_fs.read('s3://mybucket/mypath/file')
print(content)
Read a specific range of a file:
content = alluxio_fs.read_range('s3://mybucket/mypath/file', offset, length)
print(content)
See Contributions for guidelines around making new contributions and reviewing them.