# Working with Archive Objects in AIStore

This notebook demonstrates how to extract files from tar archives stored in AIStore. We'll use WebDataset format as our example - a common ML dataset format where related files (images, labels, metadata) share a base key.

### 1. Initialize Client and Create a Bucket

In [10]:
import os
import tarfile
from io import BytesIO
from pathlib import Path
from aistore import Client
from aistore.sdk.archive_config import ArchiveConfig, ArchiveMode

# Create client and bucket
client = Client(os.getenv("AIS_ENDPOINT", "http://172.25.0.7:51080"))
bucket = client.bucket("wds-demo").create(exist_ok=True)
print(f"Using bucket: {bucket.name}")

Using bucket: wds-demo


### 2. Create a WebDataset Archive

In [11]:
# Create a sample WebDataset-style tar archive
wds_path = Path("webdataset.tar")

def add_file_to_tar(tar, filename, content):
    """Helper to add a file to tar archive."""
    data = content.encode() if isinstance(content, str) else content
    info = tarfile.TarInfo(name=filename)
    info.size = len(data)
    tar.addfile(info, BytesIO(data))

with tarfile.open(wds_path, "w") as tar:
    for i in range(3):
        base_name = f"sample_{i:03d}"
        add_file_to_tar(tar, f"{base_name}.jpg", f"Image data for {base_name}\n")
        add_file_to_tar(tar, f"{base_name}.txt", f"Caption for {base_name}\n")
        add_file_to_tar(tar, f"{base_name}.json", f'{{"id": {i}, "label": "class_{i}"}}\n')

# Upload archive to AIStore
wds_obj = bucket.object("webdataset.tar")
wds_obj.get_writer().put_file(wds_path)
wds_path.unlink()
print("WebDataset archive uploaded")

WebDataset archive uploaded


### 3. List Files in the Archive

In [12]:
print("Files in archive:")
for entry in bucket.list_archive("webdataset.tar", props="name,size"):
    print(f"  {entry.name} ({entry.size} bytes)")

Files in archive:
  webdataset.tar/sample_000.jpg (26 bytes)
  webdataset.tar/sample_000.json (30 bytes)
  webdataset.tar/sample_000.txt (23 bytes)
  webdataset.tar/sample_001.jpg (26 bytes)
  webdataset.tar/sample_001.json (30 bytes)
  webdataset.tar/sample_001.txt (23 bytes)
  webdataset.tar/sample_002.jpg (26 bytes)
  webdataset.tar/sample_002.json (30 bytes)
  webdataset.tar/sample_002.txt (23 bytes)


### 4. Extract a Single File

`ObjectFileReader` (via `as_file()`) provides resilient streaming with automatic retry on network errors.

In [13]:
# Extract a single file by path
config = ArchiveConfig(archpath="sample_000.jpg")

with wds_obj.get_reader(archive_config=config).as_file(max_resume=3) as f:
    content = f.read()
    print(f"Extracted: {content.decode()}")

Extracted: Image data for sample_000



### 5. Extract All Files for a WebDataset Key

Use `ArchiveMode.WDSKEY` to extract all files (jpg, txt, json) for a specific sample.

In [14]:
# Extract all files for key "sample_001"
config = ArchiveConfig(regex="sample_001", mode=ArchiveMode.WDSKEY)

with wds_obj.get_reader(archive_config=config).as_file(max_resume=3) as f:
    with tarfile.open(fileobj=f, mode="r|*") as tar:
        print("Files for sample_001:")
        for member in tar:
            if member.isfile():
                content = tar.extractfile(member).read()
                print(f"  {member.name}: {content.decode().strip()}")

Files for sample_001:
  sample_001.jpg: Image data for sample_001
  sample_001.txt: Caption for sample_001
  sample_001.json: {"id": 1, "label": "class_1"}


### 6. Extract Files by Extension (PREFIX)

In [15]:
# Extract all files starting with "sample_00"
config = ArchiveConfig(regex="sample_00", mode=ArchiveMode.PREFIX)

with wds_obj.get_reader(archive_config=config).as_file(max_resume=3) as f:
    with tarfile.open(fileobj=f, mode="r|*") as tar:
        print("Files with prefix 'sample_00':")
        for member in tar:
            if member.isfile():
                print(f"  {member.name}")

Files with prefix 'sample_00':
  sample_000.jpg
  sample_000.txt
  sample_000.json
  sample_001.jpg
  sample_001.txt
  sample_001.json
  sample_002.jpg
  sample_002.txt
  sample_002.json


### 7. Extract Files by Extension (SUFFIX)

In [16]:
# Extract all JSON files
config = ArchiveConfig(regex=".json", mode=ArchiveMode.SUFFIX)

with wds_obj.get_reader(archive_config=config).as_file(max_resume=3) as f:
    with tarfile.open(fileobj=f, mode="r|*") as tar:
        print("All JSON files:")
        for member in tar:
            if member.isfile():
                content = tar.extractfile(member).read()
                print(f"  {member.name}: {content.decode().strip()}")

All JSON files:
  sample_000.json: {"id": 0, "label": "class_0"}
  sample_001.json: {"id": 1, "label": "class_1"}
  sample_002.json: {"id": 2, "label": "class_2"}


### 8. Extract Files by Pattern (SUBSTR)

In [17]:
# Extract files containing "002" anywhere in the name
config = ArchiveConfig(regex="002", mode=ArchiveMode.SUBSTR)

with wds_obj.get_reader(archive_config=config).as_file(max_resume=3) as f:
    with tarfile.open(fileobj=f, mode="r|*") as tar:
        print("Files containing '002':")
        for member in tar:
            print(f"  {member.name}")

Files containing '002':
  sample_002.jpg
  sample_002.txt
  sample_002.json


### 9. Extract Files by Regular Expression (REGEXP)

In [18]:
# Extract text files matching a pattern
config = ArchiveConfig(regex="sample_00[1-2]\\.txt$", mode=ArchiveMode.REGEXP)

with wds_obj.get_reader(archive_config=config).as_file(max_resume=3) as f:
    with tarfile.open(fileobj=f, mode="r|*") as tar:
        print("Matching text files:")
        for member in tar:
            if member.isfile():
                content = tar.extractfile(member).read()
                print(f"  {member.name}: {content.decode().strip()}")

Matching text files:
  sample_001.txt: Caption for sample_001
  sample_002.txt: Caption for sample_002


### 10. Cleanup

In [19]:
bucket.delete()
print("Cleanup complete")bucket.delete()
print("Cleanup complete")

Cleanup complete


## Summary

### ArchiveMode Options

| Mode | Description | Example |
|------|-------------|-------|
| `archpath` | Extract single file by exact path | `ArchiveConfig(archpath="sample_000.jpg")` |
| `PREFIX` | Match files starting with regex | `regex="sample_00"` matches all files |
| `SUFFIX` | Match files ending with regex | `regex=".json"` matches all JSON files |
| `SUBSTR` | Match files containing regex | `regex="002"` matches `sample_002.*` |
| `REGEXP` | Full regular expression | `regex="sample_00[1-2]\\.txt$"` matches samples 001-002 |
| `WDSKEY` | WebDataset key matching | `regex="sample_001"` matches all `sample_001.*` files |


### Benefits for ML Training

- Extract only needed samples without downloading entire dataset
- Resilient streaming via `ObjectFileReader` handles network interruptions automatically and seamlessly