# How to Use `attachments` to Split Your Documents Easily

In this notebook, we will explore how to use the `attachments` library to automatically split your documents into smaller chunks. The `attachments` library is a powerful tool to process any file type and make it ready for Large Language Models (LLMs). We will cover how to use the splitting functionality through both the simple API and the more advanced pipeline system.


In [1]:
from attachments import Attachments, attach, processors
from attachments.data import get_sample_path

In [2]:
from attachments import load, present, refine, modify, split
res = (attach("https://en.wikipedia.org/wiki/Artificial_intelligence[select: p]") 
       | load.url_to_bs4
       | modify.select
       | present.images
)
res

Attachment(path='https://en.wikipedia.org/wiki/Artificial_intelligence', text=0 chars, images=[1 imgs: data:image/png;base64,iVBORw0K...lFTkSuQmCC], pipeline=[])

In [3]:
res.images[0][:100]

'data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAABQAAAQNpCAIAAAAXBwOWAAAAAXNSR0IArs4c6QAAIABJREFUeJzs3W'

In [None]:
from IPython.display import HTML
HTML(f"<img src='{res.images[0]}' alt='Image' />")



In [3]:
from attachments import load, present, modify, split
res = (attach("https://en.wikipedia.org/wiki/Artificial_intelligence[select: p]") 
       | load.url_to_bs4
       | modify.select
       | present.markdown
       | split.paragraphs
)

In [None]:
len(res)

In [None]:
len(res[0].text)

In [None]:
res.images



In [6]:
from attachments import __version__
__version__

'0.12.0'

In [2]:
from attachments import Attachments

res = Attachments("https://en.wikipedia.org/wiki/Artificial_intelligence[bad: dsl][images: false][select: p][split: sentences]") 

res

[Attachments] Parsed commands for 'https://en.wikipedia.org/wiki/Artificial_intelligence[bad: dsl][images: false][select: p][split: sentences]': {'bad': 'dsl', 'images': 'false', 'select': 'p', 'split': 'sentences'}
[Attachments] Accessing command: 'split' = 'sentences'
[Attachments] Running primary processor 'webpage_to_llm' for https://en.wikipedia.org/wiki/Artificial_intelligence
[Attachments]   Accessing command: 'images' = 'false'
[Attachments]   Applying step 'load.url_to_bs4' to https://en.wikipedia.org/wiki/Artificial_intelligence
[Attachments]   Applying step 'modify.select' to https://en.wikipedia.org/wiki/Artificial_intelligence
[Attachments]     Accessing command: 'select' = 'p'
[Attachments]   Running AdditivePipeline(present.markdown + present.metadata)
[Attachments]     Applying additive step 'present.markdown' to https://en.wikipedia.org/wiki/Artificial_intelligence
[Attachments]     Applying additive step 'present.metadata' to https://en.wikipedia.org/wiki/Artificial_i

Attachments([org/wiki/artificial_intelligence#sentence-1(269chars, 0imgs), org/wiki/artificial_intelligence#sentence-2(634chars, 0imgs), org/wiki/artificial_intelligence#sentence-3(352chars, 0imgs), org/wiki/artificial_intelligence#sentence-4(1331chars, 0imgs), org/wiki/artificial_intelligence#sentence-5(313chars, 0imgs), org/wiki/artificial_intelligence#sentence-6(93chars, 0imgs), org/wiki/artificial_intelligence#sentence-7(107chars, 0imgs), org/wiki/artificial_intelligence#sentence-8(685chars, 0imgs), org/wiki/artificial_intelligence#sentence-9(126chars, 0imgs), org/wiki/artificial_intelligence#sentence-10(149chars, 0imgs), org/wiki/artificial_intelligence#sentence-11(352chars, 0imgs), org/wiki/artificial_intelligence#sentence-12(570chars, 0imgs), org/wiki/artificial_intelligence#sentence-13(526chars, 0imgs), org/wiki/artificial_intelligence#sentence-14(295chars, 0imgs), org/wiki/artificial_intelligence#sentence-15(127chars, 0imgs), org/wiki/artificial_intelligence#sentence-16(174cha

In [5]:
len(res)

261

In [None]:
for i, att in enumerate(res):
    print(f"{i}: {att.text[:500]}")





In [None]:
print(res.images)









In [None]:
ctx = Attachments("https://en.wikipedia.org/wiki/Artificial_intelligence[images: false][select: p][split: paragraphs]")

In [None]:
len(ctx[0].text)

In [None]:
len(ctx)

In [None]:
len(ctx[0].images)

In [None]:
len(ctx[0].images[0])

In [None]:
len(ctx[0].images[0][0])

In [None]:
for i, att in enumerate(ctx.attachments):
    print(f"{i}: {att.text[:100]}")

In [None]:
print(ctx.attachments[2].text)

In [None]:
print(str(ctx))

In [None]:
print(len(ctx.images))

In [None]:
# Option 1: Use included sample files (works offline)
txt_path = get_sample_path("sample.txt")
ctx = Attachments(pdf_path, txt_path)

print(str(ctx))      # Pretty text view
print(len(ctx.images))  # Number of extracted images

# Try different file types
docx_path = get_sample_path("test_document.docx")
csv_path = get_sample_path("test.csv")
json_path = get_sample_path("sample.json")

ctx = Attachments(docx_path, csv_path, json_path)
print(f"Processed {len(ctx)} files: Word doc, CSV data, and JSON")

# Option 2: Use URLs (same API, works with any URL)
ctx = Attachments(
    "https://github.com/MaximeRivest/attachments/raw/main/src/attachments/data/sample.pdf",
    "https://github.com/MaximeRivest/attachments/raw/main/src/attachments/data/sample_multipage.pptx"
)

print(str(ctx))      # Pretty text view  
print(len(ctx.images))  # Number of extracted images
```

### Advanced usage with DSL

```python
from attachments import Attachments

a = Attachments(
    "https://github.com/MaximeRivest/attachments/raw/main/src/attachments/data/" \
    "sample_multipage.pptx[3-5]"
)
print(a)           # pretty text view
len(a.images)      # 👉 base64 PNG list
```

In [1]:
from attachments import set_verbose, attach, load, modify, present, refine, split

# Enable verbose logging to see the magic
set_verbose(True)

# Your pipeline + the new refiner at the end
res = (
    attach("https://en.wikipedia.org/wiki/Artificial_intelligence[images: false][select: p][split: sentences][unused: command]")
    | load.url_to_bs4
    | modify.select
    | present.markdown
    | split.sentences
    | refine.report_unused_commands  # <-- Add this to see the report
)

print(f"Split into {len(res)} sentences.")

[Attachments] Parsed commands for 'https://en.wikipedia.org/wiki/Artificial_intelligence[images: false][select: p][split: sentences][unused: command]': {'images': 'false', 'select': 'p', 'split': 'sentences', 'unused': 'command'}
[Attachments] Applying step 'load.url_to_bs4' to https://en.wikipedia.org/wiki/Artificial_intelligence
[Attachments] Applying step 'modify.select' to https://en.wikipedia.org/wiki/Artificial_intelligence
[Attachments] Accessing command: 'select' = 'p'


Split into 261 sentences.


[Attachments] Applying step 'present.markdown' to https://en.wikipedia.org/wiki/Artificial_intelligence
[Attachments] Accessing command: 'images' = 'false'
[Attachments] Applying step 'split.sentences' to https://en.wikipedia.org/wiki/Artificial_intelligence
[Attachments] Accessing command: 'split' = 'sentences'
[Attachments] Unused commands for 'https://en.wikipedia.org/wiki/Artificial_intelligence' (split into 261 chunks): ['unused']


In [1]:
from attachments import get_dsl_info
import json

dsl_data = get_dsl_info()
print(json.dumps(dsl_data, indent=2))

{
  "ignore": [
    {
      "used_in": "loader.directory_to_structure",
      "type": "loader",
      "docstring": "Load directory or glob pattern structure and file list.\n    \n    DSL: [files:true] = process individual files, [files:false] = structure + metadata only (default)\n    ",
      "source_file": "/home/maxime/Projects/attachments/src/attachments/core.py",
      "source_line": 13
    },
    {
      "used_in": "loader.git_repo_to_structure",
      "type": "loader",
      "docstring": "Load Git repository structure and file list.\n    \n    DSL: [files:true] = process individual files, [files:false] = structure + metadata only (default)\n         [mode:content|metadata|structure] = processing mode\n    ",
      "source_file": "/home/maxime/Projects/attachments/src/attachments/core.py",
      "source_line": 12
    }
  ],
  "max_files": [
    {
      "used_in": "loader.directory_to_structure",
      "type": "loader",
      "docstring": "Load directory or glob pattern structure 