<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/docs/examples/data_connectors/simple_directory_reader.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Simple Directory Reader over a Remote FileSystem

The `SimpleDirectoryReader` is the most commonly used data connector that _just works_.  
By default, it can be used to parse a variety of file-types on your local filesystem into a list of `Document` objects.
Additionaly, it can also be configured to read from a remote filesystem just as easily! This is made possible through the [`fsspec`](https://filesystem-spec.readthedocs.io/en/latest/index.html) protocol.

This notebook will take you through an example of using `SimpleDirectoryReader` to load documents from an S3 bucket. You can either run this against an actual S3 bucket, or a locally emulated S3 bucket via [LocalStack](https://www.localstack.cloud/).

### Get Started

If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.

In [2]:
!pip install llama-index s3fs boto3 



Download Data

In [None]:
# Use this in Windows to download the file again
!wsl mkdir -p 'data/paul_graham/'
!wsl wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay1.txt'
!wsl wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay2.txt'
!wsl wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay3.txt'


In [None]:
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay1.txt'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay2.txt'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay3.txt'

In [1]:
!docker run -d -p 4566:4566 -p 4571:4571 localstack/localstack


e74de896149447b409982f81cd6cb12da49cd31a5999cf56a1aa01a8286029b3


In [3]:
# create a test-bucket in S3
import boto3

endpoint_url = (
    "http://localhost:4566"  # use this line if you are using S3 via localstack
)
# endpoint_url = None  # use this line if you are using real AWS S3
bucket_name = "llama-index-test-bucket"
#s3 = boto3.resource("s3", endpoint_url=endpoint_url)
# Set up boto3 resource with dummy credentials
s3 = boto3.resource(
    "s3",
    endpoint_url=endpoint_url,
    aws_access_key_id="dummy_access_key",
    aws_secret_access_key="dummy_secret_key",
)
s3.create_bucket(Bucket=bucket_name)
bucket = s3.Bucket(bucket_name)
# put the paul graham essays in the test-bucket in various subdirectories
bucket.upload_file(
    "data/paul_graham/paul_graham_essay1.txt", "essays/paul_graham_essay1.txt"
)
bucket.upload_file(
    "data/paul_graham/paul_graham_essay2.txt",
    "essays/more_essays/paul_graham_essay2.txt",
)
bucket.upload_file(
    "data/paul_graham/paul_graham_essay3.txt",
    "essays/even_more_essays/paul_graham_essay3.txt",
)

In [13]:
from llama_index.core import SimpleDirectoryReader

In [14]:
# create the filesystem using s3fs
from s3fs import S3FileSystem

s3_fs = S3FileSystem(anon=False,
                    key="dummy_access_key",  # Dummy access key for LocalStack
                    secret="dummy_secret_key",  # Dummy secret key for LocalStack
                    endpoint_url=endpoint_url)

Load specific files 

In [15]:
reader = SimpleDirectoryReader(
    input_dir=bucket_name,
    fs=s3_fs,
    recursive=True,  # recursively searches all subdirectories
)

In [16]:
docs = reader.load_data()
print(f"Loaded {len(docs)} docs")

Loaded 3 docs


In [17]:
docs

[Document(id_='56a6b4d6-6d15-47f6-8447-1ad659a03698', embedding=None, metadata={'file_path': 'llama-index-test-bucket/essays/even_more_essays/paul_graham_essay3.txt', 'file_name': 'paul_graham_essay3.txt', 'file_type': 'text/plain', 'file_size': 75042}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, text='\n\nWhat I Worked On\n\nFebruary 2021\n\nBefore college the two main things I worked on, outside of school, were writing and programming. I didn\'t write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.\n\nThe first programs I tried writing were on the IBM 1401 that our school distri

Load all (top-level) files from directory

In [19]:
reader = SimpleDirectoryReader(input_dir="./data/paul_graham/")

In [20]:
docs = reader.load_data()
print(f"Loaded {len(docs)} docs")

# show the metadata of each document
for idx, doc in enumerate(docs):
    print(f"{idx} - {doc.metadata}")

Loaded 3 docs
0 - {'file_path': 'e:\\Learn2\\workspace2\\git_area\\Mastering_LlamaIndex\\2-Stage-Loading\\data\\paul_graham\\paul_graham_essay1.txt', 'file_name': 'paul_graham_essay1.txt', 'file_type': 'text/plain', 'file_size': 75042, 'creation_date': '2024-11-21', 'last_modified_date': '2024-11-21'}
1 - {'file_path': 'e:\\Learn2\\workspace2\\git_area\\Mastering_LlamaIndex\\2-Stage-Loading\\data\\paul_graham\\paul_graham_essay2.txt', 'file_name': 'paul_graham_essay2.txt', 'file_type': 'text/plain', 'file_size': 75042, 'creation_date': '2024-11-21', 'last_modified_date': '2024-11-21'}
2 - {'file_path': 'e:\\Learn2\\workspace2\\git_area\\Mastering_LlamaIndex\\2-Stage-Loading\\data\\paul_graham\\paul_graham_essay3.txt', 'file_name': 'paul_graham_essay3.txt', 'file_type': 'text/plain', 'file_size': 75042, 'creation_date': '2024-11-21', 'last_modified_date': '2024-11-21'}


Create an iterator to load files and process them as they load

In [21]:
reader = SimpleDirectoryReader(
    input_dir=bucket_name,
    fs=s3_fs,
    recursive=True,
)

all_docs = []
for docs in reader.iter_data():
    for doc in docs:
        # do something with the doc
        doc.text = doc.text.upper()
        all_docs.append(doc)

print(len(all_docs))

3


In [22]:
all_docs

[Document(id_='2422b879-8c2e-4dc0-b8d5-3a1ac0b9ef29', embedding=None, metadata={'file_path': 'llama-index-test-bucket/essays/even_more_essays/paul_graham_essay3.txt', 'file_name': 'paul_graham_essay3.txt', 'file_type': 'text/plain', 'file_size': 75042}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, text='\n\nWHAT I WORKED ON\n\nFEBRUARY 2021\n\nBEFORE COLLEGE THE TWO MAIN THINGS I WORKED ON, OUTSIDE OF SCHOOL, WERE WRITING AND PROGRAMMING. I DIDN\'T WRITE ESSAYS. I WROTE WHAT BEGINNING WRITERS WERE SUPPOSED TO WRITE THEN, AND PROBABLY STILL ARE: SHORT STORIES. MY STORIES WERE AWFUL. THEY HAD HARDLY ANY PLOT, JUST CHARACTERS WITH STRONG FEELINGS, WHICH I IMAGINED MADE THEM DEEP.\n\nTHE FIRST PROGRAMS I TRIED WRITING WERE ON THE IBM 1401 THAT OUR SCHOOL DISTRI

Exclude specific patterns on the remote FS

In [24]:
reader = SimpleDirectoryReader(
    input_dir=bucket_name,
    fs=s3_fs,
    recursive=True,
    exclude=["essays/more_essays/*"],
)

all_docs = []
for docs in reader.iter_data():
    for doc in docs:
        # do something with the doc
        doc.text = doc.text.upper()
        all_docs.append(doc)

print(len(all_docs))
all_docs

2


[Document(id_='bbbabd94-7cf4-422b-8d07-e289a20d4ef3', embedding=None, metadata={'file_path': 'llama-index-test-bucket/essays/even_more_essays/paul_graham_essay3.txt', 'file_name': 'paul_graham_essay3.txt', 'file_type': 'text/plain', 'file_size': 75042}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, text='\n\nWHAT I WORKED ON\n\nFEBRUARY 2021\n\nBEFORE COLLEGE THE TWO MAIN THINGS I WORKED ON, OUTSIDE OF SCHOOL, WERE WRITING AND PROGRAMMING. I DIDN\'T WRITE ESSAYS. I WROTE WHAT BEGINNING WRITERS WERE SUPPOSED TO WRITE THEN, AND PROBABLY STILL ARE: SHORT STORIES. MY STORIES WERE AWFUL. THEY HAD HARDLY ANY PLOT, JUST CHARACTERS WITH STRONG FEELINGS, WHICH I IMAGINED MADE THEM DEEP.\n\nTHE FIRST PROGRAMS I TRIED WRITING WERE ON THE IBM 1401 THAT OUR SCHOOL DISTRI

Async execution is available through `aload_data`

In [25]:
import nest_asyncio

nest_asyncio.apply()

reader = SimpleDirectoryReader(
    input_dir=bucket_name,
    fs=s3_fs,
    recursive=True,
)

all_docs = await reader.aload_data()

print(len(all_docs))

3
