<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/docs/examples/data_connectors/simple_directory_reader.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Simple Directory Reader over a Remote FileSystem

The `SimpleDirectoryReader` is the most commonly used data connector that _just works_.  
By default, it can be used to parse a variety of file-types on your local filesystem into a list of `Document` objects.
Additionaly, it can also be configured to read from a remote filesystem just as easily! This is made possible through the [`fsspec`](https://filesystem-spec.readthedocs.io/en/latest/index.html) protocol.

This notebook will take you through an example of using `SimpleDirectoryReader` to load documents from an S3 bucket. You can either run this against an actual S3 bucket, or a locally emulated S3 bucket via [LocalStack](https://www.localstack.cloud/).

### Get Started

If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.

In [None]:
!pip install llama-index s3fs boto3 

Download Data

In [None]:
# Use this in Windows to download the file again
!wsl mkdir -p 'data/paul_graham/'
!wsl wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay1.txt'
!wsl wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay2.txt'
!wsl wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay3.txt'


In [None]:
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay1.txt'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay2.txt'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay3.txt'

In [None]:
!docker run -d -p 4566:4566 -p 4571:4571 localstack/localstack


In [None]:
# create a test-bucket in S3
import boto3

endpoint_url = (
    "http://localhost:4566"  # use this line if you are using S3 via localstack
)
# endpoint_url = None  # use this line if you are using real AWS S3
bucket_name = "llama-index-test-bucket"
#s3 = boto3.resource("s3", endpoint_url=endpoint_url)
# Set up boto3 resource with dummy credentials
s3 = boto3.resource(
    "s3",
    endpoint_url=endpoint_url,
    aws_access_key_id="dummy_access_key",
    aws_secret_access_key="dummy_secret_key",
)
s3.create_bucket(Bucket=bucket_name)
bucket = s3.Bucket(bucket_name)
# put the paul graham essays in the test-bucket in various subdirectories
bucket.upload_file(
    "data/paul_graham/paul_graham_essay1.txt", "essays/paul_graham_essay1.txt"
)
bucket.upload_file(
    "data/paul_graham/paul_graham_essay2.txt",
    "essays/more_essays/paul_graham_essay2.txt",
)
bucket.upload_file(
    "data/paul_graham/paul_graham_essay3.txt",
    "essays/even_more_essays/paul_graham_essay3.txt",
)

In [None]:
from llama_index.core import SimpleDirectoryReader

In [None]:
# create the filesystem using s3fs
from s3fs import S3FileSystem

s3_fs = S3FileSystem(anon=False,
                    key="dummy_access_key",  # Dummy access key for LocalStack
                    secret="dummy_secret_key",  # Dummy secret key for LocalStack
                    endpoint_url=endpoint_url)

Load specific files 

In [None]:
reader = SimpleDirectoryReader(
    input_dir=bucket_name,
    fs=s3_fs,
    recursive=True,  # recursively searches all subdirectories
)

In [None]:
docs = reader.load_data()
print(f"Loaded {len(docs)} docs")

Load all (top-level) files from directory

In [None]:
reader = SimpleDirectoryReader(input_dir="./data/paul_graham/")

In [None]:
docs = reader.load_data()
print(f"Loaded {len(docs)} docs")

# show the metadata of each document
for idx, doc in enumerate(docs):
    print(f"{idx} - {doc.metadata}")

In [None]:
docs = reader.load_data()
print(f"Loaded {len(docs)} docs")

# show the metadata of each document
for idx, doc in enumerate(docs):
    print(f"{idx} - {doc.metadata}")

Create an iterator to load files and process them as they load

In [None]:
reader = SimpleDirectoryReader(
    input_dir=bucket_name,
    fs=s3_fs,
    recursive=True,
)

all_docs = []
for docs in reader.iter_data():
    for doc in docs:
        # do something with the doc
        doc.text = doc.text.upper()
        all_docs.append(doc)

print(len(all_docs))

Exclude specific patterns on the remote FS

In [None]:
reader = SimpleDirectoryReader(
    input_dir=bucket_name,
    fs=s3_fs,
    recursive=True,
    exclude=["essays/more_essays/*"],
)

all_docs = []
for docs in reader.iter_data():
    for doc in docs:
        # do something with the doc
        doc.text = doc.text.upper()
        all_docs.append(doc)

print(len(all_docs))
all_docs

Async execution is available through `aload_data`

In [None]:
import nest_asyncio

nest_asyncio.apply()

reader = SimpleDirectoryReader(
    input_dir=bucket_name,
    fs=s3_fs,
    recursive=True,
)

all_docs = await reader.aload_data()

print(len(all_docs))