<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/docs/examples/data_connectors/simple_directory_reader.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Simple Directory Reader

The `SimpleDirectoryReader` is the most commonly used data connector that _just works_.  
Simply pass in a input directory or a list of files.  
It will select the best file reader based on the file extensions.  

### Get Started

If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.

In [42]:
%pip install llama-index




In [43]:
%pip install wget

Note: you may need to restart the kernel to use updated packages.


Download Data

In [None]:
# Use this in Windows to download the file again
!wsl mkdir -p 'data/paul_graham/'
!wsl wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay1.txt'
!wsl wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay2.txt'
!wsl wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay3.txt'


In [None]:
# Use this in Linux Env
!mkdir -p '../0-examples/examples/data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O '0-examples/examples/data/paul_graham/paul_graham_essay1.txt'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O '0-examples/examples/data/paul_graham/paul_graham_essay2.txt'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O '0-examples/examples/data/paul_graham/paul_graham_essay3.txt'


In [44]:
from llama_index.core import SimpleDirectoryReader

Load specific files 

In [45]:
reader = SimpleDirectoryReader(
    input_files=["data/paul_graham/paul_graham_essay1.txt"]
)

In [46]:
docs = reader.load_data()
print(f"Loaded {len(docs)} docs")

Loaded 1 docs


Load all (top-level) files from directory

In [47]:
reader = SimpleDirectoryReader(input_dir="data/paul_graham/")

In [48]:
docs = reader.load_data()
print(f"Loaded {len(docs)} docs")

Loaded 3 docs


Load all (recursive) files from directory 

In [49]:
# only load markdown files
required_exts = [".md"]

reader = SimpleDirectoryReader(
    input_dir="./data",
    required_exts=required_exts,
    recursive=True,
)

In [51]:
docs = reader.load_data()
print(f"Loaded {len(docs)} docs")

Loaded 1 docs


Create an iterator to load files and process them as they load

In [52]:
reader = SimpleDirectoryReader(
    input_dir="./data",
    recursive=True,
)

all_docs = []
for docs in reader.iter_data():
    for doc in docs:
        # do something with the doc
        doc.text = doc.text.upper()
        all_docs.append(doc)

print(len(all_docs))

4


Async execution is available through `aload_data`

In [53]:
import nest_asyncio

nest_asyncio.apply()

reader = SimpleDirectoryReader(
    input_dir="./data",
    recursive=True,
)

all_docs = await reader.aload_data()

print(len(all_docs))

4


## Full Configuration

This is the full list of arguments that can be passed to the `SimpleDirectoryReader`:

```python
class SimpleDirectoryReader(BaseReader):
    """Simple directory reader.

    Load files from file directory.
    Automatically select the best file reader given file extensions.

    Args:
        input_dir (str): Path to the directory.
        input_files (List): List of file paths to read
            (Optional; overrides input_dir, exclude)
        exclude (List): glob of python file paths to exclude (Optional)
        exclude_hidden (bool): Whether to exclude hidden files (dotfiles).
        encoding (str): Encoding of the files.
            Default is utf-8.
        errors (str): how encoding and decoding errors are to be handled,
              see https://docs.python.org/3/library/functions.html#open
        recursive (bool): Whether to recursively search in subdirectories.
            False by default.
        filename_as_id (bool): Whether to use the filename as the document id.
            False by default.
        required_exts (Optional[List[str]]): List of required extensions.
            Default is None.
        file_extractor (Optional[Dict[str, BaseReader]]): A mapping of file
            extension to a BaseReader class that specifies how to convert that file
            to text. If not specified, use default from DEFAULT_FILE_READER_CLS.
        num_files_limit (Optional[int]): Maximum number of files to read.
            Default is None.
        file_metadata (Optional[Callable[str, Dict]]): A function that takes
            in a filename and returns a Dict of metadata for the Document.
            Default is None.
        fs (Optional[fsspec.AbstractFileSystem]): File system to use. Defaults
        to using the local file system. Can be changed to use any remote file system
        exposed via the fsspec interface.
    """
```
