# Parallel Processing SimpleDirectoryReader

In this notebook, we demonstrate how to use parallel processing when loading data with `SimpleDirectoryReader`. Parallel processing can be useful with heavier workloads i.e., loading from a directory consisting of many files. (NOTE: if using Windows, you may see less gains when using parallel processing for loading data. This has to do with the differences between how multiprocess works in linux/mac and windows e.g., see [here](https://pythonforthelab.com/blog/differences-between-multiprocessing-windows-and-linux/) or [here](https://stackoverflow.com/questions/52465237/multiprocessing-slower-than-serial-processing-in-windows-but-not-in-linux))

In [1]:
import cProfile, pstats
from pstats import SortKey

In this demo, we'll use the `PatronusAIFinanceBenchDataset` llama-dataset from [llamahub](https://llamahub.ai). This dataset is based off of a set of 32 PDF files which are included in the download from llamahub. 

In [2]:
!llamaindex-cli download-llamadataset PatronusAIFinanceBenchDataset --download-dir ./data_parallel

Successfully downloaded PatronusAIFinanceBenchDataset to ./data_parallel



  0%|          | 0/32 [00:00<?, ?it/s]
  3%|▎         | 1/32 [00:00<00:24,  1.28it/s]
  6%|▋         | 2/32 [00:01<00:21,  1.38it/s]
  9%|▉         | 3/32 [00:01<00:17,  1.66it/s]
 12%|█▎        | 4/32 [00:02<00:18,  1.51it/s]
 16%|█▌        | 5/32 [00:03<00:17,  1.54it/s]
 19%|█▉        | 6/32 [00:04<00:21,  1.23it/s]
 22%|██▏       | 7/32 [00:05<00:18,  1.34it/s]
 25%|██▌       | 8/32 [00:05<00:17,  1.39it/s]
 28%|██▊       | 9/32 [00:07<00:23,  1.01s/it]
 31%|███▏      | 10/32 [00:08<00:19,  1.11it/s]
 34%|███▍      | 11/32 [00:08<00:18,  1.16it/s]
 38%|███▊      | 12/32 [00:09<00:16,  1.20it/s]
 41%|████      | 13/32 [00:10<00:15,  1.26it/s]
 44%|████▍     | 14/32 [00:11<00:16,  1.11it/s]
 47%|████▋     | 15/32 [00:12<00:16,  1.03it/s]
 50%|█████     | 16/32 [00:13<00:13,  1.16it/s]
 53%|█████▎    | 17/32 [00:13<00:12,  1.20it/s]
 56%|█████▋    | 18/32 [00:14<00:10,  1.28it/s]
 59%|█████▉    | 19/32 [00:15<00:10,  1.21it/s]
 62%|██████▎   | 20/32 [00:16<00:09,  1.27it/s]
 66%|████

In [1]:
from llama_index.core import SimpleDirectoryReader

# define our reader with the directory containing the 32 pdf files
reader = SimpleDirectoryReader(input_dir="./data_parallel/source_files")

### Sequential Load

Sequential loading is the default behaviour and can be executed via the `load_data()` method.

In [5]:
documents = reader.load_data(show_progress=True)
len(documents)

Loading files: 100%|██████████| 32/32 [05:26<00:00, 10.21s/file]


4306

In [6]:
cProfile.run("reader.load_data()", "oldstats")
p = pstats.Stats("oldstats")
p.strip_dirs().sort_stats(SortKey.CUMULATIVE).print_stats(15)

Thu Nov 21 23:19:25 2024    oldstats

         1875499942 function calls (1872026563 primitive calls) in 795.273 seconds

   Ordered by: cumulative time
   List reduced from 370 to 15 due to restriction <15>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000  795.273  795.273 {built-in method builtins.exec}
        1    0.007    0.007  795.273  795.273 <string>:1(<module>)
        1    0.000    0.000  795.266  795.266 base.py:645(load_data)
       32    0.000    0.000  795.262   24.852 base.py:474(load_file)
       32    0.000    0.000  795.254   24.852 __init__.py:328(wrapped_f)
       32    0.000    0.000  795.253   24.852 __init__.py:465(__call__)
       32    0.057    0.002  795.250   24.852 base.py:36(load_data)
     4306    3.372    0.001  786.607    0.183 _page.py:2268(extract_text)
4444/4306   16.172    0.004  783.233    0.182 _page.py:1825(_extract_text)
     4444    0.013    0.000  505.669    0.114 _data_structures.py:1401(ope

<pstats.Stats at 0x1a2231ebb10>

### Parallel Load

To load using parallel processes, we set `num_workers` to a positive integer value.

In [None]:
documents = reader.load_data(num_workers=10, show_progress=True)

In [None]:
len(documents)

4306

In [None]:
cProfile.run("reader.load_data(num_workers=10)", "newstats")
p = pstats.Stats("newstats")
p.strip_dirs().sort_stats(SortKey.CUMULATIVE).print_stats(15)

Wed Jan 10 13:05:13 2024    newstats

         12539 function calls in 31.319 seconds

   Ordered by: cumulative time
   List reduced from 212 to 15 due to restriction <15>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   31.319   31.319 {built-in method builtins.exec}
        1    0.003    0.003   31.319   31.319 <string>:1(<module>)
        1    0.000    0.000   31.316   31.316 base.py:367(load_data)
       24    0.000    0.000   31.139    1.297 threading.py:589(wait)
       23    0.000    0.000   31.139    1.354 threading.py:288(wait)
      155   31.138    0.201   31.138    0.201 {method 'acquire' of '_thread.lock' objects}
        1    0.000    0.000   31.133   31.133 pool.py:369(starmap)
        1    0.000    0.000   31.133   31.133 pool.py:767(get)
        1    0.000    0.000   31.133   31.133 pool.py:764(wait)
        1    0.000    0.000    0.155    0.155 context.py:115(Pool)
        1    0.000    0.000    0.155    0.155 pool

<pstats.Stats at 0x29408ab30>

### In Conclusion

In [None]:
391 / 30

13.033333333333333

As one can observe from the results above, there is a ~13x speed up (or 1200% speed increase) when using parallel processing when loading from a directory with many files.