# **Building custom sources with Filesystems**

We will be using dlt's `filesystem` resource to build custom sources

Filesystem source allows loading files from **remote locations (AWS S3, Google Cloud Storage, Google Drive, Azure Blob Storage, SFTP server)** or the **local filesystem** seamlessly.

Filesystem source natively supports CSV, Parquet, and JSONL files and allows customization for loading any type of structured files.



In [1]:
%%capture
!pip install pymysql duckdb dlt

In [2]:
!mkdir -p local_data && wget -O local_data/userdata.parquet https://www.timestored.com/data/sample/userdata.parquet

--2025-04-23 16:50:34--  https://www.timestored.com/data/sample/userdata.parquet
Resolving www.timestored.com (www.timestored.com)... 139.162.217.116
Connecting to www.timestored.com (www.timestored.com)|139.162.217.116|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 113629 (111K)
Saving to: ‘local_data/userdata.parquet’


2025-04-23 16:50:35 (177 KB/s) - ‘local_data/userdata.parquet’ saved [113629/113629]



*We will use a parquet file from TimeStored*

## 1. Load and Read File From Local Filesystem

The filesystem source loads data in 2 steps

1. Access the files and metadata - Only fetches the file and does not read the content yet

2. Read the content and yield records using a transformer


### *dlt Transformer Reminder*
Transformers are a type of resource in dlt that takes input from other resources and returns back transformed/enriched data

Essentially dlt uses first the `filesystem` resource to access the files and then you load it using a transformer like `read_csv` or `read_parquet`

The `|` pipe operator is used to apply the transformer to the resource

In [4]:
import dlt
from dlt.sources.filesystem import filesystem, read_parquet

# point to the local file directory
fs = filesystem(bucket_url="./local_data", file_glob="**/*.parquet")

# add a transformer
parquet_data = fs | read_parquet()

# create pipeline

pipeline = dlt.pipeline(
                    pipeline_name="filesystem_pipeline",
                    destination="duckdb"
                )
# get the userdata.parquet file that we downloaded earlier
load_info = pipeline.run(parquet_data.with_name("userdata"))
print(load_info)

pipeline.dataset().userdata.df().head()

Pipeline filesystem_pipeline load step completed in 0.43 seconds
1 load package(s) were loaded to destination duckdb and into dataset filesystem_pipeline_dataset
The duckdb destination used duckdb:////content/filesystem_pipeline.duckdb location to store data
Load package 1745428008.4847226 is LOADED and contains no failed jobs


Unnamed: 0,registration_dttm,id,first_name,last_name,email,gender,ip_address,cc,country,birthdate,salary,title,comments,_dlt_load_id,_dlt_id
0,2016-02-03 07:55:29+00:00,1,Amanda,Jordan,ajordan0@com.com,Female,1.197.201.2,6759521864920116.0,Indonesia,3/8/1971,49756.53,Internal Auditor,100.0,1745428008.4847226,YHBCidwqr4oo9Q
1,2016-02-03 17:04:03+00:00,2,Albert,Freeman,afreeman1@is.gd,Male,218.111.175.34,,Canada,1/16/1968,150280.17,Accountant IV,,1745428008.4847226,psL2JmMrYh/xDg
2,2016-02-03 01:09:31+00:00,3,Evelyn,Morgan,emorgan2@altervista.org,Female,7.161.136.94,6767119071901597.0,Russia,2/1/1960,144972.51,Structural Engineer,,1745428008.4847226,A6jZNNkdvzJkFQ
3,2016-02-03 00:36:21+00:00,4,Denise,Riley,driley3@gmpg.org,Female,140.35.109.83,3576031598965625.0,China,4/8/1997,90263.05,Senior Cost Accountant,,1745428008.4847226,JVPJPGLGdSvkgg
4,2016-02-03 05:05:31+00:00,5,Carlos,Burns,cburns4@miitbeian.gov.cn,,169.113.235.40,5602256255204850.0,South Africa,,,,,1745428008.4847226,ZS5Oa0mrGXEGXw


In [5]:
# check out the numbers below and answer 👀
df = pipeline.dataset().userdata.df()
df.groupby("gender").describe()

Unnamed: 0_level_0,id,id,id,id,id,id,id,id,salary,salary,salary,salary,salary,salary,salary,salary
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
gender,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
,67.0,567.328358,286.010033,5.0,300.5,601.0,834.0,998.0,0.0,,,,,,,
Female,482.0,505.215768,291.745516,1.0,259.25,507.5,753.75,1000.0,482.0,143737.761411,79935.529577,12380.49,77752.6275,131819.49,214969.8575,286592.99
Male,451.0,485.532151,285.123062,2.0,234.5,475.0,730.5,996.0,450.0,154647.536444,79324.863123,13268.99,87388.205,157575.99,223065.0225,286061.25


## 2. Enrich Files with Metadata

Here we are adding the filename to the data source to trace back it's origin

In [6]:
@dlt.transformer()
def read_parquet_with_filename(files):
  import pyarrow.parquet as pq
  for file_item in files:
    with file_item.open() as f:
      table = pq.read_table(f).to_pandas()
      table["source_file"] = file_item["file_name"]
      yield table.to_dict(orient="records")

fs = filesystem(bucket_url="./local_data", file_glob="*.parquet")
pipeline = dlt.pipeline("meta_pipeline", destination="duckdb")

load_info = pipeline.run((fs | read_parquet_with_filename()).with_name("userdata"))
print(load_info)

Pipeline meta_pipeline load step completed in 0.42 seconds
1 load package(s) were loaded to destination duckdb and into dataset meta_pipeline_dataset
The duckdb destination used duckdb:////content/meta_pipeline.duckdb location to store data
Load package 1745428266.329879 is LOADED and contains no failed jobs


## 3. Using Metadata to Filter Files

In [8]:
# only loading files that match certain logic

# create the fileystem resource
fs = filesystem(bucket_url="./local_data", file_glob="**/*.parquet")

# only fetch files that have 'user' in name and are less than 1MB
fs.add_filter(lambda f: "user" in f["file_name"] and f["size_in_bytes"] < 1_000_000)

pipeline = dlt.pipeline("filtered_pipeline", destination="duckdb")

load_info = pipeline.run((fs | read_parquet()).with_name("userdata_filtered"))
print(load_info)

Pipeline filtered_pipeline load step completed in 0.39 seconds
1 load package(s) were loaded to destination duckdb and into dataset filtered_pipeline_dataset
The duckdb destination used duckdb:////content/filtered_pipeline.duckdb location to store data
Load package 1745429138.2150872 is LOADED and contains no failed jobs


## 4. Loading Files Incrementally

For this we use `apply_hints` - which provides certain instructions to dlt

```python
import dlt
from dlt.sources.filesystem import filesystem, read_csv

filesystem_pipe = filesystem(bucket_url="file://Users/admin/Documents/csv_files", file_glob="*.csv") | read_csv()

# Tell dlt to merge on date
filesystem_pipe.apply_hints(write_disposition="merge", merge_key="date")
```

In [10]:
fs = filesystem(bucket_url="./local_data", file_glob="**/*.parquet")

# this tells dlt to first check the file's metadata for the modification_date and only load files that have been updated since last load
fs.apply_hints(incremental=dlt.sources.incremental("modification_date"))

parquet_data = (fs | read_parquet()).with_name("userdata_filtered")

pipeline = dlt.pipeline("incremental_pipeline", destination="duckdb")
load_info = pipeline.run(parquet_data)
print(load_info)

Pipeline incremental_pipeline load step completed in ---
0 load package(s) were loaded to destination duckdb and into dataset None
The duckdb destination used duckdb:////content/incremental_pipeline.duckdb location to store data


## 5. Creating a Custom Transformer

dlt provides the following transformers to read file contents in a filesystem resource

1. `read_csv()` - processes CSV files using Pandas
2. `read_jsonl()` - processes JSONL files chunk by chunk
3. `read_parquet()` - processes Parquet files using PyArrow
4. `read_csv_duckdb()` - this transformer processes CSV files using DuckDB, which usually shows better performance than pandas.

But in case if you have files that are in a different format then you can create your own custom transformer

In [12]:
# creating a customer transformer to read JSON files
# standalone flag tells that it's top-level function, not a nested function. This way dlt preserves the function's docstring and func signature
# it also allows the function to accept config vars like secrets
@dlt.transformer(standalone=True)
def read_json(items):
    from dlt.common import json
    for file_obj in items:
        with file_obj.open() as f:
            yield json.load(f)

# Download a JSON file
!wget -O local_data/sample.json https://jsonplaceholder.typicode.com/users

fs = filesystem(bucket_url="./local_data", file_glob="sample.json")
pipeline = dlt.pipeline("json_pipeline", destination="duckdb")

load_info = pipeline.run((fs | read_json()).with_name("users"))
print(load_info)

--2025-04-23 17:45:16--  https://jsonplaceholder.typicode.com/users
Resolving jsonplaceholder.typicode.com (jsonplaceholder.typicode.com)... 104.21.64.1, 104.21.112.1, 104.21.16.1, ...
Connecting to jsonplaceholder.typicode.com (jsonplaceholder.typicode.com)|104.21.64.1|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/json]
Saving to: ‘local_data/sample.json’

local_data/sample.j     [<=>                 ]       0  --.-KB/s               local_data/sample.j     [ <=>                ]   5.51K  --.-KB/s    in 0s      

2025-04-23 17:45:16 (63.8 MB/s) - ‘local_data/sample.json’ saved [5645]

Pipeline json_pipeline load step completed in 0.14 seconds
1 load package(s) were loaded to destination duckdb and into dataset json_pipeline_dataset
The duckdb destination used duckdb:////content/json_pipeline.duckdb location to store data
Load package 1745430316.6578174 is LOADED and contains no failed jobs


In [13]:
pipeline.dataset().users.df().head()

Unnamed: 0,id,name,username,email,address__street,address__suite,address__city,address__zipcode,address__geo__lat,address__geo__lng,phone,website,company__name,company__catch_phrase,company__bs,_dlt_load_id,_dlt_id
0,1,Leanne Graham,Bret,Sincere@april.biz,Kulas Light,Apt. 556,Gwenborough,92998-3874,-37.3159,81.1496,1-770-736-8031 x56442,hildegard.org,Romaguera-Crona,Multi-layered client-server neural-net,harness real-time e-markets,1745429681.313094,lArCPMt+LuzxOg
1,2,Ervin Howell,Antonette,Shanna@melissa.tv,Victor Plains,Suite 879,Wisokyburgh,90566-7771,-43.9509,-34.4618,010-692-6593 x09125,anastasia.net,Deckow-Crist,Proactive didactic contingency,synergize scalable supply-chains,1745429681.313094,X1FzjWK5hLWm6A
2,3,Clementine Bauch,Samantha,Nathan@yesenia.net,Douglas Extension,Suite 847,McKenziehaven,59590-4157,-68.6102,-47.0653,1-463-123-4447,ramiro.info,Romaguera-Jacobson,Face to face bifurcated interface,e-enable strategic applications,1745429681.313094,Lnk7wCH1JQB0Bw
3,4,Patricia Lebsack,Karianne,Julianne.OConner@kory.org,Hoeger Mall,Apt. 692,South Elvis,53919-4257,29.4572,-164.299,493-170-9623 x156,kale.biz,Robel-Corkery,Multi-tiered zero tolerance productivity,transition cutting-edge web services,1745429681.313094,cAZWcAeXzA2Gug
4,5,Chelsey Dietrich,Kamren,Lucio_Hettinger@annie.ca,Skiles Walks,Suite 351,Roscoeview,33263,-31.8129,62.5342,(254)954-1289,demarco.info,Keebler LLC,User-centric fault-tolerant solution,revolutionize end-to-end systems,1745429681.313094,q7NQmd+Z8lVlhA


## 6. Copy Files Locally Before Loading

Useful for backups and post-processing

In [14]:
import os
from dlt.sources.filesystem import filesystem
from dlt.common.storages.fsspec_filesystem import FileItemDict # creates a dictionary with filesystem client

def copy_local(item: FileItemDict) -> FileItemDict:
    local_path = os.path.join("copied", item["file_name"])
    os.makedirs(os.path.dirname(local_path), exist_ok=True)
    item.fsspec.download(item["file_url"], local_path)
    return item

# the add_map method takes a function to apply to each file
fs = filesystem(bucket_url="./local_data", file_glob="**/*.parquet").add_map(copy_local)
pipeline = dlt.pipeline("copy_pipeline", destination="duckdb")
load_info = pipeline.run(fs.with_name("copied_files"))
print(load_info)

Pipeline copy_pipeline load step completed in 0.30 seconds
1 load package(s) were loaded to destination duckdb and into dataset copy_pipeline_dataset
The duckdb destination used duckdb:////content/copy_pipeline.duckdb location to store data
Load package 1745431101.0865488 is LOADED and contains no failed jobs


## 7. Creating a Transformer for XML Files

In [16]:
%%capture
!pip install xmltodict

In [25]:
!wget -O local_data/cd_catalog.xml https://www.w3schools.com/xml/cd_catalog.xml

--2025-04-23 18:10:31--  https://www.w3schools.com/xml/cd_catalog.xml
Resolving www.w3schools.com (www.w3schools.com)... 104.116.243.162, 104.116.243.121, 2600:1417:e800::17d9:b39, ...
Connecting to www.w3schools.com (www.w3schools.com)|104.116.243.162|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4866 (4.8K) [text/xml]
Saving to: ‘local_data/cd_catalog.xml’


2025-04-23 18:10:32 (146 MB/s) - ‘local_data/cd_catalog.xml’ saved [4866/4866]



In [33]:
from collections.abc import Iterator
# use xmltodict python library to create a dlt transformer

import dlt
from dlt.common.storages.fsspec_filesystem import FileItemDict
from dlt.common.typing import TDataItems
from dlt.sources.filesystem import filesystem

@dlt.transformer(standalone=True)
def read_xml(items: Iterator[FileItemDict]) -> Iterator[TDataItems]:
  import xmltodict

  # iterate for each file
  for file_obj in items:
    # open the file
    with file_obj.open() as f:
      # parse the xml file and convert to dictionary
      yield xmltodict.parse(f.read())

fs = filesystem(bucket_url="./local_data", file_glob="**/*.xml")

# use the custom transformer to read XML files
xml_data = (fs | read_xml()).with_name("cd_catalog")

pipeline = dlt.pipeline("xml_pipeline", destination="duckdb", dataset_name="xml_data")
load_info = pipeline.run(xml_data)
print(load_info)

Pipeline xml_pipeline load step completed in 0.07 seconds
1 load package(s) were loaded to destination duckdb and into dataset xml_data
The duckdb destination used duckdb:////content/xml_pipeline.duckdb location to store data
Load package 1745431952.5275524 is LOADED and contains no failed jobs


In [34]:
import duckdb

conn = duckdb.connect("xml_pipeline.duckdb")

conn.execute("SET SCHEMA = xml_data")
conn.execute("SHOW TABLES").fetch_df()

Unnamed: 0,name
0,_dlt_loads
1,_dlt_pipeline_state
2,_dlt_version
3,cd_catalog
4,cd_catalog__catalog__cd


In [35]:
conn.execute("SELECT * FROM cd_catalog__catalog__cd").fetch_df()

Unnamed: 0,title,artist,country,company,price,year,_dlt_parent_id,_dlt_list_idx,_dlt_id
0,Empire Burlesque,Bob Dylan,USA,Columbia,10.9,1985,d/I4zCETcKGheA,0,BQVMzeIhLe8PqA
1,Hide your heart,Bonnie Tyler,UK,CBS Records,9.9,1988,d/I4zCETcKGheA,1,XXgVT3z5SQGNdw
2,Greatest Hits,Dolly Parton,USA,RCA,9.9,1982,d/I4zCETcKGheA,2,1vA18aa8dVrtHQ
3,Still got the blues,Gary Moore,UK,Virgin records,10.2,1990,d/I4zCETcKGheA,3,fwOmF6IGW43EIw
4,Eros,Eros Ramazzotti,EU,BMG,9.9,1997,d/I4zCETcKGheA,4,BhzR8bitFRU//A
5,One night only,Bee Gees,UK,Polydor,10.9,1998,d/I4zCETcKGheA,5,OqTCHb1t0xIZOw
6,Sylvias Mother,Dr.Hook,UK,CBS,8.1,1973,d/I4zCETcKGheA,6,CTuEIJncaE/Zuw
7,Maggie May,Rod Stewart,UK,Pickwick,8.5,1990,d/I4zCETcKGheA,7,Fe+IYWKJeyljLQ
8,Romanza,Andrea Bocelli,EU,Polydor,10.8,1996,d/I4zCETcKGheA,8,ixbCbXwektiETw
9,When a man loves a woman,Percy Sledge,USA,Atlantic,8.7,1987,d/I4zCETcKGheA,9,ycO/K+tUWYpO+A
