# `dlt` Demo

Test data: 3 columns with different data types (they may have wrong types)

1. How does dlt deal with each of them?
2. How to filter data (based on time, e.g., only new data)
3. How does dlt evolve with new data?
4. How does the data type change? (e.g., input json output csv?)

## Basic

`dlt` can be used in jupyter notebook or command line (config).

You can create your own [transformer](https://dlthub.com/docs/dlt-ecosystem/verified-sources/filesystem/advanced#example-read-data-from-excel-files) to load excel files.

In [1]:
import dlt
import duckdb

# for data validation
from pydantic import BaseModel, Field
from datetime import datetime
from typing import List, Literal
from decimal import Decimal

# using read_csv_duckdb is much more efficient than read_csv, which uses pandas
from dlt.sources.filesystem import filesystem, read_csv_duckdb, read_jsonl

In [2]:
class BusinessRecord(BaseModel):
    """Represents a single business record in the dataset."""
    id: int = Field(gt=0, description="Unique identifier for the record")
    value: Decimal = Field(decimal_places=2, description="Business metric value")
    timestamp: datetime = Field(description="Timestamp of the record")
    description: str = Field(min_length=1, description="Description of the business activity")
    category: Literal["Finance", "Sales", "Customer Service", "Marketing", "HR", "IT"] = Field(
        description="Business department category"
    )

In [3]:
filesystem_resource_topic = filesystem(
    bucket_url='file:data/normal',
    file_glob='*.csv'
) 

You can add filters (filter by name or size) at this stage.

In [4]:
filesystem_resource_topic.add_filter(lambda item: item['file_name'] != 'normal3.csv')

<dlt.extract.resource.DltResource at 0x106d87bc0>

In [5]:
filesystem_pipe_topic = filesystem_resource_topic | read_csv_duckdb()

You can apply hints (e.g., [incremental loading](https://dlthub.com/docs/general-usage/incremental-loading), i.e., only load the new data, create table name, and specify table schema) at this stage.

In [6]:
filesystem_pipe_topic.apply_hints(write_disposition='replace', table_name='normal', columns=BusinessRecord)

<dlt.extract.resource.DltResource at 0x107de4c20>

The code below generates a `example.duckdb` file. This file can be used in dbt via dbt-duckdb, see [this doc](https://dlthub.com/docs/dlt-ecosystem/destinations/duckdb).

In [7]:
pipeline_topic = dlt.pipeline(
    pipeline_name='csv_load', 
    destination=dlt.destinations.duckdb('example.duckdb'), 
    dataset_name='mydata', dev_mode=True
)

load_info = pipeline_topic.run(filesystem_pipe_topic) # the hints can be passed here as well

In [8]:
# print(load_info)

In [9]:
print(pipeline_topic.default_schema.to_pretty_yaml())

version: 2
version_hash: kIJkEwsqD4hyhNTH5Foe2+2EZdk/ww6glM0dcOkA+1E=
engine_version: 11
name: csv_load
tables:
  _dlt_version:
    columns:
      version:
        data_type: bigint
        nullable: false
      engine_version:
        data_type: bigint
        nullable: false
      inserted_at:
        data_type: timestamp
        nullable: false
      schema_name:
        data_type: text
        nullable: false
      version_hash:
        data_type: text
        nullable: false
      schema:
        data_type: text
        nullable: false
    write_disposition: skip
    resource: _dlt_version
    description: Created by DLT. Tracks schema updates
  _dlt_loads:
    columns:
      load_id:
        data_type: text
        nullable: false
      schema_name:
        data_type: text
        nullable: true
      status:
        data_type: bigint
        nullable: false
      inserted_at:
        data_type: timestamp
        nullable: false
      schema_version_hash:
        data_type: text


In [10]:
db = duckdb.connect(database='example.duckdb')

In [11]:
db.sql('DESCRIBE;')

┌──────────┬───────────────────────┬─────────────────────┬──────────────────────────────────────────────────────────────────────────────────────────────────┬──────────────────────────────────────────────────────────────────────────────────────────────┬───────────┐
│ database │        schema         │        name         │                                           column_names                                           │                                         column_types                                         │ temporary │
│ varchar  │        varchar        │       varchar       │                                            varchar[]                                             │                                          varchar[]                                           │  boolean  │
├──────────┼───────────────────────┼─────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────┼────────────────────────────────────────────────

In [12]:
db.sql('SELECT id, COUNT(*) FROM mydata_20250117034742.normal GROUP BY id;')

┌───────┬──────────────┐
│  id   │ count_star() │
│ int64 │    int64     │
├───────┼──────────────┤
│     1 │            1 │
│     2 │            1 │
│     3 │            1 │
│     4 │            1 │
│     5 │            1 │
│     6 │            1 │
│     7 │            1 │
│     8 │            1 │
│     9 │            1 │
│    10 │            1 │
│     · │            · │
│     · │            · │
│     · │            · │
│    21 │            1 │
│    22 │            1 │
│    23 │            1 │
│    24 │            1 │
│    25 │            1 │
│    26 │            1 │
│    27 │            1 │
│    28 │            1 │
│    29 │            1 │
│    30 │            1 │
├───────┴──────────────┤
│  30 rows (20 shown)  │
└──────────────────────┘

In [13]:
db.close()

If you want to use Pandas, the data can be accessed via [`ReadableDataset`](https://dlthub.com/docs/general-usage/dataset-access/dataset).

In addition to transforming data using `duckdb` explicitly, you can use [the `dlt` SQL client](https://dlthub.com/docs/dlt-ecosystem/transformations/sql) as well.

In [14]:
with pipeline_topic.sql_client() as p:
    ans = p.execute_sql('SELECT category, COUNT(*) FROM mydata_20250117034742.normal GROUP BY category;')

ans

[('Marketing', 7),
 ('Customer Service', 5),
 ('IT', 3),
 ('HR', 4),
 ('Finance', 5),
 ('Sales', 6)]

## Join Transformation

In [15]:
filesystem_resource_task = filesystem(
    bucket_url='file:data/joinable',
    file_glob='*.csv'
) 

filesystem_resource_task.add_filter(lambda item: item['file_name'] != 'j03.csv')

filesystem_pipe_task = filesystem_resource_task | read_csv_duckdb()

filesystem_pipe_task.apply_hints(write_disposition='replace', table_name='join')

pipeline_task = dlt.pipeline(
    pipeline_name='csv_load_join', 
    destination=dlt.destinations.duckdb('example.duckdb'), 
    dataset_name='mydata', dev_mode=True
)

load_info_task = pipeline_task.run(filesystem_pipe_task) # the hints can be passed here as well

In [24]:
with pipeline_task.sql_client() as p:
    ans = p.execute_sql('SELECT n.id, value, category, assigned_to, status FROM mydata_20250117035246.normal AS n JOIN mydata_20250117035805.join AS j ON n.id=j.id;')

print(ans)

[(1, Decimal('157.230000000'), 'Finance', 'john.doe@company.com', 'completed'), (2, Decimal('293.450000000'), 'Sales', 'alice.smith@company.com', 'completed'), (3, Decimal('432.180000000'), 'Customer Service', 'bob.wilson@company.com', 'in_progress'), (4, Decimal('567.890000000'), 'Marketing', 'sarah.jones@company.com', 'completed'), (5, Decimal('123.450000000'), 'HR', 'mike.brown@company.com', 'pending'), (6, Decimal('789.120000000'), 'Marketing', 'emma.davis@company.com', 'completed'), (7, Decimal('234.560000000'), 'Finance', 'james.miller@company.com', 'completed'), (8, Decimal('678.900000000'), 'Sales', 'olivia.wilson@company.com', 'in_progress'), (9, Decimal('345.670000000'), 'Customer Service', 'william.taylor@company.com', 'completed'), (10, Decimal('891.230000000'), 'Marketing', 'sophia.anderson@company.com', 'pending'), (11, Decimal('456.780000000'), 'HR', 'alexander.thomas@company.com', 'completed'), (12, Decimal('912.340000000'), 'Finance', 'isabella.martin@company.com', 'in

## Schema

## Privacy Preserving

### Pseudonymizing Columns

### Removing Columns

## Incremental Config