# How does `dlt` Work

## `dlt` Pipeline
- main building block of `dlt` pipeline is the class `pipeline`
- `pipeline.run()` method encompasses 3 steps : extract, normalise, load

In [2]:
import dlt

pipeline = dlt.pipeline(
    pipeline_name="my_pipeline",
    destination="duckdb",
    progress="log" # enables detailed logging
)

load_info = pipeline.run(
    [
        {"id": 1},
        {"id": 2},
        {"id": 3, "nested": [{"id": 1}, {"id": 2}]},
    ],
    table_name="items",
)
print(load_info)

---------------------------------- Extract my ----------------------------------
Resources: 0/1 (0.0%) | Time: 0.00s | Rate: 0.00/s
Memory usage: 170.01 MB (47.30%) | CPU usage: 0.00%

---------------------------------- Extract my ----------------------------------
Resources: 0/1 (0.0%) | Time: 0.02s | Rate: 0.00/s
items: 1  | Time: 0.00s | Rate: 0.00/s
Memory usage: 170.27 MB (47.30%) | CPU usage: 0.00%

---------------------------------- Extract my ----------------------------------
Resources: 1/1 (100.0%) | Time: 0.03s | Rate: 29.62/s
items: 3  | Time: 0.02s | Rate: 179.14/s
Memory usage: 170.37 MB (47.30%) | CPU usage: 0.00%

---------------------- Normalize my in 1740475667.9840188 ----------------------
Files: 0/1 (0.0%) | Time: 0.00s | Rate: 0.00/s
Memory usage: 170.69 MB (47.30%) | CPU usage: 0.00%

---------------------- Normalize my in 1740475667.9840188 ----------------------
Files: 0/1 (0.0%) | Time: 0.02s | Rate: 0.00/s
Items: 0  | Time: 0.00s | Rate: 0.00/s
Memory usage: 

1. **Extract** - Fully extracts the data from source to your hard drive 
2. **Normalize** - Inspects data to commute a schema; Unnests nested fields
3. **Load** - Loads data into destination; runs schema migration if necessary

## Extract Stage `pipeline.extract(data)`

1. Data is first collected in an in-memory buffer (default holds upto 5000 items) - it can be thought of as a bucket for storing data as it is loaded in
2. When the bucket is full it is written to an intermediate temp file on your disk and then the bucket is emptied to collect more data
3. If a size is specified for intermediary files and an the intermediary file in question reaches this size, a new intermediary file is opened for further data.


### Default Behaviour
- The in-memory buffer is set to 5000 items.
- By default, intermediary files are not rotated. If you do not explicitly set a size for an intermediary file with file_max_items=100000, dlt will create a single file for a resource, regardless of the number of records it contains, even if it reaches millions.
- By default, intermediary files at the extract stage use a custom version of the JSONL format.

## Normalize Stage `pipeline.normalize()`

- It is dependent on extracted data
- Intermediary files from `extract` stage are pushed forward to the `normalize` stage
- One intermediary file is processed one at a time in it's own in-memory buffer.
- Here as well: if buffer is full -> write the data to an intermediary file & clear buffer
- If size of intermediary file reaches the specified threshold then a new file is opened (by default intermediary files are not rotated that means it will keep loading in 1 intermediary file itself)

## Load Stage `pipeline.load()`

- Dependent on having completed the normalize stage 
- All intermediary files from a single source are combined into a single load package
- All load packages are then loaded to the destination

### Default behaviour
- Loading happens in 20 threads, each loading a single file.

## Intermediary File - Formats

Intermediary files at the extract stage use a custom version of the JSONL format, while the loader files - files created at the normalize stage - can take 4 different formats.

### **JSONL**
- JSONL: JSON Delimited is a file format that stores several JSON documents in one file. The JSON documents are separated by a new line
    - Used by: BigQuery, Snowflake, Filesystem

#### Configuration
Option 1 - specify in pipeline run
`info = pipeline.run(some_source(), loader_file_format="jsonl")`

Option 2 - config.toml or secrets.toml
```
[normalize]
loader_file_format="jsonl"
```

Option 3 - Via ENV Variables
`export NORMALIZE__LOADER_FILE_FORMAT="jsonl"`


Option 4 - Specify in resource decorator
```python
@dlt.resource(file_format='jsonl')
def generate_rows():
```


### **Parquet**
- Apache Parquet is a free and open-source column-oriented data storage format in the Apache Hadoop ecosystem.
- To use this format, you need a pyarrow package. You can get this package as a dlt extra as well: `pip install "dlt[parquet]"`

#### Configuration
Option 1 - specify in pipeline run
`info = pipeline.run(some_source(), loader_file_format="parquet")`

Option 2 - config.toml or secrets.toml
```
[normalize]
loader_file_format="parquet"
```

Option 3 - Via ENV Variables
`export NORMALIZE__LOADER_FILE_FORMAT="parquet"`


Option 4 - Specify in resource decorator
```python
@dlt.resource(file_format='parquet')
def generate_rows():
```

**Destination AutoConfig**:

`dlt` automatically configures the Parquet writer based on the destination's capabilities:

- Selects the appropriate decimal type and sets the correct precision and scale for accurate numeric data storage, including handling very small units like Wei.

- Adjusts the timestamp resolution (seconds, microseconds, or nanoseconds) to match what the destination supports


**Writer settings:**

`dlt` uses the pyarrow Parquet writer for file creation. You can adjust the writer's behavior with the following options:

- `flavor` adjusts schema and compatibility settings for different target systems. Defaults to None (pyarrow default).
- `version` selects Parquet logical types based on the Parquet format version. Defaults to "2.6".
- `data_page_size` sets the target size for data pages within a column chunk (in bytes). Defaults to None.
- `timestamp_timezone` specifies the timezone; defaults to UTC.
- `coerce_timestamps` sets the timestamp resolution (s, ms, us, ns).
- `allow_truncated_timestamps` raises an error if precision is lost on truncated timestamps.

  **Example configurations:**

  - In `configs.toml` or `secrets.toml`:
    ```py
    [normalize.data_writer]
    # the default values
    flavor="spark"
    version="2.4"
    data_page_size=1048576
    timestamp_timezone="Europe/Berlin"
    ```

  - Via environment variables:
    ```py
    export  NORMALIZE__DATA_WRITER__FLAVOR="spark"
    ```


**Timestamps and timezones**

`dlt` adds UTC adjustments to all timestamps, creating timezone-aware timestamp columns in destinations (except DuckDB).

**Disable timezone/UTC adjustments:**

- Set `flavor` to spark to use the deprecated `int96` timestamp type without logical adjustments.

- Set `timestamp_timezone` to an empty string (`DATA_WRITER__TIMESTAMP_TIMEZONE=""`) to generate logical timestamps without UTC adjustment.

By default, pyarrow converts timezone-aware DateTime objects to UTC and stores them in Parquet without timezone information.


### **CSV**

#### Configuration

- Directly in the `pipeline.run()`:

  ```py
  info = pipeline.run(some_source(), loader_file_format="csv")
  ```

- In `config.toml` or `secrets.toml`:

  ```py
  [normalize]
  loader_file_format="csv"
  ```

- Via environment variables:

  ```py
  export NORMALIZE__LOADER_FILE_FORMAT="csv"
  ```

- Specify directly in the resource decorator:

  ```py
  @dlt.resource(file_format="csv")
  def generate_rows():
    ...
  ```


**Two implementation**:

1. `pyarrow` csv writer - very fast, multithreaded writer for the arrow tables
  - binary columns are supported only if they contain valid UTF-8 characters
  - complex (nested, struct) types are not supported
2. `python stdlib writer` - a csv writer included in the Python standard library for Python objects

  - binary columns are supported only if they contain valid UTF-8 characters (easy to add more encodings)
  - complex columns dumped with json.dumps
  - None values are always quoted

**Default settings:**

- separators are commas
- quotes are " and are escaped as ""
- NULL values both are empty strings and empty tokens as in the example below
- UNIX new lines are used
- dates are represented as ISO 8601
quoting style is "when needed"

**Adjustable setting:**

- `delimiter`: change the delimiting character (default: ',')
- `include_header`: include the header row (default: True)
- `quoting`: `quote_all` - all values are quoted, `quote_needed` - quote only values that need quoting (default: `quote_needed`)

  ```py
  [normalize.data_writer]
  delimiter="|"
  include_header=false
  quoting="quote_all"
  ```

  or

  ```py
  NORMALIZE__DATA_WRITER__DELIMITER=|
  NORMALIZE__DATA_WRITER__INCLUDE_HEADER=False
  NORMALIZE__DATA_WRITER__QUOTING=quote_all
  ```

### **SQL INSERT FILE FORMAT**
- file contains INSERT VALUES statements to be executed on the destination

#### Configuration

- Directly in the `pipeline.run()`:

  ```py
  info = pipeline.run(some_source(), loader_file_format="insert_values")
  ```

- In `config.toml` or `secrets.toml`:

  ```py
  [normalize]
  loader_file_format="insert_values"
  ```

- Via environment variables:

  ```py
  export NORMALIZE__LOADER_FILE_FORMAT="insert_values"
  ```

- Specify directly in the resource decorator:

  ```py
  @dlt.resource(file_format="insert_values")
  def generate_rows():
    ...
  ```