In [7]:
import dlt
import duckdb

In [3]:
# Sample data containing pokemon details
data = [
    {"id": "1", "name": "bulbasaur", "size": {"weight": 6.9, "height": 0.7}},
    {"id": "4", "name": "charmander", "size": {"weight": 8.5, "height": 0.6}},
    {"id": "25", "name": "pikachu", "size": {"weight": 6, "height": 0.4}},
]

In [5]:
# Set pipeline name, destination, and dataset name
pipeline = dlt.pipeline(
    pipeline_name="quick_start",
    destination="duckdb",
    dataset_name="mydata",
)

You instantiate a pipeline by calling the `dlt.pipeline` function with the following arguments:

* **`pipeline_name`**: This is the name you give to your pipeline. It helps you track and monitor your pipeline, and also helps to bring back its state and data structures for future runs. If you don't provide a name, dlt will use the name of the Python file you're running as the pipeline name.
* **`destination`**: a name of the destination to which dlt will load the data. It may also be provided to the run method of the pipeline.
* **`dataset_name`**: This is the name of the group of tables (or dataset) where your data will be sent. You can think of a dataset like a folder that holds many files, or a schema in a relational database. You can also specify this later when you run or load the pipeline. If you don't provide a name, it will default to the name of your pipeline.
* **`dev_mode`**: If you set this to True, dlt will add a timestamp to your dataset name every time you create a pipeline. This means a new dataset will be created each time you create a pipeline.

There are more arguments, but they are for advanced use, we skip it for now.

In [10]:
load_info = pipeline.run(data, table_name='pokemon')
print(load_info)

Pipeline quick_start load step completed in 1.13 seconds
1 load package(s) were loaded to destination duckdb and into dataset mydata
The duckdb destination used duckdb:///c:\Users\HP\OneDrive\Desktop\Data Engg\Apache_Spark\dlt\quick_start.duckdb location to store data
Load package 1740221796.9537327 is LOADED and contains no failed jobs


In [11]:
# Connect to the DuckDB database
conn = duckdb.connect(f"{pipeline.pipeline_name}.duckdb")

# Set search path to the dataset
conn.sql(f"SET search_path = '{pipeline.dataset_name}'")

# Describe the dataset
conn.sql("DESCRIBE").df()

Unnamed: 0,database,schema,name,column_names,column_types,temporary
0,quick_start,mydata,_dlt_loads,"[load_id, schema_name, status, inserted_at, sc...","[VARCHAR, VARCHAR, BIGINT, TIMESTAMP WITH TIME...",False
1,quick_start,mydata,_dlt_pipeline_state,"[version, engine_version, pipeline_name, state...","[BIGINT, BIGINT, VARCHAR, VARCHAR, TIMESTAMP W...",False
2,quick_start,mydata,_dlt_version,"[version, engine_version, inserted_at, schema_...","[BIGINT, BIGINT, TIMESTAMP WITH TIME ZONE, VAR...",False
3,quick_start,mydata,pokemon,"[id, name, size__weight, size__height, _dlt_lo...","[VARCHAR, VARCHAR, DOUBLE, DOUBLE, VARCHAR, VA...",False


Commonly used arguments for `pipeline.run`:

* **`data`** (the first argument) may be a dlt source, resource, generator function, or any Iterator or Iterable (i.e., a list or the result of the map function).
* **`write_disposition`** controls how to write data to a table. Defaults to the value "append".
  * `append` will always add new data at the end of the table.
  * `replace` will replace existing data with new data.
  * `skip` will prevent data from loading.
  * `merge` will deduplicate and merge data based on `primary_key` and `merge_key` hints.
* **`table_name`**: specified in cases when the table name cannot be inferred, i.e., from the resources or name of the generator function.

In [13]:
# Fetch all data from 'pokemon' as a DataFrame
table = conn.sql("SELECT * FROM pokemon").df()

# Display the DataFrame
table

Unnamed: 0,id,name,size__weight,size__height,_dlt_load_id,_dlt_id
0,1,bulbasaur,6.9,0.7,1740221796.9537327,4sB3KJFnF3Hp0w
1,4,charmander,8.5,0.6,1740221796.9537327,fnuHa8r1tf8ROw
2,25,pikachu,6.0,0.4,1740221796.9537327,/4vrcF2jZGWaLQ


In [14]:
conn.close()