In [1]:
!pip install -q "dlt[qdrant]" "qdrant-client[fastembed]"


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.1.1[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m


In [2]:
!dlt --version

[39mdlt 1.15.0[0m


In [4]:
import requests 
import dlt
from dlt.destinations import qdrant

## What dlt is

**dlt** is a Python library for building **ELT/ETL** pipelines with minimal boilerplate. You write a **resource** (a function that produces rows/frames/chunks), and a **pipeline** (where to load them). dlt takes care of schema inference, typing, creating tables, and loading into a destination (here: **Qdrant DB**).

* `@dlt.resource(...)` turns the function into a **resource**—a producer of data for dlt to load.
* **`name="zoomcamp_table"`**: the **resource name**. By default this becomes the **table name** in the destination.
* **`write_disposition="replace"`**: loading behavior. On each run, **drop/recreate** the table and load fresh data. (Dangerous if you expect to accumulate history; use `append` to keep growing, or `merge` when you have keys.)

### Why **`yield`** and not `return`?

Short version: **streaming, robustness, and scale**.

* **Streaming / memory-safety**: With `yield`, you can emit data in **chunks** (e.g., pages from an API, monthly partitions) so your process doesn’t hold everything in RAM.
* **Resumability**: dlt can checkpoint between chunks. If a run fails mid-load, you don’t restart from scratch.
* **Throughput**: dlt can prepare/flush chunks to the destination while you compute the next one.
* **Schema inference**: It infers schema from the first item/chunk and keeps loading—safer for big jobs.

`return df` is fine for tiny datasets, but it’s a foot-gun once data grows. Use `yield` by default; scale pain disappears.

In [5]:
@dlt.resource(write_disposition='replace', name='zoomcamp_table')
def zoomcamp_data():
    docs_url = 'https://github.com/alexeygrigorev/llm-rag-workshop/raw/main/notebooks/documents.json'
    docs_response = requests.get(docs_url)
    documents_raw = docs_response.json()

    for course in documents_raw:
        course_name = course['course']

        for doc in course['documents']:
            doc['course'] = course_name
            yield doc

Creates a **pipeline** specifying where and how to load.

  * **`pipeline_name`**: a unique id for the pipeline; used for state and local files.
  * **`destination=qdrant_destination`**: load into a local **Qdrant DB** database file. 
  * **`dataset_name="zoomcamp_schema"`**: logical **schema/namespace** in the destination. This is the schema; your table becomes `zoomcamp_schema.zoomcamp_table`.

---

```python
load_info = pipeline.run(zoomcamp_data()) 
```
**`load_info`**: structured result (load id, tables written, row counts, any state updates). Useful for logging/CI.

---

```python
print(pipeline.last_trace)
```
Prints the diagnostic trace of the **last run**—handy for debugging (tables created, rows loaded, timings, warnings, errors).


In [6]:
qdrant_destination = qdrant(
  qd_path="db.qdrant", 
)

pipeline = dlt.pipeline(
    pipeline_name='zoomcamp_pipeline', 
    destination=qdrant_destination, 
    dataset_name='zoomcamp_schema'
)

load_info = pipeline.run(zoomcamp_data())
print(load_info)
print(pipeline.last_trace)

  from .autonotebook import tqdm as notebook_tqdm
Fetching 5 files: 100%|██████████| 5/5 [00:03<00:00,  1.40it/s]


Pipeline zoomcamp_pipeline load step completed in 4.36 seconds
1 load package(s) were loaded to destination qdrant and into dataset zoomcamp_schema
The qdrant destination used /workspaces/LLM-Zoomcamp/dlt-workshop/db.qdrant location to store data
Load package 1755468689.8987837 is LOADED and contains no failed jobs
Run started at 2025-08-17 22:11:23.995365+00:00 and COMPLETED in 10.83 seconds with 4 steps.
Step extract COMPLETED in 0.47 seconds.

Load package 1755468689.8987837 is EXTRACTED and NOT YET LOADED to the destination and contains no failed jobs

Step normalize COMPLETED in 0.08 seconds.
Normalized data for the following tables:
- _dlt_pipeline_state: 1 row(s)
- zoomcamp_table: 948 row(s)

Load package 1755468689.8987837 is NORMALIZED and NOT YET LOADED to the destination and contains no failed jobs

Step load COMPLETED in 4.37 seconds.
Pipeline zoomcamp_pipeline load step completed in 4.36 seconds
1 load package(s) were loaded to destination qdrant and into dataset zoomcamp_