# From REST to reasoning: ingest, index, and query with dlt and Cognee

* Video: https://www.youtube.com/watch?v=MNt_KK32gys
* Homework solution: TBA

# Resources

* [Slides](https://docs.google.com/presentation/d/1oHQilxEVqGGW4S2ctNEE0wHY2LgcjYLaRUziAoinsis/edit?usp=sharing)
* [Colab Notebook](https://colab.research.google.com/drive/1vBA9OIGChcKjjg8r5hHduR0v3A5D6rmH?usp=sharing) 

--- 



# Homework

## Question 1. dlt Version

In this homework, we will load the data from our FAQ to Qdrant

Let's install dlt with Qdrant support and Qdrant client:

```bash
pip install -q "dlt[qdrant]" "qdrant-client[fastembed]"
```

What's the version of dlt that you installed?

In [50]:
import dlt
import requests
import pandas as pd
from datetime import datetime
from qdrant_client import QdrantClient, models

In [37]:
dlt.__version__

'1.13.0'

## dlt Resourse

For reading the FAQ data, we have this helper function:

```python
def zoomcamp_data():
    docs_url = 'https://github.com/alexeygrigorev/llm-rag-workshop/raw/main/notebooks/documents.json'
    docs_response = requests.get(docs_url)
    documents_raw = docs_response.json()

    for course in documents_raw:
        course_name = course['course']

        for doc in course['documents']:
            doc['course'] = course_name
            yield doc
```

Annotate it with `@dlt.resource`. We will use it when creating
a dlt pipeline.

In [7]:
!wget https://github.com/alexeygrigorev/llm-rag-workshop/raw/main/notebooks/documents.json

--2025-07-13 19:23:47--  https://github.com/alexeygrigorev/llm-rag-workshop/raw/main/notebooks/documents.json
Resolving github.com (github.com)... 140.82.121.4
Connecting to github.com (github.com)|140.82.121.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/alexeygrigorev/llm-rag-workshop/main/notebooks/documents.json [following]
--2025-07-13 19:23:48--  https://raw.githubusercontent.com/alexeygrigorev/llm-rag-workshop/main/notebooks/documents.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.108.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 658332 (643K) [text/plain]
Saving to: ‘documents.json’


2025-07-13 19:23:48 (15.1 MB/s) - ‘documents.json’ saved [658332/658332]



In [22]:
!cat documents.json | jq ".[].documents | length"

[0;39m435[0m
[0;39m375[0m
[0;39m138[0m


In [23]:
!cat documents.json | jq ".[] | length"

[0;39m2[0m
[0;39m2[0m
[0;39m2[0m


In [38]:
@dlt.resource(write_disposition="replace", name="zoomcamp_data")
def zoomcamp_data():
    docs_url = 'https://github.com/alexeygrigorev/llm-rag-workshop/raw/main/notebooks/documents.json'
    docs_response = requests.get(docs_url)
    documents_raw = docs_response.json()

    for course in documents_raw:
        course_name = course['course']

        for doc in course['documents']:
            doc['course'] = course_name
            yield doc

## Question 2. dlt pipeline

Now let's create a pipeline. 

We need to define a destination for that. Let's use the `qdrant` one:

```python
from dlt.destinations import qdrant

qdrant_destination = qdrant(
  qd_path="db.qdrant", 
)
```

In this case, we tell dlt (and Qdrant) to create a folder with
our data, and the name for it will be `db.qdrant`

Let's run it:

```python
pipeline = dlt.pipeline(
    pipeline_name="zoomcamp_pipeline",
    destination=qdrant_destination,
    dataset_name="zoomcamp_tagged_data"

)
load_info = pipeline.run(zoomcamp_data())
print(pipeline.last_trace)
```

How many rows were inserted into the `zoomcamp_data` collection?

Look for `"Normalized data for the following tables:"` in the trace output.

In [39]:
from dlt.destinations import qdrant

qdrant_destination = qdrant(
  qd_path="db.qdrant", 
)

In [40]:
# Step 2: Create and run the pipeline
pipeline = dlt.pipeline(
    pipeline_name="zoomcamp_pipeline",
    destination=qdrant_destination,
    dataset_name="zoomcamp_tagged_data"

)
load_info = pipeline.run(zoomcamp_data())
print(pipeline.last_trace)

Run started at 2025-07-13 18:21:43.986160+00:00 and COMPLETED in 11.58 seconds with 4 steps.
Step extract COMPLETED in 0.83 seconds.

Load package 1752430904.9536664 is EXTRACTED and NOT YET LOADED to the destination and contains no failed jobs

Step normalize COMPLETED in 0.10 seconds.
Normalized data for the following tables:
- _dlt_pipeline_state: 1 row(s)
- zoomcamp_data: 948 row(s)

Load package 1752430904.9536664 is NORMALIZED and NOT YET LOADED to the destination and contains no failed jobs

Step load COMPLETED in 9.68 seconds.
Pipeline zoomcamp_pipeline load step completed in 9.65 seconds
1 load package(s) were loaded to destination qdrant and into dataset zoomcamp_tagged_data
The qdrant destination used /home/emmuzoo/llm-zoomcamp-2025/workshops/dlt/db.qdrant location to store data
Load package 1752430904.9536664 is LOADED and contains no failed jobs

Step run COMPLETED in 11.58 seconds.
Pipeline zoomcamp_pipeline load step completed in 9.65 seconds
1 load package(s) were loade

In [43]:
#dataset = pipeline.dataset().zoomcamp_data.df()

In [44]:
#dataset

## Question 3. Embeddings

When inserting the data, an embedding model was used. Which one?

You can find this out by inspecting the `meta.json` file created
in the target folder. During the data insertion process, a folder named db.qdrant will be created, and the meta.json file will be located inside this folder.



In [46]:
!ls -lh db.qdrant

total 8.0K
drwxr-xr-x 7 emmuzoo emmuzoo 4.0K Jul 13 20:21 collection
-rw-r--r-- 1 emmuzoo emmuzoo 2.6K Jul 13 20:21 meta.json


In [49]:
!cat db.qdrant/meta.json | jq

[1;39m{
  [0m[1;34m"collections"[0m[1;39m: [0m[1;39m{
    [0m[1;34m"zoomcamp_tagged_data"[0m[1;39m: [0m[1;39m{
      [0m[1;34m"vectors"[0m[1;39m: [0m[1;39m{
        [0m[1;34m"fast-bge-small-en"[0m[1;39m: [0m[1;39m{
          [0m[1;34m"size"[0m[1;39m: [0m[0;39m384[0m[1;39m,
          [0m[1;34m"distance"[0m[1;39m: [0m[0;32m"Cosine"[0m[1;39m,
          [0m[1;34m"hnsw_config"[0m[1;39m: [0m[0;90mnull[0m[1;39m,
          [0m[1;34m"quantization_config"[0m[1;39m: [0m[0;90mnull[0m[1;39m,
          [0m[1;34m"on_disk"[0m[1;39m: [0m[0;90mnull[0m[1;39m,
          [0m[1;34m"datatype"[0m[1;39m: [0m[0;90mnull[0m[1;39m,
          [0m[1;34m"multivector_config"[0m[1;39m: [0m[0;90mnull[0m[1;39m
        [1;39m}[0m[1;39m
      [1;39m}[0m[1;39m,
      [0m[1;34m"shard_number"[0m[1;39m: [0m[0;90mnull[0m[1;39m,
      [0m[1;34m"sharding_method"[0m[1;39m: [0m[0;90mnull[0m[1;39m,
      [0m[1;34m"replication_fac


## Submit the results

* Submit your results here: https://courses.datatalks.club/llm-zoomcamp-2025/homework/dlt