# **Homework: Speed up your pipeline**

### **Goal**

Use the public **Jaffle Shop API** to build a `dlt` pipeline and apply everything you've learned about performance:

- Chunking
- Parallelism
- Buffer control
- File rotation
- Worker tuning

Your task is to **make the pipeline as fast as possible**, while keeping the results correct.



### **What you’ll need**

- API base: `https://jaffle-shop.scalevector.ai/api/v1`
- Docs: [https://jaffle-shop.scalevector.ai/docs](https://jaffle-shop.scalevector.ai/docs)
- Start with these endpoints:
  - `/customers`
  - `/orders`
  - `/products`

Each of them returns **paged responses** — so you'll need to handle pagination.



### **What to implement**

1. **Extract** from the API using `dlt`
   - Use `dlt.resource` and [`RESTClient`](https://dlthub.com/docs/devel/general-usage/http/rest-client) with proper pagination

2. **Apply all performance techniques**
   - Group resources into sources
   - Yield **chunks/pages**, not single rows
   - Use `parallelized=True`
   - Set `EXTRACT__WORKERS`, `NORMALIZE__WORKERS`, and `LOAD__WORKERS`
   - Tune buffer sizes and enable **file rotation**

3. **Measure performance**
   - Time the extract, normalize, and load stages separately
   - Compare a naive version vs. optimized version
   - Log thread info or `pipeline.last_trace` if helpful


### **Deliverables**

Share your code as a Google Colab or [GitHub Gist](https://gist.github.com/) in Homework Google Form. **This step is required for certification.**


It should include:
- Working pipeline for at least 2 endpoints
- Before/after timing comparison
- A short explanation of what changes made the biggest difference if there're any differences





In [1]:
%%capture
!pip install "dlt[sql_database, duckdb]"
!pip install pymysql
!pip install pyyaml

## **Base Version Pipeline - No Optimization**

In [21]:
import dlt
from dlt.sources.helpers.rest_client import RESTClient
from dlt.sources.helpers.rest_client.paginators import HeaderLinkPaginator
import logging
import sys

client = RESTClient(
    base_url="https://jaffle-shop.scalevector.ai/api/v1",
    paginator=HeaderLinkPaginator(links_next_key="next")
)


@dlt.resource(table_name="customers", write_disposition="replace")
def get_customers():
    logger.info("Starting extraction of customers data")
    paginator = client.paginate("customers", params={"page":1, "page_size":1000})
    for page in paginator:
        yield page
    logger.info("Completed extraction of customers data")

@dlt.resource(table_name="orders", write_disposition="replace")
def get_orders():
    logger.info("Starting extraction of orders data")
    paginator = client.paginate("orders", params={"page":1, "page_size":500})
    for page in paginator:
        yield page

@dlt.resource(table_name="products", write_disposition="replace")
def get_products():
    logger.info("Starting extraction of products data")
    paginator = client.paginate("products", params={"page":1, "page_size":100})
    for page in paginator:
        yield page
    logger.info("Completed extraction of products data")
    return

def main():
    try:
        pipeline = dlt.pipeline(
            pipeline_name="jaffle_shop_pipeline_v1",
            destination="duckdb",
            dataset_name="jaffle_shop",
            #progress="log"
        )

        load_info = pipeline.run([get_customers, get_orders, get_products])

        print(f"{pipeline.last_trace}")

    except Exception as e:
        logger.error(f"Pipeline failed with error: {str(e)}", exc_info=True)
        raise

if __name__ == "__main__":
    main()

Run started at 2025-05-08 09:58:49.103423+00:00 and COMPLETED in 9 minutes and 16.99 seconds with 4 steps.
Step extract COMPLETED in 8 minutes and 49.63 seconds.

Load package 1746698329.2022405 is EXTRACTED and NOT YET LOADED to the destination and contains no failed jobs

Step normalize COMPLETED in 12.09 seconds.
Normalized data for the following tables:
- _dlt_pipeline_state: 1 row(s)
- customers: 935 row(s)
- orders: 61948 row(s)
- orders__items: 90900 row(s)
- products: 10 row(s)

Load package 1746698329.2022405 is NORMALIZED and NOT YET LOADED to the destination and contains no failed jobs

Step load COMPLETED in 15.17 seconds.
Pipeline jaffle_shop_pipeline_v1 load step completed in 15.15 seconds
1 load package(s) were loaded to destination duckdb and into dataset jaffle_shop
The duckdb destination used duckdb:////content/jaffle_shop_pipeline_v1.duckdb location to store data
Load package 1746698329.2022405 is LOADED and contains no failed jobs

Step run COMPLETED in 9 minutes an

*The non-optimized version of the pipeline takes more than 9 minutes!*

## **Optimized Pipelines**

### **V2 - Using Parallelized Resources and Grouping Resources**

In [22]:
import dlt
from dlt.sources.helpers.rest_client import RESTClient
from dlt.sources.helpers.rest_client.paginators import HeaderLinkPaginator
import logging

# Create a logger
logger = logging.getLogger('dlt')

client = RESTClient(
    base_url="https://jaffle-shop.scalevector.ai/api/v1",
    paginator=HeaderLinkPaginator(links_next_key="next")
)

def limit_pages(paginator, resource_name, limit=None):
    """Helper function to limit the number of pages returned from pagination

    If limit is None, returns all pages from the paginator.
    """
    page_count = 0
    for page in paginator:
        logger.info(f"Retrieved page {page_count + 1} for {resource_name}")
        yield page
        page_count += 1
        # Only check limit if it's not None
        if limit is not None and page_count >= limit:
            logger.info(f"Reached page limit of {limit} for {resource_name}")
            break

    logger.info(f"Total pages processed for {resource_name}: {page_count}")

@dlt.resource(table_name="customers", write_disposition="replace", parallelized=True)
def get_customers():
    logger.info("Starting extraction of customers data")
    paginator = client.paginate("customers", params={"page":1, "page_size":1000})
    for page in paginator:
        yield page
    logger.info("Completed extraction of customers data")

@dlt.resource(table_name="orders", write_disposition="replace", parallelized=True)
def get_orders():
    logger.info("Starting extraction of orders data")
    paginator = client.paginate("orders", params={"page":1, "page_size":500})
    for page in paginator:
        yield page
    logger.info("Completed extraction of orders data")

@dlt.resource(table_name="products", write_disposition="replace", parallelized=True)
def get_products():
    logger.info("Starting extraction of products data")
    paginator = client.paginate("products", params={"page":1, "page_size":100})
    for page in paginator:
        yield page
    logger.info("Completed extraction of products data")

@dlt.source
def jaffle_shop_source():
    logger.info("Initializing jaffle shop data source")
    return get_customers, get_orders, get_products

def main():
    try:
        pipeline = dlt.pipeline(
            pipeline_name="jaffle_shop_pipeline_v2",
            destination="duckdb",
            dataset_name="jaffle_shop",
            #progress="log"
        )


        load_info = pipeline.run(jaffle_shop_source())

        print(f"{pipeline.last_trace}")

    except Exception as e:
        logger.error(f"Pipeline failed with error: {str(e)}", exc_info=True)
        raise

if __name__ == "__main__":
    main()

Run started at 2025-05-08 10:08:06.522638+00:00 and COMPLETED in 9 minutes and 8.03 seconds with 4 steps.
Step extract COMPLETED in 8 minutes and 42.12 seconds.

Load package 1746698886.6162472 is EXTRACTED and NOT YET LOADED to the destination and contains no failed jobs

Step normalize COMPLETED in 12.09 seconds.
Normalized data for the following tables:
- customers: 935 row(s)
- orders: 61948 row(s)
- orders__items: 90900 row(s)
- products: 10 row(s)

Load package 1746698886.6162472 is NORMALIZED and NOT YET LOADED to the destination and contains no failed jobs

Step load COMPLETED in 13.76 seconds.
Pipeline jaffle_shop_pipeline_v2 load step completed in 13.72 seconds
1 load package(s) were loaded to destination duckdb and into dataset jaffle_shop
The duckdb destination used duckdb:////content/jaffle_shop_pipeline_v2.duckdb location to store data
Load package 1746698886.6162472 is LOADED and contains no failed jobs

Step run COMPLETED in 9 minutes and 8.03 seconds.
Pipeline jaffle_s

*V1 has just a slight improvement*

### **V3 - Larger In-memory Buffer + Increased Workers**

In [25]:
import dlt
from dlt.sources.helpers.rest_client import RESTClient
from dlt.sources.helpers.rest_client.paginators import HeaderLinkPaginator
import logging
import sys
import os

os.environ['EXTRACT__WORKERS'] = '8'
os.environ['NORMALIZE__WORKERS'] = '2'
os.environ['DATA_WRITER__BUFFER_MAX_ITEMS'] = '15000'

# Create a logger
logger = logging.getLogger('dlt')

# Configure the logger for both console and file output
logger.setLevel(logging.INFO)

# Create a file handler with more detailed formatting
file_handler = logging.FileHandler('dlt.log')
file_handler.setLevel(logging.INFO)
file_formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
file_handler.setFormatter(file_formatter)

# Create a console handler
console_handler = logging.StreamHandler(sys.stdout)
console_handler.setLevel(logging.INFO)
console_formatter = logging.Formatter('%(levelname)s: %(message)s')
console_handler.setFormatter(console_formatter)

# Add the handlers to the logger
logger.addHandler(file_handler)
logger.addHandler(console_handler)

client = RESTClient(
    base_url="https://jaffle-shop.scalevector.ai/api/v1",
    paginator=HeaderLinkPaginator(links_next_key="next")
)

def limit_pages(paginator, resource_name, limit=None):
    """Helper function to limit the number of pages returned from pagination

    If limit is None, returns all pages from the paginator.
    """
    page_count = 0
    for page in paginator:
        logger.info(f"Retrieved page {page_count + 1} for {resource_name}")
        yield page
        page_count += 1
        # Only check limit if it's not None
        if limit is not None and page_count >= limit:
            logger.info(f"Reached page limit of {limit} for {resource_name}")
            break

    logger.info(f"Total pages processed for {resource_name}: {page_count}")

@dlt.resource(table_name="customers", write_disposition="replace", parallelized=True)
def get_customers():
    logger.info("Starting extraction of customers data")
    paginator = client.paginate("customers", params={"page":1, "page_size":1000})
    for page in paginator:
        yield page
    logger.info("Completed extraction of customers data")

@dlt.resource(table_name="orders", write_disposition="replace", parallelized=True)
def get_orders():
    logger.info("Starting extraction of orders data")
    paginator = client.paginate("orders", params={"page":1, "page_size":500})
    for page in paginator:
      yield page
    logger.info("Completed extraction of orders data")

@dlt.resource(table_name="products", write_disposition="replace", parallelized=True)
def get_products():
    logger.info("Starting extraction of products data")
    paginator = client.paginate("products", params={"page":1, "page_size":100})
    for page in paginator:
        yield page
    logger.info("Completed extraction of products data")

@dlt.source
def jaffle_shop_source():
    logger.info("Initializing jaffle shop data source")
    return get_customers, get_orders, get_products

def main():
    try:
        pipeline = dlt.pipeline(
            pipeline_name="jaffle_shop_pipeline_v3",
            destination="duckdb",
            dataset_name="jaffle_shop",
            #progress="log"
        )


        load_info = pipeline.run(jaffle_shop_source())

        print(f"{pipeline.last_trace}")

    except Exception as e:
        logger.error(f"Pipeline failed with error: {str(e)}", exc_info=True)
        raise

if __name__ == "__main__":
    main()

Run started at 2025-05-08 10:43:47.284910+00:00 and COMPLETED in 9 minutes and 1.21 seconds with 4 steps.
Step extract COMPLETED in 8 minutes and 35.31 seconds.

Load package 1746701027.3618095 is EXTRACTED and NOT YET LOADED to the destination and contains no failed jobs

Step normalize COMPLETED in 11.52 seconds.
Normalized data for the following tables:
- customers: 935 row(s)
- orders: 61948 row(s)
- orders__items: 90900 row(s)
- products: 10 row(s)

Load package 1746701027.3618095 is NORMALIZED and NOT YET LOADED to the destination and contains no failed jobs

Step load COMPLETED in 14.31 seconds.
Pipeline jaffle_shop_pipeline_v3 load step completed in 13.83 seconds
1 load package(s) were loaded to destination duckdb and into dataset jaffle_shop
The duckdb destination used duckdb:////content/jaffle_shop_pipeline_v3.duckdb location to store data
Load package 1746701027.3618095 is LOADED and contains no failed jobs

Step run COMPLETED in 9 minutes and 1.21 seconds.
Pipeline jaffle_s

## Deployment

In [26]:
%%capture
!pip install "dlt[cli]"