# **Homework: Speed up your pipeline**

### **Goal**

Use the public **Jaffle Shop API** to build a `dlt` pipeline and apply everything you've learned about performance:

- Chunking
- Parallelism
- Buffer control
- File rotation
- Worker tuning

Your task is to **make the pipeline as fast as possible**, while keeping the results correct.



### **What you’ll need**

- API base: `https://jaffle-shop.scalevector.ai/api/v1`
- Docs: [https://jaffle-shop.scalevector.ai/docs](https://jaffle-shop.scalevector.ai/docs)
- Start with these endpoints:
  - `/customers`
  - `/orders`
  - `/products`

Each of them returns **paged responses** — so you'll need to handle pagination.



### **What to implement**

1. **Extract** from the API using `dlt`
   - Use `dlt.resource` and [`RESTClient`](https://dlthub.com/docs/devel/general-usage/http/rest-client) with proper pagination

2. **Apply all performance techniques**
   - Group resources into sources
   - Yield **chunks/pages**, not single rows
   - Use `parallelized=True`
   - Set `EXTRACT__WORKERS`, `NORMALIZE__WORKERS`, and `LOAD__WORKERS`
   - Tune buffer sizes and enable **file rotation**

3. **Measure performance**
   - Time the extract, normalize, and load stages separately
   - Compare a naive version vs. optimized version
   - Log thread info or `pipeline.last_trace` if helpful


### **Deliverables**

Share your code as a Google Colab or [GitHub Gist](https://gist.github.com/) in Homework Google Form. **This step is required for certification.**


It should include:
- Working pipeline for at least 2 endpoints
- Before/after timing comparison
- A short explanation of what changes made the biggest difference if there're any differences





In [1]:
%%capture
!pip install "dlt[sql_database, duckdb]"
!pip install pymysql
!pip install pyyaml

## **Base Version Pipeline - No Optimization**

In [5]:
import dlt
from dlt.sources.helpers.rest_client import RESTClient
from dlt.sources.helpers.rest_client.paginators import HeaderLinkPaginator
import logging
import sys
# Create a logger
logger = logging.getLogger('dlt')

# Configure the logger for both console and file output
logger.setLevel(logging.INFO)

client = RESTClient(
    base_url="https://jaffle-shop.scalevector.ai/api/v1",
    paginator=HeaderLinkPaginator(links_next_key="next")
)


@dlt.resource(table_name="customers", write_disposition="replace")
def get_customers():
    logger.info("Starting extraction of customers data")
    paginator = client.paginate("customers", params={"page":1, "page_size":100})
    for page in paginator:
        yield page
    logger.info("Completed extraction of customers data")


@dlt.resource(table_name="orders", write_disposition="replace")
def get_orders():
    logger.info("Starting extraction of orders data")
    paginator = client.paginate("orders", params={"page":1, "page_size":200, "start_date" : "2017-08-01T00:00:00"})
    for page in paginator:
        yield page


@dlt.resource(table_name="products", write_disposition="replace")
def get_products():
    logger.info("Starting extraction of products data")
    paginator = client.paginate("products", params={"page":1, "page_size":100})
    for page in paginator:
        yield page
    logger.info("Completed extraction of products data")
    return

def main():
    try:
        pipeline = dlt.pipeline(
            pipeline_name="jaffle_shop_pipeline_v1",
            destination="duckdb",
            dataset_name="jaffle_shop",
            progress="log"
        )

        load_info = pipeline.run([get_customers, get_orders, get_products])

        print(f"{pipeline.last_trace}")

    except Exception as e:
        logger.error(f"Pipeline failed with error: {str(e)}", exc_info=True)
        raise

if __name__ == "__main__":
    main()

----------------------- Extract jaffle_shop_pipeline_v1 ------------------------
Resources: 0/3 (0.0%) | Time: 0.00s | Rate: 0.00/s
Memory usage: 296.56 MB (10.00%) | CPU usage: 0.00%

----------------------- Extract jaffle_shop_pipeline_v1 ------------------------
Resources: 0/3 (0.0%) | Time: 0.47s | Rate: 0.00/s
customers: 100  | Time: 0.00s | Rate: 11037642.11/s
Memory usage: 296.56 MB (9.80%) | CPU usage: 0.00%

----------------------- Extract jaffle_shop_pipeline_v1 ------------------------
Resources: 0/3 (0.0%) | Time: 2.21s | Rate: 0.00/s
customers: 100  | Time: 1.74s | Rate: 57.44/s
orders: 200  | Time: 0.00s | Rate: 18641351.11/s
Memory usage: 296.56 MB (9.80%) | CPU usage: 0.00%

----------------------- Extract jaffle_shop_pipeline_v1 ------------------------
Resources: 0/3 (0.0%) | Time: 2.41s | Rate: 0.00/s
customers: 100  | Time: 1.94s | Rate: 51.66/s
orders: 200  | Time: 0.19s | Rate: 1027.59/s
products: 10  | Time: 0.00s | Rate: 1023000.98/s
Memory usage: 296.56 MB (9.8

*The non-optimized version of the pipeline takes around 1 minute 30 seconds*

## **Optimized Pipelines**

### **V2 - Using Parallelized Resources and Grouping Resources**

In [6]:
import dlt
from dlt.sources.helpers.rest_client import RESTClient
from dlt.sources.helpers.rest_client.paginators import HeaderLinkPaginator
import logging

# Create a logger
logger = logging.getLogger('dlt')

client = RESTClient(
    base_url="https://jaffle-shop.scalevector.ai/api/v1",
    paginator=HeaderLinkPaginator(links_next_key="next")
)

def limit_pages(paginator, resource_name, limit=None):
    """Helper function to limit the number of pages returned from pagination

    If limit is None, returns all pages from the paginator.
    """
    page_count = 0
    for page in paginator:
        logger.info(f"Retrieved page {page_count + 1} for {resource_name}")
        yield page
        page_count += 1
        # Only check limit if it's not None
        if limit is not None and page_count >= limit:
            logger.info(f"Reached page limit of {limit} for {resource_name}")
            break

    logger.info(f"Total pages processed for {resource_name}: {page_count}")

@dlt.resource(table_name="customers", write_disposition="replace", parallelized=True)
def get_customers():
    logger.info("Starting extraction of customers data")
    paginator = client.paginate("customers", params={"page":1, "page_size":100})
    for page in paginator:
        yield page
    logger.info("Completed extraction of customers data")

@dlt.resource(table_name="orders", write_disposition="replace", parallelized=True)
def get_orders():
    logger.info("Starting extraction of orders data")
    paginator = client.paginate("orders", params={"page":1, "page_size":200, "start_date" : "2017-08-01T00:00:00"})
    for page in paginator:
        yield page
    logger.info("Completed extraction of orders data")

@dlt.resource(table_name="products", write_disposition="replace", parallelized=True)
def get_products():
    logger.info("Starting extraction of products data")
    paginator = client.paginate("products", params={"page":1, "page_size":100})
    for page in paginator:
        yield page
    logger.info("Completed extraction of products data")

@dlt.source
def jaffle_shop_source():
    logger.info("Initializing jaffle shop data source")
    return get_customers, get_orders, get_products

def main():
    try:
        pipeline = dlt.pipeline(
            pipeline_name="jaffle_shop_pipeline_v2",
            destination="duckdb",
            dataset_name="jaffle_shop",
            #progress="log"
        )


        load_info = pipeline.run(jaffle_shop_source())

        print(f"{pipeline.last_trace}")

    except Exception as e:
        logger.error(f"Pipeline failed with error: {str(e)}", exc_info=True)
        raise

if __name__ == "__main__":
    main()

Run started at 2025-05-09 12:22:23.951411+00:00 and COMPLETED in 1 minute and 19.05 seconds with 4 steps.
Step extract COMPLETED in 1 minute and 14.27 seconds.

Load package 1746793344.0427039 is EXTRACTED and NOT YET LOADED to the destination and contains no failed jobs

Step normalize COMPLETED in 2.30 seconds.
Normalized data for the following tables:
- products: 10 row(s)
- _dlt_pipeline_state: 1 row(s)
- customers: 935 row(s)
- orders: 9389 row(s)
- orders__items: 13202 row(s)

Load package 1746793344.0427039 is NORMALIZED and NOT YET LOADED to the destination and contains no failed jobs

Step load COMPLETED in 2.38 seconds.
Pipeline jaffle_shop_pipeline_v2 load step completed in 2.35 seconds
1 load package(s) were loaded to destination duckdb and into dataset jaffle_shop
The duckdb destination used duckdb:////content/jaffle_shop_pipeline_v2.duckdb location to store data
Load package 1746793344.0427039 is LOADED and contains no failed jobs

Step run COMPLETED in 1 minute and 19.05

*V2 has just a slight improvement ~10 seconds faster*

### **V3 - Larger In-memory Buffer + Increased Workers**

In [10]:
import dlt
from dlt.sources.helpers.rest_client import RESTClient
from dlt.sources.helpers.rest_client.paginators import HeaderLinkPaginator
import logging
import sys
import os

os.environ['EXTRACT__WORKERS'] = '12'
os.environ['NORMALIZE__WORKERS'] = '4'
os.environ['DATA_WRITER__BUFFER_MAX_ITEMS'] = '5000'

# Create a logger
logger = logging.getLogger('dlt')

client = RESTClient(
    base_url="https://jaffle-shop.scalevector.ai/api/v1",
    paginator=HeaderLinkPaginator(links_next_key="next")
)

def limit_pages(paginator, resource_name, limit=None):
    """Helper function to limit the number of pages returned from pagination

    If limit is None, returns all pages from the paginator.
    """
    page_count = 0
    for page in paginator:
        logger.info(f"Retrieved page {page_count + 1} for {resource_name}")
        yield page
        page_count += 1
        # Only check limit if it's not None
        if limit is not None and page_count >= limit:
            logger.info(f"Reached page limit of {limit} for {resource_name}")
            break

    logger.info(f"Total pages processed for {resource_name}: {page_count}")

@dlt.resource(table_name="customers",  parallelized=True)
def get_customers():
    logger.info("Starting extraction of customers data")
    paginator = client.paginate("customers", params={"page":1, "page_size":100})
    for page in paginator:
        yield page
    logger.info("Completed extraction of customers data")

@dlt.resource(table_name="orders",  parallelized=True)
def get_orders():
    logger.info("Starting extraction of orders data")
    paginator = client.paginate("orders", params={"page":1, "page_size":500, "start_date" : "2017-08-01T00:00:00"})
    for page in paginator:
      yield page
    logger.info("Completed extraction of orders data")

@dlt.resource(table_name="products",  parallelized=True)
def get_products():
    logger.info("Starting extraction of products data")
    paginator = client.paginate("products", params={"page":1, "page_size":100})
    for page in paginator:
        yield page
    logger.info("Completed extraction of products data")

@dlt.source
def jaffle_shop_source():
    logger.info("Initializing jaffle shop data source")
    return get_customers, get_orders, get_products

def main():
    try:
        pipeline = dlt.pipeline(
            pipeline_name="jaffle_shop_pipeline_v3",
            destination="duckdb",
            dataset_name="jaffle_shop",
            progress="log"
        )


        load_info = pipeline.run(jaffle_shop_source())

        print(f"{pipeline.last_trace}")

    except Exception as e:
        logger.error(f"Pipeline failed with error: {str(e)}", exc_info=True)
        raise

if __name__ == "__main__":
    main()

-------------------------- Extract jaffle_shop_source --------------------------
Resources: 0/3 (0.0%) | Time: 0.00s | Rate: 0.00/s
Memory usage: 306.57 MB (10.50%) | CPU usage: 0.00%

-------------------------- Extract jaffle_shop_source --------------------------
Resources: 0/3 (0.0%) | Time: 0.34s | Rate: 0.00/s
customers: 100  | Time: 0.00s | Rate: 5447148.05/s
Memory usage: 306.77 MB (10.50%) | CPU usage: 0.00%

-------------------------- Extract jaffle_shop_source --------------------------
Resources: 0/3 (0.0%) | Time: 0.42s | Rate: 0.00/s
customers: 100  | Time: 0.08s | Rate: 1199.92/s
products: 10  | Time: 0.00s | Rate: 1823610.43/s
Memory usage: 306.92 MB (10.50%) | CPU usage: 0.00%

-------------------------- Extract jaffle_shop_source --------------------------
Resources: 1/3 (33.3%) | Time: 1.44s | Rate: 0.69/s
customers: 500  | Time: 1.10s | Rate: 452.80/s
products: 10  | Time: 1.02s | Rate: 9.80/s
Memory usage: 307.13 MB (10.50%) | CPU usage: 0.00%

---------------------

*V3 is 24 seconds faster than the base version*