# **Building custom sources with [dlt REST API source](https://dlthub.com/docs/devel/dlt-ecosystem/verified-sources/rest_api/basic) and [RESTClient](https://dlthub.com/docs/devel/general-usage/http/rest-client)**

In [22]:
%%capture
!pip install dlt[duckdb]

`rest_api_source` -> Higher level, provides declarative way to configure sources

RestAPI Client -> Lower level, provides more granular control

## 1. Ways to Work with REST API Sources in dlt

### 1.1. Building Sources with Low-level dlt Decorators

In [2]:
import os
import dlt
from dlt.sources.helpers import requests
from dlt.sources.helpers.rest_client import RESTClient
from dlt.sources.helpers.rest_client.auth import BearerTokenAuth
from dlt.sources.helpers.rest_client.paginators import HeaderLinkPaginator
from google.colab import userdata

os.environ["ACCESS_TOKEN"] = userdata.get("SECRET_KEY")

@dlt.source
def github_source(access_token=dlt.secrets.value):
    client = RESTClient(
      base_url="https://api.github.com",
      auth=BearerTokenAuth(token=access_token),
      paginator=HeaderLinkPaginator()
  )

    @dlt.resource
    def github_events():
      for page in client.paginate("orgs/dlt-hub/events"):
        yield page

    @dlt.resource
    def github_stargazers():
        for page in client.paginate("repos/dlt-hub/dlt/stargazers"):
            yield page


    return github_events, github_stargazers


pipeline = dlt.pipeline(
    pipeline_name="rest_client_github",
    destination="duckdb",
    dataset_name="rest_client_data",
    dev_mode=True,
)

load_info = pipeline.run(github_source())
print(load_info)

Pipeline rest_client_github load step completed in 7.27 seconds
1 load package(s) were loaded to destination duckdb and into dataset rest_client_data_20250421040302
The duckdb destination used duckdb:////content/rest_client_github.duckdb location to store data
Load package 1745251382.6464946 is LOADED and contains no failed jobs


### 1.2. Building with `rest_api` Source

In [3]:
import dlt
from dlt.sources.rest_api import RESTAPIConfig, rest_api_source

config: RESTAPIConfig = {
    "client": {
        "base_url": "https://api.github.com",
        "auth": {
            "token": dlt.secrets["access_token"],  # Access token configured above
        },
        "paginator": "header_link"
    },
    "resources": [
        {
            "name": "issues",
            "endpoint": {
                "path": "repos/dlt-hub/dlt/issues",
                "params": {"state": "open"},
            },
        },
        {
            "name": "issue_comments",
            "endpoint": {
                "path": "repos/dlt-hub/dlt/issues/{issue_number}/comments",
                "params": {
                    "issue_number": {
                        "type": "resolve",
                        "resource": "issues",
                        "field": "number",
                    },
                },
            },
        },
        {
            "name": "contributors",
            "endpoint": {"path": "repos/dlt-hub/dlt/contributors"},
        },
    ],
}

github_source = rest_api_source(config)

pipeline = dlt.pipeline(
    pipeline_name="rest_api_github",
    destination="duckdb",
    dataset_name="rest_api_data",
    dev_mode=True,
)

load_info = pipeline.run(github_source)
print(load_info)

Pipeline rest_api_github load step completed in 2.98 seconds
1 load package(s) were loaded to destination duckdb and into dataset rest_api_data_20250421040800
The duckdb destination used duckdb:////content/rest_api_github.duckdb location to store data
Load package 1745251680.8373253 is LOADED and contains no failed jobs


## 2. RESTAPI Client
The RESTClient class offers an Pythonic interface for interacting with RESTful APIs, including features like:

- automatic pagination,
- various authentication mechanisms,
- customizable request/response handling.

### 2.1. Creating RESTClient Instance

```python
client = RESTClient(
        base_url="https://api.github.com", # all requests are made relative to this URL
        headers={"User-Agent": "MyApp/1.0"}, # used to set headers
        auth=BearerTokenAuth(dlt.secrets["access_token"]), # set type of auth
        paginator=HeaderLinkPaginator(), #
        data_selector="data", #JSONPath Selector - extracting data from JSON Responses based on Path
        session=MyCustomSession()
    )

```

In [5]:
from dlt.sources.helpers.rest_client import RESTClient
from dlt.sources.helpers.rest_client.auth import BearerTokenAuth
from dlt.sources.helpers.rest_client.paginators import JSONLinkPaginator
from google.colab import userdata


os.environ["ACCESS_TOKEN"] = userdata.get('SECRET_KEY')

client = RESTClient(
        base_url="https://api.github.com",
        headers={"User-Agent": "MyApp/1.0"},
        auth=BearerTokenAuth(dlt.secrets["access_token"]),
        paginator=HeaderLinkPaginator(),
        data_selector="data",
        # session=MyCustomSession()
    )

response = client.get("repos/dlt-hub/dlt/issues").json()

print(response)



### 2.2. Authenticating

The **available authentication methods** are defined in the `dlt.sources.helpers.rest_client.auth` module:

- [BearerTokenAuth](https://dlthub.com/docs/devel/general-usage/http/rest-client#bearer-token-authentication) ➡ Auth token is sent via headers
- [APIKeyAuth](https://dlthub.com/docs/devel/general-usage/http/rest-client#api-key-authentication) ➡ Sends API as custom header or query param
- [HttpBasicAuth](https://dlthub.com/docs/devel/general-usage/http/rest-client#http-basic-authentication) ➡ Username and password sent via headers
- [OAuth2ClientCredentials](https://dlthub.com/docs/devel/general-usage/http/rest-client#oauth-20-authorization) ➡ REST client acts as the OAuth client, which obtains a temporary access token from the authorization server.


In [7]:
import os
from dlt.sources.helpers.rest_client import RESTClient
from dlt.sources.helpers.rest_client.auth import APIKeyAuth
from google.colab import userdata

api_key = userdata.get('news_api_key')

client = RESTClient(
    base_url="https://newsapi.org/v2/",
    auth=APIKeyAuth(name="apikey", api_key=api_key, location="query")
)

response = client.get("top-headlines", params={"q":"python", "page":1})

print(response.json())

{'status': 'ok', 'totalResults': 1, 'articles': [{'source': {'id': 'hacker-news', 'name': 'Hacker News'}, 'author': 'Seth Michael Larson', 'title': 'Tests aren’t enough: Case study after adding type hints to urllib3', 'description': 'Since Python 3.5 was released in 2015 including PEP 484 and the typing module type hints have grown from a nice-to-have to an expectation for popular packages.  To fulfill this expectation our team...', 'url': 'https://sethmlarson.dev/blog/2021-10-18/tests-arent-enough-case-study-after-adding-types-to-urllib3', 'urlToImage': 'http://sethmlarson.dev/static/avatar.jpeg', 'publishedAt': '2021-10-18T16:52:18.1226466Z', 'content': 'Since Python 3.5 was released in 2015 including PEP 484 and the typing module type hints have grown from a nice-to-have to an expectation for popular packages. To fulfill this expectation our team ha… [+11991 chars]'}]}


### 2.3. Pagination

Specifiy the paginator using
1. the `paginator` parameter of the `RESTClient`
2. directly in the `paginate()` method

The **available pagination strategies** are defined in the `dlt.sources.helpers.rest_client.paginators` module and cover the most common pagination patterns used in REST APIs:

- [`PageNumberPaginator`](https://dlthub.com/docs/general-usage/http/rest-client#pagenumberpaginator) – uses `page=N`, optionally with `pageSize` or `limit`
- [`OffsetPaginator`](https://dlthub.com/docs/general-usage/http/rest-client#offsetpaginator) – uses `offset` and `limit`
- [`JSONLinkPaginator`](https://dlthub.com/docs/general-usage/http/rest-client#jsonresponsepaginator) – follows a `next` URL in the response body
- [`HeaderLinkPaginator`](https://dlthub.com/docs/general-usage/http/rest-client#headerlinkpaginator) – follows a `Link` header (used by GitHub and others)
- [`JSONResponseCursorPaginator`](https://dlthub.com/docs/general-usage/http/rest-client#jsonresponsecursorpaginator) – uses a cursor from the response body

Each paginator knows how to update the request to get the next page of results, and will continue until:

- no more pages are available,
- a configurable `maximum_page` or `maximum_offset` is reached,
- or the API response is empty (depending on paginator behavior).

#### 2.3.1 Using `paginate()` Method and it's object `PageData`
- If a `paginator` is not specified, the `paginate()` method will attempt to **automatically detect** the pagination mechanism used by the API.

- PageData is an generator object retured when calling `client.paginate()`

```python

# Here the PageData object is stored as response
response = client.paginate("everything", params={"q": "python", "page": 1})

# 1. The original request object
print(next(response).request)

# 2. The raw HTTP response
print(next(response).response)

# 3. The paginator that was used
print(next(response).paginator)

# 4. The authentication class used
print(next(response).auth)
```

In [11]:
response = client.paginate("everything", params={"q": "python", "page": 1})
print("Paginator\n", next(response).paginator)




Paginator
 SinglePagePaginator at 788c5faf5b90


In [20]:
# Using PageNumberPaginator with NewsAPI

from dlt.sources.helpers.rest_client.paginators import PageNumberPaginator
from dlt.sources.helpers.rest_client import RESTClient
from dlt.sources.helpers.rest_client.auth import APIKeyAuth
from google.colab import userdata

api_key = userdata.get('news_api_key')


client = RESTClient(
    base_url="https://newsapi.org/v2/",
    auth=APIKeyAuth(
        name="apiKey",
        api_key=api_key,
        location="query"
    ),
    paginator=PageNumberPaginator(
        base_page=1,                 # NewsAPI starts paging from 1
        page_param="page",           # This needs to match as per the API Spec - for NewsAPI it is page
        total_path=None,             # Total number of pages, Set it to None explicitly to keep requesting till last page
        stop_after_empty_page=True,  # Stop if no articles returned
        maximum_page=4               # Optional limit for dev/testing
    )
)

i = 1
for page in client.paginate("everything", params={"q": "python", "pageSize": 5, "language": "en"}):
    j = 1
    for article in page:
        print(f" {i}.{j} : {article['title']}")
        i += 1
        j += 1

 1.1 : Microsoft adds ‘deep reasoning’ Copilot AI for research and data analysis
 2.2 : Python's PyPI Finally Gets Closer to Adding 'Organization Accounts' and SBOMs
 3.3 : The Best Programming Language for the End of the World
 4.4 : Burmese pythons are adapting, evolving and slithering around these parts of Florida
 5.5 : MCP Run Python
 6.1 : Haskelling My Python
 7.2 : Nvidia adds native Python support to CUDA
 8.3 : OpenAI Unveils o3 and o4-mini Models
 9.4 : Cut the head off this invasive python-looking fish if you see it, conservationists say
 10.5 : Pluto’s Not a Planet, But It Is a Spectrum Analyzer
 11.1 : SSLyze – SSL configuration scanning library and CLI tool
 12.2 : Understanding R1-Zero-Like Training: A Critical Perspective
 13.3 : Inside arXiv—the Most Transformative Platform in All of Science
 14.4 : Show HN: Cloud-Ready Postgres MCP Server
 15.5 : Show HN: FastOpenAPI – automated docs for many Python frameworks


### 2.4. Using `@dlt.resource` and `@dlt.source` Decorators to Create dlt Pipeline

In [6]:
import os
import dlt
from dlt.sources.helpers.rest_client.paginators import PageNumberPaginator
from dlt.sources.helpers.rest_client import RESTClient
from dlt.sources.helpers.rest_client.auth import APIKeyAuth
from google.colab import userdata

os.environ["API_KEY"] = userdata.get('news_api_key')


# resource to get all articles
@dlt.resource(write_disposition="replace", name="python_articles")
def get_articles(api_key: str = dlt.secrets.value):
    client = RESTClient(
        base_url="https://newsapi.org/v2/",
        auth=APIKeyAuth(
            name="apiKey",
            api_key=api_key,
            location="query"
        ),
        paginator=PageNumberPaginator(
            base_page=1,
            page_param="page",
            total_path=None,
            stop_after_empty_page=True,
            maximum_page=4
        ),
    )

    for page in client.paginate("everything", params={"q": "python", "pageSize": 5, "language": "en"}):
        yield page


# resource to get top headlines
@dlt.resource(write_disposition="replace", name="top_articles")
def get_top_articles(api_key: str = dlt.secrets.value):
    client = RESTClient(
        base_url="https://newsapi.org/v2/",
        auth=APIKeyAuth(
            name="apiKey",
            api_key=api_key,
            location="query"
        ),
        paginator=PageNumberPaginator(
            base_page=1,
            page_param="page",
            total_path=None,
            stop_after_empty_page=True,
            maximum_page=4
        ),
    )

    for page in client.paginate("top-headlines", params={"pageSize": 5, "language": "en"}):
        yield page

In [7]:
@dlt.source
def newsapi_source(api_key: str = dlt.secrets.value):
    return [get_articles(api_key=api_key), get_top_articles(api_key=api_key)]

In [9]:
pipeline = dlt.pipeline(
    pipeline_name="newsapi_pipeline",
    destination="duckdb",
    dataset_name="news_data"
)

info = pipeline.run(newsapi_source())
print(info)

Pipeline newsapi_pipeline load step completed in 0.17 seconds
1 load package(s) were loaded to destination duckdb and into dataset news_data
The duckdb destination used duckdb:////content/newsapi_pipeline.duckdb location to store data
Load package 1745260267.135623 is LOADED and contains no failed jobs


### 2.5. Exploring Loaded Data

In [10]:
pipeline.dataset(dataset_type='default').python_articles.df().head()

Unnamed: 0,source__id,source__name,author,title,description,url,url_to_image,published_at,content,_dlt_load_id,_dlt_id
0,the-verge,The Verge,Richard Lawler,Microsoft adds ‘deep reasoning’ Copilot AI for...,After Google and OpenAI offered up AI news on ...,https://www.theverge.com/microsoft/636089/micr...,https://platform.theverge.com/wp-content/uploa...,2025-03-26 03:04:28+00:00,Multi-step reasoning AI is coming to Microsoft...,1745260267.135623,BWZyftICRnxlhQ
1,,Slashdot.org,EditorDavid,Python's PyPI Finally Gets Closer to Adding 'O...,Back in 2023 Python's infrastructure director ...,https://developers.slashdot.org/story/25/04/05...,https://a.fsdn.com/sd/topics/python_64.png,2025-04-05 16:34:00+00:00,Back in 2023 Python's infrastructure director ...,1745260267.135623,AvjyeKaenWnjLQ
2,wired,Wired,Tiffany Ng,The Best Programming Language for the End of t...,"Once the grid goes down, an old programming la...",https://www.wired.com/story/forth-collapse-os-...,https://media.wired.com/photos/67d88e905c0123e...,2025-03-26 10:00:00+00:00,Coding in Forth reminded me of the lawless dys...,1745260267.135623,paFjCeF6aO017g
3,,Palm Beach Post,"Timothy O'Hara, Treasure Coast Newspapers","Burmese pythons are adapting, evolving and sli...",There’s mounting evidence Everglades pythons c...,https://www.palmbeachpost.com/story/news/local...,https://media.zenfs.com/en/palm_beach_post_nat...,2025-04-05 10:00:53+00:00,"In Palm Beach County, 69 Burmese pythons have ...",1745260267.135623,/e9OuFQICDcfNA
4,,Github.com,pydantic,MCP Run Python,Agent Framework / shim to use Pydantic with LL...,https://github.com/pydantic/pydantic-ai/tree/m...,https://opengraph.githubassets.com/41f3837bc6f...,2025-04-15 11:09:30+00:00,Model Context Protocol server to run Python co...,1745260267.135623,M15vwUVwyOSaDQ


## 3. Creating Custom Source Using `rest_api` dlt source

### 3.1. Define Source Config

In [13]:
import dlt
from dlt.sources.rest_api import rest_api_source

news_api_config = {
    "client":{
        "base_url" : "https://newsapi.org/v2/"
    },
    # resources will be the tables when loading data
    "resources" : [
        {
        "name" : "news_articles",
        "endpoint" : {
            "path" : "everything",
            "params" : {
                "q" : "python"
            }
        }
    }
    ]
}

news_api_source = rest_api_source(news_api_config)

pipeline = dlt.pipeline(
  pipeline_name="news_pipeline",
  destination="duckdb",
  dataset_name="news"
)

load_info = pipeline.run(news_api_source)
print(pipeline.last_trace)

PipelineStepFailed: Pipeline execution failed at stage extract when processing package 1745260836.1065605 with exception:

<class 'dlt.extract.exceptions.ResourceExtractionError'>
In processing pipe news_articles: extraction of resource news_articles in generator paginate_resource caused an exception: 401 Client Error: Unauthorized for url: https://newsapi.org/v2/everything?q=python

### 3.2. Adding Authentication using dlt's `api_key` method

In [15]:
import dlt
from dlt.sources.rest_api import rest_api_source
from google.colab import userdata

api_key = userdata.get('news_api_key')

news_api_config = {
    "client":{
        "base_url" : "https://newsapi.org/v2/",
        # adding api key to the request
        "auth" : {
            "type": "api_key",
            "name": "apiKey",
            "api_key":api_key,
            "location":"query"
        }

    },

    # resources will be the tables when loading data
    "resources" : [
        {
        "name" : "news_articles",
        "endpoint" : {
            "path" : "everything",
            "params" : {
                "q" : "python"
            }
        }
    }
    ]
}

news_api_source = rest_api_source(news_api_config)

pipeline = dlt.pipeline(
  pipeline_name="news_pipeline",
  destination="duckdb",
  dataset_name="news"
)

load_info = pipeline.run(news_api_source)
print(pipeline.last_trace)



Run started at 2025-04-21 18:44:28.865110+00:00 and COMPLETED in 2.25 seconds with 4 steps.
Step extract COMPLETED in 1.84 seconds.

Load package 1745261068.9452825 is EXTRACTED and NOT YET LOADED to the destination and contains no failed jobs

Step normalize COMPLETED in 0.05 seconds.
Normalized data for the following tables:
- _dlt_pipeline_state: 1 row(s)
- news_articles: 100 row(s)

Load package 1745261068.9452825 is NORMALIZED and NOT YET LOADED to the destination and contains no failed jobs

Step load COMPLETED in 0.29 seconds.
Pipeline news_pipeline load step completed in 0.26 seconds
1 load package(s) were loaded to destination duckdb and into dataset news
The duckdb destination used duckdb:////content/news_pipeline.duckdb location to store data
Load package 1745261068.9452825 is LOADED and contains no failed jobs

Step run COMPLETED in 2.25 seconds.
Pipeline news_pipeline load step completed in 0.26 seconds
1 load package(s) were loaded to destination duckdb and into dataset n

### 3.3. Adding Pagination Logic to the Custom Source

In [16]:
import dlt
from dlt.sources.rest_api import rest_api_source
from google.colab import userdata

api_key = userdata.get('news_api_key')

news_api_config = {
    "client":{
        "base_url" : "https://newsapi.org/v2/",
        # adding api key to the request
        "auth" : {
            "type": "api_key",
            "name": "apiKey",
            "api_key":api_key,
            "location":"query"
        },

        # adding paginator - fetches upto 3 pages given there's data
        "paginator": {
            "base_page": 1,
            "type": "page_number",
            "page_param": "page",
            "total_path": None,
            "maximum_page": 3,
        }

    },

    # resources will be the tables when loading data
    "resources" : [
        {
        "name" : "news_articles",
        "endpoint" : {
            "path" : "everything",
            "params" : {
                "q" : "python"
            }
        }
    }
    ]
}

news_api_source = rest_api_source(news_api_config)

pipeline = dlt.pipeline(
  pipeline_name="news_pipeline",
  destination="duckdb",
  dataset_name="news"
)

load_info = pipeline.run(news_api_source)
print(pipeline.last_trace)

Run started at 2025-04-21 18:46:42.133354+00:00 and COMPLETED in 2.64 seconds with 4 steps.
Step extract COMPLETED in 2.35 seconds.

Load package 1745261202.2358189 is EXTRACTED and NOT YET LOADED to the destination and contains no failed jobs

Step normalize COMPLETED in 0.06 seconds.
Normalized data for the following tables:
- news_articles: 200 row(s)

Load package 1745261202.2358189 is NORMALIZED and NOT YET LOADED to the destination and contains no failed jobs

Step load COMPLETED in 0.16 seconds.
Pipeline news_pipeline load step completed in 0.14 seconds
1 load package(s) were loaded to destination duckdb and into dataset news
The duckdb destination used duckdb:////content/news_pipeline.duckdb location to store data
Load package 1745261202.2358189 is LOADED and contains no failed jobs

Step run COMPLETED in 2.64 seconds.
Pipeline news_pipeline load step completed in 0.14 seconds
1 load package(s) were loaded to destination duckdb and into dataset news
The duckdb destination used 

In [17]:
pipeline.dataset(dataset_type="default").news_articles.df().head()

Unnamed: 0,source__id,source__name,author,title,description,url,url_to_image,published_at,content,_dlt_load_id,_dlt_id
0,the-verge,The Verge,Richard Lawler,Microsoft adds ‘deep reasoning’ Copilot AI for...,After Google and OpenAI offered up AI news on ...,https://www.theverge.com/microsoft/636089/micr...,https://platform.theverge.com/wp-content/uploa...,2025-03-26 03:04:28+00:00,Multi-step reasoning AI is coming to Microsoft...,1745261068.9452825,9jgrme5XjETm8w
1,,Slashdot.org,EditorDavid,Python's PyPI Finally Gets Closer to Adding 'O...,Back in 2023 Python's infrastructure director ...,https://developers.slashdot.org/story/25/04/05...,https://a.fsdn.com/sd/topics/python_64.png,2025-04-05 16:34:00+00:00,Back in 2023 Python's infrastructure director ...,1745261068.9452825,FT++myp+5oLrOQ
2,wired,Wired,Tiffany Ng,The Best Programming Language for the End of t...,"Once the grid goes down, an old programming la...",https://www.wired.com/story/forth-collapse-os-...,https://media.wired.com/photos/67d88e905c0123e...,2025-03-26 10:00:00+00:00,Coding in Forth reminded me of the lawless dys...,1745261068.9452825,4PplSmSvJCSjEQ
3,,Palm Beach Post,"Timothy O'Hara, Treasure Coast Newspapers","Burmese pythons are adapting, evolving and sli...",There’s mounting evidence Everglades pythons c...,https://www.palmbeachpost.com/story/news/local...,https://media.zenfs.com/en/palm_beach_post_nat...,2025-04-05 10:00:53+00:00,"In Palm Beach County, 69 Burmese pythons have ...",1745261068.9452825,k+28ypAhSsnSMg
4,,Github.com,pydantic,MCP Run Python,Agent Framework / shim to use Pydantic with LL...,https://github.com/pydantic/pydantic-ai/tree/m...,https://opengraph.githubassets.com/41f3837bc6f...,2025-04-15 11:09:30+00:00,Model Context Protocol server to run Python co...,1745261068.9452825,3WlbYRGY+yAe0A


### 3.4. Working with API Params for Filtering and Ordering Results

In [18]:
import dlt
from dlt.sources.rest_api import rest_api_source
from google.colab import userdata

api_key = userdata.get('news_api_key')

news_api_config = {
    "client":{
        "base_url" : "https://newsapi.org/v2/",
        # adding api key to the request
        "auth" : {
            "type": "api_key",
            "name": "apiKey",
            "api_key":api_key,
            "location":"query"
        },

        # adding paginator - fetches upto 3 pages given there's data
        "paginator": {
            "base_page": 1,
            "type": "page_number",
            "page_param": "page",
            "total_path": None,
            "maximum_page": 3,
        }

    },

    # resources will be the tables when loading data
    "resources" : [
        {
        "name" : "news_articles",
        "endpoint" : {
            "path" : "everything",
            "params": {
                    "q": "python",     # search keyword
                    "language": "en",  # article language
                    "pageSize": 20     # number of articles per page
                }
        }
    }
    ]
}

news_api_source = rest_api_source(news_api_config)

pipeline = dlt.pipeline(
  pipeline_name="news_pipeline",
  destination="duckdb",
  dataset_name="news"
)

load_info = pipeline.run(news_api_source)
print(pipeline.last_trace)

Run started at 2025-04-21 18:48:52.258060+00:00 and COMPLETED in 1.95 seconds with 4 steps.
Step extract COMPLETED in 1.69 seconds.

Load package 1745261332.3255434 is EXTRACTED and NOT YET LOADED to the destination and contains no failed jobs

Step normalize COMPLETED in 0.03 seconds.
Normalized data for the following tables:
- news_articles: 40 row(s)

Load package 1745261332.3255434 is NORMALIZED and NOT YET LOADED to the destination and contains no failed jobs

Step load COMPLETED in 0.18 seconds.
Pipeline news_pipeline load step completed in 0.16 seconds
1 load package(s) were loaded to destination duckdb and into dataset news
The duckdb destination used duckdb:////content/news_pipeline.duckdb location to store data
Load package 1745261332.3255434 is LOADED and contains no failed jobs

Step run COMPLETED in 1.95 seconds.
Pipeline news_pipeline load step completed in 0.16 seconds
1 load package(s) were loaded to destination duckdb and into dataset news
The duckdb destination used d

### 3.5. Converting to Incrementally Loaded Pipeline

Although NewsAPI does not support true incremental loading via cursors, you can simulate it using the `from` or `to` date filters and dlt's `incremental` loader:

```python
"from": {
    "type": "incremental",
    "cursor_path": "publishedAt",
    "initial_value": "2024-01-01T00:00:00Z",
}
```

This setup means:
- dlt will remember the last `publishedAt` seen
- On the next run, it will only request articles newer than that


In [19]:
import dlt
from dlt.sources.rest_api import rest_api_source
from google.colab import userdata

api_key = userdata.get('news_api_key')

news_api_config = {
    "client":{
        "base_url" : "https://newsapi.org/v2/",
        # adding api key to the request
        "auth" : {
            "type": "api_key",
            "name": "apiKey",
            "api_key":api_key,
            "location":"query"
        },

        # adding paginator - fetches upto 3 pages given there's data
        "paginator": {
            "base_page": 1,
            "type": "page_number",
            "page_param": "page",
            "total_path": None,
            "maximum_page": 3,
        }

    },

    # resources will be the tables when loading data
    "resources" : [
        {
        "name" : "news_articles",
        "endpoint" : {
            "path" : "everything",
            "params": {
                    "q": "python",     # search keyword
                    "language": "en",  # article language
                    "pageSize": 20,     # number of articles per page
                    "from":{
                        "type":"incremental",
                        "cursor_path":"publishedAt",
                        "initial_value": "2025-04-15T00:00:00Z" # only fetch articles published after this date
                    }
                }
        }
    }
    ]
}

news_api_source = rest_api_source(news_api_config)

pipeline = dlt.pipeline(
  pipeline_name="news_pipeline",
  destination="duckdb",
  dataset_name="news"
)

load_info = pipeline.run(news_api_source)
print(pipeline.last_trace)

Run started at 2025-04-21 18:52:40.385457+00:00 and COMPLETED in 2.28 seconds with 4 steps.
Step extract COMPLETED in 1.88 seconds.

Load package 1745261560.4548607 is EXTRACTED and NOT YET LOADED to the destination and contains no failed jobs

Step normalize COMPLETED in 0.05 seconds.
Normalized data for the following tables:
- news_articles: 40 row(s)
- _dlt_pipeline_state: 1 row(s)

Load package 1745261560.4548607 is NORMALIZED and NOT YET LOADED to the destination and contains no failed jobs

Step load COMPLETED in 0.29 seconds.
Pipeline news_pipeline load step completed in 0.26 seconds
1 load package(s) were loaded to destination duckdb and into dataset news
The duckdb destination used duckdb:////content/news_pipeline.duckdb location to store data
Load package 1745261560.4548607 is LOADED and contains no failed jobs

Step run COMPLETED in 2.28 seconds.
Pipeline news_pipeline load step completed in 0.26 seconds
1 load package(s) were loaded to destination duckdb and into dataset ne

In [21]:
# Run the pipeline one more time to see if any new data got loaded
load_info = pipeline.run(news_api_source)
print(pipeline.last_trace)

Run started at 2025-04-21 18:53:04.701413+00:00 and COMPLETED in 0.97 seconds with 4 steps.
Step extract COMPLETED in 0.74 seconds.

Load package 1745261584.7287204 is EXTRACTED and NOT YET LOADED to the destination and contains no failed jobs

Step normalize COMPLETED in 0.03 seconds.
Normalized data for the following tables:
- news_articles: 39 row(s)
- _dlt_pipeline_state: 1 row(s)

Load package 1745261584.7287204 is NORMALIZED and NOT YET LOADED to the destination and contains no failed jobs

Step load COMPLETED in 0.18 seconds.
Pipeline news_pipeline load step completed in 0.16 seconds
1 load package(s) were loaded to destination duckdb and into dataset news
The duckdb destination used duckdb:////content/news_pipeline.duckdb location to store data
Load package 1745261584.7287204 is LOADED and contains no failed jobs

Step run COMPLETED in 0.97 seconds.
Pipeline news_pipeline load step completed in 0.16 seconds
1 load package(s) were loaded to destination duckdb and into dataset ne

### 3.5. Setting Resource Defaults

In [25]:
import dlt
from dlt.sources.rest_api import rest_api_source
from google.colab import userdata

api_key = userdata.get('news_api_key')

news_api_config = {
    "client":{
        "base_url" : "https://newsapi.org/v2/",
        # adding api key to the request
        "auth" : {
            "type": "api_key",
            "name": "apiKey",
            "api_key":api_key,
            "location":"query"
        },

        # adding paginator - fetches upto 3 pages given there's data
        "paginator": {
            "base_page": 1,
            "type": "page_number",
            "page_param": "page",
            "total_path": None,
            "maximum_page": 3,
        }

    },

    # setting the resource defaults
    "resource_defaults": {
    "primary_key": "id",
    "write_disposition": "append",
    "endpoint": {
        "params": {
            "language": "en",
            "pageSize" : 20
          }
        }
    },

    # resources will be the tables when loading data
    "resources" : [
        {
        "name" : "news_articles",
        "endpoint" : {
            "path" : "everything",
            "params": {
                    "q": "python",     # search keyword
                    "language": "en",  # article language
                    "pageSize": 20,     # number of articles per page
                    "from":{
                        "type":"incremental",
                        "cursor_path":"publishedAt",
                        "initial_value": "2025-04-15T00:00:00Z" # only fetch articles published after this date
                    }
                }
        }
    }
    ]
}

### 3.6. Adding More Endpoints

In [27]:
import dlt
from dlt.sources.rest_api import rest_api_source
from google.colab import userdata

api_key = userdata.get('news_api_key')

news_api_config = {
    "client":{
        "base_url" : "https://newsapi.org/v2/",
        # adding api key to the request
        "auth" : {
            "type": "api_key",
            "name": "apiKey",
            "api_key":api_key,
            "location":"query"
        },

        # adding paginator - fetches upto 3 pages given there's data
        "paginator": {
            "base_page": 1,
            "type": "page_number",
            "page_param": "page",
            "total_path": None,
            "maximum_page": 3,
        }

    },

    # setting the resource defaults
    "resource_defaults": {
    "write_disposition": "append",
    "endpoint": {
        "params": {
            "language": "en",
            "pageSize" : 20
          }
        }
    },

    # resources will be the tables when loading data
    "resources" : [
        # resource #1 - News Articles
        {
        "name" : "news_articles",
        "endpoint" : {
            "path" : "everything",
            "params": {
                    "q": "python",     # search keyword
                    "language": "en",  # article language
                    "pageSize": 20,     # number of articles per page
                    "from":{
                        "type":"incremental",
                        "cursor_path":"publishedAt",
                        "initial_value": "2025-04-15T00:00:00Z" # only fetch articles published after this date
                    }
                }
            }
        },
        # resource #2 - Top Headlines
        {
            "name":"top_headlines",
            "endpoint" : {
                "path" : "top-headlines",
                "params" : {
                    "country" : "us"
                }
            }
        }
    ]
}

news_api_source = rest_api_source(news_api_config)

pipeline = dlt.pipeline(
  pipeline_name="news_pipeline",
  destination="duckdb",
  dataset_name="news"
)

load_info = pipeline.run(news_api_source)
print(pipeline.last_trace)


pipeline.dataset(dataset_type='default').top_headlines.df().head()

Run started at 2025-04-21 20:15:15.030801+00:00 and COMPLETED in 3.37 seconds with 4 steps.
Step extract COMPLETED in 3.02 seconds.

Load package 1745266515.0958173 is EXTRACTED and NOT YET LOADED to the destination and contains no failed jobs

Step normalize COMPLETED in 0.04 seconds.
Normalized data for the following tables:
- news_articles: 34 row(s)
- _dlt_pipeline_state: 1 row(s)
- top_headlines: 33 row(s)

Load package 1745266515.0958173 is NORMALIZED and NOT YET LOADED to the destination and contains no failed jobs

Step load COMPLETED in 0.26 seconds.
Pipeline news_pipeline load step completed in 0.24 seconds
1 load package(s) were loaded to destination duckdb and into dataset news
The duckdb destination used duckdb:////content/news_pipeline.duckdb location to store data
Load package 1745266515.0958173 is LOADED and contains no failed jobs

Step run COMPLETED in 3.37 seconds.
Pipeline news_pipeline load step completed in 0.24 seconds
1 load package(s) were loaded to destination

Unnamed: 0,source__id,source__name,author,title,description,url,url_to_image,published_at,content,_dlt_load_id,_dlt_id
0,espn,ESPN,Associated Press,Acuna calls out Braves for not disciplining Ke...,Ronald Acuna Jr. took to social media Sunday t...,https://www.espn.com/mlb/story/_/id/44767829/a...,https://a3.espncdn.com/combiner/i?img=%2Fphoto...,2025-04-20 17:53:00+00:00,"Apr 20, 2025, 01:53 PM ET\r\nATLANTA -- Ronald...",1745266515.095817,2GFvNO9nEvQZzg
1,,CNBC,Erin Doherty,Trump draft executive order would make sweepin...,"The changes, outlined in a 16-page draft order...",https://www.cnbc.com/2025/04/20/trump-state-de...,https://image.cnbcfm.com/api/v1/image/10812039...,2025-04-20 17:49:10+00:00,The Trump administration could soon roll out s...,1745266515.095817,3rJux0zP+YCeCg
2,,Financial Times,Claire Jones,Republican senator backs Powell over Trump att...,Banking committee member John Kennedy says US ...,https://www.ft.com/content/8fc4279c-7098-48d9-...,https://www.ft.com/__origami/service/image/v2/...,2025-04-20 17:37:48+00:00,White House Watch newsletter\r\nSign up for yo...,1745266515.095817,B0wVaGRf/jpxGg
3,associated-press,Associated Press,,Florida State classes resume Monday after fata...,Classes will resume at Florida State Universit...,https://apnews.com/article/florida-state-shoot...,https://dims.apnews.com/dims4/default/e629f4e/...,2025-04-20 17:18:00+00:00,"TALLAHASSEE, Fla. (AP) Classes will resume at ...",1745266515.095817,+qHPm3is3NpAJg
4,politico,Politico,Gregory Svirnovskiy,Burgum rues ‘war on mining’ ahead of tariff ne...,"Still, there's no one better than the commande...",https://www.politico.com/news/2025/04/20/burgu...,https://static.politico.com/8e/4a/23616cac49a7...,2025-04-20 17:16:13+00:00,"In the meantime, the administration is looking...",1745266515.095817,ayDXKd59OptJrw


## 4. Assignment Solution

Your task is to create a `rest_api_source` configuration for the public **Jaffle Shop API**. This exercise will help you apply what you’ve learned:

### API details:
- **Base URL:** `https://jaffle-shop.scalevector.ai/api/v1`
- **Docs:** [https://jaffle-shop.scalevector.ai/docs](https://jaffle-shop.scalevector.ai/docs)

### Endpoints to load:
- `/orders`

### Requirements:
1. Use `rest_api_source` to define your source config.
2. This API uses **pagination**. Figure out what type is has.
3. Add incremental loading to `orders`, starting from `2017-08-01` and using `ordered_at` as the cursor.
4. Add `processing_steps` to `orders`:
  - Remove records from orders which `order_total` > 500.



### Question:
How many rows does resulted table `orders` contain?


In [41]:
import dlt
from dlt.sources.rest_api import rest_api_source


# create the config
jaffle_shop_config = {
    "client" : {
        "base_url": "https://jaffle-shop.scalevector.ai/api/v1",

        # no auth required
        # "auth":{},

        # this api uses PageNumberPaginator
        "paginator": {
            "base_page" : 1,
            "type" : "page_number",
            "page_param" : "page",
            "total_path" : None,
            "maximum_page" : 3
        }

    },

    "resource_defaults" : {
        "write_disposition": "replace",
        "endpoint": {
            "params": {
                "page_size": 100
            }
        }
    },
    "resources" : [
        {
            "name": "orders",
            "processing_steps" : [
              {"filter" : lambda x: float(x["order_total"]) > 500.0}
            ],
            "endpoint" : {
                "path" : "orders",
                "params" : {
                    "start_date":{
                        "type": "incremental",
                        "cursor_path" : "ordered_at",
                        "initial_value": "2017-08-01T10:39:00"
                    }
                }
            }
        }
        ]
}


jaffle_shop_source = rest_api_source(jaffle_shop_config)

pipeline = dlt.pipeline(
    pipeline_name="jaffle_shop_assignment",
    destination="duckdb",
    dataset_name="jaffle_shop"
)

load_info = pipeline.run(jaffle_shop_source)
print(pipeline.last_trace)


pipeline.dataset(dataset_type="default").orders.df().head()

Run started at 2025-04-21 20:45:30.514917+00:00 and COMPLETED in 2.18 seconds with 4 steps.
Step extract COMPLETED in 1.89 seconds.

Load package 1745268330.587174 is EXTRACTED and NOT YET LOADED to the destination and contains no failed jobs

Step normalize COMPLETED in 0.04 seconds.
Normalized data for the following tables:
- orders: 72 row(s)
- orders__items: 118 row(s)

Load package 1745268330.587174 is NORMALIZED and NOT YET LOADED to the destination and contains no failed jobs

Step load COMPLETED in 0.19 seconds.
Pipeline jaffle_shop_assignment load step completed in 0.17 seconds
1 load package(s) were loaded to destination duckdb and into dataset jaffle_shop
The duckdb destination used duckdb:////content/jaffle_shop_assignment.duckdb location to store data
Load package 1745268330.587174 is LOADED and contains no failed jobs

Step run COMPLETED in 2.18 seconds.
Pipeline jaffle_shop_assignment load step completed in 0.17 seconds
1 load package(s) were loaded to destination duckdb

Unnamed: 0,id,customer_id,store_id,ordered_at,subtotal,tax_paid,order_total,_dlt_load_id,_dlt_id
0,e6273b98-0975-411f-aa71-5f98a82dd908,d9464bcd-0a19-4058-8fc1-bea843e6c339,4b6c2304-2b9e-41e4-942a-cf11a1819378,2017-08-01 12:08:00+00:00,600,36,636,1745268330.587174,lZgaWmBPDt9MCg
1,880b45ec-3a69-4f6f-b592-ab1d8b7f2f4c,50a2d1c4-d788-4498-a6f7-dd75d4db588f,4b6c2304-2b9e-41e4-942a-cf11a1819378,2017-08-01 15:16:00+00:00,500,30,530,1745268330.587174,ZywqBTfpMicwlw
2,45bdb4e4-37b9-4b3e-b0f4-ec2cf11d851e,5a589b0f-69ac-4256-aea8-df379845f417,4b6c2304-2b9e-41e4-942a-cf11a1819378,2017-08-01 13:03:00+00:00,2400,144,2544,1745268330.587174,XcWo/92Bgc1WPQ
3,285d1cfc-e058-4723-8024-2eb6cee1e233,f8486fce-bc07-4a4f-a6e9-ed6a06ba996c,4b6c2304-2b9e-41e4-942a-cf11a1819378,2017-08-01 10:39:00+00:00,500,30,530,1745268330.587174,SD13fgCQyP2oeQ
4,be6a5404-e15d-49ee-98ff-466d94e4f6f7,94e0e084-e14c-4061-80da-7e6030bf14d5,4b6c2304-2b9e-41e4-942a-cf11a1819378,2017-08-01 11:38:00+00:00,3900,234,4134,1745268330.587174,1VDkEX8Yub6ElA


In [43]:
pipeline.dataset(dataset_type="default").orders.df().count()

Unnamed: 0,0
id,72
customer_id,72
store_id,72
ordered_at,72
subtotal,72
tax_paid,72
order_total,72
_dlt_load_id,72
_dlt_id,72
