# Pagination, Authentication and dlt Cofiguration

## Pagination
- It is used to limit how much data is sent at once via an API
- If an endpoint supports the `per_page` query parameter then you can decide how many results you want to process at a time

In [2]:
import requests

# Github provides two parameters
# per_page - results per page
# page - page number to retreive results from

response = requests.get("https://api.github.com/orgs/dlt-hub/events?per_page=10&page=1")
response.links

{}

### dlt RESTClient
- dlt has a helper to handle pagination and to manage repetitive tasks such as 
    - authentication
    - query parameter handling
    - pagination

In [3]:
from dlt.sources.helpers.rest_client import RESTClient
client = RESTClient(base_url="https://api.github.com")

i = 0
for page in client.paginate("orgs/dlt-hub/events"):
    if i < 5: # print only first 5 pages
        print(page)
        i+=1

HTTPError: 403 Client Error: rate limit exceeded for url: https://api.github.com/orgs/dlt-hub/events

There are different types of paginations, in the above code dlt automatically inferred the type but we can specify it as well

- JSONLinkPaginator - link to the next page is included in the JSON response.
- HeaderLinkPaginator - link to the next page is included in the response headers.
- OffsetPaginator - pagination based on offset and limit query parameters.
- PageNumberPaginator - pagination based on page numbers.
- JSONResponseCursorPaginator - pagination based on a cursor in the JSON response.
- HeaderCursorPaginator - pagination based on a cursor in the response headers.

In [14]:
from dlt.sources.helpers.rest_client.paginators import HeaderLinkPaginator

client = RESTClient(
    base_url="https://api.github.com",
    paginator=HeaderLinkPaginator()
)

### Exercise 1: Pagination with RESTClient
Question: What type of pagination should we use for the GitHub API?

In [18]:
response = requests.get("https://api.github.com/orgs/dlt-hub/events?per_page=10&page=1")
print(response.headers)

{'Date': 'Sun, 23 Feb 2025 09:08:52 GMT', 'Server': 'Varnish', 'Strict-Transport-Security': 'max-age=31536000; includeSubdomains; preload', 'X-Content-Type-Options': 'nosniff', 'X-Frame-Options': 'deny', 'X-XSS-Protection': '1; mode=block', 'Content-Security-Policy': "default-src 'none'; style-src 'unsafe-inline'", 'Access-Control-Allow-Origin': '*', 'Access-Control-Expose-Headers': 'ETag, Link, Location, Retry-After, X-GitHub-OTP, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, X-RateLimit-Used, X-RateLimit-Resource, X-OAuth-Scopes, X-Accepted-OAuth-Scopes, X-Poll-Interval, X-GitHub-Media-Type, Deprecation, Sunset', 'Content-Type': 'application/json; charset=utf-8', 'Referrer-Policy': 'origin-when-cross-origin, strict-origin-when-cross-origin', 'X-GitHub-Media-Type': 'github.v3; format=json', 'X-RateLimit-Limit': '60', 'X-RateLimit-Remaining': '0', 'X-RateLimit-Reset': '1740304296', 'X-RateLimit-Resource': 'core', 'X-RateLimit-Used': '60', 'Content-Length': '278', 'X-GitH

The header contains `Link` showing the next page, so GitHub uses HeaderLinkPaginator

```
{'Date': 'Sun, 23 Feb 2025 09:09:31 GMT', 'Content-Type': 'application/json; charset=utf-8', 'Cache-Control': 'public, max-age=60, s-maxage=60', 'Vary': 'Accept,Accept-Encoding, Accept, X-Requested-With', 'ETag': 'W/"b1c22a97c4cacc94cff289841fb952e9b6c9293f7838d5a87900d5e1ae651c97"', 'Last-Modified': 'Sun, 23 Feb 2025 08:48:57 GMT', 'X-Poll-Interval': '60', 'X-GitHub-Media-Type': 'github.v3; format=json', 'Link': '<https://api.github.com/organizations/89419010/events?per_page=10&page=2>; rel="next", <https://api.github.com/organizations/89419010/events?per_page=10&page=29>; rel="last"', 'x-github-api-version-selected': '2022-11-28', 'Access-Control-Expose-Headers': 'ETag, Link, Location, Retry-After, X-GitHub-OTP, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Used, X-RateLimit-Resource, X-RateLimit-Reset, X-OAuth-Scopes, X-Accepted-OAuth-Scopes, X-Poll-Interval, X-GitHub-Media-Type, X-GitHub-SSO, X-GitHub-Request-Id, Deprecation, Sunset', 'Access-Control-Allow-Origin': '*', 'Strict-Transport-Security': 'max-age=31536000; includeSubdomains; preload', 'X-Frame-Options': 'deny', 'X-Content-Type-Options': 'nosniff', 'X-XSS-Protection': '0', 'Referrer-Policy': 'origin-when-cross-origin, strict-origin-when-cross-origin', 'Content-Security-Policy': "default-src 'none'", 'Content-Encoding': 'gzip', 'Server': 'github.com', 'Accept-Ranges': 'bytes', 'X-RateLimit-Limit': '60', 'X-RateLimit-Remaining': '59', 'X-RateLimit-Reset': '1740305371', 'X-RateLimit-Resource': 'core', 'X-RateLimit-Used': '1', 'Transfer-Encoding': 'chunked', 'X-GitHub-Request-Id': 'C8B6:3C37A6:1340157:2706F30:67BAE5CB'}
```


## Authentication

In [4]:
import os

github_token = os.getenv("GITHUB_TOKEN")

if github_token:
    print("GitHub token loaded successfully.")
else:
    print("GitHub token not found. Check your environment variables.")

GitHub token loaded successfully.


In [6]:
from dlt.sources.helpers.rest_client import RESTClient
from dlt.sources.helpers.rest_client.auth import BearerTokenAuth

client = RESTClient(
    base_url="https://api.github.com",
    auth=BearerTokenAuth(github_token)
)

i = 0
for page in client.paginate("repos/dlt-hub/dlt/stargazers"):
    if i < 5: # print only first 5 pages
        print(page)
        i+=1


[{'login': 'lalitpagaria', 'id': 19303690, 'node_id': 'MDQ6VXNlcjE5MzAzNjkw', 'avatar_url': 'https://avatars.githubusercontent.com/u/19303690?v=4', 'gravatar_id': '', 'url': 'https://api.github.com/users/lalitpagaria', 'html_url': 'https://github.com/lalitpagaria', 'followers_url': 'https://api.github.com/users/lalitpagaria/followers', 'following_url': 'https://api.github.com/users/lalitpagaria/following{/other_user}', 'gists_url': 'https://api.github.com/users/lalitpagaria/gists{/gist_id}', 'starred_url': 'https://api.github.com/users/lalitpagaria/starred{/owner}{/repo}', 'subscriptions_url': 'https://api.github.com/users/lalitpagaria/subscriptions', 'organizations_url': 'https://api.github.com/users/lalitpagaria/orgs', 'repos_url': 'https://api.github.com/users/lalitpagaria/repos', 'events_url': 'https://api.github.com/users/lalitpagaria/events{/privacy}', 'received_events_url': 'https://api.github.com/users/lalitpagaria/received_events', 'type': 'User', 'user_view_type': 'public', '

## dlt Configuration and Secrets
- Configurations are non-senstive, define behaviour of pipeline like setting file paths, database hosts, timeouts, API URLs, and performance settings.
- Secrets are sensitive - passwords, API keys, auth tokens, these should never be hard-coded
    - These can be set up in various ways:
        - Environment variables
        - Within code using dlt.secrets and dlt.config
        - Configuration files (secrets.toml and config.toml)

In [12]:
import dlt
import os
from dlt.sources.helpers import requests
from dlt.sources.helpers.rest_client import RESTClient
from dlt.sources.helpers.rest_client.auth import BearerTokenAuth
import toml
# Load the secrets file
secrets = toml.load(r"C:\Users\HP\OneDrive\Desktop\Data Engg\dlt\secrets.toml")  # Adjust the path if needed

# method 1
#dlt.secrets['github_token'] = os.getenv('GITHUB_TOKEN')

# method 2
dlt.secrets['github_token'] = secrets.get("GITHUB_TOKEN")

@dlt.source
def github_source():
    client = RESTClient(
        base_url='https://api.github.com',
        auth=BearerTokenAuth(os.getenv('GITHUB_TOKEN'))
    )

    @dlt.resource # if no table naem specified it will use the function name
    def github_events():
        for page in client.paginate("orgs/dlt-hub/events"):
            yield from page

    @dlt.resource
    def github_stargazers():
        for page in client.paginate("repos/dlt-hub/dlt/stargazers"):
            yield from page

    return github_events, github_stargazers

In [14]:
# define a new pipeline
pipeline = dlt.pipeline(pipeline_name="github_pipeline", destination="duckdb", dataset_name="github.db")

# run the pipeline with the new resource
load_info = pipeline.run(github_source())
print(load_info)



Pipeline github_pipeline load step completed in 4.26 seconds
1 load package(s) were loaded to destination duckdb and into dataset github_db
The duckdb destination used duckdb:///c:\Users\HP\OneDrive\Desktop\Data Engg\dlt\github_pipeline.duckdb location to store data
Load package 1740305706.5711625 is LOADED and contains no failed jobs


### Exercise 2: Run pipeline with dlt.secrets.value
Question: Who has id=17202864 in the stargazers table? Use sql_client.

In [19]:
with pipeline.sql_client() as client:
    with client.execute_query("SELECT * FROM github_stargazers WHERE id = 17202864") as cursor:
        print(cursor.fetchall())



[('rudolfix', 17202864, 'MDQ6VXNlcjE3MjAyODY0', 'https://avatars.githubusercontent.com/u/17202864?v=4', '', 'https://api.github.com/users/rudolfix', 'https://github.com/rudolfix', 'https://api.github.com/users/rudolfix/followers', 'https://api.github.com/users/rudolfix/following{/other_user}', 'https://api.github.com/users/rudolfix/gists{/gist_id}', 'https://api.github.com/users/rudolfix/starred{/owner}{/repo}', 'https://api.github.com/users/rudolfix/subscriptions', 'https://api.github.com/users/rudolfix/orgs', 'https://api.github.com/users/rudolfix/repos', 'https://api.github.com/users/rudolfix/events{/privacy}', 'https://api.github.com/users/rudolfix/received_events', 'User', 'public', False, '1740305706.5711625', 'pnAFJfN2P9fUQA')]
