# Write Disposition and Incremental Loading

## `dlt` Write Dispositions
- How data should be written to the destination, there are 3 types available
1. `append` : default disposition, appends data to existing data
2. `replace` : deletes all existing data and recreates the schema before loading the data
3. `merge` : merges the data, need to specify primary key for the resource

### Specifying write disposition

Method 1
```python
@dlt.resource(write_disposition='replace')
def my_resource():
    yield data
```

Method 2
```python
load_info = pipeline.run(my_resource, write_disposition="replace")
```

**write disposition specified on run() method will override any previous specifications**

In [1]:
# Sample data containing pokemon details
data = [
    {"id": "1", "name": "bulbasaur", "size": {"weight": 6.9, "height": 0.7}},
    {"id": "4", "name": "charmander", "size": {"weight": 8.5, "height": 0.6}},
    {"id": "25", "name": "pikachu", "size": {"weight": 6, "height": 0.4}},
]

In [3]:
import dlt

pipeline = dlt.pipeline(
    pipeline_name='lesson_5',
    destination='duckdb',
    dataset_name='pokemon_lesson_5'
)

@dlt.resource(table_name='pokemon_lesson_5', write_disposition='append')
def pokemon():
    yield data

load_info = pipeline.run(pokemon)
print(load_info)

# explore loaded data
pipeline.dataset(dataset_type="default").pokemon_lesson_5.df()

Pipeline lesson_5 load step completed in 0.13 seconds
1 load package(s) were loaded to destination duckdb and into dataset pokemon_lesson_5
The duckdb destination used duckdb:///c:\Users\HP\OneDrive\Desktop\Data Engg\dlt\lesson_5.duckdb location to store data
Load package 1740396526.489626 is LOADED and contains no failed jobs


Unnamed: 0,id,name,size__weight,size__height,_dlt_load_id,_dlt_id
0,1,bulbasaur,6.9,0.7,1740396495.8345797,U58wwAxOPdzyMQ
1,4,charmander,8.5,0.6,1740396495.8345797,VbF3Mzg3HaNLKg
2,25,pikachu,6.0,0.4,1740396495.8345797,C1HmpQsyzIUxYw
3,1,bulbasaur,6.9,0.7,1740396526.489626,eJ0jqQZbkhmSyw
4,4,charmander,8.5,0.6,1740396526.489626,isoN2IHrDPu/XA
5,25,pikachu,6.0,0.4,1740396526.489626,5h5t+CHzZbDp/w


*Output above shown is after running it twice*

In [4]:
import dlt

pipeline = dlt.pipeline(
    pipeline_name='lesson_5',
    destination='duckdb',
    dataset_name='pokemon_lesson_5'
)

@dlt.resource(table_name='pokemon_lesson_5', write_disposition='replace')
def pokemon():
    yield data

load_info = pipeline.run(pokemon)
print(load_info)

# explore loaded data
pipeline.dataset(dataset_type="default").pokemon_lesson_5.df()

Pipeline lesson_5 load step completed in 1.19 seconds
1 load package(s) were loaded to destination duckdb and into dataset pokemon_lesson_5
The duckdb destination used duckdb:///c:\Users\HP\OneDrive\Desktop\Data Engg\dlt\lesson_5.duckdb location to store data
Load package 1740396592.2682698 is LOADED and contains no failed jobs


Unnamed: 0,id,name,size__weight,size__height,_dlt_load_id,_dlt_id
0,1,bulbasaur,6.9,0.7,1740396592.2682698,RAB6aFuxlhvB4g
1,4,charmander,8.5,0.6,1740396592.2682698,BLjNhi+IGC5T4w
2,25,pikachu,6.0,0.4,1740396592.2682698,Zh1Gm3ZNsaRDDA


**merge disposition** 
- You can update the existing data if there are any changes and add new records - no duplicates created, nor is the complete table cleared
- it uses primary key for deduplication
- If you are dealing with Slowly Changing Dimensions (SCD) where the attribute of a record changes over time and you want to maintain a history of these changes, you can use the merge write disposition with the scd2 strategy.

In [6]:
@dlt.resource(table_name='pokemon_lesson_5', write_disposition='merge', primary_key='id')
def pokemon():
    yield data

pipeline = dlt.pipeline(
    pipeline_name='lesson_5',
    destination='duckdb',
    dataset_name='pokemon_lesson_5'
)

load_info = pipeline.run(pokemon)
print(load_info)

# explore loaded data
pipeline.dataset(dataset_type="default").pokemon_lesson_5.df()

Pipeline lesson_5 load step completed in 0.21 seconds
1 load package(s) were loaded to destination duckdb and into dataset pokemon_lesson_5
The duckdb destination used duckdb:///c:\Users\HP\OneDrive\Desktop\Data Engg\dlt\lesson_5.duckdb location to store data
Load package 1740396871.573211 is LOADED and contains no failed jobs


Unnamed: 0,id,name,size__weight,size__height,_dlt_load_id,_dlt_id
0,25,pikachu,6.0,0.4,1740396871.573211,Cy0A41TEJZBa7w
1,1,bulbasaur,6.9,0.7,1740396871.573211,d6Noib2wj83h2g
2,4,charmander,8.5,0.6,1740396871.573211,ZhCMooKbWn5DJA


## Incremental Loading with `dlt`

In [7]:
# first we need a timestamp field in our data

# We added `created_at` field to the data
data = [
    {
        "id": "1",
        "name": "bulbasaur",
        "size": {"weight": 6.9, "height": 0.7},
        "created_at": "2024-12-01"    # <------- new field
    },
    {
        "id": "4",
        "name": "charmander",
        "size": {"weight": 8.5, "height": 0.6},
        "created_at": "2024-09-01"    # <------- new field
    },
    {
        "id": "25",
        "name": "pikachu",
        "size": {"weight": 6, "height": 0.4},
        "created_at": "2023-06-01"    # <------- new field
    }
]

In [9]:
# Step 2: Defining the Incremental Logic

cursor_date = dlt.sources.incremental("created_at", initial_value="2024-01-01")

This tells dlt:
- Start date: January 1, 2024 (initial_value).
- Field to track: created_at (our timestamp).
- As you run the pipeline repeatedly, dlt will keep track of the latest created_at value processed. It will skip records older than this date in future runs.

In [12]:
@dlt.resource(table_name='pokemon_lesson_5', write_disposition='append')
def pokemon(cursor_date=dlt.sources.incremental("created_at", initial_value="2024-01-01")):
    yield data

In [13]:
pipeline = dlt.pipeline(
    pipeline_name='lesson_5',
    destination='duckdb',
    dataset_name='pokemon_lesson_5'
)

load_info = pipeline.run(pokemon)
print(load_info)

# explore loaded data
pipeline.dataset(dataset_type="default").pokemon_lesson_5.df()

Pipeline lesson_5 load step completed in 0.24 seconds
1 load package(s) were loaded to destination duckdb and into dataset pokemon_lesson_5
The duckdb destination used duckdb:///c:\Users\HP\OneDrive\Desktop\Data Engg\dlt\lesson_5.duckdb location to store data
Load package 1740397592.7813795 is LOADED and contains no failed jobs


Unnamed: 0,id,name,size__weight,size__height,_dlt_load_id,_dlt_id,created_at
0,25,pikachu,6.0,0.4,1740396871.573211,Cy0A41TEJZBa7w,
1,1,bulbasaur,6.9,0.7,1740396871.573211,d6Noib2wj83h2g,
2,4,charmander,8.5,0.6,1740396871.573211,ZhCMooKbWn5DJA,
3,1,bulbasaur,6.9,0.7,1740397592.7813797,gMvLt0fdlc/fGA,2024-12-01
4,4,charmander,8.5,0.6,1740397592.7813797,ma/UGPSQFxmK+w,2024-09-01


This:
1. Loads **only Charmander and Bulbasaur** (caught after 2024-01-01).
2. Skips Pikachu because it’s old news.

Now when you run the pipeline again, there would be no rows added as there's no new data - however the pipeline is still not complete. It does not handle updates to existing records

In [16]:
# we add another field to see the update date
data = [
    {
        "id": "1",
        "name": "bulbasaur",
        "size": {"weight": 6.9, "height": 0.7},
        "created_at": "2024-12-01",
        "updated_at": "2024-12-01"    # <------- new field
    },
    {
        "id": "4",
        "name": "charmander",
        "size": {"weight": 8.5, "height": 0.6},
        "created_at": "2024-09-01",
        "updated_at": "2024-09-01"    # <------- new field
    },
    {
        "id": "25",
        "name": "pikachu",
        "size": {"weight": 9, "height": 0.4}, # <----- pikachu gained weight from 6 to 9
        "created_at": "2023-06-01",
        "updated_at": "2024-12-16"    # <------- new field, information about pikachu has updated
    },
]

In [21]:
@dlt.resource(
    name="pokemon_with_updated_at",
    write_disposition="merge",  # <--- change write disposition from 'append' to 'merge'
    primary_key="id",  # <--- set a primary key
)
def pokemon(cursor_date=dlt.sources.incremental("updated_at", initial_value="2024-01-01")):  # <--- change the cursor name from 'created_at' to 'updated_at'
    yield data

In [22]:
pipeline = dlt.pipeline(
    pipeline_name='lesson_5',
    destination='duckdb',
    dataset_name='pokemon_lesson_5'
)

load_info = pipeline.run(pokemon)
print(load_info)

# explore loaded data
pipeline.dataset(dataset_type="default").pokemon_with_updated_at.df()

Pipeline lesson_5 load step completed in 0.33 seconds
1 load package(s) were loaded to destination duckdb and into dataset pokemon_lesson_5
The duckdb destination used duckdb:///c:\Users\HP\OneDrive\Desktop\Data Engg\dlt\lesson_5.duckdb location to store data
Load package 1740397929.4161534 is LOADED and contains no failed jobs


Unnamed: 0,id,name,size__weight,size__height,created_at,updated_at,_dlt_load_id,_dlt_id
0,25,pikachu,7.5,0.4,2023-06-01,2024-12-23,1740397929.4161534,Jf4fpFAUYwsXSQ
1,1,bulbasaur,6.9,0.7,2024-12-01,2024-12-01,1740397929.4161534,po+MjV/QjMVg/g
2,4,charmander,8.5,0.6,2024-09-01,2024-09-01,1740397929.4161534,EpXbaWvVWdpu/g


In [23]:
# We added `created_at` field to the data
data = [
    {
        "id": "1",
        "name": "bulbasaur",
        "size": {"weight": 6.9, "height": 0.7},
        "created_at": "2024-12-01",
        "updated_at": "2024-12-01"
    },
    {
        "id": "4",
        "name": "charmander",
        "size": {"weight": 8.5, "height": 0.6},
        "created_at": "2024-09-01",
        "updated_at": "2024-09-01"
    },
    {
        "id": "25",
        "name": "pikachu",
        "size": {"weight": 7.5, "height": 0.4}, # <--- pikachu lost weight
        "created_at": "2023-06-01",
        "updated_at": "2024-12-23"  # <--- data about his weight was updated a week later
    },
]

In [24]:
load_info = pipeline.run(pokemon)
print(load_info)

# explore loaded data
pipeline.dataset(dataset_type="default").pokemon_with_updated_at.df()

Pipeline lesson_5 load step completed in ---
0 load package(s) were loaded to destination duckdb and into dataset None
The duckdb destination used duckdb:///c:\Users\HP\OneDrive\Desktop\Data Engg\dlt\lesson_5.duckdb location to store data


Unnamed: 0,id,name,size__weight,size__height,created_at,updated_at,_dlt_load_id,_dlt_id
0,25,pikachu,7.5,0.4,2023-06-01,2024-12-23,1740397929.4161534,Jf4fpFAUYwsXSQ
1,1,bulbasaur,6.9,0.7,2024-12-01,2024-12-01,1740397929.4161534,po+MjV/QjMVg/g
2,4,charmander,8.5,0.6,2024-09-01,2024-09-01,1740397929.4161534,EpXbaWvVWdpu/g


## **Exercise 1: Make the GitHub API pipeline incremental**

In the previous lessons, you built a pipeline to pull data from the GitHub API. Now, let’s level it up by making it incremental, so it fetches only new or updated data.


Transform your GitHub API pipeline to use incremental loading. This means:

* Implement new `dlt.resource` for `pulls/comments` (List comments for Pull Requests) endpoint.
* Fetch only pulls comments updated after the last pipeline run.
* Use the `updated_at` field from the GitHub API as the incremental cursor.
* [Endpoint documentation](https://docs.github.com/en/rest/pulls/comments?apiVersion=2022-11-28#list-review-comments-in-a-repository)
* Endpoint URL: `https://api.github.com/repos/OWNER/REPO/pulls/comments`
* Use `since` parameter - only show results that were last updated after the given time - and `last_value`.
* `initial_value` is `2024-12-01`.


### Question

How many columns does the `comments` table have?

In [None]:
import dlt
import os
from dlt.sources.helpers import requests
from dlt.sources.helpers.rest_client import RESTClient
from dlt.sources.helpers.rest_client.auth import BearerTokenAuth

os.environ["SOURCES__ACCESS_TOKEN"] = userdata.get('SECRET_KEY')

@dlt.source
def github_source(access_token=dlt.secrets.value):
    client = RESTClient(
        base_url='https://api.github.com',
        auth=BearerTokenAuth(access_token)
    )

    @dlt.resource # if no table name specified it will use the function name
    def github_events():
        for page in client.paginate("orgs/dlt-hub/events"):
            yield from page

    @dlt.resource
    def github_stargazers():
        for page in client.paginate("repos/dlt-hub/dlt/stargazers"):
            yield from page
    
    @dlt.resource(
        name="pull_comments",
        write_disposition="merge",
        primary_key="id"
    )
    def github_pull_comments(cursor_date = dlt.sources.incremental("updated_at", initial_value="2025-01-01")):
        print(cursor_date.last_value )
        params = {
            "since": cursor_date.last_value 
        }
        for page in client.paginate("repos/dlt-hub/dlt/pulls/comments", params=params):
          yield from page
    
    
    return github_events, github_stargazers, github_pull_comments



# define dlt pipeline
pipeline = dlt.pipeline(
    pipeline_name="github_incr_exercise",
    destination="duckdb"
)


# run the pipeline with the new resource
load_info = pipeline.run(github_source())
print(load_info)