# Lesson 2: dlt sources and resources: Create first dlt pipeline.

## 1. `dlt.resource`
- this is a function that yields data
- to create a resource, you can use the `@dlt.resource` decorator (a function that modifies the behaviour of another function without changing it's code - takes function as input, modifies it and returns the modified function)

Commonly used arguement for the `@dlt.resource` decorator
- `name` : name of the table generated by this resource, default = decorator function name
- `write_disposition` : how the data should be loaded - append (default), replace, merge 

> **Why is it a better way?** This allows you to use `dlt` functionalities to the fullest that follow Data Engineering best practices, including incremental loading and data contracts.

In [3]:
import dlt

# Sample data containing pokemon details
data = [
    {"id": "1", "name": "bulbasaur", "size": {"weight": 6.9, "height": 0.7}},
    {"id": "4", "name": "charmander", "size": {"weight": 8.5, "height": 0.6}},
    {"id": "25", "name": "pikachu", "size": {"weight": 6, "height": 0.4}},
]

pipeline = dlt.pipeline(
    pipeline_name='my_pipeline',
    destination='duckdb',
    dataset_name='mydata'
)
    
# create a dlt resource
@dlt.resource(table_name='pokemon_new')
def my_dict_list():
    yield data

load_info = pipeline.run(my_dict_list)
print(load_info)

Pipeline my_pipeline load step completed in 0.31 seconds
1 load package(s) were loaded to destination duckdb and into dataset mydata
The duckdb destination used duckdb:///c:\Users\HP\OneDrive\Desktop\Data Engg\dlt\my_pipeline.duckdb location to store data
Load package 1740225433.9481218 is LOADED and contains no failed jobs


In [4]:
pipeline.dataset(dataset_type='default').pokemon_new.df()

Unnamed: 0,id,name,size__weight,size__height,_dlt_load_id,_dlt_id
0,1,bulbasaur,6.9,0.7,1740225433.9481218,JLosD3e0d5/X4Q
1,4,charmander,8.5,0.6,1740225433.9481218,SvKhsA/LDRgs0A
2,25,pikachu,6.0,0.4,1740225433.9481218,wnwuWBa53H0tyw


Instead of a dict list, the data could also be a/an:
- dataframe
- database query response
- API request response
- Anything you can transform into JSON/dict format

In [8]:
# Using DataFrame

import pandas as pd

# Define a resource to load data from a CSV
@dlt.resource(name='df_data')
def my_df():
    sample_df = pd.read_csv("https://people.sc.fsu.edu/~jburkardt/data/csv/hw_200.csv")
    yield sample_df


# Run the pipeline with the defined resource
load_info = pipeline.run(my_df, write_disposition="replace")
print(load_info)

# Query the loaded data from 'df_data'
pipeline.dataset(dataset_type="default").df_data.df().head()

Pipeline my_pipeline load step completed in 0.22 seconds
1 load package(s) were loaded to destination duckdb and into dataset mydata
The duckdb destination used duckdb:///c:\Users\HP\OneDrive\Desktop\Data Engg\dlt\my_pipeline.duckdb location to store data
Load package 1740225749.6962357 is LOADED and contains no failed jobs


Unnamed: 0,index,height_inchesx,_weight_poundsx
0,1,65.78,112.99
1,2,71.52,136.49
2,3,69.4,153.03
3,4,68.22,142.34
4,5,67.79,144.3


#### Using Database Connection to create Resource

In [9]:
pip install pymysql

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.0 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


Collecting pymysql
  Downloading PyMySQL-1.1.1-py3-none-any.whl.metadata (4.4 kB)
Downloading PyMySQL-1.1.1-py3-none-any.whl (44 kB)
   ---------------------------------------- 0.0/45.0 kB ? eta -:--:--
   --------- ------------------------------ 10.2/45.0 kB ? eta -:--:--
   ------------------ --------------------- 20.5/45.0 kB 217.9 kB/s eta 0:00:01
   --------------------------- ------------ 30.7/45.0 kB 330.3 kB/s eta 0:00:01
   ---------------------------------------- 45.0/45.0 kB 277.9 kB/s eta 0:00:00
Installing collected packages: pymysql
Successfully installed pymysql-1.1.1


In [11]:
!pip install sqlalchemy

Collecting sqlalchemy
  Downloading SQLAlchemy-2.0.38-cp312-cp312-win_amd64.whl.metadata (9.9 kB)
Collecting greenlet!=0.4.17 (from sqlalchemy)
  Downloading greenlet-3.1.1-cp312-cp312-win_amd64.whl.metadata (3.9 kB)
Downloading SQLAlchemy-2.0.38-cp312-cp312-win_amd64.whl (2.1 MB)
   ---------------------------------------- 0.0/2.1 MB ? eta -:--:--
   ---------------------------------------- 0.0/2.1 MB ? eta -:--:--
    --------------------------------------- 0.0/2.1 MB 653.6 kB/s eta 0:00:04
   -- ------------------------------------- 0.1/2.1 MB 939.4 kB/s eta 0:00:03
   -- ------------------------------------- 0.1/2.1 MB 787.7 kB/s eta 0:00:03
   ---- ----------------------------------- 0.2/2.1 MB 1.0 MB/s eta 0:00:02
   ----- ---------------------------------- 0.3/2.1 MB 1.1 MB/s eta 0:00:02
   ------ --------------------------------- 0.3/2.1 MB 1.1 MB/s eta 0:00:02
   ------- -------------------------------- 0.4/2.1 MB 1.2 MB/s eta 0:00:02
   ------- -------------------------------


[notice] A new release of pip is available: 24.0 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [12]:
from sqlalchemy import create_engine

In [13]:
# Define a resource to fetch genome data from the database
@dlt.resource(table_name='genome_data')
def get_genome_data():
  engine = create_engine("mysql+pymysql://rfamro@mysql-rfam-public.ebi.ac.uk:4497/Rfam")
  with engine.connect() as conn:
      query = "SELECT * FROM genome LIMIT 1000"
      rows = conn.execution_options(yield_per=100).exec_driver_sql(query)
      yield from map(lambda row: dict(row._mapping), rows)

# Run the pipeline with the genome data resource
load_info = pipeline.run(get_genome_data)
print(load_info)

# Query the loaded data from 'genome_data'
pipeline.dataset(dataset_type="default").genome_data.df().head()

Pipeline my_pipeline load step completed in 0.56 seconds
1 load package(s) were loaded to destination duckdb and into dataset mydata
The duckdb destination used duckdb:///c:\Users\HP\OneDrive\Desktop\Data Engg\dlt\my_pipeline.duckdb location to store data
Load package 1740226387.038616 is LOADED and contains no failed jobs


Unnamed: 0,upid,description,total_length,ncbi_id,scientific_name,kingdom,num_rfam_regions,num_families,is_reference,is_representative,...,assembly_acc,assembly_version,assembly_level,ungapped_length,assembly_name,study_ref,circular,wgs_acc,wgs_version,common_name
0,RG000000001,Potato spindle tuber viroid,4591,12892,Potato spindle tuber viroid,viroids,0,0,1,0,...,,,,,,,,,,
1,RG000000002,Columnea latent viroid,370,12901,Columnea latent viroid,viroids,0,0,1,0,...,,,,,,,,,,
2,RG000000003,Tomato apical stunt viroid-S,360,53194,Tomato apical stunt viroid-S,viroids,0,0,1,0,...,,,,,,,,,,
3,RG000000004,Tomato apical stunt viroid,360,12885,Tomato apical stunt viroid,viroids,0,0,1,0,...,,,,,,,,,,
4,RG000000005,Cucumber yellows virus,7899,32618,Cucumber yellows virus,viruses,0,0,1,0,...,,,,,,,,,,


#### REST API dlt Resource

In [15]:
from dlt.sources.helpers import requests

# define a resource to fetch from PokeAPI
@dlt.resource(table_name='pokemon_api')
def get_pokemon():
    url = 'https://pokeapi.co/api/v2/pokemon'
    response = requests.get(url)
    yield response.json()['results']

# Run the pipeline with the PokeAPI resource
load_info = pipeline.run(get_pokemon)
print(load_info)

# Query the loaded data from 'pokemon_api' table
pipeline.dataset(dataset_type="default").pokemon_api.df().head()

Pipeline my_pipeline load step completed in 0.24 seconds
1 load package(s) were loaded to destination duckdb and into dataset mydata
The duckdb destination used duckdb:///c:\Users\HP\OneDrive\Desktop\Data Engg\dlt\my_pipeline.duckdb location to store data
Load package 1740226587.076649 is LOADED and contains no failed jobs


Unnamed: 0,name,url,_dlt_load_id,_dlt_id
0,bulbasaur,https://pokeapi.co/api/v2/pokemon/1/,1740226587.076649,lOSsn8QQ10PQkw
1,ivysaur,https://pokeapi.co/api/v2/pokemon/2/,1740226587.076649,EWUtBmg/pXj31A
2,venusaur,https://pokeapi.co/api/v2/pokemon/3/,1740226587.076649,DelYuRj/CMibFw
3,charmander,https://pokeapi.co/api/v2/pokemon/4/,1740226587.076649,A3WAE0OimjQThg
4,charmeleon,https://pokeapi.co/api/v2/pokemon/5/,1740226587.076649,Kjwb9apjzZUI5A


In [16]:
# List all table names from the database
with pipeline.sql_client() as client:
    with client.execute_query("SELECT table_name FROM information_schema.tables") as table:
        print(table.df())

            table_name
0              df_data
1          genome_data
2          pokemon_api
3          pokemon_new
4           _dlt_loads
5  _dlt_pipeline_state
6         _dlt_version


## 2. `dlt.sources`

- Logical grouping of resources
- A source is a function decorated with `@dlt.source` that returns one or more resources.
- You can create/declare a source by adding the decorator to a function that returns resources

In [20]:
@dlt.source
def all_data():
    return my_df, get_genome_data, get_pokemon

In [21]:
# use the source to load all the above data in another database
pipeline = dlt.pipeline(
    pipeline_name = 'resource_source_new',
    destination = 'duckdb',
    dataset_name = 'all_data'
)

load_info = pipeline.run(all_data())

# Print load info
print(load_info)

Pipeline resource_source_new load step completed in 0.82 seconds
1 load package(s) were loaded to destination duckdb and into dataset all_data
The duckdb destination used duckdb:///c:\Users\HP\OneDrive\Desktop\Data Engg\dlt\resource_source_new.duckdb location to store data
Load package 1740227055.2556074 is LOADED and contains no failed jobs


**Why does this matter?**:
- It is more efficient than running your resources separately.
- It organizes both your schema and your code. 🙂
- It enables the option for parallelization.

## 3. `dlt.transformers`

- these are special kind of `dlt.resources` that can take data from another resource - for instance one resources provides all the IDs necessary to make an API request.

In [22]:
# this is our resource 1
@dlt.resource(table_name='pokemon')
def my_dict_list():
    yield data

In [23]:
# define a transformer - to enrich pokemon data
@dlt.transformer(data_from=my_dict_list, table_name='pokemon_detailed_info')
def poke_details(items): # <--- `items` is a variable and contains data from `my_dict_list` resource
    for item in items:
        print(f"{item} \n") # <-- print what data we get from `my_dict_list` source

        id = item['id']
        url = f"https://pokeapi.co/api/v2/pokemon/{id}"
        response = requests.get(url)
        details = response.json()

        print(f"Details: {details} \n") # <-- print what data we get from PokeAPI

        yield details

In [24]:
load_info = pipeline.run(poke_details())
print(load_info)

# Query the 'detailed_info' table and convert the result to a DataFrame
pipeline.dataset(dataset_type="default").pokemon_detailed_info.df()

{'id': '1', 'name': 'bulbasaur', 'size': {'weight': 6.9, 'height': 0.7}} 

Details: {'abilities': [{'ability': {'name': 'overgrow', 'url': 'https://pokeapi.co/api/v2/ability/65/'}, 'is_hidden': False, 'slot': 1}, {'ability': {'name': 'chlorophyll', 'url': 'https://pokeapi.co/api/v2/ability/34/'}, 'is_hidden': True, 'slot': 3}], 'base_experience': 64, 'cries': {'latest': 'https://raw.githubusercontent.com/PokeAPI/cries/main/cries/pokemon/latest/1.ogg', 'legacy': 'https://raw.githubusercontent.com/PokeAPI/cries/main/cries/pokemon/legacy/1.ogg'}, 'forms': [{'name': 'bulbasaur', 'url': 'https://pokeapi.co/api/v2/pokemon-form/1/'}], 'game_indices': [{'game_index': 153, 'version': {'name': 'red', 'url': 'https://pokeapi.co/api/v2/version/1/'}}, {'game_index': 153, 'version': {'name': 'blue', 'url': 'https://pokeapi.co/api/v2/version/2/'}}, {'game_index': 153, 'version': {'name': 'yellow', 'url': 'https://pokeapi.co/api/v2/version/3/'}}, {'game_index': 1, 'version': {'name': 'gold', 'url': 'h

Unnamed: 0,base_experience,cries__latest,cries__legacy,height,id,is_default,location_area_encounters,name,order,species__name,...,sprites__versions__generation_v__black_white__back_shiny_female,sprites__versions__generation_v__black_white__front_female,sprites__versions__generation_v__black_white__front_shiny_female,sprites__versions__generation_vi__omegaruby_alphasapphire__front_female,sprites__versions__generation_vi__omegaruby_alphasapphire__front_shiny_female,sprites__versions__generation_vi__x_y__front_female,sprites__versions__generation_vi__x_y__front_shiny_female,sprites__versions__generation_vii__ultra_sun_ultra_moon__front_female,sprites__versions__generation_vii__ultra_sun_ultra_moon__front_shiny_female,sprites__versions__generation_viii__icons__front_female
0,64,https://raw.githubusercontent.com/PokeAPI/crie...,https://raw.githubusercontent.com/PokeAPI/crie...,7,1,True,https://pokeapi.co/api/v2/pokemon/1/encounters,bulbasaur,1,bulbasaur,...,,,,,,,,,,
1,62,https://raw.githubusercontent.com/PokeAPI/crie...,https://raw.githubusercontent.com/PokeAPI/crie...,6,4,True,https://pokeapi.co/api/v2/pokemon/4/encounters,charmander,5,charmander,...,,,,,,,,,,
2,112,https://raw.githubusercontent.com/PokeAPI/crie...,https://raw.githubusercontent.com/PokeAPI/crie...,4,25,True,https://pokeapi.co/api/v2/pokemon/25/encounters,pikachu,35,pikachu,...,https://raw.githubusercontent.com/PokeAPI/spri...,https://raw.githubusercontent.com/PokeAPI/spri...,https://raw.githubusercontent.com/PokeAPI/spri...,https://raw.githubusercontent.com/PokeAPI/spri...,https://raw.githubusercontent.com/PokeAPI/spri...,https://raw.githubusercontent.com/PokeAPI/spri...,https://raw.githubusercontent.com/PokeAPI/spri...,https://raw.githubusercontent.com/PokeAPI/spri...,https://raw.githubusercontent.com/PokeAPI/spri...,https://raw.githubusercontent.com/PokeAPI/spri...


---
### Reduce the nesting level of generated tables
You can limit how deep dlt goes when generating nested tables and flattening dicts into columns. By default, the library will descend and generate nested tables for all nested lists, without limit.

You can set nesting level for all resources on the source level:

```python
@dlt.source(max_table_nesting=1)
def all_data():
  return my_df, get_genome_data, get_pokemon
```

or for each resource separately:

```python
@dlt.resource(table_name='pokemon_new', max_table_nesting=1)
def my_dict_list():
    yield data
```

In the example above, we want only 1 level of nested tables to be generated (so there are no nested tables of a nested table). Typical settings:

* `max_table_nesting=0` will not generate nested tables and will not flatten dicts into columns at all. All nested data will be represented as JSON.
* `max_table_nesting=1` will generate nested tables of root tables and nothing more. All nested data in nested tables will be represented as JSON.

## 4.1. Exercise 1: Create a pipeline for GitHub API - repos endpoint

In this exercise, you'll build a dlt pipeline to fetch data from the GitHub REST API. The goal is to learn how to use `dlt.pipeline`, `dlt.resource`, and `dlt.source` to extract and load data into a destination.

## Instructions

1. **Explore the GitHub API**

  Visit the [GitHub REST API Docs](https://docs.github.com/en/rest) to understand the endpoint to [list public repositories](https://docs.github.com/en/rest/repos/repos?apiVersion=2022-11-28) for an organization:

  GET https://api.github.com/orgs/{org}/repos

2. **Build the Pipeline**

  Write a script to:

  * Fetch repositories for a **dlt-hub** organization.
  * Use `dlt.resource` to define the data extraction logic.
  * Combine all resources to a single `@dlt.source`.
  * Load the data into a DuckDB database.

3. **Look at the data**

  Use `duckdb` connection, `sql_client` or `pipeline.dataset()`.

> **Note**: For this exercise you don't need to use Auth and Pagination.

In [30]:
pipeline = dlt.pipeline(
    pipeline_name = 'exercise_1',
    destination = 'duckdb',
    dataset_name = 'exercise_data'
)

@dlt.resource(table_name = 'dlt_git_repos')
def github_repos():
    org = 'dlt-hub'
    url = f'https://api.github.com/orgs/{org}/repos'
    response = requests.get(url)
    yield response.json()

@dlt.resource(table_name = 'dlt_git_events')
def github_events():
    org = 'dlt-hub'
    url = f'https://api.github.com/orgs/{org}/repos'
    response = requests.get(url)
    yield response.json()


@dlt.source
def get_all_data():
    return github_repos, github_events

load_info = pipeline.run(get_all_data(), write_disposition='replace')
print(load_info)

pipeline.dataset(dataset_type='default').dlt_git_repos.df().head()

Pipeline exercise_1 load step completed in 0.71 seconds
1 load package(s) were loaded to destination duckdb and into dataset exercise_data
The duckdb destination used duckdb:///c:\Users\HP\OneDrive\Desktop\Data Engg\dlt\exercise_1.duckdb location to store data
Load package 1740237540.4036763 is LOADED and contains no failed jobs


Unnamed: 0,id,node_id,name,full_name,private,owner__login,owner__id,owner__node_id,owner__avatar_url,owner__gravatar_id,...,permissions__admin,permissions__maintain,permissions__push,permissions__triage,permissions__pull,_dlt_load_id,_dlt_id,description,homepage,license__url
0,427664318,R_kgDOGX2jvg,rasa_semantic_schema,dlt-hub/rasa_semantic_schema,False,dlt-hub,89419010,MDEyOk9yZ2FuaXphdGlvbjg5NDE5MDEw,https://avatars.githubusercontent.com/u/894190...,,...,False,False,False,False,True,1740237540.4036765,JNChFHfXsADzew,,,
1,438622757,R_kgDOGiTaJQ,.github,dlt-hub/.github,False,dlt-hub,89419010,MDEyOk9yZ2FuaXphdGlvbjg5NDE5MDEw,https://avatars.githubusercontent.com/u/894190...,,...,False,False,False,False,True,1740237540.4036765,GuL7HW0orrCSYA,,,
2,452221115,R_kgDOGvRYuw,dlt,dlt-hub/dlt,False,dlt-hub,89419010,MDEyOk9yZ2FuaXphdGlvbjg5NDE5MDEw,https://avatars.githubusercontent.com/u/894190...,,...,False,False,False,False,True,1740237540.4036765,dMMMwwoPhH1hzw,data load tool (dlt) is an open source Python ...,https://dlthub.com/docs,https://api.github.com/licenses/apache-2.0
3,462711174,R_kgDOG5Rphg,rasa_semantic_schema_customization,dlt-hub/rasa_semantic_schema_customization,False,dlt-hub,89419010,MDEyOk9yZ2FuaXphdGlvbjg5NDE5MDEw,https://avatars.githubusercontent.com/u/894190...,,...,False,False,False,False,True,1740237540.4036765,InSpqQhqyLgOmg,Template repository to customize and execute R...,,
4,465315657,R_kgDOG7wnSQ,metabase_data_api,dlt-hub/metabase_data_api,False,dlt-hub,89419010,MDEyOk9yZ2FuaXphdGlvbjg5NDE5MDEw,https://avatars.githubusercontent.com/u/894190...,,...,False,False,False,False,True,1740237540.4036765,JzsCRS9PrQejpw,Metabase data api python wrapper for notebooks...,,https://api.github.com/licenses/mit


In [31]:
print(f"The dataset has: {len(pipeline.dataset(dataset_type='default').dlt_git_repos.df().columns)} column(s)")

The dataset has: 106 column(s)


## 4.2. Exercise 2: Create a pipeline for GitHub API - stargazers endpoint

Create a `dlt.transformer` for the "stargazers" endpoint
https://api.github.com/repos/OWNER/REPO/stargazers for `dlt-hub` organization.

Use `github_repos` resource as a main resource for the transformer:
1. Get all `dlt-hub` repositories.
2. Feed these repository names to dlt transformer and get all stargazers for all `dlt-hub` repositories.

In [32]:
@dlt.transformer(data_from=github_repos, table_name='repo_stargazers')
def github_stargazer(repos):
    for repo in repos:
        repo = repo['name']
        url = f'https://api.github.com/repos/dlt-hub/{repo}/stargazers'
        response = requests.get(url)
        yield response.json()


load_info = pipeline.run(github_stargazer())
print(load_info)

pipeline.dataset(dataset_type='default').repo_stargazers.df().head()

        

Pipeline exercise_1 load step completed in 0.67 seconds
1 load package(s) were loaded to destination duckdb and into dataset exercise_data
The duckdb destination used duckdb:///c:\Users\HP\OneDrive\Desktop\Data Engg\dlt\exercise_1.duckdb location to store data
Load package 1740238367.2740192 is LOADED and contains no failed jobs


Unnamed: 0,login,id,node_id,avatar_url,gravatar_id,url,html_url,followers_url,following_url,gists_url,...,subscriptions_url,organizations_url,repos_url,events_url,received_events_url,type,user_view_type,site_admin,_dlt_load_id,_dlt_id
0,indam23,32034278,MDQ6VXNlcjMyMDM0Mjc4,https://avatars.githubusercontent.com/u/320342...,,https://api.github.com/users/indam23,https://github.com/indam23,https://api.github.com/users/indam23/followers,https://api.github.com/users/indam23/following...,https://api.github.com/users/indam23/gists{/gi...,...,https://api.github.com/users/indam23/subscript...,https://api.github.com/users/indam23/orgs,https://api.github.com/users/indam23/repos,https://api.github.com/users/indam23/events{/p...,https://api.github.com/users/indam23/received_...,User,public,False,1740238367.2740192,hO4mWTVtt+AM1g
1,Ai-Yukino,87879276,MDQ6VXNlcjg3ODc5Mjc2,https://avatars.githubusercontent.com/u/878792...,,https://api.github.com/users/Ai-Yukino,https://github.com/Ai-Yukino,https://api.github.com/users/Ai-Yukino/followers,https://api.github.com/users/Ai-Yukino/followi...,https://api.github.com/users/Ai-Yukino/gists{/...,...,https://api.github.com/users/Ai-Yukino/subscri...,https://api.github.com/users/Ai-Yukino/orgs,https://api.github.com/users/Ai-Yukino/repos,https://api.github.com/users/Ai-Yukino/events{...,https://api.github.com/users/Ai-Yukino/receive...,User,public,False,1740238367.2740192,8DK1/4A10eIKuQ
2,lalitpagaria,19303690,MDQ6VXNlcjE5MzAzNjkw,https://avatars.githubusercontent.com/u/193036...,,https://api.github.com/users/lalitpagaria,https://github.com/lalitpagaria,https://api.github.com/users/lalitpagaria/foll...,https://api.github.com/users/lalitpagaria/foll...,https://api.github.com/users/lalitpagaria/gist...,...,https://api.github.com/users/lalitpagaria/subs...,https://api.github.com/users/lalitpagaria/orgs,https://api.github.com/users/lalitpagaria/repos,https://api.github.com/users/lalitpagaria/even...,https://api.github.com/users/lalitpagaria/rece...,User,public,False,1740238367.2740192,vMycJPOUVDkUOg
3,nikitavoloboev,6391776,MDQ6VXNlcjYzOTE3NzY=,https://avatars.githubusercontent.com/u/639177...,,https://api.github.com/users/nikitavoloboev,https://github.com/nikitavoloboev,https://api.github.com/users/nikitavoloboev/fo...,https://api.github.com/users/nikitavoloboev/fo...,https://api.github.com/users/nikitavoloboev/gi...,...,https://api.github.com/users/nikitavoloboev/su...,https://api.github.com/users/nikitavoloboev/orgs,https://api.github.com/users/nikitavoloboev/repos,https://api.github.com/users/nikitavoloboev/ev...,https://api.github.com/users/nikitavoloboev/re...,User,public,False,1740238367.2740192,wIipclWaEoSp2w
4,gerrykou,13572514,MDQ6VXNlcjEzNTcyNTE0,https://avatars.githubusercontent.com/u/135725...,,https://api.github.com/users/gerrykou,https://github.com/gerrykou,https://api.github.com/users/gerrykou/followers,https://api.github.com/users/gerrykou/followin...,https://api.github.com/users/gerrykou/gists{/g...,...,https://api.github.com/users/gerrykou/subscrip...,https://api.github.com/users/gerrykou/orgs,https://api.github.com/users/gerrykou/repos,https://api.github.com/users/gerrykou/events{/...,https://api.github.com/users/gerrykou/received...,User,public,False,1740238367.2740192,xfvd8oTXsR2oug


In [41]:
df = pipeline.dataset(dataset_type='default').repo_stargazers.df()

df.count()

login                  148
id                     148
node_id                148
avatar_url             148
gravatar_id            148
url                    148
html_url               148
followers_url          148
following_url          148
gists_url              148
starred_url            148
subscriptions_url      148
organizations_url      148
repos_url              148
events_url             148
received_events_url    148
type                   148
user_view_type         148
site_admin             148
_dlt_load_id           148
_dlt_id                148
dtype: int64