## Data Engineering with Python & AI:
### Data Loading 

1. Extracting Data from APIs & Handling API Challenges - Work with REST APIs, authentication, rate limits, retries, and pagination to extract data efficiently.
2. Schema Management & Automatic Normalization - Use dtl to infer schemas, flatten nested JSON, extract lists into child tables, and handle schema evolution automatically.
3. Incremental Data Extraction & State Tracking - Load only new or modified records, avoiding unnecessary reprocessing and improving pipeline efficiency.
4. Loading Data Into Various Destinations - Store data in DuckDB or Postgres, BigQuery, Snowflake, or a Data Lake while ensuring efficient schema mapping and performance.
5. Automating and Orchestrating Pipelines - Deploy, schedule, and maintain ingestion workflows with Dagster, Github Actions, and Cron Jobs.
6. Scaling Data Pipelines Efficiently - Handle large-scale data ingestion while optimizing performance, retries, and parallel execution.

After this exercise, you won't just know ingestion - you'll be able to build an API ingestion pipeline that `auto-detects schema changes`, `retries intelligently`, and `scales with demand`. (resilient, scalable, efficient and reliable pipeline)

# Data Ingestion
We'll extract data from a source. It often includes normalizing, cleaning, and adding metadata.

## Extracting Data: 
Data Streaming and Batching
* Batching - Processing data in chunks at scheduled intervals. It's suitable for scheduled tasks and reduces system load.
* Streaming - Processing data continuously as it arrives.It's ideal for real-time data processing and inmmediate insights.

Choosing the right approach depends on factors like `data volume`, `latency requirements` and `system architecture`.


1. Batch processing
Batch processing is best when you can wait for data to accumulate before processing it in large chunks. It is `cost-efficient` and works well for `non-time-sensitive` workloads.
- Common use cases
    * Nightly database updates.
    * Generating daily or weekly reports.

2. Streaming data processing
Streaming is useful when you need to `process data in real-time` or `with minimal delay`. Instead of waiting for a batch, events are processed continuously.
- Common use cases
    * Fraud detection (e.g. analyzing transactions in real-time)
    * IoT devide monitoring (e.g. temperature sensors)
    * Event-driven applications (e.g. user activity tracking)
    * Log and telemetry data ingestion

3. When to use Batch vs Streaming

|Factor      |        Batch processing      |              Streaming processing        |
|------------|------------------------------|------------------------------------------|
|Latency     |   High (minutes, hours)      |   Low (milliseconds, seconds)            |
|Data volume |   Large batches              |   Continuous small events                |
|Use case    |   Reports, ETL, backups      |   Real-time analytics, event-driven apps |
|Complexity  |   Easier to manage           |   Requires event-driven architecture     |
|Cost        |   Lower for periodic runs    |   Higher for always-on processing        |

4. Tools
Many tools support both `batch` and `streaming` data extraction. Some tools are optimized for one approach, while others provide flexibility for both.

`Message queues & Event streaming`
These tools enable real-time data ingestion and processing but can also buffer data for mini-batch processing.

  * `Apache Kafka` - Distributed event streaming platformfor real-time and batch        workloads.
  * `RabbitMQ` - Message broker that supports real-time message passing.
  * `AWS Kinesis` - Cloud-native alternative to Kafka for real-time ingestion.
  * `Google Pub/Sub` - Managed messaging service for real-time and batch workloads.

`ETL & ELT Pipelines`
These tools handle extraction, transformation and loading (ETL) for both batch and streaming pipelines.
  
  * `Apache Spark` - Supports batch processing and structured streaming.
  * `dbt (Data Build Tool)` - Focuses on batch transformations but can be used with streaming inputs.
  * `Flink` - Real-time stream processing but can also handle mini-batch workloads.
  * `NiFi` - A data flow tool for moving and transforming data in real time or batch.
  * `AWS Glue` - Serverless ETL service for batch workloads, with limited streaming support.
  * `Google Cloud Dataflow` - Managed ETL platform supporting both batch and streaming.
  * `dlt` - Automates API extraction, incremental ingestion, and schema evolution for both batch and streaming pipelines. 

## Working with RestAPI

#### APIs as a data source: Batch vs. Streaming approaches

APIs are a major source of data ingestion. Depending on how APIs provide data, they can be used in both `batch` and `streaming` workflows.

1. `APIs for batch extraction`
Some APIs return large datasets at once. This data is often fetched on a schedule or as part of an ETL process.

**Common batch API examples:**
    * **CRM APIs (Salesforce, HubSpot)** - Export customer data daily.
    * **E-commerce APIs (Shopify, Amazon)** - Download product catalogs or sales reports periodically.
    * **Public APIs (Weather, Financial Data)** - Retrieve daily stock market updates.

**How batch API extraction works:**
1. Call an API at **scheduled intervals** (e.g. every hour or day).
2. Retrieve all available data (e.g. last 24 hours of records).
3. Store results in a database, data warehouse, or file storage.

In [4]:
#%pip install requests

In [None]:
import requests
import json

def fetch_batch_data():
    url = "https://api.example.com/daily_reports"
    response = requests.get(url) 
    data = response.json()

    with open("daily_report.json", "w") as file:
        json.dump(data, file)


fetch_batch_data()

In [6]:
import requests


url = "https://api.github.com/repos/DataTalksClub/data-engineering-zoomCamp/events"

response = requests.get(url)
response.json()

[{'id': '5366579318',
  'type': 'WatchEvent',
  'actor': {'id': 244678782,
   'login': 'minuscati',
   'display_login': 'minuscati',
   'gravatar_id': '',
   'url': 'https://api.github.com/users/minuscati',
   'avatar_url': 'https://avatars.githubusercontent.com/u/244678782?'},
  'repo': {'id': 419661684,
   'name': 'DataTalksClub/data-engineering-zoomcamp',
   'url': 'https://api.github.com/repos/DataTalksClub/data-engineering-zoomcamp'},
  'payload': {'action': 'started'},
  'public': True,
  'created_at': '2025-12-17T17:20:25Z',
  'org': {'id': 72699292,
   'login': 'DataTalksClub',
   'gravatar_id': '',
   'url': 'https://api.github.com/orgs/DataTalksClub',
   'avatar_url': 'https://avatars.githubusercontent.com/u/72699292?'}},
 {'id': '5366401983',
  'type': 'WatchEvent',
  'actor': {'id': 241587839,
   'login': 'yenminh0903',
   'display_login': 'yenminh0903',
   'gravatar_id': '',
   'url': 'https://api.github.com/users/yenminh0903',
   'avatar_url': 'https://avatars.githubuserc