# Module 02: Data Sources and Extraction

**Estimated Time:** 45-60 minutes

## Learning Objectives

By the end of this module, you will:
- Extract data from various file formats (CSV, JSON, Excel, Parquet)
- Connect to and query databases using SQLAlchemy
- Make API requests to extract data from REST APIs
- Implement error handling and retry logic
- Understand best practices for data extraction

---

## 1. Types of Data Sources

Data engineers work with various data sources:

### Common Data Sources

1. **Files**
   - CSV, TSV (Comma/Tab Separated Values)
   - JSON (JavaScript Object Notation)
   - Parquet (Columnar format)
   - Excel (.xlsx, .xls)
   - XML
   - Log files

2. **Databases**
   - Relational (PostgreSQL, MySQL, SQL Server)
   - NoSQL (MongoDB, Cassandra, DynamoDB)
   - Data Warehouses (Snowflake, BigQuery, Redshift)

3. **APIs**
   - REST APIs (most common)
   - GraphQL
   - SOAP (legacy)

4. **Streaming**
   - Kafka, Kinesis, Pub/Sub
   - WebSockets
   - Message queues (RabbitMQ, SQS)

5. **Cloud Storage**
   - S3, Google Cloud Storage, Azure Blob

In this module, we'll focus on files, databases, and APIs.

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import json
from datetime import datetime
import requests
from sqlalchemy import create_engine, text
import time
import os

print("[OK] Libraries imported successfully")

---

## 2. Extracting Data from Files

### 2.1 CSV Files

In [None]:
# First, let's create some sample CSV data
sample_csv_data = """user_id,name,email,signup_date,country
1,Alice Smith,alice@example.com,2024-01-15,USA
2,Bob Jones,bob@example.com,2024-01-16,UK
3,Carol Davis,carol@example.com,2024-01-17,Canada
4,David Wilson,david@example.com,2024-01-18,Australia
5,Eve Martinez,eve@example.com,2024-01-19,Spain"""

# Save to file
with open("../data/raw/users.csv", "w") as f:
    f.write(sample_csv_data)

print("[OK] Sample CSV file created")

In [None]:
# Extract data from CSV
def extract_csv(file_path):
    """
    Extract data from a CSV file
    """
    try:
        df = pd.read_csv(file_path)
        print(f"[OK] Successfully read {len(df)} records from {file_path}")
        print(f"   Columns: {list(df.columns)}")
        return df
    except FileNotFoundError:
        print(f"[FAIL] File not found: {file_path}")
        return None
    except Exception as e:
        print(f"[FAIL] Error reading CSV: {e}")
        return None


# Extract
users_df = extract_csv("../data/raw/users.csv")
users_df.head()

In [None]:
# Advanced CSV reading options
def extract_csv_advanced(file_path, **kwargs):
    """
    Advanced CSV extraction with options:
    - chunksize: Read in chunks for large files
    - usecols: Only read specific columns
    - parse_dates: Automatically parse date columns
    - dtype: Specify data types
    """
    df = pd.read_csv(
        file_path,
        parse_dates=["signup_date"],  # Parse date columns
        dtype={"user_id": int},  # Specify data types
        **kwargs,
    )
    print(f"[OK] Read {len(df)} records with advanced options")
    print(f"   Data types: {df.dtypes.to_dict()}")
    return df


users_df_advanced = extract_csv_advanced("../data/raw/users.csv")
print("\nData preview:")
users_df_advanced.head(3)

### 2.2 JSON Files

In [None]:
# Create sample JSON data
sample_json_data = [
    {
        "product_id": "P001",
        "name": "Laptop",
        "price": 999.99,
        "category": "Electronics",
        "in_stock": True,
    },
    {
        "product_id": "P002",
        "name": "Mouse",
        "price": 29.99,
        "category": "Electronics",
        "in_stock": True,
    },
    {
        "product_id": "P003",
        "name": "Keyboard",
        "price": 79.99,
        "category": "Electronics",
        "in_stock": False,
    },
    {
        "product_id": "P004",
        "name": "Monitor",
        "price": 299.99,
        "category": "Electronics",
        "in_stock": True,
    },
]

with open("../data/raw/products.json", "w") as f:
    json.dump(sample_json_data, f, indent=2)

print("[OK] Sample JSON file created")

In [None]:
# Extract JSON data
def extract_json(file_path):
    """
    Extract data from a JSON file
    """
    try:
        # Method 1: Using pandas
        df = pd.read_json(file_path)
        print(f"[OK] Successfully read {len(df)} records from JSON")
        return df
    except Exception as e:
        print(f"[FAIL] Error reading JSON: {e}")
        return None


products_df = extract_json("../data/raw/products.json")
products_df

In [None]:
# For nested JSON (common with APIs)
nested_json_data = {
    "metadata": {"timestamp": "2024-01-20T10:00:00Z", "source": "sales_api"},
    "data": [
        {"order_id": 1, "customer": {"id": 101, "name": "Alice"}, "total": 150.00},
        {"order_id": 2, "customer": {"id": 102, "name": "Bob"}, "total": 200.00},
    ],
}

with open("../data/raw/orders.json", "w") as f:
    json.dump(nested_json_data, f, indent=2)


# Extract nested JSON
def extract_nested_json(file_path):
    """
    Extract nested JSON and flatten it
    """
    with open(file_path, "r") as f:
        data = json.load(f)

    # Extract the data array and flatten nested structures
    df = pd.json_normalize(data["data"])
    print(f"[OK] Extracted and flattened {len(df)} records")
    return df


orders_df = extract_nested_json("../data/raw/orders.json")
orders_df

### 2.3 Excel Files

In [None]:
# Create sample Excel file
sales_data = {
    "date": pd.date_range("2024-01-01", periods=10),
    "product": ["A", "B", "C", "A", "B"] * 2,
    "quantity": np.random.randint(1, 50, 10),
    "revenue": np.random.uniform(100, 1000, 10).round(2),
}

sales_df = pd.DataFrame(sales_data)
sales_df.to_excel("../data/raw/sales.xlsx", sheet_name="Sales", index=False)

print("[OK] Sample Excel file created")

In [None]:
# Extract from Excel
def extract_excel(file_path, sheet_name=0):
    """
    Extract data from Excel file

    sheet_name: can be sheet name (str) or index (int)
    """
    try:
        df = pd.read_excel(file_path, sheet_name=sheet_name)
        print(f"[OK] Successfully read {len(df)} records from Excel")
        return df
    except Exception as e:
        print(f"[FAIL] Error reading Excel: {e}")
        return None


sales_df_extracted = extract_excel("../data/raw/sales.xlsx", sheet_name="Sales")
sales_df_extracted.head()

### 2.4 Parquet Files (Columnar Format)

In [None]:
# Create sample Parquet file
large_data = {
    "id": range(1000),
    "value": np.random.randn(1000),
    "category": np.random.choice(["A", "B", "C"], 1000),
}

large_df = pd.DataFrame(large_data)
large_df.to_parquet("../data/raw/large_dataset.parquet", compression="snappy")

print("[OK] Sample Parquet file created")
print(f"   Records: {len(large_df):,}")

In [None]:
# Extract from Parquet
def extract_parquet(file_path, columns=None):
    """
    Extract data from Parquet file

    Parquet advantages:
    - Columnar format (fast for analytical queries)
    - Built-in compression
    - Can read specific columns only
    """
    try:
        df = pd.read_parquet(file_path, columns=columns)
        print(f"[OK] Successfully read {len(df):,} records from Parquet")
        return df
    except Exception as e:
        print(f"[FAIL] Error reading Parquet: {e}")
        return None


# Read all columns
parquet_df = extract_parquet("../data/raw/large_dataset.parquet")

# Read specific columns only (more efficient)
parquet_df_subset = extract_parquet("../data/raw/large_dataset.parquet", columns=["id", "category"])

print("\nFull data shape:", parquet_df.shape)
print("Subset data shape:", parquet_df_subset.shape)

---

## 3. Extracting Data from Databases

We'll use SQLite for this example (no server needed), but the same principles apply to PostgreSQL, MySQL, etc.

In [None]:
# Create a sample SQLite database
from sqlalchemy import create_engine
import sqlite3

# Create engine
db_path = "../data/raw/sample_db.sqlite"
engine = create_engine(f"sqlite:///{db_path}")

# Create sample table
customers_data = {
    "customer_id": range(1, 11),
    "name": [f"Customer {i}" for i in range(1, 11)],
    "email": [f"customer{i}@example.com" for i in range(1, 11)],
    "country": np.random.choice(["USA", "UK", "Canada", "Australia"], 10),
    "lifetime_value": np.random.uniform(100, 10000, 10).round(2),
}

customers_df = pd.DataFrame(customers_data)
customers_df.to_sql("customers", engine, if_exists="replace", index=False)

print("[OK] Sample database created with 'customers' table")

In [None]:
# Extract from database using SQL query
def extract_from_database(engine, query):
    """
    Extract data from database using SQL query
    """
    try:
        df = pd.read_sql(query, engine)
        print(f"[OK] Successfully extracted {len(df)} records from database")
        return df
    except Exception as e:
        print(f"[FAIL] Database extraction error: {e}")
        return None


# Simple query
query1 = "SELECT * FROM customers"
result1 = extract_from_database(engine, query1)
result1.head()

In [None]:
# More complex query with filtering and aggregation
query2 = """
SELECT 
    country,
    COUNT(*) as customer_count,
    AVG(lifetime_value) as avg_lifetime_value,
    MAX(lifetime_value) as max_lifetime_value
FROM customers
GROUP BY country
ORDER BY avg_lifetime_value DESC
"""

result2 = extract_from_database(engine, query2)
result2

In [None]:
# Extract with parameters (prevents SQL injection)
def extract_with_parameters(engine, query, params):
    """
    Extract data using parameterized queries (secure)
    """
    try:
        df = pd.read_sql(query, engine, params=params)
        print(f"[OK] Extracted {len(df)} records using parameterized query")
        return df
    except Exception as e:
        print(f"[FAIL] Error: {e}")
        return None


# Secure parameterized query
query = "SELECT * FROM customers WHERE country = :country AND lifetime_value > :min_value"
params = {"country": "USA", "min_value": 1000}

filtered_customers = extract_with_parameters(engine, query, params)
filtered_customers

---

## 4. Extracting Data from APIs

### 4.1 Simple GET Request

In [None]:
# Extract data from a public API
def extract_from_api(url, params=None):
    """
    Extract data from a REST API
    """
    try:
        response = requests.get(url, params=params)
        response.raise_for_status()  # Raise exception for bad status codes

        data = response.json()
        print(f"[OK] Successfully fetched data from API")
        print(f"   Status code: {response.status_code}")

        return data
    except requests.exceptions.RequestException as e:
        print(f"[FAIL] API request error: {e}")
        return None


# Example: JSONPlaceholder (fake API for testing)
api_url = "https://jsonplaceholder.typicode.com/posts"
posts_data = extract_from_api(api_url)

if posts_data:
    # Convert to DataFrame
    posts_df = pd.DataFrame(posts_data)
    print(f"\nExtracted {len(posts_df)} posts")
    posts_df.head(3)

### 4.2 API with Authentication and Headers

In [None]:
# Extract from API with authentication
def extract_from_api_with_auth(url, api_key=None, headers=None, params=None):
    """
    Extract data from API with authentication
    """
    # Build headers
    if headers is None:
        headers = {}

    if api_key:
        headers["Authorization"] = f"Bearer {api_key}"

    headers.setdefault("Content-Type", "application/json")

    try:
        response = requests.get(url, headers=headers, params=params)
        response.raise_for_status()

        print(f"[OK] API call successful")
        return response.json()
    except requests.exceptions.RequestException as e:
        print(f"[FAIL] API error: {e}")
        return None


# Example (without real API key)
# api_data = extract_from_api_with_auth(
#     url="https://api.example.com/data",
#     api_key="your-api-key-here",
#     params={'limit': 100}
# )

### 4.3 Paginated API Requests

In [None]:
# Extract data from paginated API
def extract_paginated_api(base_url, max_pages=5):
    """
    Extract data from paginated API endpoints
    """
    all_data = []

    for page in range(1, max_pages + 1):
        url = f"{base_url}?_page={page}&_limit=10"

        try:
            response = requests.get(url)
            response.raise_for_status()

            data = response.json()

            if not data:  # No more data
                break

            all_data.extend(data)
            print(f"  Page {page}: Fetched {len(data)} records")

            # Be nice to the API - add delay
            time.sleep(0.1)

        except Exception as e:
            print(f"[FAIL] Error on page {page}: {e}")
            break

    print(f"\n[OK] Total records fetched: {len(all_data)}")
    return all_data


# Fetch paginated data
api_url = "https://jsonplaceholder.typicode.com/posts"
all_posts = extract_paginated_api(api_url, max_pages=3)

posts_df = pd.DataFrame(all_posts)
posts_df.head()

---

## 5. Error Handling and Retry Logic

Production data pipelines need robust error handling.

In [None]:
# Implement retry logic with exponential backoff
import time
from functools import wraps


def retry_with_backoff(max_retries=3, initial_delay=1, backoff_factor=2):
    """
    Decorator to retry a function with exponential backoff
    """

    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            delay = initial_delay

            for attempt in range(max_retries):
                try:
                    return func(*args, **kwargs)
                except Exception as e:
                    if attempt == max_retries - 1:
                        print(f"[FAIL] Failed after {max_retries} attempts: {e}")
                        raise

                    print(f"[WARNING] Attempt {attempt + 1} failed: {e}")
                    print(f"   Retrying in {delay} seconds...")
                    time.sleep(delay)
                    delay *= backoff_factor

        return wrapper

    return decorator


@retry_with_backoff(max_retries=3, initial_delay=1)
def extract_with_retry(url):
    """
    Extract data with automatic retry
    """
    response = requests.get(url, timeout=5)
    response.raise_for_status()
    return response.json()


# Test with a valid URL
try:
    data = extract_with_retry("https://jsonplaceholder.typicode.com/users/1")
    print("[OK] Successfully extracted data with retry logic")
    print(json.dumps(data, indent=2)[:200], "...")
except Exception as e:
    print(f"Failed: {e}")

---

## 6. Best Practices for Data Extraction

### 1. Always Use Connection Pooling
- Reuse database connections
- Don't create new connections for each query

### 2. Implement Proper Error Handling
- Catch specific exceptions
- Log errors properly
- Use retry logic for transient failures

### 3. Be Mindful of API Rate Limits
- Add delays between requests
- Implement exponential backoff
- Cache responses when appropriate

### 4. Extract Incrementally When Possible
- Use timestamps or IDs to track what's been extracted
- Don't re-extract all data every time

### 5. Validate Data Early
- Check for expected columns
- Verify data types
- Count records

### 6. Use Appropriate File Formats
- CSV: Simple, human-readable
- Parquet: Large datasets, analytical workloads
- JSON: Nested/hierarchical data

### 7. Monitor and Log
- Track extraction times
- Log record counts
- Alert on failures

In [None]:
# Complete extraction example with best practices
import logging
from datetime import datetime

# Setup logging
logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
logger = logging.getLogger(__name__)


class DataExtractor:
    """
    Production-ready data extractor with best practices
    """

    def __init__(self):
        self.extraction_stats = {}

    def extract(self, source_type, **kwargs):
        """
        Extract data from various sources
        """
        start_time = datetime.now()
        logger.info(f"Starting extraction from {source_type}")

        try:
            if source_type == "csv":
                df = pd.read_csv(kwargs["file_path"])
            elif source_type == "json":
                df = pd.read_json(kwargs["file_path"])
            elif source_type == "database":
                df = pd.read_sql(kwargs["query"], kwargs["engine"])
            elif source_type == "api":
                response = requests.get(kwargs["url"])
                response.raise_for_status()
                df = pd.DataFrame(response.json())
            else:
                raise ValueError(f"Unsupported source type: {source_type}")

            # Calculate stats
            duration = (datetime.now() - start_time).total_seconds()
            record_count = len(df)

            self.extraction_stats[source_type] = {
                "records": record_count,
                "duration_seconds": duration,
                "timestamp": datetime.now().isoformat(),
            }

            logger.info(f"[OK] Extracted {record_count:,} records in {duration:.2f}s")
            return df

        except Exception as e:
            logger.error(f"[FAIL] Extraction failed: {e}")
            raise


# Use the extractor
extractor = DataExtractor()

# Extract from CSV
users = extractor.extract("csv", file_path="../data/raw/users.csv")

# Extract from JSON
products = extractor.extract("json", file_path="../data/raw/products.json")

# View extraction statistics
print("\nExtraction Statistics:")
for source, stats in extractor.extraction_stats.items():
    print(f"  {source}: {stats['records']:,} records in {stats['duration_seconds']:.2f}s")

---

## 7. Practice Exercise

Create a unified extractor function that can handle multiple source types and includes:
1. Error handling
2. Logging
3. Data validation
4. Statistics tracking

Try implementing it below:

In [None]:
# Your code here
# Create a function that extracts data from any source type
# and validates that it has at least 1 record

---

## 8. Key Takeaways

[OK] **File Extraction**: CSV, JSON, Excel, Parquet - each has different use cases

[OK] **Database Extraction**: Use SQLAlchemy for database-agnostic queries

[OK] **API Extraction**: Handle pagination, authentication, and rate limits

[OK] **Error Handling**: Retry logic with exponential backoff is crucial

[OK] **Best Practices**: Log everything, validate early, extract incrementally

### Next Steps

In **Module 03: Data Transformation and Cleaning**, we'll take the extracted data and:
- Clean and validate it
- Handle missing values
- Transform data types
- Merge and aggregate datasets

---

**Ready to transform data?** Open `03_data_transformation_cleaning.ipynb`!