# 🌐 Calling APIs - Real-Time Data Ingestion

Welcome to the third tutorial in our **Data Ingestion Pipeline** series! In this hands-on notebook, you'll learn how to collect data from APIs (Application Programming Interfaces) - the backbone of modern data integration.

## 🎯 Learning Objectives

By the end of this tutorial, you will:
- ✅ Understand what APIs are and how they work
- ✅ Make HTTP requests to fetch data from APIs
- ✅ Handle API authentication and rate limiting
- ✅ Process JSON responses and handle errors
- ✅ Build a robust API data ingestion system
- ✅ Work with real-world API examples

---

## 🛠️ Setup and Imports

Let's start by importing the libraries we'll need for API interactions:

In [None]:
# Essential imports for API interactions
import requests
import pandas as pd
import numpy as np
import json
import time
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')

# For visualizations
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('default')
sns.set_palette("husl")

# For handling URLs and HTTP status codes
from urllib.parse import urljoin, urlparse
import urllib3
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

print("📦 All libraries imported successfully!")
print(f"🌐 Requests version: {requests.__version__}")
print(f"📊 Pandas version: {pd.__version__}")
print(f"⏰ Current time: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

## 🔍 Understanding APIs

**API (Application Programming Interface)** is a set of rules that allows different software applications to communicate with each other. Think of it as a waiter in a restaurant:

- 🍽️ **You (Client)** → Order food (make request)
- 👨‍🍳 **Kitchen (Server)** → Prepare food (process request)
- 🚶‍♂️ **Waiter (API)** → Takes your order and brings back food (handles communication)

### 🌐 **Common API Types:**
- **REST APIs** - Most common, uses HTTP methods (GET, POST, PUT, DELETE)
- **GraphQL APIs** - Flexible query language for APIs
- **WebSocket APIs** - Real-time bidirectional communication
- **Webhook APIs** - Event-driven data push

In [None]:
# Let's explore API concepts with a simple example
print("🌐 API Concepts Overview")
print("=" * 25)

api_concepts = {
    'HTTP Method': ['GET', 'POST', 'PUT', 'DELETE'],
    'Purpose': [
        'Retrieve data',
        'Create new data', 
        'Update existing data',
        'Delete data'
    ],
    'Example Use Case': [
        'Get list of orders',
        'Create new order',
        'Update order status',
        'Cancel order'
    ],
    'Data Direction': [
        'Server → Client',
        'Client → Server',
        'Client → Server',
        'Client → Server'
    ]
}

df_concepts = pd.DataFrame(api_concepts)
display(df_concepts)

print("\n📊 HTTP Status Codes (Common ones):")
status_codes = {
    200: "✅ OK - Request successful",
    201: "✅ Created - Resource created successfully", 
    400: "❌ Bad Request - Invalid request format",
    401: "🔐 Unauthorized - Authentication required",
    403: "🚫 Forbidden - Access denied",
    404: "🔍 Not Found - Resource doesn't exist",
    429: "⏰ Too Many Requests - Rate limit exceeded",
    500: "💥 Internal Server Error - Server problem"
}

for code, description in status_codes.items():
    print(f"  {code}: {description}")

## 🚀 Making Your First API Call

Let's start with a simple API call using JSONPlaceholder - a free fake REST API for testing and prototyping.

In [None]:
# Simple API call example
print("🚀 Making Your First API Call")
print("=" * 30)

# JSONPlaceholder is a free fake REST API
api_url = "https://jsonplaceholder.typicode.com/posts/1"

print(f"📡 Making GET request to: {api_url}")

try:
    # Make the API call
    response = requests.get(api_url)
    
    print(f"📊 Response Status Code: {response.status_code}")
    print(f"⏱️ Response Time: {response.elapsed.total_seconds():.3f} seconds")
    print(f"📏 Response Size: {len(response.content)} bytes")
    
    # Check if request was successful
    if response.status_code == 200:
        print("✅ Request successful!")
        
        # Parse JSON response
        data = response.json()
        
        print(f"\n📋 Response Data:")
        print(json.dumps(data, indent=2))
        
        print(f"\n🔍 Data Analysis:")
        print(f"  Data Type: {type(data)}")
        print(f"  Keys: {list(data.keys()) if isinstance(data, dict) else 'Not a dictionary'}")
        
    else:
        print(f"❌ Request failed with status code: {response.status_code}")
        
except requests.exceptions.RequestException as e:
    print(f"💥 Request failed: {e}")
except Exception as e:
    print(f"💥 Unexpected error: {e}")

## 📊 Fetching Multiple Records

In real-world scenarios, you'll often need to fetch multiple records. Let's learn how to handle collections of data from APIs.

In [None]:
# Fetching multiple records
print("📊 Fetching Multiple Records from API")
print("=" * 35)

def fetch_posts(limit=10):
    """
    Fetch multiple posts from JSONPlaceholder API
    
    Args:
        limit (int): Number of posts to fetch
    
    Returns:
        tuple: (success, data, info)
    """
    api_url = "https://jsonplaceholder.typicode.com/posts"
    
    try:
        print(f"📡 Fetching {limit} posts from API...")
        start_time = time.time()
        
        # Add query parameters
        params = {'_limit': limit}
        response = requests.get(api_url, params=params)
        
        end_time = time.time()
        response_time = end_time - start_time
        
        if response.status_code == 200:
            data = response.json()
            
            info = {
                'status_code': response.status_code,
                'response_time': response_time,
                'records_count': len(data),
                'response_size_bytes': len(response.content),
                'url': response.url
            }
            
            return True, data, info
        else:
            return False, f"HTTP {response.status_code}: {response.reason}", None
            
    except requests.exceptions.RequestException as e:
        return False, f"Request error: {str(e)}", None
    except Exception as e:
        return False, f"Unexpected error: {str(e)}", None

# Test the function
success, posts_data, info = fetch_posts(limit=5)

if success:
    print(f"✅ Successfully fetched data!")
    print(f"📊 API Response Info:")
    for key, value in info.items():
        if key == 'response_time':
            print(f"  {key}: {value:.3f} seconds")
        else:
            print(f"  {key}: {value}")
    
    print(f"\n📋 Sample Post:")
    print(json.dumps(posts_data[0], indent=2))
    
    # Convert to DataFrame for analysis
    df_posts = pd.DataFrame(posts_data)
    print(f"\n📊 Posts DataFrame:")
    print(f"  Shape: {df_posts.shape}")
    print(f"  Columns: {list(df_posts.columns)}")
    
    display(df_posts.head())
    
else:
    print(f"❌ Failed to fetch data: {posts_data}")

## 🔐 Handling API Authentication

Most real-world APIs require authentication. Let's learn about different authentication methods and how to implement them.

In [None]:
# API Authentication Examples
print("🔐 API Authentication Methods")
print("=" * 30)

# Common authentication methods
auth_methods = {
    'Method': [
        'API Key (Header)',
        'API Key (Query Parameter)',
        'Bearer Token',
        'Basic Authentication',
        'OAuth 2.0'
    ],
    'Implementation': [
        'headers={"X-API-Key": "your_key"}',
        'params={"api_key": "your_key"}',
        'headers={"Authorization": "Bearer token"}',
        'auth=("username", "password")',
        'Complex token exchange flow'
    ],
    'Security Level': [
        'Medium',
        'Low',
        'High',
        'Medium',
        'Very High'
    ],
    'Common Use': [
        'Simple APIs',
        'Public APIs',
        'Modern APIs',
        'Legacy systems',
        'Enterprise APIs'
    ]
}

df_auth = pd.DataFrame(auth_methods)
display(df_auth)

print("\n🔑 Authentication Examples:")

# Example 1: API Key in Header
def api_call_with_header_key(url, api_key):
    """
    Example of API call with API key in header
    """
    headers = {
        'X-API-Key': api_key,
        'Content-Type': 'application/json'
    }
    
    # This is just an example - we won't make the actual call
    print(f"📡 Would call: {url}")
    print(f"🔑 With headers: {headers}")
    
    # return requests.get(url, headers=headers)

# Example 2: Bearer Token
def api_call_with_bearer_token(url, token):
    """
    Example of API call with Bearer token
    """
    headers = {
        'Authorization': f'Bearer {token}',
        'Content-Type': 'application/json'
    }
    
    print(f"📡 Would call: {url}")
    print(f"🔑 With Authorization: Bearer {token[:10]}...")
    
    # return requests.get(url, headers=headers)

# Example 3: Basic Authentication
def api_call_with_basic_auth(url, username, password):
    """
    Example of API call with Basic Authentication
    """
    from requests.auth import HTTPBasicAuth
    
    print(f"📡 Would call: {url}")
    print(f"🔑 With Basic Auth: {username}:{'*' * len(password)}")
    
    # return requests.get(url, auth=HTTPBasicAuth(username, password))

# Demonstrate the examples
print("\n1️⃣ API Key in Header:")
api_call_with_header_key("https://api.example.com/data", "your_api_key_here")

print("\n2️⃣ Bearer Token:")
api_call_with_bearer_token("https://api.example.com/data", "your_bearer_token_here")

print("\n3️⃣ Basic Authentication:")
api_call_with_basic_auth("https://api.example.com/data", "username", "password")

## ⏰ Rate Limiting and Retry Logic

APIs often have rate limits to prevent abuse. Let's learn how to handle rate limiting and implement retry logic for robust API interactions.

In [None]:
# Rate limiting and retry logic
print("⏰ Rate Limiting and Retry Logic")
print("=" * 35)

import time
from functools import wraps

class APIRateLimiter:
    """
    Simple rate limiter for API calls
    """
    
    def __init__(self, calls_per_minute=60):
        self.calls_per_minute = calls_per_minute
        self.min_interval = 60.0 / calls_per_minute  # seconds between calls
        self.last_call_time = 0
        self.call_count = 0
    
    def wait_if_needed(self):
        """Wait if necessary to respect rate limit"""
        current_time = time.time()
        time_since_last_call = current_time - self.last_call_time
        
        if time_since_last_call < self.min_interval:
            sleep_time = self.min_interval - time_since_last_call
            print(f"⏳ Rate limiting: sleeping for {sleep_time:.2f} seconds")
            time.sleep(sleep_time)
        
        self.last_call_time = time.time()
        self.call_count += 1

def retry_api_call(max_retries=3, delay=1, backoff=2):
    """
    Decorator for retrying API calls with exponential backoff
    
    Args:
        max_retries (int): Maximum number of retry attempts
        delay (float): Initial delay between retries (seconds)
        backoff (float): Backoff multiplier for delay
    """
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            current_delay = delay
            
            for attempt in range(max_retries + 1):
                try:
                    result = func(*args, **kwargs)
                    
                    # Check if it's a successful response
                    if hasattr(result, 'status_code'):
                        if result.status_code == 200:
                            return result
                        elif result.status_code == 429:  # Rate limited
                            print(f"⏰ Rate limited (429), attempt {attempt + 1}/{max_retries + 1}")
                        elif result.status_code >= 500:  # Server error
                            print(f"💥 Server error ({result.status_code}), attempt {attempt + 1}/{max_retries + 1}")
                        else:
                            # Client error, don't retry
                            return result
                    else:
                        return result
                        
                except requests.exceptions.RequestException as e:
                    print(f"🌐 Network error, attempt {attempt + 1}/{max_retries + 1}: {e}")
                
                # Wait before retry (except on last attempt)
                if attempt < max_retries:
                    print(f"⏳ Waiting {current_delay:.1f} seconds before retry...")
                    time.sleep(current_delay)
                    current_delay *= backoff
            
            print(f"❌ All {max_retries + 1} attempts failed")
            return None
        
        return wrapper
    return decorator

# Example usage
rate_limiter = APIRateLimiter(calls_per_minute=30)  # 30 calls per minute

@retry_api_call(max_retries=3, delay=1, backoff=2)
def robust_api_call(url, **kwargs):
    """
    Robust API call with rate limiting and retry logic
    """
    rate_limiter.wait_if_needed()
    
    print(f"📡 Making API call #{rate_limiter.call_count} to: {url}")
    response = requests.get(url, **kwargs)
    
    return response

# Test the robust API call
print("🧪 Testing Robust API Call")
print("=" * 25)

# Make a few API calls to demonstrate rate limiting
test_urls = [
    "https://jsonplaceholder.typicode.com/posts/1",
    "https://jsonplaceholder.typicode.com/posts/2",
    "https://jsonplaceholder.typicode.com/posts/3"
]

responses = []
start_time = time.time()

for url in test_urls:
    response = robust_api_call(url)
    if response and response.status_code == 200:
        responses.append(response.json())
        print(f"  ✅ Success: Got post ID {response.json()['id']}")
    else:
        print(f"  ❌ Failed to get data from {url}")

total_time = time.time() - start_time
print(f"\n⏱️ Total time for {len(test_urls)} calls: {total_time:.2f} seconds")
print(f"📊 Successful responses: {len(responses)}/{len(test_urls)}")

## 📄 Handling Pagination

Many APIs return large datasets in pages. Let's learn how to handle pagination to get all available data.

In [None]:
# Handling API pagination
print("📄 Handling API Pagination")
print("=" * 25)

def fetch_all_posts_paginated(base_url="https://jsonplaceholder.typicode.com/posts", 
                             per_page=10, max_pages=5):
    """
    Fetch all posts using pagination
    
    Args:
        base_url (str): Base API URL
        per_page (int): Number of records per page
        max_pages (int): Maximum pages to fetch (safety limit)
    
    Returns:
        tuple: (success, all_data, pagination_info)
    """
    all_data = []
    page = 1
    total_requests = 0
    
    pagination_info = {
        'pages_fetched': 0,
        'total_records': 0,
        'total_requests': 0,
        'average_response_time': 0,
        'errors': []
    }
    
    response_times = []
    
    print(f"📡 Starting paginated fetch (max {max_pages} pages, {per_page} per page)")
    
    while page <= max_pages:
        try:
            print(f"\n📄 Fetching page {page}...")
            
            # Prepare parameters for pagination
            params = {
                '_page': page,
                '_limit': per_page
            }
            
            start_time = time.time()
            response = requests.get(base_url, params=params)
            response_time = time.time() - start_time
            response_times.append(response_time)
            
            total_requests += 1
            
            if response.status_code == 200:
                page_data = response.json()
                
                if not page_data:  # Empty response means no more data
                    print(f"  📭 No more data on page {page}")
                    break
                
                all_data.extend(page_data)
                pagination_info['pages_fetched'] += 1
                
                print(f"  ✅ Page {page}: {len(page_data)} records ({response_time:.3f}s)")
                
                # Check if we got fewer records than requested (last page)
                if len(page_data) < per_page:
                    print(f"  🏁 Last page detected (got {len(page_data)} < {per_page})")
                    break
                
                page += 1
                
                # Small delay to be nice to the API
                time.sleep(0.1)
                
            else:
                error_msg = f"HTTP {response.status_code} on page {page}"
                pagination_info['errors'].append(error_msg)
                print(f"  ❌ {error_msg}")
                break
                
        except requests.exceptions.RequestException as e:
            error_msg = f"Request error on page {page}: {str(e)}"
            pagination_info['errors'].append(error_msg)
            print(f"  💥 {error_msg}")
            break
    
    # Calculate final statistics
    pagination_info['total_records'] = len(all_data)
    pagination_info['total_requests'] = total_requests
    pagination_info['average_response_time'] = np.mean(response_times) if response_times else 0
    
    success = len(all_data) > 0 and len(pagination_info['errors']) == 0
    
    return success, all_data, pagination_info

# Test pagination
success, all_posts, info = fetch_all_posts_paginated(per_page=5, max_pages=3)

if success:
    print(f"\n🎉 Pagination completed successfully!")
    print(f"📊 Pagination Summary:")
    print(f"  Pages fetched: {info['pages_fetched']}")
    print(f"  Total records: {info['total_records']:,}")
    print(f"  Total requests: {info['total_requests']}")
    print(f"  Average response time: {info['average_response_time']:.3f}s")
    
    # Convert to DataFrame
    df_all_posts = pd.DataFrame(all_posts)
    print(f"\n📋 Combined Data:")
    print(f"  DataFrame shape: {df_all_posts.shape}")
    print(f"  Unique users: {df_all_posts['userId'].nunique()}")
    
    # Show sample
    display(df_all_posts[['id', 'userId', 'title']].head(10))
    
else:
    print(f"\n❌ Pagination failed")
    if info['errors']:
        print(f"Errors: {info['errors']}")

## 🔄 Real-World API Integration Example

Let's build a more realistic example by creating a system that fetches data from multiple API endpoints and combines them.

In [None]:
# Real-world API integration example
print("🔄 Real-World API Integration Example")
print("=" * 40)

class ECommerceAPIClient:
    """
    A comprehensive API client for e-commerce data ingestion
    """
    
    def __init__(self, base_url="https://jsonplaceholder.typicode.com"):
        self.base_url = base_url.rstrip('/')
        self.session = requests.Session()
        self.rate_limiter = APIRateLimiter(calls_per_minute=30)
        
        # Set default headers
        self.session.headers.update({
            'User-Agent': 'ECommerce-Data-Ingestion/1.0',
            'Accept': 'application/json',
            'Content-Type': 'application/json'
        })
        
        print(f"🏪 E-commerce API Client initialized")
        print(f"🌐 Base URL: {self.base_url}")
    
    @retry_api_call(max_retries=3, delay=1, backoff=2)
    def _make_request(self, endpoint, params=None):
        """Make a rate-limited API request"""
        self.rate_limiter.wait_if_needed()
        
        url = f"{self.base_url}/{endpoint.lstrip('/')}"
        return self.session.get(url, params=params)
    
    def fetch_orders(self, limit=20):
        """
        Fetch orders (using posts as proxy for orders)
        
        Args:
            limit (int): Number of orders to fetch
        
        Returns:
            tuple: (success, orders_data, metadata)
        """
        print(f"📦 Fetching {limit} orders...")
        
        response = self._make_request('posts', params={'_limit': limit})
        
        if response and response.status_code == 200:
            posts = response.json()
            
            # Transform posts into order-like data
            orders = []
            for post in posts:
                order = {
                    'order_id': f"ORD-{post['id']:04d}",
                    'customer_id': f"CUST-{post['userId']:03d}",
                    'product_name': post['title'][:50],  # Use title as product name
                    'product_description': post['body'][:100],
                    'quantity': np.random.randint(1, 5),  # Random quantity
                    'unit_price': round(np.random.uniform(10, 500), 2),  # Random price
                    'order_date': datetime.now().strftime('%Y-%m-%d'),
                    'status': np.random.choice(['pending', 'processing', 'shipped', 'delivered']),
                    'source': 'api',
                    'api_id': post['id']
                }
                
                # Calculate total
                order['total_amount'] = round(order['quantity'] * order['unit_price'], 2)
                
                orders.append(order)
            
            metadata = {
                'endpoint': 'orders',
                'records_fetched': len(orders),
                'response_time': response.elapsed.total_seconds(),
                'timestamp': datetime.now().isoformat()
            }
            
            return True, orders, metadata
        
        return False, f"Failed to fetch orders: {response.status_code if response else 'No response'}", None
    
    def fetch_customers(self, limit=10):
        """
        Fetch customer data (using users as customers)
        
        Args:
            limit (int): Number of customers to fetch
        
        Returns:
            tuple: (success, customers_data, metadata)
        """
        print(f"👥 Fetching {limit} customers...")
        
        response = self._make_request('users', params={'_limit': limit})
        
        if response and response.status_code == 200:
            users = response.json()
            
            # Transform users into customer data
            customers = []
            for user in users:
                customer = {
                    'customer_id': f"CUST-{user['id']:03d}",
                    'name': user['name'],
                    'email': user['email'],
                    'phone': user['phone'],
                    'username': user['username'],
                    'website': user.get('website', ''),
                    'company': user.get('company', {}).get('name', ''),
                    'address': self._format_address(user.get('address', {})),
                    'registration_date': (datetime.now() - timedelta(days=np.random.randint(1, 365))).strftime('%Y-%m-%d'),
                    'customer_tier': np.random.choice(['Bronze', 'Silver', 'Gold', 'Platinum']),
                    'source': 'api',
                    'api_id': user['id']
                }
                
                customers.append(customer)
            
            metadata = {
                'endpoint': 'customers',
                'records_fetched': len(customers),
                'response_time': response.elapsed.total_seconds(),
                'timestamp': datetime.now().isoformat()
            }
            
            return True, customers, metadata
        
        return False, f"Failed to fetch customers: {response.status_code if response else 'No response'}", None
    
    def _format_address(self, address_data):
        """Format address from API data"""
        if not address_data:
            return ""
        
        parts = []
        if address_data.get('street'):
            parts.append(address_data['street'])
        if address_data.get('city'):
            parts.append(address_data['city'])
        if address_data.get('zipcode'):
            parts.append(address_data['zipcode'])
        
        return ", ".join(parts)
    
    def fetch_all_data(self, orders_limit=15, customers_limit=8):
        """
        Fetch all data from multiple endpoints
        
        Returns:
            dict: Combined data from all endpoints
        """
        print(f"🔄 Fetching all data from multiple endpoints")
        print("=" * 45)
        
        results = {
            'orders': {'success': False, 'data': [], 'metadata': None},
            'customers': {'success': False, 'data': [], 'metadata': None},
            'summary': {
                'total_api_calls': 0,
                'total_records': 0,
                'total_time': 0,
                'successful_endpoints': 0,
                'failed_endpoints': 0
            }
        }
        
        start_time = time.time()
        
        # Fetch orders
        success, orders_data, orders_meta = self.fetch_orders(orders_limit)
        results['orders']['success'] = success
        results['orders']['data'] = orders_data if success else []
        results['orders']['metadata'] = orders_meta
        
        if success:
            results['summary']['successful_endpoints'] += 1
            results['summary']['total_records'] += len(orders_data)
            print(f"  ✅ Orders: {len(orders_data)} records")
        else:
            results['summary']['failed_endpoints'] += 1
            print(f"  ❌ Orders: {orders_data}")
        
        # Fetch customers
        success, customers_data, customers_meta = self.fetch_customers(customers_limit)
        results['customers']['success'] = success
        results['customers']['data'] = customers_data if success else []
        results['customers']['metadata'] = customers_meta
        
        if success:
            results['summary']['successful_endpoints'] += 1
            results['summary']['total_records'] += len(customers_data)
            print(f"  ✅ Customers: {len(customers_data)} records")
        else:
            results['summary']['failed_endpoints'] += 1
            print(f"  ❌ Customers: {customers_data}")
        
        # Calculate summary
        results['summary']['total_time'] = time.time() - start_time
        results['summary']['total_api_calls'] = self.rate_limiter.call_count
        
        return results
    
    def close(self):
        """Close the session"""
        self.session.close()
        print("🔒 API client session closed")

# Test the comprehensive API client
api_client = ECommerceAPIClient()

try:
    # Fetch all data
    all_data = api_client.fetch_all_data(orders_limit=10, customers_limit=5)
    
    print(f"\n📊 API Integration Summary:")
    print(f"  Successful endpoints: {all_data['summary']['successful_endpoints']}")
    print(f"  Failed endpoints: {all_data['summary']['failed_endpoints']}")
    print(f"  Total records: {all_data['summary']['total_records']:,}")
    print(f"  Total API calls: {all_data['summary']['total_api_calls']}")
    print(f"  Total time: {all_data['summary']['total_time']:.2f} seconds")
    
    # Display sample data
    if all_data['orders']['success']:
        df_orders = pd.DataFrame(all_data['orders']['data'])
        print(f"\n📦 Sample Orders:")
        display(df_orders[['order_id', 'customer_id', 'product_name', 'quantity', 'unit_price', 'total_amount', 'status']].head())
    
    if all_data['customers']['success']:
        df_customers = pd.DataFrame(all_data['customers']['data'])
        print(f"\n👥 Sample Customers:")
        display(df_customers[['customer_id', 'name', 'email', 'customer_tier', 'company']].head())

finally:
    api_client.close()

## 📊 Data Analysis from API

Now let's analyze the data we've collected from the API to gain insights!

In [None]:
# Analyze API data
if all_data['orders']['success'] and all_data['customers']['success']:
    print("📊 API Data Analysis")
    print("=" * 20)
    
    # Create DataFrames
    df_orders = pd.DataFrame(all_data['orders']['data'])
    df_customers = pd.DataFrame(all_data['customers']['data'])
    
    # Basic statistics
    print(f"📈 Business Metrics:")
    total_revenue = df_orders['total_amount'].sum()
    avg_order_value = df_orders['total_amount'].mean()
    total_orders = len(df_orders)
    unique_customers = df_orders['customer_id'].nunique()
    
    print(f"  Total Revenue: ${total_revenue:,.2f}")
    print(f"  Average Order Value: ${avg_order_value:.2f}")
    print(f"  Total Orders: {total_orders:,}")
    print(f"  Unique Customers: {unique_customers}")
    print(f"  Orders per Customer: {total_orders/unique_customers:.1f}")
    
    # Order status analysis
    print(f"\n📦 Order Status Distribution:")
    status_counts = df_orders['status'].value_counts()
    for status, count in status_counts.items():
        percentage = (count / len(df_orders)) * 100
        print(f"  {status.title()}: {count} ({percentage:.1f}%)")
    
    # Customer tier analysis
    print(f"\n👥 Customer Tier Distribution:")
    tier_counts = df_customers['customer_tier'].value_counts()
    for tier, count in tier_counts.items():
        percentage = (count / len(df_customers)) * 100
        print(f"  {tier}: {count} ({percentage:.1f}%)")
    
    # Create visualizations
    fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 12))
    
    # 1. Order value distribution
    ax1.hist(df_orders['total_amount'], bins=10, alpha=0.7, edgecolor='black')
    ax1.set_title('Order Value Distribution')
    ax1.set_xlabel('Order Value ($)')
    ax1.set_ylabel('Frequency')
    
    # 2. Order status pie chart
    ax2.pie(status_counts.values, labels=status_counts.index, autopct='%1.1f%%')
    ax2.set_title('Order Status Distribution')
    
    # 3. Customer tier distribution
    tier_counts.plot(kind='bar', ax=ax3)
    ax3.set_title('Customer Tier Distribution')
    ax3.set_xlabel('Customer Tier')
    ax3.set_ylabel('Number of Customers')
    ax3.tick_params(axis='x', rotation=45)
    
    # 4. Revenue by customer tier (if we can join the data)
    # Merge orders with customers to get tier information
    df_merged = df_orders.merge(df_customers[['customer_id', 'customer_tier']], on='customer_id', how='left')
    revenue_by_tier = df_merged.groupby('customer_tier')['total_amount'].sum()
    
    revenue_by_tier.plot(kind='bar', ax=ax4)
    ax4.set_title('Revenue by Customer Tier')
    ax4.set_xlabel('Customer Tier')
    ax4.set_ylabel('Total Revenue ($)')
    ax4.tick_params(axis='x', rotation=45)
    
    plt.tight_layout()
    plt.show()
    
    # Show merged data sample
    print(f"\n🔗 Merged Orders + Customers Data:")
    display(df_merged[['order_id', 'customer_id', 'product_name', 'total_amount', 'customer_tier', 'status']].head())
    
else:
    print("❌ Cannot perform analysis - API data fetch failed")

## 🛡️ Error Handling and Best Practices

Let's learn about common API errors and how to handle them gracefully.

In [None]:
# API error handling examples
print("🛡️ API Error Handling and Best Practices")
print("=" * 45)

def comprehensive_api_call(url, **kwargs):
    """
    Comprehensive API call with detailed error handling
    
    Args:
        url (str): API endpoint URL
        **kwargs: Additional arguments for requests
    
    Returns:
        dict: Detailed response information
    """
    result = {
        'success': False,
        'data': None,
        'status_code': None,
        'error_type': None,
        'error_message': None,
        'response_time': 0,
        'retry_recommended': False,
        'headers': {},
        'url': url
    }
    
    start_time = time.time()
    
    try:
        # Set timeout if not provided
        if 'timeout' not in kwargs:
            kwargs['timeout'] = 30
        
        response = requests.get(url, **kwargs)
        result['response_time'] = time.time() - start_time
        result['status_code'] = response.status_code
        result['headers'] = dict(response.headers)
        
        # Handle different status codes
        if response.status_code == 200:
            try:
                result['data'] = response.json()
                result['success'] = True
            except json.JSONDecodeError:
                result['error_type'] = 'json_decode_error'
                result['error_message'] = 'Response is not valid JSON'
                result['data'] = response.text[:200]  # First 200 chars
        
        elif response.status_code == 400:
            result['error_type'] = 'bad_request'
            result['error_message'] = 'Bad request - check your parameters'
            result['retry_recommended'] = False
        
        elif response.status_code == 401:
            result['error_type'] = 'unauthorized'
            result['error_message'] = 'Authentication required or invalid'
            result['retry_recommended'] = False
        
        elif response.status_code == 403:
            result['error_type'] = 'forbidden'
            result['error_message'] = 'Access forbidden - check permissions'
            result['retry_recommended'] = False
        
        elif response.status_code == 404:
            result['error_type'] = 'not_found'
            result['error_message'] = 'Resource not found'
            result['retry_recommended'] = False
        
        elif response.status_code == 429:
            result['error_type'] = 'rate_limited'
            result['error_message'] = 'Rate limit exceeded'
            result['retry_recommended'] = True
            
            # Check for Retry-After header
            retry_after = response.headers.get('Retry-After')
            if retry_after:
                result['retry_after'] = int(retry_after)
        
        elif 500 <= response.status_code < 600:
            result['error_type'] = 'server_error'
            result['error_message'] = f'Server error: {response.status_code}'
            result['retry_recommended'] = True
        
        else:
            result['error_type'] = 'unknown_status'
            result['error_message'] = f'Unknown status code: {response.status_code}'
            result['retry_recommended'] = False
    
    except requests.exceptions.Timeout:
        result['error_type'] = 'timeout'
        result['error_message'] = f'Request timed out after {kwargs.get("timeout", 30)} seconds'
        result['retry_recommended'] = True
        result['response_time'] = time.time() - start_time
    
    except requests.exceptions.ConnectionError:
        result['error_type'] = 'connection_error'
        result['error_message'] = 'Failed to connect to the server'
        result['retry_recommended'] = True
        result['response_time'] = time.time() - start_time
    
    except requests.exceptions.RequestException as e:
        result['error_type'] = 'request_exception'
        result['error_message'] = f'Request failed: {str(e)}'
        result['retry_recommended'] = True
        result['response_time'] = time.time() - start_time
    
    except Exception as e:
        result['error_type'] = 'unexpected_error'
        result['error_message'] = f'Unexpected error: {str(e)}'
        result['retry_recommended'] = False
        result['response_time'] = time.time() - start_time
    
    return result

# Test error handling with different scenarios
test_scenarios = [
    {
        'name': 'Valid API call',
        'url': 'https://jsonplaceholder.typicode.com/posts/1'
    },
    {
        'name': 'Not found (404)',
        'url': 'https://jsonplaceholder.typicode.com/posts/99999'
    },
    {
        'name': 'Invalid domain',
        'url': 'https://this-domain-does-not-exist-12345.com/api'
    },
    {
        'name': 'Timeout test',
        'url': 'https://httpbin.org/delay/5',
        'timeout': 2
    }
]

print("🧪 Testing Different Error Scenarios:")
print("=" * 35)

for i, scenario in enumerate(test_scenarios, 1):
    print(f"\n{i}. {scenario['name']}")
    print(f"   URL: {scenario['url']}")
    
    # Extract URL and other parameters
    url = scenario.pop('url')
    kwargs = {k: v for k, v in scenario.items() if k != 'name'}
    
    result = comprehensive_api_call(url, **kwargs)
    
    if result['success']:
        print(f"   ✅ Success ({result['response_time']:.3f}s)")
        if isinstance(result['data'], dict):
            print(f"   📊 Data keys: {list(result['data'].keys())[:3]}...")
    else:
        print(f"   ❌ Failed: {result['error_type']}")
        print(f"   💬 Message: {result['error_message']}")
        print(f"   🔄 Retry recommended: {result['retry_recommended']}")
        if result['status_code']:
            print(f"   📊 Status code: {result['status_code']}")
    
    print(f"   ⏱️ Response time: {result['response_time']:.3f}s")

print(f"\n💡 API Best Practices Summary:")
best_practices = [
    "Always set timeouts for API calls",
    "Implement retry logic with exponential backoff",
    "Handle rate limiting gracefully",
    "Log all API interactions for debugging",
    "Validate API responses before processing",
    "Use connection pooling for multiple calls",
    "Implement circuit breakers for failing APIs",
    "Cache responses when appropriate",
    "Monitor API performance and errors",
    "Have fallback strategies for critical APIs"
]

for i, practice in enumerate(best_practices, 1):
    print(f"  {i:2d}. {practice}")

## 🎯 Building Your API Ingestion System

Let's put everything together and build a complete API ingestion system that you can use in real projects!

In [None]:
# Complete API ingestion system
print("🎯 Building Complete API Ingestion System")
print("=" * 45)

class APIIngestionSystem:
    """
    A complete, production-ready API ingestion system
    """
    
    def __init__(self, base_url, rate_limit=60, timeout=30):
        self.base_url = base_url.rstrip('/')
        self.session = requests.Session()
        self.rate_limiter = APIRateLimiter(calls_per_minute=rate_limit)
        self.timeout = timeout
        
        # Statistics tracking
        self.stats = {
            'total_calls': 0,
            'successful_calls': 0,
            'failed_calls': 0,
            'total_records': 0,
            'total_time': 0,
            'errors': []
        }
        
        # Set default headers
        self.session.headers.update({
            'User-Agent': 'API-Ingestion-System/1.0',
            'Accept': 'application/json'
        })
        
        print(f"🚀 API Ingestion System initialized")
        print(f"🌐 Base URL: {self.base_url}")
        print(f"⏰ Rate limit: {rate_limit} calls/minute")
        print(f"⏱️ Timeout: {timeout} seconds")
    
    def set_authentication(self, auth_type, **auth_params):
        """
        Set authentication for API calls
        
        Args:
            auth_type (str): Type of authentication ('api_key', 'bearer', 'basic')
            **auth_params: Authentication parameters
        """
        if auth_type == 'api_key':
            header_name = auth_params.get('header_name', 'X-API-Key')
            api_key = auth_params.get('api_key')
            self.session.headers[header_name] = api_key
            print(f"🔑 API Key authentication set ({header_name})")
        
        elif auth_type == 'bearer':
            token = auth_params.get('token')
            self.session.headers['Authorization'] = f'Bearer {token}'
            print(f"🔑 Bearer token authentication set")
        
        elif auth_type == 'basic':
            from requests.auth import HTTPBasicAuth
            username = auth_params.get('username')
            password = auth_params.get('password')
            self.session.auth = HTTPBasicAuth(username, password)
            print(f"🔑 Basic authentication set for user: {username}")
    
    def fetch_data(self, endpoint, params=None, max_retries=3):
        """
        Fetch data from API endpoint with comprehensive error handling
        
        Args:
            endpoint (str): API endpoint
            params (dict): Query parameters
            max_retries (int): Maximum retry attempts
        
        Returns:
            dict: Response data and metadata
        """
        url = f"{self.base_url}/{endpoint.lstrip('/')}"
        
        for attempt in range(max_retries + 1):
            try:
                # Apply rate limiting
                self.rate_limiter.wait_if_needed()
                
                # Make the request
                start_time = time.time()
                response = self.session.get(url, params=params, timeout=self.timeout)
                response_time = time.time() - start_time
                
                # Update statistics
                self.stats['total_calls'] += 1
                self.stats['total_time'] += response_time
                
                if response.status_code == 200:
                    data = response.json()
                    self.stats['successful_calls'] += 1
                    
                    # Count records
                    if isinstance(data, list):
                        record_count = len(data)
                    elif isinstance(data, dict) and 'data' in data:
                        record_count = len(data['data']) if isinstance(data['data'], list) else 1
                    else:
                        record_count = 1
                    
                    self.stats['total_records'] += record_count
                    
                    return {
                        'success': True,
                        'data': data,
                        'records_count': record_count,
                        'response_time': response_time,
                        'status_code': response.status_code,
                        'url': url,
                        'attempt': attempt + 1
                    }
                
                elif response.status_code == 429:  # Rate limited
                    if attempt < max_retries:
                        retry_after = int(response.headers.get('Retry-After', 60))
                        print(f"⏰ Rate limited, waiting {retry_after} seconds...")
                        time.sleep(retry_after)
                        continue
                
                elif response.status_code >= 500:  # Server error
                    if attempt < max_retries:
                        wait_time = (2 ** attempt)  # Exponential backoff
                        print(f"💥 Server error {response.status_code}, retrying in {wait_time}s...")
                        time.sleep(wait_time)
                        continue
                
                # Non-retryable error
                self.stats['failed_calls'] += 1
                error_msg = f"HTTP {response.status_code}: {response.reason}"
                self.stats['errors'].append({
                    'endpoint': endpoint,
                    'error': error_msg,
                    'timestamp': datetime.now().isoformat()
                })
                
                return {
                    'success': False,
                    'error': error_msg,
                    'status_code': response.status_code,
                    'response_time': response_time,
                    'url': url,
                    'attempt': attempt + 1
                }
            
            except requests.exceptions.RequestException as e:
                if attempt < max_retries:
                    wait_time = (2 ** attempt)
                    print(f"🌐 Network error, retrying in {wait_time}s: {str(e)}")
                    time.sleep(wait_time)
                    continue
                
                self.stats['failed_calls'] += 1
                error_msg = f"Network error: {str(e)}"
                self.stats['errors'].append({
                    'endpoint': endpoint,
                    'error': error_msg,
                    'timestamp': datetime.now().isoformat()
                })
                
                return {
                    'success': False,
                    'error': error_msg,
                    'url': url,
                    'attempt': attempt + 1
                }
        
        return {
            'success': False,
            'error': 'Max retries exceeded',
            'url': url,
            'attempt': max_retries + 1
        }
    
    def fetch_paginated_data(self, endpoint, per_page=20, max_pages=10, page_param='_page', limit_param='_limit'):
        """
        Fetch all data using pagination
        
        Args:
            endpoint (str): API endpoint
            per_page (int): Records per page
            max_pages (int): Maximum pages to fetch
            page_param (str): Page parameter name
            limit_param (str): Limit parameter name
        
        Returns:
            dict: All paginated data
        """
        all_data = []
        page = 1
        
        print(f"📄 Starting paginated fetch from {endpoint}")
        
        while page <= max_pages:
            params = {
                page_param: page,
                limit_param: per_page
            }
            
            result = self.fetch_data(endpoint, params=params)
            
            if result['success']:
                page_data = result['data']
                
                if not page_data or (isinstance(page_data, list) and len(page_data) == 0):
                    print(f"  📭 No more data on page {page}")
                    break
                
                if isinstance(page_data, list):
                    all_data.extend(page_data)
                    print(f"  📄 Page {page}: {len(page_data)} records")
                    
                    if len(page_data) < per_page:
                        print(f"  🏁 Last page detected")
                        break
                else:
                    all_data.append(page_data)
                    print(f"  📄 Page {page}: 1 record")
                
                page += 1
            else:
                print(f"  ❌ Failed to fetch page {page}: {result['error']}")
                break
        
        return {
            'success': len(all_data) > 0,
            'data': all_data,
            'total_records': len(all_data),
            'pages_fetched': page - 1
        }
    
    def get_statistics(self):
        """
        Get ingestion statistics
        
        Returns:
            dict: Statistics summary
        """
        stats = self.stats.copy()
        
        if stats['total_calls'] > 0:
            stats['success_rate'] = (stats['successful_calls'] / stats['total_calls']) * 100
            stats['average_response_time'] = stats['total_time'] / stats['total_calls']
        else:
            stats['success_rate'] = 0
            stats['average_response_time'] = 0
        
        return stats
    
    def close(self):
        """Close the session and print final statistics"""
        stats = self.get_statistics()
        
        print(f"\n📊 Final API Ingestion Statistics:")
        print(f"  Total API calls: {stats['total_calls']}")
        print(f"  Successful calls: {stats['successful_calls']}")
        print(f"  Failed calls: {stats['failed_calls']}")
        print(f"  Success rate: {stats['success_rate']:.1f}%")
        print(f"  Total records: {stats['total_records']:,}")
        print(f"  Average response time: {stats['average_response_time']:.3f}s")
        print(f"  Total time: {stats['total_time']:.2f}s")
        
        if stats['errors']:
            print(f"  Errors: {len(stats['errors'])}")
        
        self.session.close()
        print(f"🔒 API session closed")

# Test the complete API ingestion system
print("\n🧪 Testing Complete API Ingestion System")
print("=" * 40)

# Initialize the system
api_system = APIIngestionSystem(
    base_url="https://jsonplaceholder.typicode.com",
    rate_limit=30,
    timeout=10
)

try:
    # Test single endpoint
    print("\n1️⃣ Testing single endpoint fetch:")
    result = api_system.fetch_data('posts', params={'_limit': 5})
    
    if result['success']:
        print(f"  ✅ Success: {result['records_count']} records in {result['response_time']:.3f}s")
    else:
        print(f"  ❌ Failed: {result['error']}")
    
    # Test paginated fetch
    print("\n2️⃣ Testing paginated fetch:")
    paginated_result = api_system.fetch_paginated_data('posts', per_page=5, max_pages=3)
    
    if paginated_result['success']:
        print(f"  ✅ Success: {paginated_result['total_records']} total records from {paginated_result['pages_fetched']} pages")
        
        # Convert to DataFrame for analysis
        df_paginated = pd.DataFrame(paginated_result['data'])
        print(f"  📊 DataFrame shape: {df_paginated.shape}")
        
        # Show sample
        print(f"\n📋 Sample of paginated data:")
        display(df_paginated[['id', 'userId', 'title']].head())
    else:
        print(f"  ❌ Paginated fetch failed")
    
    # Test multiple endpoints
    print("\n3️⃣ Testing multiple endpoints:")
    endpoints = ['users', 'albums', 'photos']
    
    for endpoint in endpoints:
        result = api_system.fetch_data(endpoint, params={'_limit': 3})
        if result['success']:
            print(f"  ✅ {endpoint}: {result['records_count']} records")
        else:
            print(f"  ❌ {endpoint}: {result['error']}")

finally:
    # Always close the system
    api_system.close()

## 🎯 Key Takeaways

Congratulations! You've completed the API integration tutorial. Here's what you've mastered:

### ✅ **Core API Skills**
- **🌐 HTTP Requests**: Making GET requests with proper parameters
- **🔐 Authentication**: Implementing API keys, Bearer tokens, and Basic auth
- **⏰ Rate Limiting**: Respecting API limits and implementing delays
- **🔄 Retry Logic**: Handling failures with exponential backoff
- **📄 Pagination**: Fetching large datasets across multiple pages
- **🛡️ Error Handling**: Graceful handling of various HTTP errors

### ✅ **Production-Ready Features**
- **📊 Statistics Tracking**: Monitoring API performance and success rates
- **🔗 Session Management**: Efficient connection pooling
- **⏱️ Timeout Handling**: Preventing hanging requests
- **📝 Comprehensive Logging**: Detailed error tracking and debugging
- **🎯 Flexible Configuration**: Customizable rate limits and retry policies

### ✅ **Real-World Applications**
- **📈 Business Intelligence**: Fetching sales, customer, and product data
- **🔄 Data Synchronization**: Keeping systems in sync with external APIs
- **📊 Analytics Pipelines**: Real-time data ingestion for dashboards
- **🤖 Automation**: Automated data collection and processing

---

## 🚀 What's Next?

In the next tutorial, **"04_data_validation.ipynb"**, you'll learn:
- 🔍 How to validate data quality and completeness
- 📋 Schema validation and business rule checking
- 📊 Data quality scoring and reporting
- ⚠️ Handling validation failures gracefully
- 🎯 Building robust validation pipelines

### 🎯 **Practice Exercise**

Before moving to the next tutorial, try this exercise:

1. **Find a public API** (like OpenWeatherMap, GitHub, or News API)
2. **Register for an API key** if required
3. **Use our APIIngestionSystem** to fetch data from that API
4. **Implement custom data transformation** for that API's response format
5. **Add error handling** specific to that API's error responses
6. **Create visualizations** of the data you collected

### 💡 **Recommended Public APIs for Practice:**
- **JSONPlaceholder** (https://jsonplaceholder.typicode.com/) - No auth required
- **OpenWeatherMap** (https://openweathermap.org/api) - Free tier available
- **GitHub API** (https://docs.github.com/en/rest) - Public endpoints available
- **News API** (https://newsapi.org/) - Free tier available
- **REST Countries** (https://restcountries.com/) - No auth required

---

**Excellent work mastering API integration! 🎉**

You now have the skills to integrate with virtually any REST API. In the next tutorial, we'll focus on ensuring the data you collect is high-quality and reliable.

**Happy API Calling! 🚀**