# Day 2, Block B: API Basics & JSON Navigation

**Duration:** 30-35 minutes  
**Course:** ECBS5294 - Introduction to Data Science: Working with Data  
**Instructor:** Eduardo Ariño de la Rubia

---

## Learning Objectives

By the end of this session, you will be able to:

1. **Understand** REST API fundamentals (GET requests, status codes, JSON responses)
2. **Fetch data** from a public API using Python's `requests` library
3. **Navigate** JSON structures (dicts, arrays, nested objects)
4. **Access** nested JSON data safely using `.get()` method
5. **Handle** missing keys gracefully to prevent crashes

---


## Part 1: REST API Fundamentals (⏱️ 10-12 minutes)

### Why APIs Matter for Business

> **"Modern businesses run on APIs. Every app, every dashboard, every analytics pipeline starts with data from somewhere else."**

**Real-world examples:**
- **Stripe API:** Process payments, track transactions
- **Shopify API:** Pull order data, inventory levels, customer information
- **Salesforce API:** CRM data for customer analytics
- **Google Analytics API:** Website traffic and user behavior
- **Twitter/LinkedIn APIs:** Social media sentiment analysis

**As a data professional, you'll spend significant time working with APIs.** This is how modern data pipelines start.

---


### What is REST?

**REST** = **RE**presentational **S**tate **T**ransfer

It's the most common way to access data over the web. Think of it as "requesting information from a specific web address."

**Key concepts:**

1. **URL/Endpoint:** The web address that returns data
   - Example: `https://dummyjson.com/products`

2. **HTTP Methods:** What you want to do
   - `GET` - Retrieve data (most common for data pipelines)
   - `POST` - Send data
   - `PUT` - Update data
   - `DELETE` - Remove data

3. **Status Codes:** Did it work?
   - `200` - Success!
   - `404` - Not Found
   - `429` - Rate Limit Exceeded (too many requests)
   - `500` - Server Error

4. **Response Format:** Usually JSON (JavaScript Object Notation)
   - Structured data that's easy to parse
   - Readable by humans and machines

---


### Today's API: DummyJSON

We'll use [DummyJSON](https://dummyjson.com) - a free, public API that simulates an e-commerce product catalog.

**Why DummyJSON?**
- ✅ No authentication required (perfect for learning)
- ✅ Reliable and fast
- ✅ Real-world data structure (products, reviews, categories)
- ✅ Similar to Shopify/Amazon product APIs

**Available endpoints:**
- `/products` - All products
- `/products/1` - Single product by ID
- `/products?limit=10` - Limit results

Let's fetch some data!

---


In [None]:
# Setup: Import libraries
import requests
import json
from pprint import pprint

print("✅ Libraries imported successfully")

In [None]:
# OPTION 1: Use live API (DEFAULT)
# This is the standard way to fetch data from an API

DUMMYJSON_URL = "https://dummyjson.com/products"

# Make HTTP GET request
response = requests.get(DUMMYJSON_URL, params={'limit': 10}, timeout=10)

# Check if request was successful
response.raise_for_status()

# Parse JSON response
products_data = response.json()

print(f"✅ Successfully fetched {len(products_data['products'])} products from API")
print(f"   Status code: {response.status_code}")
print(f"   Response size: {len(response.content)} bytes")

In [None]:
# OPTION 2: Use backup file (if API is down)
# Uncomment these lines and skip the cell above if DummyJSON is unavailable

# import json
# with open('../../data/day2/block_b/products_backup.json') as f:
#     products_data = json.load(f)
# 
# print(f"✅ Loaded {len(products_data['products'])} products from backup file")

### What Did We Just Get?

We made an HTTP GET request and received a JSON response. Let's inspect the structure:

**The response has:**
- A `response` object (from requests library)
- A `products_data` dict (parsed JSON)

Let's look at what's inside:

---


In [None]:
# Inspect the response structure
print("Top-level keys in response:")
print(products_data.keys())

print("\nMetadata:")
print(f"  Total products available: {products_data['total']}")
print(f"  Returned in this response: {products_data['limit']}")
print(f"  Skipped (pagination): {products_data['skip']}")

print(f"\nActual products array has {len(products_data['products'])} items")

---

## Part 2: JSON Structure Deep Dive (⏱️ 10-12 minutes)

### JSON Basics

**JSON** (JavaScript Object Notation) is a text format for storing and transporting data.

**JSON has three building blocks:**

1. **Objects** (Python dicts) - Key-value pairs in curly braces
   ```json
   {"name": "Widget", "price": 9.99}
   ```

2. **Arrays** (Python lists) - Ordered lists in square brackets
   ```json
   ["red", "blue", "green"]
   ```

3. **Primitives** - Numbers, strings, booleans, null
   ```json
   42, "hello", true, null
   ```

**The power and the challenge:** JSON can be **nested** (objects within objects, arrays within objects).

---


### Exploring a Product's Structure

Let's look at a single product to understand the nesting:

---


In [None]:
# Get the first product
first_product = products_data['products'][0]

# Display all fields
print("Product fields and their types:")
print("-" * 50)
for key, value in first_product.items():
    print(f"{key:20s} : {type(value).__name__:10s}")

print("\n" + "=" * 50)
print("Full product data:")
print("=" * 50)
pprint(first_product)

### Key Observations

Notice the nested structures:

1. **Nested object (dict):**
   ```json
   "dimensions": {
       "width": 23.17,
       "height": 14.43,
       "depth": 28.01
   }
   ```

2. **Array of strings (list):**
   ```json
   "tags": ["beauty", "mascara"]
   ```

3. **Array of objects (one-to-many relationship):**
   ```json
   "reviews": [
       {"rating": 5, "comment": "Great!", "reviewerName": "Alice"},
       {"rating": 4, "comment": "Good", "reviewerName": "Bob"}
   ]
   ```

**This nesting is why we need normalization** (we'll tackle that in the next notebook).

---


### Accessing Data at Different Levels

Let's practice navigating this structure:

---


In [None]:
# Access top-level fields (simple)
product_id = first_product['id']
product_title = first_product['title']
product_price = first_product['price']
product_category = first_product['category']

print("Top-level access:")
print(f"  Product: {product_title}")
print(f"  Price: ${product_price}")
print(f"  Category: {product_category}")
print(f"  ID: {product_id}")

In [None]:
# Access nested object (dict within dict)
dimensions = first_product['dimensions']
width = dimensions['width']
height = dimensions['height']
depth = dimensions['depth']

# Or access directly in one line:
width_direct = first_product['dimensions']['width']

print("Nested object access:")
print(f"  Dimensions: {width} × {height} × {depth} cm")
print(f"  Width (direct access): {width_direct} cm")

In [None]:
# Access array of strings
tags = first_product['tags']

print("Array access:")
print(f"  Tags ({len(tags)}): {', '.join(tags)}")
print(f"  First tag: {tags[0]}")
print(f"  Last tag: {tags[-1]}")

In [None]:
# Access array of objects (reviews)
reviews = first_product['reviews']

print(f"Reviews ({len(reviews)} total):")
print("-" * 60)

for i, review in enumerate(reviews, 1):
    rating = review['rating']
    comment = review['comment']
    reviewer = review['reviewerName']
    
    print(f"Review {i}: {rating}⭐ - \"{comment}\" by {reviewer}")

---

## Part 3: Safe JSON Navigation (⏱️ 8-10 minutes)

### The Problem: Missing Keys Cause Crashes

**Real-world APIs are messy.** Not all records have all fields.

**What happens if we try to access a field that doesn't exist?**

---


In [None]:
# Demonstrate the problem: KeyError
try:
    # Try to access a field that doesn't exist
    fake_field = first_product['this_field_does_not_exist']
    print(f"Value: {fake_field}")
except KeyError as e:
    print(f"❌ KeyError: {e}")
    print("\n💡 The key doesn't exist, and Python crashed!")
    print("   In a production pipeline, this would stop your entire job.")

### The Solution: `.get()` Method

Python dicts have a `.get(key, default)` method that:
- Returns the value if the key exists
- Returns the default value if the key doesn't exist
- **Never crashes!**

**This is the production-ready pattern for APIs.**

---


In [None]:
# Safe access with .get()
brand = first_product.get('brand', 'Unknown')
warranty = first_product.get('warrantyInformation', 'No warranty info')
fake_field = first_product.get('this_field_does_not_exist', 'Field not available')

print("Safe access with .get():")
print(f"  Brand: {brand}")  # exists, returns actual value
print(f"  Warranty: {warranty}")  # exists, returns actual value
print(f"  Fake field: {fake_field}")  # doesn't exist, returns default

print("\n✅ No crash! The script continues running.")

### Practical Example: Handling Optional Fields

Let's process multiple products and handle missing data gracefully:

---


In [None]:
# Process all products, handling missing fields
print("Product Summary (handling missing data):")
print("=" * 70)

for product in products_data['products'][:5]:  # First 5 products
    # Required fields (we know these exist)
    title = product['title']
    price = product['price']
    
    # Optional fields (might not exist - use .get())
    brand = product.get('brand', 'Generic')
    discount = product.get('discountPercentage', 0)
    stock = product.get('stock', 'Unknown')
    
    # Calculate discounted price
    final_price = price * (1 - discount / 100)
    
    print(f"\n{title}")
    print(f"  Brand: {brand}")
    print(f"  Price: ${price:.2f} → ${final_price:.2f} (after {discount}% discount)")
    print(f"  Stock: {stock}")

print("\n" + "=" * 70)
print("✅ Processed all products without crashes!")

---

## Summary & What's Next

### What We Learned

✅ **REST API Fundamentals**
- APIs are how modern businesses access data
- HTTP GET requests fetch data from URLs
- Status codes tell us if the request succeeded

✅ **JSON Structure**
- JSON has objects (dicts), arrays (lists), and primitives
- Nesting creates complex structures (and challenges!)
- Accessing nested data requires navigating levels

✅ **Safe Navigation**
- Direct access (`product['key']`) crashes if key is missing
- `.get(key, default)` provides safe access
- Always use `.get()` for optional fields in production code

### Production Checklist

When working with APIs in real projects:

- [ ] Use `requests` library for HTTP calls
- [ ] Set `timeout` parameter (don't wait forever!)
- [ ] Check `response.status_code` or use `.raise_for_status()`
- [ ] Use `.get(key, default)` for optional fields
- [ ] Test with actual API before writing pipeline logic

---

## What's Next: Normalization & DuckDB

We now know how to:
1. Fetch JSON from an API ✅
2. Navigate nested structures ✅

**Next notebook:** We'll learn to:
3. **Normalize** nested JSON into tidy tables (one-to-many relationships)
4. **Persist** to DuckDB for SQL analysis
5. **Join** tables to answer business questions

**This is the complete modern data pipeline: API → Normalize → DuckDB → SQL → Insights**

Let's take a short break, then continue to Notebook 2!

---

### Bonus: Production Patterns (Reference)

For production systems, you'll also want to learn:
- **`requests.Session()`** for connection pooling (60% faster for multiple requests)
- **`tenacity`** library for automatic retry logic (handle transient failures)
- **Rate limiting** strategies (respect API limits)
- **Authentication** (API keys, OAuth tokens)

**See:** `references/api_pipeline_quick_reference.md` for production patterns.

---
