# Reading JSON Files with Pandas - Complete Guide

This notebook demonstrates various techniques for reading and processing JSON data from:
- **Local files** - JSON files stored on your computer
- **Remote APIs** - JSON data from web services and APIs

## Table of Contents:
1. **Libraries Import** - Essential Python libraries for JSON handling
2. **Reading Local JSON Files** - Loading JSON from disk using pd.read_json()
3. **Reading from APIs with HTTP Requests** - Fetching JSON from web endpoints
4. **Parsing Nested JSON** - Converting complex JSON structures to DataFrames using json_normalize()

---

In [None]:
# Section 1: Import Essential Libraries
# ======================================

# pandas - For reading JSON and manipulating data in DataFrames
import pandas as pd

# matplotlib.pyplot - For creating visualizations and plots
import matplotlib.pyplot as plt

# numpy - For numerical computing and array operations
import numpy as np

## Section 2: Reading JSON from Local Files

You can directly read JSON files stored on your computer using `pd.read_json()`. This method automatically:
- Parses the JSON format
- Converts it into a pandas DataFrame
- Handles nested structures based on the data layout

**JSON File Formats:**
- **Flat JSON** - Simple key-value pairs that convert directly to DataFrame columns
- **Nested JSON** - Complex structures that may require preprocessing (flattening)

In [None]:
# Example: Reading a JSON file from disk
# ========================================

# pd.read_json() reads a JSON file and converts it directly into a pandas DataFrame
# The file path can be relative (in current directory) or absolute (full path)
df = pd.read_json("train.json")

# Display the DataFrame to see its structure, columns, and data types
df

Unnamed: 0,id,cuisine,ingredients
0,10259,greek,"[romaine lettuce, black olives, grape tomatoes..."
1,25693,southern_us,"[plain flour, ground pepper, salt, tomatoes, g..."
2,20130,filipino,"[eggs, pepper, salt, mayonaise, cooking oil, g..."
3,22213,indian,"[water, vegetable oil, wheat, salt]"
4,13162,indian,"[black pepper, shallots, cornflour, cayenne pe..."
...,...,...,...
39769,29109,irish,"[light brown sugar, granulated sugar, butter, ..."
39770,11462,italian,"[KRAFT Zesty Italian Dressing, purple onion, b..."
39771,2238,irish,"[eggs, citrus fruit, raisins, sourdough starte..."
39772,41882,chinese,"[boneless chicken skinless thigh, minced garli..."


## Section 3: Fetching JSON from Remote APIs

Many online services provide data via REST APIs that return JSON. To fetch JSON from a URL:
1. Use the `requests` library to make HTTP GET requests
2. Parse the JSON response using `.json()`
3. Convert nested structures to DataFrames using `pd.json_normalize()`

**Why use requests library instead of pd.read_json() with URLs?**
- Some APIs require custom headers (User-Agent, API keys)
- Better error handling and control over the request
- Can handle authentication
- More flexible for complex API interactions

**Common API Response Structures:**
- Simple array: `[{data}, {data}]` → directly converts to DataFrame
- Nested object: `{"results": [{data}]}` → need to extract the array first
- Multi-level nesting: `{"metadata": {...}, "data": [{...}]}` → use json_normalize()

In [None]:
# Example 1: Fetching GDP Data from World Bank API
# ==================================================

# Import the requests library to make HTTP requests
import requests

# Make a GET request to the World Bank API
# URL parameters:
#   - IND = India country code
#   - NY.GDP.MKTP.CD = GDP (current US$) indicator code
#   - per_page=5000 = Request up to 5000 records per page
#   - format=json = Request response in JSON format
# The User-Agent header identifies the request; some APIs require this

response = requests.get(
    "https://api.worldbank.org/v2/countries/IND/indicators/NY.GDP.MKTP.CD?per_page=5000&format=json",
    {'User-Agent': 'Mozilla/5.0'}
)

# Extract the JSON data from the response object
# response.json() converts the JSON string into a Python dictionary
data = response.json()

# World Bank API returns data in a list with 2 elements:
#   [0] = metadata (info about the response)
#   [1] = actual data array (the records we want)

# pd.json_normalize() flattens nested JSON structures into a flat DataFrame
# This is useful when the JSON has nested objects or lists
df = pd.json_normalize(data[1])  # data[1] contains the actual GDP records

# Display the resulting DataFrame
df

Unnamed: 0,countryiso3code,date,value,unit,obs_status,decimal,indicator.id,indicator.value,country.id,country.value
0,IND,2024,3.909892e+12,,,0,NY.GDP.MKTP.CD,GDP (current US$),IN,India
1,IND,2023,3.638489e+12,,,0,NY.GDP.MKTP.CD,GDP (current US$),IN,India
2,IND,2022,3.346107e+12,,,0,NY.GDP.MKTP.CD,GDP (current US$),IN,India
3,IND,2021,3.167271e+12,,,0,NY.GDP.MKTP.CD,GDP (current US$),IN,India
4,IND,2020,2.674852e+12,,,0,NY.GDP.MKTP.CD,GDP (current US$),IN,India
...,...,...,...,...,...,...,...,...,...,...
60,IND,1964,5.648029e+10,,,0,NY.GDP.MKTP.CD,GDP (current US$),IN,India
61,IND,1963,4.842192e+10,,,0,NY.GDP.MKTP.CD,GDP (current US$),IN,India
62,IND,1962,4.216148e+10,,,0,NY.GDP.MKTP.CD,GDP (current US$),IN,India
63,IND,1961,3.923244e+10,,,0,NY.GDP.MKTP.CD,GDP (current US$),IN,India


In [None]:
# Example 2: Fetching Historical Events Data
# ============================================

# Import requests library for HTTP communication
import requests

# Make a GET request to the Vizgr Historical Events API
# URL parameters:
#   - format=json = Request JSON format
#   - begin_date=-3000000 = Start from year 3,000,000 BCE (3 million years ago)
#   - end_date=20151231 = End at December 31, 2015
#   - lang=en = Return English language descriptions
# User-Agent header is required by this API to avoid rejection

response = requests.get(
    "https://www.vizgr.org/historical-events/search.php?format=json&begin_date=-3000000&end_date=20151231&lang=en",
    {'User-Agent': 'Mozilla/5.0'}
)

# Parse the JSON response into a Python dictionary
data = response.json()

# Similar to the World Bank API, this returns:
#   [0] = metadata about the response
#   [1] = list of historical event records

# Use json_normalize() to convert the nested JSON array into a clean, flat DataFrame
# This handles any nested objects within the events and creates separate columns for each field
df = pd.json_normalize(data[1])  # data[1] contains the list of historical events

# Display the DataFrame with all historical events
df

## Key Concepts Summary

### JSON Data Structures:
- **Simple JSON** - Direct key-value pairs: `{"name": "John", "age": 30}`
- **JSON Array** - Multiple records: `[{...}, {...}, {...}]`
- **Nested JSON** - Complex hierarchies: `{"user": {"name": "John", "address": {...}}}`

### Pandas Methods:
- **`pd.read_json()`** - Simple way to read JSON from files or URLs
  - Works well for flat JSON structures
  - Can handle nested data but requires proper setup
  
- **`pd.json_normalize()`** - Flattens nested JSON structures
  - Converts nested objects into separate columns
  - Handles arrays within objects automatically
  - Essential for APIs that return complex nested responses

### Best Practices:
1. **Always handle errors** - APIs can fail, URLs can be unreachable
2. **Include User-Agent headers** - Many APIs require this to process requests
3. **Explore the response first** - Print `data` to understand the structure before normalizing
4. **Use appropriate methods**:
   - Flat JSON → `pd.read_json()`
   - Nested JSON → `pd.json_normalize()`
5. **Check API documentation** - Understand rate limits, authentication, and response format