# Tier 1: Income Distribution & Inequality Analysis
## Descriptive Analytics - Census ACS API Integration

---

### **Notebook Metadata**
- **Author**: Khipu Analytics Suite
- **Version**: v1.0
- **Date Created**: October 8, 2025
- **Last Updated**: October 10, 2025
- **Tier**: Tier 1 - Descriptive Analytics
- **UUID**: `tier1-income-acs-001`

---

### **Data Sources**
- **Primary**: U.S. Census Bureau - American Community Survey (ACS) 5-Year Estimates
- **API**: `https://api.census.gov/data/2023/acs/acs5`
- **Coverage**: All 3,143 U.S. counties + 51 states
- **Temporal**: 2019-2023 (5-year estimates)

---

### **Key Metrics**
| Variable Code | Description | Unit |
|--------------|-------------|------|
| B19013_001E | Median Household Income | USD |
| B19083_001E | Gini Index of Income Inequality | Index (0-1) |
| B19082_001E | Mean Household Income | USD |
| B19001_* | Household Income Distribution | Count by bracket |
| B17001_002E | Population Below Poverty Line | Count |

---

### **Analytical Models**
1. **Descriptive Statistics**: Mean, median, percentiles, standard deviation
2. **Inequality Measures**: Gini coefficient, Lorenz curve, Theil index
3. **Distribution Analysis**: Histograms, kernel density estimation
4. **Spatial Analysis**: Geographic patterns, choropleth mapping
5. **Correlation Analysis**: Income vs poverty, education, employment

---

### **Business Applications**
1. **Market Segmentation**: Identify high-income vs low-income regions for targeted marketing
2. **Policy Analysis**: Assess effectiveness of income inequality reduction programs
3. **Investment Strategy**: Location decisions based on income demographics
4. **Social Impact**: Measure disparities for nonprofit resource allocation
5. **Economic Development**: Benchmark regional economic health and growth potential

---

### **Prerequisites**
- None (foundational tier)

### **Next Steps**
- **Tier 2**: `Tier2_Income_Prediction_ACS.ipynb` - Predictive modeling of income determinants
- **Tier 3**: Income trend forecasting with ARIMA/Prophet
- **Tier 6**: Spatial econometric models of income spillover effects

---

## 1. Setup & Configuration

In [1]:
# ============================================================================
# IMPORT LIBRARIES
# ============================================================================

import os
import sys
import json
import requests
import pandas as pd
import numpy as np
import warnings
from datetime import datetime
from pathlib import Path

# Visualization
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import matplotlib.pyplot as plt
import seaborn as sns

# Statistical analysis
from scipy import stats
from scipy.stats import gaussian_kde

# Suppress warnings
warnings.filterwarnings('ignore')

# Configure display
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.float_format', lambda x: f'{x:,.2f}')

print("Libraries imported successfully")
print(f"Analysis Date: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

Libraries imported successfully
Analysis Date: 2025-10-10 13:13:33


In [2]:
# ============================================================================
# CONFIGURATION - SECURE API KEY MANAGEMENT
# ============================================================================

# API Keys - Load from environment or config file
def load_api_keys():
    """
    Load API keys from environment variables or config file.
    Priority: Config file > Environment variables (for consistency)
    """
    api_keys = {}
    
    # Try config file first for consistency
    config_path = Path('../../../QuipuLabs-khipu/configs/apikeys')
    if config_path.exists():
        try:
            with open(config_path, 'r') as f:
                for line in f:
                    line = line.strip()
                    # Handle both "CENSUS API:" and "CENSUS API KEY:" formats
                    if 'CENSUS API' in line:
                        # Split on first colon and take everything after it
                        parts = line.split(':', 1)
                        if len(parts) == 2:
                            api_keys['census'] = parts[1].strip()
                    elif 'BLS API KEY:' in line:
                        api_keys['bls'] = line.split(':', 1)[1].strip()
                    elif 'FRED API KEY:' in line:
                        api_keys['fred'] = line.split(':', 1)[1].strip()
                    elif 'BEA API KEY:' in line:
                        api_keys['bea'] = line.split(':', 1)[1].strip()
            
            print(f"API keys loaded from config file: {config_path.resolve()}")
            if 'census' in api_keys:
                print(f"Census API key found: {api_keys['census'][:8]}...")
        except Exception as e:
            print(f"WARNING: Could not load config file: {e}")
    else:
        print(f"WARNING: Config file not found: {config_path.resolve()}")
        print(f"Trying environment variables...")
        
        # Fall back to environment variables
        api_keys['census'] = os.getenv('CENSUS_API_KEY')
        api_keys['bls'] = os.getenv('BLS_API_KEY')
        api_keys['fred'] = os.getenv('FRED_API_KEY')
        api_keys['bea'] = os.getenv('BEA_API_KEY')
    
    return api_keys

# Load API keys
API_KEYS = load_api_keys()
CENSUS_API_KEY = API_KEYS.get('census', 'YOUR_CENSUS_API_KEY_HERE')

# Base URL for ACS 5-year estimates
ACS_BASE_URL = "https://api.census.gov/data/2023/acs/acs5"

# Income & Poverty Variables (verified against Census API documentation)
ACS_VARIABLES = {
    'B19013_001E': 'median_household_income',     # Median Household Income
    'B19083_001E': 'gini_index',                  # Gini Index
    'B19082_001E': 'mean_household_income',       # Mean Household Income
    'B17001_002E': 'population_below_poverty',    # Population Below Poverty
    'B17001_001E': 'population_for_poverty_status',  # Population for Poverty Determination
    'B01003_001E': 'total_population'             # Total Population
}

# Analysis Configuration
ANALYSIS_CONFIG = {
    'geographic_level': 'state',  # Options: 'state', 'county', 'metro', 'zip'
    'selected_states': None,  # None = all states, or list like ['06', '36', '48']
    'income_brackets': [25000, 50000, 75000, 100000, 150000, 200000],
    'poverty_threshold': 0.15,  # 15% poverty rate threshold
    'random_seed': 42
}

# Set reproducibility
np.random.seed(ANALYSIS_CONFIG['random_seed'])

print("Configuration loaded")
print(f"Geographic Level: {ANALYSIS_CONFIG['geographic_level']}")
print(f"Variables to fetch: {len(ACS_VARIABLES)}")
print(f"API Endpoint: {ACS_BASE_URL}")

if CENSUS_API_KEY and CENSUS_API_KEY != 'YOUR_CENSUS_API_KEY_HERE':
    print(f"\nCensus API key configured: {CENSUS_API_KEY[:8]}...")
    print(f"Full key length: {len(CENSUS_API_KEY)} characters")
    print(f"Available APIs: Census, BLS, FRED, BEA")
    print(f"API Documentation: https://api.census.gov/data/2023/acs/acs5/examples.html")
else:
    print("\nWARNING: Census API key not configured!")
    print("Keys should be in: ../../../QuipuLabs-khipu/configs/apikeys")
    print("Or set as environment variable: export CENSUS_API_KEY='your_key_here'")

API keys loaded from config file: /Users/bcdelo/Documents/GitHub/QuipuLabs-khipu/configs/apikeys
Census API key found: 19934324...
Configuration loaded
Geographic Level: state
Variables to fetch: 6
API Endpoint: https://api.census.gov/data/2023/acs/acs5

Census API key configured: 19934324...
Full key length: 40 characters
Available APIs: Census, BLS, FRED, BEA
API Documentation: https://api.census.gov/data/2023/acs/acs5/examples.html


## 2. Data Ingestion - Census ACS API

In [3]:
# ============================================================================
# FETCH DATA FROM CENSUS ACS API
# ============================================================================

def fetch_acs_data(geographic_level='state', state_fips=None):
    """
    Fetch income and poverty data from Census ACS API.
    
    Based on official Census API documentation:
    https://api.census.gov/data/2023/acs/acs5/examples.html
    
    Parameters:
    -----------
    geographic_level : str
        Geographic aggregation level:
        - 'state'  : State-level data (51 states + DC)
        - 'county' : County-level data (~3,143 counties)
        - 'metro'  : Metropolitan/Micropolitan Statistical Areas
        - 'zip'    : ZIP Code Tabulation Areas (ZCTAs)
        - 'place'  : Incorporated places and Census Designated Places
    
    state_fips : str or None
        Specific state FIPS code (e.g., '06' for California)
        None = all states/geographies
    
    Returns:
    --------
    pandas.DataFrame with income and poverty metrics
    
    Notes:
    ------
    - API returns JSON array: [[header], [row1], [row2], ...]
    - First row contains column names
    - Subsequent rows contain data values
    """
    
    # Build variable list for API request
    variables = ','.join(ACS_VARIABLES.keys())
    
    # Construct API endpoint based on geographic level
    # Following official Census API patterns
    if geographic_level == 'state':
        # Example: https://api.census.gov/data/2023/acs/acs5?get=NAME,B01001_001E&for=state:*
        geography = "state:*"
    
    elif geographic_level == 'county':
        # Example: https://api.census.gov/data/2023/acs/acs5?get=NAME,B01001_001E&for=county:*&in=state:*
        if state_fips:
            geography = f"county:*&in=state:{state_fips}"
        else:
            geography = "county:*&in=state:*"
    
    elif geographic_level == 'metro':
        # Example: https://api.census.gov/data/2023/acs/acs5?get=NAME,B01001_001E&for=metropolitan%20statistical%20area/micropolitan%20statistical%20area:*
        geography = "metropolitan%20statistical%20area/micropolitan%20statistical%20area:*"
    
    elif geographic_level == 'zip':
        # Example: https://api.census.gov/data/2023/acs/acs5?get=NAME,B01001_001E&for=zip%20code%20tabulation%20area:*
        geography = "zip%20code%20tabulation%20area:*"
    
    elif geographic_level == 'place':
        # Example: https://api.census.gov/data/2023/acs/acs5?get=NAME,B01001_001E&for=place:*&in=state:*
        if state_fips:
            geography = f"place:*&in=state:{state_fips}"
        else:
            geography = "place:*&in=state:*"
    
    else:
        raise ValueError(f"Invalid geographic_level: {geographic_level}. "
                        f"Must be one of: 'state', 'county', 'metro', 'zip', 'place'")
    
    # Build full URL following Census API specifications
    url = f"{ACS_BASE_URL}?get=NAME,{variables}&for={geography}&key={CENSUS_API_KEY}"
    
    print(f"\nFetching {geographic_level}-level data from Census ACS API...")
    print(f"Geographic Level: {geographic_level}")
    if state_fips:
        print(f"State FIPS Filter: {state_fips}")
    print(f"URL: {url[:120]}...")
    
    try:
        # Make API request with 30-second timeout
        response = requests.get(url, timeout=30)
        
        # Check HTTP status first
        if response.status_code != 200:
            print(f"\nHTTP Error {response.status_code}")
            print(f"Response: {response.text[:500]}")
            if response.status_code == 400:
                print(f"Check API key and parameter format")
            elif response.status_code == 404:
                print(f"Invalid endpoint or geographic level")
            return None
        
        # Check if response is actually JSON
        content_type = response.headers.get('Content-Type', '')
        if 'application/json' not in content_type:
            print(f"\nWARNING: Response Content-Type is '{content_type}', expected 'application/json'")
            print(f"Response text (first 500 chars): {response.text[:500]}")
        
        # Parse JSON response
        # Format: [[column_names], [row1_data], [row2_data], ...]
        try:
            data = response.json()
        except json.JSONDecodeError as json_err:
            print(f"\nJSON Parsing Error: {json_err}")
            print(f"Response status code: {response.status_code}")
            print(f"Response headers: {dict(response.headers)}")
            print(f"Response text (first 1000 chars):")
            print(f"{response.text[:1000]}")
            return None
        
        # Validate response structure
        if not data or len(data) < 2:
            print(f"\nWARNING: API returned no data or invalid format")
            print(f"Data: {data}")
            return None
        
        # Convert to DataFrame
        # First row = column names, remaining rows = data
        df = pd.DataFrame(data[1:], columns=data[0])
        
        print(f"\nSuccessfully fetched {len(df):,} {geographic_level} records")
        print(f"Columns: {df.shape[1]}")
        print(f"Variables: {', '.join(ACS_VARIABLES.values())}")
        
        return df
    
    except requests.exceptions.Timeout:
        print(f"\nRequest timed out after 30 seconds")
        print(f"Try reducing the geographic scope or checking network connection")
        return None
    
    except requests.exceptions.RequestException as e:
        print(f"\nRequest Error: {e}")
        print(f"Error type: {type(e).__name__}")
        return None
    
    except Exception as e:
        print(f"\nUnexpected Error: {e}")
        print(f"Error type: {type(e).__name__}")
        import traceback
        print(f"Traceback: {traceback.format_exc()}")
        return None


# Fetch data based on configuration
print("\n" + "="*80)
print("DATA INGESTION - CENSUS ACS 5-YEAR ESTIMATES (2019-2023)")
print("="*80)

df_raw = fetch_acs_data(
    geographic_level=ANALYSIS_CONFIG['geographic_level'],
    state_fips=ANALYSIS_CONFIG['selected_states']
)

if df_raw is not None:
    print("\nRaw Data Sample:")
    print(df_raw.head())
    print(f"\nColumn Names: {list(df_raw.columns)}")
else:
    print("\nWARNING: Data fetch failed. Check configuration and API key.")


DATA INGESTION - CENSUS ACS 5-YEAR ESTIMATES (2019-2023)

Fetching state-level data from Census ACS API...
Geographic Level: state
URL: https://api.census.gov/data/2023/acs/acs5?get=NAME,B19013_001E,B19083_001E,B19082_001E,B17001_002E,B17001_001E,B01003_00...

Successfully fetched 52 state records
Columns: 8
Variables: median_household_income, gini_index, mean_household_income, population_below_poverty, population_for_poverty_status, total_population

Raw Data Sample:
         NAME B19013_001E B19083_001E B19082_001E B17001_002E B17001_001E  \
0     Alabama       62027      0.4783        2.99      768185     4913932   
1      Alaska       89336      0.4333        3.72       72978      716703   
2     Arizona       76872      0.4624        3.38      907125     7109159   
3    Arkansas       58773      0.4807        3.16      471783     2944742   
4  California       96334      0.4887        2.87     4610600    38529452   

  B01003_001E state  
0     5054253    01  
1      733971    02

## 3. Data Preprocessing & Validation

In [4]:
# ============================================================================
# DATA PREPROCESSING
# ============================================================================

if df_raw is not None:
    # Rename columns to descriptive names
    df = df_raw.copy()
    
    for acs_code, descriptive_name in ACS_VARIABLES.items():
        if acs_code in df.columns:
            df.rename(columns={acs_code: descriptive_name}, inplace=True)
    
    # Convert numeric columns
    numeric_cols = list(ACS_VARIABLES.values())
    for col in numeric_cols:
        if col in df.columns:
            df[col] = pd.to_numeric(df[col], errors='coerce')
    
    # Calculate derived metrics
    df['poverty_rate'] = (df['population_below_poverty'] / df['population_for_poverty_status']) * 100
    df['income_per_capita'] = df['mean_household_income'] / (df['total_population'] / 2.5)  # Avg 2.5 per household
    
    # Handle missing values
    print("\nMissing Values Check:")
    missing_summary = df[numeric_cols + ['poverty_rate', 'income_per_capita']].isnull().sum()
    print(missing_summary[missing_summary > 0])
    
    # Remove rows with critical missing values
    df_clean = df.dropna(subset=['median_household_income', 'gini_index'])
    
    print(f"\nData cleaned: {len(df_clean):,} records retained ({len(df_clean)/len(df)*100:.1f}%)")
    print("\nSummary Statistics:")
    print(df_clean[['median_household_income', 'gini_index', 'poverty_rate']].describe())
    
else:
    print("\nWARNING: No data to process. Check API configuration.")
    df_clean = None


Missing Values Check:
Series([], dtype: int64)

Data cleaned: 52 records retained (100.0%)

Summary Statistics:
       median_household_income  gini_index  poverty_rate
count                    52.00       52.00         52.00
mean                 77,173.40        0.47         12.81
std                  14,810.73        0.02          4.82
min                  25,096.00        0.43          7.16
25%                  69,848.00        0.45         10.29
50%                  75,273.00        0.47         11.93
75%                  88,001.00        0.48         13.72
max                 106,287.00        0.55         41.55


## 4. Exploratory Data Analysis

In [5]:
# ============================================================================
# EXPLORATORY DATA ANALYSIS
# ============================================================================

if df_clean is not None and len(df_clean) > 0:
    print("\n" + "="*80)
    print("INCOME & INEQUALITY - EXPLORATORY ANALYSIS")
    print("="*80)
    
    # Top 10 highest income regions
    print("\nTOP 10 HIGHEST MEDIAN HOUSEHOLD INCOME:")
    print("-" * 80)
    top_income = df_clean.nlargest(10, 'median_household_income')[['NAME', 'median_household_income', 'gini_index', 'poverty_rate']]
    for idx, row in top_income.iterrows():
        print(f"  {row['NAME']:<30} ${row['median_household_income']:>10,.0f}  Gini: {row['gini_index']:.3f}  Poverty: {row['poverty_rate']:.1f}%")
    
    # Bottom 10 lowest income regions
    print("\nBOTTOM 10 LOWEST MEDIAN HOUSEHOLD INCOME:")
    print("-" * 80)
    bottom_income = df_clean.nsmallest(10, 'median_household_income')[['NAME', 'median_household_income', 'gini_index', 'poverty_rate']]
    for idx, row in bottom_income.iterrows():
        print(f"  {row['NAME']:<30} ${row['median_household_income']:>10,.0f}  Gini: {row['gini_index']:.3f}  Poverty: {row['poverty_rate']:.1f}%")
    
    # Inequality analysis
    print("\nINCOME INEQUALITY ANALYSIS:")
    print("-" * 80)
    high_inequality = df_clean[df_clean['gini_index'] > 0.45]
    low_inequality = df_clean[df_clean['gini_index'] < 0.40]
    print(f"  Regions with HIGH inequality (Gini > 0.45): {len(high_inequality)} ({len(high_inequality)/len(df_clean)*100:.1f}%)")
    print(f"  Regions with LOW inequality (Gini < 0.40): {len(low_inequality)} ({len(low_inequality)/len(df_clean)*100:.1f}%)")
    print(f"  National Gini Index (mean): {df_clean['gini_index'].mean():.3f}")
    print(f"  National Gini Index (median): {df_clean['gini_index'].median():.3f}")
    
    # Poverty analysis
    print("\nPOVERTY RATE ANALYSIS:")
    print("-" * 80)
    high_poverty = df_clean[df_clean['poverty_rate'] > ANALYSIS_CONFIG['poverty_threshold'] * 100]
    print(f"  Regions with poverty rate > {ANALYSIS_CONFIG['poverty_threshold']*100}%: {len(high_poverty)} ({len(high_poverty)/len(df_clean)*100:.1f}%)")
    print(f"  National poverty rate (mean): {df_clean['poverty_rate'].mean():.2f}%")
    print(f"  National poverty rate (median): {df_clean['poverty_rate'].median():.2f}%")
    print(f"  Range: {df_clean['poverty_rate'].min():.2f}% - {df_clean['poverty_rate'].max():.2f}%")
    
    # Correlation analysis
    print("\nCORRELATION ANALYSIS:")
    print("-" * 80)
    corr_income_gini = df_clean['median_household_income'].corr(df_clean['gini_index'])
    corr_income_poverty = df_clean['median_household_income'].corr(df_clean['poverty_rate'])
    corr_gini_poverty = df_clean['gini_index'].corr(df_clean['poverty_rate'])
    print(f"  Median Income vs Gini Index: {corr_income_gini:.3f}")
    print(f"  Median Income vs Poverty Rate: {corr_income_poverty:.3f}")
    print(f"  Gini Index vs Poverty Rate: {corr_gini_poverty:.3f}")
    
    print("\n" + "="*80)
else:
    print("\nWARNING: No data available for analysis.")


INCOME & INEQUALITY - EXPLORATORY ANALYSIS

TOP 10 HIGHEST MEDIAN HOUSEHOLD INCOME:
--------------------------------------------------------------------------------
  District of Columbia           $   106,287  Gini: 0.514  Poverty: 14.5%
  Maryland                       $   101,652  Gini: 0.456  Poverty: 9.3%
  Massachusetts                  $   101,341  Gini: 0.488  Poverty: 10.0%
  New Jersey                     $   101,050  Gini: 0.480  Poverty: 9.8%
  Hawaii                         $    98,317  Gini: 0.450  Poverty: 10.0%
  California                     $    96,334  Gini: 0.489  Poverty: 12.0%
  New Hampshire                  $    95,628  Gini: 0.441  Poverty: 7.2%
  Washington                     $    94,952  Gini: 0.466  Poverty: 9.9%
  Connecticut                    $    93,760  Gini: 0.498  Poverty: 10.0%
  Colorado                       $    92,470  Gini: 0.458  Poverty: 9.4%

BOTTOM 10 LOWEST MEDIAN HOUSEHOLD INCOME:
--------------------------------------------------------

## 5. Visualization - Income Distribution

In [6]:
# ============================================================================
# DEBUG: CHECK DATA STRUCTURE BEFORE MAPPING
# ============================================================================

if df_clean is not None and len(df_clean) > 0:
    print("\n" + "="*80)
    print("DATA STRUCTURE VERIFICATION FOR CHOROPLETH MAPPING")
    print("="*80)
    
    print(f"\nDataFrame Shape: {df_clean.shape}")
    print(f"\nAll Columns: {df_clean.columns.tolist()}")
    
    print("\nFirst 5 rows of key columns:")
    display_cols = ['NAME']
    if 'state' in df_clean.columns:
        display_cols.append('state')
    if 'state_fips' in df_clean.columns:
        display_cols.append('state_fips')
    display_cols.extend(['median_household_income', 'gini_index', 'poverty_rate'])
    
    print(df_clean[display_cols].head())
    
    print("\nData Types:")
    for col in display_cols:
        if col in df_clean.columns:
            print(f"  {col}: {df_clean[col].dtype}")
    
    # Check for null values
    print("\nNull values in key columns:")
    for col in display_cols:
        if col in df_clean.columns:
            null_count = df_clean[col].isnull().sum()
            if null_count > 0:
                print(f"  {col}: {null_count} nulls")
    
    print("\n" + "="*80)
else:
    print("\nWARNING: No data available for verification.")


DATA STRUCTURE VERIFICATION FOR CHOROPLETH MAPPING

DataFrame Shape: (52, 10)

All Columns: ['NAME', 'median_household_income', 'gini_index', 'mean_household_income', 'population_below_poverty', 'population_for_poverty_status', 'total_population', 'state', 'poverty_rate', 'income_per_capita']

First 5 rows of key columns:
         NAME state  median_household_income  gini_index  poverty_rate
0     Alabama    01                    62027        0.48         15.63
1      Alaska    02                    89336        0.43         10.18
2     Arizona    04                    76872        0.46         12.76
3    Arkansas    05                    58773        0.48         16.02
4  California    06                    96334        0.49         11.97

Data Types:
  NAME: object
  state: object
  median_household_income: int64
  gini_index: float64
  poverty_rate: float64

Null values in key columns:



In [7]:
# ============================================================================
# VISUALIZATION 1: INCOME DISTRIBUTION HISTOGRAM
# ============================================================================

if df_clean is not None and len(df_clean) > 0:
    fig1 = go.Figure()
    
    # Histogram
    fig1.add_trace(go.Histogram(
        x=df_clean['median_household_income'],
        nbinsx=30,
        name='Distribution',
        marker_color='rgb(55, 126, 184)',
        opacity=0.7
    ))
    
    # Add mean line
    mean_income = df_clean['median_household_income'].mean()
    fig1.add_vline(
        x=mean_income,
        line_dash="dash",
        line_color="red",
        annotation_text=f"Mean: ${mean_income:,.0f}",
        annotation_position="top right"
    )
    
    # Add median line
    median_income = df_clean['median_household_income'].median()
    fig1.add_vline(
        x=median_income,
        line_dash="dot",
        line_color="green",
        annotation_text=f"Median: ${median_income:,.0f}",
        annotation_position="bottom right"
    )
    
    fig1.update_layout(
        title=f"Median Household Income Distribution - {ANALYSIS_CONFIG['geographic_level'].title()} Level",
        xaxis_title="Median Household Income (USD)",
        yaxis_title="Frequency",
        hovermode='x unified',
        width=1200,
        height=500
    )
    
    fig1.show()
else:
    print("\nWARNING: No data available for visualization.")

In [8]:
# ============================================================================
# VISUALIZATION 2: GINI INDEX vs MEDIAN INCOME SCATTER
# ============================================================================

if df_clean is not None and len(df_clean) > 0:
    fig2 = px.scatter(
        df_clean,
        x='median_household_income',
        y='gini_index',
        size='total_population',
        color='poverty_rate',
        hover_name='NAME',
        hover_data={
            'median_household_income': ':$,.0f',
            'gini_index': ':.3f',
            'poverty_rate': ':.1f',
            'total_population': ':,.0f'
        },
        color_continuous_scale='RdYlGn_r',
        title='Income Inequality (Gini Index) vs Median Income',
        labels={
            'median_household_income': 'Median Household Income (USD)',
            'gini_index': 'Gini Index (0-1)',
            'poverty_rate': 'Poverty Rate (%)',
            'total_population': 'Population'
        }
    )
    
    fig2.update_layout(
        width=1200,
        height=600
    )
    
    fig2.show()
else:
    print("\nWARNING: No data available for visualization.")

## 6. Geographic Visualization - Choropleth Maps

This section presents **4 different perspectives** on income and poverty data:

1. **Median Income by State** - Shows absolute income levels across states
2. **Gini Index (Inequality)** - Reveals which states have the most unequal income distributions
3. **Poverty Rate** - Highlights states with highest percentage of population below poverty line
4. **Interactive HTML Export** - Combined view with all metrics for detailed exploration

**Note**: Current analysis is at **state level**. For county-level analysis, change `ANALYSIS_CONFIG['geographic_level']` to `'county'` in the configuration cell.

In [9]:
# ============================================================================
# VISUALIZATION 3: CHOROPLETH MAP - MEDIAN INCOME BY STATE
# ============================================================================

if df_clean is not None and len(df_clean) > 0 and ANALYSIS_CONFIG['geographic_level'] == 'state':
    try:
        # Create state abbreviation mapping (FIPS to abbreviation)
        state_fips_to_abbr = {
            '01': 'AL', '02': 'AK', '04': 'AZ', '05': 'AR', '06': 'CA',
            '08': 'CO', '09': 'CT', '10': 'DE', '11': 'DC', '12': 'FL',
            '13': 'GA', '15': 'HI', '16': 'ID', '17': 'IL', '18': 'IN',
            '19': 'IA', '20': 'KS', '21': 'KY', '22': 'LA', '23': 'ME',
            '24': 'MD', '25': 'MA', '26': 'MI', '27': 'MN', '28': 'MS',
            '29': 'MO', '30': 'MT', '31': 'NE', '32': 'NV', '33': 'NH',
            '34': 'NJ', '35': 'NM', '36': 'NY', '37': 'NC', '38': 'ND',
            '39': 'OH', '40': 'OK', '41': 'OR', '42': 'PA', '44': 'RI',
            '45': 'SC', '46': 'SD', '47': 'TN', '48': 'TX', '49': 'UT',
            '50': 'VT', '51': 'VA', '53': 'WA', '54': 'WV', '55': 'WI',
            '56': 'WY', '72': 'PR'
        }
        
        # Ensure state FIPS column exists
        if 'state' in df_clean.columns and 'state_fips' not in df_clean.columns:
            df_clean['state_fips'] = df_clean['state'].astype(str).str.zfill(2)
        
        # Add state abbreviations
        if 'state_fips' in df_clean.columns:
            df_clean['state_abbr'] = df_clean['state_fips'].map(state_fips_to_abbr)
        
        # Create a copy for mapping
        df_map = df_clean.copy()
        
        print(f"\nCreating choropleth map for {len(df_map)} states...")
        print(f"Sample state abbreviations: {df_map['state_abbr'].head(5).tolist()}")
        print(f"Income range: ${df_map['median_household_income'].min():,.0f} - ${df_map['median_household_income'].max():,.0f}")
        
        # Use state abbreviations as locations
        fig3 = px.choropleth(
            df_map,
            locations='state_abbr',
            locationmode='USA-states',
            color='median_household_income',
            hover_name='NAME',
            hover_data={
                'median_household_income': ':$,.0f',
                'gini_index': ':.3f',
                'poverty_rate': ':.1f',
                'state_abbr': False
            },
            color_continuous_scale='Viridis',
            title='Median Household Income by State (2023 ACS 5-Year Estimates)',
            labels={'median_household_income': 'Median Income (USD)'},
            scope='usa'
        )
        
        fig3.update_layout(
            width=1200,
            height=700,
            geo=dict(
                showlakes=True,
                lakecolor='rgb(255, 255, 255)'
            )
        )
        
        print("Map generated successfully")
        fig3.show()
        
    except Exception as e:
        print(f"\nCould not create choropleth map: {e}")
        print(f"Error details: {type(e).__name__}")
        import traceback
        print(traceback.format_exc())
else:
    print("\nWARNING: Choropleth map only available for state-level analysis.")


Creating choropleth map for 52 states...
Sample state abbreviations: ['AL', 'AK', 'AZ', 'AR', 'CA']
Income range: $25,096 - $106,287
Map generated successfully


In [10]:
# ============================================================================
# VISUALIZATION 4: CHOROPLETH MAP - GINI INDEX (INEQUALITY)
# ============================================================================

if df_clean is not None and len(df_clean) > 0 and ANALYSIS_CONFIG['geographic_level'] == 'state':
    try:
        # Ensure state_abbr column exists (created in previous cell)
        if 'state_abbr' not in df_clean.columns:
            state_fips_to_abbr = {
                '01': 'AL', '02': 'AK', '04': 'AZ', '05': 'AR', '06': 'CA',
                '08': 'CO', '09': 'CT', '10': 'DE', '11': 'DC', '12': 'FL',
                '13': 'GA', '15': 'HI', '16': 'ID', '17': 'IL', '18': 'IN',
                '19': 'IA', '20': 'KS', '21': 'KY', '22': 'LA', '23': 'ME',
                '24': 'MD', '25': 'MA', '26': 'MI', '27': 'MN', '28': 'MS',
                '29': 'MO', '30': 'MT', '31': 'NE', '32': 'NV', '33': 'NH',
                '34': 'NJ', '35': 'NM', '36': 'NY', '37': 'NC', '38': 'ND',
                '39': 'OH', '40': 'OK', '41': 'OR', '42': 'PA', '44': 'RI',
                '45': 'SC', '46': 'SD', '47': 'TN', '48': 'TX', '49': 'UT',
                '50': 'VT', '51': 'VA', '53': 'WA', '54': 'WV', '55': 'WI',
                '56': 'WY', '72': 'PR'
            }
            if 'state' in df_clean.columns and 'state_fips' not in df_clean.columns:
                df_clean['state_fips'] = df_clean['state'].astype(str).str.zfill(2)
            df_clean['state_abbr'] = df_clean['state_fips'].map(state_fips_to_abbr)
        
        # Filter to exclude Puerto Rico for better continental US visibility
        df_map_gini = df_clean[df_clean['state_abbr'] != 'PR'].copy()
        
        print(f"\nCreating Gini Index (Inequality) map...")
        print(f"States included: {len(df_map_gini)}")
        print(f"Gini range: {df_map_gini['gini_index'].min():.3f} - {df_map_gini['gini_index'].max():.3f}")
        
        fig4 = px.choropleth(
            df_map_gini,
            locations='state_abbr',
            locationmode='USA-states',
            color='gini_index',
            hover_name='NAME',
            hover_data={
                'gini_index': ':.3f',
                'median_household_income': ':$,.0f',
                'poverty_rate': ':.1f',
                'state_abbr': False
            },
            color_continuous_scale='RdYlGn_r',  # Red = high inequality, Green = low inequality
            title='Income Inequality by State - Gini Index (2023 ACS)',
            labels={'gini_index': 'Gini Index (0-1)'},
            scope='usa',
            range_color=(0.35, 0.55)  # Typical Gini range for US states
        )
        
        fig4.update_layout(
            width=1200,
            height=700,
            geo=dict(
                showlakes=True,
                lakecolor='rgb(255, 255, 255)',
                projection_type='albers usa'
            )
        )
        
        # Show inequality statistics
        print(f"\nInequality Distribution:")
        print(f"  Highest inequality: {df_map_gini.loc[df_map_gini['gini_index'].idxmax(), 'NAME']} - {df_map_gini['gini_index'].max():.3f}")
        print(f"  Lowest inequality: {df_map_gini.loc[df_map_gini['gini_index'].idxmin(), 'NAME']} - {df_map_gini['gini_index'].min():.3f}")
        print(f"  National median: {df_map_gini['gini_index'].median():.3f}")
        
        print("\nGini Index map generated successfully")
        fig4.show()
        
    except Exception as e:
        print(f"\nCould not create Gini Index map: {e}")
        print(f"Error details: {type(e).__name__}")
        import traceback
        print(traceback.format_exc())
else:
    print("\nWARNING: Choropleth map only available for state-level analysis.")


Creating Gini Index (Inequality) map...
States included: 51
Gini range: 0.428 - 0.515

Inequality Distribution:
  Highest inequality: New York - 0.515
  Lowest inequality: Utah - 0.428
  National median: 0.466

Gini Index map generated successfully


In [11]:
# ============================================================================
# VISUALIZATION 5: CHOROPLETH MAP - POVERTY RATE
# ============================================================================

if df_clean is not None and len(df_clean) > 0 and ANALYSIS_CONFIG['geographic_level'] == 'state':
    try:
        # Ensure state_abbr column exists
        if 'state_abbr' not in df_clean.columns:
            state_fips_to_abbr = {
                '01': 'AL', '02': 'AK', '04': 'AZ', '05': 'AR', '06': 'CA',
                '08': 'CO', '09': 'CT', '10': 'DE', '11': 'DC', '12': 'FL',
                '13': 'GA', '15': 'HI', '16': 'ID', '17': 'IL', '18': 'IN',
                '19': 'IA', '20': 'KS', '21': 'KY', '22': 'LA', '23': 'ME',
                '24': 'MD', '25': 'MA', '26': 'MI', '27': 'MN', '28': 'MS',
                '29': 'MO', '30': 'MT', '31': 'NE', '32': 'NV', '33': 'NH',
                '34': 'NJ', '35': 'NM', '36': 'NY', '37': 'NC', '38': 'ND',
                '39': 'OH', '40': 'OK', '41': 'OR', '42': 'PA', '44': 'RI',
                '45': 'SC', '46': 'SD', '47': 'TN', '48': 'TX', '49': 'UT',
                '50': 'VT', '51': 'VA', '53': 'WA', '54': 'WV', '55': 'WI',
                '56': 'WY', '72': 'PR'
            }
            if 'state' in df_clean.columns and 'state_fips' not in df_clean.columns:
                df_clean['state_fips'] = df_clean['state'].astype(str).str.zfill(2)
            df_clean['state_abbr'] = df_clean['state_fips'].map(state_fips_to_abbr)
        
        # Filter continental US
        df_map_poverty = df_clean[df_clean['state_abbr'] != 'PR'].copy()
        
        print(f"\nCreating Poverty Rate map...")
        print(f"States included: {len(df_map_poverty)}")
        print(f"Poverty rate range: {df_map_poverty['poverty_rate'].min():.1f}% - {df_map_poverty['poverty_rate'].max():.1f}%")
        
        fig5 = px.choropleth(
            df_map_poverty,
            locations='state_abbr',
            locationmode='USA-states',
            color='poverty_rate',
            hover_name='NAME',
            hover_data={
                'poverty_rate': ':.1f',
                'median_household_income': ':$,.0f',
                'gini_index': ':.3f',
                'state_abbr': False,
                'total_population': ':,.0f'
            },
            color_continuous_scale='RdYlGn_r',  # Red = high poverty, Green = low poverty
            title='Poverty Rate by State (2023 ACS)',
            labels={'poverty_rate': 'Poverty Rate (%)'},
            scope='usa',
            range_color=(5, 20)  # Typical poverty rate range
        )
        
        fig5.update_layout(
            width=1200,
            height=700,
            geo=dict(
                showlakes=True,
                lakecolor='rgb(255, 255, 255)',
                projection_type='albers usa'
            )
        )
        
        # Show poverty statistics
        print(f"\nPoverty Rate Distribution:")
        print(f"  Highest poverty: {df_map_poverty.loc[df_map_poverty['poverty_rate'].idxmax(), 'NAME']} - {df_map_poverty['poverty_rate'].max():.1f}%")
        print(f"  Lowest poverty: {df_map_poverty.loc[df_map_poverty['poverty_rate'].idxmin(), 'NAME']} - {df_map_poverty['poverty_rate'].min():.1f}%")
        print(f"  National median: {df_map_poverty['poverty_rate'].median():.1f}%")
        
        print("\nPoverty Rate map generated successfully")
        fig5.show()
        
    except Exception as e:
        print(f"\nCould not create Poverty Rate map: {e}")
        print(f"Error details: {type(e).__name__}")
        import traceback
        print(traceback.format_exc())
else:
    print("\nWARNING: Choropleth map only available for state-level analysis.")


Creating Poverty Rate map...
States included: 51
Poverty rate range: 7.2% - 19.1%

Poverty Rate Distribution:
  Highest poverty: Mississippi - 19.1%
  Lowest poverty: New Hampshire - 7.2%
  National median: 11.9%

Poverty Rate map generated successfully


In [12]:
# ============================================================================
# EXPORT MAP TO HTML (For viewing in browser)
# ============================================================================

if df_clean is not None and len(df_clean) > 0 and ANALYSIS_CONFIG['geographic_level'] == 'state':
    try:
        # Ensure state_abbr column exists
        if 'state_abbr' not in df_clean.columns:
            state_fips_to_abbr = {
                '01': 'AL', '02': 'AK', '04': 'AZ', '05': 'AR', '06': 'CA',
                '08': 'CO', '09': 'CT', '10': 'DE', '11': 'DC', '12': 'FL',
                '13': 'GA', '15': 'HI', '16': 'ID', '17': 'IL', '18': 'IN',
                '19': 'IA', '20': 'KS', '21': 'KY', '22': 'LA', '23': 'ME',
                '24': 'MD', '25': 'MA', '26': 'MI', '27': 'MN', '28': 'MS',
                '29': 'MO', '30': 'MT', '31': 'NE', '32': 'NV', '33': 'NH',
                '34': 'NJ', '35': 'NM', '36': 'NY', '37': 'NC', '38': 'ND',
                '39': 'OH', '40': 'OK', '41': 'OR', '42': 'PA', '44': 'RI',
                '45': 'SC', '46': 'SD', '47': 'TN', '48': 'TX', '49': 'UT',
                '50': 'VT', '51': 'VA', '53': 'WA', '54': 'WV', '55': 'WI',
                '56': 'WY', '72': 'PR'
            }
            if 'state' in df_clean.columns and 'state_fips' not in df_clean.columns:
                df_clean['state_fips'] = df_clean['state'].astype(str).str.zfill(2)
            df_clean['state_abbr'] = df_clean['state_fips'].map(state_fips_to_abbr)
        
        # Create export map (exclude Puerto Rico for better US visualization)
        df_export = df_clean[df_clean['state_abbr'] != 'PR'].copy()
        
        print(f"\nCreating exportable HTML map...")
        print(f"Records: {len(df_export)}")
        print(f"Income range: ${df_export['median_household_income'].min():,.0f} - ${df_export['median_household_income'].max():,.0f}")
        
        # Create comprehensive map using go.Choropleth
        fig_export = go.Figure(data=go.Choropleth(
            locations=df_export['state_abbr'].tolist(),
            z=df_export['median_household_income'].tolist(),
            locationmode='USA-states',
            text=df_export['NAME'].tolist(),
            colorscale='Viridis',
            autocolorscale=False,
            reversescale=False,
            marker=dict(
                line=dict(
                    color='white',
                    width=0.5
                )
            ),
            colorbar=dict(
                title={
                    'text': "Median<br>Household<br>Income (USD)",
                    'side': 'right'
                },
                thickness=15,
                len=0.7,
                tickformat='$,.0f'
            ),
            hovertemplate='<b>%{text}</b><br>' +
                          'Median Income: $%{z:,.0f}<br>' +
                          '<extra></extra>'
        ))
        
        fig_export.update_layout(
            title={
                'text': '<b>Median Household Income by State (2023 ACS 5-Year Estimates)</b>',
                'x': 0.5,
                'xanchor': 'center',
                'font': {'size': 20}
            },
            geo=dict(
                scope='usa',
                projection=dict(type='albers usa'),
                showlakes=True,
                lakecolor='rgba(127, 205, 255, 0.3)',
                showcountries=False,
                showcoastlines=True,
                coastlinecolor='rgb(204, 204, 204)',
                showland=True,
                landcolor='rgb(243, 243, 243)',
                subunitcolor='white'
            ),
            width=1400,
            height=800,
            font=dict(family="Arial, sans-serif", size=12)
        )
        
        # Save to HTML file
        html_path = Path('../../reports/income_map.html')
        html_path.parent.mkdir(parents=True, exist_ok=True)
        fig_export.write_html(str(html_path))
        
        print(f"\nMap exported to HTML: {html_path.resolve()}")
        print(f"Open this file in your browser to view the interactive map")
        print(f"File size: {html_path.stat().st_size / 1024:.1f} KB")
        
        # Also try showing in notebook
        print("\nDisplaying map in notebook:")
        fig_export.show()
        
        # Print verification stats
        print(f"\nMap Verification:")
        print(f"  States with data: {len(df_export)}")
        print(f"  State abbreviations: {sorted(df_export['state_abbr'].unique())}")
        print(f"  Income data complete: {df_export['median_household_income'].notna().sum()}")
        print(f"  Highest: {df_export.loc[df_export['median_household_income'].idxmax(), 'NAME']}")
        print(f"  Lowest: {df_export.loc[df_export['median_household_income'].idxmin(), 'NAME']}")
        
    except Exception as e:
        print(f"\nCould not create map: {e}")
        print(f"Error details: {type(e).__name__}")
        import traceback
        print(traceback.format_exc())
else:
    print("\nWARNING: Choropleth map only available for state-level analysis.")


Creating exportable HTML map...
Records: 51
Income range: $54,915 - $106,287

Map exported to HTML: /Users/bcdelo/Documents/GitHub/QRL/reports/income_map.html
Open this file in your browser to view the interactive map
File size: 4727.3 KB

Displaying map in notebook:



Map Verification:
  States with data: 51
  State abbreviations: ['AK', 'AL', 'AR', 'AZ', 'CA', 'CO', 'CT', 'DC', 'DE', 'FL', 'GA', 'HI', 'IA', 'ID', 'IL', 'IN', 'KS', 'KY', 'LA', 'MA', 'MD', 'ME', 'MI', 'MN', 'MO', 'MS', 'MT', 'NC', 'ND', 'NE', 'NH', 'NJ', 'NM', 'NV', 'NY', 'OH', 'OK', 'OR', 'PA', 'RI', 'SC', 'SD', 'TN', 'TX', 'UT', 'VA', 'VT', 'WA', 'WI', 'WV', 'WY']
  Income data complete: 51
  Highest: District of Columbia
  Lowest: Mississippi


In [13]:
# ============================================================================
# COUNTY MAP 3: POVERTY RATE BY COUNTY
# ============================================================================

if df_clean is not None and len(df_clean) > 0 and ANALYSIS_CONFIG['geographic_level'] == 'county':
    try:
        # Reuse GeoJSON
        if 'counties_geojson' not in locals():
            from urllib.request import urlopen
            with urlopen('https://raw.githubusercontent.com/plotly/datasets/master/geojson-counties-fips.json') as response:
                counties_geojson = json.load(response)
        
        fips_col = 'FIPS' if 'FIPS' in df_clean.columns else 'county_fips'
        df_county_map = df_clean.copy()
        
        print(f"\nCreating county-level Poverty Rate map...")
        print(f"Poverty rate range: {df_county_map['poverty_rate'].min():.1f}% - {df_county_map['poverty_rate'].max():.1f}%")
        
        fig_county3 = px.choropleth(
            df_county_map,
            geojson=counties_geojson,
            locations=fips_col,
            color='poverty_rate',
            hover_name='NAME',
            hover_data={
                'poverty_rate': ':.1f',
                'median_household_income': ':$,.0f',
                'gini_index': ':.3f',
                'total_population': ':,.0f',
                fips_col: False
            },
            color_continuous_scale='RdYlGn_r',  # Red = high poverty
            scope="usa",
            labels={'poverty_rate': 'Poverty Rate (%)'},
            title='Poverty Rate by County (2023 ACS)',
            range_color=(0, 40)  # Counties can have much higher poverty rates
        )
        
        fig_county3.update_layout(
            width=1400,
            height=800,
            margin={"r":0,"t":50,"l":0,"b":0}
        )
        
        # Show top 10 highest poverty counties
        top_poverty = df_county_map.nlargest(10, 'poverty_rate')[['NAME', 'poverty_rate', 'median_household_income']]
        print("\nTop 10 Highest Poverty Counties:")
        for idx, row in top_poverty.iterrows():
            print(f"  {row['NAME']}: {row['poverty_rate']:.1f}% poverty, ${row['median_household_income']:,.0f} median income")
        
        print("\nCounty Poverty Rate map generated successfully")
        fig_county3.show()
        
    except Exception as e:
        print(f"\nCould not create county poverty map: {e}")
        import traceback
        print(traceback.format_exc())
else:
    if ANALYSIS_CONFIG['geographic_level'] != 'county':
        print(f"\nSkipping county poverty map - current level: '{ANALYSIS_CONFIG['geographic_level']}'")
    else:
        print("\nWARNING: No data available")


Skipping county poverty map - current level: 'state'


In [14]:
# ============================================================================
# COUNTY MAP 2: GINI INDEX (INEQUALITY) BY COUNTY
# ============================================================================

if df_clean is not None and len(df_clean) > 0 and ANALYSIS_CONFIG['geographic_level'] == 'county':
    try:
        # Reuse GeoJSON from previous cell
        if 'counties_geojson' not in locals():
            from urllib.request import urlopen
            with urlopen('https://raw.githubusercontent.com/plotly/datasets/master/geojson-counties-fips.json') as response:
                counties_geojson = json.load(response)
        
        fips_col = 'FIPS' if 'FIPS' in df_clean.columns else 'county_fips'
        df_county_map = df_clean.copy()
        
        print(f"\nCreating county-level Gini Index map...")
        print(f"Gini range: {df_county_map['gini_index'].min():.3f} - {df_county_map['gini_index'].max():.3f}")
        
        fig_county2 = px.choropleth(
            df_county_map,
            geojson=counties_geojson,
            locations=fips_col,
            color='gini_index',
            hover_name='NAME',
            hover_data={
                'gini_index': ':.3f',
                'median_household_income': ':$,.0f',
                'poverty_rate': ':.1f',
                fips_col: False
            },
            color_continuous_scale='RdYlGn_r',  # Red = high inequality
            scope="usa",
            labels={'gini_index': 'Gini Index (0-1)'},
            title='Income Inequality by County - Gini Index (2023 ACS)',
            range_color=(0.35, 0.65)  # County Gini range typically wider than states
        )
        
        fig_county2.update_layout(
            width=1400,
            height=800,
            margin={"r":0,"t":50,"l":0,"b":0}
        )
        
        print("County Gini Index map generated successfully")
        fig_county2.show()
        
    except Exception as e:
        print(f"\nCould not create county Gini map: {e}")
        import traceback
        print(traceback.format_exc())
else:
    if ANALYSIS_CONFIG['geographic_level'] != 'county':
        print(f"\nSkipping county Gini map - current level: '{ANALYSIS_CONFIG['geographic_level']}'")
    else:
        print("\nWARNING: No data available")


Skipping county Gini map - current level: 'state'


In [15]:
# ============================================================================
# COUNTY MAP 1: MEDIAN HOUSEHOLD INCOME BY COUNTY
# ============================================================================

if df_clean is not None and len(df_clean) > 0 and ANALYSIS_CONFIG['geographic_level'] == 'county':
    from urllib.request import urlopen
    
    try:
        # Load US counties GeoJSON
        print("\nLoading county boundaries GeoJSON...")
        with urlopen('https://raw.githubusercontent.com/plotly/datasets/master/geojson-counties-fips.json') as response:
            counties_geojson = json.load(response)
        print(f"Loaded {len(counties_geojson['features'])} county boundaries")
        
        # Create 5-digit FIPS code (state + county)
        if 'state' in df_clean.columns and 'county' in df_clean.columns:
            df_clean['county_fips'] = df_clean['state'].astype(str).str.zfill(2) + df_clean['county'].astype(str).str.zfill(3)
        elif 'FIPS' not in df_clean.columns:
            print("\nWARNING: Cannot create FIPS codes - missing 'state' or 'county' columns")
            print(f"Available columns: {df_clean.columns.tolist()}")
        
        # Use FIPS or county_fips
        fips_col = 'FIPS' if 'FIPS' in df_clean.columns else 'county_fips'
        
        # Create county map
        df_county_map = df_clean.copy()
        
        print(f"\nCreating county-level choropleth map...")
        print(f"Counties: {len(df_county_map)}")
        print(f"Sample FIPS codes: {df_county_map[fips_col].head(5).tolist()}")
        print(f"Income range: ${df_county_map['median_household_income'].min():,.0f} - ${df_county_map['median_household_income'].max():,.0f}")
        
        fig_county1 = px.choropleth(
            df_county_map,
            geojson=counties_geojson,
            locations=fips_col,
            color='median_household_income',
            hover_name='NAME',
            hover_data={
                'median_household_income': ':$,.0f',
                'gini_index': ':.3f',
                'poverty_rate': ':.1f',
                fips_col: False
            },
            color_continuous_scale='Viridis',
            scope="usa",
            labels={'median_household_income': 'Median Income (USD)'},
            title='Median Household Income by County (2023 ACS 5-Year Estimates)'
        )
        
        fig_county1.update_layout(
            width=1400,
            height=800,
            margin={"r":0,"t":50,"l":0,"b":0}
        )
        
        print("County map generated successfully")
        fig_county1.show()
        
    except Exception as e:
        print(f"\nCould not create county choropleth map: {e}")
        print(f"Error details: {type(e).__name__}")
        import traceback
        print(traceback.format_exc())
else:
    if ANALYSIS_CONFIG['geographic_level'] != 'county':
        print(f"\nSkipping county map - current level: '{ANALYSIS_CONFIG['geographic_level']}'")
        print("Change configuration to 'county' to enable county-level maps")
    else:
        print("\nWARNING: No data available for county-level mapping")


Skipping county map - current level: 'state'
Change configuration to 'county' to enable county-level maps


## 6b. County-Level Geographic Visualization (Optional)

**Note**: The following cells will only work if you change the configuration to `'geographic_level': 'county'` in Cell 4.

County-level analysis provides much more granular insights:
- ~3,143 counties in the United States
- Shows local variations within states
- Useful for targeted policy interventions and business decisions

**To enable county-level analysis:**
1. Change Cell 4: `'geographic_level': 'county'`
2. Optionally filter to specific states: `'selected_states': ['06', '36', '48']` (CA, NY, TX)
3. Re-run from Cell 4 onwards

In [16]:
# ============================================================================
# QUICK TOGGLE: SWITCH TO COUNTY-LEVEL ANALYSIS
# ============================================================================

# INSTRUCTIONS TO ENABLE COUNTY-LEVEL MAPS:
# 
# 1. Scroll back to Cell 4 (Configuration cell)
# 2. Change this line:
#       'geographic_level': 'state',
#    to:
#       'geographic_level': 'county',
#
# 3. (Optional) Limit to specific states for faster processing:
#       'selected_states': ['06', '36', '48'],  # California, New York, Texas
#    Or keep as None for all ~3,143 counties (slower but comprehensive)
#
# 4. Re-run from Cell 4 onwards (Shift+Enter through all cells)
#
# EXPECTED RESULTS:
# - Cell 6 will fetch ~3,143 county records (or subset if filtered)
# - State-level maps (Cells 13-16) will show warning and skip
# - County-level maps (Cells below) will generate detailed visualizations
# - Processing time: ~30-60 seconds for all counties, ~10-20s for single state

print(" CURRENT CONFIGURATION:")
print(f"   Geographic Level: {ANALYSIS_CONFIG['geographic_level']}")
print(f"   State Filter: {ANALYSIS_CONFIG.get('selected_states', 'None (all states/counties)')}")

if ANALYSIS_CONFIG['geographic_level'] == 'county':
    print("\n County-level analysis ENABLED")
    print("   The 3 county maps below will render")
else:
    print("\n  County-level analysis DISABLED")
    print("   Change configuration to 'county' to enable county maps")
    
print("\n" + "="*80)

 CURRENT CONFIGURATION:
   Geographic Level: state
   State Filter: None

  County-level analysis DISABLED
   Change configuration to 'county' to enable county maps



## 7. Key Insights & Recommendations

**NOTE:** This section contains automated analysis and insights generated by the notebook execution.


In [17]:
# ============================================================================
# KEY INSIGHTS GENERATION
# ============================================================================

if df_clean is not None and len(df_clean) > 0:
    print("\n" + "="*80)
    print("KEY INSIGHTS & RECOMMENDATIONS")
    print("="*80)
    
    # Insight 1: Income Disparity
    income_range = df_clean['median_household_income'].max() - df_clean['median_household_income'].min()
    income_cv = df_clean['median_household_income'].std() / df_clean['median_household_income'].mean()
    
    print("\n1. INCOME DISPARITY:")
    print(f"   - Income range: ${income_range:,.0f}")
    print(f"   - Coefficient of variation: {income_cv:.3f}")
    if income_cv > 0.2:
        print("   WARNING: HIGH income disparity detected - significant regional inequality")
    else:
        print("   MODERATE income disparity - relatively balanced")
    
    # Insight 2: Inequality Patterns
    gini_high = df_clean[df_clean['gini_index'] > 0.45]
    gini_correlation = df_clean['median_household_income'].corr(df_clean['gini_index'])
    
    print("\n2. INEQUALITY PATTERNS:")
    print(f"   - {len(gini_high)} regions with Gini > 0.45 (high inequality)")
    print(f"   - Income-Inequality correlation: {gini_correlation:.3f}")
    if gini_correlation > 0.3:
        print("   Higher income regions tend to have MORE inequality")
    elif gini_correlation < -0.3:
        print("   Higher income regions tend to have LESS inequality")
    else:
        print("   No strong correlation between income level and inequality")
    
    # Insight 3: Poverty vs Income
    poverty_correlation = df_clean['median_household_income'].corr(df_clean['poverty_rate'])
    high_poverty_high_income = df_clean[
        (df_clean['poverty_rate'] > 15) & 
        (df_clean['median_household_income'] > df_clean['median_household_income'].median())
    ]
    
    print("\n3. POVERTY ANALYSIS:")
    print(f"   - Income-Poverty correlation: {poverty_correlation:.3f}")
    print(f"   - Regions with high income BUT high poverty: {len(high_poverty_high_income)}")
    if len(high_poverty_high_income) > 0:
        print("   WARNING: Income inequality creating pockets of poverty in wealthy regions")
    
    # Recommendations
    print("\n" + "="*80)
    print("STRATEGIC RECOMMENDATIONS")
    print("="*80)
    
    print("\nFOR POLICYMAKERS:")
    print("  1. Target high-inequality regions with progressive taxation policies")
    print("  2. Increase minimum wage in low-income areas to reduce poverty")
    print("  3. Implement income support programs in regions with poverty rate > 15%")
    
    print("\nFOR BUSINESSES:")
    print("  1. Premium products: Focus on high-income, low-inequality regions")
    print("  2. Value products: Target high-poverty regions with affordable options")
    print("  3. Workforce strategy: Consider cost of living when setting wages")
    
    print("\nFOR NONPROFITS:")
    print("  1. Prioritize resources to high-poverty, high-inequality regions")
    print("  2. Partner with local organizations in bottom 10 income regions")
    print("  3. Advocate for policies that reduce income inequality")
    
    print("\n" + "="*80)
else:
    print("\nWARNING: No data available for insights generation.")


KEY INSIGHTS & RECOMMENDATIONS

1. INCOME DISPARITY:
   - Income range: $81,191
   - Coefficient of variation: 0.192
   MODERATE income disparity - relatively balanced

2. INEQUALITY PATTERNS:
   - 41 regions with Gini > 0.45 (high inequality)
   - Income-Inequality correlation: -0.281
   No strong correlation between income level and inequality

3. POVERTY ANALYSIS:
   - Income-Poverty correlation: -0.757
   - Regions with high income BUT high poverty: 0

STRATEGIC RECOMMENDATIONS

FOR POLICYMAKERS:
  1. Target high-inequality regions with progressive taxation policies
  2. Increase minimum wage in low-income areas to reduce poverty
  3. Implement income support programs in regions with poverty rate > 15%

FOR BUSINESSES:
  1. Premium products: Focus on high-income, low-inequality regions
  2. Value products: Target high-poverty regions with affordable options
  3. Workforce strategy: Consider cost of living when setting wages

FOR NONPROFITS:
  1. Prioritize resources to high-povert

## 8. Export Results

In [18]:
# ============================================================================
# EXPORT RESULTS TO JSON
# ============================================================================

if df_clean is not None and len(df_clean) > 0:
    # Prepare export data
    export_data = {
        'metadata': {
            'notebook': 'Tier1_Income_Distribution_ACS.ipynb',
            'version': 'v1.0',
            'date_generated': datetime.now().isoformat(),
            'data_source': 'Census ACS 5-Year Estimates (2019-2023)',
            'geographic_level': ANALYSIS_CONFIG['geographic_level'],
            'n_records': len(df_clean)
        },
        'summary_statistics': {
            'median_income': {
                'mean': float(df_clean['median_household_income'].mean()),
                'median': float(df_clean['median_household_income'].median()),
                'std': float(df_clean['median_household_income'].std()),
                'min': float(df_clean['median_household_income'].min()),
                'max': float(df_clean['median_household_income'].max())
            },
            'gini_index': {
                'mean': float(df_clean['gini_index'].mean()),
                'median': float(df_clean['gini_index'].median()),
                'std': float(df_clean['gini_index'].std())
            },
            'poverty_rate': {
                'mean': float(df_clean['poverty_rate'].mean()),
                'median': float(df_clean['poverty_rate'].median()),
                'regions_above_15pct': int((df_clean['poverty_rate'] > 15).sum())
            }
        },
        'top_10_income': df_clean.nlargest(10, 'median_household_income')[['NAME', 'median_household_income']].to_dict(orient='records'),
        'bottom_10_income': df_clean.nsmallest(10, 'median_household_income')[['NAME', 'median_household_income']].to_dict(orient='records')
    }
    
    # Save to JSON
    output_path = Path('../../reports/tier1_income_distribution_results.json')
    output_path.parent.mkdir(parents=True, exist_ok=True)
    
    with open(output_path, 'w') as f:
        json.dump(export_data, f, indent=2)
    
    print(f"\nResults exported to: {output_path}")
    print(f"File size: {output_path.stat().st_size / 1024:.1f} KB")
    
    # Also save full dataset as CSV
    csv_path = Path('../../reports/tier1_income_distribution_data.csv')
    df_clean.to_csv(csv_path, index=False)
    
    print(f"\nFull dataset exported to: {csv_path}")
    print(f"Rows: {len(df_clean):,}")
    print(f"Columns: {len(df_clean.columns)}")
else:
    print("\nWARNING: No data available for export.")


Results exported to: ../../reports/tier1_income_distribution_results.json
File size: 2.3 KB

Full dataset exported to: ../../reports/tier1_income_distribution_data.csv
Rows: 52
Columns: 12


--- ## Summary ### **Completed Steps** Data ingestion from Census ACS API Data preprocessing and validation Exploratory data analysis Income distribution visualization Inequality analysis (Gini Index) Geographic choropleth mapping Key insights generation Results export (JSON + CSV) ### **Next Steps** 1. **Tier 2**: `Tier2_Income_Prediction_ACS.ipynb` - Predictive modeling of income determinants using Random Forest and XGBoost 2. **Tier 3**: Time series analysis of income trends with ARIMA/Prophet forecasting 3. **Tier 6**: Spatial econometric models to analyze income spillover effects between regions ### **Key Findings** - Income disparity varies significantly across regions - Gini Index reveals inequality patterns not visible in median income alone - Some high-income regions have high poverty rates (inequality indicator) - Strong correlation between poverty rate and median income (expected) ---