# EITC Analysis: Childless Families by Phase-in/Phase-out Status

## Overview
This notebook analyzes **childless tax units** (those with no EITC-qualifying children) across all 50 US states + DC, categorizing them by where they fall on the EITC schedule.

## What This Notebook Does
1. **Loads state-specific microdata** from PolicyEngine's HuggingFace repository
2. **Filters to childless households** (eitc_child_count == 0)
3. **Categorizes each household** into one of 5 EITC phase statuses
4. **Calculates weighted counts and percentages** by state
5. **Exports summary and detailed data** to CSV files

## EITC Phase Status Categories
| Status | Description |
|--------|-------------|
| **No income** | No/minimal earned income ($100 or less), not receiving EITC |
| **Pre-phase-in** | Earning income but haven't reached maximum credit yet |
| **Full amount** | At the plateau - receiving maximum credit |
| **Partially phased out** | In phase-out range, receiving reduced credit |
| **Fully phased out** | Income too high, EITC reduced to $0 |

## Data Source
- **State datasets**: `hf://policyengine/policyengine-us-data/states/{STATE}.h5`
- Each state has its own dataset with representative household microdata
- Data is weighted to represent the actual population

## Output Files
- `eitc_childless_phase_status_summary_{year}.csv` - Aggregated by state and phase status
- `eitc_childless_families_{year}.csv` - Detailed household-level data (large files, ~125MB each)

## Years Analyzed
- 2024 and 2025

## Setup and Imports

In [None]:
# =============================================================================
# IMPORTS AND CONFIGURATION
# =============================================================================
# 
# policyengine_us: PolicyEngine's US tax-benefit microsimulation model
#   - Microsimulation: Class for running simulations on survey microdata
#   - Loads datasets, calculates tax/benefit variables for each household
#
# pandas/numpy: Standard data manipulation libraries
# =============================================================================

from policyengine_us import Microsimulation
import pandas as pd
import numpy as np

# Configure pandas display options for better output formatting
pd.set_option('display.max_columns', None)      # Show all columns
pd.set_option('display.width', None)            # Don't wrap output
pd.set_option('display.float_format', lambda x: f'{x:,.2f}')  # Format numbers with commas

## EITC Phase Status Classification

The Earned Income Tax Credit (EITC) follows a trapezoidal schedule:

```
Credit
Amount
   ^
   |      ___________
   |     /           \
   |    /             \
   |   /               \
   |  /                 \
   | /                   \
   |/_____________________\____> Earned Income
     Phase-in  Plateau  Phase-out
```

### How We Classify Households

We use PolicyEngine's calculated variables to determine where each household falls:

| Variable | Description |
|----------|-------------|
| `eitc` | Final EITC amount received (after all calculations) |
| `eitc_maximum` | Maximum possible EITC for this household's filing status |
| `eitc_phased_in` | Amount "earned" based on phase-in rate × earned income |
| `eitc_reduction` | Amount reduced due to being in phase-out range |
| `tax_unit_earned_income` | Total earned income for the tax unit |

### Classification Logic
1. **No income**: Earned income ≤ $100 AND eitc = 0
2. **Pre-phase-in**: Receiving EITC but eitc_phased_in < eitc_maximum
3. **Full amount**: eitc_phased_in ≥ eitc_maximum AND eitc_reduction = 0
4. **Partially phased out**: Receiving EITC AND eitc_reduction > 0
5. **Fully phased out**: eitc = 0 AND (has reduction OR phased_in ≥ maximum)

In [None]:
# =============================================================================
# EITC PHASE STATUS CLASSIFICATION FUNCTION
# =============================================================================
# This function takes a DataFrame of households and classifies each one into
# one of 5 EITC phase statuses based on their income and EITC calculations.
#
# Uses numpy's np.select() for efficient vectorized conditional logic.
# =============================================================================

def determine_eitc_phase_status_vectorized(df):
    """
    Classify each household into an EITC phase status category.
    
    Parameters:
    -----------
    df : pandas.DataFrame
        Must contain columns: tax_unit_earned_income, eitc, eitc_reduction,
        eitc_phased_in, eitc_maximum
    
    Returns:
    --------
    numpy.ndarray
        Array of status strings, one per row in df
    
    Categories:
    -----------
    - No income: No/minimal earned income, not receiving EITC
    - Pre-phase-in: Earning but haven't reached maximum credit yet
    - Full amount: At maximum credit (plateau region)
    - Partially phased out: In phase-out region, still receiving some credit
    - Fully phased out: Income too high, EITC reduced to $0
    """
    
    # Define conditions in priority order (first match wins)
    # Each condition is a boolean array the same length as df
    conditions = [
        # CONDITION 1: No income
        # Household has little/no earned income AND isn't receiving EITC
        (df['tax_unit_earned_income'] <= 100) & (df['eitc'] <= 0),
        
        # CONDITION 2: Fully phased out (with reduction)
        # Not receiving EITC, but has earned income and would have had reduction
        (df['eitc'] <= 0) & (df['tax_unit_earned_income'] > 100) & (df['eitc_reduction'] > 0),
        
        # CONDITION 3: Fully phased out (hit max then reduced to zero)
        # Not receiving EITC, but phased_in amount reached/exceeded maximum
        (df['eitc'] <= 0) & (df['eitc_phased_in'] >= df['eitc_maximum']),
        
        # CONDITION 4: Pre-phase-in
        # Receiving EITC, but haven't earned enough to hit maximum yet
        (df['eitc'] > 0) & (df['eitc_phased_in'] < df['eitc_maximum']),
        
        # CONDITION 5: Partially phased out
        # Receiving EITC, but some reduction has been applied
        (df['eitc'] > 0) & (df['eitc_reduction'] > 0),
        
        # CONDITION 6: Full amount (plateau)
        # Receiving EITC at maximum, no reduction applied
        (df['eitc'] > 0) & (df['eitc_phased_in'] >= df['eitc_maximum']) & (df['eitc_reduction'] <= 0)
    ]
    
    # Labels corresponding to each condition above
    choices = [
        'No income',
        'Fully phased out',
        'Fully phased out',
        'Pre-phase-in',
        'Partially phased out',
        'Full amount'
    ]
    
    # np.select applies conditions in order, returns first matching choice
    # Default 'No income' catches any edge cases
    return np.select(conditions, choices, default='No income')

## Data Loading Functions

The following cell defines two key functions:

### `run_state_eitc_analysis(state_abbr, year)`
Loads and processes data for a single state:
1. Loads the state's microdata from HuggingFace
2. Calculates all relevant EITC and household variables
3. Filters to childless households only
4. Classifies each household by EITC phase status
5. Returns a DataFrame with one row per household

### `run_all_states_analysis(year)`
Orchestrates the full analysis:
1. Loops through all 51 states/DC
2. Calls `run_state_eitc_analysis()` for each
3. Combines all results into a single DataFrame
4. Reports progress and totals

### Variables Calculated
| Variable | Description |
|----------|-------------|
| `tax_unit_weight` | Survey weight (how many real households this record represents) |
| `eitc` | Federal EITC amount received |
| `state_eitc` | State EITC amount (if state has a program) |
| `eitc_child_count` | Number of EITC-qualifying children (we filter to 0) |
| `filing_status` | Tax filing status (Single, Joint, etc.) |
| `age_head` | Age of primary filer |
| `adjusted_gross_income` | AGI for the tax unit |

In [None]:
# =============================================================================
# STATE LIST AND DATA LOADING FUNCTIONS
# =============================================================================

# All US states + DC (51 total)
# Modify this list to analyze a subset of states
ALL_STATES = [
    'AL', 'AK', 'AZ', 'AR', 'CA', 'CO', 'CT', 'DE', 'DC', 'FL', 
    'GA', 'HI', 'ID', 'IL', 'IN', 'IA', 'KS', 'KY', 'LA', 'ME', 
    'MD', 'MA', 'MI', 'MN', 'MS', 'MO', 'MT', 'NE', 'NV', 'NH', 
    'NJ', 'NM', 'NY', 'NC', 'ND', 'OH', 'OK', 'OR', 'PA', 'RI', 
    'SC', 'SD', 'TN', 'TX', 'UT', 'VT', 'VA', 'WA', 'WV', 'WI', 'WY'
]

# Order for sorting phase statuses (follows the EITC schedule from left to right)
PHASE_ORDER = ['No income', 'Pre-phase-in', 'Full amount', 'Partially phased out', 'Fully phased out']


def run_state_eitc_analysis(state_abbr, year):
    """
    Load and analyze EITC data for a single state.
    
    Parameters:
    -----------
    state_abbr : str
        Two-letter state abbreviation (e.g., 'CA', 'NY', 'TX')
    year : int
        Tax year to analyze (e.g., 2024, 2025)
    
    Returns:
    --------
    pandas.DataFrame or None
        DataFrame with one row per childless tax unit, or None if error
    """
    try:
        # -----------------------------------------------------------------
        # STEP 1: Load the state's microdata from HuggingFace
        # -----------------------------------------------------------------
        # Each state has its own .h5 file with representative household data
        # The data is weighted to represent the state's actual population
        dataset_path = f"hf://policyengine/policyengine-us-data/states/{state_abbr}.h5"
        sim = Microsimulation(dataset=dataset_path)
        
        # -----------------------------------------------------------------
        # STEP 2: Calculate required variables using PolicyEngine
        # -----------------------------------------------------------------
        # These are "tax unit" level variables (a tax unit = people filing together)
        # sim.calculate() returns a weighted array of values
        data = {}
        
        tax_unit_vars = [
            'tax_unit_id',              # Unique identifier for each tax unit
            'tax_unit_weight',          # Survey weight (represents X real households)
            'eitc',                     # Federal EITC amount (final, after all calculations)
            'eitc_maximum',             # Max possible EITC for this filing status
            'eitc_phased_in',           # Amount "earned" via phase-in calculation
            'eitc_reduction',           # Amount reduced due to phase-out
            'eitc_child_count',         # Number of EITC-qualifying children
            'state_eitc',               # State EITC amount (0 if no state program)
            'adjusted_gross_income',    # AGI
            'tax_unit_earned_income',   # Total earned income
            'filing_status',            # 1=Single, 2=Joint, 3=Separate, 4=HoH, 5=Widow
            'age_head',                 # Age of primary filer
            'age_spouse',               # Age of spouse (0 if single)
        ]
        
        # Calculate each variable and extract the numpy array
        for var in tax_unit_vars:
            result = sim.calculate(var, period=year)
            # .values extracts the underlying numpy array from PolicyEngine's result
            data[var] = result.values if hasattr(result, 'values') else np.array(result)
        
        # Create DataFrame from the calculated values
        df = pd.DataFrame(data)
        df['state'] = state_abbr  # Add state identifier
        
        # -----------------------------------------------------------------
        # STEP 3: Filter to childless households only
        # -----------------------------------------------------------------
        # We want ALL childless households, not just those receiving EITC
        # This lets us calculate percentages that sum to 100%
        childless_mask = df['eitc_child_count'] == 0
        df_childless = df[childless_mask].copy()
        
        if len(df_childless) == 0:
            return None
        
        # -----------------------------------------------------------------
        # STEP 4: Classify each household by EITC phase status
        # -----------------------------------------------------------------
        df_childless['eitc_phase_status'] = determine_eitc_phase_status_vectorized(df_childless)
        
        # -----------------------------------------------------------------
        # STEP 5: Add readable labels for filing status
        # -----------------------------------------------------------------
        df_childless['year'] = year
        
        filing_status_map = {
            1: 'Single',
            2: 'Joint',
            3: 'Separate',
            4: 'Head of Household',
            5: 'Widow(er)'
        }
        df_childless['filing_status_label'] = df_childless['filing_status'].map(filing_status_map).fillna('Unknown')
        
        return df_childless
        
    except Exception as e:
        print(f"  Error processing {state_abbr}: {e}")
        return None


def run_all_states_analysis(year, states=None):
    """
    Run EITC analysis for all states and combine results.
    
    Parameters:
    -----------
    year : int
        Tax year to analyze
    states : list, optional
        List of state abbreviations. Defaults to ALL_STATES (all 51).
    
    Returns:
    --------
    pandas.DataFrame
        Combined DataFrame with all states' data
    """
    if states is None:
        states = ALL_STATES
    
    print(f"\n{'='*60}")
    print(f"Running analysis for {year}")
    print(f"{'='*60}")
    
    all_results = []
    
    # Process each state
    for i, state in enumerate(states):
        print(f"Processing {state} ({i+1}/{len(states)})...", end=" ")
        result = run_state_eitc_analysis(state, year)
        
        if result is not None and len(result) > 0:
            # Report: raw record count and weighted population count
            weighted_count = result['tax_unit_weight'].sum()
            print(f"{len(result):,} records, {weighted_count:,.0f} weighted")
            all_results.append(result)
        else:
            print("No data found")
    
    # Combine all state DataFrames
    if all_results:
        combined = pd.concat(all_results, ignore_index=True)
        print(f"\nTotal: {len(combined):,} records, {combined['tax_unit_weight'].sum():,.0f} weighted tax units")
        return combined
    else:
        return pd.DataFrame()

## Run Analysis for 2024 and 2025

In [None]:
# =============================================================================
# RUN ANALYSIS FOR 2024
# =============================================================================
# This cell processes all 51 states/DC for tax year 2024.
# 
# Output:
#   df_2024 - DataFrame containing all childless tax units from all states
#            with EITC calculations and phase status classification
#
# Processing time: Approximately 5-10 minutes depending on internet speed
#                  (downloads ~50MB of data from HuggingFace)
# =============================================================================

df_2024 = run_all_states_analysis(2024)

In [None]:
# =============================================================================
# RUN ANALYSIS FOR 2025
# =============================================================================
# Same analysis as above but for tax year 2025.
# PolicyEngine uses inflation-adjusted parameters for future years.
#
# Output:
#   df_2025 - DataFrame containing all childless tax units for 2025
# =============================================================================

df_2025 = run_all_states_analysis(2025)

In [None]:
# =============================================================================
# COMBINE BOTH YEARS INTO SINGLE DATASET
# =============================================================================
# Creates a unified dataset with both years for cross-year comparisons.
# The 'year' column distinguishes records from each tax year.
#
# Note: This combined dataset is primarily for exploratory analysis.
#       The exports are done separately by year for cleaner output files.
# =============================================================================

df_combined = pd.concat([df_2024, df_2025], ignore_index=True)
print(f"\nCombined dataset: {len(df_combined):,} records")

## Summary Statistics

### EITC Phase Status Distribution

In [None]:
# =============================================================================
# PHASE STATUS SUMMARY BY STATE
# =============================================================================
# This function creates the main summary output: for each state, what
# percentage of childless households fall into each EITC phase status?
#
# Key outputs per state × phase status:
#   - weighted_households: Actual population count (using survey weights)
#   - pct_of_state: What % of that state's childless households are in this phase
#   - avg_federal_eitc: Average federal EITC for households receiving EITC
#   - avg_state_eitc: Average state EITC (for states with programs)
#
# The percentages should sum to 100% for each state since we include ALL
# childless households (not just EITC recipients).
# =============================================================================

def create_phase_status_summary(df, year_label):
    """
    Create summary of EITC phase status by state with weighted counts and percentages.
    
    Parameters:
    -----------
    df : pandas.DataFrame
        Household-level data from run_all_states_analysis()
    year_label : str
        Label for display (e.g., "2024")
    
    Returns:
    --------
    pandas.DataFrame
        Summary with columns: state, eitc_phase_status, weighted_households,
        pct_of_state, avg_federal_eitc, avg_state_eitc
    """
    print(f"\n{'='*70}")
    print(f"EITC Phase Status by State - {year_label}")
    print(f"{'='*70}")
    
    # Step 1: Calculate weighted counts by state and phase status
    # tax_unit_weight is summed to get population-representative counts
    summary = df.groupby(['state', 'eitc_phase_status']).agg({
        'tax_unit_weight': 'sum',
    }).reset_index()
    
    summary.columns = ['state', 'eitc_phase_status', 'weighted_households']
    
    # Step 2: Calculate state totals for percentage calculation
    state_totals = summary.groupby('state')['weighted_households'].sum().reset_index()
    state_totals.columns = ['state', 'state_total']
    
    # Step 3: Merge to compute percentages
    summary = summary.merge(state_totals, on='state')
    summary['pct_of_state'] = (summary['weighted_households'] / summary['state_total'] * 100).round(1)
    
    # Step 4: Add average EITC amounts (only computed for households receiving EITC)
    # This uses weighted averages: sum(value × weight) / sum(weight)
    avg_eitc = df[df['eitc'] > 0].groupby(['state', 'eitc_phase_status']).apply(
        lambda x: pd.Series({
            'avg_federal_eitc': (x['eitc'] * x['tax_unit_weight']).sum() / x['tax_unit_weight'].sum(),
            'avg_state_eitc': (x['state_eitc'] * x['tax_unit_weight']).sum() / x['tax_unit_weight'].sum(),
        })
    ).reset_index()
    
    summary = summary.merge(avg_eitc, on=['state', 'eitc_phase_status'], how='left')
    summary['avg_federal_eitc'] = summary['avg_federal_eitc'].fillna(0)
    summary['avg_state_eitc'] = summary['avg_state_eitc'].fillna(0)
    
    # Step 5: Clean up columns and sort
    summary = summary[['state', 'eitc_phase_status', 'weighted_households', 'pct_of_state', 
                       'avg_federal_eitc', 'avg_state_eitc']]
    
    # Sort by state alphabetically, then by phase status in logical order
    summary['phase_sort'] = summary['eitc_phase_status'].map({p: i for i, p in enumerate(PHASE_ORDER)})
    summary = summary.sort_values(['state', 'phase_sort']).drop('phase_sort', axis=1)
    
    return summary

# Generate summaries for both years
summary_2024 = create_phase_status_summary(df_2024, "2024")
summary_2025 = create_phase_status_summary(df_2025, "2025")

# Preview the results
print("\n2024 Summary (first 20 rows):")
print(summary_2024.head(20).to_string(index=False))
print("\n2025 Summary (first 20 rows):")
print(summary_2025.head(20).to_string(index=False))

### Distribution by Filing Status (Marital Status)

In [None]:
# =============================================================================
# EXAMPLE HOUSEHOLDS BY PHASE STATUS
# =============================================================================
# Shows concrete examples of households in each phase status to help
# understand what kinds of households fall into each category.
#
# This is useful for validation and for explaining the analysis to stakeholders.
# =============================================================================

def show_example_households(df, year_label, n_examples=3):
    """
    Show example households from each phase status with key characteristics.
    
    Parameters:
    -----------
    df : pandas.DataFrame
        Household-level data
    year_label : str
        Label for display
    n_examples : int
        Number of examples per phase status (default 3)
    
    Returns:
    --------
    pandas.DataFrame
        Sample households with key characteristics
    """
    print(f"\n{'='*70}")
    print(f"Example Households by Phase Status - {year_label}")
    print(f"{'='*70}")
    
    examples = []
    
    # Only show examples for phases where households receive some EITC
    # (No income and Fully phased out receive $0, so less interesting as examples)
    for phase in ['Pre-phase-in', 'Full amount', 'Partially phased out']:
        phase_df = df[df['eitc_phase_status'] == phase]
        if len(phase_df) > 0:
            # Random sample with fixed seed for reproducibility
            sample = phase_df.sample(min(n_examples, len(phase_df)), random_state=42)
            for _, row in sample.iterrows():
                examples.append({
                    'phase_status': phase,
                    'state': row['state'],
                    'marital_status': row['filing_status_label'],
                    'age_head': int(row['age_head']),
                    'agi': row['adjusted_gross_income'],
                    'earned_income': row['tax_unit_earned_income'],
                    'federal_eitc': row['eitc'],
                    'state_eitc': row['state_eitc'],
                })
    
    examples_df = pd.DataFrame(examples)
    return examples_df

# Show examples for 2024
examples_2024 = show_example_households(df_2024, "2024")
print(examples_2024.to_string(index=False))

### Distribution by State

In [None]:
# =============================================================================
# SUMMARY BY STATE - TOP STATES BY POPULATION
# =============================================================================
# Shows the states with the largest childless tax unit populations,
# along with total and average EITC amounts.
#
# Useful for understanding which states contribute most to the national totals.
# =============================================================================

def summary_by_state(df, year_label, top_n=15):
    """
    Create summary by state showing top N by number of childless tax units.
    
    Parameters:
    -----------
    df : pandas.DataFrame
        Household-level data
    year_label : str
        Label for display
    top_n : int
        Number of top states to show (default 15)
    
    Returns:
    --------
    pandas.DataFrame
        State-level summary sorted by weighted tax unit count
    """
    print(f"\n{'='*60}")
    print(f"Top {top_n} States by EITC Recipients - {year_label}")
    print(f"{'='*60}")
    
    # Calculate state-level aggregates using weighted sums/averages
    summary = df.groupby('state').apply(
        lambda x: pd.Series({
            # Total weighted tax units in state
            'Tax Units (Weighted)': x['tax_unit_weight'].sum(),
            # Total federal EITC distributed (weight × eitc amount)
            'Total Federal EITC': (x['eitc'] * x['tax_unit_weight']).sum(),
            # Total state EITC distributed
            'Total State EITC': (x['state_eitc'] * x['tax_unit_weight']).sum(),
            # Weighted average federal EITC per tax unit
            'Avg Federal EITC': (x['eitc'] * x['tax_unit_weight']).sum() / x['tax_unit_weight'].sum(),
            # Weighted average state EITC per tax unit
            'Avg State EITC': (x['state_eitc'] * x['tax_unit_weight']).sum() / x['tax_unit_weight'].sum(),
            # Boolean: does this state have a state EITC program?
            'Has State EITC': (x['state_eitc'] * x['tax_unit_weight']).sum() > 0,
        })
    ).reset_index()
    
    # Sort by number of tax units (largest states first)
    summary = summary.sort_values('Tax Units (Weighted)', ascending=False).head(top_n)
    
    return summary

# Generate and display for both years
state_2024 = summary_by_state(df_2024, "2024")
print(state_2024.to_string(index=False))

state_2025 = summary_by_state(df_2025, "2025")
print(state_2025.to_string(index=False))

### Cross-tabulation: Phase Status by Filing Status

In [None]:
# =============================================================================
# CROSS-TABULATION: PHASE STATUS × FILING STATUS
# =============================================================================
# Creates a pivot table showing how phase status varies by filing status.
#
# Note: Due to data limitations in the state datasets, filing status may
# show as "Unknown" for many records.
# =============================================================================

def crosstab_phase_by_filing(df, year_label):
    """
    Create cross-tabulation of phase status by filing status (marital status).
    
    Parameters:
    -----------
    df : pandas.DataFrame
        Household-level data
    year_label : str
        Label for display
    
    Returns:
    --------
    pandas.DataFrame
        Pivot table with phase status as rows, filing status as columns
    """
    print(f"\n{'='*60}")
    print(f"Phase Status by Filing Status (Weighted Tax Units) - {year_label}")
    print(f"{'='*60}")
    
    # Create pivot table: rows = phase status, columns = filing status
    # Values are weighted tax unit counts
    pivot = df.pivot_table(
        values='tax_unit_weight',
        index='eitc_phase_status',
        columns='filing_status_label',
        aggfunc='sum',
        fill_value=0
    )
    
    # Add row and column totals for context
    pivot['Total'] = pivot.sum(axis=1)
    pivot.loc['Total'] = pivot.sum()
    
    return pivot

# Generate for both years
crosstab_2024 = crosstab_phase_by_filing(df_2024, "2024")
print(crosstab_2024.to_string())

crosstab_2025 = crosstab_phase_by_filing(df_2025, "2025")
print(crosstab_2025.to_string())

### Age Distribution

In [None]:
# =============================================================================
# AGE DISTRIBUTION ANALYSIS
# =============================================================================
# Shows how childless tax units are distributed by age of the head of household.
#
# Key insight: The childless EITC has age restrictions (25-64 for 2024 under
# current law), so we expect most EITC recipients to fall within that range.
# =============================================================================

def age_distribution(df, year_label):
    """
    Create age group distribution for heads of household.
    
    Parameters:
    -----------
    df : pandas.DataFrame
        Household-level data
    year_label : str
        Label for display
    
    Returns:
    --------
    pandas.DataFrame
        Summary by age group with weighted counts and averages
    """
    print(f"\n{'='*60}")
    print(f"Age Distribution of Head of Household - {year_label}")
    print(f"{'='*60}")
    
    # Create age groups using pd.cut
    df_copy = df.copy()
    df_copy['age_group'] = pd.cut(
        df_copy['age_head'],
        bins=[0, 25, 35, 45, 55, 65, 100],
        labels=['Under 25', '25-34', '35-44', '45-54', '55-64', '65+']
    )
    
    # Calculate weighted statistics by age group
    summary = df_copy.groupby('age_group').apply(
        lambda x: pd.Series({
            'Tax Units (Weighted)': x['tax_unit_weight'].sum(),
            'Avg Federal EITC': (x['eitc'] * x['tax_unit_weight']).sum() / x['tax_unit_weight'].sum() if x['tax_unit_weight'].sum() > 0 else 0,
            'Avg Earned Income': (x['tax_unit_earned_income'] * x['tax_unit_weight']).sum() / x['tax_unit_weight'].sum() if x['tax_unit_weight'].sum() > 0 else 0,
        })
    ).reset_index()
    
    # Add percentage of total
    total_units = summary['Tax Units (Weighted)'].sum()
    summary['% of Total'] = (summary['Tax Units (Weighted)'] / total_units * 100).round(1)
    
    return summary

# Generate for both years
age_2024 = age_distribution(df_2024, "2024")
print(age_2024.to_string(index=False))

age_2025 = age_distribution(df_2025, "2025")
print(age_2025.to_string(index=False))

### States with State EITC Programs

In [None]:
# =============================================================================
# STATE EITC PROGRAM ANALYSIS
# =============================================================================
# Shows which states have state EITC programs and how generous they are.
#
# State EITCs are typically calculated as a percentage of the federal EITC,
# ranging from ~3% (Montana) to ~125% (South Carolina).
# =============================================================================

def state_eitc_summary(df, year_label):
    """
    Summary of states with state EITC programs.
    
    Parameters:
    -----------
    df : pandas.DataFrame
        Household-level data
    year_label : str
        Label for display
    
    Returns:
    --------
    pandas.DataFrame or None
        Summary for states with state EITC programs, sorted by total distributed
    """
    print(f"\n{'='*60}")
    print(f"States with State EITC Benefits - {year_label}")
    print(f"{'='*60}")
    
    # Filter to only households actually receiving state EITC
    df_with_state_eitc = df[df['state_eitc'] > 0]
    
    if len(df_with_state_eitc) == 0:
        print("No state EITC benefits found in the data.")
        return None
    
    # Calculate state-level summaries
    summary = df_with_state_eitc.groupby('state').apply(
        lambda x: pd.Series({
            # Number of recipients (weighted)
            'Tax Units (Weighted)': x['tax_unit_weight'].sum(),
            # Total state EITC distributed
            'Total State EITC': (x['state_eitc'] * x['tax_unit_weight']).sum(),
            # Average per recipient
            'Avg State EITC': (x['state_eitc'] * x['tax_unit_weight']).sum() / x['tax_unit_weight'].sum(),
            # State EITC as percentage of federal (indicates program generosity)
            'State EITC as % of Fed': ((x['state_eitc'] * x['tax_unit_weight']).sum() / 
                                       (x['eitc'] * x['tax_unit_weight']).sum() * 100) if (x['eitc'] * x['tax_unit_weight']).sum() > 0 else 0,
        })
    ).reset_index()
    
    # Sort by total distributed (largest programs first)
    summary = summary.sort_values('Total State EITC', ascending=False)
    
    return summary

# Generate for both years
state_eitc_2024 = state_eitc_summary(df_2024, "2024")
if state_eitc_2024 is not None:
    print(state_eitc_2024.to_string(index=False))

state_eitc_2025 = state_eitc_summary(df_2025, "2025")
if state_eitc_2025 is not None:
    print(state_eitc_2025.to_string(index=False))

## Export Data to CSV

In [None]:
# =============================================================================
# EXPORT DETAILED HOUSEHOLD DATA
# =============================================================================
# Exports the full household-level dataset with all calculated variables.
#
# WARNING: These files are large (~125MB each) and are excluded from git
# via .gitignore. They are generated locally when the notebook runs.
#
# Use cases:
#   - Detailed analysis in external tools (Excel, Stata, R)
#   - Validation of the summary statistics
#   - Custom filtering/aggregation not provided in this notebook
# =============================================================================

def export_household_data(df, year):
    """
    Export household-level data to CSV, sorted by state and phase status.
    
    Parameters:
    -----------
    df : pandas.DataFrame
        Household-level data from run_all_states_analysis()
    year : int
        Tax year (used in filename)
    
    Returns:
    --------
    pandas.DataFrame
        The exported data (same as written to file)
    
    Output File:
        eitc_childless_families_{year}.csv
    """
    
    # Select columns for export (excluding eitc_maximum per user request)
    export_columns = [
        'state',                    # State abbreviation
        'eitc_phase_status',        # Classification result
        'tax_unit_id',              # Unique identifier
        'tax_unit_weight',          # Survey weight
        'eitc',                     # Federal EITC amount
        'state_eitc',               # State EITC amount
        'eitc_phased_in',           # Phase-in calculation
        'eitc_reduction',           # Phase-out reduction
        'tax_unit_earned_income',   # Total earned income
        'adjusted_gross_income',    # AGI
        'filing_status_label',      # Marital/filing status
        'age_head',                 # Age of primary filer
        'age_spouse',               # Age of spouse (0 if none)
    ]
    
    # Only include columns that exist in the DataFrame
    available_columns = [col for col in export_columns if col in df.columns]
    df_export = df[available_columns].copy()
    
    # Rename columns for clarity in external tools
    df_export = df_export.rename(columns={
        'eitc': 'federal_eitc',
        'filing_status_label': 'marital_status',
    })
    
    # Sort by state (alphabetically) then by phase status (in logical EITC order)
    df_export['phase_sort'] = df_export['eitc_phase_status'].map({p: i for i, p in enumerate(PHASE_ORDER)})
    df_export = df_export.sort_values(['state', 'phase_sort']).drop('phase_sort', axis=1)
    
    # Write to CSV
    filename = f'eitc_childless_families_{year}.csv'
    df_export.to_csv(filename, index=False)
    print(f"Exported {len(df_export):,} rows to: {filename}")
    
    return df_export

# Export both years to separate files
df_export_2024 = export_household_data(df_2024, 2024)
df_export_2025 = export_household_data(df_2025, 2025)

In [43]:
# Preview the data
print("\nSample of 2024 export data:")
df_export_2024.head(10)


Sample of 2024 export data:


Unnamed: 0,state,eitc_phase_status,tax_unit_id,tax_unit_weight,federal_eitc,state_eitc,eitc_phased_in,eitc_reduction,tax_unit_earned_income,adjusted_gross_income,marital_status,age_head,age_spouse
25751,AK,No income,0,0.8,0.0,0.0,0.0,0.0,0.0,3923.64,Unknown,79,0
25753,AK,No income,3,0.28,0.0,0.0,0.0,10068.1,0.0,148859.19,Unknown,76,74
25754,AK,No income,5,12.27,0.0,0.0,194.41,0.0,2541.26,3945.09,Unknown,64,0
25757,AK,No income,11,4387.35,0.0,0.0,0.0,3368.61,0.0,61284.13,Unknown,85,82
25761,AK,No income,15,639.52,0.0,0.0,0.0,992.74,0.0,23307.04,Unknown,85,0
25763,AK,No income,18,1114.78,0.0,0.0,0.0,0.0,0.0,1403.83,Unknown,83,0
25767,AK,No income,22,0.82,0.0,0.0,0.0,0.0,0.0,2153.92,Unknown,85,0
25769,AK,No income,24,792.77,0.0,0.0,0.0,20.54,0.0,10598.54,Unknown,81,0
25770,AK,No income,25,1.06,0.0,0.0,0.0,0.0,0.0,1403.83,Unknown,85,0
25771,AK,No income,27,1.04,0.0,0.0,0.0,0.0,0.0,1403.83,Unknown,64,0


In [44]:
# CSVs already exported in previous cell
# Files created:
# - eitc_childless_families_2024.csv
# - eitc_childless_families_2025.csv
print("Household data exported to separate files above.")

Household data exported to separate files above.


## Summary Statistics Export

In [None]:
# =============================================================================
# EXPORT SUMMARY DATA
# =============================================================================
# Exports the aggregated summary by state and phase status.
#
# These files are small (~10KB) and ARE included in git commits.
# This is the primary output for sharing with stakeholders.
#
# Output Files:
#   - eitc_childless_phase_status_summary_2024.csv
#   - eitc_childless_phase_status_summary_2025.csv
# =============================================================================

def export_summary(summary_df, year):
    """
    Export phase status summary to CSV, sorted by state and phase status.
    
    Parameters:
    -----------
    summary_df : pandas.DataFrame
        Summary from create_phase_status_summary()
    year : int
        Tax year (used in filename)
    
    Returns:
    --------
    pandas.DataFrame
        The exported data
    """
    df_export = summary_df.copy()
    
    # Sort by state (alphabetically) then phase status (logical EITC order)
    df_export['phase_sort'] = df_export['eitc_phase_status'].map({p: i for i, p in enumerate(PHASE_ORDER)})
    df_export = df_export.sort_values(['state', 'phase_sort']).drop('phase_sort', axis=1)
    
    # Write to CSV
    filename = f'eitc_childless_phase_status_summary_{year}.csv'
    df_export.to_csv(filename, index=False)
    print(f"Exported summary to: {filename}")
    return df_export

# Export both years
summary_2024_export = export_summary(summary_2024, 2024)
summary_2025_export = export_summary(summary_2025, 2025)

## Grand Totals

In [None]:
# =============================================================================
# NATIONAL TOTALS BY PHASE STATUS
# =============================================================================
# Aggregates across all states to show the national distribution of
# childless tax units by EITC phase status.
#
# Key insights:
#   - Most childless tax units (~62%) are "Fully phased out" (too much income)
#   - About 35% have "No income" (no earned income = no EITC)
#   - Only ~2% actually receive EITC (Pre-phase-in + Full amount + Partially)
# =============================================================================

def national_totals(df, year):
    """
    Calculate national totals by phase status.
    
    Parameters:
    -----------
    df : pandas.DataFrame
        Household-level data
    year : int
        Tax year (for output column)
    
    Returns:
    --------
    pandas.DataFrame
        National summary with weighted counts and percentages
    """
    totals = df.groupby('eitc_phase_status').agg({
        'tax_unit_weight': 'sum',
    }).reset_index()
    totals.columns = ['eitc_phase_status', 'weighted_households']
    
    # Calculate percentage of total
    total_all = totals['weighted_households'].sum()
    totals['pct_of_total'] = (totals['weighted_households'] / total_all * 100).round(1)
    totals['year'] = year
    return totals

# Display national totals
print("National Totals by Phase Status:")
print("\n2024:")
nat_2024 = national_totals(df_2024, 2024)
print(nat_2024.to_string(index=False))
print(f"\nTotal childless tax units: {nat_2024['weighted_households'].sum():,.0f}")

print("\n2025:")
nat_2025 = national_totals(df_2025, 2025)
print(nat_2025.to_string(index=False))
print(f"\nTotal childless tax units: {nat_2025['weighted_households'].sum():,.0f}")

## Notes

### Data Interpretation
- **Tax unit weights** represent the number of actual tax units each record represents in the population
- All monetary values are weighted averages/totals reflecting the full population
- The enhanced CPS dataset has ~42,000 household records that are weighted to represent the US population

### EITC Phase Status Definitions
1. **Pre-phase-in**: Earned income is below the level needed to receive the maximum credit. The credit amount equals (earned income × phase-in rate).
2. **Full amount**: Earned income is sufficient to receive the maximum credit, and income is below the phase-out threshold.
3. **Partially phased out**: Income is above the phase-out threshold, resulting in a reduced credit.
4. **Fully phased out**: Income is too high; credit is reduced to $0.

### State EITC Programs
Not all states have state EITC programs. States with programs typically calculate their EITC as a percentage of the federal EITC amount.

### Childless Worker EITC
The federal EITC for childless workers is significantly smaller than for workers with children. Key parameters (2024):
- Maximum credit: ~$632
- Phase-in rate: 7.65%
- Phase-out starts at: ~$9,800 (single), ~$16,400 (married)
- Phase-out rate: 7.65%