## Import Sparce ECPS vs full ECPS.

In [37]:
import h5py
import numpy as np
import pandas as pd
from policyengine_us import Microsimulation
from policyengine_us_data.storage import STORAGE_FOLDER
from policyengine_core.data.dataset import Dataset

sparse_filename= "sparse_enhanced_cps_2024.h5"
full_filename = "enhanced_cps_2024.h5"

In [55]:
# If you need to directly access the h5 file:
def inspect_dataset(file_name):
    with h5py.File(STORAGE_FOLDER / file_name, "r") as f:
        print("\nVariables in sparse dataset:")
        for variable in list(f.keys())[-10:]:  # Show last 10
            print(f"  - {variable}")
        
        # Check household count
        household_ids = f["household_id"]["2024"][:]
        unique_households = len(np.unique(household_ids))
        print(f"\nUnique households: {unique_households:,}")
        
        # Check weights
        weights = f["household_weight"]["2024"][:]
        print(f"Weight range: {weights.min():.2f} - {weights.max():.2f}")
        print(f"Total weighted population: {weights.sum():,.0f}")


        # Better state distribution metrics
        statefip = f["state_fips"]["2024"][:]
        df = pd.DataFrame({"household_id": household_ids, "state_fips": statefip})
        households_by_state = df.drop_duplicates("household_id")["state_fips"].value_counts()
        
        # Key metrics
        print(f"\nState distribution:")
        print(f"  Min households in a state: {households_by_state.min()}")
        print(f"  Median households per state: {households_by_state.median():.0f}")
        print(f"  Max households in a state: {households_by_state.max()}")
        print(f"  States with <50 households: {(households_by_state < 50).sum()}")
        print(f"  Coefficient of variation: {households_by_state.std() / households_by_state.mean():.3f}")
        
        # Income distribution (Gini)
        incomes = f["net_worth"]["2024"][:]
        weights = f["household_weight"]["2024"][:]
        # Gini calculation
        def weighted_gini(x, w):
            # Sort by x
            sorted_idx = np.argsort(x)
            x = x[sorted_idx]
            w = w[sorted_idx]
            cumw = np.cumsum(w)
            cumxw = np.cumsum(x * w)
            gini = 1 - 2 * np.sum((cumxw - x * w / 2) * w) / (cumxw[-1] * cumw[-1])
            return gini
        gini = weighted_gini(incomes, weights)
        print(f"Income distribution (Gini): {gini:.4f}")


inspect_dataset(sparse_filename)
inspect_dataset(full_filename)


Variables in sparse dataset:
  - wv_social_security_benefits_subtraction_eligible
  - wv_social_security_benefits_subtraction_person
  - wv_subtractions
  - wv_taxable_income
  - wv_taxable_property_value
  - wv_withheld_income_tax
  - year_deceased
  - year_of_retirement
  - years_in_military
  - zip_code

Unique households: 5,100
Weight range: 3.03 - 1129660.25
Total weighted population: 143,483,712

State distribution:
  Min households in a state: 35
  Median households per state: 99
  Max households in a state: 191
  States with <50 households: 3
  Coefficient of variation: 0.361
Income distribution (Gini): 0.7756

Variables in sparse dataset:
  - traditional_ira_contributions
  - unadjusted_basis_qualified_property
  - unemployment_compensation
  - unrecaptured_section_1250_gain
  - unreimbursed_business_employee_expenses
  - unreported_payroll_tax
  - veterans_benefits
  - w2_wages_from_qualified_business
  - weekly_hours_worked
  - workers_compensation

Unique households: 41,31

### Interesting observations!


| Metric | Full ECPS | Sparse ECPS | Notes |
|--------|-----------|-------------|-------|
| **Dataset Characteristics** |
| Number of households | 41,310 | 5,100 | |
| **Population Coverage** |
| Total weighted population | 149,890,112 | 143,483,712 | Just 4.3% difference! |
| Range in HH weights | (0.36, 708130.50) | (3.03, 1129660.25) | A reminder that the Sparse ECPS goes through reweighting, substantially increasing the HH weight of some HHs.|
| Net Worth inequality measure (Gini) | 0.7672 | 0.7756 | |
| **State Distribution** |
| Range of households in a state | (104, 4300) | (35, 191) | |
| Median households per state | 574 | 99 | |
| States with <50 households | 0 | 3 | |
| Coefficient of variation (SD/Mean) | 1.024 | 0.361 | |
| **Benefits** |
| Households on SNAP | | | |
| Households on Medicaid | | | |


Miscellanious notes:
- I weirdly noticed that the variables/columns seem to differ between the datasets?
  - I wonder why the full ECPS doesn't contain the variable zip_code?
- I want to see how the reweighting happens, as the max HH weight is much greater in the sparse dataset than in the full one. 

## Investigation of dataframe cell:

In [59]:
full_filename = "enhanced_cps_2024.h5"
sparse_filename = "sparse_enhanced_cps_2024.h5"

with h5py.File(STORAGE_FOLDER / full_filename, "r") as f:
    # Get all variable names
    all_variables = list(f.keys())
    
    income_vars = [var for var in f.keys() if 'code' in var.lower()]
    print("Income variables found:")
    for var in income_vars:
        print(f"  - {var}")

Income variables found:
  - detailed_occupation_recode


## Saving methods for the future, so I can debug the latest one first.