# Comparison

The table below shows various calculated inequality metrics:

| Metric | Description | Formula | Notes |
| --- | --- | --- | --- |
| Gini | Gini coefficient | $1 - 2 \sum_{i=1}^n \frac{w_i}{W} \left( \frac{W - w_i}{W} \right) \frac{r_i}{R}$ | $w_i$ is the weight of the $i$-th household, $W$ is the sum of all weights, $r_i$ is the income of the $i$-th individual, $R$ is the sum of all incomes |
| Top 10% share | Share of total income received by the top 10% | $\sum_{i=1}^{n_{10}} w_i r_i / R$ | $n_{10}$ is the number of households in the top 10% |
| Top 1% share | Share of total income received by the top 1% | $\sum_{i=1}^{n_{1}} w_i r_i / R$ | $n_{1}$ is the number of households in the top 1% |
| SPM poverty rate | Share of individuals with income below the SPM poverty threshold | $n_{\text{poor}} / n$ | $n_{\text{poor}}$ is the number of individuals with income below the SPM poverty threshold, $n$ is the total number of individuals |
| Mean weight | Mean of the weights | $\sum_{i=1}^n w_i / n$ | $n$ is the total number of households |
| Median weight | Median of the weights | | |
| Weight standard deviation | Standard deviation of the weights | | |
| Nonzero weight share | Share of households with nonzero weight | $n_{\text{nonzero}} / n$ | $n_{\text{nonzero}}$ is the number of households with nonzero weight, $n$ is the total number of households |

The first table below shows these calculated metrics for the CPS and ECPS.

In [None]:
from policyengine_us import Microsimulation
from policyengine_us_data import PUF_2024
import pandas as pd

cps = Microsimulation()
ecps = Microsimulation(dataset="enhanced_cps_2024")
puf = Microsimulation(dataset=PUF_2024)

In [69]:
gini = lambda sim: sim.calculate("household_net_income", 2024).gini().round(3)
top_10_pct = lambda sim: (sim.calculate("household_net_income", 2024).top_10_pct_share().round(3))
top_1_pct = lambda sim: (sim.calculate("household_net_income", 2024).top_1_pct_share().round(3))
spm_poverty = lambda sim: sim.calculate("in_poverty", 2024, map_to="person").mean().round(3)
weight_mean = lambda sim: pd.Series(sim.calculate("household_weight", 2024).values).mean()
weight_median = lambda sim: pd.Series(sim.calculate("household_weight", 2024).values).median()
weight_sd = lambda sim: pd.Series(sim.calculate("household_weight", 2024).values).std()
weights_nonzero_share = lambda sim: (pd.Series(sim.calculate("household_weight", 2024).values) > 0.1).mean()

metric_names = ["Gini", "Top 10% share", "Top 1% share", "SPM poverty rate", "Mean weight", "Median weight", "Weight standard deviation", "Nonzero weight share"]
metric_funcs = [gini, top_10_pct, top_1_pct, spm_poverty, weight_mean, weight_median, weight_sd, weights_nonzero_share]

datasets = []
metrics = []
values = []

for dataset, sim in zip(["CPS", "PUF", "Enhanced CPS"], [cps, puf, ecps]):
    for metric, func in zip(metric_names, metric_funcs):
        datasets.append(dataset)
        metrics.append(metric)
        values.append(func(sim))

df = pd.DataFrame({"Dataset": datasets, "Metric": metrics, "Value": values})
df = df[df.Dataset != "PUF"].pivot(index="Dataset", columns="Metric", values="Value")
df.round(3).T

Dataset,CPS,Enhanced CPS
Metric,Unnamed: 1_level_1,Unnamed: 2_level_1
Gini,0.449,0.556
Mean weight,2379.074,1290.472
Median weight,2260.32,1.037
Nonzero weight share,1.0,0.538
SPM poverty rate,0.127,0.249
Top 1% share,0.073,0.154
Top 10% share,0.323,0.416
Weight standard deviation,1422.479,11868.958


The below table shows the same metrics, but computed over tax units and including the IRS SOI PUF.

In [68]:
gini = lambda sim: sim.calculate("household_net_income", 2024, map_to="tax_unit").gini().round(3)
top_10_pct = lambda sim: (sim.calculate("household_net_income", 2024, map_to="tax_unit").top_10_pct_share().round(3))
top_1_pct = lambda sim: (sim.calculate("household_net_income", 2024, map_to="tax_unit").top_1_pct_share().round(3))
spm_poverty = lambda sim: sim.calculate("in_poverty", 2024, map_to="person").mean().round(3)
weight_mean = lambda sim: pd.Series(sim.calculate("household_weight", 2024, map_to="tax_unit").values).mean()
weight_median = lambda sim: pd.Series(sim.calculate("household_weight", 2024, map_to="tax_unit").values).median()
weight_sd = lambda sim: pd.Series(sim.calculate("household_weight", 2024, map_to="tax_unit").values).std()
weights_nonzero_share = lambda sim: (pd.Series(sim.calculate("household_weight", 2024, map_to="tax_unit").values) > 0.1).mean()

metric_names = ["Gini", "Top 10% share", "Top 1% share", "SPM poverty rate", "Mean weight", "Median weight", "Weight standard deviation", "Nonzero weight share"]
metric_funcs = [gini, top_10_pct, top_1_pct, spm_poverty, weight_mean, weight_median, weight_sd, weights_nonzero_share]

datasets = []
metrics = []
values = []

for dataset, sim in zip(["CPS", "PUF", "Enhanced CPS"], [cps, puf, ecps]):
    for metric, func in zip(metric_names, metric_funcs):
        datasets.append(dataset)
        metrics.append(metric)
        values.append(func(sim))

df = pd.DataFrame({"Dataset": datasets, "Metric": metrics, "Value": values})
df = df[df.Metric != "SPM poverty rate"].pivot(index="Dataset", columns="Metric", values="Value")
df.round(3).T

Dataset,CPS,Enhanced CPS,PUF
Metric,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Gini,0.495,0.572,0.57
Mean weight,1728.428,937.545,776.056
Median weight,1388.468,0.641,353.541
Nonzero weight share,1.0,0.534,1.0
Top 1% share,0.085,0.154,0.15
Top 10% share,0.361,0.425,0.41
Weight standard deviation,1347.706,9124.232,720.281


## Weight distributions

Below, we compare the weight distributions of the three datasets.

In [94]:
import numpy as np
import plotly.express as px
from policyengine_core.charts import *

cps_dist = pd.Series(cps.calculate("household_weight").values).quantile(np.linspace(0, 1, 1001))
ecps_dist = pd.Series(ecps.calculate("household_weight").values).quantile(np.linspace(0, 1, 1001))
puf_dist = pd.Series(puf.calculate("household_weight").values).quantile(np.linspace(0, 1, 1001))

df = pd.DataFrame({"CPS": cps_dist, "Enhanced CPS": ecps_dist, "PUF": puf_dist, "Quantile": np.linspace(0, 1, 1001)})

fig = px.line(df, x="Quantile", y=["CPS", "Enhanced CPS", "PUF"], title="Household weight distribution", labels={"value": "Weight", "variable": "Dataset"}, log_y=True, color_discrete_sequence=px.colors.qualitative.T10)
fig.update_layout(
    width=800,
    height=600,
    xaxis=dict(
        title="Quantile",
        tickformat=".0%",
    ),
    yaxis=dict(
        title="Weight",
        type="log",
    ),
)
fig