# Validation

We primarily validate tax model input datasets by comparing against the IRS Survey of Incomes data releases. For each data year, we have access to between 1,000 and 2,000 individual statistical targets (e.g. "total taxable Social Security for filers with AGI between 30k and 40k"). For each dataset produced by this repo, we can attempt to reproduce these statistics.

There are valid reasons why datasets might not be able to reproduce SOI statistics:

1. The PUF sample size is too low (e.g. "estate losses" are in the tens of thousands for some AGI breakdowns).
2. The PUF is missing granular tax form information (e.g. some capital gains tax details that form the totals are not in the PUF data).
3. Datasets that are for a tax model produce different projected tax output variables than the PUF (e.g. Tax-Calculator or PolicyEngine might have a different EITC value than is reported in the PUF data).

Regardless, it's still useful to know because we should be able to reproduce most of the SOI statistics.

## Measuring quality

One way to measure the quality of fit against SOI targets might be to just take the mean relative deviation. However, the SOI statistics that fall under category (1) (and to some extent the others) tend to blow up this mean. Those SOI statistics might not be particularly important, and we want the quality of fit indicator to have some useful informational quality, so this is not ideal.

Instead, this validation exercise marks an SOI statistic as "OK" if the relative deviation is less than 5 percent, or if the absolute deviation is less than 1 million for filer count statistics and 1 billion for aggregate statistics.

With that definition, we can report to the percentage of SOI statistics (in the relevant year) that are "OK" for each dataset below.

In [1]:
from tax_microdata_benchmarking.utils.soi_replication import *
from tax_microdata_benchmarking.storage import STORAGE_FOLDER
from tax_microdata_benchmarking.datasets import *
import pandas as pd

INPUTS = STORAGE_FOLDER / "input"
OUTPUTS = STORAGE_FOLDER / "output"

puf_2015 = pd.read_csv(INPUTS / "puf_2015.csv")
tc_puf_2015 = pd.read_csv(OUTPUTS / "tc_puf_2015.csv")

soi_from_puf_2015 = compare_soi_replication_to_soi(
    puf_to_soi(puf_2015, 2015), 2015
)
soi_from_pe_puf_2015 = compare_soi_replication_to_soi(
    pe_to_soi(PUF_2015, 2015), 2015
)
soi_from_tc_puf_2015 = compare_soi_replication_to_soi(
    tc_to_soi(tc_puf_2015, 2015), 2015
)


def soi_statistic_passes_quality_test(df):
    # Relative error lower than this => OK
    RELATIVE_ERROR_THRESHOLD = 0.05

    # Absolute error lower than this for filer counts => OK
    COUNT_ABSOLUTE_ERROR_THRESHOLD = 1e6

    # Absolute error lower than this for aggregates => OK
    AGGREGATE_ABSOLUTE_ERROR_THRESHOLD = 1e9

    relative_error_ok = (
        df["Absolute relative error"] < RELATIVE_ERROR_THRESHOLD
    )
    absolute_error_threshold = np.where(
        df.Count,
        COUNT_ABSOLUTE_ERROR_THRESHOLD,
        AGGREGATE_ABSOLUTE_ERROR_THRESHOLD,
    )
    absolute_error_ok = df["Absolute error"] < absolute_error_threshold

    return relative_error_ok | absolute_error_ok


# 2021 datasets

puf_2021 = pd.read_csv(OUTPUTS / "puf_2021.csv")
tc_puf_2021 = pd.read_csv(OUTPUTS / "tc_puf_2021.csv")
tmd_2021 = pd.read_csv(OUTPUTS / "tmd_2021.csv")

soi_from_puf_2021 = compare_soi_replication_to_soi(
    puf_to_soi(puf_2021, 2021), 2021
)
soi_from_pe_puf_2021 = compare_soi_replication_to_soi(
    pe_to_soi(PUF_2021, 2021), 2021
)
soi_from_tc_puf_2021 = compare_soi_replication_to_soi(
    tc_to_soi(tc_puf_2021, 2021), 2021
)
soi_from_tmd_2021 = compare_soi_replication_to_soi(
    tc_to_soi(tmd_2021, 2021), 2021
)

dataset_soi_comparisons = [
    soi_from_puf_2015,
    soi_from_pe_puf_2015,
    soi_from_tc_puf_2015,
    soi_from_puf_2021,
    soi_from_pe_puf_2021,
    soi_from_tc_puf_2021,
    soi_from_tmd_2021,
]

for dataset in dataset_soi_comparisons:
    dataset["OK"] = soi_statistic_passes_quality_test(dataset)

dataset_names = [
    "PUF (2015)",
    "PE PUF (2015)",
    "TC PUF (2015)",
    "PUF (2021)",
    "PE PUF (2021)",
    "TC PUF (2021)",
    "TMD (2021)",
]

comparison_df = pd.DataFrame(
    {
        "Dataset": dataset_names,
        "SOI match score": [
            (df["OK"].mean() * 100).round(1) for df in dataset_soi_comparisons
        ],
    }
)

comparison_df

Unnamed: 0,Dataset,SOI match score
0,PUF (2015),96.1
1,PE PUF (2015),80.7
2,TC PUF (2015),89.7
3,PUF (2021),60.3
4,PE PUF (2021),64.5
5,TC PUF (2021),65.1
6,TMD (2021),66.0


Note that the pure PUF-derived datasets have lower scores than the PUF with reported tax output values. This is because the tax models (both of them) produced different estimates for tax variables, including the adjusted gross income which is used to bracket out some of the SOI statistics. This can mean that even if, for example, tax-exempt pension income is simply copied directly into the dataset, if the tax model produces different adjusted gross incomes for records then the tax-exempt pension income SOI statistics by AGI might be different than in the SOI releases.

## SOI score by variable

We can break out this score further, by variable.

In [2]:
soi_from_tmd_2021.to_csv(OUTPUTS / "soi_from_puf_tmd_2021.csv", index=False)

In [3]:
score_by_dataset = pd.DataFrame(
    {
        dataset_name: (dataset.groupby("Variable").OK.mean() * 100).round(1)
        for dataset_name, dataset in zip(
            dataset_names, dataset_soi_comparisons
        )
    }
).fillna(
    100
)  # Fillna because some variables aren't in the 2021 SOI releases.
score_by_dataset.sort_values("TMD (2021)")

Unnamed: 0_level_0,PUF (2015),PE PUF (2015),TC PUF (2015),PUF (2021),PE PUF (2021),TC PUF (2021),TMD (2021)
Variable,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
adjusted_gross_income,89.8,63.9,86.7,19.9,23.8,29.3,46.1
income_tax_after_credits,100.0,38.1,64.3,41.7,29.2,33.3,48.6
employment_income,95.2,83.3,92.9,37.5,43.1,45.8,50.0
rent_and_royalty_net_income,88.1,95.2,47.6,65.3,79.2,54.2,54.2
business_net_profits,100.0,97.6,97.6,55.6,54.2,55.6,54.2
unemployment_compensation,100.0,100.0,100.0,65.3,58.3,56.9,55.6
state_and_local_tax_deductions,100.0,52.1,100.0,54.2,50.0,62.5,56.2
total_income_tax,100.0,47.6,76.2,41.7,38.9,33.3,56.9
capital_gains_gross,97.6,66.7,66.7,63.9,59.7,56.9,61.1
income_tax_before_credits,66.7,38.1,95.2,41.7,54.2,56.9,62.5
