# Relationship between age and income

The goal is to investigate the relationship between age and income using US census data. The target of inference is the linear regression coefficient when regressing yearly income in dollars on age, while controlling for sex. The data from California in the year 2019 is downloaded through the Folktables interface (1). Predictions of income are made by training a gradient boosting tree via XGBoost (2) on the previous year’s data.

1. F. Ding, M. Hardt, J. Miller, L. Schmidt, “Retiring adult: New datasets for fair machine learning” in Advances in Neural Information Processing Systems 34 (2021), pp. 6478–6490.
2. T. Chen, C. Guestrin, “XGBoost: A scalable tree boosting system” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2016), pp. 785–794.

### Import necessary packages

In [1]:
import numpy as np
from datasets import load_dataset
from FL_cpp_method import analyze_dataset, plot_cpp

In [2]:
# 示例调用
dataset_name = "census_income"
data = load_dataset('../data/', dataset_name)
Y_total = data["Y"]
Yhat_total = data["Yhat"]
X_total = data["X"]

alpha = 0.05

method = "linear"

dataset_dist = 'IID'
# dataset_dist = 'Non-IID'

# num_ratio = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
num_ratio = [1, 1, 1, 1, 1]  # 数据量分布平衡
# num_ratio = [1,2,3,3,1]  # 数据量分布不平衡

# 计算标注真实值、各节点上、组合数据后和FL后的平均值cpp
true_theta, cpp_intervals, ppi_ci_combined, mean_cpp = analyze_dataset(alpha, X_total, Y_total, Yhat_total, dataset_dist,
                                                                            num_ratio, method, grid=None)
# 画图
file_name = dataset_dist + '-' + dataset_name + '.pdf'
xlim = [0, 2600]
ylim = [0, 1.0]
title = "OLS coeff"
plot_cpp(true_theta, cpp_intervals, ppi_ci_combined, mean_cpp, file_name, xlim, ylim, title)

分组： 1
带标签的样本量： 7601
不带标签的样本量： 68418
分组： 2
带标签的样本量： 7601
不带标签的样本量： 68417
分组： 3
带标签的样本量： 7601
不带标签的样本量： 68417
分组： 4
带标签的样本量： 7601
不带标签的样本量： 68417
分组： 5
带标签的样本量： 7601
不带标签的样本量： 68417
带标签的样本量： 38005
不带标签的样本量： 342086

最终结果：
真实 theta: 937.5318947805291
CPP intervals: [array([ 888.33379315, 1006.00173535]), array([865.46299665, 990.72776535]), array([879.65428206, 996.24751686]), array([ 889.23679495, 1007.64406727]), array([868.07338288, 995.10683764])]
组合数据的置信区间: [911.56403282 965.73379052]
联邦聚合后的置信区间: [911.56919665 965.72868261]
