# Galaxy classification

The goal is to determine the demographics of galaxies with spiral arms, which are correlated with star formation in the discs of low-redshift galaxies, and therefore, contribute to the understanding of star formation in the Local Universe. A large citizen science initiative called Galaxy Zoo 2 (1) has collected human annotations of roughly 300000 images of galaxies from the Sloan Digital Sky Survey (2) with the goal of measuring these demographics. The target of inference is the fraction of galaxies with spiral arms. This notebook shows that prediction-powered inference allows for a decrease in the requisite number of human-annotated galaxies by imputing labels via computer vision.

1. K. W. Willett,  C. J. Lintott,  S. P. Bamford,  K. L. Masters, B. D. Simmons,  K. R. V. Casteels,  E. M. Edmondson,  L. F. Fortson, S. Kaviraj,  W. C. Keel, T. Melvin, R. C. Nichol, M. J. Raddick, K. Schawinski, R. J. Simpson, R. A. Skibba, A. M. Smith, D. Thomas, Galaxy Zoo 2: detailed morphological classifications for 304 122 galaxies from the Sloan Digital Sky Survey. Monthly Notices of the Royal Astronomical Society 435(4), 2835–2860 (2013).
2. D. G. York, J. Adelman, J. E. Anderson Jr, S. F. Anderson, J. Annis, N. A. Bahcall, …, N. Yasuda, The Sloan digital sky survey: Technical summary. The Astronomical Journal 120(3), 1579 (2000).

In [1]:
import numpy as np
from datasets import load_dataset 
from FL_cpp_method import analyze_dataset, plot_cpp

In [2]:
# 示例调用
dataset_name = "galaxies"
data = load_dataset('../data/', dataset_name)
Y_total = data["Y"]
Yhat_total = data["Yhat"]

alpha = 0.1

method = "mean"

dataset_dist = 'IID'
# dataset_dist = 'Non-IID'

# num_ratio = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
# num_ratio = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]  # 数据量分布平衡
num_ratio = [1,1,1,1,1]
# num_ratio = [1,1,1,1,4]  # 数据量分布不平衡
# num_ratio = [4,1,1,1,1]

# 计算标注真实值、各节点上、组合数据后和FL后的平均值cpp
true_theta, cpp_intervals, ppi_ci_combined, mean_cpp = analyze_dataset(alpha, None, Y_total, Yhat_total, dataset_dist,
                                                                            num_ratio, method, grid=None)
# 画图
file_name = dataset_dist + '-' + dataset_name + '.pdf'
xlim = [0, 0.85]  # max0.85
ylim = [0, 1.0]
# title = "frequency of spiral galaxies \n with partition [4:1:1:1:1]"
title = "frequency of spiral galaxies"
plot_cpp(true_theta, cpp_intervals, ppi_ci_combined, mean_cpp, file_name, xlim, ylim, title)

分组： 1
带标签的样本量： 334
不带标签的样本量： 3015
分组： 2
带标签的样本量： 334
不带标签的样本量： 3015
分组： 3
带标签的样本量： 334
不带标签的样本量： 3015
分组： 4
带标签的样本量： 334
不带标签的样本量： 3014
分组： 5
带标签的样本量： 334
不带标签的样本量： 3014
imputed var: [5.57442815e-06]
rectifier var [5.63809851e-05]
带标签的样本量： 1670
不带标签的样本量： 15073

最终结果：
真实 theta: 0.2592725318043361
CPP intervals: [array([0.24716544, 0.30725822]), array([0.19624974, 0.25446946]), array([0.20803219, 0.26522186]), array([0.22407075, 0.28147366]), array([0.22159328, 0.27810148])]
组合数据的置信区间: [0.23702794 0.26293777]
联邦聚合后的置信区间: [0.23541632 0.26131019]
