### Data and setup
- Load cleaned REF datasets via shared helpers.
- Prepare institutional/UoA aggregates for summaries and tests.


In [1]:

from pathlib import Path
import importlib
import sys

import pandas as pd

# Ensure local src/ is importable when run from the notebook
THIS_DIR = Path(__file__).resolve().parent if '__file__' in globals() else Path.cwd()
SRC_DIR = THIS_DIR if (THIS_DIR / 'statistics_summary.py').exists() else THIS_DIR / 'src'
if str(SRC_DIR) not in sys.path:
    sys.path.append(str(SRC_DIR))

import statistics_helpers
importlib.reload(statistics_helpers)
from statistics_helpers import (
    load_statistics_data,
    build_descriptive_summary,
    build_inference_summary,
)

pd.set_option('display.max_rows', 50)
pd.set_option('display.width', 200)

# Load raw and aggregated tables used across the summaries
df_output, df_ics, df_uoa_m, df_uni_m, df_uniuoa_m = load_statistics_data()


### Descriptive statistics
High-level counts and top/bottom breakdowns for ICS vs Outputs.


In [2]:

report = build_descriptive_summary(df_ics, df_uoa_m, df_uni_m, df_output)
print(report)


DESCRIPTIVE SUMMARY OF FEMALE REPRESENTATION IN ICS & OUTPUTS

Overall female share:
  • Outputs: 295,846.0 women / 854,366.0 total = 34.63% female
  • ICS:     6,274 women / 16,447 total = 38.15% female

All-female ICS submissions (excluding unknowns): 1,298 (20.41% of all ICS cases)

Universities with highest female Impact (ICS) proportions:
                    Institution name  pct_female_ics  pct_female_output
      Stranmillis University College        1.000000           0.741379
              Royal College of Music        0.800000           0.622642
               Leeds Arts University        0.750000           0.636364
Queen Margaret University, Edinburgh        0.714286           0.624060
          Courtauld Institute of Art        0.714286           0.750000

Universities with lowest female Impact (ICS) proportions:
                               Institution name  pct_female_ics  pct_female_output
              Guildhall School of Music & Drama        0.000000           0.2000

### Inference: ICS vs Outputs
One-sided tests asking whether female share in ICS exceeds Outputs at multiple levels.


In [3]:

inference_report = build_inference_summary(df_ics, df_output, df_uoa_m, df_uni_m, df_uniuoa_m)
print(inference_report)


Hypothesis across all levels: female proportion in ICS exceeds Outputs (one-sided tests).

RAW pooled female shares:
  ICS   : p̂ = 0.3815  (95% CI [0.3740, 0.3889]), n = 16447
  Output: p̂ = 0.3463  (95% CI [0.3453, 0.3473]), n = 854366

Two-proportion z-test (RAW):
  H0: p_ICS = p_Output   vs   H1: p_ICS > p_Output
  z = 9.392, p = 0 (H0: p1 = p2 vs H1: p1 > p2). Observed difference p1−p2 = +0.0352 [+0.0277, +0.0427] (95% CI, Wald, unpooled). Result is statistically significant at α=0.05; the estimated difference is positive.
  Interpretation: This tests the overall female share across all observations. A significant result supports higher ICS share.


— University level: paired analysis for Δ = (ICS − Output)
  Descriptives: n = 155, mean(Δ) = 0.0105 (95% CI [-0.0096, 0.0305]), sd = 0.1273, Cohen's dz = 0.082
  t-test (mean Δ > 0): t = 1.023, p = 0.154 → not significant at α=0.05 (mean Δ positive).
  Wilcoxon (median Δ > 0): W = 7525.000, p = 0.0041 → significant (median Δ positive)