<img src="./Images/Illustration_general_methods.png" alt="General Methods" style="width: 60%; display: block; margin: auto;">

<style>
.reveal {
    font-size: 10%;  /* Adjust this value to control the overall font size */
}
</style>
# GREG Estimator

$$
\hat{Y}_{i[\text{GREG}]} = X_i'\hat{\beta} + \sum_{k \in s_i} d_{ik}(y_{ik} - x_{ik}'\hat{\beta})
$$

$$
\hat{\beta} = \left( \sum_{k \in s_i} d_{ik}x_{ik}x_{ik}' \right)^{-1} \left( \sum_{k \in s_i} d_{ik}x_{ik}y_{ik}' \right)
$$

Train a regression model on survey data ($Y$ & $X$), then replicate using SAE census data $X$.

Clearly, you also have to account for $d$ (design weights) when training on survey data and when using the model. Then adjust to learn idiosyncratic part

# GREG Estimator (Part 2)

- Can be negative in small areas when it overestimates $Y$ (since $Y$ is a total, it must not be negative).
- When there is no data, it is a model-based synthetic regression estimator (regression on $X$).
- Approximately design unbiased.
- Model unbiased only when area-specific auxiliary information is available under the assumption of linear association (relevance).
- Not consistent (high residuals).

# GREG Estimator (Revised Weights)

A more general form can be derived if we take revised weights ($w_d \times w_e$):

- Consistency with covariate totals but not on the estimation of $Y$ at the aggregated level.
- Provides a good estimate when covariates $X$ are consistent for the area.

# Modified Direct Estimator
$$ \hat{Y}_{i[\text{MDE}]} = \hat{Y}_i + (X_i - \hat{X}_i)'\hat{\beta} \qquad;\qquad \hat{\beta} = \left( \sum_{k \in s} d_k x_k x_k' \right)^{-1} \left( \sum_{k \in s} d_k x_k y_k \right) $$

- **Borrow strength** over small areas (SA) for estimating regression coefficients (using regression to remove systematic biases from the Horvitz-Thompson estimator).
- **Improve estimator reliability**.
- Approximate design unbiased as the overall sample size increases (although less effective compared to indirect small area estimators or design-based model-assisted estimators).

# Modified Direct Estimator (Part 2)
It's important to note:
- $X_i$ is the true data from the census, and $\hat{X}_i$ is the estimated data through HT or another basic estimator (similar to $\hat{Y}_i$).
- The adjustment is needed because, if $\hat{X}_i$ is overestimated, $\hat{Y}_i$ will likely be overestimated too. The regression corrects this overestimation.
- Since $\hat{Y}_i$ and the adjustment term are negatively correlated, this reduces the variance.

# Bias-Adj Synthetic Estimator

## Synthetic Estimator:
$$ \hat{Y}_{i, \text{syn}} = X_i' \hat{\beta} $$
- **Assumption:** The relationship between $Y$ and $X$ is consistent across all areas.
- **Issue:** If the true regression coefficients differ between areas, the synthetic estimator may have a large bias for area $i$.

## Bias in Synthetic Estimator:
For $x_{ij1} = 1$ (including an intercept),
$$ \mathbb{E}_D(\hat{Y}_{i, \text{syn}} - Y_i) \neq 0 $$
- **Explanation:** The bias arises because $\hat{\beta}$, estimated from all areas, may not reflect the true relationship in area $i$.

# Bias-Adj Synthetic Estimator (Part 2)
## Bias Adjustment:
To address this bias, an adjustment term is added:
$$ \hat{Y}_{i, \text{adj}} = \hat{Y}_{i, \text{syn}} + (\bar{y}_i - \bar{x}_i' \hat{\beta}) $$
- $\bar{y}_i$: Sample mean of $Y$ in area $i$.
- $\bar{x}_i$: Sample mean of $X$ in area $i$.
- The term $(\bar{y}_i - \bar{x}_i' \hat{\beta})$ estimates the bias.

# Bias-Adj Synthetic Estimator (Part 3)
## Variance Impact:
- **Variance Increases:** Incorporating $\bar{y}_i$ increases the variance of the adjusted estimator:
$$ \text{Var}_D(\hat{Y}_{i, \text{adj}}) = O\left(\frac{1}{n_i}\right) $$
- Small areas with low sample sizes ($n_i$) will have larger variances.

## Implication:
- **Bias-Variance Trade-off:** 
    - Adjusting for bias reduces bias but increases variance.
    - Small $n_i$ leads to high variance, making the estimate less reliable.

# Bias-Adj Synthetic Estimator (Part 4)
## Summary:
- The synthetic estimator may have large bias if area-specific relationships differ.
- Adjusting for bias improves accuracy but can result in high variance, especially in small areas.
- **Trade-off:** Reducing bias can increase variance, and alternative estimators may better balance the two.

# Pros and Cons of Design-Based SAE

<div style="height: 300px; overflow-y: scroll; font-size: 10px; padding: 10px; border: 1px solid #ccc;">

| **Advantages** | **Disadvantages** |
| --- | --- |
| **Model Independence:** Estimation is less dependent on an assumed model, although models assist in constructing the estimators. | **Large Variance in Direct Estimators:** Due to small sample sizes, direct estimators have high variance. |
| **Unbiased and Consistent:** Estimators are approximately unbiased and consistent for large sample sizes within areas. This protects against possible model misspecification. | **Variable Survey Regression Estimators:** Survey regression estimators, though unbiased, may be too variable. |
| **Protects Against Bias:** Especially effective in large areas, reducing the risk of bias from incorrect model assumptions. | **Bias in Synthetic Estimators:** While they have small variance, synthetic estimators are generally biased. |
|  | **Complex Composite Estimators:** These have smaller bias but larger variance than synthetic estimators. Choosing the correct weights is challenging. |
|  | **Conditional Inference Limitations:** Design-based methods do not easily lend themselves to conditional inference, which can inflate estimator variance. |
|  | **Lack of Theory for No-Sample Areas:** There is no established theory for areas with no samples. Predictions in these areas are difficult without additional information. |
|  | **Sample Size Issues:** Large sample normality assumptions may not hold in small areas, making confidence interval computations unreliable. |

**Note:** Design-based methods are powerful but may not be effective for all small area estimations, especially where sample sizes are limited or nonexistent.
</div>

# Model Based Methods in SAE


# Explicit Model Approach

**Basic Area**: You have area-specific survey data and census data, then use, e.g., the Fay and Herriot model to estimate $Y$ (aggregate information) you are interested in.

- **Linking model**: $\theta_i \sim N(x_i'\beta, \sigma_e^2)$, where $i = 1, 2, \ldots, n$; and $\sigma_e$ is the variance of the model chosen
- **Matching sampling model**: $\hat{\theta}_i | \theta_i \sim N(\theta_i, \omega_i^2)$, where $i = 1, 2, \ldots, n$; and $\omega_i$ is the known sampling variance which reflects the precision of the survey
- Then you have a linear mixed model

**Basic Unit**: You have individual-specific survey data and census data, then consider both area error and individual error (nested error linear regression).

$$
y_{ij} = x_{ij}'\beta + \epsilon_i + e_{ij}
$$

# Generalized linear mixed model

Key assumption that $\epsilon$, $e$ are $N(0,\sigma^2\Phi)$ and uncorrelated.

$$y = X\beta + Z\epsilon + e$$
- **Example of GLMM (binomial logistic mixed model)**

$$ LMM(y_{ik} | \epsilon_i) = P(y_{ik} = 1 | \epsilon_i) = \frac{\exp[x'_{ik}(\beta + \epsilon_i)]}{1 + \exp[x'_{ik}(\beta + \epsilon_i)]}, \quad k \in E_i, \forall i $$

$$ \hat{y}_{ik} = \frac{\exp[x'_{ik}(\beta + \epsilon_i)]}{1 + \exp[x'_{ik}(\beta + \epsilon_i)]}, \quad \forall k \in E_i $$

# Methodologies in MMT

MMT is essentially a way to simulate microdata in small areas to have reliable data to make estimates.

- Create synthetic spatial microdata set (possible solution).
- Synthetic reconstruction: construct synthetic micropopulation such that all small area-level constraints are reproduced.
- Statistical data matching or fusion.
- Iterative Proportional Fitting (IPF) with MCMC.
- Reweighting: calibrates the sampling design weights to new based on a distance measure.


# Statistical Data Matching or Fusion

Data collected from two datasets matched using a unique variable (exact matching or uncertainty matching) $C = A \cup B$.

Empirical steps:

1. Adjust available data and variable transformation.
2. Choose matching variable.
3. Select matching method and distance function.
4. Validate.

Matching is useful to have a more complete microdataset, and possibly $B$ is census data while $A$ is survey data.


# IPF (Iterative Proportional Fitting)

## Overview
- **Purpose:** Adjust cell frequencies in contingency tables to match known marginal totals.
- **Foundation:** Based on contingency table analysis and probability theory, the goal is to estimate $p(x, y)$.

## Key Concepts
- **Expected Marginal Totals:** The process uses known marginal totals and iteratively adjusts the cell values to ensure consistency.
- **MCMC Sampling:** Used to reach a stationary distribution that reflects the true population structure.
- **Fractional Weights:** The algorithm generates fractional weights for cells, which can be crucial for accurate representation of population estimates.


## Important Considerations
- **Number of Iterations:** Choose an appropriate number of iterations to ensure convergence without overfitting.
- **Validation:** Both internal and external validation are needed to assess the accuracy and reliability of the generated data.
- **Application Issues:** If the fractional weights are not integerized, they may not be suitable for agent-based models or other applications requiring discrete data.

## IPF Process
1. **Initialize Cell Estimates:**
   - Start with initial estimates for each cell in the contingency table.
   
2. **Iterative Adjustment:**
   - Adjust cell values iteratively by normalizing:
     - **Over Columns:** 
     $$ p_{ij}^{(k+1)} = \frac{p_{ij}^{(k)} \times Q_i}{\sum_j p_{ij}^{(k)}} $$
     - **Over Rows:** 
     $$ p_{ij}^{(k+2)} = \frac{p_{ij}^{(k+1)} \times Q_j}{\sum_i p_{ij}^{(k+1)}} $$




3. **Convergence Check:**
   - The process stops when the sum of the updated cell values matches the predefined marginal totals:
   $$ \sum_j p_{ij}^{(m)} = Q_i \quad \text{and} \quad \sum_i p_{ij}^{(m)} = Q_j $$

4. **Validation:**
   - Validate the final cell values against the known marginal totals and ensure the output aligns with the expected distributions.


## IPF Application
- **Synthetic Data Creation:** Used to create synthetic spatial microdata for analysis and modeling.
- **Microsimulation Modeling:** Provides fractional weights crucial for microsimulation modeling and policy evaluations.

**Note:** While IPF is robust for adjusting cell frequencies to known totals, issues arise when these weights need to be used in models that require discrete values. Methodological advancements are ongoing to address these limitations and improve its application scope.


# Repeated Weighting Method (RWM)

## Overview
- **Purpose:** Estimate well-defined table sets from a large database, ensuring numerical consistency across estimates.
- **Context:** May encounter inconsistencies due to the Social Statistical Database (SSD) based on different data sources.

## Key Features
- **GREG Estimator:** Based on repeated use of the Generalized Regression (GREG) estimator due to its strong calibration properties.
- **Consistency:** The method ensures that all estimated tables are numerically consistent with each other.

## Steps in the Repeated Weighting Process
1. **Estimate Tables from SSD:**
   - Begin by estimating tables from a large, comprehensive database (SSD), using all available data sources.

2. **Identify Common Margins:**
   - For each table, determine which margins are in common with previously estimated tables in the set.

3. **Calibrate on Common Margins:**
   - Estimate the table by calibrating on these common margins, ensuring numerical consistency across the entire set of tables.

## Benefits
- **Reduced Variance:** This method helps to lower variance in estimates by utilizing common margins effectively.
- **Maximized Information Use:** Makes optimal use of all available data and auxiliary information.

## Limitations
- **Not Universally Applicable:** May not be suitable for all situations, particularly when rapid estimates or estimates for multiple subpopulations are required.
- **Inconsistencies with SSD:** Estimates may vary due to differences in data sources, potentially leading to inconsistencies in large-scale statistical dissemination.

**Note:** This method is most effective when numerical consistency is crucial and when working with well-defined, structured datasets.

# CO (Combination of Individuals)

- Fit an appropriate combination of individuals given known benchmarks.
- Iterative process: select randomly and replace an individual trying to improve the SA benchmark.
- Techniques: hill climbing, simulated annealing, genetic algorithms (otherwise combinatorial).


# GREGWT Approach

## Overview
- **What is GREGWT?**
  - GREGWT (Generalized Regression Weighting Tool) is an extension of the GREG estimator.
  - Unlike GREG, it uses aggregate information for calibration instead of individual-level data ($X$).

## Key Features
- **Iterative Calibration:**
  - Uses an iterative GREG algorithm, often implemented in SAS, to adjust survey weights.
  - Calibrates survey estimates to match known benchmarks, ensuring consistency with external totals.

- **Auxiliary Information:**
  - Incorporates external sources of data (auxiliary information) to improve the accuracy and reliability of survey estimates.

- **Constrained Distance Function:**
  - Minimizes a truncated chi-squared distance function, subject to calibration equations for each small area.
  - Known as "truncated linear regression" or "restricted modified chi-squared" method.





## Steps in the GREGWT Process
1. **Define Calibration Equation:**
   - Set up a calibration equation for each small area:
   $$ \sum_{k \in s} w_k x_k = T_x $$

2. **Adjust Weights:**
   - Weights are adjusted using Lagrange multipliers:
   $$ w_k = d_k + d_k x_k' \lambda $$
   where $\lambda$ is the vector of Lagrange multipliers calculated iteratively.

3. **Minimize Distance Function:**
   - Minimize the truncated chi-squared distance function using the Newton-Raphson method:
   $$ \lambda = \left( \sum_{k \in s} d_k x_k x_k' \right)^{-1} (T_x - t_{x,s}) $$

4. **Iterate Until Convergence:**
   - Repeat the process until the weights converge to satisfy the calibration equation for each small area.

## Applications
- **Survey Estimation:**
  - Used for adjusting survey weights to match known benchmarks, improving the representativeness of survey data.

- **Small Area Estimation:**
  - Particularly useful for small area estimation where individual-level data is unavailable, but aggregate benchmarks are known.

## Limitations
- **Boundary Conditions:**
  - Weights must lie within predefined boundary conditions, which can be restrictive.
  
- **Complex Implementation:**
  - Requires complex iterative algorithms and may not converge in certain scenarios without adjustments.

**Summary:**
- GREGWT is a powerful tool for calibrating survey estimates using auxiliary information. It ensures consistency with benchmarks through an iterative process that minimizes a constrained distance function.

# GREGWT vs. CO Reweighting Methodologies
<div style="height: 300px; overflow-y: scroll; font-size: 10px; padding: 10px; border: 1px solid #ccc;">

| **Aspect** | **GREGWT** | **CO** |
| --- | --- | --- |
| **Process Type** | Iterative process using the Newton-Raphson method. | Iterative process with a stochastic approach. |
| **Basis** | Minimizes a distance function based on known benchmarks. | Based on a combination of households to best fit known benchmarks. |
| **Minimization Tools** | Uses Lagrange multipliers for minimizing the distance function. | Uses CO techniques as intelligent searching tools for optimizing household combinations. |
| **Weights** | Fractional weights. | Integer weights. |
| **Boundary Conditions** | Applies new weights for boundary conditions. | No boundary conditions applied. |
| **Benchmark Constraints** | Fixed for the algorithm; sensitive to disagreements between benchmarks. | Flexible; insensitive to benchmark disagreements and can split the difference between constraints. |
| **Focus** | Simulating microdata at small area levels; aggregation possible at larger domains. | Allows for mutually consistent analysis at any aggregation or sophistication level. |
| **Standard Errors** | Obtained via a group jackknife approach. | No known method for standard error estimation. |
| **Convergence** | May not exist in some cases; requires boundary adjustments. | No convergence issues; however, final household combination may still fail to fit benchmark constraints. |
| **Statistical Reliability** | No standard index for checking statistical reliability. | No standard index for statistical reliability. |
| **Stability** | Iteration may be unstable near a horizontal asymptote or local extremum. | Iteration may avoid local extrema, but can be deceiving at local solutions. |
</div>

**Summary:**
- **GREGWT** is highly sensitive to benchmarks and can be unstable in some cases, but is effective for simulating microdata at small area levels.
- **CO** offers greater flexibility with integer weights and is less sensitive to benchmark disagreements, but lacks established methods for standard errors and statistical reliability.

# How to Estimate Standard Error (JACKKNIFE Approach)

The SA estimates have their own standard errors, and GREGWT can calculate these using a group jackknife approach:

1. Divide the survey into subsamples and calculate jackknife estimates for each based on the total sample excluding the subsample in use (leave-one-out).
2. Take the difference between the new estimate and the original estimate and use it to estimate variance $\rightarrow$ standard error.
