# Intuition

Confounder


# Notations

## Confounder definitions

A **confounder** is a variable that influences both the independent variable (genotype) and the dependent variable (trait), creating a spurious relationship between the two. If not controlled for, a confounder can distort the true relationship between the genotype and the trait.

Confounding can cause bias in the estimation of the relationship between the genotype and the trait. To avoid this bias, confounders **must be controlled** in the analysis, typically by including them as covariates in regression models.

## Example in Statistical Genetics
  Suppose we are studying the effect of a genetic variant (genotype) on a trait (e.g., height). If **genetic ancestry** influences both the genotype and height, it is a confounder. Failing to control for ancestry could lead to biased estimates of the genetic effect on the trait.

- **Graphical Representation: [FIXME -- do the figure in a better way]**
$$ \text{Genotype} \leftarrow \textbf{Ancestry} \to \text{Height} $$  

Here, **ancestry** represents the confounder that affects both the genotype and the trait (height).


Ancestry is a confounder in the relationship between SNPs and height because it affects both variables independently. 
- Different ancestral populations have different genetic backgrounds, which means SNP frequencies vary across populations. 
- At the same time, ancestry is also associated with height due to both genetic and environmental factors (e.g., nutrition, living conditions). 

Because ancestry influences both the presence of specific SNPs and height but is not part of the direct causal pathway, failing to account for it can create a misleading association between the SNP and height that is actually driven by ancestry differences rather than a true biological effect of the SNP.

# Example

In [30]:
rm(list=ls())
set.seed(21)  # For reproducibility

# Genotype matrix for 100 individuals and 3 variants
N <- 100  # Number of individuals
M <- 3    # Number of SNPs (variants)

# Simulate ancestry as a binary variable (two populations)
ancestry <- sample(0:1, N, replace = TRUE)

# Define allele frequencies for each population (population 0 and population 1)
# Population 0: higher frequency of allele 0 at Variant 2
# Population 1: higher frequency of allele 2 at Variant 2
allele_freqs_pop_0 <- c(0.6, 0.3, 0.1)  # Variant 2 has higher allele 0 frequency in population 0
allele_freqs_pop_1 <- c(0.2, 0.3, 0.5)  # Variant 2 has higher allele 2 frequency in population 1

# Create genotype matrix (differentiating populations)
X_raw <- matrix(NA, nrow = N, ncol = M)

# Simulate SNPs for each individual based on their ancestry
for (i in 1:N) {
  if (ancestry[i] == 0) {
    X_raw[i, ] <- sample(0:2, M, replace = TRUE, prob = c(allele_freqs_pop_0))
  } else {
    X_raw[i, ] <- sample(0:2, M, replace = TRUE, prob = c(allele_freqs_pop_1))
  }
}

# Add row and column names
rownames(X_raw) <- paste("Individual", 1:N)
colnames(X_raw) <- paste("Variant", 1:M)

# Standardize genotype matrix (mean = 0, sd = 1 for each SNP)
X <- scale(X_raw, scale = TRUE)

# Select Variant 2 as the true causal variant
true_causal_variant <- X[, 2]  # Second column of X

# Define the true causal effect for ancestry and SNP (Variant 2)
true_effect_ancestry <- 5    # True effect of ancestry on height
true_effect_snp <- 3         # True effect of SNP (Variant 2) on height

# Simulate height based on ancestry and the true causal variant
# We directly generate a scaled height with the true effects included
height_ori <- true_effect_ancestry * ancestry + true_effect_snp * true_causal_variant + rnorm(N, mean = 0, sd = 5)

# Scale the height (make it standardized)
height <- scale(height_ori)

# Now we adjust the true effects for the scaling
# The standard deviation of residuals after scaling height gives us the scaling factor
scaled_true_effect_ancestry <- true_effect_ancestry / sd(height_ori)  # Adjusting effect size for ancestry
scaled_true_effect_snp <- true_effect_snp / sd(height_ori)  # Adjusting effect size for SNP

# Create data frame
data <- data.frame(Individual = 1:N, ancestry = ancestry, height = height)
data <- cbind(data, X)  # Add genotype data

In [31]:
scaled_true_effect_snp

In [32]:
# Model 1: Ignoring ancestry
model1 <- lm(height ~ `Variant 2`, data = data)
summary(model1)


Call:
lm(formula = height ~ `Variant 2`, data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.96561 -0.42824  0.00881  0.50760  2.40963 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1.567e-18  7.815e-02   0.000        1    
`Variant 2` 6.289e-01  7.854e-02   8.007 2.45e-12 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7815 on 98 degrees of freedom
Multiple R-squared:  0.3955,	Adjusted R-squared:  0.3893 
F-statistic: 64.11 on 1 and 98 DF,  p-value: 2.453e-12


In [33]:
# Model 2: Considering ancestry as a covariate
model2 <- lm(height ~ `Variant 2` + ancestry, data = data)
summary(model2)



Call:
lm(formula = height ~ `Variant 2` + ancestry, data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.76124 -0.40659  0.07453  0.46985  1.80624 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.39452    0.10605   -3.72 0.000334 ***
`Variant 2`  0.44204    0.07994    5.53 2.71e-07 ***
ancestry     0.78904    0.15907    4.96 3.00e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7015 on 97 degrees of freedom
Multiple R-squared:  0.5178,	Adjusted R-squared:  0.5078 
F-statistic: 52.08 on 2 and 97 DF,  p-value: 4.336e-16


## Summary

| **Metric**                  | **Model 1 (Ignoring Ancestry)** | **Model 2 (Including Ancestry)** |
|-----------------------------|---------------------------------|----------------------------------|
| **Estimate for Variant 2**   | 0.6289                          | 0.4420                           |
| **R-squared**                | 0.3955                          | 0.5178                           |
| **P-value for Variant 2**    | 2.45e-12                        | 2.71e-07                         |
| **Estimate for Ancestry**    | Not included                    | 0.7890                           |
| **Residual Standard Error**  | 0.7815                          | 0.7015                           |


- **Model 2** provides a more accurate estimate of the SNP effect by including ancestry, resulting in an estimate closer to the true causal effect (0.4263).
- **Model 1** overestimates the SNP effect due to confounding by ancestry.
