# Intuition

![figure](./cartoons/4_1.svg)

# Notations

> slide 18-20 from Gao Wang's slides

In statistical genetics, **odds** are a way of expressing the likelihood of an event occurring relative to the likelihood of it not occurring. It is commonly used to describe the probability of a genetic outcome or a disease status in genetic association studies, such as **case-control studies**.

## Odds

The **odds** of an event occurring are calculated as the ratio of the probability of the event happening to the probability of it not happening. Mathematically, this is expressed as:

$$
\text{Odds} = \frac{P(\text{Event})}{1 - P(\text{Event})}
$$

Where:
- $P(\text{Event})$ is the probability of the event occurring.
- $1 - P(\text{Event})$ is the probability of the event **not** occurring.

## Odds Ratio (OR)

In case-control studies, **odds ratios** (OR) are commonly used to quantify the strength of the association between a genetic variant and a disease or trait. It compares the odds of an event (e.g., disease) occurring in the presence of a particular genotype to the odds of the same event occurring in the absence of that genotype.

Mathematically, the odds ratio can be written as:

$$
OR = \frac{\text{Odds of disease in carriers}}{\text{Odds of disease in non-carriers}}
$$

For a genotype with two alleles (e.g., $A$ and $a$), the odds ratio compares the odds of disease in individuals with genotype $AA$ or $Aa$ (carriers of the risk allele) to the odds of disease in individuals with genotype $aa$ (non-carriers).

### Logistic Regression and Odds Ratio

In **logistic regression**, the dependent variable is binary (e.g., presence or absence of a disease), and the model estimates the **log-odds** of the event occurring based on predictor variables (such as genetic variants or other covariates).

Mathematically, the logistic regression model is given by:

$$
\log\left(\frac{P(\text{Event})}{1 - P(\text{Event})}\right) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_k X_k
$$

Where:
- $P(\text{Event})$ is the probability of the event occurring (e.g., having the disease).
- $\frac{P(\text{Event})}{1 - P(\text{Event})}$ is the **odds** of the event occurring.
- $\beta_0$ is the intercept.
- $\beta_1, \beta_2, \dots, \beta_k$ are the coefficients corresponding to the predictor variables ($X_1, X_2, \dots, X_k$).

The key point here is that the **exponentiated coefficient** (i.e., $e^{\beta}$) of a predictor variable in the logistic regression model represents the **odds ratio** for a one-unit change in that predictor variable. 

For example, if a genetic variant is coded as $X$ (with 1 indicating the presence of the risk allele and 0 indicating its absence), the coefficient $\beta$ for $X$ gives the log-odds of the disease in individuals with the risk allele relative to those without it. The **odds ratio** associated with this genetic variant is given by:

$$
OR = e^{\beta}
$$

This means that for every one-unit increase in the predictor variable (e.g., presence of the risk allele), the odds of the disease occurring are multiplied by a factor of $e^{\beta}$.

## Conclusion

Odds are a key concept in statistical genetics, especially when interpreting the results of genetic association studies. They provide a convenient way to compare the relative likelihood of different outcomes, particularly when dealing with binary traits (e.g., disease vs. no disease). The **odds ratio** is commonly used to assess genetic risk factors and their association with diseases or other traits of interest. In **logistic regression**, the odds ratio is directly related to the regression coefficients and serves as an important measure of association between genetic variants and disease risk.


# Example

In [7]:
rm(list=ls())

# Simulate genotype and height values
genotype <- c(1, 2, 0,1,2)  # Genotypes: 1, 2, 0 represent the number of minor alleles (homozygous, heterozygous, and homozygous major)
trait <- c(1, 1, 0,1,0)     # Trait values: 0 (disease absent) and 1 (disease present) for each individual

# Simulate height values for three individuals based on genotypes
n = length(genotype)
data <- data.frame(genotype = genotype, trait = trait)
data


genotype,trait
<dbl>,<dbl>
1,1
2,1
0,0
1,1
2,0


In [8]:

# Fit a logistic regression model
logit_model <- glm(trait ~ genotype, data = data, family = "binomial")

# Summarize the model
summary(logit_model)

# Calculate the odds ratio (exp(coef)) for the genotype variable
odds_ratio <- exp(coef(logit_model)[2])
odds_ratio



Call:
glm(formula = trait ~ genotype, family = "binomial", data = data)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept)  -0.3025     1.7188  -0.176    0.860
genotype      0.6050     1.2597   0.480    0.631

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 6.7301  on 4  degrees of freedom
Residual deviance: 6.4907  on 3  degrees of freedom
AIC: 10.491

Number of Fisher Scoring iterations: 4
