## Question 1: Accident Data Analysis

This question utilizes a 2x2 contingency table to evaluate the relationship between the use of safety equipment and injury severity. The data are cross-sectional (from records), so we're dealing with prevalence rather than incidence; however, measures like relative risk (RR) and odds ratio (OR) are still applicable for assessing association.

#### Step 1: Understanding the Data and Table Setup
- **Rows (Exposure)**: Safety equipment – "None" (no seat belt) vs. "Seat belt".
- **Columns (Outcome)**: Injury – "Fatal" (bad outcome) vs. "Non-fatal" (reference).
- Table:

| Safety Equipment | Fatal | Non-fatal | Total |
|------------------|-------|-----------|-------|
| None             | 189   | 10843     | 11032 |
| Seat belt        | 104   | 10933     | 11037 |
| **Total**        | **293** | **21776**   | **22069** |

- Key observation: Fatal injuries are rare (293/22069 ≈ 1.33% prevalence). This rarity is important for why OR ≈ RR (explained below).
- Assumptions: Data is independent, randomly sampled (though it's compiled records, so potential biases like underreporting, but we assume it's representative for this exercise). No confounders mentioned, so unadjusted measures.

#### (a) Compute and Interpret Odds Ratio, Relative Risk, and Difference in Proportions (8 marks)
- **Concepts**:
  - **Odds Ratio (OR)**: Ratio of odds of fatal injury in exposed (no seat belt) vs. unexposed (seat belt). Odds = P(event) / (1 - P(event)). Formula: OR = (a/d) / (b/c) = (a * d) / (b * c), where a=189 (fatal none), b=104 (fatal seat), c=10843 (non-fatal none), d=10933 (non-fatal seat). Interprets multiplicative increase in odds.
  - **Relative Risk (RR)**: Ratio of probabilities (risks) of fatal injury in exposed vs. unexposed. RR = [a / (a+c)] / [b / (b+d)]. Interprets multiplicative increase in risk.
  - **Risk Difference (RD, or Difference in Proportions)**: Absolute difference in risks. RD = p1 - p2. Interprets additive difference.
  - **Why OR ≈ RR?** In rare events (low prevalence), the denominator (1 - p) ≈ 1 for both groups, so odds ≈ risk, making OR ≈ RR. Mathematically: RR = OR * [(1 - p_unexposed) / (1 - p_exposed)], and when p << 1, this ≈ OR. Here, p_fatal ≈ 0.013, so approximation holds (we'll see values are close: 1.83 vs. 1.82).

- **Calculations** (run in Jupyter):
  ```python
  # Odds Ratio
  or_val = (table[0,0] * table[1,1]) / (table[0,1] * table[1,0])  # 1.8324

  # Relative Risk
  rr = (table[0,0] / total_no) / (table[1,0] / total_seat)  # 1.8181

  # Risk Difference
  rd = (table[0,0] / total_no) - (table[1,0] / total_seat)  # 0.0077

  print(f"Odds Ratio: {or_val:.4f}")
  print(f"Relative Risk: {rr:.4f}")
  print(f"Risk Difference: {rd:.4f}")
  ```

- **Results and Interpretations**:
  - **Odds Ratio = 1.8324**: The odds of a fatal injury are 1.83 times higher (or 83% higher) for individuals not using seat belts compared to those using them. This suggests a strong association, but OR overestimates RR slightly in non-rare cases.
  - **Relative Risk = 1.8181**: The risk of fatal injury is 1.82 times higher (or 82% higher) for non-seat belt users. This is the preferred measure for prospective data, but valid here.
  - **Risk Difference = 0.0077**: The proportion of fatal injuries is 0.77 percentage points higher for non-seat belt users. Useful for public health impact (e.g., number needed to treat: 1/RD ≈ 130 people need seat belts to prevent one fatal injury).
  - **Why OR ≈ RR**: Fatal injuries are rare (1.33% overall), so the approximation holds. The small difference (1.832 vs. 1.818) is due to slight non-rarity, but it's negligible for interpretation.

#### (b) Construct Confidence Intervals for Risk Difference, Relative Risk, and Odds Ratio (9 marks)
- **Concepts** (95% CIs, assuming normality via large samples):
  - **RD CI**: RD ± 1.96 * SE_RD, where SE_RD = √[p1(1-p1)/n1 + p2(1-p2)/n2].
  - **RR CI**: On log scale for asymmetry: log(RR) ± 1.96 * SE_logRR, where SE_logRR = √[(1-p1)/(p1 n1) + (1-p2)/(p2 n2)]. Exponentiate back.
  - **OR CI**: Similar to RR: log(OR) ± 1.96 * SE_logOR, where SE_logOR = √[1/a + 1/b + 1/c + 1/d]. Exponentiate.
  - All CIs exclude 0 (for RD) or 1 (for RR/OR), indicating significance. Large n (>10k per group) ensures validity.

- **Calculations** (run in Jupyter):
  ```python
  p1 = table[0,0] / total_no
  p2 = table[1,0] / total_seat

  # RD CI
  se_rd = np.sqrt(p1*(1-p1)/total_no + p2*(1-p2)/total_seat)
  ci_rd = (rd - 1.96*se_rd, rd + 1.96*se_rd)  # (0.0047, 0.0107)

  # RR CI
  se_log_rr = np.sqrt((1-p1)/(p1 * total_no) + (1-p2)/(p2 * total_seat))
  log_rr = np.log(rr)
  ci_log_rr = (log_rr - 1.96*se_log_rr, log_rr + 1.96*se_log_rr)
  ci_rr = (np.exp(ci_log_rr[0]), np.exp(ci_log_rr[1]))  # (1.4333, 2.3063)

  # OR CI
  se_log_or = np.sqrt(1/table[0,0] + 1/table[0,1] + 1/table[1,0] + 1/table[1,1])
  log_or = np.log(or_val)
  ci_log_or = (log_or - 1.96*se_log_or, log_or + 1.96*se_log_or)
  ci_or = (np.exp(ci_log_or[0]), np.exp(ci_log_or[1]))  # (1.4403, 2.3312)

  print(f"Risk Difference: {rd:.4f} (95% CI: {ci_rd[0]:.4f}, {ci_rd[1]:.4f})")
  print(f"Relative Risk: {rr:.4f} (95% CI: {ci_rr[0]:.4f}, {ci_rr[1]:.4f})")
  print(f"Odds Ratio: {or_val:.4f} (95% CI: {ci_or[0]:.4f}, {ci_or[1]:.4f})")
  ```

- **Results**:
  - Risk Difference: 0.0077 (95% CI: 0.0047, 0.0107)
  - Relative Risk: 1.8181 (95% CI: 1.4333, 2.3063)
  - Odds Ratio: 1.8324 (95% CI: 1.4403, 2.3312)

#### (c) Conduct a Chi-Square Test for Association (alpha = 0.05) (8 marks)
- **Concepts**: Tests H0: No association (independent) vs. HA: Association exists. Chi-square statistic = Σ[(O - E)^2 / E], df=1. P-value < 0.05 rejects H0. Assumptions: Large expected counts (>5, which holds here). Alternative: Fisher's exact for small cells, but not needed.

- **Calculations** (run in Jupyter):
  ```python
  chi2, p, dof, expected = stats.chi2_contingency(table)
  print(f"Chi-square statistic: {chi2:.4f}")
  print(f"p-value: {p:.4e}")
  print(f"Degrees of freedom: {dof}")
  print("Expected frequencies:\n", expected)
  ```

- **Results**:
  - Chi-square statistic: 24.4445
  - p-value: 7.6480e-07
  - Degrees of freedom: 1
  - Expected frequencies:
    ```
    [[  146.47 10885.53]
     [  146.53 10890.47]]
    ```
- **Interpretation**: p-value << 0.05, reject H0. There is strong evidence of an association between safety equipment use and injury nature. Specifically, not using seat belts is associated with higher fatal injuries (as per OR/RR >1).

For presentation: Organize in a report-style section with tables, bold key results, and brief explanations. Optionally, add a bar plot:
```python
# Optional visualization
proportions = [prop_fatal_no, prop_fatal_seat]
plt.bar(labels, proportions)
plt.ylabel('Proportion Fatal')
plt.title('Proportion of Fatal Injuries by Safety Equipment')
plt.show()
```

This completes Question 1. Let me know if we proceed to Question 2 now, or if you want refinements (e.g., more derivations or sensitivity checks).

In [1]:
# In Jupyter, start by defining the data:
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt  # Optional for visualizations

# Define the contingency table
table = np.array([[189, 10843], [104, 10933]])
labels = ['None', 'Seat belt']

# Row totals (exposure groups)
total_no = table[0].sum()  # 11032
total_seat = table[1].sum()  # 11037

# Proportions for quick check
prop_fatal_no = table[0,0] / total_no  # ~0.0171
prop_fatal_seat = table[1,0] / total_seat  # ~0.0094
print(f"Proportion fatal (no seat belt): {prop_fatal_no:.4f}")
print(f"Proportion fatal (seat belt): {prop_fatal_seat:.4f}")

Proportion fatal (no seat belt): 0.0171
Proportion fatal (seat belt): 0.0094


#### (a) Compute and Interpret Odds Ratio, Relative Risk, and Difference in Proportions (8 marks)
- **Concepts**:
  - **Odds Ratio (OR)**: Ratio of odds of fatal injury in exposed (no seat belt) vs. unexposed (seat belt). Odds = P(event) / (1 - P(event)). Formula: OR = (a/d) / (b/c) = (a * d) / (b * c), where a=189 (fatal none), b=104 (fatal seat), c=10843 (non-fatal none), d=10933 (non-fatal seat). Interprets multiplicative increase in odds.
  - **Relative Risk (RR)**: Ratio of probabilities (risks) of fatal injury in exposed vs. unexposed. RR = [a / (a+c)] / [b / (b+d)]. Interprets multiplicative increase in risk.
  - **Risk Difference (RD, or Difference in Proportions)**: Absolute difference in risks. RD = p1 - p2. Interprets additive difference.
  - **Why OR ≈ RR?** In rare events (low prevalence), the denominator (1 - p) ≈ 1 for both groups, so odds ≈ risk, making OR ≈ RR. Mathematically: RR = OR * [(1 - p_unexposed) / (1 - p_exposed)], and when p << 1, this ≈ OR. Here, p_fatal ≈ 0.013, so approximation holds (we'll see values are close: 1.83 vs. 1.82).

**Calculations**

In [2]:
  # Odds Ratio
  or_val = (table[0,0] * table[1,1]) / (table[0,1] * table[1,0])  # 1.8324

  # Relative Risk
  rr = (table[0,0] / total_no) / (table[1,0] / total_seat)  # 1.8181

  # Risk Difference
  rd = (table[0,0] / total_no) - (table[1,0] / total_seat)  # 0.0077

  print(f"Odds Ratio: {or_val:.4f}")
  print(f"Relative Risk: {rr:.4f}")
  print(f"Risk Difference: {rd:.4f}")

Odds Ratio: 1.8324
Relative Risk: 1.8181
Risk Difference: 0.0077


- **Results and Interpretations**:
  - **Odds Ratio = 1.8324**: The odds of a fatal injury are 1.83 times higher (or 83% higher) for individuals not using seat belts compared to those using them. This suggests a strong association, but OR overestimates RR slightly in non-rare cases.
  - **Relative Risk = 1.8181**: The risk of fatal injury is 1.82 times higher (or 82% higher) for non-seat belt users. This is the preferred measure for prospective data, but valid here.
  - **Risk Difference = 0.0077**: The proportion of fatal injuries is 0.77 percentage points higher for non-seat belt users. Useful for public health impact (e.g., number needed to treat: 1/RD ≈ 130 people need seat belts to prevent one fatal injury).
  - **Why OR ≈ RR**: Fatal injuries are rare (1.33% overall), so the approximation holds. The small difference (1.832 vs. 1.818) is due to slight non-rarity, but it's negligible for interpretation.

#### (b) Construct Confidence Intervals for Risk Difference, Relative Risk, and Odds Ratio (9 marks)
- **Concepts** (95% CIs, assuming normality via large samples):
  - **RD CI**: RD ± 1.96 * SE_RD, where SE_RD = √[p1(1-p1)/n1 + p2(1-p2)/n2].
  - **RR CI**: On log scale for asymmetry: log(RR) ± 1.96 * SE_logRR, where SE_logRR = √[(1-p1)/(p1 n1) + (1-p2)/(p2 n2)]. Exponentiate back.
  - **OR CI**: Similar to RR: log(OR) ± 1.96 * SE_logOR, where SE_logOR = √[1/a + 1/b + 1/c + 1/d]. Exponentiate.
  - All CIs exclude 0 (for RD) or 1 (for RR/OR), indicating significance. Large n (>10k per group) ensures validity.

**Calculations**

In [5]:
  p1 = table[0,0] / total_no
  p2 = table[1,0] / total_seat

  # RD CI
  se_rd = np.sqrt(p1*(1-p1)/total_no + p2*(1-p2)/total_seat)
  ci_rd = (rd - 1.96*se_rd, rd + 1.96*se_rd)  # (0.0047, 0.0107)

  # RR CI
  se_log_rr = np.sqrt((1-p1)/(p1 * total_no) + (1-p2)/(p2 * total_seat))
  log_rr = np.log(rr)
  ci_log_rr = (log_rr - 1.96*se_log_rr, log_rr + 1.96*se_log_rr)
  ci_rr = (np.exp(ci_log_rr[0]), np.exp(ci_log_rr[1]))  # (1.4333, 2.3063)

  # OR CI
  se_log_or = np.sqrt(1/table[0,0] + 1/table[0,1] + 1/table[1,0] + 1/table[1,1])
  log_or = np.log(or_val)
  ci_log_or = (log_or - 1.96*se_log_or, log_or + 1.96*se_log_or)
  ci_or = (np.exp(ci_log_or[0]), np.exp(ci_log_or[1]))  # (1.4403, 2.3312)

  print(f"Risk Difference: {rd:.4f} (95% CI: {ci_rd[0]:.4f}, {ci_rd[1]:.4f})")
  print(f"Relative Risk: {rr:.4f} (95% CI: {ci_rr[0]:.4f}, {ci_rr[1]:.4f})")
  print(f"Odds Ratio: {or_val:.4f} (95% CI: {ci_or[0]:.4f}, {ci_or[1]:.4f})")

Risk Difference: 0.0077 (95% CI: 0.0047, 0.0107)
Relative Risk: 1.8181 (95% CI: 1.4333, 2.3063)
Odds Ratio: 1.8324 (95% CI: 1.4403, 2.3312)


- **Results**:
  - Risk Difference: 0.0077 (95% CI: 0.0047, 0.0107)
  - Relative Risk: 1.8181 (95% CI: 1.4333, 2.3063)
  - Odds Ratio: 1.8324 (95% CI: 1.4403, 2.3312)

#### (c) Conduct a Chi-Square Test for Association (alpha = 0.05) (8 marks)
- **Concepts**: Tests H0: No association (independent) vs. HA: Association exists. Chi-square statistic = Σ[(O - E)^2 / E], df=1. P-value < 0.05 rejects H0. Assumptions: Large expected counts (>5, which holds here). Alternative: Fisher's exact for small cells, but not needed.
- **Calculations**

In [6]:
  chi2, p, dof, expected = stats.chi2_contingency(table)
  print(f"Chi-square statistic: {chi2:.4f}")
  print(f"p-value: {p:.4e}")
  print(f"Degrees of freedom: {dof}")
  print("Expected frequencies:\n", expected)

Chi-square statistic: 24.4445
p-value: 7.6480e-07
Degrees of freedom: 1
Expected frequencies:
 [[  146.46680865 10885.53319135]
 [  146.53319135 10890.46680865]]


In [4]:
# Run this in a Jupyter cell
import math
import numpy as np
from scipy.stats import chi2_contingency, fisher_exact

# Data
a = 189
b = 10843
c = 104
d = 10933

# Totals
n1 = a + b
n2 = c + d
grand = n1 + n2

# Risks
p1 = a / n1
p2 = c / n2
RD = p1 - p2
OR = (a * d) / (b * c)
RR = p1 / p2

# RD CI
var1 = p1 * (1 - p1) / n1
var2 = p2 * (1 - p2) / n2
se_rd = math.sqrt(var1 + var2)
z = 1.96
rd_ci = (RD - z * se_rd, RD + z * se_rd)

# RR CI (log method)
se_log_rr = math.sqrt((1 / a - 1 / n1) + (1 / c - 1 / n2))
ln_rr = math.log(RR)
rr_ci = (math.exp(ln_rr - z * se_log_rr), math.exp(ln_rr + z * se_log_rr))

# OR CI (log method)
se_log_or = math.sqrt(1 / a + 1 / b + 1 / c + 1 / d)
ln_or = math.log(OR)
or_ci = (math.exp(ln_or - z * se_log_or), math.exp(ln_or + z * se_log_or))

# Chi-square tests
table = np.array([[a, b],
                  [c, d]])
chi2, p_value, dof, expected = chi2_contingency(table, correction=False)
chi2_yates, p_yates, _, _ = chi2_contingency(table, correction=True)
fisher_or, fisher_p = fisher_exact(table, alternative='two-sided')

# Print nicely
print("2x2 table:")
print(f"         Fatal   Non-fatal   Row total")
print(f"No belt   {a:6d}   {b:9d}   {n1:9d}")
print(f"Seat belt {c:6d}   {d:9d}   {n2:9d}")
print(f"Grand total = {grand}\n")

print("Proportions / measures:")
print(f"Risk (no belt)   p1 = {p1:.10f}  ({p1*100:.4f}%)")
print(f"Risk (seat belt) p2 = {p2:.10f}  ({p2*100:.4f}%)")
print(f"Risk difference  RD = {RD:.10f}  ({RD*100:.4f}%)  95% CI = ({rd_ci[0]:.10f}, {rd_ci[1]:.10f})\n")

print(f"Relative risk    RR = {RR:.6f}  95% CI = ({rr_ci[0]:.6f}, {rr_ci[1]:.6f})")
print(f"Odds ratio       OR = {OR:.6f}  95% CI = ({or_ci[0]:.6f}, {or_ci[1]:.6f})\n")

print("Chi-square tests:")
print(f"Expected counts (rows x cols):\n{expected}")
print(f"Pearson chi2 = {chi2:.6f}, df = {dof}, p = {p_value:.6e}")
print(f"Yates corrected chi2 = {chi2_yates:.6f}, p = {p_yates:.6e}")
print(f"Fisher exact OR = {fisher_or:.6f}, Fisher p = {fisher_p:.6e}\n")

print('Interpretation:')
print('- OR and RR are ~1.8 (similar because fatality is rare).')
print('- RD is ~0.00771 => ~0.77 percentage points (≈7.71 / 1000).')
print('- Chi-square p << 0.05 -> evidence of association between seat-belt use and fatality.')


2x2 table:
         Fatal   Non-fatal   Row total
No belt      189       10843       11032
Seat belt    104       10933       11037
Grand total = 22069

Proportions / measures:
Risk (no belt)   p1 = 0.0171319797  (1.7132%)
Risk (seat belt) p2 = 0.0094228504  (0.9423%)
Risk difference  RD = 0.0077091293  (0.7709%)  95% CI = (0.0046904515, 0.0107278070)

Relative risk    RR = 1.818131  95% CI = (1.433285, 2.306312)
Odds ratio       OR = 1.832392  95% CI = (1.440301, 2.331220)

Chi-square tests:
Expected counts (rows x cols):
[[  146.46680865 10885.53319135]
 [  146.53319135 10890.46680865]]
Pearson chi2 = 25.029541, df = 1, p = 5.645865e-07
Yates corrected chi2 = 24.444529, p = 7.648039e-07
Fisher exact OR = 1.832392, Fisher p = 5.021416e-07

Interpretation:
- OR and RR are ~1.8 (similar because fatality is rare).
- RD is ~0.00771 => ~0.77 percentage points (≈7.71 / 1000).
- Chi-square p << 0.05 -> evidence of association between seat-belt use and fatality.
