## Part 1b - Simpson's paradox

Let’s imagine we are in a laboratory and want to test the effectiveness of a new treatment for a disease that affects both humans and dogs. Our goal is to see whether the treatment helps improve the chances of survival.

In the experiment, we treat one dog and four people. After the treatment, the dog and one person survive, while the other three people die. In the group that does not receive treatment, there are four dogs and one person; of these, three dogs recover, but one dog and the person die.

If we look at the results by species, the treatment seems effective. Among the treated dogs, 100% survive, while 75% of the untreated dogs recover. Among the treated humans, 25% survive, while none of the untreated humans survive. In both cases, the treatment appears to increase the chances of recovery.

However, when all the data are combined, the interpretation changes. Across all treated dogs and humans, only 40% survive, while among the untreated group, 60% recover. This gives the impression that the treatment actually worsens the chances of survival.

The reason for this apparent contradiction requires analyzing the context. It is possible that the disease is more severe in humans, making them more likely to receive treatment. Therefore, even though the treatment increases the chances of survival, the treated humans had a higher risk from the start. Additionally, if humans have greater access to treatment than dogs, the overall results reflect this difference in distribution rather than the true effectiveness of the treatment.

This example illustrates Simpson’s Paradox, where the combined results of several groups can show a trend opposite to what is observed within each group. To correctly interpret the data, it is essential to consider the context and the causal factors influencing the outcomes.

In [None]:
# Versión completa en R de la simulación de la paradoja de Simpson

# Instalar y cargar librerías necesarias
if(!require(ggplot2)) install.packages("ggplot2", dependencies = TRUE)
library(ggplot2)

# Semilla para reproducibilidad
set.seed(42)

# Simulación de datos
n <- 50

# Grupo A (relación positiva)
x_a <- runif(n, 1, 5)
y_a <- 2 + 1.0 * x_a + rnorm(n, 0, 1)
group_a <- data.frame(x = x_a, y = y_a, group = "A")

# Grupo B (relación positiva, desplazado en eje X, intercepto más bajo)
x_b <- runif(n, 4, 8)
y_b <- -4 + 1.0 * x_b + rnorm(n, 0, 1)
group_b <- data.frame(x = x_b, y = y_b, group = "B")

# Combinar datos
data <- rbind(group_a, group_b)

# Ajustar modelos lineales
model_a <- lm(y ~ x, data = subset(data, group == "A"))
model_b <- lm(y ~ x, data = subset(data, group == "B"))
model_all <- lm(y ~ x, data = data)

# Preparar datos para líneas de regresión
x_vals <- seq(min(data$x), max(data$x), length.out = 100)
y_pred_a <- predict(model_a, newdata = data.frame(x = x_vals))
y_pred_b <- predict(model_b, newdata = data.frame(x = x_vals))
y_pred_all <- predict(model_all, newdata = data.frame(x = x_vals))

# Graficar
plot <- ggplot(data, aes(x = x, y = y, color = group)) +
  geom_point(alpha = 0.7) +
  geom_line(aes(x = x_vals, y = y_pred_a), color = "red", inherit.aes = FALSE) +
  geom_line(aes(x = x_vals, y = y_pred_b), color = "blue", inherit.aes = FALSE) +
  geom_line(aes(x = x_vals, y = y_pred_all), color = "black", linetype = "dashed", inherit.aes = FALSE) +
  labs(title = "Paradoja de Simpson: positiva en grupos, negativa en total",
       x = "X (Variable explicativa)",
       y = "Y (Resultado)") +
  theme_minimal()

print(plot)

# Mostrar pendientes de cada modelo
cat("Pendiente Grupo A:", round(coef(model_a)[2], 3), "\n")
cat("Pendiente Grupo B:", round(coef(model_b)[2], 3), "\n")
cat("Pendiente Global:", round(coef(model_all)[2], 3), "\n")