In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
import statsmodels.formula.api as smf


## Questions 1 & 2 

In this example, ﻿Z﻿ = the likelihood of a biking accident, ﻿Y﻿ = speed, and ﻿X﻿ = trail difficulty. We assume that ﻿X﻿ decreases ﻿Y﻿ causally because people decrease their speed on difficult trails. In addition, ﻿Y﻿ and ﻿X﻿ both increase ﻿Z﻿ causally because fast biking on difficult trails leads to accidents. Difficulty will be on a scale from 0 to 1, speed in miles per hour, and likelihood of an accident also on a scale from 0 to 1. (Based on the numbers, I'd say these trails are quite challenging!) 

num = 100 

difficulty = np.random.uniform(0, 1, (num,)) 

speed = np.maximum(np.random.normal(15, 5, (num, )) - difficulty * 10, 0) 

accident = np.minimum(np.maximum(0.03 * speed + 0.4 * difficulty + np.random.normal(0, 0.3, (num,)), 0), 1) 

df = pd.DataFrame({'difficulty': difficulty, 'speed': speed, 'accident': accident}) 

### Question 1

Use ﻿X﻿ to predict ﻿Y﻿ many times via regression with different data sets. Use many samples in each prediction. Which is closest to the average coefficient of ﻿X﻿ if you do the experiment enough times? 

In [6]:
# Compare average coefficient for 1000, 10000, and 100000 simulations
sim_counts = [1000, 10000, 100000]
num = 100
results = {}

for n_sim in sim_counts:
    coefs = []
    for _ in range(n_sim):
        difficulty = np.random.uniform(0, 1, (num,))
        speed = np.maximum(np.random.normal(15, 5, (num, )) - difficulty * 10, 0)
        df_sim = pd.DataFrame({'difficulty': difficulty, 'speed': speed})
        model = smf.ols('speed ~ difficulty', data=df_sim).fit()
        coefs.append(model.params['difficulty'])
    avg_coef = np.mean(coefs)
    results[n_sim] = avg_coef
    print(f"Average coefficient for {n_sim} simulations: {avg_coef:.3f}")

# Optionally, show all results together
display(results)

Average coefficient for 1000 simulations: -9.642
Average coefficient for 10000 simulations: -9.694
Average coefficient for 10000 simulations: -9.694
Average coefficient for 100000 simulations: -9.670
Average coefficient for 100000 simulations: -9.670


{1000: np.float64(-9.641783560366243),
 10000: np.float64(-9.694185500300403),
 100000: np.float64(-9.67006571450228)}

### Question 2

Then use ﻿X﻿ and ﻿Z﻿ to predict ﻿Y﻿ many times via regression with different datasets. Which of these is closest to the average coefficient of ﻿X﻿? 

In [7]:
# Simulate regression of speed on both difficulty (X) and accident (Z), average coefficient of X
sim_counts = [1000, 10000, 100000]
num = 100
results_xz = {}

for n_sim in sim_counts:
    coefs = []
    for _ in range(n_sim):
        difficulty = np.random.uniform(0, 1, (num,))
        speed = np.maximum(np.random.normal(15, 5, (num, )) - difficulty * 10, 0)
        accident = np.minimum(np.maximum(0.03 * speed + 0.4 * difficulty + np.random.normal(0, 0.3, (num,)), 0), 1)
        df_sim = pd.DataFrame({'difficulty': difficulty, 'speed': speed, 'accident': accident})
        model = smf.ols('speed ~ difficulty + accident', data=df_sim).fit()
        coefs.append(model.params['difficulty'])
    avg_coef = np.mean(coefs)
    results_xz[n_sim] = avg_coef
    print(f"Average coefficient of difficulty (X) for {n_sim} simulations (predicting speed with X and Z): {avg_coef:.3f}")

display(results_xz)

Average coefficient of difficulty (X) for 1000 simulations (predicting speed with X and Z): -10.262
Average coefficient of difficulty (X) for 10000 simulations (predicting speed with X and Z): -10.323
Average coefficient of difficulty (X) for 100000 simulations (predicting speed with X and Z): -10.326


{1000: np.float64(-10.262074587810785),
 10000: np.float64(-10.323436362870163),
 100000: np.float64(-10.326446892597016)}

## Reflection Questions

1. Draw a diagram for the following negative feedback loop:

    Sweating causes body temperature to decrease.  High body temperature causes sweating.

    A negative feedback loop means that one thing increases another while the second thing decreases the first.

    Remember that we are using directed acyclic graphs where two things cannot directly cause each other.

    ### DAG for Sweating and Body Temperature

   
    Body Temperature ──► Sweating
    Sweating ──► Body Temperature (negative effect)

    When body temp rises, it eventually leads to lowering itself through sweating.


2. Describe an example of a positive feedback loop.  This means that one things increases another while the second things also increases the first.

   ### Eating ice cream and craving
   - Eating ice cream → increases → dopamine / pleasure in brain
   - Increased pleasure → increases → craving for more ice cream
   - Craving for more ice cream → leads to → more eating ice cream

   Eating ice cream ───▶ Pleasure / craving ───▶ More eating ice cream

   Why is this positive feedback?
   - More ice cream → more craving
   - More craving → more ice cream

   It amplifies rather than stabilizes (unlike negative feedback loops, which dampen change).




3. Draw a diagram for the following situation:

    Lightning storms frighten away deer and bears, decreasing their population, and cause flowers to grow, increasing their population.
    Bears eat deer, decreasing their population.
    Deer eat flowers, decreasing their population.

   #### DAG for Lightning Storms, Deer, Bears, and Flowers

```markdown
            Lightning
           /    |     \
     (-) Deer  (-) Bears  (+) Flowers
        ↓         ↓
    (-) Flowers  (-) Deer


Write a dataset that simulates this situation.  (Show the code.) Include noise / randomness in all cases.

In [3]:
# Set random seed for reproducibility
np.random.seed(42)

# Parameters
n_steps = 100
initial_deer = 100
initial_bears = 30
initial_flowers = 200

# Initialize arrays
deer = [initial_deer]
bears = [initial_bears]
flowers = [initial_flowers]
lightning = np.random.binomial(1, 0.2, n_steps)  # 20% chance of lightning each step

for t in range(1, n_steps):
    prev_deer = deer[-1]
    prev_bears = bears[-1]
    prev_flowers = flowers[-1]
    light = lightning[t]

    # Effects with noise
    deer_change = (
        -5 * light                     # lightning scares deer
        - 0.3 * prev_bears             # bears eat deer
        + np.random.normal(0, 2)      # noise
    )
    bears_change = (
        -2 * light                     # lightning scares bears
        + np.random.normal(0, 1)      # noise
    )
    flowers_change = (
        +10 * light                    # lightning grows flowers
        - 0.2 * prev_deer             # deer eat flowers
        + np.random.normal(0, 5)      # noise
    )

    # Update populations (keep ≥ 0)
    new_deer = max(prev_deer + deer_change, 0)
    new_bears = max(prev_bears + bears_change, 0)
    new_flowers = max(prev_flowers + flowers_change, 0)

    deer.append(new_deer)
    bears.append(new_bears)
    flowers.append(new_flowers)

# Create DataFrame
df = pd.DataFrame({
    'time': np.arange(n_steps),
    'lightning': lightning,
    'deer': deer,
    'bears': bears,
    'flowers': flowers
})

print(df.head())


   time  lightning        deer      bears     flowers
0     0          0  100.000000  30.000000  200.000000
1     1          1   86.174094  27.700993  190.458804
2     2          0   73.888659  27.481321  175.009548
3     3          0   68.600050  26.963051  156.189348
4     4          0   59.507621  27.878453  144.113094


Identify a backdoor path with one or more confounders for the relationship between deer and flowers.

### Backdoor Paths 

Deer ← Lightning → Flowers

Confounder: Lightning

Lightning affects both Deer and Flowers, so it opens a non-causal association between them.

Deer ← Bears ← Lightning → Flowers

Confounders: Lightning, via Bears

Lightning reduces Bears, which reduces predation on Deer, which then affects Flowers. This is an indirect path that is not part of the direct Deer → Flowers arrow.


4. Draw a diagram for a situation of your own invention.  The diagram should include at least four nodes, one confounder, and one collider.  Be sure that it is acyclic (no loops).  Which node would say is most like a treatment (X)?  Which is most like an outcome (Y)?

```markdown
   Diet
    ↓         Genetics
  Exercise → Health Markers ← Genetics
    ↓
  Stress
    ↓
Blood Pressure


Nodes:

- Exercise (X)
- Diet
- Stress
- Blood Pressure (Y)

Causal relationships:
- Exercise → Blood Pressure (we want to estimate this effect)
- Exercise → Stress (exercise reduces stress)
- Diet → Exercise (healthier people tend to exercise more)
- Diet → Blood Pressure (better diet lowers blood pressure)
- Stress → Blood Pressure (stress increases blood pressure)
- Exercise → Health Markers ← Genetics