# Notebook 1: **Probability** basics exercises

<a target="_blank" href="https://colab.research.google.com/github/DavideScassola/PML2024/blob/main/Notebooks/01_exercises.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab"/></a>

### **Exercise 1**

We would like to estimate the effect of a drug D (D=1 the patient took the drug, D=0 didn't) on heart attack H (H=1 had heart attack, H=0 didn't) by looking at an observational dataset which contains also information about the sex of the patients (S = M/F).

- Control group (D=0):


|        | H = 1 | H = 0 |
|--------|-------|-------|
| Female |   1   |   19  |
| Male   |   12  |   28  |
| Total  |   13  |   47  |

 - Treatment group (D=1):

|        | H = 1 | H = 0 |
|--------|-------|-------|
| Female |   3   |   37  |
| Male   |   8   |   12  |
| Total  |   11  |   49  |

- Among males, what is the difference between the probability of having heart attack given that the patient took the drug and given that he didn't? Is the treatment working?
- What about females?
- What happens if you consider the whole population? 
- How can you explain that?

**Solution**
- P(H=1|D=0, S=M) - P(H=1|D=1, S=M) = 12/(12+28) - 8/(8+12) = -0.1 -> Treatment is increasing the risk of heart attack
- P(H=1|D=0, S=F) - P(H=1|D=1, S=F) = 1/(1+19) - 3/(3+37) = -0.025 -> Treatment is increasing the risk of heart attack
- P(H=1|D=0) - P(H=1|D=1) = 13/(13+47) - 11/(11+49) = 0.033 -> Treatment is preventing heart attack
- This is an example of Simpson's paradox: the treatment seems to be bad for males, bad for females but good for people! To solve it, we have to look at the data generating process: gender is affecting both the choice whether to take the drug or not and the risk of heart attack and has to be handled as a confounder.

### **Exercise 2**

Donated blood is screened for AIDS. Suppose the test has 99% accuracy,
and that one in ten thousand people in your age group are HIV positive. The
test has a 5% false positive rating, as well. Suppose the test screens you as
positive. What is the probability you have AIDS? Is it 99%? (Hint: 99% refers
to P(test positive|you have AIDS). You want to find P(you have AIDS|test
is positive).

#### Solution
$A := \text{you have AIDS}$

$T := \text{the test is positive}$

$$P(A|T) = \frac{P(T|A)P(A)}{P(T)}$$

$$P(T|A) = 0.99 \ \text{(the test has 99\% accuracy)}$$




$$P(A) = 0.0001 \ \text{(one in ten thousand people in your age group are HIV positive)}$$ 

$$P(T) = P(T | A)P(A) + P(T | \lnot A)P(\lnot A) \ \text{(by marginalization/partition law)}$$

$$P(T | \lnot A) = 0.05 \ \text{(The test has a 5\% false positive rating)} $$

$$P(T) = P(T | A)P(A) + P(T | \lnot A)P(\lnot A) = 0.99 \cdot 0.0001 + 0.05 \cdot (1-0.0001)$$

Then

$$P(A|T) = \frac{P(T|A)P(A)}{P(T)} = \frac{0.99 \cdot     0.0001}{0.99 \cdot 0.0001 + 0.05 \cdot (1-0.0001)} \approx 0.002$$




### **Exercise 3**

Yuo are given a fair coin with probability $0.5$ or an unfair coin (having $P(\text{head})=0.8$) with probability $0.5$. Then you toss it two times, with results $H_1$ and $H_2$.

Let's call $C$ the random variable describing if the coin is fair or not.
1. Is $H_2$ independent from $H_1$?
2. Factorize $p(h_1, h_2 | c)$ (write it as a product of simpler terms)
3. Factorize $p(h_1, h_2, c)$
4. Compute $p(h_1)$
5. Compute $p(c | h_1)$
6. Compute $p(h_2 | h_1)$


#### Solution

1. $H_1$ and $H_2$ are not independent, but they are conditionally independent given $C$
2. $ p(h_1, h_2 | c) = p(h_1|c)p(h_2|c) $
where $p(h_1|c) = Bern(h_1; c)$ and $p(h_2|c) = Bern(h_2; c)$
3. $ p(h_1, h_2, c) = p(c)p(h_1, h_2|c) = p(c)p(h_1|c)p(h_2|c) $
where $p(c) = Bern(c; 0.5)$

4.

\begin{align*}
p(h_1) &= \sum_{c \in \{ \text{fair}, \text{unfair}\}} p(h_1, c) = \sum_{c \in \{ \text{fair}, \text{unfair}\}} p(h_1 | c)p(c) \\
&= P(C \text{ is fair})p(h_1 | C \text{ is fair}) + P(C \text{ is unfair})p(h_1 | C \text{ is unfair}) \\
&= \frac{1}{2}p(h_1 | C \text{ is fair}) + \frac{1}{2}p(h_1 | C \text{ is unfair})
\end{align*}

\begin{align*}
P(H_1 \text{ is head}) &= \frac{1}{2}P(H_1 \text{ is head} | C \text{ is fair}) + \frac{1}{2}P(H_1 \text{ is head} | C \text{ is unfair}) \\
&= 0.5 \cdot (0.5 + 0.8) = 0.65
\end{align*}

then $p(h_1) = Bern(h_1; 0.65)$

5.

\begin{align*}
P(C \text{ is fair} | H_1 \text{ is head} ) = \frac{P(H_1 \text{ is head} | C \text{ is fair})P(C \text{ is fair})}{P(H_1 \text{ is head})} = \frac{0.5 \cdot 0.5}{0.65} \approx 0.38
\end{align*}

\begin{align*}
P(C \text{ is fair} | H_1 \text{ is tail} ) &= \frac{0.5 \cdot 0.5}{0.35} \approx 0.71
\end{align*}

then 
\begin{align*}
p(c | h_1) = Bern(c; \sim 0.38) \text{ if } h_1 \text{=head, else } Bern(c; \sim 0.71)
\end{align*}

6.

\begin{align*}
p(h_2 | h_1) &=  \sum_{c \in \{ \text{fair}, \text{unfair}\}} p(h_2, c | h_1) = \sum_{c \in \{ \text{fair}, \text{unfair}\}} p(h_2,| c, h_1)p(c|h_1) = \sum_{c \in \{ \text{fair}, \text{unfair}\}} p(h_2,| c) p(c|h_1)
\end{align*}

\begin{align*}
P(H_2 \text{ is head} | H_1 \text{ is head}) &= \sum_{c \in \{ \text{fair, unfair}\}} P(H_2 \text{ is head}| c) p(c|H_1 \text{ is head}) \\
&= P(H_2 \text{ is head}| C \text{ is fair} ) P(C \text{ is fair}|H_1 \text{ is head}) + P(H_2 \text{ is head}| C \text{ is unfair}) P(C \text{ is unfair}|H_1 \text{ is head}) \\
&= 0.5 \cdot 0.38 + 0.8 \cdot (1-0.38) \approx 0.69 \\
\end{align*}


\begin{align*}
P(H_2 \text{ is head} | H_1 \text{ is tail}) &= 0.5 \cdot 0.71 + 0.8 \cdot (1-0.71) \approx 0.59 \\
\end{align*}

then
\begin{align*}
p(h_2 | h_1) &= Bern(h_2;0.69)  \text{ if } h_1 = \text{head} \\
             &= Bern(h_2;0.59)  \text{ if } h_1 = \text{tail} \\
\end{align*}



### **Exercise 4**

Given $p(x,y)=$

|   | Y=0 | Y=1 |
|---|-----|-----|
| X=0 | 0.2  | 0.1  |
| X=1 | 0.15 | 0.0  |
| X=2 | 0.25  | 0.3 |



Calculate
- $p(y)$
- $p(x)$
- $p(x|y)$
- $p(y|x)$
- $\mathbb{E}[x]$
- $\mathbb{E}[y]$
- $\mathbb{E}[x|y]$
- $\text{cov}[x,y]$

You can do it by hand on a piece of paper, but I also suggest you doing it with the `numpy` library. You can do any of these computations with a single line of code.

In [2]:
import numpy as np

p = np.array([[0.2, 0.1],
                 [0.15, 0.0],
                 [0.25, 0.3],
                ])

x = np.array([0, 1, 2])
y = np.array([0, 1])

### **Exercise 5**

**Exercise 1**


Suppose we are interested in the relation between an exposure A (has been affected by coronavirus: 0 no, 1 yes) and an outcome Y (has myocarditis: 0 no, 1 yes). We conduct an observational study on a representative population and obtain the following proportions (N.B. this data is made up and does not come from a study):

|A \ Y| 0 | 1 |
|-----|---|---|
| 0 |0.75|0.07|
| 1 |0.15|0.03|

s.t. for example, among all subjects, 3% has been affected by coronavirus and has myocarditis.

- Among the exposed subjects, what is the proportion on individuals that have the outcome?
- Are A and Y independent?

### **Exercise 6**

Given the distribution $p(x,y) = x + y \text{ if } x \in [0,1], y \in [0,1], 0 \text{ otherwise}$

Calculate
- $\mathbb{E}[x|y]$
- $\rho[X,Y]$

This time, I suggest you trying the `sympy` library, that can help you with symbolic computations.

In [4]:
from sympy import symbols, integrate, log, sqrt

# Define the symbols
x, y = symbols('x y')

# Define the joint distribution
p_xy = (x + y)

# Example of integral
Z = integrate(p_xy, (x, 0, 1), (y, 0, 1))

print("\nZ:", Z)


Z: 1


### **Exercise 7**

Two alternative definitions of conditional independence were given, prove that they are equivalent:
$$p(a | b, c)=p(a | c) \Longleftrightarrow p(a, b | c)=p(a|c)p(b|c)$$

### **Exercise 8**

Compute the variance of the following unnormalized distribution (numerically, use `scipy.integrate`)
$$p(x) \propto sin(x)^2 e^{-|x|}$$

### **Exercise 9**

You are proposed to play the following game: you toss a coin a first time, if tail appears, you win 1€ and the game ends, if head appears you win 2€ and you can continue playing. From the second toss on, if tail appears you stop playing, if head appears the amount you already won doubles and you can keep playing.

Example:
1) You toss the coin and **head** appears (you are winning 2€)
2) You toss the coin again and **head** appears (you are winning 4€)
3) You toss the coin again and **head** appears (you are winning 8€)
4) You toss the coin again and **tail** appears (the game ends and you won a total amount of 8€)

Let's call $X$ the amount of money you win playing this game.
1. What is the expected amount you win?
2. What is the expected value of $\log_2(X)$?
3. How much would you pay for playing this game?

Hint: $ \sum_{i=1}^{\infty} i q^i = \frac{q}{(1-q)^2}$

#### Solution

$$P(H = h) = 2^{-h-1}$$

1. What is the expected amount you win?

$$f(h) = 2^{h} \text{ (amount won)} $$ 
$$\mathbb{E}_{h \sim p(h)}[f(h)] = \sum_{h=0}^{\infty} p(h)f(h) = \sum_{h=0}^{\infty} 2^{-h-1} 2^{h} = \frac{1}{2}\sum_{h=0}^{\infty} 1 = \infty$$

2. What is the expected value of $\log_2(X)$?

$$\mathbb{E}_{h \sim p(h)}[\log_2{f(h)}] = \sum_{h=0}^{\infty} h2^{-h-1} = \frac{1}{2}\sum_{h=1}^{\infty} h2^{-h} = \frac{1}{2}\frac{\frac{1}{2}}{(1-\frac{1}{2})^2} = 1$$

3. The answer is subjective, as in any gambling scenario. This is an example where the choice is hard since although the expected value is very large, it's still improbable to win a large amount. So the expected value is not always a good statistic to describe a random variable, or to be considered in order to take decisions.
Anyway, such a game is impossible in a real scenario since there is a practical limit to the amount one can actually win.

For more information check [St. Petersburg paradox](https://www.wikiwand.com/en/St._Petersburg_paradox) and [Geometric distribution](https://www.wikiwand.com/en/Geometric_distribution).

### **Exercise 10**

It's night and you are looking into the sky waiting to see a falling star. A friend of yours tells you that the waiting time $T$ (hours) is distributed exponentially: $p(t) = 2e^{-2t}$.
1. What is the probability of seeing the first falling star within 1 hour? How much time do you expect to wait?
2. You have not seen anything in one hour, what is the probability of seeing a the first falling star in the next 1 hour? (justify your answer)
3. Is the waiting time dependent on how much you have already waited? Is the answer the same for any distribution?
3. What is the probability of seeing at least two falling stars in the first 1 hour?

4. Let's say the distribution is instead $p(t) = Uniform(0, 4)$ (for example, there is a known comet expected to show up at a certain point), how do aswers to questions 1 and 2 change?