---

# EXERCISE 1: Weather's probability
You are given a (fake) <a href="https://drive.google.com/file/d/1LjZLE9ozaHcBwiCl90mHaS1nXKcglfr4/view">padua_weather.csv</a>
of historical records for Padua's weather. The weather, which can be either rainy (= 1 in the dataset), misty (= 2), or sunny (= 3), is reported for each day of the week, for a whole year (52 weeks).

After you formalised the problem (i.e. identify the random variables and necessary mathematical formulae), write a Python program that reads the dataset via Python code and computes the following:
- probability of being sunny during the weekend (one or both days);
- expected weather for each day of the week (*);
- supposed you don't know which day of the week is today: although very unrealistic, how could you guess which day is today based only on the weather?

(\*) An expected value of, for example, 2.5 can be interpreted as "a mix of misty and sunny weather".




In [2]:
from itertools import product
import numpy as np
import pandas as pd

# grafo delle dipendenze: giorno della settimana -> meteo

# Conditional Probability Table del problema P(W|D)
P_W_d = np.array([[0., 0., 0., 0., 0., 0., 0.],  # P(w=1|d=1) P(w=1|d=2) P(w=1|d=3) P(w=1|d=4) P(w=1|d=5) P(w=1|d=6) P(w=1|d=7)
                  [0., 0., 0., 0., 0., 0., 0.],  # P(w=2|d=1) P(w=2|d=2) P(w=2|d=3) P(w=2|d=4) P(w=2|d=5) P(w=2|d=6) P(w=2|d=7)
                  [0., 0., 0., 0., 0., 0., 0.]]) # P(w=3|d=1) P(w=3|d=2) P(w=3|d=3) P(w=3|d=4) P(w=3|d=5) P(w=3|d=6) P(w=3|d=7)

# leggo i dati dal file csv in https://drive.google.com/file/d/1ln-HVEchVF31S86FRgl2B5V6KQA3JiVJ/view
file_id = "1ln-HVEchVF31S86FRgl2B5V6KQA3JiVJ"
url = f"https://drive.google.com/uc?id={file_id}&export=download"
df = pd.read_csv(url)

# definizione dei valori delle variabili aleatorie
day = ("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")
weather = (1, 2, 3)

# calcolo della CPT
for d, w in product(day, weather):
    counter = 0
    for i in range(len(df[d])):
        if df[d][i] == w:
            counter += 1
    P_W_d[weather.index(w), day.index(d)] = counter / len(df[d])

# probabilità di avere sole nel weekend: P(w=2|d=5) + P(w=2|d=6)
print("probabilità di avere sole nel fine settimana: ", P_W_d[weather.index(3), day.index("Saturday")] + P_W_d[weather.index(3), day.index("Sunday")] - P_W_d[weather.index(3), day.index("Saturday")] * P_W_d[weather.index(3), day.index("Sunday")], "\n")

# tempo atteso per ogni giorno della settimana: 1*P(w=1|d) + 2*P(w=2|d) + 3*P(w=3|d)
for d in day:
    print("tempo atteso di", d , ":", P_W_d[weather.index(1), day.index(d)] + 2*P_W_d[weather.index(2), day.index(d)] + 3*P_W_d[weather.index(3), day.index(d)])
print()

# probabilità di essere in uno specifico giorno conoscendo il meteo P(d|w)
# 1. definisco la conditional probability inversa P(d|w):
P_D_w = np.array([[0.,0.,0.],[0.,0.,0.],[0.,0.,0.],[0.,0.,0.],[0.,0.,0.],[0.,0.,0.],[0.,0.,0.]])

# 2. calcolo la P(d|w) = P(w|d) * P(d) / P(w)
#                      = P(w|d) * P(d) / sum(P(w|d) forall d) * P(d)
#                      = P(w|d) / sum(P(w|d) forall d)
for d, w in product(day, weather):
    P_D_w[day.index(d), weather.index(w)] = P_W_d[weather.index(w), day.index(d)] / sum(P_W_d[weather.index(w)])

# 3. trovo il giorno più probabile sapendo che sta piovendo
print("il giorno più probabile sapendo che piove è", day[np.argmax(P_D_w[:,weather.index(1)])])

probabilità di avere sole nel fine settimana:  0.4378698224852071 

tempo atteso di Monday : 2.076923076923077
tempo atteso di Tuesday : 1.9807692307692306
tempo atteso di Wednesday : 2.0384615384615383
tempo atteso di Thursday : 1.9423076923076923
tempo atteso di Friday : 1.9615384615384617
tempo atteso di Saturday : 1.8461538461538463
tempo atteso di Sunday : 1.75

il giorno più probabile sapendo che piove è Sunday


---

# EXERCISE 2: Broad Street cholera outbreak

The following is a simplified version of an example in Judea Pearl's *The Book of Why*. It refers to a case of cholera epidemic, caused by contaminated water, which killed hundreds of people in London between 1853 and 1854. The diagram below illustrates some of the key factors explaining this epidemic, in particular:
- $X$ indicates whether the water company's intake was downstream of the London's sewers;
- $W$ indicates whether the water was contaminated or not;
- $Z$ indicates the presence of other external factors (e.g. poverty, miasma, etc.);
- $Y$ indicates the outbreak of cholera.

<img src='https://drive.google.com/uc?id=10O10x_nuuxF55rqRk0TpanHV_7Q819MA'>

(please note the probabilities in the diagram are fake)

> - Formalise the problem using opportune mathematical notations and derive an expression for computing the probability distribution of the cholera given that the water company's intake is upstream (i.e. what is the query? how can it be decomposed?)
> - Write a Python program that computes the actual probabilities of the above distribution using the information from the given CPTs.

In [6]:
# creo le CPT per le varie variabili del problema

P_X = np.array([0.5, 0.5]) # P(X=f) , P(X=t)
P_Z = np.array([0.75, 0.25]) # P(Z=f), P(Z=t)
P_W_xz = np.array([[[0.98, 0.9], [0.15, 0.1]],  # P(¬w|¬x,¬z), P(¬w|¬x,z), P(¬w|x,¬z), P(w|x,z)
                   [[0.02, 0.1], [0.85, 0.9]]]) # P(w|¬x,¬z), P(w|¬x,z), P(w|x,¬z), P(w|x,z)
P_Y_wz = np.array([[[0.95, 0.85], [0.25, 0.2]],  # P(¬y|¬w,¬z), P(¬y|¬w,z), P(¬y|w,¬z), P(¬y|w,z)
                   [[0.05, 0.15], [0.75, 0.8]]]) # P(y|¬w,¬z), P(y|¬w,z), P(y|w,¬z), P(y|w,z)

# calcolo la probabilità P(Y|X) = sum for w,z P(Y|W,Z) P(W|X,Z) P(Z)
P_Y_x = np.array([[0.,0.],[0.,0.]])
for y, w, z, x in product(range(2),range(2), range(2),range(2)):
    P_Y_x[y,x] += P_Y_wz[y,w,z] * P_W_xz[w,x,z] * P_Z[z]

print("P(Y|¬x) = ", P_Y_x[:,0])

P(Y|¬x) =  [0.89825 0.10175]
