# AAI Workshop 2

Below there are two examples and one excercise to be completed by the given deadline (read the text).

---

## EXAMPLE: Probability of disease given symptoms
Let's assume we want to calculate the probability of having COVID-19 based on three possible symptoms: high fever, continuos cough, loss of taste/smell.
We know there are 1% chances of being infected, and in such case the probabilities of developing the symptoms are as follows:
- high fever: 10% with COVID, only 2% without
- continuos cough: 20% with COVID, only 5% without
- loss of taste/smell: 15% with COVID, only 0.1% without

> __QUESTION:__ What is the probability of having COVID given the presence of all the three symptoms?

### Solution

We have four random variables, which can be either *true* or *false*:
- COVID disease $D$
- fever $F$
- cough $C$
- loss of taste/smell $L$

For simpliciy, we will use lower-case letters to indicate single events (e.g. $d$ means $D = true$ and $¬d$ means $D = false$).
The task then consists in computing the probability $P(d | f, c, l)$.

We can write the following probability distributions, given by the problem:

$
P(d) = 0.01 \\
P(¬d) = 0.99
$

or simply

$
{\bf P}(D) = \langle 0.01, 0.99 \rangle
$

Similarly, for the conditional probabilities of the symptoms:

$
{\bf P}(F | d) = \langle 0.1, 0.9 \rangle \\
{\bf P}(F | ¬d) = \langle 0.02, 0.98 \rangle \\
~\\
{\bf P}(C | d) = \langle 0.2, 0.8 \rangle \\
{\bf P}(C | ¬d) = \langle 0.05, 0.95 \rangle \\
~\\
{\bf P}(L | d) = \langle 0.15, 0.85 \rangle \\
{\bf P}(L | ¬d) = \langle 0.001, 0.999 \rangle.
$
<br><br>
Let's put these in some Python arrays:

In [2]:
import numpy as np

p_disease = np.array([0.01, 0.99])
print('P(D) = ', p_disease)

p_fever_disease = np.array([[0.1, 0.9], [0.02, 0.98]])
print('\nP(F|D) =\n', p_fever_disease)

p_cough_disease = np.array([[0.2, 0.8], [0.05, 0.95]])
print('\nP(C|D) =\n', p_cough_disease)

p_loss_disease = np.array([[0.15, 0.85], [0.001, 0.999]])
print('\nP(L|D) =\n', p_loss_disease)

P(D) =  [0.01 0.99]

P(F|D) =
 [[0.1  0.9 ]
 [0.02 0.98]]

P(C|D) =
 [[0.2  0.8 ]
 [0.05 0.95]]

P(L|D) =
 [[0.15  0.85 ]
 [0.001 0.999]]


<br><br>
Applying Bayes rule, we can write

$$P(d | f, c, l) = \dfrac{P(f, c, l | d) P(d)}{P(f, c, l)}.$$

At the numerator, we can exploit the fact that the three symptoms are conditionally independent, given the diseas (i.e. if I know I have COVID, my chances of having a fever do not change by the fact of having also a cough or loss of taste/smell). Therefore

$$P(f, c, l | d) P(d) = P(f|d) P(c|d) P(l|d) P(d).$$

We can also apply the law of total probability to the denominator and write

$$
\begin{align}
P(f, c, l) &= \sum_{x \in D} P(f, c, l | x) P(x) \nonumber\\
 &= P(f, c, l | d) P(d) + P(f, c, l | ¬d) P(¬d) \nonumber\\
 &= P(f|d) P(c|d) P(l|d) P(d) + P(f|¬d) P(c|¬d) P(l|¬d) P(¬d).\nonumber
\end{align}
$$

By substituing the above expressions for the numerator and denominator, we can write the answer to the original question:

$$P(d | f, c, l) = \dfrac{P(f | d) P(c | d) P(l | d) P(d)}{P(f|d) P(c|d) P(l|d) P(d) + P(f|¬d) P(c|¬d) P(l|¬d) P(¬d)}.$$

<br><br>
Let's implement this in Python to compute the actual probability value, starting from the numerator:

In [3]:
numerator = p_fever_disease[0,0] * p_cough_disease[0,0] * p_loss_disease[0,0] * p_disease[0]
print('P(f,c,l|d) P(d) =\n', numerator)

P(f,c,l|d) P(d) =
 3.0000000000000004e-05


At the denominator we have the sum of the symptoms probability with and without disease:

In [4]:
denominator = p_fever_disease[0,0] * p_cough_disease[0,0] * p_loss_disease[0,0] * p_disease[0] +\
    p_fever_disease[1,0] * p_cough_disease[1,0] * p_loss_disease[1,0] * p_disease[1]
print('P(f,c,l|d)P(d) + P(f,c,l|¬d)P(¬d) =\n', denominator)

P(f,c,l|d)P(d) + P(f,c,l|¬d)P(¬d) =
 3.0990000000000007e-05


Finally, the probability of the disease given all the symptoms are present is the following

In [5]:
result = numerator / denominator
print('P(d|f,c,l) =\n', result)

P(d|f,c,l) =
 0.9680542110358179


So the actual probability of having COVID, given the presence of all the three symptoms, is almost **97%**!

**NOTE**: You can find other examples and a more elegant way to represent probability distributions in Python in the file [aima_ch13.zip](aima_ch13.zip), which is extracted and adapted from Russell and Norvig's book.

---

## EXAMPLE: Probability from data

Let's see in practice how you could extract some probabilities from data.

Consider the following table, listing 10 students and their final grades (either A, B, or C) obtained in Year 1, Year 2 and Year 3:

| Student | Grade Y1 | Grade Y2 | Grade Y3 |
| --- | --- | --- | --- |
| John | A | A | B |
| Sarah | C | C | B |
| Eric | A | B | B |
| Paul | B | C | A |
| Susanne | A | A | A |
| Beth | B | A | B |
| Jack | B | C | B |
| Rachel | B | A | A |
| Tom | B | C | C |
| Jenny | B | A | B |

Assuming three random variables, $G_1$, $G_2$, and $G_3$, to represent the grades at each year, how can we compute some useful probabilities from this table like, for example, $P(G_1 = A)$ or $P(G_1 = B, G_2 = C)$?

Let's count the instances in the table:
- in Year 1 there are three A grades out of ten, therefore $P(G_1 = A) = 3/10 = 0.3$
- we can also see there are only three cases in which $G_1 = B$ and $G_2 = C$, therefore $P(G_1 = B, G_2 = C) = 0.3$.

A little more complicated is to extract a conditional probablity. For example, what about $P(G_3 = A | G_2 = C)$?
- there are four cases in which $G_2 = C$ (Sarah, Paul, Jack, and Tom)
- among these, only Paul got an A in Year 3, therefore $P(G_3 = A | G_2 = C) = 1/4 = 0.25$.

Let's see if this could also be calculated differently:
- there is only one case (Paul) in which $G_2 = C$ and $G_3 = A$, therefore $P(G_2 = C, G_3 = A) = 1/10$
- also, the probability $P(G_2 = C) = 4/10$
- therefore, $P(G_3 = A | G_2 = C) = \dfrac{P(G_2 = C, G_3 = A)}{P(G_2 = C)} = \dfrac{1/10}{4/10} = 0.25$, as expected.


**NOTE**: the above example is based on the assumption that the given data captures the real probability distribution of student grades ${\bf P}(G_1, G_2, G_3)$. In general this is not true though, since even a small change in our data could cause significant changes in the probabilities (try to recompute the above in case Jack had a B in Year 2: what do you get?). Typically, the more data you have, the better it is, as long as the samples cover sufficiently well the underlying probability distribution.

---

## EXERCISE: Weather's probability

You are given a (fake) [dataset](lincoln_weather.csv) of historical records for Lincoln's weather. The weather, which can be either rainy (= 1 in the dataset), misty (= 2), or sunny (= 3), is reported for each day of the week, for a whole year (52 weeks).

After you formalised the problem (i.e. identify the random variables and necessary mathematical formulae), write a Python program that reads the dataset and computes the following:
- probability of being sunny during the weekend (one or both days);
- expected weather for each day of the week (*);
- supposed you don't know which day of the week is today: although very unrealistic, how could you guess which day is today based only on the weather?

(\*) An expected value of, for example, 2.5 can be interpreted as "a mix of misty and sunny weather".

Write a short document (PDF, max 1 page) or Jupyter Notebook file (preferred) describing your solution and send it to **nbellotto@lincoln.ac.uk** with subject *AAI Workshop 2 - NAME SURNAME*. Please submit your work by the <u>22nd Oct 2021</u>. **It will not be graded, but only used by the lecturer to check the progress of the class**.



We can formulise the problem with random variables for the weather $W = {rainy, misty, sunny}$ and weekday $D \in {mo, tu, we, th, ..}$.
The probability for a sunny weekend would be expressed as $P(W=sunny|D \in {sa, su})$.

In [2]:
import questionary
import pandas as pd
from IPython.display import display, HTML
# read csv
df = pd.read_csv("lincoln_weather.csv")

total_num_weekends = len(df)
num_weekends_with_sunny_sat_or_sun = len(df.query("Saturday == 3 or Sunday == 3"))
prob_sunny_weekend = num_weekends_with_sunny_sat_or_sun/total_num_weekends

print("Probability of a sunny weekend is {prob}".format(prob=prob_sunny_weekend))

mean_df = df.mean().to_frame()
mean_df.columns = ["Expected Weather"]
display(HTML(mean_df.to_html()))

Probability of a sunny weekend is 0.4423076923076923


Unnamed: 0,Expected Weather
Monday,2.076923
Tuesday,1.980769
Wednesday,2.038462
Thursday,1.942308
Friday,1.961538
Saturday,1.846154
Sunday,1.75


In [3]:
df

Unnamed: 0,Monday,Tuesday,Wednesday,Thursday,Friday,Saturday,Sunday
0,1,2,1,1,1,2,1
1,2,1,2,1,1,2,1
2,2,1,2,2,1,1,1
3,1,1,1,3,3,2,1
4,3,3,3,3,1,3,2
5,3,1,1,1,3,2,1
6,2,3,2,2,1,2,2
7,1,3,1,1,3,2,2
8,2,3,3,2,1,3,2
9,1,3,1,1,3,1,2


The weekday could be estimated based on the wheather condition.
If the wheather is w_curr, the current day of the week could be estimated based on selecting the day with the highest probability ($d^*) depending on the current weather $P(d^*|w_{curr}) >= P(\hat{d}|w_{curr}) \forall d^*,\hat{d} \in D$.

In [4]:
import ipywidgets as widgets
from ipywidgets import interact, interact_manual

wheather_states = ["Rainy", "Misty", "Sunny"]

@interact
def get_most_prob_day_dep_on_w(weather=["Rainy", "Misty", "Sunny"]):

    curr_wheather = wheather_states.index(weather) + 1


    def occ_arr(arr):
        return dict((k, arr.count(k)) for k in set(arr))

    def count_occ(arr, val):
        return occ_arr(arr)[val]

    res = df.agg(lambda colmn_vals: count_occ(colmn_vals.tolist(), curr_wheather))
    # Since the probability is always >= 0 and the denominator of all probabilities is the same,
    # one can compare the absolute occurances instead of the realtive ones.
    index_of_hightes_prob = res.argmax(axis=0)
    return df.columns[index_of_hightes_prob]

interactive(children=(Dropdown(description='weather', options=('Rainy', 'Misty', 'Sunny'), value='Rainy'), Out…