# The `Bayes Theorem` 

In [1]:
import pandas as pd
import numpy as np

Do you remember this theorem covered during the lecture ? 

* The Bayes Theorem allows you to compute `a conditional probability`.
* It is widely used in Machine Learning to `update your knowledge`
* Despite its pretty simple formula, it can `highlight unexpected insights`

What is the `Bayes Theorem` ? According to [Brilliant](https://brilliant.org/wiki/bayes-theorem/) :

> Bayes' theorem is a formula that describes how to update the probabilities of hypotheses (A) when given evidence (Data).

The formula is the following:

$$ \mathbb{P}(A | Data) =  \mathbb{P}(A) \times \frac{\mathbb{P}(Data | A) }{\mathbb{P}(Data)}$$

## Challenge context: Should we play sport outside expecting some weather conditions ?

* In this challenge, we'll try to recompute this formula.

* We have a dataset with `weather conditions` (Rain, Sunny, Overcast) and `play` (Yes, No) suggesting whether a sport game should be played based on the weather conditions.

In [7]:
weather_data_example = ['Sunny', 'Overcast', 'Rainy', 'Sunny', 'Sunny', 'Overcast', 'Rainy', 'Rainy', 'Sunny',
'Rainy', 'Sunny', 'Overcast', 'Overcast', 'Rainy']

play_data_example = ['No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'Yes', 'No']

data = {'weather': weather_data_example, 'play': play_data_example}

df = pd.DataFrame(data = data)
display(df)

count = df['weather'].value_counts()['Sunny']
print("count:", count)

Unnamed: 0,weather,play
0,Sunny,No
1,Overcast,Yes
2,Rainy,Yes
3,Sunny,Yes
4,Sunny,Yes
5,Overcast,Yes
6,Rainy,No
7,Rainy,No
8,Sunny,Yes
9,Rainy,Yes


count: 5


Let's compute $ \large P(play \mid weather) = P(play) \times \frac{P(weather \mid play)}{P(weather)} $

## Warm-up : understanding the numbers with a `frequency table`

Grab a pen + a piece of paper and complete the **`frequency table`**:

| Weather  | Played | No | Total |
|----------|------|----|-------|
| Sunny    |     |   |      |
| Overcast |     |   |      |
| Rainy    |     |   |      |
| Total    |     |   |   14  |


    
| Weather  | Played | No | Total |
|----------|------|----|-------|
| Sunny    | 3    | 2  | 5     |
| Overcast | 4    | 0  | 4     |
| Rainy    | 2    | 3  | 5     |
| Total    | 9    | 5  | 14    |     


## Prior probability : $ \mathbb{P}(play)$

What is the theoretical probability of a game being played 

Look at the numbers in your previous table.


| Weather  | Played | No | Total |
|----------|------|----|-------|
| Total    | 9    | 5  | 14    |     
    
$ \mathbb{P}(played) = \frac{9}{14} = 64.29 \% $

Code the `prior_probability` function to compute the result.

In [9]:
def prior_probability(event_name: str, observations: list) -> float:
    '''
    Returns P(played)
    '''
    counter = 0
    for i in observations:
        if i == event_name:
            counter += 1
    return float(counter/len(observations))
    
# Run the following to test your code.
# If nothing shows, your function works. Otherwise, inspect your code to fix it!
assert(round(prior_probability("Yes", play_data_example),4) == 0.6429)

In [11]:
# A PYTHONIC SOLUTION 
def prior_probability_pythonic(event_name: str, observations: list) -> float:
    return sum([element == event_name for element in observations])/len(observations)

# Run the following to test your code.
# If nothing shows, your function works. Otherwise, inspect your code to fix it!
assert(round(prior_probability_pythonic("Yes", play_data_example),4) == 0.6429)

In [12]:
# AN EVEN SHORTER SOLUTION WITH NUMPY
def prior_probability_numpy_ic(weather, played, weather_data, play_data):
    return np.mean(np.array(play_data)==played)

# Run the following to test your code.
# If nothing shows, your function works. Otherwise, inspect your code to fix it!
assert(round(prior_probability_pythonic("Yes", play_data_example),4) == 0.6429)

FYI: These strange notations
```python
def prior_probability(event_name: bool, observations: list) -> float:
```
are called **typing hints**

They are optional in Python, and used to let the reader know what type of arguments and output the function should accept/return. 

There also exist Python libraries that enforce respect for these types, and raise error when not. 
It's a good practice to use these hints in very large projects to make sure nothing breaks silently.



## Likelihood :  $ \mathbb{P}(weather | play)$

In [13]:
def likelihood(weather, played, weather_data, play_data):
    '''TO DO: return P(weather|play)'''
    total = 0
    occurences = 0
    for i in range(len(play_data)):
        if play_data[i] == played:
            total += 1
            if weather_data[i] == weather:
                occurences += 1
    return float(occurences/total)

# Run the following to test your code.
# If nothing shows, your function works. Otherwise, inspect your code to fix it!
assert(likelihood("Rainy", "No", weather_data_example, play_data_example) == 0.60)

In [14]:
# ANOTHER SOLUTION : GOOD PYTHON STYLE
def likelihood_pythonic(weather, played, weather_data, play_data):
    '''TO DO: return P(weather|play)'''
    count_intersection = sum([ x == "Rainy" and y == "No" for x,y in zip(weather_data, play_data)])
    count_known_data = sum([y == "No" for y in play_data])
    return count_intersection / count_known_data

# Run the following to test your code.
# If nothing shows, your function works. Otherwise, inspect your code to fix it!
assert(likelihood_pythonic("Rainy", "No", weather_data_example, play_data_example) == 0.60)

In [15]:
# AN EVEN SHORTER SOLUTION: EVEN FASTER WITH NUMPY
def likelihood_numpy_ic(weather, played, weather_data, play_data):
    return np.mean(np.array(weather_data)[np.array(play_data)==played]==weather)

# Run the following to test your code.
# If nothing shows, your function works. Otherwise, inspect your code to fix it!
assert(likelihood_numpy_ic("Rainy", "No", weather_data_example, play_data_example) == 0.60)

## Posterior probability : $ \large P(play \mid weather ) = P(play) \times \frac{P(weather \mid play)}{P(weather)} $

We can finally compute the posterior probability as: 

$$\large \text{posterior probability} = \text{prior probability} \times \text{likelihood} \times \beta $$ 

where $ \large \beta = \frac{1}{P(weather)} $ is a _normalization factor_.
 

Expected results

Remember the table that you completed earlier ? 

| Weather  | Played | No | Total |
|----------|------|----|-------|
| Sunny    | 3    | 2  | 5     |
| Overcast | 4    | 0  | 4     |
| Rainy    | 2    | 3  | 5     |
| Total    | 9    | 5  | 14    |   
    
Based on this table, we can compute $ \mathbb{P}(played | weather) $
    
| Weather  | Proba(Played\|Weather) | Proba(No\|Weather) |
|----------|----------------------|--------------------|
| Sunny    | 3/5 = 0.6                  | 2/5 = 0.4                |
| Overcast | 4/4 = 1.0                  | 0/4 = 0.0                |
| Rainy    | 2/5 = 0.4                  | 3/5 = 0.6                |
    


In [17]:
def posterior_probability(played, weather, weather_data, play_data):
    '''TO DO: return P(play|weather)
    '''
    p_played = prior_probability(played, play_data)
    p_weather = prior_probability(weather, weather_data)
    p_likelihood = likelihood(weather, played, weather_data, play_data)
    return p_played * p_likelihood / p_weather

In [20]:
# Run the following cell to test your code
assert(np.isclose(posterior_probability("Yes", "Sunny", weather_data_example, play_data_example), 0.6))
assert(np.isclose(posterior_probability("No", "Sunny", weather_data_example, play_data_example), 0.4))
assert(np.isclose(posterior_probability("Yes", "Overcast", weather_data_example, play_data_example), 1.0))
assert(np.isclose(posterior_probability("No", "Overcast", weather_data_example, play_data_example), 0))
assert(np.isclose(posterior_probability("Yes", "Rainy", weather_data_example, play_data_example), 0.4))
assert(np.isclose(posterior_probability("No", "Rainy", weather_data_example, play_data_example), 0.6))

## Taking a step back to understand what you've done

Thanks to what youâ€™ve learned in this challenge, could you answer these questions :

1. _"Matches are more likely to be played than not if the weather is sunny"_  Is this statement correct ?
2. If you know for sure that it will be raining during the next game , what is your best guess (probability) that the game will be canceled ?

In [25]:
posterior_probability("Yes", "Sunny", weather_data_example, play_data_example)

0.6

In [24]:
posterior_probability("No", "Rainy", weather_data_example, play_data_example)

0.6