# Probability-Based Learning

In this lab we will implement functions for calculating probability, joint probability and working out the **maximum a posteriori (MAP)** prediction for a given dataset. 

## Calculating Probability

In order to build a Naive Bayes model, we need to be able to calculate the probability of an event given evidence. This is quite a straightforward calculation. Say we want to calculate the probability that a patient has a headache, our *evidence* is the dataset of patient records contained in `meningitis.csv`. The probability of that a patient has a headache is then


$\frac{NROW(Headache)}{NROW(All)}$

In the example below, use the pandas `.loc` function to calculate the probability that a patient has a headache


In [1]:
import pandas as pd
import numpy as np

df = pd.read_csv('meningitis.csv')
df = df.set_index('ID')
df

Unnamed: 0_level_0,Headache,Fever,Vomiting,Meningitis
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,True,True,False,False
2,False,True,False,False
3,True,False,True,False
4,True,False,True,False
5,False,True,False,True
6,True,False,True,False
7,True,False,True,False
8,True,False,True,True
9,False,True,False,False
10,True,False,True,True


In [2]:
nrow_headache = len(df.loc[df['Headache']==True])
nrow_all = len(df)
prob_headache = nrow_headache / nrow_all
print(prob_headache) # 0.7

0.7


Now that we've calculated the probability for a single column, we can write a function which will do this for any column we choose. For the moment, we'll assume that any given column is a boolean column, and we'll return the probability of that column having the value `True`

In [3]:
def calculate_probability(df: pd.DataFrame, column: str) -> float:
    return len(df.loc[df[column]==True]) / len(df)

print(f"P(Headache): {calculate_probability(df, 'Headache')}") # 0.7
print(f"P(Fever): {calculate_probability(df, 'Fever')}") # 0.4
print(f"P(Vomiting: {calculate_probability(df, 'Vomiting')}") # 0.6

P(Headache): 0.7
P(Fever): 0.4
P(Vomiting: 0.6


## Building a Factor Table

If we wanted to build up a full contingency table of all the possible probabilities we'd need at least one row representing every possible combination of features, but due to the curse of dimensionality, we'd still be very susceptible to outliers. The Naive Bayes works around this by assuming **conditional independence**. That is, we ignore any interaction between the feature columns, and we're only interested in the probability of each feature **given the target variable**.

For the meningitis dataset above, this means that we only need to calculate the probability of a patient having a headache *when they have meningitis* and the probability of a patient having a headache *when they don't have meningitis*. In order to do this we can use our `calculate_probability` function, above. To calculate the factors for *headache* we first split our dataset into two, rows of patients who have meningitis and rows of patients who don't have meningitis. We then work out the probability of headache for each of these subsets

Each feature column has 2 entries in the factor table, one for `meningitis=True` and one for `meningitis=False`. It makes sense, then, for our function to return two values rather than one. Python allows you to return multiple values from a function using the comma operator. For example, the following function returns the most common word in a string of text along with the number of occurrences

In [4]:
from typing import Tuple

def get_mode_and_count(text: str) -> Tuple[str, int]:
    word_dict = dict()
    for word in text.split(' '):
        word_dict[word] = word_dict.get(word, 0) + 1
    
    most_common_word = max(word_dict, key=word_dict.get)
    word_count = word_dict[most_common_word]
    return most_common_word, word_count

mode, count = get_mode_and_count('the quick brown fox jumped over the lazy dog')
mode, count

('the', 2)

Notice that when we want to use a TypeHint for a function with multiple return values we use the *Tuple* type. A tuple is like a fixed size list with definite types (it's essentially the same thing as a database row). We can use the same approach to return a tuple of probabilities for each feature column; one for each possible target value

In [5]:
from typing import Tuple

def calculate_factors(df: pd.DataFrame, column: str, target_column: str) -> Tuple[float, float]:
    return calculate_probability(df.loc[df[target_column]==True], column), \
        calculate_probability(df.loc[df[target_column]==False], column)

print(f"Factors(Headache): {calculate_factors(df, 'Headache', 'Meningitis')}") # 0.667, 0.714
print(f"Factors(Fever): {calculate_factors(df, 'Fever', 'Meningitis')}") # 0.333, 0.429
print(f"Factors(Vomiting: {calculate_factors(df, 'Vomiting', 'Meningitis')}") # 0.667, #0.571

Factors(Headache): (0.6666666666666666, 0.7142857142857143)
Factors(Fever): (0.3333333333333333, 0.42857142857142855)
Factors(Vomiting: (0.6666666666666666, 0.5714285714285714)


## Calculating *A Posteriori* Probabilities from Factors

We've looked at how to calculate factors from a dataset. The next thing we need to build our Naive Bayes is some way to calculate *a posteriori* probabilities given a factor table. Let's take another look at a basic factors table to see how this works

![Factors](factors.png)

We can see our table contains one factor for each feature, and another for the overall probability of the target. We can see from the table above that the probability of a headache given that a patient has meningitis is 0.666. We didn't make it explicit, but it follows that the probability of *not* having a headache given that a patient has meningitis is 0.333 (from the theorem of total probability).

How do we work out the probability that a patient who has headache, no fever and is vomiting has meningitis?

1. First, we calculate the *a posteriori* probability for meningitis=True
    * This is the prior $P(meningitis=True)$ multiplied by
    * $P(headache)$ multiplied by
    * $1 - P(Fever)$ multiplied by
    * $P(Vomiting)$
2. Then, we do the same for meningitis=False (using the second column of our factor table)
3. Finally, return the prediction with the maximum *a posteriori* prediction

The first thing that becomes clear is that we can easily separate our table into two. In step 1 we're only interested in the probabilities for `meningitis=True`, in step 2 we're only interested in the probabilities for `meningitis=False`

The next thing we need to be able to do is find the factor value for any given column. We're associating each factor in this table with a column in our dataframe. The easiest way to tie two values together in Python like this is to use a dictionary. In the cell below, create a dictionary containing the values factors for `headache`, `fever` and `vomiting`.

To get back the probability of a patient having a headache given that they have meningitis you would use 

```python
factor_dict['headache']
```



In [6]:
factor_dict = dict()
factor_dict['meningitis'] = (0.3, 0.7)
factor_dict['headache'] = (0.6667, 0.7143)
factor_dict['fever'] = (0.3333, 0.4286)
factor_dict['vomiting'] = (0.6667, 0.5714)

print(factor_dict) # {'headache': (0.6667, 0.7143), 'fever': (0.3333, 0.4286), 'vomiting': (0.6667, 0.5714)}

{'meningitis': (0.3, 0.7), 'headache': (0.6667, 0.7143), 'fever': (0.3333, 0.4286), 'vomiting': (0.6667, 0.5714)}


Now that we're storing our factor table as a dictionary we need a function which will calculate conditional probability using that dictionary. Remember that the factor table stores the probability that the descriptive feature is true, to find the probability that the descriptive feature is false we subtract it from 1.

In [7]:
def calculate_column_conditional_probability(factor_dict: dict, target_value: bool, column: str, value: bool) -> float:
    # our factor comes back as a tuple, the first element for target=True, second for target=False
    factor = factor_dict[column][0] if target_value == True else factor_dict[column][1]
    # we only store the probability of our descriptive feature being True, subtract from 1 for False
    return factor if value == True else 1 - factor

print(f"P(h|m): {calculate_column_conditional_probability(factor_dict, True, 'headache', True)}") # 0.6667
print(f"P(¬h|m): {calculate_column_conditional_probability(factor_dict, True, 'headache', False)}") # 0.3333
print(f"P(f|m): {calculate_column_conditional_probability(factor_dict, True, 'fever', True)}") # 0.3333
print(f"P(¬f|m): {calculate_column_conditional_probability(factor_dict, True, 'fever', False)}") # 0.6667
print(f"P(v|m): {calculate_column_conditional_probability(factor_dict, True, 'vomiting', True)}") # 0.6667
print(f"P(¬v|m): {calculate_column_conditional_probability(factor_dict,True, 'vomiting', False)}") # 0.3333
print(f"P(h|¬m): {calculate_column_conditional_probability(factor_dict, False, 'headache', True)}") # 0.7143
print(f"P(¬h|¬m): {calculate_column_conditional_probability(factor_dict, False, 'headache', False)}") # 0.2857
print(f"P(f|¬m): {calculate_column_conditional_probability(factor_dict, False, 'fever', True)}") # 0.4286
print(f"P(¬f|¬m): {calculate_column_conditional_probability(factor_dict, False, 'fever', False)}") # 0.5714
print(f"P(v|¬m): {calculate_column_conditional_probability(factor_dict, False, 'vomiting', True)}") # 0.5714
print(f"P(¬v|¬m): {calculate_column_conditional_probability(factor_dict,False, 'vomiting', False)}") # 0.4286

P(h|m): 0.6667
P(¬h|m): 0.33330000000000004
P(f|m): 0.3333
P(¬f|m): 0.6667000000000001
P(v|m): 0.6667
P(¬v|m): 0.33330000000000004
P(h|¬m): 0.7143
P(¬h|¬m): 0.28569999999999995
P(f|¬m): 0.4286
P(¬f|¬m): 0.5714
P(v|¬m): 0.5714
P(¬v|¬m): 0.4286


## Calculating A Posteriori Values of a Row

The *a posteriori* values are the conditional probabilities based on the value of our query (red) multiplied by the prior probability for the target value (blue). The divisor is a normalisation term which allows us to convert our *a posteriori* values into actual probabilities. You'll notice that the a posteriori values for True and False don't necessarily sum to 1; if we include the divisor they will. We can often ignore the divisor as usually we're interested only in what prediction the model will make rather than the actual probability the model assigns to its prediction. The larger of the two a posteriori values will always have a probability > 50% 

![Bayes Theorem](bayes_theorem_1.png)

We've now got a function which will take a column name and a value and give us back the conditional probability. In order to make a prediction, we need to take a pandas row, and calculate the *a posteriori* probability for each column value. The datatype of a Pandas row is a *Series*. A series lets us extract values using a column name, but unlike a DataFrame it can only every contain one value per column. Notice that when we want to get the column names from a pandas Series we use the **.index** property unlike a DataFrame where we would use **.columns**

In [8]:
def calculate_conditional_probs(factor_dict: dict, row: pd.Series, target_value: bool) -> float:
    conditional_probs = [calculate_column_conditional_probability(factor_dict, target_value, column, row[column]) for column in row.index]
    return conditional_probs

query = pd.Series({'headache': True, 'fever': True, 'vomiting': False})
calculate_conditional_probs(factor_dict, query, True)
        

[0.6667, 0.3333, 0.33330000000000004]

## Reducer Functions

In the example above, we've used a list comprehension to calculate the conditional probability for each value in the query. The last thing we need to do is multiply all of these probabilities together. If we were adding all of the values together we could use the buit-in Python function `sum()`. Summing is a very common operation which is why it has been provided to us out of the box. The sum() function takes a list of numbers, applies an operation to them in turn and returns a single number as output. It takes multiple numbers and reduces them down to a single number, this is where the term **reducer** comes from.

If we want to create our own custom reducer in Python we can do using the **reduce()** function. Let's take a look at how to to rewrite the sum() function using reduce().

In [9]:
from functools import reduce

numbers = [1, 2, 3, 4, 5]

def custom_sum(accumulator, nxt):
    print(f"accumulator: {accumulator}, next: {nxt}, sum: {accumulator + nxt}")
    return accumulator + nxt

reduce(custom_sum, numbers, 0)

accumulator: 0, next: 1, sum: 1
accumulator: 1, next: 2, sum: 3
accumulator: 3, next: 3, sum: 6
accumulator: 6, next: 4, sum: 10
accumulator: 10, next: 5, sum: 15


15

We've created a function here, custom_sum which takes two numbers, adds them together and returns the result. When we call the reduce() function we're telling python to take the list of numbers, and pass each of them to the custom_sum function. Notice that our function takes two parameters, though. The reduce function uses an *accumulator*. We call our function on the first item in the list, and whatever comes back is passed in as the first parameter when we move onto the second item. The return value of this call is passed in with the third item *etc.* The final parameter to the reduce() sets the initial value for the accumulator.

Try it yourself. Use the reduce function to take a list of words and output a single string with each word separated by a space.

In [10]:
def paste(words):
    return reduce(lambda acc, nxt: acc + ' ' + nxt, words, '')[1:]

# you can use str[1:] to remove the first character of a string
paste(['The', 'quick', 'brown', 'fox']) #'The quick brown fox'

'The quick brown fox'

We can now put all of this together to calculate the a posteriori values. Implement the calculate_a_posteriori function below. The answers you are expecting are included in the comments.

1. Find the prior probability (in the factor_dict using the target_column name as key)
2. Find the conditional probabilities for each column in the row (using the calculate_column_conditional_probability function
3. Reduce the conditional probabilities by multiplying each of them together
4. Multiply the result by the prior

To make a prediction we calculate the a posteriori for each possible target value and predict the value with the highest a posteriori value.

In [None]:
def calculate_a_posteriori(factor_dict: dict, row: pd.Series, target_column: str, target_value: bool) -> float:
    prior = factor_dict[target_column][0] if target_value == True else factor_dict[target_column][1]
    conditional_probs = [calculate_column_conditional_probability(factor_dict, target_value, column, row[column]) for column in row.index]
    return reduce(lambda acc, nxt: acc * nxt, conditional_probs, 1) * prior

query = pd.Series({'headache': True, 'fever': True, 'vomiting': False})
print(f"A Posteriori True: {calculate_a_posteriori(factor_dict, query, 'meningitis', True)}") # 0.0222188888889
print(f"A Posteriori False: {calculate_a_posteriori(factor_dict, query, 'meningitis', False)}") # 0.0918508169796


## Further Exploration

1. We've looked at how to calculate the *a posteriori* values for a given query using a factor table. This is essentially the code that we would run for `model.predict()`. What would we need to do when `model.train()` was called? Can you implement it?
2. So far we've only looked at how to determine which value a model should predict. ScikitLearn usually provides a `model.predict_proba()` function returning the expected probability of a given value. How would you implement predict_proba for a Naive Bayes? Try it.
3. Laplacian smoothing allows us to smooth probabilities using a parameter *k*. How would you update the calculate_probability() function from the beginning of this notebook to allow for Laplacian smoothing. You may assume all columns are boolean.
