# Marginal, Joint and conditional probabilities review

#### Marginal proability P(A):
Probability of an event irrespective of other random variables. <br> 
If variable is independant -> probability of the event directly <br>
If variable is dependant -> probability of event summed over all outcomes dor dependant events (sum rule)

#### Joint Probability
probability of 2 simultaneous events P(A and B) or P(A,B)

#### Conditional probability P(A given B) or P(A|B)
Probability of one or more event(s) given the occurence of another event

### Relationships
P(A,B) = P(A|B)*P(B) <br>
P(A,B) = P(B,A)<br>
P(A|B) = P(A,B) / P(B)<br>
P(A|B) != P(B|A)<br>


### Calculate conditional probability from other conditional probability
P(A|B) = P(B|A) * P(A) / P(B) <br>
P(B|A) = P(A|B) * P(B) / P(A)

# Bayes Theorem
It is often the case that we don't have access to the joint probability. <br>
Bayes' Theorem allows to calculate conditional probability without joing probability <br>
It also allows to calculate conditional probability without access to P(B) <br>
P(B) = P(B|A) * P(A) + P(B|not A) * P(not A) <br>
P(A|B) = P(B|A) * P(A) / (P(B|A) * P(A) + P(B|not A) * P(not A)) <br>
with P(not A) = 1-P(A) and P(B|not A) = 1 - P(not B|not A)



# Naming
Naming convention depends on context: <br>
P(A|B): Posterior probability <br>
P(A): Prior probability <br>
P(B|A): Likelihood <br>
P(B): Evidence

What is the probability that there is fire given that there is smoke? <br>
Where P(Fire) is the Prior, P(Smoke|Fire) is the Likelihood, and P(Smoke) is the evidence:<br>
P(Fire|Smoke) = P(Smoke|Fire) * P(Fire) / P(Smoke)


## Examples

### Cancer patient test
#### Initial Conditions
1/Of all people with cancer tested, 85% will test positive <br>
P(Test=Positive) | Cancer=True) = 0.85<br>
<br>
2/1 person in 5000 has cancer<br>
P(Cancer=True) = 0.02% <br>
<br>
3/ Of all people without cancer tested, 95% will test negative <br>
P(Test=Negative | Cancer=False) = 0.95

#### Formulation
P(A|B) = P(B|A) * P(A) / P(B) <br>
P(cancer=True | test = Positive) = P(test=Positive | Cancer = True) * P(cancer=True) / P(test = Positive)

#### Resolution
1/ P(cancer=True | test = Positive) = 0.85 * 0.02% / P(test = Positive) <br>
<br>
2/ We need to determine P(test = Positive) or P(B) <br>
P(B) = P(B|A) * P(A) + P(B|not A) * P(not A) <br>
P(test = Positive) = P(test = Positive | Cancer = True) * P(cancer = True) + P(test = Positive | Cancer = False) * P(cancer = False) <br>
<br>
P(B) = (0.85 * 0.02%) + (1-0.95) * (1-0.02%)

In [5]:
PB = (0.85 * 0.02/100) + (1-0.95) * (1-0.02/100)
PB

0.05016000000000005

Which gives the following resulition for P(A|B) or <br>
P(cancer=True | test = Positive) = P(test=Positive | Cancer = True) * P(cancer=True) / P(test = Positive)

In [8]:
P_of_cancer_if_positive = 0.85 * (0.02/100) / PB
print( "{:.2%}".format(P_of_cancer_if_positive))

0.34%


# Applied to multiple variables

In practice, it is very challenging to calculate full Bayes Theorem for classification.<br>
The priors for the class and the data are easy to estimate from a training dataset, if the dataset is suitability representative of the broader problem.<br>
The conditional probability of the observation based on the class P(data|class) is not feasible unless the number of examples is extraordinarily large, e.g. large enough to effectively estimate the probability distribution for all different possible combinations of values. This is almost never the case, we will not have sufficient coverage of the domain.

### Naive Bayes

The solution to using Bayes Theorem for a conditional probability classification model is to simplify the calculation.
<br><br>
The Bayes Theorem assumes that each input variable is dependent upon all other variables. This is a cause of complexity in the calculation. We can remove this assumption and consider each input variable as being independent from each other.
<br><br>
This changes the model from a dependent conditional probability model to an independent conditional probability model and dramatically simplifies the calculation.
<br><br>
This means that we calculate P(data|class) for each input variable separately and multiple the results together, for example:
<br><br>
P(class | X1, X2, …, Xn) = P(X1|class) * P(X2|class) * … * P(Xn|class) * P(class) / P(data)
We can also drop the probability of observing the data as it is a constant for all calculations, for example:
<br><br>
P(class | X1, X2, …, Xn) = P(X1|class) * P(X2|class) * … * P(Xn|class) * P(class)
This simplification of Bayes Theorem is common and widely used for classification predictive modeling problems and is generally referred to as Naive Bayes.
<br><br>