## Topic 11: Probability


### Permutation

In permutation, order matters. 
> The number of **permutations** of **n** objects taken **r** at a time is given by the formula:
>
> $$\large P(n,r) = \frac{n!}{(n - r)!}$$


Example: How many possible orders are there for first/second/third place in a race with 30 contestants?

In [1]:
import math

placing_orders = math.factorial(30) / math.factorial(30-3)
placing_orders

24360.0

### Combination

In a combination, you only care about which items are in a set.

> The number of **combinations** of **n** objects taken **r** at a time is given by the formula:
>
> $$\large C(n,r) = \frac{n!}{r!(n - r)!}$$

Example: How many possible codes are there for a standard padlock?

In [2]:
codes = 40**3
codes

64000

Example: How many unique 3 topping pizzas can you make from 8 ingredients?

In [3]:
pizza_toppings = math.factorial(8) / (math.factorial(3) * math.factorial(8-3))
pizza_toppings

56.0

### Probability Theory

**General Addition Rule**

The probability that either $A$ or $B$ will occur can be calculated by adding each individual probability, and then subtracting the probability that both occur together:

$$\large P(A \cup B) = P(A) + P(B) − P(A \cap B)$$

Remember, $P(A \cap B)$ expresses the overlap between the two events - if you don't subtract that overlap, then you double count the instances when **both** $A$ and $B$ occur!

Check: Are the outcomes independent of one another?

> Formally, $A$ and $B$ are *independent* if and only if the probability that *both* $A$ *and* $B$ happen is:
> 
> $$\large P(A \cap B) = P(A) * P(B)$$

### Conditional Probability

Rule which includes non-independent events, is:

$$\large P(A\cap B) = P(A | B) * P(B)$$

## Topic 14: Hypothesis Testing


**Errors and Sampling**

Type 1 Error is False Positive, where you reject the null hypothesis when you should fail to reject it.
* Example: Doctor says old man is pregnant

Type 2 Error is False Negative, where you fail to reject a null hypothesis that should be rejected.
* Example: Woman in labor is not pregnant

In order to produce a quality sample, follow these assumptions:
* Sample is independent, meaning the value of one observation does not affect the other observations
* Sample is collected randomly, so selection happens by chance instead of choice
* Sample is normally distributed
* Sample size is appropriately large


**P-Values**

P-value is a statistical summary of the compatibility between the observed data and what you would expect to see in a population assuming the statistical model is correct.
* If p-value is lower than significance threshold **(p < a)**, we reject the null hypothesis. This means the observed sample mean is significantly different than population mean.
* If p-value is higher than significance threshold **(p > a)**, we fail to reject the null hypothesis, or the sample mean is not significantly different from the population mean.

**Significance Threshold, or alpha**

Compared against p-value to determine if a finding is significant. Typically, alpha will be = 0.05.

Choosing a lower alpha leads to a test that is more strict, so you will be less likely to be able to reject your null-hypothesis (which is generally what you want). Choosing a higher alpha or significance level leads to a higher probability of rejecting the null-hypothesis. The downside of using a higher alpha level, however, is that you run a higher risk of falsely concluding that there is a difference between your null-hypothesis and your observed results when there actually isn't any.

**One Sample Z-test**

The one-sample 𝑧-test is used when you want to know if your sample comes from a particular population. Z - Test Use Case: When we have a large sample and we KNOW THE POPULATION STD DEV!

$$ \large \text{z-statistic} = \dfrac{\bar x - \mu_0}{{\sigma}/{\sqrt{n}}} $$

Once you obtain the z-score, run it into cdf to determine probability.


In [3]:
# Using stats.cdf to find probability for a z-score
import scipy.stats as stats

# Probabilities up to z-score of 1.5
print(stats.norm.cdf(1.5))

# Probabilities greater than z-score of 1.34
print (1-stats.norm.cdf(1.34))

0.9331927987311419
0.09012267246445238


**Example Problem**

**Anova Testing**

Anova testing can determine if there is some variation between three or more samples. Limitation: Cannot tell you where the difference lies.