# Probability and combinatorics

> "... probability tells us how often something is likely to occur when an experiment is repeated..." (Sarah Boslaugh, Statistics in a Nutshell)

Probability theory describes what properties our sample should have, given the properties of the underlying population. It is a purely theoretical discipline, telling us how likely an event is to happen and does not require data. 

There are two related terms:
- **Probability** is the proportion of times an event would occur in infinite repetitions; quantifying predictions of events yet to happen (the future); 
- **Likelihood** is measuring the frequency of events that already occurred (the past);

---

- Relative frequency / empirical probability / experimental probability: probability based on past experience, in a **number of non-infinite events**; empirical probability estimates probabilities from experience and observation.
  - Relative frequency: $\frac{3}{9}, \frac{6}{9}$
  - Empirical probability: $0.333, 0.666$
- Probability (classic definition): probability for **infinite number of events**; limit of relative frequency as the number of events approaches infinity;

---

Some important terms:
| Term | Definition | Example |
| - | - | - |
| **Trial** (experiment) | An event whose outcome is unknown. | |
| **Event (E)** | An outcome of a trial.| An event that the sum of the two die is 11, $E = \{(5,6), (6,5)\}$ |
| **Sample space** (S, set) | A set of all possible outcomes of a trial. | For a roll of a six-sided die, $S = \{1,2,3,4,5,6 \}$ |

Probability of an event is the number of desirable events divided by the total number of events in the sample space:
$$ P(E) = \frac{n(E)}{n(S)} $$


**Theoretical probability** is like what we expect to see in flipping a coin. E.g. probability of landing a coin on Heads is $P(H) = 0.5$; if we have a 6-sided die, the probability of landing a score that equals to 3 or more is $P(\ge3) = 4/6 = 2/3$

**Experimental probability** is an estimate we make based on previous experience. E.g. if we played 16 games in the past, and we make a histogram with number of points and count for each bin, we can later make a prediction for the 17th game based on past data, for example, the probability of obtaining a score that is more than a certain number.

In probability, **The Law of Large Numbers** states that experimental probability gets closer to the theoretical probability with a large number of experiments. 

Types of probabilities:
- **Marginal probability**: probability of a single event $P(X)$
- **Joint probabilities**: probability of two or more events happening at the same time: $P(A \text{ AND } B)$

Example problems:

---

There is a promotion that states that each box of cereal in their line has 1 of 6 toys to collect. Estimate how many boxes, on average, it would take to get all 6 prizes. 
- A) Randomly generate digits 1-6 and calculate how many boxes it would take to collect all unique toys; 
- B) Repeat step A many times. 
- C) Mean of the distribution of boxes that it takes to collect the toys is an approximation that we need;  

# Odds

Odds are the number of times something occurred divided by the number of times it didn't occur. Also can be thought as probability of winning divided by the probability of losing.

Odds are meaningful: odds of 2 mean that an event is two times more likely to happen than not to happen.

> Examples:
> 
> Odds that a team will win are 3 to 1 -> probability of winning = 3/(3+1) = 3/4
>
> The probability of obtaining 1 when we roll a die is $P = \cfrac{1}{6}$, but the odds are $\text{Odds} = \cfrac{1}{5}$
>
> If a horse wins 3 out of every 4 races, the probability of that horse winning a race is $P = \cfrac{3}{4}$, but odds are $\text{Odds} = \cfrac{3}{1} = 3$
>
> While probability is a number between 0 and 1, odds are a number between 0 and infinity

Odds and probability can be interchanged:
> $P(X) = 0.7, P(Y) = 0.3$
>
> Odds: $O(X) = \cfrac{7}{3}$
>
> $P(X) = \cfrac{O(X)}{1 + O(X)}$
>
> $O(X) = \cfrac{P(X)}{1 - P(X)}$



# Rules of Probability

$\cup \text{ = OR} \\ 
\cap \text{ = AND}$

A and B are **Mutually Exclusive** if they cannot occur at the same time.

Two events are **Independent** if the knowledge of occurrence of one does not change the probability that the other occurs. 


## Rule of Complementary Probabilities (Complement Rule)

For an event $E$, the probability of a complement of the event ($E_{c}$) is 1 minus the probability of $E$:
$$ P(E_{c}) = 1 - P(E) $$

> For example, if $A$ = probability that a person is a man, then $B = A_{c}$ = probability that a person is NOT a man (in a universe with two sexes, a woman) and $B = 1 - A$


## Addition rule

> Addition rule; sum rule
> 
> Probability of A **OR** B

**Addition (sum) rule for probability**: the probability of two (mutually exclusive) events occurring (in an OR manner, so event 1, event 2, or both at the same time) equals to the sum of probability of each event minus the probability of both events occurring at the same time (as we count the intersection two times). IOW, P(A or B) = P(A) + P(B) - P(A and B):

$$ \text{ Sum rule of probability (general form): } P(A \cup B) = P(A \text{ OR } B) = P(A) + P(B) - p(A \cap B) $$

To choose the variation of the rule to apply, you'll need to check if two events are mutually exclusive. 
- Two events are mutually exclusive if $A \cap B = \emptyset$
- Otherwise, the events are not mutually exclusive:

| Exclusivity of events | Explanation | Example | OR (The Addition Rule) |
| - | - | - | - |
| Not Mutually Exclusive (nonmutually exclusive) | A and B are **not mutually exclusive** if they intersect - have some degree of overlap.<br><img src="Media/not-mutually-exclusive-events.png" width=150> <br><br>An intersection exists. | On a d6 die, probability of rolling more than 1 or rolling odd ($P(>1 \text{ OR odd})$). <br> Probability of getting a heads (on a coin flip) OR a 6 (on a d6 die). <br><br>Out of a deck of playing cards, pulling out an ace (A) and pulling out a spades: $P(E_1 \cup E_2) = P(E_1) + P(E_2) - P(E_1 \cap E_2) = \frac{4}{52} + \frac{13}{52} - \frac{1}{52} = \frac{4}{13}$ | $$ P(A \cup B) = P(A) + P(B) - p(A \cap B) $$ |
| Mutually Exclusive | A and B are **mutually exclusive** if the two events cannot occur at the same time. E.g. on a d6 die, P (even OR odd) $ = P(even) + P(odd) = 0.5 + 0.5 = 1 $ <br><img src="Media/mutually-exclusive-events.png" width=150> <br><br>There is no intersection: $P(A \cap B) = \emptyset$ | Probability of getting a 4 or a 6 on a die roll. <br><br>Out of a deck of playing cards, pulling out an ace OR a king: $P(E_1 \cup E_2) = P(E_1) + P(E_2) = \frac{1}{13} + \frac{1}{13} = \frac{2}{13}$ | The Sum rule for disjoint probabilities: $$ P(A \cup B) = P(A) + P(B) $$ |

Probability of the Union of 3 or more sets:

$$
P(A \cup B \cup C) = P(A) + P(B) + P(C) - P(A \cap B) - P(A \cap C) - P(B \cap C) + P(A \cap B \cap C)
$$

## Multiplication rule

> Multiplication rule; product rule
>
> Probability of A **AND** B

**Multiplication / product rule**: the probability of event 1 AND event 2 occurring is the probability of event 1 and probability of event 2 given the event 1:
$$
\text{ Product rule (general form): } P(A\cap B) = P(A \text{ AND } B) = P(A) * P(B | A)
\\~\\
\text{ Same but for three events: } p(A \cap B \cap C) = p(A) * p(B|A) * p(C|A \cap B)
\\
\text{ Can also be written as: } p(x_1, x_2, x_3) = p(x_1) p(x_2, x_3 | x_1) = p(x_1) p(x_2 | x_1 ) p(x_3 | x_2, x_1)
$$

Before applying the rule, you'll need to know if the events are independent or not. 
- A and B are independent events if the following holds: $P(A \cap B) = P(A) * P(B)$

| Independence of events | Explanation | Example | AND (The Multiplication / Product Rule) | 
| - | - | - | - |
| Independent events | Events are independent if P of one event is not affected by the occurrence of the other event (occurrence of one doesn't influence in any way the occurrence of the other one). | E.g. flipping a coin and getting H/T every trial. P of getting T on the first trial doesn't affect the subsequent probabilities. | In independent events, $P(B\|A)=P(B)$, therefore, *Product rule of independent events*: $$P(A\cap B) = P(A)*P(B)$$ | 
| Dependent (conditional) events | Events are conditional (Bayes) when the two events are correlated | e.g. consequently taking out a particular colour of marble out of a bag full of different coloured marbles. | *Product rule of dependent events*: $$p(A\cap B) = P(A) * p(B\| A)$$ |

Examples of independent events:
- Events of "fifth coin flip lands on heads" and "sixth coin flip lands on tails": $P(E_{1} \cap E_{2}) = P(E_1) P(E_2)$
- $P(\text{A is alive in 20 years}) = 0.7, P(\text{B is alive in 20 years}) = 0.5; \text{Probability that both are alive in 20 years: } P(A \cap B) = 0.7*0.5 = 0.35 $

<img src="Media/independent-events.png" width=900>

**Examples**:

> Roll a die three times. What is the probability of at least one 6?
```txt
P(at least one 6 in three rolls) =
1 - P(no 6 in three rolls) = 
1 - P( (no 6 in first roll) AND (no 6 in second roll) AND (no 6 in third roll) ) = 
1 - 5/6 * 5/6 * 5/6 = 91/216
```

## Difference

$\text{The general law of subtraction of probabilities: } P(A-B) = P(A) - P(A \cap B)$

For example:

> In a container there are red balls numbered 1 to 10 and blue balls numbered 1 to 5. What is the probability of randomly pullling out a ball that is red and does NOT have a number 5? 

$P(R - 5) = P(R) - P(R \cap 5) = \cfrac{10}{15} - \cfrac{1}{15} = \cfrac{9}{15}$

> The probability that Antonio wins a game of tennis is 2/5 and the probability that Juan wins is 1/4. What is the probability that Antonio wins the game if he plays against Juan? 

- Antonio wins: $P(A) = 2/5$
- Juan wins: $P(B) = 1/4$

You can use the formula given in this section: $P(A -B) = P(A) - P(A \cap B) = P(A) - P(A)*P(B)$

Alternatively, you can use a slightly different approach, which to me seems more logical and intuitive:

$P(A - B) = P(A \cap B') = P(A) * P(B') = 2/5 * (1- 1/4) = 2/5 * 3/4 = 0.3$


## Examples

> Example 1: What is the probability of getting three sixes on three consecutive rolls of a six-sided die?

$ P = \left( \cfrac{1}{6} \right)^3 = \cfrac{1}{216}$

> What is the probability of rolling 6 on a d6 die OR flipping heads on a coin? 

$P(heads \text{ OR } 6) = 1/2 + 1/6 - (1/2 * 1/6) = 0.58333$

> From the deck of 36 cards one card is pulled. Let event A = pulling a spade and B = pulling an ace. 
> Q1. Are these events independent? Q2. Are these events mutually -exclusive? 

<u>Answer Q1:</u>

Remember that A and B are independent events if the following holds: $P(A \cap B) = P(A) * P(B)$

$$
P(A) = \cfrac{9}{36}
\\
P(B) = \cfrac{4}{36}
\\
\text{There is only one card that is an ace of spades: } P(A \cap B) = \cfrac{1}{36}
\\
P(A)*P(B) = \cfrac{1}{36}
\\
P(A)*P(B) = P(A \cap B) \Leftrightarrow \text{events are independent.}
$$

<u>Answer Q2:</u>

Remember that two events are mutually exclusive if $A \cap B = \emptyset$

There is an outcome that contains both events (an ace of spades card), so these events are NOT mutually exclusive.


# Conditional Probability

Conditional probability is used to calculate the probability of an event when we know another event has occurred. 

*Conditional probability* - the probability of an event A given that another event B has occurred:

$$ p(A|B) = \frac{p(A\cap B)}{p(B)} $$

- In independent events, $ p(A|B) = \cfrac{p(A\cap B)}{p(B)} = \cfrac{p(A)p(B)}{p(B)} = p(A) $
- However, in dependent events, this equality doesn't hold

**The Bayes theorem** reverses the direction of the dependencies (this formula is useful for flipping probabilities):
$$ p(A|B) = \frac{p(A) p(B|A) }{p(B)} = \frac{p(A) p(B|A) }{ p(A) p(B|A) + p(A_{c}) p(B|A_{c}) } $$

> Bayes theorem is based on the premise that the more information we gather about the event, the better estimate of the probability of that event we can make. E.g. if i a person is thinking of an animal (dog), a) what is the probability that the animal is a dog (without having more information, very difficult to estimate so the probability is probably pretty low) vs b) what is the probability that the animal is a dog, given (some hints) that it is a domestic animal that is small to large sized and likes to play with people?  

Bayes' Theorem allows revising the predicted probability of an event based on new information. Calculates conditional probability of an event. 

In the formula above, 
- $p(A)$ - the *prior probability* of $A$ (probability of an event before introduction of the new data; you could see it like the initial probability without the additional data that helps us make a better estimate);
- $p(A|B)$, $p(B|A)$ - conditinal probabilities; these come from events - something occurs, which gives us more information, e.g. people letting us know more information about the case;
- $P(A|B)$ - the *posterior probability* of $A$ given $B$ (the updated and more accurate probability based on new information). 

IOW, :
- $p(A)$ - the prior probability; 
- $p(B)$ - event. 
- $p(A|B)$ - the posterior probability of A given the new information B.

![image.png](attachment:image.png)

*Rule of complementary probabilities* - for an event E, its complement ($E_{c}$) has the probability of 1 minus p of E: $p(E_{c}) = 1 - p(E)$


<img src="Media/Bayes.png" width="400px">

$ p(B\cap A) = p(B|A) * p(A) = 0.3*0.2 = 0.06 $

$ p(B|A) = 0.3 $

s

$ p(A|B) = \frac{p(B|A)p(A)}{p(B)} = \frac{p(B|A) p(A)}{p(B|A) p(A) + p(B|A_{c}) p(A_{c})} = \frac{0.3*0.2}{0.3*0.2 + 0.05*0.8} = 0.6$

---

<u>Example problems for conditional probability</u>:

There are 8 coins in a bag, 3 of which are unfair (60% chance of H) and the rest are fair. If you randomly choose one coin from the bag and flip it 2 times, what is the probability of getting 2H?
- Upon drawing a coin, we have $p=5/8$ of drawing a fair coin and $p=3/8$ of drawing an unfair coin. 
- $P(Fair \cap HH) = P(Fair) * P(HH | Fair) = 5/8 * 0.5^{2} = 5/8 * 0.25 = 0.15625$
- $P(Unfair \cap HH) = P(Unfair) * P(HH | Unfair) = 3/8 * 0.6^{2} = 3/8 * 0.36 = 0.135$
- $P(HH) = 0.15625 + 0.135 = 0.29125 = 29.125$%


## examples

> *Example from "Grokking Machine Learning" (L. Serrano), p. 207*
>
> **What is the probability of a person having the disease, given that he tested positive on a test? Test is 99% accurate; on average, 1 out of every 10,000 people have the disease.** 

**Solution 1: using logic**

First, let's establish some nomenclature and calculate some probabilities:
- Person having the disease: $p(d+)$
- Person not having the disease: $p(d-)$
- Positive prediction of the test: $p(t+)$
- Negative prediction of the test: $p(t-)$
- Probability of having a disease is $p(d+) = \frac{1}{10,000} = 10^{-4}$. 
- Probability of not having a disease is $p(d-) = \frac{9,999}{10,000} = 0.9999$

<u>If a person is tested positive, there are two possibilities / two possible outcomes</u>:
- He tested positive and he has the disease; this means that the test is correct ($99\%$) and he has the disease ($10^{-4}$), so $p(a) = 0.99 * 10^{-4} = 9.9 * 10^{-5}$. This is the outcome where he <u>has the disease, given that the test is positive</u>, or $p(d+|t+)$;
- He tested positive but he doesn't have the disease; this means that the test is wrong ($1\%$) and he doesn't have the disease ($0.9999$), so $p(b) = 0.01 * 0.9999 = 0.009999$. This is the outcome where he <u>doesn't have the disease, given that the test is positive</u>, or $p(d-|t+)$.

Out of these two possible outcomes, let's calculate the probability of the person having the disease:

$$p(d+|t+) = \cfrac{p(d+|t+)}{p(d+|t+) + p(d-|t+)} = \cfrac{9.9*10^{-5}}{9.9*10^{-5} + 0.009999} = 0.0098$$

Answer: probability of a person to have the disease, given that he tested positive, is $0.0098$, or $0.98\%$. 

**Solution 2: using the Bayes' theorem**

If we use the Bayes' theorem, we will derive the same equation, and as a result, come to the same final calculation: 

$$p(d+|t+) = \cfrac{p(d+) p(t+|d+)}{p(t+)} = \cfrac{ p(d+) p(t+|d+) }{ p(d+) p(t+|d+) + p(d-|) p(t+|d-) } = \cfrac{10^{-4} * 0.99}{10^{-4} * 0.99 + 0.9999*0.01} = 0.0098$$

<br><br>

> **Example from Luis Serrano's "Grokking Machine Learning" book**
>
> An office has two clerks, Aisha and Beto. Aisha works there three days a week, and Beto works the other two days. While this arrangement always stays true, the exact days for each worker change every week. 
>
> The two workers have a preference for wearing red. Aisha wears red one day out of her three working days, while Beto wears red one day out of his two days. 
>
> Knowing that the clerk working today is wearing red, what is the probability that it is Aisha? 

$$
P(\text{Aisha}) = \cfrac{3}{5}, P(\text{Beto}) = \cfrac{2}{5}
\\~\\
P(\text{red|Aisha}) = \cfrac{1}{3}, P(\text{red|Beto}) = \cfrac{1}{2}
\\~\\
P(\text{Aisha|red}) = \cfrac{ P(\text{red|Aisha}) * P(\text{Aisha}) }{ p(\text{red}) } = \cfrac{ P(\text{red|Aisha}) * P(\text{Aisha}) }{ P(\text{red|Aisha}) P(\text{Aisha}) + P(\text{red|Beto}) P(\text{Beto}) }
\\~\\
P(\text{Aisha|red}) = \cfrac{1/3 * 3/5}{1/3 * 3/5 + 1/2 * 2/5} = 0.5
$$


> **Example**
>
> There are 100 emails, 20 of which are spam and 80 - ham. The number of emails containing the word 'sale' is 6 for spam emails and 4 for ham. 

```text
                'sale' (6/20)
                /
      spam (0.2) 
    /          \
  /             no word 'sale' (14/20)
/
\
  \               'sale' (4/80)
    \           /
      ham (0.8)
                \
                  no word 'sale' (76/80)
```

- $P('sale'|spam) = \cfrac{P('sale' \cap spam)}{P(spam)} = \cfrac{0.2 * 6/20}{0.2} = 6/20$
- $P('sale'|ham) = \cfrac{P('sale' \cap ham)}{P(ham)} = \cfrac{0.8 * 4/80}{0.8} = 4/80$
- $P(spam|'sale') = \cfrac{P(spam \cap 'sale')}{P('sale')} = \cfrac{0.2*6/20}{0.2*6/20 + 0.8*4/80} = 0.6$



> In a random experiment there are 10 equally-possible elementary events / outcomes, shown on the Euler diagram below.
>
> ![image.png](attachment:image.png)
> 
> Calculate conditional probabilities of events A and B. 

$$
P(B|A) = \cfrac{2}{5} = 0.4
\\
P(A|B) = \cfrac{2}{4} = 0.5
$$


# Probability with combinatorics

It is useful when the number of possible outcomes / events is too large to directly count. 


> We have 8 coin flips. What's the probability of having exactly 3 coins land as Heads?
- $p(\frac{3}{8} H) = ?$
- Total number of events = $2^8 = 256$
- Combinations 8 choose 3 = $\frac{8!}{3! (8-3)!} = 56$
- $p(\frac{3}{8} H) = \frac{56}{256} = 0.219$


> Probability of making exactly 3 out of 5 freethrows? $p(FT) = 80%$
- Total number of combinations in which we have 3 out of 5 freethrows: $C = \frac{5!}{3! *2!} = 10$
- Probability of one sequences with 3 out of 5 freethrows: $p = 0.8^{3} 0.2^{2}$
- Probability of all possible sequences with 3/5 freethrows: $p(3/5) = 10 * 0.8^3 * 0.2^2 = 20.48%$


> Probability of making at least 3 out of 5 freethrows (3/5 or more)? $p(FT) = 80%$
- 5 choose 3 = 10;
- 5 choose 4 = $\frac{5!}{4!*1!} = 5$
- $p(>= 3/5) = 10*0.8^{3}*0.2^{2} + 5*0.8^{4}*0.2^{1} + 1*0.8^{5} = 0.942 $


> Each card in a standard deck of 52 playing cards is unique and belongs to 1 of 4 suits: 13 cards are clubs, 13 are diamonds, 13 are hearts, and 13 are spades. Suppose that Luisa randomly draws 4 cards without replacement. What is the probability that Luisa gets 2 diamonds and 2 hearts (in any order)?
- Any order - combinations; 
- Combinations (2 diamonds) = $_{13}C_{2}$
- Combinations (2 hearts) = $_{13}C_{2}$
- Total number of 4-card combinations = $_{52}C_{4}$
- Probability: $p = \frac{(_{13}C_{2}) (_{13}C_{2})}{_{52}C_{4}}$


> Declan's friend Luka claims that he can read minds. To test Luka's abilities, Declan draws 5 cards without replacement from a standard deck of 52 playing cards. Declan then asks Luka to identify in any order which 5 cards he drew without looking. Assume that Luka has no special abilities and is randomly guessing the cards.
> 
> What is the probability that Luka correctly identifies all 5 cards in any order?
- There is only 1 correct set available to make from the 52 cards;
- Any order, so combinations; 
- Total number of combinations of cards = $_{52}C_{5}$
- Therefore, $p = \frac{1}{_{52}C_{5}}$


> A club of 9 people wants to choose a board of three officers: President, VP, and Secretary. Assuming the officers are chosen at random, what is the probability that the people chosen for the roles are Marsha for President, Sabita for VP, and Robert for Secretary?
- $p = 1/9 * 1/8 * 1/7 = 1/504$


> Nia is 1 of 24 students in a class. Every month, the teacher randomly selects 4 students from their class to act as president, vp, secretary, and treasurer. Students cannot hold different positions at once. What's the probability that Nia is chosen as president in a given month?
- Order matters - permutations; 
- Total number of 4-student arrangements: $_{24} P _{4}$
- All the arrangements that include Nia as the president, which is equivalent to how many arrangements of 3 students are possible from Nia's 23 classmates: $_{23} P _{3}$ 
- Answer = $\frac{_{23} P_{3}}{_{24} P_{4}}$


> What is the probability of guessing a 4-digit passcode consisting of non-repeating digits (0-9)?
- Answer = $\frac{1}{10*9*8*7} = \frac{1}{_{10} P _{4}}$


## Permutations

> Paul and his nine friends want to form in a line. What is the probability that Paul will be at the front? 

We will use the formula here for permutations, order matters.

$$
n(\text{total}) = \cfrac{10!}{(10-10)!} = 10!
\\
n(\text{sequences where Paul is at the front }) = \cfrac{9!}{(9-9)!} = 9!
\\
P(\text{sequence where Paul is at the front}) = \cfrac{9!}{10!} = 0.1
$$

> Ten friends want to form in a line. What is the probability that the first four places will be occupied by Paul, Antonio, Juan, and Pedro, in that order?

$n(\text{total}) = \cfrac{6!}{10!} = 0.000198$


# Expected value

In essence, expected value means "after a very large number of turns, the average outcome per one action will be Expected value".

Expected value of a discrete random variable.

If we define $X$ as a random discrete variable that can take on values $X_1, X_2, ..., X_K$ with the respective probabilities $p_1, p_2, ..., p_K$, where $p_1, p_2, ..., p_K = 1$, the expected value of $X$ is denoted as $E(X)$ and is calculated as follows:
$$E(X) = p_1 X_1 + p_2 X_2 + ... + p_K X_K = \sum pX$$

---

$X$ - number of workouts in a week. Below is the probability distribution of this variable:

| $X$ | $P(X)$ |
| - | - |
| 0 | 0.1 |
| 1 | 0.15 |
| 2 | 0.4 |
| 3 | 0.25 |
| 4 | 0.1 |

In this case (discrete random variable), it's also equal to the weighted mean (weighted sum):

$E(X) = 0*0.1 + 1*0.15 + 2*0.4 + 3*0.25 + 4*0.1 = 2.1 $

---

Another example - betting:

| | Win | Lose |
| - | - | - |
| X (net gain) | $35 | -$1 |
| P(X) | 1/38 | 37/38 |

Expected value of a player's net gain on a $1 bet on a single slot: $E(X) = 35 * \frac{1}{38} + (-1) * \frac{37}{38} = -0.053$ dollars. 

We could also interpret it as the following: if we look at many bets, the average return would be about -$0.053 dollars per ticket. 


> There is a lottery with 10,000 possible selections. The lottery pays $4500 on a $1 bet that all 4 digits of a selection match the lottery result. Calculate the expected net gain $E(X)$ on an X (a player's net gain on a $1 straight bet).

- If he wins (probability 1/10000), he net gains $4499;
- If he loses (probability 9999/10000), he net gains -$1;
- $E(X) = 4499 * \frac{1}{10000} + (-1) * \frac{9999}{10000} = -0.55$
- So if we play 10000 times, we pay $10000 and expect to win $4500 - net gain of -$5500


> An electronics store gives customers the option of purchasing a protection plan when customers buy a new refrigerator. The customer pays $125 for the plan, and if their refrigerator is damaged or stops working, the store will replace it for no additional charge. The store knows that 3% of customers who buy this plan end up needing a replacement that costs the store $1500 each. Calculate the expected net gain E(X) from one of these plans. 

- Replacement, probability = 0.03, net gain = -$1375; 
- No replacement, probability = 0.97, net gain = $125;
- $E(X) = 0.03*(-1375) + 0.97*(125) = 80$ dollars

> On average, how many flips of a fair coin would you need to get 2 of the same flip in a row (either 2 H in a row or 2 T)?

**You can solve by calculating expected value:** 

Let's consider the tree below:

```txt
         .
        / \
       /   \
      /     \
     H       T
    / \     / \
   H   T   H   T
```

Let $x$ - expected number of flips, then:
- Throwing HH has the probability of 0.25 and it's the desired outcome, so 2 flips is all you need
- same for TT
- Throwing HT has the probability of 0.25; it's not the desired outcome, but the next flip can be good if it's T, so you need (x+1) flips
- same for TH

$$
x = 0.25 * 2 + 0.25 (x+1) + 0.25 (x+1) + 0.25*2
\\
x = 3
$$

**You can also solve it using logic**

Probability of getting two H or two T at flip $n$: 

$P(x=n) = 2 * \left( \cfrac{1}{2} \right) ^{n} = 2^{1-n} = \cfrac{1}{2^{n-1}}$

calculate expected value:

$E[X] = \sum\limits^{\infty}_{n=2} (n P_{n}) = \sum\limits^{\infty}_{n=2} \cfrac{n}{2^{n-1}} = 3$

> On average, how many flips of a fair coin do you need to make to get three heads in a row (HHH)?

https://math.stackexchange.com/questions/1839496/expected-number-of-tosses-to-get-3-consecutive-heads

```txt
                   .
                  / \
                 /   \
                /     \
               /       \
              /         \
             /           \
            H             T
           / \           / \
          /   \         /   \
         /     \       /     \
        H       T     H       T
       / \     / \   / \     / \
      H   T   H   T H   T   H   T

```

Probabilities: 
- Let $x$ - number of flips to achieve HHH; 
- HHH = success, three H in a row, three flips: $p = \cfrac{1}{8} * 3$
- HHT = failure, so have to start again after three flips: $p = \cfrac{1}{8} (x+3)$
- HT = failure, so have to start again after two flips: $p = \cfrac{1}{4} (x+2)$
- T = failure, so have to start again after one flip: $p = \cfrac{1}{2} (x+1)$

$$
x = \cfrac{1}{8} * 3 + \cfrac{1}{8} (x+3) + \cfrac{1}{4} (x+2) + \cfrac{1}{2} (x+1)
\\~\\
x = 14 
$$

# Pascal's triangle

Pascal's triangle essentially reflects the number of repetitions of unique groups of permutations of binary event repeated N times, where N - the number of row. 

<img src="Media/math/pascal-triangle2.jpg" width=500>

![image.png](attachment:image.png)


# Stochastic / random processes / models

A stochastic or random process is a mathematical object usually defined as a sequence of random variables, where the index of the sequence has the interpretation of time.

> In simple terms, a stochastic process (also called a random process) is a way to describe something that changes over time in a random or unpredictable way, or at least that involves randomness.

Examples:
- The price of a stock going up and down
- A person walking randomly in different directions ("random walk")

## Wiener / Brownian motion

This is like random walk. 

![image.png](attachment:image.png)

## Markov chains

### Markov Property

> Markov property: the future behavior of a process depends only on its current state, regardless of how it reached that state. 

- This characteristic makes Markov property different from other processes that may depend on the entire history of past events.

$$
\large

\text{Let's consider that $x_{T}$ is a unique individual token (word) of index $T$ in a sentence consisting of $T$ words.}
\\ 
\text{Note that each token / unit can also be a N-gram, but here, for sake of simplicity, we'll consider words.}
\\
\text{Sequence: } \{ x_1 , x_2 , ... , x_{T} \}
\\
\text{Probability of sequence: } p(x_1, x_2, ..., x_{T})
\\
\text{Probability of every word, given the previous sequence: } p(x_{T} | x_{T-1}, x_{T-2}, ..., x_1)
\\~\\
\text{The Markov Property: } p(x_{t} | x_{t-1}, x_{t-2}, ...) = p(x_{t} | x_{t-1})
\\
\text{in other words, $x_t$ is independent of $x_{t-2}, x_{t-3}, ...$} 
$$

In reality, of course, each term depends on **all** preceding terms, meaning that to generate a sentence using probabilistic model modelled on real-life data and in the way that it should work, in reality, you do this: $p(x_1, ..., x_{T}) = p(x_1) p(x_2 | x_1) p(x_3 | x_1, x_2) ... p(x_{T} | x_{T-1}, ..., x_1)$. If, to predict / generate a sentence, we had to take into account conditional probabilities of every single preceding word, if we choose 2000 most common words and we want to model a sentence of 10 words, then our model is $p(x_10 | x_9, x_8, ..., x_1)$, meaning that each x has 2000 possible values (words) and we would need to consider 2000^10 probabilities, which is too much - we probably don't even have sufficient data to estimate this. 

Obviously, this Markov property doesn't hodl in all scenarios, e.g. "I like green eggs and ham" and "I like to code in C++ and Python" - obviously, "ham" and "Python" do not ONLY depend on the previous word, which is "and". 

**Property 2**: the sum of probabilities of all outgoing arrows from any state equals to 1. 

We can simulate $n \rightarrow \inf$ steps of Random Walks, which will lead to a probability distribution (for probability of each state, or how often it was observed) called the Stationary Distribution (the equilibrium state), which will not change anymore with time.

<img src="Media/math/markov-chains.png">

### Markov Model

- Works with sequences of **states (S)**, e.g. categorical symbols, words;
  - e.g. if we are modelling weather, states can be "sunny", "rainy", "cloudy". 
- A Markov Model is described by two distributions: 
  1) ⭐ State Transition Matrix $A$
  2) ⭐ Initial State Distribution $\pi$

Markov Model can answer the following questions:
1. Given $A$ and $\pi$, and sequence ${S_1, S_2, ..., S_T}$, what's the probability of seeing that sequence?

Answer: 

$$
\large
P(S_{1...T}) = P(S_1) \prod_{t=2}^{T} P(S_t | S_{t-1}) = \pi_{S_1} \prod_{t=2}^{T} A_{S_{t-1}, S_t}
$$

1. Predict next word. 

**State (S)**
- Notation of state at time $t$: $S(t) = S_t$
  - $t \in \Z, t \ge 1$
- $M$ - total number of possible states.
  - E.g. $M = 3$ for states "sunny", "rainy", "cloudy"
- We use $i$ or $j$ to index the state space, e.g. $P(S_t = i)$ - means probability that state at time $t$ is $i$

**State distribution: $P(S_t)$**

- $P(S_t) = P(S_t = 1), P(S_t = 2), ... , P(S_t = M)$
  - E.g. $P(S_{\text{Monday}} = \text{rainy}$)

**State transitions**

- $P(S_t = j | S_{t-1} = i)$, a.k.a. probability that the state at time $t$ is $j$, given that the state at time $(t-1)$ was $i$. 
- It's a conditional probability
- Since both $i$ and $j$ can take on values $\in [1, M]$, there are $M^2$ values for state transitions

⭐ **State Transition Matrix (A)**

$$
\large
A_{i j} = P(S_t = j | S_{t-1} = i), \forall i = 1...M, j = 1...M
$$
where
- $i$ = previous state
- $j$ = next state
- $A$ = an M*M matrix

Note that in this case, $A$ doesn't depend on $t$ = **time-homogeneous Markov process**.

Example:

<img src="Media/markov-chains-example-1.png" width=600>

Training $A$: it's number of times we transition from state $i$ to state $j$, divided by the total number of times we were in state $i$:

$$
\large
\hat{A_{ij}} = \cfrac{count(i \to j)}{count(i)}
$$


**Initial State**

- Enables us to get the first state in a sequence (there is no previous state for this state).

⭐ **Initial State Distribution ($\pi$)**

$\pi = P(S_1 = i), i \in [1, M]$

Training $\pi$: 

$$
\large
\hat{\pi_{i}} = \cfrac{count(S_1 = i)}{N}
\\~\\
\text{ , where}
\\
\text{$\hat{\pi_{i}}$ - estimate of $\pi_{i}$ , }
\\
\text{$N$ - number of sequences in the dataset}
$$


Another example of Markov chains (a very simple one). Let's consider this simple sentence:

`You you like me like me like you me`. 

Consider every possible bigram, which represents which word comes after each other word, e.g. `you you`, `you like`, `like me`, etc., and represent the states / probabilities of each word transitioning into another word (including itself):

<img src="Media/markov-chains-1.jpg" width=300>

> Visualisation above made in Miro

We can even calculate it by hand. 

For instance, let's consider all bigrams beginning with `you`:
- `you you`
- `you like`
- `you me`

So the probabilities are:
- `you` -> `you`: 1/3
- `you` -> `like`: 1/3
- `you` -> `me`: 1/3


### Markov sentence generator 1

We will make a probabilistic model that can generate sentences based on input string, where each unit is a word. 

In [81]:
"""
This is solution 1. Each state is not associated with probability, but rather it has all encountered previous states (repeated), 
that we have encountered and added to the list,
and then we randomly sample one, so it's akin to permutation without replacement.

Features:
- First word is selected randomly;

here's a simple program demonstrating how it works.

A similar implementation can be found here: https://benhoyt.com/writings/markov-chain/
"""
import random
import copy
import json

# Inputs
text = """
To be, or not to be: that is the question:
Whether ’tis nobler in the mind to suffer
The slings and arrows of outrageous fortune,
Or to take arms against a sea of troubles,
And by opposing end them? To die: to sleep;
No more; and by a sleep to say we end
The heart-ache and the thousand natural shocks
That flesh is heir to, ’tis a consummation
Devoutly to be wish’d. To die, to sleep;
To sleep: perchance to dream: ay, there’s the rub;
For in that sleep of death what dreams may come
When we have shuffled off this mortal coil,
Must give us pause: there’s the respect
That makes calamity of so long life;
For who would bear the whips and scorns of time,
The oppressor’s wrong, the proud man’s contumely,
The pangs of despised love, the law’s delay,
The insolence of office and the spurns
That patient merit of the unworthy takes,
When he himself might his quietus make
With a bare bodkin? who would fardels bear,
To grunt and sweat under a weary life,
But that the dread of something after death,
The undiscover’d country from whose bourn
No traveller returns, puzzles the will
And makes us rather bear those ills we have
Than fly to others that we know not of?
Thus conscience does make cowards of us all;
And thus the native hue of resolution
Is sicklied o’er with the pale cast of thought,
And enterprises of great pith and moment
With this regard their currents turn awry,
And lose the name of action.–Soft you now!
The fair Ophelia! Nymph, in thy orisons
Be all my sins remember’d.
"""
# text = "you you like me like me like you me"

# How long a sentence you want to generate
seqlen = 100

def proc_text(text: str) -> list:
    text = text.strip().lower()
    for i in '\n,\':.':
        text = text.replace(i, ' ')
    text_proc = [i.strip() for i in text.split(' ') if i not in ('')]
    return text_proc

text_proc = proc_text(text)


def generate_markov_dict(markov_dict_init: dict,
                         text_proc: list
                         ) -> dict:
    markov_dict_output = copy.deepcopy(markov_dict_init)
    for i in range(len(text_proc)):
        word = text_proc[i]
        if i+1 == len(text_proc):
            break
        else:
            if word not in markov_dict_output:
                markov_dict_output[word] = [text_proc[i+1]]
            elif word in markov_dict_output:
                markov_dict_output[word].append(text_proc[i+1])
    return markov_dict_output

markov_dict = generate_markov_dict(dict(),
                                   text_proc)
### convert dict to JSON
with open('output/markov_generate_sentence_1.json', 'w') as f:
	json.dump(
		markov_dict,
		f,
		indent = 4,
		ensure_ascii = False # add this parameter if you have foreign characters / words in your JSON
	)


output_seq = ''
# # first choice of the word is completely random
# word_choice = random.choice(list(markov_dict.keys()))
# output_seq += f"{word_choice.capitalize()} "

def get_next_state(markov_dict, cur_state: str) -> str:
    """
    Given a markov dictionary with state frequencies
    and current state,
    return next predicted state.
    """
    word_choice = copy.deepcopy(cur_state)
    try:
        if word_choice == '.':
            word_choice = random.choice([i for i in list(markov_dict.keys()) if i != '.'])
        options = markov_dict[word_choice]
        word_choice = random.choice(options)
        return word_choice
    except KeyError as e:
        return '.'

for i in range(seqlen-1):
    if i == 0:
        word_choice = random.choice(list(markov_dict.keys()))
        output_seq += f"{word_choice.capitalize()} "
    else:
        word_choice_new = get_next_state(markov_dict, word_choice)
        output_seq += f"{word_choice_new} "
        word_choice = copy.deepcopy(word_choice_new)


# As an output, you get the text that is statistically identical to the one
# you trained your markov models on, but without any meaning
output_seq

'More; and the spurns that flesh is sicklied o’er with a sea of outrageous fortune or to grunt and thus conscience does make cowards of outrageous fortune or not of? thus the whips and scorns of office and by a sleep to grunt and thus conscience does make with a weary life but that sleep to sleep of so long life; for who would fardels bear those ills we have shuffled off this mortal coil must give us all; and by a consummation devoutly to take arms against a sleep to be wish’d to die to ’tis a bare '

In [None]:
text_proc.

['to',
 'be',
 'or',
 'not',
 'to',
 'be',
 'that',
 'is',
 'the',
 'question',
 'whether',
 '’tis',
 'nobler',
 'in',
 'the',
 'mind',
 'to',
 'suffer',
 'the',
 'slings',
 'and',
 'arrows',
 'of',
 'outrageous',
 'fortune',
 'or',
 'to',
 'take',
 'arms',
 'against',
 'a',
 'sea',
 'of',
 'troubles',
 'and',
 'by',
 'opposing',
 'end',
 'them?',
 'to',
 'die',
 'to',
 'sleep;',
 'no',
 'more;',
 'and',
 'by',
 'a',
 'sleep',
 'to',
 'say',
 'we',
 'end',
 'the',
 'heart-ache',
 'and',
 'the',
 'thousand',
 'natural',
 'shocks',
 'that',
 'flesh',
 'is',
 'heir',
 'to',
 '’tis',
 'a',
 'consummation',
 'devoutly',
 'to',
 'be',
 'wish’d',
 'to',
 'die',
 'to',
 'sleep;',
 'to',
 'sleep',
 'perchance',
 'to',
 'dream',
 'ay',
 'there’s',
 'the',
 'rub;',
 'for',
 'in',
 'that',
 'sleep',
 'of',
 'death',
 'what',
 'dreams',
 'may',
 'come',
 'when',
 'we',
 'have',
 'shuffled',
 'off',
 'this',
 'mortal',
 'coil',
 'must',
 'give',
 'us',
 'pause',
 'there’s',
 'the',
 'respect',
 'tha

In [None]:
"""
This is solution 2.
"""


## Monte Carlo simulations

# Probability problems

> A new medical test for a virus has been created. 1% of the population has the virus. 99% of sick people with the virus test positive, indicating they have the virus. 99% of healthy individuals test negative for the virus. If a patient tested positive, what is the probability that they have the virus?

There are two approaches to the solution.

**Approach 1**

We solve it logically - probability of positive test AND virus over all positive tests? 
- Probability of positive test and virus: $a = 0.01 * 0.99$
- Probability of positive test: $b = 0.01 * 0.99 + 0.99 * 0.01$

$p = \cfrac{0.01 * 0.99}{0.01 * 0.99 + 0.99 * 0.01} = 0.5$

**Approach 2**

Using the Bayes theorem: $p(A|B) = \frac{p(B|A) p(A)}{p(B)}$

$p(\text{virus|positive}) = \cfrac{ p(\text{positive|virus}) * p(\text{virus}) }{p(\text{positive})} = \cfrac{0.99 * 0.01 }{0.01*0.99 + 0.99*0.01} = 0.5$


> There are 10 blue, 9 red and 6 green markers in a box. 
> 
> Two markers are chosen at random. 
> 
> Find the probability that one blue and one red marker will be chosen.

p(choosing blue and red) = p(blue, red) + p(red, blue) = 10/25 * 9/24 + 9/25 * 10/24 = 0.3
