# Some approximations (Mental maths)
### Approximating logs
Commonly used natural logarithms
- $\ln{2}=0.69$
- $\ln{3}=1.10$
- $\ln{5}=1.61$
- $\ln{7}=1.95$

General approach to approximating $x$ base $b$ logarithms ($\log_{b}{x}$):
- $\ln{x}/\ln{b}$
- $\log_b({1+\epsilon})\approx \frac{\epsilon}{\ln{b}}$ for small $\epsilon$.

### Number of digits in a large power
General formula:
\begin{equation*}
\log_{10}{a^b} = b*\lfloor\log_{10}{a}\rfloor + 1
\end{equation*}
Some good to know log base 10 values:
- $\log_{10}{2}\approx 0.30$
- $\log_{10}{3}\approx 0.48$
- $\log_{10}{5}\approx 0.70$
- $\log_{10}{7}\approx 0.85$
- $\log_{10}{11}\approx 1.04$
- $\log_{10}{13}\approx 1.11$


### Number of zeroes in a factorial in base 10, base 2, etc.
Given a base $b$, the number of zeroes in some factorial $n!$ is given by the number of times $b$ divides $n!$. 

General steps:
- Break your base $b$ into its prime factorization $b=p_1^{a_1}...p_k^{a_k}$.
- For each $p_j$, find the number of times $p_j$ divides $n!$ by looking at the sum of $n!//p_j + n!//p_j^2 + ...$ and divide the total by $a_j$.
- The number of trailing zeroes is then the minimum of all the results from the previous step.

Example 1: Base 10
The limiting factor is clearly 5 in this case. So we need to find the number of times $5$ divides $n!$. 

Example 2: Base 2
The only limiting factor is 2, so find the number of times 2 divides $n!$.

Example 3: Base 6
The limiting factor is clearly 3, so find the number of times 3 divides $n!$.

### Distributions/probabilities using the central limit theorem
Binomial distribution: Suppose we have a binomial distribution with $n$ trials and probability of success $p$. Since a binomial RV is the sum of $n$ bernoulli RVs, it tends towards a normal distribution, in particular $\mathcal{N}(np,np(1-p))$. 

Poisson distribution: Suppose we have a poisson variable with rate $\lambda$. By the additivity property of poisson RVs, we have that for large $\lambda$, $Poi(\lambda) = \sum^\lambda Poi(1)$. Thus we can approximate $Poi(\lambda)$ by $\mathcal{N}(\lambda,\lambda)$.

Negative binomial distribution: Suppose we have a negative binomial (counts the number of total trials) with the number of required successes $r$ and probability $p$ of success. Since a negative binomial is essentially the sum of $r$ $Geom(p)$ RVs, when $r$ is large we can approximate to $\mathcal{N}(r\frac{1}{p},f\frac{1-p}{p^2})$.

Hyper geometric distribution: Suppose we have a hypergeometric RV with the population $N$, successful population $K$, and number of draws $n$. Consider a very large $N$, this leads to the "drawing without replacement" to effectively be negligible in impact, so we can approximate the hypergeometric distribution by Binom(n,p=\frac{K}{N}). From here, if n is large while being significantly smaller than N (i.e. $n<<N$), we can approximate things with N(n\frac{K}{N},n\frac{K}{N}\frac{N-K}{N}) or more accurately N(n\frac{K}{N},n\frac{K}{N}\frac{N-K}{N}\frac{N-n}{N-1}). 

Chi square: Suppose we have a chi-square RV with k degrees of freedom. Since it is defined as the sum of $k$ squared standard normal RVs, we can approximate for large $k$ with $\mathcal{N}(k,2k)$.

Example problems:
- Flip a fair coin 10000 times. What is the probability that we see $\leq 5100$ heads? (Answer: $\Phi (\frac{2}{5} \sigma)$ where $\Phi()$ is the cdf of the normal)
- Cars arrive at average rate $\lambda=10000$ per day. Find the probability that we see $\leq 9800$ cars. (Answer: 0.025)
- A call center resolves calls 50% of the time. The call center stops work once they resolve 5000 calls. What is the probability that they go through less than 10200 calls that day? (Answer: 0.975)
- A factory produces 10 million products, with 50% of them being defective. An inspector randomly picks out 40000 of them. What is the probability that the inspector sees more than 20200 defective products? (Answer: 0.025)
- A factory produces metal beams. We sample 5000 of them and find the sample variance of length to be 0.05. Provide a 95% confidence interval for the true variance. 

### Sums
Known sums:
- Arithmetic sum: $\frac{n}{2}(2a+(n-1)*d)$ where $a$ is the first term and $d$ is the common difference. The sequence has $n$ terms total.
- Sum of squares: $\sum_{i=1}^ni^2=\frac{n(n+1)(2n+1)}{6}$
- Sum of cubes: $\sum_{i=1}^ni^3=\frac{(n(n+1))^2}{4}$ 
- Geometric sum: $\frac{a(1-r^n)}{1-r}$ where $a$ is the first term and $r$ is the common ratio. The sequence has $n$ terms total.
- Infinite geometric sum: $\frac{a}{1-r}$ where $a$ is the first term and $|r|<1$ is the common ratio.
- Derivative of infinite geometric sum: $\sum_{i=1}^\infty i x^{i-1}=\frac{d}{dx}\sum_{i=1}^\infty x^{i-1} = \frac{1}{(1-x)^2}$
- Infinite geometric sum with increasing coefficient: $\sum_{i=1}^\infty i x^i = \frac{x}{(1-x)^2}$
- Integral of infinite geometric sum: $\sum_{i=1}^\infty \frac{x^{i}}{i}=\int \sum_{i=1}^\infty x^{i-1} = -\ln{(1-x)}$

Integral approach examples
- Harmonic sum: $\sum_{i=1}^{n}\frac{1}{i}\approx \int_{1}^{n}\frac{1}{x}dx=\ln{n}$. The integral is an underestimate.
- Sum of roots: $\sum_{i=1}^{n}\sqrt{i}\approx \int_{1}^{n}\sqrt{x}dx = \frac{2}{3}(n^{3/2}-1)$. The integral is an overestimate.

Identifying if your integral approximation is an over or under estimate:
- If your function $f$ is concave (increasing gradient) over the integration integral, then the integral will be an underestimate
- If your function $f$ is convex (decreasing gradient) over the integration integral, then the integral will be an overestimate.

### Approximating using taylor expansions
Common taylor expansions:
- Maclaurins:
    - $\sin{x} = x - \frac{x^3}{3!} + \frac{x^5}{5!} + ...$
    - $\cos{x} = 1 - \frac{x^2}{2!} + \frac{x^4}{4!} + ...$
    - $\arctan{x} = x - \frac{x^3}{3} + \frac{x^5}{5} + ...$
    - $e^x = \sum_{n=0}^\infty \frac{x^n}{n!}$
    - $a^x = \sum_{n=0}^\infty \frac{(x\ln{(a)})^n}{n!}$
    - $a^{-x} = \sum_{n=0}^\infty \frac{(-x\ln{(a)})^n}{n!}$
    - $\ln({1 + x})=\sum_{n=1}^\infty (-1)^{n+1}\frac{x^n}{n}$ 

- Centered around 1:
    - $\sqrt{x} = 1 + \frac{1}{2}(x-1) - \frac{1}{8}(x-1)^2 + \frac{1}{16}(x-1)^3 + ...$
    - $x^{1/n} = 1 + \frac{1}{n}(x-1)^2-\frac{1}{2n}(1-\frac{1}{n})(x-1)^2 + ...$
    - $x^x = 1 + (x-1) + (x-1)^2 + \frac{1}{3}(x-1)^3+...$
    - $x^{1/x} = 1 + (x-1) - \frac{1}{2}(x-1)^2 + \frac{2}{3}(x-1)^3$

Values of the above can be approximated to arbitrary accuracy by truncating the taylor expansion to $k$ terms.

### Approximating normal distribution probabilities (68-95-99.7 rule and others)
The 68-95-99.7 rule for normally distributed random variables state that $68%$ of realizations lie within 1 standard deviation, $95%$ within 2 standard deviations and $99.7%$ within 3 standard deviations.

Lets say we wanted to find the probability that the realization of $Z\sim\mathcal{N}(0,1)$ is within $k$ of 0 where $k$ is small. Then we can use the following approximation:
\begin{array}{rl}
\mathbb{P}[|Z|\leq k] &= \displaystyle 2\int_0^k\frac{1}{\sqrt{2\pi}}e^{-\frac{x^2}{2}} dx \\\\
&= \displaystyle \sqrt{\frac{2}{\pi}}\int_0^k 1 + o(x^2) dx \text{ (Taylor expansion)} \\\\
&\approx \displaystyle \sqrt{\frac{2}{\pi}}\int_0^k 1 dx = \sqrt{\frac{2}{\pi}}k\\\\
&\approx 0.8 k
\end{array}
Tha above works for small $k$ since the o(x^2) component of the integrand goes to 0.

Now lets consider another situation, where instead we want to see the probability that the realization of $Z\sim\mathcal{N}(0,1)$ is outside the range $\pm k$ of 0, with k being large. 
\begin{array}{rl}
\mathbb{P}[|Z| > k] &= \displaystyle 2\int_k^\infty\frac{1}{\sqrt{2\pi}}e^{-\frac{x^2}{2}} dx \\\\
&= \displaystyle 2\int_k^\infty\phi(x)dx\\\\
&= \displaystyle 2\int_k^\infty\frac{-\phi'(x)}{x}dx\\\\
&= \displaystyle 2(\frac{\phi(k)}{k}-\int_k^\infty\frac{\phi(x)}{x^2}dx) \text{ (By parts integration)}\\\\
&\approx \displaystyle 2\frac{\phi(k)}{k} \text{ (Leading contribution)} \\\\
&\displaystyle= 2\frac{e^{-k^2/2}}{k} \\\\
&\approx 2(\frac{1}{k}-\frac{k}{2})
\end{array}

### Calculating variance quickly
Generally speaking, it can often be easier/quicker to find variance via its raw moments following formula:
\begin{equation*}
    Var(X) = \mathbb{E}[X^2] -\mathbb{E}[X]^2
\end{equation*}
Take for instance the variance of the product of two independent dice rolls. We can use the above and easily see that:
\begin{equation*}
    Var(X) = \mathbb{E}[X^2Y^2] - \mathbb{E}[XY]^2 = \frac{(91)^2}{36} - 3.5^4 \approx 80
\end{equation*}

Similarly, on the off chance you need to do kurtosis and skew calculations, you can use the following:
\begin{array}{rl}
 Skew(X) &=\displaystyle \frac{\mathbb{E}[X^3]-3\mu\mathbb{E}[X^2]+2\mu^3}{\sigma^3} \\\\
 Kurtosis(X) &= \displaystyle \frac{\mathbb{E}[X^4]-4\mu\mathbb{E}[X^3]+6\mu^2\mathbb{E}[X^2]-3\mu^4}{\sigma^4}
\end{array}

### Approximating the geometric average of two numbers
Suppose we have numbers $a$ and $b$. Their geometric mean $\sqrt{ab}$ can be approximated with the following:
- If $a$ and $b$ are close, then use their arithmetic mean.
- For $a$ and $b$ that are far apart such that b = ka, then we can roughly say the GM is $a\sqrt k$
- The GM should always be closer to the smaller value of the two.

### Intuition behind approximating square roots
Since mentally calculating the taylor expansion around 1 for $\sqrt{x}$ is not always feasible, we can use the following heuristics: 
For $x<1$:
- Slope is steep closer to 0
- Slope flattens closer to 1
- Use linear approximations:
\begin{cases}
    \sqrt{x} 10x \approx & 0 < x \leq 0.01
    \sqrt{x} \approx \frac{5}{3}x + \frac{1}{12} & 0.01 < x \leq 0.25  \\\\
    \sqrt{x} \approx \frac{1}{3}(2x+1)& 0.25 < x < 1
\end{cases}
- General trend: $\sqrt{x}>x$ with the difference being bigger for smaller values.
For $x>1$: We can rewrite $x$ as $\sqrt{N+\delta}$ where $N$ is the closest perfect square. Then we have 
\begin{equation*}
\sqrt{N} + \frac{x-N}{2\sqrt{N}}
\end{equation*}
E.g. $\sqrt{110} \approx 10 + \frac{1}{2}$ with the actual value being 10.48. $sqrt{220} \approx 14 + \frac{6}{7}$ with the actual value being 14.83.

### Approximating arc lengths of known curves
General approach/heuristic: Suppose we are given some function $f(x)$ and bounds $[a,b]$ to calculate an arc length over.
- If $f'(x)^2$ is likely to be very large (>>1) over the interval, then we can roughly approximate with $f(b)-f(a)$. E.g. for $x^x$, we can approximate the arc length between 2 and 3 to be 23.
- In the case that $(f'(x))^2$ is large, is easily solvable and of the form $g(x)^2$ with $g(x)$ easily integrable then we can approximate with $\int_a^b g(x) dx$. E.g. e^x between 2 and 3 is approximately $e^3-e^2\approx 12$
- If $f'(x)^2$ is likely to be small (<<1) over the interval, then we can approximate by taylor expansion: $(b-a) + \frac{1}{2}\int_a^b(f'(x))^2 dx$. But generally $(b-a)$ is a good enough approximation.
- If $f'(x)^2$ is likely close to 1 over the interval, we can approximate using $\sqrt{(b-a)^2+(f(b)-f(a))^2}$ 

### Good to remember values for coins/cards/dice
Dice:
- Single dice events:
    - EV of one dice: 3.5
    - Variance of one dice: 2.92
    - Sum of possible outcomes: 21
- Two dice events:
    - EV of product of two dice: 12.25
    - Variance of product of dice: 79.97
    - Probability of seeing 1
    - Probability of seeing a sum 7
    - Probability of seeing sum < 7
    - Probability of seeing prime sum
    - Probability of seeing consecutive numbers
    - Probability of seeing a sum divisible by 3
- Infinite rolls:
    - EV of rolls till the sum exceeds 10
    - EV of rolls to see all sides
    - EV of rolls given 1 roll = 2 tosses till you see all sides
    - EV of rolls given 1 roll = 3 tosses till you see all sides
    - Variance of rolls to see all sides
    - Variance of rolls given 1 roll = 2 tosses till you see all sides
    - Variance of rolls given 1 roll = 3 tosses till you see all sides
    - EV rolls to see a sum $\geq 7$
    - EV rolls (two at a time) to see odd only pair.

Cards:
- Single card events:
    - EV of card from a regular deck, ace = 1: $7$
    - EV of card from a regular deck, ace = 14: $8$
    - EV of card from deck with negative JKQA, ace = 14: 
    - EV of card with 6 and 10 missing, ace = 1: $7 - 2/11=6+9/11$
    - EV of card with 6 and 10 missing, ace = 14: $8 + 0/11$
    - EV of card with 4 swapped with 10, ace = 1: $7+6/13$
    - EV of card with 4 swapped with 10, ace = 14: $8+6/13$
    - EV of card from even only deck, ace = 1: 
    - EV of card from even only deck, ace = 14: 
    - EV of card from odd only deck, ace = 1: 
    - EV of card from odd only deck, ace = 14: 
    - EV of card with values multiplied by 10 and negative JKQA, ace = 14
- Two draws:
    - Probability of at least one JKQ
    - Probability of at least one JKQ with hearts or spades
    - Probability of ace
    - Probability of red only
    - Probability of red and black
    - Probability of seeing consecutive values
    - EV number of black cards
    - EV number of even cards
    - Probability that the hand sum exceeds 20 given ace = 14.
    - Probability that the hand sum < 10 given ace = 14.
    - Probability of forming a pair.
- 3 draws:
    - 
- Infinite draws:
    - Number of draws with replacement until you see all 4 suits
    - Number of draws without replacement until you see all 4 suits
    - EV draws until the hand sum exceeds 30
    - 
- Others:
    - Probability of a flush/straight/full house/four of a kind/three of a kind/two pair/pair.
    - Probability of getting 21 in blackjack.
    - Probability of busting after hitting once on a starting hand of 15 in blackjack.
    - EV and variance of the blackjack hand if you stop $\geq$ 17.
    - EV and variance of the draws till a bust in blackjack
    - Probability of reaching a hand-sum exceeding n before drawing an Ace.
    - Probability that the 4th Ace is beyond position i in the deck.
    - Probability that the first spade is in position m
    - Probability that the first red card appears before the first black card.
    - Probability that no two Aces are adjacent in a shuffled deck.
    - Probability of seeing a J, K or Q before the first ace.
    - Probability that a suited card appears immediately before each Ace.
    - Probability that the first three cards are all different suits.

Coin tosses:
- 2 Tosses:
    - Probability of HH: $1/4$
    - Probability of TH or HT: $1/2$
- 3 Tosses:
    - Probability of seeing H>T: $1/2$
    - Probability of seeing H only: $1/8$
    - Probability of seeing one H: $3/8$
    - Probability of seeing an even number of H: $1/2$
    - Probability of seeing alternating sequences: $1/4$
    - Probability of last(first) coin being H: $1/2$
- Infinite tosses (basic markov chains)
    - Expected tosses till HH: $6$
    - Expected tosses till HT (equivalently TH): $4$
    - Expected tosses till HHH: $14$
    - Expected tosses till THH: 
    - Expected tosses till HTH: 
    - Probability of HH before TT: 

### Bet sizing
**Kelly criterion**: Suppose we have a bet to make with probability of winning $p$, probability of loss $q$ and $b$ odds (return per unit bet). Then the kelly fraction $f^*$ which is the proportion of our wealth to bet to maximize our expected log utility is given by:
\begin{equation*}
    f^*=p-\frac{q}{b}
\end{equation*}

**Fractional kelly criterion**: Fractional kelly can be used when we are more risk-averse, for instance in the case where we are not entirely certain about our win and loss probabilities. This is simply the process of multiplying the kelly fraction $f^*$ by a value $u\in[0,1]$ where $u$ is the regular kelly and the smaller $u$ gets the more uncertain you are about your parameters.

**Dealing with fixed lot sizes**: Suppose we were in a situation where the bet sizes were a minimum increment of $1 and we only had $10. In this case, fractions would often yield a quantity we can't bet. If we denote our wealth by $N$ units and the units we bet by $k$, then we check the EV of every $k$ (the int from rounding up vs rounding down) to find the best fixed lot to bet.
\begin{equation*}
pln(1+b\frac{k}{N}​)+(1−p)ln(1−\frac{k}{N}​)
\end{equation*}
Another general heuristic as compared to brute forcing the above is to just round down to the nearest integer (or other smallest increment measure) since its safer to under bet (just follows fractional kelly criterion) rather than overbet and potentially lose your wealth.

**Tackling trading edge given bid-ask quotes and bid sizing**: Lets suppose we were in a situation where instead of knowing exact probabilities of winning and losing, we were given a market (bid and ask) where we know the fair price (EV) and the variance of the market's price. We can deal with this by trying to calculate the fraction $f^*$ of our wealth to bet which maximizes our log utility given our edge. Let $\mu$ be our fair value and $\sigma^2$ our variance. Suppose we are offered the opportunity to buy at a discount $S$.
\begin{array}{rl}
X &:= \displaystyle\mu-S \text{ (Trading edge)} \\\\
\mathbb{E}[\text{log utility}|f] &=\displaystyle \mathbb{E}[\ln({1+fX})] \\\\
&=\displaystyle \mathbb{E}[fX - \frac{(fX)^2}{2} + \frac{(fX)^3}{3} + ...] \text{ (Taylor expansion)} \\\\
&=\displaystyle f(\mu-S) - \frac{f^2\mathbb{E}[X^2]}{2} + ... \\\\
&\approx\displaystyle f(\mu-S) - \frac{f^2\mathbb{E}[X^2]}{2} \\\\
&=\displaystyle f(\mu-S) - f^2\frac{\sigma^2+(\mu-S)^2}{2} \text{ (Second raw moment)} \\\\
\implies f^* &=\displaystyle \frac{(\mu-S)}{\sigma^2+(\mu-S)^2} \text{ (Differentiating w.r.t f and maximizing)}
\end{array}
Thus we approximate the optimal fraction of our wealth to bet to be $\frac{(\mu-S)}{\sigma^2+(\mu-S)^2}$. In other words:
\begin{equation*}
\text{optimal fraction given edge and variance} \approx \frac{edge}{variance + edge^2} \approx \frac{edge}{variance} \text{ for edge << variance}
\end{equation*}

### Approximating the number of primes in a given range
The general approximation up to some number n is given by:
\begin{equation*}
\frac{x}{\ln{x}}
\end{equation*}

### Number of integer solutions of a tuple given constraints
Stars and bars $x_1 + x_2... + x_k = n$:
- Positive integers: $\binom{n-1}{k-1}$
- Non negative integers: $\binom{n+k-1}{k-1}$

Inclusion exclusion principle for bounded stars and bars $x_1 + x_2... + x_k = n$ with $x_i\leq M$:
\begin{equation*}
\sum_{j=0}^{k}k​(−1)^j\binom{k}{j}\binom{n-j(M+1)+k-1}{k-1}
\end{equation*}
The above is derivable by considering the events $A_i$ for $x_i$ exceeding $M$.

Non-exhaustive question styles:
- $(x,y,z)$ such that $x+y+z = n$ (Basic stars bars)
- $(x,y,z)$ such that $x+y+z <= n$ (Add slack variable $s\geq 0$ to reduce the problem to stars and bars on 4 variables)
- $(x,y,z)$ such that $x+ay+bz = n$ (Rearrange to get $ay+bz \leq n$)
- $(x,y,z)$ such that $x+y+z = n$ and $x,y,z\leq M$ (Use the inclusion exclusion formula)
- $(x,y,z)$ such that $x+y+z = n$, $x+ay+bz = m$ (Rearrange to get $(a-1)y+(b-1)z = n$ so we can write one in terms of the other)
- $(x,y,z)$ such that $x+ay+bz = n$, $x$ odd $y$ divisible by $3$ and $z$ even (Substitute in values e.g. x =  2x+1 and y=3y, z=2z. Then we go back to x+ay+bz=n case)

# Dealing with odds
### Definition
If you are quoted $a:b$ **odds** on a bet, it means that the person providing the odds is assuming that the probability of the event happening is $\frac{b}{a+b}$ and if the bet goes right, for every $b$ units bet you get back $a$ units. Conversely, if you have an event occuring with probability $\frac{b}{a+b}$, then the fair odds is given by $a:b$.

Odds can also be given in decimal instead of ratio format. Given $a:b$, then equivalent decimal odds is $1+\frac{a}{b}$. It gives the amount you can get for every 1 unit bet if the bet goes right.

### Edge given fair vs non-fair odds
**Edge** can be understood as the expected gain or loss based on your fair odds against the odds you are given. We can calculate edge as the expected percentage gain/loss on the amount bet. Lets say we have fair decimal odds $d_1$ and quoted decimal odds $d_2$. Then our edge is:
\begin{equation*}
edge = \mathbb{P}[win]*d_2 - 1 = \frac{d_2}{d_1} - 1
\end{equation*}

### Kelly criterion sizing for odds (binary outcome betting): In terms of probabilities and in terms of edge
In terms of probabilities:
\begin{equation*}
f^* = p-\frac{q}{b}
\end{equation*}
where $p$ is the win probability, $q$ is the loss probability, and $b$ is the profit per unit staked (ratio odds value or equivalently decimal odds - 1).

In terms of edge:
\begin{equation*}
f^* = \frac{edge}{b}
\end{equation*}
where $edge$ is in dollars for ever dollar bet.

General heuristics:
- The optimal fraction is going to be small if the fair win probability is also small. Intuition: you need to make a lot of bets to realize the edge, hence you reduce the amount bet considerably. Another way to think about it is that low probability outcomes with super high odds payouts have high variance (in terms of your pnl) and are more "risky" in that sense, so you need to reduce your risk of ruin by scaling your bets smaller.
- The optimal fraction is going to be large if the fair win probability is also large. Intuition: you don't need to make a lot of bets to realize the edge so you don't risk ruin. Another way to think about it is that high probability outcomes with moderate odds payouts have low variance (in terms of your pnl) and are less "risky" in that sense, so you need to reduce your risk of ruin by scaling your bets smaller.

# Betting games (Examples using dice, cards, coins)
### General strategy for thinking of probabilities
- Dice outcomes:
    - On two fair dice: Think about a 6 by 6 grid and how many cells are coloured in corresponding to the outcome.
        - Examples: 
            - Odd/Even sum
            - Sum = 7
            - Product > 10
            - At least one 6
            - At least one dice $\geq$ 5
            - Prime valued dice only
            - Even dice only

    - On more than two fair dice:

    - Handling perturbations/events:
        - Sum market:
            - 
            - 
        - Product market:
            - 
            - 

- Card outcomes
    - On two fair cards: Think about a 13 by 13 grid if the outcome is related to just the numbers/values you get and not suits and think about colouring of cells like in the fair dice situation. If you need to consider suits, compute the probability manually (not too difficult).
        - Examples:
            - At least one ace : $25/169$
            - Odd sum (ace = 1 or 14) : $84/169$
            - Even sum (ace = 1 or 14) : $85/169$
            - Sum less than 10 (ace = 1 or 14) : $36/169$
            - Two of the same colour: $1/2$
            - Two of the same suit: $1/4$
            - Only number cards: $81/169$
    - Sums in general on an arbitrary number of cards (e.g. 3 or 5):
        - Case 1, n=2: Consider a grid of possible values (i.e. i,j represents card i + card j) which is sorted. Then to count the number of occurences of a specific sum, simply count the first j such that card 1 + card j = target. Then j is your answer. If we need to consider the number of occurences of $sum\geq target$, then we consider the sum of naturals from 1 to j (i.e. $(j+1)j/2$)
        - Case 2, n>2: We can consider everything on a case by case basis (fix n-2 variables and figure out the coloured cells in a 13 by 13 grid) then repeat.
    - Handling perturbations/events:
        - Card value changes: Consider a standard 52 card deck. If we swap specific values $a$ with $b$ (e.g. 6 now counts as 10), then the EV of the final result is easy to compute. Let $\mu$ be the old EV. Then the new EV is $\mu + (b-a)/13$.
        - Card removal: Consider a standard 52 card deck. If we remove all cards with values $a_1,a_2,..,a_k$, then our new EV is $\mu - \frac{\sum_{i=1}^k(a_i-\mu)}{n-k}$
        - Card addition: Consider a standard 52 card deck. If we add in cards (1 for each suit) of values $a_1,a_2,...,a_k$ then our new EV is $\mu + \frac{\sum_{i=1}^k(a_i-\mu)}{n+k}$

- Coin outcomes
    - On two/three fair coins:
        

- Mixed game:
    - Combinations of die/cards/coins: Think about independent RVs

### General strategy for variances
- Dice, cards and other discrete uniform outcomes:

- Perturbated discrete uniform outcomes:

- Binomial counting/coin outcomes

### Heuristics in bet sizing
The same edge on low probability vs high probability translates to betting a smaller size on the low probability event compared to the high probability event.

### Paying for information (revealing of a dice value, card face, etc):
Suppose you are given quotes for the bid and ask of the sum of card values. The fair-price to pay for revealing a value is given by the extra value/edge that information would give you from what you already know (and expect to profit). I.e. the expected edge you expect to get from revealing a card minus the edge you had initially (if there was a misquote). E.g. if you have 44 @ 46 as your market for the sum of 5 cards drawn from a 52-card deck with ace counting as 14, then you would pay 0.15 units to reveal one card.

Trick for EV calculations: Say we wanted to find the EV edge of revealing one card (not subtracting away existing edge), with the previous quotes. Then we can calculate it by going case by case (i.e. what is our edge given i for all i then averaging out). One important trick is that say we wanted to calculate the edge for $i=2$. We can approximate using $2+8*4$ even though $8$ is no longer the EV of the remaining cards since overall, the perturbations to the EV of the remaining cards for every $i$ cancel out.

### Maximum order size
Going long: If you have a total wealth $w$, then the total amount of units you can buy/go long given a price $p$ is $\lfloor w/p \rfloor$. 

Going short: If you want to go short and the price is currently $p$ with a maximum possible price $p*$, then you calculate the maximum loss you could make $p-p*$ and divide $\lfloor 505/(p*-p) \rfloor $ (you need to be able to cover the downside if you lose your bet in the worst case).

# Live trading games
### General format 
- Changing fair values based on events
- Short-lived bids and asks
### Dealing with multiple variables of info
- 
- 

### An example outcomes of sports matches
- 
- 

Try it yourself: [betfair](https://www.betfair.com/)


# Some algebraic identities
### Geometric and arithmetic sums

### Product-sum identity

### Binomial coefficient identities

### Hockey stick identity

### Sum of even and odd binomial coefficients


# Basic Probability concepts

### Sample space, Sigma algebra, and Probability measures

**Sample space ($\Omega$):** The set of all possible outcomes.

**Sigma algebras:** A sigma algebra $\mathcal{F}$ is a set of subsets $A$ of $\Omega$, each subset corresponding to an event (e.g. if the sample space is the set of all outcomes of 3 coin flips, an event could be all outcomes with an odd number of heads). They obey the following properties/axioms:
- $\emptyset \in \mathcal{F}$.
- Closed under countable unions.
- Closed under complements.

**Probability measures (Kolmogorov's definition):** A probability measure is a function $\mathbb{P}:\mathcal{F}\rightarrow[0,1]$. It obeys the following properties/axioms:
- $\mathbb{P}(A)\geq 0 \forall A$
- $\mathbb{P}(\Omega) = 1$
- The probability of the union of disjoint sets is the sum of their individual probabilities.

Common definitions for probability measures:
- **Discrete sample space** (finite or countable): $\mathbb{P}(A):=\sum_{\omega\in A}{p(\omega)}$ where $p(\cdot)$ is the probability mass function defined for each point in the sample space.
- **Continuous sample space**: $\mathbb{P}(A):=\int_{A}{f(x)dx}$ where $f(x)$ is the probability density function defined for each point in the sample space.
Note that in both cases, the $pmf$ and $pdf$ functions sum/integrate to 1 over the entire sample space.

A **probability space** is a tuple $(\Omega, \mathcal{F}, \mathbb{P})$ of the three above terms When we have a stochastic process (a random process that evolves over time), we typically define a **filtration** in addition to the 3 things above.

A **filtration** is a set of sigma algebras that are increasing over time, i.e. $\mathcal{F}_s\subset\mathcal{F}_t \forall s \leq t$. Each $\mathcal{F}_t$ in the filtration represents all possible outcomes that could have occured by time $t$. The importance of specifying filtrations comes in when we consider whether we have all the information up till time $t$, less information than that (maybe a delay in info), or more (future foresight). If a random variable is **$\mathcal{F}_t$ measurable** its value is determined by the process up till time $t$.

### De Morgan's Laws
- Complement of union: $(A\cup B)^c=A^c \cap B^c$
- Complement of intersection: $(A\cap B)^c=A^c \cup B^c$

### Inclusion-Exclusion principle
Suppose $(A_i)_{i\in\mathcal{I}}$ are all subsets of some universal set like $\Omega$. The **inclusion exclusion principle** states:
\begin{equation*}
|\cup_{i\in\mathcal{I}} A_i| = \sum_{i}|A_i| - \sum_{i<j}|A_i\cap A_j| + ... + (-1)^{n+1}|\cap_{i\in\mathcal{I}}A_1|
\end{equation*}

For a probability measure based on the sizes of sets, we have:
\begin{equation*}
\mathbb{P}(\cup_{i\in\mathcal{I}} A_i) = \sum_{i}\mathbb{P}(A_i) - \sum_{i<j}\mathbb{P}(A_i\cap A_j) + ... + (-1)^{n+1}\mathbb{P}(\cap_{i\in\mathcal{I}}A_1)
\end{equation*}

### Combinatorics
Basic definitions:
- **Combinations**: The number of ways to choose $k$ objects from $n$ total objects is given by $\binom{n}{k} = \frac{n!}{k!(n-k)!}$
- **Permutations**: The number of ways to arrange $k$ objects is given by $k!$.
- $k$ length sequence of objects from n choices without replacement: Suppose you had to make an array of length $k$ and for each entry you can choose among $n$ objects. You can make $n^k$ unique arrays.

Some common combinatorics/permutations ideas:
- The **classic ballot problem** is described as the following: Suppose you have party A and party B with $a$ votes and $b$ votes, all in some random order and $a>b$. What is the probability that party A is strictly ahead of party B at all times when counting votes? To solve this problem, we reframe it as a random walk (combinatorial perspective solution):
    - Consider a grid with $a$ rows and $b$ columns. We start at the bottom left corner, and at every step move up if we get an A vote and right if we get a B vote. At the end of our path (counting all votes) we always end at the top right corner. 
    - If we draw the line $y=x$ on the grid, we want to find the number of ways we can traverse from the bottom left to the top right while staying strictly above the line.
    - The total number of ways to traverse the grid is given by $\binom{a+b}{a}$
    - By the reflection principle, there is a bijection between the number of bad paths and the number of paths given $a-1$ votes and $b+1$ votes : $\binom{a+b}{a-1}$

    Thus (after some expanding) the final probability of the event is $\frac{a-b}{a+b}$. 
    
    Below are some variations of the classic ballot problem which are all solvable using the same reflection principle + grid visualization:
    - With a loose inequality: Suppose now that the parties A and B have $a\geq b$ votes respectively. What is the probability that party A is never below party B when counting votes? 
    (Answer: $\frac{a-b+1}{a+1}$)
    - Catalan Numbers/Dyck paths: Suppose we have $n$ votes for A and $n$ votes for B. How many ways can the votes be counted such that the number of votes for A is always at least equal to B? (Answer: $\frac{1}{n+1}\binom{2n}{n}$)
    - With a multiple: What is the probability that party A is more than $r$ times ahead when counting votes (conditioned on $a>rb$)?  (Answer: $\frac{a-rb}{a+b}$)
    - With a multiple and loose inequality: What is the probability that party A is at least $r$ times ahead when counting votes (conditioned on $a\geq rb$)? 
    (Answer: $\frac{a-rb+1}{a+b}$)
- **Stars and bars**: Suppose we have $n$ stars and we want to find the number of ways they can be split into $k$ boxes where each box contains some number of stars or None. This is equivalent to having n+k-1 empty spaces, and choosing k-1 of them to have bars and the rest to be filled with stars. I.e., the number of ways is given by $\binom{n+k-1}{k-1}$. Variations:
    - Bounded above stars and bars: Suppose now that each box has to have stars $\leq M$. Find the total number of ways this can be done. (Answer: ) Note that to solve this, simply consider each event $A_i$ that box $x_i>M$. Then we find $\binom{n+k-1}{k-1} - |\cup_i A_i|$ where |\cup_i A_i| can be found via the inclusion exclusion principle.
- **Derangements**: A derangement on $n$ objects is the number of ways to permute $n$ objects such that no object is in its original position. The number of ways to do this can be deduced via the inclusion exclusion principle. Let $A_i$ denote the event that the $i$th object is in its original position. Then the derangement on $n$ objects is $n!-|\cup_i A_i|$. We can expand $|\cup_i A_i|$ by the inclusion exclusion principle: $\sum_{i}|A_i| - \sum_{i<j}|A_i\cap A_j| + ... + (-1)^{n+1}|\cap_{i}A_1|=\sum_i (-1)^{i+1}\binom{n}{i}(n-i)!$
- **Menage problem**: Suppose we have $n$ couples, i.e. $2n$ total people, such that they are all seated on a circlar table. How many ways are there for them to be positioned so that no two people belonging to the same couple are adjacent while the order is alternating (i.e. ...MFMF...)? This is similar to a circular equivalent to the derangements problem. We consider the total number of possible arrangements and subtract from it the number of ways to get at least one couple matching based on the inclusion exclusion principle.
    - The total number of arrangements is given by: $n!(n-1)!$ (Hint: Fix the position of one person, arrange the remaining men, arrange the remaining women)
    - The number of arrangements with k couples together: $(n-1)!\binom{n}{k}2^k(n-k)!$ (Hint: Consider the men's positions still being fixed first but then automatically assigning each $k$ coupled woman to either the left or right of their husband).

    The rest of the formula then follows via expansion. 
- **Collision probability**: Suppose you have $m$ balls that you want to distribute randomly across $n$ bins. What is the probability that you have a bin with two or more balls? This problem is easily solvable by considering the probability of having a unique bin for each ball: $\frac{n(n-1)...(n-m+1)}{n^m}$ with edge cases of $m=1$ which leads to probability 0 and $m>n$ which leads to probability 1 by the **pigeonhole principle**.

Sidenote (More on **catalan numbers**): In ballot problem with n votes for both parties and a weak inequality, the solution we arrive at is what is known as the nth **catalan number**. Catalan numbers appear in many combinatorics problems, below are some cases where thinking about catalan numbers may be useful:
- Number of ways you can order events where one can only occur after another occurs.
- Number of ways to arrange connections to avoid crossings
- Number of ways to form an object whose structure depends on binary splitting

Some extra examples on catalan numbers:
- Suppose you have $n$ processes that you specify a start and end for. You need to find the number of ways to order them such that the end of a process $i$ always comes after the start of process $i$. E.g. so start, end, start, end is valid for two processes and so would start, start, end, end.
- Suppose you have $2n$ points that form $n$ chords in a circle. Find the number of ways to get no intersections.
- Suppose you have $n$ nodes. Find the number of ways to construct a binary tree with $n$ nodes.
- Suppose you have a random walk that starts at $y=0$ and ends at $y=0$ with $2n$ total steps. Find the number of random walks that stay strictly above the $y=0$ axis.

### Random variables as functions, PMFs/PDFs and CDFs
Random variables are functions $X:\Omega\rightarrow \mathbb{R}$ is a from the sample space $\Omega$ to the reals. The image of a random set is defined by ${X(\omega);\omega\in\Omega}$
- A **discrete random variable** is one where the image of the random variable $X$ is countable (e.g. it maps to the natural numbers or a finite set). The probabilities associated with a discrete random variable X are defined by a **probability mass function** (pmf) $p_X$ where $\mathbb{P}(X=k) = p_X(k)$ and $\mathbb{P}(X\in A) = \sum_{a\in A} p_X(a)$.
- A **continuous random variable** (cdf) is one where the image of the random variable $X$ is uncountable (e.g. it maps to a closed interval [a,b]). The probabilities associated with a continuous random variable $X$ are defined by a **probability density function** (pdf) $f_X$ where $\mathbb{P}(X\in A)=\int_A f_X(x) dx$. 

The **cumulative distribution function** $F_X(x)$ of a random variable $X$ describes the following probability : $\mathbb{P}(X\leq x)$. If $X$ is continuous, then the CDF and pdf have the following relationship: $\frac{dF_X}{dx}=f_X$. 

When it comes to discrete variables, we can in fact differentiate the CDF as well which gives the **generalized PDF**. To do this, we introduce the **dirac delta function** $\delta()$ which is defined by:
\begin{array}{rl}
    \forall x\neq 0, \displaystyle\delta(x)=0 &\text{and}&\displaystyle\int_{-\infty}^{\infty}\delta(x)dx=1
\end{array}
It has many interpretations, the one useful in our case being the "gradient" at 0 for the step function $u(x)$.
\begin{equation*}
u(x)=\begin{cases}
1 & x\geq 0\\
0 & \text{ otherwise}
\end{cases}
\end{equation*}
We can rewrite the CDF of a discrete variable which can take on values x_k as $F_X(x) = \sum_{x_k} p_X(x_k)u(x-x_k)$. Differentiating this, we get:
\begin{equation*}
f_X(x) = \sum_{x_k} p_X(x_k)\delta(x-x_k)
\end{equation*}

Two random variables $X$ and $Y$ are **independent** if the realization of $X$ does not impact the probabilities of the outcomes $Y$ and vice versa. In probability terms: $\mathbb{P}(X\in A,Y\in B)=\mathbb{P}(X\in A)\mathbb{P}(Y\in B)$ where $A$ and $B$ are images in $X$ and $Y$ of the events corresponding to $A$ and $B$.

**Mutual exclusion**: Two events are mutually exclusive if one cannot occur with the other (their union is equivalent to the union of disjoint sets).
- Example problems: 
    - Points on a circular arc problem: Suppose you have n points on a circular arc. What is the probability of all $n$ points lying on the same semicircle? To solve this problem, we consider $n$ events $(E_i)_{1 \leq i\leq n}$ which represent the event that the semi circle starting from point $i$ going clockwise contains all the other $n-1$ points. Each of these events are mutually exclusive, so $\mathbb{P}(\cup E_i)=\sum \mathbb{P}(E_i) = \frac{n}{2^{n-1}}$

**Jacobian determinant**: Suppose we have a random vector $X=(X_1,...,X_n)$ and an invertible transformation $T(X)$ to get vector $Y=(Y_1,...,Y_n)$. Then we have the following relationship
\begin{equation*}
f_X(X)=f_Y(y)|\det(J_{T}(x))|
\end{equation*}
For the joint pdfs $f_Y$ and $f_X$ and jacobian $J_T(x)$. The jacobian of $T$ is a function of $X$ given by:
\begin{equation*}
J_T(x)=\begin{bmatrix}
\frac{\partial y_1}{\partial x_1} & \cdots &  \frac{\partial y_1}{\partial x_n} \\
\vdots & \ddots & \vdots \\
\frac{\partial y_n}{\partial x_1} & \cdots &  \frac{\partial y_n}{\partial x_n}
\end{bmatrix}
\end{equation*}
In the 1 dimensional case, the above is the equivalent of:
\begin{equation*}
f_X(x)=f_Y(y)\left|\frac{dy}{dx}\right|
\end{equation*}

### Conditional Probability
The joint probability of events $A$ and $B$ is denoted by $\mathbb{P}(A\cap B)$, that is the event of both occurring at the same time. Suppose you have n variables $(X_i)_{1\leq i\leq n}$. Their **joint probability function**, either a pmf $p_{X_1,..,X_n}()$ or pdf $f_{X_1,..,X_n}()$, describes the point probability of getting a specific realization $(x_1,..,x_n)$ of the random variables. They follow the same rules of non-negativity and normalization (i.e. sum/integrate to 1) as regular probability functions.

Given a **joint probability function**, the marginal probability function $f_{X_i}$ (or $p_{X_i}$) of a specific random variable $X_i$ is the probability of a specific realisation $x_i$ regardless of the other variables' outcomes, i.e. $f_{X_i}=\int f_{X_1,...,X_n} dx_1...dx_{i-1}dx_{i+1}...dx_n$. When considering multiple variables, they are considered **mutually independent** (different from **pairwise indepence**) if the joint probability of every possible subset of variables is the product of each corresponding marginal probability function.

The **conditional probability** of an event $A$ occuring given an event $B$ has occurred is:
\begin{equation*}
\mathbb{P}(A|B) = \frac{\mathbb{P}(A\cap B)}{\mathbb{P}(B)}
\end{equation*}
When two variables are independent, the conditional has no effect.

**Bayes theorem** states that we have the following relationship between conditional probabilities:
\begin{equation*}
\mathbb{P}(A|B) = \frac{\mathbb{P}(B|A)\mathbb{P}(A)}{\mathbb{P}(B)}
\end{equation*}

### Expectation and Variance
**Expectation**: The expectation of a random variable $X$ is a probability weighted average of all the possible outcomes of $X$.
- Discrete expectation: $\mu=\mathbb{E}(X)=\sum_{x\in Im(X)} x p_X(x)$
- Continuous expectation: $\mu=\mathbb{E}(X)=\int_{Im(X)} xf_X(x)dx$

**Linearity of expectation**: For constants $a,b,c$:
\begin{equation*}
\mathbb{E}(aX+bY+c) = a\mathbb{E}(X)+b\mathbb{E}(Y)+c
\end{equation*}

**Variance**: The variance of a random variable $X$ is the expected squared deviation from the mean. The square root of variance is called **standard deviation**.
\begin{equation*}
    Var(X) = \mathbb{E}[(X-\mu)^2] = \mathbb{E}(X^2)-[\mathbb{E}(X)]^2
\end{equation*}
- Discrete variance: $Var(X) = \sum (x -\mu)^2p_X(x)$
- Continuous variance: $Var(X) = \int (x -\mu)^2f_X(x)dx$

**Non-linearity of variance**: For constants $a,b$
\begin{equation*}
Var(aX+b) = a^2Var(X)
\end{equation*}

**Law of the unconscious statistician (LOTUS)**: Suppose you have a random variable $X$ and a random variable $Y$ defined as a function $g()$ of $X$. Then the expectation of $Y$ is given by 
\begin{equation*}
\mathbb{E}(Y)=\begin{cases}\sum g(x)p_X(x) \\\\
\int g(x)f(x)dx\end{cases}
\end{equation*}

**Law of total expectation**: Given random variables $X$ and $Y$, we have the following expectation formula:
\begin{equation*}
\mathbb{E}(X) = \begin{cases}
\sum \mathbb{E}(X|Y=y)\mathbb{P}(Y=y) \\\\
\int \mathbb{E}(X|Y=y)\mathbb{P}(Y=y)
\end{cases}
\end{equation*} 

**Law of total variance**: Given random variables $X$ and $Y$, we have the following variance formula:
\begin{equation*}
Var(X) = \mathbb{E}[Var(Y|X)] + Var(\mathbb{E}(Y|X))
\end{equation*}

**Popoviciu inequality on variance**: Suppose we have a random variable $X$ bounded between $a$ and $b$. Then we have the following:
\begin{equation*}
0\leq Var(X)\leq \frac{(b-a)^2}{4}
\end{equation*}
The above is derivable by considering that the maximum variance occurs when $X$ only takes on exactly $a$ or $b$ with equal probability.

**Tail integrals/sums**: Given a random variable $X$ that takes on non-negative values, its expectation is given by:
\begin{equation*}
\mathbb{E}(X)=
\begin{cases}
\sum_{x=0}^\infty \mathbb{P}(X>x)\\\\
\int_0^\infty \mathbb{P}(X>x)dx
\end{cases}
\end{equation*}

**Indicator RVs** are random variables that take on either 0 or 1 to indicate the occurence of an event. These are useful in problems involving expectations since it can reduce a complicated task to a sum of probabilities.

**Geometric sums** refers to the sum of geometric sequences which occurs in cases where you need to consider long paths or infinitely long paths, e.g. coin tosses.
\begin{array}{rl}
S_n &= \displaystyle \frac{a_0(1-r^n)}{1-r}\\\\
S_\infty &= \displaystyle\frac{a_0}{1-r}
\end{array}

Example problems:
- Refer to [quantable.io](quantable.io)
- Probability $P(X < Y^2|X<Y)$ given independent uniform $X$ and $Y$.

### Covariance and Correlation
**Covariance**: The covariance between two random variables $X$ and $Y$ measures the linear dependence between the two and is defined by the following expectation:
\begin{equation*}
Cov(X,Y) = \mathbb{E}(XY) - \mathbb{E}(X)\mathbb{E}(Y)
\end{equation*}
It can take on negative values. If $X$ and $Y$ are the same, then the above reduces to the regular variance formula. 

**Bilinearity of covariance**: Given constants $a,b,c,d,e$ and random variables $X,Y,Z$
\begin{equation*}
Cov(aX+bY+c,dZ+e) = adCov(X,Z) + bdCov(Y,Z)
\end{equation*}

**Variance-covariance relationship**: Given variables $X$ and $Y$ with constants $a$ and $b$ we have.
\begin{array}{rl}
Var(aX+bY) = a^2Var(X) + b^2Var(Y) + 2abCov(X,Y)
Var(\sum a_iX_i) = a_i^2\sum Var(X_i) + 2a_ia_j\sum Cov(X_i,X_j) 
\end{array}

**Law of total covariance**:
\begin{equation*}
Cov(X,Y) = \mathbb{E}[Cov(X,Y)|Z]+Cov(\mathbb{E}(X|Z),\mathbb{E}(Y|Z))
\end{equation*}

**Correlation**: The correlation between two random variables $X$ and $Y$ can be understood as the normalized covariance and is given by:
\begin{equation*}
Corr(X,Y) = \frac{Cov(X,Y)}{\sqrt{Var(X)Var(Y)}} = \frac{\mathbb{E}(X-\mu_X)\mathbb{E}(Y-\mu_Y)}{\sigma_X\sigma_Y}
\end{equation*}
Correlation varies between $-1$ and $1$. Multiple variables are **equicorrelated** if they all have the same pairwise correlation. 

Independence and correlation/covariance:
- If two variables $X$ and $Y$ are independent, then we have $Cov(X,Y)=Corr(X,Y)=0$. The converse is not true for covariance (i.e. zero covariance does not imply independence).
- If two variables have perfect correlation, i.e. $|Corr(X,Y)|=1$, then we have that $Y=aX+b$.

**Frechet bounds** If $X\in[a,b]$ and $Y\in[c,d]$, then we have:
\begin{equation*}
\frac{max(ac,bd)-\mu_x\mu_y}{\sigma_x\sigma_y}\leq Corr(X,Y)\leq \frac{max(ad,cd)-\mu_x\mu_y}{\sigma_x\sigma_y}
\end{equation*}

**Partial correlation**: The partial correlation between two variables $X$ and $Y$ controlling for a set of random variables $Z_1,..,Z_n$ is the correlation between $X$ and $Y$ after removing the linear effect of $Z_1,...,Z_n$.

Example problems:
- Traditional:
    - Minimum correlation $\rho$ given three pairwise correlated variables (i.e. equicorrelated variables)
    - Upper-lower bounds on a correlation matrix with off diagonals equal to $\rho$ (i.e. an equicorrelated matrix)
    - Maximum/minimum possible correlation between two linear combinations of random variables.
    - Correlation between the minimum and maximum of two random variables
    - Covariance matrix of a linear transformation of a random vector
    - Maximum correlation between two variables of fixed variances
    - Find the expectation $E[XY]$ given a correlation value and the distributions of X and Y (Hint: just use the covariance formula).

- Finance setting:
    - Correlation between two infinitely sized stock indices:
    - Correlation of related returns on stocks
    - min-max correlations between the returns on two stocks
    - Variance of an equally weighted portfolio
    - Correlation between two portfolios constructed using deterministic weights.
    - Expected correlation between two randomly constructed portfolios.
    - Correlation between two indices with overlapping stocks
    - Correlation between two stocks with orthogonality

### Limits for random variables (Not as important besides the last two theorems)

Limit superior and limit inferior of events: Let $(A_i)_{i\geq1}$ be a sequence of events. 
- The **limit superior** of events is the event that occurs infinitely often: $\omega\in\limsup A_n\text{ if }\omega\in A_k \text{ for infinitely many } k$
- The **limit inferior** of events is the event that occurs eventually always: $\omega\in\liminf A_n\text{ if }\exists N; \forall k\geq N, \omega \in A_k$
\begin{array}{rl}
\limsup_{n\rightarrow\infty}A_n &= \cap^{\infty}_{n=1}\cup^{\infty}_{k=n}A_k\\\\
\liminf_{n\rightarrow\infty}A_n &= \cup^{\infty}_{n=1}\cap^{\infty}_{k=n}A_k
\end{array}

Sequences of random variables: We have 3 main definitions for the convergence of sequences of random variables.
- **Almost sure convergence**: A sequence of random variables $X_n$ almost surely converges to $X$ if $\mathbb{P}({\omega;\lim_{n\rightarrow\infty}X_n(\omega)=X(\omega)})$. This is the "strongest" definition in that it implies the other two.
- **Convergence in probability**: A sequence of random variables $X_n$ converges in probability to $X$ if $\lim_{n\rightarrow\infty}\mathbb{P}(|X-X_n|>\epsilon)=0\forall\epsilon$. This is the second strongest definition as it implies convergence in distribution but not almost sure convergence.
- **Convergence in distribution**: A sequence of random variables $X_n$ converges in distirbution to $X$ if $\lim_{n\rightarrow\infty}F_{X_n}(x)=F_X(x)$ for all points $x$ where $F_X$ is continuous. This is the weakest definition of convergence.

Application the sequence definitions for random variables:
- **Laws of Large Numbers (LLN)**: The law of large numbers states that the average of many independent samples of a finite variance random variable converges to its expected value.
    - Weak law (Convergence in probability): $\lim_{n\rightarrow\infty}\mathbb{P}(|\bar{X}_n-\mu|>\epsilon)=0$
    - Strong law (Converges almost surely): $\mathbb{P}(\lim_{n\rightarrow\infty}\bar{X}_n=\mu)=1$
- **Central Limit Theorem (CLT)**: The central limit theorem states that the standardized mean ($\sqrt{n}(\bar{X}_n-\mu)$) of iid samples of finite variance random variables converges in distribution to a standard normal random variable. This is an application of a convergence in distribution.

### Common distributions
**Continuous uniform**: Denoted by $Unif(a,b)$. It is a random variable taking values in the range $[a,b]$ all with equal probability.
- pdf: $f(x)=\frac{1}{b-a}$

- Expectation: $\frac{(b-a)}{2}$

- Variance: $\frac{(b-a)^2}{12}$

- Some common extensions:
    - Distribution of the sum of 2 uniform random variables (Triangular distribution)
    - Distribution of the sum of n uniform random variables (Tends to the normal distribution by the CLT)
    - Geometric interpretation of formulas of iid $x,y,z$ Unif(0,1) variables.

**Discrete uniform**: It is a random variable taking integer values in the range $[a,b]$ all with equal probability.
- pmf: $p_X(x)=\frac{1}{b-a}$

- Expectation: $\frac{b-a}{2}$

- Variance: $\frac{(b-a)(b-a+2)}{12}$

- Common extensions:
    - Only odds, only evens
    - Combine two or more disjoint intervals
    - Utilize a multiset

**Bernoulli**: Denoted by $Bern(p)$. It is a random variable that takes on the value 1 with probability of $p$ and 0 otherwise. It is a random variable representing an event with probability of success of $p$.
- pmf: $p_X(1) = p, p_X(0) = 1-p$

- Expectation: $p$

- Variance: $p(1-p)$

**Binomial**: Denoted by $Binom(n,p)$. It is a random variable that takes on integer values from $0$ up to $n$ and is equivalent to taking the sum of $n$ iid bernoulli random variables each with probability $p$. It describes the number of successful trials given n total trials with each succeeding with probability $p$. It is denoted by the pdf $p_X(x)=\binom{n}{x}p^x(1-p)^{n-x}$.
- pmf: $p_X(x)=\binom{n}{x}p^x(1-p)^{n-x}$

- Expectation: $np$

- Variance: $np(1-p)$

**Geometric**: Denoted by $Geom(p)$. It is a random variable taking on integer values in the range $[0,\infty)$ and describes the number of trials of a bernoulli variable with probability p till the first success.
- pmf: $p_X(x)=(1-p)^{x-1}p$ (the $x^{th}$ trial is the first success)

- Expectation: $\frac{1}{p}$

- Variance: $\frac{1-p}{p^2}$

**Negative Binomial**: Denoted by $NB(r,p)$. It is a random variable taking on integer values in the range $[r,\infty)$ and describes the number of trials of a bernoulli variable with probability p till the $r^{th}$ success.
- pmf: $p_X(x)=\binom{n-1}{r-1}(1-p)^{n-r}p^r$

- Expectation: $\frac{r}{p}$ 

- Variance: $\frac{r(1-p)}{p^2}$

**Hypergeometric**: Denoted by Hypergeom(N,K,n). It is a random variable taking on integer values in the range $[\max(0,n-(N-K)),\min(n,K)]$ and describes the number of successes when we sample $n$ times from a finite population without replacement. The population will be one that contains $K$ correct items and $N$ total items (i.e. K\leq N).
- pmf: $P(X=k)=\frac{\binom{K}{k}\binom{N-K}{n-k}}{\binom{N}{n}}$

- Expectation: $\frac{nK}{N}$

- Variance: $\frac{nK(N-K)(N-n)}{N^2(N-1)}$

Note that for the hypergeometric expectation and variance formulae, you can derive it by considering X to be a sum of indicator random variables.

**Poisson**: Denoted by $Poi(\lambda)$. It is a random variable taking on integer values in the range $[0,\infty)$ and describes the number of events occuring in a fixed interval.
- pmf: $p_X(x)=\frac{\lambda^x}{x!}e^{-\lambda}$ 

- Expectation: $\lambda$

- Variance: $\lambda$

**Exponential**: Denoted by $Exp(\lambda)$. It is a random variable taking on continuous values in the range $[0,\infty)$ and describes the waiting time between consecutive events of a poisson process with parameter $\lambda t$. It is a special case of the gamma distribution.
- pdf: $f_X(x) = \lambda e^{-\lambda x}$

- Expectation: $\frac{1}{\lambda}$

- Variance: $\frac{1}{\lambda^2}$

**Gamma**: Denoted by $Gamma(\alpha,\theta)$, where $\alpha$ is the shape parameter and $\theta$ is the scale parameter. It ranges between $(0,\infty)$. Note that sometimes $\theta$ is substituted with a rate parameter $\lambda = 1/\theta$. If $\alpha=k$ and $\theta =1/\lambda$, then the gamma variable represents the waiting time between $h^{th}$ and $(h+k)^{th}$ events. 
- pdf: $f_X(x)=\frac{1}{\Gamma(\alpha)\theta^\alpha}x^{\alpha-1}e^{-x/\theta}$

- Expectation: $\alpha\theta$

- Variance: $\alpha\theta^2$

**Normal**: Denoted by $\mathcal{N}(\mu,\sigma^2)$ where $\mu$ is the mean and $\sigma^2$ is the variance. The standard normal distribution is given by $\mathcal{N}(0,1)$.
- pdf: $f_X(x)=\frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{(x-\mu)^2}{2\sigma}}$

- Expectation: $\mu$

- Variance: $\sigma$

- Some common extensions:
    - Bivariate normal: Denoted by $\mathcal{N}((\mu_1,\mu_2),\Sigma)$ where $\Sigma$ is the covariance matrix of the random vector. It describes a random vector where the first and second entries are each normally distributed according to $\mathcal{N}(\mu_1,\sigma_1^2)$ and $\mathcal{N}(\mu_2,\sigma_2^2)$ and have correlation $\rho$. The joint pdf of the two is given by:
    \begin{equation*}
        f_{XY}(x,y) = \frac{1}{2\pi\sigma_x\sigma_y\sqrt{1-\rho^2}}\exp\left(-\frac{1}{2(1-\rho^2)}\left(\frac{(x-\mu_x)^2}{\sigma_x^2}+\frac{(y-\mu_y)^2}{\sigma_y^2}-\frac{2\rho(x-\mu_x)(y-\mu_y)}{\sigma_x\sigma_y}\right)\right)
    \end{equation*}
    - Multivariate normal: Denoted by $\mathcal{N}(\mu,\Sigma)$ where $\mu$ is the mean vector and $\Sigma$ the covariance matrix. It describes a random vector of normal random variables with pairwise correlations $\rho_{ij}$. The joint pdf of the two is given by:
    \begin{equation*}
        f_X(x)=\frac{1}{(2\pi)^{n/2}|\Sigma|^{1/2}}\exp\left(-\frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu)\right)
    \end{equation*}
- **Projection theorem**: The projection theorem for jointly normal RVs $Y$ and $X$ states that the conditional expectation of $Y$ given $X$ is the linear projection of $Y$ onto $X$. This also implies that $Y$ can be decomposed into two orthogonal components, the linear projection onto $X$ and the orthogonal (to $X$) residual.

**Chi-square**: Denoted by $\chi_k^2$, where $k>0$ is the degrees of freedom. It represents the sum of $k$ squared iid standard normal variables and takes on values $\geq 0$.
- pdf: $f_X(x)=\frac{1}{2^{k/2}\Gamma(k/2)}x^{k/2-1}e^{-x/2}$

- Expectation: $k$

- Variance: $2k$

**Student's-t**: Denoted by $t_\nu$ where $\nu$ is the degrees of freedom. It represents the ratio of a standard normal variable and the square root of a scaled chi-square random variable with degree $\nu$. It varies between $\pm\infty$ like the normal distribution.
- Expectation: 0

- Variance: $\frac{\nu}{\nu-2}$ if $\nu > 2$, otherwise $\infty$.

- Considerations: Typically $\nu$ is greater than 1. In the case where $\nu\leq 1$, the expectation and variance become undefined.

- Use cases: It is useful for the t-test, where you perform a hypothesis test for the mean using unknown variance.

**F-distribution**: Denoted by $F(d_1,d_2)$, where $d_1$ and $d_2$ are degrees of freedom. It describes the ratio between two scaled chi-square random variables (divided by their degree).
- Expectation: $\frac{d_2}{d_2-2}$

- Variance: $\frac{2d_2^2(d_1+d_2-2)}{d_1(d_2-2)^2(d_2-4)}$

- Use cases: It is useful for hypothesis testing of the relevance of dependent variables in linear regression.

**Beta**: Denoted by $Beta(\alpha,\beta)$, where $\alpha,\beta>0$. It ranges between $[0,1]$.
- pdf: $f_X(x)=\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}x^{\alpha-1}(1-x)^{\beta-1}$

- Expectation: $\frac{\alpha}{\alpha+\beta}$

- Variance: $\frac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)}$

- Common extensions:
    - **Dirichlet distribution**: A random vector X has a dirichlet distribution with parameters $\alpha_1,...,\alpha_K$ if $X_i>0$, $sum_1^KX_i=1$ and if the joint pdf follows:
    \begin{equation*}
    f(x_1,...,x_K)=\frac{\Gamma(\sum \alpha_i)}{\prod \Gamma(\alpha_i)}\prod_{i=1}^Kx_i^{\alpha_i-1}
    \end{equation*}
    and each of the marginal distributions are equal to a beta distribution. This has expectation $\mathbb{E}(X_i)=\frac{\alpha_i}{\sum \alpha_j}$ and variance $Var(X_i)=\frac{\alpha_i(\sum \alpha_j - \alpha_i)}{(\sum \alpha_j)^2((\sum \alpha_j) +1)}$

Relations between the distributions: 
- Bernoulli and Binomial: $Bin(n,p) = \sum^n_1 Bern(p)$
- Geometric and Negative Binomial: $Geom(p)=NB(1,p)$.
- Binomial and Hypergeometric: A binomial RV is the equivalent of a hypergeometric RV if sampling was done with replacement. Hence if we take $N$ to infinity while keeping $p=K/N$ the same, the hypergeometric RV becomes a binomial RV.
- Poisson and Exponential and Gamma: Suppose we have a random variable denoting the number of events up to time $t$ distributed according to $Poi(\lambda t)$. Then the waiting time between events, by the memoryless property, is equal to an exponential RV with parameter $\lambda$ (derivable by considering $N(t)=0$). For the time between $k$ events, the distribution becomes a gamma distribution with parameters $k$ and $1/\lambda$. 
$Y=\sum_{i=1}^n X_i$ for $Y\sim Gamma(n,1/\lambda)$ and $X_i\sim Exp(\lambda)$ iid.
- Poisson and Binomial: A poisson variable with $\lambda=np$ is the equivalent of a binomial variable with parameters $n$ and $p$ when $n\rightarrow\infty$ and $p\rightarrow 0$ while $\lambda=np$ stays the same.
- Normal and Chi-square: $\chi_\nu^2=\sum_{i=1}^\nu \mathcal{N}(0,1)^2$
- Student's t and normal and chi-square: $t_\nu = \frac{\mathcal{N}(0,1)}{\sqrt(\chi_\nu^2/\nu)}$.
- Chi-square and gamma: $\chi_k^2=Gamma(k/2,2)$
- Beta and gamma: $\frac{Gamma(\alpha,1)}{Gamma(\alpha,1)+Gamma(\beta,1)}=Beta(\alpha,\beta)$
- Chi-square and beta:  $\frac{\chi_p^2}{\chi_p^2+\chi_q^2}=Beta(\frac{p}{2},\frac{q}{2})$
- Uniform and exponential: $-log(U)/\lambda \sim Exp(\lambda)$ where $U\sim Unif(0,1)$

Additivitity rules between independent RVs:
- Normal: $N(\mu_1,\sigma_1^2)+N(\mu_2,\sigma_2^2)=N(\mu_1+\mu_2,\sigma_1^2+\sigma_2^2)$
- Poisson: $Poi(\lambda_1)+Poi(\lambda_2)=Poi(\lambda_1+\lambda_2)$
- Binomial: $Bin(n_1,p) + Bin(n_2,p) = Bin(n_1+n_2,p)$
- Negative binomial: $NB(r_1,p)+NB(r_2,p)=NB(r_1+r_2,p)$
- Exponential: $Exp(\lambda) + Exp(\lambda) = Gamma(2,\lambda)$
- Gamma: $Gamma(\alpha_1,\theta) + Gamma(\alpha_2,\theta) = Gamma(\alpha_1+\alpha_2,\theta)$
- Chi-square: $\chi_p^2+\chi_q^2=\chi_{p+q}^2$

Shapes of the distributions:
- Normal: Symmetric about the mean (hence mean=median=mode)
- Student's t: Symmetric about 0 with heavier tails than the normal.
- Binomial: Symmetric for $p=0.5$, otherwise it is left or right skewed depending on if $p$ is less or greater than 0.5.
- Beta: Symmetric for $\alpha = \beta$. If $\beta<\alpha$ it is left skewed, if $\beta>\alpha$ it is right skewed.
- Gamma: Right-skewed.
- Geometric: Right skewed.
- Hypergeometric: Symmetric for $K/N=0.5$. Skewed left if proportion is less than 0.5, and right otherwise.
- F distribution: Right skewed.
- Exponential: Right skewed.
- Poisson: Right skewed.
- Chi-square: Right skewed.

### Moments of random variables
Kth raw moment: The $k^{th}$ **raw moment** of a random variable $X$ is the following expectation:
\begin{equation*}
\mathbb{E}(X^k)
\end{equation*}

Kth central moment: The $k^{th}$ **central moment** of a random variable $X$ is the following expectation:
\begin{equation*}
\mathbb{E}[(X-\mathbb{E}(X))^2]
\end{equation*}

Skew: The **skew** is the 3rd central moment of a random variable. The closer it is to 0, the more symmetric the distribution is around the mean. For a positive skew, the RV's distribution is right-skewed, and vice versa for left skew.

Kurtosis: The **kurtosis** is the 4th central moment of a random variable. This describes the shape of the tails of a distribution, where it approaches 0 for those with tails similar to a normal distirbution. Otherwise there exists excess kurtosis, resulting in heavy tails if kurtosis is greater than 0 (leptokurtic) and light tails otherwise (platykurtic).

MGF: The **moment generating function** of an RV is given by
\begin{equation*}
M_X(t)=\mathbb{E}(e^{tX})=\sum_{k=0}^{\infty}\frac{t^k}{k!}\mathbb{X^k}
\end{equation*}
From the definition above, it is easy to see that the $k^{th}$ moment is the $k^{th}$ derivative of the MGF at $t=0$, hence why it is called moment generating.

PGF: The **probability generating function** of a discrete RV is given by:
\begin{equation*}
G_X(t)=\mathbb{E}(t^X)
\end{equation*}
Similar to the moment generating function, the probability $\mathbb{P}(X=k)$ can be recovered from the derivative $k^{th}$ of $G_X$ at $t=0$ divided by $k!$.

### Order statistics
Suppose you have a random sample $X_1$,...,$X_n$ of size $n$, then we can order them based on value to yield $X_{(1)}\leq...\leq X_{(n)}$ where $X_(k)$ is the $k^{th}$ **order statistic**. The first order statistic is equivalent to the minimum and the nth the maximum (the median is given by $X_{(\lceil n/2\rceil)}$).

The pdf of the k-th order statistic is given by:
\begin{equation*}
f_{X_(k)}(x) = \binom{n-1}{k-1}[F(x)]^{k-1}[1-F(x)]^{n-k}f(x), x\in\mathbb{R} 
\end{equation*}
The above formula can be understood as the product of the probability that there are $k-1$ values below x (given by $[F(x)]^{k-1}$), the probability that there are n-k values above (given by $[1-F(x)]^{n-k}$), the number of ways to choose $k-1$ points out of the remaining $n-1$ points.

For the minimum and maximum order statistics, the formula reduces to the following:
\begin{array}{rl}
f_{X_{(1)}}(x)&=n[1-F(x)]^{n-1}f(x) \\ 
f_{X_{(n)}}(x)&=n[F(x)]^{n-1}f(x)
\end{array}

The joint distribution of the order statistics is given by:
\begin{equation*}
f_{X_{(1)},...,X_{(n)}}(x_1,...,x_n)=n!f(x_1)...f(x_n), x_1<...<x_n
\end{equation*}

**Uniform order statistics** (i.e. iid $Unif(0,1)$):
- distribution of the kth order statistic: $X_{(k)}\sim Beta(k,n-k+1)$
- pdf of the kth order statistic: $f_{X_{(k)}}(x)=\binom{n-1}{k-1}x^{k-1}(1-x)^{n-k}$
- expected value of the kth order statistic: $\frac{k}{n+1}$
- joint pdf: $f_{X_{(1)},...,X_{(n)}}(x_1,..,x_n)=n!$

Spacings: Consider the difference between consecutive uniform order statistics $D_k = X_{(k+1)}-X_(k)$ where $X_{(0)}=0,X_{(n+1)}=1$. I.e. $D_0=X_{(1)}$ and $D_n=1-X_{(n)}$. 
- distribution of spacings: $(D_0,...,D_n)\sim Dirichlet(1,1...,1)$ with $D_k\sim Beta(1,n)$
- pdf of any spacing: $n!(1-x)^{n-1}$
- joint pdf of the spacings: $f(x_1,...,x_K)=(n-1)!$
- expected value of a spacing: $\mathbb{E}(D_k)=\frac{1}{n+1}$

Order statistics on spacings: 
- Smallest spacing pdf: $f(x)=n(n-1)(1-nx)^{n-2}$
- Smallest spacing expected value: $\frac{1}{(n+1)^2}$
- Largest spacing pdf: $f(x)=\sum_{j=1}^n(-1)^{j-1}\binom{n}{j}(n-1)(j)(1-jx)^{n-2}$
- Largest spacing expected value: $\frac{H_{n+1}}{n+1}$ where $H_i$ is the $i^{th}$ harmonic sum.

Sidenote: Everything above is entirely derivable from first principles (i.e. just start from definitions of uniform random variables and order statistics and work your way down the derivations).

Example problems:
- Traditional:
    - Probability and expected size of minimum/maximum of n uniform random variables
    - Probability and expected size of median of n uniform random variables
    - Probability and expected size of quantiles of n uniform random variables
    - Probability and Expected value of the kth uniform order statistic
    - Expected size of the smallest/largest section of a stick broken at n points
    - Probability of a triangle given 2 breakpoints on a stick
    - Skyline problem
    - Probability that at least k uniform points fall in an interval (binomial and beta distribution perspectives).
- Finance context:
    - Minimum loss 
    - Median return
    - Maximum return
    - Expected kth largest return
    - Expected maximum drawdown
    - Expected percentile returns
    - Expected number of assets exceeding the median

### Matrices in probabiility
**Stochastic matrices**: Matrices whose rows and columns each sum to one and whose entries are greater than or equal to 0.

**Positive definite matrices**: Symmetric matrices such that $\forall x, x^TMx>0$
- Properties:
    - Eigenvalues: All eigenvalues are positive
    - Determinants: The determinant is positive
    - Invertibility: The inverse exists.

**Positive-semi definite** matrices: Symmetric matrices such that $\forall x, x^TMx\geq0$
- Properties:
    - Eigenvalues: All eigenvalues are non negative
    - Determinants: The determinant is non negative
    - Invertibility: The inverse may not exist

**Correlation matrix**: Given a random vector, its correlation matrix is a matrix whose $ij^{th}$ entry describes the correlation between $X_i$ and $X_j$. The correlation matrix must be positive semi-definite since it is calculated from:
\begin{equation*}
R = \frac{1}{n}XX^T
\end{equation*}
where X represents a data matrix. It becomes positive definite if none of the variables are linearly dependent.

**Covariance matrix**: Given a random vector, its covariance matrix is a matrix whose $ij^{th}$ entry describes the covariance between $X_i$ and $X_j$. The covariance matrix is also positive semi-definite:
\begin{equation*}
\alpha^T\Sigma \alpha = Var(\alpha^Tx) \geq 0
\end{equation*}
It becomes positive definite if none of the variables are linearly dependent.

Importance of PSD/PD in correlation and covariance matrices:
- Invertibility: If PD, the inverse of the matrices exists which makes a lot of calculations (e.g. portfolio optimization) easier. Otherwise, other methods are needed to calculate the inverses.
- Convexity: If a matrix $A$ is PSD, the function defined by $f(x)=x^TAx$ is convex while if PD, then it is strictly convex. This is important in optimization problems since it guarantees a minimum.

Some example questions:
- Using the PSD property of the covariance matrix, find the upper and lower bounds of the variables in a given covariance matrix.

# Important Inequalities

### Basic inequalities:
**AM-GM-HM inequality**: 
\begin{equation*}
HM\leq GM \leq AM
\end{equation*}
where HM is the harmonic mean ($\sum\frac{n}{\sum \frac{1}{x_i}}$), GM is the geometric mean ($(\prod x_i)^{1/n}$) and AM is the arithmetic mean of some sample.

**Triangle inequality**: Suppose you have vectors $x$ and $y$. Then:
\begin{equation*}
||x+y||\leq||x||+||y||
\end{equation*}
Equivalent to the above is the following statement: suppose you have 3 side lengths $a,b,c$. Then these sidelengths can form a triangle so long as the sum of any two sides is always greater than the third.

**Cauchy Schwarz inequality**: Suppose you have vectors $x$ and $y$. Then:
\begin{equation*}
|<x,y>|\leq||x||||y||
\end{equation*}
where $<x,y>$ denotes the inner product (usually $x^Ty=||x||||y||\cos{\theta}$). 

### Probability focused:

**Markov inequality**: Given a non negative random variable $X$ and $a>0$ we can bound the probability by
\begin{equation*}
\mathbb{P}(X\geq a)\leq\frac{\mathbb{E}(X)}{a}
\end{equation*}

**Chebyshev**: Given a random variable $X$ with mean $\mu$ and variance $\sigma^2$, we can bound the probability by:
\begin{equation*}
\mathbb{P}(|X-\mu|\geq k\sigma)\leq\frac{1}{k^2}
\end{equation*}

**Convexity and concavity**: 
A function $f(x)$ is **convex** on the interval $[a,b]$ if we have:
\begin{equation*}
    \forall t\in[0,1], f(ta+(1-t)b)\leq tf(a)+(1-t)f(b)
\end{equation*}
The above inequality formally defines the idea of the line segment between $(a,f(a))$ and $(b,f(b))$ being above the convex portion of $f$. Strict convexity occurs when the inequality above is strict.

A function $f(x)$ is **concave** on $[a,b]$ if:
\begin{equation*}
    \forall t\in[0,1], f(ta+(1-t)b)\geq tf(a)+(1-t)f(b)
\end{equation*}
The above states that the line segment between $(a,f(a))$ and $(b,f(b))$ lies below the curve. Vice versa for strictly concave functions.

**Jensen's inequality** is an inequality that applies convexity/concavity to expectations. Suppose a function f is convex, then we have:
\begin{equation*}
f(\mathbb{E}(X))\leq\mathbb{E}(f(X))
\end{equation*}

# Lin alg

### Independent vectors, Span, Basis, Dimensionality

### Dot product, Norm, Orthogonality, Gram Schmidt procedure

### Linearity and matrices as linear transformations

### Matrix subspaces : row space, column space, image, null space, rank-nullity

### Determinants, Cofactors, Adjugates, Cramer's rule

### Transpose, Inverse

### Spectral theory: 
Eigen values, 
eigen vectors, 
characteristic polynomial, 
multiplicities, 
diagonalization

### Special square matrices and properties
Symmetric matrices 
Triangular matrices
Skew symmetric

### Other Matrix decompositions (Besides eigenvalue decomposition)
LU
QR
PLU
Cholesky
SVD


# Matrix Calculus
### Gradient, Divergence and curl

### Hessian

### Jacobian

### Laplacian

### Differentiation rules


# Linear Recurrence relations
Recurrence relations often appear in markov chain related problems among others and hence knowing the methods behind solving specific types of them are useful to know.

### Definition
A **linear recurrence relation** defines each term as a combination of previous terms. It has general form:
\begin{equation*}
x_n = a_1x_{n-1} + a_2x_{n-2}+...+a_kx_{n-k} + f(n)
\end{equation*}
where $a_1, a_2,...,a_k$ are coefficients and $f(n)$ is the non-recursive term. 

Recurrence relations are classified based on three main things:
- Homogeneity: If $f(n) = 0$, the recurrence relation is homogeneous. Otherwise it is not.
- Constant vs Variable coefficients: if $a_i$ do not vary with $n$ then the coefficients are constant.
- Order: A kth-order relation means that $x_n$ is a function of $x_{n-1}$ up to $x_{n-k}$.

### Solving constant, homogeneous linear recurrences:
Given the kth-order relation
\begin{equation*}
x_n = a_1x_{n-1} + a_2x_{n-2}+...+a_kx_{n-k}
\end{equation*}
we have the following **characteristic polynomial**
\begin{equation*}
r^k = a_1r^{k-1} + a_2r^{k-2}+...+a_k
\end{equation*}

We will denote $x^{(h)}_n$ as the general solution to the homogeneous system. We have three cases of solutions to the characteristic polynomial:
#### Case 1:
If the characteristic equation has $k$ distinct roots $r_1, r_2,...,r_k$, then the general solution is given by:
\begin{equation*}
x_n=\sum_{j=1}^k{A_jr_j^n}
\end{equation*}
where substituting the initial conditions provides the values of the unknown coefficients $A_1,...,A_k$.

#### Case 2:
If the characteristic equation has repeated roots, then given $r_1,...,r_p$ with multiplicities $m_1,..,m_p$ that sum to $k$, we have the general solution:
\begin{equation*}
x_n=\sum_{j=1}^p\sum_{i=0}^{m_j-1}A_{j,i}n^ir^n_j=A_{1,0}r^n_1+..+A_{1,m_1-1}n^{m_1-1}r^n_1+A_{2,0}r^2_j...+A_{p,m_p-1}n^{m_p-1}r^n_p
\end{equation*}

#### Case 3:
If the characteristic equation has complex roots, then each of the complex conjugate pairs $\alpha\pm i\beta$ contributes the following term to the general solution (letting $r$ denote $\alpha+i\beta$):
\begin{equation*}
x_n=...+|r|^n(B\cos{(n \arg(r))}+C\sin{(n \arg(r))})
\end{equation*}
where $|r|^2=\alpha^2+\beta^2$ and $\arg(r)=\arctan{\beta/\alpha}$.

### Inhomogeneous linear recurrences
In general, the solution to an inhomogeneous system is given by: $x_n^{GS}=x_n^{(h)}+x_n^{(p)}$ where $x^{(h)}_n$ denotes the homogeneous solution and x_n^{(p)} the particular solution. So the main extension is to solve the particular solution. This can typically be done by **Ansatz** which is the process of guessing a solution form (e.g. $x_n=A_0+A_1n+A_2n^2$) based on what $f(n)$ looks like then substituting to solve for the coefficients.

### Green's function approach for linear recurrences:
Green's function $G$ is the impulse response of an inhomogeneous linear differential operator $L$ (ODEs/Difference equations/Recurrence relations/etc.). Note that this means it must obey any fixed boundary conditions imposed by the original recurrence relation as well.
\begin{equation*}
LG=\delta
\end{equation*}
where $\delta$ is the dirac delta function (defined by $\int_{-k}^k\delta(x)dx=1, \forall k$).

In the case of a linear recurrence relation, we have the equation $Lx_n = f(n)$ that gives us:
\begin{equation*}
LG(n,m) = \delta_{n,m} = \begin{cases} 1 & \text{if } n=m \\ 0 & \text{otherwise} \end{cases}
\end{equation*}
where $\delta_{n,m}$ is the discrete dirac delta. The solution to the linear recurrence is then given by:
\begin{equation*}
x_n = \sum_{m}G(n,m)f(m)
\end{equation*}

Note that when solving for $G(n,m)$, it may be necessary to specify two equations for $G$ depending on $n<m$, $n=m$ and $n>m$ and to assume continuity at $n=m$. The solution to the equation $LG(n,m) = \delta_{n,m}$ is then just a matter of solving a homogeneous linear recurrence relation.

### Example: 
Suppose you have 1000 yellow cubes and 1 blue cube. At every time step, 2 cubes are picked up at random and one is made to match the color of the other. What is the expected number of steps till the cubes all become the same color?

For starters, we can define the following recurrence in expectations: Let $N_i$ denote the number of steps if we started with $i$ blue cubes and $1001-i$ yellow cubes.
\begin{array}{rl}
\mathbb{E}(N_0)=\mathbb{E}(N_{1001})=0 \text{ (Boundary conditions/Edge case)}\\\\
\mathbb{E}(N_i)&=\displaystyle 1 + \mathbb{E}(N_{i+1})\frac{i(1001-i)}{1001*1000} + \mathbb{E}(N_{i-1})\frac{i(1001-i)}{1001*1000} + \mathbb{E}(N_i)(1-2\frac{i(1001-i)}{1001*1000}) \\\\
\implies \mathbb{E}(N_i)&=\displaystyle \frac{1001000}{2i(1001-i)} + \frac{1}{2}\mathbb{E}(N_{i+1}) + \frac{1}{2}\mathbb{E}(N_{i+1})
\end{array}

Let the expectation for $N_i$ be denoted by $x_i$. Then we have:
\begin{array}{rl}
x_i &=\displaystyle \frac{1001000}{2i*(1001-i)} + \frac{1}{2}x_{i-1} + \frac{1}{2}x_{i+1}\\\\
\implies 2x_i - x_{i-1} - x_{i+1} &= \displaystyle\frac{1001000}{i*(1001-i)}
\end{array}

The above gives us a linear recurrence relation, with homogeneous part $2x_i - x_{i-1} - x_{i+1} = 0$. This gives us the following general solution via the characteristic equation:
\begin{array}{rl}
2r - 1 - r^2 &= 0 \\\\
\implies r_1,r_2 &= 1 \text{ (repeated root)} \\\\
\implies x_i^{(h)} &= Ar_1 + Bir_1 = A + Bi \text{ (A and B constants to be determined)}
\end{array}

Thus this leaves us with the step of finding the particular solution to the inhomogeneous recurrence $x_{i+1} -2x_i + x_{i-1} = -\frac{1001000}{i*(1001-i)}$. Let $f(i)=-\frac{1001000}{i*(1001-i)}$. We now apply green's function (discrete version):
\begin{array}{rl}
L(x_i) &= x_{i+1} -2x_i + x_{i-1} \text{ (Linear operator)} \\\\
LG(i,j) &= G(i+1,j) - 2G(i,j) + G(i-1,j) = \delta(i,j) \text{ (Applying linear operator to green's function)}\\\\
G(i,0) = G(i,1001) &= 0 \text{ (Enforcing boundary conditions)} \\\\
x_i &= \displaystyle \sum_{j=0}^{1001} G(i,j)f(j) = \sum_{j=1}^{1000} G(i,j)f(j) \text{ (Defining $x_i$ in terms of green's function)}
\end{array}

Now we solve the equation $G(i+1,j) - 2G(i,j) + G(i-1,j) = \delta(i,j)$ to figure out green's function. 
\begin{array}{rl}
G(i+1,j) - 2G(i,j) + G(i-1,j) &= 0, \text{ $i\neq j$}\\\\
G(i,j) &= Ai + B \text{ (Homogeneous solution)} \\\\
G(i,j) &= \begin{cases}
A_Li + B_L & i<j \\\\
A_Ri + B_R & i>j
\end{cases}\\\\
G(0,j) &= 0 = B_L \text{ (Applying boundary conditions to determine coefficients)} \\\\
G(1001,j) &= 0 = 1001A_R + B_R \implies B_R = - 1001 A_R \\\\
\implies G(i,j) &= \begin{cases}
A_Li & i<j \\\\
A_R(i - 1001) & i>j
\end{cases}\\\\
A_Lj &= A_R(j - 1001) \text{ (Assuming continuity at $i=j$)} \\\\
\implies A_R =\displaystyle A_L\frac{j}{j-1001} \\\\
G(i+1,j) - 2G(i,j) + G(i-1,j) &= 1, \text{ ($i=j$, particular solution)}\\\\
\implies G(j+1,j) - 2G(j,j) + G(j-1,j) &= 1 \\\\
\implies (A_R(j+1)-1001A_R) - 2(A_Lj) + A_L(j-1) &= 1 \\\\
\implies\displaystyle A_L\frac{j(j+1)}{j-1001} - A_L\frac{1001j}{j-1001} - 2(A_Lj) + A_L(j-1) &= 1 \\\\
\implies A_L &= \displaystyle\frac{j-1001}{1001} \\\\
\implies A_R &= \displaystyle\frac{j}{1001} \\\\
\therefore G(i,j) &= \begin{cases}
\displaystyle i\frac{j-1001}{1001} & i\leq j\\\\
\displaystyle j\frac{i-1001}{1001} & i > j
\end{cases}
\end{array}

This allows us to find the value of $\mathbb{E}(N_1)=x_1=\sum_{j=1}^{1000}G(1,j)*f(j)$.
\begin{array}{rl}
G(1,j) &= \displaystyle \frac{j-1001}{1001} \text{ (By formula)}\\\\
\implies G(1,j)f(j) &= \displaystyle -\frac{1001000}{j*(1001-j)}\frac{j-1001}{1001} = \frac{1000}{j} \\\\
\implies x_1 &= 1000 * \sum_{j=1}^{1000} \frac{1}{j} \\\\
\implies x_1 &\approx 1000 * ln1000 \\\\
\implies x_1 &\approx 6900 \\\\
\end{array}

### Extension:
Methods for solving linear recurrence relations have many analogues in linear ODE/PDE systems and linear difference equations.

# Concepts in Random walks
### Definition of a random walk


### Definition of brownian motion


### Markov Property


### Martingales


### Stopping time and the Optional Stopping Theorem


### Mean reverting walks


### Random walk on a graph



# Markov chains [[src](https://www.statslab.cam.ac.uk/~rrw1/markov/M.pdf)]

### Definition
For a markov chain, we have the following components:
- The **probability space** $(\Omega, \mathcal{F}, \mathbb{P})$.
- A countable set $\mathcal{I}$ which denotes our set of possible states in the **state-space**
- The sequence of random variables $X_0, X_1,...\in \mathcal{I}$ representing our **markov chain** $(X_n)_{n\geq 0}$. This can be understood as the path the random process takes
- A row vector $\lambda$ representing the initial distirbution over $\mathcal{I}$ (i.e. $\mathbb{P}(X_0=i_0)=\lambda_i \forall i\in\mathcal{I}$).
- A stochastic matrix (meaning all entries are $\geq 0$) known as the **transition matrix** $P$.
Denote a markov chain following $\lambda$ and $P$ by $Markov(\lambda, P)$ The entry $(P){ij}=p_{ij}$ (in row $i$ and column $j$) represents the probability of transitioning from state $i$ to state $j$.

### Probabilities on a Markov Chain
Below are the basic probability rules governing a basic Markov chain
\begin{array}{rl}
    \mathbb{P}(X_0=i_0,...,X_n=i_n) &= \lambda_{i_0}p_{i_0i_1}...p_{i_{n-1}i_n} \\\\
    \mathbb{P}(X_n=i_n|X_0=i_0,...,X_{n-1}=i_{n-1})&=\mathbb{P}(X_n=i_n|X_{n-1}=i_{n-1})=p_{i_{n-1},i_n} \text{ (Markov property)} \\\\
    \mathbb{P}(X_2=j|X_0=i) &= \sum_k{p_{ik}p_{kj}}=(P^2)_{ij} \text{ (conditioned 2-step transition)} \\\\
    \mathbb{P}(X_2) &= \sum_{i}\lambda_i\mathbb{P}(X_2=j|X_0=i) \text{ (2-step transition)} \\\\
    \mathbb{P}(X_n=j|X_0=i) &= \sum_{i_1,...,i_n}p_{ii_1}...p_{i_{n-1}i_n} = (P^n)_{ij} \text{ (conditioned n-step transition)} \\\\
    \mathbb{P}(X_n=j) &= \sum_{i_0,...,i_n}\lambda_{i_0}p_{i_0i_1}...p_{i_{n-1}i_n} = (\lambda P^n)_j \text{ (n-step transition)} \\\\
    (P^{n+m})_{ij} &= \sum_k{P_{ik}^nP_{kj}^m} \text{ (Chapman Kolmogorov equation)}
\end{array}

### Reducibility, periodicity and stationary distributions
Two states $i$ and $j$ of a markov chain communicate if there exists an $n$ such that $(P^n)_{ij}>0$. A markov chain is **irreducible** if every pair of states communicates. Otherwise, it is **reducible** and we can effectively break down the markov chain into sub-markov chains.

The **period** $d(i)$ of a state $i$ is the greatest common divisor of all $n$ such that $(P^n)_{ij}>0$. It represents the number of time steps require before the pattern of state $i$'s occurences repeat. If $d(i)$ does not exist, then the state is said to be **aperiodic**. If the chain is irreducible, then the whole chain becomes aperiodic so long as at least one of its states is aperiodic.

The **stationary distribution** of an irreducible and aperiodic markov chain with $n$ states is the vector $\pi=(\pi_1,...,\pi_n)$ such that $\pi P=\pi$. It represents the long-run distribution over the possible states as we repeatedly take steps in the markov chain and, by definition, is also the left eigen vector of the transition matrix corresponding to eigenvalue 1.

A state is **absorbing** if there is positive probability of reaching it and never leaving (typically a state that loops back to itself with probability 1). It is **transient** if there is positive probability of leaving and never returning.

### Markov chains in probability and expectations
Example problems:
- Expected length of coin sequence to see 2 heads in a row
- Probability of seeing 2 heads in a row before HT.

# Dynamic Progamming principle (DPP) for problem solving
### Definition of DPP and structure of DPP-able problems
DPP is a method of breaking down a complicated problem by solving simpler subproblems. This subproblems have the same structure as the original but on a smaller scale, and there exists a recursive relation between the solution of larger problems and smaller problems. The most common example of this would be the fibonnaci numbers.

Typically, the idea is to identify the base cases, then the recursive relation, and using the relation build up your answer step by step.

### Common DPP number sequences 
#### Basic
Binomial: Number of ways to choose $k$ items out of $n$: $\binom{n}{k}=\binom{n-1}{k}+\binom{n-1}{k-1}$

Derangements: Number of permutations of $n$ elements with no fixed points (AKA single cycle permutations).
\begin{equation*}
D_n = (n-1)(D_{n-1} + D_{n-2})
\end{equation*}

Fibonacci: $F_n = F_{n-1} + F_{n-2}$, $F_0=0,F_1=1$

#### Random walks/lattice paths
Catalan numbers: Number of sequences of steps (up (1,1), down (1,-1)) from (0,0) to (n,n) that never cross the diagonal from above and stay above the diagonal.
\begin{equation*}
C_n = \frac{1}{n+1}\binom{2n}{n} = \sum_{i=0}^{n-1}C_iC_{n-1-i}
\end{equation*}

Motzkin numbers: Number of sequences of steps (up (1,1), down (1,-1), level (1,0)) from (0,0) to (n,0) that never go below the x-axis.
\begin{equation*}
M_n = M_{n-1} + \sum_{k=0}^{n-2}M_kM_{n-2-k}
\end{equation*}

Schroder numbers: Number of sequences of steps (up (1,1), down (1,-1), level (2,0)) from (0,0) to (2n,0) that never go below the x-axis.
\begin{equation*}
S_n = S_{n-1} + \sum_{k=0}^{n-1}S_kS_{n-1-k}
\end{equation*}

#### Set partitioning
Bell numbers: Number of ways to partition a set of n elements into non-empty subsets.
\begin{equation*}
B_{n+1} = \sum_{k=0}^n \binom{n}{k}B_k 
\end{equation*}

Stirling numbers: The number of ways to partition a set of n elements into k non-empty subsets. Note that bell number $B_n = \sum_{k=1}^n S(n,k)$.
\begin{equation*}
    S(n,k) = k*S(n-1,k) + S(n-1,k-1)
\end{equation*}

Partition numbers: The number of ways to write a positive integer $n$ as a sum of positive integers where order does not matter
```python
def partition(n):
    dp = [0]*(n+1)
    dp[0] = 1
    for i in range(1, n+1):
        for j in range(i, n+1):
            dp[j] += dp[j-i]
    return dp[n]
```

Sidenote: Try deriving each of the above recurrences.

### Applying DPP to Expectations
Examples:
- Price a European option where the underlying only doubles or halves at every time step and does this for 3 time steps.

### Applying DPP to Probabilities
Examples:
- Probability of coin tosses leading to HH before HT.
