# Randomized algorithm
Prerequisites: Some understanding about [expectation](../statistic/random_variables.ipynb#Expected-value)

Back in the introduction, we stated that we briefly defined the [worst case and average running time of an algorithm](./introduction.ipynb#Worst-case-time-and-average-time).

However, for randomized algorithm, we are usually interested in the *worst-case expected* time complexity of an algorithm.

The main difference between this and the average time complexity is that average time complexity is dependent on the distribution of possible input.

Worst-case expected instead is interested in the average time complexity on the worst case that it can be given.

So if the worst-case occurs at a small chance, then the average time complexity would be small, while the worst-case expected time complexity would be unaffected.
For ease of notation, we will refer to this expectation $E(T(n))$ as $\bar T(n)$.

## Deterministic vs random algorithm
In deterministic algorithm, the output and the running time are functions of only the input.

In random algorithm, the output and running time can change even when given the same input.
As a layer of abstraction, we can represent this randomness as some random bits that we feed into the algorithm as input on top of our regular input.
Hence, we can say that the output and the running time are functions of input and these random bits.

## Types of randomized algorithm
### Randomized Las Vegas algorithms
The output is always correct, but the running time is a random variable

Examples: [Randomized quick sort](#Randomized-quick-sort), randomized quick select

### Randomized Monte Carlo algorithms
The output may be incorrect with some probability, but the running time is deterministic.

Example: [Approximate median](#Approximate-median)

## Examples
### Approximate median
Given an array $A[1\dots n]$, and $\epsilon > 0$, we wish to find an element whose rank is in the range of 
$$
[(1-\epsilon) \frac{n}{2}, (1+ \epsilon) \frac{n}{2}]
$$

A random algorithm could be as follows:
1. Select a random sample $S$ of size $O(\frac{1}{\epsilon} \log n)$ from $A$
2. Sort $S$
3. Return the median of $S$

Using this, we would get a runtime of $O( \frac{1}{\epsilon} \log n \log \log n)$.
And we would get an $\epsilon$-approximate of the median at a probability of $n^{-2}$

### Matching suits
Suppose that we are a tailor shop which serves $n$ people of different sizes.
We have produced $n$ suits, each suits matching exactly 1 person on our list.

However, one day, we mixed up all our suits and thus we forgot which suit belongs to who.
To make matters worse, we also lost the records on the size of each person.
Hence, what we are left with a list of people, and a set of suits.

However, just like real life, it can be rather hard to determine whether a suit matches the person without trying it on.
Also, it is not possible to compare the size of the suit due to different elasticity of the fabric.

Due to confidentiality concerns, your clients do not wish to see you in person, hence you will not be able to compare the sizes of the people against each other.

Thus, the only operation we are permitted is to give a suit to a person, and we are returned with whether the suit is too small, too big, or if it fits.

And our goal is to determine which suit belongs to who.

#### Sub problem
A simple first step is to pick up a suit, and try to match it against a person.
If we select the person randomly, we would expect to take $n / 2$ attempts before we find the correct match, even though it is $n -1 $ at worst case.
Suppose we were to analyze the time complexity of trying to **match a single customer to a given suit**, then we would obtain the following expected complexity $\bar T(n)$.

$$
\bar T(n) = \sum ^{n-1}_{k=1}  k Pr[T(n) = k]
$$

$$
Pr[T(n) = k] = 
\begin{cases}
1/n \quad \text{if } k < n - 1 \\
2/n \quad \text{if } k = n - 1 \\
\end{cases}
$$

For example, for $n=4$, it would correspond to:
* $1/4$ chance of finding the match one the first try
* $1/4$ chance of the match being the second person we sample
* $2/4$ chance of us needing to sample a third person (when we failed in the first 2 attempts)
    * If it matches, then it belongs to the third person
    * Otherwise, we know that it belongs to the fourth
    * This is why the formula has $2/n$ as an edge case

Simplifying our formula, we get

$$
\begin{align}
\bar T(n) &= \sum ^{n-1}_{k=1}  k Pr[T(n) = k] \\
&= \sum ^{n-2}_{k=1}  k Pr[T(n) = k]+ (n-1) Pr[T(n) = n-1]\\
&= \sum ^{n-2}_{k=1}  \frac{k}{n} + \frac{2n-2}{n}\\
&= \frac{(n-1)(n-2)}{2n} + \frac{2n-2}{n}\\
&= \frac{(n-1)(n+2)}{2n} \\
&= \frac{n^2+n-2}{2n} \\
\end{align}
$$

<details>
    <summary style="color: blue">Alternative derivation (Click to expand)</summary>
    <div style="background: aliceblue">
        <p>
        We can instead, solve the problem recursively.
        </p>
        <p>
        When $n=1$, the expected complexity is trivially $\bar T(n) = 0$, since we know that the only suit must match the only customer. 
        </p>
        <p>
        Suppose that we have $n > 1$ customers to match.
        Now we randomly pick a customer to match against our given suit.
        If it matches ($\frac{1}{n}$ chance of this occurring), then we would have sampled $1$ person.
        If it didn't ($\frac{n-1}{n}$ chance of this occurring), then we would need to sample $1 + \bar T(n-1)$ person, because we would need to find the match within the $n-1$ customers.
        </p>
        <p>
         Hence, the recurrence is defined as $\bar T(n)=\frac{1}{n}(1) + \frac{n-1}{n}\left(1 + \bar T(n-1)\right) = 1 + \frac{n-1}{n}\left(\bar T(n-1)\right)$
        </p>
        <p>
        To solve the recurrence, we introduce the substitution of $t(n) = n \bar T(n)$.
        Subbing it in, we get
        $$
        \begin{align}
        t(n) &= n \bar T(n) \\ 
        &= n (1 + \frac{n-1}{n} \bar T(n-1)) \\
        &= n + (n-1) \bar T(n-1) \\
        &= n + t(n-1)
        \end{align}
        $$
        </p>
        <p>
        This simplifies to a simple summation $$t(n) = \sum ^n _{k=2} k = \frac{n(n+1)}{2} - 1$$
        </p>
        <p>
        $$\bar T(n) = t(n)/n = \frac{n+1}{2} - \frac{1}{n}$$
        </p>
</div>
</details>

Hence, as we expected, we need about $n/2$ operations to find 1 match, given our expected complexity of $O(n)$.

Moving back to the main problem, we can perform this $n$ times on each of the suit, to obtain an expected complexity of $O(n^2)$ with the naive approach.

#### Further improvements
Notice that we didn't use the information that the selected customer is too small or too big for the suit; we only used the information as to whether he matched the suit or not.

Thus, we can perform similarly to quick sort
1. Choose any suit as the pivot suit
2. Compare this suit against all the customers
3. We now have a match, and the remaining customers are split into 2 sets: the ones smaller and the ones larger
4. Solve the subproblem, recursing into the respective partition to find our match

Since we need $2n-1$ comparisons to perform steps 1 to 3, we get the following recursion.
$$
T(n) = 2n -1 + \max _{1 \leq k \leq n}[T(k-1) + T(n-k) = 2n -1 + T(n-1)]
$$

And we know from the [quick sort](./recursion.ipynb#Good-pivot) section that this has complexity $O(n^2)$



When doing step 3, we "sort" the customers more and more, thus we can use this information to perform binary search using any pivot suit.

Now, we will analyze the expected worst-case time complexity.
The time complexity is as per below:

$$
\bar T(n) = 2n -1 + E_k [\bar T(k-1) + \bar T(n-k)] \\
= 2n -1 + \frac{1}{n} \sum _{k=1} ^n [\bar T(k-1) + \bar T(n-k)] \\
= 2n -1 + \frac{1}{n} \sum _{k=1} ^n [\bar T(k-1) + \bar T(n-k)] \\
$$

Solving this recurrence, we would get $\bar T(n) = O(n \log n)$.



### Randomized quick sort

In quick sort, we realized that we would encounter the worst case if we were to repeated choose a "bad" pivot.
However, if we were to randomize the pivot selection process, then we would hopefully avoid the "bad" pivot most of the time.
Indeed, the complexity of the randomized quick sort is $O(n \log n)$, which follows the same analysis as per the previous example.

## Benefits of randomized algorithm

After all that we have analyzed, we have seen that randomized algorithm usually lead to the same (expected) time complexity as the deterministic algorithm.
However, notice that the logic is usually simpler, and many times, the algorithm is also faster than their deterministic counterparts.

### Examples

#### Sorting
Even though both [merge sort](./recursion.ipynb#Merge-sort) and randomized quick sort is $O(n \log n)$, randomized quick sort usually outperforms merge sort in practice, due to it's simpler logic, which leads to less resources needed to be allocated for its algorithm.

In [1]:
import numpy as np
from module.sort import quick_sort, merge_sort

SIZE = 10_000
arr = np.random.random(size=SIZE)

In [2]:
%%timeit -n 10
quick_sort(arr)

39.5 ms ± 11.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [3]:
%%timeit -n 10
merge_sort(arr)

52.6 ms ± 13 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


#### Pivot selection
The randomized way of finding a good pivot for quick sort/quick select is much simpler than the [deterministic way](./recursion.ipynb#Finding-good-pivot).

#### Primality test
Given a $n$-bit integer, we wish to test if it is prime

The best deterministic algorithm (AKS01) is complicated and operates in $O(n^6)$, thus never used in practice.

Compared against the best randomized algorithm (Rabin80), which is simple and operates in $O(n^2)$