## **Binary vs Linear Search**

**Linear search** is the simplest search. It just goes through every single element until finding the target. 

In [10]:
# Import relevant modules

from random import randint, sample, random
from typing import TypedDict, Dict, List, Optional, Callable, Any
from time import perf_counter_ns as t_ns
from binary_search import binary_search
from collections import defaultdict

In [11]:
# Simple implementation of linear search
def linear_search(ls: list, target: int) -> Optional[int]:
    for i, num in enumerate(ls):
        if num == target:
            return i

    return None

### **Analyzing Complexity**

Binary search halves the search range on every step, whereas linear search only decreases the range by one on every step. So, in theory, binary search should be O(log n), while linear search O(n). In this experiment, I'll set out to showcase this. 

This analysis however, is for the worst case. In practice, one may see other results - such as best case, or, when aggregated, an average case of a given sample. So it is worthwhile thinking of these.

For binary search: 
* **Best case**    - Element is at the middle. So it needs 1 operation.
* **Worst case**   - Element is where the target happens to be the last one checked (or not present). It needs log n.
* **Average case** - Everything else is in between these two. On average, this leads to ~ (log n) - 1. Why?

For linear search:
* **Best case**    - Element is at the beginning. So it needs 1 operation.
* **Worst case**   - Element is at the end (or not present). It needs n operations
* **Average case** - Everything else is in between the two. As it's evenly distributed, this leads to ~ n / 2.

In [12]:
# TODO: Explain why average case of binary search

would be if the element is at the exact middle (so it needs 1 operation), while the worst is when it's at one of those tricky spots where the target happens to be the last one checked or it's not on the list (so it needs log n operations). 
Everything else is in between these two values, the average being roughly (log n) - 1. An intuition for this is: if we picture the list as a tree, most elements are found above the deepest level, so on average the number of comparisons is slightly smaller than the height of the tree (log n)


### **Experimental Set Up**

The base set up will be as follows: 
1. For every size in _sizes_, generate a list as a fixed sorted sequence from 1 to _size_.  
2. Then, select a target and search it with both methods. 
3. Repeat this _attempts_ amount of times.

##### **Timing the Searches**

In this case, I use a [wrapper function](./extra/wrapper_functions.ipynb) to time the different searches

In [13]:
# Wrapper that returns the time a function takes to run
def timed(func: Callable[..., Any], *args, **kwargs) -> tuple[Any, int]:
    """
    Runs a function and returns (result, elapsed_time_ns).
    """
    start = t_ns()
    result = func(*args, **kwargs)
    elapsed = t_ns() - start
    return result, elapsed


#### **Defining Parameters**

In [14]:
class SizeResult(TypedDict):                                            # size : {"times": [], "indices": []}
    times:   List[float]
    indices: List[int]

class SearchResults(TypedDict):                                         # search: {size: SizeResult}
    binary:  Dict[int, SizeResult]
    linear:  Dict[int, SizeResult]
    targets: Dict[int, List[int]]

class ExpResults(TypedDict):
    hit_rate: SearchResults

def make_size_result():
    return {"times": [], "indices": []}             

In [24]:
sizes = [8, 32, 128, 1024, 16384, 131072]                               # sizes = 2**n for n = 3, 5, 7, 10, 17. Each 1 order of magnitude bigger
hit_rates = [1]                                                 # ratio of targets inside list (hit) vs outside (miss)
attempts = 100                                                          

##### **Running the Experiment**

In [None]:
def run_experiment(sizes: List[int], hit_rates: List[int], attempts: int) -> ExpResults:
    exp_results = {}
    for rate in hit_rates:
        results: SearchResults = {
            "binary": defaultdict(make_size_result),
            "linear": defaultdict(make_size_result),
            "targets": defaultdict(list)
        }

        # Run experiment for every size
        for size in sizes:
            nums = list(range(1, size + 1))

            # Check first, middle, last, and a miss.
            indices = [1, size // 2, size - 1, -1]

            # Add a combination of hits and misses in proportion to hit rate
            hits = int(rate*attempts)
            misses = attempts - hits
            if misses: indices.extend([-1]*misses)
            if hits: indices.extend([randint(0, size - 1) for _ in range(hits)])

            # Run searches 'attempts' amount of times
            for idx in indices:
                if idx == -1: target = -1
                else: target = nums[idx]

                lin_i, lin_t = timed(linear_search, nums, target)
                bin_i, bin_t = timed(binary_search, nums, target)

                # Check validity - shouldn't be necessary!
                if lin_i != bin_i: # Expects miss
                    if idx == -1:
                        print(f"Expected a miss on both. Got index {lin_i} for linear and {bin_i} for binary.")
                    else:
                        print(f"Expected index {idx}. Got index {lin_i} for linear and {bin_i} for binary.")

                # Save data
                results["linear"][size]["indices"].append(lin_i)
                results["linear"][size]["times"].append(lin_t)
                results["binary"][size]["indices"].append(bin_i)
                results["binary"][size]["times"].append(bin_t)
                results["targets"][size].append(idx)
            

        exp_results[rate] = results
    
    return exp_results
    


In [26]:
exp_results = run_experiment(sizes, hit_rates, attempts)

In [36]:
for size in sizes:
    times_l = exp_results[1]["linear"][size]["times"]
    times_b = exp_results[1]["binary"][size]["times"]

    avg_b = sum(times_b) / (attempts * 10**3) # nanosecond to microsecond
    avg_l = sum(times_l) / (attempts * 10**3)
    print(f"Average time for size {size}: Linear={avg_l: .2f} Binary={avg_b: .2f} [microseconds]")
    print(f"Times for Linear: first={times_l[0]: .2f}, middle={times_l[1]: .2f}, last={times_l[2]: .2f}, miss={times_l[3]: .2f}")
    print(f"Times for Binary: first={times_b[0]: .2f}, middle={times_b[1]: .2f}, last={times_b[2]: .2f}, miss={times_b[3]: .2f}")


Average time for size 8: Linear= 0.27 Binary= 0.28 [microseconds]
Times for Linear: first= 1300.00, middle= 500.00, last= 400.00, miss= 800.00
Times for Binary: first= 1100.00, middle= 500.00, last= 400.00, miss= 700.00
Average time for size 32: Linear= 0.52 Binary= 0.33 [microseconds]
Times for Linear: first= 300.00, middle= 500.00, last= 800.00, miss= 900.00
Times for Binary: first= 400.00, middle= 400.00, last= 400.00, miss= 400.00
Average time for size 128: Linear= 1.44 Binary= 0.44 [microseconds]
Times for Linear: first= 300.00, middle= 1400.00, last= 2500.00, miss= 2500.00
Times for Binary: first= 500.00, middle= 500.00, last= 500.00, miss= 500.00
Average time for size 1024: Linear= 11.34 Binary= 0.78 [microseconds]
Times for Linear: first= 200.00, middle= 10900.00, last= 20900.00, miss= 21500.00
Times for Binary: first= 1200.00, middle= 1500.00, last= 1000.00, miss= 1000.00
Average time for size 16384: Linear= 177.21 Binary= 1.21 [microseconds]
Times for Linear: first= 300.00, m

As expected, binary wins on all but first, as for linear it's the first it checks, while for binary it's among the last.
Also one can notice similar performance for the middle on the first 2 as there's so few elements, they do a similar amount of operations.

In [32]:
from math import log2
for size in sizes:
    print(log2(size))

3.0
5.0
7.0
10.0
14.0
17.0
