# Algorithm Analysis

A simple runtime analysis example. Notice that in each version of the algorithm, we count the steps (to empirically verify the different in runtime)

In [2]:
def disjoint1(A, B, C):
    """Return True if there is no element common to all three lists."""
    i=0
    for a in A:
        for b in B:
            for c in C:
                i += 1
                if a == b == c:
                    print('took {} steps'.format(i))
                    return False # we found a common value
    print('took {} steps'.format(i))
    return True

In [3]:
def disjoint2(A, B, C):
    """Return True if there is no element common to all three lists."""
    i=0
    for a in A:
        for b in B:
            i += 1
            if a == b: # only check C if we found match from A and B
                for c in C:
                    i += 1
                    if a == c: # (and thus a == b == c)
                        print('took {} steps'.format(i))
                        return False # we found a common value
    print('took {} steps'.format(i))
    return True # if we reach this, sets are disjoint

`disjoint1` should take $O(n^3)$ steps to run, why?

* the outermost loop is $O(n)$
* for each of the $O(n)$ iterations, another loop is run, which is $O(n)$ (so $O(n^2)$ total)
* for each of $O(n^2)$ iterations, a third loop is run, for a total of $O(n^3)$ iterations

In [4]:
disjoint1({1,2,3,4,5},{6,7,8,9,10},{11,12,13,14,15})
#dis.dis(disjoint1)

took 125 steps


True

`disjoint2` should take $O(n^2)$ steps to run, why?

* the outermost loop is $O(n)$
* for each of the $O(n)$ iterations, another loop is run, which is $O(n)$ (so $O(n^2)$ total)
* Not always will the innermost loop run. It will run at most $n$ times (if $A\subseteq B$ or $B \subseteq A$)
* For each of the (at most) $O(n)$ iterations when $a==b$, a third loop is run, for a total of $O(n^2)$ iterations

Hence, $O(n^2)+O(n^2)$ is $O(n^2)$

All 3 sets disjoint scenerio. Notice the inner loop never executes.

In [5]:
disjoint2({1,2,3,4,5},{6,7,8,9,10},{11,12,13,14,15})

took 25 steps


True

Worst case scenerio. Notice the inner loop executes all $n$ times

In [6]:
disjoint2({1,2,3,4,5},{1,2,3,4,5},{11,12,13,14,15})

took 50 steps


True

Best case scenerio. A common value is immediately found

In [7]:
disjoint2({1,2,3,4,5},{1,2,3,4,5},{1,2,3,4,5})

took 2 steps


False

We can see the low level code using the `dis` module

In [8]:
#import dis
#dis.dis(disjoint1)

# Recursion

## Binary Search

Consider the following binary search algorithm

In [35]:
def binary_search(data, target, low, high):
    """
    Return True if target is found in indicated portion of a 
    Python list. The search only considers the portion from 
    data[low] to data[high] inclusive
    """
    if low > high:
        return False
    else:
        mid = (low + high)//2
        print('low : {}, mid: {}, high: {}'.format(low, mid, high))  # track progress
        if target == data[mid]:
            return True
        elif target < data[mid]:
            return binary_search(data, target, low, mid-1)
        elif target > data[mid]:
            return binary_search(data, target, mid+1, high)

In [36]:
data = [2,4,5,7,8,9,12,14,17,19,22,25,27,28,33,37]
binary_search(data,22,0,15)

low : 0, mid: 7, high: 15
low : 8, mid: 11, high: 15
low : 8, mid: 9, high: 10
low : 10, mid: 10, high: 10


True

See the image for a visualization of the execution of this function

![](images/binary_search.png)

## Disk Utility

Consider the following program to calculate disk usage (like `du` in Unix)

In [63]:
import os

def disk_usage(path):
    """Return the number of bytes used by a file/folder and any descendents."""
    total = os.path.getsize(path)                    # account for direct usage
    if os.path.isdir(path):                          # if this is a directory,
        for filename in os.listdir(path):            # then for each child:
            childpath = os.path.join(path, filename) # compose full path to child
            total += disk_usage(childpath)           # add child’s usage to total
    print ( '{0:<7}'.format(total), path)            # descriptive output (optional)
    return total                                     # return the grand total

In [64]:
disk_usage('/network/home/sciclunm/research/coding_tests_repo/algorithmic_analysis/')

55326   /network/home/sciclunm/research/coding_tests_repo/algorithmic_analysis/images/du.png
76360   /network/home/sciclunm/research/coding_tests_repo/algorithmic_analysis/images/binary_search.png
131690  /network/home/sciclunm/research/coding_tests_repo/algorithmic_analysis/images
13894   /network/home/sciclunm/research/coding_tests_repo/algorithmic_analysis/algo_analysis_and_recursion.ipynb
72      /network/home/sciclunm/research/coding_tests_repo/algorithmic_analysis/.ipynb_checkpoints/algo_analysis_and_recursion-checkpoint.ipynb
75      /network/home/sciclunm/research/coding_tests_repo/algorithmic_analysis/.ipynb_checkpoints
145664  /network/home/sciclunm/research/coding_tests_repo/algorithmic_analysis/


145664

## Run Time Analysis

We modify the `binary_search` function to count the steps

In [72]:
def binary_search(data, target, low, high):
    """
    Return True if target is found in indicated portion of a 
    Python list. The search only considers the portion from 
    data[low] to data[high] inclusive
    """
    global i
    i += 1
    if low > high:
        return False
    else:
        mid = (low + high)//2
        print('low : {}, mid: {}, high: {}'.format(low, mid, high))  # track progress
        if target == data[mid]:
            return True
        elif target < data[mid]:
            return binary_search(data, target, low, mid-1)
        elif target > data[mid]:
            return binary_search(data, target, mid+1, high)

`binary_search` should take $O(\log n)$ steps to run. Why?
* There are $\log n$ steps. This is because the $r^{th}$ step leaves you with $\frac{n}{2^r}$. So if we want to find smallest $r$ s.t. $1 > \frac{n}{2^r} \Rightarrow r=\lfloor \log n \rfloor + 1$
* each step is $O(1)$

In [74]:
i = 0
binary_search(data,-1,0,15)
print('took {} steps'.format(i))

low : 0, mid: 7, high: 15
low : 0, mid: 3, high: 6
low : 0, mid: 1, high: 2
low : 0, mid: 0, high: 0
took 5 steps


test empirically

In [85]:
import math
math.floor(math.log2(len(data)))+1

5

The `disk_usage` function is $O(n)$, for $n=$ number of files/folders
To prove this, it is e.t.s that 

* there are $n$ recursive calls of `disk_usage`
* the for loop is called $n-1$ times **across** all recursive calls of `disk_usage`

For clarity, consider the below hypothetical filesystem:

![](images/du.png)

To prove that there are $n$ recursive calls of `disk_usage`, we need to define the _nesting level_ of each level. (e.g. `/usr/rt/courses/` is level 0, `cs252/` is level 1, ...).

We prove this by induction over the _nesting level_ $k$:
* For $k=0$, the only folder at _nesting level_ $0$ is the initial path argument to `disk_usage` (`user/rt/courses` in the example). `disk_usage` is called once, as the main function call, the claim is satisfied.
* If `disk_usage` is called for each file/folder at _nesting level_ $k-1$, we see that `disk_usage` will be called for each file/folder at _nesting level_ $k$ since each file/folder appears in the for loop.

Hence there are $n$ recursive calls of `disk_usage`, as needed.


We modify `disk_usage` to record how many files we loop over **across** recursive calls of `disk_usage`. 

We use this to verify that we loop over $n-1$ files in total

In [119]:
import os

def disk_usage(path):
    """Return the number of bytes used by a file/folder and any descendents."""
    global i
    total = os.path.getsize(path)                                                # account for direct usage
    if os.path.isdir(path):                                                      # if this is a directory
        visible_files = [f for f in os.listdir(path) if not f.startswith('.')]   # only include visible files
        i += len(visible_files)                                                  # record number of folders looped over
        print('looped over folders {}'.format(visible_files))
        for filename in visible_files:                                           # then for each child:
            childpath = os.path.join(path, filename)                             # compose full path to child
            total += disk_usage(childpath)                                       # add child’s usage to total
    #print ( '{0:<7}'.format(total), path)                                       # descriptive output (optional)
    return total                                                                 # return the grand total

In [122]:
i=0
disk_usage('courses/')
print('looped over {} folders'.format(i))

looped over folders ['cs016', 'cs252']
looped over folders ['homeworks', 'programs', 'grades']
looped over folders ['hw1.txt', 'hw3.txt', 'hw2.txt']
looped over folders ['pr3.txt', 'pr2.txt', 'pr1.txt']
looped over folders []
looped over folders ['grades', 'projects']
looped over folders []
looped over folders ['papers', 'demos']
looped over folders ['sellhigh.txt', 'buylow.txt']
looped over folders ['market.txt']
looped over 18 folders


Indeed, $19-1=18$