# Memory Management & B-Trees

Key topics include:
* Memory Management
* Memory Hierachies & Caching
* External Searching & B-Trees
* External Memory Sorting

In [1]:
# packages and data
import pandas as pd, numpy as np

## Memory Management

* Computer memory is organized into a sequence of words, each of which typically consists of 4, 8, or 16 bytes (depending on the computer). These memory words are numbered from 0 to N −1, where N is the number of memory words available to the computer. The number associated with each memory word is known as its memory address
* With Python, all objects are stored in a pool of memory, known as the memory heap or Python heap
* The designers of Python instead placed the burden of memory management entirely on the interpreter. The process of detecting “stale” objects, deallocating the space devoted to those objects, and returning the reclaimed space to the free list is known as garbage collection
* Python principally uses 2 forms of garbage collction algorithms:
    * Reference counts: Within the state of every Python object is an integer known as its reference count. This is the count of how many references to the object exist anywhere in the system. Every time a reference is assigned to this object, its reference count is incremented, and every time one of those references is reassigned to something else, the reference count for the former object is decremented
    * Cycle detection: a more advanced method used to reclaim objects that are unreachable, despite their nonzero reference counts
        * Mark sweep algorithm: reclaims unused space in time proportional to the number of live objects and their references plus the size of the memory heap

## Memory Hierachies & Caching

* In order to accommodate large data sets, computers have a hierarchy of different kinds of memories, which vary in terms of their size and distance from the CPU
* Hierachy of memory in computers (by distance from CPU):
    1. CPU
    2. Registers: used by the CPU (fast access but very few of them)
    3. Caches: considerably larger than the register set of a CPU, but accessing it takes longer
    4. Internal memeory - a.k.a main or core memory - larger than caches but requires more time to access
    5. External memory - very large but very slow
    6. Network storage
* Transfer of data between memories when a program is being exectued can quickly become a computational bottleneck
* Caching in web browsers: uses 3 main strategies for page replacement algorithms
    * first in first out (FIFO): evict the page that has been in the cache the longest
    * Longest recently used (LRU): evict the page whose last request occurred furthest in the past
    * Random

## External Searching & B-Trees

* Consider the problem of maintaining a large collection of items that does not fit in main memory, such as a typical database. To reduce the number of external-memory accesses when searching, we can represent our map using a multiway search tree
* This approach gives rise to a generalization of the (2,4) tree data structure known as the (a,b) tree. An (a,b) tree is a multiway search tree such that each node has between a and b children and stores between a−1 and b−1 entries
* A version of the (a,b) tree data structure, which is the best-known method for maintaining a map in external memory, is called the “B-tree.”

## External Memory Sorting

* In addition to data structures, such as maps, that need to be implemented in external memory, there are many algorithms that must also operate on input sets that are too
large to fit entirely into internal memory. In this case, the objective is to solve the algorithmic problem using as few block transfers as possible
    * Multiway merge sort: An efficient way to sort a set S of n objects in external memory amounts to a simple external-memory variation on the familiar merge-sort algorithm