Here is **Chapter 3: Computer Science Fundamentals** — the algorithmic foundation for scalable AI systems.

---

# **CHAPTER 3: COMPUTER SCIENCE FUNDAMENTALS**

*The Engineer's Discipline*

## **Chapter Overview**

Mathematics gives you models; Python gives you tools; Computer Science gives you scalability. This chapter covers the data structures and algorithms that separate prototype hackers from production engineers. Every concept is tied to real ML system challenges: from indexing billions of embeddings to optimizing autoregressive generation.

**Estimated Time:** 40-50 hours (3 weeks)  
**Prerequisites:** Chapters 1-2, basic programming logic

---

## **3.0 Learning Objectives**

By the end of this chapter, you will be able to:
1. Select optimal data structures for ML pipelines (O(1) feature lookups, O(log n) neighbor search)
2. Analyze algorithmic complexity of training loops and inference pipelines
3. Implement core algorithms from scratch (sorting, graph traversal, dynamic programming)
4. Design object-oriented ML systems using SOLID principles and design patterns
5. Write comprehensive test suites for stochastic ML code
6. Optimize memory and compute using algorithmic insights

---

## **3.1 Data Structures for ML Systems**

#### **3.1.1 Arrays and Dynamic Arrays (Lists)**

**ML Applications:** Feature vectors, batches, model weights.

**Time Complexity:**
- Access: $O(1)$
- Append (amortized): $O(1)$ 
- Insert/Delete at index: $O(n)$ (elements shift)

**The Amortized Analysis:**
Python lists over-allocate (growth factor ~1.125) to make append $O(1)$ amortized. Critical for online learning buffers.

```python
import sys

# Demonstrate list overallocation
lst = []
for i in range(10):
    print(f"Length: {len(lst)}, Size in memory: {sys.getsizeof(lst)}")
    lst.append(i)
    
# Output shows size doubles at 4, 8, 16... geometric growth
```

**Memory Layout:**
Contiguous memory = cache efficiency. NumPy arrays exploit this with SIMD. Linked lists (non-contiguous) kill cache performance—avoid for numerical data.

#### **3.1.2 Hash Tables (Dictionaries)**

**ML Applications:** 
- Feature hashing (hashing trick for high-cardinality categorical features)
- Vocabulary mappings (word → index)
- Memoization in dynamic programming
- Deduplication in data pipelines

**Collision Resolution:**
Python uses open addressing (not chaining). Load factor > 0.66 triggers resize.

**Feature Hashing Example:**
```python
def hash_features(features: dict, n_buckets: int = 1000) -> dict:
    """Hashing trick for memory-efficient feature representation"""
    hashed = {}
    for key, value in features.items():
        # Deterministic hash to bucket
        bucket = hash(key) % n_buckets
        hashed[bucket] = hashed.get(bucket, 0) + value
    return hashed

# Use case: 1M unique words → 1000 dimensional vector
vocab_features = {f"word_{i}": 1.0 for i in range(1000000)}
compressed = hash_features(vocab_features, n_buckets=1000)
print(f"Compressed {len(vocab_features)} features to {len(compressed)}")
```

**Complexity:** Average $O(1)$ lookup, worst $O(n)$ (all collisions). Use `collections.Counter` for frequency counts.

#### **3.1.3 Linked Lists**

**ML Applications:** 
- LRU Cache for model predictions (eviction policy)
- Gradient accumulation buffers (unbounded growth)
- Implementing custom autograd engines (computation graphs as linked structures)

**When NOT to use:** Random access to training samples. Use arrays/tensors instead.

**LRU Cache Implementation:**
```python
from collections import OrderedDict

class LRUCache:
    """Cache for model predictions (e.g., embedding lookups)"""
    def __init__(self, capacity: int):
        self.cache = OrderedDict()
        self.capacity = capacity
    
    def get(self, key):
        if key not in self.cache:
            return None
        # Move to end (most recent)
        self.cache.move_to_end(key)
        return self.cache[key]
    
    def put(self, key, value):
        if key in self.cache:
            self.cache.move_to_end(key)
        self.cache[key] = value
        if len(self.cache) > self.capacity:
            # Pop first (least recent)
            self.cache.popitem(last=False)

# Usage: Cache expensive embedding computations
cache = LRUCache(capacity=10000)
```

#### **3.1.4 Trees**

**Binary Search Trees (BST):** $O(\log n)$ lookup if balanced. Rarely used directly in ML (unpredictable branch prediction hurts SIMD), but essential for understanding:

**Decision Trees:** The fundamental ML algorithm. Each node is a binary split.
- **Random Forests:** Ensembles of trees
- **Gradient Boosting:** Sequential trees (XGBoost, LightGBM)

**Implementation Insight:**
```python
class DecisionNode:
    def __init__(self, feature_idx, threshold, left, right, info_gain):
        self.feature_idx = feature_idx  # Which feature to split
        self.threshold = threshold      # Split value
        self.left = left                # Left subtree (<= threshold)
        self.right = right              # Right subtree (> threshold)
        self.info_gain = info_gain      # Information gain metric

class TreeClassifier:
    def predict(self, x, node):
        """Traverse tree for inference (O(tree depth))"""
        if not isinstance(node, DecisionNode):
            return node  # Leaf value
        
        if x[node.feature_idx] <= node.threshold:
            return self.predict(x, node.left)
        return self.predict(x, node.right)
```

**Segment Trees:** For range queries on time-series data (min/max/mean over intervals). Used in financial ML for rolling statistics.

#### **3.1.5 Heaps (Priority Queues)**

**ML Applications:**
- **Beam Search:** In sequence generation (NLP), keep top-$k$ hypotheses
- **Top-K Selection:** Finding largest gradients for importance sampling
- **A* Search:** Pathfinding in RL environments

**Properties:** 
- Min-heap: Parent < children. Root is minimum.
- Operations: Insert $O(\log n)$, Extract-min $O(\log n)$, Peek $O(1)$

**Beam Search Implementation:**
```python
import heapq

def beam_search_decoder(predict_fn, start_token, beam_width=3, max_len=20):
    """Beam search for sequence generation"""
    # Priority queue: (negative_score, sequence) 
    # (negative because heapq is min-heap, we want max probability)
    beams = [(0.0, [start_token])]
    
    for _ in range(max_len):
        candidates = []
        for score, seq in beams:
            if seq[-1] == "<EOS>":
                candidates.append((score, seq))
                continue
            
            # Get next token probabilities
            probs = predict_fn(seq)  # Dict[token] = log_prob
            
            # Keep top beam_width extensions
            for token, log_prob in probs.items():
                new_score = score + log_prob
                new_seq = seq + [token]
                heapq.heappush(candidates, (new_score, new_seq))
        
        # Keep only beam_width best
        beams = heapq.nlargest(beam_width, candidates)
    
    return beams[0][1]  # Best sequence
```

#### **3.1.6 Graphs**

**ML Applications:**
- **Graph Neural Networks (GNNs):** Social networks, molecules, knowledge graphs
- **Nearest Neighbor Search:** K-d trees, Ball trees (scikit-learn)
- **Bayesian Networks:** Probabilistic graphical models
- **Flow Networks:** Max-flow min-cut for image segmentation

**Representations:**
1. **Adjacency Matrix:** $O(V^2)$ space. Good for dense graphs (GNNs use this with sparse matrices).
2. **Adjacency List:** $O(V + E)$ space. Better for sparse graphs (social networks).

```python
class Graph:
    def __init__(self, directed=False):
        self.adj_list = defaultdict(list)
        self.directed = directed
    
    def add_edge(self, u, v, weight=1):
        self.adj_list[u].append((v, weight))
        if not self.directed:
            self.adj_list[v].append((u, weight))
    
    def bfs(self, start):
        """Breadth-first search (shortest path in unweighted graph)"""
        visited = {start}
        queue = deque([(start, 0)])  # (node, distance)
        distances = {start: 0}
        
        while queue:
            node, dist = queue.popleft()
            for neighbor, _ in self.adj_list[node]:
                if neighbor not in visited:
                    visited.add(neighbor)
                    distances[neighbor] = dist + 1
                    queue.append((neighbor, dist + 1))
        return distances
    
    def dfs(self, start, visited=None):
        """Depth-first search (for topological sort, cycle detection)"""
        if visited is None:
            visited = set()
        visited.add(start)
        print(start)  # Process node
        
        for neighbor, _ in self.adj_list[start]:
            if neighbor not in visited:
                self.dfs(neighbor, visited)
        return visited
```

**Topological Sort:** For neural network computation graphs (forward/backward pass ordering).

#### **3.1.7 Tries (Prefix Trees)**

**ML Applications:**
- **Autocomplete systems:** Language model tokenization
- **String matching:** Bioinformatics (DNA sequence alignment)
- **Efficient vocabulary storage:** Especially for BPE tokenizers

```python
class TrieNode:
    def __init__(self):
        self.children = {}
        self.is_end = False
        self.frequency = 0  # For predicting most likely completion

class AutocompleteTrie:
    def __init__(self):
        self.root = TrieNode()
    
    def insert(self, word, freq=1):
        node = self.root
        for char in word:
            if char not in node.children:
                node.children[char] = TrieNode()
            node = node.children[char]
        node.is_end = True
        node.frequency = freq
    
    def search(self, prefix):
        """Return all words with given prefix"""
        node = self.root
        for char in prefix:
            if char not in node.children:
                return []
            node = node.children[char]
        return self._collect(node, prefix)
    
    def _collect(self, node, prefix):
        results = []
        if node.is_end:
            results.append((prefix, node.frequency))
        for char, child in node.children.items():
            results.extend(self._collect(child, prefix + char))
        return results
```

---

## **3.2 Algorithms for ML Optimization**

#### **3.2.1 Sorting and Selection**

**ML Use Cases:**
- Computing percentiles (median absolute deviation for outlier detection)
- Ranking predictions (precision@k, NDCG metrics)
- Finding nearest neighbors (sort by distance)

**Key Algorithms:**
- **Quicksort:** Average $O(n \log n)$, worst $O(n^2)$. Used in `numpy.sort()` (introsort hybrid).
- **Mergesort:** Stable $O(n \log n)$. Used in `sorted()` (Timsort).
- **Heapselect:** $O(n \log k)$ for top-k (more efficient than full sort).

**Partial Sorting for Top-K:**
```python
import numpy as np

# Efficient top-k (faster than full sort)
scores = np.random.randn(1000000)
k = 100

# Method 1: Full sort (O(n log n))
top_k_slow = np.sort(scores)[-k:]

# Method 2: Partition (O(n))
# np.argpartition does introselect (quickselect)
idx = np.argpartition(scores, -k)[-k:]
top_k_fast = scores[idx]
# Note: top_k_fast is not sorted; sort just these k if needed
top_k_sorted = np.sort(top_k_fast)
```

**Quickselect for Median:** $O(n)$ average to find median without sorting.

#### **3.2.2 Search Algorithms**

**Binary Search:** $O(\log n)$. Used in:
- Hyperparameter search (monotonic validation curves)
- Threshold tuning (finding optimal classification threshold)
- Sparse matrix lookups

```python
def binary_search_threshold(scores, labels, target_precision=0.95):
    """Find threshold achieving target precision"""
    low, high = 0.0, 1.0
    best_thresh = 0.5
    
    for _ in range(20):  # 20 iterations = precision of 2^-20
        mid = (low + high) / 2
        preds = (scores >= mid).astype(int)
        precision = np.mean(labels[preds == 1]) if np.sum(preds) > 0 else 0
        
        if precision >= target_precision:
            best_thresh = mid
            high = mid  # Try higher threshold (fewer positives, higher precision)
        else:
            low = mid   # Need lower threshold to get more positives
    
    return best_thresh
```

#### **3.2.3 Dynamic Programming (DP)**

**ML Applications:**
- **Sequence Alignment:** Needleman-Wunsch (bioinformatics)
- **Edit Distance:** Levenshtein distance for fuzzy string matching
- **Viterbi Algorithm:** Finding most likely state sequence in HMMs (CRFs, speech recognition)
- **Knapsack Problem:** Feature selection with budget constraints
- **Coin Change:** Optimal model ensemble selection (weighted majority)

**Edit Distance Example:**
```python
def levenshtein_distance(s1, s2):
    """Minimum edits to transform s1 into s2"""
    m, n = len(s1), len(s2)
    # DP table: dp[i][j] = distance between s1[:i] and s2[:j]
    dp = [[0] * (n + 1) for _ in range(m + 1)]
    
    # Base cases
    for i in range(m + 1): dp[i][0] = i  # Delete all
    for j in range(n + 1): dp[0][j] = j  # Insert all
    
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if s1[i-1] == s2[j-1]:
                dp[i][j] = dp[i-1][j-1]  # No operation needed
            else:
                dp[i][j] = 1 + min(
                    dp[i-1][j],     # Deletion
                    dp[i][j-1],     # Insertion
                    dp[i-1][j-1]    # Substitution
                )
    return dp[m][n]

# Use case: Fuzzy matching of OCR errors to dictionary
```

**Space Optimization:** Notice we only need previous row, not full table. Reduce $O(mn)$ space to $O(n)$.

#### **3.2.4 Greedy Algorithms**

**When to use:** When local optimal choices lead to global optimum (e.g., matroids).

**ML Applications:**
- **Decision Tree Splitting:** Greedy information gain maximization
- **Feature Selection:** Greedy forward selection (add feature with highest marginal gain)
- **Set Cover:** Selecting minimal training set that covers all feature space regions

**Greedy vs DP:**
- Greedy: Faster, no guarantee of optimal (but often good enough)
- DP: Optimal but expensive (exponential state space in worst case)

---

## **3.3 Complexity Analysis**

#### **3.3.1 Big O Notation**

**Definitions:**
- $O(f(n))$: Upper bound (worst case)
- $\Omega(f(n))$: Lower bound (best case)  
- $\Theta(f(n))$: Tight bound (average case)

**Common Complexities in ML:**

| Operation | Complexity | Example |
|-----------|-----------|---------|
| Matrix Multiply | $O(n^3)$ (naive), $O(n^{2.37})$ (Strassen) | Transformer attention |
| Matrix Inversion | $O(n^3)$ | Linear regression normal equations |
| Sorting | $O(n \log n)$ | Ranking predictions |
| K-NN Search | $O(n)$ naive, $O(\log n)$ with KD-tree | Nearest neighbors |
| Gradient Descent | $O(iterations \times data\_size \times params)$ | Neural network training |

#### **3.3.2 Amortized Analysis**

**Example:** Python list append is $O(1)$ amortized, even though occasional resizes are $O(n)$. Over $n$ appends, total cost is $O(n)$, so amortized $O(1)$ per operation.

**ML Application:** Hash table resizing during feature counting. Batch inserts to amortize cost.

#### **3.3.3 Space Complexity**

**Critical for ML:**
- **Model Parameters:** $O(d \times h)$ for neural network (d=input dim, h=hidden dim)
- **Activations:** $O(b \times h \times l)$ for backprop (batch, hidden, layers) — memory bottleneck
- **Data:** $O(n \times d)$ for dataset

**Trade-offs:**
- Memoization (DP): Time $O(n)$ → Space $O(n)$ vs Time $O(2^n)$ → Space $O(1)$
- Checkpoints in training: Recompute vs Store (memory vs compute tradeoff)

#### **3.3.4 Profiling Complexity**

```python
import time

def measure_time(func, *args, n_trials=5):
    """Empirical complexity measurement"""
    times = []
    for _ in range(n_trials):
        start = time.perf_counter()
        func(*args)
        times.append(time.perf_counter() - start)
    return np.mean(times)

# Verify O(n log n) sorting
sizes = [100, 1000, 10000, 100000]
times = []
for n in sizes:
    arr = np.random.randn(n)
    t = measure_time(np.sort, arr)
    times.append(t)
    print(f"n={n}, time={t:.4f}, ratio={times[-1]/times[-2] if len(times)>1 else 0:.2f}")
# Ratio should approach log(n) growth
```

---

## **3.4 Software Engineering for ML**

#### **3.4.1 SOLID Principles**

**S - Single Responsibility:** A class should have one reason to change.
```python
# BAD: Data loading AND augmentation AND model training
class MLSystem:
    def load_data(self): ...
    def augment(self): ...
    def train(self): ...

# GOOD: Separate concerns
class DataLoader: ...
class Augmenter: ...
class Trainer: ...
```

**O - Open/Closed:** Open for extension, closed for modification.
```python
from abc import ABC, abstractmethod

class LossFunction(ABC):
    @abstractmethod
    def compute(self, y_pred, y_true): ...

class MSELoss(LossFunction):
    def compute(self, y_pred, y_true):
        return np.mean((y_pred - y_true)**2)

class CrossEntropyLoss(LossFunction):
    def compute(self, y_pred, y_true):
        return -np.sum(y_true * np.log(y_pred + 1e-8))

# Can add new losses without modifying Trainer
class Trainer:
    def __init__(self, loss_fn: LossFunction):
        self.loss_fn = loss_fn
```

**L - Liskov Substitution:** Subtypes must be substitutable.
```python
class BaseModel:
    def predict(self, X): ...

class NeuralNet(BaseModel):
    def predict(self, X): ...  # Same signature, stricter preconditions OK

# Violation: Changing return type from array to list breaks downstream code
```

**I - Interface Segregation:** Clients shouldn't depend on methods they don't use.
```python
# BAD: One huge interface
class Dataset:
    def load_image(self): ...
    def load_text(self): ...
    def load_audio(self): ...

# GOOD: Split interfaces
class ImageDataset: ...
class TextDataset: ...
```

**D - Dependency Inversion:** Depend on abstractions, not concretions.
```python
# BAD
class Trainer:
    def __init__(self):
        self.model = ResNet50()  # Concrete dependency

# GOOD
class Trainer:
    def __init__(self, model: ModelInterface):
        self.model = model  # Injectable dependency
```

#### **3.4.2 Design Patterns in ML**

**Strategy Pattern:** Interchangeable algorithms (optimizers, schedulers).
```python
class OptimizerStrategy(ABC):
    @abstractmethod
    def step(self, gradients, params): ...

class SGDOptimizer(OptimizerStrategy): ...
class AdamOptimizer(OptimizerStrategy): ...
```

**Factory Pattern:** Create models based on config.
```python
class ModelFactory:
    @staticmethod
    def create(model_type: str):
        if model_type == "resnet":
            return ResNet()
        elif model_type == "transformer":
            return Transformer()
        else:
            raise ValueError(f"Unknown model: {model_type}")
```

**Observer Pattern:** Logging, early stopping, model checkpointing.
```python
class TrainerObservable:
    def __init__(self):
        self.observers = []
    
    def register(self, observer):
        self.observers.append(observer)
    
    def notify(self, epoch, metrics):
        for obs in self.observers:
            obs.update(epoch, metrics)

class EarlyStoppingObserver:
    def update(self, epoch, metrics):
        if metrics['val_loss'] > self.best_loss:
            self.patience_counter += 1
```

**Decorator Pattern:** Data augmentation pipelines (composable transforms).
```python
class Transform(ABC):
    @abstractmethod
    def apply(self, image): ...

class Compose(Transform):
    def __init__(self, transforms):
        self.transforms = transforms
    
    def apply(self, image):
        for t in self.transforms:
            image = t.apply(image)
        return image
```

#### **3.4.3 Testing ML Code**

**Challenges:**
- Stochasticity (random initialization, dropout)
- Large data (slow tests)
- Floating point precision

**Strategies:**

1. **Seed Fixing:**
```python
def test_model_convergence():
    np.random.seed(42)
    torch.manual_seed(42)
    # ... test deterministic behavior
```

2. **Mocking Data:**
```python
import pytest

@pytest.fixture
def dummy_data():
    return np.random.randn(100, 10), np.random.randint(0, 2, 100)

def test_forward_pass(dummy_data):
    X, y = dummy_data
    model = SmallModel()
    out = model(X)
    assert out.shape == (100, 2)
```

3. **Property-Based Testing:**
```python
from hypothesis import given, strategies as st

@given(st.lists(st.floats(), min_size=1))
def test_normalization_mean_zero(data):
    arr = np.array(data)
    normalized = (arr - np.mean(arr)) / np.std(arr)
    assert abs(np.mean(normalized)) < 1e-10
```

4. **Regression Tests:** Save model outputs for fixed inputs; ensure they don't change unexpectedly (snapshot testing).

5. **Integration Tests:** End-to-end training for 1 epoch on toy data to catch shape mismatches.

---

## **3.5 Workbook Labs**

### **Lab 1: Custom Autograd Engine**
Implement a simplified version of PyTorch's autograd using computational graphs (linked nodes) and reverse-mode automatic differentiation.

**Requirements:**
- Support operations: add, multiply, ReLU, matrix multiply
- Build computation graph dynamically
- Backward pass using topological sort
- Verify gradients match numerical differentiation

**Deliverable:** `micrograd.py` with test cases showing correct gradients for a small neural network.

### **Lab 2: Approximate Nearest Neighbor (ANN) Search**
Implement a basic Locality Sensitive Hashing (LSH) or KD-Tree for finding similar embeddings.

**Requirements:**
- Index 100k vectors of dimension 128
- Query time <10ms for top-10 neighbors (vs O(n) brute force)
- Recall@10 > 0.95 compared to exact search
- Memory usage <500MB

**Deliverable:** `ann_index.py` with benchmarking against `sklearn.neighbors.KDTree`.

### **Lab 3: Optimal Binning Algorithm**
Implement a dynamic programming algorithm to find optimal binning thresholds for continuous features (maximizing information gain).

**Requirements:**
- O(n log n) complexity using sorted splits
- Support for multi-class targets
- Handles missing values efficiently

**Deliverable:** `optimal_binner.py` integrated with a Decision Tree classifier.

### **Lab 4: Distributed Feature Counter**
Using only standard library (no external databases), implement a disk-based feature counter that can handle 10B features using external merge sort.

**Requirements:**
- Process data in chunks that fit in RAM (1GB limit)
- Exact counts (not approximate like Count-Min Sketch)
- Parallel processing with multiprocessing

**Deliverable:** `external_counter.py` with complexity analysis documentation.

---

## **3.6 Common Pitfalls**

1. **$O(n^2)$ Distance Matrix Computation:**
   ```python
   # BAD: Double loop
   for i in range(n):
       for j in range(n):
           dist[i,j] = np.linalg.norm(X[i] - X[j])
   
   # GOOD: Vectorized broadcasting
   dist = np.sqrt(((X[:, None, :] - X[None, :, :]) ** 2).sum(-1))
   # Or better: use scipy.spatial.distance.cdist
   ```

2. **Repeated Dictionary Lookups in Loops:**
   ```python
   # Slow
   for i in range(1000000):
       val = big_dict[keys[i]]
   
   # Faster: Local variable binding
   get_val = big_dict.get
   for i in range(1000000):
       val = get_val(keys[i])
   ```

3. **Recursive DFS on Deep Graphs:** Python recursion limit (~1000). Use iterative stack implementation for deep computation graphs.

4. **Not Using `__slots__` for Small Objects:**
   ```python
   class Node:
       __slots__ = ['feature', 'threshold', 'left', 'right']
       # Saves ~50% memory per node (critical for large trees)
   ```

---

## **3.7 Interview Questions**

**Q1:** Why is QuickSort generally preferred over MergeSort for arrays, but MergeSort for linked lists?
*A: QuickSort has better cache locality for arrays (in-place partition) but $O(\log n)$ stack space. MergeSort requires $O(n)$ extra space but works efficiently with linked lists (no random access needed) and is stable (preserves order of equal elements).*

**Q2:** How would you find the median of a stream of integers (online algorithm)?
*A: Use two heaps: max-heap for lower half, min-heap for upper half. Balance sizes so max-heap has equal or one more element. Median is top of max-heap (or average of both tops if even count). Operations $O(\log n)$ per insertion.*

**Q3:** Explain why hash table lookup is $O(1)$ average but $O(n)$ worst case.
*A: Average: uniform hashing distributes keys evenly, few collisions. Worst: all keys hash to same bucket (degrades to linked list). In practice, universal hashing or cryptographic hashing prevents adversarial attacks.*

**Q4:** Implement a Trie and analyze space complexity vs hash table for string storage.
*A: Trie shares prefixes (e.g., "cat", "car" share "ca"). Space $O(\text{total characters})$ vs hash table $O(\text{total characters} \times \text{pointer overhead})$. Trie wins on prefix queries; hash table wins on exact match.*

**Q5:** How does dynamic programming differ from greedy algorithms? Give an ML example.
*A: DP explores all subproblems and stores solutions (optimal substructure, overlapping subproblems). Greedy makes locally optimal choice without reconsidering. Example: DP for global sequence alignment (Needleman-Wunsch) vs greedy local alignment (not guaranteed optimal).*

---

## **3.8 Further Reading**

**Books:**
- *Introduction to Algorithms* (CLRS) - Chapters 6 (Heaps), 15 (DP), 22 (Graphs), 35 (Approximation algorithms)
- *Design Patterns* (Gang of Four) - Strategy, Observer, Factory patterns
- *Clean Code* (Robert Martin) - SOLID principles with examples

**Specific to ML:**
- "Efficient Data Structures for Tiny Machine Learning" (TinyML research)
- "Algorithmic Efficiency in Transformers" (FlashAttention, sparse attention patterns)
- Scikit-learn source code for tree implementations (Cython optimized)

---

## **3.9 Checkpoint Project: High-Performance Feature Store**

Build a feature store that serves pre-computed ML features with low latency (<5ms p99).

**Requirements:**

**Architecture:**
- **Storage Layer:** In-memory hash table + disk-backed SSTables (LSM-tree concept)
- **Serving Layer:** gRPC API (or FastAPI) with batch and single lookup endpoints
- **Ingestion Layer:** Stream processing simulation (Kafka-like queue consumption)

**Data Structures:**
1. **Hash Index:** Feature name → offset in SSTable (O(1) lookup)
2. **LRU Cache:** Hot feature vectors cached in memory (eviction when full)
3. **Bloom Filter:** Check if feature exists before disk lookup (reduce I/O)
4. **Segment Tree:** For time-windowed features (rolling aggregates)

**Algorithms:**
- **Consistent Hashing:** For distributed feature storage (future scaling)
- **Binary Search:** On SSTable index for range queries
- **Top-K:** Maintain most frequently accessed features for cache warming

**Performance Targets:**
- Single feature lookup: <1ms (in-memory), <10ms (disk)
- Batch lookup (100 features): <5ms
- Ingestion throughput: 10k features/second
- Memory usage: <2GB for 1M features (512-dim vectors)

**Testing:**
- Unit tests for each data structure
- Integration test with 1M random features
- Load test with Locust (1000 concurrent users)
- Chaos test: random disk failures, verify consistency

**Deliverables:**
- GitHub repo with architecture diagram
- Benchmark report comparing against Redis (baseline)
- Documentation on consistency guarantees (eventual vs strong)

---

**End of Chapter 3**

*You now possess the algorithmic foundations to build scalable ML systems. Chapter 4 will cover Development Environment & Tools (Git, Linux, Docker, Cloud) — the infrastructure layer of AI engineering.*

---

<div style='width:100%; display:flex; justify-content:space-between; align-items:center; margin: 1em 0;'>
  <a href='2. python_for_ai_development.ipynb' style='font-weight:bold; font-size:1.05em;'>&larr; Previous</a>
  <a href='../TOC.md' style='font-weight:bold; font-size:1.05em; text-align:center;'>Table of Contents</a>
  <a href='4. development_enironment_and_tools.ipynb' style='font-weight:bold; font-size:1.05em;'>Next &rarr;</a>
</div>
