# Chapter 35: Choosing the Right Data Structure

> *"The right data structure can make an algorithm sing; the wrong one can make it crawl. Choosing wisely is the hallmark of an expert engineer."* — Anonymous

---

## 35.1 Introduction

In the preceding chapters, we've explored a vast arsenal of data structures—from simple arrays and linked lists to sophisticated trees, heaps, and graphs. Each structure has its own strengths and weaknesses. The key to solving any computational problem efficiently lies not just in the algorithm, but in selecting the appropriate data structure to support it. This chapter provides a framework for making that choice, considering not only asymptotic complexity but also practical factors like cache behavior, memory footprint, code complexity, and the environment (in‑memory vs. external storage).

### 35.1.1 Why the Choice Matters

```
┌─────────────────────────────────────────────────────────────────────┐
│                    IMPORTANCE OF DATA STRUCTURE SELECTION             │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  1. PERFORMANCE: The right structure can reduce an O(n²) algorithm  │
│     to O(n log n) or better.                                        │
│                                                                      │
│  2. SCALABILITY: A structure that works for small datasets may      │
│     collapse under larger loads (e.g., O(n) search in a linked list │
│     vs. O(log n) in a balanced tree).                               │
│                                                                      │
│  3. MAINTAINABILITY: Complex structures increase code complexity    │
│     and bug risk; simplicity should be favored unless performance   │
│     demands otherwise.                                              │
│                                                                      │
│  4. MEMORY EFFICIENCY: Some structures have high overhead (e.g.,    │
│     each node in a tree carries pointers); others are compact       │
│     (e.g., arrays).                                                 │
│                                                                      │
│  5. CACHE LOCALITY: Modern CPUs rely heavily on caches; structures  │
│     that access memory sequentially (arrays) often outperform       │
│     pointer‑chasing structures (linked lists) even when asymptotics │
│     are similar.                                                    │
│                                                                      │
│  6. CONCURRENCY: In multi‑threaded environments, some structures    │
│     are easier to make thread‑safe than others.                     │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘
```

---

## 35.2 Criteria for Selection

When faced with a problem, consider the following dimensions:

### 35.2.1 Time Complexity

- **Worst‑case:** Essential for real‑time systems or when adversarial input is possible.
- **Average‑case:** Often more relevant for typical use.
- **Amortized:** Important when occasional expensive operations are offset by many cheap ones (e.g., dynamic array resizing).

Always ask: What operations will be performed most frequently? (insert, delete, search, iteration, etc.)

### 35.2.2 Space Complexity

- Overhead per element (pointers, balance information, etc.).
- Total memory footprint, especially for large datasets.
- If data is too large to fit in RAM, you need an **external memory** structure (e.g., B‑tree).

### 35.2.3 Memory Access Patterns (Cache Behavior)

- **Sequential access:** Arrays, dynamic arrays (good cache locality).
- **Random access:** Trees, hash tables (poor locality, many cache misses).
- **Spatial locality:** Structures that store related data contiguously (e.g., array of structs vs. struct of arrays) matter for performance.

### 35.2.4 Code Complexity and Maintainability

- How hard is it to implement correctly?
- Are there library implementations available? (e.g., Python’s `list`, `dict`, `heapq`, `bisect`; C++ STL; Java Collections)
- Does the team understand the structure?

### 35.2.5 Adaptability and Future Requirements

- Will the data need to support new operations later? Choose a more general structure.
- Is the data size expected to grow? Plan for scalability.

### 35.2.6 Concurrency

- Does the structure need to be thread‑safe?
- Can we use locks, or do we need lock‑free structures?

### 35.2.7 External Memory Constraints

- When data exceeds RAM, we need structures optimized for disk I/O (e.g., B‑trees, LSM trees).
- Operations should minimize the number of disk accesses (block reads/writes).

---

## 35.3 Common Data Structures and Their Trade‑offs

The following table summarizes typical use cases for fundamental data structures. (Complexities are for the most common operations; see Appendix B for detailed cheat sheets.)

| Structure               | Strong Points                                           | Weak Points                                           | Typical Use Cases                          |
|-------------------------|--------------------------------------------------------|-------------------------------------------------------|---------------------------------------------|
| **Array**               | O(1) index access, cache‑friendly                      | O(n) search, fixed size (static)                      | Lookup by index, small datasets             |
| **Dynamic Array (list)**| Amortized O(1) append, cache‑friendly, flexible size   | O(n) insert/delete at arbitrary position              | General‑purpose sequence, stack             |
| **Linked List**         | O(1) insert/delete at known position                    | O(n) search, poor cache locality                       | When many inserts/deletes at arbitrary pos  |
| **Stack**               | LIFO, O(1) push/pop                                    | Limited access                                        | Expression evaluation, backtracking         |
| **Queue**               | FIFO, O(1) enqueue/dequeue                             | Limited access                                        | BFS, task scheduling                         |
| **Hash Table**          | O(1) average insert/search/delete                      | O(n) worst case, no order, memory overhead            | Dictionaries, caches, set operations        |
| **Binary Search Tree**  | O(log n) search/insert/delete (if balanced)            | Unbalanced → O(n), overhead per node                   | Ordered dynamic set, range queries          |
| **Balanced BST (AVL, Red‑Black)** | Guaranteed O(log n) operations               | Complex implementation, higher constant                | When worst‑case log n needed                |
| **Heap**                | O(log n) insert/extract min/max, O(1) peek             | No search for arbitrary element                        | Priority queues, scheduling                  |
| **Segment Tree**        | O(log n) range queries/updates                         | O(n) space, static size (unless dynamic)               | Range queries (sum, min, etc.)               |
| **Fenwick Tree**        | O(log n) prefix sums, point updates, less memory       | Only prefix queries, not arbitrary ranges              | Frequency counting, cumulative sums          |
| **Trie**                | O(m) search/insert (m = key length)                    | Memory‑heavy if many keys                               | Prefix matching, autocomplete, IP routing   |
| **Suffix Array / LCP**  | Powerful string queries (pattern, repeats)             | O(n log n) construction, large memory                  | Bioinformatics, text indexing                |
| **Graph (adj. list)**   | Space‑efficient for sparse graphs                       | Edge existence O(degree)                                | Most graph algorithms                        |
| **Graph (adj. matrix)** | O(1) edge existence, good for dense graphs             | O(V²) memory                                           | Small graphs, Floyd‑Warshall                 |
| **Disjoint Set (Union‑Find)** | Nearly O(1) union/find                            | Only connectivity queries                               | Kruskal’s, dynamic connectivity              |

---

## 35.4 Cache‑Oblivious Algorithms

Modern computers have a hierarchy of memory (registers, L1/L2/L3 cache, RAM, disk). **Cache‑oblivious algorithms** are designed to work well without knowing the cache size or block size. They achieve this by using divide‑and‑conquer that naturally exploits spatial and temporal locality.

**Examples:**
- **Vanilla matrix multiplication** is not cache‑oblivious; **tiled multiplication** is cache‑aware. **Cache‑oblivious** matrix multiplication uses recursive subdivision until submatrices fit in cache.
- **Binary search** on a sorted array has poor cache behavior; a **cache‑oblivious B‑tree** (e.g., van Emde Boas layout) stores the tree recursively to improve locality.

**Takeaway:** For performance‑critical applications, consider how your data structure interacts with the memory hierarchy. Even an O(log n) tree can be slower than an O(n) array scan if the array fits in cache and the tree causes many cache misses.

---

## 35.5 External Memory Algorithms

When data is too large to fit in main memory, we must design algorithms that minimize disk I/O. Disk accesses are many orders of magnitude slower than RAM accesses. The standard model for external memory analysis counts **block transfers** (size B) between disk and RAM, with memory size M.

### 35.5.1 B‑Trees

B‑trees (and B+ trees) are the quintessential external memory data structure. They keep height low by storing many keys per node (order = block size). Operations require O(log_B N) disk accesses.

- Used in almost all databases and file systems.
- B+ trees store all data in leaves and link leaves for efficient range scans.

### 35.5.2 Buffer Trees

Buffer trees are designed for batch operations (like bulk inserts). They buffer updates in memory and periodically flush them to disk, achieving amortized O((N/B) log_{M/B} (N/B)) I/Os for a sequence of operations.

### 35.5.3 External Sorting

When sorting data that doesn't fit in memory, we use a **k‑way merge sort**:
- Create sorted runs of size M (using internal sort).
- Merge runs using a heap, reading one block from each run at a time.
- I/O complexity: O((N/B) log_{M/B} (N/B)).

### 35.5.4 LSM Trees (Log‑Structured Merge Trees)

LSM trees are used in modern key‑value stores (LevelDB, RocksDB). They consist of a memory‑resident table (memtable) and a series of disk‑resident sorted tables (SSTables). Writes go to the memtable; when full, it is flushed to disk. Reads check the memtable and then the SSTables (using bloom filters to avoid many I/Os). Compactions merge SSTables periodically to maintain order and reclaim space.

**Trade‑offs:** High write throughput, but read amplification (may need to check multiple levels).

---

## 35.6 Decision Framework

When faced with a problem, follow these steps:

1. **Understand the operations:**
   - List all operations needed (insert, delete, search, min/max, range query, etc.).
   - Identify which operations are most frequent.
   - Determine if the data is static or dynamic.

2. **Consider the environment:**
   - In‑memory or on‑disk?
   - Multi‑threaded?
   - Real‑time constraints?

3. **Evaluate candidate structures:**
   - Use the table above to shortlist structures that support the required operations efficiently.
   - Compare asymptotic complexities (worst‑case, average, amortized).
   - Factor in practical considerations (cache, overhead, ease of implementation).

4. **Prototype and measure if possible:**
   - For performance‑critical applications, benchmark with realistic data.
   - Sometimes the simplest structure is “good enough” and much easier to maintain.

5. **Plan for growth:**
   - Will the data volume increase? Will new operations be added?
   - Choose a structure that scales gracefully or can be migrated.

### Example: Implementing a Cache

- **Operations:** insert (key, value), lookup (key), delete (key) – all fast.
- **Requirement:** Must evict least‑recently‑used (LRU) item when full.
- **Candidate structures:** Hash table + doubly linked list.
  - Hash table gives O(1) lookup.
  - Linked list maintains order of use; moving an item to front is O(1) if we have the node pointer (stored in hash table).
- **Why not a balanced tree?** O(log n) is slower; also, maintaining order by access time is more complex with a tree.

Thus the classic combination: HashMap + LinkedList.

---

## 35.7 Summary

```
┌─────────────────────────────────────────────────────────────────────┐
│                    CHOOSING THE RIGHT DATA STRUCTURE                  │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  Key Criteria:                                                      │
│    • Time complexity (worst, average, amortized)                    │
│    • Space complexity                                                │
│    • Cache locality                                                  │
│    • Code complexity and maintainability                            │
│    • Concurrency requirements                                        │
│    • External memory constraints                                     │
│                                                                      │
│  Common Structures and Their Sweet Spots:                           │
│    • Array: index access, fixed size                                 │
│    • Dynamic array: flexible sequence, stack                        │
│    • Linked list: frequent insert/delete at known position           │
│    • Hash table: fast key‑value lookup                               │
│    • Balanced BST: ordered operations                                │
│    • Heap: priority queue                                            │
│    • Segment/Fenwick tree: range queries                             │
│    • Trie: string prefixes                                           │
│    • B‑tree: external memory                                         │
│    • LSM tree: high write throughput                                 │
│                                                                      │
│  Decision Process:                                                   │
│    1. List required operations                                       │
│    2. Identify frequency                                             │
│    3. Consider environment (RAM/disk, concurrency)                   │
│    4. Compare candidate structures                                   │
│    5. Prototype if needed                                            │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘
```

---

## 35.8 Practice Problems

These problems are less about implementing a single structure and more about reasoning about the best choice.

1. **Design a phone book** that supports lookup by name, insertion, deletion, and listing all names in alphabetical order. What data structure(s) would you use?

2. **Implement an LRU cache** (LeetCode 146) – discuss why the chosen combination works.

3. **Design a system** that records and queries the number of visits to a website per minute over the last hour. You need to support updating the count for the current minute and querying the total for the last hour. (Hint: circular buffer or deque.)

4. **Design an autocomplete system** for a search engine. What data structure would you use for storing the dictionary and for retrieving suggestions based on prefix?

5. **External memory sorting:** You have a file of 100 million integers (4 bytes each) and 1 GB of RAM. Describe how you would sort the file efficiently.

6. **Database indexing:** Why do most relational databases use B+ trees rather than hash tables for indexing?

7. **Social network friends list:** You need to store each user's friends and support operations like "add friend", "remove friend", "check if two users are friends", and "list all friends of a user". Which structure?

8. **Multi‑level feedback queue** for CPU scheduling: which data structure(s) would you use?

9. **Design a spell checker** that can quickly check if a word is in a dictionary and suggest corrections (edit distance). What data structures and algorithms would you combine?

10. **Real‑time stock price feed:** You receive millions of price updates per second and need to answer queries for the current price of any stock, as well as the top 10 highest prices. What structures?

---

## 35.9 Further Reading

1. **"Introduction to Algorithms" (CLRS)** – Chapters on data structures throughout.
2. **"The Algorithm Design Manual"** by Steven Skiena – Has a catalog of data structures.
3. **"Data Structures and Algorithms in Python"** by Goodrich, Tamassia, Goldwasser – Practical coverage.
4. **"Database System Concepts"** by Silberschatz, Korth, Sudarshan – For B‑trees and external storage.
5. **"Designing Data‑Intensive Applications"** by Martin Kleppmann – Excellent for real‑world storage systems (LSM trees, etc.).
6. **"Cache‑Oblivious Algorithms"** – Research papers by Frigo et al. (1999).

---

> **Coming in Chapter 36**: **DSA in Production Systems** – We'll look at practical applications like database indexing, caching, rate limiting, and probabilistic data structures.

---

**End of Chapter 35**