# Phase 4: Data Structures & Modeling üèóÔ∏è

> **Goal:** Store reality correctly.

If your data model is bad, no algorithm can save you.

## 1. Choosing the Right Tool

- **List:** Ordered, duplicates allowed. (e.g., Todo list)
- **Set:** Unordered, NO duplicates. Fast lookups. (e.g., Admin ID whitelist)
- **Dict:** Key-Value pairs. Fast lookups. (e.g., User profile)
- **Tuple:** Immutable ordered. (e.g., GPS coordinates)

In [None]:
import time

# Set vs List Lookup Speed
size = 1000000
haystack_list = list(range(size))
haystack_set = set(range(size))
needle = 999999

start = time.time()
needle in haystack_list # O(n) - Scans entire list
print(f"List lookup: {time.time() - start:.6f}s")

start = time.time()
needle in haystack_set # O(1) - Instant hash check
print(f"Set lookup:  {time.time() - start:.6f}s")

# Conclusion: Use sets for membership checks!

## 2. Formatting & Normalization

Never trust input data. Normalize it immediately.

In [None]:
raw_emails = ["  Bob@Example.com", "alice@example.com ", "BOB@example.com"]

# Bad way: Hard to compare
# set(raw_emails) -> {'  Bob@Example.com', ...}

# Engineering way: Normalize first
cleaned_emails = {email.strip().lower() for email in raw_emails}
print(cleaned_emails)

## 3. Modeling Reality (Dictionaries)

Don't use parallel lists. Use dictionaries or classes.

In [None]:
# BAD ‚ùå
names = ["Alice", "Bob"]
scores = [85, 92]

# GOOD ‚úÖ
students = [
    {"name": "Alice", "score": 85, "active": True},
    {"name": "Bob", "score": 92, "active": False}
]

# Now I can process a single 'student' object without syncing indices.