# Chapter 12: File I/O and Data Persistence

Most applications must persist data beyond the runtime of the program, whether writing configuration files, processing log data, or serializing objects for network transmission. Python provides a rich ecosystem for file handling, from low-level byte streams to high-level path abstractions and structured data formats.

This chapter examines the context management protocol that ensures resources are properly released, modern file handling techniques using `pathlib`, and serialization strategies for JSON, CSV, and binary formats. We emphasize contemporary best practices—using `pathlib` over string manipulation, explicit encoding declarations, and understanding the security implications of deserialization.

## 12.1 Context Managers: Resource Management with `with`

Resource management is error-prone. Files left open after exceptions, locks held during crashes, and database connections leaking cause system instability. **Context managers** solve this through the `with` statement, guaranteeing cleanup code executes regardless of success or failure.

### The Context Management Protocol

Context managers implement two methods:
*   `__enter__()`: Called when entering the `with` block; returns the resource
*   `__exit__(exc_type, exc_val, exc_tb)`: Called when exiting; receives exception info if one occurred

```python
from typing import Optional, Type, Self
import time

class DatabaseConnection:
    """
    Custom context manager demonstrating the protocol.
    
    Ensures connections are closed even if queries fail.
    """
    
    def __init__(self, connection_string: str, timeout: int = 30) -> None:
        self.connection_string: str = connection_string
        self.timeout: int = timeout
        self.connection: Optional[object] = None
        self.is_connected: bool = False
    
    def __enter__(self) -> Self:
        """Acquire resource and return it for use."""
        print(f"Connecting to {self.connection_string}...")
        # Simulate connection establishment
        self.connection = object()  # Stand-in for actual connection
        self.is_connected = True
        print("Connection established")
        return self  # This becomes the 'as' variable
    
    def __exit__(
        self,
        exc_type: Optional[Type[BaseException]],
        exc_val: Optional[BaseException],
        exc_tb: Optional[object]
    ) -> Optional[bool]:
        """
        Release resource and handle exceptions if needed.
        
        Args:
            exc_type: Type of exception (if any)
            exc_val: Exception instance (if any)
            exc_tb: Traceback object (if any)
            
        Returns:
            True if exception was handled, False/None to propagate
        """
        if self.connection:
            print("Closing connection...")
            self.connection = None
            self.is_connected = False
        
        # Log exception if one occurred
        if exc_type:
            print(f"Exception during context: {exc_val}")
            # Return False to propagate exception
            return False
        return True

# Usage
def query_database() -> None:
    with DatabaseConnection("postgresql://localhost/mydb") as conn:
        print(f"Executing query on {conn.connection_string}")
        # If exception occurs here, __exit__ still runs
        raise RuntimeError("Query failed!")  # Connection still closed

# Output:
# Connecting to postgresql://localhost/mydb...
# Connection established
# Executing query on postgresql://localhost/mydb
# Closing connection...
# Exception during context: Query failed!
# (Exception propagates)
```

### The `contextlib` Module

Writing classes for simple cleanup is verbose. The `contextlib` module provides utilities for creating context managers from functions.

#### `@contextmanager` Decorator

Transforms a generator function into a context manager:

```python
from contextlib import contextmanager
from typing import Generator
import tempfile
import os

@contextmanager
def temporary_directory() -> Generator[str, None, None]:
    """
    Create temporary directory that's cleaned up automatically.
    
    Yields the path, then cleanup runs after the block.
    """
    temp_dir: str = tempfile.mkdtemp()
    print(f"Created temp directory: {temp_dir}")
    
    try:
        yield temp_dir  # Value bound to 'as' variable
    finally:
        # Always executes, even if exception occurred
        import shutil
        shutil.rmtree(temp_dir)
        print(f"Cleaned up: {temp_dir}")

# Usage
with temporary_directory() as tmpdir:
    path: str = os.path.join(tmpdir, "data.txt")
    with open(path, 'w') as f:
        f.write("Temporary data")
    # Directory and contents deleted automatically
```

#### `contextlib.closing`

Ensures objects with `close()` method are properly closed:

```python
from contextlib import closing
from urllib.request import urlopen

def fetch_data(url: str) -> bytes:
    """Ensure connection closes even if parsing fails."""
    with closing(urlopen(url)) as response:
        # response doesn't support context manager protocol directly
        return response.read()
```

#### `contextlib.ExitStack`

Manage multiple context managers dynamically, especially useful when the number of resources isn't known until runtime:

```python
from contextlib import ExitStack
from typing import List

def process_multiple_files(filenames: List[str]) -> None:
    """Open variable number of files safely."""
    with ExitStack() as stack:
        files: List[object] = [
            stack.enter_context(open(fname, 'r'))
            for fname in filenames
        ]
        
        # All files open; process them
        for f in files:
            process_line(f.readline())
        
        # All files closed automatically, even if exception occurs

# Alternative: Nested with statements (clumsy for many files)
# with open('a.txt') as f1:
#     with open('b.txt') as f2:
#         with open('c.txt') as f3:
#             ...
```

#### `contextlib.suppress`

Ignore specific exceptions cleanly:

```python
from contextlib import suppress
import os

def remove_if_exists(filepath: str) -> None:
    """Remove file without checking existence first."""
    # Instead of:
    # if os.path.exists(filepath):
    #     os.remove(filepath)
    
    # Cleaner approach:
    with suppress(FileNotFoundError):
        os.remove(filepath)
```

#### `contextlib.redirect_stdout/stderr`

Capture or redirect output streams:

```python
from contextlib import redirect_stdout
import io

def capture_output(func) -> str:
    """Capture print statements from function."""
    buffer: io.StringIO = io.StringIO()
    with redirect_stdout(buffer):
        func()
    return buffer.getvalue()
```

### Async Context Managers (Python 3.5+)

For asynchronous resources, use `__aenter__` and `__aexit__`:

```python
from contextlib import asynccontextmanager
from typing import AsyncGenerator

class AsyncDatabase:
    async def connect(self) -> None: ...
    async def disconnect(self) -> None: ...

@asynccontextmanager
async def get_db_connection() -> AsyncGenerator[AsyncDatabase, None]:
    """Async context manager for database connections."""
    db: AsyncDatabase = AsyncDatabase()
    await db.connect()
    try:
        yield db
    finally:
        await db.disconnect()

# Usage
async def query() -> None:
    async with get_db_connection() as db:
        await db.query("SELECT * FROM users")
```

## 12.2 File Handling: Text and Binary Modes

Python distinguishes between **text mode** (decodes bytes to strings using encoding) and **binary mode** (raw bytes). Understanding this distinction prevents common encoding errors and data corruption.

### Opening Files

```python
from typing import TextIO, BinaryIO

# Text mode (default) - returns str, uses encoding
file_text: TextIO = open('data.txt', mode='r', encoding='utf-8')

# Binary mode - returns bytes, no encoding
file_binary: BinaryIO = open('image.png', mode='rb')

# Context manager ensures closure
with open('data.txt', 'r', encoding='utf-8') as f:
    content: str = f.read()
```

**Mode Characters:**
*   `r` - Read (default)
*   `w` - Write (truncate if exists)
*   `x` - Exclusive creation (fail if exists)
*   `a` - Append (write to end)
*   `b` - Binary mode
*   `t` - Text mode (default)
*   `+` - Read and write

### Reading Strategies

```python
from typing import Iterator

def read_strategies(filepath: str) -> None:
    """Demonstrate different reading approaches."""
    
    # Strategy 1: Read entire file (small files only)
    with open(filepath, 'r', encoding='utf-8') as f:
        content: str = f.read()
        # Memory = size of file
    
    # Strategy 2: Read line by line (memory efficient)
    with open(filepath, 'r', encoding='utf-8') as f:
        for line in f:  # f is iterable, yields lines
            process_line(line.strip())
    
    # Strategy 3: Read fixed chunks (binary processing)
    with open(filepath, 'rb') as f:
        while chunk := f.read(8192):  # 8KB chunks
            process_chunk(chunk)
    
    # Strategy 4: Read all lines into list
    with open(filepath, 'r', encoding='utf-8') as f:
        lines: list[str] = f.readlines()  # Includes newlines
    
    # Strategy 5: Line iterator with enumerate
    with open(filepath, 'r', encoding='utf-8') as f:
        for line_num, line in enumerate(f, 1):
            if 'ERROR' in line:
                print(f"Line {line_num}: {line.strip()}")

def process_line(line: str) -> None: ...
def process_chunk(chunk: bytes) -> None: ...
```

**Best Practice:** Iterate over the file object directly (`for line in f`) rather than `f.readlines()`, as the former is memory-efficient (streams one line at a time) while the latter loads the entire file into memory.

### Writing Files

```python
def write_data(filepath: str, data: list[dict]) -> None:
    """Write data with proper encoding and newline handling."""
    
    # Text mode with explicit encoding (always specify encoding!)
    with open(filepath, 'w', encoding='utf-8', newline='') as f:
        # newline='' prevents universal newlines translation
        # (important for CSV files)
        
        for item in data:
            f.write(f"{item['name']}: {item['value']}\n")
    
    # Binary mode (write bytes)
    with open('output.bin', 'wb') as f:
        f.write(b'\x00\x01\x02\x03')  # Raw bytes
        
        # Convert string to bytes
        text: str = "Hello"
        f.write(text.encode('utf-8'))

    # Append mode
    with open('log.txt', 'a', encoding='utf-8') as f:
        f.write(f"{datetime.now()}: Event occurred\n")
```

### File Positioning and Seeking

```python
def random_access(filepath: str) -> None:
    """Read specific portions of file."""
    with open(filepath, 'r+b') as f:  # Read + write binary
        # Get current position
        pos: int = f.tell()
        
        # Seek to specific position
        f.seek(0)           # Beginning
        f.seek(0, 2)        # End (0 offset from end)
        f.seek(-10, 2)      # 10 bytes from end
        f.seek(100)         # Absolute position
        
        # Read/write at position
        f.seek(50)
        data: bytes = f.read(10)
        f.seek(50)
        f.write(b'NEW_DATA')
```

### Memory-Mapped Files (Advanced)

For large files requiring random access without loading entirely into memory:

```python
import mmap

def process_large_file(filepath: str) -> None:
    """Memory-map file for efficient random access."""
    with open(filepath, 'r+b') as f:
        # Memory-map entire file
        with mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) as mm:
            # mm behaves like bytes object
            if mm.find(b'search_term') != -1:
                print("Found!")
            
            # Can slice without copying
            header: bytes = mm[:100]
```

## 12.3 Modern Path Handling with `pathlib`

The `pathlib` module (Python 3.4+) provides an object-oriented interface for filesystem paths, replacing the fragmented `os.path` functions with intuitive methods and proper operator overloading.

### Path Types

```python
from pathlib import Path, PurePath, PurePosixPath, PureWindowsPath

# Concrete path (interacts with actual filesystem)
p: Path = Path('/usr/bin/python')  # Unix
p: Path = Path(r'C:\Users\name')   # Windows (automatically handles)

# Pure paths (path manipulation without filesystem access)
pure: PurePath = PurePosixPath('/etc/hosts')
```

### Path Construction and Manipulation

```python
from pathlib import Path

# Creating paths
current: Path = Path.cwd()          # Current working directory
home: Path = Path.home()            # User's home directory
file_path: Path = Path('data', 'subfolder', 'file.txt')  # Joining

# Path joining with / operator (clean and intuitive)
base: Path = Path('/home/user')
config: Path = base / 'config' / 'app.ini'  # /home/user/config/app.ini

# Path components
path: Path = Path('/usr/local/bin/python3')
print(path.name)        # python3 (filename)
print(path.suffix)      # .3 (last extension)
print(path.suffixes)    # [] if no extension, or list for multiple
print(path.stem)        # python3 (filename without suffix)
print(path.parent)      # /usr/local/bin (immediate parent)
print(path.parents)     # Iterable of all ancestors
print(path.parts)       # ('/', 'usr', 'local', 'bin', 'python3')
print(path.anchor)      # / (root) or C:\ on Windows
```

### File Operations

```python
from pathlib import Path
import shutil

def path_operations() -> None:
    """Modern file operations with pathlib."""
    src: Path = Path('source.txt')
    dst: Path = Path('backup', 'source.txt')
    
    # Existence and type checks
    if src.exists():
        print("File exists")
    if src.is_file():
        print("Is a file")
    if src.is_dir():
        print("Is directory")
    if src.is_symlink():
        print("Is symlink")
    
    # Metadata
    size: int = src.stat().st_size      # Bytes
    mtime: float = src.stat().st_mtime  # Modification time
    
    # Reading (convenient methods)
    text: str = src.read_text(encoding='utf-8')
    data: bytes = src.read_bytes()
    
    # Writing
    dst.write_text("Content", encoding='utf-8')
    dst.write_bytes(b'\x00\x01')
    
    # Copying (using shutil with Path objects)
    shutil.copy(src, dst)
    
    # Moving/Renaming
    src.rename(dst)
    src.replace(dst)  # Overwrites if exists
    
    # Deleting
    src.unlink()      # Remove file (missing_ok=True to avoid errors)
    src.unlink(missing_ok=True)
    
    # Create directory
    dst.mkdir(parents=True, exist_ok=True)  # Like mkdir -p
    
    # Remove directory
    dst.rmdir()       # Must be empty
    shutil.rmtree(dst)  # Recursive delete

def find_files(directory: Path) -> None:
    """Glob patterns for file discovery."""
    # Glob patterns
    py_files: list[Path] = list(directory.glob('*.py'))  # Immediate children
    all_py: list[Path] = list(directory.rglob('*.py'))   # Recursive
    
    # Pattern matching
    for file in directory.iterdir():  # Like os.listdir but yields Paths
        if file.match('test_*.py'):
            print(file)
    
    # Specific glob with **
    for log in directory.glob('**/*.log'):  # Recursive
        process_log(log)
```

### Path Comparison and Normalization

```python
def path_equality() -> None:
    """Path comparison and resolution."""
    p1: Path = Path('/usr/bin')
    p2: Path = Path('/usr') / 'bin'
    
    # Comparison
    print(p1 == p2)  # True (compares normalized paths)
    
    # Absolute vs relative
    relative: Path = Path('data/file.txt')
    absolute: Path = relative.resolve()  # Resolve to absolute, follow symlinks
    absolute: Path = relative.absolute() # Absolute without resolving
    
    # Normalization
    messy: Path = Path('/usr//local/../bin/./python')
    clean: Path = messy.resolve()  # /usr/bin/python
    
    # Relative paths
    from_path: Path = Path('/home/user/projects')
    to_path: Path = Path('/home/user/data/file.txt')
    rel: Path = to_path.relative_to(from_path)  # ../data/file.txt
```

## 12.4 Serialization: JSON, CSV, and Binary Formats

Serialization converts Python objects to formats storable or transmittable, with deserialization reconstructing them. Different formats suit different needs: human-readable vs. compact, schema-flexible vs. typed, secure vs. performant.

### JSON (JavaScript Object Notation)

JSON is the lingua franca of web APIs—human-readable, language-independent, and widely supported.

```python
import json
from typing import Any
from datetime import datetime
from pathlib import Path

class DateTimeEncoder(json.JSONEncoder):
    """Custom encoder for non-serializable types."""
    def default(self, obj: Any) -> Any:
        if isinstance(obj, datetime):
            return obj.isoformat()
        if isinstance(obj, set):
            return list(obj)
        return super().default(obj)

def json_operations() -> None:
    """JSON serialization and deserialization."""
    data: dict[str, Any] = {
        'name': 'Alice',
        'age': 30,
        'scores': [85, 92, 78],
        'active': True,
        'created': datetime.now(),
        'tags': {'python', 'developer'}  # Set
    }
    
    # Serialize to string
    json_str: str = json.dumps(data, cls=DateTimeEncoder, indent=2)
    print(json_str)
    
    # Serialize to file
    with open('data.json', 'w', encoding='utf-8') as f:
        json.dump(data, f, cls=DateTimeEncoder, indent=2)
    
    # Deserialize from string
    parsed: dict[str, Any] = json.loads(json_str)
    
    # Deserialize from file
    with open('data.json', 'r', encoding='utf-8') as f:
        loaded: dict[str, Any] = json.load(f)
    
    # Compact format (no whitespace)
    compact: str = json.dumps(data, separators=(',', ':'), cls=DateTimeEncoder)

def load_json_with_types(filepath: Path) -> dict[str, Any]:
    """
    Load JSON with type hints (requires validation in production).
    
    Note: json module returns basic Python types.
    For strict typing, use pydantic or marshmallow.
    """
    with open(filepath, 'r', encoding='utf-8') as f:
        return json.load(f)
```

**JSON Type Mappings:**
| Python | JSON |
|--------|------|
| dict | object |
| list, tuple | array |
| str | string |
| int, float | number |
| True | true |
| False | false |
| None | null |

**Security Note:** `json` is safe to use with untrusted data (unlike `pickle`). It doesn't execute arbitrary code during deserialization.

### CSV (Comma-Separated Values)

CSV remains ubiquitous for tabular data exchange despite its lack of standardization.

```python
import csv
from pathlib import Path
from typing import Iterator, Dict, List

def csv_reading(filepath: Path) -> None:
    """Read CSV files efficiently."""
    
    # Method 1: DictReader (access by column name)
    with open(filepath, 'r', newline='', encoding='utf-8') as f:
        reader: csv.DictReader = csv.DictReader(f)
        for row in reader:  # Each row is OrderedDict or dict
            print(f"{row['name']}: {row['email']}")
    
    # Method 2: Standard reader (access by index)
    with open(filepath, 'r', newline='', encoding='utf-8') as f:
        reader: csv.reader = csv.reader(f)
        header: list[str] = next(reader)  # Skip header
        for row in reader:
            name, email, age = row[0], row[1], row[2]
    
    # Method 3: Reading with type conversion
    def read_users(filepath: Path) -> Iterator[Dict[str, Any]]:
        with open(filepath, 'r', newline='', encoding='utf-8') as f:
            reader = csv.DictReader(f)
            for row in reader:
                yield {
                    'name': row['name'],
                    'age': int(row['age']),
                    'active': row['active'].lower() == 'true'
                }

def csv_writing(data: List[Dict[str, Any]], filepath: Path) -> None:
    """Write data to CSV."""
    if not data:
        return
    
    fieldnames: list[str] = list(data[0].keys())
    
    with open(filepath, 'w', newline='', encoding='utf-8') as f:
        writer: csv.DictWriter = csv.DictWriter(
            f, 
            fieldnames=fieldnames,
            extrasaction='ignore',  # Ignore extra keys in dict
            quoting=csv.QUOTE_MINIMAL
        )
        writer.writeheader()
        writer.writerows(data)
        
        # Or write row by row
        for row in data:
            writer.writerow(row)

def csv_advanced() -> None:
    """Handle dialects and custom formatting."""
    # Custom dialect for weird formats
    csv.register_dialect('unix', delimiter=' ', quoting=csv.QUOTE_NONE)
    
    with open('data.txt', 'r') as f:
        reader = csv.reader(f, dialect='unix')
```

**Critical CSV Parameter:** Always use `newline=''` when opening CSV files to prevent blank line issues on Windows.

### Pickle: Python-Specific Serialization

**Warning:** Only unpickle data you trust. Pickle can execute arbitrary code during deserialization.

```python
import pickle
from pathlib import Path
from typing import Any

def pickle_operations() -> None:
    """
    Serialize Python objects to binary format.
    
    WARNING: Never unpickle untrusted data. Pickle is not secure.
    """
    data: dict[str, Any] = {
        'complex': 3 + 4j,
        'function': lambda x: x**2,  # Can serialize lambdas!
        'nested': {'a': [1, 2, 3]}
    }
    
    # Serialize to bytes
    pickled: bytes = pickle.dumps(data, protocol=pickle.HIGHEST_PROTOCOL)
    
    # Serialize to file
    with open('data.pkl', 'wb') as f:
        pickle.dump(data, f, protocol=pickle.HIGHEST_PROTOCOL)
    
    # Deserialize
    with open('data.pkl', 'rb') as f:
        loaded: Any = pickle.load(f)  # Security risk if file is tampered!
    
    # Safer alternative for simple data: use json or messagepack
```

**Pickle Protocols:**
*   Protocol 0: ASCII, backward compatible
*   Protocol 1: Binary, backward compatible
*   Protocol 2: Python 2.3+
*   Protocol 3: Python 3.0+ (explicit bytes support)
*   Protocol 4: Python 3.4+ (large objects, framing)
*   Protocol 5: Python 3.8+ (out-of-band buffers)

**When to Use Pickle:**
*   Caching Python objects between program runs (trusted source)
*   Multiprocessing (sending objects between processes)
*   Machine learning model serialization (with caution)

**Never Use Pickle For:**
*   Data from untrusted sources (network, user uploads)
*   Long-term storage (protocol changes break compatibility)
*   Cross-language communication

### Alternative Serialization

```python
# MessagePack (binary, cross-language, faster than JSON)
import msgpack

data: bytes = msgpack.packb({'key': 'value'})
obj: Any = msgpack.unpackb(data)

# YAML (human-readable, supports comments)
import yaml

with open('config.yaml', 'r') as f:
    config: dict = yaml.safe_load(f)  # Use safe_load, not load!
```

## Summary

Persistent data management requires both technical precision and security awareness. You have mastered the **context manager protocol** (`__enter__` and `__exit__`), enabling robust resource management that guarantees cleanup even during exception handling. The `with` statement eliminates resource leaks, while `contextlib` utilities like `@contextmanager`, `ExitStack`, and `suppress` reduce boilerplate for common patterns.

You understand the distinction between **text and binary modes**, the critical importance of explicit encoding declarations (UTF-8), and memory-efficient strategies for processing large files through iteration and chunking. **Pathlib** replaces archaic `os.path` string manipulation with an object-oriented interface where the `/` operator intuitively joins path components, and methods like `read_text()`, `write_bytes()`, and `glob()` streamline filesystem interactions.

For data serialization, **JSON** provides universal interoperability for simple data structures, while **CSV** handles tabular exchange despite its format ambiguities. You recognize that **pickle**, while powerful for Python-specific serialization, carries severe security risks when processing untrusted data—a vulnerability that has led to remote code execution exploits in production systems.

However, file I/O is often the bottleneck in application performance. In the next chapter, we explore concurrency and parallelism—techniques to perform multiple operations simultaneously, from threading for I/O-bound tasks to multiprocessing for CPU-bound workloads and the modern `asyncio` framework for high-performance asynchronous programming.

**Next Chapter**: Chapter 13: Concurrency and Parallelism (Threading, Multiprocessing, and Asyncio).

<div style='width:100%; display:flex; justify-content:space-between; align-items:center; margin: 1em 0;'>
  <a href='11. iterators_generators_and_decorators.ipynb' style='font-weight:bold; font-size:1.05em;'>&larr; Previous</a>
  <a href='../TOC.md' style='font-weight:bold; font-size:1.05em; text-align:center;'>Table of Contents</a>
  <a href='13. concurrency_and_parallelism.ipynb' style='font-weight:bold; font-size:1.05em;'>Next &rarr;</a>
</div>
