# Module 04: Process & Service Management

**Difficulty**: ⭐⭐ (Intermediate)

**Estimated Time**: 60 minutes

**Prerequisites**: 
- Completed Modules 00-03
- Basic understanding of processes and services
- Familiarity with psutil basics from Module 02

## Learning Objectives

By the end of this notebook, you will be able to:

1. **Monitor** system processes and resource usage
2. **Manage** running processes (list, kill, restart)
3. **Track** ML training jobs and detect issues
4. **Automate** process management tasks
5. **Work with** Windows services programmatically
6. **Build** monitoring dashboards for system health

## Introduction: Why Process Management for Data Scientists?

Data scientists need process management for:

### Common Scenarios

**1. ML Training Monitoring**
- Training job running for hours/days
- Need to track GPU/CPU usage
- Detect if process is stuck or crashed
- Kill process if memory leak detected

**2. Resource Management**
- Multiple experiments running simultaneously
- Ensure fair resource distribution
- Prevent one job from hogging all resources
- Auto-kill processes exceeding limits

**3. Production Deployment**
- Model serving API needs to stay running
- Auto-restart if service crashes
- Monitor for memory leaks
- Health checks and alerts

**4. Development Workflow**
- Kill hung Jupyter kernels
- Restart stuck processes
- Clean up orphaned processes
- Find processes using specific ports

This module teaches you to automate all these tasks!

In [None]:
# Setup: Import required libraries
import subprocess
import time
from datetime import datetime, timedelta
from pathlib import Path

# Install psutil if needed
try:
    import psutil
except ImportError:
    import sys
    subprocess.run([sys.executable, '-m', 'pip', 'install', 'psutil', '-q'])
    import psutil

print(f"psutil version: {psutil.__version__}")
print("Setup complete!")

## 1. Process Discovery and Monitoring

Understanding what processes are running is the first step to managing them.

### Process Attributes

Each process has:
- **PID**: Process ID (unique identifier)
- **Name**: Process name (e.g., "python.exe")
- **Status**: running, sleeping, zombie, etc.
- **CPU%**: CPU usage percentage
- **Memory**: RAM usage
- **Create time**: When process started
- **Cmdline**: Command line arguments

### Why This Matters

- Find specific training jobs by command line arguments
- Identify resource-hungry processes
- Track how long jobs have been running
- Detect zombie/stuck processes

### 1.1 Finding Processes by Name

In [None]:
# Find all processes matching a name
# Essential for finding your training jobs

def find_processes_by_name(name_pattern):
    """
    Find processes whose name contains the pattern.
    
    Args:
        name_pattern: String to search for in process names (case-insensitive)
    
    Returns:
        list: List of matching Process objects
    """
    matching_processes = []
    
    for proc in psutil.process_iter(['pid', 'name', 'cpu_percent', 'memory_info', 'create_time']):
        try:
            if name_pattern.lower() in proc.info['name'].lower():
                matching_processes.append(proc)
        except (psutil.NoSuchProcess, psutil.AccessDenied):
            pass
    
    return matching_processes

# Example: Find all Python processes
python_processes = find_processes_by_name('python')

print(f"Found {len(python_processes)} Python process(es):")
print("=" * 70)

for proc in python_processes[:5]:  # Show first 5
    try:
        mem_mb = proc.info['memory_info'].rss / (1024 * 1024)
        create_time = datetime.fromtimestamp(proc.info['create_time'])
        uptime = datetime.now() - create_time
        
        print(f"PID {proc.info['pid']}: {proc.info['name']}")
        print(f"  Memory: {mem_mb:.1f} MB")
        print(f"  Uptime: {uptime}")
        print()
    except (psutil.NoSuchProcess, psutil.AccessDenied):
        pass

### 1.2 Finding Processes by Command Line

More precise than name matching - find processes by their command line arguments. Perfect for identifying specific training scripts!

In [None]:
# Find processes by command line arguments
# Example: Find which process is running "train_model.py"

def find_processes_by_cmdline(pattern):
    """
    Find processes whose command line contains pattern.
    
    Args:
        pattern: String to search for in command line
    
    Returns:
        list: List of (process, cmdline) tuples
    """
    matches = []
    
    for proc in psutil.process_iter(['pid', 'name', 'cmdline']):
        try:
            cmdline = ' '.join(proc.info['cmdline']) if proc.info['cmdline'] else ''
            
            if pattern.lower() in cmdline.lower():
                matches.append((proc, cmdline))
        except (psutil.NoSuchProcess, psutil.AccessDenied, TypeError):
            pass
    
    return matches

# Example: Find processes running .ipynb files (Jupyter)
jupyter_processes = find_processes_by_cmdline('.ipynb')

print(f"Found {len(jupyter_processes)} process(es) running notebooks:")
for proc, cmdline in jupyter_processes[:3]:
    print(f"\nPID {proc.pid}: {proc.name()}")
    print(f"  Command: {cmdline[:100]}...")  # Truncate long commands

## 2. Process Resource Monitoring

Track resource usage of specific processes over time - essential for ML training monitoring.

In [None]:
# Monitor a specific process's resource usage
# Useful for tracking training job resources

class ProcessMonitor:
    """
    Monitor a process's resource usage over time.
    
    Example:
        monitor = ProcessMonitor(pid=12345)
        monitor.start_monitoring(duration=60, interval=5)
    """
    
    def __init__(self, pid=None, process_name=None):
        """
        Initialize monitor for a process.
        
        Args:
            pid: Process ID to monitor
            process_name: Or find process by name
        """
        if pid:
            self.process = psutil.Process(pid)
        elif process_name:
            procs = find_processes_by_name(process_name)
            if not procs:
                raise ValueError(f"No process found with name: {process_name}")
            self.process = procs[0]  # Use first match
        else:
            raise ValueError("Provide either pid or process_name")
        
        self.history = []
    
    def get_current_stats(self):
        """
        Get current resource usage stats.
        
        Returns:
            dict: Current stats
        """
        try:
            return {
                'timestamp': datetime.now(),
                'cpu_percent': self.process.cpu_percent(interval=1),
                'memory_mb': self.process.memory_info().rss / (1024 * 1024),
                'num_threads': self.process.num_threads(),
                'status': self.process.status()
            }
        except (psutil.NoSuchProcess, psutil.AccessDenied):
            return None
    
    def monitor_once(self):
        """
        Record current stats and add to history.
        
        Returns:
            dict: Current stats or None if process ended
        """
        stats = self.get_current_stats()
        if stats:
            self.history.append(stats)
        return stats
    
    def monitor(self, duration_seconds=60, interval_seconds=5):
        """
        Monitor process for a duration.
        
        Args:
            duration_seconds: How long to monitor
            interval_seconds: Seconds between samples
        """
        print(f"Monitoring PID {self.process.pid} for {duration_seconds}s (interval: {interval_seconds}s)")
        print(f"Process: {self.process.name()}")
        print()
        
        end_time = time.time() + duration_seconds
        
        while time.time() < end_time:
            stats = self.monitor_once()
            
            if stats is None:
                print("Process has ended")
                break
            
            # Display current stats
            print(f"[{stats['timestamp'].strftime('%H:%M:%S')}] "
                  f"CPU: {stats['cpu_percent']:5.1f}% | "
                  f"Memory: {stats['memory_mb']:7.1f} MB | "
                  f"Threads: {stats['num_threads']:3d} | "
                  f"Status: {stats['status']}")
            
            time.sleep(interval_seconds)
        
        print(f"\nMonitoring complete. Collected {len(self.history)} samples.")
    
    def get_summary(self):
        """
        Get summary statistics from monitoring history.
        
        Returns:
            dict: Summary stats
        """
        if not self.history:
            return {}
        
        cpu_values = [s['cpu_percent'] for s in self.history]
        mem_values = [s['memory_mb'] for s in self.history]
        
        return {
            'samples': len(self.history),
            'duration': self.history[-1]['timestamp'] - self.history[0]['timestamp'],
            'cpu_avg': sum(cpu_values) / len(cpu_values),
            'cpu_max': max(cpu_values),
            'memory_avg_mb': sum(mem_values) / len(mem_values),
            'memory_max_mb': max(mem_values)
        }

# Example: Monitor current Python process (this notebook)
import os
current_pid = os.getpid()
print(f"Current process PID: {current_pid}")

# Uncomment to actually monitor:
# monitor = ProcessMonitor(pid=current_pid)
# monitor.monitor(duration_seconds=10, interval_seconds=2)
# summary = monitor.get_summary()
# print("\nSummary:", summary)

print("\nProcessMonitor class ready!")

## 3. Process Management (Kill, Terminate, Restart)

Sometimes you need to forcefully stop processes - hung kernels, runaway training jobs, memory leaks.

### Termination Methods

1. **terminate()**: Graceful shutdown (SIGTERM on Unix, TerminateProcess on Windows)
   - Process can clean up and save state
   - May not work if process is stuck

2. **kill()**: Forceful shutdown (SIGKILL on Unix, TerminateProcess with force on Windows)
   - Immediate termination
   - No cleanup possible
   - Use as last resort

### Safety Guidelines

- ✅ Always try terminate() first
- ✅ Wait a few seconds before using kill()
- ✅ Verify process actually terminated
- ❌ Don't kill system processes
- ❌ Don't kill processes you don't own (permission errors)

In [None]:
# Safe process termination with fallback to kill
# Essential for cleaning up stuck training jobs

def terminate_process_safe(pid, timeout=5):
    """
    Safely terminate a process with timeout.
    
    Args:
        pid: Process ID to terminate
        timeout: Seconds to wait before force-killing
    
    Returns:
        bool: True if terminated successfully
    """
    try:
        proc = psutil.Process(pid)
        proc_name = proc.name()
        
        print(f"Terminating process: PID {pid} ({proc_name})")
        
        # Try graceful termination first
        proc.terminate()
        print(f"  Sent termination signal, waiting {timeout}s...")
        
        try:
            # Wait for process to terminate
            proc.wait(timeout=timeout)
            print(f"  ✓ Process terminated gracefully")
            return True
        
        except psutil.TimeoutExpired:
            # Process didn't terminate, force kill
            print(f"  ⚠ Process didn't terminate, forcing kill...")
            proc.kill()
            proc.wait()  # Wait for kill to complete
            print(f"  ✓ Process killed")
            return True
    
    except psutil.NoSuchProcess:
        print(f"  ℹ Process {pid} not found (may have already ended)")
        return True
    
    except psutil.AccessDenied:
        print(f"  ✗ Access denied (insufficient permissions)")
        return False
    
    except Exception as e:
        print(f"  ✗ Error: {e}")
        return False

# Example: Terminate by name (find and kill)
def terminate_by_name(process_name, max_processes=None):
    """
    Terminate all processes matching a name.
    
    Args:
        process_name: Name pattern to match
        max_processes: Maximum processes to terminate (safety limit)
    
    Returns:
        int: Number of processes terminated
    """
    procs = find_processes_by_name(process_name)
    
    if not procs:
        print(f"No processes found matching: {process_name}")
        return 0
    
    if max_processes and len(procs) > max_processes:
        print(f"⚠ Found {len(procs)} processes, but max_processes={max_processes}")
        print(f"  Not terminating for safety. Increase max_processes if intentional.")
        return 0
    
    print(f"Found {len(procs)} process(es) matching '{process_name}':")
    
    terminated = 0
    for proc in procs:
        if terminate_process_safe(proc.pid):
            terminated += 1
    
    print(f"\nTerminated {terminated}/{len(procs)} process(es)")
    return terminated

# Example (commented out for safety):
# terminate_by_name('notepad', max_processes=5)

print("Process termination functions ready!")
print("⚠ Use with caution - always verify process before terminating")

## 4. Automated Resource Limiting

Prevent runaway processes from consuming all resources.

In [None]:
# Auto-kill processes exceeding resource limits
# Prevents one job from hogging all resources

class ResourceLimiter:
    """
    Monitor processes and kill those exceeding limits.
    
    Example:
        limiter = ResourceLimiter(max_memory_mb=8000, max_cpu_percent=90)
        limiter.enforce_limits(process_name='python', duration=300)
    """
    
    def __init__(self, max_memory_mb=None, max_cpu_percent=None, max_runtime_hours=None):
        """
        Initialize resource limiter.
        
        Args:
            max_memory_mb: Kill if memory exceeds this (MB)
            max_cpu_percent: Kill if CPU exceeds this for sustained period
            max_runtime_hours: Kill if running longer than this
        """
        self.max_memory_mb = max_memory_mb
        self.max_cpu_percent = max_cpu_percent
        self.max_runtime_hours = max_runtime_hours
    
    def check_process(self, proc):
        """
        Check if process exceeds limits.
        
        Args:
            proc: psutil.Process object
        
        Returns:
            tuple: (should_kill: bool, reason: str)
        """
        try:
            # Check memory
            if self.max_memory_mb:
                mem_mb = proc.memory_info().rss / (1024 * 1024)
                if mem_mb > self.max_memory_mb:
                    return True, f"Memory {mem_mb:.0f}MB > {self.max_memory_mb}MB"
            
            # Check CPU
            if self.max_cpu_percent:
                cpu = proc.cpu_percent(interval=1)
                if cpu > self.max_cpu_percent:
                    return True, f"CPU {cpu:.1f}% > {self.max_cpu_percent}%"
            
            # Check runtime
            if self.max_runtime_hours:
                create_time = datetime.fromtimestamp(proc.create_time())
                runtime = datetime.now() - create_time
                max_runtime = timedelta(hours=self.max_runtime_hours)
                if runtime > max_runtime:
                    return True, f"Runtime {runtime} > {max_runtime}"
            
            return False, ""
        
        except (psutil.NoSuchProcess, psutil.AccessDenied):
            return False, ""
    
    def enforce_limits(self, process_name=None, check_interval=60, duration=None):
        """
        Monitor and enforce limits on processes.
        
        Args:
            process_name: Only check processes matching this name
            check_interval: Seconds between checks
            duration: Total seconds to run (None for infinite)
        """
        print(f"Resource Limiter started")
        print(f"Limits: Memory={self.max_memory_mb}MB, "
              f"CPU={self.max_cpu_percent}%, "
              f"Runtime={self.max_runtime_hours}h")
        print(f"Checking every {check_interval}s")
        print()
        
        start_time = time.time()
        killed_count = 0
        
        try:
            while True:
                # Get processes to check
                if process_name:
                    processes = find_processes_by_name(process_name)
                else:
                    processes = list(psutil.process_iter())
                
                # Check each process
                for proc in processes:
                    should_kill, reason = self.check_process(proc)
                    
                    if should_kill:
                        try:
                            proc_name = proc.name()
                            proc_pid = proc.pid
                            
                            print(f"[{datetime.now().strftime('%H:%M:%S')}] "
                                  f"Killing PID {proc_pid} ({proc_name}): {reason}")
                            
                            proc.terminate()
                            proc.wait(timeout=5)
                            killed_count += 1
                        
                        except Exception as e:
                            print(f"  Failed to kill: {e}")
                
                # Check duration
                if duration and (time.time() - start_time) >= duration:
                    break
                
                time.sleep(check_interval)
        
        except KeyboardInterrupt:
            print("\nStopped by user")
        
        print(f"\nKilled {killed_count} process(es) total")

# Example (commented out):
# limiter = ResourceLimiter(max_memory_mb=8000, max_runtime_hours=24)
# limiter.enforce_limits(process_name='python', check_interval=60, duration=300)

print("ResourceLimiter ready!")
print("Use to prevent runaway processes from consuming all resources")

## 5. Practice Exercises

### Exercise 1: Process Dashboard

Create a dashboard that shows:
1. Top 5 processes by CPU usage
2. Top 5 processes by memory usage
3. All Python processes
4. System-wide resource summary
5. Refresh every 5 seconds

**Hint**: Use process_iter() with sort key

In [None]:
# Exercise 1: Your solution here

def process_dashboard(duration_seconds=30, interval_seconds=5):
    """
    Display real-time process dashboard.
    
    Args:
        duration_seconds: How long to run
        interval_seconds: Refresh interval
    """
    # TODO: Implement dashboard
    pass

# Test your dashboard
# process_dashboard(duration_seconds=30, interval_seconds=5)


### Exercise 2: Hung Process Detector

Create a detector for hung processes:
1. Monitor process CPU usage over time
2. If CPU = 0% for 5 consecutive checks, mark as hung
3. If process is hung and using >1GB memory, offer to kill it
4. Log all detections

**Hint**: Track CPU history, identify patterns

In [None]:
# Exercise 2: Your solution here

class HungProcessDetector:
    """
    Detect and optionally kill hung processes.
    """
    
    def __init__(self, cpu_threshold=0.1, consecutive_checks=5):
        # TODO: Initialize detector
        pass
    
    def is_hung(self, pid):
        # TODO: Check if process is hung
        pass
    
    def monitor(self, process_name, duration=300):
        # TODO: Monitor and detect hung processes
        pass

# Test your detector
# detector = HungProcessDetector()
# detector.monitor('python', duration=60)


### Exercise 3: Training Job Manager

Create a manager for ML training jobs:
1. Start training script as subprocess
2. Monitor its resource usage
3. Log metrics to CSV file
4. Auto-kill if exceeds resource limits
5. Send notification when training completes

**Hint**: Combine subprocess.Popen with ProcessMonitor

In [None]:
# Exercise 3: Your solution here

class TrainingJobManager:
    """
    Manage and monitor ML training jobs.
    """
    
    def __init__(self, max_memory_mb=16000, max_runtime_hours=48):
        # TODO: Initialize manager
        pass
    
    def start_job(self, script_path, *args):
        # TODO: Start training script
        pass
    
    def monitor_job(self, log_file='training_metrics.csv'):
        # TODO: Monitor and log metrics
        pass
    
    def enforce_limits(self):
        # TODO: Check and enforce resource limits
        pass

# Test your manager
# manager = TrainingJobManager(max_memory_mb=8000)
# manager.start_job('train_model.py', '--epochs', '100')
# manager.monitor_job()


## 6. Summary

### Key Concepts

1. **Process Discovery**
   - `psutil.process_iter()` to list all processes
   - Filter by name or command line
   - Access process attributes (PID, name, CPU, memory)

2. **Resource Monitoring**
   - Track CPU, memory, threads over time
   - Calculate average and peak usage
   - Identify resource bottlenecks

3. **Process Management**
   - `terminate()` for graceful shutdown
   - `kill()` for forceful termination
   - Wait with timeout for process end
   - Handle access denied errors

4. **Automation**
   - ProcessMonitor for tracking jobs
   - ResourceLimiter for preventing resource hogging
   - Auto-kill hung or runaway processes

### Real-World Applications

- **ML Training**: Monitor GPU/CPU usage, detect stuck jobs
- **Resource Management**: Prevent one job from hogging resources
- **Production**: Auto-restart crashed services
- **Development**: Kill hung Jupyter kernels, clean up orphans

### Safety Checklist

Before terminating processes:
- [ ] Verify it's the correct process (check PID and name)
- [ ] Try graceful termination first
- [ ] Wait reasonable time before force-killing
- [ ] Don't kill system processes
- [ ] Log all terminations for audit

### What's Next?

In **Module 05: Registry & System Configuration**, you'll learn:
- Read and write Windows Registry
- Manage environment variables permanently
- Configure system settings programmatically
- Automate software configuration

### Self-Assessment

Before moving on, make sure you can:
- [ ] Find processes by name and command line
- [ ] Monitor process resource usage over time
- [ ] Safely terminate processes with fallback
- [ ] Implement resource limiting automation
- [ ] Build process monitoring tools

---

**Continue to Module 05** when ready!