# Assignment: Log GPU Usage and Model Performance for Tuning

## Objective
This assignment focuses on instrumenting a machine learning training process to monitor and log GPU resource utilization (memory, compute) alongside traditional model performance metrics. This is crucial for optimizing model training, especially for large models or when working with limited hardware resources, enabling efficient hyperparameter tuning.

## Part 1: Environment Setup and Basic Model Training (30 Marks)

1.  **Environment Setup:**
    * Create a new Python virtual environment.
    * Install necessary libraries: `torch` (or `tensorflow`), `torchvision` (for common datasets), `scikit-learn`, `pandas`, `numpy`, `matplotlib`, `psutil` (for CPU/RAM), `pynvml` (for NVIDIA GPU metrics, if available), `GPUtil` (alternative/additional GPU library), `tqdm` (for progress bars), `csv` (for logging).
        * Provide a `requirements.txt` file.
    * **Important:** This assignment highly benefits from a GPU. If you don't have one, specify that you'll focus on CPU usage logging and theoretical GPU considerations.

2.  **Dataset and DataLoader:**
    * Load a common image classification dataset (e.g., CIFAR-10, FashionMNIST) using `torchvision.datasets`.
    * Create `DataLoader` instances for training and validation sets.
    * Print dataset sizes and batch sizes.

3.  **Simple Neural Network Model:**
    * Define a simple Convolutional Neural Network (CNN) using `torch.nn.Module` (or `tf.keras.Model`).
        * It should have at least 2-3 convolutional layers and 1-2 fully connected layers.
    * Move the model to the appropriate device (GPU if available, else CPU).
    * Print the model summary (if using PyTorch, maybe use `torchsummary` if installed, or just print the model structure).

4.  **Basic Training Loop:**
    * Implement a standard training loop (e.g., 5-10 epochs) that:
        * Iterates through batches.
        * Performs forward pass, loss calculation, backward pass, and optimizer step.
        * Calculates and prints epoch-level training loss and validation accuracy at the end of each epoch.
    * Use a standard loss function (e.g., `CrossEntropyLoss`) and optimizer (e.g., `Adam`).
    * Demonstrate that your model can train and show its basic performance.

In [None]:
# Your code for environment setup, dataset/dataloader, model definition, and basic training loop.
# Provide `requirements.txt`.
# Confirm GPU availability (or note CPU-only).
# Show sample training output (losses, accuracies).

## Part 2: GPU/CPU Usage Logging (40 Marks)

1.  **Resource Monitor Class/Function:**
    * Create a Python class or a set of functions (e.g., `ResourceMonitor`) that can:
        * **For GPU (if available):**
            * Initialize `pynvml` (or `GPUtil`).
            * Periodically (e.g., every few seconds or every N batches) query and record:
                * GPU utilization (e.g., `utilization.gpu` from `pynvml`).
                * GPU memory usage (e.g., `memory.used` from `pynvml`).
            * Handle cases where `pynvml` is not installed or no NVIDIA GPU is found gracefully (e.g., print a warning and skip GPU logging).
        * **For CPU/RAM (always):**
            * Use `psutil` to query and record:
                * CPU utilization (`psutil.cpu_percent`).
                * System RAM usage (`psutil.virtual_memory().percent`).

2.  **Integrate Monitoring into Training Loop:**
    * Modify your training loop from Part 1 to integrate the `ResourceMonitor`.
    * The monitor should:
        * Start before training begins.
        * Log resource usage at regular intervals (e.g., every 10-50 batches, or at fixed time intervals).
        * Stop monitoring after training completes.
    * Store the collected resource data (e.g., timestamps, GPU/CPU usage, memory) in a list of dictionaries.

3.  **Save Logs to CSV:**
    * After training, save the collected resource usage data to a CSV file (e.g., `resource_logs.csv`).
    * The CSV should have columns for `timestamp`, `epoch`, `batch`, `gpu_util_percent`, `gpu_mem_used_mb`, `cpu_util_percent`, `ram_util_percent` (and `N/A` for GPU fields if no GPU).

4.  **Visualize Resource Usage:**
    * Load the `resource_logs.csv` into a Pandas DataFrame.
    * Create at least **two plots** using `matplotlib` or `seaborn`:
        * Plot of GPU/CPU Utilization (%) over time/batches during training.
        * Plot of GPU/RAM Memory Usage (MB or %) over time/batches during training.
    * Analyze the plots: Do you see expected patterns (e.g., high GPU usage during forward/backward pass, memory increasing)? Are there any bottlenecks?

In [None]:
# Your Python code for `ResourceMonitor` class/functions and its integration into the training loop.
# Code for saving logs to CSV and plotting.
# Include the generated plots and your analysis.

## Part 3: Model Performance and Tuning (30 Marks)

1.  **Model Complexity Experiment:**
    * Create **two distinct configurations** for your neural network model (from Part 1):
        * **Configuration A (Simpler):** Fewer layers, fewer channels/neurons.
        * **Configuration B (More Complex):** More layers, more channels/neurons.
    * For each configuration:
        * Train the model for the same number of epochs.
        * Log both resource usage (as in Part 2) and epoch-level training/validation metrics (loss, accuracy).
        * Save resource logs to separate CSVs (e.g., `resource_logs_A.csv`, `resource_logs_B.csv`).
        * Save training/validation metrics to separate JSON or CSV files (e.g., `metrics_A.csv`, `metrics_B.csv`).

2.  **Compare Performance & Resource Usage:**
    * Load and compare the training/validation metrics of Configuration A vs. Configuration B. Which model performs better on validation accuracy?
    * Load and compare the resource usage plots (GPU/CPU/RAM) for Configuration A vs. Configuration B.
    * Analyze: How does model complexity impact resource utilization and training time? Is there a trade-off between performance and resource efficiency for your specific task/dataset?

3.  **Discussion for Tuning:**
    * How can logging GPU/CPU usage guide your hyperparameter tuning efforts (e.g., choosing batch size, model architecture, optimizer)?
    * Imagine you have limited GPU memory. How would the logs from this assignment help you identify potential memory bottlenecks and suggest solutions (e.g., gradient accumulation, mixed precision training, model pruning)?
    * What other metrics (beyond accuracy/loss) might be valuable to log for a more comprehensive model evaluation (e.g., inference speed, model size)?

In [None]:
# Your code for implementing two model configurations and running training for each.
# Code for saving metrics and resource logs for both configurations.
# Include plots comparing resource usage and model performance.
# Your discussion and analysis.

## Submission Guidelines

* Submit this Jupyter Notebook (.ipynb file) with all cells executed and outputs visible.
        * Ensure your code is well-commented and easy to understand.
* Provide a `requirements.txt` file listing all dependencies.
* Include all generated plots and log outputs directly in the notebook.
* Clearly explain your observations and analysis for each part.
* If you lack a GPU, clearly state this and explain how your approach adapts to CPU-only logging.