# <font color="dodgerblue">YOLO-Distance Model Fine-Tuning for Normal and Blurred Datasets</font>

**Author:** Teague Sangster
**Date Updated:** May 12, 2025
**Project Goal:** This Jupyter Notebook orchestrates our fine-tuning of two YOLO-Distance models! One model is trained on a standard ("tuned") image dataset, and the other is trained on a version of the same dataset where a depth-of-field blur has been programmatically applied. The objective is to compare model performance which is done in the seperate compare file (Also on Github). The process starts both models from the same initial pre-trained weights and alternates their training in sessions. 

I wrote summarys at the top of each block so please read those for more insight. 

---

### <font color="teal">How it all works!</font>


1.  **<font color="darkorange">Cell 1: Initial Setup, Imports, and Core Paths</font>**
    * **Environment Setup:** Imports essential Python libraries (`sys`, `os`, `numpy`, `tensorflow`, `argparse`, `time`, `shutil`, `datetime`). TensorFlow is specifically chosen for YOLO v3 compatibility and potential hardware acceleration (MPS on Mac).
    * **Path Modification:** Adds the path to a cloned `yolo-with-distance` repository to `sys.path`.
    * **Import Verification:** Confirms that custom modules (like `get_classes`) can be successfully imported.
    * **Path Definitions:** Establishes all crucial paths for initial model weights, directories for saving new model runs (separate for "tuned" and "blurred" models), image data, label files, and logging directories.
    * **Directory Creation:** Ensures all specified output and log directories are created.
    * **YOLO & Training Parameters:** Defines common YOLO model parameters and master training control parameters like `MAX_TOTAL_TRAINING_TIME_SECONDS`, `BREAK_TIME_SECONDS`, and `EPOCHS_PER_MAIN_SESSION`.

2.  **<font color="darkorange">Cell 2: Helper Function - `create_aggregated_annotation_file`</font>**
    * **Purpose:** Combines individual KITTI-style label files into a single aggregated annotation file for YOLO training.
    * **Functionality:** Loads class names, maps them to IDs, iterates through label files, matches them to images, parses object data (bounding boxes, class IDs, **distance**), and formats this into an output file.
    * *Note: The project reuses labels for modified images, but separate labels are advised for distinct datasets.*

3.  **<font color="darkorange">Cell 3: (Assumed) Annotation File Generation</font>**
    * *(This cell likely calls `create_aggregated_annotation_file` from Cell 2 for both "tuned" and "blurred" datasets, producing their respective annotation `.txt` files.)*

4.  **<font color="darkorange">Cell 4: Helper Function - `run_training_session` (Revised)</font>**
    * **Purpose:** Manages and executes a single training session for either model, with safety and checkpointing.
    * **Safety Callbacks (Keras):**
        * `LossGuardCallback`: Flags epochs with loss > 50 as potentially invalid.
        * `EmergencyStopCallback`: Stops training if loss hits a critical threshold (e.g., 1000).
    * **Log/Checkpoint Helpers:**
        * `get_latest_timestamped_log_subdir`: Finds the most recent log subdirectory.
        * `find_best_epoch_checkpoint_in_session_log_dir`: Scans for the best checkpoint (`.h5` file) by lowest validation loss, filtering high-loss files.
    * **`run_training_session` Functionality:** Sets up model-specific parameters, loads weights, constructs arguments for the external `train.py` script (disabling `eval_online`), configures GPU, calls the training script, and then copies the best session checkpoint to the overall best model path if no loss explosion is detected in the filename.

5.  **<font color="darkorange">Cell 5: Quick Test Run Phase</font>**
    * **Purpose:** Executes a brief (e.g., 1-epoch) "smoke test" for both "tuned" and "blurred" models.
    * **Functionality:** Calls `run_training_session` for each model with minimal epochs, using dedicated test log/model paths to verify the pipeline before full training. Includes a break for resource management.

6.  **<font color="darkorange">Cell 6: Main Alternating Training Loop</font>**
    * **Purpose:** The primary execution block that iteratively trains the "tuned" and "blurred" models in alternating sessions.
    * **Functionality:** Initializes trackers for epochs and validation losses. Loops while `training_active` and within `MAX_TOTAL_TRAINING_TIME_SECONDS`. In each iteration, it sets up parameters for the `current_model_turn`, calls `run_training_session`, updates total epochs and best validation loss, switches models, and takes a `BREAK_TIME_SECONDS` pause. A `finally` block provides a concluding summary.


### <font color="dodgerblue">Cell 1: Initial Setup, Imports, and Core Paths</font>

This cell focuses on setting up the Python environment and defining essential file paths and parameters for our **YOLO (You Only Look Once) - Distance model**. We have two models we are training, but here we are going to start with the *same weights for both* so they have the same start point.

**<font color="teal">Imports:</font>**
* `sys` and `os`: For system and operating system interactions (like path manipulation and directory creation).
* `numpy`: For numerical operations.
* `tensorflow`: For deep learning tasks (likely the backend or framework for the YOLO model). <font color="chocolate">The main reason we chose YOLO v3 is the TensorFlow dependency versions ensured I could hardware accelerate training on my Laptop. I know it's not ideal but when you have 2 days to finish a model you do what you can do.</font>
* `argparse`: For creating argument namespaces, suggesting this script might interact with or prepare arguments for another script like `train.py`.
* `time`: For time-related functions (e.g., pausing).
* `shutil`: For file operations like copying.
* `datetime`: <font color="chocolate">Used to time how long each pass takes and to stop training after 9 hours. This file was run a few times with different changes to the `train.py` file so having a timer was key to running it overnight and while I was out.</font>

**<font color="teal">Python Path Modification:</font>**
* It adds a specified repository path (`/Users/teaguesangster/Code/Python/ComputerVisionFinal/ExistingModel/yolo-with-distance/`) to the `sys.path`.
* <font color="chocolate">This is done to allow the script to import custom modules that aren't directly in this section of the repository. Originally I was using the `Train.py` file and such provided by https://gitlab.com/EnginCZ/yolo-with-distance. However to optimize this for Mac those files had to be edited as well. Regardless that path points to where that directory is cloned.</font>

**<font color="teal">Import Verification is in ensuring everything gets loaded from those paths:</font>**
* It attempts to import `get_classes` from `common.utils` (a module within the appended repository path). This acts as a check to ensure the path modification was successful and the necessary custom code is accessible. If the import fails, an error message is printed, and the script exits.

**<font color="teal">Core Path Definitions:</font>**
* It defines several string variables for crucial file and directory paths:
    * `initial_weights_path`: Path to pre-trained model weights (`trained_final_original.h5`). <font color="chocolate">These weights are the starting point for both models. These are graciously provided by the earlier linked GitLab. We couldn't have done this project without their help, thank you!</font>
    * `runs_base_path`: A base directory for storing model training runs.
    * `tuned_model_save_dir` and `blurred_model_save_dir`: Subdirectories within `runs_base_path` for saving models trained on tuned (normal) and blurred datasets, respectively.
    * `tuned_model_best_h5_path` and `blurred_model_best_h5_path`: Specific paths for the best saved `.h5` model files for each dataset type. <font color="chocolate">These are our checkpoints as we had runaway errors with our weights in the past.</font>
    * `normal_images_base_path`, `blurred_images_base_path`, `labels_base_path`: Paths to the directories containing normal images, pre-blurred images, and their corresponding label files.
    * `log_dir_base_tuned`, `log_dir_base_blurred`, and their `_test` counterparts: Base directories for storing training logs. The script anticipates that a training script (`train.py`) will create timestamped subfolders within these.

**<font color="teal">Directory Creation:</font>**
* It iterates through a list of the defined path variables and uses `os.makedirs(path, exist_ok=True)` to ensure all these directories exist, creating them if they don't.

**<font color="teal">Common YOLO Parameters:</font>**
* A dictionary `common_yolo_params` is created to store shared configuration parameters for the YOLO model. This includes:
    * `model_type` (e.g., 'yolo3_xception').
    * Paths to anchor and class definition files.
    * `model_image_size` (e.g., (608, 608)).
    * `elim_grid_sense` (a boolean training parameter).

**<font color="teal">Training Control Parameters:</font>**
Note THERE IS A 2 MIN TIME BUFFER FOR SWITHCING WHICH MODEL IS TRAINING, this was to minitmize memeory compression on my laptop as it uses shared memeory and i've had projects fail training seesons in the past due to model changes.
* It defines several constants to control the training process:
    * `MAX_TOTAL_TRAINING_TIME_SECONDS`: Maximum duration for the entire training.
    * `BREAK_TIME_SECONDS`: Pause duration between switching models or training sessions.
    * `EPOCHS_PER_MAIN_SESSION`: Number of epochs to train each model for in a primary training loop.
    * `BASE_EPOCHS_ORIGINAL_MODEL`: The number of epochs the initial weights were trained for.

**<font color="teal">Print Statements:</font>**
* The cell concludes by printing several of the defined paths and control parameters for verification and then a "Cell 1: Setup Complete ---" message.

In [2]:
# Cell 1: Initial Setup, Imports, and Core Paths
# This cell focuses on setting up the Python environment and defining essential
# file paths and parameters for our YOLO (You Only Look Once) - Distance model.
# We have two models we are training but here we are going to start with the
# same weights for both so they have the same start point.

import sys # For system-specific parameters and functions, like modifying the Python path.
import os # For interacting with the operating system (e.g., file paths, directory creation).
import numpy as np # For numerical operations, especially with arrays.
# from PIL import Image # Imported by data.py, may not be directly needed here (commented out as per original)
import tensorflow as tf # For deep learning tasks; YOLO v3 was chosen for TensorFlow dependency versions to ensure hardware acceleration on a Mac.
# import matplotlib.pyplot as plt # Optional: for any inline plotting/debugging (commented out as per original)
import argparse # For creating argument namespaces, useful if this script prepares args for train.py.
import time # For time-related functions (e.g., pausing).
import shutil # For file operations like copying.
from datetime import datetime # Used to time how long each pass takes and to stop training after a set duration.

print("--- Cell 1: Initial Setup ---")

# Add the cloned repository directory to the Python path
# This allows importing custom modules from that specific repository.
# Originally from https://gitlab.com/EnginCZ/yolo-with-distance, but files were edited for Mac optimization.
repo_path = '/Users/teaguesangster/Code/Python/ComputerVisionFinal/ExistingModel/yolo-with-distance/'
if repo_path not in sys.path:
    sys.path.append(repo_path)
    print(f"Appended to sys.path: {repo_path}")
else:
    print(f"{repo_path} already in sys.path.")

# Attempt to import a module from the repo to verify path
# This acts as a check to ensure the path modification was successful.
try:
    from common.utils import get_classes # Used in annotation file creation from the yolo-with-distance repo.
    print("Successfully imported 'get_classes' from common.utils.")
except ModuleNotFoundError as e:
    print(f"ERROR: Could not import 'get_classes'. ModuleNotFoundError: {e}")
    sys.exit("Critical import failed. Ensure common.utils is accessible via repo_path.")
except ImportError as e:
    print(f"ERROR: Could not import 'get_classes' due to ImportError: {e}")
    sys.exit("Critical import failed. Check dependencies or conflicts.")


# --- Core Paths ---
# Defines several string variables for crucial file and directory paths.
initial_weights_path = os.path.join(repo_path, 'weights', 'trained_final_original.h5') # Path to pre-trained model weights (starting point for both models, provided by EnginCZ).
runs_base_path = '/Users/teaguesangster/Code/Python/ComputerVisionFinal/runs' # A base directory for storing model training runs.

# Subdirectories for saving models trained on tuned (normal) and blurred datasets.
tuned_model_save_dir = os.path.join(runs_base_path, "tuned")
blurred_model_save_dir = os.path.join(runs_base_path, "blurred")

# Specific paths for the best saved .h5 model files (checkpoints).
tuned_model_best_h5_path = os.path.join(tuned_model_save_dir, "yolo-distance-tuned_best.h5")
blurred_model_best_h5_path = os.path.join(blurred_model_save_dir, "yolo-distance-blured_best.h5")

# Paths to the directories containing image and label data.
normal_images_base_path = "/Users/teaguesangster/Code/Python/ComputerVisionFinal/TrainingData/data_object_image_2/training/image_2/"
blurred_images_base_path = "/Users/teaguesangster/Code/Python/ComputerVisionFinal/TrainingData/BlurredImages/BlurredTrainingData/"
labels_base_path = "/Users/teaguesangster/Code/Python/ComputerVisionFinal/TrainingData/training/label_2/"

# Base log directories (train.py will create timestamped subfolders within these).
log_dir_base_tuned = os.path.join(repo_path, 'logs_finetune_tuned')
log_dir_base_blurred = os.path.join(repo_path, 'logs_finetune_blurred')
test_log_dir_base_tuned = log_dir_base_tuned + "_test" # For quick test runs
test_log_dir_base_blurred = log_dir_base_blurred + "_test"

# List of paths to ensure existence (will be created if they don't exist).
paths_to_create = [
    runs_base_path, tuned_model_save_dir, blurred_model_save_dir,
    log_dir_base_tuned, log_dir_base_blurred,
    test_log_dir_base_tuned, test_log_dir_base_blurred
]
for path_to_create in paths_to_create: # Iterate and create directories.
    os.makedirs(path_to_create, exist_ok=True) # exist_ok=True prevents error if directory already exists.
    # print(f"Ensured directory exists: {path_to_create}")

# --- Common YOLO Parameters (for train.py and annotation generation) ---
# A dictionary to store shared configuration parameters for the YOLO model.
common_yolo_params = {
    "model_type": 'yolo3_xception', # Specifies the YOLO model architecture.
    "anchors_path": os.path.join(repo_path, 'configs', 'yolo3_anchors.txt'), # Path to anchor definitions.
    "classes_path": os.path.join(repo_path, 'configs', 'kitty_all_except_nodata.txt'), # Path to class definitions.
    "model_image_size": (608, 608), # Input image size for the model.
    "elim_grid_sense": True # A boolean training parameter from the original notebook.
}
# print(f"Common YOLO parameters: {common_yolo_params}")

# --- Training Control Parameters ---
# Defines several constants to control the training process.
MAX_TOTAL_TRAINING_TIME_SECONDS = 9 * 3600 # Maximum duration (9 hours) for the entire training (used for overnight runs).
BREAK_TIME_SECONDS = 120 # Pause duration (2 minutes) between switching models or training sessions.
EPOCHS_PER_MAIN_SESSION = 5 # Number of epochs to train each model for in a primary training loop.
BASE_EPOCHS_ORIGINAL_MODEL = 43 # The number of epochs the initial_weights_path model was originally trained for.

# Print some of the defined paths and control parameters for verification.
print(f"Initial weights path: {initial_weights_path}")
print(f"Max training time: {MAX_TOTAL_TRAINING_TIME_SECONDS / 3600:.2f} hours")
print(f"Break time between model switches: {BREAK_TIME_SECONDS} seconds")
print(f"Epochs per main training session: {EPOCHS_PER_MAIN_SESSION}")
print(f"Base epochs of original model: {BASE_EPOCHS_ORIGINAL_MODEL}")
print("--- Cell 1: Setup Complete ---")

--- Cell 1: Initial Setup ---
Appended to sys.path: /Users/teaguesangster/Code/Python/ComputerVisionFinal/ExistingModel/yolo-with-distance/
Successfully imported 'get_classes' from common.utils.
Initial weights path: /Users/teaguesangster/Code/Python/ComputerVisionFinal/ExistingModel/yolo-with-distance/weights/trained_final_original.h5
Max training time: 9.00 hours
Break time between model switches: 120 seconds
Epochs per main training session: 5
Base epochs of original model: 43
--- Cell 1: Setup Complete ---


### <font color="dodgerblue">Cell 2: Helper Function - Create Aggregated Annotation File</font>

Here we define the `create_aggregated_annotation_file` which handles combining all of the image labels for both of our datasets. Here we are using the *same labels for both of our datasets* (this is because we are training off modifications made to the images). Aggregating all of our KITTI-style label files, we create a corresponding directory list for all the images we wish to train on. <font color="chocolate">Word of advice! If you want to use this code make sure you make separate label files if you have separate datasets.</font>

**Here's a breakdown of its functionality:**
Here we take a few steps:

1.  **<font color="teal">Load Class Names</font>**
    * Class names here actually don't come from the individual label files; they come from a separate definition in Cell 1 (`model_classes_file_path` which points to `kitty_all_except_nodata.txt` via `common_yolo_params`).
    * <font color="chocolate">We do this as we are folding some classes into each other to simplify training. I'm pretty sure this is why our initial code run lost all of its confidence. We didn't freeze most of our layers and as such, we started removing already trained classes by folding labels in on themselves.</font>

2.  **<font color="teal">Create Class ID Map</font>**
    * It then creates a mapping (`name_to_id_map`) from these class names to integer class IDs, which is necessary for formatting the annotations in the way YOLO expects.

3.  **<font color="teal">Directory Checks</font>**
    * It checks if the provided `label_folder` and `image_folder` exist. If either is not found, it prints an error and returns, preventing further execution.
    * <font color="chocolate">If you are running this in Colab, you have to add an import drive function before this.</font>

**<font color="teal">We then iterate through all label files:</font>**
* It lists all `.txt` files (label files) in the `label_folder` and sorts them for consistent processing order. If no label files are found, it prints an error and returns.
* For each label file:
    * It determines the corresponding image filename (by replacing `.txt` with `.png`). <font color="chocolate">Again, we have 2 datasets here using the same labels, so be careful!</font>
    * It constructs the absolute path to the image file and checks if this image file *actually exists* in the `image_folder`. If not, it increments a `skipped_images_count` and moves to the next label file.
    * It initializes a list `line_parts_for_image` starting with the absolute path to the current image file. This list will eventually form one line in the output annotation file.

    **<font color="darkorange">Parse Individual Label Files:</font>**
    * It opens the current KITTI label file.
    * For each line (representing an object) in the KITTI label file:
        * It splits the line into parts.
        * It extracts the ground truth class name.
        * It checks if the class name is 'DontCare' or if it's not present in the `name_to_id_map` (meaning it's not a class the model is being trained on). If so, the object is skipped.
        * If the object is relevant, it attempts to parse:
            * The 2D bounding box coordinates (`xmin`, `ymin`, `xmax`, `ymax`).
            * <font color="chocolate">The most important thing here for our training was the **distance** of the object (from the 14th element, `obj_parts[13]`, which is the z-coordinate in KITTI). We load this value alongside all the expected values for a YOLO dataset.</font>
        * We then format all of our information into a string: `"int(xmin),int(ymin),int(xmax),int(ymax),class_id_model,distance:.6f"`. The distance is formatted as a float with 6 decimal places. This string is appended to `line_parts_for_image`.
        * Error handling (a `try-except` block) is included to catch issues during the parsing of individual object lines (e.g., `ValueError`, `KeyError`, `IndexError`) and prints a warning if an object is skipped.

**<font color="teal">Aggregate Annotations:</font>**
* If `line_parts_for_image` contains more than just the image path (meaning at least one valid object was found for that image), the parts are joined into a single string with spaces and appended to the `aggregated_annotations` list.

**<font color="teal">Report Skipped Images:</font>**
* If any images were skipped because their files weren't found, a warning message is printed.

**<font color="teal">Write Output File:</font>**
* Finally, it opens the specified `output_annotation_file_path` in write mode.
* It writes each line from `aggregated_annotations` to this file, followed by a newline character.
* It prints a confirmation message indicating where the aggregated file was created and how many image entries it contains.

*The cell also includes print statements at the beginning and end to indicate the definition phase of this function.*

In [3]:
# Cell 2: Helper Function - Create Aggregated Annotation File
# Here we define the create_aggregated_annotation_file which handles combining all of the
# image labels for both of our datasets. We are using the same labels for both datasets
# because we are training off modifications made to the images.
# Word of advice! If you want to use this code make sure you make separate label files
# if you have separate datasets.
print("--- Cell 2: Defining create_aggregated_annotation_file function ---")

def create_aggregated_annotation_file(image_folder, label_folder, model_classes_file_path, output_annotation_file_path):
    """
    Aggregates KITTI-style label files into a single annotation file for YOLO training.
    Combines individual label files with their corresponding image paths and formats
    object data (bbox, class_id, distance) as required by the training script.
    """
    # 1. Load Class names
    # Class names here actually don't come from the individual label files; they come from
    # a separate definition in Cell 1 (model_classes_file_path which points to
    # kitty_all_except_nodata.txt via common_yolo_params).
    # We do this as we are folding some classes into each other to simplify training.
    # I'm pretty sure this is why our initial code run lost all of its confidence.
    # We didn't freeze most of our layers and as such, we started removing already
    # trained classes by folding labels in on themselves.
    print(f"Attempting to load model class names from: {model_classes_file_path}")
    model_class_names_list = get_classes(model_classes_file_path) # Uses imported get_classes from common.utils
    # 2. It then creates a mapping (name_to_id_map) from these class names to integer class IDs,
    # which is necessary for formatting the annotations.
    name_to_id_map = {name: i for i, name in enumerate(model_class_names_list)}
    # print(f"Model class map for annotation generation: {name_to_id_map}")

    aggregated_annotations = [] # List to store each line of the final annotation file.

    # 3. Directory Checks:
    # It checks if the provided label_folder and image_folder exist.
    # If you are running this in Colab you have to add an import drive function before.
    if not os.path.isdir(label_folder):
        print(f"ERROR: Label folder not found at {label_folder}")
        return
    if not os.path.isdir(image_folder):
        print(f"ERROR: Image folder not found at {image_folder}")
        return

    # We then iterate through all label files ->
    # It lists all .txt files (label files) in the label_folder and sorts them
    # for consistent processing order.
    label_filenames = sorted([f for f in os.listdir(label_folder) if f.endswith('.txt')])
    if not label_filenames:
        print(f"Error: No label files (.txt) found in {label_folder}")
        return
    print(f"Found {len(label_filenames)} label files in {label_folder} for {os.path.basename(image_folder)}. Processing...")

    skipped_images_count = 0 # Counter for images whose files are not found.
    # For each label file:
    for label_filename in label_filenames:
        # It determines the corresponding image filename (by replacing .txt with .png).
        # Again we have 2 datasets here using the same labels so be careful!
        image_filename = label_filename.replace('.txt', '.png')
        # It constructs the absolute path to the image file.
        image_file_path_abs = os.path.join(image_folder, image_filename)
        
        # and checks if this image file actually exists in the image_folder.
        # If not, it increments a skipped_images_count and moves to the next label file.
        if not os.path.exists(image_file_path_abs):
            skipped_images_count += 1
            continue # Skip this label file if the corresponding image doesn't exist.

        kitti_label_file_path = os.path.join(label_folder, label_filename)
        # It initializes a list line_parts_for_image starting with the absolute path
        # to the current image file. This list will eventually form one line
        # in the output annotation file.
        line_parts_for_image = [image_file_path_abs]

        # Parse Individual Label Files:
        # It opens the current KITTI label file.
        with open(kitti_label_file_path, 'r') as f_label:
            # For each line (representing an object) in the KITTI label file:
            for kitti_line in f_label:
                obj_parts = kitti_line.strip().split() # It splits the line into parts.
                # Basic validation of the line format
                if not obj_parts or len(obj_parts) < 14: continue
                # It extracts the ground truth class name.
                class_name_gt = obj_parts[0]
                # It checks if the class name is 'DontCare' or if it's not present in the
                # name_to_id_map (meaning it's not a class the model is being trained on).
                # If so, the object is skipped.
                if class_name_gt == 'DontCare' or class_name_gt not in name_to_id_map: continue
                
                # If the object is relevant, it attempts to parse:
                try:
                    # The 2D bounding box coordinates (xmin, ymin, xmax, ymax).
                    xmin, ymin, xmax, ymax = float(obj_parts[4]), float(obj_parts[5]), float(obj_parts[6]), float(obj_parts[7])
                    # The most important thing here for our training was the distance of the object
                    # (from the 14th element, obj_parts[13], which is the z-coordinate in KITTI).
                    # We load this value alongside all the expected values for a YOLO dataset.
                    distance = float(obj_parts[13])
                    class_id_model = name_to_id_map[class_name_gt] # Get the integer class ID.
                    
                    # We then format all of our information into a string:
                    # "int(xmin),int(ymin),int(xmax),int(ymax),class_id_model,distance:.6f".
                    # The distance is formatted as a float with 6 decimal places. This string is appended to line_parts_for_image.
                    # Ensuring distance is formatted as float, as data.py's get_ground_truth_data expects float for distance
                    box_info_str = f"{int(xmin)},{int(ymin)},{int(xmax)},{int(ymax)},{class_id_model},{distance:.6f}" # Keep precision for distance
                    line_parts_for_image.append(box_info_str)
                # Error handling (try-except block) is included to catch issues during the
                # parsing of individual object lines (e.g., ValueError, KeyError, IndexError)
                # and prints a warning if an object is skipped.
                except (ValueError, KeyError, IndexError) as e:
                    print(f"Warning: Skipping object in {label_filename} for image {image_filename} due to parsing error: '{kitti_line.strip()}' | Error: {e}")
        
        # Aggregate Annotations:
        # If line_parts_for_image contains more than just the image path (meaning at
        # least one valid object was found for that image), the parts are joined into
        # a single string with spaces and appended to the aggregated_annotations list.
        if len(line_parts_for_image) > 1: # Only add if there are objects for this image.
            aggregated_annotations.append(" ".join(line_parts_for_image))

    # Report Skipped Images:
    # If any images were skipped because their files weren't found, a warning message is printed.
    if skipped_images_count > 0:
        print(f"Warning: Skipped {skipped_images_count} entries because corresponding image files were not found in {image_folder}.")

    # Write Output File:
    # Finally, it opens the specified output_annotation_file_path in write mode.
    with open(output_annotation_file_path, 'w') as f_out:
        # It writes each line from aggregated_annotations to this file, followed by a newline character.
        for line in aggregated_annotations:
            f_out.write(line + "\n")
            
    # It prints a confirmation message indicating where the aggregated file was created
    # and how many image entries it contains.
    print(f"Aggregated annotation file created at: {output_annotation_file_path} with {len(aggregated_annotations)} image entries.")

# The cell also includes print statements at the beginning and end to indicate the
# definition phase of this function.
print("--- Cell 2: Definition Complete ---")

--- Cell 2: Defining create_aggregated_annotation_file function ---
--- Cell 2: Definition Complete ---


In [4]:
# Cell 3: Generate Annotation Files for Both Datasets
print("--- Cell 3: Generating Annotation Files ---")

annotation_file_tuned = os.path.join(repo_path, "kitti_train_list_for_keras_yolo_tuned.txt")
print(f"\nCreating annotation file for yolo-distance-tuned using images from: {normal_images_base_path}")
create_aggregated_annotation_file(
    image_folder=normal_images_base_path,
    label_folder=labels_base_path,
    model_classes_file_path=common_yolo_params['classes_path'],
    output_annotation_file_path=annotation_file_tuned
)

annotation_file_blurred = os.path.join(repo_path, "kitti_train_list_for_keras_yolo_blured.txt")
print(f"\nCreating annotation file for yolo-distance-blured using images from: {blurred_images_base_path}")
create_aggregated_annotation_file(
    image_folder=blurred_images_base_path,
    label_folder=labels_base_path,
    model_classes_file_path=common_yolo_params['classes_path'],
    output_annotation_file_path=annotation_file_blurred
)
print("--- Cell 3: Annotation File Generation Complete ---")

--- Cell 3: Generating Annotation Files ---

Creating annotation file for yolo-distance-tuned using images from: /Users/teaguesangster/Code/Python/ComputerVisionFinal/TrainingData/data_object_image_2/training/image_2/
Attempting to load model class names from: /Users/teaguesangster/Code/Python/ComputerVisionFinal/ExistingModel/yolo-with-distance/configs/kitty_all_except_nodata.txt
Found 6784 label files in /Users/teaguesangster/Code/Python/ComputerVisionFinal/TrainingData/training/label_2/ for . Processing...
Aggregated annotation file created at: /Users/teaguesangster/Code/Python/ComputerVisionFinal/ExistingModel/yolo-with-distance/kitti_train_list_for_keras_yolo_tuned.txt with 6080 image entries.

Creating annotation file for yolo-distance-blured using images from: /Users/teaguesangster/Code/Python/ComputerVisionFinal/TrainingData/BlurredImages/BlurredTrainingData/
Attempting to load model class names from: /Users/teaguesangster/Code/Python/ComputerVisionFinal/ExistingModel/yolo-with

### <font color="dodgerblue">Cell 4: Helper Function - `run_training_session` (Revised)</font>

Here we go through a few separate functions that allow me to feel safe leaving the training process alone. This includes including **safety callbacks**, helper functions for **log directory management**, and the main **`run_training_session` function**. <font color="chocolate">The callbacks weren't in the original attempt at training but were added after our loss value sadly exploded.</font>

---

#### <font color="teal">Safety Callbacks (Keras Callbacks):</font>

1.  **`LossGuardCallback`**: <font color="darkorange">To make sure our loss doesn't Explode!</font>
    * Here the main thing is we wanted to make sure our total loss values never went over `50`. Confidence could've gone over `50` and did in some sections, but because of its weighting, most of our loss came from distance calculations.
    * If total loss went over `50`, an internal flag would be set off (`loss_exploded`).
    * At the end of an epoch, if the `loss_exploded` flag is set, it indicates that the checkpoint for that epoch should be considered invalid or skipped. It then resets the flag. <font color="chocolate">This function doesn't *do* the skipping; it just lets us know if we need to skip / not save due to loss.</font>

2.  **`EmergencyStopCallback`**: <font color="darkorange">Another Loss managing function!</font>
    * Here we are monitoring the training loss at the end of each batch.
    * If the loss exceeds a `critical_threshold` (e.g., `1000.0`), it prints a critical warning and signals the Keras model to stop training (`self.model.stop_training = True`). This is a more drastic measure to prevent runaway training with extremely high losses.
    * <font color="chocolate">In our initial run, I came back in the morning to see a loss of 2000. At that point, there was no reason to continue training, and as such, that run was effectively removed/restarted.</font>

---

#### <font color="teal">Re-import `train.py`'s `main` function:</font>
* <font color="chocolate">We had issues before with this not being imported correctly, so we call it again just in case.</font> It includes error handling if the import fails.
* <font color="chocolate">We have output statements later on as well, as I was making modifications in subdirectories and wanted to make sure messing with those didn't ruin the training run.</font>

---

#### <font color="teal">Helper Function: `get_latest_timestamped_log_subdir(base_log_dir)`:</font>

* **Purpose:** To find the most recently created timestamped subdirectory within a given `base_log_dir`. The `train.py` script is expected to create log directories with names like `"YYYY_MM_DD_HH_MM_SS"`.
* **Functionality:**
    * Checks if `base_log_dir` exists. It then lists all subdirectories in `base_log_dir`. (So we make sure we are handling all our logs correctly).
    * We then iterate through these subdirectories, attempting to parse their names using the expected timestamp format.
    * We keep track of the subdirectory with the latest modification time whose name matches the timestamp format. This is then returned to be the latest valid timestamped subdirectory.
    * <font color="chocolate">Includes a fallback to return the `base_log_dir` itself if no valid timestamped subdirectory is found, with a comment stating this might happen if `train.py` failed early or if the log structure changed.</font>

---

#### <font color="teal">Helper Function: `find_best_epoch_checkpoint_in_session_log_dir(session_actual_log_dir)`:</font>

* **Purpose:** To scan a specific session's actual log directory (which is timestamped and found by the previous function) and identify the best model checkpoint file (e.g., `epXXX-lossYYY-val_lossZZZ.h5`) based on the **lowest validation loss**. <font color="chocolate">We need this as if weights exploded in one run and we still had time leftover, we could re-run and try to improve.</font>
* **Functionality:**
    * Checks if the provided `session_actual_log_dir` is valid.
    * Lists all `.h5` files starting with `"ep"` (epoch checkpoints).
    * Iterates through these files, parsing the filename to extract the `val_loss` value.
    * It includes a safety check to skip checkpoints if their parsed `val_loss` is suspiciously high (e.g., `> 50.0`, matching the `LossGuardCallback` threshold).
    * Keeps track of the file with the minimum valid `val_loss`.
    * Returns the full path to the best checkpoint file found.

---

In [5]:
# Cell 4: Helper Function - run_training_session (Revised)
# This cell defines several components aimed at managing and executing a single
# training session for a YOLO model, including safety callbacks and helper functions
# for log directory management.
# The callbacks weren't in the original attempt at training but were added
# after our loss value sadly exploded.
print("--- Cell 4: Defining run_training_session function (Revised) ---")


# --- Safety Callbacks (Keras Callbacks) ---
# These were added to help manage training and prevent runaway processes,
# especially when leaving the model to train unattended.

# LossGuardCallback: To make sure our loss doesn't Explode!
# Here the main thing is we wanted to make sure our total loss values never went over 50.
# Confidence could've gone over 50 and did in some sections, but because of its
# weighting most of our loss came from distance calculations.
class LossGuardCallback(tf.keras.callbacks.Callback):
    def __init__(self, threshold=50.0): # Threshold for "exploded" loss.
        super(LossGuardCallback, self).__init__()
        self.threshold = threshold
        self.loss_exploded = False # Flag to indicate if loss has exploded.
        
    def on_batch_end(self, batch, logs=None):
        # At the end of each batch, check if the loss exceeds the threshold.
        if logs and logs.get('loss') > self.threshold:
            print(f"\n⚠️ Warning: Loss exploded to {logs.get('loss')}! Marking checkpoint as invalid.")
            self.loss_exploded = True # Set flag if loss is too high.
            
    def on_epoch_end(self, epoch, logs=None):
        # At the end of an epoch, if the loss_exploded flag is set, it indicates that
        # the checkpoint for that epoch should be considered invalid or skipped.
        # This function doesn't do the skipping it just lets us know if we need to skip / not save due to loss.
        if self.loss_exploded:
            print(f"⚠️ Epoch {epoch+1} checkpoint will be skipped due to loss explosion.")
            self.loss_exploded = False  # Reset for next epoch to monitor anew.

# EmergencyStopCallback: Another Loss managing function!
# Here we are monitoring the training loss at the end of each batch.
# If the loss exceeds a critical_threshold, it stops training.
# In our initial run I came back in the morning to see a loss of 2000. At that point
# there was no reason to continue training and as such it was removed (that run).
class EmergencyStopCallback(tf.keras.callbacks.Callback):
    def __init__(self, critical_threshold=1000.0): # Higher threshold for emergency stop.
        super(EmergencyStopCallback, self).__init__()
        self.critical_threshold = critical_threshold
        
    def on_batch_end(self, batch, logs=None):
        # If loss exceeds the critical threshold, stop the training.
        if logs and logs.get('loss') > self.critical_threshold:
            print(f"\n🛑 CRITICAL: Loss exploded to {logs.get('loss')}! Stopping training.")
            self.model.stop_training = True # Signal Keras model to stop.


# --- Re-import train.py's main function ---
# We had issues before with this not being imported correctly so we call it again just in case.
# We have output statements later on as well as I was making modification in subdirectories
# and wanted to make sure messing with those didn't ruin the training run.
try:
    from train import main as run_yolo_training_script # Imported from train.py from the yolo-with-distance repo.
    print("Successfully re-imported 'main' as 'run_yolo_training_script' from train.py for this cell.")
except ImportError:
    print("ERROR: Could not import 'main' from train.py in Cell 4. Ensure train.py is accessible.")
    run_yolo_training_script = None # Set to None if import fails to prevent errors later.


# --- Helper Function: get_latest_timestamped_log_subdir ---
def get_latest_timestamped_log_subdir(base_log_dir):
    """
    Finds the most recent timestamped subdirectory within the base_log_dir.
    The train.py script is expected to create log directories with names like "YYYY_MM_DD_HH_MM_SS".
    """
    if not os.path.isdir(base_log_dir): # Ensure base_log_dir itself exists.
        print(f"Warning: Base log directory {base_log_dir} does not exist.")
        return None
        
    subdirs = [] # List to store paths of subdirectories.
    # Lists all subdirectories in base_log_dir. (So we make sure we are handling all our logs correctly).
    for d_name in os.listdir(base_log_dir):
        d_path = os.path.join(base_log_dir, d_name)
        if os.path.isdir(d_path): # Check if it's a directory.
            subdirs.append(d_path)
            
    if not subdirs:
        print(f"No subdirectories found in {base_log_dir}.")
        return None
    
    latest_subdir_found = None # To store the path of the latest valid subdirectory.
    latest_time = 0 # To store the modification time of the latest valid subdirectory.

    # We then Iterate through these subdirectories, attempting to parse their names
    # using the expected timestamp format.
    for subdir_path in subdirs:
        try:
            # Check if dirname matches expected timestamp format from train.py
            datetime.strptime(os.path.basename(subdir_path), "%Y_%m_%d_%H_%M_%S")
            # If format matches, check its modification time.
            mod_time = os.path.getmtime(subdir_path)
            # We keep track of the subdirectory with the latest modification time
            # whose name matches the timestamp format.
            if mod_time > latest_time:
                latest_time = mod_time
                latest_subdir_found = subdir_path
        except ValueError:
            # Subdirectory name doesn't match the timestamp format, ignore it.
            continue
            
    if latest_subdir_found:
        print(f"Identified latest timestamped log directory: {latest_subdir_found}")
    else:
        print(f"Could not find a valid timestamped log directory in {base_log_dir}. Will attempt to scan base dir.")
        # Fallback to return the base_log_dir itself if no valid timestamped subdirectory is found,
        # This might happen if train.py failed early or if the log structure changed.
        return base_log_dir # Or None, depending on how strictly we want to enforce timestamped dirs.
        
    return latest_subdir_found # This is then returned to be the latest valid timestamped subdirectory.


# --- Helper Function: find_best_epoch_checkpoint_in_session_log_dir ---
def find_best_epoch_checkpoint_in_session_log_dir(session_actual_log_dir):
    """
    Scans a specific session's actual log directory (timestamped) for epoch checkpoint files
    (e.g., epXXX-lossYYY-val_lossZZZ.h5) and returns path to the one with lowest val_loss.
    We need this as if weights exploded in one run and we still had time leftover, we could re-run and try to improve.
    """
    # Checks if the provided session_actual_log_dir is valid.
    if not session_actual_log_dir or not os.path.isdir(session_actual_log_dir):
        print(f"Error: Session log directory '{session_actual_log_dir}' is invalid or not provided.")
        return None

    # Lists all .h5 files starting with "ep" (epoch checkpoints).
    checkpoint_files = [f for f in os.listdir(session_actual_log_dir) if f.startswith('ep') and f.endswith('.h5')]
    best_val_loss = float('inf') # Initialize with a very high value.
    best_checkpoint_file_path = None # To store the path of the best checkpoint.

    if not checkpoint_files:
        print(f"No epoch checkpoint files (ep*.h5) found in {session_actual_log_dir}.")
        return None

    # Iterates through these files, parsing the filename to extract the val_loss value.
    for fname in checkpoint_files:
        try:
            parts = fname.replace('.h5', "").split('-') # Split filename by '-' to find val_loss part.
            val_loss_str = None
            for part in parts:
                if part.startswith('val_loss'): # Look for the part starting with 'val_loss'.
                    val_loss_str = part.replace('val_loss', '') # Extract the numeric value.
                    break
            if val_loss_str:
                val_loss = float(val_loss_str)
                
                # It includes a safety check to skip checkpoints if their parsed val_loss is
                # suspiciously high (e.g., > 50.0, matching the LossGuardCallback threshold).
                if val_loss > 50.0:  # Same threshold as in LossGuardCallback.
                    print(f"⚠️ Skipping checkpoint {fname} with suspiciously high loss {val_loss}")
                    continue # Skip this checkpoint.
                
                # Keeps track of the file with the minimum valid val_loss.
                if val_loss < best_val_loss:
                    best_val_loss = val_loss
                    best_checkpoint_file_path = os.path.join(session_actual_log_dir, fname)
        except Exception as e:
            print(f"Error parsing val_loss from {fname} in {session_actual_log_dir}: {e}")

    if best_checkpoint_file_path:
        print(f"Identified best session checkpoint: {os.path.basename(best_checkpoint_file_path)} with val_loss: {best_val_loss:.3f}")
    else:
        print(f"Could not determine a best session checkpoint file in {session_actual_log_dir} from epXXX.h5 files.")
            
    return best_checkpoint_file_path # Returns the full path to the best checkpoint file found.
# No main run_training_session function in this cell according to the provided code block.

--- Cell 4: Defining run_training_session function (Revised) ---
Successfully re-imported 'main' as 'run_yolo_training_script' from train.py for this cell.



#### <font color="darkorange">Now for the Main Function!!! `run_training_session(...)`:</font>
Because we are running two models with different training parameters, we have a main function that can take in which model it needs (`model_name_str` for "tuned" or "blurred"). We also pass if we need it for a given number of epochs. It handles loading weights, setting up training arguments, calling the external training script, and then saving the best checkpoint from that session.

* **Parameters:**
    * `model_name_str`: What model we actually want to train! (e.g., `"tuned"`, `"blurred"`).
    * `current_total_fine_tune_epochs_for_model`: The total number of epochs this model has already been fine-tuned for in previous sessions (used to calculate `init_epoch`).
    * `annotation_file_path`: Path to the aggregated annotation file for the dataset.
    * `model_base_log_dir`: The base directory where logs for this model type are stored (e.g., `log_dir_base_tuned`).
    * `target_overall_best_h5_path`: The path where the overall best checkpoint for this model type should be saved (e.g., `tuned_model_best_h5_path`).
    * `epochs_to_run_this_session`: The number of epochs to train for in this specific call.
* **Functionality:**
    * **Weight Loading:** Determines which weights to load. If a `target_overall_best_h5_path` exists, it loads those weights; otherwise, it falls back to `initial_weights_path`.
    * **Directory Creation:** Ensures `model_base_log_dir` exists.
    * **Epoch Calculation:** Calculates `session_init_epoch` (starting epoch number for `train.py`) and `session_total_epoch` (target ending epoch number).
    * **Argument Setup (`argparse.Namespace`):** Creates an `argparse.Namespace` object (`train_args_for_session`) to simulate command-line arguments for `train.py`. This is where all training hyperparameters are set:
        * Uses `common_yolo_params` (from Cell 1) for model type, anchor/class paths, image size.
        * Sets `weights_path` to the determined weights to load.
        * Sets `annotation_file`, `log_dir`.
        * Specifies `batch_size` (modified to `8`), `learning_rate` (very low, `1e-7`), `optimizer` (modified to `'rmsprop'` for ARM compatibility), `clipnorm`, `clipvalue` (added for stability). <font color="chocolate">You can feel free to change this, but again we were limited by training hardware.</font>
        * Sets `freeze_level` (modified to `2`, freeze more layers), `label_smoothing` (added).
        * `elim_grid_sense` is set to `True` (changed to always `True` to prevent confidence loss explosion).
        * Sets `workers` and `max_queue_size` (modified/reduced for ARM).
        * Sets `loss_weights` with custom values to balance confidence, location, class, and distance components.
        * **<font color="red">Important Modification:</font>** `eval_online` is set to `False` to disable online evaluation during training. I couldn't get this working but maybe you can! 
    * **GPU Configuration:** Clears the Keras backend session and attempts to set memory growth for physical GPUs to `True` to avoid TensorFlow allocating all GPU memory at once.
    * **Call Training Script:** Calls `run_yolo_training_script(train_args_for_session)`, which is the imported `main` function from `train.py`.
    * **Post-Training Checkpoint Handling:**
        * After the training script completes, it calls `get_latest_timestamped_log_subdir` to find the actual log directory created by `train.py` for this session.
        * Then, it calls `find_best_epoch_checkpoint_in_session_log_dir` to find the best `.h5` checkpoint within that session's log directory.
        * **Loss Check on Filename:** It performs a basic string check on the filename of the best checkpoint for indicators of very high loss (e.g., `"loss1000"`). If such indicators are found, it prints a warning and does *not* update the `target_overall_best_h5_path`. Again we only want to save the best run!
        * If a valid best checkpoint is found (and no high loss indicators in its name), it copies this checkpoint to `target_overall_best_h5_path` using `shutil.copy2`.
        * Includes warnings if no best checkpoint is found or if the specific log directory couldn't be identified.
    * **Error Handling:** Wraps the training script call and checkpoint handling in a `try-except` block to catch any exceptions during training, print a traceback, and set `session_success` to `False`.
    * **Return Values:** Returns `session_success` (boolean) and `epochs_actually_run` (integer).

*The cell concludes with print statements indicating its definition and noting the "Revised with eval_online=False" status. The key purpose of this cell is to provide a robust wrapper around the external `train.py` script, allowing for iterative training sessions, management of weights and logs, and safety checks against exploding losses.*

In [6]:
# Modified run_training_session function
# --- Main Function: run_training_session(...) ---
# Because we are running two models with different training parameters we have a main
# function that can take in which model it needs (model_name_str for "tuned" or "blurred").
# We also pass if we need it for a given number of epochs. It handles loading weights,
# setting up training arguments, calling the external training script, and then
# saving the best checkpoint from that session.
def run_training_session(model_name_str, # What model we actually want to train! (e.g., "tuned", "blurred").
                         current_total_fine_tune_epochs_for_model, # The total number of epochs this model has already been fine-tuned for in previous sessions (used to calculate init_epoch).
                         annotation_file_path, # Path to the aggregated annotation file for the dataset.
                         model_base_log_dir, # The base directory where logs for this model type are stored (e.g., log_dir_base_tuned).
                         target_overall_best_h5_path, # The path where the overall best checkpoint for this model type should be saved (e.g., tuned_model_best_h5_path).
                         epochs_to_run_this_session): # The number of epochs to train for in this specific call.
    if run_yolo_training_script is None: # Check if the training script function was imported successfully.
        print("CRITICAL ERROR: run_yolo_training_script (from train.py) is not defined. Cannot train.")
        return False, 0 # Return failure and 0 epochs run.

    print(f"\n--- Preparing Training Session for: {model_name_str} ---")
    print(f"Will attempt to train for {epochs_to_run_this_session} epochs this session.")
    
    # Weight Loading: Determines which weights to load.
    weights_to_load = initial_weights_path # Default to initial weights.
    # If an target_overall_best_h5_path exists, it loads those weights;
    if os.path.exists(target_overall_best_h5_path):
        weights_to_load = target_overall_best_h5_path # Use previously saved best weights.
        print(f"Loading previously saved best weights for {model_name_str} from: {weights_to_load}")
    else:
        # otherwise, it falls back to initial_weights_path.
        print(f"No previous overall best weights found at {target_overall_best_h5_path}. Loading initial weights: {initial_weights_path}")

    # Directory Creation: Ensures model_base_log_dir exists.
    os.makedirs(model_base_log_dir, exist_ok=True)
    
    # Epoch Calculation: Calculates session_init_epoch (starting epoch number for train.py)
    # and session_total_epoch (target ending epoch number).
    session_init_epoch = BASE_EPOCHS_ORIGINAL_MODEL + current_total_fine_tune_epochs_for_model
    session_total_epoch = session_init_epoch + epochs_to_run_this_session

    # Argument Setup (argparse.Namespace): Creates an argparse.Namespace object
    # (train_args_for_session) to simulate command-line arguments for train.py.
    # This is where all training hyperparameters are set:
    train_args_for_session = argparse.Namespace(
        # Uses common_yolo_params (from Cell 1) for model type, anchor/class paths, image size.
        model_type=common_yolo_params['model_type'],
        weights_path=weights_to_load, # Sets weights_path to the determined weights to load.
        annotation_file=annotation_file_path, # Sets annotation_file, log_dir.
        anchors_path=common_yolo_params['anchors_path'],
        classes_path=common_yolo_params['classes_path'],
        model_image_size=common_yolo_params['model_image_size'],
        # Specifies batch_size (modified to 8), learning_rate (very low, 1e-7),
        # optimizer (modified to 'rmsprop' for ARM compatibility), clipnorm,
        # clipvalue (added for stability). You can feel free to change this but
        # again we were limited by training hardware.
        batch_size=8,  # MODIFIED: Reduced batch size for stability
        init_epoch=session_init_epoch,
        total_epoch=session_total_epoch,
        learning_rate=1e-7,  # Keep this very low
        optimizer='rmsprop',  # MODIFIED: Switch to rmsprop for ARM compatibility
        clipnorm=5.0,  # MODIFIED: Reasonable gradient clipping
        clipvalue=10.0, # ADDED: Also clip by value for additional stability
        log_dir=model_base_log_dir, 
        checkpoint_period=1, 
        val_split=0.1,
        val_annotation_file=None,
        freeze_level=2,  # MODIFIED: Freeze more layers to stabilize training (Sets freeze_level (modified to 2, freeze more layers), label_smoothing (added)).
        transfer_epoch=0,
        multiscale=False,
        rescale_interval=10,
        enhance_augment=None,
        label_smoothing=0.1,  # MODIFIED: Added some label smoothing
        multi_anchor_assign=False,
        elim_grid_sense=True,  # Changed to always True to prevent confidence loss explosion.
        data_shuffle=True,
        gpu_num=1,
        model_pruning=False,
        eval_online=False, # Important Modification: eval_online is set to False to disable online evaluation during training, likely to avoid potential errors or complexities with the EvalCallBack.
        eval_epoch_interval=1, 
        save_best_only=True, 
        save_eval_checkpoint=False,
        dataset_working_directory="", 
        decay_type=None,
        lr_patience=5,
        min_lr=1e-8,
        early_stopping_patience=10,
        workers=8,  # MODIFIED: Reduced workers for ARM processors (Sets workers and max_queue_size (modified/reduced for ARM)).
        use_multiprocessing=True, 
        max_queue_size=32,  # MODIFIED: Reduced queue size
        # Updated loss weights to better balance components (Sets loss_weights with custom values to balance confidence, location, class, and distance components).
        loss_weights={'confidence': 0.01, 'location': 1.0, 'class': 1.0, 'dist': 3.0},
    )
    if not train_args_for_session.eval_online:
        print("NOTE: Online evaluation (EvalCallBack) is disabled for this session to avoid potential errors.")
    
    print(f"Effective training epochs for {model_name_str}: {train_args_for_session.init_epoch} to {train_args_for_session.total_epoch -1}.")

    session_success = False # Flag to track if the session completed without critical errors.
    epochs_actually_run = 0 # Counter for epochs run in this session.
    
    # Error Handling: Wraps the training script call and checkpoint handling in a
    # try-except block to catch any exceptions during training, print a traceback,
    # and set session_success to False.
    try:
        # GPU Configuration: Clears the Keras backend session and attempts to set memory
        # growth for physical GPUs to True to avoid TensorFlow allocating all GPU memory at once.
        print(f"Clearing Keras session and configuring GPU for {model_name_str} training...")
        tf.keras.backend.clear_session() # Clear previous Keras states.
        physical_gpus = tf.config.list_physical_devices('GPU') # Get list of GPUs.
        if physical_gpus:
            try:
                # Attempt to enable memory growth for each GPU.
                for gpu in physical_gpus: tf.config.experimental.set_memory_growth(gpu, True)
            except RuntimeError as e:
                # This error can occur if memory growth is already set or cannot be set.
                print(f"Note: GPU memory growth setting issue (may be already set): {e}")
        else:
            print(f"WARNING: No GPU detected by TensorFlow for {model_name_str} session!")

        # Call Training Script: Calls run_yolo_training_script(train_args_for_session),
        # which is the imported main function from train.py.
        print(f"Calling training script for {model_name_str}...")
        run_yolo_training_script(train_args_for_session) # Execute the training.
        print(f"Training script call completed for {model_name_str}.")
        
        epochs_actually_run = epochs_to_run_this_session # Assume all requested epochs ran if no exception.
        session_success = True # Mark session as successful.

        # Post-Training Checkpoint Handling:
        # Since train.py creates a timestamped subdirectory, we need to find it.
        # After the training script completes, it calls get_latest_timestamped_log_subdir
        # to find the actual log directory created by train.py for this session.
        session_specific_log_dir_actual = get_latest_timestamped_log_subdir(model_base_log_dir)
        
        if session_specific_log_dir_actual:
            print(f"Scanning for best checkpoint in actual session log directory: {session_specific_log_dir_actual}")
            # Then, it calls find_best_epoch_checkpoint_in_session_log_dir to find the
            # best .h5 checkpoint within that session's log directory.
            best_checkpoint_from_this_session = find_best_epoch_checkpoint_in_session_log_dir(session_specific_log_dir_actual)
            
            if best_checkpoint_from_this_session and os.path.exists(best_checkpoint_from_this_session):
                # Loss Check on Filename: It performs a basic string check on the filename
                # of the best checkpoint for indicators of very high loss (e.g., "loss1000").
                checkpoint_filename = os.path.basename(best_checkpoint_from_this_session)
                
                # Simple string check for suspiciously high loss values in filename.
                high_loss_indicators = ["loss100", "loss200", "loss500", "loss1000"]
                # If such indicators are found, it prints a warning and does *not*
                # update the target_overall_best_h5_path.
                if any(indicator in checkpoint_filename for indicator in high_loss_indicators):
                    print(f"⚠️ WARNING: Best checkpoint filename indicates very high loss: {checkpoint_filename}")
                    print(f"The overall best model at {target_overall_best_h5_path} will NOT be updated.")
                else:
                    # If a valid best checkpoint is found (and no high loss indicators in its name),
                    # it copies this checkpoint to target_overall_best_h5_path using shutil.copy2.
                    print(f"Best checkpoint from this session found: {best_checkpoint_from_this_session}")
                    print(f"Copying to target overall best model path: {target_overall_best_h5_path}")
                    shutil.copy2(best_checkpoint_from_this_session, target_overall_best_h5_path) # Copy the best session checkpoint.
                    print(f"Successfully updated overall best model for {model_name_str} at: {target_overall_best_h5_path}")
            else:
                # Includes warnings if no best checkpoint is found or if the specific log
                # directory couldn't be identified.
                print(f"WARNING: No best checkpoint found in {session_specific_log_dir_actual} from this session for {model_name_str}.")
                print(f"The overall best model at {target_overall_best_h5_path} was NOT updated this session.")
        else:
            print(f"WARNING: Could not identify the specific log subdirectory for this session in {model_base_log_dir}. Cannot copy best model.")

    except Exception as e:
        print(f"!!! ERROR during training script execution for {model_name_str}: {e} !!!")
        import traceback
        traceback.print_exc() # Print full traceback for debugging.
        session_success = False # Mark session as failed.
        epochs_actually_run = 0 # No epochs considered run if there was an error.

    print(f"--- Training Session for {model_name_str} Concluded (Success: {session_success}, Epochs run this session: {epochs_actually_run}) ---")
    # Return Values: Returns session_success (boolean) and epochs_actually_run (integer).
    return session_success, epochs_actually_run
# The cell concludes with print statements indicating its definition and noting
# the "Revised with eval_online=False" status. The key purpose of this cell is
# to provide a robust wrapper around the external train.py script, allowing for
# iterative training sessions, management of weights and logs, and safety checks
# against exploding losses.
print("--- Cell 4: Definition Complete (Revised with eval_online=False) ---")

--- Cell 4: Definition Complete (Revised with eval_online=False) ---



Test Run Configuration:
### <font color="dodgerblue">Here is our trial run!</font>

Because we have so much happening in both of our models, and because our main loop was designed to go for so long, it became clear a testing cell was required. The most important part of this is ensuring the pipeline for both the "tuned" (normal images) and "blurred" (blurred images) models. We do this by ensuring that the run_training_session function (defined in Cell 4) can execute for a minimal number of epochs without critical errors, using dedicated test log directories and test model save paths.

Here's the breakdown:

1.  **<font color="teal">Test Run Configuration:</font>**
    * `EPOCHS_FOR_TEST_RUN` is set to `1`. This means each model will only attempt to train for a single epoch during this test phase, making it quick.
    * The cell prints messages to clearly indicate the start of this test phase and the (short) number of epochs to be used.

2.  **<font color="teal">Test `yolo-distance-tuned` (Normal Images Model):</font>**
    * A specific path for where the "best" model from *this particular test run* would be saved is defined (`test_best_h5_path_tuned`). This path is within the `tuned_model_save_dir` (from Cell 1) but has a `_QUICK_TEST_best.h5` suffix to distinguish it from actual training run outputs.
    * Informative messages are printed, noting the start of the test for the "tuned" model, where its test logs will be stored (in `test_log_dir_base_tuned`, also from Cell 1), and the target save path for its test model.
    * The `run_training_session` function is then called with the following key parameters:
        * `model_name_str`: `"yolo-distance-tuned (Test)"`
        * `current_total_fine_tune_epochs_for_model`: `0` <font color="chocolate">(since it's a fresh test, we're not continuing previous fine-tuning for this test sequence; it will usually just load the `initial_weights_path`).</font>
        * `annotation_file_path`: `annotation_file_tuned` <font color="chocolate">(This variable, pointing to the aggregated annotation file for normal images generated in Cell 2, is have been defined in a preceding cell.</font>
        * `model_base_log_dir`: `test_log_dir_base_tuned`
        * `target_overall_best_h5_path`: `test_best_h5_path_tuned`
        * `epochs_to_run_this_session`: `EPOCHS_FOR_TEST_RUN` (which is 1)
    * After the `run_training_session` call, the script checks the `success_test_tuned` flag and the `epochs_done_test_tuned` value.
    * If the session was successful and the epoch completed, a success message is printed. It then also checks if the `test_best_h5_path_tuned` file was actually created (this might not happen if, for example, the validation loss didn't improve in the single epoch, and `save_best_only` is True in `train.py`).
    * If the session was not successful or didn't complete the epoch, a failure message is printed, advising the user to check the logs for errors.

3.  **<font color="teal">Break Time:</font>**
    * A `time.sleep(BREAK_TIME_SECONDS)` is called (using the `BREAK_TIME_SECONDS` constant defined in Cell 1). This introduces a pause before the script proceeds to test the next model. <font color="chocolate">We do this so that I don't get too much memorry pressure on my MAC, but if you were running this in collab I doubt you would need this extra time to avoid a crash or performance issues.</font>

4.  **<font color="teal">Test `yolo-distance-blured` (Blurred Images Model):</font>**
    * This section mirrors the test for the "tuned" model but targets the model configuration for blurred images.
    * A specific path for the "best" model from *this test run* is defined (`test_best_h5_path_blurred`), again using a `_QUICK_TEST_best.h5` suffix.
    * Messages are printed indicating the start of this test, its log directory (`test_log_dir_base_blurred` from Cell 1), and its model save path. There's alot of comments here but that's because you never know what breaks and comments tell us where we were at.
    * `run_training_session` is called with parameters tailored for the blurred model:
        * `model_name_str`: `"yolo-distance-blured (Test)"`
        * `current_total_fine_tune_epochs_for_model`: `0`
        * `annotation_file_path`: `annotation_file_blurred` <font color="chocolate">(Similar to `annotation_file_tuned`, this path to the annotation file for the blurred image dataset, generated in Cell 2, must have been defined in a preceding cell, likely Cell 3).</font>
        * `model_base_log_dir`: `test_log_dir_base_blurred`
        * `target_overall_best_h5_path`: `test_best_h5_path_blurred`
        * `epochs_to_run_this_session`: `EPOCHS_FOR_TEST_RUN` (which is 1)
    * The success and epoch completion for this blurred model test run are then checked, with corresponding success/failure messages and verification for the saved test model file.

5.  **<font color="teal">Conclusion and Review Instructions:</font>**
    * The cell finally will print a message indicating that the "Quick Test Run Phase" is complete.
    * **<font color="red">Crucially</font>**, Finally some important warnings for my groupmates!
        * `>>> IMPORTANT: Review the output and logs from this test phase carefully (check the _test log directories). <<<`
        * `>>> Verify that 'Successfully updated best model...' messages appeared if expected, or that checkpoints were created. <<<`
       

In [None]:
# Cell 5: Quick Test Run
# Here is our trial run!
# Because we have so much happening in both of our models, and because our main
# loop was designed to go for so long, it became clear a testing cell was required.
# The most important part of this is ensuring the pipeline for both the "tuned"
# (normal images) and "blurred" (blurred images) models. We do this by ensuring
# that the run_training_session function (defined in Cell 4) can execute for a
# minimal number of epochs without critical errors, using dedicated test log
# directories and test model save paths.
print("--- Cell 5: Quick Test Run Phase ---")

# 1. Test Run Configuration:
EPOCHS_FOR_TEST_RUN = 1 # This means each model will only attempt to train for a single epoch during this test phase, making it quick.
print(f"Quick test run will use {EPOCHS_FOR_TEST_RUN} epoch(s) per model.")

# --- Test yolo-distance-tuned (Normal Images Model) ---
# 2. Test `yolo-distance-tuned` (Normal Images Model):
# A specific path for where the "best" model from *this particular test run*
# would be saved is defined (`test_best_h5_path_tuned`). This path is within the
# `tuned_model_save_dir` (from Cell 1) but has a `_QUICK_TEST_best.h5` suffix
# to distinguish it from actual training run outputs.
test_best_h5_path_tuned = os.path.join(tuned_model_save_dir, "yolo-distance-tuned_QUICK_TEST_best.h5")
# Informative messages are printed, noting the start of the test for the "tuned" model,
# where its test logs will be stored (in `test_log_dir_base_tuned`, also from Cell 1),
# and the target save path for its test model.
print(f"\n--- Starting QUICK TEST RUN for yolo-distance-tuned ---")
print(f"Test logs will go into subdirectories of: {test_log_dir_base_tuned}")
print(f"Test best model will attempt to save to: {test_best_h5_path_tuned}")

# The `run_training_session` function is then called with the following key parameters:
# For a test run, current_total_fine_tune_epochs_for_model is 0 (since it's a fresh test,
# we're not continuing previous fine-tuning for this test sequence; it will usually just load
# the `initial_weights_path`).
# `annotation_file_path`: `annotation_file_tuned` (This variable, pointing to the
# aggregated annotation file for normal images generated in Cell 2, must have been defined
# in a preceding cell).
success_test_tuned, epochs_done_test_tuned = run_training_session(
    model_name_str="yolo-distance-tuned (Test)",
    current_total_fine_tune_epochs_for_model=0, 
    annotation_file_path=annotation_file_tuned, # Assumes annotation_file_tuned is defined previously
    model_base_log_dir=test_log_dir_base_tuned, 
    target_overall_best_h5_path=test_best_h5_path_tuned, 
    epochs_to_run_this_session=EPOCHS_FOR_TEST_RUN
)

# After the `run_training_session` call, the script checks the `success_test_tuned` flag
# and the `epochs_done_test_tuned` value.
if success_test_tuned and epochs_done_test_tuned == EPOCHS_FOR_TEST_RUN:
    print("SUCCESS: Quick test run for yolo-distance-tuned appears to have completed the epoch(s).")
    # It then also checks if the `test_best_h5_path_tuned` file was actually created
    # (this might not happen if, for example, the validation loss didn't improve in
    # the single epoch, and `save_best_only` is True in `train.py`).
    if os.path.exists(test_best_h5_path_tuned):
        print(f"  Test best model saved at {test_best_h5_path_tuned}")
    else:
        print(f"  Warning: Test best model for tuned was NOT found at {test_best_h5_path_tuned} (this might be okay if val_loss didn't improve in 1 epoch).")
else:
    # If the session was not successful or didn't complete the epoch, a failure message
    # is printed, advising the user to check the logs for errors.
    print("FAILED: Quick test run for yolo-distance-tuned encountered issues or did not complete epochs. Please check logs above.")

# 3. Break Time:
# A `time.sleep(BREAK_TIME_SECONDS)` is called (using the `BREAK_TIME_SECONDS`
# constant defined in Cell 1). This introduces a pause before the script proceeds
# to test the next model. We do this so that I don't get too much memory pressure on my MAC,
# but if you were running this in collab I doubt you would need this extra time
# to avoid a crash or performance issues.
print(f"\nTaking a {BREAK_TIME_SECONDS} second break before testing the blurred model...")
time.sleep(BREAK_TIME_SECONDS)

# --- Test yolo-distance-blured (Blurred Images Model) ---
# 4. Test `yolo-distance-blured` (Blurred Images Model):
# This section mirrors the test for the "tuned" model but targets the model
# configuration for blurred images.
# A specific path for the "best" model from *this test run* is defined
# (`test_best_h5_path_blurred`), again using a `_QUICK_TEST_best.h5` suffix.
test_best_h5_path_blurred = os.path.join(blurred_model_save_dir, "yolo-distance-blured_QUICK_TEST_best.h5")
# Messages are printed indicating the start of this test, its log directory
# (`test_log_dir_base_blurred` from Cell 1), and its model save path.
# There's a lot of comments here but that's because you never know what breaks
# and comments tell us where we were at.
print(f"\n--- Starting QUICK TEST RUN for yolo-distance-blured ---")
print(f"Test logs will go into subdirectories of: {test_log_dir_base_blurred}")
print(f"Test best model will attempt to save to: {test_best_h5_path_blurred}")

# `run_training_session` is called with parameters tailored for the blurred model:
# `annotation_file_path`: `annotation_file_blurred` (Similar to `annotation_file_tuned`,
# this path to the annotation file for the blurred image dataset, generated in Cell 2,
# must have been defined in a preceding cell).
success_test_blurred, epochs_done_test_blurred = run_training_session(
    model_name_str="yolo-distance-blured (Test)",
    current_total_fine_tune_epochs_for_model=0,
    annotation_file_path=annotation_file_blurred, # Assumes annotation_file_blurred is defined previously
    model_base_log_dir=test_log_dir_base_blurred, 
    target_overall_best_h5_path=test_best_h5_path_blurred,
    epochs_to_run_this_session=EPOCHS_FOR_TEST_RUN
)

# The success and epoch completion for this blurred model test run are then checked,
# with corresponding success/failure messages and verification for the saved test model file.
if success_test_blurred and epochs_done_test_blurred == EPOCHS_FOR_TEST_RUN:
    print("SUCCESS: Quick test run for yolo-distance-blured appears to have completed the epoch(s).")
    if os.path.exists(test_best_h5_path_blurred):
        print(f"  Test best model saved at {test_best_h5_path_blurred}")
    else:
        print(f"  Warning: Test best model for blurred was NOT found at {test_best_h5_path_blurred}.")
else:
    print("FAILED: Quick test run for yolo-distance-blured encountered issues or did not complete epochs. Please check logs above.")
        
# 5. Conclusion and Review Instructions:
# The cell finally will print a message indicating that the "Quick Test Run Phase" is complete.
print("\n--- Cell 5: Quick Test Run Phase Complete ---")
# Crucially, Finally some important warnings for my groupmates!
print(">>> IMPORTANT: Review the output and logs from this test phase carefully (check the _test log directories). <<<")
print(">>> Verify that 'Successfully updated best model...' messages appeared if expected, or that checkpoints were created. <<<")

--- Cell 5: Quick Test Run Phase ---
Quick test run will use 1 epoch(s) per model.

--- Starting QUICK TEST RUN for yolo-distance-tuned ---
Test logs will go into subdirectories of: /Users/teaguesangster/Code/Python/ComputerVisionFinal/ExistingModel/yolo-with-distance/logs_finetune_tuned_test
Test best model will attempt to save to: /Users/teaguesangster/Code/Python/ComputerVisionFinal/runs/tuned/yolo-distance-tuned_QUICK_TEST_best.h5

--- Preparing Training Session for: yolo-distance-tuned (Test) ---
Will attempt to train for 1 epochs this session.
No previous overall best weights found at /Users/teaguesangster/Code/Python/ComputerVisionFinal/runs/tuned/yolo-distance-tuned_QUICK_TEST_best.h5. Loading initial weights: /Users/teaguesangster/Code/Python/ComputerVisionFinal/ExistingModel/yolo-with-distance/weights/trained_final_original.h5
NOTE: Online evaluation (EvalCallBack) is disabled for this session to avoid potential errors.
Effective training epochs for yolo-distance-tuned (Test)

Process Keras_worker_SpawnPoolWorker-3:
Traceback (most recent call last):
  File "/opt/anaconda3/envs/keras-yolo3/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/opt/anaconda3/envs/keras-yolo3/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/anaconda3/envs/keras-yolo3/lib/python3.8/multiprocessing/pool.py", line 114, in worker
    task = get()
  File "/opt/anaconda3/envs/keras-yolo3/lib/python3.8/multiprocessing/queues.py", line 355, in get
    with self._rlock:
  File "/opt/anaconda3/envs/keras-yolo3/lib/python3.8/multiprocessing/synchronize.py", line 95, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt
Process Keras_worker_SpawnPoolWorker-2:
Traceback (most recent call last):
  File "/opt/anaconda3/envs/keras-yolo3/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/opt/anaconda3/envs/keras-yolo3/lib/python3.8/mult

### <font color="dodgerblue">Cell 6: Time to Train! Main Alternating Training Loop</font>

Here we implement the real training loop which will alternate between training the "tuned" (normal images) and "blurred" (blurred images) YOLO-Distance models. It manages our overall training time, switches between these models, and calls the `run_training_session` function (from Cell 4) for each. The main extra added stuff in this section is for tracking the total epochs trained for each model. <font color="chocolate">Again, because loss was our enemy early on, we had to monitor validation loss improvement.</font> The entire loop is designed to run for a predefined maximum duration (which is defined in Cell 1). <font color="chocolate">We don't have early stopping, as I really just wanted to see how good it could get overnight.</font> We also again have periodic breaks in between the models to alleviate training pressure.

**Here's a breakdown of its functionality:**

#### <font color="teal">Initialized Variables that control the run:</font>

* `best_val_loss_tuned` and `best_val_loss_blurred`: Initialized to infinity to keep track of the best validation loss achieved for each model during this entire run.
* `epochs_without_improvement_tuned` and `epochs_without_improvement_blurred`: Initialized to `0`, potentially for early stopping logic if we wanted to add this later on.
* `overall_start_time`: Records the starting time of the main loop. (Again, because we only train for a certain duration).
* `total_fine_tune_epochs_on_tuned` and `total_fine_tune_epochs_on_blurred`: Counters for the total number of fine-tuning epochs completed for each model *within this specific execution of the notebook*.
* `training_active`: A boolean flag set to `True` to control the `while` loop.
* `current_model_turn`: A string variable initialized to `"tuned"`, indicating which model starts the training sequence.
* <font color="green">Prints initial messages</font> about the maximum training duration, epochs per session, and break time (using constants from Cell 1).

---

#### <font color="teal">Main `while training_active` Loop:</font>

This loop continues as long as `training_active` is `True`. <font color="chocolate">This will be set to `False` assuming we went overtime on the last loop.</font>

* **Time Check:** At the beginning of each iteration, it checks if the `elapsed_time_seconds` since `overall_start_time` has exceeded `MAX_TOTAL_TRAINING_TIME_SECONDS` (from Cell 1). If so, it prints a message, sets `training_active` to `False`, and breaks out of the loop. This ensures the training doesn't run indefinitely. <font color="chocolate">By default (or at least right now) it's set to 9 hours.</font>
* **Status Update:** Prints the current iteration status, including elapsed time and total fine-tune epochs for each model so far in this run.
* **Set Up Current Session Parameters:**
    Based on the `current_model_turn` ("tuned" or "blurred"), it sets up session-specific variables:
    * `session_model_name_display`: A string for display purposes.
    * `session_model_annotation_file`: The path to the correct aggregated annotation file (`annotation_file_tuned` or `annotation_file_blurred`, defined in a previous cell).
    * `session_model_base_log_dir`: The correct base log directory (`log_dir_base_tuned` or `log_dir_base_blurred` from Cell 1).
    * `session_model_target_best_h5`: The path where the overall best checkpoint for the current model should be saved (`tuned_model_best_h5_path` or `blurred_model_best_h5_path` from Cell 1).
    * `session_current_total_fine_tune_epochs`: The current accumulated fine-tuning epochs for the model whose turn it is.
* **Run Training Session:**
    * Prints which model is about to be trained.
    * Calls the `run_training_session` function (defined in Cell 4, which internally calls the `train.py` script) with all the prepared session-specific parameters and `EPOCHS_PER_MAIN_SESSION` (from Cell 1).
* **Process Session Results:**
    If the session was successful (`success_this_session` is `True`) and at least one epoch was completed:
    * It attempts to find the best checkpoint from that specific session using `get_latest_timestamped_log_subdir` and `find_best_epoch_checkpoint_in_session_log_dir` (both from Cell 4).
    * If a best checkpoint is found, it tries to parse the `val_loss` from its filename.
    * It updates the total fine-tune epochs for the current model.
    * It compares the `current_val_loss` with the stored `best_val_loss_` for that model. If the current is better, it updates `best_val_loss_` and resets `epochs_without_improvement_`. Otherwise, it increments `epochs_without_improvement_`. <font color="chocolate">This allows for tracking if the model is still improving.</font>
    * If parsing `val_loss` fails or no best checkpoint is found, it still increments the total fine-tune epochs for the model.
    * If the session had issues, a warning is printed, and it's noted that the model will be retried on its next turn if time permits.
* **Switch Model Turn:** The `current_model_turn` is switched from `"tuned"` to `"blurred"`, or vice-versa, for the next iteration.
* **Post-Session Time Check & Break:**
    * Checks the `elapsed_time_seconds` again. If the maximum training time is reached, it exits the loop.
    * If `training_active` is still `True`, it prints a message about taking a break (using `BREAK_TIME_SECONDS` from Cell 1) and which model is up next, then pauses using `time.sleep()`. <font color="chocolate">(Again, handling memory pressure).</font>

---

#### <font color="teal">Loop Termination (`try-except-finally`):</font>

* The `while` loop is wrapped in a `try` block.
* **`KeyboardInterrupt`:** If the user manually interrupts the training (e.g., Ctrl+C), it catches `KeyboardInterrupt`, prints an interruption message, and sets `training_active` to `False`.
* **`finally` Block:** This block executes regardless of how the loop terminated (normally, by time limit, or by interruption).
    * It calculates and prints the `final_elapsed_time_seconds`.
    * It prints a summary of the total fine-tuning epochs completed for each model during this entire run.
    * It indicates the expected path for the final best model for each type and checks if the file exists at that path, printing an appropriate message.
    * It reminds the user to check the log directories for detailed TensorBoard logs and session checkpoints.

* **Cell Completion Message:** Prints `"--- Cell 6: Main Training Loop Execution Complete ---"`.

In [None]:
# Cell 6: Main Alternating Training Loop
# Here we implement the real training loop which will alternate between training
# the "tuned" (normal images) and "blurred" (blurred images) YOLO-Distance models.
# It manages our overall training time, switches between these models, and calls
# the run_training_session function (from Cell 4) for each. The main extra added
# stuff in this section is for tracking the total epochs trained for each model.
# Again, because loss was our enemy early on, we had to monitor validation loss improvement.
# The entire loop is designed to run for a predefined maximum duration (which is defined in cell 1).
# We don't have early stopping as I really just wanted to see how good it could get overnight.
# We also again have periodic breaks in between the models to alleviate training pressure.
print("--- Cell 6: Main Alternating Training Loop ---")

# --- Initialized Variables that control the run ---
# Add tracking of best validation loss
best_val_loss_tuned = float('inf') # Initialized to infinity to keep track of the best validation loss for 'tuned' model.
best_val_loss_blurred = float('inf') # Initialized to infinity for 'blurred' model.
# Initialized to 0, potentially for early stopping logic if we wanted to add this later on.
epochs_without_improvement_tuned = 0
epochs_without_improvement_blurred = 0

overall_start_time = time.time() # Records the starting time of the main loop. (Again because we only train for a certain duration)
# Counters for the total number of fine-tuning epochs completed for each model
# within this specific execution of the notebook.
total_fine_tune_epochs_on_tuned = 0
total_fine_tune_epochs_on_blurred = 0
training_active = True # A boolean flag set to True to control the while loop.
current_model_turn = "tuned"  # A string variable initialized to "tuned", indicating which model starts the training sequence.

# Prints initial messages about the maximum training duration, epochs per session,
# and break time (using constants from Cell 1).
print(f"\n--- INITIATING MAIN TRAINING LOOP (Max duration: {MAX_TOTAL_TRAINING_TIME_SECONDS / 3600:.2f} hours) ---")
print(f"Epochs per model session: {EPOCHS_PER_MAIN_SESSION}, Break between sessions: {BREAK_TIME_SECONDS}s")

# --- Loop Termination (try-except-finally) ---
# The while loop is wrapped in a try block.
try:
    # --- Main while training_active Loop ---
    # This loop continues as long as training_active is True. This will be set to false
    # assuming we went overtime on the last loop.
    while training_active:
        # Time Check: At the beginning of each iteration, it checks if the elapsed_time_seconds
        # since overall_start_time has exceeded MAX_TOTAL_TRAINING_TIME_SECONDS (from Cell 1).
        elapsed_time_seconds = time.time() - overall_start_time
        if elapsed_time_seconds >= MAX_TOTAL_TRAINING_TIME_SECONDS:
            # If so, it prints a message, sets training_active to False, and breaks out of the loop.
            # This ensures the training doesn't run indefinitely. By default (Or at least right now) it's set to 9 hours.
            print(f"\nMaximum training time of {MAX_TOTAL_TRAINING_TIME_SECONDS / 3600:.2f} hours reached. Stopping.")
            training_active = False
            break # Exit while loop immediately

        # Status Update: Prints the current iteration status, including elapsed time and
        # total fine-tune epochs for each model so far in this run.
        print(f"\n{'='*20} Main Loop Iteration {'='*20}")
        print(f"Time elapsed: {elapsed_time_seconds / 3600:.2f} / {MAX_TOTAL_TRAINING_TIME_SECONDS / 3600:.2f} hours")
        print(f"Total fine-tune epochs for Tuned model (this run): {total_fine_tune_epochs_on_tuned}")
        print(f"Total fine-tune epochs for Blurred model (this run): {total_fine_tune_epochs_on_blurred}")
        
        # Initialize session-specific variables
        session_model_name_display = ""
        session_model_annotation_file = ""
        session_model_base_log_dir = ""
        session_model_target_best_h5 = ""
        session_current_total_fine_tune_epochs = 0

        # Set Up Current Session Parameters:
        # Based on the current_model_turn ("tuned" or "blurred"), it sets up session-specific variables:
        if current_model_turn == "tuned":
            session_model_name_display = "yolo-distance-tuned" # A string for display purposes.
            session_model_annotation_file = annotation_file_tuned # The path to the correct aggregated annotation file (annotation_file_tuned, defined in a previous cell).
            session_model_base_log_dir = log_dir_base_tuned # The correct base log directory (log_dir_base_tuned from Cell 1).
            session_model_target_best_h5 = tuned_model_best_h5_path # The path where the overall best checkpoint for the current model should be saved (tuned_model_best_h5_path from Cell 1).
            session_current_total_fine_tune_epochs = total_fine_tune_epochs_on_tuned # The current accumulated fine-tuning epochs for the model whose turn it is.
        else: # current_model_turn == "blurred"
            session_model_name_display = "yolo-distance-blured"
            session_model_annotation_file = annotation_file_blurred # (annotation_file_blurred, defined in a previous cell).
            session_model_base_log_dir = log_dir_base_blurred # (log_dir_base_blurred from Cell 1).
            session_model_target_best_h5 = blurred_model_best_h5_path # (blurred_model_best_h5_path from Cell 1).
            session_current_total_fine_tune_epochs = total_fine_tune_epochs_on_blurred
        
        # Run Training Session:
        # Prints which model is about to be trained.
        print(f"\n>>> Current turn: Training {session_model_name_display} <<<")
        
        # Calls the run_training_session function (defined in Cell 4, which internally calls
        # the train.py script) with all the prepared session-specific parameters and
        # EPOCHS_PER_MAIN_SESSION (from Cell 1).
        success_this_session, epochs_completed_this_session = run_training_session(
            model_name_str=session_model_name_display, # For logging within the function
            current_total_fine_tune_epochs_for_model=session_current_total_fine_tune_epochs,
            annotation_file_path=session_model_annotation_file,
            model_base_log_dir=session_model_base_log_dir, # Base log dir for train.py
            target_overall_best_h5_path=session_model_target_best_h5, # Where to copy the session's best
            epochs_to_run_this_session=EPOCHS_PER_MAIN_SESSION
        )

        # Process Session Results:
        # If the session was successful (success_this_session is True) and at least one epoch was completed:
        if success_this_session and epochs_completed_this_session > 0:
            print(f"Session for {session_model_name_display} completed {epochs_completed_this_session} epochs successfully.")
            
            # Get the last validation loss from the session
            # It attempts to find the best checkpoint from that specific session using
            # get_latest_timestamped_log_subdir and find_best_epoch_checkpoint_in_session_log_dir (both from Cell 4).
            session_specific_log_dir = get_latest_timestamped_log_subdir(session_model_base_log_dir)
            best_checkpoint = find_best_epoch_checkpoint_in_session_log_dir(session_specific_log_dir)
            
            # If a best checkpoint is found, it tries to parse the val_loss from its filename.
            if best_checkpoint:
                checkpoint_basename = os.path.basename(best_checkpoint)
                val_loss_parts = [p for p in checkpoint_basename.split('-') if p.startswith('val_loss')]
                
                if val_loss_parts:
                    try:
                        current_val_loss = float(val_loss_parts[0].replace('val_loss', ''))
                        
                        # It updates the total fine-tune epochs for the current model.
                        # It compares the current_val_loss with the stored best_val_loss_ for that model.
                        # If the current is better, it updates best_val_loss_ and resets epochs_without_improvement_.
                        # Otherwise, it increments epochs_without_improvement_. This allows for tracking if the model is still improving.
                        if current_model_turn == "tuned":
                            total_fine_tune_epochs_on_tuned += epochs_completed_this_session
                            if current_val_loss < best_val_loss_tuned:
                                print(f"New best val_loss for tuned model: {current_val_loss:.4f} (previous: {best_val_loss_tuned:.4f})")
                                best_val_loss_tuned = current_val_loss
                                epochs_without_improvement_tuned = 0
                            else:
                                epochs_without_improvement_tuned += epochs_completed_this_session
                                print(f"No improvement for tuned model. Epochs without improvement: {epochs_without_improvement_tuned}")
                        else: # blurred
                            total_fine_tune_epochs_on_blurred += epochs_completed_this_session
                            if current_val_loss < best_val_loss_blurred:
                                print(f"New best val_loss for blurred model: {current_val_loss:.4f} (previous: {best_val_loss_blurred:.4f})")
                                best_val_loss_blurred = current_val_loss
                                epochs_without_improvement_blurred = 0
                            else:
                                epochs_without_improvement_blurred += epochs_completed_this_session
                                print(f"No improvement for blurred model. Epochs without improvement: {epochs_without_improvement_blurred}")
                    except ValueError:
                        print(f"Could not parse validation loss from checkpoint name: {checkpoint_basename}")
                        # If parsing val_loss fails or no best checkpoint is found,
                        # it still increments the total fine-tune epochs for the model.
                        if current_model_turn == "tuned":
                            total_fine_tune_epochs_on_tuned += epochs_completed_this_session
                        else: # blurred
                            total_fine_tune_epochs_on_blurred += epochs_completed_this_session
            else: # No best_checkpoint found this session
                if current_model_turn == "tuned":
                    total_fine_tune_epochs_on_tuned += epochs_completed_this_session
                else: # blurred
                    total_fine_tune_epochs_on_blurred += epochs_completed_this_session
        else: # Session was not successful or no epochs completed
            # If the session had issues, a warning is printed, and it's noted that
            # the model will be retried on its next turn if time permits.
            print(f"WARNING: Session for {session_model_name_display} had issues or completed 0 epochs. It will be retried on its next turn if time permits.")

        # Switch Model Turn: The current_model_turn is switched from "tuned" to "blurred",
        # or vice-versa, for the next iteration.
        if current_model_turn == "tuned":
            current_model_turn = "blurred"
        else:
            current_model_turn = "tuned"

        # Post-Session Time Check & Break:
        # Checks the elapsed_time_seconds again. If the maximum training time is reached, it exits the loop.
        elapsed_time_seconds = time.time() - overall_start_time
        if elapsed_time_seconds >= MAX_TOTAL_TRAINING_TIME_SECONDS:
            print(f"\nMaximum training time reached after session for {session_model_name_display}. Exiting loop.")
            training_active = False # Ensure loop terminates
            break # Exit while loop immediately
            
        # If training_active is still True, it prints a message about taking a break
        # (using BREAK_TIME_SECONDS from Cell 1) and which model is up next, then pauses
        # using time.sleep(). (Again handling memory pressure)
        if training_active: # Only take a break if we are not about to exit due to time limit
            print(f"\nTaking a {BREAK_TIME_SECONDS} second break (approx {BREAK_TIME_SECONDS/60:.1f} mins)... Next up: {current_model_turn}")
            time.sleep(BREAK_TIME_SECONDS)

# KeyboardInterrupt: If the user manually interrupts the training (e.g., Ctrl+C),
# it catches KeyboardInterrupt, prints an interruption message, and sets training_active to False.
except KeyboardInterrupt:
    print("\n--- Training Interrupted by User (KeyboardInterrupt) ---")
    training_active = False # Ensure loop terminates if interrupted
# finally Block: This block executes regardless of how the loop terminated
# (normally, by time limit, or by interruption).
finally:
    final_elapsed_time_seconds = time.time() - overall_start_time
    # It calculates and prints the final_elapsed_time_seconds.
    print(f"\n{'='*20} MAIN TRAINING LOOP FINISHED {'='*20}")
    print(f"Total script duration: {final_elapsed_time_seconds / 3600:.2f} hours.")
    # It prints a summary of the total fine-tuning epochs completed for each model during this entire run.
    print(f"Total fine-tuning epochs (this run) for yolo-distance-tuned: {total_fine_tune_epochs_on_tuned}")
    # It indicates the expected path for the final best model for each type and checks
    # if the file exists at that path, printing an appropriate message.
    if os.path.exists(tuned_model_best_h5_path):
        print(f"  Final best model for yolo-distance-tuned is expected at: {tuned_model_best_h5_path}")
    else:
        print(f"  No final best model found for yolo-distance-tuned at the end of the run.")
        
    print(f"Total fine-tuning epochs (this run) for yolo-distance-blured: {total_fine_tune_epochs_on_blurred}")
    if os.path.exists(blurred_model_best_h5_path):
        print(f"  Final best model for yolo-distance-blured is expected at: {blurred_model_best_h5_path}")
    else:
        print(f"  No final best model found for yolo-distance-blured at the end of the run.")
    # It reminds the user to check the log directories for detailed TensorBoard logs and session checkpoints.
    print("Please check the respective log directories for detailed TensorBoard logs and checkpoints from each session.")

# Cell Completion Message: Prints "--- Cell 6: Main Training Loop Execution Complete ---".
print("--- Cell 6: Main Training Loop Execution Complete ---")