# UN number detection

## Business Understanding
For this project, we have been tasked with developing a machine learning model capable of recognizing UN number hazard plates. These plates, commonly displayed on freight train wagons, indicate the types of hazardous materials being transported. The successful implementation of this model will contribute to a more efficient and secure railway system across the EU.

<img src="https://interessante-bilder.startbilder.de/1200/zagns-gaskesselwagen-aus-deutschland-vtg-498869.jpg" alt="Hazard Plate" width="500"/>

The hazard plates play a crucial role in ensuring the safety of transportation by providing essential information about the nature of the substances on board, such as flammability, toxicity, or corrosiveness. By automating the recognition process with machine learning, the handling and tracking of these hazardous materials can be streamlined, reducing manual labor and minimizing potential human errors.
 Determine business objectives

### Determine business objectives
#### Background
The specific expectations and objectives of the EU for this project are not yet fully defined, but the initiative's roots are clear. This project is spearheaded by the University of Twente, with researcher Mellisa Tijink serving as our supervisor. Our team, composed of pre-master's Computer Science students, has been tasked with developing the machine learning model. Mellisa Tijink plays a pivotal role as the intermediary between our team and major stakeholders, including ProRail, the EU, and other experts in the field.

This project is part of a broader initiative aimed at enhancing rail freight operations within Europe, aligning with the EU’s goals for improved efficiency and safety. More information on the initiative can be found on the official project site: [EU Rail FP5](https://projects.rail-research.europa.eu/eurail-fp5/).

Flagship Project 5: *TRANS4M-R aims to establish rail freight as the backbone of a low-emission, resilient European logistics chain that meets end-user needs. It focuses on two main technological clusters: 'Full Digital Freight Train Operation' and 'Seamless Freight Operation', which will develop and demonstrate solutions to increase rail capacity, efficiency, and cross-border coordination. By integrating Digital Automatic Coupler (DAC) solutions with software-defined systems, the project seeks to optimize network management and enhance cooperation among infrastructure managers. The ultimate goal is to create an EU-wide, interoperable rail freight framework with unified technologies and seamless operations across borders and various stakeholders, boosting the EU transport and logistics sector.*

#### Business objectives

**Primary Objective:** Develop an object detection model for UN number hazard plates on freight wagons.

**Sub-objectives:**
1. Detect and identify UN number hazard plates: Ensure the model can accurately locate hazard plates on freight wagons. 
2. Read and interpret the UN numbers: Implement recognition capabilities to accurately read the numbers on the detected plates.
3. Ensure model robustness and accuracy: Train the model to achieve high accuracy and reliability under various conditions (e.g., different lighting, weather).
4. Optimize model for speed: Make sure the model runs efficiently and in real-time to function on moving trains.
5. Adapt the model for moving environments: Design and test the model to handle the unique challenges of detecting and reading plates on trains in motion. 

### Assess Situation

#### Inventory of resources

**Business Experts:** Our team currently lacks extensive expertise in this area. We can consult Tijink for some questions

**Data Mining Team:** 
- Melissa Tijink (Researcher in Data Management & Biometrics/Electrical Engineering, Mathematics, and Computer Science)
- E. Talavera Martínez (Researcher in Data Management & Biometrics)
- Ewaldo Nieuwenhuis (Pre-master student in Computer Science)
- Stanislav Levendeev (Pre-master student in Computer Science)

**Data:**
1. **Video Data of Freight Trains:** This consists of video footage of moving freight trains, where the freight wagons should display the UN numbers.
2. **Line Scan Camera Pictures:** These are high-resolution images of the train, but they are very spread out. It is still uncertain if these will be useful.
3. **Photos of ADR Warning Signs:** These are images of ADR signs on freight trains. However, this is not exactly what we need since our objective is to build a model that recognizes UN numbers.
4. **Public data sources of trucks:** These are public images from the internet from trucks containing the UN numbers, unlabeled images

**Computing Resources:** We have access to a cluster from the University of Twente, which we can use to train or fine-tune our model.

**Software:** We will use Python, Jupyter Notebook, Keras, PyTorch, and TensorFlow for analyzing, cleaning, preparing the data, and modeling. For data labeling, we will use [CVAT](https://www.cvat.ai/).

### Requirements, assumptions, and constraints

##### Requirements
- Object detection capability for UN number hazard plates.
- Text recognition to read and extract UN numbers.
- High accuracy and precision in detection and recognition.
- Robust performance under varying conditions (weather, lighting, speed).
- Speed optimization for fast processing with minimal lag
- Real-time processing for operation on moving trains.

##### Assumptions
- Consistent access to a high-performance computational cluster for model training and testing.
- The high-performance cluster is necessary due to the heavy processing demands of deep learning models.
- Local machines are not sufficient for the required high computational tasks.
- Project-specific data, including images and videos of freight trains with hazard plates, will be provided as planned.
- Data will include varied conditions (different lighting and weather) to ensure robustness.
- Access to diverse data is essential for creating a model that generalizes well to real-world scenarios.
- The stakeholders will provide timely feedback to guide any changes or adaptations needed in the project.

##### Constraints
- The team has restricted experience with advanced object detection methods, which may impact the initial development and refinement of the model.
- Most of the available data is not labeled, presenting a challenge for training supervised machine learning models. Some labeled data exists but belongs to another researcher, and access to it is uncertain.
- The dataset may be skewed with an overrepresentation of specific UN numbers from certain wagons, which could limit the model's ability to generalize across different scenarios.
- The size of the dataset makes it difficult to filter out specific wagons or relevant segments efficiently, posing a challenge for data processing and targeted training

# Imports

In [None]:
error
import cv2 
import json
import os
import pandas as pd
import seaborn as sns
import sys
from pathlib import Path
from PIL import Image
import matplotlib.pyplot as plt

from IPython import display
from IPython.display import Image
from IPython.display import display as ipy_display

sys.path.append('..')  # Adds the parent directory to the path

from src.draw.utils import draw_box
from src.annotation.annotation_validator import (
    display_yolo_sample,
    analyze_yolo_label_distribution,
    display_sample_annotations_yolo,
    find_missing_labels,
    visualize_annotations_from_links
)


# Data Understanding

## Collect Initial Data


In [None]:
# Directory containing video files
data_dir = os.environ["PATH_TO_DATA"]
video_directory = data_dir + "/videos"
print (video_directory)


### ProRail Dataset of videos

In [None]:
# Get all filenames in the directory
video_files = [f for f in os.listdir(video_directory) if f.endswith(('.mp4'))]
video_files[0]
# show example video file in jupyter notebook with display
example_video_path = os.path.join(video_directory, video_files[0])
display.Video(example_video_path, embed=True)


In [None]:
print(video_files[0])
df_video = pd.read_csv(f'{data_dir}/video_data_info.csv')
df_video.head()

### HIN text labels
This dataset is the dataset that describes which hazardous materials are being transported


In [None]:
df_hin = pd.read_csv(f"{data_dir}/un-number-labels.csv")

### COCO dataset ProRail
The COCO dataset is a formatted dataset for training the model for Faster-RCNN

In [None]:
coco_dir = data_dir + "/data_faster_rcnn"
train_dir = coco_dir + "/train"
val_dir = coco_dir + "/val"
test_dir = coco_dir + "/test"

# image files are in the images folder in each of the train, val and test folders
train_image_dir = train_dir + "/images"
val_image_dir = val_dir + "/images"
test_image_dir = test_dir + "/images"

### Ultralytic YOLO format ProRail
This is the dataset used to finetune the YOLO model this is in Ultralytics format


In [None]:
# Define the root directory for your YOLO dataset
YOLO_ROOT = Path(f"{data_dir}/yolo")

### HazTruck dataset
This is the public dataset including public images and annotations for trucks with hazmat plates.


In [None]:
df_haztruck = pd.read_csv(f'{data_dir}/haztruck_dataset.csv')
df_haztruck

## Describe Data
### Video dataset ProRail
**Column Description:**

- **Unnamed: 0**: The ID of the video.
- **filename**: The name of the file, including the `.mp4` extension.
- **fps**: Frames per second.
- **frame_count**: The total number of frames in the video.
- **width**: The width of the video in pixels.
- **height**: The height of the video in pixels.
- **resolution**: The video resolution, expressed as `width x height`.
- **duration_seconds**: The video's duration in seconds.
- **hash**: Hash of the video to check if it is original
- **file_size_mb**: The file size in megabytes.
- **train_detected**: Indicates whether a train was detected using a YOLO model. If the model's confidence score exceeded 10%, a train is considered detected, though this may not always be accurate.
- **confidence**: The confidence score indicating how likely it is that the video contains a train.

In [None]:
print(f" The rows and column size are {df_video.shape} \n")
print(f" The describe of the dataframe is: \n {df_video.describe(include='all')} \n")
print(f" The info of the dataframe is: \n {df_video.info()} \n")
print(f" The columns of the dataframe are: \n {df_video.columns} \n")
print(f" The number of null values in each column are: \n {df_video.isnull().sum()} \n")

In [None]:
df_video['hash'].value_counts(ascending=False)

They seem to be all orginal in terms of hashing, we probably cannot determine duplicates by hash alone

In [None]:
# Load the MP4 video
video = "1690279852.mp4"
video2 = "1690281303.mp4"
video_path = video_directory+'/'+video
video2_path = video_directory+'/'+video2
# Embed video in the notebook

display.Video(video_path, embed=True)

In [None]:
display.Video(video2_path, embed=True)

**These are the same videos, the second video is only one second longer than the first video. There are duplicate video's in this dataset**

In [None]:
# videos sorted by highest to lowest FPS
df_video['fps'] = pd.to_numeric(df_video['fps'], errors='coerce')
df_x = df_video.sort_values(by='fps', ascending=False)
df_x

In [None]:
# min max and avg fps
min_fps = df_video['fps'].min()
max_fps = df_video['fps'].max()
avg_fps = df_video['fps'].mean()
print(f" The minimum fps is {min_fps}, the maximum fps is {max_fps} and the average fps is {avg_fps} \n")

The frames per second (FPS) in your video data can influence the performance of training your model. FPS affects the temporal resolution and the amount of data fed into the model. When training on video data, the FPS determines how many frames are available to capture motion or other temporal patterns, which can influence model performance.

For example, if you use a higher FPS, your model will have more frames to analyze within a given time frame, potentially improving its ability to capture finer details in motion (e.g., in object detection or action recognition tasks). However, processing more frames per second can also lead to higher computational costs and may require more memory and processing power, which might reduce training efficiency unless properly optimized.
[source 1](https://library.fiveable.me/key-terms/deep-learning-systems/frames-per-second-fps)


Conversely, lower FPS can reduce computational demands but may also decrease the temporal resolution of your data, making it harder for your model to accurately capture fast movements or dynamic changes. Depending on your specific use case, you'll need to balance FPS with your model's ability to process the data effectively while managing computational resources.
[source 2](https://paulbridger.com/posts/video-analytics-pipeline-tuning/)

It's also important to consider other factors like video resolution and preprocessing techniques, which could further affect how FPS influences your model's performance.

In [None]:
def plot_distribution(df, column, xlabel, ylabel="Frequency", bins=10):
    plt.figure(figsize=(8, 5))
    plt.hist(df[column], bins=bins, color='skyblue', edgecolor='black', alpha=0.7)
    plt.title(f"Distribution of {xlabel}", fontsize=14)
    plt.xlabel(xlabel, fontsize=12)
    plt.ylabel(ylabel, fontsize=12)
    plt.grid(axis='y', linestyle='--', alpha=0.7)
    plt.show()

In [None]:
plot_distribution(df_video, 'fps', xlabel='Frames Per Second (FPS)')
plot_distribution(df_video, 'frame_count', xlabel='Total Frames', bins=20)
plot_distribution(df_video, 'duration_seconds', xlabel='Duration (seconds)', bins=20)
plot_distribution(df_video, 'file_size_mb', xlabel='File Size (MB)', bins=15)
plot_distribution(df_video, 'resolution', xlabel='Resolution', bins=15)

In [None]:
def plot_resolutions(df):
    plt.figure(figsize=(8, 6))
    plt.scatter(df['width'], df['height'], c='orange', alpha=0.7, edgecolors='black')
    plt.title("Resolution Scatter Plot (Width vs Height)", fontsize=14)
    plt.xlabel("Width (pixels)", fontsize=12)
    plt.ylabel("Height (pixels)", fontsize=12)
    plt.grid(linestyle='--', alpha=0.7)
    plt.show()

plot_resolutions(df_video)

### HIN text labels

In [None]:
df_hin

In [None]:
# get amount of unique UN numbers in the dataframe
unique_un_numbers = df_hin['number'].nunique()
print(f" The amount of unique UN numbers in the dataframe is {unique_un_numbers} \n")

<img src="https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Fwww.firesafe.org.uk%2Fwp-content%2Fuploads%2F2011%2F05%2FHazchem_sign3.gif&f=1&nofb=1&ipt=213e14aa1b853778982c7947745401044f12ea17de088b2a1556d3f89f5fd052" alt="Image of UN number plate">

The Hazard Identification Number (HIN) plate is used to identify the type of hazardous material a train is carrying. You can find the HIN on the upper part of the plate (1). The class number is not relevant in this research

### COCO dataset

We used an 80/10/10 split for our dataset:

- **80% Training:** Used to train the model and learn patterns from the data.
- **10% Validation:** Used to tune model hyperparameters and prevent overfitting.
- **10% Testing:** Used to evaluate the final model performance on unseen data.

**Dataset Directory Structure (COCO Format)**

The dataset is organized using a structure inspired by the popular **COCO (Common Objects in Context)** format. This is a standard and highly effective way to manage data for object detection, ensuring that the images and their corresponding labels are kept separate but clearly linked.

The data is split into three distinct sets: **training**, **validation**, and **testing**.

**Visual Layout**

Here is a visual representation of the directory tree:

```
data_faster_rcnn/
├── train/
│   ├── images/
│   │   ├── 1690801380_00384.jpg
│   │   ├── 1690801380_00385.jpg
│   │   └── ...
│   └── annotations/
│       └── instances_train.json
│
├── val/
│   ├── images/
│   │   ├── 1690802500_00100.jpg
│   │   └── ...
│   └── annotations/
│       └── instances_val.json
│
└── test/
    ├── images/
    │   │   ├── 1690804000_00050.jpg
    │   │   └── ...
    └── annotations/
        └── instances_test.json
```

**Explanation of Components**

*   **data_faster_rcnn/**: This is the main root folder that contains the entire dataset.

*   **train/, val/, test/**: These three folders represent the primary data splits.
    *   **train**: This folder contains the majority of the data, used for **training** the machine learning model.
    *   **val**: (Validation) This folder holds a smaller set of data used to fine-tune the model during training and prevent it from "memorizing" the training data.
    *   **test**: This folder contains data that the model has never seen before, used for the final **evaluation** of the model's performance.

*   **images/**: Inside each of the `train`, `val`, and `test` folders, this sub-folder contains all the actual image files (e.g., `.jpg`, `.png`).

*   **annotations/**: This sub-folder contains the label data.
    *   **instances_train.json**: This single JSON file contains all the annotations (like bounding boxes and object categories) for **every single bounding box** featured inside the `train/images/` folder. The same logic applies to `instances_val.json` and `instances_test.json`.

**Image Filename Convention**

The images in this dataset are individual frames extracted from the **Prorail video dataset**. Each filename follows a specific pattern that links it back to the original source video and its position within that video.

The format is: **`video_id_frame_id.jpg`**

**Example:** `1690801380_00384.jpg`

*   **1690801380**: This is the unique **Video ID**, which identifies the source video the frame was taken from.
*   **00384**: This is the **Frame ID**, which represents the specific frame number from that video sequence.

In [None]:
def show_examples(image_dir, n=3, title=""):
    print(title)
    image_files = os.listdir(image_dir)[:n]
    for img_file in image_files:
        img_path = os.path.join(image_dir, img_file)
        ipy_display(Image(filename=img_path, width=300))

show_examples(train_image_dir, n=1, title="Example training images:")
show_examples(val_image_dir, n=1, title="Example validation images:")
show_examples(test_image_dir, n=1, title="Example test images:")

In [None]:
# --- 1. Define the path ---
annotation_file = os.path.join(train_dir, "annotations", "instances_train.json")

# --- 2. Load the JSON annotation file ---
print(f"Loading annotations from: {annotation_file}\n")
with open(annotation_file, 'r') as f:
    data = json.load(f)

# --- 3. Explore the main structure of the JSON file ---
print("The main keys in the JSON file are:")
print(f"-> {list(data.keys())}\n")

# --- 4. Create helpful mappings for easy data lookup ---
# Create a dictionary to map category IDs to category names
categories = {cat['id']: cat['name'] for cat in data['categories']}
print("Found Categories:")
print(f"-> {categories}\n")

# Create a dictionary to map image IDs to their file names
images = {img['id']: img['file_name'] for img in data['images']}

# --- 5. Display the first few annotations ---
print("--- Displaying Details for the First 5 Annotations ---\n")

# Get the list of all annotations
annotations = data['annotations']

for i, ann in enumerate(annotations[:5]): # Loop through the first 5 annotations
    image_id = ann['image_id']
    category_id = ann['category_id']
    bbox = ann['bbox']

    # Use the mappings to get human-readable names
    image_filename = images[image_id]
    category_name = categories[category_id]

    print(f"**Annotation #{i+1}**")
    print(f"  - Image File:    {image_filename}")
    print(f"  - Category:      {category_name}")
    print(f"  - Bounding Box:  {bbox}  (Format: [x, y, width, height])")
    print("-" * 20)

In [None]:
# show amount of annotations
print(f"Total number of annotations: {len(data['annotations'])}")

In [None]:
# show if any are null: x, y, width, height or Image File
null_bboxes = [ann for ann in data['annotations'] if any(coord is None for coord in ann['bbox'])]
null_image_ids = [ann['image_id'] for ann in data['annotations'] if ann['image_id'] not in images]
print(f"Number of annotations with null bounding box coordinates: {len(null_bboxes)}")
print(f"Number of annotations with invalid image IDs: {len(null_image_ids)}")

In [None]:
# --- 2. Select a Sample Annotation ---
# Let's pick the 10th annotation in the list as an example
annotation = data['annotations'][10]

# --- 3. Find the Corresponding Image Path ---
image_id = annotation['image_id']
# Use a generator expression to find the image info dict matching the ID
image_info = next(img for img in data['images'] if img['id'] == image_id)
image_path = os.path.join(train_image_dir, image_info['file_name'])

print(f"Found Image Path: {image_path}")

# --- 4. Load the Image using OpenCV ---
# cv2.imread loads the image as a NumPy array in BGR format
image = cv2.imread(image_path)
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)  # Convert BGR to RGB for correct color display
plt.figure(figsize=(10, 10))
plt.imshow(image)

if image is None:
    raise FileNotFoundError(f"Could not load image at path: {image_path}")

# --- 5. Extract and Format the Bounding Box ---
# COCO format is [x_min, y_min, width, height]
bbox_coco = annotation['bbox']

# Convert COCO format to the format needed for drawing: (x_min, y_min, x_max, y_max)
x, y, w, h = bbox_coco
ground_truth = (x, y, x + w, y + h)

# Get the category name to use as a label
category_id = annotation['category_id']
category_info = next(cat for cat in data['categories'] if cat['id'] == category_id)
label = category_info['name']

print(f"Bounding Box (x, y, w, h): {bbox_coco}")
print(f"Drawing Box (x1, y1, x2, y2): {ground_truth}")
print(f"Label: {label}")

# --- 6. Draw the Bounding Box on the Image ---
# Call your drawing function
image_with_box = draw_box(image=image, ground_truth=ground_truth)

# # --- 7. Display the Final Image ---
# # Matplotlib displays images in RGB format, so we need to convert from OpenCV's BGR
plt.title(f"Ground Truth Box on Image: {image_info['file_name']}")
plt.axis('off') # Hide the axes for a cleaner look
plt.tight_layout()


### Ultralytics YOLO dataset

In [None]:
# Display the first sample from the 'train' split
display_yolo_sample(YOLO_ROOT, split="train", image_index=0)

# Display the fifth sample from the 'validation' split (example)
display_yolo_sample(YOLO_ROOT, split="val", image_index=4)

#### **Explanation of YOLO Label File Format**

This document explains the structure and meaning of the YOLO `.txt` label files used for object detection tasks.

---

**1. Filename Convention**

The name of each label file directly corresponds to an image file and often contains metadata about its source.

**Example Filename:** `1690281365_00053.txt`

This name is broken down as follows:
*   **`1690281365`**: This is the **Video ID**, a unique identifier for the video from which the image frame was extracted.
*   **`_`**: A separator character.
*   **`00053`**: This is the **Frame Number**, indicating that this is the 53rd frame analyzed from that specific video.
*   **`.txt`**: The file extension for the label file.

The corresponding image for this label file would be named `1690281365_00053.jpg` (or `.png`, etc.).

---

**2. File Content Structure**

Each `.txt` label file contains one or more lines. **Each line represents a single bounding box** for one object detected in the image.

The format for each line is:

```
[class_id] [x_center] [y_center] [width] [height]
```

All five values are space-separated. Let's break down each component:

| Component | Description |
| :--- | :--- |
| **`class_id`** | An integer representing the object's class. This ID is **zero-indexed** (starts from 0). It maps to the class names defined in your `dataset.yaml` file (e.g., `0` could be 'person', `1` could be 'car'). |
| **`x_center`** | The **horizontal center** of the bounding box. This value is **normalized** by the image's width, so it's a float between 0.0 and 1.0. (e.g., `0.5` means the center is exactly in the middle of the image horizontally). |
| **`y_center`** | The **vertical center** of the bounding box. This value is **normalized** by the image's height, so it's a float between 0.0 and 1.0. (e.g., `0.5` means the center is exactly in the middle of the image vertically). |
| **`width`** | The **width** of the bounding box. This value is also **normalized** by the image's width. |
| **`height`** | The **height** of the bounding box. This value is also **normalized** by the image's height. |

> **What does "normalized" mean?**
> A normalized value is a fraction of the total dimension. To get the actual pixel value, you would multiply the normalized value by the image's dimension:
> * `absolute_x_center_in_pixels = x_center * image_width`
> * `absolute_width_in_pixels = width * image_width`
> * `absolute_y_center_in_pixels = y_center * image_height`
> * `absolute_height_in_pixels = height * image_height`

---

**3. Examples**

**Example 1: Single Bounding Box**

*   **File:** `1690281365_00053.txt`
*   **Content:**
    ```
    0 0.0785078125 0.4359953704 0.072671875 0.0501203704
    ```

*   **Interpretation:**
    *   `0`: This bounding box is for an object of class `0`.
    *   `0.0785...`: The center of the box is located at ~7.85% of the image's width from the left edge.
    *   `0.4359...`: The center of the box is located at ~43.6% of the image's height from the top edge.
    *   `0.0726...`: The width of the box is ~7.27% of the total image width.
    *   `0.0501...`: The height of the box is ~5.01% of the total image height.

**Example 2: Multiple Bounding Boxes**

If an image contains two objects, the label file will have two lines.

*   **Content:**
    ```
    0 0.0785078125 0.4359953704 0.072671875 0.0501203704
    0 0.0784407812 0.5359953704 0.1671875000 0.0501203704
    ```

*   **Interpretation:**
    *   This file describes **two separate objects** found in the corresponding image.
    *   Both objects belong to **class `0`**.
    *   They are located at slightly different positions (note the different `y_center` values) and have different widths (`width`).

In [None]:
# Run the analysis
analyze_yolo_label_distribution(YOLO_ROOT)

We use a 80/10/10 split on the YOLO dataset

In [None]:
# Run the validation check
missing_files = find_missing_labels(YOLO_ROOT)

There are no images without a label

In [None]:
display_sample_annotations_yolo(YOLO_ROOT, amount=3)


#### Explanation of the YOLO `dataset.yaml` File

The `dataset.yaml` file is the central configuration file for a YOLO dataset. It acts as a map, telling the training script where to find the data and what kind of objects it should learn to detect.

**Example `dataset.yaml`**

```yaml
# ------------------------------------------------------------------
# Root path to the dataset. All other paths are relative to this one.
# ------------------------------------------------------------------
path: yourpath\UN-number-detection\data\annotations\prorail\yolo

# ------------------------------------------------------------------
# Paths to the image sets for each split (train, validation, test)
# ------------------------------------------------------------------
train: images/train  # Training images
val: images/val      # Validation images
test: images/test    # (Optional) Test images

# ------------------------------------------------------------------
# Dataset Class Definitions
# ------------------------------------------------------------------
nc: 1  # Number of classes
names: ['hazmat_plate']  # List of class names
```

**Detailed Breakdown of Each Key**

Here is a detailed explanation of what each line means.

| Key | Example Value | Description |
| :--- | :--- | :--- |
| **`path`** | `yourpath\yolo` | The **absolute path** to the root directory of your YOLO dataset. All other paths in this file (`train`, `val`, `test`) are relative to this one. |
| **`train`** | `images/train` | The **relative path** from the `path` directory to the folder containing your **training images**. |
| **`val`** | `images/val` | The **relative path** from the `path` directory to the folder containing your **validation images**. |
| **`test`** | `images/test` | The **relative path** from the `path` directory to the folder containing your **test images**. This is optional and used for final model evaluation. |
| **`nc`** | `1` | **Number of Classes**. This is a crucial integer that tells the model how many different object categories it needs to learn. |
| **`names`** | `['hazmat_plate']` | A **list of class names**. The order of this list directly maps to the class IDs used in your `.txt` label files. |

**How It Works Together**

**1. Path Resolution:**

The training script combines the `path` with the `train`, `val`, and `test` paths to find the images.

> For example, based on the file above, the full path to the training images would be:
> **`C:\...\prorail\yolo`** + **`images/train`** = **`C:\...\prorail\yolo\images\train`**

**2. Class Mapping:**

The `nc` and `names` fields are critical for the model to understand the labels. The index of the item in the `names` list corresponds to the `class_id` in your label files.

> In this example:
> *   The list `names` has **1** item.
> *   The value of `nc` is **1**.
> *   This means a `class_id` of **`0`** in a `.txt` label file will correspond to the class **`'hazmat_plate'`**.

If you had more classes, the mapping would look like this:

```yaml
nc: 3
names: ['person', 'car', 'dog']
```
*   `class_id` **`0`** = `'person'`
*   `class_id` **`1`** = `'car'`
*   `class_id` **`2`** = `'dog'`

It is essential that `nc` is equal to the number of items in the `names` list.

### Haztruck dataset



**Description of the `haztruck_dataset.csv` Master Annotation File**

**Overview**

This file serves as the master annotation log for a dataset created to detect and identify UN Number hazardous material (hazmat) plates on trucks. It is structured as a comprehensive CSV (Comma-Separated Values) file, where each row represents a **single bounding box** for one hazmat plate. This rich format contains not only the location of the plate but also extensive metadata about the image source, quality, and the specific codes on the plate.

**File Format**

The data is stored in a standard CSV format. It is important to note that a single image can be represented by multiple rows if it contains more than one hazmat plate. For instance, if `image_id` 73 contains two plates, there will be two rows with `image_id` set to 73, each describing a different bounding box.

**Column-by-Column Breakdown**

The table below describes each of the 15 columns in the `haztruck_dataset.csv` file.

| Column Name | Example Value | Description |
| :--- | :--- | :--- |
| **`image_id`** | `2` | A unique integer that identifies the image file. |
| **`link`** | `https://stock.adobe.com/...` | The URL to the **source page** where the original image is hosted. For Adobe Stock, this is the product/landing page, **not a direct link to the image file itself.** |
| **`website`** | `Adobe Stock` | The name of the website or platform where the image was sourced. |
| **`author`** | `M. Perfectti` | The creator or uploader of the source image. |
| **`resolution`** | `1000x667` | The dimensions of the image in pixels (Width x Height). |
| **`date_accessed`** | `06/02/2025 18:06` | The timestamp when the image was downloaded and logged. |
| **`added_by`** | `Ewaldo Nieuwenhuis` | The name of the annotator who added the data to the dataset. |
| **`image_name`** | `2.png` | The local filename of the image, corresponding to its `image_id`. |
| **`box_label`** | `hazmat_sign` | The class label for the annotated object (in this case, always a hazmat sign). |
| **`box_xtl`** | `158.67` | The **X**-coordinate of the **T**op-**L**eft corner of the bounding box (in pixels). |
| **`box_ytl`** | `186.83` | The **Y**-coordinate of the **T**op-**L**eft corner of the bounding box (in pixels). |
| **`box_xbr`** | `368.87` | The **X**-coordinate of the **B**ottom-**R**ight corner of the bounding box (in pixels). |
| **`box_ybr`** | `337.69` | The **Y**-coordinate of the **B**ottom-**R**ight corner of the bounding box (in pixels). |
| **`issue`** | `low quality`, `rotated` | An optional flag noting any potential quality issues or specific attributes of the annotation or image, such as poor lighting, obstruction, or weather conditions. |
| **`code`** | `33/1993` | The specific identification codes visible on the hazmat plate. |

**Key Concepts and Usage**

**Bounding Box Coordinate System**

The bounding box coordinates are provided in an **absolute pixel format**, defining the box by its top-left `(box_xtl, box_ytl)` and bottom-right `(box_xbr, box_ybr)` corners. This format is common but must be converted to a normalized, center-based format (like `x_center`, `y_center`, `width`, `height`) for use with object detection frameworks like YOLO.

**The 'code' Field Explained**

This field is a critical part of the dataset, capturing the text on the hazmat plate. The format `XX/YYYY` corresponds to the European ADR agreement for transporting dangerous goods:
*   **Top Number (Hazard Identification Number):** This two or three-digit code indicates the primary and subsidiary dangers of the substance (e.g., `33` signifies a highly flammable liquid).
*   **Bottom Number (UN Number):** This four-digit code is a globally recognized number identifying the specific hazardous substance being transported (e.g., `1203` is gasoline/petrol).

**Purpose and Application**

This master CSV file is the **single source of truth** for the `haztruck_dataset`. It is designed to be processed by scripts to:
1.  **Generate Labels:** Automatically create label files in formats required by machine learning frameworks (e.g., YOLO `.txt` files or Pascal VOC `.xml` files).
2.  **Data Analysis:** Perform detailed analysis on the dataset, such as calculating class distribution, analyzing image sources, or filtering by quality issues noted in the `issue` column.
3.  **Auditing and Verification:** Provide full traceability for each annotation, linking back to the original source page and the person who added it.

In [None]:
print(f" The rows and column size are {df_haztruck.shape} \n")
print(f" The describe of the dataframe is: \n {df_haztruck.describe(include='all')} \n")
print(f" The info of the dataframe is: \n {df_haztruck.info()} \n")
print(f" The columns of the dataframe are: \n {df_haztruck.columns} \n")
print(f" The number of null values in each column are: \n {df_haztruck.isnull().sum()} \n")

In [None]:
# Set plotting style for better aesthetics
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['font.size'] = 12

total_annotations = len(df_haztruck)
unique_images = df_haztruck['image_id'].nunique()

print(f"Total number of annotations (bounding boxes): {total_annotations:,}")
print(f"Total number of unique images: {unique_images:,}")
if unique_images > 0:
    print(f"Average annotations per image: {total_annotations / unique_images:.2f}")

print("\n--- Annotations by Website ---")
website_counts = df_haztruck['website'].value_counts()
print(website_counts)

plt.figure(figsize=(10, 6))
sns.barplot(x=website_counts.index, y=website_counts.values, palette='plasma')
plt.title('Number of Annotations Sourced from Each Website')
plt.xlabel('Website')
plt.ylabel('Number of Annotations')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

print("\n--- Annotations by Author (Top 10) ---")
author_counts = df_haztruck['author'].value_counts()
top_10_authors = author_counts.head(10)
print(top_10_authors)

plt.figure(figsize=(12, 7))
sns.barplot(x=top_10_authors.values, y=top_10_authors.index, palette='magma', orient='h')
plt.title('Top 10 Authors by Number of Annotations Contributed')
plt.xlabel('Number of Annotations')
plt.ylabel('Author')
plt.tight_layout()
plt.show()


# Fill missing values in 'issue' column to mean 'No Issue' for analysis
issue_counts = df_haztruck['issue'].fillna('No Issue').value_counts()
print("Breakdown of annotations by issue:")
print(issue_counts)

plt.figure(figsize=(10, 6))
sns.barplot(x=issue_counts.index, y=issue_counts.values, palette='coolwarm')
plt.title('Distribution of Annotation Quality Issues')
plt.xlabel('Issue Type')
plt.ylabel('Number of Annotations')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


In [None]:
visualize_annotations_from_links(df_haztruck, num_samples=2)