# DamagedSigns Dataset Labeling

This notebook processes the DamagedSigns dataset annotations from CSV format and creates a standardized CSV file for ML training with main categories and subcategories.

In [11]:
import pandas as pd
import csv
from pathlib import Path
import os

## Dataset Information

The DamagedSigns dataset contains images of traffic signs with multiple classification labels:
- **Bending**: Signs that are bent or deformed
- **Damage**: Signs with physical damage
- **Healthy**: Signs in good condition
- **Vandalism**: Signs with graffiti or other vandalism
- **Wear**: Signs showing wear and tear

We'll categorize these as:
- **Main Category**: "Public Property Damage"
- **Sub Category**: Based on the specific damage type

In [29]:
# Define dataset paths
base_path = Path(r"c:\Users\lchat\One Drive-UoM\OneDrive - University of Moratuwa\Datathon-2025-\DATATHON 2025\DamagedSigns")
train_path = base_path / "train"
test_path = base_path / "test"
valid_path = base_path / "valid"

print(f"Base path: {base_path}")
print(f"Train path exists: {train_path.exists()}")
print(f"Test path exists: {test_path.exists()}")
print(f"Valid path exists: {valid_path.exists()}")

Base path: c:\Users\lchat\One Drive-UoM\OneDrive - University of Moratuwa\Datathon-2025-\DATATHON 2025\DamagedSigns
Train path exists: True
Test path exists: True
Valid path exists: True


In [34]:
def process_damaged_signs_csv(csv_file_path, split_name):
    """
    Process DamagedSigns CSV file and convert to standardized format
    
    Args:
        csv_file_path: Path to the _classes.csv file
        split_name: Name of the split (train/test/valid)
    
    Returns:
        List of dictionaries with standardized labels
    """
    labeled_data = []
    
    try:
        # Read the CSV file
        df = pd.read_csv(csv_file_path)
        
        # Clean column names (remove extra spaces)
        df.columns = df.columns.str.strip()
        
        print(f"Processing {split_name} split - Found {len(df)} images")
        print(f"Columns: {list(df.columns)}")
        
        for _, row in df.iterrows():
            # Normalize filename (keep only the basename)
            filename = Path(str(row['filename']).strip()).name
            # Build full path (absolute)
            split_dir = csv_file_path.parent              # e.g. .../DamagedSigns/train
            image_path = (split_dir / filename).resolve() # absolute path
            # If you prefer a project‑relative path instead:
            # relative_image_path = image_path.relative_to(base_path)

            
            # Check which labels are active (value = 1)
            active_labels = []
            if row.get('Bending', 0) == 1:
                active_labels.append('Bending')
            if row.get('Damage', 0) == 1:
                active_labels.append('Damage')
            if row.get('Healthy', 0) == 1:
                active_labels.append('Healthy')
            if row.get('Vandalism', 0) == 1:
                active_labels.append('Vandalism')
            if row.get('Wear', 0) == 1:
                active_labels.append('Wear')
            
            # Skip healthy signs as they don't represent infrastructure problems
            if 'Healthy' in active_labels:
                continue  # Skip this image
            
            # Determine main and sub categories based on active labels
            if 'Vandalism' in active_labels:
                main_category = "Public Cleanliness & Public Property Damage"
                sub_category = "Vandalism"
            elif any(label in active_labels for label in ['Damage', 'Bending', 'Wear']):
                main_category = "Road & Infrastructure Issues"
                # Prioritize the most severe damage
                if 'Damage' in active_labels:
                    sub_category = "Broken/Missing Road Signs"
                elif 'Bending' in active_labels:
                    sub_category = "Broken/Missing Road Signs"
                else:  # Wear
                    sub_category = "Broken/Missing Road Signs"
            else:
                # If no problematic labels are found, skip this image
                continue
            
            labeled_data.append({
                'image_file': str(image_path),
                'main_category': main_category,
                'sub_category': sub_category,
                'split': split_name,
                'active_labels': ', '.join(active_labels)
            })
    
    except Exception as e:
        print(f"Error processing {csv_file_path}: {str(e)}")
    
    return labeled_data

In [35]:
# Process all splits
all_labeled_data = []

# Process train split
train_csv = train_path / "_classes.csv"
if train_csv.exists():
    train_data = process_damaged_signs_csv(train_csv, "train")
    all_labeled_data.extend(train_data)
    print(f"Train split: {len(train_data)} images processed")
else:
    print(f"Train CSV file not found: {train_csv}")

print()

Processing train split - Found 1647 images
Columns: ['filename', 'Bending', 'Damage', 'Healthy', 'Vandalism', 'Wear']
Train split: 858 images processed



In [36]:
# Process test split
test_csv = test_path / "_classes.csv"
if test_csv.exists():
    test_data = process_damaged_signs_csv(test_csv, "test")
    all_labeled_data.extend(test_data)
    print(f"Test split: {len(test_data)} images processed")
else:
    print(f"Test CSV file not found: {test_csv}")

print()

Processing test split - Found 30 images
Columns: ['filename', 'Bending', 'Damage', 'Healthy', 'Vandalism', 'Wear']
Test split: 17 images processed



In [37]:
# Process valid split
valid_csv = valid_path / "_classes.csv"
if valid_csv.exists():
    valid_data = process_damaged_signs_csv(valid_csv, "valid")
    all_labeled_data.extend(valid_data)
    print(f"Valid split: {len(valid_data)} images processed")
else:
    print(f"Valid CSV file not found: {valid_csv}")

print(f"\nTotal images processed: {len(all_labeled_data)}")

Processing valid split - Found 116 images
Columns: ['filename', 'Bending', 'Damage', 'Healthy', 'Vandalism', 'Wear']
Valid split: 67 images processed

Total images processed: 942


In [38]:
# Display summary statistics
if all_labeled_data:
    df_summary = pd.DataFrame(all_labeled_data)
    
    print("=== DATASET SUMMARY ===")
    print(f"Total images: {len(df_summary)}")
    print()
    
    print("Distribution by split:")
    print(df_summary['split'].value_counts())
    print()
    
    print("Distribution by main category:")
    print(df_summary['main_category'].value_counts())
    print()
    
    print("Distribution by sub category:")
    print(df_summary['sub_category'].value_counts())
    print()
    
    # Show first few examples
    print("First 10 examples:")
    print(df_summary[['image_file', 'main_category', 'sub_category', 'active_labels']].head(10))
else:
    print("No data was processed!")

=== DATASET SUMMARY ===
Total images: 942

Distribution by split:
split
train    858
valid     67
test      17
Name: count, dtype: int64

Distribution by main category:
main_category
Road & Infrastructure Issues                   669
Public Cleanliness & Public Property Damage    273
Name: count, dtype: int64

Distribution by sub category:
sub_category
Broken/Missing Road Signs    669
Vandalism                    273
Name: count, dtype: int64

First 10 examples:
                                          image_file  \
0  C:\Users\lchat\One Drive-UoM\OneDrive - Univer...   
1  C:\Users\lchat\One Drive-UoM\OneDrive - Univer...   
2  C:\Users\lchat\One Drive-UoM\OneDrive - Univer...   
3  C:\Users\lchat\One Drive-UoM\OneDrive - Univer...   
4  C:\Users\lchat\One Drive-UoM\OneDrive - Univer...   
5  C:\Users\lchat\One Drive-UoM\OneDrive - Univer...   
6  C:\Users\lchat\One Drive-UoM\OneDrive - Univer...   
7  C:\Users\lchat\One Drive-UoM\OneDrive - Univer...   
8  C:\Users\lchat\One Drive-U

In [39]:
# Save the labeled data to CSV
if all_labeled_data:
    output_file = base_path / "damagedsigns_labeled_data.csv"
    
    with open(output_file, 'w', newline='', encoding='utf-8') as csvfile:
        fieldnames = ['image_file', 'main_category', 'sub_category']
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        
        writer.writeheader()
        for item in all_labeled_data:
            writer.writerow({
                'image_file': item['image_file'],
                'main_category': item['main_category'],
                'sub_category': item['sub_category']
            })
    
    print(f"\n✅ Successfully created: {output_file}")
    print(f"📊 Total records: {len(all_labeled_data)}")
    print("\n🎯 Ready for ML training!")
else:
    print("❌ No data to save!")


✅ Successfully created: c:\Users\lchat\One Drive-UoM\OneDrive - University of Moratuwa\Datathon-2025-\DATATHON 2025\DamagedSigns\damagedsigns_labeled_data.csv
📊 Total records: 942

🎯 Ready for ML training!


In [41]:
import pandas as pd
trash_labels = pd.read_csv("damagedsigns_labeled_data.csv")

In [42]:
trash_labels

Unnamed: 0,image_file,main_category,sub_category
0,C:\Users\lchat\One Drive-UoM\OneDrive - Univer...,Road & Infrastructure Issues,Broken/Missing Road Signs
1,C:\Users\lchat\One Drive-UoM\OneDrive - Univer...,Road & Infrastructure Issues,Broken/Missing Road Signs
2,C:\Users\lchat\One Drive-UoM\OneDrive - Univer...,Road & Infrastructure Issues,Broken/Missing Road Signs
3,C:\Users\lchat\One Drive-UoM\OneDrive - Univer...,Public Cleanliness & Public Property Damage,Vandalism
4,C:\Users\lchat\One Drive-UoM\OneDrive - Univer...,Public Cleanliness & Public Property Damage,Vandalism
...,...,...,...
937,C:\Users\lchat\One Drive-UoM\OneDrive - Univer...,Road & Infrastructure Issues,Broken/Missing Road Signs
938,C:\Users\lchat\One Drive-UoM\OneDrive - Univer...,Road & Infrastructure Issues,Broken/Missing Road Signs
939,C:\Users\lchat\One Drive-UoM\OneDrive - Univer...,Public Cleanliness & Public Property Damage,Vandalism
940,C:\Users\lchat\One Drive-UoM\OneDrive - Univer...,Road & Infrastructure Issues,Broken/Missing Road Signs


## Summary

This script processes the DamagedSigns dataset which uses CSV format with multiple binary classification columns:

**Input Format**: CSV files with columns for different damage types (Bending, Damage, Healthy, Vandalism, Wear)

**Output Categories**:
- **Healthy signs** → "Public Property Maintenance" / "Healthy Infrastructure"
- **Vandalized signs** → "Public Cleanliness & Public Property Damage" / "Vandalism"
- **Damaged signs** → "Public Property Damage" / "Physical Damage", "Structural Damage", or "Wear and Tear"

The script handles multi-label classification by prioritizing the most severe damage type when multiple labels are active.