# Descriptive Statistics for YOLO Training Dataset

This notebook provides visual and quantitative insights into your training dataset.

It processes image and label files (with matching names) from two folders:
- `images/` – contains input images
- `labels/` – contains YOLO-format `.txt` files

📌 **Requirement**: A `labels.txt` file listing annotation classes like:

```python
'0': 'class_name0',
'1': 'class_name1',
'2': 'class_name2'
```
These statistics help assess label coverage and dataset consistency prior to training.


&copy; 2023 Marion Charpier — use of this notebook requires appropriate citation.

## Environment

In [37]:
import os
import re
import codecs
import shutil

import pandas as pd
import matplotlib.pyplot as plt

import sys
sys.path.append(os.path.join('..', 'modules'))

from class_names_functions import get_labels

## Functions

### Functions to create the folder for statistics results and the path to data folders

In [38]:
def create_stats_folder(training_folder):
    """
    This function creates a subdirectory named 'dataset_statistics' within the specified training folder.
    The folder is intended to store statistical data related to the dataset used during training, 
    such as data distribution, class frequencies, and other metrics that can provide insights into 
    the dataset composition and quality.

    :param training_folder: 
        - Type: str
        - Description: The absolute path to the training folder where the 'dataset_statistics' subdirectory 
                       will be created.

    :return: 
        - Type: None
        - Description: This function does not return a value. It creates a folder named 'dataset_statistics' 
                       within the specified `training_folder` if it does not already exist.

    This ensures a dedicated space for storing dataset statistics, helping maintain an organized project structure.
    """
    
    if not os.path.exists(os.path.join(training_folder, 'dataset_statistics')):
        os.mkdir(os.path.join(training_folder, 'dataset_statistics'))

### Functions to describe the annotated sources

#### Clean up annotated data names with Label Studio

In [39]:
def clean_LS(training_folder, annotated_with_LS):
    """
    This function renames files in the 'images' and 'labels' subdirectories of the specified training folder 
    by removing the prefix added by Label Studio (LS) during annotation. The prefix typically consists of 
    an 8-character alphanumeric string followed by a dash (e.g., 'abcd1234-').

    :param training_folder: 
        - Type: str
        - Description: The absolute path to the training folder containing the 'images' and 'labels' subdirectories.

    :param annotated_with_LS: 
        - Type: bool
        - Description: A boolean flag indicating whether the files were annotated using Label Studio. 
                       If `True`, the function will proceed with renaming the files to remove the LS prefix.

    :return: 
        - Type: None
        - Description: This function does not return a value. It modifies the filenames of the images and labels 
                       in place, making them compatible with the rest of the processing pipeline.
    """
    
    if annotated_with_LS:
        img_folder = os.path.join(training_folder, 'images')
        label_folder = os.path.join(training_folder, 'labels')

        # Browse the files in the 'images' directory
        for img_file in os.listdir(img_folder):
            new_img_filename = img_file[9:]
            new_img_filepath = os.path.join(img_folder, new_img_filename)
            
            os.rename(os.path.join(img_folder, img_file), new_img_filepath)
            print(f"Renamed image file : {img_file} -> {new_img_filename}")

        # Browse the files in the 'labels' directory
        for label_file in os.listdir(label_folder):
            new_label_filename = label_file[9:]
            new_label_filepath = os.path.join(label_folder, new_label_filename)
            
            os.rename(os.path.join(label_folder, label_file), new_label_filepath)
            print(f"Renamed label file : {label_file} -> {new_label_filename}")

###  Distribution of annotations

#### Get the annotation files

In [40]:
def get_annotation_files(img_folder, txt_folder):
    """
    This function retrieves the list of '.txt' files containing the annotations corresponding to the 
    images stored in the specified image folder. It matches the annotation files with the image files 
    based on their names, ensuring that only annotations with a corresponding image are included.

    :param img_folder: 
        - Type: str
        - Description: The absolute path to the folder where the images are stored. The function will 
                       look for image files with extensions such as '.jpg', '.jpeg', or '.png'.

    :param txt_folder: 
        - Type: str
        - Description: The absolute path to the folder where the annotation files (`.txt` files) are stored.
                       The function will look for annotation files that match the names of the images 
                       in the `img_folder`.

    :return: 
        - Type: list of str
        - Description: A list containing the absolute paths of all `.txt` annotation files that have a 
                       corresponding image in the `img_folder`.

    This function is useful for ensuring that only annotations with corresponding images are used, 
    which is crucial for maintaining consistency between images and labels during model training.
    """
    
    image_extensions = (".jpg", ".jpeg", ".png")
    image_files = [filename for filename in os.listdir(img_folder) if filename.endswith(image_extensions)]

    annotation_files = []
    
    for image_file in image_files:
        image_name, image_ext = os.path.splitext(image_file)
        annotation_file = os.path.join(txt_folder, image_name + '.txt')
        
        if os.path.exists(annotation_file):
            annotation_files.append(annotation_file)
            
    return annotation_files

#### Check that all annotation files are utf-8 encoded

In [41]:
def encoding(training_folder):
    """
    This function ensures that all annotation files within the specified training folder are encoded in UTF-8 format, 
    which is required for compatibility with the YOLOv8 model training process. If an annotation file is found to have 
    a different encoding (e.g., ISO-8859-1), the function identifies and logs it for further action.

    :param training_folder: 
        - Type: str
        - Description: The absolute path to the training folder containing the subdirectories 'images' and 'labels'.
                       The function will process the annotation files stored in the 'labels' subdirectory.

    :return: 
        - Type: None
        - Description: This function does not return a value. It checks and logs the encoding of each annotation file 
                       to ensure they are in UTF-8 format.

    This function helps ensure that all annotation files have consistent encoding, preventing errors during 
    the training process with YOLOv8 or other machine learning models that require UTF-8 encoding.
    """
    
    annotations_txt = get_annotation_files(os.path.join(training_folder, 'images'), os.path.join(training_folder, 'labels'))

    for filename in annotations_txt:
        file_path = os.path.join(os.path.join(training_folder, 'labels'), filename)
        with open(file_path, 'rb') as f:
            rawdata = f.read()
        try:
            result = codecs.decode(rawdata, 'utf-8')
        except UnicodeDecodeError:
            try:
                result = codecs.decode(rawdata, 'iso-8859-1')
                print(f"{filename} is encoded in ISO-8859-1")
            except UnicodeDecodeError:
                print(f"{filename} encoding not recognized")

#### Function to get the number of images without annotations

In [42]:
def img_without_annotations(img_folder, txt_folder):
    """
    This function identifies images in the specified image folder that do not have corresponding annotation files 
    or have empty annotation files. It helps detect unannotated images, which may cause issues during model training.
    This function helps ensure that the dataset is clean and consistent before starting a training session, 
    preventing potential errors or suboptimal model performance caused by unannotated or empty images.

    :param img_folder: 
        - Type: str
        - Description: The absolute path to the folder where the images are stored. This function will check for 
                       image files with standard image extensions such as `.jpg`, `.jpeg`, and `.png`.

    :param txt_folder: 
        - Type: str
        - Description: The absolute path to the folder where the annotation files are stored. This function will 
                       look for `.txt` files that match the image filenames.

    :return: 
        - Type: int
        - Description: The number of unannotated images found, including those without annotation files 
                       and those with empty annotation files.
    """

    annotation_files = get_annotation_files(img_folder, txt_folder)
    
    image_extensions = (".jpg", ".jpeg", ".png")
    image_files = [filename for filename in os.listdir(os.path.join(training_folder, 'images')) if filename.endswith(image_extensions)]
    
    count = 0
    unannotated_image = []
    
    for image_file in image_files:
        image_name, image_ext = os.path.splitext(image_file)
        annotation_file = os.path.join(txt_folder, image_name + '.txt')
        if annotation_file not in annotation_files:
            count += 1
            unannotated_image.append(image_file)
            print(f"Image {image_file} has no annotation file")
    
    for annotation_file in annotation_files:
        with open(os.path.join(txt_folder, annotation_file), 'r') as f:
            annotations = f.read()
            if annotations == "":
                count += 1

    if len(unannotated_image) > 0:
        delete = input(f'You have {len(unannotated_image)} unannotated images in your dataset. Do you want to delete them? (yes/no) : ')
        if delete == 'yes':
            for image in unannotated_image:
                os.remove(os.path.join(training_folder, 'images', image))
                print(f"The image {os.path.join(training_folder, 'images', image)} have been delete")
        else:
            print('Warning! You will start a training session with unannotated images')
    
    return count

#### Get the number of annotations per image

In [43]:
def annotations_per_img(training_folder):
    """
    This function calculates the number of annotations per image in the specified training folder 
    and generates a CSV file containing these results. The CSV file can be used for statistical analysis 
    or to identify images with high or low annotation counts, which might impact training performance.

    :param training_folder: 
        - Type: str
        - Description: The absolute path to the training folder containing 'images' and 'labels' subdirectories. 
                       The function will analyze the annotation files stored in the 'labels' subdirectory.

    :return: 
        - Type: None
        - Description: This function does not return a value. It creates a CSV file named `annotations_per_img.csv` 
                       in the 'dataset_statistics' subdirectory of the training folder.

    This CSV file can be used to identify images with insufficient or excessive annotations, enabling better 
    dataset curation and analysis.
    """

    # Retrieve Annotation Files
    annotation_files = get_annotation_files(os.path.join(training_folder, 'images'), os.path.join(training_folder, 'labels'))

    # Count Annotations per Image
    lines_per_file = {}
   
    for annotation_file in annotation_files:
        with open(os.path.join(os.path.join(training_folder, 'labels'), annotation_file), 'r') as f:
            nb_lines = 0
            for line in f:
                nb_lines += 1

        image_name = os.path.splitext(annotation_file)[0]  # Get the image name without extension
        image_path = os.path.join(os.path.join(training_folder, 'images'), f'{image_name}.jpg')  # Assume images have .jpg extension, modify as needed
        lines_per_file[image_path] = nb_lines
    
    lines_per_file_tries = dict(sorted(lines_per_file.items(), key=lambda x: x[1], reverse=True))

    # Create a DataFrame from the results
    df = pd.DataFrame(lines_per_file_tries.items(), columns=['image_name', 'annotations_nb'])

    # Write the DataFrame to a CSV file with ';' as the separator
    csv_file_path = os.path.join(os.path.join(training_folder, 'dataset_statistics'), 'annotations_per_img.csv')
    df.to_csv(csv_file_path, index=False, sep=';')

    print(f'{csv_file_path} created')    

#### Get total number of annotations

In [44]:
def total_annotations(img_folder, txt_folder):
    """
    This function calculates the total number of annotations present in the specified training dataset 
    by counting the number of non-empty lines in each annotation file. Each line in a `.txt` annotation file 
    typically represents an individual bounding box or object annotation.

    :param img_folder: 
        - Type: str
        - Description: The absolute path to the folder where the images are stored. 
                       The function will look for image files to identify corresponding annotation files.

    :param txt_folder: 
        - Type: str
        - Description: The absolute path to the folder where the annotation files are stored. 
                       The function will look for `.txt` files containing annotation data.

    :return: 
        - Type: int
        - Description: The total number of annotations across all images in the dataset. This count includes 
                       all valid lines from the `.txt` annotation files, excluding empty lines.

    This function helps provide an overview of the dataset's annotation density,
    which can be useful for dataset analysis and model training considerations.
    """

    # Retrieve Annotation Files
    annotation_files = get_annotation_files(img_folder, txt_folder)

    # Count Annotations
    total_lines = 0

    for annotation_file in annotation_files:
        with open(os.path.join(txt_folder, annotation_file), 'r') as f:
            nb_lines = 0
            for line in f:
                if line.strip():  # ignore les lines vides
                    nb_lines += 1
            total_lines += nb_lines

    return total_lines
    print(f"The total number of annotations is {total_lines}.")

#### Get the number of annotations for each class

In [45]:
def classes_distribution(training_folder):
    """
    This function calculates the distribution of annotations across different classes in the training dataset.
    It counts the number of annotations per class and saves the results in a CSV file. Additionally, it generates 
    a bar chart to visualize the class distribution and saves it as a PNG file in the 'dataset_statistics' folder.

    :param training_folder: 
        - Type: str
        - Description: The absolute path to the training folder containing the 'labels' and 'dataset_statistics' 
                       subdirectories. The function will analyze the annotation files stored in the 'labels' 
                       subdirectory and use the `labels.txt` file for class names.

    :return: 
        - Type: None
        - Description: This function does not return a value. It creates a CSV file named `class_distribution.csv` 
                       and a PNG image named `class_distribution.png` in the 'dataset_statistics' subdirectory.

    This function helps provide a clear understanding of the class distribution in the dataset, 
    allowing for better insights and analysis before training a model.
    """

    # Get the labels from the labels.txt file
    annotation_classes = get_labels(os.path.join(training_folder, 'labels.txt'))
    annotation_files = get_annotation_files(os.path.join(training_folder, 'images'), os.path.join(training_folder, 'labels'))
    
    annotation_labels = annotation_classes

    # Count Annotations per Class
    occurrences = {}
    for annotation_file in annotation_files:
        with open(os.path.join(os.path.join(training_folder, 'labels'), annotation_file), 'r', encoding='ascii') as f:
            for line in f:
                annotation_code = line.split()[0]
                if annotation_code not in occurrences:
                    occurrences[annotation_code] = 1
                else:
                    occurrences[annotation_code] += 1

    # Map annotation codes to class names
    class_names = [annotation_labels[code].strip() for code in occurrences.keys()]
    
    # Create a DataFrame from the results
    df = pd.DataFrame({'class_name': class_names, 'nb_occurrences': occurrences.values()})

    # Write the DataFrame to a CSV file with ';' as the separator
    csv_file_path = os.path.join(os.path.join(training_folder, 'dataset_statistics'), 'class_distribution.csv')
    df.to_csv(csv_file_path, index=False, sep=';')

    print(f'{csv_file_path} created')
    
    # Creating a stacked bar chart
    plt.barh(class_names, occurrences.values())

    # Setting axis and title labels
    plt.xlabel('Nombre d\'occurrences')
    plt.ylabel('Classes')
    plt.title('Distribution des classes')

    # Display and save the graph
    plt.savefig(os.path.join(training_folder, 'dataset_statistics', 'class_distribution.png'), bbox_inches='tight')
    plt.show()


#### Output global statistics

In [46]:
def get_global_results(training_folder):
    """
    This function generates a summary of key dataset metrics and saves the results in a CSV file named `global_data.csv`.
    The metrics included are:
        - The number of images without corresponding annotations.
        - The total number of annotations in the dataset.

    :param training_folder: 
        - Type: str
        - Description: The absolute path to the training folder containing the 'images', 'labels', and 
                       'dataset_statistics' subdirectories. The function will analyze the annotation files 
                       in the 'labels' subdirectory and images in the 'images' subdirectory.

    :return: 
        - Type: None
        - Description: This function does not return a value. It creates a CSV file named `global_data.csv` 
                       in the 'dataset_statistics' subdirectory of the training folder.

    This function provides a quick summary of the dataset's quality and completeness, allowing for easier 
    tracking of dataset issues before starting model training.
    """

    # Calculate the metrics
    metrics = {
        'Number of files without annotations': img_without_annotations(os.path.join(training_folder, 'images'), os.path.join(training_folder, 'labels')),
        'Total number of annotations': total_annotations(os.path.join(training_folder, 'images'), os.path.join(training_folder, 'labels'))
    }

    # Create a DataFrame from the results
    df = pd.DataFrame(metrics.items(), columns=['metric', 'value'])

    # Write the DataFrame to a CSV file with ';' as the separator
    csv_file_path = os.path.join(os.path.join(training_folder, 'dataset_statistics'), 'global_data.csv')
    df.to_csv(csv_file_path, index=False, sep=';')

    print(f'{csv_file_path} created')

## Processing

In [None]:
training_folder = 'TRAINING_FOLDER' # to be modified, absolute path to the folder in which the training session data are stored

In [49]:
# Create the statistic folder
create_stats_folder(training_folder)

In [50]:
# Clean Label Studio file, default False, change as needed
clean_LS(training_folder, annotated_with_LS=False)

In [51]:
# Check encoding format of annotation files
encoding(training_folder)

In [None]:
# Print a txt file with the number of annotations per image
annotations_per_img(training_folder)

In [None]:
# Print a file with the distribution of classes in the training dataset
classes_distribution(training_folder)

In [None]:
# Print a file with the number of manuscripts used for training, the number of unannotated images and total annotations
get_global_results(training_folder)