<img src='https://github.com/Deci-AI/data-gradients/blob/master/src/data_gradients/assets/images/logo.png?raw=true'>


## 👋🏽 What's up! It's [Harpreet](https://twitter.com/DataScienceHarp)

Welcome to this tutorial notebook on [DataGradients](https://github.com/Deci-AI/data-gradients) for **object detection datasets**. DG is an open-source Python library, created by [Deci AI](https://deci.ai/),  for computer vision dataset analysis. If you're looking for the **segmentation datasets** notebook, you can find that [here.](https://bit.ly/dg-starter-notebook-seg)

I'll be guiding you through this notebook. At any point, if you get stuck or have questions, feel free to [read the docs](https://bit.ly/dg-docs) or get in touch:

1) Send me an email with your issue: harpreet.sahota@deci.ai

2) Hop into the [Deep Learning Daily (powered by Deci) Discord server](https://discord.gg/p9ecgRhDR8), and let me know what your question is.

3) [Open an issue on GitHub](https://github.com/Deci-AI/data-gradients/issues/new)

Let's get to it...


# Introduction

Whether you're working on image classification, object detection, or [semantic segmentation](https://bit.ly/dg-starter-notebook-seg), DataGradients helps you gain insights and analyze your datasets effectively.

In this tutorial, you'll explore the features and functionalities of DataGradients, guiding you through comprehensive data analysis for computer vision projects.

With DataGradients, you can:

- Analyze image features such as color distribution, brightness, and size.
- Profile object detection datasets with metrics like bounding box area, intersection, and class frequency.
- Understand segmentation datasets using object area, width, height, and class frequency.
- Visualize samples for a better understanding.
- And [much more](https://github.com/Deci-AI/data-gradients/blob/master/documentation/feature_description.md)

Profiling your datasets has never been easier!

## 👨🏽‍🔧 Step 0: Installation

Note: after installation is complete you will need to restart this notebook. Do the following: `Runtime -> Restart runtime`.

Be careful NOT to select `Disconnect and delete runtime`.



In [None]:
%%capture
# data-gradients
!pip install data-gradients

# to get data from roboflow
!pip install roboflow

# for displaying pdfs as images in notebook
!pip install pdf2image
!apt-get install poppler-utils

# for pretty printing json
!pip install Pygments

# 🛠️ Utility functions

The `display_pdf_pages` function is a utility that displays each page of a PDF file as images in separate output cells.

Given the path to a PDF file, it converts the PDF into a list of PIL Images using `pdf2image`.

It then iterates through the list and displays each image using `IPython.display`.

This function is useful for visually examining PDF content, such as reviewing pages or checking layouts.

In [None]:
from PIL import Image
from pdf2image import convert_from_path
from IPython.display import display

def display_pdf_pages(pdf_path):
    """
    Display each page of a PDF file as images in separate output cells.

    Args:
        pdf_path (str): The path to the PDF file.

    Raises:
        FileNotFoundError: If the specified PDF file is not found.

    Returns:
        None
    """
    try:
        # Convert PDF to a list of PIL Images
        images = convert_from_path(pdf_path)

        # Display each image
        for i, image in enumerate(images):
            # Display the image
            display(image)

    except FileNotFoundError:
        raise FileNotFoundError("The specified PDF file was not found.")

The `print_pretty_json` function opens a JSON file, formats the data with an indent of 4 spaces using `json.dumps`, applies syntax highlighting with `Pygments`, and prints the pretty-printed JSON data in the cell below.

In [None]:
import json
from pygments import highlight, lexers, formatters
from pathlib import Path

def print_pretty_json(file_path):
    """
    Function to pretty print a JSON file with colorization.

    Args:
        file_path (Union[str, Path]): The path of the JSON file to be pretty-printed.

    Raises:
        FileNotFoundError: If the file at `file_path` doesn't exist.
        json.JSONDecodeError: If the file at `file_path` is not valid JSON.
    """
    try:
        # Open the file
        with open(file_path, 'r') as f:
            # Load the JSON data from the file
            data = json.load(f)
    except FileNotFoundError:
        raise FileNotFoundError(f"File not found: {file_path}")
    except json.JSONDecodeError:
        raise json.JSONDecodeError("Invalid JSON file", "", 0)

    # Pretty print the JSON data with an indent of 4 spaces
    formatted_json = json.dumps(data, indent=4)

    # Colorize the pretty-printed JSON data
    colorful_json = highlight(formatted_json,
                              lexers.JsonLexer(),
                              formatters.TerminalFormatter()
                              )

    # Print the colorized, pretty-printed JSON data
    print(colorful_json)


## ⤵️ Step 1: Download Dataset

Before you start, you need to create a Roboflow [account and get your API key](https://app.roboflow.com/login). If you're not sure how to find your API key, [here's how](https://www.loom.com/share/05277274e8d542efaf9bc3f33c1396d3?sid=3a41d4c5-c0c7-4712-bf4b-6a8c7ba51947).

In [None]:
from IPython.display import clear_output
from roboflow import Roboflow
rf = Roboflow(api_key="<your-roboflow-key-here>")
project = rf.workspace("joseph-nelson").project("uno-cards")
dataset = project.version(1).download("yolov8")
clear_output()

The following will open the `data.yaml`  file that comes with the dataset, read that yaml's contents, and converts those contents into a Python dictionary using the `safe_load` function from the `yaml` module.

In [None]:
import yaml

# define the path to your YAML file
yaml_file_path = '/content/Uno-Cards-1/data.yaml'

# open the YAML file and load it into a dictionary
with open(yaml_file_path, 'r') as f:
    data_yaml = yaml.safe_load(f)

## 💾 Step 2: Instantiating Dataloaders

To use DataGradients, you need to be equipped with a few prerequisites:

- **Dataset**: Includes a Train set and a Validation or a Test set.

- **Class Names**: A list of the unique categories present in your dataset.

- **Iterable**: A method to iterate over your Dataset providing images and labels. This can be any of the following:
  - PyTorch Dataloader
  - PyTorch Dataset
  - Generator that yields image/label pairs
  - Any other iterable you use for model training/validation

DataGradients provides some functionality that allows you to easily load your data without any extra coding needed:
- `YoloFormatDetectionDataset`
- `VOCFormatDetectionDataset`
- `VOCDetectionDataset`

You can learn more about those [here.](https://github.com/Deci-AI/data-gradients/blob/master/documentation/datasets.md)


#### 🚨 If your dataset and annotations is in a custom format, you can use DataGradients Dataset Adapters. Learn more about those [here](https://github.com/Deci-AI/data-gradients#dataset-adapters).


In [None]:
dataset_params = {
    'data_dir':'/content/Uno-Cards-1',
    'train_images_dir':'train/images',
    'train_labels_dir':'train/labels',
    'val_images_dir':'valid/images',
    'val_labels_dir':'valid/labels',
    'test_images_dir':'test/images',
    'test_labels_dir':'test/labels',
    'classes': data_yaml['names']
}

In [None]:
from data_gradients.datasets.detection import YoloFormatDetectionDataset

train_set = YoloFormatDetectionDataset(root_dir=dataset_params['data_dir'],
                                       images_dir=dataset_params['train_images_dir'],
                                       labels_dir=dataset_params['train_labels_dir'])

val_set = YoloFormatDetectionDataset(root_dir=dataset_params['data_dir'],
                                     images_dir=dataset_params['val_images_dir'],
                                     labels_dir=dataset_params['val_labels_dir'])

## 📊 Step 3: Perform Analysis

Now that you have your dataset loaded and ready, its time to profile it using DataGradients.

In this section, you'll use the `DetectionAnalysisManager` class from DataGradients to analyze your dataset. This will trigger feature extraction, visualization, and an interpretation processes provided by DataGradients.

**The time it takes to analyze your dataset depends on its size. If your dataset is large, it may take 20 minutes or more.**

You instantiate the `DetectionAnalysisManager` with the following arguments:

- `report_title`: A title for the analysis report.

- `train_data` and `val_data`: The dataloaders for the training and validation sets, respectively.

- `class_names`: The list of class names present in the dataset.


**🔘 There are optional parameters that you can adjust as needed**:

- `class_names_to_use`: The subset of class names we want to analyze if you're interested in only certain classes.

- `images_extractor` and `labels_extractor`: Custom functions to extract images and labels from the dataset if needed.

- `threshold_soft_labels`: A threshold value for soft labels, converting them to hard labels.

- `batches_early_stop`: The number of batches to analyze before early stopping.

You can find more information about these parameters in the [`DetectionAnalysisManager` class documentation](https://github.com/Deci-AI/data-gradients/blob/master/src/data_gradients/managers/detection_manager.py).

Once we have instantiated the `DetectionAnalysisManager`, you can run the analysis by calling the `run()` method.

### You'll be prompted for information about your annotations:

- Are your annotations labels-first or labels-last?

- What format are your annotations in?

In [None]:
from data_gradients.managers.detection_manager import DetectionAnalysisManager
import matplotlib

matplotlib.use('Agg') # This line is only for Colab

analyzer = DetectionAnalysisManager(
    report_title="Testing Data-Gradients Object Detection",
    train_data=train_set,
    val_data=val_set,
    class_names=data_yaml['names'],
)

analyzer.run()

## Step 4: View Full PDF Report

When you created the `analyzer`, you passed a value for `report_title`.

The report will be saved to a folder in your current working directory that corresponds to that value.

If you want to save it in a different folder, you can pass the path to `log_dir` in the `DetectionAnalysisManager` constructor.

Inside the log directory, you will find a complete PDF report that summarizes and provides insights on feature extractors.

In [None]:
# this function was defined and described at the beginning of this notebook
display_pdf_pages("/content/logs/Testing_Data-Gradients_Object_Detection/Report.pdf")

### ⬇️ Download Report

Since this is a Google Colab notebook, the report isn't saved to your local drive. If you want to download the report you can do so by running the following cell:

In [None]:
from google.colab import files
files.download('/content/logs/Testing_Data-Gradients_Object_Detection/Report.pdf')

### 📃 Accessing JSON report.

You can also access the json report. This could be helpful to have if you want to save it and analyze changes in your data characteristics over time.

Note: I left the below cell commented out because it will make the notebook long and messy.

In [None]:
# report_json = '/content/logs/Testing_Data-Gradients_Object_Detection/summary.json'
# print_pretty_json(report_json)

# 📄 Individual analysis.

If you only need one or a few analyses instead of the entire report, you can easily do so with the `run_detection_analysis` function.

 is a convenient utility that allows you to perform an individual analysis (as opposed to complete report) with ease.

It takes various parameters to configure the analysis settings:

- `report_title`: The title of the analysis report.
- `feature_extractors`: A list of feature extractors to use for analysis.


### You can choose from the following options for object detection datasets:

#### 🔍 [ImageColorDistribution](https://github.com/Deci-AI/data-gradients/blob/master/src/data_gradients/feature_extractors/common/image_color_distribution.py)

This is a comparison between the RGB and grayscale intensity distributions (0-255) of the entire dataset, assuming the RGB channel ordering.

It helps to detect differences in image characteristics between the two datasets and potential flaws in the augmentation process.

For example, a significant difference in the mean value of a specific color between the two datasets could indicate an issue with augmentation.

#### 🌓 [ImagesAverageBrightness](https://github.com/Deci-AI/data-gradients/blob/master/src/data_gradients/feature_extractors/common/image_average_brightness.py)

This graph displays how the brightness of each dataset's images is distributed.

This can reveal discrepancies between the training and validation sets.

For example, it may show that the training set only has daytime images while the validation set only has nighttime images.

#### 📏 [ImagesResolution](https://github.com/Deci-AI/data-gradients/blob/master/src/data_gradients/feature_extractors/common/image_resolution.py)

The histograms show how the image height and width are distributed.

If any images were resized or added with padding, the histograms will display the size after these modifications.


#### 📐 [DetectionBoundingBoxArea](https://github.com/Deci-AI/data-gradients/blob/master/src/data_gradients/feature_extractors/object_detection/bounding_boxes_area.py)

This graph displays how the bounding box area is distributed among each class.

This can reveal any gaps in object size between the training and validation sets, which could negatively impact the model's performance.

Having an excessive amount of small objects may suggest that the original image was downsized to an inadequate resolution for the objects in question.

#### 🔲 [DetectionBoundingBoxIoU](https://github.com/Deci-AI/data-gradients/blob/master/src/data_gradients/feature_extractors/object_detection/bounding_boxes_iou.py)

This chart displays the distribution of Intersection over Union (IoU) values for a given set of boxes.

The heatmap indicates the percentage of boxes that overlap with an IoU value within the range of 0 to T for each class.

Only the intersection of boxes belonging to the same class are taken into account.

#### 📊 [DetectionBoundingBoxPerImageCount](https://github.com/Deci-AI/data-gradients/blob/master/src/data_gradients/feature_extractors/object_detection/bounding_boxes_per_image_count.py)

These graphs shows how many bounding boxes appear in images.

This can typically be valuable to know when you observe a very high number of bounding boxes per image, as some models include a parameter to filter the top k results.

#### 📏 [DetectionBoundingBoxSize](https://github.com/Deci-AI/data-gradients/blob/master/src/data_gradients/feature_extractors/object_detection/bounding_boxes_resolution.py)

These heat maps illustrate the distribution of bounding box width and height per class.

Large variations in object size can affect the model's ability to accurately recognize objects.

#### 🔢 [DetectionClassFrequency](https://github.com/Deci-AI/data-gradients/blob/master/src/data_gradients/feature_extractors/object_detection/classes_frequency.py)

One important thing to consider is the frequency of each class in both the training and validation sets.

This can reveal any differences in class distribution, such as if a certain class only appears in the validation set.

Knowing this in advance can help you understand if your model will be able to accurately predict that class.

#### 🌡️ [DetectionClassHeatmap](https://github.com/Deci-AI/data-gradients/blob/master/src/data_gradients/feature_extractors/object_detection/classes_heatmap_per_class.py)

The heatmap shows where objects are most densely located in the images, which helps you understand how the objects are distributed in the space.

By looking at the heatmap, you can easily tell if the objects are mainly concentrated in certain areas or spread out evenly across the scene.

This provides useful information to determine if the objects are positioned as expected in the areas of interest.

#### 📊 [DetectionClassesPerImageCount](https://github.com/Deci-AI/data-gradients/blob/master/src/data_gradients/feature_extractors/object_detection/classes_frequency_per_image.py)

The graph displays the frequency of each class in an image and indicates if the number of appearances is consistent or varies across images.

#### 👁️ [DetectionSampleVisualization](https://github.com/Deci-AI/data-gradients/blob/master/src/data_gradients/feature_extractors/object_detection/sample_visualization.py)

The sample visualization feature shows images and labels in a visual format.

This helps you better understand the contents of your dataset.

In [None]:
def run_detection_analysis(report_title,
                           feature_extractors,
                           train_data,
                           val_data,
                           class_names):
    """
    Run the detection analysis using the provided parameters.

    Args:
        report_title (str): Title of the analysis report.
        feature_extractors (str): Feature extractors to be used.
        train_data (str): Training data for the analysis.
        val_data (str): Validation data for the analysis.
        class_names (list): List of class names for the analysis.
    """

    # Create an instance of DetectionAnalysisManager
    analyzer = DetectionAnalysisManager(
        report_title=report_title,
        feature_extractors=feature_extractors,
        train_data=train_data,
        val_data=val_data,
        class_names=class_names
    )

    # Run the analysis
    analyzer.run()


In [None]:
run_detection_analysis(report_title= 'DetectionBoundingBoxPerImageCount',
                           feature_extractors='DetectionBoundingBoxPerImageCount',
                           train_data=train_set,
                           val_data=val_set,
                           class_names=data_yaml['names'])

In [None]:
display_pdf_pages('/content/logs/DetectionBoundingBoxPerImageCount/Report.pdf')

In [None]:
report_json = '/content/logs/DetectionBoundingBoxPerImageCount/summary.json'
print_pretty_json(report_json)

# 🔚 You've made it to the end!

If you think DataGradients is useful, head over to GitHub and [⭐️ the repo](https://github.com/Deci-AI/data-gradients).


### Some datasets you might want to try on your own:
 - [HuggingFace competition: Ship detection](https://huggingface.co/spaces/competitions/ship-detection)

- [Aquarium dataset on RoboFlow](https://public.roboflow.com/object-detection/aquarium)

- [Vehicles-OpenImages Dataset on RoboFlow](https://public.roboflow.com/object-detection/vehicles-openimages)

- [Winegrape detection](https://github.com/thsant/wgisd)

- [Low light object detection](https://github.com/cs-chan/Exclusively-Dark-Image-Dataset)

- [Infrafred person detection](https://camel.ece.gatech.edu/)

- [Pothole detection](https://www.kaggle.com/datasets/chitholian/annotated-potholes-dataset)

- [100k Labeled Road Images | Day, Night](https://www.kaggle.com/datasets/solesensei/solesensei_bdd100k)

- [Deep Fashion dataset](https://github.com/switchablenorms/DeepFashion2)

- [Playing card detection](https://www.kaggle.com/datasets/luantm/playing-card)

- [Anaomoly detection in videos](https://www.crcv.ucf.edu/projects/real-world/)

- [Underwater fish recognition](https://www.kaggle.com/datasets/aalborguniversity/brackish-dataset)

- [Document layout detection](https://www.primaresearch.org/datasets/Layout_Analysis)

- [Trash Annotations in Context](http://tacodataset.org/)


# Give [YOLO-NAS](https://github.com/Deci-AI/super-gradients/blob/master/YOLONAS.md) a try!
Now that you've analyzed your object detection dataset, go and train a YOLO-NAS model on it.

[Here's a starter notebook the get you on your way!](https://bit.ly/yolo-nas-starter-notebook)