<img src='https://github.com/Deci-AI/data-gradients/blob/master/src/data_gradients/assets/images/logo.png?raw=true'>


## 👋🏽 What's up! It's [Harpreet](https://twitter.com/DataScienceHarp)

Welcome to this tutorial notebook on [DataGradients](https://github.com/Deci-AI/data-gradients) for **segmentation datasets**. DG is an open-source Python library for computer vision dataset analysis. If you're looking for the **object detection datasets notebook** you can find that [here.](bit.ly/dg-starter-notebook-od)

I'll be guiding you through this notebook. At any point, if you get stuck or have questions, feel free to [read the docs](https://bit.ly/dg-docs) or get in touch:

1) Send me an email with your issue: harpreet.sahota@deci.ai

2) Hop into the [Deep Learning Daily (powered by Deci) Discord server](https://discord.gg/p9ecgRhDR8), and let me know what your question is.

3) [Open an issue on GitHub](https://github.com/Deci-AI/data-gradients/issues/new)


Let's get to it...


# Introduction

Whether you're working on image classification, [object detection](https://bit.ly/dg-starter-notebook-od), or semantic segmentation, DataGradients helps you gain insights and analyze your datasets effectively.

In this tutorial, you'll explore the features and functionalities of DataGradients, guiding you through comprehensive data analysis for computer vision projects.

With DataGradients, you can:

- Analyze image features such as color distribution, brightness, and size.
- Profile object detection datasets with metrics like bounding box area, intersection, and class frequency.
- Understand segmentation datasets using object area, width, height, and class frequency.
- Visualize samples for a better understanding.
- And [much more](https://github.com/Deci-AI/data-gradients/blob/master/documentation/feature_description.md)

Profiling your datasets has never been easier!

## 👨🏽‍🔧 Step 0: Installation

Note: after installation is complete you will need to restart this notebook. Do the following: `Runtime -> Restart runtime`.

Be careful NOT to select `Disconnect and delete runtime`.



In [None]:
%%capture
# data-gradients
!pip install -U -q git+https://github.com/Deci-AI/data-gradients.git

# !pip install data-gradients

# to get data from roboflow
!pip install roboflow

# for displaying pdfs as images in notebook
!pip install pdf2image
!apt-get install poppler-utils

# for pretty printing json
!pip install Pygments

# 🛠️ Utility functions

The `display_pdf_pages` function is a utility that displays each page of a PDF file as images in separate output cells.

Given the path to a PDF file, it converts the PDF into a list of PIL Images using `pdf2image`.

It then iterates through the list and displays each image using `IPython.display`.

This function is useful for visually examining PDF content, such as reviewing pages or checking layouts.

In [None]:
from PIL import Image
from pdf2image import convert_from_path
from IPython.display import display

def display_pdf_pages(pdf_path):
    """
    Display each page of a PDF file as images in separate output cells.

    Args:
        pdf_path (str): The path to the PDF file.

    Raises:
        FileNotFoundError: If the specified PDF file is not found.

    Returns:
        None
    """
    try:
        # Convert PDF to a list of PIL Images
        images = convert_from_path(pdf_path)

        # Display each image
        for i, image in enumerate(images):
            # Display the image
            display(image)

    except FileNotFoundError:
        raise FileNotFoundError("The specified PDF file was not found.")

The `print_pretty_json` function opens a JSON file, formats the data with an indent of 4 spaces using `json.dumps`, applies syntax highlighting with `Pygments`, and prints the pretty-printed JSON data in the cell below.

In [None]:
import json
from pygments import highlight, lexers, formatters
from pathlib import Path

def print_pretty_json(file_path):
    """
    Function to pretty print a JSON file with colorization.

    Args:
        file_path (Union[str, Path]): The path of the JSON file to be pretty-printed.

    Raises:
        FileNotFoundError: If the file at `file_path` doesn't exist.
        json.JSONDecodeError: If the file at `file_path` is not valid JSON.
    """
    try:
        # Open the file
        with open(file_path, 'r') as f:
            # Load the JSON data from the file
            data = json.load(f)
    except FileNotFoundError:
        raise FileNotFoundError(f"File not found: {file_path}")
    except json.JSONDecodeError:
        raise json.JSONDecodeError("Invalid JSON file", "", 0)

    # Pretty print the JSON data with an indent of 4 spaces
    formatted_json = json.dumps(data, indent=4)

    # Colorize the pretty-printed JSON data
    colorful_json = highlight(formatted_json,
                              lexers.JsonLexer(),
                              formatters.TerminalFormatter()
                              )

    # Print the colorized, pretty-printed JSON data
    print(colorful_json)


## ⤵️ Step 1: Download Dataset

To demonstrate the analysis capabilities of DataGradients, we will work with a portion of the [BDD dataset](https://bdd-data.berkeley.edu/), a popular computer vision dataset for autonomous driving

In [None]:
import os

BDD_DATASET_DOWNLOAD_PATH="/content"

bdd_dataset_dir_path = BDD_DATASET_DOWNLOAD_PATH + os.path.sep + 'bdd_example'

if os.path.isdir(bdd_dataset_dir_path):
    print('bdd dataset already downloaded...')
else:
    print('Downloading and extracting bdd dataset to: ' + BDD_DATASET_DOWNLOAD_PATH)
    ! mkdir $BDD_DATASET_DOWNLOAD_PATH
    %cd $BDD_DATASET_DOWNLOAD_PATH
    ! wget https://deci-pretrained-models.s3.amazonaws.com/bdd_example.zip
    ! unzip --qq bdd_example.zip

## 💾 Step 2: Instantiating Dataloaders

To use DataGradients, you need to be equipped with a few prerequisites:

- **Dataset**: Includes a Train set and a Validation or a Test set.

- **Class Names**: A list of the unique categories present in your dataset.

- **Iterable**: A method to iterate over your Dataset providing images and labels. This can be any of the following:
  - PyTorch Dataloader
  - PyTorch Dataset
  - Generator that yields image/label pairs
  - Any other iterable you use for model training/validation

DataGradients provides some functionality that allows you to easily load your data without any extra coding needed:
- `CocoFormatSegmentationDataset`
- `CocoSegmentationDataset`
- `VOCFormatSegmentationDataset`
- `VOCSegmentationDataset`

You can learn more about those [here.](https://github.com/Deci-AI/data-gradients/blob/master/documentation/datasets.md)


#### 🚨 If your dataset and annotations is in a custom format, you can use DataGradients Dataset Adapters. Learn more about those [here](https://github.com/Deci-AI/data-gradients#dataset-adapters).


In [None]:
from torch.utils.data import DataLoader
from torchvision.transforms import Compose, ToTensor

from data_gradients.datasets.bdd_dataset import BDDDataset

    # Create torch DataSet
train_dataset = BDDDataset(
        data_folder="/content/bdd_example",
        split="train",
        transform=Compose([ToTensor()]),
        target_transform=Compose([ToTensor()]),
    )
val_dataset = BDDDataset(
        data_folder="/content/bdd_example",
        split="val",
        transform=Compose([ToTensor()]),
        target_transform=Compose([ToTensor()]),
    )

    # Create torch DataLoader
train_loader = DataLoader(train_dataset, batch_size=8)
val_loader = DataLoader(val_dataset, batch_size=8)

## 📊 Step 3: Perform Analysis

Now that you have your dataset loaded and ready, its time to profile it using DataGradients.

In this section, you'll use the `SegmentationAnalysisManager` class from DataGradients to analyze your dataset. This will trigger feature extraction, visualization, and an interpretation processes provided by DataGradients.

**The time it takes to analyze your dataset depends on its size. If your dataset is large, it may take 20 minutes or more.**

You instantiate the `SegmentationAnalysisManager` with the following arguments:

- `report_title`: A title for the analysis report.

- `train_data` and `val_data`: The dataloaders for the training and validation sets, respectively.

- `class_names`: The list of class names present in the dataset.

**🔘 There are optional parameters that you can adjust as needed**:

- `class_names_to_use`: The subset of class names we want to analyze if you're interested in only certain classes.

- `images_extractor` and `labels_extractor`: Custom functions to extract images and labels from the dataset if needed.

- `threshold_soft_labels`: A threshold value for soft labels, converting them to hard labels.

- `batches_early_stop`: The number of batches to analyze before early stopping.

You can find more information about these parameters in the [`SegmentationAnalysisManager` class documentation](https://github.com/Deci-AI/data-gradients/blob/master/src/data_gradients/managers/segmentation_manager.py).

Once we have instantiated the `SegmentationAnalysisManager`, you can run the analysis by calling the `run()` method.

### You'll be prompted for information about your annotations:


In [None]:
from data_gradients.managers.segmentation_manager import SegmentationAnalysisManager
import matplotlib

matplotlib.use('Agg') # This line is only for Colab

analyzer = SegmentationAnalysisManager(
        report_title="BDD Subset Example",
        train_data=train_loader,
        val_data=val_loader,
        class_names=BDDDataset.CLASS_NAMES,
        class_names_to_use=BDDDataset.CLASS_NAMES[:-1],
        # Optionals
        images_extractor=None,
        labels_extractor=None,
        threshold_soft_labels=0.5,
        batches_early_stop=75,
    )

analyzer.run()

## Step 4: View Full PDF Report

When you created the `analyzer`, you passed a value for `report_title`.

The report will be saved to a folder in your current working directory that corresponds to that value.

If you want to save it in a different folder, you can pass the path to `log_dir` in the `SegmentationAnalysisManager` constructor.

Inside the log directory, you will find a complete PDF report that summarizes and provides insights on feature extractors.

In [None]:
# this function was defined and described at the beginning of this notebook
display_pdf_pages("/content/logs/BDD_Subset_Example/Report.pdf")

### ⬇️ Download Report

Since this is a Google Colab notebook, the report isn't saved to your local drive. If you want to download the report you can do so by running the following cell:

In [None]:
from google.colab import files
files.download("/content/logs/BDD_Subset_Example/Report.pdf")

### 📃 Accessing JSON report.

You can also access the json report. This could be helpful to have if you want to save it and analyze changes in your data characteristics over time.

Note: I left the below cell commented out because it will make the notebook long and messy.

In [None]:
# report_json = "/content/logs/BDD_Subset_Example/summary.json"
# print_pretty_json(report_json)

# 📄 Individual analysis.

If you only need one or a few analyses instead of the entire report, you can easily do so with the `run_segmentation_analysis` function.

 is a convenient utility that allows you to perform an individual analysis (as opposed to complete report) with ease.

It takes various parameters to configure the analysis settings:

- `report_title`: The title of the analysis report.
- `feature_extractors`: A list of feature extractors to use for analysis.


### You can choose from the following options for segmentation datasets:

#### 🔍 [ImageColorDistribution](https://github.com/Deci-AI/data-gradients/blob/master/src/data_gradients/feature_extractors/common/image_color_distribution.py)

This is a comparison between the RGB and grayscale intensity distributions (0-255) of the entire dataset, assuming the RGB channel ordering.

It helps to detect differences in image characteristics between the two datasets and potential flaws in the augmentation process.

For example, a significant difference in the mean value of a specific color between the two datasets could indicate an issue with augmentation.

#### 🌓 [ImagesAverageBrightness](https://github.com/Deci-AI/data-gradients/blob/master/src/data_gradients/feature_extractors/common/image_average_brightness.py)

This graph displays how the brightness of each dataset's images is distributed.

This can reveal discrepancies between the training and validation sets.

For example, it may show that the training set only has daytime images while the validation set only has nighttime images.

#### 📏 [ImagesResolution](https://github.com/Deci-AI/data-gradients/blob/master/src/data_gradients/feature_extractors/common/image_resolution.py)

The histograms show how the image height and width are distributed.

If any images were resized or added with padding, the histograms will display the size after these modifications.


#### 📐 [SegmentationBoundingBoxArea](https://github.com/Deci-AI/data-gradients/blob/master/src/data_gradients/feature_extractors/segmentation/bounding_boxes_area.py)

This graph shows the distribution of object area for each class.

This can highlight distribution gap in object size between the training and validation splits, which can harm the model performance.

Another thing to keep in mind is that having too many very small objects may indicate that your are down sizing your original image to a low resolution that is not appropriate for your objects.

#### 📏 [SegmentationBoundingBoxResolution](https://github.com/Deci-AI/data-gradients/blob/master/src/data_gradients/feature_extractors/segmentation/bounding_boxes_resolution.py)

Object size differences can impact accuracy.

Heat maps show how objects are distributed by width and height for each class.

#### 📊 [SegmentationClassFrequency](https://github.com/Deci-AI/data-gradients/blob/master/src/data_gradients/feature_extractors/segmentation/classes_frequency.py)

This bar graph shows how often each class appears, which can help identify differences in distribution between the training and validation data.

If a class only appears in the validation set, it is likely that the model will not be able to accurately predict that class.


#### 📍 [SegmentationClassHeatmap](https://github.com/Deci-AI/data-gradients/blob/master/src/data_gradients/feature_extractors/segmentation/classes_heatmap_per_class.py)


The heatmap shows where there are a lot of objects in the images, giving you an idea of how they are spread out.

By looking at the heatmap, you can see if the objects are mostly grouped together in certain areas or if they are spread out evenly.

This can help you figure out if the objects are in the right places that you're interested in.

#### 📊 [SegmentationClassesPerImageCount](https://github.com/Deci-AI/data-gradients/blob/master/src/data_gradients/feature_extractors/segmentation/classes_frequency_per_image.py)

The graph illustrates the frequency of each class appearing in an image.

It indicates whether the occurrence of each class is consistent across all images or varies from one image to another.

#### 🧮 [SegmentationComponentsConvexity](https://github.com/Deci-AI/data-gradients/blob/master/src/data_gradients/feature_extractors/segmentation/components_convexity.py)

This graph displays the distribution of convexity values of objects in both the training and validation sets.

Higher convexity values indicate complex structures that could make accurate segmentation more difficult.


####  🔍 [SegmentationComponentsErosion](https://github.com/Deci-AI/data-gradients/blob/master/src/data_gradients/feature_extractors/segmentation/components_erosion.py)

This evaluating the stability of objects through morphological opening, which involves erosion followed by dilation.

If there are many small components, the number of components may decrease, potentially causing noise in our annotations (such as "sprinkles").


#### 📊 [SegmentationComponentsPerImageCount](https://github.com/Deci-AI/data-gradients/blob/master/src/data_gradients/feature_extractors/segmentation/component_frequency_per_image.py)

The graphs display the number of various objects present in images.

This information is useful especially when there are numerous objects in the image, as some models have a feature that filters the top k results.


#### 👁️ [SegmentationSampleVisualization](SegmentationSampleVisualization)

The sample visualization feature presents images and labels in a visual format, which helps you better comprehend the makeup of the dataset.

In [None]:
def run_segmentation_analysis(report_title,
                           feature_extractors,
                           train_data,
                           val_data,
                           class_names,
                           class_names_to_use):
    """
    Run the detection analysis using the provided parameters.

    Args:
        report_title (str): Title of the analysis report.
        feature_extractors (str): Feature extractors to be used.
        train_data (str): Training data for the analysis.
        val_data (str): Validation data for the analysis.
        class_names (list): List of class names for the analysis.
    """

    # Create an instance of SegmentationAnalysisManager
    analyzer = SegmentationAnalysisManager(
        report_title=report_title,
        feature_extractors=feature_extractors,
        train_data=train_data,
        val_data=val_data,
        class_names=class_names,
        class_names_to_use=class_names_to_use
    )

    # Run the analysis
    analyzer.run()

In [None]:
run_segmentation_analysis(report_title= 'SegmentationComponentsConvexity',
                           feature_extractors='SegmentationComponentsConvexity',
                           train_data=train_dataset,
                           val_data=val_dataset,
                          class_names=BDDDataset.CLASS_NAMES,
                          class_names_to_use=BDDDataset.CLASS_NAMES[:-1])

In [None]:
display_pdf_pages("")

In [None]:
report_json = ""
print_pretty_json(report_json)

# 🔚 You've made it to the end!

If you think DataGradients is useful, head over to GitHub and [⭐️ the repo](https://github.com/Deci-AI/data-gradients).


### Some datasets you might want to try on your own:


# Give SuperGradients a try!
Now that you've analyzed your segmentation, go and train a YOLO-NAS model on it.

[Here's a starter notebook the get you on your way!](https://bit.ly/yolo-nas-starter-notebook)