## Introduction
Whether you're working on image classification, object detection, or semantic segmentation, DataGradients helps you gain insights and analyze your datasets effectively.

In this tutorial, you'll explore the features and functionalities of DataGradients, guiding you through comprehensive data analysis for computer vision projects.

With DataGradients, you can:

Analyze image features such as color distribution, brightness, and size.
Profile object detection datasets with metrics like bounding box area, intersection, and class frequency.
Understand segmentation datasets using object area, width, height, and class frequency.
Visualize samples for a better understanding.
And much more
Profiling your datasets has never been easier!

## Imports - librairies


In [1]:
!pip install data-gradients

# for displaying pdfs as images in notebook
!pip install pdf2image
!apt-get -y install poppler-utils

# for pretty printing json
!pip install Pygments

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
poppler-utils is already the newest version (22.02.0-2ubuntu0.3).
0 upgraded, 0 newly installed, 0 to remove and 24 not upgraded.


In [2]:

import seaborn as sns  # library for visualization

sns.set_style("darkgrid")
import matplotlib.pyplot as plt  # library for visualization
%matplotlib inline

from tqdm import tqdm
tqdm.pandas()
import os
from glob import glob
import random
from datetime import datetime
import pandas

from typing import List, Tuple, Dict, Union

from concurrent.futures import ThreadPoolExecutor, as_completed
import pickle
import warnings
import re

## Set up Google Drive

In [3]:
SET_UP_GOOGLE_DRIVE = True

from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


## 🛠️ Utility functions

In [4]:
from PIL import Image
from pdf2image import convert_from_path
from IPython.display import display

def display_pdf_pages(pdf_path):
    """
    Display each page of a PDF file as images in separate output cells.

    Args:
        pdf_path (str): The path to the PDF file.

    Raises:
        FileNotFoundError: If the specified PDF file is not found.

    Returns:
        None
    """
    try:
        # Convert PDF to a list of PIL Images
        images = convert_from_path(pdf_path)

        # Display each image
        for i, image in enumerate(images):
            # Display the image
            display(image)

    except FileNotFoundError:
        raise FileNotFoundError("The specified PDF file was not found.")

## Step 1 : Prepare Dataset

In [5]:
if SET_UP_GOOGLE_DRIVE:
    DATASETS_DIR_ROOT_PATH = r"/content/gdrive/MyDrive/KESKIA Drive Mlamali/datasets"
    EDA_DATAGRADIENT_OUTPUTS_PATH =  r"/content/gdrive/MyDrive/KESKIA Drive Mlamali/CDuPropreMantes/outputs/eda-datagradients"
else:
    EDA_DATAGRADIENT_OUTPUTS_PATH = "/outputs/eda-datagradients"
print(DATASETS_DIR_ROOT_PATH)
print(os.listdir(DATASETS_DIR_ROOT_PATH))
MY_DATASET_PATH = os.path.join(DATASETS_DIR_ROOT_PATH,'taco-2gb-updated-2023121718')
if not os.path.exists(MY_DATASET_PATH):
    raise FileExistsError("ehhh")
MY_DATASET_PATH

/content/gdrive/MyDrive/KESKIA Drive Mlamali/datasets
['taco-2gb', 'taco-2gb-updated', 'taco-2gb-updated-2023121620', 'taco-2gb-updated-2023121621', 'taco-2gb-updated-2023121718']


'/content/gdrive/MyDrive/KESKIA Drive Mlamali/datasets/taco-2gb-updated-2023121718'

In [13]:
import numpy as np
from tqdm import tqdm
import matplotlib.pyplot as plt
from tensorflow.keras.applications.vgg16 import VGG16, preprocess_input
from tensorflow.keras.preprocessing import image
from sklearn.manifold import TSNE
import os

# Charger le modèle VGG16 pré-entraîné
model = VGG16(weights='imagenet', include_top=False)

# Fonction pour charger et prétraiter les images
def load_and_preprocess_img(img_path):
    img = image.load_img(img_path, target_size=(224, 224))
    x = image.img_to_array(img)
    x = np.expand_dims(x, axis=0)
    x = preprocess_input(x)
    return x, img

# Charger les images du dataset
def load_dataset_images(dataset_path):
    images = []
    raw_images = []
    for img_file in os.listdir(dataset_path):
        if img_file.lower().endswith('.jpg') or img_file.lower().endswith('.png'):
            img_path = os.path.join(dataset_path, img_file)
            processed_img, raw_img = load_and_preprocess_img(img_path)
            images.append(processed_img)
            raw_images.append(raw_img)
    return np.vstack(images), raw_images

# Chemin vers le dossier d'images d'entraînement
train_images_path = os.path.join(MY_DATASET_PATH, "train","images")

# Charger les images d'entraînement
train_images, raw_train_images = load_dataset_images(train_images_path)

In [14]:
# Extraire les features avec VGG16
features = model.predict(train_images)

# Aplatir les features pour t-SNE
flattened_features = features.reshape(features.shape[0], -1)

# Réduction de dimensionnalité avec t-SNE
tsne = TSNE(n_components=2)
tsne_results = tsne.fit_transform(flattened_features)



In [20]:

# Visualiser les résultats
from matplotlib.offsetbox import OffsetImage, AnnotationBbox

# Visualiser les résultats
fig, ax = plt.subplots(figsize=(50,50))
for i, img in tqdm(enumerate(raw_train_images)):
    x, y = tsne_results[i, 0], tsne_results[i, 1]
    im = OffsetImage(img, zoom=0.125)  # Ajustez le zoom si nécessaire
    ab = AnnotationBbox(im, (x, y), xycoords='data', frameon=False)
    ax.add_artist(ab)
ax.set_xlim(tsne_results[:, 0].min() - 1, tsne_results[:, 0].max() + 1)
ax.set_ylim(tsne_results[:, 1].min() - 1, tsne_results[:, 1].max() + 1)
plt.show()

Output hidden; open in https://colab.research.google.com to view.

In [10]:
import yaml

# define the path to your YAML file
yaml_file_path = os.path.join(MY_DATASET_PATH, "data.yaml")

# open the YAML file and load it into a dictionary
with open(yaml_file_path, 'r') as f:
    data_yaml = yaml.safe_load(f)

data_yaml

{'train': '../train/images',
 'val': '../val/images',
 'nc': 59,
 'names': {0: 'Aluminium foil',
  1: 'Battery',
  2: 'Aluminium blister pack',
  3: 'Carded blister pack',
  4: 'Other plastic bottle',
  5: 'Clear plastic bottle',
  6: 'Glass bottle',
  7: 'Plastic bottle cap',
  8: 'Metal bottle cap',
  9: 'Broken glass',
  10: 'Food Can',
  11: 'Aerosol',
  12: 'Drink can',
  13: 'Toilet tube',
  14: 'Other carton',
  15: 'Egg carton',
  16: 'Drink carton',
  17: 'Corrugated carton',
  18: 'Meal carton',
  19: 'Pizza box',
  20: 'Paper cup',
  21: 'Disposable plastic cup',
  22: 'Foam cup',
  23: 'Glass cup',
  24: 'Other plastic cup',
  25: 'Food waste',
  26: 'Glass jar',
  27: 'Plastic lid',
  28: 'Metal lid',
  29: 'Other plastic',
  30: 'Magazine paper',
  31: 'Tissues',
  32: 'Wrapping paper',
  33: 'Normal paper',
  34: 'Paper bag',
  35: 'Plastic film',
  36: 'Six pack rings',
  37: 'Garbage bag',
  38: 'Other plastic wrapper',
  39: 'Single-use carrier bag',
  40: 'Polypropyl

## 💾 Step 2: Instantiating Dataloaders

In [11]:
dataset_params = {
    'data_dir':MY_DATASET_PATH,
    'train_images_dir':'train/images',
    'train_labels_dir':'train/labels',
    'val_images_dir':'val/images',
    'val_labels_dir':'val/labels',
    'test_images_dir':'test/images',
    'test_labels_dir':'test/labels',
    'classes': data_yaml['names']
}

In [12]:
from data_gradients.datasets.detection import YoloFormatDetectionDataset

train_set = YoloFormatDetectionDataset(root_dir=dataset_params['data_dir'],
                                       images_dir=dataset_params['train_images_dir'],
                                       labels_dir=dataset_params['train_labels_dir'])

val_set = YoloFormatDetectionDataset(root_dir=dataset_params['data_dir'],
                                     images_dir=dataset_params['val_images_dir'],
                                     labels_dir=dataset_params['val_labels_dir'])



In [14]:
len(train_set), len(val_set)

(1394, 480)

## 📊 Step 3: Perform Analysis

In [25]:
from data_gradients.managers.detection_manager import DetectionAnalysisManager
from data_gradients.feature_extractors.common import ImageDuplicates
from data_gradients.feature_extractors.common.sample_visualization import AbstractSampleVisualization
from data_gradients.utils.data_classes import ImageChannels
import matplotlib

ImageDuplicates(train_image_dir=dataset_params['train_images_dir'],val_image_dir=dataset_params['val_images_dir'])

(data_gradients.feature_extractors.common.image_duplicates.ImageDuplicates,
 data_gradients.feature_extractors.common.sample_visualization.AbstractSampleVisualization)

In [50]:
REPORT_TITLE = "TACO - Exploratory Data Analysis (Object Detection)"
REPORT_SUBTITLE = f"dataset_path: {MY_DATASET_PATH}"
now_str = datetime.now().strftime("%Y%m%d_%H")
LOG_DIR = f"{EDA_DATAGRADIENT_OUTPUTS_PATH}/{REPORT_TITLE} {now_str}"
LOG_DIR

'/content/gdrive/MyDrive/KESKIA Drive Mlamali/CDuPropreMantes/outputs/eda-datagradients/TACO - Exploratory Data Analysis (Object Detection) 20231218_02'

In [45]:

matplotlib.use('Agg') # This line is only for Colab

analyzer = DetectionAnalysisManager(
    report_title=REPORT_TITLE,
    report_subtitle=REPORT_SUBTITLE,
    train_data=train_set,
    val_data=val_set,
    class_names=dataset_params['classes'],
    log_dir = LOG_DIR ,
    is_label_first=True,
    image_channels=ImageChannels.from_str("RGB")
    bbox_format="cxcywh",
    remove_plots_after_report=False,
    config_path=f"{EDA_DATAGRADIENT_OUTPUTS_PATH}/config.yaml"
)

analyzer.run()

  - Executing analysis with: 
  - batches_early_stop: None 
  - len(train_data): 1394 
  - len(val_data): 480 
  - log directory: /content/gdrive/MyDrive/KESKIA Drive Mlamali/CDuPropreMantes/outputs/eda-datagradients/TACO - Exploratory Data Analysis (Object Detection) 20231218 
  - Archive directory: /content/gdrive/MyDrive/KESKIA Drive Mlamali/CDuPropreMantes/outputs/eda-datagradients/TACO - Exploratory Data Analysis (Object Detection) 20231218/archive_20231218-013107 
  - feature extractor list: {'Image Features': [SummaryStats, ImagesResolution, ImageColorDistribution, ImagesAverageBrightness, ImageDuplicates], 'Object Detection Features': [DetectionSampleVisualization, DetectionClassHeatmap, DetectionBoundingBoxArea, DetectionBoundingBoxPerImageCount, DetectionBoundingBoxSize, DetectionClassFrequency, DetectionClassesPerImageCount, DetectionBoundingBoxIoU, DetectionResizeImpact]}
[34;1m╔[0m[34;1m═[0m[34;1m═[0m[34;1m═[0m[34;1m═[0m[34;1m═[0m[34;1m═[0m[34;1m═[0m[34;1m

Analyzing... :   0%|          | 0/1394 [00:00<?, ?it/s]


--------------------------------------------------------------------------------
[33;1mIn which format are your images loaded ?[0m
--------------------------------------------------------------------------------

[34;1mOptions[0m:
[[34;1m0[0m] | RGB
[[34;1m1[0m] | BGR
[[34;1m2[0m] | LAB
[[34;1m3[0m] | Other

Your selection (Enter the [34;1mcorresponding number[0m) >>> 0
Great! [33;1mYou chose: `RGB`[0m


Analyzing... : 100%|██████████| 1394/1394 [44:50<00:00,  1.93s/it]
Summarizing... :   0%|          | 0/2 [00:00<?, ?it/s]ERROR:data_gradients.managers.abstract_manager:Feature extractor ImageDuplicates error: ['Traceback (most recent call last):\n', '  File "/usr/local/lib/python3.10/dist-packages/data_gradients/managers/abstract_manager.py", line 146, in post_process\n    feature = feature_extractor.aggregate()\n', '  File "/usr/local/lib/python3.10/dist-packages/data_gradients/feature_extractors/common/image_duplicates.py", line 233, in aggregate\n    self._find_duplicates()\n', '  File "/usr/local/lib/python3.10/dist-packages/data_gradients/feature_extractors/common/image_duplicates.py", line 124, in _find_duplicates\n    train_encodings = dhasher.encode_images(self.train_image_dir)\n', '  File "/usr/local/lib/python3.10/dist-packages/imagededup/methods/hashing.py", line 155, in encode_images\n    raise ValueError(\'Please provide a valid directory path!\')\n', 'ValueError: Please p

Dataset successfully analyzed!
Starting to write the report, this may take around 10 seconds...


You can find more information about what happened in /content/gdrive/MyDrive/KESKIA Drive Mlamali/CDuPropreMantes/outputs/eda-datagradients/TACO - Exploratory Data Analysis (Object Detection) 20231218/archive_20231218-013107/errors.json



Your dataset evaluation has been completed!

----------------------------------------------------------------------------------------------------
Training Configuration...
`DetectionDataConfig` cache is not enabled because `cache_path=None` was not set.

----------------------------------------------------------------------------------------------------
Report Location:
    - Temporary Folder (will be overwritten next run):
        └─ /content/gdrive/MyDrive/KESKIA Drive Mlamali/CDuPropreMantes/outputs/eda-datagradients/TACO - Exploratory Data Analysis (Object Detection) 20231218
                ├─ Report.pdf
                └─ summary.json
    - Archive Folder:
        └─ /content/gdrive/MyDrive/KESKIA Drive Mlamali/CDuPropreMantes/outputs/eda-datagradients/TACO - Exploratory Data Analysis (Object Detection) 20231218/archive_20231218-013107
                ├─ Report.pdf
                └─ summary.json

Seen a glitch? Have a suggestion? Visit https://github.com/Deci-AI/data-gradients 

## Step 4: View Full PDF Report


In [49]:
# this function was defined and described at the beginning of this notebook
display_pdf_pages(f"{LOG_DIR}/Report.pdf")

Output hidden; open in https://colab.research.google.com to view.

### ⬇️ Download Report

In [None]:
from IPython.display import FileLink
FileLink(f"{LOG_DIR}/Report.pdf")