In object detection, the quality and characteristics of your training data significantly impact your model's performance.

Data curation becomes an important, possibly the most important, part in building a high quality model. Especially when it comes to:

#### Improving Data Quality:

 - **Identifying and Removing Errors:** Data curation helps identify and remove mislabeled images, incorrect bounding boxes, or corrupted data that can mislead the model during training.

 - **Handling Class Imbalance:** If certain object classes are underrepresented in the dataset, the model might struggle to detect them accurately. Data curation techniques like oversampling or data augmentation can address this imbalance.

#### Enhancing Dataset Relevance:

 - **Focusing on Relevant Data:** Not all data is created equal. Data curation helps select the most relevant subset of data for the specific task, which can improve model efficiency and performance. For example, if you're building a model to detect cars, focusing on images with cars rather than a general dataset with various objects can lead to a more accurate model.

 - **Filtering Out Noise:** Removing irrelevant or noisy data (e.g., images with poor lighting or occlusions) can help the model focus on the essential features for object detection.

#### Optimizing Dataset Size:

 - **Reducing Training Time:** Large datasets can take a long time to train. Data curation helps create smaller, more efficient datasets that still capture the necessary information for the model to learn effectively. This is the core idea behind the competition you're designing.

 - **Improving Model Generalizability:** A smaller, well-curated dataset can sometimes lead to better generalizability, forcing the model to learn the essential features rather than overfitting specific examples in a large dataset.

#### Data Curation Techniques for Object Detection:

 - **Data Cleaning:** Correcting labelling errors, removing duplicates, and handling missing values.

 - **Data Augmentation:** Creating new training examples by applying transformations like flipping, cropping, rotating, or adding noise to existing images.

 - **Dataset Balancing:** Addressing class imbalance through techniques like oversampling or undersampling.

 - **Hard Example Mining:** Identifying and focusing on examples the model finds difficult to classify correctly.

By applying these techniques, you'll create a cleaner, more relevant, and more efficient dataset, ultimately leading to a better-performing object detection model.

This notebook will show you how you can use the open source library, [`fiftyone`](github.com/voxel51/fiftyone), to perform the tasks listed above.

Let's get into it!

In [None]:
!pip install fiftyone

In [None]:
!pip install umap-learn

Fine-tuning a pretrained model on a new dataset is a common practice in machine learning, especially when dealing with specific domains or limited data availability. Ensuring that the dataset is of high quality is critical for the success of such an endeavor.

Here’s a systematic approach to assess and improve the quality of a dataset using `fiftyone` before fine-tuning a model:


In [None]:
import fiftyone as fo
import fiftyone.zoo as foz

voc_dataset = foz.load_zoo_dataset("voc-2012", split="train")

In [None]:
voc_dataset


#**Visual Inspection**

Start by visually inspecting the dataset.

- Examine the images and their corresponding annotations

- Verify the accuracy, consistency, and identify any apparent errors or anomalies

- Scrutinize annotations for incorrect labels

- Check for missing or inaccurate bounding boxes


You can do this via the `fiftyone` app and the Python SDK - both of which allow you to [tag any image issues you encounter during your exploration](https://docs.voxel51.com/user_guide/app.html#tags-and-tagging)!


In [None]:
session = fo.launch_app(voc_dataset)

You can verify the number of images in the training set:

In [None]:
voc_dataset.count()

# note, you can also use len(voc_dataset)

And the number of detections:

In [None]:
voc_dataset.aggregate(fo.Count("ground_truth.detections"))

You see we have more detections than images, so there will be multiple detections per image. It's a good idea to get some statistics about what is in the dataset.

# **Statistical Analysis**

When curating a dataset for an object detection task, performing key statistical analyses helps ensure the quality and effectiveness of your data. By understanding and optimizing your dataset, you can significantly improve the performance and robustness of your object detection models.

Here are some analyses you can conduct:

1. Class Distribution:
   - Analyze the distribution of object classes in your dataset
   - Identify and address class imbalances to prevent biased models
   - Ensure sufficient representation of all classes for accurate detection

2. Bounding Box Size Distribution:
   - Compute the width and height distributions of bounding boxes
   - Identify variations in object sizes and ensure coverage of diverse scales
   - Curate a dataset with a representative range of object sizes

3. Object Density:
   - Calculate the average number of object instances per image
   - Ensure a balanced distribution of single-object and multi-object scenes
   - Curate a dataset that reflects the expected object density in real-world scenarios

4. Image Resolution:
   - Analyze the distribution of image resolutions
   - Ensure a consistent and appropriate resolution for accurate object detection
   - Curate a dataset with images of sufficient detail while considering computational efficiency

5. Object Aspect Ratios:
   - Compute the aspect ratios (width/height) of bounding boxes
   - Identify variations in object shapes and ensure coverage of diverse aspect ratios
   - Curate a dataset with a representative range of object shapes

6. Object Co-occurrence:
   - Analyze the co-occurrence of different object classes within the same image
   - Identify potential correlations or dependencies between classes
   - Curate a dataset that reflects realistic object co-occurrence patterns

By conducting these statistical analyses and curating your dataset accordingly, you can ensure that your object detection models are trained on high-quality, representative data. This data-centric approach leads to improved model performance, increased robustness to real-world variations, and more reliable object detection results.

#### You can do this in `fiftyone`!

`fiftyone` allows you to analyze your dataset with [pandas style queries](https://docs.voxel51.com/cheat_sheets/pandas_vs_fiftyone.html) and [dataset views](https://docs.voxel51.com/user_guide/using_views.html#using-views) so that you can perform statistical analysis of your dataset.

I'll show you how to get started.


#### **Distribution of Classes**

Analyze the distribution of classes to identify any imbalance. Highly imbalanced datasets can lead to biased models that perform well on frequent classes but poorly on rare classes.





In [None]:
voc_dataset.distinct("ground_truth.detections.label")

In [None]:
voc_dataset.count_values("ground_truth.detections.label")

#### **Annotation Metrics**

Calculate metrics such as:

• The number of bounding boxes per image

• Average area of bounding boxes (as a percentage of total image area)


FiftyOne provides [`ViewField`](https://docs.voxel51.com/api/fiftyone.core.expressions.html#fiftyone.core.expressions.ViewField) and [`ViewExpression`](https://docs.voxel51.com/api/fiftyone.core.expressions.html#fiftyone.core.expressions.ViewExpression) that give you the power to write custom queries based on information that exists in your dataset.


In [None]:
from fiftyone import ViewField as F

### Get the number of detections per image

In [None]:
detection_counter = F("ground_truth.detections").length()
num_detections_per_image = voc_dataset.values(detection_counter) #returns a list

And you can plot them:

In [None]:
from fiftyone import ViewField as F
from fiftyone.core.plots.views import CategoricalHistogram

CategoricalHistogram(
    init_view=dataset,
    field_or_expr="ground_truth",
    expr=F("detections").length(),
    title="Count of Images by Number of Detections",
    xlabel="Number of Detections per image",
    template={
        "layout": {
            "xaxis": {
                "range": [0, 30]  # This sets the x-axis range from 0 to 30
            }
        }
    }
)

And if you're curious about that samples with more than $n$ number detections, you do that with [filtering and matching](https://docs.voxel51.com/cheat_sheets/filtering_cheat_sheet.html#built-in-filter-and-match-functions).

In [None]:
many_detections_view = voc_dataset.match(detection_counter>20)

You can programatically get a sense of what's in these images:

In [None]:
many_detections_view.count_values("ground_truth.detections.label")

And, of course, you can visually inspect them:

In [None]:
fo.launch_app(many_detections_view)

### Analyze area of bounding boxes per image.

First, compute some metadata for each image. Namely, the image height and width.


In [None]:
voc_dataset.compute_metadata()

Then, compute the bounding box.

Below the absolute and relative bounding box areas are being computed. Absolute bounding box area is the actual pixel dimensions of the bounding box rectangle. Relative bounding box area is area of the bounding box as a percentage of the total image/canvas size.

Note: that bounding boxes are in `[top-left-x, top-left-y, width, height]` format.


In [None]:
rel_bbox_area = F("bounding_box")[2] * F("bounding_box")[3]

im_width, im_height = F("$metadata.width"), F("$metadata.height")

abs_area = rel_bbox_area * im_width * im_height

voc_dataset.set_field("ground_truth.detections.relative_bbox_area", rel_bbox_area).save()

voc_dataset.set_field("ground_truth.detections.absolute_bbox_area", abs_area).save()

You can compute the upper and lower [bounds](https://docs.voxel51.com/api/fiftyone.core.aggregations.html#fiftyone.core.aggregations.Bounds) of the bounding box areas as well as the mean for each.

Note: these are relative bounding box areas, so they represent the percentage of the total image area.

Here's how you can do that:

In [None]:
labels = voc_dataset.distinct("ground_truth.detections.label")
for label in labels:
    view = voc_dataset.filter_labels("ground_truth", F("label") == label)
    bounds = view.aggregate(fo.Bounds("ground_truth.detections.relative_bbox_area"))
    bounds = (bounds[0]*100, bounds[1]*100)
    area = view.mean("ground_truth.detections.relative_bbox_area")*100
    print("\033[1m%s:\033[0m Min: %.2f, Mean: %.2f, Max: %.2f" % (label, bounds[0], area, bounds[1]))


### Compute overlap over bounding boxes


You can use The [`compute_max_ious`](https://docs.voxel51.com/api/fiftyone.utils.iou.html#fiftyone.utils.iou.compute_max_ious) function to compute the maximum Intersection over Union (IoU) between objects in a dataset.

IoU is a measure of overlap between two objects, commonly used in object detection tasks.


In [None]:
from fiftyone.utils import iou
iou.compute_max_ious(voc_dataset, "ground_truth")

And you can see there now a field in the detections for each object called `max_iou`.

The `max_iou` value indicates the highest degree of overlap between an object and any other object in the same sample or frame. A higher `max_iou` value suggests that the object has a significant overlap with at least one other object.

You can use this a criterion for filtering or thresholding objects based on their overlap with other objects. For example, you might choose to keep only objects with a `max_iou` value below a certain threshold to remove highly overlapping or duplicate objects.

Or, it can be a starting point for further analysis or processing. For instance, you might use the `max_iou` values to identify objects with high overlap and perform additional operations on those objects.

In [None]:
voc_dataset.first().ground_truth

You can count the number of overlapping objects in the dataset like so. Note that if a detection has no overlap, the `max_iou` value is set to `None` andthe `None` values are ignored in `count`.

In [None]:
voc_dataset.count("ground_truth.detections.max_iou", safe=True)

You can get the average intersection over all the detections.

In [None]:
voc_dataset.mean("ground_truth.detections.max_iou")

You can do the same for each label as well

In [None]:
labels = voc_dataset.distinct("ground_truth.detections.label")
for label in labels:
    view = voc_dataset.filter_labels("ground_truth", F("label") == label)
    bounds = view.aggregate(fo.Bounds("ground_truth.detections.max_iou"))
    bounds = (bounds[0]*100, bounds[1]*100)
    area = view.mean("ground_truth.detections.max_iou")*100
    print("\033[1m%s:\033[0m Min: %.2f, Mean: %.2f, Max: %.2f" % (label, bounds[0], area, bounds[1]))

# **Quality Checks**

### **Duplicate Detection**

You can use `fiftyone` to detect and [remove duplicate images](https://docs.voxel51.com/recipes/image_deduplication.html#) or very similar images (which could lead to overfitting).


Start by iterating over the samples and compute their file hashes:


In [None]:
import fiftyone.core.utils as fou

for sample in voc_dataset:
    sample["file_hash"] = fou.compute_filehash(sample.filepath)
    sample.save()

The, you can use a simple Python statement to locate the duplicate files in the dataset, i.e., those with the same file hashses:

In [None]:
from collections import Counter

filehash_counts = Counter(sample.file_hash for sample in voc_dataset)
dup_filehashes = [k for k, v in filehash_counts.items() if v > 1]

print("Number of duplicate file hashes: %d" % len(dup_filehashes))

Awesome, no duplicates.

But if the above indicated that are any duplicates you could visually verify by creating a view that contains only the samples with these duplicate file hashes using the following pattern:

```python
dup_view = (dataset
    # Extract samples with duplicate file hashes
    .match(F("file_hash").is_in(dup_filehashes))
    # Sort by file hash so duplicates will be adjacent
    .sort_by("file_hash")
)

print("Number of images that have a duplicate: %d" % len(dup_view))
print("Number of duplicates: %d" % (len(dup_view) - len(dup_filehashes)))
```

And then you can inspect this view in the app:

```python
session.view = dup_view
```


Alternatively, you can use the [deduplication plugin](https://github.com/jacobmarks/image-deduplication-plugin), which streamlines this workflow!


### **Image quality**

Poor image quality can hurt a model's ability to learn meaningful features and generalize well.

Some aspects of image quality you want to examine are:

- **Image Resolution:** Check for low-resolution images that might lack sufficient detail for the model to learn effectively.

- **Image Noise:** Look for images with excessive noise, blur, or artifacts that could negatively impact model performance.

- **Lighting Conditions:** Assess the variability of lighting conditions in the images. Extreme lighting variations can pose challenges for the model.

`fiftyone` has a robust [plugins ecosystem](https://github.com/voxel51/fiftyone-plugins?tab=readme-ov-file), and you can use the [Image Quality Issues Plugin](https://github.com/jacobmarks/image-quality-issues) to find the following issues:

**📏 Aspect ratio**: find images with weird aspect ratios

**🌫️ Blurriness**: find blurry images

**☀️ Brightness**: find bright and dark images

**🌓 Contrast**: find images with high or low contrast

**🔀 Entropy**: find images with low entropy

**📸 Exposure**: find overexposed and underexposed images

**🕯️ Illumination**: find images with uneven illumination

**🧂 Noise**: find images with high salt and pepper noise

**🌈 Saturation**: find images with low and high saturation

Start by **downloading the plug in:**

In [None]:
!fiftyone plugins download https://github.com/jacobmarks/image-quality-issues/

Note that you can compute image quality issues one at at time, programmatically, like so:

In [None]:
import fiftyone.operators as foo
## Access the operator via its URI (plugin name + operator name)
compute_brightness = foo.get_operator(
    "@jacobmarks/image_issues/compute_brightness"
)

In [None]:
compute_brightness(voc_dataset)

And then inspect them in the app.

Or, you can hit the backtick (`) key, and pick and choose which one axis of image quality you want to assess. You can do this on the whole image, or patch, level. [Check the documentation for more information](https://github.com/jacobmarks/image-quality-issues).

In [None]:
fo.launch_app(voc_dataset)

#### Some ideas for overcoming these issues using  Data Augmentation Techniques

• Apply random cropping, flipping, and rotations to the images during training.
This helps the model learn invariances to these transformations.

• Use color jittering to randomly adjust the brightness, contrast, saturation, and hue of the images. This simulates different lighting conditions.

• Add random Gaussian noise to the images to improve the model's robustness to noisy input

`fiftyone` has an [integration with Albumentations](https://docs.voxel51.com/integrations/albumentations.html) that can help you with this!

# Using embeddings

[Visualizing your dataset in a low-dimensional embedding space](https://docs.voxel51.com/tutorials/image_embeddings.html) can reveal [patterns and clusters](https://docs.voxel51.com/tutorials/clustering.html#Computing-and-Visualizing-Clusters) in your data that can answer important questions about the critical failure modes of your model and how to augment your dataset to address these failures.

The [`fiftyone` model zoo](https://docs.voxel51.com/user_guide/model_zoo/models.html) has several embeddings models you can choose from. You can start with the [`clip-vit-base32-torch`](https://docs.voxel51.com/user_guide/model_zoo/models.html#clip-vit-base32-torch) model.


For this, you'll need to use the [`compute_visualization`](https://docs.voxel51.com/api/fiftyone.brain.html#fiftyone.brain.compute_visualization) method of the [`FiftyOne Brain`](https://docs.voxel51.com/user_guide/brain.html#brain-embeddings-visualization)

In [None]:
import fiftyone.brain as fob

res = fob.compute_visualization(
    voc_dataset,
    model="clip-vit-base32-torch",
    embeddings="clip_embeddings",
    method="umap",
    brain_key="clip_vis",
    batch_size=64,
    num_dims=5
)

voc_dataset.set_values("clip_umap", res.current_points)

And once you have this, you can inspect the embeddings. Check out [the documentation](https://docs.voxel51.com/user_guide/brain.html#brain-embeddings-visualization) for more examples of how to use embeddings to understand your data.

In [None]:
session = fo.launch_app(voc_dataset)

### **Outlier Detection**

You can use embeddings to look for outliers or anomalous samples in your dataset. Outliers can include images with unusual objects, extreme lighting conditions, or rare poses. Investigate whether these outliers are valid and informative or if they are noisy and potentially harmful to the model's performance.

Then, you can decide whether to keep, remove, or separately handle the outliers based on their relevance and impact.

You can use the [outlier detection](https://github.com/danielgural/outlier_detection) plugin to do some interesting analysis!


Note, the above computed embeddings on the whole image level, but you can also compute embeddings on the patch level. Below is the pattern for that:


In [None]:
clip_embeddings_model = foz.load_zoo_model('clip-vit-base32-torch')

In [None]:
gt_patches = voc_dataset.to_patches("ground_truth").clone()

In [None]:
gt_patches.compute_patch_embeddings(
    model=clip_embeddings_model,
    patches_field='ground_truth',
    embeddings_field = 'patch_embeddings',
    batch_size=64,
    progress=True
)

In [None]:
import fiftyone.brain as fob

fob.compute_visualization(
    gt_patches,
    embeddings="patch_embeddings",
    method="umap",
    brain_key="umap_clip",
    num_dims=3
)

In [None]:
fo.launch_app(gt_patches)

# The `FiftyOne Brain` is super powerful!

Using the Brain, you can also compute:

- [Image Uniqueness](https://docs.voxel51.com/user_guide/brain.html#image-uniqueness)

- [Label Mistakes](https://docs.voxel51.com/user_guide/brain.html#label-mistakes)

Give it try!

----

Remember, in data-centric AI, the quality and composition of your dataset is the most important thing. By investing time and effort into understanding and optimizing your data, you can unlock the full potential of your object detection models and achieve superior results compared to solely focusing on model architecture and hyperparameters.


#### Actionable Steps

To operationalize these strategies, consider the following steps:

- **Develop a Checklist**: Create a comprehensive checklist based on the points above to ensure thorough review and evaluation of the dataset.

- **Automate What You Can**: Develop or use existing tools to automate parts of the quality check, like corruption checks, duplicate detection, and basic annotation verification.

- **Document Everything**: Keep detailed records of the quality assessment process, findings, and actions taken. This documentation will be crucial for understanding decisions made during dataset curation and model training.

By systematically assessing the data quality through these steps, you can significantly increase the chances of successful model performance when fine-tuning on new datasets.