Data Cleaning Tutorial

Modern data cleaning approaches will be presented, explained, and critically reviewed with a focus on emerging tools for image dataset curation. Automatic detection of data quality issues in data collections of growing size will be motivated by reviewing contamination in popular benchmarks and by assessing its impact on the training and evaluation of machine learning models. Data cleaning will be shown to be complementary to learning with noise, although it is not quite as known. Particular attention will be paid to near-duplicate images, which can lead to train-evaluation data leaks, irrelevant samples, which are invalid within their context, and label errors, which corrupt the learning signal. The major repositories containing resources for data cleaning will be presented with their strengths and weaknesses, used in guided examples, and participants will be encouraged to clean their own datasets in the closing part of the tutorial.

Installation Instructions

There are several possibilities to install the needed libraries for this tutorial, depending on your preferences:

if you use Docker you can start a jupyter notebook server with make by running make start_jupyter
if you use venv's or want to install it locally you can pip install requirements.txt and your jupyter notebook
if you do not want to install anything locally you can run everything on Google Colab by clicking the button below, remember to change the runtime to GPU.

NOTE: We recommend using Google Colab to run the tutorial. We also provide setup for a virtual environment and Docker. However, we cannot guarantee that the setup will work on your machine. These options may be the best if you do not want to upload your datasets, but depending on your hardware and internet connection, you may have to deal with longer install times, disk space requirements, or slower computations.

Hands On

In the first tutorial, we will see how difficult it can be to perform data cleaning for image datasets traditionally or manually. Then, in the next tutorials, we will examine how easy this task can be made when relying on data-centric cleaning frameworks.

00	Traditional (manual) data cleaning: Showcases how manual data cleaning is typically done and calculates the effort required for exhaustive annotation. 🗃️ Dataset: Oxford-IIIT Pet, Imagenette, your own.



01	FastDup: Learn how to analyze and clean datasets using FastDup, the preferred solution for very large data collections. 🗃️ Dataset: Oxford-IIIT Pet, Imagenette, your own.



02	CleanLab: Learn how to analyze and clean datasets using CleanLab (DataLab), the preferred solution for reliable results. 🗃️ Dataset: Oxford-IIIT Pet, Imagenette, your own.



03	SelfClean: Learn how to analyze and clean datasets using SelfClean, the preferred solution for small to medium datasets with an emphasis on the highest data quality. 🗃️ Dataset: Oxford-IIIT Pet, Imagenette, your own.

Tips

For more detailed tips during the hands-on session, consult our dedicated page

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
assets		assets
notebooks		notebooks
pre_computed_assets		pre_computed_assets
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE.md		LICENSE.md
Makefile		Makefile
README.md		README.md
common.mk		common.mk
data-cleaning-tutorial-tips.md		data-cleaning-tutorial-tips.md
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Cleaning Tutorial

Installation Instructions

Hands On

Tips

About

Releases

Packages

Contributors 2

Languages

License

Digital-Dermatology/data-cleaning-hands-on

Folders and files

Latest commit

History

Repository files navigation

Data Cleaning Tutorial

Installation Instructions

Hands On

Tips

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages