Data-Centric AI (DCAI) represents the recent transition from focusing on modeling to the underlying data used to train and evaluate models. Increasingly, common model architectures have begun to dominate a wide range of tasks, and predictable scaling rules have emerged. While building and using datasets has been critical to these successes, the endeavor is often artisanal -- painstaking and expensive. The community lacks high productivity and efficient open data engineering tools to make building, maintaining, and evaluating datasets easier, cheaper, and more repeatable. The DCAI movement aims to address this lack of tooling, best practices, and infrastructure for managing data in modern ML systems.
Here are the simple steps do run and find the output:
Go the following folder.
In there, you will find the four scripts, as well as the data and outputs folders. At this point, we have ran the code and the files generated are all in the outputs folder. However, you can also load the unclassfied dataframe to the clustering model and then run it again to generate the 3d interactive plot. Or run the code on the small sample of images we have provided. To do so, you just need to specify if you want to c.generate_clusters_from_csv()
or c.generate_clusters_from_imgs()
.
Should you require to run the module on a the same dataset again, and you have access to the images, then you would have to place them imgs folder. And as alwyas, you must run the code from the main.py script.
- Leila Bagha - leila.bagha@phaze.ro