Does your dataset have issues? How do you find out, and how do you fix those issues?
I originally pitched this project as:
One dataset with different bad samples (eg too much of one class, missing values, gender bias), each as its own "discover the data problem" exercise
I wanted to include others' previous work on parsing CSV / data sources in general, to offer as many examples as possible
- https://github.com/pplonski/datasets-for-start
- https://martinjc.github.io/UK-GeoJSON/
- https://github.com/maxogden/csv-spectrum
In the future ideally there would be a data browser, where you can programmatically review the dataset and determine its problems
Open source, MIT License