diff --git a/docs/wiki-guide/HF_DatasetCard_Template_Imageomics.md b/docs/wiki-guide/HF_DatasetCard_Template_Imageomics.md index 698516a..506d19c 100644 --- a/docs/wiki-guide/HF_DatasetCard_Template_Imageomics.md +++ b/docs/wiki-guide/HF_DatasetCard_Template_Imageomics.md @@ -10,6 +10,7 @@ tags: - animals - CV size_categories: # ex: n<1K, 1K.ipynb # Pair a notebook to a py script jupytext --sync .ipynb # Sync the two representations ``` -#### But wait! ...There's another way to automate it! +##### But wait! ...There's another way to automate it! There is a [jupytext pre-commit hook](https://jupytext.readthedocs.io/en/latest/using-pre-commit.html) that can be used to sync your paired files automatically when updating your GitHub repo. To learn more about pre-commit hooks in general, see the [git docs on pre-commit hooks](https://git-scm.com/book/en/v2/Customizing-Git-Git-Hooks). -## Ruff +### Marimo + +[marimo](https://marimo.io/) functions similarly to a Jupyter Notebook, but has many built-in reproducibility and error-avoidance features, including the fact that it saves as a Python program (similar to the paired file created by [Jupytext](#jupytext)). See the summary in their [README](https://github.com/marimo-team/marimo?tab=readme-ov-file) or explore the [docs](https://docs.marimo.io/) to get started. + +## Formatting and Linting -[Ruff](https://github.com/astral-sh/ruff) is a fast python formatter and linter. You can install it with `pip install ruff` or `conda install ruff` in your virtual/conda environment. They also have extensions for [VS Code](https://github.com/astral-sh/ruff-vscode) and [other editors supporting LSP](https://github.com/astral-sh/ruff-lsp). +Have you found yourself saying, "I just need to clean up my code first"? Make this easier, and do it as you go, with linters! Additionally, formatting can impact code consistency and readability, while altering display of Markdown and generally adding noise version control diffs. [Ruff](#ruff) and [markdownlint](#markdownlint) are two tools designed to resolve this challenge, for Python and Markdown, respectively. + +### Ruff + +Fast _Python_ formatter and linter. You can install [astral-sh/ruff](https://github.com/astral-sh/ruff) with `pip install ruff` or `conda install ruff` in your virtual/conda environment. They also have extensions for [VS Code](https://github.com/astral-sh/ruff-vscode) and [other editors supporting LSP](https://github.com/astral-sh/ruff-lsp). To format a file, run: @@ -39,3 +51,60 @@ ruff check ``` Ruff can also be set up as part of a pre-commit hook or GitHub Workflow. See their [Usage section](https://github.com/astral-sh/ruff?tab=readme-ov-file#usage) for more information. + +### Markdownlint + +Fast _Markdown_ formatter and linter. We use the [DavidAnson/markdownlint](https://github.com/DavidAnson/markdownlint) package for this site; see instructions and example in the [linting section](https://github.com/Imageomics/Collaborative-distributed-science-guide/blob/main/CONTRIBUTING.md#linting) of our contributing guidelines. It is flexible in configuration and allows for simple checking or even fixing straight-forward formatting issues. + +## FAIR Data Access and Validation + +Don't add to the reproducibility crisis! Are you using existing data accessed through URLs and need to ensure consistency for re-use? Do you have a folder of images with all their metadata documented through their filenames? [Cautious Robot](#cautious-robot) and [Sum Buddy](#sum-buddy) are here to help. + +### Cautious Robot + +Simple image from CSV downloader. The [Imageomics/cautious-robot](https://github.com/Imageomics/cautious-robot) package provides a FAIR and Reproducible method for **downloading a collection of images from URLs**. + +- Configurable wait time and max attempts for retry. +- Names images by given column with unique values. +- Logs all successful responses and errors for review after download. +- Uses [sum-buddy](#sum-buddy) to record checksums of all downloaded images. +- Performs minimal check that the number of expected images matches the number sum-buddy counts. + +**Optional features:** + +- Organize images into subfolders based on any column in CSV. +- Create square images for modeling: + - Organizes images in a second directory (same format) with copies of images in specified size. +- **Buddy-check:** verifies all expected images downloaded intact (compares given checksums with sum-buddy output). + +#### Sample Command + +Given a CSV (`example.csv`) with a list of image URLs in a `file_url` column with `filename` providing unique IDs for each image, the following snippet will download these into an `example_images/` directory and validate the contents with provided MD5 hashes from the `md5` column of the CSV. + +```console +cautious-robot --input-file example.csv --output-dir example_images -v "md5" +``` + +To download larger (10-100M image scale), more distributed datasets, to HPC systems please see [Imageomics/distributed-downloader](https://github.com/Imageomics/distributed-downloader). + +### Sum Buddy + +Simple and flexible checksum calculator, from a single file to an entire directory. The [Imageomics/sum-buddy](https://github.com/Imageomics/sum-buddy) package provides a FAIR and Reproducible method for **duplicate file identification**, efficient **metadata generation**, and general **file integrity and validation** support. + +- Input: Folder with things to checksum. +- Output: CSV or printout of filepaths, filenames, and checksums. +- Options: + - Ignore subfolders and patterns, + - Hash algorithm to use, + - Avoid hidden files and directories. +- Usage: Run as a CLI or with exposed Python methods. + +#### Sample Use Case + +Given a collection of images, e.g., in an `images/` directory, with no accompanying metadata, quickly generate a metadata file listing the filepaths, filenames, and checksums of all images contained in the folder. Note the option to include an "ignore file". This operates similarly to a `.gitignore`, allowing one to avoid inclusion of particular files or file types. In this case, let's assume there may be some `.doc` or similar included with the images. Hidden files and directories (e.g., `.DS_Store`) are ignored by default. + +```console +sum-buddy --output-file metadata.csv --ignore-file .sbignore images/ +``` + +The added benefit to this method of metadata CSV generation is the ability to quickly and easily check for duplicate images within a collection. See our [data training repo](https://github.com/Imageomics/data-workshop-AH-2024) to learn more about this subject. diff --git a/docs/wiki-guide/Technical-Infrastructure.md b/docs/wiki-guide/Technical-Infrastructure.md index 76e6183..421ef6e 100644 --- a/docs/wiki-guide/Technical-Infrastructure.md +++ b/docs/wiki-guide/Technical-Infrastructure.md @@ -11,7 +11,7 @@ Overall [Infrastructure Chart](https://docs.google.com/spreadsheets/d/1JSOi5pp2Y - [OSC Storage Guidelines (_internal_)](https://github.com/Imageomics/internal-guidelines/wiki/OSC-Storage-Guidelines): recommended usage patterns for each file system on OSC. - Imageomics dedicated GPU server: _Internal_ server, hosts our CVAT instance. - [Usage and access guide (_internal_)](https://github.com/Imageomics/internal-guidelines/wiki/Imageomics-GPU-Server) - - [CVAT user guide](https://github.com/Imageomics/kabr-tools/wiki/CVAT-User-Guide) + - [CVAT user guide](https://imageomics.github.io/kabr-tools/cvat/cvat-guide/) - NSF ACCESS Accelerate Allocation: NCSA Delta GPU credits - [Amazon Web Services (AWS)](https://aws.amazon.com/?nc2=h_lg): Basic to extremely powerful, abundant (though finite) resources, high cost. - Used sparingly for urgent deadlines when other compute is not available (generally hasn't been available at those times either, though) or to host projects that cannot be hosted effectively through a Hugging Face Space.