Data hosting #100

niksirbi · 2023-07-13T17:18:44Z

Closes #33

Sample projects

I added some sample WAZP projects to be used for testing, examples and tutorials on an external data repository.
Our hosting platform of choice is called GIN and is maintained by the German Neuroinformatics Node.
GIN has a GitHub-like interface and git-like CLI functionalities.

Project organisation

The projects are stored in folders named after the species - e.g. jewel-wasp (Ampulex compressa).
Each species folder may contain various WAZP sample projects as zipped archives. For example, the jewel-wasp folder contains the following projects:

short-clips_raw.zip - a project containing short ~10 second clips extracted from raw .avi files.
short-clips_compressed.zip - same as above, but compressed using the H.264 codec and saved as .mp4 files.
entire-video_raw.zip - a project containing the raw .avi file of an entire video, ~32 minutes long.
entire-video_compressed.zip - same as above, but compressed using the H.264 codec and saved as .mp4 file.

Each WAZP sample project has the following structure:

{project-name}.zip
    └── videos
        ├── {video1-name}.{ext}
        ├── {video1-name}.metadata.yaml
        ├── {video2-name}.{ext}
        ├── {video2-name}.metadata.yaml
        └── ...
    └── pose_estimation_results
        ├── {video1-name}{model-name}.h5
        ├── {video2-name}{model-name}.h5
        └── ...
    └── WAZP_config.yaml
    └── metadata_fields.yaml

Note
To learn more about how the sample projects were generated, see scripts/generate_sample_projects in the WAZP GitHub repository. I thought to "save" the scripts in this repository, in case we need to produce more sample projects from different datasets in the future.

Fetching projects

To fetch the data from GIN, we use the pooch Python package, which can download data from pre-specified URLs and store them locally for all subsequent uses. It also provides some nice utilities, like verification of sha256 hashes and decompression of archives.

The relevant funcitonality is implemented in the wazp.datasets.py module. The most important parts of this module are:

The sample_projects registry, which contains a list of the zipped projects and their known hashes.
The find_sample_projects() function, which returns the names of available projects per species, in the form of a dictionary.
The get_sample_project() function, which downloads a project (if not already cached locally), unzips it, and returns the path to the unzipped folder.

Example usage:

>>> from wazp.datasets import find_sample_projects, get_sample_project

>>> projects_per_species = find_sample_projects()
>>> print(projects_per_species)
{'jewel-wasp': ['short-clips_raw', 'short-clips_compressed', 'entire-video_raw', 'entire-video_compressed']}

>>> project_path = get_sample_project('jewel-wasp', 'short-clips_raw')
>>> print(project_path)
/home/user/.WAZP/sample_data/jewel-wasp/short-clips_raw

Local storage

By default, the projects are stored in the ~/.WAZP/sample_data folder. This can be changed by setting the LOCAL_DATA_DIR variable in the wazp.datasets.py module.

Adding new projects

Only core WAZP developers may add new projects to the external data repository.
To add a new poject, you will need to:

Create a GIN account
Ask to be added as a collaborator on the WAZP data repository (if not already)
Download the GIN CLI and set it up with your GIN credentials, by running gin login in a terminal.
Clone the WAZP data repository to your local machine, by running gin get SainsburyWellcomeCentre/WAZP in a terminal.
Add your new projects, followed by git add, and git commit, just like you would with a GitHub repository. Make sure to follow the project ornanisation as described above. Don't forget to modify the README file accordingly.
Upload the commited changes to the GIN repository, by running gin upload. Latest changes to the repository can be pulled via gin download. gin sync will synchronise the latest changes bidirectionally.
Determine the sha256 checksum hash of each new project archive, by running sha256sum {project-name.zip} in a terminal. Alternatively, you can use pooch to do this for you: python -c "import pooch; pooch.file_hash('/path/to/file.zip')". If you wish to generate a text file containing the hashes of all the files in a given folder, you can use python -c "import pooch; pooch.make_registry('/path/to/folder', 'hash_registry.txt').
Update the wazp.datasets.py module on the WAZP GitHub repository by adding the new projects to the sample_projects registry. Make sure to include the correct sha256 hash, as determined in the previous step. Follow all the usual guidelines for contributing code. Additionally, you may want to update the scripts in scripts/generate_sample_projects, depending on how you generated the new projects. Make sure to test whether the new projects can be fetched successfully (see fetching projects above) before submitting your pull request.

You can also perform steps 3-6 via the GIN web interface, if you prefer to avoid using the CLI.

Using sample projects in tests

I think the best way to do that is through pytest fixtures.
For example, I've added one in tests/test_unit/conftest.py:

from pathlib import Path

import pytest

from wazp.datasets import get_sample_project


@pytest.fixture()
def sample_project() -> Path:
    """Get the sample project for testing."""
    return get_sample_project("jewel-wasp", "short-clips_compressed", progressbar=True)

This gets the smallest sample project and returns its local path, to be used in tests.

codecov-commenter · 2023-07-14T15:00:51Z

Codecov Report

Merging #100 (f486328) into main (4a7dcbb) will increase coverage by 2.70%.
Report is 1 commits behind head on main.
The diff coverage is 86.04%.

@@            Coverage Diff             @@
##             main     #100      +/-   ##
==========================================
+ Coverage   39.94%   42.64%   +2.70%     
==========================================
  Files          12       13       +1     
  Lines         691      734      +43     
==========================================
+ Hits          276      313      +37     
- Misses        415      421       +6

Files	Coverage Δ
wazp/datasets.py	`86.04% <86.04%> (ø)`

📣 Codecov offers a browser extension for seamless coverage viewing on GitHub. Try it in Chrome or Firefox today!

samcunliffe

🍉 OK, this all looks really good.

No comments about the structure or code quality or robustness. (It was left hanging for months and still works ✨.) I've checked and it all works perfectly for me in my existing months-old wazp-env conda environment.

I have one noncritical, somewhat major, comment:

The tools in scripts/generate_sample_projects are all untested. Is it worth or feasible to do a dry run of them as part of a testing job? Either tack to the end of the current or a new workflow.

Suuuper rough sketch:

name: test-tools
on:
  push:
    branches: main
  pull_request:

jobs:
  run_sample_project_gen:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout source
        uses: actions/checkout@v3
      - name: Set up Python
        uses: actions/setup-python@v4
      - name: Install self
        run: python -m pip install .
      - name: Setup local data
        run: |
            mkdir ./Code/Data/WAZP
            gin clone https://gin.g-node.org/SainsburyWellcomeCentre/WAZP
      - name: Run sample generation
        run: python scripts/generate_sample_projects/main.py
      - name: Check generation was OK
        run: test -f /path/to/output/file.zip

Now that won't improve our test coverage as reported by pytest but at least we will know that we don't break these tools.

... I also don't insist on this (hence approving). If you think it's worth shooting for right now, that's cool. If it's worth farming this into an issue for future Sam and Niko, that's also cool. If it's too much of a pain then we can chat in the next gemba.

wazp/datasets.py

.github/workflows/test_and_deploy.yml

scripts/generate_sample_projects/main.py

tests/test_unit/test_placeholder.py

Co-authored-by: Sam Cunliffe <samcunliffe@users.noreply.github.com>

niksirbi · 2023-11-10T19:00:43Z

The tools in scripts/generate_sample_projects are all untested. Is it worth or feasible to do a dry run of them as part of a testing job? Either tack to the end of the current or a new workflow.

I thought about it, and this would actually, be very hard to do, as things stand. The main reason is that these samples are sourced from our non-public internal server storage. So they cannot "really" be tested without access to that source (which the GitHub runners can't have). Downloading the test data on GIN only gives you the output of this pipeline, the input is inaccessible.

Of course we still have some options, we could:

rewrite the scripts to work with already publicly available animal behaviour datasets
access the source data via our locally hosted GitHub runner
only test smaller parts of the scripts with unit tests, as much as we can without the source data (no end-to-end testing)

Since all the above require considerable work, and this is not a priority right now, I opened an issue for future reference.

niksirbi · 2023-11-10T19:02:30Z

Since all the above require considerable work, and this is not a priority right now, I opened an issue for future reference.

Apart from the above, I took care of all smaller comments, so I'm going ahead with the merge 🤞🏼

Thanks a ton @samcunliffe!

Co-authored-by: Sam Cunliffe <samcunliffe@users.noreply.github.com>

niksirbi added 15 commits July 6, 2023 11:56

added pooch as dependency

87368c3

removed old sample project

6e21546

added scripts for generating sample projects automatically

4b5c07a

fixed typo in file name

50a6da5

modified scripts for automatic sample project generation

6e67eb3

Added datasets module to handle the fetching of sample datasets

cf2886d

added info about sample datasets in CONTRIBUTING.md

c1ee54a

added pytest fixtures for fetching the sample project

d5e81f8

fixed paths to sample project in tests

6f5d25d

modify assertions in tess for metadatda dfs

ae40627

added expected hashes for sample datasets

fa79654

added tqdm dependency

4d8656f

renamed sample dataset to sample projects

93d6222

delete placeholder unit test

64b3026

moved sample_project fixture to conftest.py

fdb7c1c

niksirbi marked this pull request as ready for review July 14, 2023 15:01

niksirbi requested review from samcunliffe and sfmig July 14, 2023 15:01

sfmig mentioned this pull request Jul 20, 2023

Add benchmarks for the main function [Feature] brainglobe/brainglobe-workflows#105

Closed

niksirbi added 3 commits August 2, 2023 18:13

added caching of test data during gh actions to speed up CI

c56d63d

deleted duplicate word in docstring

81861cc

modified instructions for adding new projects on GIN

95c551c

samcunliffe approved these changes Nov 8, 2023

View reviewed changes

wazp/datasets.py Show resolved Hide resolved

.github/workflows/test_and_deploy.yml Outdated Show resolved Hide resolved

scripts/generate_sample_projects/main.py Outdated Show resolved Hide resolved

tests/test_unit/test_placeholder.py Outdated Show resolved Hide resolved

niksirbi and others added 5 commits November 10, 2023 17:01

Update path in test_and_deploy.yml worflow

b49262a

Co-authored-by: Sam Cunliffe <samcunliffe@users.noreply.github.com>

added convenience function for downloading all sample projects

bb3d4db

replace my own HOME dir with Path.home()

ea5b037

sphinx ignore missing anchors during linkcheck

fc56690

make linkcheck happy

f486328

niksirbi mentioned this pull request Nov 10, 2023

Refactor sample projects generation scripts into a proper tested utility #102

Open

niksirbi merged commit 4993bbc into main Nov 10, 2023
20 checks passed

niksirbi deleted the data-hosting branch November 10, 2023 19:02

niksirbi added a commit that referenced this pull request Nov 13, 2023

Data hosting (#100)

34404d9

Co-authored-by: Sam Cunliffe <samcunliffe@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data hosting #100

Data hosting #100

niksirbi commented Jul 13, 2023 •

edited

codecov-commenter commented Jul 14, 2023 •

edited

samcunliffe left a comment •

edited

niksirbi commented Nov 10, 2023

niksirbi commented Nov 10, 2023

Data hosting #100

Data hosting #100

Conversation

niksirbi commented Jul 13, 2023 • edited

Sample projects

Project organisation

Fetching projects

Local storage

Adding new projects

Using sample projects in tests

codecov-commenter commented Jul 14, 2023 • edited

Codecov Report

samcunliffe left a comment • edited

Choose a reason for hiding this comment

niksirbi commented Nov 10, 2023

niksirbi commented Nov 10, 2023

niksirbi commented Jul 13, 2023 •

edited

codecov-commenter commented Jul 14, 2023 •

edited

samcunliffe left a comment •

edited