Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data hosting #100

Merged
merged 23 commits into from
Nov 10, 2023
Merged

Data hosting #100

merged 23 commits into from
Nov 10, 2023

Conversation

niksirbi
Copy link
Member

@niksirbi niksirbi commented Jul 13, 2023

Closes #33

Sample projects

I added some sample WAZP projects to be used for testing, examples and tutorials on an external data repository.
Our hosting platform of choice is called GIN and is maintained by the German Neuroinformatics Node.
GIN has a GitHub-like interface and git-like CLI functionalities.

Project organisation

The projects are stored in folders named after the species - e.g. jewel-wasp (Ampulex compressa).
Each species folder may contain various WAZP sample projects as zipped archives. For example, the jewel-wasp folder contains the following projects:

  • short-clips_raw.zip - a project containing short ~10 second clips extracted from raw .avi files.
  • short-clips_compressed.zip - same as above, but compressed using the H.264 codec and saved as .mp4 files.
  • entire-video_raw.zip - a project containing the raw .avi file of an entire video, ~32 minutes long.
  • entire-video_compressed.zip - same as above, but compressed using the H.264 codec and saved as .mp4 file.

Each WAZP sample project has the following structure:

{project-name}.zip
    └── videos
        ├── {video1-name}.{ext}
        ├── {video1-name}.metadata.yaml
        ├── {video2-name}.{ext}
        ├── {video2-name}.metadata.yaml
        └── ...
    └── pose_estimation_results
        ├── {video1-name}{model-name}.h5
        ├── {video2-name}{model-name}.h5
        └── ...
    └── WAZP_config.yaml
    └── metadata_fields.yaml

Note
To learn more about how the sample projects were generated, see scripts/generate_sample_projects in the WAZP GitHub repository. I thought to "save" the scripts in this repository, in case we need to produce more sample projects from different datasets in the future.

Fetching projects

To fetch the data from GIN, we use the pooch Python package, which can download data from pre-specified URLs and store them locally for all subsequent uses. It also provides some nice utilities, like verification of sha256 hashes and decompression of archives.

The relevant funcitonality is implemented in the wazp.datasets.py module. The most important parts of this module are:

  1. The sample_projects registry, which contains a list of the zipped projects and their known hashes.
  2. The find_sample_projects() function, which returns the names of available projects per species, in the form of a dictionary.
  3. The get_sample_project() function, which downloads a project (if not already cached locally), unzips it, and returns the path to the unzipped folder.

Example usage:

>>> from wazp.datasets import find_sample_projects, get_sample_project

>>> projects_per_species = find_sample_projects()
>>> print(projects_per_species)
{'jewel-wasp': ['short-clips_raw', 'short-clips_compressed', 'entire-video_raw', 'entire-video_compressed']}

>>> project_path = get_sample_project('jewel-wasp', 'short-clips_raw')
>>> print(project_path)
/home/user/.WAZP/sample_data/jewel-wasp/short-clips_raw

Local storage

By default, the projects are stored in the ~/.WAZP/sample_data folder. This can be changed by setting the LOCAL_DATA_DIR variable in the wazp.datasets.py module.

Adding new projects

Only core WAZP developers may add new projects to the external data repository.
To add a new poject, you will need to:

  1. Create a GIN account
  2. Ask to be added as a collaborator on the WAZP data repository (if not already)
  3. Download the GIN CLI and set it up with your GIN credentials, by running gin login in a terminal.
  4. Clone the WAZP data repository to your local machine, by running gin get SainsburyWellcomeCentre/WAZP in a terminal.
  5. Add your new projects, followed by git add, and git commit, just like you would with a GitHub repository. Make sure to follow the project ornanisation as described above. Don't forget to modify the README file accordingly.
  6. Upload the commited changes to the GIN repository, by running gin upload. Latest changes to the repository can be pulled via gin download. gin sync will synchronise the latest changes bidirectionally.
  7. Determine the sha256 checksum hash of each new project archive, by running sha256sum {project-name.zip} in a terminal. Alternatively, you can use pooch to do this for you: python -c "import pooch; pooch.file_hash('/path/to/file.zip')". If you wish to generate a text file containing the hashes of all the files in a given folder, you can use python -c "import pooch; pooch.make_registry('/path/to/folder', 'hash_registry.txt').
  8. Update the wazp.datasets.py module on the WAZP GitHub repository by adding the new projects to the sample_projects registry. Make sure to include the correct sha256 hash, as determined in the previous step. Follow all the usual guidelines for contributing code. Additionally, you may want to update the scripts in scripts/generate_sample_projects, depending on how you generated the new projects. Make sure to test whether the new projects can be fetched successfully (see fetching projects above) before submitting your pull request.

You can also perform steps 3-6 via the GIN web interface, if you prefer to avoid using the CLI.

Using sample projects in tests

I think the best way to do that is through pytest fixtures.
For example, I've added one in tests/test_unit/conftest.py:

from pathlib import Path

import pytest

from wazp.datasets import get_sample_project


@pytest.fixture()
def sample_project() -> Path:
    """Get the sample project for testing."""
    return get_sample_project("jewel-wasp", "short-clips_compressed", progressbar=True)

This gets the smallest sample project and returns its local path, to be used in tests.

@codecov-commenter
Copy link

codecov-commenter commented Jul 14, 2023

Codecov Report

Merging #100 (f486328) into main (4a7dcbb) will increase coverage by 2.70%.
Report is 1 commits behind head on main.
The diff coverage is 86.04%.

@@            Coverage Diff             @@
##             main     #100      +/-   ##
==========================================
+ Coverage   39.94%   42.64%   +2.70%     
==========================================
  Files          12       13       +1     
  Lines         691      734      +43     
==========================================
+ Hits          276      313      +37     
- Misses        415      421       +6     
Files Coverage Δ
wazp/datasets.py 86.04% <86.04%> (ø)

📣 Codecov offers a browser extension for seamless coverage viewing on GitHub. Try it in Chrome or Firefox today!

Copy link
Member

@samcunliffe samcunliffe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🍉 OK, this all looks really good.

No comments about the structure or code quality or robustness. (It was left hanging for months and still works ✨.) I've checked and it all works perfectly for me in my existing months-old wazp-env conda environment.

I have one noncritical, somewhat major, comment:

The tools in scripts/generate_sample_projects are all untested. Is it worth or feasible to do a dry run of them as part of a testing job? Either tack to the end of the current or a new workflow.

Suuuper rough sketch:

name: test-tools
on:
  push:
    branches: main
  pull_request:

jobs:
  run_sample_project_gen:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout source
        uses: actions/checkout@v3
      - name: Set up Python
        uses: actions/setup-python@v4
      - name: Install self
        run: python -m pip install .
      - name: Setup local data
        run: |
            mkdir ./Code/Data/WAZP
            gin clone https://gin.g-node.org/SainsburyWellcomeCentre/WAZP
      - name: Run sample generation
        run: python scripts/generate_sample_projects/main.py
      - name: Check generation was OK
        run: test -f /path/to/output/file.zip

Now that won't improve our test coverage as reported by pytest but at least we will know that we don't break these tools.

... I also don't insist on this (hence approving). If you think it's worth shooting for right now, that's cool. If it's worth farming this into an issue for future Sam and Niko, that's also cool. If it's too much of a pain then we can chat in the next gemba.

wazp/datasets.py Show resolved Hide resolved
.github/workflows/test_and_deploy.yml Outdated Show resolved Hide resolved
scripts/generate_sample_projects/main.py Outdated Show resolved Hide resolved
tests/test_unit/test_placeholder.py Outdated Show resolved Hide resolved
@niksirbi
Copy link
Member Author

The tools in scripts/generate_sample_projects are all untested. Is it worth or feasible to do a dry run of them as part of a testing job? Either tack to the end of the current or a new workflow.

I thought about it, and this would actually, be very hard to do, as things stand. The main reason is that these samples are sourced from our non-public internal server storage. So they cannot "really" be tested without access to that source (which the GitHub runners can't have). Downloading the test data on GIN only gives you the output of this pipeline, the input is inaccessible.

Of course we still have some options, we could:

  • rewrite the scripts to work with already publicly available animal behaviour datasets
  • access the source data via our locally hosted GitHub runner
  • only test smaller parts of the scripts with unit tests, as much as we can without the source data (no end-to-end testing)

Since all the above require considerable work, and this is not a priority right now, I opened an issue for future reference.

@niksirbi
Copy link
Member Author

Since all the above require considerable work, and this is not a priority right now, I opened an issue for future reference.

Apart from the above, I took care of all smaller comments, so I'm going ahead with the merge 🤞🏼

Thanks a ton @samcunliffe!

@niksirbi niksirbi merged commit 4993bbc into main Nov 10, 2023
20 checks passed
@niksirbi niksirbi deleted the data-hosting branch November 10, 2023 19:02
niksirbi added a commit that referenced this pull request Nov 13, 2023
Co-authored-by: Sam Cunliffe <samcunliffe@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Hosting data for testing
3 participants