Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement annif upload and annif download commands for Hugging Face Hub integration #762

Merged
merged 39 commits into from
Apr 23, 2024

Conversation

juhoinkinen
Copy link
Member

@juhoinkinen juhoinkinen commented Feb 1, 2024

The organization of the uploaded files follows the 4. option of issue #760:

  1. Separate files for each project, vocab, and projects configs

Compress each project directory into its own zip file (in projects/<project-id>.zip in the repo).

Each vocabularies zip is placed in vocabs/<vocab-id>.zip, and each project configuration is placed in <project-id>.cfg in the repo root.

This option is good for caching by preventing unnecessary uploads/downloads of the projects bundle when only one is changed, and for visibility of the repo contents.

The unzip after download places the data directories directly to data/projects/ and data/vocabs/, and the project configurations as separate <project-id>.cfg files in projects.d/ directory, so after download the projects are directly usable (if projects.{cfg,toml} does not exists or by using the --projects option).

Upload

Push a set of selected projects and vocabularies to a Hugging Face Hub repository as zip files and the configurations of the projects as cfg/ini files. For example, the following command uploads all projects with ids matching *-fi to juhoinkinen/Annif-models-upload-testing:

annif upload-projects "*-fi" juhoinkinen/Annif-models-upload-testing --commit-message "[OptMsg] Upload all Finnish projects"

Download

Downloads the project and vocabulary archives and the configuration files of the projects that match the given
and unzip the archives to data/ directory and places the configuration files to projects.d/ directory:

annif download-projects "yake-fi" juhoinkinen/Annif-models-upload-testing

Note that currently the download will overwrite the project and vocab dirs if they already exist. Edit: By default overwrite does not happen, it can be performed by adding the --force option to the command.

A git revision (commit hash, branch etc.) can be specified with --revision option.

The downloaded files remain in the huggingface_hub client cache dir.


For all uploads and downloads in the case of private repos the user needs to have logged in with huggingface-cli login, or the HF Hub token can be given also with the --token option of this Annif command.

@juhoinkinen

This comment was marked as outdated.

annif/cli_util.py Fixed Show fixed Hide fixed
Copy link

codecov bot commented Feb 6, 2024

Codecov Report

Attention: Patch coverage is 99.69697% with 1 lines in your changes are missing coverage. Please review.

Project coverage is 99.64%. Comparing base (df20982) to head (6f35fff).
Report is 47 commits behind head on main.

Files Patch % Lines
annif/hfh_util.py 99.30% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #762      +/-   ##
==========================================
- Coverage   99.65%   99.64%   -0.02%     
==========================================
  Files          89       91       +2     
  Lines        6404     6768     +364     
==========================================
+ Hits         6382     6744     +362     
- Misses         22       24       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@juhoinkinen juhoinkinen changed the title Implement annif upload-projects command for Hugging Face Hub integration Implement annif upload-projects and download-projects commands for Hugging Face Hub integration Feb 8, 2024
annif/cli_util.py Fixed Show fixed Hide fixed
@juhoinkinen juhoinkinen linked an issue Feb 8, 2024 that may be closed by this pull request
annif/cli.py Outdated Show resolved Hide resolved
@juhoinkinen
Copy link
Member Author

juhoinkinen commented Feb 8, 2024

TODO:

  • Try to find a way to avoid the need to use quotes around projects-id pattern (without them shell expansion kicks in) Edit: Not possible, or at least worth it, see this SO answer.
  • Add --force option to allow overwrite of existing files when downloading; prevent overwrite otherwise
  • Add tests: could use mocking of huggingface_hub methods to assert their calls, e.g. @mock.patch("huggingface_hub.commands.upload.HfApi.upload_file")
  • Speed up CLI startup by moving huggingface_hub imports to functions
  • Update to huggingface-hub 0.21.*
  • Add a page about the HFH integration to Wiki
  • Add upload and download commands to CLI documentation at readthedocs (or check they get automatically included)

@osma could you take a look at the working principles of the commands before I spend more time with this?

The downloaded files now remain in cache for huggingface client, e.g. (/home/local/jmminkin/.cache/huggingface/hub/models--juhoinkinen--Annif-models-upload-testing), should the cache be cleared when successfully unzipped the datadirs and moved the configs?

Could or should these commands be flagged as experimental for now, to allow changing them?

@osma
Copy link
Member

osma commented Feb 29, 2024

I took a look at this (sorry for the delay!) and I think this is a really good feature. Some thoughts:

  1. Agree that these commands should be marked as experimental for now. We may want to change the command options, functionality etc. Maybe in the 1.2 release the experimental status could be dropped.
  2. The commands are uploading/downloading not just the project(s), but also the vocabulary(/ies). Would it make sense to shorten the commands to just upload and download?
  3. Regarding overwriting the vocabulary on download, I think this is a problematic feature and needs some thought. For example if I already have trained models with vocabularies in my data directory, and I download another model from HF Hub, it might come with another version of the same vocabulary and thus rewrite my existing vocabulary and break existing projects. So I agree that there should be some kind of check for this and perhaps an --overwrite or --force option before blindly overwriting existing vocabularies.

I tested this on my local install by running the command annif download-projects "yake-fi" juhoinkinen/Annif-models-upload-testing; I wanted to see what happens when I already have a local yso vocabulary. The overwrite failed with an exception:

$ annif download-projects "yake-fi" juhoinkinen/Annif-models-upload-testing
Downding project(s): yake-fi
yake-fi.zip: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.63M/2.63M [00:00<00:00, 7.24MB/s]
yake-fi.cfg: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 119/119 [00:00<00:00, 470kB/s]
yso.zip: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 54.6M/54.6M [00:04<00:00, 11.0MB/s]
Traceback (most recent call last):
  File "/home/xxx/.cache/pypoetry/virtualenvs/annif-3KgT0m3n-py3.10/bin/annif", line 6, in <module>
    sys.exit(cli())
  File "/home/xxx/.cache/pypoetry/virtualenvs/annif-3KgT0m3n-py3.10/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/home/xxx/.cache/pypoetry/virtualenvs/annif-3KgT0m3n-py3.10/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/home/xxx/.cache/pypoetry/virtualenvs/annif-3KgT0m3n-py3.10/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/xxx/.cache/pypoetry/virtualenvs/annif-3KgT0m3n-py3.10/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/xxx/.cache/pypoetry/virtualenvs/annif-3KgT0m3n-py3.10/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/home/xxx/.cache/pypoetry/virtualenvs/annif-3KgT0m3n-py3.10/lib/python3.10/site-packages/click/decorators.py", line 33, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/home/xxx/.cache/pypoetry/virtualenvs/annif-3KgT0m3n-py3.10/lib/python3.10/site-packages/flask/cli.py", line 357, in decorator
    return __ctx.invoke(f, *args, **kwargs)
  File "/home/xxx/.cache/pypoetry/virtualenvs/annif-3KgT0m3n-py3.10/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/home/xxx/git/Annif/annif/cli.py", line 690, in run_download_projects
    cli_util.unzip(vocab_zip_local_cache_path)
  File "/home/xxx/git/Annif/annif/cli_util.py", line 316, in unzip
    zfile.extractall()  # TODO Disallow overwrite
  File "/usr/lib/python3.10/zipfile.py", line 1647, in extractall
    self._extract_member(zipinfo, path, pwd)
  File "/usr/lib/python3.10/zipfile.py", line 1701, in _extract_member
    open(targetpath, "wb") as target:
PermissionError: [Errno 13] Permission denied: '/home/xxx/git/Annif/data/vocabs/yso/subjects.dump.gz'

The downloaded files now remain in cache for huggingface client, e.g. (/home/local/jmminkin/.cache/huggingface/hub/models--juhoinkinen--Annif-models-upload-testing), should the cache be cleared when successfully unzipped the datadirs and moved the configs?

Isn't the caching standard behaviour of HF Hub operations? I think just keeping the cache would be the least surprising thing to do here. Maybe there could be a --no-cache or --no-keep option to control caching? What do other tools with HF Hub integration (e.g. Ludwig) do?

@osma
Copy link
Member

osma commented Feb 29, 2024

About overwriting vocabularies - I think it would be great if Annif would notice, that the downloaded vocabulary differs from what exists locally. It should be no problem to download two different models that use the same vocabulary:

annif download-projects "yake-fi" juhoinkinen/Annif-models-upload-testing
annif download-projects "yso-fasttext-fi" juhoinkinen/Annif-models-upload-testing
annif download-projects "yso-tfidf-fi" juhoinkinen/Annif-models-upload-testing

But when you have a different version of the vocabulary already present, Annif could show a message such as

Your local version of the "yso" vocabulary differs from the version you are downloading. If you are sure you want to overwrite it, run the command again with the `--overwrite` option. This may break projects that use your current vocabulary.

Even better would be to list the projects that may break.

But how to detect that the local vocabulary is different than the downloaded one? Calculate checksums on the files and compare them?

@juhoinkinen
Copy link
Member Author

juhoinkinen commented Mar 1, 2024

But how to detect that the local vocabulary is different than the downloaded one? Calculate checksums on the files and compare them?

Actually the ZipFile objects allow getting ZipInfo of each member of the archive, which in turn contain a CRC-32 checksum (and the name and modification timestamp) of the uncomprossed file. And ZipInfo.from_file() class method allows to get the CRC from the existing, uncompressed files, so the checksum comparisons should be quite easy to implement.

Edit: But CRC of the objects constructed by ZipInfo.from_file() throw AttributeError: 'ZipInfo' object has no attribute 'CRC'. In the compressed archive CRC attribute exists. 😕

This bug?

@osma
Copy link
Member

osma commented Mar 1, 2024

If it's the above bug in zipfile, it's claimed to be

Fixed in 3.9.0

Annif just dropped 3.8 support so maybe we're good? Just need to rebase this branch.

@juhoinkinen
Copy link
Member Author

juhoinkinen commented Mar 1, 2024

The Python versions 3.9.18, 3.10.13 and 3.11.8 all behaved the same.

This seems strange: the code

zinfo = zipfile.ZipInfo.from_file("data/projects/yake-fi/yake-index")
print(dir(zinfo))
print(zinfo.file_size)
print(zinfo.CRC)

outputs

['CRC', 'FileHeader', '__class__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__slots__', '__str__', '__subclasshook__', '_compresslevel', '_decodeExtra', '_encodeFilenameFlags', '_end_offset', '_raw_time', 'comment', 'compress_size', 'compress_type', 'create_system', 'create_version', 'date_time', 'external_attr', 'extra', 'extract_version', 'file_size', 'filename', 'flag_bits', 'from_file', 'header_offset', 'internal_attr', 'is_dir', 'orig_filename', 'reserved', 'volume']
2627360
Traceback (most recent call last):
  File "/home/local/myusername/.cache/pypoetry/virtualenvs/annif-ul-EXdhi-py3.11/bin/annif", line 6, in <module>
    sys.exit(cli())
             ^^^^^
  File "/home/myusername/.cache/pypoetry/virtualenvs/annif-ul-EXdhi-py3.11/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/myusername/.cache/pypoetry/virtualenvs/annif-ul-EXdhi-py3.11/lib/python3.11/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/home/myusername/.cache/pypoetry/virtualenvs/annif-ul-EXdhi-py3.11/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/myusername/.cache/pypoetry/virtualenvs/annif-ul-EXdhi-py3.11/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/myusername/.cache/pypoetry/virtualenvs/annif-ul-EXdhi-py3.11/lib/python3.11/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/myusername/.cache/pypoetry/virtualenvs/annif-ul-EXdhi-py3.11/lib/python3.11/site-packages/click/decorators.py", line 33, in new_func
    return f(get_current_context(), *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/myusername/.cache/pypoetry/virtualenvs/annif-ul-EXdhi-py3.11/lib/python3.11/site-packages/flask/cli.py", line 357, in decorator
    return __ctx.invoke(f, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/myusername/.cache/pypoetry/virtualenvs/annif-ul-EXdhi-py3.11/lib/python3.11/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/local/myusername/git/Annif/annif/cli.py", line 686, in run_download
    cli_util.unzip(project_zip_local_cache_path, force)
  File "/home/local/myusername/git/Annif/annif/cli_util.py", line 325, in unzip
    _is_existing(member)
  File "/home/local/myusername/git/Annif/annif/cli_util.py", line 351, in _is_existing
    print(zinfo.CRC)
          ^^^^^^^^^
AttributeError: 'ZipInfo' object has no attribute 'CRC'

The CRC attribute is created using ZipInfo.__slots__(), but then it is not initialized when running ZipInfo.from_file()?

Also trying to access zinfo.FileHeader() fails similarly, with Traceback ending

...
File "/usr/lib/python3.11/zipfile.py", line 453, in FileHeader
    CRC = self.CRC
          ^^^^^^^^
AttributeError: 'ZipInfo' object has no attribute 'CRC'

Anyway, the CRC-32 checksums for the existing, uncompressed files could be calculated by binascii.crc32().

annif/cli_util.py Fixed Show fixed Hide fixed
@juhoinkinen
Copy link
Member Author

I added --force option to overwrite existing local files (project and vocabulary data files and configs), and modified the unzip/config-copy logic so that when the contents are identical (downloaded vs local), there the unzip/config-copy is skipped.

I did not see a reason to restrict the content comparison to only vocabularies(?) (and it was easier to implement both projects and vocabs 🙂).

Demo with

#!/bin/bash

rm -r data/projects/{fasttext-fi,tfidf-fi,yake-fi}

echo "# Initial download"
annif download "tfidf-fi" juhoinkinen/Annif-models-upload-testing
echo "# Second download with identical content, no-op"
annif download "tfidf-fi" juhoinkinen/Annif-models-upload-testing

echo $RANDOM > data/projects/tfidf-fi/file.txt
echo $RANDOM >> projects.d/tfidf-fi.cfg
echo "# Third download over changed content, skips file overwrites with complains"
annif download "tfidf-fi" juhoinkinen/Annif-models-upload-testing

echo "# Forced third download over changed content"
annif download "tfidf-fi" juhoinkinen/Annif-models-upload-testing --force
tree data

The output:

# Initial download
Downloading project(s): tfidf-fi
# Second download with identical content, no-op
Downloading project(s): tfidf-fi
# Third download over changed content, skips file overwrites with complains
Downloading project(s): tfidf-fi
Not overwriting data/projects/tfidf-fi/file.txt (use --force to override)
Not overwriting projects.d/tfidf-fi.cfg (use --force to override)
# Forced third download over changed content
Downloading project(s): tfidf-fi
data
├── projects
│   └── tfidf-fi
│       └── file.txt
└── vocabs
    └── yso
        ├── subjects.csv
        ├── subjects.dump.gz
        └── subjects.ttl

4 directories, 4 files

@juhoinkinen juhoinkinen changed the title Implement annif upload-projects and download-projects commands for Hugging Face Hub integration Implement annif upload and annif download commands for Hugging Face Hub integration Mar 6, 2024
@juhoinkinen
Copy link
Member Author

I commented on huggingface-hub repository about git tags, and they promptly created an issue for creating, listing and deleting tags with the huggingface-cli (also wrote a question on HF forum).

If/when it gets implemented, I think it would be most convenient for the case of Annif models if the tagging would be performed after annif upload <projects-pattern> <repo-id> using huggingface-cli, e.g. like huggingface-cli tag <repo-id> <tag-name>.

@juhoinkinen juhoinkinen added this to the 1.1 milestone Mar 25, 2024
@juhoinkinen juhoinkinen changed the base branch from main to release-1.0.2 March 25, 2024 11:31
@juhoinkinen juhoinkinen changed the base branch from release-1.0.2 to main March 25, 2024 11:32
@juhoinkinen
Copy link
Member Author

juhoinkinen commented Mar 27, 2024

Actually multiple files could be included in one commit using CommitOperationAdd() and create_commit(), but it seems that for this all the included files should exist until running create_commit(), and the zips can be quite large.

The preupload_lfs_files() function helped to put all files in one commit but to upload them separately.

When testing with yso-*fi projects of FintoAI (10.6 GBs uncompressed, 8.8 GBs compressed) /usr/bin/time -v reported Maximum resident set size (kbytes): 49616 (and I did not notice much increase in disk space usage); upload took about 20 min at the office.

@juhoinkinen
Copy link
Member Author

While uploading projects to FintoAI-data-YSO with annif upload "*" NatLibFi/FintoAI-data-YSO , this error occurred on two tries:

...
  File "/home/local/jmminkin/git/Annif/annif/cli_util.py", line 259, in prepare_datadir_commit
    fobj = _archive_dir(data_dir)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/local/jmminkin/git/Annif/annif/cli_util.py", line 285, in _archive_dir
    with zipfile.ZipFile(fp, mode="w") as zfile:
  File "/usr/lib/python3.11/zipfile.py", line 1355, in __exit__
    self.close()
  File "/usr/lib/python3.11/zipfile.py", line 1911, in close
    self.fp.seek(self.start_dir)
OSError: [Errno 28] No space left on device

However I had about 60 GBs of free disk space. And preuploading should free memory of the uploaded objects, so it should not be memory issue either... Anyway, this could be circumvented by uploading only projects of one language by one command.

Copy link
Member

@osma osma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very good.

The only thing that caught my eye is that this PR adds a lot of new code especially to cli_util.py, but also cli.py and test_cli.py. Would it make sense to put the HF-specific code in separate files instead, at least in the case of cli_util.py?

tests/test_cli.py Fixed Show fixed Hide fixed
tests/test_hfh_util.py Dismissed Show dismissed Hide dismissed
tests/test_hfh_util.py Dismissed Show dismissed Hide dismissed
Copy link

sonarcloud bot commented Apr 23, 2024

Quality Gate Passed Quality Gate passed

Issues
0 New issues
2 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarCloud

@juhoinkinen juhoinkinen merged commit d9f3793 into main Apr 23, 2024
15 of 16 checks passed
@juhoinkinen juhoinkinen deleted the issue760-hugging-face-hub-integration branch April 23, 2024 07:17
@osma
Copy link
Member

osma commented Apr 23, 2024

🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Hugging Face integration
2 participants