Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement annif upload and annif download commands for Hugging Face Hub integration #762

Merged
merged 39 commits into from
Apr 23, 2024
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
f6d2b7d
Initial functionality for HF Hub upload
juhoinkinen Feb 1, 2024
ab5e4bf
Use tempfile module and file-like objects for uploads
juhoinkinen Feb 5, 2024
d3dd888
Separate files for each project, vocab and config
juhoinkinen Feb 6, 2024
9d030c6
Catch also HFValidationError in HFH uploads
juhoinkinen Feb 6, 2024
3135114
Initial functionality for HF Hub download
juhoinkinen Feb 7, 2024
038d86d
Upgrade to huggingface-hub 0.21.*
juhoinkinen Feb 29, 2024
5afb251
Drop -projects part from upload/download CLI commands
juhoinkinen Feb 29, 2024
13191fc
Speed up CLI startup by moving imports in functions
juhoinkinen Feb 29, 2024
7666de8
Add --force option to allow overwrite local contents on download
juhoinkinen Mar 1, 2024
301d787
Resolve CodeQL complaint about imports
juhoinkinen Mar 1, 2024
d5b4abe
Restore datafile timestamps after unzipping
juhoinkinen Mar 4, 2024
a1e7605
Add comment to zip file with used Annif version
juhoinkinen Mar 4, 2024
25a46dc
Catch HFH Errors in listing files in repo
juhoinkinen Mar 4, 2024
86714d8
Unzip archive contents to used DATADIR
juhoinkinen Mar 6, 2024
6ba1e08
Add tests
juhoinkinen Mar 7, 2024
4d06be6
Create /.cache/huggingface/ with full access rights in Dockerimage
juhoinkinen Mar 7, 2024
a4f0f6f
Merge branch 'update-dependencies-v1.1' into issue760-hugging-face-hu…
juhoinkinen Mar 8, 2024
7575fff
Fix and improve tests and increase coverage
juhoinkinen Mar 8, 2024
16bacfb
Remove todos
juhoinkinen Mar 8, 2024
2952f64
Create /Annif/projects.d/ for tests in Dockerfile
juhoinkinen Mar 8, 2024
ed3cf2c
Refactor to address quality complains; improve names
juhoinkinen Mar 8, 2024
5b16952
Add docstrings
juhoinkinen Mar 12, 2024
c87675c
Add type hints
juhoinkinen Mar 12, 2024
2fe5b73
Update RTD CLI commands page
juhoinkinen Mar 12, 2024
d7be137
Remove --revision option of download command
juhoinkinen Mar 13, 2024
47f7ee4
Upgrade to huggingface-hub 0.22.*
juhoinkinen Mar 25, 2024
a488d07
Revert "Remove --revision option of download command"
juhoinkinen Mar 26, 2024
0c57bf2
Preupload lfs files
juhoinkinen Mar 26, 2024
df105a3
Fix HF Hub caching in Dockerfile
juhoinkinen Mar 27, 2024
d14ff30
Refactor to address quality complains
juhoinkinen Apr 12, 2024
cc0c989
Again: Refactor & simplify to address quality complains
juhoinkinen Apr 12, 2024
9443c8f
Fix typo in mocked filenames in repo
juhoinkinen Apr 19, 2024
156bbf5
Detect projects present in repo by .cfg files, not .zip files
juhoinkinen Apr 19, 2024
3f60456
Add --revision option to upload command
juhoinkinen Apr 19, 2024
2dd359d
Enable completion of project_id argument in upload command
juhoinkinen Apr 19, 2024
63076cd
Adapt test for adding revision option to upload command
juhoinkinen Apr 19, 2024
a0a3850
Move functions for HuggingFaceHub interactions to own file
juhoinkinen Apr 23, 2024
638aa07
Move unit tests for HuggingFaceHub util fns to own file
juhoinkinen Apr 23, 2024
6f35fff
Make io import conditional to TYPE_CHECKING
juhoinkinen Apr 23, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
53 changes: 53 additions & 0 deletions annif/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
import os.path
import re
import sys
from fnmatch import fnmatch

import click
import click_log
Expand Down Expand Up @@ -583,6 +584,58 @@
click.echo("---")


@cli.command("upload-projects")
@click.argument("project_ids_pattern")
@click.argument("repo_id")
@click.option(
"--token",
help="""Authentication token, obtained from the Hugging Face Hub.
Will default to the stored token.""",
)
@click.option(
"--commit-message",
help="""The summary / title / first line of the generated commit.""",
)
@cli_util.common_options
def run_upload_projects(project_ids_pattern, repo_id, token, commit_message):
"""
Upload selected projects to a Hugging Face Hub repository
\f
This command zips the project directories and vocabularies of the projects
that match the given `project_ids_pattern`, and uploads the archives along
with the projects configuration to the specified Hugging Face Hub repository.
An authentication token and commit message can be given with options.
"""
projects = [

Check warning on line 609 in annif/cli.py

View check run for this annotation

Codecov / codecov/patch

annif/cli.py#L609

Added line #L609 was not covered by tests
proj
for proj in annif.registry.get_projects(min_access=Access.private).values()
if fnmatch(proj.project_id, project_ids_pattern)
]
click.echo(f"Uploading project(s): {', '.join([p.project_id for p in projects])}")

Check warning on line 614 in annif/cli.py

View check run for this annotation

Codecov / codecov/patch

annif/cli.py#L614

Added line #L614 was not covered by tests

commit_message = (

Check warning on line 616 in annif/cli.py

View check run for this annotation

Codecov / codecov/patch

annif/cli.py#L616

Added line #L616 was not covered by tests
commit_message
if commit_message is not None
else f"Upload project(s) {project_ids_pattern} with Annif"
)

project_dirs = {p.datadir for p in projects}
vocab_dirs = {p.vocab.datadir for p in projects}
data_dirs = project_dirs.union(vocab_dirs)

Check warning on line 624 in annif/cli.py

View check run for this annotation

Codecov / codecov/patch

annif/cli.py#L622-L624

Added lines #L622 - L624 were not covered by tests

for data_dir in data_dirs:
zip_path = data_dir.split(os.path.sep, 1)[1] + ".zip" # TODO Check this
fobj = cli_util.archive_dir(data_dir)
cli_util.upload_to_hf_hub(fobj, zip_path, repo_id, token, commit_message)
fobj.close()

Check warning on line 630 in annif/cli.py

View check run for this annotation

Codecov / codecov/patch

annif/cli.py#L626-L630

Added lines #L626 - L630 were not covered by tests

for project in projects:
config_path = project.project_id + ".cfg"
fobj = cli_util.write_config(project)
cli_util.upload_to_hf_hub(fobj, config_path, repo_id, token, commit_message)
fobj.close()

Check warning on line 636 in annif/cli.py

View check run for this annotation

Codecov / codecov/patch

annif/cli.py#L632-L636

Added lines #L632 - L636 were not covered by tests


@cli.command("completion")
@click.option("--bash", "shell", flag_value="bash")
@click.option("--zsh", "shell", flag_value="zsh")
Expand Down
53 changes: 52 additions & 1 deletion annif/cli_util.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,17 +2,24 @@
from __future__ import annotations

import collections
import configparser
import io
Fixed Show fixed Hide fixed
Fixed Show fixed Hide fixed
Fixed Show fixed Hide fixed
import itertools
import os
import pathlib
import sys
import tempfile
import zipfile
from typing import TYPE_CHECKING

import click
import click_log
from flask import current_app
from huggingface_hub import HfApi
from huggingface_hub.utils import HfHubHTTPError, HFValidationError

import annif
from annif.exception import ConfigurationException
from annif.exception import ConfigurationException, OperationFailedException
from annif.project import Access

if TYPE_CHECKING:
Expand Down Expand Up @@ -230,6 +237,50 @@
return list(itertools.product(limits, thresholds))


def _is_train_file(fname):
train_file_patterns = ("-train", "tmp-")
for pat in train_file_patterns:
if pat in fname:
return True
return False

Check warning on line 245 in annif/cli_util.py

View check run for this annotation

Codecov / codecov/patch

annif/cli_util.py#L241-L245

Added lines #L241 - L245 were not covered by tests


def archive_dir(data_dir):
fp = tempfile.TemporaryFile()
path = pathlib.Path(data_dir)
fpaths = [fpath for fpath in path.glob("**/*") if not _is_train_file(fpath.name)]
with zipfile.ZipFile(fp, mode="w") as zfile:
for fpath in fpaths:
logger.debug(f"Adding {fpath}")
zfile.write(fpath)
fp.seek(0)
return fp

Check warning on line 257 in annif/cli_util.py

View check run for this annotation

Codecov / codecov/patch

annif/cli_util.py#L249-L257

Added lines #L249 - L257 were not covered by tests


def write_config(project):
fp = tempfile.TemporaryFile(mode="w+t")
config = configparser.ConfigParser()
config[project.project_id] = project.config
config.write(fp) # This needs tempfile in text mode
fp.seek(0)

Check warning on line 265 in annif/cli_util.py

View check run for this annotation

Codecov / codecov/patch

annif/cli_util.py#L261-L265

Added lines #L261 - L265 were not covered by tests
# But for upload fobj needs to be in binary mode
return io.BytesIO(fp.read().encode("utf8"))

Check warning on line 267 in annif/cli_util.py

View check run for this annotation

Codecov / codecov/patch

annif/cli_util.py#L267

Added line #L267 was not covered by tests


def upload_to_hf_hub(fileobj, filename, repo_id, token, commit_message):
api = HfApi()
try:
api.upload_file(

Check warning on line 273 in annif/cli_util.py

View check run for this annotation

Codecov / codecov/patch

annif/cli_util.py#L271-L273

Added lines #L271 - L273 were not covered by tests
path_or_fileobj=fileobj,
path_in_repo=filename,
repo_id=repo_id,
token=token,
commit_message=commit_message,
)
except (HfHubHTTPError, HFValidationError) as err:
raise OperationFailedException(str(err))

Check warning on line 281 in annif/cli_util.py

View check run for this annotation

Codecov / codecov/patch

annif/cli_util.py#L280-L281

Added lines #L280 - L281 were not covered by tests


def _get_completion_choices(
param: Argument,
) -> dict[str, AnnifVocabulary] | dict[str, AnnifProject] | list:
Expand Down
7 changes: 7 additions & 0 deletions docs/source/commands.rst
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,13 @@ Project administration

N/A

.. click:: annif.cli:run_upload_projects
:prog: annif upload-projects

**REST equivalent**

N/A

****************************
Subject index administration
****************************
Expand Down
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,7 @@ python-dateutil = "2.8.*"
tomli = { version = "2.0.*", python = "<3.11" }
simplemma = "0.9.*"
jsonschema = "4.17.*"
huggingface-hub = "0.20.*"

fasttext-wheel = {version = "0.9.2", optional = true}
voikko = {version = "0.5.*", optional = true}
Expand Down
Loading