Skip to content

Add version information to data locations#105

Merged
nikhilwoodruff merged 9 commits intomainfrom
nikhilwoodruff/issue104
May 26, 2025
Merged

Add version information to data locations#105
nikhilwoodruff merged 9 commits intomainfrom
nikhilwoodruff/issue104

Conversation

@nikhilwoodruff
Copy link
Contributor

Fixes #104

@nikhilwoodruff nikhilwoodruff self-assigned this May 23, 2025
@nikhilwoodruff nikhilwoodruff added the enhancement New feature or request label May 23, 2025
@nikhilwoodruff nikhilwoodruff requested a review from mikesmit May 23, 2025 14:04
Copy link
Contributor Author

@nikhilwoodruff nikhilwoodruff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Get version number as in this PR
  2. Tag HF uploads with that version number
  3. Set gcloud metadata with that version number

@nikhilwoodruff nikhilwoodruff requested a review from anth-volk May 26, 2025 10:58
@nikhilwoodruff
Copy link
Contributor Author

@mikesmit and @anth-volk, this is ready for re-review when you have a second. I've changed the data upload to:

  1. Operate from a single function
  2. For Hugging Face files, we upload the files together in one commit, and then tag that commit with the package version (we can then download the specific version with the HuggingFace API revision= kwarg)
  3. For GCP files, we upload and set the metadata of each file to include the version. We can then list metadata of each blob's versions without downloading it, and download the blob with the specific version.

Copy link
Contributor Author

@nikhilwoodruff nikhilwoodruff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note to double check that this script is run after versioning updates.

Confirmed that it is run after: the dataset build and upload action is triggered by changes to pyproject.toml.

@nikhilwoodruff
Copy link
Contributor Author

nikhilwoodruff commented May 26, 2025

Also separate the HF and GCP logic into two separate functions

@nikhilwoodruff
Copy link
Contributor Author

@anth-volk this PR now has the changes from our in-person review just now.

Copy link

@anth-volk anth-volk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was great to go through this @nikhilwoodruff. Probably 70% of the comments boil down to: you probably want to use GH Actions's needs keyword instead of depending on changes to particular files to avoid strange edge case bugs when sequencing push actions.

Copy link

@anth-volk anth-volk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question before approval, in the comments

Copy link

@anth-volk anth-volk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do it

@nikhilwoodruff nikhilwoodruff merged commit b181b10 into main May 26, 2025
3 checks passed
@nikhilwoodruff nikhilwoodruff deleted the nikhilwoodruff/issue104 branch May 26, 2025 21:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add version information to data locations

2 participants