Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider best strategies for handling versioning of "dumb" remotes #515

Open
MiguelRodo opened this issue Mar 21, 2024 · 1 comment
Open

Comments

@MiguelRodo
Copy link
Collaborator

MiguelRodo commented Mar 21, 2024

So, let's say that we have a local remote (just a folder), and we want to determine whether we should copy anything to there.

So, we have in our local manifest a record of what was done in the latest version (which we know is correct).

Then, suppose that, for that remote, we just copied it across what was in that remote.

Okay, so, what IS the remote? Is it what's at the path? Yes, right? That's basically the ID. Or is the path the sub-part of the remote? Or does it include the label? Maybe let's look at what the old _proj.yml had, before removing the local archiving.

Okay, this is what that looked like:

build:
  local:
    "archive":
      path: _archive
      content: [data-raw, output, docs]
      structure: version

Okay, yes, so the path was _archive. Then, there was a label, and then there was the version.

So, where should the manifest be kept? Well, clearly, it should be kept at the top directory. Then, we're basically assuming that each directory only has the data-raw thing kept once (maybe an issue for OSF, when we might put archive in there as well? Hmmm, I don't know... I guess we should specify what the path is to it as well.
So, for each, we should keep the manifest at the top level, but then specify within it:

  • What the path is (without versioning, but including the label and any "sub-path")
  • Whether it's versioned or not.
  • Then, we say what's in there.

Okay, so if there is no manifest.csv file there or it does not have that, then:

  • For "latest" remotes, we just compare.
  • For "versioned" remotes, we take it that the manifest corresponds to the latest version.
  • So, in either case it doesn't really matter.

I think, at this stage, we don't keep a record of what was previously at that remote. It's just not relevant. We only keep the active folders, so for latest that's just whatever's at the path. For versioned, it's for whatever's been uploaded. So, it's not a record but a live value.

Then, this impacts how we compare what's there. Once we have that, the upload proceeds as normal.

So, what needs to happen?

  • First, when trying to assess what's changed (if we're using the manifest), we need to first:
    • Try download the manifest
      • If it's not there, then we can either:
        • Use file-based versioning, OR
        • assume nothing's there, and upload everything, OR
        • compare to the last version in the manifest.
    • If we have the manifest:
      • Then we just compare the fn column, that's all.
    • Then we do all the transfers.
    • Then, we upload the revised manifest.
      • Well, we don't actually overwrite entirely.
      • Just, for whatever the path was that we uploaded to, we overwrite.
        • So, for example, in the above we add at the remote the following manifest:
          • For the data-raw content:
            • path: _archive/data-raw/v<latest> (if adding)
            • fn: <whatever_was_local>
              • Well, this may depend:
                • If we used upload-all, then we overwrite whatever was local but merge with whatever wasn't
                • If we used upload-missing, then we just add whatever was missing.
                • If we used synchronisation-using-version and synchronisation-using-deletion, then we overwrite
          • For the output and docs content:
            • Same as above, just the path changes.
          • So, we basically just stack, one for each content and path combination (given that we use the projr-specified remote structure.
@MiguelRodo MiguelRodo pinned this issue Mar 22, 2024
@MiguelRodo
Copy link
Collaborator Author

MiguelRodo commented Mar 22, 2024

So, the plan (exactly what to send) is only determined for each label within a title - thank goodness! That means we already do the planning as often as we need. We've also got all the info we need to get te remote (and not just remote_final). This is very good news.

So, this is what happens in projr_dest_send_label:

  • Prep:
    • Get the local path to send from
    • Get the remote send instructions
    • Get the final remote
  • Get plan: get general approach to sending things
  • Get plan details: get which files to send and delete
  • Make transfers: determine what to send when

So, I need to add:

  • Prep:
    • Get the "initial" remote (can make it NULL to save time, if we know we don't need it)
  • Get plan details:
    • Try get the manifest off the remote, and compare to that instead
      • That's where the rule above comes in.

Now, within .projr_dest_send_get_plan_detail, we have, for example, this function:

.projr_dest_send_get_plan_detail_add_missing <- function(path_dir_local,
                                                         remote,
                                                         type) {
  path_dir_local_remote <- .dir_create_tmp_random()
  fn_vec_remote <- .projr_remote_file_ls(type, remote)
  fn_vec_local <- .file_ls(path_dir_local)
  fn_vec_add <- setdiff(fn_vec_local, fn_vec_remote)
  .dir_rm(path_dir_local_remote)
  list("add" = fn_vec_add, "rm" = character())
}
  • Actually, we don't use it there - projr always just checks what's on the remote.

Let's rather look where we do use version-source:

.projr_dest_send_get_plan_detail_change <- function(remote,
                                                    type,
                                                    label,
                                                    version_source,
                                                    path_dir_local) {
  change_list <- .projr_change_get(
    label = label,
    path_dir_local = path_dir_local,
    version_source = version_source,
    type = type,
    remote = remote
  )
  list(
    "add" = c(
      change_list[["kept_changed"]][["fn"]] %@@% character(),
      change_list[["added"]][["fn"]] %@@% character()
    ) |>
      as.character(),
    "rm" = change_list[["removed"]][["fn"]] %@@% character() |> as.character()
  )
}
  • Clearly, again we need to add remote_base
  • Then, within .projr_change_get:
    • Also add the argument remote_base
    • Then, we'd look at .projr_change_get_manifest:

At the moment, we're only passing label to it, because it's just comparing the latest two versions.

Here are the contents:

.projr_change_get_manifest <- function(version_post = NULL,
                                       version_pre = NULL,
                                       label = NULL) {
  # this differs from .projr_change_get_hash
  # as it will filter on version and does
  # not assume there is only one label
  # get manifests from previous version and current version
  manifest <- .projr_manifest_read(.dir_proj_get("manifest.csv"))

  if (nrow(manifest) == 0L) {
    return(.projr_zero_list_manifest_get())
  }

  # get version to compare
  version_vec <- .projr_change_get_manifest_version_to_compare(
    version_post = version_post,
    version_pre = version_pre,
    manifest = manifest
  )

  # choose current label only,
  # done after comparing to ensure we get the right comparison
  if (!is.null(label)) {
    manifest <- manifest[manifest[["label"]] == label, ]
  }

  # use zero table if version_pre not found
  manifest_pre <- manifest[manifest[["version"]] == version_vec[["pre"]], ] %@@%
    .projr_zero_tbl_get_manifest()

  manifest_post <- manifest[manifest[["version"]] == version_vec[["post"]], ]

  # compare
  # -----------------

  # can't assume there's only one label
  .projr_change_get_hash(hash_pre = manifest_pre, hash_post = manifest_post)
}

From the version_vec, we'd only need the latest version locally.
Then, we'd need to get that manifest off the remote (maybe we should download that earlier?) and just get the latest version on there.

And then we compare, as before.

Simple!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant