Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow modifying remote/project urls of imported manifests #615

Open
jobasto opened this issue Dec 21, 2022 · 27 comments
Open

Allow modifying remote/project urls of imported manifests #615

jobasto opened this issue Dec 21, 2022 · 27 comments
Labels
Partial imports Incomplete or changing imports are much more complicated than you think performance How long things take

Comments

@jobasto
Copy link

jobasto commented Dec 21, 2022

Introduction

We use west to checkout all source files necessary for building a firmware and use the import feature as pictured below. This way we don't need to specify the exact version of the modules we want to use because that information is taken from the imported manifests.

project_repository (internal git server)
├─ west.yml
    ⇓ imports (filtered by name-allowlist)
    company_library_repository (internal git server)
    ├─ west.yml
        ⇓ imports
        zephyr_repository (https://github.com/zephyrproject-rtos/zephyr)
        ├─ west.yml
             ⇓ imports
             zephyr_module_1 (https://github.com/zephyrproject-rtos/zephyr_module_1)
             zephyr_module_2 (https://github.com/zephyrproject-rtos/zephyr_module_2)
             zephyr_module_3 (https://github.com/zephyrproject-rtos/zephyr_module_3)

Requirement

To ensure long term reproducibility we want to mirror all externally hosted repositories on our internal git server and use this mirror server for cloning.

Possible solutions

User/system specific git config

One way to achieve this would be to use the git url.<base>.insteadOf config option (see also #28 (comment)). However, this option is user and/or system specific but we want that the actually used servers are deducible by examining the project_repository and its imports.

Specifying git config options in manifest

Another way would be to introduce a new manifest file key that allows setting arbitrary git config options. However, west probably shouldn't set user or system wide config options so it would have to set the config option in all imported repositories.

Adding a rewrite-url key to the manifest

Probably the cleanest way would be to introduce a new key to the manifest file, e.g. rewrite-url. An example could look like

manifest:
  defaults:
    remote: zephyr
    rewrite-url:
      - from: "default from regexp"
        to: "default to regexp "

  remotes:
    - name: zephyr
      url-base: https://github.com/zephyrproject-rtos

  projects:
    - name: zephyr
      revision: xxxxxx
      path: modules/zephyr
      import:
        rewrite-url:
          - from: "from regexp 1"
            to: "to regexp 1"
          - from: "from regexp 2"
             to: "to regexp 2"
    - name: module
      revision: yyyyyyyyyyy
      path: modules/lib/module
      import:
         true # uses default rewrite-url

rewrite-url are a list of rewrite rules that will be applied to a url in order. If manifests are imported multiple levels deep the rewrite rules on the lowest level are applied first.

Open points:

  • Is this full fledged implementation necessary or is there a more simple solution?
  • Should we support full regexp or just match the start of the url as git does?
  • Do we need project specific rewrite-urls? Would it make the implementation if we supported only rewrite-url in the defaults section?

Suggestions?

@marc-hb
Copy link
Collaborator

marc-hb commented Dec 21, 2022

@jobasto
Copy link
Author

jobasto commented Dec 21, 2022

Thanks for pointing that out. I've overlooked this issue. But I'm not quite sure what to make of it.

The solution proposed is to use the url.<base>.insteadOf config option of git, which I had considered above. But if I see it correctly this mandates that every developer has to configure their git installation accordingly. And this changes the servers used to checkout without that fact being reflected in some west.yml.

@marc-hb
Copy link
Collaborator

marc-hb commented Dec 22, 2022

But I'm not quite sure what to make of it.

I believe the salient point is "anything 'dynamic' in imports instantly becomes insanely complicated". Quoting @mbolivar-nordic in #565

I thought about how to treat remotes as first class objects you can inherit from sub-manifests that you import. It gets really messy and I couldn't come up with a clean set of rules for it. So the decision I made is that when you import a project, its fetch URL is completely resolved at import time.

Quoting him again in #608

in the presence of this feature, we can't count on a complete workspace, so all commands which require a manifest to work would need to be revisited and correct semantics decided upon (e.g. if I do west init + west update --skip foo + west list, and 'foo' has some imports in it, the manifest will fail to load and west list won't have all the information it needs to do its job. Should it fail? Do we want something else? Etc.)

To be fair, mirroring does not change any git version so it's not actually "dynamic". I'm afraid mirroring is unfortunate "collateral damage" of the more general complexity of changing imports on the fly.

My 2 cents, I'm merely trying (and probably failing) to follow @mbolivar-nordic's expert thoughts on this.

@marc-hb
Copy link
Collaborator

marc-hb commented Dec 22, 2022

Is this full fledged implementation necessary or is there a more simple solution?

I'm instinctively not a fan of regular expressions and "rewrites" for this. It seems vastly overkill to just point at a different git clone. git submodules somewhat support mirrors and they don't need all that. A much simpler layer of remote indirection looks like it would be enough do to the job. It would also solve #565 and #608.

It could/should be something as simple git's insteadOf but implemented in workspace/.west/config to make it project-specific and not pollute unrelated work.

but we want that the actually used servers are deducible by examining the project_repository and its imports.

I don't understand why. Mirroring should make absolutely no functional difference; it should only affect build performance. If mirroring makes any functional difference then it's a bug.

@marc-hb
Copy link
Collaborator

marc-hb commented Dec 22, 2022

but we want that the actually used servers are deducible by examining the project_repository and its imports.

Wait, this requirement is actually much worse than I thought: what if I want to use my own, personal/internal mirrors for maximum performance and fault tolerance? Or because my network is isolated for security reasons? I do NOT want to touch any manifest files in that case because comparing manifests is the very first step to make sure I'm building exactly the same thing as someone else - including for security reasons but not just[*]

[*] https://www.cisa.gov/sbom

@jobasto
Copy link
Author

jobasto commented Dec 22, 2022

but we want that the actually used servers are deducible by examining the project_repository and its imports.

I don't understand why. Mirroring should make absolutely no functional difference; only performance and build reliability. If mirror makes a difference then it's a bug.

It's meant to ensure that we have everything necessary to build a project in-house so that in a couple of years from now no matter what happens we are able to reproduce the binaries. So I want to make it obvious that our internal mirror is used and I want to make it the default by putting the rewrite rule (or whatever method) on the company_library_repository level (because every application project is importing this repository).

@marc-hb
Copy link
Collaborator

marc-hb commented Dec 22, 2022

It's meant to ensure that we have everything necessary to build a project in-house so that in a couple of years from now no matter what happens we are able to reproduce the binaries.

The biggest innovation brought by decentralized version control systems like git and mercurial is that the server does not matter. All clones are equivalent, they all have the same content (except of course for brand new work in progress not synchronized yet).

Hosting and content are now unrelated to each other.

Mirroring and backups come "for free".

In fact this "content-addressable distributed storage" idea is so good that it's being extended beyond git to pretty much everything: https://en.wikipedia.org/wiki/InterPlanetary_File_System

To "ensure you have everything necessary" specific server names do not and must not matter.

So I want to make it obvious that our internal mirror is used and...

If that mattered then there would be a flaw in your design.

@jobasto
Copy link
Author

jobasto commented Dec 23, 2022

It's meant to ensure that we have everything necessary to build a project in-house so that in a couple of years from now no matter what happens we are able to reproduce the binaries.

The biggest innovation brought by decentralized version control systems like git and mercurial is that the server does not matter. All clones are equivalent, they all have the same content (except of course for brand new work in progress not synchronized yet).

That's true as long as one can find the repository somewhere. One could argue that this will be always the case for any valuable piece of source code. But I don't want to rely on somebody else to do the backup for me. Maybe we have different standpoints regarding this subject which explains our discussion.

So how do we proceed? Should we wait for @mbolivar-nordic to have a look at this issue?

@marc-hb
Copy link
Collaborator

marc-hb commented Dec 23, 2022

That's true as long as one can find the repository somewhere.

Yes, and? I don't understand the point you're trying to make here sorry.

But I don't want to rely on somebody else to do the backup for me.

To perform backups I can (and I do) clone my own repos as many times as I like, I don't have to wait for someone else to do it.

@jobasto
Copy link
Author

jobasto commented Dec 23, 2022

I try to summarize it. I hope this makes my thoughts clear.

  1. I want to make sure that the application is buildable without relying on any external servers.
  2. How do I do that? -> I mirror everything using our own servers where I know a backup process is in place.
  3. How do I make sure that every repository I (or any colleague) need is on the mirror server? -> I make using the mirror server the default so that referencing a repo that is not mirrored will fail.
  4. How do I make using the mirror the default (also for colleagues and new application projects)? -> I put a rewrite rule in the company_library_repository .

@marc-hb
Copy link
Collaborator

marc-hb commented Dec 23, 2022

Thanks for making sure there is no misunderstanding.

My point is: 3. and 4. must not be in a manifest. The final mirroring decision must belong to a local configuration file. Having default values and sample configuration files in git is fine but the final mirroring decision must be in an (optional) local file that is not tracked in git.

Even when you and I build from different mirrors we must keep the ability to compare what we build and make sure it's the same thing all the way. This is becoming a legal requirement for obvious security reasons: https://www.cisa.gov/sbom

The very first step (not the only one) to compare our builds is to make sure we build from the same manifests and that our west status and git status are clean. Polluting manifests with mirroring details would change either git hashes or git status and break the ability to compare builds easily.

git (and mercurial, etc.) were a bit of a revolution because they totally decoupled location and version control thanks to the magic of checksumming everything. "Download from anywhere; I don't care". Git submodules kept that critical decoupling: you can mirror submodules and download them from anywhere and they still work. Git submodules come with a default location but it can be redirected in .git/config without polluting anything tracked in git.

West manifests should certainly not regress this wonderful design decision.

I make using the mirror server the default so that referencing a repo that is not mirrored will fail.

This does not prove anything. You could still have a build script downloading something without west and git knowing anything. For instance CMake's "ExternalProject" does that routinely. I'm not 100% sure but I think it has happened in Zephyr: https://github.com/zephyrproject-rtos/zephyr/issues?q=externalproject. There are many other ways to download from the Internet.

The only way to make really sure is to test your mirror while having no internet access.

@marc-hb
Copy link
Collaborator

marc-hb commented Jan 2, 2023

This does not prove anything. You could still have a build script downloading something without west and git knowing anything.

Just remembered the Zephyr project has a west blob command (I never used it)

EDIT: here's another example of code being downloaded at build time. It's wrong but not uncommon

@mbolivar-nordic
Copy link
Contributor

mbolivar-nordic commented Jan 3, 2023

Happy new year!

@marc-hb thank you for collecting all the backlogs!

I believe the salient point is "anything 'dynamic' in imports
instantly becomes insanely complicated".

Yep, the way import works internally is more complicated than people
usually expect.

The internal representations west uses to keep everything consistent
and well-defined when information (like project URLs) can come from
multiple places in the manifest or imported manifests (like 'remote'
or 'url' keys in an individual project definition, or the default
remote if neither is provided) can make some things harder to do than
people expect in particular.

To be fair, mirroring does not change any git version so it's not
actually "dynamic". I'm afraid mirroring is unfortunate "collateral
damage" of the more general complexity of changing imports on the
fly.

I thought about it a bit and we may be able to salvage some sort of
project URL override to satisfy this use case.

@jobasto:

manifest:
  defaults:
    remote: zephyr
    rewrite-url:
      - from: "default from regexp"
        to: "default to regexp "

This does not seem to address the core problem, which as I understand
it is how to selectively rewrite the URLs of imported projects, while
leaving everything else the same. I'm also not clear about the
intended semantics. I am inclined to say "no" to this part of a
proposal without a better description of what problem it solves and
how it solves it.

  projects:
    - name: zephyr
      revision: xxxxxx
      path: modules/zephyr
      import:
        rewrite-url:
          - from: "from regexp 1"
            to: "to regexp 1"
          - from: "from regexp 2"
             to: "to regexp 2"

I think that something like this is probably doable, though I share
@marc-hb 's concerns about using regular expressions.

Let me start by saying that while I don't have time to implement this
myself (my main tasks are devicetree-related at the present time), I
would be happy to provide some guidance on how I would like to see
this done.

The function that needs to be hooked is Manifest._load_projects:

def _load_projects(self, manifest: Dict[str, Any],

This is what's responsible for taking the parsed YAML data and
creating a west.manifest.Project object, which has the final url
field that @jobasto wants to override.

We need to track any context-specific information that we may have
inherited from a "parent" manifest (so e.g.
project_repository/west.yml could provide a URL override, which then
needs to be passed downwards as we recursively import
company_library_repository/west.yml and finally
zephyr_repository/west.yml). At this last step, we will call
_load_projects on the manifest data defining the project, and we
need the "parent" rewrite data in project_repository/west.yml to be
available.

The correct place to put this information is _import_ctx:

class _import_ctx(NamedTuple):

The current import context is always available as
west.manifest.Manifest._ctx. You should therefore:

  • add a field to _import_ctx with an internal representation for all
    the rewrite rules you need

  • merge it with any new rewrite rules as you recursively resolve
    manifests (look at how _compose_imap_filters() is defined and used
    for examples of how west updates its import context as it resolves
    imports)

  • make sure _load_projects consults this context when it creates the
    Project object, using the rewrite as necessary

@jobasto is this something you would be willing to take on?

@jobasto
Copy link
Author

jobasto commented Jan 5, 2023

Happy new year!

manifest:
  defaults:
    remote: zephyr
    rewrite-url:
      - from: "default from regexp"
        to: "default to regexp "

This does not seem to address the core problem, which as I understand it is how to selectively rewrite the URLs of imported projects, while leaving everything else the same. I'm also not clear about the intended semantics. I am inclined to say "no" to this part of a proposal without a better description of what problem it solves and how it solves it.

The goal is to have a "catch-all" rewrite rule that applies to all imports if no other rewrite rule is specified.

One could also imagine to put a rule in the remotes section:

manifest:
  defaults:
    remote: upstream

  remotes:
    - name: upstream
      url-base: https://github.com/zephyrproject-rtos
      url-rewrite: [...]

However, I don't need it for my use case as I've only one import where I want to rewrite urls. So we can keep things simple and leave that out.

But I'm not sure how far we should think ahead. For example, I only need one rewrite rule. So

rewrite-url:
    from: "from regexp"
    to: "to regexp"

would be enough for me. But I can imagine that maybe somebody would need multiple rules. So I would prefer to have a list of rules even if I put only one rule in that list.

I think that something like this is probably doable, though I share @marc-hb 's concerns about using regular expressions.

Regarding regular expressions the rationale was a bit like before: Keep it flexible. But I don't need it at the moment. So we could replace the beginning of the url as git does. If the need to have regular expressions should ever arise we could always extend the syntax, e.g.

rewrite-url:
    from: "from regexp"
    to: "to regexp"
    regexp: true

@jobasto is this something you would be willing to take on?

Yes, I would. And thanks a lot for offering guidance.

@marc-hb marc-hb added the Partial imports Incomplete or changing imports are much more complicated than you think label Jan 5, 2023
@mbolivar-nordic
Copy link
Contributor

mbolivar-nordic commented Jan 5, 2023

@marc-hb :

My point is: 3. and 4. must not be in a manifest. The final mirroring decision must belong to a local configuration file. Having default values and sample configuration files in git is fine but the final mirroring decision must be in an (optional) local file that is not tracked in git.

I think this is a separate feature request from what @jobasto is asking for. West has no features for populating configuration variables to certain values by default. Putting this in the manifest is the only way to guarantee that other users are a west init + west update away from getting the repositories from exactly the mirrors you want.

You could certainly additionally add configuration file support for overriding URLs, the same way we have configuration file support for group filters in addition to manifest file syntax for specifying group filters. However that does not seem to address the concrete use case here.

Furthermore upon reflection I am a bit leery of adding configuration file support for this. Anything that allows users to silently get something different when they run west update is a potential footgun and I am very protective of making sure that west update won't randomly get you different code. I see group filters as different because they are useful for getting a subset of a workspace, which is different than getting some slightly different use case workspace. Therefore I'd be leery of accepting such a feature without a compelling use case.

@mbolivar-nordic
Copy link
Contributor

The goal is to have a "catch-all" rewrite rule that applies to all imports if no other rewrite rule is specified.

This feels like overkill to me unless you have some examples where it really saves a lot of typing. If the idea of restricting these rules to live inside an import: foo: <list of rules> map will solve your use case (I think it does, right?), I'd prefer we start there, as long as the design doesn't limit us from adding catch-all rules later.

Remember west is:

Conservative about features: no features will be accepted without strong and compelling motivation.
https://docs.zephyrproject.org/latest/develop/west/why.html#west-design-constraints

So I would prefer to have a list of rules even if I put only one rule in that list.

Agreed; anywhere that one rewrite rule can appear, we should ideally allow several. Though then the semantics of what happens when multiple rules apply needs to be clearly specified. I suggest "first match wins, subsequent matches are ignored". Thoughts?

If the need to have regular expressions should ever arise we could always extend the syntax

Agreed -- please feel free to design a syntax that is future-proof enough to make regular expressions as a future extension, but let's limit this to what you need, and what git can do, at least for now / until we have a clear use case.

@jobasto is this something you would be willing to take on?

Yes, I would. And thanks a lot for offering guidance.

Great, thanks! I'm watching the repository and will review. Feel free to ask questions or chat further on the #west channel in the zephyr discord instance (https://discord.com/invite/Ck7jw53nU2).

@mbolivar-nordic
Copy link
Contributor

I suggest "first match wins, subsequent matches are ignored". Thoughts?

Note that this is a little subtle since we can inherit rules from multiple levels of parents.

@marc-hb
Copy link
Collaborator

marc-hb commented Jan 5, 2023

West has no features for populating configuration variables to certain values by default.

I'm afraid we're talking across each other. The default value for mirroring is obviously "no mirroring", so there's no need for west to provide default mirroring configuration variables.

I think this is a separate feature request from what @jobasto is asking for.

Putting this in the manifest is the only way to guarantee that other users are a west init + west update away from getting the repositories from exactly the mirrors you want.

@jobasto and I have clearly different mirroring objectives but it's still mirroring in both cases.

@jobasto wants to make sure different people using the same top-level manifest hit "exactly the mirrors specified".

I believe there is "strong and compelling" (and very different) use case for "invisible mirroring" where mirroring is a local decision; making sure only THE CONTENT is the same no matter where it was downloaded from. In fact it's not me believing that but basically anyone working on content-addressable storage, including the people who designed git and friends. See all the references above.

I think these two different mirroring objectives are not mutually exclusive and I suspect they could share most of their "remote rewriting" implementation besides the user interface.

You could certainly additionally add configuration file support for overriding URLs, the same way we have configuration file support for group filters in addition to manifest file syntax for specifying group filters. However that does not seem to address the concrete use case here.

OK so maybe we're not talking across each other?

@marc-hb
Copy link
Collaborator

marc-hb commented Jan 5, 2023

Furthermore upon reflection I am a bit leery of adding configuration file support for this. Anything that allows users to silently get something different when they run west update is a potential footgun and I am very protective of making sure that west update won't randomly get you different code.

I don't understand this. If the revision is a checksum (true content-addressable storage) then rewriting remotes cannot get any different content. That's how git is designed.

If on the other hand the revision is a moving branch then two people running west update at a different time will get different content anyway before any kind of mirror gets involved at all.

So mirroring will never make any content difference one way or the other - as expected.

@mbolivar-nordic
Copy link
Contributor

mbolivar-nordic commented Jan 6, 2023

If on the other hand the revision is a moving branch then two people running west update at a different time will get different content anyway before any kind of mirror gets involved at all.

These two cases are strikingly different when you are on a support team and you are positive what's in your company's fork of a repository, but have no idea what's in a random customer's mirror.

@marc-hb
Copy link
Collaborator

marc-hb commented Jan 6, 2023

So there are two manifest types: the deterministic ones that provide true content-addressable storage, reproducible builds, safe supply chains etc. and the other ones with a non-zero number of "moving parts". I understand there are some situations where you want to use the latter instead of the former but these should be considered the exception and not the rule. In such cases you probably want to ask users to (re)move ALL their custom configuration anyway for good measure - including git's insteadOf which already exists[*], has already been recommended here and that no west design decision can block.
In support situations (I've been there too) you also want to ask for some nice and short west list anyway to be completely sure there hasn't been any "operator error", misunderstanding or unexpected bug.

[*] ... but is not workspace-specific.

@mbolivar-nordic
Copy link
Contributor

including git's insteadOf which already exists, has already been recommended here and that no west design decision can block[*].

That's a fair counterpoint. OK, I'm fine with "someone" (again, not me) adding configuration options for this as well, but I don't think it should block an initial implementation which doesn't have them, especially if that's all the author has time to submit. We can always add more features later as concrete use cases arise.

I am at least a little sympathetic to the "I want to hack something together with my own mirrors and my local config files" -- but if that's the case, insteadOf already works, so it's still not clear what the added value of a west config option is. We already made the decision not to implement any west-specific support for managing secrets to access private repositories since git already natively supports this. This kind of feels similar.

@marc-hb
Copy link
Collaborator

marc-hb commented Jan 6, 2023

Thanks, glad we could find some common ground.

but I don't think it should block an initial implementation which doesn't have them, especially if that's all the author has time to submit. We can always add more features later as concrete use cases arise.

Agreed; I'm certainly not expecting anyone to implement something they don't need themselves.

On the other hand, I expect the implementation of one feature not to get in the way of an other, future and very similar feature with which it's likely to share most of its code. That's what I care about.

@marc-hb
Copy link
Collaborator

marc-hb commented Jan 6, 2023

insteadOf already works, so it's still not clear what the added value of a west config option is.

It's simply global = in the wrong place for some use cases. It's not because you want to use a mirror for some git clones that you want to use a mirror for all git clones / projects / workspaces.

@mbolivar-nordic
Copy link
Contributor

OK, yep, good point. Thanks.

@jobasto
Copy link
Author

jobasto commented Mar 2, 2023

I just want to let everybody know that I've written a little code and will open a pull request as soon as I can (which might still take some time).

@marc-hb
Copy link
Collaborator

marc-hb commented Mar 3, 2023

Please share code as soon as you can but don't forget @mbolivar-nordic will request some new tests and coverage eventually :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Partial imports Incomplete or changing imports are much more complicated than you think performance How long things take
Projects
None yet
Development

No branches or pull requests

3 participants