Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: list all package licenses #10852

Open
wants to merge 3 commits into
base: dev
Choose a base branch
from
Open

Conversation

aaronpowell
Copy link

This proposal introduces a new feature for dumping the list of licenses from all dependent NuGet packages (direct and transient), so that people can better understand what licenses are used within a project.

It's modelled from a dotnet tool I wrote - https://github.com/aaronpowell/dotnet-delice

@JonDouglas
Copy link
Contributor

Thank you so much for the contribution to NuGet!

We'll leave this proposal open for the next couple weeks to get feedback from the .NET community & respective teams & we'll do a quick internal review after!

If you're seeing this message, please 👍 or provide your feedback on this proposal in this PR as to why we should or shouldn't do this.

Thank you everyone!

- If a `licenseUrl` is provided, attempt to download the license from the endpoint and compare with the known list
- Fallback to looking at the package feed and see if it provides license information

It may also be worth having the facility to cache the SPDX license information from https://spdx.org/licenses/licenses.json to improve lookup performance
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nuget.org does not let licenses not listed in NuGet.Packaging (data) to be used in license expressions for the package, so the data is already there, but may be stale in the sense of OSI/FSF approval.


The first technical challenge for this is the inconsistent nature of which licenses are provided by NuGet packages. While the [`licenseUrl` field was deprecated](https://github.com/NuGet/Announcements/issues/32), some projects haven't adopted the new format (or older packages that predate the deprecation are in use), making it difficult to determine what the license of a project is.

The next challenge is how to detect licenses from license files. The ideal approach would be to mirror GitHub's approach, which uses [Licensee](https://licensee.github.io/licensee/) [for detection](https://help.github.com/en/articles/licensing-a-repository#detecting-a-license) (but naturally a dotnet implementation). Essentially, this uses [Sørensen–Dice coefficient](https://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient) with a threshold for what is the acceptable level of comparison between the package's license file and license template.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if package supplies license file or uses license URL, but its text does not match any license?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delice, the tool that this proposal is based off, will mark it as Unable to determine: https://github.com/aaronpowell/dotnet-delice/blob/main/src/DotNetDelice/ConsoleOutput.fs#L60

@JonDouglas
Copy link
Contributor

Hey @aaronpowell,

We took some time this afternoon to review your proposal. Overall, we really like the direction & problems this proposal seeks to solve. There are a few items that we believe should be considered to take on this proposal. I'll go through a few of those just for the sake of transparency.

  1. We are not quite sure how involved third-party tooling should be into the first-party tooling. In other words, we don't know if this project best fits it's current place in the ecosystem, or if it should evolve into a home in the official dotnet SDK. Do you have thoughts on this?
    • For a home in the dotnet SDK, we believe we may need to alter the functionality to live in two locations such as dotnet list packages and a dotnet audit type command. For the sake of listing licenses, we could extend existing list functionality. For the sake of auditing licenses, we could create newer experiences.
    • There exist a number of open source license compliance tools on the market. Would we be re-inventing the wheel by including this?
    • Existing SBOM formats such as SPDX provide license expressions for dependencies. Would having this information in two locations such as generated by a CLI command & inside a generated SBOM be overkill for compliance?
  2. We'd likely opt to not use GitHub's API to check licenses and rely on the NuGet license expression instead. The primary concern would be shipping a third-party API for functionality in the dotnet SDK.
  3. We'd push for only using license metadata and not support deprecated metadata like licenseUrl.

For now we're going to leave this proposal open to continue to iterate & gather more feedback from you and the community. Thanks again for the proposal and we invite anyone reading this comment to provide their feedback on this proposal in addition!

@aaronpowell
Copy link
Author

Hi @JonDouglas,

Thanks for the feedback and notes on the proposal. It's the first time I've submitted a proposal so I am very welcoming of feedback 😁.

Here's my comments on the items you've raised:

  1. We are not quite sure how involved third-party tooling should be into the first-party tooling. In other words, we don't know if this project best fits it's current place in the ecosystem, or if it should evolve into a home in the official dotnet SDK. Do you have thoughts on this?

I do think it could make a useful command within the dotnet sdk itself, rather than within a sub-tool such as NuGet, but at the same time, I view NuGet as the key tool for interacting with external dependencies, which is why I started the proposal here.

There exist a number of open source license compliance tools on the market. Would we be re-inventing the wheel by including this?

To a degree, yes it would be doing something that's already possible with third-party tooling. After all, I've written my own tooling to do this, which is the basis for the spec. Where I see this differs to existing tools, such as snyk's offering, is that this is a building block command. It doesn't go to the level of telling you if you are compliant or not, it just gives you enough information that you can make decisions around your compliancy, or even just give you an insight into the licenses you are consuming for transparency.

The command is unopinionated, and I'd argue it should stay that way, so that users of it can make their own opinions based off the information they are provided.

Additionally, with the increase of reliance on third party dependencies, I feel that it's the responsibility of the platform to give you the insights you need, rather than having to get third party tools to do that.

Existing SBOM formats such as SPDX provide license expressions for dependencies. Would having this information in two locations such as generated by a CLI command & inside a generated SBOM be overkill for compliance?

Sorry, I'm not sure I understand the question here.

  1. We'd likely opt to not use GitHub's API to check licenses and rely on the NuGet license expression instead. The primary concern would be shipping a third-party API for functionality in the dotnet SDK.
  2. We'd push for only using license metadata and not support deprecated metadata like licenseUrl.

Understandable, but I'd encourage doing research into the coverage of information that can be obtained without the GitHub API fallback. The reason that I added it to delice was that I was finding that a lot of packages were still using licenseUrl and not the license field of the nuspec to provide license information. This then meant I was needing to either report it as an unknown dependency, or have some way to recover the contents of the file (to then apply the Sørensen–Dice coefficient against) and this meant requesting the file from GitHub. Running this against a suitably large project could exceed the API request limit, which is why I added that option in too.

So dropping a check against licenseUrl will likely also remove the need to access the GitHub API, but the trade off from that is that you might get less valuable data produced, as adoption of the license nuspec field seemed low (at least, when I wrote delice).

@JonDouglas
Copy link
Contributor

JonDouglas commented Jun 16, 2021

@aaronpowell Thanks for the response! I just wanted to write up the thoughts that the team had just to be fully transparent.

If there's interest, I can help provide some functional designs for how this proposal might integrate into the existing dotnet CLI commands that NuGet manages today such as dotnet list package and if we were to provide support for machine readable output we can accomplish what delice does without having to create a new command.

With regards to the previous thought on an SBOM providing similar information, those reading this comment can check out what is an sbom to help inform opinions as license information would likely be included in both the dotnet CLI & a SBOM.

We'll definitely take a look at the % of packages on NuGet.org that contain each type of license metadata. Given it's a dotnet CLI feature, the emphasis would be on .NET Core and newer supported scenarios(i.e. PackageReference / core supported packages). We'll see what those numbers look like to help give a better picture as older packages may also use outdated licenseUrl metadata.

@aaronpowell
Copy link
Author

If it seems more logical to have it as part of the metadata available off dotnet list package, I'm happy to work on the proposal for there. My only concern would be discoverability and whether that impacts the usefulness of the data. Something that I found useful with delice was that it was license-first, not package-first, in the output, since that was the information you're attempting to view.

So, if the license information is added as part of the output of dotnet list package, being able to generate a list of licenses would require parsing the machine-readable format and generating a new output view. Would that be a barrier to usefulness?

Something I also want to raise is that npm has a proposal for a similar feature - npm/rfcs#182 (it was the inspiration for this proposal), so having an aligned machine-readable output would make it easier to produce a view of licenses across multiple platforms that may be in use by a project.

@markusschaber
Copy link

@aaronpowell I filed a similar (but not identical) issue: #10993

While both issues both point into the same direction, mine has extended requirements, as it focuses on a different use case. I think this proposal will work fine for framework dependent distributions of desktop or server applications, but does not completely cover use cases like <PublishSingleFile> or client side Blazor.

  1. Include non-NuGet dependencies, e. G. parts of the Framework / runtime which pulled in with <PublishSingleFile> or client side Blazor.
  2. Honor the results of the linker, thus only listing those packages which parts are actually linked into the product, instead of just following all NuGet references, whether they're actually used in the project, or unused references, or "just" build tools.
  3. Generate user-consumable output including the copyright notices per package - your proposal concentrates with oly the licenses and deduplicates them over the packages.
  4. Injecting the result into the binary, so we can be sure it will be "given" to the recipient of the compiled application, as requested by the Apache License (See Feature Request: Provide third party / OSS bill of material and license / copyright list on build #10993 for a more detailed explanation).

Do you think we could unify our proposal by extending yours to include my requirements?

@aaronpowell
Copy link
Author

@markusschaber I've had a read of your issue, but I'm not sure there's much overlap between what's proposed across these two, other than they are talking about licenses.

The primary objective of this proposal is to surface data that is hard to get, a dump of the list of licenses from dependencies (including transient and framework) of a project. It's intended to be unopinionated about what to do with that data, and it's up to the consumer to make decisions around allow lists/deny lists, etc.

I'm also not sure that NuGet would be the right place for a command such as you're describing. What you're describing requires integration through the linker, understanding the different platform targets, etc., which is much more than a package manager handles.

As a result, I don't think extending this proposal is the right approach, it'd add complexity to what is (in theory 😅) a relatively simple proposal.

@markusschaber
Copy link

@aaronpowell Ok, I agree with that. So we keep the proposals separate.

@jeffkl jeffkl added the Community PRs (and linked Issues) created by someone not in the NuGet team label May 3, 2022

- If a `license` field is in the nuspec, check if it's a SPDX ID, if so, return. If it's a file, use Sørensen–Dice coefficient to compare it to a known list
- If a `licenseUrl` is provided, attempt to download the license from the endpoint and compare with the known list
- Fallback to looking at the package feed and see if it provides license information

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should also be a mechanism to exclude packages. We run our own nuget package stream containing homegrown packages that do not include any licanse information but don't need to, since they are only used internally and thus don't need to be checked

Copy link
Contributor

@dominoFire dominoFire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your patience! Some comments.


The technical workflow for license detection would follow:

- If a `license` field is in the nuspec, check if it's a SPDX ID, if so, return. If it's a file, use Sørensen–Dice coefficient to compare it to a known list
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious on why do you chose Dice coefficient for text similarity. Have you considered other text similarity techniques? For example

(I did a homework on text similarity during my college years)

Are we expecting large texts so that min-hashing is worth doing?

Also, how license text will be processed? Will stop words be removed?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason that similarity algorithm is specified is because dotnet-delice (which inspired this proposal) is a .NET port of the JavaScript delice tool, and I wanted to have the same level of similarity applied so you could use both tools together and get comparable results.

License text should be processed as provided by the package, removal of stop words or anything else would mean that you're not accurately processing the license as shipped by the package and may give incorrect results.

The same goes for the license as provided by SPDX, there should be no modification of it to avoid incorrect results.


### Technical explanation

The first technical challenge for this is the inconsistent nature of which licenses are provided by NuGet packages. While the [`licenseUrl` field was deprecated](https://github.com/NuGet/Announcements/issues/32), some projects haven't adopted the new format (or older packages that predate the deprecation are in use), making it difficult to determine what the license of a project is.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I would suggest computing similarity as a command option. Making a compliance decision without reading the license can be risky, as just one word can change the whole license meaning.
  • How about scope-spliting this spec into the following scenarios?
    1. first, listing licenses of all packages, e.g. dotnet list package --show-license, just showing license types
    2. then, making 'similarity aggregation' in SDPX licenses, as an option
    3. later, making 'similarity aggregation' license files, with similarity selection
    4. finally, licenseUrl, as an option

I cannot make any guarantees of whether or not this going to be implemented, but, implementing the first two scenarios looks feasible. We are more than happy to review a community PR :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I would suggest computing similarity as a command option. Making a compliance decision without reading the license can be risky, as just one word can change the whole license meaning.

This wouldn't make compliance decisions, it's merely surfacing the information that is required to allow decision makers to make said decision. A level of similarity is required, which is why something such as Sørensen–Dice coefficient is used to determine the license (if it's not explicitly set), otherwise you'd be provided with what is mostly unhelpful information (from when I did the initial proposal, the usage of the SPDX ID's for licensing was low relative to a dedicated URL).

  • How about scope-spliting this spec into the following scenarios?

    1. first, listing licenses of all packages, e.g. dotnet list package --show-license, just showing license types
    2. then, making 'similarity aggregation' in SDPX licenses, as an option
    3. later, making 'similarity aggregation' license files, with similarity selection
    4. finally, licenseUrl, as an option

I cannot make any guarantees of whether or not this going to be implemented, but, implementing the first two scenarios looks feasible. We are more than happy to review a community PR :)

Dropping license similarity based on file content I feel would really decrease the value of using this, as my past testing has indicated that the usage of SPDX identifiers in NuGet packages for license indication is relatively low, and that's ultimately why I added the feature to dotnet-delice to do template comparisons.

Given that this has already been implemented once, I'm probably overestimating the simplicity of implementing it again (converting the F# to C#), but doing so with some additional branches that allows you to opt-out of SPDX template comparisons wouldn't be a huge overhead in the process.

@dominoFire
Copy link
Contributor

Team triage meeting: handing off to @JonDouglas ; please help driving this proposal. Thanks!

@dominoFire dominoFire assigned JonDouglas and unassigned dominoFire Jul 19, 2022
@ghost ghost added the Status:No recent activity No recent activity. label Sep 26, 2023
@ghost
Copy link

ghost commented Sep 26, 2023

This PR has been automatically marked as stale because it has no activity for 30 days. It will be closed if no further activity occurs within another 330 days of this comment. If it is closed, you may reopen it anytime when you're ready again, as long as you don't delete the branch.

@cremor
Copy link

cremor commented Sep 26, 2023

Please do not close this for lack of activity. It would be a very useful feature and I don't see an official "yes" or "no" here.

@ghost ghost removed the Status:No recent activity No recent activity. label Sep 26, 2023
@JonDouglas
Copy link
Contributor

Hi all,

I know it has been a couple years since this was proposed. It is a good idea(kudos to @aaronpowell on being early on calling it out) and is something that keeps coming up.

One challenge for NuGet at the time was our license adoption for packages. Back in 2021, we didn't have great adoption of best license practices(i.e. expressed licenses). This wasn't really called out here on the proposal but I'm going to call it out now.

2023 is looking like a much better picture and something we need to keep our eyes open for is license auditing as per this proposal.

image

I know the bot recently closed this, but I believe this proposal should be kept open to collect 👍 for awhile longer until we can properly understand how to add this to tooling. This is a common/highly requested ask and I'd like to note that here.

If you're reading this comment, please continue to contribute to this proposal and the ideas of how we can list all package licenses that are expressed in the NuGet tooling.

@aaronpowell
Copy link
Author

@JonDouglas - I haven't done much with the tool that inspired this proposal for a while, would it be useful for yourselves if I updated it to the latest .NET and ran some tests to see what the output looks like with the current state of NuGet package licenses?

@ghost ghost added the Status:No recent activity No recent activity. label Oct 27, 2023
@ghost
Copy link

ghost commented Oct 27, 2023

This PR has been automatically marked as stale because it has no activity for 30 days. It will be closed if no further activity occurs within another 330 days of this comment. If it is closed, you may reopen it anytime when you're ready again, as long as you don't delete the branch.

@cremor
Copy link

cremor commented Oct 27, 2023

Please do not close this for lack of activity. See comment from @JonDouglas above:

I know the bot recently closed this, but I believe this proposal should be kept open to collect 👍 for awhile longer until we can properly understand how to add this to tooling. This is a common/highly requested ask and I'd like to note that here.

Also, there is an open question from the PR author to @JonDouglas.

Offtopic: Does the bot really create those "no recent activity" comments after 30 days although the actual closing would only happen after another 330 days (a full year of no activity)?

@ghost ghost removed the Status:No recent activity No recent activity. label Oct 27, 2023
@JonDouglas
Copy link
Contributor

Thank you @cremor. Let me see if we can get this bot disabled entirely here. It is not very helpful in the context of design/proposal PRs. Also, it just disrupts and discourages people from engaging.

Just to be clear with people on this specific issue, there is no Yes/No decision made here. I am suggesting we are in a phase of "not yet" but need to keep collecting sentiment and be transparent about why we're not there yet (i.e. sharing data like I did in September).

Licenses are especially important today i.e. https://www.sonatype.com/state-of-the-software-supply-chain/introduction

@donnie-msft
Copy link
Contributor

Hi, we have removed our "proposals" folder, so please move this proposal to the "accepted" folder.
See the update here, and let me know if you have questions/concerns: https://github.com/NuGet/Home/blob/b18b5cc1507df04ea9785f8ba613b1ceb2ad93ea/meta/README.md#what-happens-to-a-proposal
Thanks!

@kartheekp-ms kartheekp-ms added the Status:Do not auto close Do not auto close for PRs needs long review process label Nov 28, 2023
@ghost ghost added the Status:No recent activity No recent activity. label Dec 29, 2023
@ghost
Copy link

ghost commented Dec 29, 2023

This PR has been automatically marked as stale because it has no activity for 30 days. It will be closed if no further activity occurs within another 330 days of this comment. If it is closed, you may reopen it anytime when you're ready again, as long as you don't delete the branch.

@cremor
Copy link

cremor commented Dec 29, 2023

@aaronpowell Please update this proposal PR as explained by @donnie-msft above.

@ghost ghost removed the Status:No recent activity No recent activity. label Dec 29, 2023
@cremor
Copy link

cremor commented Dec 29, 2023

@kartheekp-ms Looks like the label Status:Do not auto close that you've added to this issue doesn't work.

@aaronpowell aaronpowell requested a review from a team as a code owner January 1, 2024 20:37
@aaronpowell
Copy link
Author

Hi, we have removed our "proposals" folder, so please move this proposal to the "accepted" folder. See the update here, and let me know if you have questions/concerns: https://github.com/NuGet/Home/blob/b18b5cc1507df04ea9785f8ba613b1ceb2ad93ea/meta/README.md#what-happens-to-a-proposal Thanks!

I've moved it to the 2023 proposals folder

@ghost
Copy link

ghost commented Feb 1, 2024

This PR has been automatically marked as stale because it has no activity for 30 days. It will be closed if no further activity occurs within another 330 days of this comment. If it is closed, you may reopen it anytime when you're ready again, as long as you don't delete the branch.

@sensslen
Copy link

sensslen commented Feb 1, 2024

I don't think this proposal should be marked as stale/be closed, out of the following reasons:

  1. There is interest in this feature (people are builing tools that do exactly this - see https://github.com/aaronpowell/dotnet-delice and https://github.com/sensslen/nuget-license)
  2. The proposal was updated according to the request)

@ghost ghost removed the Status:No recent activity No recent activity. label Feb 1, 2024
@aaronpowell
Copy link
Author

Given we're coming towards the 3-year mark since this proposal was first put forward, I'd like to know what the stance on it is.

I really do think that this would be a valuable addition to the NuGet CLI, and the viability of it has increased due to the greater adoption of how licenses are stored in NuGet packages (as @JonDouglas pointed out).

I'm still happy to contribute the implementation of this if that would be of value.

@JonDouglas
Copy link
Contributor

Here are some thoughts. I'll be as transparent as possible. This is a good proposal and something the tooling could generally use to delight people with a helpful command.

Here's where we are though. Today we are working on two major things, generative AI and security I.e SBOMs. As one may imagine both of these truly help the problem of listing and understanding licenses to help one audit true license risk.

While our focuses are more so on how we can make NuGet better long term for many of these things and push adoption of things like license expression as a best practice (as seen by previous comments) I think that this specific proposal would need to be championed by the community to the OSS project and our teams (dotnet and NuGet) can help shepherd this into the formal tooling such as dotnet and NuGet cli.

That however is just my community opinion here with what I know and I believe we will need some further team input here as ultimately it is a team sport to get these things done. @NuGet/nuget-client for example to share some thoughts if anyone on the team would like to add additional perspectives.


The first technical challenge for this is the inconsistent nature of which licenses are provided by NuGet packages. While the [`licenseUrl` field was deprecated](https://github.com/NuGet/Announcements/issues/32), some projects haven't adopted the new format (or older packages that predate the deprecation are in use), making it difficult to determine what the license of a project is.

The next challenge is how to detect licenses from license files. The ideal approach would be to mirror GitHub's approach, which uses [Licensee](https://licensee.github.io/licensee/) [for detection](https://help.github.com/en/articles/licensing-a-repository#detecting-a-license) (but naturally a dotnet implementation). Essentially, this uses [Sørensen–Dice coefficient](https://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient) with a threshold for what is the acceptable level of comparison between the package's license file and license template.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fwiw, for anything to ever be a part of the dotnet commands, it needs to be source buildable: https://github.com/dotnet/source-build?tab=readme-ov-file#source-build-goals.

We can't take a non .NET dependency easily.


## Prior Art

I have created a dotnet global tool that does this, [`dotnet-delice`](https://github.com/aaronpowell/dotnet-delice). This proves that it is a technical possibility to implement such a solution.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love that there's a global tool for this. They're ideal for these types of scenarios, where getting the functionality in the .NET SDK itself would meet some blockers like the license detection above.

@dotnet-policy-service dotnet-policy-service bot added the Status:No recent activity No recent activity. label Mar 7, 2024
Copy link
Contributor

This PR has been automatically marked as stale because it has no activity for 30 days. It will be closed if no further activity occurs within another 330 days of this comment. If it is closed, you may reopen it anytime when you're ready again, as long as you don't delete the branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Community PRs (and linked Issues) created by someone not in the NuGet team Status:Do not auto close Do not auto close for PRs needs long review process Status:No recent activity No recent activity.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet