Skip to content

Conversation

@ericphanson
Copy link
Member

@ericphanson ericphanson commented Jan 18, 2021

fixes #261 by using LicenseCheck.jl to wrap the Go library licensecheck in order to find the most likely license file and check what licenses it contains. LicenseCheck should be registered in a few hours; CI will fail until then.

License-file finding strategy: this is implemented in LicenseCheck.jl. The current strategy is: if there are more than 100 toplevel files, then check any that are in a long hardcoded list (COPYING, LICENSE, GPL.txt, etc). If there are fewer than 100, then check every plaintext file at toplevel. For each file we check, licensecheck returns a list of licenses it has found in that file (if any), and the percent of the file covered by those licenses it found (percent_covered). We take the file with the highest percent_covered and assume that is the license file of interest.

License rules: what I've currently implemented is: for each version being registered (not new packages only), there must be a file of which at least 70% of the contents are covered by identified licenses, and all of those licenses must be OSI-approved (determined by taking the licensecheck-obtained identifier, and seeing if it is' on a list of SDPX license identifiers which are OSI-approved, via LicenseCheck.jl).

Lots of choices in all of this, and maybe these rules will lead to too many falsely-flagged packages, I'm not sure.

The license guideline is enabled by a keyword argument to run, which is off by default so that this is not breaking. IMO we should just turn it on for General in it's config.

update: Now also looks for a license key in the Project.toml and takes that as the source of truth, if it exists.


original message

Addresses #261. It doesn’t necessarily close the issue, since this PR just does the first step, which is check if a license file exists at all. Checking the license against a list of OSI compatible licenses via licensee or other tools is left for future work.

The first commit moves a TagBot utility up to be used as a general RegistryCI utility, and the second uses it for a check if a license file exists.

This isn't done yet because it still needs a test for the license check; I'm not totally sure how to do that because it needs an actual PR body to parse and a repo to go look at, and doesn't just care about the commits to the registry itself. I suspect the TagBot tests that use mocking might be what I should look at, but they are currently broken (#322).

This is breaking because I set the license check to default to true, but we could also just set it to default to false and turn it on specifically for General, in which case it would be not breaking. (I did make sure it could be turned off since private registries may not use licenses for all their packages.)

By the way, adding new guidelines to the pipeline is super easy thanks to @GunnarFarneback's recent work!

bors bot added a commit that referenced this pull request Jan 18, 2021
bors bot added a commit that referenced this pull request Jan 18, 2021
bors bot added a commit that referenced this pull request Jan 18, 2021
@ericphanson
Copy link
Member Author

Hmm, just reread #261 (comment) and will need to change the approach here

@ericphanson ericphanson changed the title WIP: Add license check to automerge guidelines Add license check to automerge guidelines Feb 13, 2021
@ericphanson
Copy link
Member Author

@oxinabox suggested using a license field in the package's Project.toml, if it exists, to override the license detection here. That field is mentioned in passing in https://julialang.github.io/Pkg.jl/dev/glossary/ but not in https://julialang.github.io/Pkg.jl/dev/toml-files/#Project.toml.

@ericphanson ericphanson marked this pull request as ready for review February 13, 2021 21:43
@ericphanson
Copy link
Member Author

ericphanson commented Feb 13, 2021

the latest commit adds a check to see if there is a license key in the Project.toml, with either the SDPX 3.10 identifier of an OSI-approved license, or a list of such identifiers (in which case all must be OSI-approved for the check to pass, as currently written). This list can be seen here: https://github.com/ericphanson/LicenseCheck.jl/blob/main/src/OSI_LICENSES.jl. It includes things like "MIT", "GPL-2.0" etc, so it should be fairly intuitive to use.

If the Project.toml has a license key, we do not try to find a license file ourselves; i.e. we take the Project as a source of truth. This can help if the autodetection goes wrong; the solution is to just put a license identifier in the Project.toml, and then there's no way to confuse it. It's helpful to have such an opt-out because the check is done on every version, so the same package could get repeatedly flagged by accident otherwise.

That being said, I do think the autodetection should be pretty good here.

Thanks to @oxinabox for helping decide on this logic to use for this mechanism + pointing to the license key in the first place.

@ericphanson ericphanson reopened this Feb 14, 2021
bors bot added a commit that referenced this pull request Feb 14, 2021
bors bot added a commit that referenced this pull request Feb 14, 2021
@ericphanson
Copy link
Member Author

The license check requires access to the package code in order to look and see if there is a license there (and to parse the Project.toml to look for a license field). This functionality (looking at the package code) would also be needed for #313.

My previous approach (i.e. in the 2nd to last commit) was to add in a process_pkgdir callback to _run_pkg_commands that would run after the commands were run, but before the tempdirs were cleaned up. This would provide an opportunity to inspect the downloaded code. However, this approach meant that either every check that wants access to the pkg code needs to redownload it (which increases the chances of hitting a network failure etc), or all the checks that want access to the pkg code would need to be run within the same "guideline".

In the latest commit, I instead added a Ref field to GitHubAutoMergeData that would allow the Pkg.add step to populate the directory with the tempdir, and pass a cleanup callback to be called after the checks were done. That way, all checks that want access to the package code can re-use it (by accessing this field), while keeping the checks somewhat atomic.

This makes GitHubAutoMergeData stateful, which is not ideal IMO. However, at least it's somewhat local state being passed around explicitly.

@GunnarFarneback, since you designed the new guidelines loop, would you mind taking a look and letting me know what you think?

bors try

bors bot added a commit that referenced this pull request Feb 14, 2021
bors bot added a commit that referenced this pull request Feb 14, 2021
@GunnarFarneback
Copy link
Contributor

I would do it like this:

  1. Add a field for the directory in GitHubAutoMergeData and initialize it up front with mktempdir.
  2. Download the package code when needed.
  3. Don't worry about cleanup. From Julia 1.3, directories created by mktempdir clean up themselves on process exit.

@ericphanson
Copy link
Member Author

Ah ok, that sounds good. So when we download the code, we'll use that field to get the tempdir to download it to, and when we use it, we just use that directory too.

bors bot added a commit that referenced this pull request Feb 14, 2021
@JuliaRegistries JuliaRegistries deleted a comment from bors bot Feb 14, 2021
@JuliaRegistries JuliaRegistries deleted a comment from bors bot Feb 14, 2021
@JuliaRegistries JuliaRegistries deleted a comment from bors bot Feb 14, 2021
@JuliaRegistries JuliaRegistries deleted a comment from bors bot Feb 14, 2021
@JuliaRegistries JuliaRegistries deleted a comment from bors bot Feb 14, 2021
@JuliaRegistries JuliaRegistries deleted a comment from bors bot Feb 14, 2021
@giordano
Copy link
Member

giordano commented Mar 1, 2021

This PR has been open for a while. Here are the steps I would suggest @ericphanson:

  1. Remove the license = "MIT" functionality.
  2. Add the osi_license_url = "https://opensource.org/licenses/MIT" functionality. You can just hardcode the list of OSI approved licenses from the horse's mouth: opensource.org/licenses/alphabetical

Alternatively doing only 1 and trying to do 2 in a later pull request, since the check in Project.toml seems to be the most controversial part?

@DilumAluthge
Copy link
Member

This PR has been open for a while. Here are the steps I would suggest @ericphanson:

  1. Remove the license = "MIT" functionality.
  2. Add the osi_license_url = "https://opensource.org/licenses/MIT" functionality. You can just hardcode the list of OSI approved licenses from the horse's mouth: opensource.org/licenses/alphabetical

Alternatively doing only 1 and trying to do 2 in a later pull request, since the check in Project.toml seems to be the most controversial part?

Yes, I would be comfortable with this as well.

@ericphanson
Copy link
Member Author

ericphanson commented Mar 1, 2021

I am concerned that that might make for a sudden surge in manual work without a simple solution: as soon as we turn this on, new versions will need fixing or manual merging for every package that does not pass the guidelines, which includes ones with ok licenses like Torch.jl. I was hoping we could tell those packages with licenses but who don’t pass the check “just add this field and you won’t need manual merging for every version”, or some other fix that lets them go back to automatic mode.

So I’d rather not start with the most strict option and add more options later; I’d rather start with a looser option and tighten it later. To that end, what about removing the project stuff (since that’s controversial), and to compensate, also relaxing the check in two ways?

  1. Lowering the “threshold of license file recognized” requirement to zero (from an arbitrary 70%). That is, we still require that licensecheck recognizes a license in a top level file in the package, but we don’t require that licensecheck understands anything more in the file than that it has a license. (So it could be at the bottom of a readme, or have a bunch of unrecognized licenses too). This will let eg Torch.jl pass.
  2. Requiring only that one of the licenses found is OSI-approved, not all of them. That will rule out some false positives like CommonMark.jl which put a CC0 license too.

I think these two requirements will cause most of the false positives, and when we have a better way to deal with false positives, we could add them back in (since I think they do have a legitimate function).

@DilumAluthge
Copy link
Member

Sure that sounds fine.

I agree that long-term, we want to add both of those requirements back in. But we can do that after we have implemented the project override feature.

@DilumAluthge
Copy link
Member

And will be a minor frustration when you think you have a good link but then get rejected.

If this happens, then we will just manually merge the PR, and then we'll update the list of acceptable URLs in the RegistryCI.jl source code.

AutoMerge does not need to be perfect or cover every possible edge case. It just needs to cover most cases. And then the remaining cases can be manually handled.

EDIT: Also, this will only occur when both of the following occur:

1. The OSI approves a new license.

2. Someone wants to use this new license for a Julia package.

So it will be a very infrequent event.

Actually, I just realized that LicenseCheck.jl uses a hardcoded list of which SPDX identifiers are OSI-approved: https://github.com/ericphanson/LicenseCheck.jl/blob/main/src/OSI_LICENSES.jl

So, even with the license = "MIT" approach in this PR, if OSI approves a new license, we need to regenerate the hardcoded list in LicenseCheck.jl.

So that disadvantage is shared by both the license = "MIT" and the osi_license_url = "https://opensource.org/licenses/MIT" approach, right?

@ericphanson Is it correct that the list of OSI-approved licenses is hard-coded in LicenseCheck.jl?

@DilumAluthge
Copy link
Member

DilumAluthge commented Mar 1, 2021

Why don't we have a Pkg call tomorrow (2pm eastern time, IIRC) to discuss the license field in the project file?

I've posted in the #pkg-dev channel on Slack.

@DilumAluthge
Copy link
Member

@ericphanson Maybe hold off on making any changes to the PR for now? If we have a Pkg call tomorrow, and in the call it is decided that we go the license = "MIT" approach, then your PR will be good to go.

@ericphanson
Copy link
Member Author

@ericphanson Is it correct that the list of OSI-approved licenses is hard-coded in LicenseCheck.jl?

Yep! The script to do so is in the repo as well so we can update it, but it definitely does need manual updating. Additionally, the list of licenses licensecheck (the underlying Go library) recognizes is shipped with the binary, i.e. we need to manually update that as well for new licenses (hopefully upstream updates and we just need to release a new version of our Yggdrasil binary though).

Why don't we have a Pkg call tomorrow to discuss the license field in the project file?

Sure, I can probably attend that.

@ericphanson Maybe hold off on making any changes to the PR for now? If we have a Pkg call tomorrow, and in the call it is decided that we go the license = "MIT" approach, then your PR will be good to go.

Sounds good, although it probably needs integration tests and some rebasing, and I might not have time to do that until this weekend anyway.

@ericphanson
Copy link
Member Author

ericphanson commented Mar 7, 2021

Since we didn't end up having a package call, I'm thinking the next steps are to

Does that sound good?

@DilumAluthge
Copy link
Member

Yeah.

@ericphanson
Copy link
Member Author

I add some unit tests that try out the various code paths in meets_version_has_osi_license.

What needs to be tested by integration tests then is that the depot_path field gets correctly populated with a depot that we can find the package in during the guideline. So I just turned on check_license for Requires.jl since that passes currently and so should be getting Pkg.add'd successfully, so we should be able to test the codepath about populating the depot and getting the package out of the depot. Let's see if it works.

bors try

bors bot added a commit that referenced this pull request Mar 8, 2021
@ericphanson
Copy link
Member Author

bors try

bors bot added a commit that referenced this pull request Mar 8, 2021
@JuliaRegistries JuliaRegistries deleted a comment from bors bot Mar 8, 2021
@bors
Copy link
Contributor

bors bot commented Mar 8, 2021

try

Build succeeded:

@ericphanson ericphanson requested a review from DilumAluthge March 8, 2021 21:13
@ericphanson
Copy link
Member Author

ericphanson commented Mar 8, 2021

Ok, I think this is good to go from my end! Bors can't squash, right? So I should do some git cleaning-up.

Do the messages look OK? Too verbose?


In case we go back to parsing the Project.toml and my poor git skills ruin the history, here's the code I had a couple commits back:

Old code using Project.toml license field
function parse_licenses_from_project(dir)
    project_path = joinpath(dir, "Project.toml")
    if !isfile(project_path)
        project_path = joinpath(dir, "JuliaProject.toml")
    end
    isfile(project_path) || return String[]
    project = TOML.tryparsefile(project_path)
    project isa TOML.ParserError && return String[]
    project_licenses = get(project, "license", String[])

    # it should either be a string or a vector of strings;
    # if it's a string, make it a vector of strings
    if project_licenses isa AbstractString
        return String[project_licenses]
    elseif project_licenses isa AbstractVector{<:AbstractString}
        return project_licenses
    else
        # what is it?
        @error "`license` field has uncaught type" typeof(project_licenses)
        return String[]
    end
end

function check_license(dir)
    project_licenses = parse_licenses_from_project(dir)

    # only run `check_license` when we don't find
    # a license in the Project.toml
    if isempty(project_licenses)
        file_license = LicenseCheck.find_license(dir)
    else
        file_license = nothing
    end
    return (; project_licenses=project_licenses, file_license=file_license)
end


function pkgdir_from_depot(depot_path::String, pkg::String)
    pkgdirs = readdir(joinpath(depot_path, "packages", pkg); join=true)
    always_assert(length(pkgdirs) == 1)
    return pkgdirs[1]
end

function meets_version_has_osi_license(working_directory::String,
                                        pkg::String,
                                        version::VersionNumber;
                                        registry_deps::Vector{<:AbstractString} = String[],
                                        depot_path)
    pkgdir = pkgdir_from_depot(depot_path, pkg)

    if !isdir(pkgdir)
        return false, "Could not check license because could not access package code. Perhaps the `Pkg.add` step failed earlier."
    end

    license_results = check_license(pkgdir)
    project_licenses = license_results.project_licenses

    # if there's a license in the Project.toml, then we only consider that.
    if !isempty(project_licenses)
        # possibly a list of licenses were passed
        non_osi_licenses = filter(!LicenseCheck.is_osi_approved, lic.licenses)
        if isempty(non_osi_licenses)
            @info "Found license field in Project.toml with OSI-approved license identifier(s)" project_licenses
            return true, ""
        else
            non_osi_licenses_str = join(non_osi_licenses, ", ")
            @error "Found license field in Project.toml but with non-OSI-approved license identifier(s)" project_licenses non_osi_licenses_str
            return false, "Non-OSI-approved license(s) $(non_osi_licenses_str) found in Project.toml"
        end
    end

    # If we're here, then there was no `license` field in the Project.toml.

    lic = license_results.file_license
    percent_covered_threshold = 70

    if LicenseCheck.is_osi_approved(lic) && lic.license_file_percent_covered > percent_covered_threshold
        @info "Found OSI license(s)"
        return true, string("")
    # multiple failure modes:
    # 1. No license file found at all
    #    of the contents of the file, so can't be sure what the license is.
    elseif lic === nothing
        @error "Did not find license file."
        return false, "No license file found."
    # 2. License file found with OSI licenses, but did not recognize more than 30%
    elseif LicenseCheck.is_osi_approved(lic)
        @error "Found OSI-approved license(s) in toplevel file $(lic.license_filename), but the recognized licenses cover less than $(percent_covered_threshold)% of the file." licenses_found=lic.licenses_found license_file_percent_covered=lic.license_file_percent_covered
        return false, "Could not identify at least $(percent_covered_threshold)% of the contents of the license file."
    # 3. License file found, but with one or more non-OSI licenses
    else
        non_osi_licenses = filter(!LicenseCheck.is_osi_approved, lic.licenses)
        @error "Non-OSI-approved license(s) found in toplevel file $(lic.license_filename)." licenses_found=lic.licenses_found license_file_percent_covered=lic.license_file_percent_covered non_osi_licenses
        non_osi_licenses_str = join(non_osi_licenses, ", ")
        return false, "Non-OSI-approved license(s) $(non_osi_licenses_str) found in license file $(lic.license_filename)."
    end
end

Copy link
Member

@DilumAluthge DilumAluthge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to go. Yeah I would squash before you bors merge.

@ericphanson
Copy link
Member Author

bors merge

@bors
Copy link
Contributor

bors bot commented Mar 8, 2021

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

License checks

7 participants