Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test that license texts match SPDX plain license texts #636

Open
mlinksva opened this issue Jan 13, 2019 · 7 comments
Open

Test that license texts match SPDX plain license texts #636

mlinksva opened this issue Jan 13, 2019 · 7 comments

Comments

@mlinksva
Copy link
Contributor

We should have a test that each license text in _licenses is the same as the plain text license in the SPDX collection to automate the requirement described at https://github.com/github/choosealicense.com/blob/gh-pages/CONTRIBUTING.md#adding-a-license

The text of the license should match the corresponding text found in spdx/license-list-data. If there are errors there, please fix them in spdx/license-list-XML (from which the plain text version is generated) so as to minimize license text variation and make it easier for choosealicense.com to eventually consume license texts directly from SPDX.

The test could clone spdx/license-list-data and compare each license we have cataloged in this project. Many existing licenses would probably have to be marked as expected failures due to bugs in SPDX output and discrepancies in how this project has cataloged some licenses. But we should address upfront for any new license cataloged here, and continue to chip away at the existing inconsistencies.

@mlinksva mlinksva changed the title Add test tat license texts match SPDX plain license texts Test that license texts match SPDX plain license texts Jan 13, 2019
@travi
Copy link

travi commented Apr 13, 2019

The latest SPDX version changes the text of the MIT license slightly, compared to the version currently on choosealicense.com. Do you have plans for how you want to handle old and new versions of licenses that change over time?

I use spdx-license-list when scaffolding new projects and the latest version updates its list to v3.4, which includes the change. Since updating to this version my new projects show unrecognized licenses, such as this one.

@mlinksva
Copy link
Contributor Author

mlinksva commented Apr 13, 2019

@travi thanks for pointing that out. The change is the optional text added at spdx/license-list-XML@ca17b91#diff-a3960b442eb635386ec51a5d6d15af2d

For better or worse SPDX doesn't AFAIK distinguish between optional but not usual and optional but preferred text, and outputs all optional text in https://github.com/spdx/license-list-data/blob/2d27e4c31441af8f343eba0293d03d27707d9c02/text/MIT.txt

I don't think we can or should move to MIT including the optional text here. Can because that would cause license detection problems for most existing MIT licenses given the way licensee (which GitHub uses) is tied to texts curated here (choosealicense.com). Should because I'd rather encourage adoption of the most widely used text, which doesn't include the optional text added.

There are tons of variations on MIT text, I've linked a paper about that a few times.

I don't have a plan to implement, but here's what I'd like to see:

  • choosealicense.com (this repo) has texts reflecting most common/best practice use, to discourage variation/encourage best practice
  • to handle variation (and additional licenses) licensee and other license detection tools can handle some variation, a couple obvious strategies:
    • match with knowledge of SPDX markup including optional sections
    • take advantage o human curation, eg license texts corresponding to packages that have been identified as MIT or other particular license, eg in ClearlyDefined

Since updating to this version my new projects show unrecognized licenses

I would recommend not including the optional text now published by SPDX.

For anyone who insists on doing that, yes, GitHub will identify that there is a license, but not what it is and show in "View license" rather than "MIT".

Presently the only way for licensee to deal with optional text is to normalize it away before matching, but I think it would need to be a super common variation to justify doing that. Feel free to open an issue in licensee/licensee if you want to pursue there.

@reversi-fun
Copy link

reversi-fun commented Jun 20, 2019

My tool may help this issues.
I tried using my tools to compare with SPDX plain license text.

The two enhancements will make it easier to test continuously.

My tool can output the degree of similarity between documents and the number of words using a library called gensim.
For example, we could automatically find out the similarity of the license text below.
{spdx/LGPL-3.0-only, spdx/LGPL-3.0-or-later,research/choosealicense.com-gh-pages/_licenses/lgpl-3.0}

Currently there is no spdx/LGPL-3.0.
You can find spdxIDs with similar license texts, even if the file names in _licenses/lgpl-3.0.txt are more incorrect name.

For example, the similarity between the following two license files was 0.796,
and the difference in word count was +130.
lic-lgpl3

The difference in the number of words is the number of words in the header section.
My tool marks the license name containing the word "PATENT" in red.
the File _licenses/lgpl-3.0
contains "Contributors provide an express grant of patent rights" at header-description section.
The above box(_licenses/lgpl-3.0) would not have been marked red if plain text without a leading part was entered.

The comparison results for all other license texts are as follows.
https://github.com/reversi-fun/license_doc_similality1/blob/master/data/lic_graph.fdp.svg

You can confirm that choosealicense.com-gh-pages/_licenses and "spdx license plain texts" were all similar by the following search.

  • download the file, and open by browser.
  • by CTRL-F ,finding string "research/choosealicense.com-gh-pages/_licenses/".
  • If you do not find a similar license, it will appear isolated from the blue dotted box as shown below.
    lic-ngpl

@mlinksva
Copy link
Contributor Author

mlinksva commented May 5, 2020

@darkmorpher licensee can recognize both the GNU hosted text and SPDX version. You're pointing to a non-master branch in the MT repo. There is no license or copying file in the master branch in the root, that's why no license is detected. If you find a bug that you can reproduce in licensee, please open an issue in the licensee repo.

@sschuberth
Copy link

We should have a test that each license text in _licenses is the same as the plain text license in the SPDX collection

IMO this is not a desirable goal as long as SPDX tampers with the original plain text version (if any) of a license, also see https://github.com/spdx/license-list-data/issues/44. This is because SPDX does not take a plain-text license as-is, but regenerates it from its own XML representation (as also described in the original post of this issue).

@mlinksva
Copy link
Contributor Author

@sschuberth yes I'm well aware of that. As I've written before (but am too lazy to search for now) I'd love to see the SPDX plaintext renderings be as close to the canonical plain text version of licenses, and have over the years contributed a few small fixes toward that. As I wrote in the issue comment above:

Many existing licenses would probably have to be marked as expected failures due to bugs in SPDX output and discrepancies in how this project has cataloged some licenses. But we should address upfront for any new license cataloged here, and continue to chip away at the existing inconsistencies.

@darkmorpher
Copy link

darkmorpher commented Oct 12, 2022

RE: @mlinksva

#636 (comment)

(If still an issue) As a test case, Can you add one of these GitHub Actions to compare plaintext license and spdx data files in a new branch? Granted, all required files will have to be copied there too and repo will end up with duplicate files.

↔️ Spoiler (Click here)

(Previous discussion, may no longer apply)
I should mention that slightest change in texts (formatting/punctuation) trips up licensee/licensee and fails recognition.

Edit: per reply, added/UPDATED example:
GNU hosted text: https://www.gnu.org/licenses/gpl-3.0.txt
SPDX version: https://raw.githubusercontent.com/spdx/license-list-data/master/text/GPL-3.0-only.txt

GNU hosted text is unrecognized by licensee>> as seen here: repository master

Essentially anyone grabbing a GPL license copy from the GNU site will have this issue (especially GitHub imported/mirrored repositories hosted on GNU Savannah's git repository server)

This is essentially different than adding an attribution header, where more complex detection method is needed.

Related: licensee/licensee/issues/387 | licensee/licensee/issues/416

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants
@mlinksva @travi @sschuberth @reversi-fun @darkmorpher and others