Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify whether base64 encoding of license content is really optional #382

Open
ben-sag opened this issue Feb 19, 2024 · 15 comments
Open

Clarify whether base64 encoding of license content is really optional #382

ben-sag opened this issue Feb 19, 2024 · 15 comments

Comments

@ben-sag
Copy link

ben-sag commented Feb 19, 2024

The specs (incl 1.4) are unclear about whether it is mandatory or optional to use base64 encoding for license text's "content".

The fact the doc for "encoding" states it "must be one of"... "base64" makes me think it's mandatory to use base64, but the fact it also says "Specifies the optional encoding" makes me think maybe it's optional (e.g. if I don't specify an encoding parameter maybe it's treated as normal unencoded text). I looked at the examples but there is none that have content without encoding.

Can you clarify, and in the next spec put something into the doc of the "content" field to make clear whether plain unencoded text is permitted? And maybe something in the examples to demonstrate unencoded text.

(btw I'm very much hoping that use of base64 is not mandatory for this field - many licenses make it legally necessary to include the full license text and the SBOM is a very natural place to put that... but I doubt a law court would accept a base64 encoding of the text as meeting that requirement)

@mrutkows
Copy link
Contributor

mrutkows commented Feb 19, 2024

If you look at the spec./schema (i.e., specifically, https://cyclonedx.org/docs/1.5/json/#metadata_component_licenses_oneOf_i0_items_license_text_content) the "content" field is the ONLY required field in the object. It is of type string with the description "The attachment data. Proactive controls such as input validation and sanitization should be employed to prevent misuse of attachment text." which means it can contain ANY "text" data including (sanitized if possible) plain text. In fact, the stated default value for "contentType" is "text/plain".

If the "content" is optionally "encoded", then the ONLY value allowed for encoding is "base64" which means the "content" would need to be decoded prior to taking any further action using the "contentType" value in order to properly interpret the decoded "text".

In short, feel free to simply fill in the "content" with plain text. However, from an automated processing perspective, base64 encoding is helpful IMO to avoid transmission (whitespace) errors as well as avoids hidden/embedded control characters or functions.


For further clarity, the "one of" language assures that schema will ONLY allow "one of" the values indicated; at this time, ONLY a single value is allowed (only if that field is used) which would be "base64". Again, limiting the encoding types, esp. to a single value, helps in automated processing so that no unexpected/unclear encoding values appear.

@ben-sag
Copy link
Author

ben-sag commented Feb 20, 2024

Thanks that's very helpful! I'm really glad non-encoded is an option.

I did quite carefully read all the spec (including the part you quoted) yesterday - multiple times :) - and couldn't convince myself 100% which way to interpret it (I found reasons for both possible interpretations :) ), so a brief remark in the next spec to make it explicit would be super helpful to avoid possible confusion. Perhaps add an example where the encode= is not specified. And in the "encoding" parameter you could state that the behaviour if the encoding parameter is not specified i.e. that the text is not encoded . (I realise that might sound too obvious to someone who already knows the intention of the spec, but the text stating it must be one of "base64" sent me off on a different interpretation. (I was also looking at this thread CycloneDX/cyclonedx-core-java#100 where it looks like someone else concluded base64 was mandatory)

sschuberth added a commit to oss-review-toolkit/ort that referenced this issue Feb 20, 2024
Base64-encoding of license texts is optional [1]. As it results in
larger and less readable files, disable it.

[1]: CycloneDX/specification#382

Signed-off-by: Sebastian Schuberth <sebastian@doubleopen.org>
@ben-sag
Copy link
Author

ben-sag commented Feb 20, 2024

We're having a debate about this on CycloneDX/cyclonedx-core-java#100 where it seems the main Java library for CycloneDX has taken a different interpretation of the encoding question.

sschuberth added a commit to oss-review-toolkit/ort that referenced this issue Feb 20, 2024
Base64-encoding of license texts is optional [1]. As it results in
larger and less readable files, disable it.

[1]: CycloneDX/specification#382

Signed-off-by: Sebastian Schuberth <sebastian@doubleopen.org>
@jkowalleck
Copy link
Member

jkowalleck commented Feb 20, 2024

PS/EDIT: I learned I was wrong, the following is not true:

encoding is optional with a default value of base64

nope. no default value anymore, this was fixed.
therefore, the following statements of mine are incorrect.



at the state of CycloneDX 1.5 and before:

  • fact: in schema specs, encoding is optional with a default value of base64
  • meaning: in data transport, the field encoding is optional, its value is always base64. In result, the content value MUST be base64 encoded.

In addition, the contentType describes the content before it was encoded.

example: https://cyclonedx.org/use-cases/#provenance

{
          "contentType": "text/xml",
          "content": "PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0idXRmLTgiID8+CjxTb2Z0d2FyZUlkZW50aXR5IHhtbDpsYW5nPSJFTiIgbmFtZT0iQ3J5cHRvIExpYnJhcnkiIHZlcnNpb249IjMuMC4wIiAKIHZlcnNpb25TY2hlbWU9Im11bHRpcGFydG51bWVyaWMiIAogdGFnSWQ9InN3aWRnZW4tNWRjYjc5YWYtYTFkMi02MWIzLTM0ZmQtNTM2YzUzYjA4ODEwXzMuMC4wIiAKIHhtbG5zPSJodHRwOi8vc3RhbmRhcmRzLmlzby5vcmcvaXNvLzE5NzcwLy0yLzIwMTUvc2NoZW1hLnhzZCI+IAogeG1sbnM6eHNpPSJodHRwOi8vd3d3LnczLm9yZy8yMDAxL1hNTFNjaGVtYS1pbnN0YW5jZSIgCiB4c2k6c2NoZW1hTG9jYXRpb249Imh0dHA6Ly9zdGFuZGFyZHMuaXNvLm9yZy9pc28vMTk3NzAvLTIvMjAxNS1jdXJyZW50L3NjaGVtYS54c2Qgc2NoZW1hLnhzZCIgPgogIDxNZXRhIGdlbmVyYXRvcj0iU1dJRCBUYWcgT25saW5lIEdlbmVyYXRvciB2MC4xIiAvPiAKICA8RW50aXR5IG5hbWU9IkV4YW1wbGUsIEluYy4iIHJlZ2lkPSJleGFtcGxlLmNvbSIgcm9sZT0idGFnQ3JlYXRvciIgLz4gCjwvU29mdHdhcmVJZGVudGl0eT4="
        }

In short, feel free to simply fill in the "content" with plain text.

this is wrong, as stated before.

@ben-sag
Copy link
Author

ben-sag commented Feb 20, 2024

Aha, well that's a shame. But thanks for clarifying, much appreciated!

Given several of us have come to different conclusions about what the current spec means it's certainly valuable to make the wording in the next spec more explicit.

But I hope you would also consider supporting human-readable/non-base64 encoding in future - the spec allows for other legal texts such as copyrights to be human-readable, and the lack of supporting this for custom license texts makes the format unusable as a machine-readable way to comply with legal requirements.

Even though I much prefer CycloneDX for all other aspects, it looks like I will have so switch over to SPDX SBOMs unfortunately :(

@jkowalleck
Copy link
Member

jkowalleck commented Feb 20, 2024

re #382 (comment)

But I hope you would also consider supporting human-readable/non-base64 encoding in future - the spec allows for other legal texts such as copyrights to be human-readable, and the lack of supporting this for custom license texts makes the format unusable as a machine-readable way to comply with legal requirements.

@ben-sag ,
what do you mean? you can base64 any data, like a PDF contentType=application/pdf, or a plain text contentType=text/plain.
This means, you clearly can transport an arbitrary copyright test via CycloneDX, like so:

{content": "KGMpIDIwMjQgSmFuZSBEb2Uu", "encoding": "base64"}

@ben-sag
Copy link
Author

ben-sag commented Feb 20, 2024

Yes you can include any text, but it's not human readable.... so although you're 99% of the way to having a really useful file format that can both list the components and fulfil the legal obligations (e.g. to publish license texts), the fact it's mandated to be in a non-human readable format (base64) stops the second use case. It's fine to allow base64 but forcing it as the only option is a surprising and imho unnecessary limitation. (Especially since you don't do that for other extracted text like copyrights but only license text).

@jkowalleck
Copy link
Member

jkowalleck commented Feb 20, 2024

Yes you can include any text, but it's not human readable....

let me ask it this way: would you give a JSON file to your lawyer? and would they open it as a pure text file and read the raw text? literally give a lawyer this example https://raw.githubusercontent.com/CycloneDX/bom-examples/master/SBOM/juice-shop/via_npm/bare/bom.1.4.json
(Who with their right mind would read a 60 thousand lines long structured data document? You would use proper tooling for this, like a program that displays the relevant data in a human-appealing way, right?)

the fact it's mandated to be in a non-human readable format (base64) stops the second use case.

This is during transport. Nothing prevents you from base64-decoding the text, before handing it to some person.

[...] don't do that for other extracted text like copyrights but only license text).

The field you are referring to is of type "attachment", not "text".
An "attachment" may be not only a license text; this could also be cryptographic assets, licensing information in a proprietary data format, some binary blob as an evidence for automation-based concluded data, ...

Please understand, that JSON and XML might be human-readable, but are not intended to be consumed by humans. They are data interchange formats for the digital age.
These data are ingested by computer applications, like the web browser that you are currently using, which, in fact, received some data from github web servers, which was probably in no human-readable form during transport, and still, your web browser is able to display some text here ;-)

@jkowalleck
Copy link
Member

jkowalleck commented Feb 20, 2024

@ben-sag I checked again and I was wrong with my original statement.
As of CycloneDX 1.5, the encoding is indeed optional now, no default value anymore.
Meaning this would be a valid attachment:

{"content": "(c) 2024 Jane Doe"}

sschuberth added a commit to oss-review-toolkit/ort that referenced this issue Feb 20, 2024
This reverts commit 2300487 as only setting the `encoding` field is
optional, but `content` indeed always has to be Base64-encoded [1].

[1]: CycloneDX/specification#382 (comment)

Signed-off-by: Sebastian Schuberth <sebastian@doubleopen.org>
@sschuberth
Copy link

As of CycloneDX 1.5, the encoding is indeed optional now, no default value anymore.

Hmm. Does that mean that depending on the specification / schema level chosen, a user of the Java library needs to manually encode e.g. license texts?

@stevespringett
Copy link
Member

IMO, adding a method to the Java library that will encode/decode the license text would be a good feature to add.

@sschuberth
Copy link

adding a method to the Java library that will encode/decode the license text

IMO, just adding such a function is not of much value, the Java code for that already is a one-liner. Instead, the library should automatically Base64-encode whatever is required to be Base64-encoded.

@ben-sag
Copy link
Author

ben-sag commented Feb 20, 2024

As of CycloneDX 1.5, the encoding is indeed optional now, no default value anymore.

Hmm. Does that mean that depending on the specification / schema level chosen, a user of the Java library needs to manually encode e.g. license texts?

I couldn't see any change to the spec - from v1.2 right through to v1.5 the encoding field is marked as "optional".

So the Java library should not be doing any automatic transformations but just passing through the text unchanged.

I do think given the confusion on this thread there's a very strong case now for a small change making it clear/explicit in the spec whether non-base64 content is permitted (sounds like it is?).

@mrutkows
Copy link
Contributor

mrutkows commented Feb 22, 2024

nope. no default value anymore, this was fixed.
therefore, the following statements of mine are incorrect.

I believe (and asking as well) what Jan is saying, being the gatekeeper for v1.6, is that v1.6 it is no longer "optional"???

However, the current JSON schema still indicates "text" is of type "attachment" which is still described as follows (links to current/committed v1.6 json schema):

   "attachment": {
      "type": "object",
      "title": "Attachment",
      "description": "Specifies the metadata and content for an attachment.",
      "required": [
        "content"
      ],
      ...
      "properties": {
        "contentType": {
          "type": "string",
          "title": "Content-Type",
          "description": "Specifies the content type of the text. Defaults to text/plain if not specified.",
          "default": "text/plain"
        },
        "encoding": {
          "type": "string",
          "title": "Encoding",
          "description": "Specifies the optional encoding the text is represented in.",
          "enum": [
            "base64"
          ],
          "meta:enum": {
            "base64": "Base64 is a binary-to-text encoding scheme that represents binary data in an ASCII string."
          }
        },
        "content": {
          "type": "string",
          "title": "Attachment Text",
          "description": "The attachment data. Proactive controls such as input validation and sanitization should be employed to prevent misuse of attachment text."
        }
      }
    },

which still has the only "required" field being "content" (i.e., "encoding" is not required) ... and the description of "encoding" still has "optional".

Perhaps there is an outstanding (not merged) pull request (PR) that has this projected change?

@jkowalleck
Copy link
Member

jkowalleck commented Feb 22, 2024

TLDR;
Event with CDX 1.6, attachment encoding is not required, it is optional --

"required": [
"content"
],

Therefore, omitting the "encoding" field means no encoding is applied, as intended.


I believe (and asking as well) what Jan is saying, being the gatekeeper for v1.6, is that v1.6 it is no longer "optional"???

I was referring to a state when we had a schema issue during CycloneDX 1.3 JSON version only, when there was a default value for the "encoding" -- which was an unintended thing and was fixed in the CycloneDX JSON schema a while ago. This affected only JSON documents that were validated by the schema.
This memory caused my confusion. Maybe by the fact that it drove me mad for some days, back then ... 😁

The actual CycloneDX specification never had a default value for the attachment's "encoding" field.

jkowalleck added a commit to CycloneDX/cyclonedx.org that referenced this issue Aug 20, 2024
caused by
CycloneDX/specification#382 (comment)
> [...] add an example where the encode= is not specified [...]

we have two examples of encoded text attachments in the use cases
already.

this PR makes one of them an unencoded text attachment. 
this way we have both use cases shown.

since the MIT license us not a template, but an actual text with actual
values in it,
and since it is pretty short,
I used an example with MIT license.

in contrast - the Apache license is a template - it must not be modified
and therefore it makes no practical since to have it as an attachment
...

---------

Signed-off-by: Jan Kowalleck <jan.kowalleck@gmail.com>
Signed-off-by: Jan Kowalleck <jan.kowalleck@owasp.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants