Clarify whether base64 encoding of license content is really optional #382

ben-sag · 2024-02-19T19:24:14Z

The specs (incl 1.4) are unclear about whether it is mandatory or optional to use base64 encoding for license text's "content".

The fact the doc for "encoding" states it "must be one of"... "base64" makes me think it's mandatory to use base64, but the fact it also says "Specifies the optional encoding" makes me think maybe it's optional (e.g. if I don't specify an encoding parameter maybe it's treated as normal unencoded text). I looked at the examples but there is none that have content without encoding.

Can you clarify, and in the next spec put something into the doc of the "content" field to make clear whether plain unencoded text is permitted? And maybe something in the examples to demonstrate unencoded text.

(btw I'm very much hoping that use of base64 is not mandatory for this field - many licenses make it legally necessary to include the full license text and the SBOM is a very natural place to put that... but I doubt a law court would accept a base64 encoding of the text as meeting that requirement)

mrutkows · 2024-02-19T22:23:10Z

If you look at the spec./schema (i.e., specifically, https://cyclonedx.org/docs/1.5/json/#metadata_component_licenses_oneOf_i0_items_license_text_content) the "content" field is the ONLY required field in the object. It is of type string with the description "The attachment data. Proactive controls such as input validation and sanitization should be employed to prevent misuse of attachment text." which means it can contain ANY "text" data including (sanitized if possible) plain text. In fact, the stated default value for "contentType" is "text/plain".

If the "content" is optionally "encoded", then the ONLY value allowed for encoding is "base64" which means the "content" would need to be decoded prior to taking any further action using the "contentType" value in order to properly interpret the decoded "text".

In short, feel free to simply fill in the "content" with plain text. However, from an automated processing perspective, base64 encoding is helpful IMO to avoid transmission (whitespace) errors as well as avoids hidden/embedded control characters or functions.

For further clarity, the "one of" language assures that schema will ONLY allow "one of" the values indicated; at this time, ONLY a single value is allowed (only if that field is used) which would be "base64". Again, limiting the encoding types, esp. to a single value, helps in automated processing so that no unexpected/unclear encoding values appear.

ben-sag · 2024-02-20T10:54:45Z

Thanks that's very helpful! I'm really glad non-encoded is an option.

I did quite carefully read all the spec (including the part you quoted) yesterday - multiple times :) - and couldn't convince myself 100% which way to interpret it (I found reasons for both possible interpretations :) ), so a brief remark in the next spec to make it explicit would be super helpful to avoid possible confusion. Perhaps add an example where the encode= is not specified. And in the "encoding" parameter you could state that the behaviour if the encoding parameter is not specified i.e. that the text is not encoded . (I realise that might sound too obvious to someone who already knows the intention of the spec, but the text stating it must be one of "base64" sent me off on a different interpretation. (I was also looking at this thread CycloneDX/cyclonedx-core-java#100 where it looks like someone else concluded base64 was mandatory)

Base64-encoding of license texts is optional [1]. As it results in larger and less readable files, disable it. [1]: CycloneDX/specification#382 Signed-off-by: Sebastian Schuberth <sebastian@doubleopen.org>

ben-sag · 2024-02-20T12:02:55Z

We're having a debate about this on CycloneDX/cyclonedx-core-java#100 where it seems the main Java library for CycloneDX has taken a different interpretation of the encoding question.

Base64-encoding of license texts is optional [1]. As it results in larger and less readable files, disable it. [1]: CycloneDX/specification#382 Signed-off-by: Sebastian Schuberth <sebastian@doubleopen.org>

jkowalleck · 2024-02-20T12:50:52Z

PS/EDIT: I learned I was wrong, the following is not true:

encoding is optional with a default value of base64

nope. no default value anymore, this was fixed.
therefore, the following statements of mine are incorrect.

at the state of CycloneDX 1.5 and before:

fact: in schema specs, encoding is optional with a default value of base64
meaning: in data transport, the field encoding is optional, its value is always base64. In result, the content value MUST be base64 encoded.

In addition, the contentType describes the content before it was encoded.

example: https://cyclonedx.org/use-cases/#provenance

{
          "contentType": "text/xml",
          "content": "PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0idXRmLTgiID8+CjxTb2Z0d2FyZUlkZW50aXR5IHhtbDpsYW5nPSJFTiIgbmFtZT0iQ3J5cHRvIExpYnJhcnkiIHZlcnNpb249IjMuMC4wIiAKIHZlcnNpb25TY2hlbWU9Im11bHRpcGFydG51bWVyaWMiIAogdGFnSWQ9InN3aWRnZW4tNWRjYjc5YWYtYTFkMi02MWIzLTM0ZmQtNTM2YzUzYjA4ODEwXzMuMC4wIiAKIHhtbG5zPSJodHRwOi8vc3RhbmRhcmRzLmlzby5vcmcvaXNvLzE5NzcwLy0yLzIwMTUvc2NoZW1hLnhzZCI+IAogeG1sbnM6eHNpPSJodHRwOi8vd3d3LnczLm9yZy8yMDAxL1hNTFNjaGVtYS1pbnN0YW5jZSIgCiB4c2k6c2NoZW1hTG9jYXRpb249Imh0dHA6Ly9zdGFuZGFyZHMuaXNvLm9yZy9pc28vMTk3NzAvLTIvMjAxNS1jdXJyZW50L3NjaGVtYS54c2Qgc2NoZW1hLnhzZCIgPgogIDxNZXRhIGdlbmVyYXRvcj0iU1dJRCBUYWcgT25saW5lIEdlbmVyYXRvciB2MC4xIiAvPiAKICA8RW50aXR5IG5hbWU9IkV4YW1wbGUsIEluYy4iIHJlZ2lkPSJleGFtcGxlLmNvbSIgcm9sZT0idGFnQ3JlYXRvciIgLz4gCjwvU29mdHdhcmVJZGVudGl0eT4="
        }

In short, feel free to simply fill in the "content" with plain text.

this is wrong, as stated before.

ben-sag · 2024-02-20T12:58:04Z

Aha, well that's a shame. But thanks for clarifying, much appreciated!

Given several of us have come to different conclusions about what the current spec means it's certainly valuable to make the wording in the next spec more explicit.

But I hope you would also consider supporting human-readable/non-base64 encoding in future - the spec allows for other legal texts such as copyrights to be human-readable, and the lack of supporting this for custom license texts makes the format unusable as a machine-readable way to comply with legal requirements.

Even though I much prefer CycloneDX for all other aspects, it looks like I will have so switch over to SPDX SBOMs unfortunately :(

jkowalleck · 2024-02-20T13:04:23Z

re #382 (comment)

But I hope you would also consider supporting human-readable/non-base64 encoding in future - the spec allows for other legal texts such as copyrights to be human-readable, and the lack of supporting this for custom license texts makes the format unusable as a machine-readable way to comply with legal requirements.

@ben-sag ,
what do you mean? you can base64 any data, like a PDF contentType=application/pdf, or a plain text contentType=text/plain.
This means, you clearly can transport an arbitrary copyright test via CycloneDX, like so:

{content": "KGMpIDIwMjQgSmFuZSBEb2Uu", "encoding": "base64"}

ben-sag · 2024-02-20T13:07:59Z

Yes you can include any text, but it's not human readable.... so although you're 99% of the way to having a really useful file format that can both list the components and fulfil the legal obligations (e.g. to publish license texts), the fact it's mandated to be in a non-human readable format (base64) stops the second use case. It's fine to allow base64 but forcing it as the only option is a surprising and imho unnecessary limitation. (Especially since you don't do that for other extracted text like copyrights but only license text).

jkowalleck · 2024-02-20T13:33:05Z

Yes you can include any text, but it's not human readable....

let me ask it this way: would you give a JSON file to your lawyer? and would they open it as a pure text file and read the raw text? literally give a lawyer this example https://raw.githubusercontent.com/CycloneDX/bom-examples/master/SBOM/juice-shop/via_npm/bare/bom.1.4.json
(Who with their right mind would read a 60 thousand lines long structured data document? You would use proper tooling for this, like a program that displays the relevant data in a human-appealing way, right?)

the fact it's mandated to be in a non-human readable format (base64) stops the second use case.

This is during transport. Nothing prevents you from base64-decoding the text, before handing it to some person.

[...] don't do that for other extracted text like copyrights but only license text).

The field you are referring to is of type "attachment", not "text".
An "attachment" may be not only a license text; this could also be cryptographic assets, licensing information in a proprietary data format, some binary blob as an evidence for automation-based concluded data, ...

Please understand, that JSON and XML might be human-readable, but are not intended to be consumed by humans. They are data interchange formats for the digital age.
These data are ingested by computer applications, like the web browser that you are currently using, which, in fact, received some data from github web servers, which was probably in no human-readable form during transport, and still, your web browser is able to display some text here ;-)

jkowalleck · 2024-02-20T14:00:59Z

@ben-sag I checked again and I was wrong with my original statement.
As of CycloneDX 1.5, the encoding is indeed optional now, no default value anymore.
Meaning this would be a valid attachment:

{"content": "(c) 2024 Jane Doe"}

This reverts commit 2300487 as only setting the `encoding` field is optional, but `content` indeed always has to be Base64-encoded [1]. [1]: CycloneDX/specification#382 (comment) Signed-off-by: Sebastian Schuberth <sebastian@doubleopen.org>

sschuberth · 2024-02-20T15:09:26Z

As of CycloneDX 1.5, the encoding is indeed optional now, no default value anymore.

Hmm. Does that mean that depending on the specification / schema level chosen, a user of the Java library needs to manually encode e.g. license texts?

stevespringett · 2024-02-20T15:39:09Z

IMO, adding a method to the Java library that will encode/decode the license text would be a good feature to add.

sschuberth · 2024-02-20T15:56:42Z

adding a method to the Java library that will encode/decode the license text

IMO, just adding such a function is not of much value, the Java code for that already is a one-liner. Instead, the library should automatically Base64-encode whatever is required to be Base64-encoded.

ben-sag · 2024-02-20T16:30:50Z

As of CycloneDX 1.5, the encoding is indeed optional now, no default value anymore.

Hmm. Does that mean that depending on the specification / schema level chosen, a user of the Java library needs to manually encode e.g. license texts?

I couldn't see any change to the spec - from v1.2 right through to v1.5 the encoding field is marked as "optional".

So the Java library should not be doing any automatic transformations but just passing through the text unchanged.

I do think given the confusion on this thread there's a very strong case now for a small change making it clear/explicit in the spec whether non-base64 content is permitted (sounds like it is?).

mrutkows · 2024-02-22T18:42:39Z

nope. no default value anymore, this was fixed.
therefore, the following statements of mine are incorrect.

I believe (and asking as well) what Jan is saying, being the gatekeeper for v1.6, is that v1.6 it is no longer "optional"???

However, the current JSON schema still indicates "text" is of type "attachment" which is still described as follows (links to current/committed v1.6 json schema):

   "attachment": {
      "type": "object",
      "title": "Attachment",
      "description": "Specifies the metadata and content for an attachment.",
      "required": [
        "content"
      ],
      ...
      "properties": {
        "contentType": {
          "type": "string",
          "title": "Content-Type",
          "description": "Specifies the content type of the text. Defaults to text/plain if not specified.",
          "default": "text/plain"
        },
        "encoding": {
          "type": "string",
          "title": "Encoding",
          "description": "Specifies the optional encoding the text is represented in.",
          "enum": [
            "base64"
          ],
          "meta:enum": {
            "base64": "Base64 is a binary-to-text encoding scheme that represents binary data in an ASCII string."
          }
        },
        "content": {
          "type": "string",
          "title": "Attachment Text",
          "description": "The attachment data. Proactive controls such as input validation and sanitization should be employed to prevent misuse of attachment text."
        }
      }
    },

which still has the only "required" field being "content" (i.e., "encoding" is not required) ... and the description of "encoding" still has "optional".

Perhaps there is an outstanding (not merged) pull request (PR) that has this projected change?

jkowalleck · 2024-02-22T22:37:44Z

TLDR;
Event with CDX 1.6, attachment encoding is not required, it is optional --

specification/schema/bom-1.6.schema.json

Lines 1128 to 1130 in 2873c49

    
           "required": [ 
        
             "content" 
        
           ],

Therefore, omitting the "encoding" field means no encoding is applied, as intended.

I believe (and asking as well) what Jan is saying, being the gatekeeper for v1.6, is that v1.6 it is no longer "optional"???

I was referring to a state when we had a schema issue during CycloneDX 1.3 JSON version only, when there was a default value for the "encoding" -- which was an unintended thing and was fixed in the CycloneDX JSON schema a while ago. This affected only JSON documents that were validated by the schema.
This memory caused my confusion. Maybe by the fact that it drove me mad for some days, back then ... 😁

The actual CycloneDX specification never had a default value for the attachment's "encoding" field.

caused by CycloneDX/specification#382 (comment) > [...] add an example where the encode= is not specified [...] we have two examples of encoded text attachments in the use cases already. this PR makes one of them an unencoded text attachment. this way we have both use cases shown. since the MIT license us not a template, but an actual text with actual values in it, and since it is pretty short, I used an example with MIT license. in contrast - the Apache license is a template - it must not be modified and therefore it makes no practical since to have it as an attachment ... --------- Signed-off-by: Jan Kowalleck <jan.kowalleck@gmail.com> Signed-off-by: Jan Kowalleck <jan.kowalleck@owasp.org>

sschuberth mentioned this issue Feb 20, 2024

chore(cyclonedx): Disable Base64-encoding of license texts oss-review-toolkit/ort#8314

Merged

jkowalleck mentioned this issue Feb 20, 2024

Unable to disable optional Base64-encoding of license texts CycloneDX/cyclonedx-core-java#100

Closed

sschuberth mentioned this issue Feb 20, 2024

Revert "chore(cyclonedx): Disable Base64-encoding of license texts" oss-review-toolkit/ort#8316

Closed

jkowalleck mentioned this issue Feb 20, 2024

use case: unencoded plaintext license text attachment CycloneDX/cyclonedx.org#280

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarify whether base64 encoding of license content is really optional #382

Clarify whether base64 encoding of license content is really optional #382

ben-sag commented Feb 19, 2024

mrutkows commented Feb 19, 2024 •

edited

Loading

ben-sag commented Feb 20, 2024

ben-sag commented Feb 20, 2024

jkowalleck commented Feb 20, 2024 •

edited

Loading

ben-sag commented Feb 20, 2024

jkowalleck commented Feb 20, 2024 •

edited

Loading

ben-sag commented Feb 20, 2024

jkowalleck commented Feb 20, 2024 •

edited

Loading

jkowalleck commented Feb 20, 2024 •

edited

Loading

sschuberth commented Feb 20, 2024

stevespringett commented Feb 20, 2024

sschuberth commented Feb 20, 2024

ben-sag commented Feb 20, 2024 •

edited

Loading

mrutkows commented Feb 22, 2024 •

edited

Loading

jkowalleck commented Feb 22, 2024 •

edited

Loading

Clarify whether base64 encoding of license content is really optional #382

Clarify whether base64 encoding of license content is really optional #382

Comments

ben-sag commented Feb 19, 2024

mrutkows commented Feb 19, 2024 • edited Loading

ben-sag commented Feb 20, 2024

ben-sag commented Feb 20, 2024

jkowalleck commented Feb 20, 2024 • edited Loading

ben-sag commented Feb 20, 2024

jkowalleck commented Feb 20, 2024 • edited Loading

ben-sag commented Feb 20, 2024

jkowalleck commented Feb 20, 2024 • edited Loading

jkowalleck commented Feb 20, 2024 • edited Loading

sschuberth commented Feb 20, 2024

stevespringett commented Feb 20, 2024

sschuberth commented Feb 20, 2024

ben-sag commented Feb 20, 2024 • edited Loading

mrutkows commented Feb 22, 2024 • edited Loading

jkowalleck commented Feb 22, 2024 • edited Loading

mrutkows commented Feb 19, 2024 •

edited

Loading

jkowalleck commented Feb 20, 2024 •

edited

Loading

jkowalleck commented Feb 20, 2024 •

edited

Loading

jkowalleck commented Feb 20, 2024 •

edited

Loading

jkowalleck commented Feb 20, 2024 •

edited

Loading

ben-sag commented Feb 20, 2024 •

edited

Loading

mrutkows commented Feb 22, 2024 •

edited

Loading

jkowalleck commented Feb 22, 2024 •

edited

Loading