Add evidence used to determine inclusion of a component. #129

JorisVanEijden · 2022-02-01T14:33:52Z

Most SBOM generators base the inclusion of a component in an SBOM on a packagemanager file or the existence of some other file.
I would like to be able to trace back what source was used as evidence for the inclusion of a component.
The component.evidence property seemed like a good fit, but that only supports "licenses" and "copyright".

Could we add a "files" list to that?
Then an SBOM generator can list the file(s) it used to decide to include the component there,

The text was updated successfully, but these errors were encountered:

samj1912 · 2022-02-03T19:12:24Z

+1 This would be really useful for cyclonedx support in syft as well. syft currently stores the file "evidence" in its internal model. If we could add this in cyclonedx, it would improve the traceability of SBOM outputs a lot in syft :)

cc: @wagoodman

stevespringett · 2022-02-03T19:20:31Z

In typical CDX fashion, we will try to target v1.5 of a Q1 2023 release. Therefore, there's some time to flush this out and support as many use cases as possible - keeping in mind that we try to focus on simplicity, high-degrees of automation, technology-agnostic, all without making the spec overly large.

With that in mind, are there use cases we can start documenting?

cc: @brianf

JorisVanEijden · 2022-02-04T07:42:20Z

The main question it needs to answer is: "Why is this component included in this SBOM?"

My actual (way too frequent) scenario:

DependencyTrack says project X has a critical vulnerability in component Y.
Project X says "we don't use component Y"
I say "then why is it in your SBOM?"
They say "I have no idea"

This could be solved with a simple text field saying "/themes/custom/wow/package-lock.json" or "JvE: I statically linked this with the executable", or "Syft v3.2.1 detected this as a Composer dependency in /app/dir/composer.lock".

Or something more complex with different types ("manual", "scanner", "generator", "other") and id's ("name: Steve Jackson" or "name: Cdxgen, version: 1.3.4") and more types ("filepath", "explanation", "url") etc.

For me both would work just as well but I see no use case for the more complex structure.

brianf · 2022-04-13T13:37:52Z

We provide information in our tooling that we call "occurrences" which would include a list of the file paths where the component was detected and the binary fingerprint that this thing matched. This helps people understand embedded cases or even when we detect a similar match or when something was renamed (eg foo.jar is actually log4j)

stevespringett · 2022-04-13T16:00:03Z

Thanks @brianf. Few questions...

Is the binary fingerprint reproducible from external tools? If so, can you provide a pointer on where we can find more information?
Would it be useful to capture the fingerprint and the tool that generated the fingerprint for every occurrence?
What else would be useful to capture?

planetlevel · 2022-09-14T15:54:23Z

Some evidence we would like to provide related to measuring at runtime

we found the right libraries -- libraries that are not in the code repo (appserver, runtime platform)
we did not find wrong libraries -- libraries only used in build or test environments
hashes of libraries as loaded in production
libraries that are used/unused
number of classes/methods used in components
the actual classes/methods invoked and the callsites/traces capturing how they are invoked

We could provide callsites/traces showing exactly how libraries are used by applications, but it is a LOT of data. It would be useful for folks trying to verify whether a vulnerable method (like the JDNI lookup in log4j) is actually used. Also useful when attempting to remove a component from an application.

mrutkows · 2022-09-14T16:00:24Z

Want to highlight that evidence is always associated with a "tool" (loosely) used against 1 or more components or services (or hardware) to evaluate some security or compliance use case; therefore, the schema needs a robust means to associate:

evidence/provenance/other (inputs/outputs/state) -- tool -- hardware/component/service (resource under inspection)
at various stages of a "build" (read CI/CD for "build") against what we are discussing as a "formulation".

First, please be aware that the Sigstore project and its Rekor (https://github.com/sigstore/rekor) "transparency log" has structured records for all types of "attestations" (evidence) around CI/CD including, OIDC attestations (IDs that are used for "build"), changes in MF auth. of an identity, records to attest that package manager builds have been signed/verified (near-term plans to be used by NPM, Ruby, Java, etc.), records of Cert. generations for ephemeral keys (per-Ci build, from SPIFFE/SPIRE), and more... In all these cases a ref. to the log entry (id) and format would be quite valuable for downstream tooling.

See TUF formats (https://github.com/theupdateframework) where many of these records are being standardized and "future proofed".
See OpenSSF Frsca project which is looking to produce standardized evidence that can map their "controls" to SLSA (and have recently been discussing OSCAL as the canonical mapping)

In general, the types of "evidence" (I will use this term loosely) we actually have today (note ALL relate to "tools" that produce it) which is being produced by Tekton (SLSA compliant) CI systems and are being stored in crude "evidence lockers" include:

CI system "pipeline run" instance (evidence the pipeline was run with proper config/creds., no runtime/container mutations)
CI "task" (evidence task was invoked with proper configs (evidence the pipeline was run with proper config/creds., no runtime/container mutations)
Scorecard "checks" tests on scsm/project health/source provenance, etc. (and Gauge evidence of provenance and developer origin)
Component / Service graphing tools (e.g., GitBom) and decisions trees (assure nothing skipped for polyglot or for media/file types)
- Note: gitbom pruces a std. ADG graph format
SAST - (evidence of runtime env. perhaps even test matrix as supported in many CI systems) and static tests run (names, results)
DAST - (evidence of the staging environment creation/config along with all dynamic tests (names, results))
Fuzzing (evidence of that API/endpoint tests were run, e.g., RPC/HTTP GET/POST calls invoked)
license/copyright/legal scanners (evidence are regex/regex templates used to scan source for presence of known or suspect legal language)
- include spdx templates as evidence
fingerprinting: evidence of "genome" produced (and model which may vary per binary type)

In all cases, the configurations, parameters and env. vars. (snapshot) present at tool invocation are necessity for toward otentially achieving repro. builds using SBOM as a potentially viable means of capture

and more to come!

stevespringett · 2022-11-03T03:58:32Z

@brianf for occurrences, what types of data do you have? Is it only the paths or is there other data?

planetlevel · 2022-11-03T04:14:09Z

IAST (evidence from complete running app/API stack, including exactly which libraries, classes, and methods are loaded and run. Also evidence of vulnerability testing, all exposed routes, all backend connections (services))

brianf · 2022-11-03T12:05:19Z

@brianf for occurrences, what types of data do you have? Is it only the paths or is there other data?

Our occurrences are file paths. This way if someone questions a finding, say it’s embedded inside another component, we can provide the exact path (sometimes a bang path) to where they can see the component in the scan path (workspace, CI build, application zip etc)

brianf · 2022-11-03T12:08:38Z

Thanks @brianf. Few questions...

Is the binary fingerprint reproducible from external tools? If so, can you provide a pointer on where we can find more information?

Some of the binary fingerprints are sha-x so yes. Similar match fingerprints for detecting a recompiled or slightly altered file are ultimately also sha fingerprints but of combinations of data that are ultimately proprietary. These wouldn’t typically appear in an SBOM output however, they’d be used internal to our tooling communication to figure out what a thing is.

Would it be useful to capture the fingerprint and the tool that generated the fingerprint for every occurrence?

If the fingerprints don’t exist elsewhere in the BOM, then yes I think that would be useful. This way tools that can go a level deeper and do binary matching analysis have more to validate, or even augment.

What else would be useful to capture?

stevespringett · 2022-11-04T04:59:46Z

Ok, trying to flush out a few ideas here... Bear with me... If I have log4j-core and I want to describe the evidence collected to determine that the library is indeed log4j-core, I might end up with the following information:

(note: this uses both an SCA and IAST example in one. Not sure if that would really be possible, but trying to illustrate both since we have reps from both on this ticket)

"components": [
  {
    "type": "library",
    "group": "org.apache.logging.log4j",
    "name": "log4j-core",
    "version": "2.14.0",
    "evidence": {
      "identity": [
        {
          "field": "group | name | version | purl | swid",
          "confidence": "0..1",
          "methods": [
            "source-code-analysis", 
            "binary-analysis", 
            "manifest-analysis", 
            "ast-fingerprint", 
            "instrumentation", 
            "dynamic-analysis", 
            "other" 
          ],
          "source": "where was the evidence found...",
          "name": "",
          "value": ""
        }
      ],
      "formulation": [
        {
          "ref": ""
        }  
      ],
      "occurrences": [
        "/path/to/log4j-core-2.14.0.jar",
      ],
      "callstack": {
        "frames": [
          {

            "package": "com.apache.logging.log4j.core",
            "module": "Logger.class",
            "function": "logMessage",
            "parameters": [
              "com.acme.HelloWorld", "Level.INFO", null, "Hello World"
            ],
            "line": 150,
            "column": 17,
            "fullFilename": "/path/to/log4j-core-2.14.0.jar!/org/apache/logging/log4j/core/Logger.class",
          },
          {
            "module": "HelloWorld.class",
            "function": "main",
            "line": 20,
            "column": 12,
            "fullFilename": "/path/to/HelloWorld.class",
          }
        ]
      }
    }
  }
]

planetlevel · 2022-11-04T18:22:14Z

I'm trying to understand this through the lens of the typical claim-evidence structure. Here we make some claims about the library identity (name, version, etc...).and I can imagine some evidence of that.... like we found a file with the name "log4j-core-2.14.0.jar" at this location on this host (low confidence). Or we calculated a hash from the bytes loaded at runtime, and matched that hash with a hash in xyz database (high confidence). Or we did some fingerprint thing that found a 98% match with log4j-core-2.14.0.jar from some binary repo (98% confidence this is a modified version of log4j).

The other claim here is that this library is actually used in production. You could provide static evidence of this - sometimes called reachability (low confidence) or instrumentation-based evidence (high confidence). I think providing the full stack trace of that interaction would be excellent evidence (but it's a LOT of data). Perhaps the parameters help... but there could be infinite variations of the parameters, so I guess you just report the first one? Would you do this for all the classes and methods in every library? Seems like a LOT of data for little payoff. For me, it would be strong enough evidence to simply report that a class from a particular library was observed to be loaded at a particular time by a tool that has the ability to observe that operation.

Contrast can capture all the classes that are actually used by the application. This data is very useful when trying to determine whether the vulnerable part of a library is actually in use. Personally, though, if a library has a vulnerability and any part of it is also used, I think the smart policy is to upgrade. This eliminates the 62% of libraries that are never used at all, and lets you focus on the libraries that are both vulnerable and actually used.

jkowalleck · 2023-03-07T14:50:17Z

this is similar to https://github.com/CycloneDX/specification-greymatter/issues/9
right?

stevespringett · 2023-03-25T15:32:48Z

@madpah Is there a difference between confidence and whether something was an exact match or not?

stevespringett · 2023-03-25T16:41:00Z

@planetlevel We're going to target reachability in CDX 1.7 with #103. In the mean time, we're planning on adding support for evidence of identity and the occurrences in which the component was found.

PR to come later this weekend.

"evidence": {
  "identity": {
    "field": "purl",
    "confidence": 1,
    "methods": [
      {
        "technique": "filename",
        "confidence": 0.1,
        "value": "log4j-core-2.20.0.jar"
      },
      {
        "technique": "ast-fingerprint",
        "confidence": 0.9,
        "value": "61e4bc08251761c3a73b606b9110a65899cb7d44f3b14c81ebc1e67c98e1d9ab"
      },
      {
        "technique": "hash-comparison",
        "confidence": 0.7,
        "value": "7c547a9d67cc7bc315c93b6e2ff8e4b6b41ae5be454ac249655ecb5ca2a85abf"
      }
    ],
    "tools": [
      "bom-ref-of-tool-that-performed-analysis"
    ]
  },
  "occurrences": [
    {
      "bom-ref": "d6bf237e-4e11-4713-9f62-56d18d5e2079",
      "location": "/path/to/component"
    },
    {
      "bom-ref": "b574d5d1-e3cf-4dcd-9ba5-f3507eb1b175",
      "location": "/another/path/to/component"
    }
  ]
}

@brianf thoughts on the above?

planetlevel · 2023-03-27T15:50:50Z

Wouldn't hash match be 1.0? Just want to make sure I'm not misunderstanding this.

planetlevel · 2023-03-27T15:55:42Z

@stevespringett - we're still including the option to include callstack evidence, right?

stevespringett · 2023-03-27T15:59:27Z

@planetlevel Would you like it included? If so, is the proposal adequate or does it need revision? If it's ok as is, I'll update the PR to include it.

stevespringett · 2023-03-27T16:00:50Z

The hash could match, but say its an MD5 or SHA1 with known colllision possibilities, the confidence may be less than one. if its a SHA256 or higher, then likely the confidence would be 1. But it's just an example above.

jkowalleck · 2023-03-29T10:24:32Z

can somebody please answer in short: why have an overall confidence and multiple specific confidences, but do not publish weights of specific confidence values?
🔍 see #199 (comment)

planetlevel · 2023-03-29T12:09:29Z

@planetlevel Would you like it included? If so, is the proposal adequate or does it need revision? If it's ok as is, I'll update the PR to include it.

Yes, we should include it.

stevespringett added the proposed core enhancement label Feb 3, 2022

stevespringett added this to the 1.5 milestone Feb 3, 2022

stevespringett mentioned this issue Aug 23, 2022

Add support for SPDX v3 DependencyTrack/dependency-track#1746

Open

coderpatros self-assigned this Oct 3, 2022

stevespringett self-assigned this Nov 3, 2022

This was referenced Nov 11, 2022

document AFI / FBoM (Auditable Firmware Implementation, Firmware Bill of Materials) platform-system-interface/psi-spec#4

Open

Investigate pre-exiting formats for storing dependency info rust-secure-code/cargo-auditable#31

Open

stevespringett mentioned this issue Mar 24, 2023

Add tight scoping to nodes in the dependency graph #197

Open

stevespringett mentioned this issue Mar 25, 2023

Added identity and occurrences to evidence. Updated test cases. #199

Merged

stevespringett linked a pull request Mar 25, 2023 that will close this issue

Added identity and occurrences to evidence. Updated test cases. #199

Merged

stevespringett added request for comment RFC notice sent A public RFC notice was distributed to the CycloneDX mailing list for consideration labels Mar 25, 2023

stevespringett added the RFC vote accepted label Apr 24, 2023

stevespringett closed this as completed Apr 24, 2023

stevespringett mentioned this issue May 16, 2023

Add support for lifecycle #167

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add evidence used to determine inclusion of a component. #129

Add evidence used to determine inclusion of a component. #129

JorisVanEijden commented Feb 1, 2022

samj1912 commented Feb 3, 2022

stevespringett commented Feb 3, 2022

JorisVanEijden commented Feb 4, 2022

brianf commented Apr 13, 2022 •

edited

Loading

stevespringett commented Apr 13, 2022

planetlevel commented Sep 14, 2022

mrutkows commented Sep 14, 2022 •

edited

Loading

stevespringett commented Nov 3, 2022

planetlevel commented Nov 3, 2022

brianf commented Nov 3, 2022

brianf commented Nov 3, 2022 •

edited

Loading

stevespringett commented Nov 4, 2022

planetlevel commented Nov 4, 2022

jkowalleck commented Mar 7, 2023

stevespringett commented Mar 25, 2023

stevespringett commented Mar 25, 2023 •

edited

Loading

planetlevel commented Mar 27, 2023

planetlevel commented Mar 27, 2023

stevespringett commented Mar 27, 2023

stevespringett commented Mar 27, 2023

jkowalleck commented Mar 29, 2023

planetlevel commented Mar 29, 2023

Add evidence used to determine inclusion of a component. #129

Add evidence used to determine inclusion of a component. #129

Comments

JorisVanEijden commented Feb 1, 2022

samj1912 commented Feb 3, 2022

stevespringett commented Feb 3, 2022

JorisVanEijden commented Feb 4, 2022

brianf commented Apr 13, 2022 • edited Loading

stevespringett commented Apr 13, 2022

planetlevel commented Sep 14, 2022

mrutkows commented Sep 14, 2022 • edited Loading

stevespringett commented Nov 3, 2022

planetlevel commented Nov 3, 2022

brianf commented Nov 3, 2022

brianf commented Nov 3, 2022 • edited Loading

stevespringett commented Nov 4, 2022

planetlevel commented Nov 4, 2022

jkowalleck commented Mar 7, 2023

stevespringett commented Mar 25, 2023

stevespringett commented Mar 25, 2023 • edited Loading

planetlevel commented Mar 27, 2023

planetlevel commented Mar 27, 2023

stevespringett commented Mar 27, 2023

stevespringett commented Mar 27, 2023

jkowalleck commented Mar 29, 2023

planetlevel commented Mar 29, 2023

brianf commented Apr 13, 2022 •

edited

Loading

mrutkows commented Sep 14, 2022 •

edited

Loading

brianf commented Nov 3, 2022 •

edited

Loading

stevespringett commented Mar 25, 2023 •

edited

Loading