Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add evidence used to determine inclusion of a component. #129

Closed
JorisVanEijden opened this issue Feb 1, 2022 · 22 comments · Fixed by #199
Closed

Add evidence used to determine inclusion of a component. #129

JorisVanEijden opened this issue Feb 1, 2022 · 22 comments · Fixed by #199
Assignees
Labels
proposed core enhancement request for comment RFC notice sent A public RFC notice was distributed to the CycloneDX mailing list for consideration RFC vote accepted
Milestone

Comments

@JorisVanEijden
Copy link

Most SBOM generators base the inclusion of a component in an SBOM on a packagemanager file or the existence of some other file.
I would like to be able to trace back what source was used as evidence for the inclusion of a component.
The component.evidence property seemed like a good fit, but that only supports "licenses" and "copyright".

Could we add a "files" list to that?
Then an SBOM generator can list the file(s) it used to decide to include the component there,

@samj1912
Copy link
Member

samj1912 commented Feb 3, 2022

+1 This would be really useful for cyclonedx support in syft as well. syft currently stores the file "evidence" in its internal model. If we could add this in cyclonedx, it would improve the traceability of SBOM outputs a lot in syft :)

cc: @wagoodman

@stevespringett
Copy link
Member

In typical CDX fashion, we will try to target v1.5 of a Q1 2023 release. Therefore, there's some time to flush this out and support as many use cases as possible - keeping in mind that we try to focus on simplicity, high-degrees of automation, technology-agnostic, all without making the spec overly large.

With that in mind, are there use cases we can start documenting?

cc: @brianf

@JorisVanEijden
Copy link
Author

The main question it needs to answer is: "Why is this component included in this SBOM?"

My actual (way too frequent) scenario:

  • DependencyTrack says project X has a critical vulnerability in component Y.
  • Project X says "we don't use component Y"
  • I say "then why is it in your SBOM?"
  • They say "I have no idea"

This could be solved with a simple text field saying "/themes/custom/wow/package-lock.json" or "JvE: I statically linked this with the executable", or "Syft v3.2.1 detected this as a Composer dependency in /app/dir/composer.lock".

Or something more complex with different types ("manual", "scanner", "generator", "other") and id's ("name: Steve Jackson" or "name: Cdxgen, version: 1.3.4") and more types ("filepath", "explanation", "url") etc.

For me both would work just as well but I see no use case for the more complex structure.

@brianf
Copy link

brianf commented Apr 13, 2022

We provide information in our tooling that we call "occurrences" which would include a list of the file paths where the component was detected and the binary fingerprint that this thing matched. This helps people understand embedded cases or even when we detect a similar match or when something was renamed (eg foo.jar is actually log4j)

@stevespringett
Copy link
Member

Thanks @brianf. Few questions...

  • Is the binary fingerprint reproducible from external tools? If so, can you provide a pointer on where we can find more information?
  • Would it be useful to capture the fingerprint and the tool that generated the fingerprint for every occurrence?
  • What else would be useful to capture?

@planetlevel
Copy link

Some evidence we would like to provide related to measuring at runtime

  • we found the right libraries -- libraries that are not in the code repo (appserver, runtime platform)
  • we did not find wrong libraries -- libraries only used in build or test environments
  • hashes of libraries as loaded in production
  • libraries that are used/unused
  • number of classes/methods used in components
  • the actual classes/methods invoked and the callsites/traces capturing how they are invoked

We could provide callsites/traces showing exactly how libraries are used by applications, but it is a LOT of data. It would be useful for folks trying to verify whether a vulnerable method (like the JDNI lookup in log4j) is actually used. Also useful when attempting to remove a component from an application.

@mrutkows
Copy link
Contributor

mrutkows commented Sep 14, 2022

Want to highlight that evidence is always associated with a "tool" (loosely) used against 1 or more components or services (or hardware) to evaluate some security or compliance use case; therefore, the schema needs a robust means to associate:

  • evidence/provenance/other (inputs/outputs/state) -- tool -- hardware/component/service (resource under inspection)
  • at various stages of a "build" (read CI/CD for "build") against what we are discussing as a "formulation".

First, please be aware that the Sigstore project and its Rekor (https://github.com/sigstore/rekor) "transparency log" has structured records for all types of "attestations" (evidence) around CI/CD including, OIDC attestations (IDs that are used for "build"), changes in MF auth. of an identity, records to attest that package manager builds have been signed/verified (near-term plans to be used by NPM, Ruby, Java, etc.), records of Cert. generations for ephemeral keys (per-Ci build, from SPIFFE/SPIRE), and more... In all these cases a ref. to the log entry (id) and format would be quite valuable for downstream tooling.

  • See TUF formats (https://github.com/theupdateframework) where many of these records are being standardized and "future proofed".
  • See OpenSSF Frsca project which is looking to produce standardized evidence that can map their "controls" to SLSA (and have recently been discussing OSCAL as the canonical mapping)

In general, the types of "evidence" (I will use this term loosely) we actually have today (note ALL relate to "tools" that produce it) which is being produced by Tekton (SLSA compliant) CI systems and are being stored in crude "evidence lockers" include:

  • CI system "pipeline run" instance (evidence the pipeline was run with proper config/creds., no runtime/container mutations)
  • CI "task" (evidence task was invoked with proper configs (evidence the pipeline was run with proper config/creds., no runtime/container mutations)
  • Scorecard "checks" tests on scsm/project health/source provenance, etc. (and Gauge evidence of provenance and developer origin)
  • Component / Service graphing tools (e.g., GitBom) and decisions trees (assure nothing skipped for polyglot or for media/file types)
    • Note: gitbom pruces a std. ADG graph format
  • SAST - (evidence of runtime env. perhaps even test matrix as supported in many CI systems) and static tests run (names, results)
  • DAST - (evidence of the staging environment creation/config along with all dynamic tests (names, results))
  • Fuzzing (evidence of that API/endpoint tests were run, e.g., RPC/HTTP GET/POST calls invoked)
  • license/copyright/legal scanners (evidence are regex/regex templates used to scan source for presence of known or suspect legal language)
    • include spdx templates as evidence
  • fingerprinting: evidence of "genome" produced (and model which may vary per binary type)

In all cases, the configurations, parameters and env. vars. (snapshot) present at tool invocation are necessity for toward otentially achieving repro. builds using SBOM as a potentially viable means of capture

and more to come!

@coderpatros coderpatros self-assigned this Oct 3, 2022
@stevespringett
Copy link
Member

@brianf for occurrences, what types of data do you have? Is it only the paths or is there other data?

@stevespringett stevespringett self-assigned this Nov 3, 2022
@planetlevel
Copy link

  • IAST (evidence from complete running app/API stack, including exactly which libraries, classes, and methods are loaded and run. Also evidence of vulnerability testing, all exposed routes, all backend connections (services))

@brianf
Copy link

brianf commented Nov 3, 2022

@brianf for occurrences, what types of data do you have? Is it only the paths or is there other data?

Our occurrences are file paths. This way if someone questions a finding, say it’s embedded inside another component, we can provide the exact path (sometimes a bang path) to where they can see the component in the scan path (workspace, CI build, application zip etc)

@brianf
Copy link

brianf commented Nov 3, 2022

Thanks @brianf. Few questions...

  • Is the binary fingerprint reproducible from external tools? If so, can you provide a pointer on where we can find more information?

Some of the binary fingerprints are sha-x so yes. Similar match fingerprints for detecting a recompiled or slightly altered file are ultimately also sha fingerprints but of combinations of data that are ultimately proprietary. These wouldn’t typically appear in an SBOM output however, they’d be used internal to our tooling communication to figure out what a thing is.

  • Would it be useful to capture the fingerprint and the tool that generated the fingerprint for every occurrence?

If the fingerprints don’t exist elsewhere in the BOM, then yes I think that would be useful. This way tools that can go a level deeper and do binary matching analysis have more to validate, or even augment.

  • What else would be useful to capture?

@stevespringett
Copy link
Member

Ok, trying to flush out a few ideas here... Bear with me... If I have log4j-core and I want to describe the evidence collected to determine that the library is indeed log4j-core, I might end up with the following information:

(note: this uses both an SCA and IAST example in one. Not sure if that would really be possible, but trying to illustrate both since we have reps from both on this ticket)

"components": [
  {
    "type": "library",
    "group": "org.apache.logging.log4j",
    "name": "log4j-core",
    "version": "2.14.0",
    "evidence": {
      "identity": [
        {
          "field": "group | name | version | purl | swid",
          "confidence": "0..1",
          "methods": [
            "source-code-analysis", 
            "binary-analysis", 
            "manifest-analysis", 
            "ast-fingerprint", 
            "instrumentation", 
            "dynamic-analysis", 
            "other" 
          ],
          "source": "where was the evidence found...",
          "name": "",
          "value": ""
        }
      ],
      "formulation": [
        {
          "ref": ""
        }  
      ],
      "occurrences": [
        "/path/to/log4j-core-2.14.0.jar",
      ],
      "callstack": {
        "frames": [
          {

            "package": "com.apache.logging.log4j.core",
            "module": "Logger.class",
            "function": "logMessage",
            "parameters": [
              "com.acme.HelloWorld", "Level.INFO", null, "Hello World"
            ],
            "line": 150,
            "column": 17,
            "fullFilename": "/path/to/log4j-core-2.14.0.jar!/org/apache/logging/log4j/core/Logger.class",
          },
          {
            "module": "HelloWorld.class",
            "function": "main",
            "line": 20,
            "column": 12,
            "fullFilename": "/path/to/HelloWorld.class",
          }
        ]
      }
    }
  }
]

@planetlevel
Copy link

I'm trying to understand this through the lens of the typical claim-evidence structure. Here we make some claims about the library identity (name, version, etc...).and I can imagine some evidence of that.... like we found a file with the name "log4j-core-2.14.0.jar" at this location on this host (low confidence). Or we calculated a hash from the bytes loaded at runtime, and matched that hash with a hash in xyz database (high confidence). Or we did some fingerprint thing that found a 98% match with log4j-core-2.14.0.jar from some binary repo (98% confidence this is a modified version of log4j).

The other claim here is that this library is actually used in production. You could provide static evidence of this - sometimes called reachability (low confidence) or instrumentation-based evidence (high confidence). I think providing the full stack trace of that interaction would be excellent evidence (but it's a LOT of data). Perhaps the parameters help... but there could be infinite variations of the parameters, so I guess you just report the first one? Would you do this for all the classes and methods in every library? Seems like a LOT of data for little payoff. For me, it would be strong enough evidence to simply report that a class from a particular library was observed to be loaded at a particular time by a tool that has the ability to observe that operation.

Contrast can capture all the classes that are actually used by the application. This data is very useful when trying to determine whether the vulnerable part of a library is actually in use. Personally, though, if a library has a vulnerability and any part of it is also used, I think the smart policy is to upgrade. This eliminates the 62% of libraries that are never used at all, and lets you focus on the libraries that are both vulnerable and actually used.

@jkowalleck
Copy link
Member

this is similar to https://github.com/CycloneDX/specification-greymatter/issues/9
right?

@stevespringett
Copy link
Member

@madpah Is there a difference between confidence and whether something was an exact match or not?

@stevespringett
Copy link
Member

stevespringett commented Mar 25, 2023

@planetlevel We're going to target reachability in CDX 1.7 with #103. In the mean time, we're planning on adding support for evidence of identity and the occurrences in which the component was found.

PR to come later this weekend.

"evidence": {
  "identity": {
    "field": "purl",
    "confidence": 1,
    "methods": [
      {
        "technique": "filename",
        "confidence": 0.1,
        "value": "log4j-core-2.20.0.jar"
      },
      {
        "technique": "ast-fingerprint",
        "confidence": 0.9,
        "value": "61e4bc08251761c3a73b606b9110a65899cb7d44f3b14c81ebc1e67c98e1d9ab"
      },
      {
        "technique": "hash-comparison",
        "confidence": 0.7,
        "value": "7c547a9d67cc7bc315c93b6e2ff8e4b6b41ae5be454ac249655ecb5ca2a85abf"
      }
    ],
    "tools": [
      "bom-ref-of-tool-that-performed-analysis"
    ]
  },
  "occurrences": [
    {
      "bom-ref": "d6bf237e-4e11-4713-9f62-56d18d5e2079",
      "location": "/path/to/component"
    },
    {
      "bom-ref": "b574d5d1-e3cf-4dcd-9ba5-f3507eb1b175",
      "location": "/another/path/to/component"
    }
  ]
}

@brianf thoughts on the above?

@stevespringett stevespringett added request for comment RFC notice sent A public RFC notice was distributed to the CycloneDX mailing list for consideration labels Mar 25, 2023
@planetlevel
Copy link

Wouldn't hash match be 1.0? Just want to make sure I'm not misunderstanding this.

@planetlevel
Copy link

@stevespringett - we're still including the option to include callstack evidence, right?

@stevespringett
Copy link
Member

@planetlevel Would you like it included? If so, is the proposal adequate or does it need revision? If it's ok as is, I'll update the PR to include it.

@stevespringett
Copy link
Member

The hash could match, but say its an MD5 or SHA1 with known colllision possibilities, the confidence may be less than one. if its a SHA256 or higher, then likely the confidence would be 1. But it's just an example above.

@jkowalleck
Copy link
Member

can somebody please answer in short: why have an overall confidence and multiple specific confidences, but do not publish weights of specific confidence values?
🔍 see #199 (comment)

@planetlevel
Copy link

@planetlevel Would you like it included? If so, is the proposal adequate or does it need revision? If it's ok as is, I'll update the PR to include it.

Yes, we should include it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
proposed core enhancement request for comment RFC notice sent A public RFC notice was distributed to the CycloneDX mailing list for consideration RFC vote accepted
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants