Skip to content

C++: Instantiate model generation library #19295

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Apr 23, 2025

Conversation

MathiasVP
Copy link
Contributor

Now that both #19273 and #19274 are merged we can finally get to the fun part: Adding the actual implementation of model generation to C++ 🎉.

I couldn't think of a good way to structure the commits in this PR. Apologies!

  • The first commit adds the entire library
  • The second commit adds tests that will succeed at the end of the PR.
  • The third commit instantiates the inline expectation test framework to test model generation
  • The fourth commit adds all the files that I could copy/paste from C#/Java/Rust. It's basically all the query and test files required for model generation.

I think that after this PR is merged the final step is to add the required DCA suite for model generation. However, as this isn't yet done I don't think it makes sense to run any DCA for this right now.

I've done some testing of this already: I've run the model generation on sqlite and the models look sensible for the very small subset that I've checked (if you're curious: https://gist.github.com/MathiasVP/6942f022c7a8f4e515c80ccd442ab59f).

cc @michaelnebel would you mind taking a brief look at this PR? I don't expect you to review the C++ specific parts, obviously 🙈

@Copilot Copilot AI review requested due to automatic review settings April 11, 2025 18:53
@MathiasVP MathiasVP requested a review from a team as a code owner April 11, 2025 18:53
@github-actions github-actions bot added the C++ label Apr 11, 2025
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR instantiates the model generation library for C++ by adding the complete library, tests, and associated extension configuration files to enable model generation via query summaries.

  • Adds annotated model definitions and summaries in the library tests.
  • Introduces tests (including instantiation of the inline expectation test framework) for verifying model generation.
  • Provides extension YAML files to integrate summary models into the CodeQL pack.

Reviewed Changes

Copilot reviewed 4 out of 18 changed files in this pull request and generated no comments.

File Description
cpp/ql/test/library-tests/dataflow/modelgenerator/dataflow/summaries.cpp Adds model definitions with summary annotations for C++ dataflow.
cpp/ql/test/library-tests/dataflow/modelgenerator/dataflow/CaptureSummaryModels.ext.yml Configures summary capture extension with manual models.
cpp/ql/test/library-tests/dataflow/modelgenerator/dataflow/CaptureContentSummaryModels.ext.yml Configures content-based summary capture extension.
cpp/ql/src/utils/modelgenerator/GenerateFlowModel.py Introduces a utility script for generating C++ flow models.
Files not reviewed (14)
  • cpp/ql/lib/utils/test/InlineMadTest.qll: Language not supported
  • cpp/ql/src/utils/modelgenerator/CaptureContentSummaryModels.ql: Language not supported
  • cpp/ql/src/utils/modelgenerator/CaptureMixedNeutralModels.ql: Language not supported
  • cpp/ql/src/utils/modelgenerator/CaptureMixedSummaryModels.ql: Language not supported
  • cpp/ql/src/utils/modelgenerator/CaptureNeutralModels.ql: Language not supported
  • cpp/ql/src/utils/modelgenerator/CaptureSinkModels.ql: Language not supported
  • cpp/ql/src/utils/modelgenerator/CaptureSourceModels.ql: Language not supported
  • cpp/ql/src/utils/modelgenerator/CaptureSummaryModels.ql: Language not supported
  • cpp/ql/src/utils/modelgenerator/internal/CaptureModels.qll: Language not supported
  • cpp/ql/src/utils/modelgenerator/internal/CaptureModelsPrinting.qll: Language not supported
  • cpp/ql/test/library-tests/dataflow/modelgenerator/dataflow/CaptureContentSummaryModels.expected: Language not supported
  • cpp/ql/test/library-tests/dataflow/modelgenerator/dataflow/CaptureContentSummaryModels.ql: Language not supported
  • cpp/ql/test/library-tests/dataflow/modelgenerator/dataflow/CaptureSummaryModels.expected: Language not supported
  • cpp/ql/test/library-tests/dataflow/modelgenerator/dataflow/CaptureSummaryModels.ql: Language not supported
Comments suppressed due to low confidence (2)

cpp/ql/src/utils/modelgenerator/GenerateFlowModel.py:9

  • [nitpick] Consider renaming 'madpath' to a more descriptive name like 'modelsDataPath' for improved clarity.
madpath = os.path.join(gitroot, "misc/scripts/models-as-data/")

cpp/ql/test/library-tests/dataflow/modelgenerator/dataflow/CaptureContentSummaryModels.ext.yml:6

  • [nitpick] Ensure consistent naming for model identifiers; consider using 'Models' instead of 'models' to align with other summary annotations.
- [ "models", "ManuallyModelled", False, "hasSummary", "(void *)", "", "Argument[0]", "ReturnValue", "value", "manual"]

@MathiasVP MathiasVP added the no-change-note-required This PR does not need a change note label Apr 11, 2025
Copy link
Contributor

@michaelnebel michaelnebel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very very nice! Well done @MathiasVP !

import generate_flow_model as model

language = "cpp"
model.Generator.make(language).run()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is my intention to change the python script such that --with-summaries uses the mixed query instead (and correspondingly for the neutral), but for testing purposes it is nice to keep both the content based and heuristic based queries around.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Started a PR with this work prior to easter: #19311
Will make sure not to merge before your PR is merged (and then also make the corresponding changes for C++).

f.isStatic()
}

predicate isUninterestingForDataFlowModels(Callable api) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe consider to move the content of this predicate to the relevant predicate instead.
This will make things easier in case you intend to introduce "lifting" logic for the produced models.
For C# and Java we consider implementations of method "prototypes" (implementations of interface- or abstract class members) to abide to the "contract" of the interface- or abstract member - at least if the implementation is in the same codebase as the interface- or abstract member declaration.
That is, for something like

public interface I { 
  object M(object o);
}

public class C : I {
  public object M(object o) {
    return o;
  }
}

we would like to "lift" the model identified for C.M to I.M and use this for all implementations of I.M.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah! That idea totally flew over my head when I was reading the Java/C# implementations. I've moved the contents to the relevant predicate in 3dfb68d, but I'll save the lifting logic to another PR, I think. Thanks for the heads up on this!

Copy link
Contributor

@michaelnebel michaelnebel Apr 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that is not unfortunately not evident from the implementation (or module signature) why that distinction is there (it is because parts of the model generator is also re-used for type based model generation - maybe I should consider to revisit this again). 😄

@michaelnebel
Copy link
Contributor

Inspiration for adding model generation summaries to DCA: https://github.com/github/codeql-dca/pull/847/
Not sure why the experiments failed last week. We might need help from the DX team on this (but maybe this is something for after Easter).

@MathiasVP
Copy link
Contributor Author

Inspiration for adding model generation summaries to DCA: github/codeql-dca#847
Not sure why the experiments failed last week. We might need help from the DX team on this (but maybe this is something for after Easter).

Thanks a lot, Michael! I've opened a draft PR with what I think are the required DCA changes. I'll test it once this PR is merged.

@MathiasVP MathiasVP force-pushed the cpp-add-mad-generation-library branch from d188823 to 3fd760c Compare April 20, 2025 15:49
@@ -16,6 +16,7 @@ dependencies:
codeql/xml: ${workspace}
dataExtensions:
- ext/*.model.yml
- ext/generated/*.model.yml
Copy link
Contributor

@michaelnebel michaelnebel Apr 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This generates warnings (when running queries/tests) that there are no files matching the *.model.yml.
Consider adding an empty.model.yml (in the generated folder):

extensions:
  - addsTo:
      pack: codeql/cpp-all
      extensible: summaryModel
    data: []

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uh, well spotted! Fixed in 07d8f8d

@michaelnebel
Copy link
Contributor

michaelnebel commented Apr 22, 2025

@MathiasVP : Something to consider when starting to generate models for libraries: The mixed modelgeneration uses a combination of exact data flow (content based data flow) and heuristic data flow for producing models. A heuristic summary is only generated as a fallback (these are the models with provenance df-summary) and they have the drawback that they are only approximations (and in many cases over-approximations). An example of this: When a heuristic model is generated, fields are not taken into account and it is assumed that tainting a field taints the entire object (the field conflation problem).
As a consequence we don't automatically get idempotency for model generation. That is, model_gen(model_gen(x)) might yield more models than model_gen(x) (if the generated models from the first run are included when generating the models the second time).
Furthermore, for C# and Java we also experienced a slowdown in analysis time when analyzing a repo, which we had generated models for (this is because each public call target might now have a synthetic (summarized) callable and a source code callable, which can lead to a combinatorial explosion in the number of paths).
Both of these issues are handled in the C# and Java data flow libraries by providing a heuristic for when to use generated models. Let me know, if you need more info. 😄

michaelnebel
michaelnebel previously approved these changes Apr 22, 2025
Copy link
Contributor

@michaelnebel michaelnebel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really good work!!! 🎉

Maybe consider to add the empty.model.yml file before merging to avoid all the warnings.

@MathiasVP
Copy link
Contributor Author

Furthermore, for C# and Java we also experienced a slowdown in analysis time when analyzing a repo, which we had generated models for (this is because each public call target might now have a synthetic (summarized) callable and a source code callable, which can lead to a combinatorial explosion in the number of paths).
Both of these issues are handled in the C# and Java data flow libraries by providing a heuristic for when to use generated models. Let me know, if you need more info. 😄

Thanks a lot for these additional details, Michael! For C++ we're dome something very simple which we should eventually upgrade to be more like C# and Java: If we have a MaD summary for a callable then we only dispatch to the Mad summary and not to the source callable. So far this has given us good results, but now that we'll (probably) have many more models this may need to be re-examined.

Once we get to that I'll be sure to reach out!

jketema
jketema previously approved these changes Apr 23, 2025
Copy link
Contributor

@jketema jketema left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't tried to grasp all the details here, but overall this looks sensible to me. Two small nits below.

name = uc.getUnion().getName() and
indirectionIndex = uc.getIndirectionIndex() and
// Note: We don't actually support the union string in MaD, but we should do that eventually
kind = "Union["
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although we don't support this, there's also no test that exercises this. Maybe we should add one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea! I've done this now in 9e9a580

or
exists(DataFlow::ElementContent ec |
c.isSingleton(ec) and
result = "Element[" + ec.getIndirectionIndex() + "]"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There doesn't seem to be a test that exercises this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. We don't actually add any Element contents as part of the generation yet. The heuristics need to be quite a bit more complicated that the corresponding Java implementation so I decided to leave it out for now

michaelnebel
michaelnebel previously approved these changes Apr 23, 2025
Copy link
Contributor

@michaelnebel michaelnebel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@MathiasVP MathiasVP dismissed stale reviews from michaelnebel and jketema via 9e9a580 April 23, 2025 10:15
@MathiasVP MathiasVP merged commit 808141f into github:main Apr 23, 2025
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C++ no-change-note-required This PR does not need a change note
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants