C++: Instantiate model generation library #19295

MathiasVP · 2025-04-11T18:53:04Z

Now that both #19273 and #19274 are merged we can finally get to the fun part: Adding the actual implementation of model generation to C++ 🎉.

I couldn't think of a good way to structure the commits in this PR. Apologies!

The first commit adds the entire library
The second commit adds tests that will succeed at the end of the PR.
The third commit instantiates the inline expectation test framework to test model generation
The fourth commit adds all the files that I could copy/paste from C#/Java/Rust. It's basically all the query and test files required for model generation.

I think that after this PR is merged the final step is to add the required DCA suite for model generation. However, as this isn't yet done I don't think it makes sense to run any DCA for this right now.

I've done some testing of this already: I've run the model generation on sqlite and the models look sensible for the very small subset that I've checked (if you're curious: https://gist.github.com/MathiasVP/6942f022c7a8f4e515c80ccd442ab59f).

cc @michaelnebel would you mind taking a brief look at this PR? I don't expect you to review the C++ specific parts, obviously 🙈

Copilot

Pull Request Overview

This PR instantiates the model generation library for C++ by adding the complete library, tests, and associated extension configuration files to enable model generation via query summaries.

Adds annotated model definitions and summaries in the library tests.
Introduces tests (including instantiation of the inline expectation test framework) for verifying model generation.
Provides extension YAML files to integrate summary models into the CodeQL pack.

Reviewed Changes

Copilot reviewed 4 out of 18 changed files in this pull request and generated no comments.

File	Description
cpp/ql/test/library-tests/dataflow/modelgenerator/dataflow/summaries.cpp	Adds model definitions with summary annotations for C++ dataflow.
cpp/ql/test/library-tests/dataflow/modelgenerator/dataflow/CaptureSummaryModels.ext.yml	Configures summary capture extension with manual models.
cpp/ql/test/library-tests/dataflow/modelgenerator/dataflow/CaptureContentSummaryModels.ext.yml	Configures content-based summary capture extension.
cpp/ql/src/utils/modelgenerator/GenerateFlowModel.py	Introduces a utility script for generating C++ flow models.

Files not reviewed (14)

cpp/ql/lib/utils/test/InlineMadTest.qll: Language not supported
cpp/ql/src/utils/modelgenerator/CaptureContentSummaryModels.ql: Language not supported
cpp/ql/src/utils/modelgenerator/CaptureMixedNeutralModels.ql: Language not supported
cpp/ql/src/utils/modelgenerator/CaptureMixedSummaryModels.ql: Language not supported
cpp/ql/src/utils/modelgenerator/CaptureNeutralModels.ql: Language not supported
cpp/ql/src/utils/modelgenerator/CaptureSinkModels.ql: Language not supported
cpp/ql/src/utils/modelgenerator/CaptureSourceModels.ql: Language not supported
cpp/ql/src/utils/modelgenerator/CaptureSummaryModels.ql: Language not supported
cpp/ql/src/utils/modelgenerator/internal/CaptureModels.qll: Language not supported
cpp/ql/src/utils/modelgenerator/internal/CaptureModelsPrinting.qll: Language not supported
cpp/ql/test/library-tests/dataflow/modelgenerator/dataflow/CaptureContentSummaryModels.expected: Language not supported
cpp/ql/test/library-tests/dataflow/modelgenerator/dataflow/CaptureContentSummaryModels.ql: Language not supported
cpp/ql/test/library-tests/dataflow/modelgenerator/dataflow/CaptureSummaryModels.expected: Language not supported
cpp/ql/test/library-tests/dataflow/modelgenerator/dataflow/CaptureSummaryModels.ql: Language not supported

Comments suppressed due to low confidence (2)

cpp/ql/src/utils/modelgenerator/GenerateFlowModel.py:9

[nitpick] Consider renaming 'madpath' to a more descriptive name like 'modelsDataPath' for improved clarity.

madpath = os.path.join(gitroot, "misc/scripts/models-as-data/")

cpp/ql/test/library-tests/dataflow/modelgenerator/dataflow/CaptureContentSummaryModels.ext.yml:6

[nitpick] Ensure consistent naming for model identifiers; consider using 'Models' instead of 'models' to align with other summary annotations.

- [ "models", "ManuallyModelled", False, "hasSummary", "(void *)", "", "Argument[0]", "ReturnValue", "value", "manual"]

cpp/ql/lib/utils/test/InlineMadTest.qll

cpp/ql/src/utils/modelgenerator/internal/CaptureModels.qll

michaelnebel

This is very very nice! Well done @MathiasVP !

michaelnebel · 2025-04-14T08:21:47Z

cpp/ql/src/utils/modelgenerator/GenerateFlowModel.py

+import generate_flow_model as model
+
+language = "cpp"
+model.Generator.make(language).run()


It is my intention to change the python script such that --with-summaries uses the mixed query instead (and correspondingly for the neutral), but for testing purposes it is nice to keep both the content based and heuristic based queries around.

Started a PR with this work prior to easter: #19311
Will make sure not to merge before your PR is merged (and then also make the corresponding changes for C++).

michaelnebel · 2025-04-14T08:58:41Z

cpp/ql/src/utils/modelgenerator/internal/CaptureModels.qll

+    f.isStatic()
+  }
+
+  predicate isUninterestingForDataFlowModels(Callable api) {


Maybe consider to move the content of this predicate to the relevant predicate instead.
This will make things easier in case you intend to introduce "lifting" logic for the produced models.
For C# and Java we consider implementations of method "prototypes" (implementations of interface- or abstract class members) to abide to the "contract" of the interface- or abstract member - at least if the implementation is in the same codebase as the interface- or abstract member declaration.
That is, for something like

public interface I { object M(object o); } public class C : I { public object M(object o) { return o; } }

we would like to "lift" the model identified for C.M to I.M and use this for all implementations of I.M.

Ah! That idea totally flew over my head when I was reading the Java/C# implementations. I've moved the contents to the relevant predicate in 3dfb68d, but I'll save the lifting logic to another PR, I think. Thanks for the heads up on this!

Yeah, that is not unfortunately not evident from the implementation (or module signature) why that distinction is there (it is because parts of the model generator is also re-used for type based model generation - maybe I should consider to revisit this again). 😄

cpp/ql/src/utils/modelgenerator/internal/CaptureModels.qll

cpp/ql/src/utils/modelgenerator/GenerateFlowModel.py

cpp/ql/test/library-tests/dataflow/modelgenerator/dataflow/summaries.cpp

michaelnebel · 2025-04-14T10:32:35Z

Inspiration for adding model generation summaries to DCA: https://github.com/github/codeql-dca/pull/847/
Not sure why the experiments failed last week. We might need help from the DX team on this (but maybe this is something for after Easter).

MathiasVP · 2025-04-18T15:28:37Z

Inspiration for adding model generation summaries to DCA: github/codeql-dca#847
Not sure why the experiments failed last week. We might need help from the DX team on this (but maybe this is something for after Easter).

Thanks a lot, Michael! I've opened a draft PR with what I think are the required DCA changes. I'll test it once this PR is merged.

…ration.

… since C++ has no external packs depending on MaD testing.

michaelnebel · 2025-04-22T12:58:49Z

cpp/ql/lib/qlpack.yml

@@ -16,6 +16,7 @@ dependencies:
  codeql/xml: ${workspace}
 dataExtensions:
  - ext/*.model.yml
+  - ext/generated/*.model.yml


This generates warnings (when running queries/tests) that there are no files matching the *.model.yml.
Consider adding an empty.model.yml (in the generated folder):

extensions: - addsTo: pack: codeql/cpp-all extensible: summaryModel data: []

Uh, well spotted! Fixed in 07d8f8d

michaelnebel · 2025-04-22T13:15:25Z

@MathiasVP : Something to consider when starting to generate models for libraries: The mixed modelgeneration uses a combination of exact data flow (content based data flow) and heuristic data flow for producing models. A heuristic summary is only generated as a fallback (these are the models with provenance df-summary) and they have the drawback that they are only approximations (and in many cases over-approximations). An example of this: When a heuristic model is generated, fields are not taken into account and it is assumed that tainting a field taints the entire object (the field conflation problem).
As a consequence we don't automatically get idempotency for model generation. That is, model_gen(model_gen(x)) might yield more models than model_gen(x) (if the generated models from the first run are included when generating the models the second time).
Furthermore, for C# and Java we also experienced a slowdown in analysis time when analyzing a repo, which we had generated models for (this is because each public call target might now have a synthetic (summarized) callable and a source code callable, which can lead to a combinatorial explosion in the number of paths).
Both of these issues are handled in the C# and Java data flow libraries by providing a heuristic for when to use generated models. Let me know, if you need more info. 😄

michaelnebel

Really good work!!! 🎉

Maybe consider to add the empty.model.yml file before merging to avoid all the warnings.

MathiasVP · 2025-04-23T09:21:24Z

Furthermore, for C# and Java we also experienced a slowdown in analysis time when analyzing a repo, which we had generated models for (this is because each public call target might now have a synthetic (summarized) callable and a source code callable, which can lead to a combinatorial explosion in the number of paths).
Both of these issues are handled in the C# and Java data flow libraries by providing a heuristic for when to use generated models. Let me know, if you need more info. 😄

Thanks a lot for these additional details, Michael! For C++ we're dome something very simple which we should eventually upgrade to be more like C# and Java: If we have a MaD summary for a callable then we only dispatch to the Mad summary and not to the source callable. So far this has given us good results, but now that we'll (probably) have many more models this may need to be re-examined.

Once we get to that I'll be sure to reach out!

jketema

I haven't tried to grasp all the details here, but overall this looks sensible to me. Two small nits below.

jketema · 2025-04-23T09:37:32Z

cpp/ql/src/utils/modelgenerator/internal/CaptureModels.qll

+        name = uc.getUnion().getName() and
+        indirectionIndex = uc.getIndirectionIndex() and
+        // Note: We don't actually support the union string in MaD, but we should do that eventually
+        kind = "Union["


Although we don't support this, there's also no test that exercises this. Maybe we should add one?

Good idea! I've done this now in 9e9a580

jketema · 2025-04-23T09:38:00Z

cpp/ql/src/utils/modelgenerator/internal/CaptureModels.qll

+    or
+    exists(DataFlow::ElementContent ec |
+      c.isSingleton(ec) and
+      result = "Element[" + ec.getIndirectionIndex() + "]"


There doesn't seem to be a test that exercises this?

Correct. We don't actually add any Element contents as part of the generation yet. The heuristics need to be quite a bit more complicated that the corresponding Java implementation so I decided to leave it out for now

cpp/ql/src/utils/modelgenerator/GenerateFlowModel.py

michaelnebel

LGTM!

Copilot AI review requested due to automatic review settings April 11, 2025 18:53

MathiasVP requested a review from a team as a code owner April 11, 2025 18:53

github-actions bot added the C++ label Apr 11, 2025

Copilot AI reviewed Apr 11, 2025

View reviewed changes

MathiasVP added the no-change-note-required This PR does not need a change note label Apr 11, 2025

github-advanced-security bot found potential problems Apr 11, 2025

View reviewed changes

cpp/ql/lib/utils/test/InlineMadTest.qll Fixed Show fixed Hide fixed

cpp/ql/src/utils/modelgenerator/internal/CaptureModels.qll Fixed Show fixed Hide fixed

michaelnebel reviewed Apr 14, 2025

View reviewed changes

MathiasVP added 12 commits April 20, 2025 16:48

C++: Instantiate model generation library.

3d48b23

C++: Add tests that will soon succeed.

f241e4b

C++: Instantiate inline expectation test framework to test model gene…

09ebd6e

…ration.

C++: Add copy-pasted files from C#.

1465058

C++: Fix ql-for-ql findings.

1f43e51

C++: Make final member functions not extensible.

5462dcd

Remove an unnecessary if.

0ce6ab5

C++: Add another entry to 'qlpack' for external models.

9cba91c

C++: Move contents of 'isUninterestingForDataFlowModels' to 'relevant'

e55f94c

C++: Also make protected members irrelevant.

f6f5f97

C++: Add more tests.

6fcf56e

C++: Move 'InlineMadTest.qll' out of 'lib/utils/test' and into 'test'…

3fd760c

… since C++ has no external packs depending on MaD testing.

MathiasVP force-pushed the cpp-add-mad-generation-library branch from d188823 to 3fd760c Compare April 20, 2025 15:49

michaelnebel reviewed Apr 22, 2025

View reviewed changes

michaelnebel previously approved these changes Apr 22, 2025

View reviewed changes

C++: Add an empty model to prevent a warning.

07d8f8d

MathiasVP dismissed michaelnebel’s stale review via 07d8f8d April 23, 2025 09:25

jketema previously approved these changes Apr 23, 2025

View reviewed changes

jketema reviewed Apr 23, 2025

View reviewed changes

cpp/ql/src/utils/modelgenerator/GenerateFlowModel.py Show resolved Hide resolved

michaelnebel previously approved these changes Apr 23, 2025

View reviewed changes

C++: Add MaD generation test with union content.

9e9a580

MathiasVP dismissed stale reviews from michaelnebel and jketema via 9e9a580 April 23, 2025 10:15

jketema approved these changes Apr 23, 2025

View reviewed changes

MathiasVP merged commit 808141f into github:main Apr 23, 2025
16 checks passed

MathiasVP mentioned this pull request Apr 25, 2025

C++: Fix missing summaries in MaD generation #19383

Merged

C++: Instantiate model generation library #19295

C++: Instantiate model generation library #19295

Uh oh!

Conversation

MathiasVP commented Apr 11, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

michaelnebel left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

michaelnebel Apr 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

michaelnebel commented Apr 14, 2025

Uh oh!

MathiasVP commented Apr 18, 2025

Uh oh!

michaelnebel Apr 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

michaelnebel commented Apr 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

michaelnebel left a comment

Choose a reason for hiding this comment

Uh oh!

MathiasVP commented Apr 23, 2025

Uh oh!

jketema left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

michaelnebel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

michaelnebel Apr 22, 2025 •

edited

Loading

michaelnebel Apr 22, 2025 •

edited

Loading

michaelnebel commented Apr 22, 2025 •

edited

Loading