-
Notifications
You must be signed in to change notification settings - Fork 1.7k
C++: Instantiate model generation library #19295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
C++: Instantiate model generation library #19295
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR instantiates the model generation library for C++ by adding the complete library, tests, and associated extension configuration files to enable model generation via query summaries.
- Adds annotated model definitions and summaries in the library tests.
- Introduces tests (including instantiation of the inline expectation test framework) for verifying model generation.
- Provides extension YAML files to integrate summary models into the CodeQL pack.
Reviewed Changes
Copilot reviewed 4 out of 18 changed files in this pull request and generated no comments.
File | Description |
---|---|
cpp/ql/test/library-tests/dataflow/modelgenerator/dataflow/summaries.cpp | Adds model definitions with summary annotations for C++ dataflow. |
cpp/ql/test/library-tests/dataflow/modelgenerator/dataflow/CaptureSummaryModels.ext.yml | Configures summary capture extension with manual models. |
cpp/ql/test/library-tests/dataflow/modelgenerator/dataflow/CaptureContentSummaryModels.ext.yml | Configures content-based summary capture extension. |
cpp/ql/src/utils/modelgenerator/GenerateFlowModel.py | Introduces a utility script for generating C++ flow models. |
Files not reviewed (14)
- cpp/ql/lib/utils/test/InlineMadTest.qll: Language not supported
- cpp/ql/src/utils/modelgenerator/CaptureContentSummaryModels.ql: Language not supported
- cpp/ql/src/utils/modelgenerator/CaptureMixedNeutralModels.ql: Language not supported
- cpp/ql/src/utils/modelgenerator/CaptureMixedSummaryModels.ql: Language not supported
- cpp/ql/src/utils/modelgenerator/CaptureNeutralModels.ql: Language not supported
- cpp/ql/src/utils/modelgenerator/CaptureSinkModels.ql: Language not supported
- cpp/ql/src/utils/modelgenerator/CaptureSourceModels.ql: Language not supported
- cpp/ql/src/utils/modelgenerator/CaptureSummaryModels.ql: Language not supported
- cpp/ql/src/utils/modelgenerator/internal/CaptureModels.qll: Language not supported
- cpp/ql/src/utils/modelgenerator/internal/CaptureModelsPrinting.qll: Language not supported
- cpp/ql/test/library-tests/dataflow/modelgenerator/dataflow/CaptureContentSummaryModels.expected: Language not supported
- cpp/ql/test/library-tests/dataflow/modelgenerator/dataflow/CaptureContentSummaryModels.ql: Language not supported
- cpp/ql/test/library-tests/dataflow/modelgenerator/dataflow/CaptureSummaryModels.expected: Language not supported
- cpp/ql/test/library-tests/dataflow/modelgenerator/dataflow/CaptureSummaryModels.ql: Language not supported
Comments suppressed due to low confidence (2)
cpp/ql/src/utils/modelgenerator/GenerateFlowModel.py:9
- [nitpick] Consider renaming 'madpath' to a more descriptive name like 'modelsDataPath' for improved clarity.
madpath = os.path.join(gitroot, "misc/scripts/models-as-data/")
cpp/ql/test/library-tests/dataflow/modelgenerator/dataflow/CaptureContentSummaryModels.ext.yml:6
- [nitpick] Ensure consistent naming for model identifiers; consider using 'Models' instead of 'models' to align with other summary annotations.
- [ "models", "ManuallyModelled", False, "hasSummary", "(void *)", "", "Argument[0]", "ReturnValue", "value", "manual"]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very very nice! Well done @MathiasVP !
import generate_flow_model as model | ||
|
||
language = "cpp" | ||
model.Generator.make(language).run() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is my intention to change the python script such that --with-summaries
uses the mixed query instead (and correspondingly for the neutral), but for testing purposes it is nice to keep both the content based and heuristic based queries around.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Started a PR with this work prior to easter: #19311
Will make sure not to merge before your PR is merged (and then also make the corresponding changes for C++).
f.isStatic() | ||
} | ||
|
||
predicate isUninterestingForDataFlowModels(Callable api) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe consider to move the content of this predicate to the relevant
predicate instead.
This will make things easier in case you intend to introduce "lifting" logic for the produced models.
For C# and Java we consider implementations of method "prototypes" (implementations of interface- or abstract class members) to abide to the "contract" of the interface- or abstract member - at least if the implementation is in the same codebase as the interface- or abstract member declaration.
That is, for something like
public interface I {
object M(object o);
}
public class C : I {
public object M(object o) {
return o;
}
}
we would like to "lift" the model identified for C.M
to I.M
and use this for all implementations of I.M
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah! That idea totally flew over my head when I was reading the Java/C# implementations. I've moved the contents to the relevant
predicate in 3dfb68d, but I'll save the lifting logic to another PR, I think. Thanks for the heads up on this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that is not unfortunately not evident from the implementation (or module signature) why that distinction is there (it is because parts of the model generator is also re-used for type based model generation - maybe I should consider to revisit this again). 😄
cpp/ql/test/library-tests/dataflow/modelgenerator/dataflow/summaries.cpp
Show resolved
Hide resolved
Inspiration for adding model generation summaries to DCA: https://github.com/github/codeql-dca/pull/847/ |
Thanks a lot, Michael! I've opened a draft PR with what I think are the required DCA changes. I'll test it once this PR is merged. |
… since C++ has no external packs depending on MaD testing.
d188823
to
3fd760c
Compare
@@ -16,6 +16,7 @@ dependencies: | |||
codeql/xml: ${workspace} | |||
dataExtensions: | |||
- ext/*.model.yml | |||
- ext/generated/*.model.yml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This generates warnings (when running queries/tests) that there are no files matching the *.model.yml
.
Consider adding an empty.model.yml
(in the generated folder):
extensions:
- addsTo:
pack: codeql/cpp-all
extensible: summaryModel
data: []
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uh, well spotted! Fixed in 07d8f8d
@MathiasVP : Something to consider when starting to generate models for libraries: The mixed modelgeneration uses a combination of exact data flow (content based data flow) and heuristic data flow for producing models. A heuristic summary is only generated as a fallback (these are the models with provenance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really good work!!! 🎉
Maybe consider to add the empty.model.yml
file before merging to avoid all the warnings.
Thanks a lot for these additional details, Michael! For C++ we're dome something very simple which we should eventually upgrade to be more like C# and Java: If we have a MaD summary for a callable then we only dispatch to the Mad summary and not to the source callable. So far this has given us good results, but now that we'll (probably) have many more models this may need to be re-examined. Once we get to that I'll be sure to reach out! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't tried to grasp all the details here, but overall this looks sensible to me. Two small nits below.
name = uc.getUnion().getName() and | ||
indirectionIndex = uc.getIndirectionIndex() and | ||
// Note: We don't actually support the union string in MaD, but we should do that eventually | ||
kind = "Union[" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although we don't support this, there's also no test that exercises this. Maybe we should add one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea! I've done this now in 9e9a580
or | ||
exists(DataFlow::ElementContent ec | | ||
c.isSingleton(ec) and | ||
result = "Element[" + ec.getIndirectionIndex() + "]" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There doesn't seem to be a test that exercises this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. We don't actually add any Element
contents as part of the generation yet. The heuristics need to be quite a bit more complicated that the corresponding Java implementation so I decided to leave it out for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Now that both #19273 and #19274 are merged we can finally get to the fun part: Adding the actual implementation of model generation to C++ 🎉.
I couldn't think of a good way to structure the commits in this PR. Apologies!
I think that after this PR is merged the final step is to add the required DCA suite for model generation. However, as this isn't yet done I don't think it makes sense to run any DCA for this right now.
I've done some testing of this already: I've run the model generation on sqlite and the models look sensible for the very small subset that I've checked (if you're curious: https://gist.github.com/MathiasVP/6942f022c7a8f4e515c80ccd442ab59f).
cc @michaelnebel would you mind taking a brief look at this PR? I don't expect you to review the C++ specific parts, obviously 🙈