Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Java: Identify more APIs as supported in the telemetry queries. #16297

Merged
merged 7 commits into from May 3, 2024

Conversation

michaelnebel
Copy link
Contributor

@michaelnebel michaelnebel commented Apr 22, 2024

In this PR we improve the telemetry queries such that sources- and sink APIs defined in QL will be taken into account when deciding whether an API is supported or not.
To this end a couple of new modules ApiSinks and ApiSources are introduced.
In each of these modules a class of nodes are introduced where the intention is to add all sources and sinks pertaining to APIs.

  • Defined in Models as Data.
  • Defined in QL re-usable libraries and used in data flow configurations as sources or sinks for queries.

Only source- and sink definitions that are sufficiently concrete have been added (that is, sources and sinks that just "guesses" based on string pattern matching on names are not included).

This is a best effort attempt to improve the telemetry reporting.

@michaelnebel michaelnebel changed the title Java: Indentify more APIs as supported in the telemetry queries (as QL … Java: Identify more APIs as supported in the telemetry queries. Apr 23, 2024
@michaelnebel
Copy link
Contributor Author

DCA looks.
As expected the DIL size and execution (stage timings) of the telemetry queries increased. This is not surprising, especially if the compute time for some of the sources and sinks now contribute to the telemetry query stage timings.
Overall DCA reports a slowdown of around 2%, which I believe is an upper limit (as the sources and sinks needs to be computed anyways for the queries to run).

@michaelnebel michaelnebel added the no-change-note-required This PR does not need a change note label Apr 24, 2024
@michaelnebel michaelnebel marked this pull request as ready for review April 24, 2024 07:04
@michaelnebel michaelnebel requested a review from a team as a code owner April 24, 2024 07:04
@owen-mc
Copy link
Contributor

owen-mc commented Apr 24, 2024

Will this need to be updated when a new query is added/promoted? If so, how will we remember to do that?

@michaelnebel
Copy link
Contributor Author

Will this need to be updated when a new query is added/promoted? If so, how will we remember to do that?

Excellent question.
I hope we won't have to update this whenever a query is added/promoted.
(1) If a query is based purely on MaD then all these are by default when models are added (or promoted out of experimental).(2) It is only necessary to add the sources or sinks of a query in case we think accurately define a group or a specific set of sources or sinks (based on API calls). When I looked through the java data flow configurations, I tried to avoid adding sources and sinks that was defined "with a broad net" (for instance pattern matching in parts of a method name).

The query doesn't have to be 100% accurate. We just need to at least have a best effort of incorporating QL sources and sinks.

@owen-mc
Copy link
Contributor

owen-mc commented Apr 25, 2024

I have skimmed this, but not reviewed in great detail. Here is the approach that I thought you were taking, but then I saw you weren't: define an abstract class SinkNode extending DataFlow::Node, put it somewhere that's imported by everything, make all the sink definitions you are interested in extend that, and then use that in the telemetry query. The advantage of this is that it's more discoverable: people will see SinkNode in various places, maybe in queries that they are copying when they're making a new query, they'll see the QLDoc which explains what it does, and then they are more likely to use it in their new queries. The other advantage is that it's easier to add new sinks. Is there a reason why that approach wouldn't work, or that your approach is preferable?

@michaelnebel
Copy link
Contributor Author

I have skimmed this, but not reviewed in great detail. Here is the approach that I thought you were taking, but then I saw you weren't: define an abstract class SinkNode extending DataFlow::Node, put it somewhere that's imported by everything, make all the sink definitions you are interested in extend that, and then use that in the telemetry query. The advantage of this is that it's more discoverable: people will see SinkNode in various places, maybe in queries that they are copying when they're making a new query, they'll see the QLDoc which explains what it does, and then they are more likely to use it in their new queries. The other advantage is that it's easier to add new sinks. Is there a reason why that approach wouldn't work, or that your approach is preferable?

That is a fair point and I will look into it - thank you! The reason I picked the approach here was to avoid have bidirectional imports between the Query like modules.

@michaelnebel michaelnebel marked this pull request as draft April 26, 2024 07:14
@aschackmull
Copy link
Contributor

There are a couple of different design options: There needs to be a central qll file that imports all of the sink definitions in order for the telemetry queries to see them (so we can't just have query-specific files extend SinkNode and be done).
A. This can be done either as it's currently done with a single qll that references the relevant sink definitions from their query-specific qlls and forms a big union.
B. Or it could be with a diamond-shaped import pattern such that one qll defines an abstract SinkNode class, which all of the sink definitions then extend, and then there's a separate telemetry-specific qll that imports all of the files containing extensions of SinkNode. (We probably don't want all the query-file imports to be in the file containing the SinkNode definition as that would collapse the import graph to one big SCC - that's the bi-directional import pattern, which we probably don't want here.)
C. A third alternative is to refactor all the sink definitions into something like Concepts, i.e. a combined collection of definitions that all of the query files would reference. Then the SinkNode definition and all of the sink definitions could be in the same file (or import-SCC of files). That's somewhat close to the bi-directional import that I just claimed we didn't want, except that it'll be much more clear which parts of the query logic that's import-reachable from all queries and which parts that remain local to the individual queries, i.e. putting something in scope of all queries will be an explicit choice rather than an accidental consequence of a massive bidirectional import-SCC.

@michaelnebel
Copy link
Contributor Author

michaelnebel commented Apr 26, 2024

There are a couple of different design options: There needs to be a central qll file that imports all of the sink definitions in order for the telemetry queries to see them (so we can't just have query-specific files extend SinkNode and be done). A. This can be done either as it's currently done with a single qll that references the relevant sink definitions from their query-specific qlls and forms a big union. B. Or it could be with a diamond-shaped import pattern such that one qll defines an abstract SinkNode class, which all of the sink definitions then extend, and then there's a separate telemetry-specific qll that imports all of the files containing extensions of SinkNode. (We probably don't want all the query-file imports to be in the file containing the SinkNode definition as that would collapse the import graph to one big SCC - that's the bi-directional import pattern, which we probably don't want here.) C. A third alternative is to refactor all the sink definitions into something like Concepts, i.e. a combined collection of definitions that all of the query files would reference. Then the SinkNode definition and all of the sink definitions could be in the same file (or import-SCC of files). That's somewhat close to the bi-directional import that I just claimed we didn't want, except that it'll be much more clear which parts of the query logic that's import-reachable from all queries and which parts that remain local to the individual queries, i.e. putting something in scope of all queries will be an explicit choice rather than an accidental consequence of a massive bidirectional import-SCC.

Thank you! I am already working on alternative B ;-)

@michaelnebel michaelnebel force-pushed the java/improveapitelemetry branch 4 times, most recently from fa4912f to fe654be Compare April 26, 2024 11:40
@michaelnebel
Copy link
Contributor Author

DCA looks good.

Copy link
Contributor

@owen-mc owen-mc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. Some minor nitpicks/questions. Also I had a quick look through and thought you might also want to include these sinks:

  • CookieSink in java/ql/lib/semmle/code/java/security/InsecureRandomnessQuery.qll
  • FileCreationSink in java/ql/lib/semmle/code/java/security/TempDirLocalInformationDisclosureQuery.qll

Co-authored-by: Owen Mansel-Chan <62447351+owen-mc@users.noreply.github.com>
@michaelnebel
Copy link
Contributor Author

@owen-mc : Thank you for the feedback!

You suggestion on incorporating FileCreationSink and CookieSink has been included as well. I removed Api nodes that were just used for "safe" data flow configurations (used to exclude results). I am aware that if such configs had been using the MaD models, they would not be excluded, but I just want to draw the line somewhere.
The Api Source and Sink nodes are "just" used for telemetry purposes and the telemetry queries are mostly relevant for us to discover "unsupported" APIs. If it turns out that we should consider more QL sources and sinks as supported, we can add them (but then I would like to see those as results in the top 100 of unsupported APIs).

@michaelnebel michaelnebel merged commit 95ff5ba into github:main May 3, 2024
15 checks passed
@michaelnebel michaelnebel deleted the java/improveapitelemetry branch May 3, 2024 10:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Java no-change-note-required This PR does not need a change note
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants