New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using binary logs to significantly increase CodeQL analysis performance for C# #16346
Comments
The MSBuild team is also happy to support here. |
Thank you for putting together the sample. I think your suggestion makes sense, I'm exploring it a bit further to understand what else would need to be changed. One point that is not immediately clear to me is how to get hold of the syntax trees produced by source generators. Is there a Roslyn API that would return these trees? |
The default experience is that generated files are embedded into the PDB produced by the build. There is no official API for this but it's in a well known part of the PDB. The code for reading that is here. If the goal is to just read these files for a given compilation on demand then it's a fairly simple change to expose an API directly that lets you access them. It would be something along the lines of List<SyntaxTree> ReadAllGeneratedFiles(CompilerCall compilerCall); There is one caveat for that approach though: it only works if the build is using portable PDBs. Native PDBs don't have a facility for that. Using native PDbs with source generators though is decently rare. Seems reasonable that for a CodeQL run you could ask that portable PDBs be generated.
Curious if you're asking about generated files in this context: essentially how can If so the underlying library takes a couple of different approaches:
|
Thanks for the explanation.
Yes, this is the goal. BTW, what is the reason that Roslyn doesn't expose an API on
We're injecting the following properties into the build command:
We're adding |
Okay will add an API for that
The underlying APIs for running the generators are available. Can see how compiler logs approaches that problem here. That provides the Compilation compilation = ...;
var driver = CreateGeneratorDriver();
driver.RunGeneratorsAndUpdateCompilation(compilation, out var compilation2, out var diagnostics, cancellationToken);
var generatedSyntaxTrees = compilation2.SyntaxTrees.Skip(compilation.SyntaxTrees.Count()); That does require the host to write a bit of code though to load the analyzers, run the driver, etc ... The reason there isn't a higher level API that just takes care of all of this for you is that it's hard to find a single solution that fits all hosts. In particular analyzer loading is very complex and as a result different hosts tend to make different decisions on how to approach it. Consider that even within the Roslyn compiler analyzers are loaded with three different strategies depending on the scenario. The reason the NuPkg I used in the sample has easy APIs for creating
I'm not as familiar with this scenario. I attempted to recreate it with a .NET Framework ASPNet application. Setting that property I do see it go into a Are you all intercepting that call? Or is there a subtly I missed where this also generates a separate csc call. There is a csc invocation but I see that with our without the
Yep those are just a normal source generator at the end of the day and should behave like other generators in terms of finding the files. |
I'm not that familiar with
|
The nature of the C# driver means that it needs to see all csc invocations for a build. Today that is achieved by disabling shared compilation during build. That unfortunately creates a significant performance penalty when building C# code. The larger the repo the more significant the penalty. The rule of thumb is that for solutions of at least medium size this will cause somewhere from a 3-4X build slow down. For larger repositories this can be even bigger.
Consider as a concrete example the dotnet/runtime repository. When their CodeQL pipeline runs and shared compilation is disabled it increases their build time by 1 hour and 38 minutes. Or alternatively building the dotnet/roslyn repository. Building that locally results in a ~4X perf penalty when shared compilation is disabled.
I understand from previous conversations that this is done because the extractor needs to see every C# compiler call in the build and potentially process it. An alternative approach to doing this is to have the build produce a binary log. That is a non-invasive change to builds and there is ample tooling for reading C# compiler calls from those once the build completes. The code for reading compiler calls is at the core very straight forward:
I have a sample here of integrating this approach into the existing extractor tool. On my machine I was able to use this sample to run over a full build of roslyn without any obvious issues. And most importantly, no changes to how I built roslyn 😄
Using libraries like this have other advantages because you could also offload the work of creating
CSharpCompilation
objects. There is code in that library that will very faithfully recreate aCSharpCompilation
for everyCompilerCall
instance. Furthermore, think there are other optimization opportunities for the extractor once you are running the analysis in bulk: cachingMetadataReferences
, cachingSourceText
, etc ...More than happy to chat about this, the different ways binary logs can be used, moving analysis off machine, etc ... My big end goal is to get back to us using shared compilation so we can keep our build times down.
The text was updated successfully, but these errors were encountered: