New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimizations for FileMatcher.GetFiles #2572
Conversation
@maca88, It will cover your contributions to all .NET Foundation-managed open source projects. |
@maca88, thanks for signing the contribution license agreement. We will now validate the agreement and then the pull request. |
src/Shared/FileMatcher.cs
Outdated
if (type == FileSystemEntity.Directories) | ||
{ | ||
return entriesCache.GetOrAdd($"{path};{pattern ?? "*"}", | ||
s => new CachedEnumerable<string>(s_defaultGetFileSystemEntries(type, path, pattern, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that s_defaultGetFileSystemEntries
gets the whole array of entries in one go (https://github.com/Microsoft/msbuild/blob/master/src/Shared/FileMatcher.cs#L256), wouldn't it be better to use something like ImmutableArray<T>
, initialize with the full array and avoid the locks and contention in CachedEnumerable
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main idea of CachedEnumerable
was to save one iteration through the IEnumerable<string>
retrived by Directory.EnumerateDirectories. After some testing I realized that there are no noticable differences by initialize the IEnumerable
with ToImmutableArray
method prior returning. I will remove the CachedEnumerable
and cache an ImmutableArray<string>
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, but you are switching that to using EnumerateDirectories
later so this doesn't matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, if you are reading all of it here itself then we can switch back to using Directory.GetDirectories
and to string[]
from IEnumerable<string>
.
Also, it would be good to use git rebase -i
to rework the commits whenever you want to change or fix something in them. That way they look like clean commits and makes it easier to review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just thinking aloud - Directory.EnumerateDirectories
should be helping the amount of memory being used at least, in case of lot of directories. Didn't it?
Looking into the tests. |
The problem is with the usage of |
I think the main issue here is the testing mock file system returns paths with backslashes in them, regardless of the OS. I guess the GetAccessibleFileSystemEntries method should always return paths that work well with Path.GetDirectoryName / Path.GetFileName. I'll try and make the mock normalize its slashes. |
Fixed directory matching and flattened the |
5a702a2
to
e5ba128
Compare
There seems to be an inconsistent behavior with excludes and case sensibility. In the current master, when an exclude is matched via regex the matching is case-insensitive and when is matched with the searchesToExcludeInSubdirs dictionary, is case-sensitive, as the dictionary does not use a case-insensitive comparer. For example, if we modify this line to |
Regarding case sensitivity, I checked the version that shipped with VS 2015 Update 3 (before we did any improvements to file globbing). Back then the exclude specs were not sent to GetFiles, the disk got walked once for include, and once for each exclude, and the include file set was obtained by subtracting the excluded files from the included ones. The only matching that happened was the Regex one, which was case insensitive. So to keep with previous behaviour, both matches need to be case insensitive. Regarding the double question mark match, master seems to work, it excludes
|
src/Shared/FileMatcher.cs
Outdated
// extensions that start with the same three characters e.g. "*.htm" would match both "file.htm" and "file.html" | ||
// 3) if the ? wildcard is to the left of a period, it matches files with shorter name e.g. ???.txt would match | ||
// foo.txt, fo.txt and also f.txt | ||
string extensionPart = Path.GetExtension(searchPattern); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't allocate here unless needed, make sure this occurs only when if the search for "?." succeeds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src/Shared/FileMatcher.cs
Outdated
if (Directory.Exists(path)) | ||
{ | ||
try | ||
{ | ||
entries = Directory.GetFileSystemEntries(path, pattern); | ||
return ShouldEnforceMatching(pattern) | ||
? Directory.EnumerateFileSystemEntries(path, pattern).Where(o => IsMatch(Path.GetFileName(o), pattern)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to think about exception semantics, no longer are exceptions isolated here - they can be thrown when the "enumerable" is iterated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ouch, I completely missed that. I will wrap the enumerable into a "safe" enumerable that will omit those two exceptions, in order to preserve the same behavior, prior this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no longer are exceptions isolated here
Sadly, this is not actually true, when EnumerateFileSystemEntries
/EnumerateFiles
/EnumerateDirectories
methods are called, they can also throw because the file enumerator is initialized by finding the first file/directory and if an error occurs, the exception is immediately thrown. So we have to handle exceptions when one of those three methods is called and on the MoveNext call.
src/Shared/FileMatcher.cs
Outdated
@@ -372,7 +412,7 @@ GetFileSystemEntries getFileSystemEntries | |||
else | |||
{ | |||
// getFileSystemEntries(...) returns an empty array if longPath doesn't exist. | |||
string[] entries = getFileSystemEntries(FileSystemEntity.FilesAndDirectories, longPath, parts[i], null, false); | |||
string[] entries = getFileSystemEntries(FileSystemEntity.FilesAndDirectories, longPath, parts[i], null, false).ToArray(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can throw.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
) | ||
{ | ||
for (int i = 0; i < paths.Length; i++) | ||
foreach (string path in paths) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can throw.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
string projectDirectory | ||
) | ||
{ | ||
bool directoryLastCharIsSeparator = IsDirectorySeparator(projectDirectory[projectDirectory.Length - 1]); | ||
for (int i = 0; i < paths.Length; i++) | ||
foreach (string path in paths) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can throw.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of these will be caught via the aggregate exception in GetFiles right?
In reply to: 143076202 [](ancestors = 143076202)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, they will be, but this is not correct as we don't want to return "empty" result just because one of the subdirectories, failed to get its files. The catching of the AggreateException
is supposed to do the same as the other catch, when the exception is thrown inside a Parallel.ForEach
, which always throws an AggreateException
containing the actual exception. The iteration over paths
should not throw for SecurityException
and UnauthorizedAccessException
in order to preserve the old behavior, I will correct this.
Edit:
Fixed
src/Shared/FileMatcher.cs
Outdated
@@ -1905,21 +2123,42 @@ DirectoryExists directoryExists | |||
* This is because it's cheaper to add items to an IList and this code |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment is outdated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Flatten to get exceptions than are thrown inside a nested Parallel.ForEach | ||
if (ex.Flatten().InnerExceptions.All(ExceptionHandling.IsIoRelatedException)) | ||
{ | ||
return CreateArrayWithSingleItemIfNotExcluded(filespecUnescaped, excludeSpecsUnescaped); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, is this to handle the exceptions? If so, why we do we still need the catch blocks above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check the comment above for catching the AggregateException
.
// Set to use only half processors when we have 4 or more of them, in order to not be too aggresive | ||
// By setting MaxTasksPerIteration to the maximum amount of tasks, which means that only one | ||
// Parallel.ForEach will run at once, we get a stable number of threads being created. | ||
var maxTasks = Math.Max(1, Environment.ProcessorCount / 2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lifengl What do you think about this when it runs inside VS?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems to be fine to me. In dev 14 time, during the solution loading time, we use about 2.2 cores inside VS, and design time build uses about 1 core in the background. However, in .net core project, after we disabled the async tree loading, and with very long time in the evaluation, it falls back to almost one core, especially during the project evaluation phase. So this change will make the evaluation more efficient during that phase.
In the project editing phase, or project reloading time, we may have multiple evaluation in the background. Ideally, if we can control that based on the priority of the evaluation (foreground or background), it will be better. But it looks like this setting is not too aggressive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really great contribution, thanks @maca88. I'm not clear on the exception semantics of the change - so I'd like that clarified - what happens when an exception is thrown as we iterate?
/// <summary> | ||
/// Cache used for caching IO operation results | ||
/// </summary> | ||
protected ConcurrentDictionary<string, IEnumerable<string>> EntriesCache => _lazyEvaluator._entriesCache; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cdmihai Help me understand the lifetime of this cache? Just for single Include or for all item evaluations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it spans the entire item evaluation phase, so covers all items outside of targets
In reply to: 143077078 [](ancestors = 143077078)
I will modify the algorithm to support case-insensitive matching as fast as possible, correct the initialization of
My bad, I didn't noticed that the error was due to the mock file system, which doesn't support complex patterns. |
The exception handling should now work like before, with two exceptions:
Edit: In order to avoid the second point, I've changed the |
src/Shared/FileMatcher.cs
Outdated
@@ -372,7 +478,7 @@ GetFileSystemEntries getFileSystemEntries | |||
else | |||
{ | |||
// getFileSystemEntries(...) returns an empty array if longPath doesn't exist. | |||
string[] entries = getFileSystemEntries(FileSystemEntity.FilesAndDirectories, longPath, parts[i], null, false); | |||
string[] entries = getFileSystemEntries(FileSystemEntity.FilesAndDirectories, longPath, parts[i], null, false).ToArray(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: ToList() is usually slightly faster, (they are almost same to copy data to a buffer, but ToArray() often copies one more time to fit the exact size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the recent change, the ToArray
is not needed anymore.
That is really nice work, @maca88. I saw so many traces that the slow globbing code consumed 10 seconds or even 30 seconds in some projects. This change will make a huge difference. |
@@ -249,6 +249,7 @@ | |||
</ProjectReference> | |||
</ItemGroup> | |||
<ItemGroup Condition="'$(NetCoreBuild)' != 'true'"> | |||
<Reference Include="System.Collections.Immutable, Version=1.2.1.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a" /> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've had to add this reference in order to build the project. Does anyone know why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's because ImmutableArray got exposed to the tests (FileMatcher_Tests, via the new cache). And then FileMatcher_Tests is included in two test dlls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that the two test dlls need the dependency, what I don't understand is that adding the dependency to the project.json
don't work for Engine.UnitTests
, but it works for the Tasks.UnitTests
project. Its seems like that one of the Engine.UnitTests
dependency needs a lower version of the Immutable dll, but from the logs it is not clear which one.
src/Shared/FileMatcher.cs
Outdated
// Add all matched files at once to reduce thread contention | ||
if (files != null && files.Any()) | ||
{ | ||
listOfFiles.PushRange(files.ToArray()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of copying the list to an array, and then create a link list node for each of them (inside the stack code), I think it can be Stack<List>(), so you just push the non-empty list there. In the code consuming it, it will be:
stack.SelectMany(list => list).xxxx..., so we can prevent the extra allocation, in case the list is big.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dotnet-bot test Windows_NT Build for Full please |
src/Shared/FileMatcher.cs
Outdated
// Lock only when we may be dealing with multiple threads | ||
if (taskOptions.MaxTasks > 1 && taskOptions.MaxTasksPerIteration > 1) | ||
{ | ||
lock (s_directoryLock) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain what you are protecting against?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.. and rename the variable to reflect that. (s_taskOptionsLock?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This lock prevents multiple threads that were created from different Parallel.ForEach
calls from modifying the dop
variable in order to assure, that we are not going to use more tasks that have been specified in TaskOptions
. We have to consider that this method is recursive, which means that we could have one Parallel.ForEach
that may call another Parallel.ForEach
inside its iteration. I've added an additional condition for locking, as with the current settings the lock is not needed. To demonstrate how this works, I've created a simulation, where TaskOptions
is configured to have up to 4 Parallel.ForEach
running concurrently. By commenting the lock
statement, we can see that after several runs, we may get 5 Parallel.ForEach
running concurrently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just lock the taskoptions instead of creating another object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea! By doing so, we will have the locking per call, which is more correct than having it static, as the purpose of the locking is just to modify the TaskOptions
.
I have corrected the case sensitivity behavior, so that excludes are now matched case-insensitive. I've added a unit test for each correction, which would fail on the current master. The overall performance after adding the case-insensitive matching remains more or less the same as before. Edit: There are some specific cases when an include or exclude pattern won't be matched case-insensitive on Unix. By quick checking the version that shipped with VS 2015 Update 3, seems that it has the same problems. Should we corrrect them or should we leave as it is, for backward compatibility? |
edbe8f5
to
d06a0c0
Compare
@dotnet-bot test this please |
d06a0c0
to
59a1395
Compare
Leave them as is, to reduce scope creep :) I'll open an issue linking to your test once this PR gets in. |
FYI, we're planning to merge this in tomorrow morning (10am PDT). @maca88 can you please squash your changes if you get the chance? (If not I'll do it, I just need to make sure you are preserved as commit author) |
@maca88 can you please contact me offline davkean@microsoft.com? |
Optimizations: - Replaced Directory methods GetFiles and GetDirectories with EnumerateFiles and EnumerateDirectories - Added a custom wildcard pattern matching algorithm for matching directory and file names - Added directory caching - Added parallelization of subdirectories
59a1395
to
246a185
Compare
@cdmihai |
@maca88 this is a very substantial contribution, probably the biggest we've had on MSBuild outside Microsoft. Thank you so much for seeing this through and addressing the numerous comments and iterations! This will definitely be a big part of the perf gains we'll get out of 15.5! |
I just came here to ooh and aah at the performance improvements but I really hope the quoted part (in an earlier comment by @maca88) wasn't implemented because there are definitely Unix file systems where you are allowed to have backslashes in filenames, and trying to turn them into slashes could be a logic error at best and point to completely the wrong file at worst. There's security consequences to treating Unix and Windows paths the same way (or even just not like themselves). |
@JesperTreetop the quoted part wasn't implemented, I updated the missleading part of the comment in order to avoid confusion. |
Here are the optimization changes that were discussed in #2392, which are divided into three commits:
Directory.EnumerateFiles
andDirectory.EnumerateDirectories
and checking excludes with a custom wildcard matching algorithm.LazyItemEvaluator
Parallel.ForEach
For the parallel change I did some modifications in order to prevent spawning too much threads. With the old modification the traversing created up to 9 threads on my computer, with the new one, it creates up to 4 threads. I found this to be a good balance between performance and threads being used, at least on my computer (Intel I7 860).
This test was used on NHibernate.Test project and Roslyn root folder. For NHibernate also the evaluation was tested by using the Project class.
Latest master:
This PR:
All tests were ran in Release configuration on Intel I7 860 processor and the average of 5 runs was taken.