Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File Search - Improve performance of file search #15384

Closed
chrmarti opened this issue Nov 11, 2016 · 2 comments · Fixed by #16286
Closed

File Search - Improve performance of file search #15384

chrmarti opened this issue Nov 11, 2016 · 2 comments · Fixed by #16286
Assignees
Labels
plan-item VS Code - planned item for upcoming search Search widget and operation issues
Milestone

Comments

@chrmarti
Copy link
Contributor

chrmarti commented Nov 11, 2016

Part of #55.

We will focus on simple text search. Regex text searches might be more difficult to optimize and are used less often than simple text searches. We will initially explore our options (e.g., refining our own implementation, command line tools, text search libraries/engines).

We will include full-text search libraries/engines in our investigation and see if any of these can do simple text searches while at the same time open the door for also supporting a more natural 'fuzzy' text search at a later time.

The goal is raw search performance with library size (if any) and disk usage of an index (if any) being potential trade-offs.

@chrmarti chrmarti added plan-item VS Code - planned item for upcoming search Search widget and operation issues labels Nov 11, 2016
@chrmarti chrmarti added this to the November 2016 milestone Nov 11, 2016
@roblourens
Copy link
Member

roblourens commented Nov 15, 2016

Subjectively, file search in VS Code is slower than in other editors, and this has been noticed by users. Here are a couple example issues:

Slow search
#14157
#14012
#8680
#63

Slow, cancelling is not discoverable, progress is not helpful
#9853

The first step is

  • Admit we have a problem

To nail down the issue, I'm doing a round of performance tests in controlled conditions to better understand the perf characteristics of our current file search, and find the areas that will benefit the most from optimization. I'm doing the same search using other tools for comparison.

Goals

VS Code file search should not be, or feel, significantly slower than other editors. Being faster than some is probably achievable. Results should appear quickly and the UI should remain responsive throughout the duration of the search.

We also have the goal of investigating whether tools that might provide a very fast indexed search, and a more "fuzzy" search that goes beyond literal text, and evaluating the costs associated with it.

Setup

I'm running these tests in two workspaces, workspace A and workspace B. I'm trying to run a search in each that produces about the same number of results, where workspace A is a very large workspace with roughly evenly-distributed results, and workspace B is a very small workspace with densely packed results.

Workspace A is a clone of the chromium repo, by the instructions here. This is workspace of 234,655 files, about 1.8 GB. In this workspace I'm doing a case-insensitive search for the phrase static_library(, producing 447 results in 282 files, which are well-distributed across the repo. This phrase shows up in BUILD.gn files which are all over the repo.

Workspace B is a folder containing 4 files, which together have just the phrase static_library("foo") 447 times.

The regex search uses the regex /static_...rary\(/.

Timing is done using a stopwatch from pressing 'enter' on the query, to the results being fully loaded into the UI and ready for interaction. Except for vscode, into which I injected timing instrumentation, and grep, using the 'time' command. This is pretty adhoc testing on my MBP and Windows desktop with SSDs.

The Grep search is using this command: time find . -type f -print0 | xargs -0 -P 4 fgrep -i -I "static_library(" | wc -l.

Results

MacOS
- Mid-2015 MacBook Pro
- 2.2 GHz Intel Core i7
- 16 GB 1600 MHz DDR3
- APPLE SSD SM0512G

Tool Workspace A Workspace A regex Workspace B Notes
Webstorm 3.5s 8s instant Indexed search*
grep 18s 19s .02s Obviously doesn't need to build a UI, etc etc
git grep 10s 9.3 n/a Only searches files that git knows about. Multithreaded where normal grep is not.
Sublime 28s 28s instant Streams results into a document in lexicographical order
Atom 2m50s 2m47s instant Shows results during the search, but scrolls back to the top every time new results are added
VS Code 1m49s 1m49s .25s No results appear until the end
  • Webstorm indexes the workspace on launch. If you search before the index is finished building (takes a minute or so for workspace A) it takes about 50s to search. Also has a search "Context" option to search just in string literals, comments, etc. I want that.

Windows
- Intel Xeon E3-1240 v3 @3.40 GHz
- 16 GB RAM
- Crucial 500GB SSD

Tool Workspace A Workspace A regex Workspace B Notes
Webstorm 2.5s 2.8s instant Indexed search, see above notes.
findstr 18s 29s instant
Sublime 29s 29s instant
VS 55s 3m45s 1s No progress bar. Apparently doesn't exclude binary files like other tools, as it spent awhile searching a .mp4 file.
VS Code 2m18s 2m18s instant

Analysis

VS Code time breakdown

What happened when VS Code spent 1m49s on a search:
- 0s - Search started
- .2s - First candidate file found in the search process
- .5s - First match found in the search process
- 108.289s - Batch of 282 FileMatch objects received in the frontend (447 matches total)
- 108.324s - Final SearchResult.add call, updating TreeView UI begins
- 109.03s - TreeView UI finished updating

The time from starting the search to finding a match is very short, but then we sit on that match for 100+ seconds, because we don't send results to the frontend until we've filled a batch of 512. Since there are fewer than a full batch of results, nothing is visible until the search is over. So we could increase the perceived performance quite a lot by adding a timeout to send intermediate results over if the batch doesn't fill up.

I changed the batch size to 20 and got results immediately, which is much nicer. But scrolling around through the list, it jumped around a lot as results were inserted into the middle. Results are returned as files are finished, which may or may not be in order. We could only show results in order (Sublime does this), or show results as they come in. This is also a problem with a batch size of 512, but there are only 4 batches in our max results limit of 2048, so less of a problem.

On the end of making search faster, we should consider using the find command (we do this for cmd+p, but not text search) to get a list of files, and grep to search them. We can use findstr on Windows. I'm trying to understand how much time the filesystem walking is responsible for vs the search, but it's difficult to say exactly since the two flows are interleaved. I also want to profile the search process more to understand where it's spending time.

We can also try to run our search in more than one process. I saw a near 4x speedup when running 4 processes of grep instead of 3. Depending on disk speed, number of cores, and amount of time we spend just sending information between processes.

Fuzzy search

We also wanted to investigate options for enabling search-engine-style indexed fuzzy search in a workspace. 'Indexed' meaning an index is built for the files in the workspace, on disk or in memory, and a search in a large workspace can complete in seconds or milliseconds. 'Fuzzy' meaning a search based on not literal text or regex, but on word stemming, word proximity, etc. Here's an example use case, based on a true story:

I'm exploring the vscode code base, trying to find the code responsible for file search. I cmd+shift+F and search search files. No results. file search. No results. filesearch? That's it, there are 50 results with that literal. But it would be useful to find a comment that says // This code searches the text in files.

There are a bunch of options for this, but any third-party tool here will increase our download size, so we plan to consider exposing search as an extension API, and implementing an optional extension that would provide this advanced search behavior for people who want it, while making sure that our built-in search is still usable and fast.

@roblourens
Copy link
Member

roblourens commented Nov 23, 2016

These results come from running the TextSearch.performance test a couple times and averaging (where each run averages 3 runs of the same search). It executes the search and returns results to the frontend process, but doesn't do UI things with them.

MacOS Win
With Buffer.slice (original code) 100s 125.5s
Replacing Buffer.slice with buffer.toString(null, start, end) 71.1s 97.05s
With 4 worker processes (for a total of 5 (very very rough)) 26.4 19.0
With code in PR 17.6 16.3

roblourens added a commit that referenced this issue Nov 23, 2016
…tart, end).

This is much faster although it's not totally obvious why. But it's run for every line of every file in the workspace so it's very sensitive.  See #15384.
@roblourens roblourens mentioned this issue Dec 3, 2016
3 tasks
@egamma egamma mentioned this issue Dec 20, 2016
56 tasks
@vscodebot vscodebot bot locked and limited conversation to collaborators Nov 18, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
plan-item VS Code - planned item for upcoming search Search widget and operation issues
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants