Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not load entire database for all commands #269

Merged
merged 34 commits into from Jan 17, 2020

Conversation

sruffell
Copy link
Contributor

This merge request improves the performance when there are many entries in the database by not loading or parsing all the lines in many cases where it is not necessary.

Related to issue #245

@sruffell
Copy link
Contributor Author

I was just thinking about this, and I am thinking that it might make more sense for being/end to walk backwards through the inclusions. That would eliminate most of the use of rbegin/rend and put them in "natural" order since @1 is always the most recent, etc..

@lauft
Copy link
Member

lauft commented Dec 27, 2019

You mean instead of adding an extra set of iterators (rbegin/rend), to redefine the existing ones? That's fine with me. As you already mentioned, the natural order for Timewarrior is going backwards in time through the intervals.

@sruffell
Copy link
Contributor Author

sruffell commented Jan 2, 2020

You mean instead of adding an extra set of iterators (rbegin/rend), to redefine the existing ones?

The pull request already defines two sets of iterators begin/end and rbegin/rend. I was just thinking that the semantics of the two sets should be reversed, which would allow more of code to use the ranged based for loops.

But ok, I'll update the pull request with that change.

@sruffell
Copy link
Contributor Author

sruffell commented Jan 4, 2020

I've updated this branch so that range based for loops on the database start with the most recent interval (and renamed the lastLine to firstLine to match).

Also as part of working to remove the users of getAllInclusions, I added an interface to the Database in order to get the set of tags directly from the tags database. This leaves only a single user of getAllInclusions left.

@sruffell sruffell mentioned this pull request Jan 4, 2020
Ubuntu 18.04.3 does not have the -v option to the date utility.
@sruffell sruffell force-pushed the do-not-load-entire-database branch 2 times, most recently from a3bd5b4 to 0a4c629 Compare January 6, 2020 04:04
@sruffell
Copy link
Contributor Author

sruffell commented Jan 6, 2020

Since the performance appears to be close to O(1) now, I do not intend to do any more work on this branch pending review from @lauft

Since the switch to python3, there is another method that starts with "test"
higher up in the stack which produces unhelpful file / line information on a
failed test like:

  ERROR: CommandError on file /usr/lib/python3.6/unittest/case.py line 59 in testPartExecutor: 'yield':

This change restores the previous behavior from before the switch to python 3.
Not only does this eliminate the need to copy the stings to the caller, it will
also eliminate the need for any iterators over the entries in the Database from
having to hold a copy of the lines from the Datafile.

Related to GothenburgBitFactory#245.
This allows the database to be treated as a single collection of strings, but
can be used to avoid loading the entire database when only interested in recent
entries.

Related to issue GothenburgBitFactory#245.
Now that we have the iterators, we can standardize on their use.

Related to issue GothenburgBitFactory#245.
intervalSummarize is called at the end of most commands. The cost of parsing
all the lines in the database can be significant as the size of the database
grows.

Related to issue GothenburgBitFactory#245.
This does not appear to be necessary anymore given that the database lines are
generated from intervals and are all well formed. Any open interval *should* be
at the end of the database.

Related to issue GothenburgBitFactory#245.
We can eliminate the need to parse the entire database if we only look for
overlaps based on the latest interval.

Related to issue GothenburgBitFactory#245.
outerRange is no longer used, since the filter was simply started based on the
first line in the database now.
The Database class itself can now be used in range-based for loops for iterating
over all the lines.
The inclusion database for the user always starts with the most recent
entry. It is now the same way in the code as well.
The database class now separatly tracks tag information. So for the one
place where all the inclusions were iterated over in order to build up a
tag set, we now instead ask the database for this set directly.

Related to issue GothenburgBitFactory#245
The getUntracked, called as part of the `timew gaps` command, is
normally looking at a relatively recent interval. We do not want to take
the performance hit of loading the entire database into memory when
processing this command.

Related to issue GothenburgBitFactory#245
All locations in the code that was creating Intervals for all entries in
the database have been removed. This function can now be removed as
well.
getIntervalsByIds will be used by commands that are loading complete database
currently when they really want a few intervals that the user specified by ID.

Related to issue GothenburgBitFactory#245
This change eliminates the call to getTracked with an empty filter,
which causes the entire database to be parsed.

Related to issue GothenburgBitFactory#245
@sruffell
Copy link
Contributor Author

sruffell commented Jan 6, 2020

I pushed another update. I realized that the original getIntervalsByIds would consider the latest interval twice if there were synthetic intervals, and this condition was not caught in any tests.

I updated two of the tests, annotate.t and move.t, to include a mix of non-synthetic and synthetic intervals and fixed getIntervalsByIds.

This fixes an issue in the modify command since it was first added. This will
allow modify to work in the presence of synthetic intervals.
@sruffell
Copy link
Contributor Author

sruffell commented Jan 8, 2020

After updating most of the tests in this pull request, I realized that when I first added support for the modify command I did not handle synthetic intervals. This change really isn't specifically related to performance improvements, but it just seemed easier to add it to this series since the easiest fix uses the flattenDatabase helper added as part of it.

@lauft lauft self-requested a review January 12, 2020 21:10
Copy link
Member

@lauft lauft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some small issues but I give a general thumbs up. 👍

@sruffell: Because this PR has now waited quite a while to get reviewed I would do some final tests on my machine and then merge it as soon as possible. Please address the review comments in a separate PR after the merge.

test/move.t Show resolved Hide resolved
test/modify.t Show resolved Hide resolved
src/commands/CmdUntag.cpp Show resolved Hide resolved
src/commands/CmdUntag.cpp Show resolved Hide resolved
test/modify.t Show resolved Hide resolved
test/annotate.t Show resolved Hide resolved
src/Database.h Show resolved Hide resolved
src/commands/CmdTag.cpp Show resolved Hide resolved
src/commands/CmdResize.cpp Show resolved Hide resolved
src/data.cpp Show resolved Hide resolved
@lauft lauft added this to the 1.2.1 milestone Jan 17, 2020
@lauft lauft merged commit d9480b5 into GothenburgBitFactory:dev Jan 17, 2020
sruffell added a commit to sruffell/timewarrior that referenced this pull request Jan 18, 2020
timwarrior coding standard is for there to be curly braces around all code
blocks.

See GothenburgBitFactory#269 (comment)
sruffell added a commit to sruffell/timewarrior that referenced this pull request Jan 18, 2020
firstLine is ambiguous (the first line that was added in time? The first line
that will be returned when iterating the database?)

See GothenburgBitFactory#269 (comment)
lauft pushed a commit that referenced this pull request Jan 26, 2020
timwarrior coding standard is for there to be curly braces around all code
blocks.

See #269 (comment)
lauft pushed a commit that referenced this pull request Jan 26, 2020
firstLine is ambiguous (the first line that was added in time? The first line
that will be returned when iterating the database?)

See #269 (comment)
@sruffell sruffell deleted the do-not-load-entire-database branch February 23, 2020 19:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants