However, if an index is created intentionally, or in many cases unintentionally (auto index triggered by dt[flag0 == 1, ...]), the performance of the above code significantly decreases and could be unstable:
I ran your code through profvis. It shows that with the index you're de/allocating ~170 GB of memory as opposed to ~1.6 GB without one. Most of the time is spent in calls to .shallow() originating from bmerge() with about half being occupied by the gc. That latter part would be where the variability in timings you observed comes from.
Now what's going on? First, bmerge() makes a shallow copy of x which is just dt in this case. Looking into shallow(), at C level you find a call of DUPLICATE_ATTRIB which, among other things, creates a complete copy of the index vector. Since the vector occupies 20 MB and you're doing it 5000 times we've already accounted for ~100 GB of the memory footprint.
I haven't dug into it further to figure out the difference. But just to confirm,
@tlapak Thanks for digging into that. The non-equi join is the perfect way to do that.
Some of my practical use cases involve more calculations on each row to get the dates in which it might either not be consecutive in calendar or uses another variable for each row to determine if there is an end_date or none, and this makes is not easy to apply non-equi join in a uniform way.