Append seems to lose data #43

bigtonylewis · 2020-05-27T13:15:20Z

It seems that when using collection.append, some data is not written. If I do a collection.write() then one or more collection.append() and then read it back, I get less rows than I put in.

Here's some code that demonstrates it at https://gist.github.com/bigtonylewis/eb2913814869416ccbb82944c3662d32

When I iterate over two 1000-row dataframes, including splitting, writing and appending them, I get about 1800 rows back out of 2000.

The text was updated successfully, but these errors were encountered:

jeffneuen · 2021-03-25T22:13:29Z

@bigtonylewis did you ever find a resolution for this?

bigtonylewis · 2021-03-25T22:18:50Z

No, I moved on and used something else

hong-ds · 2021-03-31T19:57:17Z

Seems like it is dropping duplicates when appending. and it doesn't take the timestamp index into consideration

payasparab · 2021-05-24T19:33:32Z

@bigtonylewis what solution did you switch to that? did that solution retain the pystore structure/syntax? Not being able to increment some massive datasets is causing some problems for us, but we do not want to change datasets built on the Pystore framework. Any help is appreciated!

ranaroussi · 2021-05-24T19:41:01Z

Hi - can you share the code piece that causes that? I can't seem to be able to replicate it scenario.

Thanks!

bigtonylewis · 2021-05-25T02:29:33Z

@ranaroussi Here's the code from the first post: https://gist.github.com/bigtonylewis/eb2913814869416ccbb82944c3662d32

@payas-parab-92 I just whipped up my own code, a very much reduced subset of this that fit my purpose. It won't scale well

jeffneuen · 2021-05-25T02:56:19Z

I personally would be stoked to see this bug fixed, as it was a show-stopper for me too!

payasparab · 2021-06-09T00:30:38Z

@ranaroussi I have also detailed the issue a little bit further in another issue with more detail than I provided here: #48 (comment)

Similar to @jeffneuen this is a big pain point for us/becoming super critical and I will be playing around with this in the next few days and will circle back to this thread if I figure anything out.

r-stiller · 2021-08-20T10:57:16Z

Edit: I added PR #57 to address this problem.

@ranaroussi I'm quite confident that I figured out why pystore is loosing data when appending.

The problem here is that you are calling combined = dd.concat([current.data, new]).drop_duplicates(keep="last") (line 181 in collection.py) within the append function.

The documentation of the drop_duplicates function has a nice example of what happens here:

id	brand	style	rating
0	Yum Yum	cup	4.0
1	Yum Yum	cup	4.0
2	Indomie	cup	3.5
3	Indomie	pack	15.0
4	Indomie	pack	5.0

becomes

id	brand	style	rating
0	Yum Yum	cup	4.0
2	Indomie	cup	3.5
3	Indomie	pack	15.0
4	Indomie	pack	5.0

See how the index of the data is ignored when searching for duplicates.
If you've got a dataframe with a timestamp as index and a single column with prices all the rows with the same price will be deleted.

There is no way to force drop_duplicates to include the index into the comparison. But here is a workaround:

Reset the index, so the original index is inserted as a column before calling drop_duplicates.
This makes shure that the index is compared, too.
After dropping set the index to the original one.

combined = dd.concat([current.data, new])
idx_name = combined.index.name
combined = combined.reset_index().drop_duplicates(keep="last").set_index(idx_name)

When you've got this dataframe

Timestamp	price	volume
100000	0.3	500
100000	0.3	500
200000	0.3	500
200000	0.3	777
300000	0.4	200

it will be changed to this

Timestamp	price	volume
100000	0.3	500
200000	0.3	500
200000	0.3	777
300000	0.4	200

Note how this keeps rows with the same index but different column values.
That might be usefull for data that can have the same timestamp (even though it's quite rare) but different values.
(Trades that were executed at the very same time might have the same timestamp)

For data that can only have unique timestamps (OHLC, EOD, ect.) it is enough to drop duplicated indexes.
Just change the last line it to:

combined = combined.reset_index().drop_duplicates(subset=idx_name, keep="last").set_index(idx_name)

This will result in:

Timestamp	price	volume
100000	0.3	500
200000	0.3	777
300000	0.4	200

For Pandas there is actually no need to copy the index as you can use the df.groupby(axis=0).last() method for removing duplicated indexes.
But Dask doesn't support groupby on axis=0.

To control the behavior you could add a keyword to the append function. Something like force_unique_index=False. Which leads to:

idx_name = combined.index.name
if force_unique_index:
   subset = idx_name
else: 
   subset = None

combined = dd.concat([current.data, new])
combined = combined.reset_index().drop_duplicates(subset=subset, keep="last").set_index(idx_name)

To increase speed on big dataframes it would be nice to apply a Boolean mask of index duplicates first and then look for duplicated row columns. It avoids copying the index and reduces amount of rows (with all columns) that are compared. Unfortunately Dask haven't implemented the duplicated function of pandas which returns Booleans.
Here is the my code for Pandas maybe there is a similar way for Dask:

mask = combined.index.duplicated(keep=False)                # Returns all index duplicates as true
mask[mask] = combined[mask].duplicated(keep="first")     # Returns only first duplicate (index + columns) as true
combined = combined[~mask]                                           # Deletes duplicates from dataframe

On my 15338798 rows × 7 columns dataframe its mask:
3.74 s ± 219 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
vs drop_duplicates with index copy:
29.4 s ± 1.02 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

yohplala · 2022-01-18T09:45:05Z

Hi everyone , for information, I have started an alternative lib, oups that has some similarities with pystore. Please, beware this is my first project, but I would gladly accept any feedback on it.

@ranaroussi, I am aware this post may not be welcome and I am sorry if it is a bit rude. Please, remove it if it does.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Append seems to lose data #43

Append seems to lose data #43

bigtonylewis commented May 27, 2020 •

edited

jeffneuen commented Mar 25, 2021

bigtonylewis commented Mar 25, 2021

hong-ds commented Mar 31, 2021

payasparab commented May 24, 2021

ranaroussi commented May 24, 2021

bigtonylewis commented May 25, 2021

jeffneuen commented May 25, 2021

payasparab commented Jun 9, 2021

r-stiller commented Aug 20, 2021 •

edited

yohplala commented Jan 18, 2022

Append seems to lose data #43

Append seems to lose data #43

Comments

bigtonylewis commented May 27, 2020 • edited

jeffneuen commented Mar 25, 2021

bigtonylewis commented Mar 25, 2021

hong-ds commented Mar 31, 2021

payasparab commented May 24, 2021

ranaroussi commented May 24, 2021

bigtonylewis commented May 25, 2021

jeffneuen commented May 25, 2021

payasparab commented Jun 9, 2021

r-stiller commented Aug 20, 2021 • edited

yohplala commented Jan 18, 2022

bigtonylewis commented May 27, 2020 •

edited

r-stiller commented Aug 20, 2021 •

edited