Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Append seems to lose data #43

Open
bigtonylewis opened this issue May 27, 2020 · 10 comments
Open

Append seems to lose data #43

bigtonylewis opened this issue May 27, 2020 · 10 comments

Comments

@bigtonylewis
Copy link

bigtonylewis commented May 27, 2020

It seems that when using collection.append, some data is not written. If I do a collection.write() then one or more collection.append() and then read it back, I get less rows than I put in.

Here's some code that demonstrates it at https://gist.github.com/bigtonylewis/eb2913814869416ccbb82944c3662d32

When I iterate over two 1000-row dataframes, including splitting, writing and appending them, I get about 1800 rows back out of 2000.

@jeffneuen
Copy link

@bigtonylewis did you ever find a resolution for this?

@bigtonylewis
Copy link
Author

No, I moved on and used something else

@hong-ds
Copy link

hong-ds commented Mar 31, 2021

Seems like it is dropping duplicates when appending. and it doesn't take the timestamp index into consideration

@payasparab
Copy link

@bigtonylewis what solution did you switch to that? did that solution retain the pystore structure/syntax? Not being able to increment some massive datasets is causing some problems for us, but we do not want to change datasets built on the Pystore framework. Any help is appreciated!

@ranaroussi
Copy link
Owner

Hi - can you share the code piece that causes that? I can't seem to be able to replicate it scenario.

Thanks!

@bigtonylewis
Copy link
Author

@ranaroussi Here's the code from the first post: https://gist.github.com/bigtonylewis/eb2913814869416ccbb82944c3662d32

@payas-parab-92 I just whipped up my own code, a very much reduced subset of this that fit my purpose. It won't scale well

@jeffneuen
Copy link

I personally would be stoked to see this bug fixed, as it was a show-stopper for me too!

@payasparab
Copy link

@ranaroussi I have also detailed the issue a little bit further in another issue with more detail than I provided here: #48 (comment)

Similar to @jeffneuen this is a big pain point for us/becoming super critical and I will be playing around with this in the next few days and will circle back to this thread if I figure anything out.

@r-stiller
Copy link

r-stiller commented Aug 20, 2021

Edit: I added PR #57 to address this problem.

@ranaroussi I'm quite confident that I figured out why pystore is loosing data when appending.

The problem here is that you are calling combined = dd.concat([current.data, new]).drop_duplicates(keep="last") (line 181 in collection.py) within the append function.

The documentation of the drop_duplicates function has a nice example of what happens here:

id brand style rating
0 Yum Yum cup 4.0
1 Yum Yum cup 4.0
2 Indomie cup 3.5
3 Indomie pack 15.0
4 Indomie pack 5.0

becomes

id brand style rating
0 Yum Yum cup 4.0
2 Indomie cup 3.5
3 Indomie pack 15.0
4 Indomie pack 5.0

See how the index of the data is ignored when searching for duplicates.
If you've got a dataframe with a timestamp as index and a single column with prices all the rows with the same price will be deleted.

There is no way to force drop_duplicates to include the index into the comparison. But here is a workaround:

Reset the index, so the original index is inserted as a column before calling drop_duplicates.
This makes shure that the index is compared, too.
After dropping set the index to the original one.

combined = dd.concat([current.data, new])
idx_name = combined.index.name
combined = combined.reset_index().drop_duplicates(keep="last").set_index(idx_name)

When you've got this dataframe

Timestamp price volume
100000 0.3 500
100000 0.3 500
200000 0.3 500
200000 0.3 777
300000 0.4 200

it will be changed to this

Timestamp price volume
100000 0.3 500
200000 0.3 500
200000 0.3 777
300000 0.4 200

Note how this keeps rows with the same index but different column values.
That might be usefull for data that can have the same timestamp (even though it's quite rare) but different values.
(Trades that were executed at the very same time might have the same timestamp)

For data that can only have unique timestamps (OHLC, EOD, ect.) it is enough to drop duplicated indexes.
Just change the last line it to:

combined = combined.reset_index().drop_duplicates(subset=idx_name, keep="last").set_index(idx_name)

This will result in:

Timestamp price volume
100000 0.3 500
200000 0.3 777
300000 0.4 200

For Pandas there is actually no need to copy the index as you can use the df.groupby(axis=0).last() method for removing duplicated indexes.
But Dask doesn't support groupby on axis=0.

To control the behavior you could add a keyword to the append function. Something like force_unique_index=False. Which leads to:

idx_name = combined.index.name
if force_unique_index:
   subset = idx_name
else: 
   subset = None

combined = dd.concat([current.data, new])
combined = combined.reset_index().drop_duplicates(subset=subset, keep="last").set_index(idx_name)

To increase speed on big dataframes it would be nice to apply a Boolean mask of index duplicates first and then look for duplicated row columns. It avoids copying the index and reduces amount of rows (with all columns) that are compared. Unfortunately Dask haven't implemented the duplicated function of pandas which returns Booleans.
Here is the my code for Pandas maybe there is a similar way for Dask:

mask = combined.index.duplicated(keep=False)                # Returns all index duplicates as true
mask[mask] = combined[mask].duplicated(keep="first")     # Returns only first duplicate (index + columns) as true
combined = combined[~mask]                                           # Deletes duplicates from dataframe

On my 15338798 rows × 7 columns dataframe its mask:
3.74 s ± 219 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
vs drop_duplicates with index copy:
29.4 s ± 1.02 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

@yohplala
Copy link

Hi everyone , for information, I have started an alternative lib, oups that has some similarities with pystore. Please, beware this is my first project, but I would gladly accept any feedback on it.

@ranaroussi, I am aware this post may not be welcome and I am sorry if it is a bit rude. Please, remove it if it does.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants