Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

R crashes on non-equi join #3401

Closed
Gayyam opened this issue Feb 14, 2019 · 9 comments
Closed

R crashes on non-equi join #3401

Gayyam opened this issue Feb 14, 2019 · 9 comments
Assignees

Comments

@Gayyam
Copy link

@Gayyam Gayyam commented Feb 14, 2019

Hi,

I think I'm encountering a weird bug where R crashes as I try to do a non-equi join. Apologies for not being able to create a minimal reproducible example. I have attached two data.tables, both with 10,000 rows and up to 4 columns.

Here is the code to (hopefully) reproduce the error

library(data.table)
DT1 <- readRDS('DT1.rds')
DT2 <- readRDS('DT2.rds')

# This does not work, R crashes on my system.
DT1[
  DT2, on = .(Month<=MonthFuture, Month>=MonthPast, FlightDetails==FlightCode)
]
# This works
set.seed(1)
n <- 1e3
DT1 <- DT1[sample(.N, n)]
DT2 <- DT2[sample(.N, n)]
DT1[
  DT2, on = .(Month<=MonthFuture, Month>=MonthPast, FlightDetails==FlightCode)
]

Output of sessionInfo() ----

R version 3.5.2 (2018-12-20)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_New Zealand.1252  LC_CTYPE=English_New Zealand.1252        LC_MONETARY=English_New Zealand.1252
[4] LC_NUMERIC=C                         LC_TIME=English_New Zealand.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.12.0

loaded via a namespace (and not attached):
[1] compiler_3.5.2 tools_3.5.2    yaml_2.2.0

Thank you. Let me know if there is anything I can provide to aid in debugging.

@Gayyam
Copy link
Author

@Gayyam Gayyam commented Feb 14, 2019

Also tested on the latest dev version, same error persists.
data.table 1.12.1 IN DEVELOPMENT built 2019-02-14 09:56:44 UTC;

@MichaelChirico
Copy link
Member

@MichaelChirico MichaelChirico commented Feb 14, 2019

Might be unrelated but you should run setDT(DT1) after readRDS

@Gayyam
Copy link
Author

@Gayyam Gayyam commented Feb 14, 2019

Thanks Michael. I set DT1 and DT2 both with setDT and reran the commands, but R still crashed.

Were you able to reproduce this?

@arunsrinivasan
Copy link
Member

@arunsrinivasan arunsrinivasan commented Feb 15, 2019

Works well up until 1.11.8. Seg faults from 1.12.0.

@arunsrinivasan
Copy link
Member

@arunsrinivasan arunsrinivasan commented Feb 16, 2019

Seems like this is the commit that breaks: e59ba14#diff-3f6e5ca10e702fb2c499a882aa3447e0

@Gayyam
Copy link
Author

@Gayyam Gayyam commented Feb 16, 2019

Thanks @arunsrinivasan! That helps. I'm using 1.11.8 for now. Will upgrade to the latest dev once your fix is merged.

Also, I wanted to thank you all for your great work in data.table. It has been a pleasure to use and its RAM efficiency has helped us avoid the purchase of a new computer that can support more than 64GB RAM for as long as was possible.

arunsrinivasan added a commit that referenced this issue Feb 16, 2019
* Fix segfault issue, #3401

* Need to Free().
@ethanbsmith
Copy link

@ethanbsmith ethanbsmith commented Mar 27, 2019

I just ran into this issue. super happy to find it has already been logged and patched. tested on my end against dev version and can confirm it solves my issue as well. you folks rock!!!

@jangorecki
Copy link
Member

@jangorecki jangorecki commented Mar 27, 2019

And it will be landing on CRAN within days/hours

@ethanbsmith
Copy link

@ethanbsmith ethanbsmith commented Mar 27, 2019

I was able to work around this in my scenario by adding a pre-filter on x. I haven't fully thought this through, but this might be a possible generic optimization to reduce working set. If not, just ignore ;)

my original code was something like:```

d1[d2, on = .(rowid >= start.rowid, rowid < end.rowid, Cont.Low <= Top, Cont.High >= Bottom),
        allow.cartesian = T, by = .EACHI, 
        .(.N, res = sum(Cont.Low + Cont.High))]

by adding a filters on d1[rowid >= min(d2$start.rowid) & rowid <= max(d2$end.rowid)] I was able to get this to work:

  d1[rowid >= min(d2$start.rowid) & rowid <= max(d2$end.rowid)][d2, on = .(rowid >= start.rowid, rowid < end.rowid, Cont.Low <= Top, Cont.High >= Bottom),
        allow.cartesian = T, by = .EACHI, 
        .(.N, res=sum(Cont.Low + Cont.High))]

this filter on x is logically implied by the non-equi join conditions and doesn't actually affect the result, but seems (in my scenario) to bypass the memory allocation.

obviously there are all kinds of considerations, like is computing the min and max and applying the filter worth it and such. as I said, feel free to ignore if not useful

MichaelChirico pushed a commit that referenced this issue Apr 3, 2019
* Fix segfault issue, #3401

* Need to Free().
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
5 participants
You can’t perform that action at this time.