Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimise frank() in GForce so as to avoid eval() penalty #1197

Open
arunsrinivasan opened this issue Jun 23, 2015 · 1 comment
Open

Optimise frank() in GForce so as to avoid eval() penalty #1197

arunsrinivasan opened this issue Jun 23, 2015 · 1 comment
Labels
enhancement GForce issues relating to optimized grouping calculations (GForce) Low

Comments

@arunsrinivasan
Copy link
Member

Relevant SO post: http://stackoverflow.com/a/31006848/559784

@glendigity
Copy link

Hi there,

I started typing up something on performance of rank vs frank but then I think this is the same issue. See my testing below, hopefully it helps.

Cheers!


First I setup a datatable where we have column A which we wish to group on and column B which we which to rank within each value of column A.

We can find that where the number of groups in column A are small, then frank is faster. However, when the number of groups in column is grows then frank is far slower.

Adding or removing the setkey on the data table does not appear to change performance. See code and results on my machine below:

`

rows <- 1000000
timings <- NULL

for(g in c(1,5,10,20,100,200,1000,2000,5000,10000,20000)) {
    dt <- data.table(a = round(runif(rows,0,g),0), b = runif(rows,0,1000))

    setkey(dt,a)

    t <- Sys.time()
    dt[,frnk := frank(b),by = a]
    t.frank <- Sys.time() - t

    t <- Sys.time()
    dt[,rnk := rank(b),by = a]
    t.rank <- Sys.time() - t

    if (is.null(timings)){
        timings <- data.table(groups = g, t.frank = t.frank,t.rank = t.rank)
    } else {
        timings <- rbind(timings,data.table(groups = g, t.frank = t.frank,t.rank = t.rank))
    }
}

timings

`

groups        t.frank         t.rank

1: 1 0.1249878 secs 0.3281560 secs
2: 5 0.1250062 secs 0.2563920 secs
3: 10 0.1421061 secs 0.2404330 secs
4: 20 0.1250150 secs 0.2343860 secs
5: 100 0.2387691 secs 0.1660478 secs
6: 200 0.1874959 secs 0.1718900 secs
7: 1000 0.4375210 secs 0.1718860 secs
8: 2000 0.7968328 secs 0.2045951 secs
9: 5000 1.7227261 secs 0.3140860 secs
10: 10000 3.3601918 secs 0.4983292 secs
11: 20000 6.0010481 secs 0.8939619 secs

@arunsrinivasan arunsrinivasan modified the milestones: v1.9.8, v2.0.0 Mar 20, 2016
@mattdowle mattdowle removed this from the Candidate milestone May 10, 2018
@MichaelChirico MichaelChirico added the GForce issues relating to optimized grouping calculations (GForce) label Feb 25, 2019
@jangorecki jangorecki removed the Medium label Apr 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement GForce issues relating to optimized grouping calculations (GForce) Low
Projects
None yet
Development

No branches or pull requests

5 participants