Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

default reduced to half of all logical CPUs #3435

Merged
merged 4 commits into from
Mar 1, 2019
Merged

Conversation

mattdowle
Copy link
Member

@mattdowle mattdowle commented Feb 28, 2019

Closes #3298
Closes #3395

I investigated various methods for determining number of cores. The one referred to in #3395 (RhpcBLASctl::get_num_cores()) seems like the only one that actually works; parallel::detectCores() doesn't work for me. But RhpcBLASctl::get_num_cores() reads the system files on linux to do it so could possibly be slow, and would incur a new dependency.
So I figured why not just keep it very simple and default to 50% of logical CPUs. This leaves plenty of room for other processes. If the user wants to use more or less they can set OMP_NUM_THREADS up to the number of logical CPUs, or call setDTthreads().

@mattdowle mattdowle added this to the 1.12.2 milestone Feb 28, 2019
@mattdowle mattdowle changed the title default half logical cpus default reduced to half of all logical CPUs Feb 28, 2019
@codecov
Copy link

codecov bot commented Feb 28, 2019

Codecov Report

Merging #3435 into master will decrease coverage by <.01%.
The diff coverage is 84.61%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3435      +/-   ##
==========================================
- Coverage   95.06%   95.05%   -0.01%     
==========================================
  Files          65       65              
  Lines       12249    12254       +5     
==========================================
+ Hits        11644    11648       +4     
- Misses        605      606       +1
Impacted Files Coverage Δ
src/openmp-utils.c 92.1% <84.61%> (-1.84%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3d4459d...5b6ded1. Read the comment docs.

@codecov
Copy link

codecov bot commented Feb 28, 2019

Codecov Report

Merging #3435 into master will increase coverage by 0.03%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3435      +/-   ##
==========================================
+ Coverage   95.06%   95.09%   +0.03%     
==========================================
  Files          65       65              
  Lines       12249    12302      +53     
==========================================
+ Hits        11644    11699      +55     
+ Misses        605      603       -2
Impacted Files Coverage Δ
src/fread.c 98.49% <ø> (+0.15%) ⬆️
src/openmp-utils.c 100% <100%> (+6.06%) ⬆️
R/openmp-utils.R 100% <100%> (ø) ⬆️
src/init.c 93.93% <100%> (+0.09%) ⬆️
R/test.data.table.R 95.74% <100%> (-0.03%) ⬇️
src/gsumm.c 90.68% <0%> (-0.23%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3d4459d...324b4b2. Read the comment docs.

NEWS.md Outdated Show resolved Hide resolved
man/openmp-utils.Rd Outdated Show resolved Hide resolved
Copy link
Member

@renkun-ken renkun-ken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK, most users are operating on small data (thousands to millions of rows, tens of columns). Using half or full number of logical cores does not exhibit a significant difference in computing time as I test on my server with such data size.

When the data is very large (e.g. 1 billion rows and hundreds of columns), using half and full computing capacity may result in greater difference (but not as much as I expected as I test on my 40-core server with 70M+ rows and 50+ columns). However, I believe most user operating on such data size are working on servers like mine, or the data simply cannot be put into memory. And this is exactly the case where using all cores cause a problem since it is unlikely that such servers are used by only one user.

I'm perfectly okay with this change.

@mattdowle
Copy link
Member Author

I haven't noticed much difference between 4 and 8 threads on my laptop either (I have 4 cores). Our algos are mostly cache bound.
I see Jan's point there's an inconvenience aspect to needing to use setDTthreads() and know the number to give it. Also the package has to be loaded first to use it, unlike options() which can be set in .Rprofile on a server without user needing to change any code to add the setDTthreads() call.
How about a new option: options(datatable.logicalcpu.percent=0.50) to control the default. Once the option is there, Jan could see the impact it has on db-bench.

@HughParsonage
Copy link
Member

I personally think the default should be single-core since, as @renkun-ken remarked, more cores currently offers only small performance benefit on commodity hardware (with the possible exception of fread/fwrite), but can be much slower if there is other parallelism going on. In the latter case, the parallelism can't really be managed a priori; the operator would need to set the cores according to the particular apparatus being used.

This is not really a problem with data.table -- I see it in lots of places, even outside R. One thing that strikes me about the present is the high flux of packages moving from single thread to parallel operations. In isolation this can provides performance benefits, but their interaction is a bit tricky. One can imagine a scenario where a user is using data.table with its implicit parallelism together with another package X. Package X then moves from single-core to multi-core and there's a performance regression. This can be pretty tricky to track down especially since the NEWS in package X will say the functions should be faster.

I think fread and fwrite are special though: the performance benefit appears to be large and they are typically less likely to be used in combination with other packages.

Thanks for inviting me to review. I've approved because I think the PR represents an improvement.

@jangorecki
Copy link
Member

jangorecki commented Feb 28, 2019

I think it is strongly depends on data, query and how the other processes utilise resources. Single core vs multicore makes significant difference, at least for queries we are benchmarking. The huge time improvement you can see below is caused by parallel forder and parallel aggregations. If not those multicore improvements we would have been behind spark and pydatatable (both do aggregations using multiple cores).

hist_groupby_plot-1

Machine we are using for that is (at the moment) 20 cpu, 125 GB mem. Once this PR will be merged I will run benchmark so we can see how halving cpus impacts speed.

@mattdowle
Copy link
Member Author

mattdowle commented Feb 28, 2019

Glad we're all on the same page. I'm surprised by Hugh's comments that more threads don't often help other than for fread/fwrite. I suspect what's going on is a combination of recent improvements that really show on large data (db-bench) plus some aspects that now cause significant slow downs on small / iterated tasks which otherwise would are fast anyway without needing parallelism. For example, still need to look at uniqueN by group as raised here: #3395 (comment). So this PR would not fully close #3395, there'll be a follow up issue for that.

I was going to add the option(datatable.logicalcpu.percent) before merging so Jan didn't need to retest. But the tricky thing is we don't want to read and re-read this option every time we call getDTthread() which happens quite a lot internally for each and every parallel region. If user changes this option, they'll expect their change to be read and we don't want to have to explain that it's only a startup opton. To convey that it's a default option which is read once when data.table is loaded, I'm now thinking to create 2 new environment variables instead: R_DATATABLE_NCPU_PERCENT and R_DATATABLE_NCPU. Such variables can be set on command line or set in R using Sys.setenv() before loading data.table. This gives the maximum flexibility for production usage. If both are provided then the minimum of them. Startup message could print the result (as Jan suggested). If unset the default for R_DATATABLE_NCPU_PERCENT would be 0.50.

@jangorecki
Copy link
Member

what about recent @HughParsonage comment?

Perhaps it would be better to have threads = NULL be the default, keep threads = 0 to retain the previous behaviour, and special-case if (is.null(threads)) threads = <half of CPU> else threads. That way, people relying on threads = 0 to work as documented can continue to, but people just using 'the data.table default' are moved to the new default.

data.table::setDTthreads can be added to .Rprofile so the change to existing programs can be avoided.

@mattdowle mattdowle merged commit 1638c76 into master Mar 1, 2019
@mattdowle mattdowle deleted the default_nthread branch March 1, 2019 11:16
@jangorecki
Copy link
Member

jangorecki commented Mar 1, 2019

There are some failures on linux machines provided by gitlab, they might be single cpu instances, not sure.
https://gitlab.com/Rdatatable/data.table/-/jobs/169970092

 getDTthreads()='omp_get_num_procs()==1; R_DATATABLE_NUM_PROCS_PERCENT=="" (default 50); R_DATATABLE_NUM_THREADS==""; omp_get_thread_limit()==2147483647; omp_get_max_threads()==1; OMP_THREAD_LIMIT==""; OMP_NUM_THREADS==""; data.table is using 1 threads. This is set on startup, and by setDTthreads(). See ?setDTthreads.; RestoreAfterFork==true']. Search tests/tests.Rraw for test numbers: 1997.09, 1997.11.

db-bench is running on 50% threads as expected

@jangorecki

This comment has been minimized.

@mattdowle
Copy link
Member Author

@jangorecki Very interesting thanks! Maybe something like 80% is optimal on the server?

@MichaelChirico
Copy link
Member

MichaelChirico commented Mar 2, 2019 via email

@jangorecki
Copy link
Member

jangorecki commented Mar 2, 2019

@MichaelChirico we will use all cores in benchmark, where available, same as for spark and others. I doubt using less than 100% did ever result in better performance in that environment. There are generally no other processes running in the background.

@HughParsonage
Copy link
Member

@jangorecki Those numbers don't look significant to me. Even with 10 billion groups the differences are only fractions of a second. And this is under optimal setting for parallelism: no other processes, no nested parallelism.

I get no significant differences with a 'normal' setup (i.e. some other processes running) even with the aggregations most suited to parallelization. That is, when testing execution times of the same operation, sometimes the parallel version is faster, sometimes it's slower. Never more than 1% or so.

My argument for single-core default is that the small gains under favourable conditions are outweighed by the risk of enormous performance regressions under unforeseen unfavourable conditions.

@jangorecki
Copy link
Member

jangorecki commented Mar 2, 2019

Those numbers are up to 1 billion. Differences are up to 29% - using 50% logical cores, not huge difference, for ordinary production 50% logical cores will good enough, but for benchmark we should aim for fastest possible option. Will run benchmarks using single core so we will have bigger picture what would be the performance cost of such change.

@mattdowle
Copy link
Member Author

@HughParsonage the columns that look like fractions of seconds are actually percentages, I think.
@jangorecki - can you scale by 100 and use a % in that output please. And where it's seconds use "s" or "secs" so there's no doubt.

@jangorecki
Copy link
Member

be8c7f6 20 cores (100% of logical cores) vs 1638c76 1 core.
We can see that in best case single core is 43% slower, in worst case it is slower more than 5 times.


top 5 speed up - or actually top 5 smallest slow down

|data           |question          | be8c7f6| 1638c76|time_change_pct |
|:--------------|:-----------------|-------:|-------:|:---------------|
|G1_1e7_1e1_0_0 |mean v1:v3 by id4 |   0.230|   0.331|43.9%           |
|G1_1e7_2e0_0_0 |mean v1:v3 by id4 |   0.266|   0.393|47.7%           |
|G1_1e8_1e1_0_0 |mean v1:v3 by id4 |   2.037|   3.371|65.5%           |
|G1_1e9_1e1_0_0 |mean v1:v3 by id4 |  20.152|  33.797|67.7%           |
|G1_1e7_1e1_0_0 |sum v1 by id1     |   0.143|   0.267|86.7%           |

top 5 slow down

|data           |question              | be8c7f6| 1638c76|time_change_pct |
|:--------------|:---------------------|-------:|-------:|:---------------|
|G1_1e9_1e2_0_1 |sum v1 by id1:id2     |   3.493|  22.344|539.7%          |
|G1_1e9_1e2_0_1 |sum v1 by id1         |   2.835|  17.907|531.6%          |
|G1_1e7_1e2_0_1 |sum v1 mean v3 by id3 |   0.119|   0.749|529.4%          |
|G1_1e7_1e2_0_1 |sum v1:v3 by id6      |   0.135|   0.814|503.0%          |
|G1_1e8_1e2_0_1 |sum v1 by id1:id2     |   0.375|   2.258|502.1%          |

by question

|question              |mean_time_change_pct |
|:---------------------|:--------------------|
|sum v1 by id1         |221.8%               |
|sum v1 by id1:id2     |289.9%               |
|sum v1 mean v3 by id3 |366.4%               |
|mean v1:v3 by id4     |94.7%                |
|sum v1:v3 by id6      |382.0%               |

by data

|data           |mean_time_change_pct |
|:--------------|:--------------------|
|G1_1e7_1e2_0_0 |315.9%               |
|G1_1e7_1e1_0_0 |220.9%               |
|G1_1e7_2e0_0_0 |108.9%               |
|G1_1e7_1e2_0_1 |388.8%               |
|G1_1e8_1e2_0_0 |317.6%               |
|G1_1e8_1e1_0_0 |207.9%               |
|G1_1e8_2e0_0_0 |141.3%               |
|G1_1e8_1e2_0_1 |395.9%               |
|G1_1e9_1e2_0_0 |307.2%               |
|G1_1e9_1e1_0_0 |243.9%               |
|G1_1e9_2e0_0_0 |160.4%               |
|G1_1e9_1e2_0_1 |411.6%               |

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants