-
Notifications
You must be signed in to change notification settings - Fork 514
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH Decision Tree new backend computeSplitClassificationKernel
histogram calculation and occupancy optimization
#3616
Conversation
} | ||
// case when d is larger than all bins | ||
if(!breakflag) atomicAdd(pdf_shist + nbins*nclasses + label, 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should never happen with new way of quantile computation once #3586 gets merged. As it does not incur a minimal penalty, we can keep this for now.
int n_blks_for_rows = b.n_blks_for_rows( | ||
colBlks, | ||
(const void*) | ||
computeSplitClassificationKernel<DataT, LabelT, IdxT, TPB_DEFAULT>, | ||
TPB_DEFAULT, smemSize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This value is computed again in workspaceSize
. Is there a guarantee that calling n_blks_for_rows()
would result in a consistent ouput?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes it's value is basically constant, and equals the number of blocks for dimx needed when dimz (parallel nodes) is minimum (which is 1). This happens to be a function of n_blks_for_cols
and occupancy calculator. As far as occupancy is concerned, the theoretical limiter for now is register count. So ceterus paribus, the output of n_blks_for_rows()
should also be the same when called from workspaceSize
and one preceding call of computeSplit.*Kernel
@@ -362,50 +367,132 @@ __global__ void computeSplitClassificationKernel( | |||
col = select(colIndex, treeid, node.info.unique_id, seed, input.N); | |||
} | |||
|
|||
for (IdxT i = threadIdx.x; i < len; i += blockDim.x) shist[i] = 0; | |||
// populating shared memory with initial values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding the comments. It helps me understand the code better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😄
auto isRight = d > sbins[b]; // no divergence | ||
auto offset = b * 2 * nclasses + isRight * nclasses + label; | ||
atomicAdd(shist + offset, 1); // class hist | ||
if (d <= sbins[b]) { // shist (0 -> nbins*nclasses - 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to cause warp divergence, unlike the old code. What's the rationale for this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, this seems to be a small price to pay for reducing smem atomic writes by an order of magitude...
So, previous code does atomicAdd()
every single iteration (O(n^2)
) . The change basically does it only once per outer data-sample and then breaks (O(n)
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested this pull request last night and observed a significant performance regression. Here is a minimal reproducible example: import os
import pickle
import tqdm
import time
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
from enum import Enum
from urllib.request import urlretrieve
from cuml.ensemble import RandomForestClassifier as cumlRandomForestClassifier
from cuml.ensemble import RandomForestRegressor as cumlRandomForestRegressor
pbar = None
class LearningTask(Enum):
REGRESSION = 1
CLASSIFICATION = 2
MULTICLASS_CLASSIFICATION = 3
class Data: # pylint: disable=too-few-public-methods,too-many-arguments
def __init__(self, X_train, X_test, y_train, y_test, learning_task):
self.X_train = X_train
self.X_test = X_test
self.y_train = y_train
self.y_test = y_test
self.learning_task = learning_task
def show_progress(block_num, block_size, total_size):
global pbar
if pbar is None:
pbar = tqdm.tqdm(total=total_size / 1024, unit='kB')
downloaded = block_num * block_size
if downloaded < total_size:
pbar.update(block_size / 1024)
else:
pbar.close()
pbar = None
def retrieve(url, filename=None):
return urlretrieve(url, filename, reporthook=show_progress)
def download_higgs():
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00280/HIGGS.csv.gz'
local_url = os.path.basename(url)
pickle_url = "higgs.pkl"
if os.path.exists(pickle_url):
return pickle.load(open(pickle_url, "rb"))
if not os.path.isfile(local_url):
retrieve(url, local_url)
higgs = pd.read_csv(local_url)
X = higgs.iloc[:, 1:].to_numpy(dtype=np.float32)
y = higgs.iloc[:, 0].to_numpy(dtype=np.float32)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=77,
test_size=0.2)
data = Data(X_train, X_test, y_train, y_test, LearningTask.CLASSIFICATION)
pickle.dump(data, open(pickle_url, "wb"), protocol=4)
return data
def download_year():
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00203/YearPredictionMSD.txt' \
'.zip'
local_url = os.path.basename(url)
pickle_url = "year.pkl"
if os.path.exists(pickle_url):
return pickle.load(open(pickle_url, "rb"))
if not os.path.isfile(local_url):
retrieve(url, local_url)
year = pd.read_csv(local_url, header=None)
X = year.iloc[:, 1:].to_numpy(dtype=np.float32)
y = year.iloc[:, 0].to_numpy(dtype=np.float32)
X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle=False,
train_size=463715,
test_size=51630)
data = Data(X_train, X_test, y_train, y_test, LearningTask.REGRESSION)
pickle.dump(data, open(pickle_url, "wb"), protocol=4)
return data
def main():
higgs = download_higgs()
year = download_year()
# higgs
tstart = time.perf_counter()
clf = cumlRandomForestClassifier(max_features=1.0, random_state=0, n_bins=128, n_streams=4,
bootstrap=True, n_estimators=100, max_depth=20,
max_samples=0.01, split_algo=1, use_experimental_backend=True)
clf.fit(higgs.X_train, higgs.y_train)
tend = time.perf_counter()
print(f'higgs: time elapsed = {tend - tstart} s')
# year
tstart = time.perf_counter()
clf = cumlRandomForestRegressor(max_features=1.0, random_state=0, n_bins=128, n_streams=4,
bootstrap=True, n_estimators=100, max_depth=20,
max_samples=0.01, split_algo=1, use_experimental_backend=True)
clf.fit(year.X_train, year.y_train)
tend = time.perf_counter()
print(f'year: time elapsed = {tend - tstart} s')
if __name__ == '__main__':
main() (The datasets get downloaded once and are cached in subsequent runs.) Before (commit 14bd6c1):
After (this PR, with
My GPU is Quadro RTX 8000, using CUDA 11.0 and deriver 450.51.06. |
* using atomics to calculate PDFs and then using blockScan to get required CDFs that was originally issueing too many atomicAdds to shared memory
* was earlier hard-coded
* dynamically assigning based on occupancy while ceil-ing it to minimum 4 blocks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving ops-codeowner
file changes
conda/recipes/cuml/meta.yaml
Outdated
@@ -43,7 +43,7 @@ requirements: | |||
- libcumlprims {{ minor_version }} | |||
- cupy>=7.8.0,<9.0.0a0 | |||
- treelite=1.0.0 | |||
- nccl>=2.5 | |||
- nccl>=2.8.4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please leave out unrelated changes from this pull request?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry about that, don't know why i thought rebasing was a good idea!
be18c99
to
c128de6
Compare
c128de6
to
32a773f
Compare
…stogram-calculation-optimization-for-computesplitclassificationkernel
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes LGTM. Thanks @venkywonka for this PR!
@venkywonka seems like batched-level-algo unit-tests are now failing for 11.2? |
yes, working on it 😅 yet to find out why |
The tests for this PR and PR #3674 will pass once PR #3690 gets merged |
* This PR fixes the regressions shown by `BatchedLevelAlgo/DtClsTestF` and `BatchedLevelAlgo/DtRegTestF` wherein the quantiles parameter passed to `grow_tree` function was uninitialized garbage memory as opposed to what should have been quantiles computed for each column. * It also replaces the old method of computing quantiles (`preprocess_quantiles`) with new, more accurate one (`computeQuantiles`) * removes an unnecessary memory allocation to `tempmem` in the setup phase of the test fixture. * This fixes failing `BatchedLevelAlgo/DtRegTestF` tests as reported in issue #3406 * It also fixes failing `BatchedLevelAlgo/DtClsTestF` tests in PR #3616 cc @teju85 @vinaydes @JohnZed @hcho3 Authors: - Venkat (https://github.com/venkywonka) Approvers: - Thejaswi. N. S (https://github.com/teju85) - John Zedlewski (https://github.com/JohnZed) URL: #3690
…ogram-calculation-optimization-for-computesplitclassificationkernel
Codecov Report
@@ Coverage Diff @@
## branch-0.19 #3616 +/- ##
===============================================
+ Coverage 80.70% 82.92% +2.21%
===============================================
Files 227 227
Lines 17615 17591 -24
===============================================
+ Hits 14217 14587 +370
+ Misses 3398 3004 -394
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
@gpucibot merge |
…tion optimization (#3674) This is a follow-up of PR #3616 and should be merged after that. This PR introduces: * Modularizing the `pdf_to_cdf` conversion using inclusive-sumscan into a device function so that it can be reused by both the `ML::DecisionTree::computeSplitClassificationKernel` and `ML::DecisionTree::computeSplitRegressionKernel` * Integrating the above mentioned device function to calculate the prediction sums and counts in the `ML::DecisionTree::computeSplitRegressionKernel` . These histograms are used for node-splitting in decision trees for the task of regression. * The reason for this optimization follows the same explanation given in PR #3616 * As of now, only the first pass has been optimized using sumscans. Authors: - Venkat (https://github.com/venkywonka) Approvers: - Philip Hyunsu Cho (https://github.com/hcho3) - Thejaswi. N. S (https://github.com/teju85) URL: #3674
ML::DecisionTree::computeSplitClassificationKernel
. These histograms are used for node-splitting in decision trees for the task of classification.gridDim.x
in the launch configuration of the above kernel from4
to based on occupancy calculator and other dimension gridDims, thus improving the occupancy to theoretical limitscomputeSplitRegressionKernel
has different share-memory write patterns that deserves it's own PR for optimization 😬