Skip to content

Latest commit

 

History

History
70 lines (62 loc) · 2.96 KB

overview_features.rst

File metadata and controls

70 lines (62 loc) · 2.96 KB

Features

There are four running types of HyperGBM:

  • Single node:running in a single machine and using Pandas and Numpy datatype
  • Single node with NVIDIA GPU device:running in a single machine with NVIDIA GPU devices and using cuDF and cupy datatype
  • Distributed with single node:running in a single machine and using Dask datatype which requires creating Dask collections before using HyperGBM
  • Distributed with multi nodes:running in multiple machines and using Dask datatype which requires creating Dask collections to manage resources for multiple machines before using HyperGBM

The overview of supported features for different running types are displayed in the following table:

  Features Single node Single node with GPU Distributed with single node Distributed with multi nodes
Data Cleaning Empty characters handling
  Recognizing columns types automatically
  Columns types correction
  Constant columns cleaning
  Repeated columns cleaning
  Deleting examples without targets
  Illegal characters replacing
  id columns cleaning
Dataset splitting Splitting by ratio
  Adversarial validation
Feature engineering Feature generation  
  Feature dimension reduction
Data preprocessing SimpleImputer
  SafeOrdinalEncoder
  TargetEncoder    
  SafeOneHotEncoder
  TruncatedSVD
  StandardScaler
  MinMaxScaler
  MaxAbsScaler
  RobustScaler
Imbalanced data handling ClassWeight
  UnderSampling(Nearmiss,Tomekslinks,Random)      
  OverSampling(SMOTE,ADASYN,Random)      
Search algorithms MCTS
  Evolution
  Random search
  NSGA-II      
  R-NSGA-II      
  MOEA/D      
  Play back
Early stopping time limit
  no improvements are made after n trials
  expected_reward
  trail discriminator
Modeling algorithms XGBoost
  LightGBM
  CatBoost  
  HistGridientBoosting      
Evaluation Cross-Validation
  Train-Validation-Holdout
Advanced Automatica task type inference
  Data adaption    
  Collinearity detection  
  Data drift detection
  Feature selection
  Feature selection(Two-stage)
  Pseudo label(Two-stage)
  Pre-searching with UnderSampling
  Model ensemble