Skip to content
Learn how to use the library in general or across multiple case-study data science courses!
R
Branch: master
Clone or download
Latest commit 51de24d Oct 21, 2019
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
Images Create CARMAIMAGE.png Sep 5, 2019
R Update AutoTS.R Oct 22, 2019
man Update AutoH2oGBMCARMA.Rd Oct 15, 2019
tests Update tests-RemixAML_1.R Aug 24, 2019
vignettes Updating for new release Sep 7, 2019
.Rbuildignore Updating for new release Sep 7, 2019
DESCRIPTION Update DESCRIPTION Oct 22, 2019
NAMESPACE Add files via upload Oct 11, 2019
README.md Update README.md Oct 12, 2019

README.md

Version: 0.10.0 Build: Passing License: MPL 2.0 Maintenance Contributors: 4 GitHub issues PRs Welcome HitCount

Installing RemixAutoML:

Install pacakge dependecies and install RemixAutoML:

Expand to see code snippet

library(devtools)
to_install <- c("arules","catboost","caTools","data.table","doParallel","xgboost",
  "foreach","forecast","fpp","ggplot2","gridExtra","h2o","itertools","lubridate",
  "magick","Matrix", "MLmetrics","monreg","nortest","RColorBrewer","recommenderlab","ROCR","zoo",
  "pROC","scatterplot3d","stringr","sde","timeDate","tm","tsoutliers","wordcloud","Rcpp")
for (i in to_install) {
  message(paste("looking for ", i))
  if(i == "catboost" & !requireNamespace(i)) {
    devtools::install_github('catboost/catboost', subdir = 'catboost/R-package')
  } else if(i == "h2o" & !requireNamespace(i)) {
    if ("package:h2o" %in% search()) { detach("package:h2o", unload=TRUE) }
    if ("h2o" %in% rownames(installed.packages())) { remove.packages("h2o") }
    pkgs <- c("RCurl","jsonlite")
    for (pkg in pkgs) {
      if (! (pkg %in% rownames(installed.packages()))) { install.packages(pkg) }
    }
    install.packages("h2o")
  } else if (!requireNamespace(i)) {
    message(paste("     installing", i))
    install.packages(i)
  }
}

# Install RemixAutoML:
devtools::install_github('AdrianAntico/RemixAutoML', upgrade = FALSE, dependencies = FALSE, force = TRUE)

If you're having trouble installing, see if this issue helps you out.

Issue #19

RemixAutoML

This is a collection of functions that I have made to speed up machine learning and to ensure high quality modeling results and output are generated. They are great at establishing solid baselines that are extremely challenging to beat using alternative methods (if at all). They are intended to make the development cycle fast and robust, along with making operationalizing quick and easy, with low latency model scoring. To see them in action, check out the free tutorials at RemyxCourses.com or the reference manual and vignette in the vignette folder above.

Also, be sure to visit our blog at RemixInstitute.ai for data science, machine learning, and AI content.

You can contact me via LinkedIn for any questions about the package. You can also go into the vignettes folder to see the package reference manual and a vignette with some background and examples. If you want to be a contributer, contact me via LinkedIn email.

RemixAutoML Blogs:

AI for Small to Medium Size Businesses: A Management Take On The Challenges...

Why Machine Learning is more Practical than Econometrics in the Real World

Build Thousands of Automated Demand Forecasts in 15 Minutes Using AutoCatBoostCARMA in R

Automate Your KPI Forecasts With Only 1 Line of R Code Using AutoTS

Companies Are Demanding Model Interpretability. Here’s How To Do It Right

The Easiest Way to Create Thresholds And Improve Your Classification Model

Automated Supervised Learning Training Functions:

EXPAND

Regression:


expand

AutoCatBoostRegression() GPU + CPU

AutoCatBoostRegression() utilizes the CatBoost algorithm in the below steps

AutoXGBoostRegression() GPU + CPU

AutoXGBoostRegression() utilizes the XGBoost algorithm in the below steps

AutoH2oGBMRegression()

AutoH2oGBMRegression() utilizes the H2O Gradient Boosting algorithm in the below steps

AutoH2oDRFRegression()

AutoH2oDRFRegression() utilizes the H2o Distributed Random Forest algorithm in the below steps

The Auto_Regression() models handle a multitude of tasks. In order:

  1. Convert your data to data.table format for faster processing
  2. Transform your target variable using the best normalization method based on the AutoTransformationCreate() function
  3. Create train, validation, and test data, utilizing the AutoDataPartition() function, if you didn't supply those directly to the function
  4. Consoldate columns that are used for modeling and what metadata you want returned in your test data with predictions
  5. Dichotomize categorical variables (for AutoXGBoostRegression()) and save the factor levels for scoring in a way that guarentees consistency across training, validation, and test data sets, utilizing the DummifyDT() function
  6. Save the final modeling column names for reference
  7. Handles the data conversion to the appropriate modeling type, such as CatBoost, H2O, and XGBoost
  8. Build out a random hyperparameter set for a random grid search for model grid tuning (which includes the default model hyperparameters) if you choose to run a grid tune
  9. Loop through the grid-tuning process, building N models
  10. Collect the evaluation metrics for each grid tune run
  11. Identify the best model of the set of models built in the grid tuning search
  12. Save the hyperparameters from the winning grid tuned model
  13. Build the final model based on the best model from the grid tuning model search (I remove each model after evaluation metrics are generated in the grid tune to avoid memory overflow)
  14. Back-transform your predictions based on the best transformation used earlier in the process
  15. Collect evaluation metrics based on performance on test data (based on back-transformed data)
  16. Store the final predictions with the associated test data and other columns you want included in that set
  17. Save your transformation metadata for recreating them in a scoring process
  18. Build out and save an Evaluation Calibration Line Plot and Evaluation Calibration Box-Plot, using the EvalPlot() function
  19. Generate and save Variable Importance
  20. Generate and save Partital Dependence Calibration Line Plots and Partital Dependence Calibration Box-Plots, using the ParDepPlots() function
  21. Return all the objects generated in a named list for immediate use and evaluation

Binary Classification:


expand

AutoCatBoostClassifier() GPU + CPU

AutoCatBoostClassifier() utilizes the CatBoost algorithm in the below steps

AutoXGBoostClassifier() GPU + CPU

AutoXGBoostClassifier() utilizes the XGBoost algorithm in the below steps

AutoH2oGBMClassifier()

AutoH2oGBMClassifier() utilizes the H2O Gradient Boosting algorithm in the below steps

AutoH2oDRFClassifier()

AutoH2oDRFClassifier() utilizes the H2O Distributed Random Forest algorithm in the below steps

The Auto_Classifier() models handle a multitude of tasks. In order:

  1. Convert your data to data.table format for faster processing
  2. Create train, validation, and test data if you didn't supply those directly to the function
  3. Consoldate columns that are used for modeling and what is to be kept for data returned
  4. Dichotomize categorical variables (for AutoXGBoostRegression) and save the factor levels for scoring in a way that guarentees consistency across training, validation, and test data sets
  5. Saves the final column names for modeling to a csv for later reference
  6. Handles the data conversion to the appropriate type, based on model type (CatBoost, H2O, and XGBoost)
  7. Build out a random hyperparameter set for a random grid search for model tuning (includes the default model hyperparameters) if you want to utilize that feature
  8. Build the grid tuned models
  9. Collect the evaluation metrics for each grid tune run
  10. Identify the best model of the set of models built in the grid tuning setup
  11. Save the hyperparameters from the winning grid tuned model
  12. Build the final model based on the best model from the grid tuning model search
  13. Collect evaluation metrics based on performance on test data
  14. Store the final predictions with the associated test data and other columns you want included in that set
  15. Build out and save an Evaluation Calibration Line Plot
  16. Build out and save an ROC plot with the top 5 models used in grid-tuning (includes the winning model)
  17. Generate and save Variable Importance data
  18. Generate and save Partital Dependence Calibration Line Plots
  19. Return all the objects generated in a named list for immediate use

Multinomial Classification:


expand

AutoCatBoostMultiClass() GPU + CPU

AutoCatBoostMultiClass() utilizes the CatBoost algorithm in the below steps

AutoXGBoostMultiClass() GPU + CPU

AutoXGBoostMultiClass() utilizes the XGBoost algorithm in the below steps

AutoH2oGBMMultiClass()

AutoH2oGBMMultiClass() utilizes the H2O Gradient Boosting algorithm in the below steps

AutoH2oDRFMultiClass()

AutoH2oDRFMultiClass() utilizes the H2O Distributed Random Forest algorithm in the below steps

The Auto_MultiClass() models handle a multitude of tasks. In order:

  1. Convert your data to data.table format for faster processing
  2. Create train, validation, and test data if you didn't supply those directly to the function
  3. Consoldate columns that are used for modeling and what is to be kept for data returned
  4. Dichotomize categorical variables (for AutoXGBoostRegression) and save the factor levels for scoring in a way that guarentees consistency across training, validation, and test data sets
  5. Saves the final column names for modeling to a csv for later reference
  6. Ensures the target levels are consistent across train, validate, and test sets and save the levels to file
  7. Handles the data conversion to the appropriate type, based on model type (CatBoost, H2O, and XGBoost)
  8. Build out a random hyperparameter set for a random grid search for model tuning (includes the default model hyperparameters) if you want to utilize that feature
  9. Build the grid tuned models
  10. Collect the evaluation metrics for each grid tune run
  11. Identify the best model of the set of models built in the grid tuning setup
  12. Save the hyperparameters from the winning grid tuned model
  13. Build the final model based on the best model from the grid tuning model search
  14. Collect evaluation metrics based on performance on test data
  15. Store the final predictions with the associated test data and other columns you want included in that set
  16. Generate and save Variable Importance data
  17. Return all the objects generated in a named list for immediate use

Generalized Hurdle Models:


expand

First step is to build either a binary classification model (in the case of a single bucket value, such as zero) or a multiclass model (for the case of multiple bucket values, such as zero and 10). The next step is to subset the data for the cases of: less than the first split value, in between the first and second split value, second and third split value, ..., second to last and last split value, along with greater than last split value. For each data subset, a regression model is built for predicting values in the split value ranges. The final compilation is to multiply the probabilities of being in each group times the values supplied by the regression values for each group.

Single Partition
  • E(y|xi) = Pr(X = 0) * 0 + Pr(X > 0) * E(X | X >= 0)
  • E(y|xi) = Pr(X < x1) * E(X | X < x1) + Pr(X >= x1) * E(X | X >= x1)
Multiple Partitions
  • E(y|xi) = Pr(X = 0) * 0 + Pr(X < x2) * E(X | X < x2) + ... + Pr(X < xn) * E(X | X < xn) + Pr(X >= xn) * E(X | X >= xn)
  • E(y|xi) = Pr(X < x1) * E(X | X < x1) + Pr(x1 <= X < x2) * E(X | x1 <= X < x2) + ... + Pr(xn-1 <= X < xn) * E(X | xn-1 <= X < xn) + Pr(X >= xn) * E(X | X >= xn)
AutoCatBoostHurdleModel()

AutoCatBoostHurdleModel() utilizes the CatBoost algorithm on the backend.

AutoXGBoostHurdleModel()

AutoXGBoostHurdleModel() utilizes the XGBoost algorithm on the backend.

AutoH2oDRFHurdleModel()

AutoH2oDRFHurdleModel() utilizes the H2O distributed random forest algorithm on the backend.

AutoH2oGBMHurdleModel()

AutoH2oGBMHurdleModel() utilizes the H2O gradient boosting machine algorithm on the backend.

General Purpose H2O Automated Modeling:


expand

AutoH2OModeler()

AutoH2OModeler() automatically build any number of models along with generating partial dependence calibration plots, model evaluation calibration plots, grid tuning, and file storage for easy production implementation. Handles regression, quantile regression, time until event, and classification models (binary and multinomial) using numeric and factor variables without the need for monotonic transformations nor one-hot-encoding.

  • Models include:
    • RandomForest (DRF)
    • GBM
    • Deeplearning
    • XGBoost (for Linux)
    • LightGBM (for Linux)
    • AutoML - medium debth grid tuning for Deeplearning, XGBoost (if available), DRF, GBM, GLM, and StackedEnsembles

Nonlinear Regression Modeling:


expand

AutoNLS()

AutoNLS() is an automated nonlinear regression modeling. This function automatically finds the best model fit from the suite of models below and merges predictions to source data file. Great for forecasting growth over time or estimating single variable nonlinear functions.

  • Models included:
    • Asymptotic
    • Asymptotic through origin
    • Asymptotic with offset
    • Bi-exponential
    • Four parameter logistic
    • Three parameter logistic
    • Gompertz
    • Michal Menton
    • Weibull
    • Polynomial regression or monotonic regression

Automated Model Scoring Functions:

EXPAND

AutoCatBoostScoring()

AutoCatBoostScoring() is an automated scoring function that compliments the AutoCatBoost() model training functions. This function requires you to supply features for scoring. It will run ModelDataPrep() to prepare your features for catboost data conversion and scoring. It will also handle and transformations and back-transformations if you utilized that feature in the regression training case.

AutoXGBoostScoring()

AutoXGBoostScoring() is an automated scoring function that compliments the AutoXGBoost() model training functions. This function requires you to supply features for scoring. It will run ModelDataPrep() and the DummifyDT() functions to prepare your features for xgboost data conversion and scoring. It will also handle and transformations and back-transformations if you utilized that feature in the regression training case.

AutoH2OMLScoring()

AutoH2OMLScoring() is an automated scoring function that compliments the AutoH2oGBM__() and AutoH2oDRF__() model training functions. This function requires you to supply features for scoring. It will run ModelDataPrep()to prepare your features for H2O data conversion and scoring. It will also handle transformations and back-transformations if you utilized that feature in the regression training case and didn't do it yourself before hand.

AutoH2OScoring()

AutoH2OScoring() is for scoring models that were built with the AutoH2OModeler, AutoKMeans, and AutoWord2VecModeler functions. Scores mojo models or binary files by loading models into the H2O environment and scoring them. You can choose which output you wish to keep as well for classification and multinomial models.

Automated Time Series Modeling Functions:

EXPAND

AutoTS()

AutoTS()

  • Returns a list containing
    • A data.table object with a date column and the forecasted values
    • The model evaluation results
    • The champion model for later use if desired
    • The name of the champion model
    • A time series ggplot with historical values and forecasted values with optional 80% and 95% prediction intervals
  • The models tested internally include:
    • DSHW: Double Seasonal Holt-Winters
    • ARFIMA: Auto Regressive Fractional Integrated Moving Average
    • ARIMA: Auto Regressive Integrated Moving Average with specified max lags, seasonal lags, moving averages, and seasonal moving averages
    • ETS: Additive and Multiplicative Exponential Smoothing and Holt-Winters
    • NNetar: Auto Regressive Neural Network models automatically compares models with 1 lag or 1 seasonal lag compared to models with up to N lags and N seasonal lags
    • TBATS: Exponential smoothing state space model with Box-Cox transformation, ARMA errors, Trend and Seasonal components
    • TSLM: Time Series Linear Model - builds a linear model with trend and season components extracted from the data

For each of the models tested internally, several aspects should be noted:

  • Optimal Box-Cox transformations are used in every run where data is strictly positive. The optimal transformation could also be "no transformation". 

  • Four different treatments are tested for each model:

    • user-specified time frequency + no historical series smoothing & imputation
    • model-based time frequency + no historical smoothing and imputation
    • user-specified time frequency + historical series smoothing & imputation
    • model-based time frequency + historical smoothing & imputation
  • You can specify MaxFourierPairs to test out if adding Fourier term regressors can increase forecast accuracy. The Fourier terms will be applied to the ARIMA and NNetar models only.

  • For the ARIMA, ARFIMA, and TBATS, any number of lags and moving averages along with up to 1 seasonal lags and seasonal moving averages can be used (selection based on a stepwise procedure)

  • For the Double Seasonal Holt-Winters model, alpha, beta, gamma, omega, and phi are determined using least-squares and the forecasts are adjusted using an AR(1) model for the errors

  • The Exponential Smoothing State-Space model runs through an automatic selection of the error type, trend type, and season type, with the options being "none", "additive", and "multiplicative", along with testing of damped vs. non-damped trend (either additive or multiplicative), and alpha, beta, and phi are estimated

  • The neural network is setup to test out every combination of lags and seasonal lags and the model with the best holdout score is selected

  • The TBATS model utilizes any number of lags and moving averages for the errors, damped trend vs. non-damped trend are tested, trend vs. non-trend are also tested, and the model utilizes parallel processing for efficient run times

  • The TSLM model utilizes a simple time trend and season depending on the frequency of the data

The CARMA Suite

AutoTS()

AutoCatBoostCARMA()

AutoCatBoostCARMA() utilizes the CatBoost alorithm

AutoXGBoostCARMA()

AutoXGBoostCARMA() utilizes the XGBoost alorithm

AutoH2oDRFCARMA()

AutoH2oDRFCARMA() utilizes the H2O Distributed Random Forest alorithm

AutoH2oGBMCARMA()

AutoH2oGBMCARMA() utilizes the H2O Gradient Boosting Machine alorithm

The CARMA suite utilizes several features to ensure proper models are built to generate the best possible out-of-sample forecasts.

Feature engineering: I use a time trend, calendar variables, holiday counts, lags and moving averages. Internally, the CARMA functions utilize several RemixAutoML functions, all written using data.table for fast and memory efficient processing: 

  • DT_GDL_Feature_Engineering() - creates lags and moving average features (also creates lags and moving averages off of time between records)
  • Scoring_GDL_Feature_Engineering() - creates lags and moving average features for a single record (along with the time between vars)
  • CreateCalendarVariables() - creates numeric features identifying various time units based on date columns
  • CreateHolidayVariables() - creates count features based on the specified holiday groups you want to track and the date columns you supply

Optimal transformations: the target variable along with the associated lags and moving average features were transformed. This is really useful for regression models with categorical features that have associated target values that significantly differ from each other. The transformation options that are tested (using a Pearson test for normality) include: 

  • YeoJohnson
  • BoxCox
  • arcsinh
  • Identity
  • arcsin(sqrt(x)): proportion data only
  • logit(x): proportion data only
The functions used to create these and generate them for scoring models come from RemixAutoML:
  • AutoTransformationCreate()
  • AutoTransformationScore()

Models: there are four CARMA functions and each use a different algorithm for the model fitting. The models used to fit the time series data come from RemixAutoML and include: 

  • AutoCatBoostRegression()
  • AutoXGBoostRegression()
  • AutoH2oDRFRegression()
  • AutoH2oGBMRegression()

GPU: With the CatBoost and XGBoost functions, you can build the models utilizing GPU (I run them with a GeForce 1080ti) which results in an average 10x speedup in model training time (compared to running on CPU with 8 threads).

Data partitioning: for creating the training, validation, and test data, the CARMA functions utilize the AutoDataPartition() function and utilizes the "timeseries" option for the PartitionType argument which ensures that the train data reflects the furthest points back in time, followed by the validation data, and then the test data which is the most recent in time.

Forecasting: Once the regression model is built, the forecast process replicates the ARIMA process. Once a single step-ahead forecast is made, the lags and moving average features are updated based on the predicted values from scoring the model. Next, the rest of the other features are updated. Then the next forecast step is made, rinse and repeat for remaining forecasting steps. This process utilizes the RemixAutoML functions:

  • AutoCatBoostScoring()
  • AutoXGBoostScoring()
  • AutoH2oMLScoring()

Intermittent Demand Forecasting Functions

TimeSeriesFill()

TimeSeriesFill() is a function that will zero pad (currently only zero pad) a time series data set (not transactional data). There are three ways to use this function:

  • Grouped data 1 - find the minimum and maximum dates regardless of grouping variables and use those values to ensure all group levels have all the dates represented within the series bounds (if missing, fill with zeros)
  • Grouped data 2 - find the minimum and maximum dates with respect to each unique grouping variable level (grouping variables must be hierarchical) and zero pads missing dates within in each group level.
  • Single series - Zero pad any missing dates within series bounds
  • Used internally with the CARMA suite of functions by specifying the argument to enable this functionality
IntermittentDemandDataGenerator()

IntermittentDemandDataGenerator() is for frequency and size data sets. This function generates count and size data sets for intermittent demand forecasting, using the methods in this package.

AutoCatBoostSizeFreqDist()

AutoCatBoostSizeFreqDist() is for building size and frequency predictive distributions via quantile regressions. Size (or severity) and frequency (or count) quantile regressions are build and you supply the actual percentiles you want predicted. Use this with the ID_SingleLevelGibbsSampler() function to simulate from the joint distribution.

AutoH2oGBMSizeFreqDist()

AutoH2oGBMSizeFreqDist() is for building size and frequency predictive distributions via quantile regressions. Size (or severity) and frequency (or count) quantile regressions are build and you supply the actual percentiles you want predicted. Use this with the ID_SingleLevelGibbsSampler() function to simulate from the joint distribution.

AutoCatBoostFreqSizeScoring()

AutoCatBoostFreqSizeScoring() is for scoring the models build with AutoCatBoostFreqSizeScoring(). It will return the predicted values for every quantile model for both distributions for 1 to the max forecast periods you provided to build the scoring data.

AutoH2oGBMFreqSizeScoring()

AutoH2oGBMFreqSizeScoring() is for scoring the models build with AutoH2oGBMSizeFreqDist(). It will return the predicted values for every quantile model for both distributions for 1 to the max forecast periods you provided to build the scoring data.

ID_Forecast()

ID_Forecast() is for simulating via a collapsed gibbs sampler from the quantile regressions built with Auto_SizeFreqDist() functions.

Automated Recommender System Functions:

EXPAND

AutoRecomDataCreate()

AutoRecomDataCreate() automatically creates your binary ratings matix from transaction data

AutoRecommender()

AutoRecommender() automated collaborative filtering modeling where each model below competes against one another for top performance

  • RandomItems
  • PopularItems
  • UserBasedCF
  • ItemBasedCF
  • AssociationRules
AutoRecommenderScoring()

AutoRecommenderScoring() automatically score a recommender model from AutoRecommender()

AutoMarketBasketModel()

AutoMarketBasketModel() is a function that runs a market basket analysis automatically. It will convert your data, run the algorithm, and generate the recommended items. On top of that, it includes additional significance values not provided by the source pacakge.

Automated Unsupervised Learning Functions:

EXPAND

GenTSAnomVars()

GenTSAnomVars() generates time series anomaly variables. (Cross with Feature Engineering) Create indicator variables (high, low) along with cumulative anomaly rates (high, low) based on control limits methodology over a max of two grouping variables and a date variable (effectively a rolling GLM).

ResidualOutliers()

ResidualOutliers() Generate residual outliers from time series modeling. (Cross with Feature Engineering) Utilize tsoutliers to indicate outliers within a time series data set

AutoKMeans()

AutoKMeans() This function builds a generalized low rank model followed by KMeans. (Possible cross with Feature Engineering) Generate a column with a cluster identifier based on a grid tuned (optional) generalized low rank model and a grid tuned (optimal) K-Optimal searching K-Means algorithm

ProblematicRecords()

ProblematicRecords() automatically identifies anomalous data records via Isolation Forests from H2O.

Automated Feature Engineering Functions:

EXPAND

DT_GDL_Feature_Engineering()

DT_GDL_Feature_Engineering() builds autoregressive and moving average features from target columns and distributed lags and distributed moving average from independent features distributed across time. On top of that, you can also create time between instances along with their associated lags and moving averages. This function works for data with groups and without groups. 100% data.table built. It runs super fast and can handle big data.

Partial_DT_GDL_Feature_Engineering()

Partial_DT_GDL_Feature_Engineering() is for generating the equivalent features built from DT_GDL_Feature_Engineering() for a set of new records as rapidly as possible. I used this to create the feature vectors for scoring models in production. This function is for generating lags and moving averages (along with lags and moving averages off of time between records), for a partial set of records in your data set, typical new records that become available for model scoring. Column names and ordering will be identical to the output from the corresponding DT_GDL_Feature_Engineering() function, which most likely was used to create features for model training.

Scoring_GDL_Feature_Engineering()

Scoring_GDL_Feature_Engineering() is a function that runs internally inside the CARMA functions but might have use outside of it. It is for scoring a single record, for no grouping variables, or one record per group level when a single group is utilized. Generates identical column names as the DT_GDL_Feature_Engineering() function and the Partial_GDL_Feature_Engineering() function.

AutoWord2VecModeler()

AutoWord2VecModeler() generates a specified number of vectors for each column of text data in your data set and save the models for re-creating them later in the scoring process. You can choose to build individual models for each columns or one model for all your columns.

CreateCalendarVariables()

ModelDataPrep() This functions creates new columns that extract the calendar information from date columns, such as second, minute, hour, week day, day of month, day of year, week, isoweek, month, quarter, and year.

CreateHolidayVariable()

ModelDataPrep() This function counts up the number of specified holidays between the current record time stamp and the previous record time stamp.

ModelDataPrep()

ModelDataPrep() rapidly convert "inf" values to NA, convert character columns to factor columns, and impute with specified values for factor and numeric columns.

DummifyDT()

DummifyDT() rapidly dichotomizes a list of columns in a data table (N+1 columns for N levels using one hot encoding or N columns for N levels otherwise). Several other arguments exist for outputting and saving factor levels for model scoring processes, which are used internally in the AutoXGBoost__() suite of modeling functions.

AutoDataPartition()

AutoDataPartition() is designed to achieve a few things that standard data partitioning processes or functions don't handle. First, you can choose to build any number of partitioned data sets beyond the standard train, validate, and test data sets. Second, you can choose between random sampling to split your data or you can choose a time-based partitioning. Third, for the random partitioning, you can specify stratification columns in your data to stratify by in order to ensure a proper split amongst your categorical features (E.g. think MultiClass targets). Lastly, it's 100% data.table so it will run fast and with low memory overhead.

AutoTransformationCreate()

AutoTransformationCreate() is a function for automatically identifying the optimal transformations for numeric features and transforming them once identified. This function will loop through your selected transformation options (YeoJohnson, BoxCox, Asinh, Asin, and Logit) and find the one that produces data that is the closest to normally distributed data. It then makes the transformation and collects the metadata information for use in the AutoTransformationScore() function, either by returning the objects (always) or saving them to file (optional).

AutoTransformationScore()

AutoTransformationScore() is a the compliment function to AutoTransformationCreate(). Automatically apply or inverse the transformations you identified in AutoTransformationCreate() to other data sets. This is useful for applying transformations to your validation and test data sets for modeling. It's also useful for back-transforming your target and prediction columns after you have build and score your models so you can obtain statistics on the original features.

GDL_Feature_Engineering()

GDL_Feature_Engineering() builds autoregressive and rolling stats from target columns and distributed lags and distributed rolling stats for independent features distributed across time. On top of that, you can also create time between instances along with their associated lags and rolling stats. This function works for data with groups and without groups. The rolling stats can be of any variety, such as rolling standard deviations, rolling quantiles, etc. but the function runs much slower than the DT_GDL_Feature_Engineering() counterpart so it might not be a good choice for scoring environments that require low latency.

Automated Model Evaluation:

EXPAND

ParDepCalPlots()

ParDepCalPlots() is for visualizing the relationships of features and the reliability of the model in predicting those effects. Build a partial dependence calibration line plot, box plot or bar plot for the case of categorical variables.

ParDepCalPlots Blog

EvalPlot()

EvalPlot() Has two plot versions: calibration line plot of predicted values and actual values across range of predicted value, and calibration boxplot for seeing the accuracy and variability of predictions against actuals.

threshOptim()

threshOptim() is great for situations with asymmetric costs across the confusion matrix. Generate a cost-sensitive optimized threshold for classification models. Just supply the costs for false positives and false negatives (can supply costs for all four outcomes too) and the function will return the optimal threshold for maximizing "utility".

RedYellowGreen()

RedYellowGreen() computes optimal thresholds for binary classification models where "don't classify" is an option. Consider a health care binary classification model that predicts whether or not a disease is present. This is certainly a case for threshOptim since the costs of false positives and false negatives can vary by a large margin. However, there is always the potential to run further analysis. The RedYellowGreen() function can compute two thresholds if you can supply a cost of "further analysis". Predicted values < the lower threshold are confidently classified as a negative case and predicted values > the upper threshold are confidently classified as a postive case. Predicted values in between the lower and upper thresholds are cases that should require further analysis.

RedYellowGreen Blog

Utilities, EDA, and Misc. Functions:

EXPAND

AutoWordFreq()

AutoWordFreq() creates a word frequency data.table and a word cloud

AutoH2OTextPrepScoring()

AutoH2OTextPrepScoring() prepares your data for scoring based on models built with AutoWord2VecModel and runs internally inside the AutoH2OScoring() function. It cleans and tokenizes your text data.

ProblematicFeatures()

ProblematicFeatures() identifies columns that have either little to no variance, categorical variables with extremely high cardinality, too many NA's, too many zeros, or too high of a skew.

RemixTheme()

RemixTheme() is a specific font, set of colors, and style for plots.

ChartTheme()

ChartTheme() is a specific font, set of colors, and style for plots.

multiplot()

multiplot() is useful for displaying multiple plots in a single pane. I've never had luck using grid so I just use this instead.

tokenizeH2O()

tokenizeH2O() tokenizes an H2O string column.

percRank()

percRank() is an inner function for calibration plots and partial dependence plots. It computes PercentRank for all numeric records in a column.

SimpleCap()

SimpleCap() apply proper case to text.

PrintObjectsSize()

PrintObjectsSize() prints out environment objects and their respective sizes. Useful for debugging programs.

tempDatesFun()

tempDatesFun() is a special case for character conversion to date when importing from Excel.

You can’t perform that action at this time.