Install the package in R via:
# Depending on the development state (future versions, etc.) you can install by pasting the below into your R session: devtools::install_github('AdrianAntico/RemixAutoML', upgrade = FALSE) or devtools::install_github('AdrianAntico/RemixAutoML', force = TRUE, dependencies = TRUE, upgrade = FALSE)
This is a collection of functions that I have made to speed up machine learning and to ensure high quality modeling output is generated. They are great at establishing solid baselines that are extremely challenging to beat using alternative methods. To see them in action, check out our free tutorials at RemyxCourses.com.
Also, be sure to visit our blog at RemixInstitute.ai for data science, machine learning, and AI content.
You can also contact me via LinkedIn for any questions about the package. You can also go into the vignettes folder to see more detail. If you want to be a contributer, contact me via LinkedIn email.
Supervised Learning Functions:
Automated machine learning. Automatically build any number of models along with generating partial dependence calibration plots, model evaluation calibration plots, grid tuning, and file storage for easy production implementation. Handles regression, quantile regression, time until event, and classification models (binary and multinomial) using numeric and factor variables without the need for monotonic transformations nor one-hot-encoding.
- Models include:
- RandomForest (DRF)
- XGBoost (for Linux)
- LightGBM (for Linux)
- AutoML - medium debth grid tuning for Deeplearning, XGBoost (if available), DRF, GBM, GLM, and StackedEnsembles
Scoring models that were built with the AutoH2OModeler, AutoKMeans, and AutoWord2VecModeler functions. Scores models either via mojo or the standard method by loading models into the H2O environment and scoring them. You can choose which output you wish to keep as well.
Automated time series modeling function. Automatically finds the best model fit from the suite of models below (using optimized box-cox transformations and tests both user-supplied time series frequency and model-based time series frequency), along with generating forecasts and evaluation metrics.
- Models include:
- DSHW: Double Seasonal Holt Winters
- ARIFIMA: Auto Regressive Fractional Integrated Moving Average
- ARIMIA: Stepwise Auto Regressive Integrated Moving Average with specified max lags, seasonal lags, moving averages, and seasonal moving averages
- ETS: Additive and Multiplicitive Exponential Smoothing and Holt Winters
- NNetar: Auto Regressive Neural Network models automatically compares models with 1 lag or 1 seasonal lag compared to models with up to N lags and N seasonal lags
- TBATS: Exponential smoothing state space model with Box-Cox transformation, ARMA errors, Trend and Seasonal components
- TSLM: Time Series Linear Model - builds a linear model with trend and season components extracted from the data
Automated nonlinear regression modeling. Automatically finds the best model fit from the suite of models below and merges predictions to source data file. Great for forecasting growth over time or estimating single variable nonlinear functions.
- Models included:
- Asymptotic through origin
- Asymptotic with offset
- Four parameter logistic
- Three parameter logistic
- Michal Menton
- Polynomial regression or monotonic regression
Automated collaborative filtering modeling where each model competes against each other
Automatically score a recommender model from AutoRecommender
Unsupervised Learning Functions:
Generate time series anomaly variables. (Cross with Feature Engineering) Create indicator variables (high, low) along with cumulative anomaly rates (high, low) based on control limits methodology over a max of two grouping variables and a date variable (effectively a rolling GLM).
Residual outliers from time series modeling. (Cross with Feature Engineering) Utilize tsoutliers to indicate outliers within a time series data set
Generalized low rank model followed by KMeans. (Possible cross with Feature Engineering) Generate a column with a cluster identifier based on a grid tuned (optional) generalized low rank model and a grid tuned (optimal) K-Optimal searching K-Means algorithm
Feature Engineering Functions:
Fast generalized distributed lag feature engineering. Rapidly generate time between events, autoregressive, moving average / standard deviation / min / max / quantile 85 / quantile 95 for when you want to generate these features only for predicting events at the latest time interval of the data set. 100% data.table except for rolling statistics.
Generate a wider set of features (similar in structure to FAST_GDL) using any aggregation statistic for the rolling stats. 100% data.table except for rolling statistics.
Generate the model features from FAST_GDL or GDL for scoring purposes when the scoring data is for forward looking predictions (not historical, which can be obtained from FAST_GDL or GDL). 100% data.table.
Lags + Moving Averages, 100% data.table
Generate a specified number of vectors for each column of text data in your data set and save the models for re-creating them later in the scoring process.
Rapidly convert "inf" values to NA, convert character columns to factor columns, and impute with specified values for factor and numeric columns (factors are necessary (no characters values) for H20).
Rapidly dichotomize a list of columns in a data table (N+1 columns for N levels using one hot encoding or N columns for N levels otherwise)
Model Evaluation, Interpretation, and Cost-Sensitive Functions:
Great for features effects estimation and reliability of model in predicting those effects. Build a partial dependence calibration plot on train, test, or all data
Great for assessing accuracy across range of predicted values. Build a calibration plot on test data
Great for situations with asymmetric costs across the confusion matrix. Generate a cost-sensitive optimized threshold for classification models
Computes optimal thresholds for binary classification models when "don't classify" is an option
Utilities and Misc. Functions:
Prepares your data for scoring based on models built with Word2VecModel
Turns your transactional data into a binary ratings matrix
Tokenize and H20 string column.
Special case for character conversion to date when importing from Excel.
Fonts, colors, style for plots.
Fonts, colors, style for plots.
Apply proper case to text.
Inner function for calibration plots and partial dependence plots. Computes PercentRank.
Useful for displaying multiple plots in a single pane.
print out objects and their sizes that are in the envrionment
creates a word frequency data.table and a word cloud