Skip to content
Learn how to use the library in general or across multiple case-study data science courses!
Branch: master
Clone or download
Latest commit 3d6d9d9 Apr 22, 2019
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
R Update RemixAutoML.R Apr 22, 2019
man Add files via upload Apr 22, 2019
tests Add files via upload Apr 16, 2019
vignettes Add files via upload Apr 17, 2019
DESCRIPTION Add files via upload Apr 22, 2019
INDEX Add files via upload Apr 5, 2019
LICENSE Add files via upload Mar 28, 2019
NAMESPACE Add files via upload Apr 22, 2019
README.md Updated README Apr 21, 2019
RemixAutoML.Rproj Add files via upload Mar 18, 2019
RemixAutoML_Logo.png Add files via upload Apr 5, 2019

README.md

RemixAutoML_Logo

Install the package in R via:

# Depending on the development state (future versions, etc.) you can install by pasting the below into your R session:
devtools::install_github('AdrianAntico/RemixAutoML', upgrade = FALSE) 
or
devtools::install_github('AdrianAntico/RemixAutoML', force = TRUE, dependencies = TRUE, upgrade = FALSE)

RemixAutoML

This is a collection of functions that I have made to speed up machine learning and to ensure high quality modeling output is generated. They are great at establishing solid baselines that are extremely challenging to beat using alternative methods. To see them in action, check out our free tutorials at RemyxCourses.com.

Also, be sure to visit our blog at RemixInstitute.ai for data science, machine learning, and AI content.

You can also contact me via LinkedIn for any questions about the package. You can also go into the vignettes folder to see more detail. If you want to be a contributer, contact me via LinkedIn email.

Supervised Learning Functions:

AutoH2OModeler()

Automated machine learning. Automatically build any number of models along with generating partial dependence calibration plots, model evaluation calibration plots, grid tuning, and file storage for easy production implementation. Handles regression, quantile regression, time until event, and classification models (binary and multinomial) using numeric and factor variables without the need for monotonic transformations nor one-hot-encoding.

  • Models include:
    • RandomForest (DRF)
    • GBM
    • Deeplearning
    • XGBoost (for Linux)
    • LightGBM (for Linux)
    • AutoML - medium debth grid tuning for Deeplearning, XGBoost (if available), DRF, GBM, GLM, and StackedEnsembles
AutoH2OScoring()

Scoring models that were built with the AutoH2OModeler, AutoKMeans, and AutoWord2VecModeler functions. Scores models either via mojo or the standard method by loading models into the H2O environment and scoring them. You can choose which output you wish to keep as well.

AutoTS()

Automated time series modeling function. Automatically finds the best model fit from the suite of models below (using optimized box-cox transformations and tests both user-supplied time series frequency and model-based time series frequency), along with generating forecasts and evaluation metrics.

  • Models include:
    • DSHW: Double Seasonal Holt Winters
    • ARIFIMA: Auto Regressive Fractional Integrated Moving Average
    • ARIMIA: Stepwise Auto Regressive Integrated Moving Average with specified max lags, seasonal lags, moving averages, and seasonal moving averages
    • ETS: Additive and Multiplicitive Exponential Smoothing and Holt Winters
    • NNetar: Auto Regressive Neural Network models automatically compares models with 1 lag or 1 seasonal lag compared to models with up to N lags and N seasonal lags
    • TBATS: Exponential smoothing state space model with Box-Cox transformation, ARMA errors, Trend and Seasonal components
    • TSLM: Time Series Linear Model - builds a linear model with trend and season components extracted from the data
AutoNLS()

Automated nonlinear regression modeling. Automatically finds the best model fit from the suite of models below and merges predictions to source data file. Great for forecasting growth over time or estimating single variable nonlinear functions.

  • Models included:
    • Asymptotic
    • Asymptotic through origin
    • Asymptotic with offset
    • Bi-exponential
    • Four parameter logistic
    • Three parameter logistic
    • Gompertz
    • Michal Menton
    • Weibull
    • Polynomial regression or monotonic regression
AutoRecommender()

Automated collaborative filtering modeling where each model competes against each other

  • RandomItems
  • PopularItems
  • UserBasedCF
  • ItemBasedCF
  • AssociationRules
AutoRecommenderScoring()

Automatically score a recommender model from AutoRecommender

Unsupervised Learning Functions:

GenTSAnomVars()

Generate time series anomaly variables. (Cross with Feature Engineering) Create indicator variables (high, low) along with cumulative anomaly rates (high, low) based on control limits methodology over a max of two grouping variables and a date variable (effectively a rolling GLM).

ResidualOutliers()

Residual outliers from time series modeling. (Cross with Feature Engineering) Utilize tsoutliers to indicate outliers within a time series data set

AutoKMeans()

Generalized low rank model followed by KMeans. (Possible cross with Feature Engineering) Generate a column with a cluster identifier based on a grid tuned (optional) generalized low rank model and a grid tuned (optimal) K-Optimal searching K-Means algorithm

Feature Engineering Functions:

FAST_GDL_Feature_Engineering()

Fast generalized distributed lag feature engineering. Rapidly generate time between events, autoregressive, moving average / standard deviation / min / max / quantile 85 / quantile 95 for when you want to generate these features only for predicting events at the latest time interval of the data set. 100% data.table except for rolling statistics.

GDL_Feature_Engineering()

Generate a wider set of features (similar in structure to FAST_GDL) using any aggregation statistic for the rolling stats. 100% data.table except for rolling statistics.

Scoring_GDL_Feature_Engineering()

Generate the model features from FAST_GDL or GDL for scoring purposes when the scoring data is for forward looking predictions (not historical, which can be obtained from FAST_GDL or GDL). 100% data.table.

DT_GDL_Feature_Engineering()

Lags + Moving Averages, 100% data.table

AutoWord2VecModeler()

Generate a specified number of vectors for each column of text data in your data set and save the models for re-creating them later in the scoring process.

ModelDataPrep()

Rapidly convert "inf" values to NA, convert character columns to factor columns, and impute with specified values for factor and numeric columns (factors are necessary (no characters values) for H20).

DummifyDT()

Rapidly dichotomize a list of columns in a data table (N+1 columns for N levels using one hot encoding or N columns for N levels otherwise)

Model Evaluation, Interpretation, and Cost-Sensitive Functions:

ParDepCalPlots()

Great for features effects estimation and reliability of model in predicting those effects. Build a partial dependence calibration plot on train, test, or all data

EvalPlot()

Great for assessing accuracy across range of predicted values. Build a calibration plot on test data

threshOptim()

Great for situations with asymmetric costs across the confusion matrix. Generate a cost-sensitive optimized threshold for classification models

RedYellowGreen()

Computes optimal thresholds for binary classification models when "don't classify" is an option

Utilities and Misc. Functions:

AutoH2OTextPrepScoring()

Prepares your data for scoring based on models built with Word2VecModel

RecomDataCreate()

Turns your transactional data into a binary ratings matrix

tokenizeH2O()

Tokenize and H20 string column.

tempDatesFun()

Special case for character conversion to date when importing from Excel.

RemixTheme()

Fonts, colors, style for plots.

ChartTheme()

Fonts, colors, style for plots.

SimpleCap()

Apply proper case to text.

percRank()

Inner function for calibration plots and partial dependence plots. Computes PercentRank.

multiplot()

Useful for displaying multiple plots in a single pane.

PrintObjectsSize()

print out objects and their sizes that are in the envrionment

AutoWordFreq()

creates a word frequency data.table and a word cloud

You can’t perform that action at this time.