We wanted to do our own AI-projects faster and with fewer errors. Also, coding same things over and over again is quite stupid and boring. We also felt that the maintenance and development of multiple AI/ML-environments needs a coherent solution. We also wanted to create a solution which bends into several different business problems. These factors led us to develop a "framework" that we call AI-JACK.
The AI-JACK is basically a collection of code, which facilitates robust development of Machine Learning solutions. It integrates data handling, preprocessing, error handling and logging, model training and versioning, model application, and deployment. All of this is handled with just a few lines of code. The modeling is done using the H2O API.
This is the R-version of the AI-JACK (we also have a Python version, which is likely to be released later). As we have developped this framework using open source code, we have chosen to provide it back to the community. However, this is not the only reason; we also hope the community could help us develop the framework further and make it even better.
AI-JACK provides capabilities for end-to-end development of machine leartning projects. The functionality is built into modules (collections of functions) that are used to:
- take care of data connections (e.g., from local files or remote SQL server),
- retrieve data from source and make prepocessing,
- train (and optimise) user-specified models,
- write execution logs,
- version trained models,
- deployment, e.g., via predictive API service.
We organize webinars! During 1-hour meetings we show how to use AI-JACK, why is it so cool and what are possible use cases for AI in various businesses. We held several webinars before summer holidays and now we have a break. Check Bilot's homepage to be up-to-date with the upcoming webinars!
In order to work, AI-JACK needs a working installation of Java Runtime Environment. To check whether Java exists, type the following command to the system terminal:
java -version
If there is no Java installation on your machine, this command should prompt installation. If you have an old version, you probably neet to update it. Java installations can be found here.
To install the AI-JACK package, all you need is to run the following command in R (making sure that the `devtools` package has been installed):
devtools::install_github(repo = "Bilot/AI-JACK-opensource-R")
Next, one is able to initiate a project as follows:
library(AIjack)
project_path = "/full/path/to/my/project"
init_aijack(project_path)
here the project_path
should contain also the final directory where the project content will be included. There is no need to create this directory manually, as the init_aijack
function will do this automatically.
The init_aijack
function also automatically creates a directory structure within project_path
as well as .csv
files for output tables. A project can be deleted with the delete_project()
function.
The control
folder is intended to contain configuration files that are used for parameterising (config
files) AI-JACK and handling workflow (main
files). For example, the config_model.R
file is used to make several specifications regarding data handling, model fitting, and file management. However, to make more detailed adjustments, e.g., to model fitting behaviour, one needs to make changes to the source code.
In contrast, there is typically no need to modify the main_model.R
and main_apply.R
files, as these only execute either model training or model application workflows, respectively.
The minimum requirement for adjusting the config_model.R
file for model training is to:
- set the
project_path
variable as the path to the directory used ininit_aijack()
function, - in
set$main
, setlabel
as the name of the target column in the data, - in
set$main
, setmodel_name_part
to a name appearing in outputs, - in
set$main
, setid
as the name of an ID-column in the data (a columns with this name will be created, if missing), - in
set$main
, settest_train_val
as the name of a column indicting to which data split (either 1 = 'train', 2 = 'test', 3 = 'validation') each row belongs to (if missing, a column with this name will be created automatically, containing a data split), - in
set$main
, setlabeliscategory
to eitherTRUE
/FALSE
according to the type of the label column (this is checked in the workflow), - in
set$model
, give a vector intrain_models
to indicate which models should be trained.
When the parameterisation has been done approprietly, the modeling workflow can be automised by scheduling the execution of the main_model.R
script. Similarly, scheduling the execution of the main_apply.R
script, it is possible to automate batch application of a specified model on new data.
One also needs to make sure that the control .R
-files are located in the control
-folder in the project directory and that the working directory is set to the project directory (this can be set automatically in the workflow, given that the correct path is specified in the settings).
As clustering algorithms are treated slightly differently than supervised ML techniques, there are separate config
files designed to work with these methods. There is no need to configure standard config_model.R
and main_model.R
files - you have to use config_clust_model.R
and main_clust_model.R
files instead.
The adjustment of config_clust_model.R
file for model training goes almost the same as config_model.R
:
- set the
project_path
variable as the path to the directory used ininit_aijack()
function, - in
set$main
, setmodel_name_part
to a name appearing in outputs, - in
set$main
, setid
as the name of an ID-column in the data (a columns with this name will be created, if missing).
After the necessary configurations have been made, a workflow can be executed from command line as follows. First, make sure that you're located in the project directory (cd /path/to/project
). Then simply run the following command in your project path:
Mac/Linux
Rscript control/main_model.R
Again, given that the config_model.R
has been modified correctly, this should run the model training workflow.
Windows
If you're running Windows, you may need to tell where the Rscript
program is located:
From R
To run the workflow from withion R
, first set the working directory to the project path:
setwd('/path/to/project')
Then, just source the workflow script:
source('control/main_model.R')
The AI-JACK is primarily intended to be used for ML-project management in production. This means that while there are some pre-processing steps taking place, there is no functionality for data engineering, which is typically needed before modelling. That is, the intention is that the initial data analysis, investigation and engineering (including feature extraction/engineering) has been done prior to using AI-JACK. One clear reason for this is that data engineering is not easily generalised; it depends on the data what manipulations are needed / are most useful.
If the AI-JACK is run using local files, the source_model
directory should contain the source data file in .csv
format (by default ;
separation is assumed). Two columns are also assumed by default: each row needs to have an ID, specified by id
column (this can be changed in the settings), and a column test_train_val
, which indicates whether a row is assigned to model training, testing, or validation. If these are missing, they will be added automatically (a dummy ID is created and a random data split is added).
Additionally, a text file (typically either .txt
or .csv
) containing a two-column table of variable names (COLUMN_NAME
) and their data types (TYPE_NAME
) (the column names can be specified in config). The idea here is to make sure the data types will be formated correctly in R.
If the types-file is written in csv-format, one needs to make sure that the model_name_part
parameter string can be fully matched with the file name of the data-file (e.g., model_name_part = "Churn"
and file name churn_2020.csv
). This is because the relevant files are automatically searched from the "source" directories. The same applies if there are several types-files in the same directory; the correct one is found by matching the model_name_part
to the file name. The types-file should have the same column separator as
the data file (given by the file_sep
parameter in config).
Importantly, the data types should follow SQL convention. If data is taken from an SQL database, datatypes are read automatically from the source. Types among "bigint", identity" and "char"
will be casted to character
, those among "bit", "varchar", and "nvarchar" will be casted to factor
, those among
"int", "float", "numeric", and "real" will be casted to numeric
, and those among "datetime", "date", "time" will be casted to
character
.
There are two exampe data sets provided for testing purposes: churn.csv
and boston.csv
. The churn example is a classical data from the 90's (The Orange Telecom's Churn Dataset), with 5000 rows of customer records. This data is also available e.g. from the C50
-package as well as on Kaggle.
The Boston house price data, consisting of ca. 500 rows of indicators of median house price, is availbale on Kaggle. Each record in the data describes a Boston suburb or town. The data was drawn from the Boston Standard Metropolitan Statistical Area (SMSA) in 1970.
The churn example is a classical data from the 90's (The Orange Telecom's Churn Dataset), with 5000 rows of customer records. This data is also available e.g. from the C50
-package as well as on Kaggle. Here the aim is to predict whether a customer is at risk to churn, based on the recorded history given by the churn
column (classification problem).
The Boston house price data, consisting of ca. 500 rows of indicators of median house price, is availbale on Kaggle, as well as in the mlbench
package. Each record in the data describes a Boston suburb or town. The data was drawn from the Boston Standard Metropolitan Statistical Area (SMSA) in 1970. In this case the objective is to predict the level of house prises within different areas in Boston (regression problem).
For each of the datasets, there is also a data types-file available, as well as an unlabelled samples for testing model application.
The modelling functionality of AI-JACK rests upon package h2o
, enabling running H2O from within R. H2O is an open source, in-memory, distributed, fast, and scalable machine learning and predictive analytics platform that allows you to build machine learning models on big data and provides easy productionalization of those models in an enterprise environment.
At present, AI-JACK has capabilities for training either classification or regression models. For classification, the logic has been built binary problems in mind, but this can be fairly easily modified.
Given that there are trained models to apply, one can easily expose such a model as an API, using plumber
. This requires:
- a script file
plumber_core.R
that defines the API logic, - configuration file
config_plumber.R
, - a parameter string for calling the API.
The parameter string consists of three parts:
- feature values:
param <- "param=val1#val2#val3#val4"
, - feature names:
param2 <- "param2=nam1#nam2#nam3#nam4"
, - feature data types:
"param3=f#n#n#f"
(f = factor, n = numeric, etc.).
If the data to query exists in a file, the parameter string can be generasted using the parse_params()
function:
parse_params(file_path = "path_to_file",
row = 1,set = set)
The API can be exposed with the following commands:
# Create Plumber router:
r <- plumber::plumb('control/plumber_core.R')
# Expose endpoint:
r$run(host='0.0.0.0', port=8000, swagger=TRUE)
which will open the API in localhost:8000
.
When the endpoint is set up and running, it can be queried from the command line as follows:
curl --data "param=val1#val2#val3#val4¶m2=nam1#nam2#nam3#nam4¶m3=f#n#n#f" "http://localhost:8000/predict"
The result will be written either to a results table ("output_plumber/predictions.csv") or to SQL database table, depending on the settings.
The data_read()
function handles the retrieval of raw data from the specified source. When writing output, either write_db()
or write_csv()
will be used, depending on the data connection.
In the workflow, the prep_results()
function (among other operations) generates a standard statistical summary of the data, which will be outputted to a metadata
table. In turn, the calculate_stats()
function calculates other statistics on the data (only correlation implememted).
The following transformation routines are available:
- Classify numeric features with missing values (
trans_classifyNa
) - Drop constant features (
trans_delconstant
) - Drop equal (redundant) features (
trans_delequal
) - Replace special characters in nominal features and feature names (
trans_replaceScandAndSpecial
) - Discretise continuous features, based on entropy (
trans_entropy
)
The transformation step is handled by the do_transforms()
function, except for trans_entropy
, which is call by the entropy_recategorization()
function. Recategorised data will be constructed and used in models, if the parameter set$model$discretize
is set TRUE
. Also, function create_split()
will be called in the workflow, if the raw data does not contain a column specifying data split.
Currently, the following supervised modelling methods are available:
- linear models (
glm
) withh2o.glm
, - decision tree (
decisionTree
) withh2o.randomForest
(n_trees = 1
), - random forest (
randomForest
) withh2o.randomForest
, - gradient boosting (
gbm
) withh2o.gbm
, - extreme gradient boosting (
xgboost
) withh2o.xgboost
(not available on Windows), - deep learning (
deeplearning
) withh2o.deeplearning
, - autoML (
automl
) withh2o.automl
, - time series (
timeseries
) withh2o.automl
.
In order to use time series feature correctly, your dataset currently needs to have a numeric value column (specified in label
part in configuration file) and date column in dd.mm.yyyy, dd/mm/yy, dd/mm/yyyy or similar format - condition is that dmy
function from lubridate
package reads it. Other format options will be added soon.
In addition, deep learning is also possible to run in unsupervised form, by using it in autoencoder
form.
Also, three clustering methods are currently available:
- k-means with
kmeans
function fromstats
package, - expectation-maximization (EM) with
Mclust
package, - k-medoids (PAM) with
pam
function fromcluster
package.
In clustering case, the user doesn't have to choose a technique as all three are done in parallel and compared in terms of average silhouette width. There are also functions available to visualize the clustering results.
The create_models()
function handles hyperparameter optimisation (training with train
-split and validating with test
-split) as well as re-fitting the best model (on both the train
- and test
-split, except for deep learning).