Skip to content

Bilot/AI-JACK-opensource-R

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation


AI-JACK open source for R

What is AI-JACK?

We wanted to do our own AI-projects faster and with fewer errors. Also, coding same things over and over again is quite stupid and boring. We also felt that the maintenance and development of multiple AI/ML-environments needs a coherent solution. We also wanted to create a solution which bends into several different business problems. These factors led us to develop a "framework" that we call AI-JACK.

The AI-JACK is basically a collection of code, which facilitates robust development of Machine Learning solutions. It integrates data handling, preprocessing, error handling and logging, model training and versioning, model application, and deployment. All of this is handled with just a few lines of code. The modeling is done using the H2O API.

This is the R-version of the AI-JACK (we also have a Python version, which is likely to be released later). As we have developped this framework using open source code, we have chosen to provide it back to the community. However, this is not the only reason; we also hope the community could help us develop the framework further and make it even better.


Features

AI-JACK provides capabilities for end-to-end development of machine leartning projects. The functionality is built into modules (collections of functions) that are used to:

  • take care of data connections (e.g., from local files or remote SQL server),
  • retrieve data from source and make prepocessing,
  • train (and optimise) user-specified models,
  • write execution logs,
  • version trained models,
  • deployment, e.g., via predictive API service.

How to use it?

We organize webinars! During 1-hour meetings we show how to use AI-JACK, why is it so cool and what are possible use cases for AI in various businesses. We held several webinars before summer holidays and now we have a break. Check Bilot's homepage to be up-to-date with the upcoming webinars!

Installation & setup

In order to work, AI-JACK needs a working installation of Java Runtime Environment. To check whether Java exists, type the following command to the system terminal:

java -version

If there is no Java installation on your machine, this command should prompt installation. If you have an old version, you probably neet to update it. Java installations can be found here.

To install the AI-JACK package, all you need is to run the following command in R (making sure that the `devtools` package has been installed):

devtools::install_github(repo = "Bilot/AI-JACK-opensource-R")

Next, one is able to initiate a project as follows:

library(AIjack)

project_path = "/full/path/to/my/project"

init_aijack(project_path)

here the project_path should contain also the final directory where the project content will be included. There is no need to create this directory manually, as the init_aijack function will do this automatically.

The init_aijack function also automatically creates a directory structure within project_path as well as .csv files for output tables. A project can be deleted with the delete_project() function.

Handling

The control folder is intended to contain configuration files that are used for parameterising (config files) AI-JACK and handling workflow (main files). For example, the config_model.R file is used to make several specifications regarding data handling, model fitting, and file management. However, to make more detailed adjustments, e.g., to model fitting behaviour, one needs to make changes to the source code.

In contrast, there is typically no need to modify the main_model.R and main_apply.R files, as these only execute either model training or model application workflows, respectively.

The minimum requirement for adjusting the config_model.R file for model training is to:

  • set the project_path variable as the path to the directory used in init_aijack() function,
  • in set$main, set label as the name of the target column in the data,
  • in set$main, set model_name_partto a name appearing in outputs,
  • in set$main, set id as the name of an ID-column in the data (a columns with this name will be created, if missing),
  • in set$main, set test_train_val as the name of a column indicting to which data split (either 1 = 'train', 2 = 'test', 3 = 'validation') each row belongs to (if missing, a column with this name will be created automatically, containing a data split),
  • in set$main, set labeliscategory to either TRUE/FALSE according to the type of the label column (this is checked in the workflow),
  • in set$model, give a vector in train_models to indicate which models should be trained.

When the parameterisation has been done approprietly, the modeling workflow can be automised by scheduling the execution of the main_model.R script. Similarly, scheduling the execution of the main_apply.R script, it is possible to automate batch application of a specified model on new data.

One also needs to make sure that the control .R-files are located in the control-folder in the project directory and that the working directory is set to the project directory (this can be set automatically in the workflow, given that the correct path is specified in the settings).

Handling and running when using clustering algorithm

As clustering algorithms are treated slightly differently than supervised ML techniques, there are separate config files designed to work with these methods. There is no need to configure standard config_model.R and main_model.R files - you have to use config_clust_model.R and main_clust_model.R files instead.

The adjustment of config_clust_model.R file for model training goes almost the same as config_model.R:

  • set the project_path variable as the path to the directory used in init_aijack() function,
  • in set$main, set model_name_partto a name appearing in outputs,
  • in set$main, set id as the name of an ID-column in the data (a columns with this name will be created, if missing).

Running

After the necessary configurations have been made, a workflow can be executed from command line as follows. First, make sure that you're located in the project directory (cd /path/to/project). Then simply run the following command in your project path:

Mac/Linux

Rscript control/main_model.R 

Again, given that the config_model.R has been modified correctly, this should run the model training workflow.

Windows If you're running Windows, you may need to tell where the Rscript program is located:

From R To run the workflow from withion R, first set the working directory to the project path:

setwd('/path/to/project')

Then, just source the workflow script:

source('control/main_model.R')

Data

The AI-JACK is primarily intended to be used for ML-project management in production. This means that while there are some pre-processing steps taking place, there is no functionality for data engineering, which is typically needed before modelling. That is, the intention is that the initial data analysis, investigation and engineering (including feature extraction/engineering) has been done prior to using AI-JACK. One clear reason for this is that data engineering is not easily generalised; it depends on the data what manipulations are needed / are most useful.

If the AI-JACK is run using local files, the source_model directory should contain the source data file in .csv format (by default ; separation is assumed). Two columns are also assumed by default: each row needs to have an ID, specified by id column (this can be changed in the settings), and a column test_train_val, which indicates whether a row is assigned to model training, testing, or validation. If these are missing, they will be added automatically (a dummy ID is created and a random data split is added).

Additionally, a text file (typically either .txt or .csv) containing a two-column table of variable names (COLUMN_NAME) and their data types (TYPE_NAME) (the column names can be specified in config). The idea here is to make sure the data types will be formated correctly in R.

If the types-file is written in csv-format, one needs to make sure that the model_name_part parameter string can be fully matched with the file name of the data-file (e.g., model_name_part = "Churn" and file name churn_2020.csv). This is because the relevant files are automatically searched from the "source" directories. The same applies if there are several types-files in the same directory; the correct one is found by matching the model_name_part to the file name. The types-file should have the same column separator as the data file (given by the file_sep parameter in config).

Importantly, the data types should follow SQL convention. If data is taken from an SQL database, datatypes are read automatically from the source. Types among "bigint", identity" and "char" will be casted to character, those among "bit", "varchar", and "nvarchar" will be casted to factor, those among "int", "float", "numeric", and "real" will be casted to numeric, and those among "datetime", "date", "time" will be casted to character.


Examples

There are two exampe data sets provided for testing purposes: churn.csv and boston.csv. The churn example is a classical data from the 90's (The Orange Telecom's Churn Dataset), with 5000 rows of customer records. This data is also available e.g. from the C50-package as well as on Kaggle. The Boston house price data, consisting of ca. 500 rows of indicators of median house price, is availbale on Kaggle. Each record in the data describes a Boston suburb or town. The data was drawn from the Boston Standard Metropolitan Statistical Area (SMSA) in 1970.

The churn example is a classical data from the 90's (The Orange Telecom's Churn Dataset), with 5000 rows of customer records. This data is also available e.g. from the C50-package as well as on Kaggle. Here the aim is to predict whether a customer is at risk to churn, based on the recorded history given by the churn column (classification problem).

The Boston house price data, consisting of ca. 500 rows of indicators of median house price, is availbale on Kaggle, as well as in the mlbench package. Each record in the data describes a Boston suburb or town. The data was drawn from the Boston Standard Metropolitan Statistical Area (SMSA) in 1970. In this case the objective is to predict the level of house prises within different areas in Boston (regression problem).

For each of the datasets, there is also a data types-file available, as well as an unlabelled samples for testing model application.


Modelling

The modelling functionality of AI-JACK rests upon package h2o, enabling running H2O from within R. H2O is an open source, in-memory, distributed, fast, and scalable machine learning and predictive analytics platform that allows you to build machine learning models on big data and provides easy productionalization of those models in an enterprise environment.

At present, AI-JACK has capabilities for training either classification or regression models. For classification, the logic has been built binary problems in mind, but this can be fairly easily modified.


Web service

Given that there are trained models to apply, one can easily expose such a model as an API, using plumber. This requires:

  • a script file plumber_core.R that defines the API logic,
  • configuration file config_plumber.R,
  • a parameter string for calling the API.

The parameter string consists of three parts:

  • feature values: param <- "param=val1#val2#val3#val4",
  • feature names: param2 <- "param2=nam1#nam2#nam3#nam4",
  • feature data types: "param3=f#n#n#f" (f = factor, n = numeric, etc.).

If the data to query exists in a file, the parameter string can be generasted using the parse_params() function:

parse_params(file_path = "path_to_file",
             row = 1,set = set)

The API can be exposed with the following commands:

# Create Plumber router:
r <- plumber::plumb('control/plumber_core.R')

# Expose endpoint:
r$run(host='0.0.0.0', port=8000, swagger=TRUE)

which will open the API in localhost:8000.

When the endpoint is set up and running, it can be queried from the command line as follows:

curl --data "param=val1#val2#val3#val4&param2=nam1#nam2#nam3#nam4&param3=f#n#n#f" "http://localhost:8000/predict"

The result will be written either to a results table ("output_plumber/predictions.csv") or to SQL database table, depending on the settings.


Technical details

Data

The data_read() function handles the retrieval of raw data from the specified source. When writing output, either write_db() or write_csv() will be used, depending on the data connection.

Statistics

In the workflow, the prep_results() function (among other operations) generates a standard statistical summary of the data, which will be outputted to a metadata table. In turn, the calculate_stats() function calculates other statistics on the data (only correlation implememted).

Transformations

The following transformation routines are available:

  • Classify numeric features with missing values (trans_classifyNa)
  • Drop constant features (trans_delconstant)
  • Drop equal (redundant) features (trans_delequal)
  • Replace special characters in nominal features and feature names (trans_replaceScandAndSpecial)
  • Discretise continuous features, based on entropy (trans_entropy)

The transformation step is handled by the do_transforms() function, except for trans_entropy, which is call by the entropy_recategorization() function. Recategorised data will be constructed and used in models, if the parameter set$model$discretize is set TRUE. Also, function create_split() will be called in the workflow, if the raw data does not contain a column specifying data split.

Model algorithms

Currently, the following supervised modelling methods are available:

  • linear models (glm) with h2o.glm,
  • decision tree (decisionTree) with h2o.randomForest (n_trees = 1),
  • random forest (randomForest) with h2o.randomForest,
  • gradient boosting (gbm) with h2o.gbm,
  • extreme gradient boosting (xgboost) with h2o.xgboost (not available on Windows),
  • deep learning (deeplearning) with h2o.deeplearning,
  • autoML (automl) with h2o.automl,
  • time series (timeseries) with h2o.automl.

In order to use time series feature correctly, your dataset currently needs to have a numeric value column (specified in label part in configuration file) and date column in dd.mm.yyyy, dd/mm/yy, dd/mm/yyyy or similar format - condition is that dmy function from lubridate package reads it. Other format options will be added soon.

In addition, deep learning is also possible to run in unsupervised form, by using it in autoencoder form. Also, three clustering methods are currently available:

  • k-means with kmeans function from stats package,
  • expectation-maximization (EM) with Mclust package,
  • k-medoids (PAM) with pam function from cluster package.

In clustering case, the user doesn't have to choose a technique as all three are done in parallel and compared in terms of average silhouette width. There are also functions available to visualize the clustering results.

The create_models() function handles hyperparameter optimisation (training with train-split and validating with test-split) as well as re-fitting the best model (on both the train- and test-split, except for deep learning).

About

A machine learning pipeline accelerator, written in a form of R library.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages