# Tutorial 2: Machine Learning

## How to use this tutorial

  * Select "run all cells" on this notebook from the Run menu in Jupyter notebook or Jupyter
    lab. This step will produce intermediate data output and charts.
  * Some cells print out a url, which you can click on and bring up an interactive web UI to
    visualize the graph data.
  * In the unlikely event that the notebook becomes irresponsive, you can try "Restart
    Kernel" from the Kernel menu, then run individual cells one by one using `Shift+Enter`.
  * Some tutorials use local clusters consisting of multiple processes to mimic the effects
    of graph distribution over a remote cluster. By default, these local clusters
    automatically stop after idling for 15 minutes to conserve CPU and memory resources. You
    will need to rerun the entire notebook if your local cluster stopped due to inactivity.
  * Additional resources (video demos & blogs) are available at http://juliustech.co.
  * To report any issues, get help or request features, please raise an issue at
    https://github.com/JuliusTechCo/JuliusGraph/issues.

## 0. Introduction

This Tutorial shows how to use Julius Graph Engine to set up the training and validation of
a machine learning model. We will compare several different ML models to predict (or postdict)
the survival of Titanic passengers using the classic Titanic dataset.

Julius  distribution includes a `DataScience` package, which contains a rich set of
functionalities for data sourcing, cleansing, and machine learning. In this tutorial,
we will show how to use the `DataScience` package to quickly build a transparent and
sophisticated ML pipeline. This tutorial broadly follow the steps of a data scientist
when building a new ML model.

## 1. Data Processing

### 1.1 Data Sourcing & Visualization

A data scientist usually starts their project by exploring and visualizing data of various
sources. Julius provides a rich set of connectors to multiple data sources and formats,
such as CSV, web url, relational Databases, various NoSQL Databases, etc. Julius also offers
many data visualization tools in its interactive web UI.

We start by including necessary Julia and Julius packages and set up some basic
configurations.

In [None]:
# Julia packages
using Base.CoreLogging
using DataFrames, Statistics

# Julius Packages
using GraphEngine: RuleDSL, GraphVM
using DataScience, AtomExt, GraphIO

# turn off informational logging output
disable_logging(CoreLogging.Info)

# extend the number of displayed columns in Jupyter notebooks
ENV["COLUMNS"] = 100;

# the project is used for web UI display
config = RuleDSL.Config(:project => "Titanic");

The dataset can be loaded from a url or a local CSV file via rules defined in the `ds`
namespace, which is provided in the `DataScience` package. The line commented out is a rule
to load the same data from a URL.

In [None]:
rawsrc = RuleDSL.@ref ds.csvsrc("../data/titanic.csv", true; label="raw csv");
# rawsrc = RuleDSL.@ref ds.urlsrc("https://raw.githubusercontent.com/agconti/kaggle-titanic/master/data/train.csv", true; label="raw url")

The very first thing a data scientits do is often to get a summary of the dataset.
The follow cell shows how it can be done using the `ds.datasummary` rule in the `DataScience`
package.

In [None]:
rawsummary = RuleDSL.@ref ds.datasummary(rawsrc; label="data summary")

gs1 = GraphVM.createlocalgraph(config, RuleDSL.GenericData());
GraphVM.calcfwd!(gs1, Set([rawsummary]));

The data summary results can be retrieved using the `GraphVM.getdata` method. The data
cached in individual graph nodes are all vectors, the last argument `1` is optional, it
selects a given element from the data vector of the node. Without it, the entire vector
will be returned.

In [None]:
RuleDSL.getdata(gs1, rawsummary, 1)

### 1.2 Data Cleansing & Imputation

We observe that some columns in the raw data set have `missing` values. Data imputation
and cleansing is the next step for the data scientists. Julius' `DataScience`
library provides common data imputation methods, which can be easily invoked using the
`ds.fillmissing` rule with the desired imputation method for each missing field, i.e., we use
median value of Age of all passenger for any missing Ages, and
the mode value (which is true) for any missing Embarked.

After data imputation, we recompute the data summary, showing all the
the `missing` values for both `Age` and `Embarked` features have been populated.

In [None]:
cleansrc = RuleDSL.@ref ds.fillmissing(
    rawsrc, Dict(:Age => :median, :Embarked => :mode); label="imputation"
);

cleansummary = RuleDSL.@ref ds.datasummary(cleansrc; label="clean summary")
GraphVM.calcfwd!(gs1, Set([cleansummary]))
RuleDSL.getdata(gs1, cleansummary, 1)

### 1.3 Feature Engineering

Once the data scientist is happy with the results of data cleansing and imputation, the next
step is feature engineering, which is to add or remove columns from the data set.

In the Titanic data set, we want to drop the columns that should have no correlation
to a passenger's suvival outcome, such as a passenger's ticket id, name and IDs. Including
irrelavant data in the training of a ML model may degrade its performance. The
Cabin also has to be dropped because it has too many missing values to be useful.

We also create two additional features: 1) the z value of the ticket fare, which is the
difference of a passenger's ticket price from the mean price in the unit of standard deviation
of the ticket prices; 2) the total number of relatives onboard for a given passenger, which
is the sum of the number of siblings (:SibSp) and parents/children (:Parch) onboard.

Feature engineering is supported generically by a rule `ds.coltransform` in the
`DataScience` package. The following cell shows is usage. The feature engineering can be
easily entered as formulae operating on the columns (named by those variables start with `:`).

In [None]:
newfeatures = quote
    :Zfare = (:Fare .- mean(:Fare)) ./ std(:Fare)
    :Relatives = :SibSp .+ :Parch
end

dropfeatures = [:Cabin, :Ticket, :PassengerId, :Name]
features = RuleDSL.@ref ds.coltransform(cleansrc, :feature, newfeatures, dropfeatures; label="feature eng")

featuresummary = RuleDSL.@ref ds.datasummary(features; label="feature summary")
GraphVM.calcfwd!(gs1, Set([featuresummary]));

The data summary results after feature engineering is therefore:

In [None]:
RuleDSL.getdata(gs1, featuresummary, 1)

The entire data processing steps we performed so far can be visualized interactively
in Julius convenient web UI by clicking the link below. All the intermediate data are
accessilble from the web UI.

In [None]:
# start data server for web UI
gss = Dict{String,RuleDSL.AbstractGraphState}()
port = GraphVM.drawdataport()
@async GraphVM.startresponder(gss, port)

svg = GraphIO.postlocalgraph(gss, gs1, port, true; key="data");
display("image/svg+xml", svg)

## 2. Experiment with multiple ML models

Once the data scientist is happy with the results of data cleansing, imputation and
feature engineering, the next step is often to try multiple ML models and see how
they perform on the data set.

Julius Graph Engine can interop with existing Python, Java, C++ and R libraries via
the generic `Atom` interface, making tt seamless to access the rich set of ML models in
these ecosystems.

For example, the following rules leverages the Python ML libraries, such as
`sklearn` and `xgboost`, by using the `PyTrain` atom provided in the `DataScience` package. The first
parameter of the `PyTrain` atom is the full name of the Python ML class to use. The second
parameter is a Dictionary with the corresponding parameters/options/arguments of that ML
class.

```julia
@addrules ds begin
    classifiertrain(model::Val{:SVC}, options::Dict, traindat::NodeRef) = PyTrain["sklearn.svm.SVC", options](traindat...)
    classifiertrain(model::Val{:DecisionTree}, options::Dict, traindat::NodeRef) = PyTrain["sklearn.tree.DecisionTreeClassifier", options](traindat...)
    classifiertrain(model::Val{:RandomForest}, options::Dict, traindat::NodeRef) = PyTrain["sklearn.ensemble.RandomForestClassifier", options](traindat...)
    classifiertrain(model::Val{:AdaBoost}, options::Dict, traindat::NodeRef) = PyTrain["sklearn.ensemble.AdaBoostClassifier", options](traindat...)
    classifiertrain(model::Val{:MLPC}, options::Dict, traindat::NodeRef) = PyTrain["sklearn.neural_network.MLPClassifier", options](traindat...)
    classifiertrain(model::Val{:GaussianNB}, options::Dict, traindat::NodeRef) = PyTrain["sklearn.naive_bayes.GaussianNB", options](traindat...)
    classifiertrain(model::Val{:XGBoost}, options::Dict, traindat::NodeRef) = PyTrain["xgboost.XGBClassifier", options](traindat...)
    classifiertrain(model::Symbol, options::Dict, traindat::NodeRef; label = "$model-train") = Alias(classifiertrain(val(model), options, traindat))
end
```

We now proceed to train multiple ML models and compare their in-sample and out-sample
performance using metrics, such as Gini. The ML models are trained to predict the survival
probability of Titanic passengers. We first define the list of models we want to compare and
their hyperparameters.

In [None]:
models = [
    :DecisionTree => Dict(:min_samples_leaf => 0.1),
    :LogisticRegression => Dict(:solver => "saga", :max_iter => 200),
    :AdaBoost => Dict(),
    :XGBoost => Dict(),
    :GradientBoost => Dict(:min_samples_leaf => 0.1),
    :RandomForest => Dict(:min_samples_leaf => 0.1),
    :GaussianNB => Dict(),
];

The target variable name for ML prediction is given below, which is the survival outcome
of passengers.

In [None]:
yname = :Survived;

To divide the input dataset for training and validation, we use the `randrowsel` rule
from the `DataScience` package, which randomly select a portion of the input data as
validation set, and the rest is used for training. The parameter `1/3` is the fraction of
rows that are reserved for validation.

In [None]:
valind = RuleDSL.@ref ds.randrowsel(cleansrc, 1 / 3);

`DataScience.ClassifierSpec` is a genereic `struct` that holds all the configurations for
training and validating binary classifiers, such as those we have defined so far. It is more
convenient and readable to pass the `DataScience.ClassifierSpec` object to a rule, than
having to pass five separate parameters. The `DataScience.ClassifierSpec`
can be used for any binary classifier problems on data sets. The last parameter to the
`DataScience.ClassifierSpec` constructor is a tuple representing the feature engineering.

In [None]:
cspec = DataScience.ClassifierSpec(models, cleansrc, yname, valind, (:feature, newfeatures, dropfeatures));

Now we can proceed and use the `ds.classifiermetrics` rule, which is also part of `DataScience`,
to compute in-sample and out-of-sample metrics for each model. This rule depends on the
`ds.classifiertrain` rules defined above for accessing the python ML models.

In [None]:
metrics = [:gini, :roc, :accuracyrate, :accuracygraph]
basem = RuleDSL.@ref ds.classifiermetrics(cspec, metrics)
gs2 = GraphVM.createlocalgraph(config, RuleDSL.GenericData())
@time GraphVM.calcfwd!(gs2, Set([basem]));

We can retrieve in-sample and out-sample performance metrics, for example, the GINIs:

In [None]:
giniref = RuleDSL.@ref ds.classifiermetric(cspec, :gini)
gini = GraphVM.getdata(gs2, hash(giniref), 1)
ginidf = DataFrame(model=gini[:InSample][!, :Model], InSample_GINI=gini[:InSample][!, 2], OutSample_GINI=gini[:OutSample][!, 2])

The entire data and logic from can be visualized by clicking on the URL below.

In [None]:
svg = GraphIO.postlocalgraph(gss, gs2, port; key="ml");
display("image/svg+xml", svg)

The entire ML pipeline includes all the steps we have defined so far, including data
sourcing, imputation, feature engineering, training of multiple ML models and the computation
and reporting of performance metrics. A data scientists only need to invoke a few rules
defined in `DataScience` package to construct this realistic ML pipeline, the total
number of nodes in the graph is 83, as shown below.

In [None]:
dg = GraphVM.mygraph(gs2)
println(length(dg._items))

## 3. Hyperparameter Tuning

Once a data scientist narrowed down the choice of ML models to a few, the next step is
to select the optimal hyperparameters for these candidate ML models.

Julius Graph Engin provides a generic rule `hypertune` for hyperparameter tuning
of any ML model. This shows the power of high level rules, where a single
hypertune rule can perform hyperparmeter tuning for any ML model.

For example, for a given machine learning model, we can select a range for a set of
hyperparameters and easily perform a grid search and report the corresponding
metric results:

In [None]:
ht_1 = RuleDSL.@ref ds.hypertune(cspec, :XGBoost,       Dict(), :gini, :n_estimators => 50:50:200, :learning_rate    => .05:.05:.2);
ht_2 = RuleDSL.@ref ds.hypertune(cspec, :AdaBoost,      Dict(), :gini, :n_estimators => 50:50:200, :learning_rate    => .05:.05:.2);
ht_3 = RuleDSL.@ref ds.hypertune(cspec, :GradientBoost, Dict(), :gini, :n_estimators => 50:50:200, :min_samples_leaf => .05:.05:.2);
ht_4 = RuleDSL.@ref ds.hypertune(cspec, :RandomForest,  Dict(), :gini, :n_estimators => 50:50:200, :min_samples_leaf => .05:.05:.2);

Additional search dimensions can be added to the `ds.hypertune` rule by appending additional
pairs of hyperparameter => searchgrid to the end of rule parameter. We can then wrap all
the hyperparameter searches in a single node for convenience by means of the `alias` rule
wich uses the `Alias` atom:

In [None]:
tunings = RuleDSL.@ref ds.alias([ht_1, ht_2, ht_3, ht_4]; label="Hyperparameter Tuning")

Now proceed with the computation of all the defined hyperparameter tunings:

In [None]:
gs3 = GraphVM.createlocalgraph(config, RuleDSL.GenericData());
@time GraphVM.calcfwd!(gs3, Set([tunings]));

The following cell shows the resulting insample and outsample GINI from the different
hyperparametrs for GradientBoost:

In [None]:
dat = GraphVM.getdata(gs3, hash(ht_3))
df = deepcopy(dat[1][:, 1:2])
df[!, :InSampleGINI] = dat[1][!, 3]
df[!, :OutSampleGINI] = dat[2][!, 3]
df

A data scientist has to exercise sound judgement in selecting the optimal
hyperparameter set, which may have to balance multiple objectives. The parameter set with
the maximum out-sample gini may not be the best choice. Often, it is better to choose the
parameter set with similar in-sample and out-of-sample gini to minimize the chance of
overfitting.

The details of hyperparameter search can be visualized by clicking the url below.

In [None]:
svg = GraphIO.postlocalgraph(gss, gs3, port; key="hyper");
display("image/svg+xml", svg)

## 4. Conclusions

It only takes a few lines of code in Julius to build a sophisticated and ML pipeline, by
leveraging the existing rules and atoms provided by the `DataScience` package.  Even though
the titanic data set is quite small, the ML pipeline built in this tutorial are quite
representative; it has all the essential elements of a real world ML piepline such as data
cleansing, imputation, feature engineering, model performance monitoring and hyper parameter
tuning.

The ML pipeline built by Julius offers full transparency and data lineage, allows data
scientists to easily visualze and explore data in every intermediate step, all from Julius'
web UI. Julius also offers full data lineage and explanability, a data scientist can easily
query and trace how a piece of data is sourced, modified and used throughout the entire ML
pipeline, as every intermediate results are automatically cached by Julius Graph Engine.

In a next tutorial "distributed ML pipeline", we will show how to deal with very large
data set that does not fit into memory.

---

*This notebook was generated using [Literate.jl](https://github.com/fredrikekre/Literate.jl).*