You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The DML (Declarative Machine Learning) language has built-in functions which enable access to both low- and high-level functions
to support all kinds of use cases.
A builtin is either implemented on a compiler level or as DML scripts that are loaded at compile time.
Built-In Construction Functions
There are some functions which generate an object for us. They create matrices, tensors, lists and other non-primitive
objects.
tensor-Function
The tensor-function creates a tensor for us.
tensor(data, dims, byRow=TRUE)
Arguments
Name
Type
Default
Description
data
Matrix[?], Tensor[?], Scalar[?]
required
The data with which the tensor should be filled. See data-Argument.
Note that this function is highly unstable and will be overworked and might change signature and functionality.
Returns
Type
Description
Tensor[?]
The generated Tensor. Will support more datatypes than Double.
data-Argument
The data-argument can be a Matrix of any datatype from which the elements will be taken and placed in the tensor
until filled. If given as a Tensor the same procedure takes place. We iterate through Matrix and Tensor by starting
with each dimension index at 0 and then incrementing the lowest one, until we made a complete pass over the dimension,
and then increasing the dimension index above. This will be done until the Tensor is completely filled.
If data is a Scalar, we fill the whole tensor with the value.
dims-Argument
The dimension of the tensor can either be given by a vector represented by either by a Matrix, Tensor, String or List.
Dimensions given by a String will be expected to be concatenated by spaces.
Example
print("Dimension matrix:");
d=matrix("2 3 4", 1, 3);
print(toString(d, decimal=1))
print("Tensor A: Fillvalue=3, dims=2 3 4");
A= tensor(3, d); # fill with value, dimensions given by matrix
print(toString(A))
print("Tensor B: Reshape A, dims=4 2 3");
B= tensor(A, "4 2 3"); # reshape tensor, dimensions given by string
print(toString(B))
print("Tensor C: Reshape dimension matrix, dims=1 3");
C= tensor(d, list(1, 3)); # values given by matrix, dimensions given by list
print(toString(C, decimal=1))
print("Tensor D: Values=tst, dims=Tensor C");
D= tensor("tst", C); # fill with string, dimensions given by tensor
print(toString(D))
Note that reshape construction is not yet supported for SPARK execution.
DML-Bodied Built-In Functions
DML-bodied built-in functions are written as DML-Scripts and executed as such when called.
confusionMatrix-Function
A confusionMatrix-accepts a vector for prediction and a one-hot-encoded matrix, then it computes the max value
of each vector and compare them, after which it calculates and returns the sum of classifications and the average of
each true class.
The correctTypos - function tries to correct typos in a given frame. This algorithm operates on the assumption that most strings are correct and simply swaps strings that do not occur often with similar strings that occur more often. If correct is set to FALSE only prints suggested corrections without affecting the frame.
This cspline-function solves Cubic spline interpolation. The function usages natural spline with $$ q_1''(x_0) == q_n''(x_n) == 0.0 $$.
By default, it calculates via csplineDS-function.
1-column matrix of x values knots. It is assumed that x values are monotonically increasing and there is no duplicate points in X
Y
Matrix[Double]
---
1-column matrix of corresponding y values knots
inp_x
Double
---
the given input x, for which the cspline will find predicted y
mode
String
DS
Specifies that method for cspline (DS - Direct Solve, CG - Conjugate Gradient)
tol
Double
-1.0
Tolerance (epsilon); conjugate gradient procedure terminates early if L2 norm of the beta-residual is less than tolerance * its initial norm
maxi
Integer
-1
Maximum number of conjugate gradient iterations, 0 = no maximum
Returns
Name
Type
Description
pred_Y
Matrix[Double]
Predicted values
K
Matrix[Double]
Matrix of k parameters
Example
num_rec=100# Num of recordsX=matrix(seq(1,num_rec), num_rec, 1)
Y= round(rand(rows=100, cols=1, min=1, max=5))
inp_x=4.5tolerance=0.000001max_iter=num_rec
[result, K] = cspline(X=X, Y=Y, inp_x=inp_x, tol=tolerance, maxi=max_iter)
csplineCG-Function
This csplineCG-function solves Cubic spline interpolation with conjugate gradient method. Usage will be same as cspline-function.
Usage
[result, K] = csplineCG(X, Y, inp_x, tol, maxi)
Arguments
Name
Type
Default
Description
X
Matrix[Double]
---
1-column matrix of x values knots. It is assumed that x values are monotonically increasing and there is no duplicate points in X
Y
Matrix[Double]
---
1-column matrix of corresponding y values knots
inp_x
Double
---
the given input x, for which the cspline will find predicted y
tol
Double
-1.0
Tolerance (epsilon); conjugate gradient procedure terminates early if L2 norm of the beta-residual is less than tolerance * its initial norm
maxi
Integer
-1
Maximum number of conjugate gradient iterations, 0 = no maximum
Returns
Name
Type
Description
pred_Y
Matrix[Double]
Predicted values
K
Matrix[Double]
Matrix of k parameters
Example
num_rec=100# Num of recordsX=matrix(seq(1,num_rec), num_rec, 1)
Y= round(rand(rows=100, cols=1, min=1, max=5))
inp_x=4.5tolerance=0.000001max_iter=num_rec
[result, K] = csplineCG(X=X, Y=Y, inp_x=inp_x, tol=tolerance, maxi=max_iter)
csplineDS-Function
This csplineDS-function solves Cubic spline interpolation with direct solver method.
Usage
[result, K] = csplineDS(X, Y, inp_x)
Arguments
Name
Type
Default
Description
X
Matrix[Double]
---
1-column matrix of x values knots. It is assumed that x values are monotonically increasing and there is no duplicate points in X
Y
Matrix[Double]
---
1-column matrix of corresponding y values knots
inp_x
Double
---
the given input x, for which the cspline will find predicted y
Returns
Name
Type
Description
pred_Y
Matrix[Double]
Predicted values
K
Matrix[Double]
Matrix of k parameters
Example
num_rec=100# Num of recordsX=matrix(seq(1,num_rec), num_rec, 1)
Y= round(rand(rows=100, cols=1, min=1, max=5))
inp_x=4.5
[result, K] = csplineDS(X=X, Y=Y, inp_x=inp_x)
cvlm-Function
The cvlm-function is used for cross-validation of the provided data model. This function follows a non-exhaustive
cross validation method. It uses lm and lmPredict functions to solve the linear
regression and to predict the class of a feature vector with no intercept, shifting, and rescaling.
Usage
cvlm(X, y, k)
Arguments
Name
Type
Default
Description
X
Matrix[Double]
required
Recorded Data set into matrix
y
Matrix[Double]
required
1-column matrix of response values.
k
Integer
required
Number of subsets needed, It should always be more than 1 and less than nrow(X)
icpt
Integer
0
Intercept presence, shifting and rescaling the columns of X
reg
Double
1e-7
Regularization constant (lambda) for L2-regularization. set to nonzero for highly dependant/sparse/numerous features
The decisionTree() implements the classification tree with both scale and categorical
features.
Usage
M= decisionTree(X, Y, R);
Arguments
Name
Type
Default
Description
X
Matrix[Double]
required
Feature matrix X; note that X needs to be both recoded and dummy coded
Y
Matrix[Double]
required
Label matrix Y; note that Y needs to be both recoded and dummy coded
R
Matrix[Double]
" "
Matrix R which for each feature in X contains the following information - R[1,]: Row Vector which indicates if feature vector is scalar or categorical. 1 indicates a scalar feature vector, other positive Integers indicate the number of categories If R is not provided by default all variables are assumed to be scale
bins
Integer
20
Number of equiheight bins per scale feature to choose thresholds
depth
Integer
25
Maximum depth of the learned tree
verbose
Boolean
FALSE
boolean specifying if the algorithm should print information while executing
Returns
Name
Type
Description
M
Matrix[Double]
Each column of the matrix corresponds to a node in the learned tree
The discoverFD-function finds the functional dependencies.
Usage
discoverFD(X, Mask, threshold)
Arguments
Name
Type
Default
Description
X
Double
--
Input Matrix X, encoded Matrix if data is categorical
Mask
Double
--
A row vector for interested features i.e. Mask =[1, 0, 1] will exclude the second column from processing
threshold
Double
--
threshold value in interval [0, 1] for robust FDs
Returns
Type
Description
Double
matrix of functional dependencies
dist-Function
The dist-function is used to compute Euclidian distances between N d-dimensional points.
Usage
dist(X)
Arguments
Name
Type
Default
Description
X
Matrix[Double]
required
(n x d) matrix of d-dimensional points
Returns
Type
Description
Matrix[Double]
(n x n) symmetric matrix of Euclidian distances
Example
X= rand (rows=5, cols=5)
Y= dist(X)
dmv-Function
The dmv-function is used to find disguised missing values utilising syntactical pattern recognition.
Usage
dmv(X, threshold, replace)
Arguments
Name
Type
Default
Description
X
Frame[String]
required
Input Frame
threshold
Double
0.8
threshold value in interval [0, 1] for dominant pattern per column (e.g., 0.8 means that 80% of the entries per column must adhere this pattern to be dominant)
replace
String
"NA"
The string disguised missing values are replaced with
Returns
Type
Description
Frame[String]
Frame X including detected disguised missing values
The gaussianClassifier-function computes prior probabilities, means, determinants, and inverse
covariance matrix per class.
Classification is as per $$ p(C=c | x) = p(x | c) * p(c) $$
Where is the (multivariate) Gaussian P.D.F. for class , and is the
prior probability for class .
The glm-function is a flexible generalization of ordinary linear regression that allows for response variables that have
error distribution models.
Usage
glm(X,Y)
Arguments
Name
Type
Default
Description
X
Matrix[Double]
required
matrix X of feature vectors
Y
Matrix[Double]
required
matrix Y with either 1 or 2 columns: if dfam = 2, Y is 1-column Bernoulli or 2-column Binomial (#pos, #neg)
dfam
Int
1
Distribution family code: 1 = Power, 2 = Binomial
vpow
Double
0.0
Power for Variance defined as (mean)^power (ignored if dfam != 1): 0.0 = Gaussian, 1.0 = Poisson, 2.0 = Gamma, 3.0 = Inverse Gaussian
link
Int
0
Link function code: 0 = canonical (depends on distribution), 1 = Power, 2 = Logit, 3 = Probit, 4 = Cloglog, 5 = Cauchit
lpow
Double
1.0
Power for Link function defined as (mean)^power (ignored if link != 1): -2.0 = 1/mu^2, -1.0 = reciprocal, 0.0 = log, 0.5 = sqrt, 1.0 = identity
yneg
Double
0.0
Response value for Bernoulli "No" label, usually 0.0 or -1.0
icpt
Int
0
Intercept presence, X columns shifting and rescaling: 0 = no intercept, no shifting, no rescaling; 1 = add intercept, but neither shift nor rescale X; 2 = add intercept, shift & rescale X columns to mean = 0, variance = 1
reg
Double
0.0
Regularization parameter (lambda) for L2 regularization
tol
Double
1e-6
Tolerance (epislon) value.
disp
Double
0.0
(Over-)dispersion value, or 0.0 to estimate it from data
moi
Int
200
Maximum number of outer (Newton / Fisher Scoring) iterations
mii
Int
0
Maximum number of inner (Conjugate Gradient) iterations, 0 = no maximum
Returns
Type
Description
Matrix[Double]
Matrix whose size depends on icpt ( icpt=0: ncol(X) x 1; icpt=1: (ncol(X) + 1) x 1; icpt=2: (ncol(X) + 1) x 2)
The gmm-function implements builtin Gaussian Mixture Model with four different types of
covariance matrices i.e., VVV, EEE, VVI, VII and two initialization methods namely "kmeans" and "random".
The gnmf-function does Gaussian Non-Negative Matrix Factorization.
In this, a matrix X is factorized into two matrices W and H, such that all three matrices have no negative elements.
This non-negativity makes the resulting matrices easier to inspect.
Usage
gnmf(X, rnk, eps=10^-8, maxi=10)
Arguments
Name
Type
Default
Description
X
Matrix[Double]
required
Matrix of feature vectors.
rnk
Integer
required
Number of components into which matrix X is to be factored.
eps
Double
10^-8
Tolerance
maxi
Integer
10
Maximum number of conjugate gradient iterations.
Returns
Type
Description
Matrix[Double]
List of pattern matrices, one for each repetition.
Matrix[Double]
List of amplitude matrices, one for each repetition.
The gridSearch-function is used to find the optimal hyper-parameters of a model which results in the most accurate
predictions. This function takes train and eval functions by name.
Usage
gridSearch(X, y, train, predict, params, paramValues, verbose)
The hyperband-function is used for hyper parameter optimization and is based on multi-armed bandits and early elimination.
Through multiple parallel brackets and consecutive trials it will return the hyper parameter combination which performed best
on a validation dataset. A set of hyper parameter combinations is drawn from uniform distributions with given ranges; Those
make up the candidates for hyperband.
Notes:
hyperband is hard-coded for lmCG, and uses lmPredict for validation
hyperband is hard-coded to use the number of iterations as a resource
hyperband can only optimize continuous hyperparameters
Usage
hyperband(X_train, y_train, X_val, y_val, params, paramRanges, R, eta, verbose)
Arguments
Name
Type
Default
Description
X_train
Matrix[Double]
required
Input Matrix of training vectors.
y_train
Matrix[Double]
required
Labels for training vectors.
X_val
Matrix[Double]
required
Input Matrix of validation vectors.
y_val
Matrix[Double]
required
Labels for validation vectors.
params
List[String]
required
List of parameters to optimize.
paramRanges
Matrix[Double]
required
The min and max values for the uniform distributions to draw from. One row per hyper parameter, first column specifies min, second column max value.
R
Scalar[int]
81
Controls number of candidates evaluated.
eta
Scalar[int]
3
Determines fraction of candidates to keep after each trial.
verbose
Boolean
TRUE
If TRUE print messages are activated.
Returns
Type
Description
Matrix[Double]
1-column matrix of weights of best performing candidate
Row vector R indicating whether a feature is categorical or continuous. 1 denotes a continuous feature, 2 denotes a categorical feature.
n_bins
Integer
20
Number of equi-width bins for binning in case of scale features.
method
String
---
String indicating the method to use; either "entropy" or "gini".
Returns
Name
Type
Description
IM
Matrix[Double]
(1 x ncol(X)) row vector containing information/gini gain for each feature of the dataset. In case of gini, the values denote the gini gains, i.e. how much impurity was removed with the respective split. The higher the value, the better the split. In case of entropy, the values denote the information gain, i.e. how much entropy was removed. The higher the information gain, the better the split.
The lm-function solves linear regression using either the direct solve method or the conjugate gradient algorithm
depending on the input size of the matrices (See lmDS-function and
lmCG-function respectively).
Usage
lm(X, y, icpt=0, reg=1e-7, tol=1e-7, maxi=0, verbose=TRUE)
Arguments
Name
Type
Default
Description
X
Matrix[Double]
required
Matrix of feature vectors.
y
Matrix[Double]
required
1-column matrix of response values.
icpt
Integer
0
Intercept presence, shifting and rescaling the columns of X (Details)
reg
Double
1e-7
Regularization constant (lambda) for L2-regularization. set to nonzero for highly dependant/sparse/numerous features
tol
Double
1e-7
Tolerance (epsilon); conjugate gradient procedure terminates early if L2 norm of the beta-residual is less than tolerance * its initial norm
maxi
Integer
0
Maximum number of conjugate gradient iterations. 0 = no maximum
verbose
Boolean
TRUE
If TRUE print messages are activated
Note that if number of features is small enough (rows of X/y < 2000), the lmDS-Function'
is called internally and parameters tol and maxi are ignored.
Returns
Type
Description
Matrix[Double]
1-column matrix of weights.
icpt-Argument
The icpt-argument can be set to 3 modes:
0 = no intercept, no shifting, no rescaling
1 = add intercept, but neither shift nor rescale X
2 = add intercept, shift & rescale X columns to mean = 0, variance = 1
The mice-function implements Multiple Imputation using Chained Equations (MICE) for nominal data.
Usage
mice(F, cMask, iter, complete, verbose)
Arguments
Name
Type
Default
Description
X
Matrix[Double]
required
Data Matrix (Recoded Matrix for categorical features), ncol(X) > 1
cMask
Matrix[Double]
required
0/1 row vector for identifying numeric (0) and categorical features (1) with one-dimensional row matrix with column = ncol(F).
iter
Integer
3
Number of iteration for multiple imputations.
verbose
Boolean
FALSE
Boolean value.
Returns
Type
Description
Matrix[Double]
imputed dataset.
Example
F=matrix("4 3 NaN 8 7 8 5 NaN 6", rows=3, cols=3)
cMask= round(rand(rows=1,cols=ncol(F),min=0,max=1))
dataset= mice(F, cMask, iter=3, verbose=FALSE)
msvm-Function
The msvm-function implements builtin multiclass SVM with squared slack variables
It learns one-against-the-rest binary-class classifiers by making a function call to l2SVM
Usage
msvm(X, Y, intercept, epsilon, lamda, maxIterations, verbose)
Arguments
Name
Type
Default
Description
X
Double
---
Matrix X of feature vectors.
Y
Double
---
Matrix Y of class labels.
intercept
Boolean
False
No Intercept ( If set to TRUE then a constant bias column is added to X)
num_classes
Integer
10
Number of classes.
epsilon
Double
0.001
Procedure terminates early if the reduction in objective function value is less than epsilon (tolerance) times the initial objective function value.
lamda
Double
1.0
Regularization parameter (lambda) for L2 regularization
The multiLogReg-function solves Multinomial Logistic Regression using Trust Region method.
(See: Trust Region Newton Method for Logistic Regression, Lin, Weng and Keerthi, JMLR 9 (2008) 627-650)
Usage
multiLogReg(X, Y, icpt, reg, tol, maxi, maxii, verbose)
Arguments
Name
Type
Default
Description
X
Double
--
The matrix of feature vectors
Y
Double
--
The matrix with category labels
icpt
Int
0
Intercept presence, shifting and rescaling X columns: 0 = no intercept, no shifting, no rescaling; 1 = add intercept, but neither shift nor rescale X; 2 = add intercept, shift & rescale X columns to mean = 0, variance = 1
reg
Double
0
regularization parameter (lambda = 1/C); intercept is not regularized
tol
Double
1e-6
tolerance ("epsilon")
maxi
Int
100
max. number of outer newton interations
maxii
Int
0
max. number of inner (conjugate gradient) iterations
The normalize-function normalises the values of a matrix by changing the dataset to use a common scale.
This is done while preserving differences in the ranges of values.
The output is a matrix of values in range [0,1].
The outlierByDB-function implements an outlier prediction for a trained dbscan model. The points in the Xtest matrix are checked against the model and are considered part of the cluster if at least one member is within eps distance.
Usage
outlierByDB(X, model, eps)
Arguments
Name
Type
Default
Description
Xtest
Matrix[Double]
required
Matrix of points for outlier testing
model
Matrix[Double]
required
Matrix model of the clusters, containing all points that are considered members, returned by the dbscan builtin
eps
Double
0.5
Epsilon distance between points to be considered in their neighborhood
Returns
Type
Description
Matrix[Double]
Matrix indicating outlier values of the points in Xtest, 0 suggests it being an outlier
The pnmf-function implements Poisson Non-negative Matrix Factorization (PNMF). Matrix X is factorized into
two non-negative matrices, W and H based on Poisson probabilistic assumption. This non-negativity makes the
resulting matrices easier to inspect.
Usage
pnmf(X, rnk, eps=10^-8, maxi=10, verbose=TRUE)
Arguments
Name
Type
Default
Description
X
Matrix[Double]
required
Matrix of feature vectors.
rnk
Integer
required
Number of components into which matrix X is to be factored.
eps
Double
10^-8
Tolerance
maxi
Integer
10
Maximum number of conjugate gradient iterations.
verbose
Boolean
TRUE
If TRUE, 'iter' and 'obj' are printed.
Returns
Type
Description
Matrix[Double]
List of pattern matrices, one for each repetition.
Matrix[Double]
List of amplitude matrices, one for each repetition.
Implements training phase of Sherlock: A Deep Learning Approach to Semantic Data Type Detection
[Hulsebos, Madelon, et al. "Sherlock: A deep learning approach to semantic data type detection."
Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining., 2019]
Usage
sherlock(X_train, y_train)
Arguments
Name
Type
Default
Description
X_train
Matrix[Double]
required
Matrix of feature vectors.
y_train
Matrix[Double]
required
Matrix Y of class labels of semantic data type.
Returns
Type
Description
Matrix[Double]
weights (parameters) matrices for character distribtions
Matrix[Double]
weights (parameters) matrices for word embeddings
Matrix[Double]
weights (parameters) matrices for paragraph vectors
Matrix[Double]
weights (parameters) matrices for global statistics
Matrix[Double]
weights (parameters) matrices for combining all featurs (final)
Implements prediction and evaluation phase of Sherlock: A Deep Learning Approach to Semantic Data Type Detection
[Hulsebos, Madelon, et al. "Sherlock: A deep learning approach to semantic data type detection."
Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining., 2019]
The Sigmoid function is a type of activation function, and also defined as a squashing function which limit the output
to a range between 0 and 1, which will make these functions useful in the prediction of probabilities.
Usage
sigmoid(X)
Arguments
Name
Type
Default
Description
X
Matrix[Double]
required
Matrix of feature vectors.
Returns
Type
Description
Matrix[Double]
1-column matrix of weights.
Example
X= rand (rows=20, cols=10)
Y= sigmoid(X)
slicefinder-Function
The slicefinder-function returns top-k worst performing subsets according to a model calculation.
Usage
slicefinder(X,W, y, k, paq, S);
Arguments
Name
Type
Default
Description
X
Matrix[Double]
required
Recoded dataset into Matrix
W
Matrix[Double]
required
Trained model
y
Matrix[Double]
required
1-column matrix of response values.
k
Integer
1
Number of subsets required
paq
Integer
1
amount of values wanted for each col, if paq = 1 then its off
S
Integer
2
amount of subsets to combine (for now supported only 1 and 2)
Returns
Type
Description
Matrix[Double]
Matrix containing the information of top_K slices (relative error, standart error, value0, value1, col_number(sort), rows, cols,range_row,range_cols, value00, value01,col_number2(sort), rows2, cols2,range_row2,range_cols2)
The smote-function (Synthetic Minority Oversampling Technique) implements a classical techniques for handling class imbalance.
The built-in takes the samples from minority class and over-sample them by generating the synthesized samples.
The built-in accepts two parameters s and k. The parameter s define the number of synthesized samples to be generated
i.e., over-sample the minority class by s time, where s is the multiple of 100. For given 40 samples of minority class and
s = 200 the smote will generate the 80 synthesized samples to over-sample the class by 200 percent. The parameter k is used to generate the
k nearest neighbours for each minority class sample and then the neighbours are chosen randomly in synthesis process.
Usage
smote(X, s, k, verbose);
Arguments
Name
Type
Default
Description
X
Matrix[Double]
required
Matrix of feature vectors of minority class samples
s
Integer
200
Amount of SMOTE (percentage of oversampling), integral multiple of 100
k
Integer
1
Number of nearest neighbour
verbose
Boolean
TRUE
If TRUE print messages are activated
Returns
Type
Description
Matrix[Double]
Matrix of (N/100) * X synthetic minority class samples
The steplm-function (stepwise linear regression) implements a classical forward feature selection method.
This method iteratively runs what-if scenarios and greedily selects the next best feature until the Akaike
information criterion (AIC) does not improve anymore. Each configuration trains a regression model via lm,
which in turn calls either the closed form lmDS or iterative lmGC.
Usage
steplm(X, y, icpt);
Arguments
Name
Type
Default
Description
X
Matrix[Double]
required
Matrix of feature vectors.
y
Matrix[Double]
required
1-column matrix of response values.
icpt
Integer
0
Intercept presence, shifting and rescaling the columns of X (Details)
reg
Double
1e-7
Regularization constant (lambda) for L2-regularization. set to nonzero for highly dependent/sparse/numerous features
tol
Double
1e-7
Tolerance (epsilon); conjugate gradient procedure terminates early if L2 norm of the beta-residual is less than tolerance * its initial norm
maxi
Integer
0
Maximum number of conjugate gradient iterations. 0 = no maximum
verbose
Boolean
TRUE
If TRUE print messages are activated
Returns
Type
Description
Matrix[Double]
Matrix of regression parameters (the betas) and its size depend on icpt input value. (C in the example)
Matrix[Double]
Matrix of selected features ordered as computed by the algorithm. (S in the example)
icpt-Argument
The icpt-arg can be set to 2 modes:
0 = no intercept, no shifting, no rescaling
1 = add intercept, but neither shift nor rescale X
selected-Output
If the best AIC is achieved without any features the matrix of selected features contains 0. Moreover, in this case no further statistics will be produced
The symmetricDifference-function returns the symmetric difference of the two input vectors.
This is done by calculating the setdiff (nonsymmetric) between union and intersect of the two input vectors.
The tomekLink-function performs undersampling by removing Tomek's links for imbalanced
multiclass problems
Reference:
"Two Modifications of CNN," in IEEE Transactions on Systems, Man, and Cybernetics, vol. SMC-6, no. 11, pp. 769-772, Nov. 1976, doi: 10.1109/TSMC.1976.4309452.
The winsorize-function removes outliers from the data. It does so by computing upper and lower quartile range
of the given data then it replaces any value that falls outside this range (less than lower quartile range or more
than upper quartile range).
XGBoost is a decision-tree-based ensemble Machine Learning algorithm that uses a gradient boosting. This xgboost implementation supports classification and regression and is capable of working with categorical and scalar features.
In order to calculate a prediction, XGBoost sums predictions of all its trees. Each tree is not a great predictor on it’s own, but by summing across all trees, XGBoost is able to provide a robust prediction in many cases. Depending on our supervised machine learning type use xgboostPredictRegression() or xgboostPredictClassification() to predict the labels.
Usage
y_pred= xgboostPredictRegression(X=X, M=M)
or
y_pred = xgboostPredictClassification(X = X, M = M)
Arguments
NAME
TYPE
DEFAULT
Description
X
Matrix[Double]
---
Feature matrix X; categorical features needs to be one-hot-encoded
M
Matrix[Double]
---
Trained model returned from xgboost. Each column of the matrix corresponds to a node in the learned model Detailed description can be found in xgboost.dml
learning_rate
Double
0.3
alias: eta. After each boosting step the learning rate controls the weights of the new predictions. Should be the same as at xgboost-function call
Returns
Name
Type
Default
Description
P
Matrix[Double]
---
xgboostPredictRegression: The prediction of the samples using the xgboost model. (y_prediction) xgboostPredictClassification: The probability of the samples being 1 (like XGBClassifier.predict_proba() in Python)