# Data Preparation of the Iris dataset in Julia

Before getting to the real Machine Learning part, it is necessary to get the data imported and prepared. I will cover only three basic steps here: importing a csv file, one hot encoding a categorical variable, and making a train-test split.

## Data Prep 1 — Import a CSV file in Julia

The first step to getting started in Julia is to import data. In this case, we use a csv file with the Iris data. For importing a csv file as a Data Frame, you will need to add the libraries “CSV” and “DataFrames” as shown below. Then, you use the “CSV.File” function to read the csv file and the DataFrame function to convert it to a data frame.

In [1]:
# import a csv file
import Pkg; Pkg.add("CSV")
import Pkg; Pkg.add("DataFrames")
using CSV, DataFrames
iris = DataFrame(CSV.File("mypath//iris.csv"))

[32m[1m   Updating[22m[39m registry at `~/.julia/registries/General`
######################################################################### 100,0%
[32m[1m  Resolving[22m[39m package versions...
[32m[1m  Installed[22m[39m DataValueInterfaces ───────── v1.0.0
[32m[1m  Installed[22m[39m InvertedIndices ───────────── v1.0.0
[32m[1m  Installed[22m[39m DataAPI ───────────────────── v1.3.0
[32m[1m  Installed[22m[39m Missings ──────────────────── v0.4.4
[32m[1m  Installed[22m[39m IteratorInterfaceExtensions ─ v1.0.0
[32m[1m  Installed[22m[39m Tables ────────────────────── v1.1.0
[32m[1m  Installed[22m[39m DataFrames ────────────────── v0.21.8
[32m[1m  Installed[22m[39m SentinelArrays ────────────── v1.2.16
[32m[1m  Installed[22m[39m TableTraits ───────────────── v1.0.0
[32m[1m  Installed[22m[39m PooledArrays ──────────────── v0.5.3
[32m[1m  Installed[22m[39m StructTypes ───────────────── v1.1.0
[32m[1m  Installed[22m[39m CategoricalArr

LoadError: ArgumentError: "mypath//iris.csv" is not a valid file

## Data Prep 2 — One Hot Encode the dependent variable (variety)

For some models, you will need one hot encoding for the categorical variables. You can use the “Lathe” library for this. It has a OneHotEncode function that will convert the data frame into a OneHotEncoded data frame. After that, you can remove the original column using the “select” function.

In [None]:
import Pkg; Pkg.add("Lathe")
using Lathe
scaled_feature = Lathe.preprocess.OneHotEncode(iris,:variety)
iris = select!(iris, Not([:variety]))
first(iris,5)

## Data Prep 3 — Train Test Split

For model evaluation, you will need a train test split. The following code does this using the library “Random”. Basically, it selects a random subset of indexes and treats these as the train set, while the non-selected indexes will be the test set:

In [None]:
using Random
sample = randsubseq(1:size(iris,1), 0.75)
train = iris[sample, :]
notsample = [i for i in 1:size(iris,1) if isempty(searchsorted(sample, i))]
test = iris[notsample, :]

# Machine Learning in Julia

The resources for Machine Learning in Julia are still relatively distributed over different packages. Julia being not (yet) as popular as other programming languages for Machine Learning, it can sometimes be a bit of work to find specific models. It can also be more effort to find (or write) certain basic data preparation functions that are easily available in Python and R.

The good news is that there are initiatives to regroup Machine Learning models in larger libraries. At this point, there are two libraries that are seriously competing for becoming the go-to Machine Learning library in Julia: MLJ and Scikit Learn.

Those two initiatives are great, but they are not yet totally complete. As a result, for some models, they simply provide wrappers to other, much smaller, Machine Learning libraries. Because of this, I find it important to also cover two of those smaller libraries: “GLM” for Generalized Linear Models and “DecisionTree” for many tree-based models. I will start with the smaller libraries and finish with the larger initiatives.

# Logistic Regression in Julia using the GLM library

The following example fits three Logistic Regression models using the GLM library on the Iris data. GLM uses the “formula” interface, which is common in statistics-oriented libraries. We can specify a family (Binomial in this case) and a link type (Logit Link in this case) in order to create the type of GLM that is desired. This is done in the first part of the below code snippet.

At the end of this snippet, the predictions of the three models are horizontally concatenated in order to prepare for the application of a One-Versus-All multi-class classification.

In [None]:
import Pkg; Pkg.add("StatsModels")
import Pkg; Pkg.add("GLM")

using DataFrames, GLM
fm_setosa = @formula(Setosa ~  sepallength + sepalwidth + petallength + petalwidth)
lm_setosa = glm(fm_setosa, train, Binomial(), LogitLink())
pred_setosa = predict(lm_setosa, test)

fm_virginica = @formula(Virginica ~ sepallength + sepalwidth + petallength + petalwidth)
lm_virginica = glm(fm_virginica, train, Binomial(), LogitLink())
pred_virginica = predict(lm_virginica, test)

fm_versicolor = @formula(Versicolor ~ sepallength + sepalwidth + petallength + petalwidth)
lm_versicolor = glm(fm_versicolor, train, Binomial(), LogitLink())
pred_versicolor = predict(lm_versicolor, test)

preds = hcat(pred_setosa, pred_virginica, pred_versicolor)

In the following snippet, we convert the three predicted probabilities for each row into one class prediction per row. This decision is based on the highest predicted probability between each of the three predicted probabilities:

In [None]:
# Reclass by maximum predicted probability
preds_cat = String[];
for i in 1:nrow(DataFrame(preds))
    if pred_setosa[i] >= pred_virginica[i] && pred_setosa[i] >= pred_versicolor[i]
        preds_cat = vcat(preds_cat ,"Setosa")
    elseif pred_versicolor[i] >= pred_virginica[i] && pred_versicolor[i] >= pred_setosa[i]
        preds_cat = vcat(preds_cat ,"Versicolor")
    elseif pred_virginica[i] >= pred_versicolor[i] && pred_virginica[i] >= pred_setosa[i]
        preds_cat = vcat(preds_cat ,"Virginica")
    end
end

preds_cat

As a final step, here is how to compute the accuracy of our GLM prediction on the test set, using a short for-loop:

In [None]:
# Compute Accuracy of GLM

correct = 0
actual = orig_col[notsample]
n=length(actual)
for i in 1:n
    if actual[i] == preds_cat[i]
        correct = correct + 1
    end
end
println(correct / n)

# Decision Tree in Julia using the DecisionTree.jl library

In the following code snippet, you will see how to fit a Decision Tree in Julia. Firstly, it re-imports the Iris data, because the Decision Tree supports the use of Categorical variables. As stated in the introduction, this is an advantage of using Julia.

Then the model is created as an instantiation of the DecisionTreeClassifier. We can give several hyper parameters, as for example max_depth used in this example. The fit syntax is quite special with the exclamation mark.

The last two steps are prediction on the test set using the predict function and computing the accuracy, as in the previous model.

In [None]:
import Pkg; Pkg.add("DecisionTree")

# re import iris, because RandomForests will handle the categorical variable
iris = DataFrame(CSV.File("C://Users//jkorstan//Desktop//iris.csv"))

train = iris[sample, :]
notsample = [i for i in 1:size(iris,1) if isempty(searchsorted(sample, i))]
test = iris[notsample, :]
            
X_train = convert(Array, train[:, 1:4]);
y_train = convert(Array, train[:, 5]);
            
X_test = convert(Array, test[:, 1:4]);
y_test = convert(Array, test[:, 5]);

      
# Fit the model
using DecisionTree

model = DecisionTreeClassifier(max_depth=5)
fit!(model, X_train, y_train)


# Predict
dectree_pred = DecisionTree.predict(model, X_test)

                                    
# Compute accuracy
correct = 0
n=length(y_test)
for i in 1:n
    if actual[i] == dectree_pred[i]
        correct = correct + 1
    end
end
println(correct / n)

# Random Forest in Julia using the DecisionTree.jl library

As you will see, the Random Forest Model is applied in almost the same way as the Decision Tree. It may be confusing at first, but the Random Forest model is also part of the Decision Tree library!

In [None]:
using DecisionTree

# Fit the model
rf = RandomForestClassifier()
fit!(rf, X_train, y_train)

# Predict on the test set
rf_pred = DecisionTree.predict(rf, X_test)

# Compute the accuracy
correct = 0
n=length(y_test)
for i in 1:n
    if actual[i] == rf_pred[i]
        correct = correct + 1
    end
end
println(correct / n)

# Main packages for Machine Learning in Julia

Now that we have seen how to use two great, but small, libraries for Machine Learning in Julia, let’s get to the larger libraries. As stated before, there are two main packages that compete for becoming the go-to ML library in Julia: Scikit Learn and MLJ. Let’s check them both out.

## Scikit Learn for Machine Learning in Julia

Many of you will know Scikit Learn from Python. It is the package for Machine Learning in Python, and it is great to have it in Julia as well. It requires much less effort if we can just use the same syntax as Python!

Let’s see an example of Scikit Learn in Julia. This code snippet starts with importing the Scikit Learn library. The next step is loading the model you want to use (in this case a Logistic Regression). Using the “fit!” syntax (attention to the exclamation mark), the model is trained.

After that, the predict function is used to predict the test set with the trained model. Finally, the accuracy is computed.

In [None]:
# Import the library
import Pkg; Pkg.add("ScikitLearn")

# Import the model you want to use
using ScikitLearn
@sk_import linear_model: LogisticRegression

# Fit the model
log_reg = fit!(LogisticRegression(), X_train, y_train)

# Predict on the test set
sklearn_pred = log_reg.predict(X_test)

# Compute the accuracy
correct = 0
n=length(y_test)
for i in 1:n
    if y_test[i] == sklearn_pred[i]       
        correct = correct + 1
    end
end
println(correct / n)

Using Scikit Learn in Julia also has its disadvantages. For example, a large part of the Scikit Learn library that we can use in Julia is actually just a wrapper around Python. Apart from a few models that have been implemented in Julia, the Julia models actually use Pycall to call Python code.

However, if we want to switch to Julia, it should be for having the benefits of Julia. One of the main benefits of Julia being a speed advantage over Python, calling Python code is really not what we should be doing here. If it’s just a Python wrapper, we might as well stay with Scikit Learn in Python directly.

Another disadvantage is that the Python models in Scikit Learn have no support for categorical variables. Apart from encoding them, there is really not much that can be done in Scikit Learn and that really is a negative point (especially for tree-based models). As you have seen throughout the examples, Julia allows us to treat a categorical variable as one variable, rather than as a set of one hot encoded dummies. So that advantage of Julia would also go away when we use Julia as a mere Python wrapper.

## MLJ for Machine Learning in Julia

A competitor for Machine Learning in Julia is the MLJ package. It promises to solve the problem of categorical variables and it is pure-Julia. This makes it very interesting to explore. It also has serious support from the Alan Turing Foundation, which makes me believe that this library could be here to stay.

Let’s now see an example of MLJ for Machine Learning in Julia in the below snippet. There are a few things that are different from usual. Especially, the creation of a machine is a choice of syntax that will be new for many. Then what’s unusual more is the fact to have to load a model, rather than importing a package.

In [None]:
# Import the packages that you need (this depends on the model you use)
import Pkg; Pkg.add("MLJ")
import Pkg; Pkg.add("LIBSVM")
import Pkg; Pkg.add("MLJModels")

# load the model
using MLJ
svc_model = @load SVC verbosity=1

# create a so-called machine
svc = machine(svc_model, X_train, categorical(y_train))

# fit the model
MLJ.fit!(svc);

# predict on the test set
yhat = MLJ.predict(svc, X_test);

#compute the accuracy
correct = 0
n=length(y_test)
for i in 1:n
    if actual[i] == yhat[i]       
        correct = correct + 1
    end
end
println(correct / n)

But after those syntactical differences, the use of the MLJ library is not fundamentally different. MLJ syntax is easy to learn and there is good documentation on MLJ’s documentation website.

# Conclusion

In this article, we have seen four libraries for Machine Learning in Julia. Two of those libraries (MLJ and Scikit Learn) seem to be real competitors to take over the Machine Learning landscape in Julia.

Scikit Learn has the big advantage of the familiar syntax from the Python implementation and it has trust from its community. On the other hand, Scikit Learn is often simply calling Python code, which takes away most of the advantages of using Julia in the first place.

MLJ has the big advantage of being a real Julia project. Its syntax is slightly new, but the differences seem minor. The real challenge for MLJ would be gaining trust and popularity by a larger community.

I hope this article has given you all that you need for getting started in Julia and I wish you good luck doing so!