# Machine Learning in Julia (using the Iris dataset)

We will be creating a machine learning model using Julia and Flux.jl to classify types of flowers using the [Iris dataset](https://www.kaggle.com/uciml/iris#Iris.csv).

In [9]:
versioninfo() # Information of the environment I am using

Julia Version 1.3.0
Commit 46ce4d7933 (2019-11-26 06:09 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i5-6200U CPU @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)


## Dependencies

We will be using [IJulia](https://github.com/JuliaLang/IJulia.jl) to run the Julia kernel inside this jupyter notebook.

For this project we will be using Julia version 1.3.0 and some packages wich are listed below:
- [CSV.jl](https://juliadata.github.io/CSV.jl/stable/)
- [Flux.jl](https://fluxml.ai/)

In [None]:
# To add the denpendencies run:
# Prefereabley run this in the terminal
using Pkg
Pkg.add("IJulia")
Pkg.add("Flux")
Pkg.add("CSV")

In [10]:
# Importing the Dependencies
using CSV, Flux
# The first time you import them, it will take a lot of time for them to precompile.

In [11]:
dataset = convert(Array, CSV.File("./Iris.csv"))

150-element Array{CSV.Row{false},1}:
 [1, 5.1, 3.5, 1.4, 0.2, "Iris-setosa"]     
 [2, 4.9, 3.0, 1.4, 0.2, "Iris-setosa"]     
 [3, 4.7, 3.2, 1.3, 0.2, "Iris-setosa"]     
 [4, 4.6, 3.1, 1.5, 0.2, "Iris-setosa"]     
 [5, 5.0, 3.6, 1.4, 0.2, "Iris-setosa"]     
 [6, 5.4, 3.9, 1.7, 0.4, "Iris-setosa"]     
 [7, 4.6, 3.4, 1.4, 0.3, "Iris-setosa"]     
 [8, 5.0, 3.4, 1.5, 0.2, "Iris-setosa"]     
 [9, 4.4, 2.9, 1.4, 0.2, "Iris-setosa"]     
 [10, 4.9, 3.1, 1.5, 0.1, "Iris-setosa"]    
 [11, 5.4, 3.7, 1.5, 0.2, "Iris-setosa"]    
 [12, 4.8, 3.4, 1.6, 0.2, "Iris-setosa"]    
 [13, 4.8, 3.0, 1.4, 0.1, "Iris-setosa"]    
 ⋮                                          
 [139, 6.0, 3.0, 4.8, 1.8, "Iris-virginica"]
 [140, 6.9, 3.1, 5.4, 2.1, "Iris-virginica"]
 [141, 6.7, 3.1, 5.6, 2.4, "Iris-virginica"]
 [142, 6.9, 3.1, 5.1, 2.3, "Iris-virginica"]
 [143, 5.8, 2.7, 5.1, 1.9, "Iris-virginica"]
 [144, 6.8, 3.2, 5.9, 2.3, "Iris-virginica"]
 [145, 6.7, 3.3, 5.7, 2.5, "Iris-virginica"]
 [146, 6.7, 3.0, 5

As you can see there are 150 elements in the dataset. We will now remove about 20 random elements to be used as testing data.
This is the layout of the dataset:

|ID|SepalLengthCm|SepalWidthCm|PetalLengthCm|PetalWidthCm|Species|
|---|---|---|---|---|---|
|1|5.1|3.5|1.4|0.2|"Iris-setosa"|
|2|4.9|3.0|1.4|0.2|"Iris-setosa"|
|...|...|...|...|...|...|

-----------

In [12]:
testdata = []
traindata = dataset
for _ in 1:20
    randindex = rand(1:length(dataset))
    push!(testdata, dataset[randindex])
    splice!(traindata, randindex)
end

In [13]:
# Let us create a dictionary for julia to associate numbers with types of flowers
labelDict = Dict("Iris-setosa" => Flux.onehot(3, 1:3), 
    "Iris-versicolor" => Flux.onehot(2, 1:3), 
    "Iris-virginica" => Flux.onehot(1, 1:3))

Dict{String,Flux.OneHotVector} with 3 entries:
  "Iris-virginica"  => Bool[1, 0, 0]
  "Iris-setosa"     => Bool[0, 0, 1]
  "Iris-versicolor" => Bool[0, 1, 0]

## Training

First we are going to create a function to create batches that will return `Flux.batch` which are just arrays that are optimised to be trained with Flux.

In [16]:
function create_batch(data)
    inputData = []
    labels = []
    for i in 1:length(data)
        SepalLengthCm = data[i][2]
        SepalWidthCm = data[i][3]
        PetalLengthCm = data[i][4]
        PetalWidthCm = data[i][5]
    
        label = labelDict[data[i][6]]
        
        push!(inputData, [SepalLengthCm, SepalWidthCm, PetalLengthCm, PetalWidthCm])
        push!(labels, label)
    end
    return (Flux.batch(inputData), Flux.batch(labels))
end

create_batch (generic function with 1 method)

`Flux.onehot` creates a onehote vector with one true value and the other values as false, we can use it to label our inputs.

In [14]:
Flux.onehot(2, 1:3) # Second element as true

3-element Flux.OneHotVector:
 0
 1
 0

In [17]:
trainbatch = create_batch(traindata)

([5.1 4.9 … 6.2 5.9; 3.5 3.0 … 3.4 3.0; 1.4 1.4 … 5.4 5.1; 0.2 0.2 … 2.3 1.8], Bool[0 0 … 1 1; 0 0 … 0 0; 1 1 … 0 0])

## Most important and easy part
We are now going to create our neural network. `Dense` means a single dense neuron with 4 inputs (SepalLengthCm, SepalWidthCm, PetalLengthCm, PetalWidthCm) and 3 outputs ([0, 0, 1] etc to denote out probability outputs).

We are going to use the `sigma` σ function to classify our decision boundries 

In [22]:
model = Dense(4, 3, σ)
L(x, y) = Flux.mse(model(x), y) # This is the loss function to test how our model is doing
ps = Flux.params(model)
opt = Descent() # This is the optimiser we will be using
@time Flux.train!(L, ps, [trainbatch], opt)

  0.134458 seconds (82.32 k allocations: 3.905 MiB)


In [27]:
Flux.train!(L, ps, Iterators.repeated(trainbatch, 1000), opt)

Now we can view the loss value after training the dataset to see how it is doing.

In [28]:
L(trainbatch...)

0.07934447f0

In [29]:
testbatch = create_batch(testdata)

([6.3 6.9 … 5.3 5.1; 2.5 3.1 … 3.7 3.8; 5.0 5.4 … 1.5 1.5; 1.9 2.1 … 0.2 0.3], Bool[1 1 … 0 0; 0 0 … 0 0; 0 0 … 1 1])

In [30]:
L(testbatch...)

0.06459588f0