# Julia introduction


First than all, you can download Julia from https://julialang.org/downloads/ 

I'm using Julia 1.8.2 (last version to date)

For allowing Julia code to be run in notebooks:
- After downloading Julia, open console and run:
  




In [None]:
using Pkg; Pkg.add("IJulia")

**Pkg is Julia's builtin package manager**.


## Activating an enviroment

In Julia you can activate enviroments by using (in console):



In [None]:
using Pkg; Pkg.activate("PATH_TO_FOLDER"); Pkg.instantiate()

Replace "PATH_TO_FOLDER" with your path to the folder where you have a Project.toml file.

If already in folder, put a "."

### Packages I'm using:

  "CSV"            => v"0.10.7"


  "MLJ"            => v"0.19.0"


  "BenchmarkTools" => v"1.3.2"


  "Missings"       => v"1.0.2"


  "ScikitLearn"    => v"0.6.5"


  "StatsBase"      => v"0.33.21"


  "IJulia"         => v"1.23.3"


  "LightGBM"       => v"0.6.0"


  "MLJModels"      => v"0.16.0"


  "DataFrames"     => v"1.4.2"

## Documentations <a id="docs"></a>

Here are some links that help my out.

- [Official Documentation](https://docs.julialang.org/en/v1/)
- [DataFrames](https://dataframes.juliadata.org/stable/man/comparisons/#Comparison-with-the-Python-package-pandas) Julia's package to work with dataframes. It even has a comparison with pandas.
- [Scikitlearn.jl](https://juliapackages.com/p/scikitlearn) Probably you don't want to use it. More on that later.
-  [Stadistics](https://docs.julialang.org/en/v1/stdlib/Statistics/) standard library and [BaseStats](https://juliastats.org/StatsBase.jl/stable/) a package that provides basic support for statistics.
-  [MLJ](https://alan-turing-institute.github.io/MLJ.jl/dev/) (Machine Learning Framework for Julia) It has a lot of examples and a good structured documentation. [More examples](https://juliaai.github.io/DataScienceTutorials.jl/end-to-end/boston-lgbm/)
-  [Performance tips](https://docs.julialang.org/en/v1/manual/performance-tips/)

You'll probably understand Julia well enough if you already know a lot about Python. 
The sad part of Julia to this date, for me at least, is that I'm more used to Python and in Python you don't only have a great documentation, you have a HUGE community with millions of users and tutorials everywhere on the internet. Don't get me wrong, Julias has a lot of users, but the documentation could be better also while doing this project I run into some compatibility issues between libraries. So, I think that if Julia starts getting more attention could shine and show its full capabilities for ML.

Next, I want to show to interesting things: The "! method" and about using sklearn on Julia.



### The "! method"

Most Julia functions/methods have a ! "version", for example replace and replace!. The first one returns a copy of the object and the latter change memory inplace. Why is this important? Well... ! funtions are faster. Consider the following:

In [58]:
using DataFrames
using BenchmarkTools
using Random 

a = collect(1:5000000) #Creating a INT array of 5.000.000 numbers
b = [missing for i in range(0.0, 1000000.0)] # Creating an array of 1.000.000 missing values
a = Float64.(a)
c = vcat(a,b) #concat
shuffle!(c) #shuffle
df = DataFrame([c],:auto ) #Julia is vervose by default, so it's going to print last line


Row,x1
Unnamed: 0_level_1,Float64?
1,4.26358e6
2,414927.0
3,3.96343e6
4,3.25033e6
5,missing
6,1.865e6
7,300258.0
8,4.89609e6
9,909949.0
10,4.473e6


In [59]:
#Now with our nice dataframe, let's test replace, and replace!
@benchmark d = replace(df[:,"x1"], missing => 0) #63.290 ms ±  35.548 ms after 79 samples

BenchmarkTools.Trial: 78 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m34.833 ms[22m[39m … [35m150.999 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m 0.00% … 49.13%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m48.612 ms               [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m 0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m64.130 ms[22m[39m ± [32m 32.468 ms[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m18.52% ± 19.51%

  [39m [39m [39m▆[39m [39m█[39m [39m▅[39m [34m▂[39m[39m [39m [39m [39m [39m [39m [39m [39m [32m [39m[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m 
  [39m▄[39m▇[39m█[39m█[39

In [60]:
@benchmark replace!(df[:,"x1"], missing => 0)# 96 samples 45.079 ms ±  17.816 ms

BenchmarkTools.Trial: 111 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m28.658 ms[22m[39m … [35m112.027 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m 0.00% … 61.42%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m39.111 ms               [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m 0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m45.079 ms[22m[39m ± [32m 17.816 ms[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m11.49% ± 17.34%

  [39m▄[39m [39m [39m [39m█[39m▃[39m▄[39m▅[34m▃[39m[39m [39m [39m [32m▃[39m[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m 
  [39m█[39m▇[39m▁[39m▅[3

So... the ! is always faste than the non ! function. However, although almost every Julia func as an ! variant, non all of them actually works (or at least not now). I think that all base func has ! working and just some packages doesn't. So better try.

### Using Sklearn on Julia

While you can use Sklearn models and pipelines, you probably don't want to. Why? Well, It's not an official implementation, or let's say that it's not even implemented in Julia. It's a wrapper of python Sklearn. So each time you executed code in Julia, you're running Python in the backend. So probably not worth it. In the MLJ library you have more optimized models and some preprocessing options. The same goes with Tensorflow.jl

If you want to try it, I leave you with an example on how to translate the "py_folder/pipeline.py" to Julia:

In [None]:
using ScikitLearn.Pipelines: Pipeline, make_pipeline

@sk_import preprocessing: (OneHotEncoder, StandardScaler)
@sk_import impute: (SimpleImputer)

PROCESS_PIPE = DataFrameMapper([
        (Constants.COLUMNS_NUM,[SimpleImputer(strategy="median"),
                                StandardScaler()]),
        (Constants.COLUMNS_STR,[SimpleImputer(strategy="most_frequent"),
                                OneHotEncoder(drop="first",sparse=false)]),
        (Constants.COLUMNS_BOOL,[SimpleImputer(strategy="most_frequent")]), 
        ]; missing2NaN=true)

