# Introduction to Julia

**Statistical Genetics Short Course**, <http://www.genetics.ucla.edu/courses/statgene/Mendel/>  
**Dr. Hua Zhou**, [huazhou@ucla.edu](mailto: huazhou@ucla.edu)  
**Department of Biostatistics, UCLA**  
**Sep 20, 2017**

# Installation and setup

This Jupyter notebook can be located at <https://github.com/Hua-Zhou/SGSC2017Colorado/blob/master/notebooks/13_Julia_Overview/13_Julia_Overview.ipynb> or <https://tinyurl.com/ycsqfmmz>

There are at least 3 ways to run examples in this notebook

**Method 1**: **Copy and paste** the Julia commands to a Julia terminal  
0. Install Julia v0.6.0 from <https://julialang.org/downloads/>  
    
**Method 2**: Run **Jupyter notebook** on your computer. Install IJulia package by commands    
```julia
# This will install the package
Pkg.add("IJulia") 
# Clone the Repo
Pkg.clone("https://github.com/Hua-Zhou/SGSC2017Colorado")
### Next commands you run each time you open the REPL
using IJulia
notebook(detached = true, dir = Pkg.dir("SGSC2017Colorado/notebooks"))
```

**Method 3**: Run Jupyter notebooks in cloud using **JuliaBox**
0. Go to [JuliaBox.com](https://www.juliabox.com)  
0. Sign in through a portal you prefer (Google, GitHub, or LinkedIn)  
0. Clink the **Sync** tab  
0. Enter https://github.com/Hua-Zhou/SGSC2017Colorado into the **Git Clone URL**  
0. Click the **+** button  
0. Now you have a clone of the Jupyter notebooks in your JuliaBox, which you can modify and run in JuliaBox  
0. To install some required packages, we need to open Julia in the **Console** tab and install by  
```julia
Pkg.add("PkgName1")
Pkg.add("PkgName2")
...
```

# Required packages

These packages will be used in this lecture. Install them by
```julia
Pkg.add("RCall")
Pkg.add("Distributions")
```

# What is Julia?

> Julia is a high-level, high-performance dynamic programming language for technical computing, with syntax that is familiar to users of other technical computing environments

* Started in 2009. First public release in 2012. 
  - Creators: Jeff Bezanson, Alan Edelman, Stefan Karpinski, Viral Shah
  - Current release v0.6.0
  - v1.0 is staged to release in late 2017

* Aims to solve the notorious **two language problem** in scientific computing:  
  - Two language problem is the paradigm that **prototype** code goes into high-level languages like Python/R/Matlab, and **production** code goes into low-level language like C/C++/Fortran

# Why do we (statistical genetists) care?

Statistical genetists are always juggling a dozen specialized computer softwares:

* SNP calling, sequence alignment: **Gatk**  
* Manipulation of sequence data: **VCFTools**  
* Manipulation of SNP data: **Plink**  
* Phasing and imputation of SNP data: **Impute**, **MaCH**, **Beagle**, **FastPhase**, **Mendel**, **Mendel-Impute**  
* Heritability analysis: **GCTA**  
* Admixture estimation: **EigenStrat**, **Structure**, **Admixture**  
* Association test: **Plink**, **Fast-LMM**, **Emma**, **Gemma**, **Fbat**, **Mendel**  
* Linkage analysis: **Merlin**, **Solar**, **Mendel**, **Sage**   
* Statistical analysis and visualization: **R**, **Matlab**  

Glue code: Python, Perl, R, Matlab

These programs  
* are implemented in different languages (C/C++, Fortran, JavaScript)  
* run on different platforms  
* require different input/output format  

This paradigm creates an unsurpassable barrier between users and developers.

<img src="./juggling.jpg" width="450" align="center"/>

## Dream world

A unified programming environment that  
* is efficient (genomic data is big)  
* eases new method development  
* fosters scientific collaboration  
* encourages reproducible research  
* cross-platform (hardware and software) 
* embraces new technology such as parallel and cloud computing   

> users == developers

# Julia the Savior?

## Benchmark with other languages

<https://julialang.org/benchmarks/>

## Benchmark

Benchmark code `R-benchmark-25.R` from [http://r.research.att.com/benchmarks/R-benchmark-25.R](http://r.research.att.com/benchmarks/R-benchmark-25.R) covers some commonly used numerical operations used in statistics. We ported to [Julia](./benchmark_julia.jl) and report the run times (averaged over 10 runs) here.

| Test | R 3.4.1 | Julia 0.6.0 | Speedup |  
|:-------- |:-------:|:-------:|:-------:|  
| Matrix creation, trans., deform. (2500 x 2500) | 0.17 | 0.25 | 0.67 |  
| Power of matrix (2500 x 2500, `A.^1000`) | 0.55 | 0.25 | 2.18 |  
| Quick sort ($n = 7 \times 10^6$) | 0.68 | 0.59 | 1.15 |  
| Cross product (2800 x 2800, $A^TA$) | 6.15 | 0.20 | 31.15 |  
| LS solution ($n = p = 2000$) | 12.83 | 0.15 | 88.34 |  
| FFT ($n = 2,400,000$) | 0.32 | 0.12 | 2.59 |  
| Eigen-values ($600 \times 600$) | 0.65 | 0.49 | 1.33 |  
| Determinant ($2500 \times 2500$) | 2.47 | 0.12 | 20.00 |  
| Cholesky ($3000 \times 3000$) | 2.87 | 0.14 | 20.00 |  
| Matrix inverse ($1600 \times 1600$) | 5.04 | 0.17 | 29.87 |  
| Fibonacci (vector calculation) | 0.22 | 0.19 | 1.17 |  
| Hilbert (matrix calculation) | 0.23 | 0.06 | 3.69 |  
| GCD (recursion) | 0.39 | 0.08 | 4.64 |  
| Toeplitz matrix (loops) | 0.04 | 0.00 | 49.69 |  
| Escoufiers (mixed) | 0.33 | 0.15 | 2.14 |  

Machine specs: Intel i7 (Skylake) @ 2.9GHz (4 physical cores, 8 threads), 16G RAM, Mac OS Sierra 10.12.6.

To run the benchmark on your own machine, download [benchmark.jl](./benchmark.jl), [benchmark_julia.jl](./benchmark_julia.jl), and [R-benchmark-25.R](R-benchmark-25.R) to the same folder and run 
```julia
include("benchmark.jl")
```
within Julia.

## An example by Doug Bates

* An example from Dr. Doug Bates's slides [Julia for R Programmers](http://www.stat.wisc.edu/~bates/JuliaForRProgrammers.pdf).

* The task is to create a Gibbs sampler for the density  
$$
f(x, y) = k x^2 exp(- x y^2 - y^2 + 2y - 4x), x > 0
$$
using the conditional distributions
$$
\begin{eqnarray*}
  X | Y &\sim& \Gamma \left( 3, \frac{1}{y^2 + 4} \right) \\
  Y | X &\sim& N \left(\frac{1}{1+x}, \frac{1}{2(1+x)} \right).
\end{eqnarray*}
$$

* **R solution**. The `RCall.jl` package allows us to execute R code without leaving the `Julia` environment. We first define an R function `Rgibbs()`.

In [5]:
# Pkg.add("RCall")
using RCall

R"""
library(Matrix)
Rgibbs <- function(N, thin) {
    mat <- matrix(0, nrow=N, ncol=2)
    x <- y <- 0
    for (i in 1:N) {
        for (j in 1:thin) {
        x <- rgamma(1, 3, y * y + 4) # 3rd arg is rate
        y <- rnorm(1, 1 / (x + 1), 1 / sqrt(2 * (x + 1)))
        }
        mat[i,] <- c(x, y)
    }
    mat
}
"""

RCall.RObject{RCall.ClosSxp}
function (N, thin) 
{
    mat <- matrix(0, nrow = N, ncol = 2)
    x <- y <- 0
    for (i in 1:N) {
        for (j in 1:thin) {
            x <- rgamma(1, 3, y * y + 4)
            y <- rnorm(1, 1/(x + 1), 1/sqrt(2 * (x + 1)))
        }
        mat[i, ] <- c(x, y)
    }
    mat
}


and then generate the same number of samples

In [6]:
# benchmark
@elapsed R"""
system.time(Rgibbs(10000, 500))
"""

* **Julia solution**. This is a Julia function for the simple Gibbs sampler:

In [7]:
# Pkg.add("Distributions")
using Distributions

function jgibbs(N, thin)
    mat = zeros(N, 2)
    x = y = 0.0
    for i in 1:N
        for j in 1:thin
            x = rand(Gamma(3.0, 1.0 / (y * y + 4.0)))
            y = rand(Normal(1.0 / (x + 1.0), 1.0 / sqrt(2.0(x + 1.0))))
        end
        mat[i, 1] = x
        mat[i, 2] = y
    end
    mat
end

jgibbs (generic function with 1 method)

Generate a bivariate sample of size 10,000 with a thinning of 500. How long does it take?

In [8]:
jgibbs(100, 5); # warm-up
@elapsed jgibbs(10000, 500)

We see 50 fold speed up of `Julia` over `R` on this example, **without extra coding effort**!

> As some of you may know, I have had a (rather late) mid-life crisis and run off with another language called Julia.   
>
> -- <cite>Doug Bates (on the `knitr` Google Group)</cite>

# Some resources for learning Julia

My (current) favorite tutorials:  
0. [Hands-on Julia](https://github.com/dpsanders/hands_on_julia) by Dr. David P. Sanders, at [https://github.com/dpsanders/hands_on_julia](https://github.com/dpsanders/hands_on_julia).  
0. [A Deep Introduction to Julia for Data Science and Scientific Computing](http://ucidatascienceinitiative.github.io/IntroToJulia/) by Chris Rackauckas, at <http://ucidatascienceinitiative.github.io/IntroToJulia/>.