Computer scientists often classify programming languages according to the following two categories

*High level* languages aim to maximize productivity by

* being easy to read, write and debug
* automating standard tasks (e.g., memory management)
* being interactive, etc.

*Low level* languages aim for speed and control, which they achieve by

* being closer to the metal (direct access to CPU, memory, etc.)
* requiring a relatively large amount of information from the user (e.g., all * data types must be specified)

Traditionally we understand this as a trade off

* high productivity or high performance
* optimized for humans or optimized for machines
* One of the great strengths of Julia is that it pushes out the curve, achieving both high productivity and high performance with relatively little fuss

The word “relatively” is important here, however…

In simple programs, excellent performance is often trivial to achieve

For longer, more sophisticated programs, you need to be aware of potential stumbling blocks

## How machine code is generated:

### AOT compiled languages (C/C++, Fortran)

* Users have to provide a **lot** of detail on types etc.

This means:
* Coding can be tedious
* Loss of interactivity

### Interpreted languages (Python, Matlab, etc.)

* Machine code is generated "on the fly"
* We don't have to specify types or fiddle with memory allocation

This means:
* Runtime has to check the type of each object one at a time
    * Substantial overhead
* Overhead also in accessing the data values themselves

* Slow and unwieldy machine code

### Just-in-time compiles languages (Julia)

Middle ground between the two:

* Functions are compiled as requested
* The compiler will attempt to infer missing information (e.g. type hierarchy)
* Machine code efficiency scales with the amount of information specified

In the setting where all the necessary information is provided or inferrable, Julia will run as fast as an AOT language

In [1]:
function f(a, b)
    y = (a + 8b)^2
    return 7y
end

f (generic function with 1 method)

In [2]:
z = f(1,2)

2023

Now the JIT compiler knows the types of a and b (since we specified them), and can infer the types of other variables in the function (e.g. y will be an integer as well). 

In [3]:
@code_native f(1, 2)

	.text
; Function f {
; Location: In[1]:2
	pushq	%rbp
	movq	%rsp, %rbp
; Function +; {
; Location: int.jl:53
	leaq	(%rcx,%rdx,8), %rcx
;}
; Function literal_pow; {
; Location: intfuncs.jl:243
; Function *; {
; Location: int.jl:54
	imulq	%rcx, %rcx
;}}
; Location: In[1]:3
; Function *; {
; Location: int.jl:54
	leaq	(,%rcx,8), %rax
	subq	%rcx, %rax
;}
	popq	%rbp
	retq
	nopl	(%rax)
;}


In [15]:
@time f(2,3)
@time f(4,1) # notice it's really fast for integers now

  0.000002 seconds (5 allocations: 176 bytes)
  0.000003 seconds (5 allocations: 176 bytes)


1008

In [16]:
@time f(2.0, 3) # But takes longer for floats the first time around

  0.005848 seconds (13.71 k allocations: 772.420 KiB)


4732.0

In [17]:
@time f(2.0, 3.5)

  0.010885 seconds (18.06 k allocations: 1004.903 KiB)


6300.0

This is an example where the JIT compiler is really fast. However, there are situations where it generates messy machine code. 

### Example: Global variables

In [37]:
b = 1.0 # global variable

function g(a)
    for i in 1:1_000_000
        tmp = a + b
    end
end

g (generic function with 2 methods)

In [40]:
@time g(10)
@time g(10) # this doesn't really get faster though

  0.032688 seconds (1.00 M allocations: 15.259 MiB, 2.31% gc time)
  0.030455 seconds (1.00 M allocations: 15.259 MiB)


In [20]:
@code_native g(1.0)

	.text
; Function g {
; Location: In[18]:4
	pushq	%rbp
	movq	%rsp, %rbp
	pushq	%r15
	pushq	%r14
	pushq	%r12
	pushq	%rsi
	pushq	%rdi
	pushq	%rbx
	andq	$-32, %rsp
	subq	$128, %rsp
	vmovaps	%xmm6, -64(%rbp)
	vmovaps	%xmm0, %xmm6
	vxorps	%xmm0, %xmm0, %xmm0
	vmovaps	%ymm0, 32(%rsp)
	movl	$jl_get_ptls_states, %eax
	vzeroupper
	callq	*%rax
	movq	%rax, %rsi
	movq	$4, 32(%rsp)
	movq	(%rsi), %rax
	movq	%rax, 40(%rsp)
	leaq	32(%rsp), %rax
	movq	%rax, (%rsi)
	movl	$1000000, %ebx          # imm = 0xF4240
	movabsq	$jl_gc_pool_alloc, %r14
	movabsq	$jl_apply_generic, %r15
	leaq	88(%rsp), %r12
	nop
; Location: In[18]:5
L112:
	movq	267151560, %rdi
	movq	%rdi, 48(%rsp)
	movl	$1488, %edx             # imm = 0x5D0
	movl	$16, %r8d
	movq	%rsi, %rcx
	callq	*%r14
	movq	$jl_system_image_data, -8(%rax)
	vmovsd	%xmm6, (%rax)
	movq	%rax, 56(%rsp)
	movq	$jl_system_image_data, 88(%rsp)
	movq	%rax, 96(%rsp)
	movq	%rdi, 104(%rsp)
	movl	$3, %edx
	movq	%r12, %rcx
	callq	*%r15
; Function iterate; {
; Location: range.jl:

### Without global variable:

In [35]:
function g(a, b)
    for i in 1:1_000_000
        tmp = a + b
    end
end

g (generic function with 2 methods)

In [36]:
@time g(1.0, 1.0) # 10x faster 
@time g(1.0, 1.0) # and now 1000x! 

  0.006150 seconds (11.15 k allocations: 610.749 KiB)
  0.000003 seconds (4 allocations: 160 bytes)


In [28]:
@code_native g(1.0, 1.0) # and much cleaner machine code

	.text
; Function g {
; Location: In[21]:2
	pushq	%rbp
	movq	%rsp, %rbp
; Location: In[21]:3
	popq	%rbp
	retq
	nopw	%cs:(%rax,%rax)
;}


## Using "const"

In [42]:
const c = 1.0
function g(a)
    for i in 1:1_000_000
        tmp = a + c
    end
end

g (generic function with 2 methods)

In [45]:
@time g(1.0, 1.0) # 1000x faster to begin with :)
@time g(1.0, 1.0)

  0.000003 seconds (4 allocations: 160 bytes)
  0.000003 seconds (4 allocations: 160 bytes)


In [46]:
@code_native g(1.0, 1.0)

	.text
; Function g {
; Location: In[35]:2
	pushq	%rbp
	movq	%rsp, %rbp
; Location: In[35]:3
	popq	%rbp
	retq
	nopw	%cs:(%rax,%rax)
;}


## Conclusion:

Use concrete types when the compiler needs help and is likely to screw up

## Tips

Use functions to segregate operations cleanly (which also helps with performance when we provide types for one of the functions)