# Introduction to Mojo and System Programming
This guide assumes some familiarity with a programming language like Python, but does not assume knowledge about computer science fundamentals. It will be a natural stepping stone from Python to Mojo, but also useful for beginners through to experienced programmers looking to get started with Mojo.

Some of these concepts might be brand new to you, don't worry if they don't make sense right now, it takes a while for these things to click. Keep moving and we'll do a lot of examples to cement them in your mind.

## Basic types
Python is a beautiful and popular language that's easy to get started with, but it hides a lot of details from the programmer, and that comes at a performance cost.

Let's run some Python code and print the result in Mojo:

In [1]:
x = Python.evaluate("5 + 10")
print(x)

15


Mojo now has the same representation for `x` as Python uses, which is a `pointer` to a C object with a value, type, and some other things. Let's explore this a little more using the Python interpreter, we can access all the Python keywords by importing `builtins`:

In [2]:
let py = Python.import_module("builtins")

print(py.id(x))

139839161194784


`id()` gives us an address to the C object that `x` is pointing to, when we `print(x)`, it's actually taking the address stored in `x` and looking up the value at that location in your computers RAM, which comes with a performance cost. Let's dive into this a little further by understanding `stack` and `heap` memory.

## Stack
This is the fast access section of memory that is allocated to your computers RAM, take a simple program:

In [3]:
%%python
def double(a):
    return a * 2 

def quad(a):
    return a * 4 

a = 1

a = double(a)
a = quad(a)

If we represent the instructions in pseudo code, this is a simplified version of what your `stack` memory would look like as the program runs:

In [4]:
%%python
from pprint import pprint

stack = []
stack.append({"frame": "main", "a": 1, "function_calls": ["double(a)", "quad(a)"]})
stack.append({"frame": "add", "a": 1})

pprint(stack)

[{'a': 1, 'frame': 'main', 'function_calls': ['double(a)', 'quad(a)']},
 {'a': 1, 'frame': 'add'}]


The program starts by allocating variables from `main` to the `stack` memory, the first function is `add` so it is then appended to the stack.

When it's finished running and returns the result, all the variables in `add` are popped off the stack:

In [5]:
%%python
stack.pop()
stack[0]["a"] *= 2
print(stack)

[{'frame': 'main', 'a': 2, 'function_calls': ['double(a)', 'quad(a)']}]


This is why a `stack` is called Last In First Out (LIFO), because the `last` function to be allocated is the first one `out` of the stack.

The next function calls `quad` and the variable is appended to the stack memory and runs, it then returns while updating `a` and being popped off the stack, then `main` is popped off the stack as there are no more instructions to run which ends the program:

In [6]:
%%python
stack.append({"frame": "quad", "a": 2})
stack[0]["a"] *= 4
stack.pop()
stack.pop()
print(stack)

[]


## Heap

The Heap memory is huge, it can use the remainder of the available RAM on your OS, Python uses it for every object to provide us with conveniences, `a` in the previous example doesn't actually contain the value `1` at the start of the program, it contains an address to another place in memory on the heap:

In [7]:
%%python
heap = {
    44601345678945: {
        "type": "int",
        "ref_count": 1,
        "size": 1,
        "digit": 8,
        #...
    }
}

So on the stack `a` looks more like this for each frame:

In [8]:
%%python
[
    {"frame": "main", "a": 44601345678945 }
]

Where `a` contains an address that is pointing to the heap object, in Python when we write something like:

In [9]:
a = "mojo"

The object in C will change its representation:

In [10]:
%%python
heap = {
    "a": {
        "type": "string",
        "ref_count": 1,
        "size": 4,
        "ascii": True,
        # utf-8 / ascii for "mojo"
        "value": [109, 111, 106, 111]
        # ...
    }
}

This allows Python to do nice convenient things for us
- once the `ref_count` goes to zero it will be de-allocated from the heap during garbage collection, so the OS can use that memory for something else
- an integer can grow beyond 64 bits by increasing `size`
- we can dynamically change the `type`
- the data can be large or small, we don't have to worry about it if we should allocate to stack or heap

However this also comes with a penalty, there is a lot more extra memory being used for the extra fields, and it also takes CPU instructions to allocate the data, retrieve it, deallocate it etc.

_If the `stack` and `heap` isn't making sense, [check out this video](https://www.youtube.com/watch?v=_8-ht2AKyH4), it uses some great visual aids_

In Mojo we can remove all that overhead:

## Mojo 🔥

In [11]:
x = 5 + 10
print(x)

15


We've just unlocked our first two Mojo optimizations! Instead of looking up an object on the heap via an address, `x` is now just a value on the stack with 64 bits that can be passed through registers.

The performance implications of this alone run very deep:

- Allocation, Deallocation, Indirection etc. is all very expensive and no longer required
- The compiler can do huge optimizations in loops and other things when it knows what the numeric type is
- The value can be passed through registers for simple operations
- The data can now be packed into a vector for huge performance gains

That last one is very important for today's world, let's see how Mojo gives us the power to take advantage of modern hardware.

## SIMD

SIMD stands for `Single Instruction, Multiple Data`, hardware now contains special registers that allow you do the same operation in a single instruction, greatly improving performance, let's take a look:

In [15]:
from DType import DType

y = SIMD[DType.ui16, 4](1, 2, 3, 4)
print(y)

[1, 2, 3, 4]


This is now a vector of 16bit numbers that are packed into 64bits, it's taking up the same space as a single 64bit `Int`, and we can perform a single instruction instead of 4 separate instructions:

In [16]:
y *= 10
print(y)

[10, 20, 30, 40]


You can also initialize it just using a single argument:

In [19]:
z = SIMD[DType.ui16, 4](1)
print(z)

[1, 1, 1, 1]


Or do it in a loop:

In [17]:
for i in range(3):
    print(SIMD[DType.ui16, 4](i))

[0, 0, 0, 0]
[1, 1, 1, 1]
[2, 2, 2, 2]


You can also access a single item in the vector:

In [26]:
print(z[0])

1


Modern CPUs generally have a register with a vector size of 512, so this becomes very powerful way to improve processing speed. Think about a string, each character is generally 8 bits of data, so we can pack 64 characters into a single register and perform one operation on all of it to be 64x faster.

## Exercises
1. Create a SIMD of DType UI8, 16 bytes wide and each value at 2, then multiply it by 8 and print it
2. Create a loop using SIMD that prints four rows of data that looks like this:
    [1,0,0,0]
    [0,2,0,0]
    [0,0,3,0]
    [0,0,0,4]

## Solutions
### Exercise 1

In [24]:
print(SIMD[DType.ui8, 16](2) * 8)

[16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16]


### Exercise 2

In [30]:
for i in range(4):
    vec = SIMD[DType.ui8, 4](0)
    vec[i] = i + 1
    print(vec)

[1, 0, 0, 0]
[0, 2, 0, 0]
[0, 0, 3, 0]
[0, 0, 0, 4]
