# Basic data types

The states we have manipulated up until now are too limited in what they can contain: only integer numbers. This limitation makes it much more difficult to encode information which is naturally not understood as integer numbers, such as text (a name, a surname, a country, ...), a yes/no piece of information (whether or not a user has a driver's license), a rational number (the precise length of something in meters: 1.79), and more.

We might want to go the extra mile of trying to encode these other types of data by only using integers in a clever way, and indeed we could define:
- numbers for letters (`0` for `a`, `1` for `b`, etc.);
- `0` for _no_, `1` for _yes_;
- two numbers for fractions (`1` and `2` for `1/2 = 0.5`);
- ...

The issue with such a strategy would be the inherent confusion associated with overloading the meaning of a construct, integers, with many more associated meanings. Moreover, adding two rational numbers would become much more complex: instead of just being able to write `0.5 + 0.25`, we would need to add `1/2 + 1/4`, which in turn requires many steps. Programs written this way would become long, needlessly complex, and unwieldy.

Moreover, since these concepts are very important when encoding the solutions to many recurring problems, we want to treat them as _primitives_, instead of derived objects.


## Data types and operators
In order to be able to tackle this issue, we will now introduce the concept of **data types**.

A data type is a set, $T$, equipped with a series of distinct operations $Ops$.

The set, $T = \{ a, b, c, d, \dots \}$, contains all the elements that make up the data type. This is the static foundation of the datatype, and for integers it would contain all numbers such as $T = \mathbb{N} = \{ 0, 1, -1, 2, -2, 3, -3, \dots \}$ .

A data type contains more than just an unstructured collection of values. A data type also has some important connections between elements. We call these connections *structure*, as they determine a network of ties between the elements which are all and the only paths that can be followed.

These connections are represented by the operators available, $Ops = \{ op^{a_1}_1, op^{a_2}_2, \dots \}$. An operator $op_i^{a_i}$ would take as input (connect) ${a_i}$ values from $T$, and yield as output (to) a single value in $T$. ${a_i}$ is called the _ariety_ of the operator.

In the case of integers this set could for example be: $\{ +^2, -^2, -^1, \times^2, /^2 \}$. Notice that there are two distinct meanings for the same symbol, depending on its ariety: $-^2$ and $-^1$. Indeed, subtraction and negation both use the minus symbol $ - $, but with either two or one arguments, which we both see in action in an expression such as $4 -^2 (-^1 3)$.

Notice that we usually do not bother with specifying the ariety for very well known operators, but the ariety is still quite important and will need, at some point, to be defined. When clear from context (almost always) we will not need to write it as a superscript to the operator. The expression $4 -^2 (-^1 3)$ then becomes the well known $4 - (-3)$.

The fact that operators simply represent connections between elements of $T$ can be understood visually. Let us consider, for example, negation. Negation connects each element of $\mathbb{N}$ to another element of $\mathbb{N}$:

<img src="images/integer_negation_set.png" alt="Integer negation set" style="width: 400px;"/>

Negation is an operator of ariety one, which is simpler to visualise. We can make a first step in visualizing operators of higher ariety by showing $T$ multiple times, and drawing the operator as picking first one element from the first $T$, then another from the second $T$, etc., and finally diving into the resulting element in the final $T$ (in the picture we show the *sum* operator): 

<img src="images/integer_sum_two_steps.png" alt="Integer sum as pairs to singles" style="width: 800px;"/>

Another notation, perhaps less visually suggestive but very often used, would group elements in all possible combinations of the desired ariety. The set of all tuples of a given ariety is called _Cartesian product_ of the set with itself. This leads us to defining an operator with ariety greater than one as a link from a _Cartesian product_ $T \times T \times \dots T$, into $T$ itself (again, we show in the picture the *sum* operator):

<img src="images/integer_sum_pairs.png" alt="Integer sum as pairs to singles" style="width: 400px;"/>

Following the notations above, we will define an operator $op^n$ in a data type on $T$ as 

$$op^n : (T_1, T_2, \dots, T_n) \rightarrow T$$

The colon and the comma's in the definition above tell us that $op^n$ will *accept* (or *take*) a series of $n$ parameters taken from $T$, which we call $T_1$, $T_2$, etc, and *returns* (or *result in*) one value taken from $T$. 

An alternative notation would emphasize the fact that the operands are given to the operator in a specific order, and as such giving an operand already corresponds to "following an arrow":

$$op^n : T_1 \rightarrow T_2 \rightarrow \dots \rightarrow T_n \rightarrow T$$


The two notations above are, for those familiar with functional programming, the curried and uncurried versions of the operator.


### About arrows

Notice how we are not saying $3 + 2 = 5$, but rather $3 + 2 \rightarrow 5$. This apparent difference between what is traditionally seen in basic arithmetics and what we are presenting here is not just a random occurrence. When we say $3 + 2$ in programming, we are not defining a number, but rather a specification used to determine a number via computation. We call such a specification an _expression_, and expressions and statements are combined together to form programs.

Evaluating expressions is the process of following the arrows defined, for example, in the pictures above in order to slowly, one piece at a time, reduce the expression to simpler and simpler form. We stop when we have reached a value of $T$, which we cannot simplify any further.

This means that $3 + 2$ _is not_ $5$, but we can **go** _from_ $3 + 2$ _to_ $5$.

Moreover, since our goal is automation, we require that following arrows must be a clear, unambiguous process: at every step of the computation it must be that at most one arrow can be followed. Notice that the arrow $3 + 2 \rightarrow 5$ has a clear direction: left to right. Arrows (or computations) move from a complex specification to a simpler answer, but not backwards. Indeed, suppose we were trying to go backwards from a simple answer such as $5$. How did we get to $5$? There is no way to determine this uniquely, as it could have been determined by:
- $0 + 5 \rightarrow 5$
- $1 + 4 \rightarrow 5$
- $2 + 3 \rightarrow 5$
- $10 - 5 \rightarrow 5$
and infinite other possibilities. 

This means that our notion of *computing*, which is based on following arrows, trades time and information (the previous program and state are usually lost) for a simpler (but equivalent) formulation that is, hopefully, closer to the final result we are seeking.

## Some concrete data types

Let us move back into the realm of programming constructs. We will now define some useful concrete data types which are very often found in programming languages. Depending on the actual programming language, we will find different combinations of these datatypes. We will define these data types following the notation given previously, and assuming that the operators have been "lifted" into our programming language. Evaluating the lifted operators simply requires following the corresponding arrows in the underlying definition. This should remind us that there are multiple meanings associated with the same symbols: `1` somewhere in our code is not exactly the same thing as the number $1$ in set $\mathbb{N}$, but we can convert from code to set and then back. The same applies to operations, such as `+` or `-`: they exist both in the underlying domain, and as syntactic symbols in the programming language. For this reason, we will treat `<3+2>` as an expression, meaning that `3`, `+`, and `2` are the symbols in our programming language, and when we write $3+2$ we will denote the underlying operation:

<img src="images/lifting_statements_arithmetics.png" alt="Lifting of semantics into arithmetics" style="width: 600px;"/>

Such a transformation which preserves the structure between two similar constructions is known as a _functor_ (or _homomorphism_).

### Void and Unit

Let us begin with the simplest data types of all: `void` and `unit`. `void` has no values. It is an empty data type with no operations. It is needed to denote computations which cause no change and produce no result whatsoever. `unit` has only one single value, which, depending on the language, is denoted `()`, `null`, or `None`. Unit has no operations, since supporting only one single value they would not be able to say anything at all. The name `unit` is derived from the fact that the set $T$ only contains a single value.

### Bool

The second data type we will study is a simple extension upon `unit`, in that it features two values. This data type is called `bool`, and its two values are called `True` and `False`.

We use `bool` whenever we must model two different states or situations which are mutually exclusive, such as `is_subscribed`, `wants_info_by_email`, `shields_on`, etc.

`bool`, like all non-trivial data types, has operators to manipulate one or more `bool`'s. The simplest operator is `not`, a unary operator (`not: bool` $\rightarrow$ `bool`) which is very similar to negation of integer numbers. Its semantics are:
- `eval_expr(<not True>, S)` $\rightarrow$ `<False>`
- `eval_expr(<not False>, S)` $\rightarrow$ `<True>`
- `eval_expr(<not E>, S)` $\rightarrow$ `<not E'>` where `eval_expr(<E>, S)` $\rightarrow$ `<E'>`

<img src="images/boolean_not.png" alt="Boolean not" style="width: 400px;"/>

Binary operators are also available on `bool`, just like they are on `int`. The first combination operator is `and` (also called _conjunction_, written in mathematics as $\wedge$), which gives `False` when any of the two operands is `False`. Its semantics are based on the principle of short-circuiting: we keep evaluating the left-hand side of the operation, and when the left-hand side is a constant, we can determine whether we can stop early (it is `False`, so it does not matter what the right-hand side is), or we can proceed with the evaluation of the other side:
- `eval_expr(<E1 and E2>, S)` $\rightarrow$ `<E1' and E2>`, where `eval_expr(<E1>, S)` $\rightarrow$ `<E1'>`
- `eval_expr(<False and E>, S)` $\rightarrow$ `<False>`
- `eval_expr(<True and E>, S)` $\rightarrow$ `<E>`

An alternate formulation could simply evaluate both sides in a left-to-right direction, and when both operands are reduced to values we can just define their table of results:
- `eval_expr(<E1 and E2>, S)` $\rightarrow$ `<E1' and E2>`, where `eval_expr(<E1>, S)` $\rightarrow$ `<E1'>`
- `eval_expr(<B and E>, S)` $\rightarrow$ `<B and E'>`, where `eval_expr(<E>, S)` $\rightarrow$ `<E'>` and `B` is a boolean constant
- `eval_expr(<True and True>, S)` $\rightarrow$ `<True>`
- `eval_expr(<True and False>, S)` $\rightarrow$ `<False>`
- `eval_expr(<False and True>, S)` $\rightarrow$ `<False>`
- `eval_expr(<False and False>, S)` $\rightarrow$ `<False>`

The first formulation, based on short-circuiting, has an interesting advantage: it requires less steps to terminate in some cases, for example `<False and (True and (not False))>` would not even look at the right-hand side (`(True and (not False))`). Moreover, sometimes it could happen that the right-hand side would return an error if executed when the left side is `False`. With short-circuiting we avoid executing such faulty code. For these reasons we will rely on short-circuiting semantics.

The second combination operator is `or` (also called _disjunction_, written in mathematics as $\vee$), which gives `True` when any of the two operands is `True`. Its semantics are based again on the principle of short-circuiting: we keep evaluating the left-hand side of the operation, and when the left-hand side is a constant, we can determine whether we can stop early (it is `True`, so it does not matter what the right-hand side is), or we can proceed with the evaluation of the other side:
- `eval_expr(<E1 or E2>, S)` $\rightarrow$ `<E1' or E2>`, where `eval_expr(<E1>, S)` $\rightarrow$ `<E1'>`
- `eval_expr(<False or E>, S)` $\rightarrow$ `<E>`
- `eval_expr(<True or E>, S)` $\rightarrow$ `<True>`

Also for `or` we can define the alternate formulation which evaluates both sides in a left-to-right direction, and when both operands are reduced to values we can just define their table of results:
- `eval_expr(<E1 or E2>, S)` $\rightarrow$ `<E1' or E2>`, where `eval_expr(<E1>, S)` $\rightarrow$ `<E1'>`
- `eval_expr(<B or E>, S)` $\rightarrow$ `<B or E'>`, where `eval_expr(<E>, S)` $\rightarrow$ `<E'>` and `B` is a boolean constant
- `eval_expr(<True or True>, S)` $\rightarrow$ `<True>`
- `eval_expr(<True or False>, S)` $\rightarrow$ `<True>`
- `eval_expr(<False or True>, S)` $\rightarrow$ `<True>`
- `eval_expr(<False or False>, S)` $\rightarrow$ `<False>`

Notice that `and` takes precedence over `or`: `True or False and True` will be parsed as `True or (False and True)`.

We can draw both operators visually by relying on set diagrams:

<img src="images/boolean_and_or.png" alt="Boolean operators: and, or" style="width: 500px;"/>

Take notice: the drawing corresponds to the tabular formulation of the evaluation function, in that it just "lists" all possible combinations of elements and links them to their results. This suggests that in the future we will often look for "smart algorithms", such as a short-circuiting evaluation, that encode the very same logic as the exhaustive table of combinations, while taking as few steps as possible to reach the answer given the input(s). The fewer the steps, the better the performance of the algorithm.

### Int

Integer operators are perhaps some of the most well-known operators of all. These operators work on an infinite set of values, $\mathbb{N}$, which contains all possible integer numbers. This set is so large that all arithmetic combinations made with its elements are still contained in the set itself, and without (as in the case of $\mathbb{B}$) having to resort to repeating values.

The operators we will usually find for integers are the typical arithmetic operators such as sum `+`, subtraction `-`, multiplication `*` ($\times$ in the usual mathematical notation), division `/`, and remainder `%`.

Division in particular requires some careful attention. Division of integers can be interpreted in many ways: on one hand, we could be talking about division with truncation: $5 / 2 \rightarrow 2$, but on the other hand we could be talking about rational division without truncation: $5 / 2 \rightarrow 2.5$. Moreover, we could also define the remainder of division with truncation, meaning that: $5 \% 2 \rightarrow 1$. Programming languages usually offer a way to make a distinction between the desired operations. For example, Python uses `//` for truncated integer division and `/` for non-truncated integer division. We will, in the following, adhere to the Python convention. There are many more operators that we could reasonably define, and which are implemented in concrete programming languages, but we will not cover them all. To truly understand their meaning is in many cases not trivial, and requires a background that we do not assume in this text. Moreover, a very common application of software engineering often makes very little use of "traditional" mathematics as taught in high schools: very few "business applications" will make use of trigonometry, but they will make use of complex abstractions to represent and process data effectively according to complex domains. On the other hand, data science or graphics applications will make use of these operators and functions, but at the same time requiring such a deep and complex background in the foundations of the discipline that is absolutely not feasible to cover in a text such as the present. For this reason, we will only cover the concepts which are shared between the two domains of application (mathematics and programming languages), and leave the advanced mathematical applications to the motivated student who will find such information in more specialized tomes.

The various evaluation rules of all operators roughly look like what we have seen in the previous chapter. Given any operator $\oplus$, we will first evaluate both operands as long as they are remain expressions:
- `eval_expr(<E1` $\oplus$ `E2>, S)` $\rightarrow$ `<E1'` $\oplus$ `E2>`, where `eval_expr(<E1>, S)` $\rightarrow$ `<E1'>`
- `eval_expr(<N` $\oplus$ `E>, S)` $\rightarrow$ `<N` $\oplus$ `E'>`, where `eval_expr(<E>, S)` $\rightarrow$ `<E'>` and `N` is an integer

When we reach a constant value on both sides of the operator, then it is time to perform the actual evaluation:
- `eval_expr(<N` $\oplus$ `M>, S)` $\rightarrow$ `<Q>` where `Q = ` $N \oplus M$ and `N, M` are both integers

It does not matter which operator we are talking about: all semantic rules are the same. Notice once again the existence of the two parallel worlds: one is the world of programs and expressions, denoted for example as `<N+M>`; the other is the world of mathematical abstractions, denoted for example as $N+M$. We can agree on the idea that, while programming languages are a human, invented construct, mathematics exists independently of us and is discovered, like an unknown archipelago. Since mathematics exists already, and we are not really in state of inventing a new one, what we can do to build our programming languages is to provide a way to translate their constructs in a way that ultimately relies on mathematical operations when it's time to _actually do something_. The computer is no more than a physical implementation of the basic rules of arithmetics, and state storage and representation, that is the computer is little more than a very fast, physical embodiment of basic mathematics.

#### Float

Most modern languages also offer floating-point numbers, which are rational ($\mathbb{Q}$) instead of real ($\mathbb{R}$). Rational numbers can be represented with a finite amount of information, whereas a single real number can contain an infinite amount of information. Infinite information cannot fit in a computer, which is only capable of mathematical computations exclusively applied to finite data.

Such numbers are usually written `3.5`, `2.1`, `0.001`, etc. Floating point numbers usually feature roughly the same operators as integers. Given this strong correspondance with integers, many languages automatically convert between `int` and `float` whenever needed in order to preserve information. This means that `2 + 2.5` will automatically be converted to a floating point addition, therefore equivalent to `2.0 + 2.5`.

### String

The `string` datatype is used to represent text. Text is a very useful datatype since most applications representing organisations or other real world entities deal with names and descriptions which are almost invariably text-based.

Strings are defined inductively based on the empty string and an alphabet, meaning that we start from the empty string and then repeatedly concatenate it with all elements of the alphabet in order to generate all possible strings. 

We start from the empty string as it is the simplest string imaginable, usually denoted with two empty quotes `""`. We say that the empty string has a length of zero. We then take the alphabet of all possible characters (there are thousands of them!) as strings of length one: `"a"`, `"b"`, ..., `"♥"`, ...

The last step is then repeatedly concatenating all strings from the previous layers in order to build the next layer (while removing the duplicates in the process). This leads us to:

<img src="images/free_string_monoid.png" alt="Free string monoid" style="width: 600px;"/>

The most significant operator on strings is `+`, which just concatenates strings (and correctly rebalances the quotes `"`):
- `eval_expr(<"A" + "B">, S)` $\rightarrow$ `<"AB">`, where `A` and `B` are sequences of characters

Another operator that we occasionally encounter in programming languages is the multiplication of a string and a number, which repeats the string as many times as the number specifies. We will not use this in practice, since it is just a shorthand for loops and "regular" constructs which we believe readers should familiarize with first.


### Conversions and domain-mixing operators

A series of operators are defined across data-types. For example, comparing two integer numbers to see if one is bigger than the other does not yield an integer as result, but rather a boolean (stating whether or not the comparison matches with the given values). The comparison operators, which semantics entirely relies on the underlying mathematical semantics, are `<`, `>`, `<=`, `>=`, `!=`, `==`.
We will try to use `==` for equality in our languages, and `:=` for assignment whenever possible, but when introducing Python we will be forced to move to `=` for assignment. The reason why we remark that variable assignment should properly be done with an operator different from `=`, such as our `:=`, is to prevent us from writing code looking like `x = x + 1`, which makes no sense whatsoever from the mathematical point of view. Since programming languages are ultimately built upon mathematical logic, changing the meaning of symbols with respect to the underlying layer is a risky strategy that we do not encourage.

Another example of a domain-mixing operator is the `len` operator (it could be called differently in other programming languages). This operator, given a string, returns an integer which represents the length of the string (that is, from how many characters the string is formed).

Some other operators perform conversions instead. For example, we might want to convert a string or a float to an integer.
The three most common conversion operators are `str`, `int`, and `float`. These operators convert values to the type with the same name as the operator: `int("101")` will yield the integer `101`, `str(101)` will yield the string `"101"`, and so on.


### Operator precedence

The various operators we have seen so far have a "canonical" precedence which is mostly borrowed from the mathematical rules for operator precedence, plus convenience rules gained from experience with actual programming. The fact that boolean operators usually have lower precedence than comparison operators allows us to write expressions such as `x > 3 or x < 6` without having to put too many brackets.

The resulting table, where precedence **grows** from top to bottom, is therefore:

| Operator                                    | Description |
|:-------------------------------------------:|:-----------:|
| `if - then - else`                          | Conditional expression (we will see it in the next lecture) | 
| `or`                                        | Boolean OR  |
| `and`                                       | Boolean AND |
| `not`                                       | Boolean NOT |
| `<`, `<=`, `>`, `>=`, `!=`, `==`            | Comparisons |
| `+`, `-`                                    | Addition and subtraction |
| `*`, `/`, `//`, `%`                         | Multiplication, division, remainder |