## Classical information

To describe quantum information and how it works, it is helpful to begin with a mathematical description of *classical* information.
This is so for multiple reasons.
One reason is that, although quantum and classical information are different in some pretty spectacular ways, their mathematical descriptions are actually quite similar in how they work.
A second reason is that classical information serves as a familiar point of reference when studying quantum information, as well as a source of analogy that goes a surprisingly long way.
Indeed, it is quite common that questions that arise when one is learning about quantum information and computation have natural classical analogs — and the sometimes obvious answers to the classical analogs of these questions can provide both clarity and insight into the original question.

### Classical states and probability vectors

Suppose that we have a system — meaning an abstraction of a physical device or a medium of some sort — that stores information.
More specifically,  we will assume that this system can be in one of a finite number of *classical states* at each instant.
Here, the term *classical state* should be understood in intuitive terms, as a configuration that can be recognized and described unambiguously.

The archetypal example, which we will come back to repeatedly, is that of a *bit*, which is a system whose classical states are 0 and 1.
Other examples include a standard six-sided die, whose classical states are 1, 2, 3, 4, 5, and 6; a nucleobase in a strand of DNA, whose classical states are *A*, *C*, *G*, and *T*; and a switch on an electric fan, whose classical states are (commonly) *high*, *medium*, *low*, and *off*.
In mathematical terms, the specification of the classical states of a system are, in fact, the starting point: we *define* a bit to be a system that has classical states 0 and 1, and likewise for systems having different classical state sets.

For the sake of the discussion that follows, let us give the name $\mathsf{X}$ to the system being considered, and let us use the symbol $\Sigma$ to refer to the set of classical states of $\mathsf{X}$.
In addition to the assumption that $\Sigma$ is *finite*, as was already mentioned, we naturally assume that $\Sigma$ is *nonempty*, as it is nonsensical for a physical system to have no states at all that it can be in.
Although it is sensible to consider physical systems having infinitely many classical states, we will not be concerned with this possibility at this time.
For the sake of convenience and brevity, the term *classical state set* should hereafter be understood to mean *a finite and nonempty set*.

For example, if $\mathsf{X}$ is a bit, then $\Sigma = {\{0,1\}}$;
if $\mathsf{X}$ is a six-sided die, then $\Sigma = \{1,2,3,4,5,6\}$; and if $\mathsf{X}$ is an electric fan switch, then 
$\Sigma = \{\mathrm{high},\mathrm{medium},\mathrm{low},\mathrm{off}\}$.

When thinking about $\mathsf{X}$ as a carrier of information, where the different classical states of $\mathsf{X}$ may have different interpretations and may lead to different outcomes or consequences, it may be sufficient to describe $\mathsf{X}$ as simply being in one of its possible classical states.
For instance, if $\mathsf{X}$ is a fan switch, we might happen to know with certainty that it is set to *high*.
It is very common in the context of information processing, however, that our knowledge of $\mathsf{X}$ is *uncertain*, and we represent our knowledge of the classical state of $\mathsf{X}$ by assigning *probabilities* to each classical state, resulting in a *probabilistic state*.

For example, let us suppose that $\mathsf{X}$ is a bit.
Based on what we know or expect about what has happened to $\mathsf{X}$ in the past, perhaps we believe that $\mathsf{X}$ is in the classical state 0 with probability 3/4 and in the state 1 with probability 1/4.
We may represent such a belief by writing this:

$$
\operatorname{Pr}(\mathsf{X}=0) = \frac{3}{4}
\quad\text{and}\quad
\operatorname{Pr}(\mathsf{X}=1) = \frac{1}{4}.
$$

A more succinct way to represent this probabilistic state is by a column
vector:

$$
\begin{pmatrix}
  \frac{3}{4}\\[1mm]
  \frac{1}{4}
\end{pmatrix}.
$$

Naturally, the entries of this vector are placed in correspondence with the classical state set $\{0,1\}$ in the most natural way, which is 0 first and 1 second.

This representation can be generalized to arbitrary classical state sets, assuming that the entries of the column vector are ordered in whatever way makes the most sense (or is clearly specified if there is no obvious ordering).
In addition to its succinctness, the identification of a probabilistic state with a column vector has the advantage that operations on probabilistic states are represented through matrix–vector multiplication, as the reader may well be aware, and as will be discussed in greater detail shortly.

At this point, we observe that the set of all possible probabilistic states of a system $\mathsf{X}$ having classical state set $\Sigma$ corresponds precisely to the set of all column vectors having entries in correspondence with $\Sigma$ for which these properties are satisfied:
1. All entries are *nonnegative real numbers*.
1. The sum of the entries is equal to 1.

Hereafter, whenever it is convenient, we will use the term *probability vector* to refer to any such vector.

### Measuring classical states

Now let us briefly consider what happens if we *measure* a system when it is in a probabilistic state.
By measuring a system, we mean that we unambiguously recognize whatever classical state the system is in.

Doing this naturally changes our knowledge of the system, and therefore changes the probabilistic state that we associate with that system: if we recognize that $\mathsf{X}$ is in the classical state $a\in\Sigma$, then the new probability vector representing our knowledge of $\mathsf{X}$ becomes a vector having a 1 in the entry corresponding to $a$ and 0 for all other entries.
This vector indicates that $\mathsf{X}$ is in the classical state $a$ with certainty, which we know having just recognized it.

We denote the vector just described, meaning the vector having a 1 in the entry corresponding to $a$ and 0 for all other entries, by $|a\rangle$.
This vector is read as "ket $a$" for a reason that will be explained later.
Vectors of this sort are also called *standard basis* elements.

For example, assuming that the system we have in mind is a bit, whose classical state set is therefore $\{0,1\}$, we denote

$$
  |0\rangle = \begin{pmatrix}1\\0\end{pmatrix}
  \quad\text{and}\quad
  |1\rangle = \begin{pmatrix}0\\1\end{pmatrix}.
$$

Notice that any two-dimensional column vector can be expressed as a linear combination of these two vectors.
For example, we have

$$
\begin{pmatrix}
  \frac{3}{4}\\[2mm]
  \frac{1}{4}
\end{pmatrix}
= \frac{3}{4}\,|0\rangle + \frac{1}{4}\,|1\rangle.
$$

The fact that any column vector can be written as a linear combination of standard basis elements naturally generalizes to arbitrary classical state sets.
Expressing vectors as linear combinations of standard basis elements will be very typical going forward.

Returning to the change of a probabilistic state upon being measured, we may note the following connection to our everyday experiences.
Suppose that we flip a fair coin, but cover up the coin before looking at it.
We would then naturally consider that its probabilistic state is

$$
\begin{pmatrix}
  \frac{1}{2}\\[2mm]
  \frac{1}{2}
\end{pmatrix}
= \frac{1}{2}\,|\text{heads} \rangle + \frac{1}{2}\,|\text{tails}\rangle.
$$

Here, the classical state set of our coin is $\{\text{heads},\text{tails}\}$, and even though it does not matter for this particular vector, let us decide that we order these two states as heads first, tails second:

$$
|\text{heads}\rangle = \begin{pmatrix}1\\0\end{pmatrix}
\quad\text{and}\quad
|\text{tails}\rangle = \begin{pmatrix}0\\1\end{pmatrix}.
$$

If we were to uncover the coin and look at it, we would see one of the two classical states: heads or tails.
Supposing that the result were tails, we would naturally update our description of the probabilistic state of the coin so that it becomes $|\text{tails}\rangle$.
Of course, if we were then to cover up the coin, and then uncover it and look at it again, the classical state would obviously still be tails, which is consistent with the probabilistic state being described by the vector
$|\text{tails}\rangle$.
This may seem trivial, and in some sense it is — but in recognizing this triviality, the analogous behavior for quantum information might seem less unusual.

One final remark concerning measurements of probabilistic states is that probabilistic states describe knowledge or belief, and not necessarily something actual.
The state of our coin after we flip it, but before we look, is either heads or tails and we simply do not know which until we look — and doing so does not actually change the state of the coin, but only our knowledge of it.
Upon seeing that the classical state of the coin is tails, say, we naturally update our knowledge by assigning the probabilistic state $|\text{tails}\rangle$ to the coin — but to someone else in the room who was not able to see the coin when it was uncovered, the probabilistic state of the coin would remain unchanged and would be described by the probability vector specified previously.
This is not a cause for concern: different individuals may have different knowledge or beliefs about of a particular system, and hence describe that system by different probability vectors.


### Classical operations

Finally, in this brief summary of classical information, let us consider the sorts of *operations* one might perform on a classical system.

#### Deterministic operations

First, there are *deterministic* operations, where each classical state $a\in\Sigma$ is transformed into $f(a)$ for some function $f$ of the form $f:\Sigma\rightarrow\Sigma$.

For example, if $\Sigma = \{0,1\}$, there are four functions of this form, $f_1$, $f_2$, $f_3$, and $f_4$, which can be represented by tables of values as follows:

$$
\rule[-10mm]{0mm}{15mm}
\begin{array}{c|c}
  a & f_1(a)\\
  \hline
  0 & 0\\
  1 & 0
\end{array}
\qquad
\begin{array}{c|c}
  a & f_2(a)\\
  \hline
  0 & 0\\
  1 & 1
\end{array}
\qquad
\begin{array}{c|c}
  a & f_3(a)\\
  \hline
  0 & 1\\
  1 & 0
\end{array}
\qquad
\begin{array}{c|c}
  a & f_4(a)\\
  \hline
  0 & 1\\
  1 & 1
\end{array}
$$

The first and last of these functions are *constant*: $f_1(a) = 0$ and $f_4(a) = 1$ for each $a\in\Sigma$.
The middle two are not constant, they are *balanced* in the sense that the two possible output values occur the same number of times as we range over the possible inputs.
The function $f_2$ is the *identity function*: $f_2(a) = a$ for each $a\in\Sigma$.
And $f_3$ is the function $f_3(0) = 1$ and $f_3(1) = 0$, which is better-known as the NOT function.

The actions of deterministic operations on probabilistic states can be represented by matrix-vector multiplication.
Specifically, the matrix $M$ that represents a given function $f:\Sigma\rightarrow\Sigma$ is the one that satisfies

$$
M | a \rangle = |f(a)\rangle
$$

for every $a\in\Sigma$.
Such a matrix always exists and is unique.

For example, the matrices $M_1,\ldots,M_4$ corresponding to the functions $f_1,\ldots,f_4$ above are as follows:

$$
  \rule[-6mm]{0mm}{14.5mm}
  M_1 =
  \begin{pmatrix}
    1 & 1\\
    0 & 0
  \end{pmatrix},
  \hspace{4mm}
  M_2 =
  \begin{pmatrix}
    1 & 0\\
    0 & 1
  \end{pmatrix},
  \hspace{4mm}
  M_3 =
  \begin{pmatrix}
    0 & 1\\
    1 & 0
  \end{pmatrix},
  \hspace{4mm}
  \text{and}
  \hspace{4mm}
  M_4 =
  \begin{pmatrix}
    0 & 0\\
    1 & 1
  \end{pmatrix}.
$$

Here, and in general, the matrices that represent deterministic operations correspond precisely to matrices having exactly one 1 in each column, and 0 for all other entries.

A convenient way to represent matrices of these and other forms makes use of an analogous notation for row vectors to the one for column vectors discussed previously: we denote by $\langle a |$ the *row* vector having a 1 in the entry corresponding to $a$ and zero for all other entries, for each $a\in\Sigma$.
This vector is read as "bra $a$."
For example, if $\Sigma = \{0,1\}$, then

$$
  \langle 0 | = \begin{pmatrix}
    1 & 0
  \end{pmatrix}
  \quad\text{and}\quad
  \langle 1 | = \begin{pmatrix}
    0 & 1
  \end{pmatrix}.
$$

For an arbitrary choice of a classical state set $\Sigma$, viewing row vectors and column vectors as matrices having a single row or column, respectively, and performing the matrix multiplication $|b\rangle \langle a|$, one obtains a square matrix having a 1 in the $(b,a)$ entry and 0 for all other entries.
This provides a convenient tool for representing matrices.

In particular, for any function $f:\Sigma\rightarrow\Sigma$, we may express the matrix $M$ corresponding to the function $f$ as 

$$
  M = \sum_{a\in\Sigma} |f(a) \rangle \langle a |.
$$

Notice that this expression is consistent with the fact that the multiplication $\langle a | |b \rangle$, which is written as $\langle a | b\rangle$ for the sake of tidiness, satisfies

$$
  \langle a | b \rangle
  = \begin{cases}
    1 & a = b\\
    0 & a \not= b.
  \end{cases}
$$

That is, using this fact together with the fact that matrix multiplication is associative and linear, we obtain

$$
  M | b \rangle = 
  \Biggl(
  \sum_{a\in\Sigma} |f(a) \rangle \langle a |
  \Biggr)
  | b\rangle
  = \sum_{a\in\Sigma} |f(a) \rangle \langle a | b \rangle
  = |f(b)\rangle,
$$

for each $b\in\Sigma$, which is what we require of $M$.

The names "bra" and "ket" are perhaps now evident: putting a "bra" $\langle a|$ together with a "ket" $|b\rangle$ yields a "bracket" $\langle a | b\rangle$.
This notation and terminology is due to Paul Dirac, and for this reason is known as the *Dirac notation*.

#### Probabilistic operations and stochastic matrices

In addition to deterministic operations, we have *probabilistic operations*.

For example, consider an operation on a bit where, if the classical state of the bit is 0, it is left alone; and if the classical state of the bit is 1, it is flipped to 0 with probability $1/2$.
This operation is represented by the matrix

$$
  \begin{pmatrix}
    1 & \frac{1}{2}\\[1mm]
    0 & \frac{1}{2}
  \end{pmatrix},
$$

meaning that the action of this operation on a given probabilistic state is obtained by multiplying this matrix to the probability vector associated with the given probabilistic state.

For an arbitrary choice of a classical state set, we can describe the set of all probabilistic operations in mathematical terms as those that are represented by *stochastic matrices*.
These are matrices satisfying these two properties:
1. All entries are nonnegative real numbers.
1. The entries in every column sum to 1.
Equivalently, stochastic matrices are matrices whose columns all form probability vectors.

We can think about probabilistic operations at an intuitive level as ones where randomness might somehow be used or introduced during the operation, just like in the example above.
With respect to the stochastic matrix description of a probabilistic operation, each column can be viewed as a vector representation of the probabilistic state that is generated given whatever classical state input corresponds to that column.

We can also think about stochastic matrices as being exactly those matrices that always map probability vectors to probability vectors.
That is, stochastic matrices always map probability vectors to probability vectors, and any matrix that always maps probability vectors to probability vectors must be a stochastic matrix.

Finally, another way to think about probabilistic operations is that they are random choices *of* deterministic operations.
For instance, we can think about the operation in the example above as applying either the identity function or the constant 0 function, each with probability $1/2$.
This is consistent with the equation

$$
  \begin{pmatrix}
    1 & \frac{1}{2}\\[1mm]
    0 & \frac{1}{2}
  \end{pmatrix}
  = \frac{1}{2}
  \begin{pmatrix}
    1 & 0\\[1mm]
    0 & 1
  \end{pmatrix}
  + \frac{1}{2}
  \begin{pmatrix}
    1 & 1\\[1mm]
    0 & 0
  \end{pmatrix}.
$$

Such an expression is always possible, for an arbitrary choice of a classical state set and any stochastic matrix with rows and columns identified with that classical state set.

#### Compositions of probabilistic operations

Suppose that $\mathsf{X}$ is a system having classical state set $\Sigma$, and $M_1,\ldots,M_n$ are stochastic matrices representing probabilistic operations on the system $\mathsf{X}$, so the rows and columns of these matrices have been placed in correspondence with $\Sigma$ as usual.

If the first probabilistic operation is applied to the probabilistic state represented by a probability vector $u$, the resulting probabilistic state is represented by the vector $M_1 u$ that we obtain by multiplying $M_1$ to $u$.
If we then apply the second probabilistic operation to this new probability vector, we obtain the probability vector

$$
  M_2 (M_1 u) = (M_2 M_1) u.
$$

The equality follows from the fact that matrix multiplication (which includes matrix-vector multiplication as a special case) is an *associative operation*.
Thus, the probabilistic operation obtained by *composing* the first and second probabilistic operations, where we first apply $M_1$ and then apply $M_2$, is represented by the matrix $M_2 M_1$, which is necessarily stochastic.

More generally, composing the probabilistic operations represented by the matrices $M_1,\ldots,M_n$ in that order, meaning that $M_1$ is applied first, $M_2$ is applied second, and so on, with $M_n$ applied last, is represented by
the matrix

$$
  M_n \,\cdots\, M_1.
$$

Note that the ordering is important here: although matrix multiplication is associative, it is not a commutative operation in general.
For example, if we have

$$
  M_1 = 
  \begin{pmatrix}
    1 & \frac{1}{2}\\[1mm]
    0 & \frac{1}{2}
  \end{pmatrix}
  \quad\text{and}\quad
  M_2 =
  \begin{pmatrix}
    0 & 1\\[1mm]
    1 & 0
  \end{pmatrix},
$$

then

$$
  M_2 M_1 = 
  \begin{pmatrix}
    0 & \frac{1}{2}\\[1mm]
    1 & \frac{1}{2}
  \end{pmatrix}
  \quad\text{and}\quad
  M_1 M_2 =
  \begin{pmatrix}
    \frac{1}{2} & 1\\[1mm]
    \frac{1}{2} & 0
  \end{pmatrix}.
$$

That is, the order in which probabilistic operations are composed matters: changing the order in which operations are applied in a composition can change the resulting operation.