## 4.1 The Sequence ADT

A text is a sequence of characters. A to-do list is a sequence of tasks.
A top 40 chart is a sequence of songs. A queue is a sequence of people.
A **sequence** is a collection of items in a particular order:
one item comes first, another comes second, and so on until the last item.
The items in a sequence are also called its **members** or **elements**.
An **empty sequence** has no elements, e.g. when all tasks have been done.
Sequences tend to be **homogeneous**, with all items of the same data type,
like the examples given, but can also be **heterogeneous**,
i.e. include items of different types.

<div class="alert alert-info">
<strong>Info:</strong> In mathematics, sequences are usually lists of numbers that follow a certain pattern, like 1, 4, 7, 10, ... or 5, 10, 20, 40, ...
They are special cases of the sequences used in computing.
Numeric sequences are introduced in MU123 Unit&nbsp;9 Section&nbsp;1.1 and MST124 Unit&nbsp;10.
</div>

In M269 we write sequences as comma-separated lists of items,
enclosed in parentheses: () is the empty sequence, while
(1, 2, true) is a heterogeneous sequence of two integers and one Boolean.
The order matters, so (1, 2, 3) and (1, 3, 2) are different sequences.

<div class="alert alert-info">
<strong>Info:</strong> There's no standard bracket to enclose sequences. Some texts use {...} or
$\langle$...$\rangle$ or no brackets at all, like the MU123 and MST124 books.
</div>

The following are the most common operations on **immutable** sequences,
i.e. sequences that cannot be modified.
Operations on **mutable** sequences are given in [Section&nbsp;4.6](../04_Iteration/04_6_lists.ipynb#4.6-Lists).

### 4.1.1 Inspecting sequences

The following functions allow us to obtain some information about a sequence.

#### Size

The **length** or **size** of a sequence _s_, written │*s*│,
is the number of its elements. The empty sequence has length zero.
We assume the size is stored in memory together with the sequence, and that
the size is computed and updated when the sequence is created and modified.
Hence the length operation can look up the size in constant time.

#### Indexing

It's possible to obtain the first, second, ..., last item of a sequence
with the **indexing** operation. In the sequence (true, false),
true is at index zero and false is at index one. Because indices start at 0,
the last index is one less than the sequence's length.

The indexing function takes a sequence and an index, and returns
the item at that index. The members of a sequence can be of any type, so
we need a general ADT that includes every possible data item.
In M269 we call it the **object** ADT.
It only has two operations: equality and inequality.
We can now define the indexing operation.

**Function**: indexing\
**Inputs**: _values_, a sequence; _index_, an integer\
**Preconditions**: 0 ≤ _index_ < │*values*│ \
**Output**: _value_, an object\
**Postconditions**: _value_ is the _n_-th item of _values_, with _n_ = _index_ + 1

The indexing operation is written in mathematics as $s_i$,
for a sequence $s$ and index $i$.
In computing, the more common notation is _s_[*i*].
You can use both notations in M269.

In M269, all data we need to process fits and is stored in the computer's main
random-access memory (RAM). Any RAM position can be accessed in the same time,
so we assume that indexing takes constant time.

#### Exercise 4.1.1

Does the definition of the indexing function allow the operation
to be applied to the empty sequence?

_Write your answer here._

[Hint](../31_Hints/Hints_04_1_01.ipynb)
[Answer](../32_Answers/Answers_04_1_01.ipynb)

#### Membership

The **membership** operation, written $v \in s$ or _v_ in _s_,
checks whether value _v_ is an element of sequence _s_.
Here's one way to define it.

**Function**: membership\
**Inputs**: _values_, a sequence; _value_, an object\
**Preconditions**: true\
**Output**: _is member_, a Boolean\
**Postconditions**: _is member_ if and only if there's an integer _index_
such that _values_[*index*] = _value_

The postcondition states that the output is true only when there's an integer
for which the indexing operation is defined and returns the input value.
Note that the postcondition does _not_ state how such an index can be found.
In the previous chapters, the postconditions were similar to the algorithm
that implemented the operation. But postconditions aren't algorithms:
they are conditions – Boolean expressions that must be true after the algorithm
does its job. Postconditions _check_ the output: they don't _compute_ it.

We assume the membership operation has best-case complexity Θ(1) and
worst-case complexity Θ(│*values*│). The reasoning is as follows.
To decide whether _value_ is an element of _values_, the operation has to
go through each member of _values_ and check if it's equal to _value_.
The best-case scenario, when the operation does the least work,
is when the first member of the sequence is _value_:
the search is over after one comparison.

There are two worst-case scenarios, when the operation does the most work:
the _value_ is the last item of the sequence or it doesn't occur at all.
In both scenarios, the operation compares _value_ against all sequence members,
i.e. it takes linear time in the length of the sequence:
if the number of items doubles, the operation does double the work.

#### Comparison

If the items of a sequence are **pairwise comparable**, i.e.
each item can be compared to every other item, then
we can apply the minimum and maximum operations to
determine the smallest and largest values.

The **lexicographic comparison** of two sequences does a
pairwise comparison of items, one by one, until a decision can be made, i.e.
items are compared until they differ or one sequence ends before the other.
If two sequences are equal until one ends, then the shorter sequence is
considered 'less than' the longer one. Some examples:

- (1, 2) < (1, 2, 3) because the left sequence ends before the right one
- (1, 3, 3) > (1, 2, 3) because the second item of the left sequence is greater than the second item of the right one
- () < _s_ for any non-empty sequence _s_
- (1, 2, 3) ≠ (1, true) because the second items differ.

The last two sequences can only be compared for equality and inequality,
because the other comparison operations aren't defined for 2 and true.

As usual, we write _s1_ ≤ _s2_ to mean _s1_ < _s2_ or _s1_ = _s2_,
and similarly for other operations.

The comparison operations on sequences have best-case complexity Θ(1)
because in the best-case scenario the first item of both sequences differ
and a decision can be made after one comparison, which takes constant time.
The comparisons have worst-case complexity Θ(min(│*left*│, │*right*│))
because in a worst-case scenario all items of the shorter sequence
(or all except the last one) are equal to the corresponding items of the longer
sequence and the decision is delayed until the end of the shorter sequence.

### 4.1.2 Creating sequences

The following operations create new sequences from existing ones.

#### Slicing

The **slicing** operation extracts a consecutive sequence of items
from the input sequence _s_. The resulting **slice** is defined by
the given start and end indices. We follow Python's notation _s_[_start_:*end*]
with the understanding that the item at the final index is  _not_ included.
A slice is also called a **substring**.
The term 'subsequence' has another meaning,
explained in [Section&nbsp;4.6](../04_Iteration/04_6_lists.ipynb#4.6.1-Modifying-sequences).

For example, if _s_ = (6, 7, 8, 9), then _s_[1:4] is the substring (7, 8, 9),
with the items from indices 1 to 3, inclusive.
In other words, it's the slice from the second to the fourth item.
If _s_ has fewer than four items, then the slice isn't defined.
Here's a definition:

**Function**: slicing\
**Inputs**: _values_, a sequence; _start_, an integer; _end_, an integer\
**Preconditions**: 0 ≤ _start_ ≤ _end_ ≤ │*values*│\
**Output**: _substring_, a sequence\
**Postconditions**: _substring_ = (_values_[*start*], _values_[_start_ + 1], ..., _values_[_end_ - 1])

The preconditions allow the end index to be equal to the length of the sequence,
so that the last item can be included in the slice.
If _start_ = _end_, then the slice is empty.

There are two reasons why the item at index _end_ isn't included in the slice.
First, it makes it easier to see how many items are in the slice. If the end
item were included, the length of the slice would be _end_ − _start_ + 1.
By not including it, │_s_[_start_:*end*]│ = _end_ - _start_, e.g.
_s_[2:7] has five items.

The second reason is to make it easier to split sequences.
If you want to split a sequence _s_ at index _i_, then
the 'left' part of the sequence is _s_[0:*i*] and
the 'right' part is _s_[_i_:│ _s_ │].

The slicing operation copies all items in the slice to a new sequence,
so the complexity is linear in the size of the slice: Θ(_end_ - _start_).

#### Concatenating

The **concatenation** operation, written _left_ + _right_ in M269,
forms a new sequence by joining both input sequences.

**Function**: concatenation\
**Inputs**: _left_, a sequence; _right_, a sequence\
**Preconditions**: true\
**Output**: _joined_, a sequence\
**Postconditions**: _joined_ = (_left_[0], ..., _left_[│ _left_ │ - 1],
_right_[0], ..., _right_[│ _right_ │ - 1])

If sequence _s_ is empty, then there are no items _s_[0], ..., _s_[│ _s_ │ - 1].
For example, if _left_ is empty, then
_joined_ = (_right_[0], ..., _right_[│ _right_ │ - 1]).
And if _right_ is also empty, then _joined_ = ().


The concatenation _left_ + _right_ copies all items of _left_ and
all items of _right_  to the new sequence, so the run-time is proportional to
the number of items copied: │*left*│ + │*right*│. The concatenation operation is
linear in the total length of the inputs: Θ(│*left*│ + │*right*│).
If both inputs double their length, then the total number of items doubles,
and so does the run-time of concatenation.

A sequence _pattern_ is a **prefix** of a sequence _s_ if there's a sequence
_rest_ such that _pattern_ + _rest_ = _s_. In other words,
_pattern_ is a substring of _s_ that starts at index&nbsp;0 of _s_.
Vice versa, if there's a sequence _rest_ such that _rest_ + _pattern_ = _s_, i.e. _s_ ends with _pattern_, then _pattern_ is a **suffix** of _s_.
For every sequence _s_, () and _s_ are substrings of _s_, and
hence also prefixes and suffixes of _s_.

#### Sorting

A sequence can be **sorted** in ascending (smallest to largest item) or
descending order if the items are pairwise comparable.
For example, (3, 3, 7, 9) is in ascending order and
(9, 7, 3, 3) is in descending order.
Here's one definition:

**Function**: ascending sort\
**Inputs**: _original_, a sequence\
**Preconditions**: all items in _original_ are pairwise comparable\
**Output**: _sorted_, a sequence\
**Postconditions**:

- _sorted_ is a permutation of _original_
- _sorted_[*index*] ≤ _sorted_[_index_+1] for _index_ = 0, 1, ..., │*sorted*│ - 2

A **permutation** is a rearrangement.
The first postcondition relates the output to the input:
the output has the same items as the input, possibly in a different order.
(The input sequence may already be in ascending order.)
The second postcondition defines what ascending order means.

In algorithms in English we write '_s_ in ascending order' or
'_s_ in descending order' for some sequence _s_.
The sorting functions produce a new sequence that can be part of
a longer expression or be the right-hand side of an assignment, like

> let _sorted_ be _sequence_ in ascending order

Sorting can be the basis for solving other problems, e.g.
finding the median (the middle value when values are sorted)
or the ten largest values. We'll consider the complexity of sorting later.

⟵ [Previous section](04-introduction.ipynb) | [Up](04-introduction.ipynb) | [Next section](04_2_strings.ipynb) ⟶