## 4.4 Linear search

Many problems using sequences are **search problems**: they involve
finding one or more elements of the sequence that satisfy some condition.
Such problems can be solved with a **linear search**, an algorithm
that goes systematically through the sequence and checks each element.
This section shows some examples of linear searches at work
and *en passant* illustrates some finer points of the problem-solving process.

<div class="alert alert-info">
<strong>Info:</strong> TM112 Block&nbsp;2 Section&nbsp;2.3 introduces search problems and
algorithmic patterns for them.
</div>

### 4.4.1 Finding characters

Imagine that we have a non-empty string with one or more sentences,
each ending with a full stop.
We're asked to create a new string with just the first sentence.

Although this isn't formulated as a search problem, it involves a search:
finding the first full stop in the string. Once we know its index,
we simply take the slice up to that index: that's the first sentence.

<div class="alert alert-warning">
<strong>Note:</strong> Even if a problem isn't stated as a search problem,
think whether doing a search could solve it.
</div>

#### Problem definition and instances

Let's take the opportunity to define the more general problem of
finding the first occurrence of a given character in a given string.

**Function**: first index\
**Inputs**: *text*, a string; *character*, a string\
**Preconditions**: │*character*│ = 1\
**Output**: *index*, an integer\
**Postconditions**: if  *character* in *text*,
*index* is the smallest integer such that *text*[*index*] = *character*,
otherwise *index* = │*text*│

Search problems often have postconditions of the form
'... is the first / last / smallest / largest ... such that ...' or similar.
To indicate that the character doesn't occur in the text,
the output is set to an impossible index.

For the test table, I need to think of edge cases.
An input sequence with the smallest allowed length is always an edge case.
I also need to create tests for different options of where and how often
*character* occurs in *text*.

Case | *text* | *character* | *index*
-|-|-|-
smallest input | ''  | 'a'  | 0
occurs at start  | 'all'  | 'a'  | 0
occurs in the middle  | 'abracadabra'  | 'c'  | 4
occurs at the end  | 'hi!'  | '!'  | 2
multiple occurrences  | 'abracadabra'  | 'b'  | 1
no occurrence  | 'abracadabra'  | 'k'  | 11

#### Algorithm

The linear search algorithm simply goes through all the indices of the text and
stops when it finds the character. The output is the index at which it stopped.
Otherwise, the postconditions tell us the output is the length of the string.

1. for each *index* from 0 to │*text*│ - 1:
   1. if *text*[*index*] = *character*:
      1. stop
1. let *index* be │*text*│

#### Complexity

Whenever there's a stop statement within a loop, we must think of
best- and worst-case scenarios: under which conditions does the algorithm
stop the earliest and stop the latest?

For this algorithm, the loop can stop in its first iteration
if the first character of the text is the sought character.
In that case step&nbsp;1.1, which takes constant time, is executed once,
so the algorithm has best-case complexity Θ(1).

In a worst-case scenario the algorithm goes through all characters.
This may happen because the character doesn't occur at all
or only in the last position.
Step&nbsp;1.1 is executed │*text*│ times; the worst-case complexity is Θ(│*text*│).

The complexity of linear search algorithms is often, but not always, constant in the best case and linear in the size of the sequence in the worst case.

#### Code and tests

The translation to Python makes use of the `range` constructor.
We must not forget that the end of the range isn't included and so must be
one higher (or lower, if iterating backwards) than we need.

In [1]:
def first_index(text: str, character: str) -> int:
    """Return the lowest index of character in text.

    Preconditions: len(character) = 1
    Postconditions: if text includes character, then the output is
    the lowest index such that text[index] = character,
    otherwise the output is len(text)
    """
    for index in range(len(text)):
        if text[index] == character:
            return index
    return len(text)

In [2]:
first_index("", "a") == 0

True

In [3]:
first_index("all", "a") == 0

True

In [4]:
first_index("abracadabra", "c") == 4

True

In [5]:
first_index("hi!", "!") == 2

True

In [6]:
first_index("abracadabra", "b") == 1

True

In [7]:
first_index("abracadabra", "k") == 11

True

#### Performance

To illustrate the use of repeated concatenation, let's check that the
worst-case complexity is linear in the size of the input string.
A worst-case scenario is for the sought character to not occur.

In [8]:
text = 100 * "blah"  # start with a not too short string
%timeit -r 3 -n 1000 first_index(text, '!')
text = 200 * "blah"
%timeit -r 3 -n 1000 first_index(text, '!')
text = 400 * "blah"
%timeit -r 3 -n 1000 first_index(text, '!')
text = 800 * "blah"
%timeit -r 3 -n 1000 first_index(text, '!')

12.4 μs ± 1.45 μs per loop (mean ± std. dev. of 3 runs, 1,000 loops each)
20.5 μs ± 836 ns per loop (mean ± std. dev. of 3 runs, 1,000 loops each)
42.5 μs ± 406 ns per loop (mean ± std. dev. of 3 runs, 1,000 loops each)
87.5 μs ± 169 ns per loop (mean ± std. dev. of 3 runs, 1,000 loops each)


On my computer, the run-time roughly doubles as the input size doubles,
thereby confirming that the worst-case complexity is linear.

### 4.4.2 Valid password

Consider the problem of deciding whether a given string is a valid password,
which we take to mean having at least one lowercase letter and one digit.
To keep the example short, I focus on the problem definition, algorithm and complexity only.

**Function**: valid password\
**Inputs**: *password*, a string\
**Preconditions**: true\
**Output**: *is valid*, a Boolean\
**Postconditions**:
*is valid* if and only if *password* contains a digit and a lowercase letter

There are two conditions to be satisfied, so we should write at least four
tests, with inputs that satisfy both, neither or just one of the conditions.

#### Exercise 4.4.1

Write a test table. Add rows as necessary.

Case | *password* | *is valid*
-|-|-
  |   |

[Hint](../31_Hints/Hints_04_4_01.ipynb)
[Answer](../32_Answers/Answers_04_4_01.ipynb)

#### Algorithm

This is a decision problem that can be solved by
searching for two characters with certain properties.
We can use linear search again, as long as we remember if we found
a lowercase letter and a digit so far. There are only two states for each,
found or not found, so Boolean variables will do.
This problem doesn't require keeping track of indices,
so we can iterate over the string directly.

1. let *has letter* be false
2. let *has digit* be false
3. for each *character* in *password*:
   1. if '0' ≤ *character* ≤ '9':
      1. let *has digit* be true
   1. if 'a' ≤ *character* ≤ 'z':
      1. let *has letter* be true
1. let *is valid* be *has digit* and *has letter*

This is a typical use of Boolean variables as **flags**.
A flag is 'raised', i.e. the Boolean is set to true, when some condition occurs
and it stays raised to remember that the condition occurred.
The use of Boolean flags is common in searches.

#### Exercise 4.4.2

Explain whether the following algorithm is correct.

1. let *has letter* be false
2. let *has digit* be false
3. for each *character* in *password*:
   1. let *has digit* be '0' ≤ *character* ≤ '9'
   1. let *has letter* be 'a' ≤ *character* ≤ 'z'
1. let *is valid* be *has digit* and *has letter*

_Write your answer here._

[Hint](../31_Hints/Hints_04_4_02.ipynb)
[Answer](../32_Answers/Answers_04_4_02.ipynb)

#### Exercise 4.4.3

The algorithm goes through the whole string even
if a lowercase letter and digit appear early on in the string.
Alice and Bob are modifying the algorithm to stop as soon as it can.
This is Alice's algorithm:

1. let *has letter* be false
2. let *has digit* be false
3. for each *character* in *password*:
   1. if '0' ≤ *character* ≤ '9':
      1. let *has digit* be true
   1. if 'a' ≤ *character* ≤ 'z':
      1. let *has letter* be true
   1. if *has digit* and *has letter*:
      1. stop
1. let *is valid* be *has digit* and *has letter*

This is Bob's algorithm:

1. let *has letter* be false
2. let *has digit* be false
3. for each *character* in *password*:
   1. if '0' ≤ *character* ≤ '9':
      1. let *has digit* be true
   1. if 'a' ≤ *character* ≤ 'z':
      1. let *has letter* be true
   1. let *is valid* be *has digit* and *has letter*
   1. if *is valid*:
      1. stop

For each algorithm, explain whether it's correct or not.

_Write your answer here._

[Answer](../32_Answers/Answers_04_4_03.ipynb)

#### Complexity

The complexity of an algorithm is an indication of how its run-time grows
for increasingly large inputs. By definition, a constant-time step doesn't make
the run-time grow and so doesn't contribute to the complexity of the algorithm.
We can thus ignore all constant-time steps when analysing the complexity.
Well, not quite all: we can't ignore if and stop statements,
because they affect how the algorithm behaves. For the original algorithm,
and Alice's, we ignore steps 1, 2, 3.1.1, 3.2.1 and&nbsp;4.
(Bob's algorithm doesn't have a step&nbsp;4: we ignore step&nbsp;3.3 instead.)

The 'partial' algorithm

3. for each *character* in *password*:
   1. if '0' ≤ *character* ≤ '9':
   1. if 'a' ≤ *character* ≤ 'z':

has exactly the same complexity as the complete algorithm.
Both if-statements take constant time, as each does one or two comparisons.
Whether the current character is a letter, digit or something else,
each iteration takes constant time.
The complexity is thus linear in the number of iterations,
which is the length of the input string: Θ(│*password*│).

As mentioned before, the complexity of an algorithm is not about the
run-times for particular problem instances, but rather about the growth of
the run-times for instances with increasingly large values or sizes.
Therefore, a scenario is a *collection* of problem instances with
increasing sizes or values: a scenario is not a *single* problem instance.
Even though this algorithm does the least work for the empty string,
because the loop is skipped, the empty string is *not* a best-case scenario.

#### Exercise 4.4.4

What are best- and worst-case scenarios for a linear search algorithm that
stops as soon as it knows the password is valid?

_Write your answer here._

[Hint](../31_Hints/Hints_04_4_04.ipynb)
[Answer](../32_Answers/Answers_04_4_04.ipynb)

#### Exercise 4.4.5

Implement and test the password-validation function.
You can choose the original algorithm or
the more efficient version that stops early.

In [9]:
# replace this with your code

Add code cells for the tests.

[Answer](../32_Answers/Answers_04_4_05.ipynb)

⟵ [Previous section](04_3_iteration.ipynb) | [Up](04-introduction.ipynb) | [Next section](04_5_tuples.ipynb) ⟶