## 3.4 Classification problems

So far, all algorithms have been just sequences of instructions:
first do this, then do this, next do that, etc.
Most algorithms don't execute all their instructions, only some of them,
based on certain conditions. Programming languages have conditional statements
that allow us to select which instructions to execute under which conditions.
Let's consider a classic problem that requires selection and conditions,
stated as Boolean expressions. I follow again the same process to solve it.

<div class="alert alert-info">
<strong>Info:</strong> TM112 introduces selection in Block&nbsp;1 Sections 4.3 and&nbsp;4.4, and includes
a more complicated version of the following problem in Block&nbsp;2 Section&nbsp;2.5.
</div>

### 3.4.1 Problem definition and instances

Given a mark (an integer from 0 to 100), we wish to
award the corresponding pass grade, from 1 (distinction) to 5 (fail).
This is a **classification problem**: each of the many possible input values is
classified into one of a few categories.
A decision problem is a classification problem with only two categories.

<div class="alert alert-info">
<strong>Info:</strong> Question 19 of the TM112 Block&nbsp;1 Quiz is a classification problem too:
given the Richter magnitude of an earthquake,
classify it as a minor, moderate or major earthquake.
TM358 introduces machine learning algorithms for difficult classification problems.
</div>

Let's assume that, in some fictitious module,
the pass grade boundaries are 40, 50, 60 and 80.
Marks on the boundaries are awarded the higher pass grade.

**Function**: grading\
**Inputs**: *mark*, an integer\
**Preconditions**: 0 ≤ *mark* ≤ 100\
**Output**: *pass*, an integer\
**Postconditions**:

- *pass* = 5 if 0 ≤ *mark* < 40
- *pass* = 4 if 40 ≤ *mark* < 50
- *pass* = 3 if 50 ≤ *mark* < 60
- *pass* = 2 if 60 ≤ *mark* < 80
- *pass* = 1 if 80 ≤ *mark* ≤ 100

There's no simple formula to transform a mark into a pass grade.
It's easier to write one condition per grade.

<div class="alert alert-info">
<strong>Info:</strong> Functions that have different formulas for different intervals of the input
values are called piecewise-defined functions in MST124 Unit&nbsp;3 Section&nbsp;1.3.
</div>

It's easy to make a test table because...

<div class="alert alert-warning">
<strong>Note:</strong> The edge cases for a classification problem are the categories' boundaries.
</div>

Case | *mark* | *pass*
-|-|-
lowest fail  | 0  |  5
highest fail  | 39  |  5
lowest pass 4 | 40  |  4
highest pass 4 | 49  |  4
lowest pass 3 | 50  |  3
highest pass 3 | 59  |  3
lowest merit  | 60  |  2
highest merit  | 79  |  2
lowest distinction  | 80  |  1
highest distinction  | 100  |  1

### 3.4.2 Algorithm

The algorithm for a classification problem is a sequence of statements of the
form 'if the input value is like this, then the category is that one',
which is essentially how the postconditions are expressed,
making their translation to an algorithm rather easy.

1. if 0 ≤ *mark* < 40:
    1.  let *pass* be 5
1. if 40 ≤ *mark* < 50:
    1.  let *pass* be 4
1. if 50 ≤ *mark* < 60:
    1.  let *pass* be 3
1. if 60 ≤ *mark* < 80:
    1.  let *pass* be 2
1. if 80 ≤ *mark* ≤ 100:
    1.  let *pass* be 1

The algorithm simply follows the typographic convention in English of
introducing sublists of items (here, instructions) with colons, and indenting them.
We refer to individual steps as 1, 1.1, 2, 2.1, etc.

This is a fine algorithm. It's easy to check the algorithm is correct
because it directly follows the postconditions,
showing explicitly every category boundary.
The conditions are mutually exclusive
(at most one is true for each problem instance) and comprehensive
(at least one is true for each problem instance).
For any input value that satisfies the preconditions,
exactly one condition is true,
and this allows the output to be uniquely determined.

However, the algorithm is not the most efficient.
If the mark is, say, 30, only the first condition applies, but
the remaining conditions are checked too, even though they're false.
A more efficient algorithm stops as soon as the grade is determined.

1. if 0 ≤ *mark* < 40:
    1.  let *pass* be 5
1. otherwise if 40 ≤ *mark* < 50:
    1.  let *pass* be 4
1. otherwise if 50 ≤ *mark* < 60:
    1.  let *pass* be 3
1. otherwise if 60 ≤ *mark* < 80:
    1.  let *pass* be 2
1. otherwise if 80 ≤ *mark* ≤ 100:
    1.  let *pass* be 1

The word 'otherwise' indicates that the next condition is checked only if
the previous ones failed. This allows us to check fewer conditions. For example,
if the second condition gets evaluated, it means the first one is false,
i.e. the mark isn't in the range 0–39. It's therefore redundant to check
if it's greater or equal to 40 (because it is). The algorithm becomes:

1. if 0 ≤ *mark* < 40:
    1.  let *pass* be 5
1. otherwise if *mark* < 50:
    1.  let *pass* be 4
1. otherwise if *mark* < 60:
    1.  let *pass* be 3
1. otherwise if *mark* < 80:
    1.  let *pass* be 2
1. otherwise if *mark* ≤ 100:
    1.  let *pass* be 1

We can omit the first part of the first check and all of the final check because 0 ≤ *mark* ≤ 100 is always true,
due to the preconditions.

1. if *mark* < 40:
    1.  let *pass* be 5
1. otherwise if *mark* < 50:
    1.  let *pass* be 4
1. otherwise if *mark* < 60:
    1.  let *pass* be 3
1. otherwise if *mark* < 80:
    1.  let *pass* be 2
1. otherwise:
    1.  let *pass* be 1

The last part of the algorithm states that if none of the previous cases
applies then the grade must be a distinction.

<div class="alert alert-warning">
<strong>Note:</strong> Check input intervals from the lowest to the highest or
from the highest to the lowest, so that you can simplify the conditions.
</div>

The following exercises show the importance of the order in which
conditions are checked.

#### Exercise 3.4.1

My algorithm, repeated below, does one comparison for a pass 5 mark
and four comparisons for a pass 1 mark.

1. if *mark* < 40:
    1.  let *pass* be 5
1. otherwise if *mark* < 50:
    1.  let *pass* be 4
1. otherwise if *mark* < 60:
    1.  let *pass* be 3
1. otherwise if *mark* < 80:
    1.  let *pass* be 2
1. otherwise:
    1.  let *pass* be 1

Change the algorithm so that
the minimum number of comparisons is made for pass 1 marks,
and the maximum number of comparisons is made for pass 5 marks.
Do as few comparisons as possible.

[Hint](../31_Hints/Hints_03_4_01.ipynb)
[Answer](../32_Answers/Answers_03_4_01.ipynb)

#### Exercise 3.4.2

Consider the following grading algorithm. Is it correct?

1. if 60 ≤ *mark* < 80:
    1.  let *pass* be 2
1. otherwise if 0 ≤ *mark* < 40:
    1.  let *pass* be 5
1. otherwise if 50 ≤ *mark* < 60:
    1.  let *pass* be 3
1. otherwise if 80 ≤ *mark* ≤ 100:
    1.  let *pass* be 1
1. otherwise if 40 ≤ *mark* < 50:
    1.  let *pass* be 4


[Hint](../31_Hints/Hints_03_4_02.ipynb)
[Answer](../32_Answers/Answers_03_4_02.ipynb)

#### Exercise 3.4.3

Consider the following simplification of the above conditions.

1. if *mark* < 80:
    1.  let *pass* be 2
1. otherwise if *mark* < 40:
    1.  let *pass* be 5
1. otherwise if *mark* < 60:
    1.  let *pass* be 3
1. otherwise if *mark* ≤ 100:
    1.  let *pass* be 1
1. otherwise if *mark* < 50:
    1.  let *pass* be 4

Explain why the algorithm is incorrect by showing *one* **counter-example**:
a problem instance for which the algorithm produces the wrong output.

_Write your answer here._

Copy the text cell with the algorithm to below this paragraph and fix it.
You can only change the conditions, not their order.

[Hint](../31_Hints/Hints_03_4_03.ipynb)
[Answer](../32_Answers/Answers_03_4_03.ipynb)

### 3.4.3 Complexity

The fourth version of the algorithm, the one just before the exercises,
always does one assignment, but it may execute one to four comparisons,
against 40, 50, 60 and 80 marks, depending on the input value.
To handle these situations, complexity analysis distinguishes between
best- and worst-case scenarios.

A **best-case scenario** is a group of problem instances that lead to the
algorithm doing the least work, i.e. running fastest.
For the algorithm at hand, a best-case scenario is a mark from 0 to 39,
as only one comparison is made.
A **worst-case scenario** is a group of problem instances that require the
algorithm to do the most work, i.e. running slowest.
A worst-case scenario for this algorithm is a mark from 60 upwards,
because it requires four comparisons.
(Best and worst cases are for the algorithm, not for the student's grade.)

The algorithm does one assignment and one comparison
in the best-case scenario. That's a fixed number of constant-time operations, so
the algorithm has **best-case complexity** Θ(1). In the worst-case scenario,
the algorithm does one assignment and four comparisons,
which takes constant time too. The **worst-case complexity** is also Θ(1).
When an algorithm's best- and worst-case complexities are the same,
we simply state the algorithm's complexity without any qualification.
In this example, we just state that the algorithm has constant complexity.

As you'll see in later chapters, there may be different equivalent best-case
(or worst-case) scenarios, so we tend to speak of
*a* best- or worst-case scenario rather than *the* best- or worst-case scenario.
All best- (or worst-) case scenarios necessarily have the same complexity,
otherwise some scenarios would be better (respectively, worse) than others.

### 3.4.4 Code

Python's syntax closely follows English, with 'if', 'otherwise' and
'otherwise if' being the keywords `if`, `else` and `elif`, respectively.

Problem definitions indicate the output's name
so that postconditions can refer to it.
Function headers don't name the output, so I write instead 'the output is 4'
or more simply 'return 4' in the docstring.
I take the opportunity to rely on the preconditions,
i.e. that marks go from 0 to 100, to
slightly simplify the formulation of the postconditions.

In [1]:
def grading(mark: int) -> int:
    """Return the pass grade, from 1 to 5, for the given mark.

    Preconditions: 0 <= mark <= 100
    Postconditions:
    - if mark < 40, return 5
    - if 40 <= mark < 50, return 4
    - if 50 <= mark < 60, return 3
    - if 60 <= mark < 80, return 2
    - if mark >= 80, return 1
    """
    if mark < 40:
        grade = 5
    elif mark < 50:
        grade = 4
    elif mark < 60:
        grade = 3
    elif mark < 80:
        grade = 2
    else:
        grade = 1
    return grade

We can immediately return the grade once it's determined,
which leads to another version:
```python
if mark < 40:
    return 5
if mark < 50:
    return 4
if mark < 60:
    return 3
if mark < 80:
    return 2
return 1
```
In English, we write this algorithm as:

1. if mark < 40:
   1. let *pass* be 5
   2. stop
1. if mark < 50:
   1. let *pass* be 4
   2. stop
1. if mark < 60:
   1. let *pass* be 3
   2. stop
1. if mark < 80:
   1. let *pass* be 2
   2. stop
1. let *pass* be 1

With the explicit 'stop' instruction we don't need to use 'otherwise'.
For this example, I prefer to use 'otherwise' instead of 'stop',
as it makes the algorithm shorter and easier to understand, in my view.

Remember that our algorithms in English must assign a value
to the output variable mentioned in the function definition template,
whereas algorithms in Python can directly return the value,
as there's no output name in the Python function header.

Some authors advocate having a single stopping point in an algorithm,
usually implicit after the final step,
as having several may make the algorithm harder to understand.
In M269 you can use either 'style':
one style may be more convenient for the algorithm you're working on.

### 3.4.5 Tests

Previously, we called the function on each problem instance and manually checked
if the output was the same as in the last column of the test table. We can use
Boolean expressions to compare the returned grade against the expected one.
It's much faster and less error-prone to see if all outputs are true
than to check if each output is the right grade.

In [2]:
grading(0) == 5

True

In [3]:
grading(39) == 5

True

In [4]:
grading(40) == 4

True

In [5]:
grading(49) == 4

True

In [6]:
grading(50) == 3

True

In [7]:
grading(59) == 3

True

In [8]:
grading(60) == 2

True

In [9]:
grading(79) == 2

True

In [10]:
grading(80) == 1

True

In [11]:
grading(100) == 1

True

### 3.4.6 Performance

Remember that in the best-case scenario (any fail mark)
only one comparison is made,
whereas in the worst-case scenario (any distinction mark)
four comparisons are made. For curiosity's sake,
let's see the difference between the best- and the worst-case run-times.

In [12]:
%timeit -r 3 -n 1000 grading(0)
%timeit -r 3 -n 1000 grading(100)

49.3 ns ± 0.367 ns per loop (mean ± std. dev. of 3 runs, 1,000 loops each)
73.2 ns ± 0.484 ns per loop (mean ± std. dev. of 3 runs, 1,000 loops each)


On my computer the second call doesn't take four times longer to run,
because most of the time goes into calling the function and returning from it,
not in executing comparisons.

### 3.4.7 Mistakes

Writing `else if` instead of `elif` is a syntax error.

If the conditions are not comprehensive,
i.e. don't cover all possible input values,
then no output is computed for some problem instances.

If the conditions are not mutually exclusive,
i.e. they overlap for some input values,
then some problem instances can be classified in more than one category.
The algorithm will assign the category for the first condition that succeeds,
so the order in which conditions are checked may lead to the correct
answer for some inputs, but not for others.
Consider again the algorithm of Exercise&nbsp;3.4.3:

1. if *mark* < 80:
    1.  let *pass* be 2
1. otherwise if *mark* < 40:
    1.  let *pass* be 5
1. otherwise if *mark* < 60:
    1.  let *pass* be 3
1. otherwise if *mark* ≤ 100:
    1.  let *pass* be 1
1. otherwise if *mark* < 50:
    1.  let *pass* be 4

The conditions overlap, e.g. marks up to 40&nbsp;satisfy all conditions, and
due to the order they're checked, marks below 60 get the wrong grade.

<div class="alert alert-warning">
<strong>Note:</strong> Write at least one test for each category, so that you're more likely
to catch missing and overlapping conditions.
</div>

⟵ [Previous section](03_3_expressions.ipynb) | [Up](03-introduction.ipynb) | [Next section](03_5_exercises.ipynb) ⟶