# 1. Pseudo-code

(Download this file as an
<a href="pseudocode.html" download>HTML document</a> or a
<a href="pseudocode.ipynb" download>Jupyter notebook</a>.)

## 1.1 Introduction

Pseudo-code is a precise, yet somewhat informal description of an algorithm.

**Input** The algorithm is first described in terms of its input, eg, a list, a string?
Are there any pre-requisites or requirements?  Should the list have a specified length?  Is there are maximum length?

**Output** What does the algorithm output?  Does it return a result?

It is usually based on a syntax reminiscent of procedural programming language:

* It uses assignments (sometimes with a syntax such as $x \leftarrow 1$) equivalent to `x = 1` in Python
* It uses standard operations and mathematical functions
* It uses standard while or for loops
* It uses standard if-then-else control flow
* It can use sophisticated data structures (whose properties are known, eg, dictionaries)

Read more about pseudo-code from Wikipedia [EN](https://en.wikipedia.org/wiki/Pseudocode) or [FR](https://fr.wikipedia.org/wiki/Pseudo-code).

Remember how we calculed the arithmetic mean of a sample of values contained in a list (or NumPy array, or tuple, or set).

In [1]:
def mean(X):
    """This function computes the sample mean of a list of values"""
    
    n = len(X)
    Xsum = 0.
    
    for Xval in X:
        Xsum += Xval
        
    return Xsum/n

In [2]:
mean([1, 2, 3, 4])

2.5

For instance, the pseudo-code for our mean() function would look like:

<ul style="list-style-type:none">
    <li>ALGORITHM mean</li>
    <ul style="list-style-type:none">
        <li><strong>Input:</strong>
            A non-empty list of numbers $x = \{x_0, x_1, \ldots\}$</li>
        <li><strong>Output:</strong> the arithmetic mean
            $\bar{x} = \tfrac{1}{n} (x_0+x_1+\cdots)$</li>
        <li>Initialise $S \leftarrow 0$</li>
        <li>Let $n$ be the length of $x$</li>
        <li><strong>for</strong> all $x_i$ in $x$ <strong>do</strong></li>
        <ul style="list-style-type:none">
            <li>Update $S \leftarrow S + x_i$</li>
        </ul>
        <li><strong>end for</strong></li>
        <li><strong>return</strong> $S/n$</li>
    </ul>
</ul>

## 1.2 Exercise 1: Number of occurrences of a substring

Suppose you have two strings, $G$ and $s$. We want to determine how many times $s$ appears in $G$.  $G$ must be at least as long as $s$.

<ul style="list-style-type:none">
    <li>ALGORITHM count_occurrences</li>
    <ul style="list-style-type:none">
        <li><strong>Input:</strong>
            A string $G$ and a string $s$; $|G| \geq |s|$</li>
        <li><strong>Output:</strong> an integer, the number of subsequences of $G$ that match exactly $s$</li>
        <li>Let $m$ be the length of $G$</li>
        <li>Let $n$ be the length of $s$</li>
        <li>Initialise $c \leftarrow 0$</li>
        <li><strong>for</strong> all subsequences $g$ of length $n$ of $G$ <strong>do</strong></li>
        <ul style="list-style-type:none">
            <li><strong>if</strong> $g == s$ <strong>then</strong></li>
            <ul style="list-style-type:none">
                <li>Update $c \leftarrow c + 1$</li>
            </ul>
            <li><strong>end if</strong></li>
        </ul>
        <li><strong>end for</strong></li>
        <li><strong>return</strong> $c$</li>
    </ul>
</ul>

In [1]:
G = '\
TCTAGGGGAAAGTGTCGTGAACGTGACCGGTTAGCACAG\
GAGTGAGAATACCATTCTGGAACCACTCCTGTGTAGGCC\
TTACCATTTCTTACTCGCATTGTACTTGTGGGCAAAGTC\
TTCGTTATCTCTCGGACTGTTCTTTAACTTACGAACCCA'
s = 'TAC'

m = 'This is useless'

In [25]:
def count_occurrences(G, s):
    '''Determines the number of subsequences of G that match exactly s'''
    
    m = len(G)
    n = len(s)
    c = 0
    
    for i in range(m - n + 1):
        g = G[i:(i + 1)]
        
        if g == s:
            c = c + 1

    return c

IndentationError: unexpected indent (<ipython-input-25-c72bd97f0388>, line 4)

In [26]:
count_occurrences(G, s)

0

In [22]:
m # The original m variable has not been affected by the function's local variables

'This is useless'

Two of the most common errors:

* To forget the `:` at the end of a for, if, while statement
* To indent inconsistently the code

## 1.3 Exercise: Longest common substring

Suppose you're given two strings $r$ and $s$.

In [6]:
R = 'AAGCCGCGTCTAACTCGTTGTGCCAAAGCCGCAGACGATAATTAAG'
S = 'CGAATGGTTGGGAATCAAAAGATATAGATTGTCCTCAGCCGATACATTGATGATTGATAGGCCTAACGCGGGCAACTGTG'

# R = 'AAGCCGCGTCTAACTCGTTGTGCCAAAGCCGCAGACGATAATTAAGCATACGCTTGATTTAAGCTGACCTCCAGTGGAGC'

# AA AG GC CC GC ... size of substring = 2
# AAG AGC GCC ... size of substring = 3
# AAGC AGCC ... = 4
# AAGCA AGCCG ... = 5
# AAGCAA

We wish to find the longest common substring (ie, contiguous letters) common to both $r$ and $s$ (here it's <span style='color:blue'>AGCCG</span>):

<ul style="list-style-type:none">
    <li><tt>CGAATGGTTGGGAATCAAAAGATATAGATTGTCCTC<span style='color:blue'>AGCCG</span>ATACATTGATGATTGATAGGCCTAACGCGGGCAACTGTG</tt></li>
    <li><tt>A<span style='color:blue'>AGCCG</span>CGTCTAACTCGTTGTGCCAA<span style='color:blue'>AGCCG</span>CAGACGATAATTAAGCATACGCTTGATTTAAGCTGACCTCCAGTGGAGC</tt></li>
</ul>
    
Devise a simple yet naive algorithm to solve this problem.  Write this as pseudocode.

```
Input: R, S (assume R is shorter than S or same size)
Output: returns one of the longest common sequences

m = len(R)
n = len(S)

Initialise longest_substring <- ''
for size T in 1...m:
    for all subsequences r of size T of R:
        for all subsequences s of size T of S:
            if r = s:
                longest_substring = r

return longest_substring
```
## 1.4 Exercise: Longest common subsequence

We alter sligthly the problem: the letters do not have to be contiguous in both sequences $r$ and $s$.  Experiment with Eryk Kopczyński's [online tool](https://www.mimuw.edu.pl/~erykk/algovis/lcs.html) with the same two sequences $r$ and $s$ as above.

The answer should be <tt>CGAATGTTGGCAAAAGAATAATTGCTACGCTTGATTTAAGGCCTCCGGGAC</tt>.

The pseudocode to determine (not the subsequence itself but) the length of the subsequence is borrowed from Wikipedia's [Longest common sequence problem](https://en.wikipedia.org/wiki/Longest_common_subsequence_problem#Computing_the_length_of_the_LCS).

<ul style="list-style-type:none">
    <li>ALGORITHM lcs</li>
    <ul style="list-style-type:none">
        <li><strong>Input:</strong>
            Two strings $r$ and $s$</li>
        <li><strong>Output:</strong> an integer, the length of the longest subsequence common to both $r$ and $s$</li>
        <li>Let $m$ be the length of $r$</li>
        <li>Let $n$ be the length of $s$</li>
        <li>Initialise an $(m + 1) \times (n + 1)$ matrix $M$
            ($m + 1$ rows and $n + 1$ columns) with zeroes</li>
        <li><strong>for</strong> $i$ in $1 \ldots m$ <strong>do</strong></li>
        <ul style="list-style-type:none">
            <li><strong>for</strong> $j$ in $1 \ldots n$<strong>do</strong></li>
            <ul style="list-style-type:none">
                <li><strong>if</strong> $r_{i-1} = s_{j-1}$ <strong>then</strong></li>
                <ul style="list-style-type:none">
                    <li>$M[i, j] \leftarrow M[i-1, j-1] + 1$</li>
                </ul>
                <li><strong>else</strong></li>
                <ul style="list-style-type:none">
                    <li>$M[i, j] \leftarrow \max(M[i, j-1], M[i-1, j])$</li>
                </ul>
            </ul>
            <li><strong>end for</strong></li>
        </ul>
        <li><strong>end for</strong></li>
        <li><strong>return</strong> $M[m, n]$</li>
    </ul>
</ul>

Note that the $\max$ function exists in Python

In [27]:
R = 'AAGCCGCGTCTAACTCGTTGTGCCAAAGCCGCAGACGATAATTAAGCATACGCTTGATTTAAGCTGACCTCCAGTGGAGC'
S = 'CGAATGGTTGGGAATCAAAAGATATAGATTGTCCTCAGCCGATACATTGATGATTGATAGGCCTAACGCGGGCAACTGTG'

In [28]:
import numpy as np # np.zeros might be of help

In [32]:
def lcs(r, s):
    '''Computes the length of the longest subsequence common to both r and s'''
    
    m = len(r)
    n = len(s)
    
    M = np.zeros( (m + 1, n + 1) ) # Use a tuple! np.zeros(m + 1, n + 1) wouldn't work
    
    for i in range(1, m + 1):      # m + 1 because the second bound is excluded
        for j in range(1, n + 1):  # n + 1 because the second bound is excluded
            if r[i - 1] == s[j - 1]:
                M[i, j] = M[i - 1, j - 1] + 1
            else:
                M[i, j] = max(M[i, j - 1], M[i - 1, j])

    return M[m, n]

In [34]:
lcs(R, S)

51.0

In [31]:
np.zeros(10, 10)

TypeError: Cannot interpret '10' as a data type

In [30]:
np.zeros( (10, 4))

array([[0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.]])