### Exercise 2a - the fairy tale

The first thing you need to do is to load the `re` module and the text.

In [34]:
import re
txt = "A boy called Peter lived with his 2 parents in a village on the hillside. His parents, like most of the other people in the village, were sheep farmers.  There were 430 sheep in the village, 33 sheep dogs and 21 humans. Close to the village, there were also 5 wolves.  Everybody in the village took turns to look after the sheep, and when Peter was 10 years old, he was considered old enough to take his turn at shepherding."

#### Find one word consisting of only two letters.

First, how do we express "a word"?

A word is something which is surrounded by whitespace (yes, this is a simplified definition, but we only need
to find one such word, so let's assume it's correct).
A whitespace is `\s`, so something - let's call it `X` - surrounded by whitespaces would be `\sX\s`.

Second, how do we express "a letter" as a regular expression?

An english letter is one between 'a' and 'z'. We want to express a *class* of with these letters.

You can form a class like this:
`[0-9]` (any digit) or like this `[abc]` ('a', 'b' or 'c').

This gives us:

one letter `=>` `[a-z]`

two letters `=>` `[a-z][a-z]`

Now we have: `\s[a-z][a-z]\s`. Two english letters surrounded by whitespace. Let's test it!

In [3]:
p = re.compile('\s[a-z][a-z]\s')  # any two english letters surronded by whitespaces
match = p.search(txt)
if match:
    print("Found the word: '{}'".format(match.group()))

Found the word: ' in '


It works! Next exercise!

#### Find all words that contain an 'o'.

Again we need to specify
what letters are allowed in words. Since last exercise we already have `[a-z]`.

Now, "a word containing o" could also be expressed as "any number of letters, then an 'o', then again any number of letter".

Any number is expressed as `*`, the Kleene star. Let's put it together:

`[a-z]*` (any number of letters) + `o` + `[a-z]*` (any number of letters)

Since the star `*` is greedy, it will consume as many characters as possible. This means that it will match all letters in the word, until the word boundary.

`[a-z]*o[a-z]*`: a word containing 'o'.

If we want to allow capital letters, we can add the `re.IGNORECASE` flag:

In [6]:
p = re.compile('[a-z]*o[a-z]*', re.IGNORECASE)  # an 'o' surronded by one or more english letters

for match in p.finditer(txt):
    print(match.group())

boy
on
most
of
other
people
dogs
Close
to
also
wolves
Everybody
took
to
look
old
considered
old
enough
to


#### Find all numbers written with digits

We are looking for digits! Digits are be represented with `\d`.

`\d` will match `1`. But it will not match `2`. How to express that we allow more than one digit?

"One or more" is expressed as `+`.

We thus have `\d+`.

In [23]:
p = re.compile('\d+')

for match in p.finditer(txt):
    print(match.group())

2
430
33
21
5
10


### Rename Peter to Petter

Remember the `p.sub(x, y)` method. It says that the pattern `p` should be replaced with `x` in the string `y`.


This requires us to do two things:

1. Find "Peter" (that is, compile the expression matching "Peter")
2. Replace that pattern with "Petter" in the text.

The first task should not be too hard, we are looking for an exact string ("Peter"). To compile it, we use

```py
p = re.compile("Peter")
```

Then, we go on to the replacement:
```py
p.sub("Petter", txt)
```
We substitute "Peter" with "Pette" in `txt`. Let's try it:

In [7]:
p = re.compile("Peter")  # Match "Peter"

print(p.sub("Petter", txt))  # Replace with "Petter"

A boy called Petter lived with his 2 parents in a village on the hillside. His parents, like most of the other people in the village, were sheep farmers.  There were 430 sheep in the village, 33 sheep dogs and 21 humans. Close to the village, there were also 5 wolves.  Everybody in the village took turns to look after the sheep, and when Petter was 10 years old, he was considered old enough to take his turn at shepherding.


-----
### Exercise 2b - the protein sequnce

Here's what we will do:
- for each line (that is, for each individual) look for the pattern
- if it's found, count it and remember what this variantion looks like.


We start by writing a function that reads each line in a file

In [40]:
def check_protein(inputfile):
    for line in open(inputfile):
        if not line.startswith('>'):
            # This is where we do something!
            ...

We now need to decide what pattern to look for. The original pattern is "TPLTVETLAKT", but "VE" is changed to "Lx" ("x" means anything), so we have "TPLTLxTLAKT".

"Any character" is expressed by the period `.`, so the changed is expressed as `L.`.

We get `TPLTL.TLAKT`. This is the pattern we are looking for, the changed orginial amino acid sequence. Let's compile the pattern:

```py
p = re.compile('TPLTL.TLAKT')
```

In [41]:
import re

def check_protein(inputfile):
    p = re.compile('TPLTL.TLAKT')  # The pattern
    for line in open(inputfile):
        if not line.startswith('>'):
            # This is where we do something!
            ...
    

The first question to answer is:
- How many individuals have got this change?

  For this, we will need to count, so we need a counter
  
  ```py
  counter = 0```

In the beginning of the function, we will have to set this to its start value.
```py
counter = 0
```

We now start searching for the pattern, for each line, using `re.search()`:

```py
    match = p.search(line)
```

If we get a match, we count it!
```py
    if match:
        var = match.group()
        counter += 1
```


That's it, let's print the result:
```py
    print(f'Found {counter} individuals with changed sequence')
```

Everything put together:

In [15]:
import re


def check_protein(inputfile):
    count = 0
    p = re.compile('TPLTL.TLAKT')
    for line in open(inputfile):
        if not line.startswith('>'):
            match = p.search(line)
            if match:
                count += 1
    print('Found {} individuals with changed sequence'.format(count))

The two next questions are:

- What variants are there (which values can x take)?

  For this one, we will need to collect a number of different patterns ("LQ", "LA"...). We could use
  a list, a set or a dictionary. Since the next question is...:
  
  
- Which variation is the most common one?

  ...we will have to remember *how many of each variants* we have seen. This tells us that   a dictionary will be handy. This way, we can store the variants and ther current count together:
     ```py
     {'LQ': 3, 'LA': 1, ...}```

Define the empty dictionary in the beginning of the function, together with the `counter`:

```py
variants = {}
```



Go back to the code and to the line where you have found a match:
```py
if match:
```

If we have a match, we should:
- find out what variation it is, exactly. What is the pattern we caught in the string?
- if we haven't seen this variation before, add it to our `variants` dictionary
- increase the counter for this specific variant

To find the pattern:

`.group()` will give us the text matched by `p`, that is the variantion we are looking for.

```py
if match:
    var = match.group()
    ```

Now check if this variation (`var`) has been seen before:
```py
        if var not in variants:
            variants[var] = 0
```
If not, add it to `variants`.


We then increase the count for the variation:
```py
        variants[var] += 1
```

Everything put together:

In [17]:
import re

def check_protein(inputfile):
    counter = 0
    variants = {}
    p = re.compile('TPLTL.TLAKT')
    for line in open(inputfile):
        if not line.startswith('>'):
            match = p.search(line)
            if match:
                var = match.group()
                counter += 1  # if we found a match, count it
                if var not in variants:
                    variants[var] = 0
                variants[var] += 1    # count this one      

The loop is done! All variations are now counted and we know how many individuals that have the variationvariations there are. Let's print it:

 ```py
    print('Found {} types of variations'.format(len(variants)))```

To answer which variation that is most common, we will loop over the dictionary `variants` and keep track of the maximum number.

```py
    max_num = 0   # the highest frequency we have seen so far
    max_variant = ''  # the variation with the highest frequency
    for change, count in variants:
        if count > max_num:
            max_num = count
            max_variant = change
            ```

In [31]:
import re


def check_protein(inputfile):
    counter = 0
    variants = {}
    p = re.compile('TPLTL.TLAKT')
    for line in open(inputfile):
        if not line.startswith('>'):
            match = p.search(line)
            if match:
                counter += 1
                var = match.group()
                if var not in variants:
                    variants[var] = 0
                variants[var] += 1
    print('Found {} individuals with changed sequence'.format(counter))
    print('Found {} types of variations'.format(len(variants)))
    max_num = 0
    max_variant = ''
    for change, count in variants.items():
        if count > max_num:
            max_num = count
            max_variant = change
    print('Most common variation: {} ({} instances)'.format(max_variant, max_num))


To run the code:

In [39]:
check_protein('../../downloads/proteins.fasta')

Found 63 individuals with changed sequence
Found 19 types of variations
Most common variation: TPLTLQTLAKT (6 instances)
