In [None]:
from IPython.display import Image
from IPython.display import clear_output
from IPython.display import FileLink, FileLinks

## Introduction to

![title](img/python-logo-master-flat.png)

### with Application to Bioinformatics

#### - Day 5

## Review Day 4

- More control!

  - `break`, `continue`, `pass`
  - keyword arguments
  
- Modules

  - importing code
  - documentation: reading and writing

#### Control loops

- `break` a loop => stop it

<center>
<img src="img/break.png" alt="break" width="30%"/>
</center>


#### Control loops

- `continue` => go on to the next iteration

<center>
<img src="img/continue.png" alt="break" width="30%"/>
</center>

#### Control loops

- `pass`  => do nothing

```py
for line in file:
    if len(line) > 40:
        # TODO find out what to do here
        pass
    do_something(line)
```

#### Keyword arguments

```py
open(filename, encoding="utf-8")
```

```py
open(file, mode='r', buffering=-1, encoding=None, errors=None, newline=None, closefd=True, opener=None)
```


#### Keyword arguments

- programmer: set default values
- user: ignore parameters
- better overview

#### Using code

- Python standard modules
- Your colleague's code


- `import datetime`
- `import xml.etree.ElementTree as ET`
- `from collections import defaultdict`


- `help(datetime)`

#### Using code

**`Counter`** - a handy type of dictionary.

```py
help(Counter)
```


> Dict subclass... Elements are stored as dictionary keys and their counts are stored as dictionary values.
...


In [None]:
from collections import Counter

In [None]:
text = 'lettercountingstring'

In [None]:
c = Counter(text)
c

In [None]:
c.most_common(3)

In [None]:
for letter in "a nice story about books":
    if letter != " ":
        c[letter] += 1
        
for letter in "unimportant letters":
    if letter != " ":
        c[letter] -= 1

In [None]:
c.most_common(3)

https://docs.python.org/3/py-modindex.html

#### Your code being used

- write comments       `# why do I do this?`
- write documentation  `"""what is this? how do you use it?"""`

#### Your code being used

```py
def f(a, b):
    for c in open(a):
        if c.startswith(b):
            print(c)
            ```

==>

```py
def print_lines(filename, start):
    for line in open(filename):
        if line.startswith(start):
            print(line)
            ```

**<center>Care about the names of your variables and functions</center>**

#### Your code being used

```py
def main(input):
    ...
    
    
if __name__ == "__main__":
    main(sys.argv[1])
    ```

## TODAY

- Formatting
- Regular expressions
- Sum up of the course

## String formatting


Putting your values into tables or other nice looking strings.

`"A string with formatting instructions".format(some, values, to, put, in, string)`

In [None]:
"The movie '{}' got {} votes".format('Beauty and the Beast', 10)

In [None]:
def pretty(val1, val2):
    print("The movie '"+ val1 +"' got "+ val2 +" votes.")

pretty('Beauty and the Beast', '10')
pretty('It', '10000')

In [None]:
def pretty(val1, val2):
    print("{:20} got {:^10} votes.".format(val1, val2))

    
pretty('Beauty and Beast', '10')
pretty('It', '10000')

In [None]:
def prettyheader(val1, val2):
    print('|{:-^20}|{:-^20}|'.format(val1, val2))


prettyheader("Movie", "Votes")
pretty('Beauty and Beast', '10')
pretty('It', '10000')

In [None]:
"some text, some values here: {} and here: {}".format(42, "value")

#### Positional arguments

In [None]:
"The movie '{}' got {} votes.".format("Beauty and the Beast", "10")

In [None]:
"The movie '{0}' got {1} votes.".format("Beauty and the Beast", "10")

In [None]:
"The movie '{1}' got {1} votes.".format("Beauty and the Beast", "10")

In [None]:
"The movie '{3}' got {43} votes.".format("Beauty and the Beast", "10")

#### Keyword arguments

In [None]:
"The movie '{movie}' got {number} votes.".format(movie="Beauty and the Beast", number="10")

In [None]:
"The movie '{movie}' got {movie} votes.".format(movie="Beauty and the Beast", number="10")

In [None]:
"The movie '{0}' got {1} votes.".format(movie="Beauty and the Beast", number="10")

In [None]:
"The movie '{0}' got {number} votes.".format("Beauty and the Beast", number="10")

Positional arguments comes first, keyword arguments after!

In [None]:
"The movie '{movie}' got {1} votes.".format(movie="Beauty and the Beast", "10")

### Printing a table

#### Field width

In [None]:
"The movie '{movie:30}' got {number:10} votes.".format(movie="Beauty and the Beast", number="10")

#### Alignment

In [None]:
"The movie '{movie:<30}' got {number:>10} votes.".format(movie="Beauty and the Beast", number="10")

In [None]:
"The movie '{movie:^30}' got {number:^10} votes.".format(movie="Beauty and the Beast", number="10")

In [None]:
def pretty(val1, val2):
    print("{:20} got {:^10} votes.".format(val1, val2))
pretty('Beauty and the Beast', 10)
pretty('It', 10000)

### Printing a table

#### Filling

In [None]:
"The movie '{movie:_<30}' got {number:*^10} votes.".format(movie="Beauty and the Beast", number="10")

In [None]:
"|{0:-^20}|{1:-^20}|".format(' Movie ',' Votes ')

#### Rounding

In [None]:
points = 19
total = 23
'Score: {}'.format( points / total )

In [None]:
'Score: {:.2f}'.format( points / total )

In [None]:
'Score: {:.2%}'.format( points / total )

#### Formatting dates

In [None]:
import datetime
d = datetime.datetime(2010, 7, 4, 12, 15, 58)
d

In [None]:
'{:%Y-%m-%d %H:%M:%S}'.format(d)

In [None]:
now = datetime.datetime.now()
'{:%Y-%m-%d week:%W %A %H:%M:%S}'.format(now)

Learn more: [strftime.org](http://strftime.org)


### Printing a  full table

In [None]:
# movies :: [(name, votes, total score)]
movies = [('Beauty and the Beast', 12, 55), ('It', 10000, 30450)]
 
print('|{:-^30}|{:-^10}|{:-^10}|'.format('Movie', 'Votes', 'Average'))
for movie, votes, total in movies: 
    print('|{:30}|{:^10}|{:^10.2f}|'.format(movie, votes, total/votes))

Learn more from the [Python docs](https://docs.python.org/3.4/library/string.html#format-string-syntax)!


### Older syntax

In [None]:
print("I've read the story '%s' %s times" % ('Alice in Wonderland', 10))


### Alternative syntax: `f-string`

In [None]:
story = "Alice in Wonderland"
num = 10
print(f"I've read the story '{story}' {num} times")

(*version >= 3.6*)

## Regular Expressions

Search for patterns in text.



- A formal language for defining search patterns

- Lets you search not only for exact strings but controlled variations of that string.

- Why?

- Examples:
   - Find variations in a protein or DNA sequence
     - `"MVR???A"`
     - `"ATG???TAG`
   - American/British spelling, endings and other variants:
     - salpeter, salpetre, saltpeter, nitre, niter or KNO3
     - hemaglobin, heamoglobin, hemaglobins, heamoglobin's
     - catalyze, catalyse, catayzed...

## Regular Expressions

- When?


- To find information
    - in your `vcf` or `fasta` files
    - in your code
    - in your next essay
    - in a database
    - online
    - in a bunch of articles
    - ...

 - Search/replace
   - becuase &rarr; because
   - color &rarr; colour

- In most programming languages, text editors, search engines...

### Defining a search pattern

<center>
<img src="img/color.png" alt="regex" width="50%"/>
<img src="img/salpeter.png" alt="regex" width="50%"/>

</center>

#### Common operations
- `.` matches any character (once)
- `?` repeat previous pattern 0 or 1 times
- `*` repeat previous pattern 0 or more times
- `+` repeat previous pattern 1 or more times

`.*` matches everything (including the empty string)!


<center><code>"salt?pet.."</code></center>



<center><font color="green">saltpeter</font></center>


<center><font color="red">"saltpet88"</font></center>
<center><font color="red">"salpetin"</font></center>
<center><font color="red">"saltpet  "</font></center>


#### More common operations

- `\w` matches any letter or number, and the underscore
- `\d` matches any digit
- `\D` matches any non-digit
- `\s` matches any whitespace (spaces, tabs, ...)
- `\S` matches any non-whitespace
- `[abc]` matches a single character defined in this set {a, b, c}
- `[^abc]` matches a single character that is **not** a, b or c

#### `[a-z]` matches all letters between `a` and `z` (the english alphabet).

#### `[a-z]+` matches any (lowercased) english word.

<center><code>salt<font color="red">?</font>pet<font color="red">[</font>er<font color="red">]+</font>
</code></center>

   <font color="green"><center>saltpeter</center>
    <font color="green"><center>salpetre</center>

<center><strike><font color="red">"saltpet88"</font></strike></center>
<center><strike><font color="red">"salpetin"</font></strike></center>
<center><strike><font color="red">"saltpet  "</font></strike></center>

## Exercise 1

<b>&rarr; Notebook Day_5_Exercise_1  (~30 minutes) </b>

### Regular expressions in Python

In [None]:
import re

In [None]:
p = re.compile('ab*')
p

### Searching

In [None]:
p = re.compile('ab*')

p.search('abc')

In [None]:
print(p.search('cb'))

In [None]:
p = re.compile('HELLO')
m = p.search('gsdfgsdfgs  HELLO  __!@£§≈[|ÅÄÖ‚…’ﬁ]')

print(m)

### Case insensitiveness

In [None]:
p = re.compile('[a-z]+')
result = p.search('ATGAAA')
result

In [None]:
p = re.compile('[a-z]+', re.IGNORECASE)
result = p.search('ATGAAA')
result

### The match object

In [None]:
print('The result {} is of type {}'.format(result, type(result)))

`result.group()`: Return the string matched by the expression

`result.start()`: Return the starting position of the match

`result.end()`: Return the ending position of the match

`result.span()`: Return both (start, end)

In [None]:
result.group()

In [None]:
result.start()

In [None]:
result.end()

In [None]:
result.span()

### Zero or more...?

In [None]:
p = re.compile('.*HELLO.*')

In [None]:
m = p.match('lots of text  HELLO  more text and characters!!! ^^')

In [None]:
m.group()

The `*` is **greedy**.

### Finding all the matching patterns

In [None]:
p = re.compile('HELLO')
objects = p.finditer('lots of text  HELLO  more text  HELLO ... and characters!!! ^^')
print(objects)

In [None]:
for m in objects:
    print('Found {0:30} at position {1}'.format(m.group(), m.start()) )

In [None]:
objects = p.finditer('lots of text  HELLO  more text  HELLO ... and characters!!! ^^')
for m in objects:
    print('Found {0:^30} at position {1}'.format(m.group(), m.span()) )

### How to find a full stop?

In [None]:
txt = "The first full stop is here: ."
p = re.compile('.')

m = p.search(txt)
print('"{}" at position {}'.format(m.group(), m.start()))

In [None]:
p = re.compile('\.')

m = p.search(txt)
print('"{}" at position {}'.format(m.group(), m.start()))

### More operations
- `\` escaping a character
- `^` beginning of the string
- `$` end of string
- `|` boolean `or`

<center>
    <code>salt<font color="red">?</font>pet<font color="red">(</font>er<font color="red">|</font>re<font color="red">)</font> <font color="red">|</font> nit<font color="red">(</font>er<font color="red">|</font>re<font color="red">)</font> <font color="red">|</font> KNO3</code>
</center>

### Capturing groups

Find the domain name in an email address:

kalle.larsson@**gmail**.com pigelin@**uu**.se

In [None]:
email1 = "kalle.larsson@gmail.com"
email2 = "pigelin@uu.se"
p = re.compile('.*@.*\..*')

p.search(email1).group()


In [None]:
p = re.compile('.*@(.*)\..*')
p.match(email1).group(1)

In [None]:
p = re.compile('.*@(.*)\.(.*)')
p.match(email1).groups()


### Capturing groups

**Structure**

- Each pair of parentheses creates a group
- `group()` stores the whole match, as usual
- `group(1)` stores the first group, `group(2)` the second...
- `groups()` stores all groups

### Substitution

In [None]:
p = re.compile('gmail')
p.sub('hotmail', email1)

In [None]:
p = re.compile('@.*\.')
e1 = p.sub('@hotmail.', email1)
e2 = p.sub('@hotmail.', email2)
print(e1)
print(e2)

#### Finally, we can fix our spelling mistakes!

In [None]:
txt = "Do it becuase I say so, not becuase you want!"

In [None]:
p = re.compile('becuase')
p.sub('because', txt)

#### Overview

 - Construct regular expressions
 
     ```py
     p = re.compile()```
     
 - Searching
 
     ```py
     p.search(text)
     p.match(text)```
     
 - Substitution
 
     ```py
     p.sub(replacement, text)
     ```

**Typical code structure:**

```python
p = re.compile( ... )
m = p.search('string goes here')
if m:
    print('Match found: ', m.group())
else:
    print('No match')
```

### Backreference

A reference to a previously captured group. Finds repetitions!

In [None]:
p = re.compile(r'(.)TG\1')

<center>
<img src="img/backref.png" alt="break" width="20%"/>
</center>

In [None]:
p.search('CTGACCATGAG')
p

<center>CTGACC<u><b>A</b>TG<b>A</b></u>G</center>

In [None]:
kalle = "kalle-kalle@gmail.com"
lisa = "lisa-lisa@hotmail.com"
lisen = "lisa-lisen@uu.se"

p = re.compile(r'(.*)-\1@(.*)')

In [None]:
p.search(kalle)

In [None]:
p.search(lisa)

In [None]:
p.search(lisen)

### Backreference

**Structure**

- Each pair of parentheses creates a group, which can be referred to
- `\1` refers the first group, `\2` the second...
- Remember to use *raw strings* for all backreferences! `r'\1'`
- In the match object, the groups can be found with `group(n)` and `groups()`, as usual

### Substitution + backreference!


- Create repetitions

In [None]:
email2

In [None]:
p = re.compile(r'(.*)@')
p.sub(r'\1-\1@', email2)

In [None]:
p = re.compile(r'(.*)@(.*)\.')
p.sub(r'\1-\1@\2.\2.', email2)

- Reuse alternatives

In [None]:
import re
p = re.compile('(he|she|it) like ')
corr = r'\1 likes '

In [None]:
p.sub(corr, 'she like programming')

In [None]:
p = re.compile('(he|she|it) (\w+)')
corr = r'\1 \2s '

In [None]:
p.sub(corr, 'she like programming and she make programs')

### Substitution + backreference


- Reorder

**Date formats**

<table border="0" width="80%" style="font-size:80%">
<tr>
    <td >Swedish format:</td><td> <code>year-month-day</code></td><td>   2018-11-23</td>
</tr><tr>
    <td>UK format:</td><td>      <code>day/month/year</code></td><td>   23/11/2018</td>
</tr><tr>
    <td>US format:</td><td>      <code>month/day/year</code></td><td>   11/23/2018</td>
</tr>
</table>

In [None]:
swedish_date = '2018-11-23'

p = re.compile('(\d\d\d\d)-(\d\d)-(\d\d)')

In [None]:
uk_date = p.sub(r'\3/\2/\1', swedish_date)
print('UK:', uk_date)

In [None]:
us_date = p.sub(r'\2/\3/\1', swedish_date)
print('US:', us_date)

### Backreference + substitution

- Create repetitions

    ```py
    p.sub(r'\1-\1@', email2)```

- Reorder
```py
us_date = p.sub(r'\2/\3/\1', swedish_date)```

- Reuse alternatives
```py
p = re.compile('(he|she|it) (\w+)')
p.sub(r'\1 \2s ', 'she like programming')```

### Regular expressions


- A powerful tool to search and modify text

- There is much more to read in the [docs](https://docs.python.org/3/library/re.html)

- Note: regex comes in different flavours. If you use it outside Python, there might be small variations in the syntax.

## Exercise 2

Read more: full documentation https://docs.python.org/3.6/library/re.html
         
         
<br/>
<b>&rarr; Notebook Day_5_Exercise_2  (~30 minutes) </b>


## <center>Sum up!</center>

#### Loop through the lines of a file

```py

for line in open('myfile.txt', 'r'):
    do_stuff(line)
```

#### Store values

```py
iterations = 0
information = []

for line in open('myfile.txt', 'r'):
    iterations += 1
    information += do_stuff(line)
```

#### Values

- Literals:

    - ```py
    str     "hello"```
    - ```py
    int     5```
    - ```py
    float   5.2```
    - ```py
    bool    True```
    
- Collections:

    - ```py
    list  ["a", "b", "c"]```
    - ```py
    dict  {"a": "alligator", "b": "bear", "c": "cat"}```
    - ```py
    tuple ("this", "that")```
    - ```py
    set   {"drama", "sci-fi"}```

#### Modify values and compare

- ```py
+, -, -,...```
- ```py
and, or, not```
- ```py
==, !=```
- ```py
<, >, <=, >=```

#### Deciding what to do

```py
if count > 10:
   print('big')
elif count > 5:
   print('medium')
else:
   print('small')
```

#### Deciding what to do

```py
iterations = 0
information = []

for line in open('myfile.txt', 'r'):
    if is_comment(line):
       use_comment()
    else:
       read_data()
```

#### Strings

Raw text

- Pruning:
 
  - ```py
  s.strip()  # remove unwanted spacing```
  - ```py
  s.split()  # split line into columns```
  - ```py
  s.upper(), s.lower()  # change the case```



- Regular expressions help you find and replace strings.

  - ```py
  p = re.compile('A.A.A')
  p.search(dnastring)
  ```
  - ```py
  p = re.compile('T')
  p.sub('U', dnastring)
  ```

#### Collections

Can contain strings, integer, booleans...
- **Mutable**: you can *add*, *remove*, *change* values

  - Lists:
  ```py
    mylist.append('value')
```

  - Dicts:
  ```py
    mydict['key'] = 'value'```

  - Sets:
```py
    myset.add('value')```

#### Mutable objects

- Test for membership:
```py
value in myobj
```

- Check size:
```py
len(myobj)
```

#### Lists

- Ordered!

```py
mylist = ["a", "b", "c"]

mylist.sort()
mylist.reverse()
mylist[2]
mylist[-1]
mylist[2:6]
```

#### Dictionaries

- Keys have values

```py
mydict = {"a": "alligator", "b": "bear", "c": "cat"}

mydict["a"]
mydict.keys()
mydict.values()
```



#### Sets

- Bag of values
 
  - No order
  
  - No duplicates 

  - Fast membership checks

  - No duplicates 
  
  - Logical set operations (union, difference, intersection..)


```py
myset = {"drama", "sci-fi"}

myset.difference({"sci-fi", "comedy"})  # myset - {"sci-fi", "comedy"}
> {"drama"}

myset.intersection({"sci-fi", "comedy"})
> {"sci-fi"}
```


#### Strings

- Works like a list of characters
  - ```py
  s += "more words"  # add content```
  - ```py
  s[4]               # get character at index 4```
  - ```py
  'e' in s           # check for membership```
  
  - ```py
  len(s)             # check size```


- But are immutable

  - ```py
  > s[2] = 'i'```
  ---
  ```py
  Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  TypeError: 'str' object does not support item assignment
```

#### Input/Output

- In:

  - Read files: `open(filename, 'r')`

  - Read information from command line: `sys.argv[1:]`

- Out:

  - Printing: `print('my_information')`

  - Write files: `open(filename, 'w').write(text)`



#### Formatting

```py
# movies :: [(name, votes, total score)]
movies = [('Beauty and the Beast', 12, 55), ('It', 10000, 30450)]

print('|{:-^30}|{:-^10}|{:-^10}|'.format('Movie', 'Votes', 'Average'))
for movie, votes, total in movies: 
    print('|{:30}|{:^10}|{:^10.2f}|'.format(movie, votes, total/votes))
```

<code>
|------------Movie-------------|--Votes---|-Average--|
|Beauty and the Beast          |    12    |   4.58   |
|It                            |  10000   |   3.04   |
</code>

**The format string**

<pre>
{[field_name] [: <font color="green">format_spec</font>]}
</pre>

*Format_spec:*
<pre>
<font color="green">filling alignment width precision type</font>
   -       &gt;        10    .2       f
</pre>

In [None]:
print('{movie:^20}'.format(movie="It"))
print('{:_>20.2f}'.format(20/3))

*There is more to it! Read the [docs](https://docs.python.org/3.4/library/string.html#format-string-syntax)!*

#### Code structure

- Functions
- Modules

#### Functions

- A named piece of code that performs a certain task.

<img src="img/function_structure_explained.png" alt="Drawing" style="width: 600px;"/>  


#### Modules

- A (larger) piece of code containing functions, classes...
- Corresponds to a file
- Can be imported

```py
import mymodule
```
- Can be run as a script

```> python3 mymodule.py```

<img src="img/example_code.png" alt="Module" style="height: 80%"/>

<center>Now you know Python!</center>

￼￼￼￼<center><font size=20> 🎉 </font></center>
    
 ￼￼￼￼<center><font size=20>    Well done!</font></center>
