# Back To Basics
##  Simple but powerful python techniques


# Contents:
* ```String``` tricks
* The ```os``` module 
* The ```dict()``` constructor
* Tricks with lists
* ```Numpy``` arrays and maths

We've covered some python fundamentals in the [**BasicBitsPython**](https://github.com/UoMMIB/Python-Club/blob/master/Tutorials/BasicBitsPython.ipynb) notebook,  including:
* Comments and variables
* Loops
* Functions
* Dictionaries
and we used what we'd learnt to [translate](https://en.wikipedia.org/wiki/Translation_(biology)) some DNA into the amino acid sequence that it codes for, which folds up to look like this:

# Background - 🔬
In this tutorial, we'll look at some simple, but powerful commands to find out more about this protein. Here's the sequence that we translated from the DNA:
```
MTIKEMPQPKTFGELKNLPLLNTDKPVQALMKIADELGEIFKFEAPGRVTRYLS
SQRLIKEACDESRFDKNLSQALKFVRDFAGDGLFTSWTHEKNWKKAHNILLPSFSQQAMKGYHAMM
VDIAVQLVQKWERLNADEHIEVPEDMTRLTLDTIGLCGFNYRFNSFYRDQPHPFITSMVRALDEAM
NKLQRANPDDPAYDENKRQFQEDIKVMNDLVDKIIADRKASGEQSDDLLTHMLNGKDPETGEPLDD
ENIRYQIITFLIAGHETTSGLLSFALYFLVKNPHVLQKAAEEAARVLVDPVPSYKQVKQLKYVGMV
LNEALRLWPTAPAFSLYAKEDTVLGGEYPLEKGDELMVLIPQLHRDKTIWGDDVEEFRPERFENPS
AIPQHAFKPFGNGQRACIGQQFALHEATLVLGMMLKHFDFEDHTNYELDIKETLTLKPEGFVVKAK
SKKIPLGGIPSPSTEQSAKKVRKKGC*
```
Each letter represents an [**amino acid**](https://en.wikipedia.org/wiki/Amino_acid) - each with the same 'backbone':
![](glycine.png)

But each has its own side-chain, with its own chemical properties, like charge and other features that make the chain of amino-acids fold into a functioning protein.

![](amino-acids.png)

In our case, the chain of amino acids folds around a [**heme**](https://en.wikipedia.org/wiki/Heme_B), a red molecule which is used to oxidise fats - a controlled burn in the centre of the protein. 
![](heme.png)

Which fats can get into the centre of the protein is determined by the enzymes shape, and the chemical properties of the tunnel that leads to the core. We can find the structures of some of these proteins using a technique called [**X-ray crystallography**](https://en.wikipedia.org/wiki/X-ray_crystallography). Here's a snapshot of our enzyme - the [**Cytochrome P450**](https://en.wikipedia.org/wiki/Cytochrome_P450): [**BM3**](BM3-Reveiw-Andy-Munro.pdf).

[**PDB ID: 1bu7**](https://www.ebi.ac.uk/pdbe/entry/pdb/1bu7/index) - see this link for an interactive view
![](tutorial-data/1bu7-molstar-image.png)


The ability to 'do chemistry' is what makes this protein an [**enzyme**](https://en.wikipedia.org/wiki/Enzyme), which are an important part of the biotechnology research we do at the [Manchester Institute of Biotechnology
](https://en.wikipedia.org/wiki/Manchester_Institute_of_Biotechnology) - especially [**engineering**](https://en.wikipedia.org/wiki/Cytochrome_P450_engineering) them to do new chemical reactions that can help with agriculture, drug development and making industrial chemical processes more environmentally friendly.

# Problem #1 - ```string``` parsing
Here's a table of some calculated chemical properties of each of the 20 amino acids listed above. We could have found this on a web page. 
We don't have time to download the information in a clean file, we need it now! 🏃

In this section we copy and paste the table into our notebook as a ```string``` and extract the data into a usable format!


<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>MolWT</th>
      <th>LogP</th>
      <th>HBondDonors</th>
      <th>HBondAcceptors</th>
      <th>nAromaticRings</th>
      <th>nHeteroAtoms</th>
      <th>nRotatableBonds</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>A</th>
      <td>89.047678</td>
      <td>-0.58180</td>
      <td>3</td>
      <td>3</td>
      <td>0</td>
      <td>3</td>
      <td>3</td>
    </tr>
    <tr>
      <th>C</th>
      <td>121.019749</td>
      <td>-0.67190</td>
      <td>3</td>
      <td>3</td>
      <td>0</td>
      <td>4</td>
      <td>4</td>
    </tr>
    <tr>
      <th>D</th>
      <td>133.037508</td>
      <td>-1.12700</td>
      <td>4</td>
      <td>5</td>
      <td>0</td>
      <td>5</td>
      <td>4</td>
    </tr>
    <tr>
      <th>E</th>
      <td>147.053158</td>
      <td>-0.73690</td>
      <td>4</td>
      <td>5</td>
      <td>0</td>
      <td>5</td>
      <td>5</td>
    </tr>
    <tr>
      <th>F</th>
      <td>165.078979</td>
      <td>0.64100</td>
      <td>3</td>
      <td>3</td>
      <td>1</td>
      <td>3</td>
      <td>4</td>
    </tr>
    <tr>
      <th>G</th>
      <td>75.032028</td>
      <td>-0.97030</td>
      <td>3</td>
      <td>3</td>
      <td>0</td>
      <td>3</td>
      <td>2</td>
    </tr>
    <tr>
      <th>H</th>
      <td>155.069477</td>
      <td>-0.63590</td>
      <td>4</td>
      <td>5</td>
      <td>1</td>
      <td>5</td>
      <td>4</td>
    </tr>
    <tr>
      <th>I</th>
      <td>131.094629</td>
      <td>0.44440</td>
      <td>3</td>
      <td>3</td>
      <td>0</td>
      <td>3</td>
      <td>6</td>
    </tr>
    <tr>
      <th>K</th>
      <td>146.105528</td>
      <td>-0.47270</td>
      <td>5</td>
      <td>4</td>
      <td>0</td>
      <td>4</td>
      <td>7</td>
    </tr>
    <tr>
      <th>L</th>
      <td>131.094629</td>
      <td>0.44440</td>
      <td>3</td>
      <td>3</td>
      <td>0</td>
      <td>3</td>
      <td>6</td>
    </tr>
    <tr>
      <th>M</th>
      <td>149.051050</td>
      <td>0.15140</td>
      <td>3</td>
      <td>3</td>
      <td>0</td>
      <td>4</td>
      <td>6</td>
    </tr>
    <tr>
      <th>N</th>
      <td>132.053492</td>
      <td>-1.72630</td>
      <td>5</td>
      <td>5</td>
      <td>0</td>
      <td>5</td>
      <td>4</td>
    </tr>
    <tr>
      <th>P</th>
      <td>115.063329</td>
      <td>-0.17700</td>
      <td>2</td>
      <td>3</td>
      <td>0</td>
      <td>3</td>
      <td>1</td>
    </tr>
    <tr>
      <th>Q</th>
      <td>146.069142</td>
      <td>-1.33620</td>
      <td>5</td>
      <td>5</td>
      <td>0</td>
      <td>5</td>
      <td>5</td>
    </tr>
    <tr>
      <th>R</th>
      <td>174.111676</td>
      <td>-1.33843</td>
      <td>7</td>
      <td>6</td>
      <td>0</td>
      <td>6</td>
      <td>6</td>
    </tr>
    <tr>
      <th>S</th>
      <td>105.042593</td>
      <td>-1.60940</td>
      <td>4</td>
      <td>4</td>
      <td>0</td>
      <td>4</td>
      <td>4</td>
    </tr>
    <tr>
      <th>T</th>
      <td>119.058243</td>
      <td>-1.22090</td>
      <td>4</td>
      <td>4</td>
      <td>0</td>
      <td>4</td>
      <td>5</td>
    </tr>
    <tr>
      <th>V</th>
      <td>117.078979</td>
      <td>0.05430</td>
      <td>3</td>
      <td>3</td>
      <td>0</td>
      <td>3</td>
      <td>5</td>
    </tr>
    <tr>
      <th>W</th>
      <td>204.089878</td>
      <td>1.12230</td>
      <td>4</td>
      <td>4</td>
      <td>2</td>
      <td>4</td>
      <td>4</td>
    </tr>
    <tr>
      <th>Y</th>
      <td>181.073893</td>
      <td>0.34660</td>
      <td>4</td>
      <td>4</td>
      <td>1</td>
      <td>4</td>
      <td>5</td>
    </tr>
  </tbody>
</table>

In [1]:
# I've used triple quotes to make a multi-line string

data = '''MolWT	LogP	HBondDonors	HBondAcceptors	nAromaticRings	nHeteroAtoms	nRotatableBonds
A	89.047678	-0.58180	3	3	0	3	3
C	121.019749	-0.67190	3	3	0	4	4
D	133.037508	-1.12700	4	5	0	5	4
E	147.053158	-0.73690	4	5	0	5	5
F	165.078979	0.64100	3	3	1	3	4
G	75.032028	-0.97030	3	3	0	3	2
H	155.069477	-0.63590	4	5	1	5	4
I	131.094629	0.44440	3	3	0	3	6
K	146.105528	-0.47270	5	4	0	4	7
L	131.094629	0.44440	3	3	0	3	6
M	149.051050	0.15140	3	3	0	4	6
N	132.053492	-1.72630	5	5	0	5	4
P	115.063329	-0.17700	2	3	0	3	1
Q	146.069142	-1.33620	5	5	0	5	5
R	174.111676	-1.33843	7	6	0	6	6
S	105.042593	-1.60940	4	4	0	4	4
T	119.058243	-1.22090	4	4	0	4	5
V	117.078979	0.05430	3	3	0	3	5
W	204.089878	1.12230	4	4	2	4	4
Y	181.073893	0.34660	4	4	1	4	5'''

data # try running print(data) - what's the difference?

'MolWT\tLogP\tHBondDonors\tHBondAcceptors\tnAromaticRings\tnHeteroAtoms\tnRotatableBonds\nA\t89.047678\t-0.58180\t3\t3\t0\t3\t3\nC\t121.019749\t-0.67190\t3\t3\t0\t4\t4\nD\t133.037508\t-1.12700\t4\t5\t0\t5\t4\nE\t147.053158\t-0.73690\t4\t5\t0\t5\t5\nF\t165.078979\t0.64100\t3\t3\t1\t3\t4\nG\t75.032028\t-0.97030\t3\t3\t0\t3\t2\nH\t155.069477\t-0.63590\t4\t5\t1\t5\t4\nI\t131.094629\t0.44440\t3\t3\t0\t3\t6\nK\t146.105528\t-0.47270\t5\t4\t0\t4\t7\nL\t131.094629\t0.44440\t3\t3\t0\t3\t6\nM\t149.051050\t0.15140\t3\t3\t0\t4\t6\nN\t132.053492\t-1.72630\t5\t5\t0\t5\t4\nP\t115.063329\t-0.17700\t2\t3\t0\t3\t1\nQ\t146.069142\t-1.33620\t5\t5\t0\t5\t5\nR\t174.111676\t-1.33843\t7\t6\t0\t6\t6\nS\t105.042593\t-1.60940\t4\t4\t0\t4\t4\nT\t119.058243\t-1.22090\t4\t4\t0\t4\t5\nV\t117.078979\t0.05430\t3\t3\t0\t3\t5\nW\t204.089878\t1.12230\t4\t4\t2\t4\t4\nY\t181.073893\t0.34660\t4\t4\t1\t4\t5'

# ```\t``` & ```\n```

```\``` is an [**escape character**](https://en.wikipedia.org/wiki/Escape_character) - it precedes a character that needs special treatment. In this case:
* ```\t``` means ```tab``` 
* ```\n``` means ```newline``` 

# The ```io``` module

In [2]:
import pandas as pd
import io

pd.read_csv(io.StringIO(data), delimiter = '\t')

Unnamed: 0,MolWT,LogP,HBondDonors,HBondAcceptors,nAromaticRings,nHeteroAtoms,nRotatableBonds
A,89.047678,-0.5818,3,3,0,3,3
C,121.019749,-0.6719,3,3,0,4,4
D,133.037508,-1.127,4,5,0,5,4
E,147.053158,-0.7369,4,5,0,5,5
F,165.078979,0.641,3,3,1,3,4
G,75.032028,-0.9703,3,3,0,3,2
H,155.069477,-0.6359,4,5,1,5,4
I,131.094629,0.4444,3,3,0,3,6
K,146.105528,-0.4727,5,4,0,4,7
L,131.094629,0.4444,3,3,0,3,6


# How about with ```string``` tricks?

Using the ```io``` module is very convenient for reading data, but how about some generally useful techniques with strings and basic datatypes? In this section we'll use some techniques that don't rely on importing modules to turn our string into something useable.

# Exercise
Here are the steps we'll take to get a ```dictionary``` of our data, which we can easily use to look up molecular properties of each amino acid.
1. Split the ```data``` string into rows - ```split()```
2. Split the rows into columns - ```split()```
3. Use ```for``` loops and list indexing to extract: coulmn headers, table index, table entry values like:
```python
['MolWT',  'LogP',  'HBondDonors',  'HBondAcceptors',  'nAromaticRings',  'nHeteroAtoms',  'nRotatableBonds']
&
['A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y']
```
4. Use the coulmn headers, table index, table entry values to make a dictionary for an amino acid like:

```python
# for A
{'MolWT': '89.047678',
  'LogP': '-0.58180',
  'HBondDonors': '3',
  'HBondAcceptors': '3',
  'nAromaticRings': '0',
  'nHeteroAtoms': '3',
  'nRotatableBonds': '3'}
```
5. Make a ```dictionary``` of the amino acid ```dictionaries```  like:

```python
{'A': {'MolWT': '89.047678',
  'LogP': '-0.58180',
  'HBondDonors': '3',
  'HBondAcceptors': '3',
  'nAromaticRings': '0',
  'nHeteroAtoms': '3',
  'nRotatableBonds': '3'},
 'C': {'MolWT': '121.019749',
  'LogP': '-0.67190',
  'HBondDonors': '3',
  'HBondAcceptors': '3'
       ...
```

6. Use the ```dictionary``` to find the total molecular weight of the protein

# Techniques:
#### Check out Chapter 41: String Methods in [Python notes for professionals (free ebook)](https://books.goalkicker.com/PythonBook/) for more string methods

## ```split()```

Let's split ```data``` by row. ```split()``` is a function that splits string on a particular character, it works like this:
```python
>>> "Let's split ```data``` by row. ```split()``` is a function that splits string on a particular character".split(' ') 

# splitting on ' ' is default - 🤫
# see how the function comes from the string?
# that's because it's a method that 'belongs' to the string type of object
# run dir(<string>) to see other methods that belong to strings
# or run help(<string>) for more info on strings
>>> ["Let's", 'split', '```data```', 'by', 'row.', '```split()```', 'is', 'a', 'function', 'that', 'splits', 'string', 'on', 'a', 'particular', 'character']
```
## ```for``` loops:
#### For more detail on loops and ```list comprehensions``` (for loops in a single line) look at our [list comprehension tutorial](https://github.com/UoMMIB/Python-Club/blob/master/Tutorials/ListComprehensions.ipynb) or even better: Chapter 21 of [Python notes for professionals](https://books.goalkicker.com/PythonBook/) 


#### Filling ```lists``` and ```dictionaries``` using ```for``` loops:
One way to fill a ```list``` or ```dictionary``` with items using ```for``` loops is to create the ```list``` or ```dictionary``` outside the loop, and then use the ```list``` or ```dictionary``` specific methods to add items to that ```list``` or ```dictionary```:
```python
# things to loop through
letters = ['a','b','c','d','e']
numbers = [1,2,3,4,5]

l = [] # empty list
d = {} # enpty dictionary

for i,j in zip(letters, numbers):
    # loop through letters and numbers at the same time
    # i = letter; j = number
    l.append(i) # add letter to list
    d[i] = j # add number to dictionary under key letter
    
# l = ['a', 'b', 'c', 'd', 'e']
# d = {'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5}
```

# 1. Split the ```data``` string into a list of rows
# 2. Split each row into a list of values
# to get a list of lists, save that variable as something like ```table```

# 3.1 Extract the top row (column headers) of your ```table``` using *list slicing*
### Keep the top row saved as a variable like ```header```

### You'll need to use list ```slicing``` (chapter 64 in [Python notes for professionals](https://books.goalkicker.com/PythonBook/))

List slicing is the name for getting one or more item from a list, and other objects that behave similarly. It has a distinctive ```[square brackets``` notation. Here are some quick syntax rules for slicing:
```python
l = [1,2,3,4,5,6,7,8,9,10]

# syntax:
l[<start>:<stop>:<step size>]
# leaving out start, stop or step, the interpreter will assume:
<start> = 0
<stop> = end
<step size> = 1
# first item
>>> l[0] # python starts counting at zero - why waste a whole number?
>>> 1
# last item
>>> l[-1]
>>> 10
# items 0-4
>>> l[0:4]
>>> [1, 2, 3, 4]
# 4 onwards
>>> l[4:]
>>> [5, 6, 7, 8, 9, 10]
# ever other number between 2 and 8
>>> l[1:8:2]
>>> [2, 4, 6, 8]

```

# 3.2 Extract the row names (index) from your ```table```

You may have to loop through your list of lists to find these items

# 3.3 Extract the table entries from your ```table```, excluding the headers and the index

# 3.4 Change the ```data type``` of the table entries from ```sting``` to ```float```
### Hints:
```python
>>> '131.094629' + 5
>>> TypeError

>>> float('131.094629') + 5
>>> 136.094629
```
### Use a ```for``` loop within a ```for``` loop, or a ```list comprehension``` within a ```list comprehension```

# 4. Create a ```dictionary``` like: ```{'MolWT': <value>, ...}``` for amino acid ```A``` - alanine
![](alanine.png)

# 5.1 Create a ```dictionary``` for each amino acid's properties, and store each ```dictionary``` within another dictionary like:

```python
{'A': {'MolWT': '89.047678',
  'LogP': '-0.58180',
  'HBondDonors': '3',
  'HBondAcceptors': '3',
  'nAromaticRings': '0',
  'nHeteroAtoms': '3',
  'nRotatableBonds': '3'},
 'C': {'MolWT': '121.019749',
  'LogP': '-0.67190',
  'HBondDonors': '3',
  'HBondAcceptors': '3',
       ...
       ```

# 5.2 - Using the ```dictionary``` 
Information stored in a ```dictionary```can be accessed using ```keys```. Our top layer ```dictionary``` has keys like:

```python
['A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y']
```

Which correspond to our amino acids. We can use the keys to access the corresponding dictionary items with syntax similar to ```slicing```:

```python
>>> d['A']
>>> {'MolWT': '89.047678',
 'LogP': '-0.58180',
 'HBondDonors': '3',
 'HBondAcceptors': '3',
 'nAromaticRings': '0',
 'nHeteroAtoms': '3',
 'nRotatableBonds': '3'}
```

#### I've made a dictionary called ```d``` below, which contains dictionaries of properties for each amino acid.
# Find the ```nAromaticRings``` of ```W``` (tryptophan)

In [4]:
# answers - for just in case you need them
from backtobasics import   extract_data_with_list_comprehensions
d = extract_data_with_list_comprehensions(data)

# 6. Use this dictionary to find the total MolWT (molecular weight) of the protein 
#### Hint: add 616.5 (molecular weight of heme) to the total weight of all the amino acids

### 🧬 The sequence is at the top of this notebook 🧬

### Steps:
1. Save the sequence above as a string
2. Use the ```<string>.replace(<from>,<to>)``` to replace ```\n``` and ```*``` (end of sequence) characters
3. Loop through the sequence, for each letter, lookup the corresponding ```MolWT```
4. ```sum()``` the ```MolWT``` for each amino acid and add  ```616.5``` to account for the heme

In [5]:
s = '''MTIKEMPQPKTFGELKNLPLLNTDKPVQALMKIADELGEIFKFEAPGRVTRYLS
SQRLIKEACDESRFDKNLSQALKFVRDFAGDGLFTSWTHEKNWKKAHNILLPSFSQQAMKGYHAMM
VDIAVQLVQKWERLNADEHIEVPEDMTRLTLDTIGLCGFNYRFNSFYRDQPHPFITSMVRALDEAM
NKLQRANPDDPAYDENKRQFQEDIKVMNDLVDKIIADRKASGEQSDDLLTHMLNGKDPETGEPLDD
ENIRYQIITFLIAGHETTSGLLSFALYFLVKNPHVLQKAAEEAARVLVDPVPSYKQVKQLKYVGMV
LNEALRLWPTAPAFSLYAKEDTVLGGEYPLEKGDELMVLIPQLHRDKTIWGDDVEEFRPERFENPS
AIPQHAFKPFGNGQRACIGQQFALHEATLVLGMMLKHFDFEDHTNYELDIKETLTLKPEGFVVKAK
SKKIPLGGIPSPSTEQSAKKVRKKGC*'''''
s

'MTIKEMPQPKTFGELKNLPLLNTDKPVQALMKIADELGEIFKFEAPGRVTRYLS\nSQRLIKEACDESRFDKNLSQALKFVRDFAGDGLFTSWTHEKNWKKAHNILLPSFSQQAMKGYHAMM\nVDIAVQLVQKWERLNADEHIEVPEDMTRLTLDTIGLCGFNYRFNSFYRDQPHPFITSMVRALDEAM\nNKLQRANPDDPAYDENKRQFQEDIKVMNDLVDKIIADRKASGEQSDDLLTHMLNGKDPETGEPLDD\nENIRYQIITFLIAGHETTSGLLSFALYFLVKNPHVLQKAAEEAARVLVDPVPSYKQVKQLKYVGMV\nLNEALRLWPTAPAFSLYAKEDTVLGGEYPLEKGDELMVLIPQLHRDKTIWGDDVEEFRPERFENPS\nAIPQHAFKPFGNGQRACIGQQFALHEATLVLGMMLKHFDFEDHTNYELDIKETLTLKPEGFVVKAK\nSKKIPLGGIPSPSTEQSAKKVRKKGC*'