# Exercise Sheet 3 <br/> CHE103 Übungen zu Anwendungen des Computers in der Chemie <br/> Spring Semester 2023

***

To hand in the exercise for feedback, upload this notebook containing your answers and your code back to OLAT before **Friday**. Handing in is **optional** but recommanded.

<div class="alert alert-warning">
    <b>Important</b>: Before uploading a notebook with your answers back to OLAT, you need to make sure that everything works as intended. To do so, start by clearing all the outputs of the notebook by going to the toolbar and clicking <mark>Edit -> Clear All Outputs</mark> and then rerun all cells with a fresh kernel using <mark>Kernel -> Restart Kernel</mark> followed by <mark>Run -> Run All Cells</mark>. You should then go through your answers and double-check that everything is correct.
</div>

## Exercise 1: SMILES and functions

In this exercise we classify hydrocarbons based on their SMILES string. For simplicity we consider here only alkanes (single bond), alkenes (double bond), alkynes (triple bond) and cycloalkanes (ring).

<div class="alert alert-success"><b>Task:</b> Write a function <code>hydrocarbon_group</code> that decides whether a compound is a hydrocarbon and (if yes) to which group it belongs. This function should take a SMILES string and return <code>"alkane"</code>, <code>"alkene"</code>, <code>"alkyne"</code>, <code>"cycloalkane"</code> or <code>None</code>.</div>

__Hint:__ The SMILES strings of these compounds consist of only 4 characters: `C`, `1`, `=` and `#`. If additional characters appear in the SMILES string, the compound is not a hydrocarbon. Use `if/elif` statements to analyze the SMILES string.

In [40]:
def hydrocarbon_group(smiles):
    # If anything else than the characters C, 1, =, # appear in the SMILES
    # we don't have a hydrocarbon.
    
    for character in smiles:  # go through all characters in the provided smiles
        if character not in ["C", "1", "=", "#"]:
            # if any of the characters is NOT a C, 1, = or #, return the value None
            # since we do not have a hydrocarbon group and we can exit the function
            # immediately
            return None  # None (without "") is a separate datatype
    
    # From here on we can be sure that the string in the variable smiles
    # only contains C, #, 1 or =, meaning we can check for the different
    # features. Please note: the order of the if/elif's are important!
    
    if "1" in smiles:  # a "1" is a marker for a ring
        answer = "cycloalkane"
    elif "#" in smiles:  # if there is a triple bond
        answer = "alkyne"
    elif "=" in smiles:  # have a double bond
        answer = "alkene"
    else:  # if we got neither a triple or double bond or a ring
        answer = "alkane"

    # return the content of the variable answer as result of this function
    return answer

# test your function by checking whether the output below is correct
print('methane is', hydrocarbon_group('C'))
print('propene is', hydrocarbon_group('C=CC'))
print('ethyne is', hydrocarbon_group('C#C'))
print('cyclobutane is', hydrocarbon_group('C1CCC1'))
print('methyl isocyanate is', hydrocarbon_group('CN=C=O'))

methane is alkane
propene is alkene
ethyne is alkyne
cyclobutane is cycloalkane
methyl isocyanate is None


## Exercise 2: Dictionaries


You are working in a chemical lab that is specialized in petroleum analysis. Your job is to classify petroleum samples as paraffinic (i.e. contains mostly alkanes) or naphthenic (i.e. contains mostly cycloalkanes) according to the table below:


| Class      | Alk. wt.%| Cycl. wt.%|
|------------|----------|-----------|
| Paraffinic | 46-61    | 22-32     |
| Naphthenic | 15-26    | 61-76     |

Your lab colleague has analysed a sample of petroleum. Since she is not only a skilled chemist but also a Python pro, she sends you her results as a Python dictionary `petroleum_composition`, where the values are weight percentages:


In [41]:
petroleum_composition = {"C1CC1": 24, "CCC": 22, "CCCC": 20, "C=CC": 12, "CC": 8, "C1CCC1": 7, "C1=CC=NC=C1": 3}

print("compound (SMILES), weight percent:")
for key in petroleum_composition:
    print(key, petroleum_composition[key])

compound (SMILES), weight percent:
C1CC1 24
CCC 22
CCCC 20
C=CC 12
CC 8
C1CCC1 7
C1=CC=NC=C1 3


<div class="alert alert-success"><b>Task:</b> By making use of the function <code>hydrocarbon_group</code> defined above, create a new dictionary that contains as keys the hydrocarbon groups and as values their weight percents. Print the dictionary.

__Hint:__ The weight percent of a hydrocarbon group is the sum of the weight percents of all compounds of that group.

In [42]:
# start with a dictionary with pre-defined keys for the different
# hydrocarbon groups and with the integer value 0

hc_group_weights = {
    "alkane": 0,
    "alkene": 0,
    "alkyne": 0,
    "cycloalkane": 0,
}

for key in petroleum_composition:  # copied from above, to go through the provided values
    group = hydrocarbon_group(key)
    
    # first we have to check whether it is a hydrocarbon group at all
    if group is not None:
        # then add the value
        hc_group_weights[group] += petroleum_composition[key]

for key in hc_group_weights:
    print("total weight percent of the hydrocarbon group", key, "is", hc_group_weights[key])

total weight percent of the hydrocarbon group alkane is 50
total weight percent of the hydrocarbon group alkene is 12
total weight percent of the hydrocarbon group alkyne is 0
total weight percent of the hydrocarbon group cycloalkane is 31


<div class="alert alert-success"><b>Question:</b> Is this petroleum sample paraffinic or naphthenic? </div>

__Answer:__

## Exercise 3: Using the RCSB Protein Data Bank

The following example illustrates how you could use Python to automate some simple data analysis (and of course, using loops, lists and dictionaries).

We are going to use some data from the [RCSB PDB](https://www.rcsb.org/) database:

* Go to the website and search for a molecule with the ID `1XF0`. This enzyme _17-beta hydroxysteroid dehydrogenase_ is part of the process of creating Testosterone (you can read more about its purpose and its relation to Anabolic Steroids on the [RCSB PDB Molecule of the Month page](http://pdb101.rcsb.org/motm/92)).

First we are going to inspect this enzyme visually:
* Click on `3D View: Structure` on the left side of the page, under the 3D representation of the enzyme.
* Choose `JSmol (JavaScript)` under `select a different viewer` at the bottom right of the page.
* Helices and sheets now appear in different colors.

<div class="alert alert-success"><b>Task 1:</b> Count the number of $\alpha$-helices and $\beta$-sheets.</div>

Now we are going to do the same task programmatically using the PDB file:
* Download the molecule in PDB format.
* Make sure that the file `1xf0.pdb` is in the same folder as this notebook and run the next Python cell.

In [None]:
# open the file for reading using a 'with' statement
with open("1xf0.pdb", "r") as myfile: # associate the file with the variable 'myfile'
    # and read the complete file into the variable 'content'
    content = myfile.read()
# after such a 'with' statement, the file is automatically closed again

# but its content is still contained in the variable 'content'
print(content)

When browsing through the file content above you should notice that the atoms, helices and sheets are directly specified in it.

<div class="alert alert-success"><b>Task 2:</b> Write a loop to count and print the number of atoms, helices and sheets.</div>

__Hint__: You need the functions from the past exercises/lectures to get a list of lines from the file content. Then loop over the lines and check whether they start with a specific word. Be careful: for the sheets you can't simply count the lines since they consist of multiple strands (see also the [PDB File Format Documentation for SHEET](http://www.wwpdb.org/documentation/file-format-content/format33/sect5.html#SHEET)).

In [None]:
# your code here

And finally, we would like to get some statistics on the amino acids occurring in the given enzyme (specified in the lines starting with `SEQRES`).

<div class="alert alert-success"><b>Task 3 (Advanced):</b> Write another loop which counts all amino acids and some code to print the statistics in the form<br/>
<code>
GLY: 3
ALA: 5
...
</code>
</div>

__Hint__: Even though the amino acids are all known, it is very easy to avoid having to define them in your script manually when using dictionaries. Loop over the lines of the file and analyze those starting with `SEQRES`. For each amino acid present, check whether there is already a key in the `amino_acids` dictionary. If no, create a new key with value 1. If yes, increment the corrsponding value by one.  

<div class="alert alert-warning">
    <b>Note:</b> This task requires a fairly advanced programming level. For a simpler exercise, you can simply (programatically) count and print the total amount of amino acids.
</div>

In [None]:
# your code here
amino_acids = {} # the keys in this dictionary will be the amino acids (GLY, ALA, etc.)

## Exercise 4: Theoretical Questions
Here are a couple of more conceptual questions. You might find the answer in the lecture slides, or you might have to do some internet research of your own. A good point to start is the [official Python documentation](https://docs.python.org/3/). Python also comes with a [long list of modules/libraries](https://docs.python.org/3/library) for a lot of common tasks. You can answer with text and/or an example code.

<div class="alert alert-success"><b>Question 1:</b> How can you remove an item from a dictionary ? </div>

__Answer 1:__

<div class="alert alert-success"><b>Question 2:</b> Suppose you want to store the square roots of all square numbers between 1 and 100 in a Python object called <code>squareroot</code> such that <code>squareroot[1] == 1</code>, <code>squareroot[4] == 2</code>, <code>squareroot[9] == 3</code>, ... <code>squareroot[100] == 10</code>. Would you choose a list or a dictionary for <code>squareroot</code>? What would be the required length <code>len(squareroot)</code> of this object in both cases?

__Answer 2:__

<div class="alert alert-success"><b>Question 3:</b>  In exercise 2, your colleague has chosen SMILES strings as keys in the dictionary <code>petroleum_composition</code>. Could he have chosen strings representing the chemical formula instead (such as <code>"C3H6"</code> for propene instead of <code>"C=CC"</code>)?

__Answer 3:__

<div class="alert alert-success"><b>Question 4:</b> Why do we need <code>InChI</code>s when we already have <code>SMILES</code>? </div>

__Answer 4:__