# Strings

## Table of content

1. [Strings](#Strings)
   1. [Table of content](#Table-of-content)
2. [Definitions](#Definitions)
   1. [Position Si](#Position-Si)
      1. [repr](#repr)
   2. [Substring Sij](#Substring-Sij)
   3. [Length N](#Length-N)
3. [String Methods](#String-Methods)
   1. [split](#split)
   2. [find](#find)
   3. [in](#in)
4. [The module re](#The-module-re)
   1. [split](#split)
   2. [wildcard in re](#wildcard-in-re)
   3. [finditer](#finditer)
   4. [findall](#findall)
5. [Modifying strings](#Modifying-strings)
   1. [upper and lower](#upper-and-lower)
   2. [strip](#strip)
   3. [replace](#replace)
   4. [Concatenating strings](#Concatenating-strings)
6. [Working with files](#Working-with-files)
   1. [open](#open)
   2. [write and save](#write-and-save)
   3. [read](#read)
   4. [parsing files](#parsing-files)
7. [Bioinformatic file types](#Bioinformatic-file-types)
   1. [fasta files](#fasta-files)
   2. [fastq files](#fastq-files)
   3. [genebank files](#genebank-files)
   4. [clustalW files](#clustalW-files)
   5. [pdb files](#pdb-files)
12. [Exercises](#Exercises)
    1. [Exercise 1 - Simple Strings](#Exercise-1---Simple-Strings)
    2. [Exercise 2 - String methods](#Exercise-2---String-Methods)
    3. [Exercise 3 - Working with Strings](#Exercise-3---Working-with-strings)
    4. [Exercise 4 - A bot more about Strings](#Exercise-4---A-bit-more-about-Strings)
    5. [Exercise 5 - Writing Files](#Exercise-5---Writing-Files)
    6. [Exercise 6 - Reading a file](#Exercise-6---Reading-a-file)
    7. [Exercise 7 - Small sequence comparison](#Exercise-7---Small-sequence-comparison)
13. [Extras](#Extras)
    1. [Accessing the letters of the alphabet](#Accessing-the-letters-of-the-alphabet)
    2. [whitespaces with escape characters](#whitespaces-with-escape-characters)

# Definitions

- Alphabets are the collection of symbols or letters that can make up the string
  - Σdna = {A,G,C,T} ; for nucleotides
  - Σamino = {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y}; for amino acids
- Each **Element E** of an alphabet is named symbol or letter
- **String S** is a juxtaposition of symbols or letters from the alphabet Sigma
  - S = (ΣE) (An order of symbols is defined)
  - E.g: S = „ATGGAATGCTAATAG“
- **Position Si** is the letter E on position i in the string (starts at position 0!)
- **Substring Sij** are the letters from position Si to Sj excluding the latter
- **Length N** of a string is the number of letters in it
- **Split Spliti** is the splitting of a string in two substrings at position i. Two new strings S0i and SiN are created
- **Split SplitDelimitor** is the splitting of a string in multiple substrings divided at the delimitor (which can be anything, but examples could be a space, a tab, a comma etc.). The number of substrings is equal to the number of occurrences of the delimitor +1
- **Prefix n** is a substring of the string S which contains the first n letters of S
- **Suffix n** is a substring of the string S which contains the last n letters of S

## Position Si

We can access individual letters/symbols of a string similar to accessing elements of a list. 

- Indexing starts at 0
- Negative indices access the string from the back
- It is only possible to access elements that exist, otherwise there is an out of range error.
- linebreaks (`\n`, `\t` etc.) which need an escape character are considered *one character*

```
S = "Examplestring"
S[0] = "E"
S[1] = "x"
S[-1] = "g"
S[20] = ERROR
```




In [1]:
S = "Examplestring"
print(S[0])
print(S[1])
print(S[-1])
print(S[15])

E
x
g


IndexError: string index out of range

### repr

If we access a character that is represented by nothing (e.g. space, tab, linebreak) this shows in print like we didn't access anything. Using `repr`inside `print` allows us to see them better by adding `''`.

In [2]:
S = "Example string"
print(S[7])
print(repr(S[7]))

 
' '


In [3]:
S = "Example\nstring"

print("####")
print(S)

print("####")
print(repr(S))

print("####")
print(S[7])

print("####")
print(repr(S[7]))

print("####")

####
Example
string
####
'Example\nstring'
####


####
'\n'
####


## Substring Sij

To generate a substring the string is spliced. Follows the splicing rules of lists

- start index i is included
- end index j is excluded
- j can be higher than the length of the string, it then just goes to the last letter/symbol
- the first or last may be left empty if we want everything from the start/till the end

```
S = "Examplestring"
S[i:j]

S[2:5] = "amp"
S[:5] = S[0:5] = "Examp"
S[5:] = S[5:12] = "lestring"
```




In [4]:
S = "Examplestring"
print(S[2:5])
print(S[:5])
print(S[5:])
print(S[5:50])

amp
Examp
lestring
lestring


## Length N

`len(S)` counts the number of symbols in the string S

- including spaces, commas, linebreaks etc.
- `\n` etc. count as *one character*

In [5]:
S = "Examplestring"
S2 = "Example string"
S3 = "Example\nstring"

print(f"S has length {len(S)} and S2 and S3 have length {len(S2)} and {len(S3)}")

S has length 13 and S2 and S3 have length 14 and 14


# String methods

## split

This splits a string at the occurence of a specific symbol/letter, which is a built in string method. 

- It will result in a list with the substrings
- default is space as delimitor
- otherwise the delimitor can be specified as an argument (can be anything even several letters)
- the delimitor is *removed* from the list and is lost!
- If the delimitor is not encouteres split returns the entire string in a list of ine element

Syntax:

```
string.split()

#to use , as delimitor:
string.split(",")
```

In [6]:
S = "This is an example"
split_list = S.split()
print(split_list)

['This', 'is', 'an', 'example']


In [7]:
S = "This is an example, so we need to make a bigger sentence"
split_list = S.split(",")
print(split_list)

['This is an example', ' so we need to make a bigger sentence']


## find

The string method `find` allows us to search for substrings.

- it returns the start index of the *first* occurence of the search term
- case sensitivity applies
- if the search term is not present find returns -1

Syntax

```
string.find("term")
```




In [8]:
S = "This is an example"
idx = S.find("is")
print(idx)

idx2 = S.find("IS")
print(idx2)

2
-1


## in

If we just want to know if a substring is present and don't care *where* it percisely is we can use `in` 

- it returns `True` or `False`

Examples:

    print("term" in string)
    
    if "term" in string:
        do something

In [9]:
S = "This is an example"
print("is" in S)

if "is" in S:
    print("Yeah")

True
Yeah


# The module re

This module contains functions for regular expression operations. It is part of the standard python libraries.

- contains a lot of useful methods
- we will discuss only 3, but there are many more
- we use the functions with the point notation

import Syntax:

    import re

A note on escape characters in re:

- general re syntax prefers double `\\` as escape character, instead of the normal single `\`, to avoid any unintentional effects
- at the moment this is a warning, in future it will be a syntax error
- to search for the actual sequence `\n` we can either use `\\\\n` or utilize raw `r(\\n)`
- escape sequences:



## split

similar to the built in `split`

- several delimitors allowed, separated by pipes `|`
- two arguments
  - the delimitor(s)
  - the string
- Note that the delimitor definition is space sensitive, so unless the space is to be defined as (part of) the delimitor keep the symbols and pipes together
  - `(" ")` = single space as delimitor
  - `(",|;")` = comma or semicolon as delimitor
  - `(", | ;")` = comma followed by space or space followed by semicolon as delimitor
- Returns a list with the substrings
  - if the string starts with a delimitor the first element of the result list is an empty string
  - if two delimitors follow each other this is also represented by an emtpy string
  - otherwise the delimitor is simply deleted like the behaviour of built-in `split`
  - The reason for this:
    - `split` opens a substring, which is closed when it comes across a delimitor.
    - If the first symbol is a delimitor the substring is closed immediately after being opened and is emtpy.
    - Same if two delimitor follow each other: An emtpy string is produced
    - while built in `split` ignores these emtpy strings `re.split` will include them.
    - to exclude the empty strings see example below

Syntax:

    re.split("delimitor1|delimitor2", string)

In [10]:
import re

S = ",This is, a long, sentence; with weird punctuation,;!"
S_split0 = re.split(",|;",S)

print(S_split0)

#To exclude the emtpy string:
S_split = [elem for elem in S_split0 if len(elem)>0]
print(S_split)

['', 'This is', ' a long', ' sentence', ' with weird punctuation', '', '!']
['This is', ' a long', ' sentence', ' with weird punctuation', '!']


## wildcard in re

`.` is a wildcard standing for exactly one character

- if `.` is to be used as a delimitor it needs to be combined with an escape character
- `\\.` since we are in re and they prefer double `\\` as escape

I'm not sure but it is possible that the wildcard for 0 or mire characters is `.+`

In [11]:
import re

S = "This is an example of two senteces. So I need a second sentence."
print(re.split("\.",S))

['This is an example of two senteces', ' So I need a second sentence', '']


  print(re.split("\.",S))


In [12]:
import re

S = "This is an example of two senteces. So I need a second sentence."
print(re.split("\\.",S))

['This is an example of two senteces', ' So I need a second sentence', '']


## finditer

finds the start index of *all* occurences of a search term

finditer returns a generator which makes usage a bit complicated to get to the indices

Syntax using a list comprehension:

    idcs = [m.start() for m in re.finditer("a",S3)]

Explanation:

- `finditer` uses two arguments: the search term and the string. 
- It produce an iterator which contains the match objects (= the found instances)
- by looping over this iterator we can look at the match objects
- An easy way to get to the start indices:
  - these match objects have a method `start` which we can call by point notation and which contains the indices
  - by looping over the the match objects we can save the indices in a list
  - similarly there is a method called `end` which tells us where the string ends

In [13]:
import re
S = "This is an example, of two senteces, So I need a second, sentence."

idcs = [m.start() for m in re.finditer(",",S)]
print(idcs)

[18, 35, 55]


In [14]:
# accessing the match objects
test1 = re.finditer(",",S)

for m in test1:
    print(m)
    print(m.start())
    print(m.end())
    #print(dir(m))

<re.Match object; span=(18, 19), match=','>
18
19
<re.Match object; span=(35, 36), match=','>
35
36
<re.Match object; span=(55, 56), match=','>
55
56


## findall

To see how *how often* my substring occurs without caring where they occur `findall` is a good option

- findall will return a list with the occurences (which all look the same)
- the length of that list is the number of occurences

Syntax:

    Sf = re.findall("term",string)
    len(Sf)

In [15]:
S = "Hello, This is a long example"
Sf = re.findall("l",S)
print(Sf)
print(len(Sf))

['l', 'l', 'l', 'l']
4


# Modifying strings

sometimes we need to make changes to the strings, e.g. set sequence data to all uppercase or all lowercase so case sensitivity doesn't mess things up

## upper and lower

upper and lower are a built in string methods which turns a string into all uppercase(all lowercase

- we assign a new variable
- S in itself remains unchanged, unless I reassign it onto itself!
- upper and lower do not work inplace

```
S = "This is a string"
S_upper = S.upper()
S_lower = S.lower()
```



In [16]:
S = "This is a string"
S_upper = S.upper()
S_lower = S.lower()

print(S)
print(S_upper)
print(S_lower)

This is a string
THIS IS A STRING
this is a string


## strip

strip removes whitespaces at the beginning or end of the string

- like upper and lower we need to assign a new variable
- doesn't change the originial string, unless I reassign it into itself

Syntax:

```
S = " This is a string"
S1 = S.strip()
```


In [17]:
S = " This is a string"
S1 = S.strip()
print(repr(S))
print(repr(S1))

' This is a string'
'This is a string'


## replace

replace changes one substring into antother.

- takes two arguments
  - the substring I want to replace
  - what I want to replace it with
  - This is not sensitive to wildcards (not even the search string), so `.` works as written

Syntax:

```
S = "This is a string"
Sreplace = S.replace("This","That")
```

Example how to get rid of all spaces, linebreaks and tabstops:

```
replacements = [" ", "\n","\t"]
for repl in replacements:
    S2 = S2.replace(repl,"")
```




In [18]:
S = "This is a string"
Sreplace = S.replace("This","That")
print(repr(S))
print(repr(Sreplace))

'This is a string'
'That is a string'


In [19]:
S = "Th.is and is a string"
Sreplace = S.replace(".is","That")
print(repr(S))
print(repr(Sreplace))

'Th.is and is a string'
'ThThat and is a string'


## Concatenating strings

By using the `+` operator it is possible to simply combine two strings. They are simply added together with nothing in between

```
S1 = "This is a string"
S2 = "This is another string"
S3 = S1+S2

S3 = "This is a stringThis is another string"
```


In [20]:
S1 = "This is a string"
S2 = "This is another string"
S3 = S1+S2
print(S3)
S4 = S1 + " " + S2 + "!"
print(S4)

This is a stringThis is another string
This is a string This is another string!


# Working with files

It is realatively simple to create files in Python. The first step is always opening a file

## open

- Takes two arguments
  - the file name (or (relative or absolute) path to the file)
  - what we want to do with the file
    - "w": writing, creates new or overwrites existing content
    - "a": appending, adds to an existing file (also works on non-exisiting files)
    - "r": reading, works only for existing files
- we always need to assign the file to a variable!!

```
file = open("TestFile.txt","w")
```

## write and save

to write into a file we use the built in method `write` with point notation

- takes a single argument, which has to be a string!
- doesn't create linebreaks between two `write` statements, if we want them we have to add them!!
- to add several strings in one go they may be added by `+` 
- f strings can be used as well to input variables
- several write statments underneath each other are alled added to the file one after the other
  - this is the part where linebreaks are not added!! its like + with strings
- the file doesn't need to exist for "w", it is created in the working directory when we close it
- the content is not saved until we close the file with `file.close()`

```
a = "test"
b = 34
# a simple write
file.write("string")
file.write("string1" + a + str(234) + str(b) + "\n")
file.close()
```

Alternatively we use the whole thing including the opening in a `with` statement. This is then saved automatically when the `with` statement is over:

```
with open("Filename.txt", "w") as with_file:
    with_file.write(string)
```



In [21]:
file = open("Test.txt","w")
file.write("Hello world!\n")
# add a visual separator, 20 x #
file.write(20*'#' + "\n")
file.write("My name is Lisa")
file.close()

In [22]:
with open("Test2.txt", "w") as with_file:
    with_file.write("hello garden\n")
    with_file.write(50*'-' + "\n")
    with_file.write("My name is Lisa R\n")

In [23]:
file = open("Test.txt","a")
file.write("\nSome more lines\n")
file.close()

## read

It is best to always use a `with` statement to read a file to avoid leaving itopen when done

```
with open("Filename.txt","r" as read_file:
```

inside there are three ways of reading the file:

1. reading the entire file in as one big string
   - contains the whole test including all linebreaks, special characters etc.
   - for small files (max 20-30 line), as otherwise the computer gets *very* slow
2. reading the lines one by one and saving each line as element of a list (for small files)
   - includes the linebreaks at the end of each element
3. loop over every line individually, this has the advantage that only one line at a time is storedin the memory
   - a useful line of code inside the loop is to remove the linebreak at the end, since then the last character is actually the last one we see in print

```
# 1. whole content in one big string
with open("Filename.txt","r") as read_file:
    content = read_file.read()

# 2. each line is one element of a list
with open("Filename.txt","r") as read_file2:
    lines = read_file2.readlines()

# 3. looping over the lines
with open("TestFile.txt","r") as read_file3:
    for line in read_file3:
        line_adj = line.replace("\n","")
        print(line_adj)
```

In [24]:
with open("Test.txt","r") as read_file:
    content = read_file.read()
print(content)

print(repr(content))

Hello world!
####################
My name is Lisa
Some more lines

'Hello world!\n####################\nMy name is Lisa\nSome more lines\n'


In [25]:
with open("Test.txt","r") as read_file:
    lines = read_file.readlines()
print(lines)

['Hello world!\n', '####################\n', 'My name is Lisa\n', 'Some more lines\n']


In [26]:
with open("Test.txt","r") as read_file:
    for line in read_file:
        print(repr(line))



'Hello world!\n'
'####################\n'
'My name is Lisa\n'
'Some more lines\n'


## parsing files

Many bioinformatics files contain information we don't need.

Picking the informatino we do need is called parsing. 

To be able to do this we need to know what the file looks like, to do this we follow 4 steps:

1. Read in the file
2. Iterate over the file line by line
3. Diassemble the lines so only necessary informatino is retained -> Know the format!
4. Save the data in a Python-usable datastructure

Before you start writing a parser ask yourself the following questions:

1. Can I use an already existing parser?
2. Do I need to make the parser by myself?
   1. Which information do I need?
   2. How do I have to dismantle each line?
   3. Which substrings can be used as marker for my data?

# Bioinformatic file types

Most Bioinformatic files are simple text files with a specific extension. Some of the most common are:

- fasta
- fastq
- genbank (gb)
- clustal
- pdb
- xml
- json
- mmCIF
- and many more

## fasta files

Fasta files are used to save and relay raw sequences with only very little additinoal data

Extensions (do not influence the content, simply help us identify the sequence type):

- .fasta (general sequences)
- .faa (protein sequences)
- .fna (nucleic acid sequences)

Makeup is always one (single!) line starting with `>` which contains the name (and sometimes additional information) of the sequence followed by one or more lines containing the corresponding sequence itself

A file can contain several sequences underneath each other separated by their respective name lines

There are two good parsers for fasta files that can all easily import the sequence and have each a few other extras:

**SeqIO from Biopython**

- using `parse()` (not `read()`!!)
- generates a FastaIterator object containing one sequence object for each sequence
- we iterate over the sequence objects with a loop
- also used for many other filetypes
- https://biopython.org/wiki/SeqIO

Syntax:

```
from Bio import SeqIO

fasta1 = SeqIO.parse("Filename.fasta","fasta")

for record in fasta1:
    print(record.seq) # for the sequence
    print(record.id) # for the sequence name
```

**fastaparser**

- needs to be installed (not part of main Python)
- file is opened with a `with` statement
- we need to loop over the resulting object to access the sequences
- there is a standard and a quick method of reading in the data
  - the standard results in a object where we can access the sequence with the method `seq.sequence_as_string()`
  - the quick method results in an object that has the sequence as an attribute `seq.sequence` (without `()`)
  - the standard method has many more options that can be used, e.g. to update the id line, to calculate gc_content etc. etc. see below 


```
import fastaparser

with open("Filename.fasta","r") as fasta_file:
    fasta1 = fastaparser.Reader(fasta_file)
    for seq in fasta1:
        print(seq.sequence_as_string())

# OR

with open("Filename.fasta","r") as fasta_file:
    fasta1 = fastaparser.Reader(fasta_file, parse_method = "quick")
    for seq in fasta1:
        print(seq.sequence)
```


In [27]:
from Bio import SeqIO

fasta1 = SeqIO.parse("B3JI28.fasta","fasta")
for record in fasta1:
    print(record.id) # for the sequence name
    print()
    print(record.seq) # for the sequence
    print()
    print(dir(record))

tr|B3JI28|B3JI28_9BACT

MELKTQEVMKKLISKLYTAWLIITVGLFSACTPDSFELEGKDVTVDDLVEGIAFSITHDSENPNIVYLKSLMPSSYQVCWQHPQGRSQEREVTLQMPFEGKYEVTFGVQTRGGIVYGNPATFTIDSFCADFVNDDLWTYLTGGVNNEKVWIFDNGSYGYAAGEMTYADPSTTVEWNNWSANWDPGVGHTGDDAIWESTMTFGLKGGANVTVYNSSSKETASGTFMLNTSNHTITFTDCELLHTPSWSDRSTNWSRDLKLLELDENHMRIGVMRDNSEGPWWLIWNFVSKEYADNYVPEDKPDPEPALPDGWEDMVSEVVTSKIKWTMSADVPFDWANLDGSLMNNFTAGNYPDWATPVSGLDQLSMTLDSKTMTYEFAMPDGTTASGTYTLDEKGIYTFSGNVPSYHIGGGDIMFGADGNNQLRILSIESAAGNVMGMWVGARSAEKDEYQAYHFIPNAGGGTSQPEATSIAVDNSKLVWGHLETDKNNFRIELYNMYGETASASPIDPASIVFDYSMELTFTISGLTGDAATKEYNAGLMCTADGWWPAYSGTSDVKVKGDGTYTIKMKPEAAYNGVIVFVIDIFDMFSDIADLDAVNVTIDSLKIL

['_AnnotationsDict', '_AnnotationsDictValue', '__add__', '__annotations__', '__bool__', '__bytes__', '__class__', '__contains__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__m

In [28]:
import fastaparser

with open("B3JI28.fasta","r") as fasta_file:
    fasta1 = fastaparser.Reader(fasta_file)
    for seq in fasta1:
        print(seq.sequence_as_string())
        print()
        print(seq.id)
        print()
        print(dir(seq))

MELKTQEVMKKLISKLYTAWLIITVGLFSACTPDSFELEGKDVTVDDLVEGIAFSITHDSENPNIVYLKSLMPSSYQVCWQHPQGRSQEREVTLQMPFEGKYEVTFGVQTRGGIVYGNPATFTIDSFCADFVNDDLWTYLTGGVNNEKVWIFDNGSYGYAAGEMTYADPSTTVEWNNWSANWDPGVGHTGDDAIWESTMTFGLKGGANVTVYNSSSKETASGTFMLNTSNHTITFTDCELLHTPSWSDRSTNWSRDLKLLELDENHMRIGVMRDNSEGPWWLIWNFVSKEYADNYVPEDKPDPEPALPDGWEDMVSEVVTSKIKWTMSADVPFDWANLDGSLMNNFTAGNYPDWATPVSGLDQLSMTLDSKTMTYEFAMPDGTTASGTYTLDEKGIYTFSGNVPSYHIGGGDIMFGADGNNQLRILSIESAAGNVMGMWVGARSAEKDEYQAYHFIPNAGGGTSQPEATSIAVDNSKLVWGHLETDKNNFRIELYNMYGETASASPIDPASIVFDYSMELTFTISGLTGDAATKEYNAGLMCTADGWWPAYSGTSDVKVKGDGTYTIKMKPEAAYNGVIVFVIDIFDMFSDIADLDAVNVTIDSLKIL

tr|B3JI28|B3JI28_9BACT

['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__module__', '__ne__', '__new__', '__next__', '__reduce__', '__reduce_ex__', '__repr__', '__reversed__', '__setattr__',

In [29]:
with open("B3JI28.fasta","r") as fasta_file:
    fasta1 = fastaparser.Reader(fasta_file, parse_method = "quick")
    for seq in fasta1:
        print(seq.sequence)
        print(dir(seq))

MELKTQEVMKKLISKLYTAWLIITVGLFSACTPDSFELEGKDVTVDDLVEGIAFSITHDSENPNIVYLKSLMPSSYQVCWQHPQGRSQEREVTLQMPFEGKYEVTFGVQTRGGIVYGNPATFTIDSFCADFVNDDLWTYLTGGVNNEKVWIFDNGSYGYAAGEMTYADPSTTVEWNNWSANWDPGVGHTGDDAIWESTMTFGLKGGANVTVYNSSSKETASGTFMLNTSNHTITFTDCELLHTPSWSDRSTNWSRDLKLLELDENHMRIGVMRDNSEGPWWLIWNFVSKEYADNYVPEDKPDPEPALPDGWEDMVSEVVTSKIKWTMSADVPFDWANLDGSLMNNFTAGNYPDWATPVSGLDQLSMTLDSKTMTYEFAMPDGTTASGTYTLDEKGIYTFSGNVPSYHIGGGDIMFGADGNNQLRILSIESAAGNVMGMWVGARSAEKDEYQAYHFIPNAGGGTSQPEATSIAVDNSKLVWGHLETDKNNFRIELYNMYGETASASPIDPASIVFDYSMELTFTISGLTGDAATKEYNAGLMCTADGWWPAYSGTSDVKVKGDGTYTIKMKPEAAYNGVIVFVIDIFDMFSDIADLDAVNVTIDSLKIL
['__add__', '__class__', '__class_getitem__', '__contains__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getnewargs__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__match_args__', '__module__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex_

## fastq files

Also used for storing biological seqeunces, most often nucleotide *and its corresponding quality score*

- Sequence and quality score are encoded in single character
- standard outpu for high throughput sequencing, e.g Illumina
- makeup: blocks of 4 line-separated fields
   1. begins with `@` and is followed by the indetifiers and optionally a description (e.g. length)
   2. raw sequence letters
   3. begins with a `+` and is optionally followed by the identifier and/or description again
   4. encodes the quality values for fiels 2
- quality values are from lowest to highest:
   - !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
   - this includes the markers for fields 1 and 3, so multiple line sequences are really difficult to parse
   - that's why these days we have single line reads only
   - also these days mostly the snippets are shorter (~100 bp in short-read Illumina sequencing)
 
fastq files can be read in like fasta files using the SeqIO parser.

## genebank files

these are files developed and used by the NCBI for the databases Nucleotide or Gene

- extenstion .gp
- contain more infrormation than fasta files
   - about the sequence
   - references
   - Features (in depth info, e.g. coding region etc.)
      - only mandatory part of FEATURES is source
      - 1..45 is a span from the first to the 45th entry of the sequence
      - the translation in CDS is often not quite correct or even complete nonsense
   - at the end of the file in ORIGIN there is the seqeunce
   - see the powerpoint for more details

Can be read in with SeqIO, which results in a obejct with a lot of useful attributes

- id: single string
- description: single string
- sequence: single string
- features: list of several SeqFeature objects, one for each entry in the features table of the corresponding sequence

```
from Bio import SeqIO

genbank1 = SeqIO.parse("Filename.gb", "genbank")
for record in genbank1:
    print(record.id)
    print(record.description)
    print(record.seq)
    print(record.features)
```

## clustalW files

This is used to store several sequences that are aligned (pairwise alignment, e.g. Needleman-Wunsch). Many if not all multiple sequence alignment tools support the clustalW format for their outputs. To look at alignments rather use Jalview

- They often have the extension .aln but .clustal is also possible
- These files contain the result of a multiple sequence Alignment and follow very specific rules
- Each file must start with the words "CLUSTAL" or "CLUSTALW“ and other information in the first line is ignored
- This is followed by one or more empty lines
- then comes one or more blocks of data
  - Each block consists of multiple lines with one line representing one sequence of your multiple sequence alignment
  - Each line consists of:
     - the sequence name
     - white space
     - up to 60 sequence symbols (sequence or `-` if a particular sequence has no match at this position)
     - optional - white space followed by a cumulative count of residues for the sequences
  - the sequences are followed by one line showing the degree of conservation for the columns of the alignment in this block
     - degrees of conservation are shown by:
     - `*`: all residues or nucleotides in that column are identical
     - `:`: conserved substitutions have been observed (score greater than .5 on the PAM 250 matrix)
     - `.`: semi-conserved substitutions have been observed (score less than or equal to .5 on the PAM 250 matrix)
     - ` `: no match
  - following that come one or more empty lines before the next block of data / the end of the file

Reading them in (also later this week)

- SeqIO
- AlignIO submodule of Biophyton (`alignment = AlignIO.read("Filename.clustal","clustal")`)

## pdb files

pdb files are used to store 3D structure data of proteins in form of atom coordinates

- two file formats
  - pdb (extenstion .pdb) or mmCIF
- very strict rules how the files look like
  -  Each line contains a maximum of 80 columns, that means 80 characters
  -  Each line represents one record and has to be self-identifying
  - The first six columns of a line contain the record name, which must be one of the predefined record names of the pdb format
  - The record name is then followed by a space and the remaining information / data that belongs to this record
  - for each type of record there is a defined position for each information therein. It depends on the position in the line, not just whether there is a space between
  
pdb files contain multiple sections:
1. Title section
   - contains up to 16 different record types
   - HEADER, which contains a classification of the molecule(s), the deposition date and the PDB identifier
   - TITLE, contains the title for the experiment or analysis
   - COMPND, describes the macromolecular contents of the file
   - SOURCE, specifies the biological and/or chemical source of each biological molecule in the file
   - AUTHOR, contains the names of the people responsible for the contents of the file
   - JRNL, contains the primary literature citation that describes the experiment
   - REMARK, presents experimental details, annotations, comments and information not included in other records
      - The REMARK section is further devided into many subsections, each marked by a specific number
      - **REMARK 465** lists missing residues, refering to either amino acids or bases in our macromolecules, that are present in the sequence which is given in the same file but are completely absent from the coordinate section. The main part of this section is a table containing:
      - the residue name (RES), e.g. SER 
      - the chain identifier (C), e.g. A, that denotes one of the macromolecules present in the file (e.g. there can be several protein chains present and each are given an identifier, the chain identifier)
      - sequence number (SSSEQI), e.g. 11, the position of the residue in the sequence
      - **REMARK 470** contains information about the non-hydrogen atoms of standard residues which are missing from the coordinates. The main part is again a table containing:
      - the residue name (RES)
      - the chain identifier (C)
      - the sequence number (SEQI)
      - the names of the missing atoms (ATOMS) The atoms are called by their position. the first Carbon is carbin alpha (CA), the fourth sulfur is sulfur delta (SD) etc.
2. Primary structure section
   - SEQRES record, is reserved for the sequence(s) of the macromolecule(s) present in the file. This is the "theoretical" record against which the structure data is compared
   - Each record contains:
   - the serial number 
   - the chain identifier
   - the number of residues in the chain
   - the residue names (in triplet code for aa) of up to 13 consecutive residues
3. Heterogen section
   - complete description of non-standard residues in the file (ligands, co-factors like metal ions, inhibitors etc.)
   - information about modified amino acids, e.g. glycosylated amino acids
4. Secondary structure section
   - helices and sheets found in protein and polypeptide structures
5. Connectivity Annotation section
   - existence and location of disulfide bonds and other linkages
6. Miscellaneous Features section
    - properties in the molecule such as environments surrounding a non-standard residue or the assembly of an active site
7. Crystallographic and Coordinate Transformation section
    - geometry of the crystallographic experiment
8. *Coordinate section*
   - exact three-dimensional coordinates of all atoms present in the structure
   - built in a tabular way to contain all necessary information for each atom
   - remember that the columns depend on the text column, ie.e their position in line and not spaces etc.!!
   - 6 records ins this section: MODEL, ATOM, ANISOU, TER, HETATM and ENDMDL
   - ATOM: Name, Atom Number, Atom name, residue name, chain identifier, residue number, x- coordinate, y-coordinate, z-coordinare, sth, sth, element symbol
   - HETATM follows the same rules as ATOM but contain anything that is not a standard amino acid/nucleotide
   - TER record indicates the end of one macromolecule: atom number, residue nane, chain identifier, residue number
9. Connectivity section
   - information about different kind of bonds (e.g Disulfide bonds, bonds inside ligands)
   - only notes bonds not specified in the standard residue connectivity table
10. Bookkeeping section
    - final information about the file
    - MASTER record, which lists the number of lines in the coordinate entry or file for selected record types
    - END record which marks the end of the PDB file
   
Importing pdb files

**Biopandas**

```
from biopandas.pdb import PandasPdb
ppdb = PandasPdb()
Struc = ppdb.read_pdb("filename.pdb")
print(Struc)
print(Struc.df["ATOM"])
```

**PDBParser**

```
from Bio.PDB.PDBParser import PDBParser
parser = PDBParser()
structure = parser.get_structure('1kim', '1kim.pdb')
```




# Exercises

## Exercise 1 - Simple Strings

- Write a Python script that defines five string variables containing these Strings:
  - “AAAAAAAAAAAAAAAAA”
  - “ACTGACTGACTGACTGACTG”
  - “HAIMGVVFTWIMALACAAPPLVGWSRY”
  - “SSSIYNPVIYIMLNKQFRNCMLTTLCCGKNPLG”
  - “PFSNVTGVVRSPFEQPQYYLAEPWQFSMLAAYMFLLIVLGFPINFLTLYVTVQH”
- Print out the length of all five variables.
- Print out the letters at the indices 0,5,10,15,20 and 25 of the five strings if possible
- Create the substrings S-5-20, S-20-30, S-2-10 of all five strings if possible

In [30]:
#create the strings
S1 = "AAAAAAAAAAAAAAAAA"
S2 = "ACTGACTGACTGACTGACTG"
S3 = "HAIMGVVFTWIMALACAAPPLVGWSRY"
S4 = "SSSIYNPVIYIMLNKQFRNCMLTTLCCGKNPLG"
S5 = "PFSNVTGVVRSPFEQPQYYLAEPWQFSMLAAYMFLLIVLGFPINFLTLYVTVQH"

#print the lengths
print(f"String 1 has length {len(S1)}")
print(f"String 2 has length {len(S2)}")
print(f"String 3 has length {len(S3)}")
print(f"String 4 has length {len(S4)}")
print(f"String 5 has length {len(S5)}")

String 1 has length 17
String 2 has length 20
String 3 has length 27
String 4 has length 33
String 5 has length 54


In [31]:
# I make two lists of indices, since String 1 and 2 won't have indices 20 and 25
idc1 = [0,5,10,15]
idc2 = [0,5,10,15,20,25]

#Loop over printing the indices using idc1 for S1 and S2 and idc2 for S3-5
for idx in idc1:
    print(S1[idx])
print()

for idx in idc1:
    print(S2[idx])
print()

for idx in idc2:
    print(S3[idx])
print()

for idx in idc2:
    print(S4[idx])
print()

for idx in idc2:
    print(S5[idx])
print()

A
A
A
A

A
C
T
G

H
V
I
C
L
R

S
N
I
Q
M
C

P
T
S
P
A
F



In [32]:
#Create the substrings S-5-20, S-20-30, S-2-10 of all five strings if possible (20:30 is not possible for S1 and S2)

S1sub5_20 = S1[5:20]
S1sub2_10 = S1[2:10]

S2sub5_20 = S2[5:20]
S2sub2_10 = S2[2:10]

S3sub5_20 = S3[5:20]
S3sub20_30 = S3[20:30]
S3sub2_10 = S3[2:10]

S4sub5_20 = S4[5:20]
S4sub20_30 = S4[20:30]
S4sub2_10 = S4[2:10]

S5sub5_20 = S5[5:20]
S5sub20_30 = S5[20:30]
S5sub2_10 = S5[2:10]

#print

print(S1sub5_20)
print(S1sub2_10)
print()
print(S2sub5_20)
print(S2sub2_10)
print()
print(S3sub5_20)
print(S3sub20_30)
print(S3sub2_10)
print()
print(S4sub5_20)
print(S4sub20_30)
print(S4sub2_10)
print()
print(S5sub5_20)
print(S5sub20_30)
print(S5sub2_10)

AAAAAAAAAAAA
AAAAAAAA

CTGACTGACTGACTG
TGACTGAC

VVFTWIMALACAAPP
LVGWSRY
IMGVVFTW

NPVIYIMLNKQFRNC
MLTTLCCGKN
SIYNPVIY

TGVVRSPFEQPQYYL
AEPWQFSMLA
SNVTGVVR


## Exercise 2 - String Methods

Use the strings from Exercise 01 to fulfill the following tasks:

- At which index are the first occurrences of the letters A, C, G, N, T and U
- Split the strings at each appearance of the patterns “TG”, “TT” and “AA”
- Use if statements to test whether the strings contain the patterns “FLL” and “MLT”

In [33]:
#create the strings
S1 = "AAAAAAAAAAAAAAAAA"
S2 = "ACTGACTGACTGACTGACTG"
S3 = "HAIMGVVFTWIMALACAAPPLVGWSRY"
S4 = "SSSIYNPVIYIMLNKQFRNCMLTTLCCGKNPLG"
S5 = "PFSNVTGVVRSPFEQPQYYLAEPWQFSMLAAYMFLLIVLGFPINFLTLYVTVQH"

# I put them in a dictionary to make looping over them easier:
stringdict = {}
stringdict["S1"] = S1
stringdict["S2"] = S2
stringdict["S3"] = S3
stringdict["S4"] = S4
stringdict["S5"] = S5

In [34]:
# Where do the letters of list1 first occur in the strings?
# This is the "long" loop

list1 = ["A","C","G","N","T","U"]

for elem in list1:
    print(f"First occurence of {elem} in S1 is at index {S1.find(elem)}")
    print(f"First occurence of {elem} in S2 is at index {S2.find(elem)}")
    print(f"First occurence of {elem} in S3 is at index {S3.find(elem)}")
    print(f"First occurence of {elem} in S4 is at index {S4.find(elem)}")
    print(f"First occurence of {elem} in S5 is at index {S5.find(elem)}")
    print()

First occurence of A in S1 is at index 0
First occurence of A in S2 is at index 0
First occurence of A in S3 is at index 1
First occurence of A in S4 is at index -1
First occurence of A in S5 is at index 20

First occurence of C in S1 is at index -1
First occurence of C in S2 is at index 1
First occurence of C in S3 is at index 15
First occurence of C in S4 is at index 19
First occurence of C in S5 is at index -1

First occurence of G in S1 is at index -1
First occurence of G in S2 is at index 3
First occurence of G in S3 is at index 4
First occurence of G in S4 is at index 27
First occurence of G in S5 is at index 6

First occurence of N in S1 is at index -1
First occurence of N in S2 is at index -1
First occurence of N in S3 is at index -1
First occurence of N in S4 is at index 5
First occurence of N in S5 is at index 3

First occurence of T in S1 is at index -1
First occurence of T in S2 is at index 2
First occurence of T in S3 is at index 8
First occurence of T in S4 is at index 22

In [35]:
# Where do the letters of list1 first occur in the strings?
#This is the short dictionary loop

list1 = ["A","C","G","N","T","U"]

for elem in list1:
    for key, value in stringdict.items():
        print(f"First occurence of {elem} in {key} is at index {value.find(elem)}")
    print()

First occurence of A in S1 is at index 0
First occurence of A in S2 is at index 0
First occurence of A in S3 is at index 1
First occurence of A in S4 is at index -1
First occurence of A in S5 is at index 20

First occurence of C in S1 is at index -1
First occurence of C in S2 is at index 1
First occurence of C in S3 is at index 15
First occurence of C in S4 is at index 19
First occurence of C in S5 is at index -1

First occurence of G in S1 is at index -1
First occurence of G in S2 is at index 3
First occurence of G in S3 is at index 4
First occurence of G in S4 is at index 27
First occurence of G in S5 is at index 6

First occurence of N in S1 is at index -1
First occurence of N in S2 is at index -1
First occurence of N in S3 is at index -1
First occurence of N in S4 is at index 5
First occurence of N in S5 is at index 3

First occurence of T in S1 is at index -1
First occurence of T in S2 is at index 2
First occurence of T in S3 is at index 8
First occurence of T in S4 is at index 22

In [36]:
#“TG”, “TT” and “AA”
list2 = ["TG","TT","AA"]

for elem in list2:
    print(elem)
    for key, value in stringdict.items():
        print(f"{key}: {value.split(elem)}")
    print()

TG
S1: ['AAAAAAAAAAAAAAAAA']
S2: ['AC', 'AC', 'AC', 'AC', 'AC', '']
S3: ['HAIMGVVFTWIMALACAAPPLVGWSRY']
S4: ['SSSIYNPVIYIMLNKQFRNCMLTTLCCGKNPLG']
S5: ['PFSNV', 'VVRSPFEQPQYYLAEPWQFSMLAAYMFLLIVLGFPINFLTLYVTVQH']

TT
S1: ['AAAAAAAAAAAAAAAAA']
S2: ['ACTGACTGACTGACTGACTG']
S3: ['HAIMGVVFTWIMALACAAPPLVGWSRY']
S4: ['SSSIYNPVIYIMLNKQFRNCML', 'LCCGKNPLG']
S5: ['PFSNVTGVVRSPFEQPQYYLAEPWQFSMLAAYMFLLIVLGFPINFLTLYVTVQH']

AA
S1: ['', '', '', '', '', '', '', '', 'A']
S2: ['ACTGACTGACTGACTGACTG']
S3: ['HAIMGVVFTWIMALAC', 'PPLVGWSRY']
S4: ['SSSIYNPVIYIMLNKQFRNCMLTTLCCGKNPLG']
S5: ['PFSNVTGVVRSPFEQPQYYLAEPWQFSML', 'YMFLLIVLGFPINFLTLYVTVQH']



In [37]:
#Use if statements to test whether the strings contain the patterns “FLL” and “MLT”
for key, value in stringdict.items():
    if "FLL" in value: print(f"FLL is present in {key}")
    else:  print(f"FLL is not present in {key}")
    if "MLT" in value:  print(f"MLT is present in {key}")
    else: print(f"MLT is not present in {key}")

FLL is not present in S1
MLT is not present in S1
FLL is not present in S2
MLT is not present in S2
FLL is not present in S3
MLT is not present in S3
FLL is not present in S4
MLT is present in S4
FLL is present in S5
MLT is not present in S5


## Exercise 3 - Working with Strings

- Take the “lorem ipsum” String from down below and use it as a variable in Python
- Complete the following tasks:
  - How long is this text?
  - How many words are in this text? (split and len)
  - How often does each letter of the alphabet occur in this text? (re, len and a loop)
  - Are the words “minim”, “anim”, “caesar”, “brutus” and “duis” in the string?

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur int occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

In [38]:
import re
import string
S = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."

In [39]:
#Finding the characters and words
print(f"The text has {len(S)} characters.")
print(f"The text has {len(S.split())} words.")

The text has 445 characters.
The text has 69 words.


In [40]:
#making a list of all letters of the alphabet
all =  list(string.ascii_letters)
print(all)

#the long version
for letr in all:
    print(f"{letr} occurs {len(re.findall(letr,S))} times.")

['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z']
a occurs 29 times.
b occurs 3 times.
c occurs 16 times.
d occurs 18 times.
e occurs 37 times.
f occurs 3 times.
g occurs 3 times.
h occurs 1 times.
i occurs 42 times.
j occurs 0 times.
k occurs 0 times.
l occurs 21 times.
m occurs 17 times.
n occurs 24 times.
o occurs 29 times.
p occurs 11 times.
q occurs 5 times.
r occurs 22 times.
s occurs 18 times.
t occurs 32 times.
u occurs 28 times.
v occurs 3 times.
w occurs 0 times.
x occurs 3 times.
y occurs 0 times.
z occurs 0 times.
A occurs 0 times.
B occurs 0 times.
C occurs 0 times.
D occurs 1 times.
E occurs 1 times.
F occurs 0 times.
G occurs 0 times.
H occurs 0 times.
I occurs 0 times.
J occurs 0 times.
K occurs 0 times.
L occurs 1 times.
M occurs 0 times.
N occurs 0 times.
O occ

In [41]:
#making a list of all letters of the alphabet
lower = list(string.ascii_lowercase)
# print(lower)
# Printing a list of uppercase alphabets.
upper = list(string.ascii_uppercase)

#the (slightly) more fancy version
for LET,let in zip(upper,lower):
    print(f"{LET} occurs {len(re.findall(LET,S))} times and {let} occurs {len(re.findall(let,S))} times.")
    print(f"{LET} or {let} occur a total of {len(re.findall(LET,S)) + len(re.findall(let,S))} times.")
    print()

A occurs 0 times and a occurs 29 times.
A or a occur a total of 29 times.

B occurs 0 times and b occurs 3 times.
B or b occur a total of 3 times.

C occurs 0 times and c occurs 16 times.
C or c occur a total of 16 times.

D occurs 1 times and d occurs 18 times.
D or d occur a total of 19 times.

E occurs 1 times and e occurs 37 times.
E or e occur a total of 38 times.

F occurs 0 times and f occurs 3 times.
F or f occur a total of 3 times.

G occurs 0 times and g occurs 3 times.
G or g occur a total of 3 times.

H occurs 0 times and h occurs 1 times.
H or h occur a total of 1 times.

I occurs 0 times and i occurs 42 times.
I or i occur a total of 42 times.

J occurs 0 times and j occurs 0 times.
J or j occur a total of 0 times.

K occurs 0 times and k occurs 0 times.
K or k occur a total of 0 times.

L occurs 1 times and l occurs 21 times.
L or l occur a total of 22 times.

M occurs 0 times and m occurs 17 times.
M or m occur a total of 17 times.

N occurs 0 times and n occurs 24 time

In [42]:
# Are the words “minim”, “anim”, “caesar”, “brutus” and “duis” in the string?
list_words = ["minim","anim","caesar", "brutus","duis"]

for word in list_words:
    if word in S:
        print(f"{word} is part of the string.")
    else:
        print(f"{word} is not part of the string.")

minim is part of the string.
anim is part of the string.
caesar is not part of the string.
brutus is not part of the string.
duis is not part of the string.


## Exercise 4 - A bit more about Strings

Take the “lorem ipsum” string from Exercise 03 and complete the following tasks:

- Replace the strings et, in and eu with at, on and au
- Create an all lower case and an all upper case variant of it
- Split the text in it’s words and concatenate them together again (split and a loop)

In [43]:
import re
Lorem = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."

In [44]:
#Replace syllables
Lorem_repl = Lorem.replace("et","at")
Lorem_repl = Lorem_repl.replace("in","on")
Lorem_repl = Lorem_repl.replace("eu","au")

print(Lorem_repl)

Lorem ipsum dolor sit amat, consectatur adipiscong elit, sed do eiusmod tempor oncididunt ut labore at dolore magna aliqua. Ut enim ad monim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor on reprehenderit on voluptate velit esse cillum dolore au fugiat nulla pariatur. Exceptaur sont occaecat cupidatat non proident, sunt on culpa qui officia deserunt mollit anim id est laborum.


In [45]:
#Write text in all upper and all lower case
Lorem_upper = Lorem.upper()
Lorem_lower = Lorem.lower()

print(Lorem_upper)
print()
print(Lorem_lower)

LOREM IPSUM DOLOR SIT AMET, CONSECTETUR ADIPISCING ELIT, SED DO EIUSMOD TEMPOR INCIDIDUNT UT LABORE ET DOLORE MAGNA ALIQUA. UT ENIM AD MINIM VENIAM, QUIS NOSTRUD EXERCITATION ULLAMCO LABORIS NISI UT ALIQUIP EX EA COMMODO CONSEQUAT. DUIS AUTE IRURE DOLOR IN REPREHENDERIT IN VOLUPTATE VELIT ESSE CILLUM DOLORE EU FUGIAT NULLA PARIATUR. EXCEPTEUR SINT OCCAECAT CUPIDATAT NON PROIDENT, SUNT IN CULPA QUI OFFICIA DESERUNT MOLLIT ANIM ID EST LABORUM.

lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.


In [46]:
# split the words
Lorem_split = [elem for elem in re.split(" |,|\\.", Lorem) if len(elem)>0]
print(Lorem_split)
print()

#concatenate the words
Lorem_conc = ""
for elem in Lorem_split: 
    Lorem_conc+=elem

print(Lorem_conc)

['Lorem', 'ipsum', 'dolor', 'sit', 'amet', 'consectetur', 'adipiscing', 'elit', 'sed', 'do', 'eiusmod', 'tempor', 'incididunt', 'ut', 'labore', 'et', 'dolore', 'magna', 'aliqua', 'Ut', 'enim', 'ad', 'minim', 'veniam', 'quis', 'nostrud', 'exercitation', 'ullamco', 'laboris', 'nisi', 'ut', 'aliquip', 'ex', 'ea', 'commodo', 'consequat', 'Duis', 'aute', 'irure', 'dolor', 'in', 'reprehenderit', 'in', 'voluptate', 'velit', 'esse', 'cillum', 'dolore', 'eu', 'fugiat', 'nulla', 'pariatur', 'Excepteur', 'sint', 'occaecat', 'cupidatat', 'non', 'proident', 'sunt', 'in', 'culpa', 'qui', 'officia', 'deserunt', 'mollit', 'anim', 'id', 'est', 'laborum']

LoremipsumdolorsitametconsecteturadipiscingelitseddoeiusmodtemporincididuntutlaboreetdoloremagnaaliquaUtenimadminimveniamquisnostrudexercitationullamcolaborisnisiutaliquipexeacommodoconsequatDuisauteiruredolorinreprehenderitinvoluptatevelitessecillumdoloreeufugiatnullapariaturExcepteursintoccaecatcupidatatnonproidentsuntinculpaquiofficiadeseruntmollit

## Exercise 5 - Writing Files

Write a program that creates two different files.

- The first file should be opened normally and two lines should be written in it.
- Afterwards close the file, open it again and add another three lines of text to the file (use the operator „a“ for append)
- The second file should be generated by using„with“ and a loop to write a countdown from 10 to 0
- Each number should be in a separate line

In [47]:
# create the non-existing File Exercise5_File1.txt and add some lines, save by closing
file1 = open("Exercise5_File1.txt","w")
file1.write("Today is Wednesday, 10th of April 2024\n")
file1.write("This is File 1 of Exercise 5 in Module 3.1 of my Bioinformatics Course\n")
file1.close()

In [48]:
# add some more line to the first file
file1b = open("Exercise5_File1.txt","a")
file1b.write("I have to add another couple of lines to this thing.\n")
file1b.write("Today the sun is shining. It is colder than it was on the weekend but still nice.\n")
file1b.write("I just ate a piece of bread with Momo's fabulous abricot jam :-)\n")
file1b.close()

In [49]:
# create a string with the countdown
exc5_2 = ""
for i in range(10,-1,-1):
    exc5_2 += str(i) + "\n"

print(repr(exc5_2))

# write the countdown into a new file using with
with open("Exercise5_File2.txt", "w") as with_file:
    with_file.write(exc5_2)

'10\n9\n8\n7\n6\n5\n4\n3\n2\n1\n0\n'


In [50]:
# same thing a bit shorter:
with open("Exercise5_File2.txt", "w") as with_file:
    for i in range(0,11):
        with_file.write(str(i)+"\n")

## Exercise 6 - Reading a file

- Read in your second file from exercise 5, the one with the countdown, by using the last method we discussed, the one using a for loop inside the with statement
- Calculate the sum of the numbers and print out the result

In [51]:
with open("Exercise5_File2.txt","r") as read_file:
    result = 0
    for line in read_file:
        result += float(line)

print(result)

55.0


In [52]:
# I make a more complex example, where there is a string at the end 
# so I can't just sum up the numbers

with open("Exercise5_File3.txt","w") as file:
    for i in range(0,1500,200):
        file.write(str(i)+" x\n")

In [53]:
# since I have numbers ranging from 1 digit to 4 digits 
# I can't just take the slice from the start
# Since I always have the same from the end though (\n,x and space)
# I can access my number irrespective of length from the back
with open("Exercise5_File3.txt","r") as read_file:
    result2 = 0
    for line in read_file:
        print(line[:-3])
        result2 += int(line[:-2])

print()
print(result2)

0
200
400
600
800
1000
1200
1400

5600


In [54]:
# now how would I do if there were variable text afterwards?
# since the number is always the first element followed by a space I can simply split the lines
# and use the first element
with open("Exercise5_File3.txt","r") as read_file:
    result2 = 0
    for line in read_file:
        print(line.split()[0])
        result2 += int(line.split()[0])

print()
print(result2)

0
200
400
600
800
1000
1200
1400

5600


## Exercise 7 - Small sequence comparison

- You have four fasta files called XP_021046627, XP_031236805, A0A8I4A5I4 and ELR51227
- Write a program that reads in all four of them
- Afterwards compare the first three sequences letter by letter with each other (use a simple loop) and calculate how many identical letters these sequences have
- Compare the fourth sequence with the remaining three in a similar matter, be careful this sequence is 3 amino acids longer

In [55]:
#XP_021046627, XP_031236805, A0A8I4A5I4 and ELR51227
#reading in the files
import fastaparser

with open("XP_021046627.fasta","r") as fasta_file:
    fasta1 = fastaparser.Reader(fasta_file)
    for seq in fasta1:
        print(seq.sequence_as_string())
        seq1 = seq.sequence_as_string()

MNGTEGPNFYVPFSNVTGVVRSPFEQPQYYLAEPWQFSMLAAYMFLLIVLGFPINFLTLYVTVQHKKLRTPLNYILLNLAVADLFMVFGGFTTTLYTSLHGYFVFGPTGCDLEGFFATLGGEIALWSLVVLAIERYVVVCKPMSNFRFGENHAIMGVVFTWIMALACAAPPLVGWSRYIPEGMQCSCGIDYYTLKPEVNNESFVIYMFVVHFTIPMTVIFFCYGQLVFTVKEAAAQQQESATTQKAEKEVTRMVVIMVIFFLICWLPYAGVAFYIFTHQGSNFGPIFMTLPAFFAKSSSIYNPVIYIMLNKQFRNCMLTTLCCGKNPLGDDDASATASKTETSQVAPA


In [56]:
with open("XP_031236805.fasta","r") as fasta_file:
    fasta2 = fastaparser.Reader(fasta_file)
    for seq in fasta2:
        print(seq.sequence_as_string())
        seq2 = seq.sequence_as_string()

MNGTEGPNFYVPFSNITGVVRSPFEQPQYYLAEPWQFSMLAAYMFLLIVLGFPINFLTLYVTVQHKKLRTPLNYILLNLAVADLFMVFGGFTTTLYTSLHGYFVFGPTGCNLEGFFATLGGEIALWSLVVLAIERYVVVCKPMSNFRFGENHAIMGVAFTWVMALACAAPPLVGWSRYIPEGMQCSCGIDYYTLKPEVNNESFVIYMFVVHFTIPMIVIFFCYGQLVFTVKEAAAQQQESATTQKAEKEVTRMVIIMVIFFLICWLPYASVAMYIFTHQGSNFGPIFMTLPAFFAKSSSIYNPVIYIMLNKQFRNCMLTTLCCGKNPLGDDDASATASKTETSQVAPA


In [57]:
with open("A0A8I4A5I4.fasta","r") as fasta_file:
    fasta3 = fastaparser.Reader(fasta_file)
    for seq in fasta3:
        print(seq.sequence_as_string())
        seq3 = seq.sequence_as_string()

MNGTEGPNFYVPFSNATGVVRSPFEYPQYYLAEPWQFSMLAAYMFLLIVLGFPINFLTLYVTVQHKKLRTPLNYILLNLAVADLFMVFGGFTTTLYTSLHGYFIFGPTGCNAEGFFATLGGEIALWSLVVLAIERYVVVCKPMSNFRFGENHAIMGVAFTWVMALACAAPPLAGWSRYIPEGLQCSCGIDYYTLKPEVNNESFVIYMFVVHFTIPMIVIFFCYGQLVFTVKEAAAQQQESATTQKAEKEVTRMVIIMVIAFLICWVPYASVAFYIFTHQGSNFGPIFMTIPAFFAKSAAIYNPVIYIMMNKQFRNCMLTTICCGKNPLGDDEASATVSKTETSQVAPA


In [58]:
with open("ELR51227.fasta","r") as fasta_file:
    fasta4 = fastaparser.Reader(fasta_file)
    for seq in fasta4:
        print(seq.sequence_as_string())
        seq4 = seq.sequence_as_string()

AAAMNGTEGPNFYVPFSNKTGVVRSPFEAPQYYLAEPWQFSMLAAYMLLLIVLGFPINFLTLYVTVQHKKLRTPLNYILLNLAVADLFMVFGGFTTTLYTSLHGYFVFGPTGCNLEGFFATLGGEIALWSLVVLAIERYVVVCKPMSNFRFGENHAIMGVAFTWVMALACAAPPLVGWSRYIPEGMQCSCGIDYYTPHEETNNESFVIYMFVVHFIIPLIVIFFCYGQLVFTVKEAAAQQQESATTQKAEKEVTRMVIIMVIAFLICWLPYAGVAFYIFTHQGSDFGPIFMTIPAFFAKTSAVYNPVIYIMMNKQFRNCMVTTLCCGKNPLGDDEASTTVSKTETSQVAPA


In [59]:
# comparing the first three sequences
# They all have the same length
print(len(seq1), len(seq2), len(seq3))

348 348 348


In [60]:
#counting how many aa are the same in the same position
count_all = 0
count_12 = 0
count_13 = 0
count_23 = 0
for i in range(len(seq1)):
    if seq1[i] == seq2[i]:
        count_12+=1
    if seq1[i] == seq3[i]:
        count_13+=1
    if seq2[i] == seq3[i]:
        count_23+=1
    if seq1[i] == seq2[i] == seq3[i]:
        count_all+=1

print(f"Sequence 1, 2 and 3 have a total of {count_all} amino acids in common")
print(f"Sequence 1 and 2 have a total of {count_12} amino acids in common")
print(f"Sequence 1 and 3 have a total of {count_13} amino acids in common")
print(f"Sequence 2 and 3 have a total of {count_23} amino acids in common")

Sequence 1, 2 and 3 have a total of 326 amino acids in common
Sequence 1 and 2 have a total of 340 amino acids in common
Sequence 1 and 3 have a total of 327 amino acids in common
Sequence 2 and 3 have a total of 332 amino acids in common


In [61]:
#comparing to seq4, taking into consideration that the three extra amino acids could be in the beginning or the end!
count_all2 = 0
count_14 = 0
count_24 = 0
count_34 = 0

shift_counts = {}

for j in range (4):
    shift_counts[j] = {}
    shift_counts[j]["count_all2"] = 0
    shift_counts[j]["count_14"] = 0
    shift_counts[j]["count_24"] = 0
    shift_counts[j]["count_34"] = 0
    for i in range(len(seq1)):
        if seq1[i] == seq4[i+j]:
            shift_counts[j]["count_14"]+=1
        if seq2[i] == seq4[i+j]:
            shift_counts[j]["count_24"]+=1
        if seq3[i] == seq4[i+j]:
            shift_counts[j]["count_34"]+=1
        if seq1[i] == seq2[i] == seq3[i] == seq4[i+j]:
            shift_counts[j]["count_all2"]+=1

print(shift_counts)
print()
            
for key in shift_counts:
    print(f"For an Alignment Shift of {key} in Seq4, it has {shift_counts[key]["count_14"]} AA in common with Seq1.")
    print(f"For an Alignment Shift of {key} in Seq4, it has {shift_counts[key]["count_24"]} AA in common with Seq2.")
    print(f"For an Alignment Shift of {key} in Seq4, it has {shift_counts[key]["count_34"]} AA in common with Seq3.")
    print(f"For an Alignment Shift of {key} in Seq4, the four sequences have {shift_counts[key]["count_all2"]} AA in common.")
    print()

{0: {'count_all2': 16, 'count_14': 16, 'count_24': 17, 'count_34': 18}, 1: {'count_all2': 21, 'count_14': 22, 'count_24': 22, 'count_34': 21}, 2: {'count_all2': 34, 'count_14': 38, 'count_24': 36, 'count_34': 35}, 3: {'count_all2': 314, 'count_14': 323, 'count_24': 326, 'count_34': 326}}

For an Alignment Shift of 0 in Seq4, it has 16 AA in common with Seq1.
For an Alignment Shift of 0 in Seq4, it has 17 AA in common with Seq2.
For an Alignment Shift of 0 in Seq4, it has 18 AA in common with Seq3.
For an Alignment Shift of 0 in Seq4, the four sequences have 16 AA in common.

For an Alignment Shift of 1 in Seq4, it has 22 AA in common with Seq1.
For an Alignment Shift of 1 in Seq4, it has 22 AA in common with Seq2.
For an Alignment Shift of 1 in Seq4, it has 21 AA in common with Seq3.
For an Alignment Shift of 1 in Seq4, the four sequences have 21 AA in common.

For an Alignment Shift of 2 in Seq4, it has 38 AA in common with Seq1.
For an Alignment Shift of 2 in Seq4, it has 36 AA in co

# Extras

## Accessing the letters of the alphabet

In [62]:
# Python program to print a list of alphabets.
# Importing the string module.
import string
# Printing a list of lowercase alphabets.
lower = list(string.ascii_lowercase)
print(lower)
# Printing a list of uppercase alphabets.
upper = list(string.ascii_uppercase)
print(upper)
# Printing a list of lower and uppercase alphabet
all = list(string.ascii_letters)
print(all)

['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z']
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z']


## whitespaces with escape characters

|Symbol|Meaning|
|--|--|
|`\n`|Linebreak|
|`\t`|tabstop|
|`\s`|any white space (including space and tab)|