# Lecture 2 - Strings, Lists, and Loops
### University of California, Berkeley - Spring 2022

In [1]:
from IPython.display import HTML, Image

## What we have learned? 

- Python
- The IPython notebook
- Variables (`int`, `float`, `bool`)
- Operators (`+`, `-`, `*`, ..., `==`, `<`, ..., `and`, `or`, ...)
- Conditional statements (`if`, `elif`, `else`)
- While loops

## Strings

Strings are ordered collections of _characters_. 

### Ordered
_Ordered collections_ means that elements are numbered with _indexes_: 0, 1, 2, 3, 4...  
Note that the first index is 0, __not__ 1!

### Characters

Characters are textual symbols, like letters (`ABCDE...`), numerals (`12345`), punctuation marks (`,.?:&`), and even things like newline (`\n`) and whitespace (` `).

![keyboard](imgs/keyboard.jpeg)

### Back to strings

Most commonly, strings are used to work with _text_. 

We can _assign_ and _print_ strings:

In [2]:
x = "MyFirstPythonCourse"
y = 'I love python'
print(x)
print(y)

MyFirstPythonCourse
I love python


Strings are objects of type `str`:

In [3]:
type(x)

str

We can concat strings:

In [4]:
print(x + "2015")

MyFirstPythonCourse2015


We can convert string to numbers and vice versa (if it is appropriate):

In [5]:
x = "4"
y = int(x)
print("y+1 =", y + 1)

y+1 = 5


Otherwise, we get an error message...

In [6]:
print("x+1 =", x + 1)

TypeError: can only concatenate str (not "int") to str

In [7]:
x = str(y)
print("x =", x)

x = 4


In [8]:
x = "3.14"
y = float(x)
print("y*2 =", y * 2)

y*2 = 6.28


### Why do we care about text in programming?
- Sequences
- Data in formated text files (lesson 4)
- Free text

Because we are biologists, strings are not just text, they are also __sequences__!

![sequences](imgs/msa.png)

In [9]:
dna = "ATGCGTA"
print(dna)

ATGCGTA


Again we can concat strings:

In [10]:
upstream = "AAA"
downstream = "GGG"
dna = upstream + "ATG" + downstream
print(dna)

AAAATGGGG


We can find the length of a string using the command `len`:

In [11]:
n = len(dna)
print("The length of the DNA variable is", n)

dna = dna + "AGCTGA"
print("Now it is", len(dna))

The length of the DNA variable is 9
Now it is 15


In [12]:
print(dna)
dna = dna + "AGCTGA"
print(dna)

AAAATGGGGAGCTGA
AAAATGGGGAGCTGAAGCTGA


also works with numbers:

In [13]:
x = 10
x = x + 7
print(x)

17


### String slicing
We can extract subsets of a string by using _slicing_, with the corresponding indexes.  
Remember: string indexes start from __0__!

We can access specific indexes of the list (_starting from 0_)

In [14]:
bacteria = 'Escherichia coli'

In [15]:
# get the 1st and 6th letters
print(bacteria[0])
print(bacteria[5])

E
r


Indexes work from the tail as well, using negative numbers:

In [16]:
# get the last letter
print(bacteria[-1])
# get 5th letter from the end
print(bacteria[-5])

i
 


We can get a range of indexes using _\[start:end\]_

In [17]:
# get the 3rd to 8th letters
print(bacteria[2:8])

cheric


Notice that the _start_ position is included, but not the _end_ position. We actually take the character with indexes 2,3,4,5,6,7.
And what do we get?

In [18]:
type(bacteria[2:8])

str

There are shorts for taking the first and last characters:

In [19]:
# get the first 5 letters
print(bacteria[0:5])
# or simply:
print(bacteria[:5])

# get 3rd to last nucleotides:
print(bacteria[3:])

# last 3 nucleotides
print(bacteria[-3:])

Esche
Esche
herichia coli
oli


### String methods

There are some methods (actions, commands) we can operate on strings. These are provoked using the '`.`' character.

We can change a string to lowercase:

In [20]:
dna = dna.lower()
print(dna)

aaaatggggagctgaagctga


And back to uppercase:

In [21]:
dna = dna.upper()
print(dna)

AAAATGGGGAGCTGAAGCTGA


We can replace characters:

In [22]:
rna = dna.replace("T", "U")
print(rna)

AAAAUGGGGAGCUGAAGCUGA


#### Count
We can count characters. 

For example, let's count the number of histidine (`H`) and proline (`P`) in the [AA](http://upload.wikimedia.org/wikipedia/commons/a/a9/Amino_Acids.svg) (amino-acid) sequence of [Human Insulin](http://www.uniprot.org/blast/?about=P01308):

In [23]:
insulin = 'MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN'
print("# of histidine:", insulin.count('H'))
print("# of proline:", insulin.count('P'))

# of histidine: 2
# of proline: 6


#### Find
We can find a substring within a string.
For example, we can look for the character `D` in the insulin sequence.

In [24]:
pos = insulin.index('D')
print(pos)

19


In [25]:
type(pos)

int

In [26]:
print(insulin[pos])

D


The result is the index (position) of the first `D` found in the sequence.

We can also look for longer substrings, representing motiffs. For example, let's find the position of the Insulin [B-chain](http://www.uniprot.org/blast/?about=P01308[25-54]) in the entire peptide:

In [27]:
b_chain = "FVNQHLCGSHLVEALYLVCGERGFFYTPKT"
position = insulin.index(b_chain)
print("Position:", position)

Position: 24


In [28]:
print(len(b_chain))

30


In [29]:
found = insulin[position:position + len(b_chain)] # slicing (notice the ':')
print(b_chain == found)
print("Original:", b_chain)
print("Found:   ", found)

True
Original: FVNQHLCGSHLVEALYLVCGERGFFYTPKT
Found:    FVNQHLCGSHLVEALYLVCGERGFFYTPKT


#### Split

We can split a string on every occurence of a separator character:

In [30]:
names = "melanogaster,simulans,yakuba,ananassae"
species = names.split(",")
print(species)

['melanogaster', 'simulans', 'yakuba', 'ananassae']


What do we get?

In [31]:
type(species)

list

## Lists

Lists are similar to strings in being sequential, only they can contain any type of data, not just characters.

This includes `int`, `float`, `bool`, `str`, and even `list`.  
Lists could even include mixed variable types.

We define a list just like any other variable, but use '[ ]' and ',' to separate _elements_.

In [32]:
# a list of strings
apes = ["Homo sapiens", "Pan troglodytes", "Pongo pygmaeus"]
print(apes)

['Homo sapiens', 'Pan troglodytes', 'Pongo pygmaeus']


![Gorila](imgs/gorilla.jpeg)

In [33]:
# a list of numbers
nums = [7,13,2,400]
print(nums)

[7, 13, 2, 400]


In [34]:
# a mixed list
mixed = [12,'Mus musculus',True]
print(mixed)

[12, 'Mus musculus', True]


You can access list elements just like strings, using indexes (starting from 0):

In [35]:
print("Human:", apes[0])
print("Gorila:", apes[-1])

Human: Homo sapiens
Gorila: Pongo pygmaeus


Lists are dynamic - you can append, remove and insert into them. This is done using _list methods_, again using the '.':

We can access and change list elements.

In [36]:
new_apes = apes[:] # make a copy of the apes list
new_apes[2] = 'Hylobates lar'
print(new_apes)

['Homo sapiens', 'Pan troglodytes', 'Hylobates lar']


This __does NOT__ work with strings though...

In [37]:
print(dna)
dna[5] = 'G'

AAAATGGGGAGCTGAAGCTGA


TypeError: 'str' object does not support item assignment

In [38]:
# add element to the end of the list
apes.append("Gorilla gorilla")
print(apes)

['Homo sapiens', 'Pan troglodytes', 'Pongo pygmaeus', 'Gorilla gorilla']


In [39]:
# insert element at a given index
apes.insert(2, "Pan paniscus")
print(apes)

['Homo sapiens', 'Pan troglodytes', 'Pan paniscus', 'Pongo pygmaeus', 'Gorilla gorilla']


In [40]:
# remove element from list
apes.remove("Pongo pygmaeus")
print(apes)

['Homo sapiens', 'Pan troglodytes', 'Pan paniscus', 'Gorilla gorilla']


To remove a list item by index:

In [41]:
# option 1
apes.remove(apes[1])
# option 2
del(apes[1])

We can concat lists, just like strings:

In [42]:
print(apes + ["Pongo pygmaeus", "Pongo abelii"])

['Homo sapiens', 'Gorilla gorilla', 'Pongo pygmaeus', 'Pongo abelii']


![Organutan](imgs/orangutan.jpeg)

Searching in lists is done using `index` (not `find`):

In [43]:
i = apes.index('Pan troglodytes')
print(i)
print(apes[i])

ValueError: 'Pan troglodytes' is not in list

You can also check if something is in a list (works as well for strings):

In [44]:
if 'Saguinus nigricollis' in apes:
    print('Saguinus nigricollis is an ape')
else:
    print('Saguinus nigricollis is not an ape')

Saguinus nigricollis is not an ape


### Lists of numbers

Suppose we have a list of experimental measurements and we want to do basic statistics: count the number of results, calculate the average, and find the maximum and minimum.

In [45]:
measurements = [33, 55,45,87,88,95,34,76,87,56,45,98,87,89,45,67,45,67,76,73,33,87,12,100,77,89,92]

count = len(measurements)
avg = sum(measurements) / len(measurements)
maximum = max(measurements)
minimum = min(measurements)

print(count, "measurements with average", avg, "maximum", maximum, "minimum", minimum)

27 measurements with average 68.07407407407408 maximum 100 minimum 12


### Sorting lists
  
We can sort lists using the `sorted` method.  
If the list is made __entirely__ of numbers, then sorting is straightforward:

In [46]:
sorted_measurements = sorted(measurements)
print(sorted_measurements)

[12, 33, 33, 34, 45, 45, 45, 45, 55, 56, 67, 67, 73, 76, 76, 77, 87, 87, 87, 87, 88, 89, 89, 92, 95, 98, 100]


A list of strings will be sorted lexicographically (think about the way '<' and '>' work on strings):

In [47]:
sorted_apes = sorted(apes)
print(sorted_apes)

['Gorilla gorilla', 'Homo sapiens']


But beware of mixed lists!

In [48]:
mixed = apes + measurements
print(mixed)
print(sorted(mixed))

['Homo sapiens', 'Gorilla gorilla', 33, 55, 45, 87, 88, 95, 34, 76, 87, 56, 45, 98, 87, 89, 45, 67, 45, 67, 76, 73, 33, 87, 12, 100, 77, 89, 92]


TypeError: '<' not supported between instances of 'int' and 'str'

### List of lists (nested lists)
  
List elements can be of any type, including lists!  
For example:

In [49]:
birds = ['Gallus gallus', 'Corvus corone', 'Passer domesticus']
snakes = ['Ophiophagus hannah', 'Vipera palaestinae', 'Python bivittatus']
animals = [apes,birds,snakes]
print(animals)

[['Homo sapiens', 'Gorilla gorilla'], ['Gallus gallus', 'Corvus corone', 'Passer domesticus'], ['Ophiophagus hannah', 'Vipera palaestinae', 'Python bivittatus']]


We access lists of lists using double-indexes. For example, to get the 3rd snake:

In [50]:
print(animals[2][2])

Python bivittatus


Note that the elements of the outer list are __lists__ themselves, not strings. For example:

In [51]:
type(animals[1])

list

### List slicing
  
We can slice lists just like we did with strings, to get partial lists.  
For example:

In [52]:
# get the first 10 measurements
print(measurements[:10])
# get the last 3 measurements
print(measurements[-3:])

[33, 55, 45, 87, 88, 95, 34, 76, 87, 56]
[77, 89, 92]


## Loops

Say we want to print each element of our list:

In [53]:
print(apes[0], "is an ape")
print(apes[1], "is an ape")

Homo sapiens is an ape
Gorilla gorilla is an ape


but this is very repetitive and relies on us knowing the number of elements in the list. What we need is a way to say something along the lines of “for each element in the list of apes, print out the element, followed by the words ‘ is an ape’“. Python’s loop syntax allows us to express those instructions like this:

In [54]:
for ape in apes:
    print(ape, "is an ape")

Homo sapiens is an ape
Gorilla gorilla is an ape


![Python loop](imgs/python_loop.jpeg)

A more complex loop will go over each ape name and print some stats:

In [55]:
for ape in apes:
    name_length = len(ape)
    first_letter = ape[0]
    print(ape, "is an ape. Its name starts with", first_letter)
    print("Its name has", name_length, "letters")

Homo sapiens is an ape. Its name starts with H
Its name has 12 letters
Gorilla gorilla is an ape. Its name starts with G
Its name has 15 letters


We can also loop over a string. 

Let's go over the Insulin AA sequnce and count the number of prolines manualy:

In [56]:
count = 0
for aa in insulin:
    if aa == "P":
        count = count + 1
print("# of prolines:", count)

# of prolines: 6


Can you remember another way of doing this?

Let's count how many measurements (see above) are above the average:

In [57]:
print(measurements)
print(avg)

[33, 55, 45, 87, 88, 95, 34, 76, 87, 56, 45, 98, 87, 89, 45, 67, 45, 67, 76, 73, 33, 87, 12, 100, 77, 89, 92]
68.07407407407408


In [58]:
over = 0
for x in measurements:
    if x > avg:
        over = over + 1
print(over, "measurements are over the average.")

15 measurements are over the average.


## Using `range`

Sometimes we want to loop over consecutive numbers.

This is accomplished using the `range` command.

`range` accepts one, two, or three arguments: the bottom and upper limits and the step size.  
The bottom limit can be omited - default is zero - and the step can be omited - default is 1.  
The upper limit is __not__ included.

In [59]:
for i in range(10): # aka range(0,10,1)
    print(i)

0
1
2
3
4
5
6
7
8
9


In [60]:
for i in range(10,20):
    print(i, end=' ') # prints ending with space instead of newline

10 11 12 13 14 15 16 17 18 19 

In [61]:
for i in range(100,1000,10):
    print(i, end=' ')

100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 260 270 280 290 300 310 320 330 340 350 360 370 380 390 400 410 420 430 440 450 460 470 480 490 500 510 520 530 540 550 560 570 580 590 600 610 620 630 640 650 660 670 680 690 700 710 720 730 740 750 760 770 780 790 800 810 820 830 840 850 860 870 880 890 900 910 920 930 940 950 960 970 980 990 

Let's check if the number `n` is a prime number - that is, it can only be divided by 1 and itself:

In [62]:
n = 97 # try other numbers
divider = 1

for k in range(2,n): # why start at 2? can we choose a different limit to range? a different step perhaps?
    if n % k == 0:
        divider = k
if divider != 1:
    print(n, "is divided by", divider)
else:
    print(n, "is a prime number")

97 is a prime number


We can also use `range()` to loop on a list. This is useful in some cases.

In [63]:
for i in range(len(apes)):
    print(apes[i])

Homo sapiens
Gorilla gorilla


## Some cool practices

### 1) Restriction fragment lengths

Here’s a short DNA sequence:

```ACTGATCGATTACGTATAGTAGAATTCTATCATACATATATATCGATGCGTTCAT```

The sequence contains a recognition site for the EcoRI restriction enzyme, which cuts at the motif `G*AATTC` (the position of the cut is indicated by an asterisk). Write a program which will calculate the size of the two fragments that will be produced when the DNA sequence is digested with EcoRI.

(from [Python for Biologists](http://pythonforbiologists.com/index.php/introduction-to-python-for-biologists/2-printing-and-manipulating-text/))

In [64]:
seq = 'ACTGATCGATTACGTATAGTAGAATTCTATCATACATATATATCGATGCGTTCAT'

In [65]:
fragments = seq.split('GAATTC')
f1_length = len(fragments[0]) + 1 # add 1 for the 'G'
f2_length = len(fragments[1]) + 5 # add 5 for the 'AATTC'
print('Fragment lengths of',f1_length,'and',f2_length,'will be produced.')

Fragment lengths of 22 and 33 will be produced.


### 2) Complementing DNA

Write a program that will print the complement of the sequence above.

(from [Python for Biologists](http://pythonforbiologists.com/index.php/introduction-to-python-for-biologists/2-printing-and-manipulating-text/))

In [66]:
complement = ''
for base in seq:
    if base == 'A':
        complement = complement + 'T'
    elif base == 'T':
        complement = complement + 'A' 
    elif base == 'G':
        complement = complement + 'C'
    elif base == 'C':
        complement = complement + 'G'    
    else:
        print("Bad base:", base)
print("Complement:", complement)

Complement: TGACTAGCTAATGCATATCATCTTAAGATAGTATGTATATATAGCTACGCAAGTA


## Extra resources

- Python for Biologists: [Strings](http://pythonforbiologists.com/index.php/introduction-to-python-for-biologists/2-printing-and-manipulating-text/), [Lists and loops](http://pythonforbiologists.com/index.php/introduction-to-python-for-biologists/lists-and-loops/)
- Software carpentry: [Strings](http://software-carpentry.org/v4/python/strings.html), [Lists](http://software-carpentry.org/v4/python/lists.html)

## Congrats!

The notebook is available at https://github.com/Naghipourfar/molecular-biomechanics/