#  Bio 208: Lecture 04 -- Live coding (annotated)

## Strings

### String creation

Strings can be created by typing characters (ASCII or unicode) surrounded by either single quotes or double quotes.

In [1]:
s1 = "ATGCCCGA"  # double quotes
s1

In [2]:
s1 = 'ATGCCCGA'  # single quotes -- equivalent to above
s1

Being able to use both types of quotes means you can nest quotes in strings:

In [3]:
s2 = "He said, 'Hello world'"
s2

Note the difference between showing the Python representation of a string (see example above) and printing the string. Here's the same string passed through the `print` function:

In [4]:
print(s2)

Notice that the outer quotes are no longer visible.  This is a subtle difference in this case, but see below...

### String literals

Some special characters such as newlines and tabs are written using "string literals" -- a special set of characters to indicate them.  For example, newline as written as `\n` and tabs are written as `\t`.

In [5]:
s3 = "There will be a newline here\nAnd then there will be more text..."
s3

Again, compare the Python representation of the string in the output above, to the printed representation below:

In [6]:
print(s3)

### Triple quoted strings 

Triple quoted strings allow you to use special characters like tabs and newlines in a string without explicitly writing their string literal forms.

In [7]:
s3 = """
’Twas brillig, and the slithy toves
      Did gyre and gimble in the wabe:
All mimsy were the borogoves,
      And the mome raths outgrabe. 
"""

In [8]:
print(s3)

### Raw strings

In a raw string, string literals remain uninterpretted. Raw strings are prefixed with `r` as shown below:

In [9]:
s4 = r"This is a raw string. I can write \n here without generating a newline"
print(s4)

### String interpolation with f-strings

String interpolation using "f-strings" is a convenient when to insert the value of other computations into strings.  This works with variables you've created previously or directly with expressions as the following two examples illustrate: 

In [10]:
area = 3.141592654 * 10**2
s5 = f"The area of the circle is {area} cm^2"
print(s5)

In [11]:
s5 = f"The area of the circle is {3.141592654 * 10**2} cm^2"
print(s5)

f-strings also accept format specifiers, which are useful for things like specifying the number of decimal places to include in the string representation of a floating point value:

In [12]:
s6 = f"The area of the circle is {area:0.2f} cm^2"  # print 2 decimal places
print(s6)

### String indexing and slicing

Strings have a length and support indexing and slicing in the same manner as lists. Like lists, strings are zero-indexed.

In [13]:
s1 = "ATGCCCGA"

In [14]:
s1[0] # 1st element

In [15]:
s1[1] # 2nd element (at index 1)

'T'

In [16]:
s1[-1] # last element

'A'

In [17]:
s1[:3] # first three elements

'ATG'

In [18]:
s1[3:] # everything from index 3 to the end

'CCCGA'

In [19]:
s1[::-1]  # Can use strides with slices, for example this reverse the string

'AGCCCGTA'

### String are immutable

Trying to change the elements of a string raises an error, because string objects in python are immutable:

In [20]:
s[0] = "T"

NameError: name 's' is not defined

### String concatenation and repetition

Concatenating two strings using the `+` operator forms a new string:

In [21]:
s4 = "ATG" 
s5 = "GTA"
s4 + s5

'ATGGTA'

Concatenation can be comined with slicing to "edit" strings:

In [22]:
incorrect = "How noow brown cow?"
print("Original version: ", incorrect)
corrected = incorrect[:5] + incorrect[6:]
print("Corrected version: ", corrected)

Original version:  How noow brown cow?
Corrected version:  How now brown cow?


The `*` operator when applied to a string and an integer repeats a string: 

In [23]:
repetetive = "TA" * 4
repetetive

'TATATATA'

### Other useful functions on strings

Here are a few other useful methods associated with string objects in Python.  See [the Python documentation](https://docs.python.org/3/library/stdtypes.html#text-sequence-type-str) for discussion of these methods and more:

#### replace

In [24]:
# replace -- by default replace all instances of the first argument with the 2nd argument
r1 = "IT was The besT of Times, iT was The worsT of Times..."
r1.replace("T", "t")

'It was the best of times, it was the worst of times...'

In [25]:
# note that replace doesn't change the original string
r1

'IT was The besT of Times, iT was The worsT of Times...'

#### strip

`strip` removes white space characters ate the beginning and ending of string:

In [26]:
r2 = "    <- Look at all that whitespace at the beginning and ending ->    \n"
r2

'    <- Look at all that whitespace at the beginning and ending ->    \n'

In [27]:
r2.strip()

'<- Look at all that whitespace at the beginning and ending ->'

#### join

`join` is a method that takes as input a list (or other sequence) of strings, and concatenates them with the string used to call the method. This is best illustrated by example:

In [28]:
words = ["how", "now", "brown", "cow"] # create a list of words

In [29]:
" ".join(words)  # pass words as the argument to join called on the string containing a single space

'how now brown cow'

In [30]:
"_".join(words) # same thing but using underscore as the calling string

'how_now_brown_cow'

In [31]:
print("\n".join(words)) # or using newlines (best visualized by printing the string)

how
now
brown
cow


## The complement of a DNA sequence

Having learned the basics of string manipulation in Python we'll examine a number of different ways to write a function that operates on strings.  The function we'll examine is a simple but important function -- the complement of a DNA sequence.

As we explore different ways to write a function to compute the complement of a DNA sequence we'll introduce some additional important Python concepts, including flow control statements (if-then statements and for-loops) and another common data structure (dictionaries)

### Approach #1:  solve the problem for the smallest atomic unit, and then scale up

In this approach we'll first solving the problem of computing the complement of one nucleotide -- the smallest unit of a DNA string -- and then figure out how to scale up that solution to work with multiple nucleotides.

### if-else statements

Consider how we might verbally formulate the solution to the problem of producing the complement of a nucleotide:  

> If we're handed the nucleobase "T" (thymine) we return an "A" (adenine); if given "A" we return "T"; if given "G" (guanine) we return "C" (cytosine); if given "C" we return "A".

Notice we formulated that as a bunch of "if this then that" type statements.  Most programming languages allow you to expression such conditional computations as "if-else" setatements that allow you to structure the flow of execution so that certain expressions are executed only if particular conditions are met. Python "if-else" expression look like this:

```python
if (some condition is true):
    execute this code
else:
    execute this other code
```

When you have more than two possible conditions to evaluate there is version we'll refer to as "if-elif-else":

```python
if (condition 1 is true):
    execute code specific to condition 1
elif (condition 2 is true):
    execute code specific to condition 2
...
elif (...any number of additional conditions to check...)
...
else:
    if none of the above conditions is true execute this code
```

To calculate the complement we have more than two conditions to check (4 possible nucleotides), so we'll use the "if-elif-else" form:

In [32]:
nuc = "A"  # change this variable and re-evaluate this code block to see how the if-elif-else block below works

if nuc == "A":
    complement = "T"
elif nuc == "T":
    complement = "A"
elif nuc == "G":
    complement == "C"
elif nuc == "C":
    complement == "G"
else:   
    complement == "N"  # this last branch covers the condition were we get an
                       # input we don't recognize. In this case we'll return
                       # a "N" which is the way to express an ambiguous nucleotide

print(f"The complement of {nuc} is {complement}.")

The complement of A is T.


Having seen how we can implement this basic algorithm with "if-else" statements, let's turn this into a function:

In [33]:
def complement_nucleotide(n):
    n = n.upper()  # capitalize the input so function also works with lower case strings
    if n == "A":
        return "T"
    elif n == "T":
        return "A"
    elif n == "G":
        return "C"
    elif n == "C":
        return "G"
    else:
        return "N"

In [34]:
complement_nucleotide("T")

'A'

In [35]:
complement_nucleotide("a")  # also works with lower case, see: note about `upper` method on strings

'T'

### Scaling up using iteration

We've successfully figured out how to write a function to compute the complement of a string with a single nucleotide, but we haven't yet solved the general case of a string containing multiple nucleotides.  To do that we'll need to introduce the concept of "for loops". 

#### for loops

A "for loop" is a control flow statement that "iterates" (traverses, walks-over) a sequence-like object (things like strings, lists, arrays). A common use of for loops is to carry out a computation on each element of a sequence  or to make a calculation that involves all the elements of a sequence (like calculating a sum).

The general form of a form of a for-loop in Python is:

```python
for item in seq:
    some computation applied to each item (thing) in the seq object
```

Here's an example of a for loop operating on Boolean values:

In [36]:
x = [True, True, True, False]
y = []

for item in x:
    y.append(not item)  # calculate the inverse of the item and append to the list y
    
y

[False, False, False, True]

Here's another slightly more complicated example where there are multiple statements in the body of the for-loop.

In [37]:
sentence = "shouting on the internet LOOKS LIKE THIS!"
newsentence = ""

for letter in sentence:
    if letter.islower():  # lookup islower in python docs
        # newsentence += ... is short-hand for writing newsentence = newsentence + ....
        newsentence += letter.upper()  
    else:
        newsentence += letter.lower()

newsentence

'SHOUTING ON THE INTERNET looks like this!'

#### Using our `complement_nucleotide` function in a for loop

Now that we know how for loops work let's see if we can use one with the `complement_nucletoide` function we defined above:

In [38]:
seq = "ATGCCC"
compseq = ""

for nuc in seq:
    compseq += complement_nucleotide(nuc)

compseq

'TACGGG'

Awesome!  Now let's wrap this logic in another function:

In [39]:
def complement_sequence(seq):
    compseq = ""
    for nuc in seq:
        compseq += complement_nucleotide(nuc)
    return compseq

Now to test it:

In [40]:
complement_sequence("ATGCCCC")

'TACGGGG'

In [41]:
complement_sequence("nonsensical input")

'NNNNNNNNGTNNNNNNA'

#### for loops with an explicit index

If you've programmed in languages like C/C++ or Java, you're probably more familiar with for-loop structures like the one below, where our for-loop iterates over a set of integer values and uses those index values to retrieve elements to compute on:

In [42]:
def complement_sequence_Clike(seq):
    compseq = ""
    for i in range(len(seq)):
        compseq = compseq + complement_nucleotide(seq[i])
    return compseq

complement_sequence_Clike("ATGCCCC")

'TACGGGG'

As you see this works as well, but the first form is more idiomatic for simple loops.  However this second form with explicit indexing is sometimes preferred for more complex operations, like the example below where we are working with slices of a string:


In [43]:
s = "abcdefg"
letterpairs = []

for i in range(0, len(s), 2):  # range from 0 to len(s), stepping by 2
    letterpairs.append(s[i:i+2])
    
letterpairs


['ab', 'cd', 'ef', 'g']

### List comprehensions

Iteration is such a fundamental concept in programming that Python includes a special syntax called a "list comprehension" that allows us to iterate over a sequence, applying some computation of interest, and collect the results of each of those computations into a list.  The list comprehension syntax looks like this:

```python
[do some computation on item for item in seq]
```

We can think of a list comprehension as having two parts of the comprehension, to the left and right of the `for` keyword.  The left part specifies what you're doing, and the right part specifies what you're doing it with.

Here are some examples:

In [44]:
[-x for x in [1,2,3]]  # left part negates the value, right part specifies what values we're negating

[-1, -2, -3]

To emphasize the parallels, here's how you'd do the same computation using a for-loop:

In [45]:
result = []

for x in [1,2,3]:
    result.append(-x)

result

[-1, -2, -3]

The list comprehension form is much more compact to write and they're also (usually) more computationally efficient than the corresponding for-loop, so I tend to use them wherever appropriate as long as they don't obscure the logic of my program.

Here's a few more examples of list comprehensions:

In [46]:
[x**2 for x in range(10)]  # square every number between 0 and 9

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

In [47]:
s = "abcdefg"
letterpairs = [s[i:i+2] for i in range(0, len(s), 2)]
letterpairs

['ab', 'cd', 'ef', 'g']

#### Conditionals in list comprehensions

List comprehension also support a conditional form, as illustrated below:

In [48]:
from math import sqrt

vals = [4, 9, 16, -25]

[sqrt(x) for x in vals if x >= 0]  # calculate square roots but only for values greater than 0

[2.0, 3.0, 4.0]

List comprehensions also support an "if-else" form but this requires you move the if-else on the left side of the `for` keyword:

In [49]:
# the nan ("not a number") object, defined in the math library,
# is a useful way to represent the result of numerical computations that
# might produce invalid results for some inputs
from math import nan 

[sqrt(x) if x > 0 else nan for x in vals]

[2.0, 3.0, 4.0, nan]

Unfortunately, the "if-else" form of list comprehensions isn't quite as readable as the standard form or the single if form.

#### A list comprehension version of complement_sequence

Let's see how we might use a list comprehension to implement a function equivalent to `complent_sequence` defined above:

In [50]:
def complement_sequence_LC(seq):
    return "".join([complement_nucleotide(nuc) for nuc in seq])

In [51]:
complement_sequenceLC("ATGCCCC")

NameError: name 'complement_sequenceLC' is not defined

That was super easy using a list comprehension!

### One more version using dictionaries

We're now going to look at one more implementation of `complement_sequence` using a data structure called a dictionary (`dict`). Dictionaries along with lists (and tuples which we haven't looked at) are one of the core Pythond data structures.

A dictionary (sometimes called a hashmap or simply a map) is a data structured which can be used to represent mappings or relationships between pairs of objects.  The things you're mapping from are often called "keys", while the things your mapping to are often referred to as "values".  A dictionary is thus a collection of key,value pairs.

For example, you might want to maintain a mapping between single letter DNA abbreviations and the full names of the nucleotides they represent.  Here's how you'd construction such a mapping using a dictionary:


In [52]:
# dictionaries are constructed inside curly brackets, each key:value pair is separated by a comma
nuc2name = {"A": "adenine", "T":"thymine", "G":"guanine", "C":"cytosine"}

Sometimes for clarity it can be helpful to reformat a statement like the one above like the following, because the vertical arrangement helps to emphasize each key,value pair:

In [53]:
nuc2name = {
    "A": "adenine", 
    "T": "thymine", 
    "G": "guanine", 
    "C": "cytosine"
}

Having defined a dictionary we can look up the values associated with a key using the following syntax:

In [54]:
nuc2name["A"]

'adenine'

In [55]:
nuc2name["C"]

'cytosine'

We can use a for loop or list comprehension to iterate over the keys in a dictionary:

In [56]:
for key in nuc2name:
    print(f"{key} stands for {nuc2name[key]}")

A stands for adenine
T stands for thymine
G stands for guanine
C stands for cytosine


If we want to get keys and associated values simultaneously we can use the `items` method associated with dictionaries:

In [57]:
for (key, value) in nuc2name.items():
    print(f"{key} stands for {value}")

A stands for adenine
T stands for thymine
G stands for guanine
C stands for cytosine


The keys and values of a dictionary don't have to be of the same type. For example, here's a dictionary mapping the names of fruits to their prices in dollars:

In [58]:
fruit2price = {
    "apples":  1.50,
    "bananas": 0.99,
    "cherries": 3.99,
    "pineapple": 8.99,
}

The keys of a dictionary should be non-mutable objects like strings or numbers, but the values can be arbitrary Python objects, such as lists:

In [59]:
letter2words = {
    "a": ["apple", "aardvark", "apricot"],
    "b": ["banana", "baby"],
    "c": ["cobra", "copper", "capriciuos", "carrot"]
}

for key in letter2words:
    print(f"Here are some words I know with the letter {key.upper()}: ", letter2words[key])
    

Here are some words I know with the letter A:  ['apple', 'aardvark', 'apricot']
Here are some words I know with the letter B:  ['banana', 'baby']
Here are some words I know with the letter C:  ['cobra', 'copper', 'capriciuos', 'carrot']


#### Representing nucleotide complements using a dictionary

Back to the task at hand. Earlier we wrote a function called `complement_nucleotide` that given an input nucleotide return that nucleotides complement.  This was in essence a mapping function.  Here's how we could represent that same mapping using a dictionary:

In [60]:
nuc2complement = {
    "A": "T", 
    "T": "A",
    "G": "C",
    "C": "G"}

With such a dictionary, we could rewrite the `complement_sequence` function as so:

In [61]:
def complement_sequence_dict(seq):

    
    seq = seq.upper()
    compseq = ""
    for nuc in seq:
        if nuc in nuc2complement:  # test whether the character is in our dictionary
            compseq += nuc2complement[nuc]
        else:
            compseq += "N"  # if not in dict, return ambiguity code
    return compseq

complement_sequence_dict("ATGCCCC")

'TACGGGG'

Or by combining a dictionary with a list comprehension:

In [62]:
def complement_sequence_dictLC(seq):
    seq = seq.upper()
    return "".join([nuc2complement[nuc] 
                    if nuc in nuc2complement
                    else "N" 
                    for nuc in seq])
    
complement_sequence_dictLC("ATGCCCC")

'TACGGGG'