# Data types

The term "data types" refers to the types of values that can be represented in a computer program.

Familiar data types include numbers and strings. Many programming languages also include a built-in data type for representing True/False values (usually called a "Boolean" or "logical" data type).

In [1]:
10 # this is a number

10

In [2]:
"Hello world" # this is a string

'Hello world'

In [3]:
False # this is a Boolean value

False

When working in Python we can query a value's data type using the function `type()` as illustrated below:

In [4]:
type("hello")  # str (short for string)

str

In [5]:
type(True) # bool (short for boolean)

bool

In [6]:
type(1)  # number data types come in two flavors, int (integers, for representing whole numbers)

int

In [7]:
type(1.0) # and float (floating point value, for representing real numbers)

float

A values type constrains what sort of operations we can do with that value.  For example, we can add to numeric data types:

In [8]:
3 + 10

13

But we can't add a number and a string

In [9]:
3 + "10"

TypeError: unsupported operand type(s) for +: 'int' and 'str'

Using the `+` operator with two strings concatenates them:

In [None]:
"foo" + "bar"

# Variables

Variables allow us to attach names to values we want to compute on.  Variables provide both a means of "holding onto" an value and a  degree of abstraction the allows us to generalize computations in the same way that using variable names like $x$ or $y$ in mathematical functions allows us to abstract away the specific values. 


In [10]:
x = 3
y = 10
x + y

13

In [11]:
s1 = "baz"
type(s1)

str

# Numbers

Python floats and integers support arithmetic operations using standard operator symbols

In [12]:
10 + 3 ## addition

13

In [13]:
10 - 3 ## subtraction

7

In [14]:
10 * 3 ## multiplication

30

In [15]:
10 / 3  ## division

3.3333333333333335

In [16]:
10 // 3  ## integer division

3

In [17]:
10**3 ## exponeniation

1000

In [18]:
10 - 3 / 5  ## operator precedence matters

9.4

In [19]:
(10 - 3) / 5  ## use parentheses to disambiguate where needed

1.4

# Library imports

Many Python functions are available through libraries (also called "modules" or "packages"). A standard Python distribution includes a large number of libraries described in the [Python Standard Library](https://docs.python.org/3/library/index.html).  The Anaconda Python distribution that we're using for this class also includes a bunch of addtional "third party" libraries for scientific computing, including Numpy, Scipy, and Matplotlib (the home pages for these libraries can be accessed from the "Help" menu in Jupyter).

To make a libraries functions available for our use we need to import the library, like so:

In [20]:
import math

Having imported the `math` library, every variable and function defined in the `math` library can be referenced by appending the prefix `math.`.  For example, the `math` library defines a variable called `pi` representing the value of $\pi$. We can access this variable as so:

In [21]:
math.pi

3.141592653589793

The `math` library also provides a bunch of common mathematical functions, as described in the [documentation](https://docs.python.org/3/library/math.html).  Here's a few examples:

In [22]:
math.sin(math.pi)  # sine(π)

1.2246467991473532e-16

In [23]:
math.cos(math.pi)  # cosine(π)

-1.0

In [24]:
math.ceil(3.5)  # ceiling function rounds floats up to nearest integer

4

In [25]:
math.floor(3.5) # round down to the nearest integer

3

## Importing specific functions

If we're using a function repeatedly it can be tedious to repeatedly type the name of the library. We can deal with this by importing specific variables or functions directly from a library. After doing this we no longer need to add the library name prefix for those functions

In [26]:
from math import sin, cos, tan, pi  # import sin, cos, tan, and pi in

In [27]:
sin(0.5*pi)  # now we can refer to sin and pi w/out writing math.pi and math.sin

1.0

## Aliasing a library name

If you don't want to explicitly import functions from a library, but still want to avoid typing a long library name repeatedly, another alternative is to give the library a short "alias" when you import it. This can be done like so:

In [28]:
import math as m # "m" is now an alias for "math"

Now we can refer to the `math` functions with this shortened name:

In [29]:
m.sin(m.pi/4) 

0.7071067811865475

## Importing everything from a library

A final way to use the `import` statement is to import everything from a libary, using a "star import".  This can be convenient, but you need to be aware that doing so will overwrite any variables you've defined that have the same names as those in the library you're importing. Thus this is only recommended in interactive sessions.

In [30]:
from math import *

Having import all the functions and constants from the math library we no longer need to prefix references to them with the package name

In [31]:
sin(pi/4)  # note lack of m.sin, m.pi, etc.

0.7071067811865475

# Strings

## String creation

Strings can be created by typing characters (ASCII or unicode) surrounded by either single quotes or double quotes.

In [32]:
s1 = "ATGC"  # double quotes
s1

'ATGC'

In [33]:
s2 = 'ATGC'  # single quotes -- equivalent to above
s2

'ATGC'

In [34]:
s1 == s2 # note the use of double equal sign to test equivalency

True

Being able to use both types of quotes means you can nest quotes in strings:

In [35]:
s2 = "He said, 'Hello world'"
s2

"He said, 'Hello world'"

Note the difference between showing the Python representation of a string (see example above) and printing the string. Here's the same string passed through the `print` function:

In [36]:
print(s2)

He said, 'Hello world'


Notice that the outer quotes are no longer visible.  This is a subtle difference in this case, but see below...

## String literals

Some special characters such as newlines and tabs are written using "backslash escapes".  For example, newline as written as `\n` and tabs are written as `\t`.

In [37]:
s3 = "There will be a newline here\nAnd then there will be more text..."
s3

'There will be a newline here\nAnd then there will be more text...'

Again, compare the Python representation of the string in the output above, to the printed representation below:

In [38]:
print(s3)

There will be a newline here
And then there will be more text...


## Triple quoted strings 

Triple quoted strings allow you to use special characters like tabs and newlines in a string without explicitly writing their string literal forms.

In [39]:
s3 = """
’Twas brillig, and the slithy toves
      Did gyre and gimble in the wabe:
All mimsy were the borogoves,
      And the mome raths outgrabe. 
"""

In [40]:
print(s3)


’Twas brillig, and the slithy toves
      Did gyre and gimble in the wabe:
All mimsy were the borogoves,
      And the mome raths outgrabe. 



## Raw strings

In a raw string, string literals remain uninterpretted. Raw strings are prefixed with `r` as shown below:

In [41]:
s4 = r"This is a raw string. I can write \n here without generating a newline"
print(s4)

This is a raw string. I can write \n here without generating a newline


## Length and Indexing

One of their properties of a string is it's length, i.e. the number of characters in the string.  We can get the length of a string using the `len()` function (`len()` also works with other data types/structures that have length).

In [42]:
s1 = "ATGC"
len(s1)

4

The act of retrieving a character from a particular position in a string is called "indexing" (the same term applies to getting the elements of other objects with a length property).  

Python, like many programming languages, uses "0-indexing" which means that the first character of a string is indexed by the numeric value 0 (the "zeroth element").  To access the elements of a string we write the index in square brackets as illustrated below

In [43]:
s1[0] # 1st element

'A'

In [44]:
s1[1] # 2nd element (at index 1)

'T'

Because of 0-indexing, the largest valid index of a string is not equal to it's length, but rather it's length minus one.

In [45]:
s1[4]  # s1 is of length 4 but this raises an IndexError

IndexError: string index out of range

In [46]:
s1[3] 

'C'

We can index backwards from the end of a string using negative values, starting at -1:

In [47]:
s1[-1]  # last element, regardless of length

'C'

In [48]:
s1[-3]  # third element from the end

'T'

## Slicing

We can get a range of the contents of a string using "slicing".  Slicing is like indexing but we specify a start index and an end index .  The element at the start index is included in the returned string but the element at the end index is *not* included (i.e. slicing is inclusive with respect to the starting index and exclusive with respect to the end index)

In [49]:
s2 = "ATGCCCTT"

In [50]:
s2[0:2] # get the first two elements 
        # (i.e. up to but not including the element at index 2)

'AT'

In [51]:
s2[2:4] # everything from index 2 up to but not including index 4

'GC'

There are convenient short hands for slice from the first index, or up to and including the last index:

In [52]:
s2[:3]  # can drop the 0 before the colon, 0 will be assumed as first index

'ATG'

In [53]:
s2[3:]  # can drop the 0 after the colon, length will be assumed as last index

'CCCTT'

### Strides

When slicing you can also specify a "stride" as a third indexing term, which specifies the step size between elements (by default 1).  For example, the code below gets very 2nd element from index 0 (inclusive) to 5 (exclusive).

In [54]:
s2[0:5:2]  # every other element

'AGC'

In [55]:
s2[::-1]  # negative strides go from the last to the first index

'TTCCCGTA'

### String are immutable

Trying to change the elements of a string raises an error, because string objects in python are immutable:

In [56]:
s = "ATG"
s[0] = "T"

TypeError: 'str' object does not support item assignment

### String concatenation and repetition

Concatenating two strings using the `+` operator forms a new string:

In [57]:
s4 = "ATG" 
s5 = "GTA"
s4 + s5

'ATGGTA'

The `*` operator when applied to a string and an integer repeats a string: 

In [58]:
repetetive = "TA" * 4
repetetive

'TATATATA'

### Strings Methods

"Methods" are functions that are "attached to" a data type.  Every instance of that data type "carries around" these methods.

For example, strings have a `count()` method that counts the number of types a substring appears in the string. In the examples below note how the method call is invoked by specifying the value or variable holding the string, followed by a period, the method name and any additional inputs specified in parentheses.

In [59]:
"ATGAAT".count("A")  # Read this as: In the string "ATGAAT" count the number of times "A" appears

3

In [60]:
# This is equivalent to the above, but I've assigned the string to a variable first
s1 = "ATGAAT"
s1.count("A")

3

A full list of methods that apply to strings can be found in the Python [documentation on strings](https://docs.python.org/3/library/stdtypes.html#text-sequence-type-str). Below I review some of the most commonly used methods:

#### find

Finds the lowest index at which the specified character appears. If the character is not in the string return -1.

In [61]:
"ATGAAT".find("T")

1

In [62]:
"ATGAAT".find("C")

-1

#### split

Split a string on the specified separator. By defaults splits on white space (spaces, tabs, newlines).

In [63]:
"how now brown cow".split()

['how', 'now', 'brown', 'cow']

In [64]:
"one, two, three".split(",") # notice the spaces in the output string

['one', ' two', ' three']

In [65]:
"one potato two potato three".split(" potato ") # notice the extra space around potato

['one', 'two', 'three']

There's a variant of `split` that specifically splits at newlines:

In [66]:
quote = """\
Twas brillig and 
the slithey toves
Did gyre and gimble
in the wabe"""

quote.splitlines()

['Twas brillig and ',
 'the slithey toves',
 'Did gyre and gimble',
 'in the wabe']

#### replace

Replace all instances of the first argument with the 2nd argument.

In [67]:
r1 = "IT was The besT of Times, iT was The worsT of Times..."
r1.replace("T", "t")

'It was the best of times, it was the worst of times...'

Note that replace doesn't change the original string

In [68]:
r1

'IT was The besT of Times, iT was The worsT of Times...'

#### strip

`strip` removes white space characters at the beginning and ending of string:

In [69]:
r2 = "    <- Look at all that whitespace at the beginning and ending ->    \n"
r2

'    <- Look at all that whitespace at the beginning and ending ->    \n'

In [70]:
r2.strip()

'<- Look at all that whitespace at the beginning and ending ->'

#### join

`join` is a method that takes as input a list (or other sequence) of strings, and concatenates them with the string used to call the method. This is best illustrated by example:

In [71]:
words = ["how", "now", "brown", "cow"] # create a list of words

In [72]:
" ".join(words)  # pass words as the argument to join called on the string containing a single space

'how now brown cow'

In [73]:
"_".join(words) # same thing but using underscore as the calling string

'how_now_brown_cow'

In [74]:
print("\n".join(words)) # or using newlines (best visualized by printing the string)

how
now
brown
cow


#### `upper`, `lower`, and `title`

"ee cummings would not like you to apply the upper method".upper()

In [75]:
"if you want to shout on the internet write your message in all capitals".upper()

'IF YOU WANT TO SHOUT ON THE INTERNET WRITE YOUR MESSAGE IN ALL CAPITALS'

In [76]:
"EE Cummings would appreciate the lower method".lower()

'ee cummings would appreciate the lower method'

In [77]:
"title case: a study in string reformatting".title()

'Title Case: A Study In String Reformatting'

# Simple IO operations

Python provides a single built-in function, `open`, for file input/output operations. Specific libraries often build on the core functionality provided by the `open` function.  Here we'll briefly explore how to use this function to read and write strings to files.

## Creating a file object and writing to it

In [78]:
# Create the string info we want to write to the file

s = """’Twas brillig, and the slithy toves
      Did gyre and gimble in the wabe:
All mimsy were the borogoves,
      And the mome raths outgrabe. """

In [79]:
# open file in write mode, note this will truncate the file
# if it already exists so be careful!

f = open("mytempfile.txt", mode="w") 

In [80]:
f.write(s) # write the string, s, to the file

140

For efficiency reasons, file objects are "buffered" so we must remember to close the file afterword to make sure that the data is properly written the file object finalized.

In [81]:
f.close() # flush and close the file object

### With statements for working with file objects

Failure to close a file object is a frequent source of bugs, so the Python language designers developed a construct called the `with` statement that will automatically close a file when the with statement ends (`with` statements have other uses related to error handling, but here we focus solely on their application to file handling).

In [82]:
# more idiomatic approach, when using the "with" construct you 
# don't need to remember to close the file object as it
# automatically gets closes when the with block ends

with open("mytempfile.txt", mode="w") as f:
    f.write(s)

## Reading from file objects

To read string data from a file we again use the `open` function, but specifying the mode as `r` (read).


In [83]:
with open("mytempfile.txt", mode="r") as f:
    s2 = f.read()

In [84]:
s2

'’Twas brillig, and the slithy toves\n      Did gyre and gimble in the wabe:\nAll mimsy were the borogoves,\n      And the mome raths outgrabe. '

## Reading and writing files lines as lists

The `read` and `write` functions above deal with read/write all of their data at once.  There are corresponding `readlines()` and `writelines()` functions for dealing with line oriented I/O:

In [85]:
with open("mytempfile.txt", "r") as f:
    lines = f.readlines()
    
# notice we get a list of strings rather than a single string    
lines  

['’Twas brillig, and the slithy toves\n',
 '      Did gyre and gimble in the wabe:\n',
 'All mimsy were the borogoves,\n',
 '      And the mome raths outgrabe. ']

Here's the corresponding function for writing a list of strings to a file. Note that line separators are NOT added by `writelines` so we had to explicitly add newlines to each string (NOTE: in actual use I'd probably use a list comprehension to add the newlines rather than do that for each string separately; see below):

In [86]:
s3 = ["how\n", "now\n", "brown\n", "cow\n"] 

with open("mytempfile2.txt", "w") as f:
    f.writelines(s3)

# Data structures

The term "data structure" refers to the different ways collections of values can be stored in memory or accessed by the user. 

Data structures can be homogenous or heterogeneous with respect to data types they hold. Different data structures are used to facilitate particular modes of access to the values they hold, or to represent different conceptual ways or organizing information.

## Lists

One of the simplest data structures in Python are lists. Lists are ordered collections of heterogenous values. 

You create a list by writing enclosing the items to be included in the list in square brackets, separating each item with a comma:

In [87]:
[10, 20, 30] # a list of numbers

[10, 20, 30]

In [88]:
["hello", "world"] # a list of strings

['hello', 'world']

In [89]:
[1, 2, "tie", "your", "shoe", True] # list of numbers, strings, and bools

[1, 2, 'tie', 'your', 'shoe', True]

In [90]:
l1 = ["a", "b", "c"]  # assign the list to the variable l1
l1

['a', 'b', 'c']

Lists can even include other lists, or in fact any  arbitrary Python object (like functions)!

In [91]:
number_list = [1, 2, 3]
string_list = ["a", "b", "c", "d"]

list_of_lists = [number_list, string_list]
list_of_lists

[[1, 2, 3], ['a', 'b', 'c', 'd']]

### Indexing and slicing lists

Indexing and slicing works with lists just like it does with strings. Both lists and strings are 0-indexed.

In [92]:
len(number_list)

3

In [93]:
# indexing starts from 0, so this gives the first element.
number_list[0]

1

In [94]:
number_list[-1]  # last element

3

In [95]:
len(string_list)

4

In [96]:
string_list[1:3]

['b', 'c']

In [97]:
list_of_lists[-1][1:]  # can you figure what's going on here?

['b', 'c', 'd']

### Lists are mutable

Python lists are mutable, meaning the elements in them (and the length of the list) can be changed.

In [98]:
l1 = [9, 0, False, 99, -10]
l1

[9, 0, False, 99, -10]

In [99]:
l1[1] = -9 # change the element at index 1
l1

[9, -9, False, 99, -10]

In [100]:
l1[-1] = 1000  # change the last element
l1

[9, -9, False, 99, 1000]

In [101]:
l1[2:] = [1, 0, 1]  # change multiple elements simultaneously using slicing
l1

[9, -9, 1, 0, 1]

Mutability has some subtle consequences when you nest one list within another. For example, let's create a list that includes another list (`l`) nested within it:

In [102]:
l2 = [3, 4, l1]
l2

[3, 4, [9, -9, 1, 0, 1]]

Now if we change `l1`, that change get's reflected in `l2`:

In [103]:
l1[0] = 3.1415
l2

[3, 4, [3.1415, -9, 1, 0, 1]]

This happens because behind the scenes `l2` doesn't include a copy of `l1`, but rather a reference to `l1` (this is memory efficient way to allow for nested lists).  

### Testing for membership in a list

You can check if a specific value is an element of a list using the `in` keyword:

In [104]:
l1

[3.1415, -9, 1, 0, 1]

In [105]:
1  in l1 # does L5 include the value 1 (at least once)?

True

You can get the first index at which a given element appears using the `index` method (a method is a function "attached to" a Python object):

In [106]:
l1.index(1)  # 1 is found at index 2 (also at index 4, but index return first occurence)

2

If the value doesn't exist in the list, the `index` method raises an error

In [107]:
l1.index(-99)

ValueError: -99 is not in list

### Adding and removing items from lists

The `append` method allows you to add new items to a list.

In [108]:
l1 = [1, 2, 4]
l1.append(1)
l1

[1, 2, 4, 1]

The `remove` methods allows you to delete the first appearance of an item in a list"

In [109]:
l1.remove(1)  # notice the first value of 1 gets removed, not all values
l1

[2, 4, 1]

To delete a specific object at a given index in a list, use the `del` keyword as shown below:

In [110]:
del l1[1]  # delete the item at index 1 from the list
l1

[2, 1]

### Concatenating list

You can concatenate (add together) lists to create new lists:

In [111]:
l2 = [2, 4, 6, 8]
l3= l1 + l2
l3

[2, 1, 2, 4, 6, 8]

Note that concatenating doesn't change the original lists:

In [112]:
l1

[2, 1]

In [113]:
l2

[2, 4, 6, 8]

### Useful functions for numeric lists

Lists that include only numeric values can be queried for their maximum or minimum elements, and summed::

In [114]:
l4 = [2, 4, 6, 88, 9, -10]

In [115]:
max(l4)

88

In [116]:
min(l4)

-10

In [117]:
sum(l4)

99

The `range` function generates a list of integers over a given range:

In [118]:
list(range(10))  # all the values from 0 (implicit) up to but not including 10

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [119]:
list(range(5, 15))  # all the values from 5 up to but not including 15

[5, 6, 7, 8, 9, 10, 11, 12, 13, 14]

In [120]:
list(range(4, 20, 2)) # all the values from 4 to 20 (exclusive) in steps of 2

[4, 6, 8, 10, 12, 14, 16, 18]

### Indexing one list with another

A fairly common operation in the data sciences is to index one list with another.  For example, consider the case where you have two lists: 1) a list of species names, and 2) a list of genome sizes for the respective species in the first list.  If you wanted to compute the name of the species with the biggest genome you could do something like this:

In [121]:
species = ["ecoli", "moss", "yeast", "amoeba", "fruit fly"]
genome_size = [4.6e6, 510e6, 12e6, 34e6, 140e6]

biggest_genome = species[genome_size.index(max(genome_size))]  # **KEY**
biggest_genome

'moss'

Mentally unpack the statement marked `**KEY**` above to make sure you understand how it works. Note this computation involves two functions/methods (`max` and `index`) as well as indexing one list based on computations on another lists.

If you were to translate that statement into english it might read something like:

> To find the species with the largest genome, find the maximum genome size, and then query the genome_size list to find out the index of the element with that value. Having calculated that index, pass it to the corresponding species list to retrieve the name of the species.

Phew! That's pretty wordy.  Here's a case where the code allows us to express this idea much more succinctly.


## Tuples

Tuples are similar to lists, but are immutable.  This immutability constraint makes them more efficient to operate on, though provides obvious limitations in terms of how we use them.

In [122]:
t1 = ("a","b")  # a 2-tuple 
t2 = (2,3,4)  # a 3-tuple
t3 = sin(math.pi/2), cos(math.pi) # parentheses are optional
t4 = t1,  # a 1-tuple of tuples!

In [123]:
t1

('a', 'b')

In [124]:
t2

(2, 3, 4)

In [125]:
t3

(1.0, -1.0)

In [126]:
t4

(('a', 'b'),)

### Tuple length and indexing

Tuples have a length:

In [127]:
len(t1), len(t2), len(t3), len(t4)  # note that we're returning the lengths as a tuple!

(2, 3, 2, 1)

Tuples are indexed in a manner similar to lists:

In [128]:
t1[0]  

'a'

In [129]:
t1[1]

'b'

In [130]:
# assignment doesn't work after creation -- TypeError!
t1[0] = 'c'

TypeError: 'tuple' object does not support item assignment

### Tuple destructuring

A frequent idiom in Python code is "tuple destructuring" in which the elements of a short tuple are assigned to variables, as illustrated below

In [131]:
first, second = t1

In [132]:
first

'a'

In [133]:
second

'b'

## Dictionaries

A dictionary (sometimes called a hashmap or simply a map) is a data structured which can be used to represent mappings or relationships between pairs of objects. The things you're mapping from are often called "keys", while the things your mapping to are often referred to as "values". A dictionary is thus a collection of key,value pairs.

For example, you might want to maintain a mapping between single letter DNA abbreviations and the full names of the nucleotides they represent. Here's how you'd construction such a mapping using a dictionary:

In [134]:
# dictionaries are constructed inside curly brackets, each key:value pair is separated by a comma
nuc2name = {"A": "adenine", "T":"thymine", "G":"guanine", "C":"cytosine"}

Sometimes for clarity it can be helpful to reformat a statement like the one above like the following, because the vertical arrangement helps to emphasize each key,value pair:

In [135]:
# equivalent to previous code
# reformatted to make it more readable

nuc2name = {
    "A": "adenine", 
    "T": "thymine", 
    "G": "guanine", 
    "C": "cytosine"
}

Having defined a dictionary we can look up the values associated with a key using the following syntax:

In [136]:
nuc2name["A"]

'adenine'

In [137]:
nuc2name["C"]

'cytosine'

We can use a for loop or list comprehension to iterate over the keys in a dictionary:

In [138]:
for key in nuc2name:
    print(f"{key} stands for {nuc2name[key]}")  

A stands for adenine
T stands for thymine
G stands for guanine
C stands for cytosine


If we want to get keys and associated values simultaneously we can use the `items` method associated with dictionaries:

In [139]:
for (key, value) in nuc2name.items():
    print(f"{key} stands for {value}")

A stands for adenine
T stands for thymine
G stands for guanine
C stands for cytosine


The keys and values of a dictionary don't have to be of the same type. For example, here's a dictionary mapping the names of fruits to their prices in dollars:

In [140]:
fruit2price = {
    "apples":  1.50,
    "bananas": 0.99,
    "cherries": 3.99,
    "pineapple": 8.99,
}

The keys of a dictionary should be non-mutable objects like strings or numbers, but the values can be arbitrary Python objects, such as lists:

In [141]:
letter2words = {
    "a": ["apple", "aardvark", "apricot"],
    "b": ["banana", "baby"],
    "c": ["cobra", "copper", "capriciuos", "carrot"]
}

for key in letter2words:
    print(f"Here are some words I know with the letter {key.upper()}: ", letter2words[key])
    

Here are some words I know with the letter A:  ['apple', 'aardvark', 'apricot']
Here are some words I know with the letter B:  ['banana', 'baby']
Here are some words I know with the letter C:  ['cobra', 'copper', 'capriciuos', 'carrot']


## Sets

A third useful built-in data structure in Python are sets.  Set objects only allow one instance of a given object, and support standard set theoretic options like union, intersection, etc.

In [142]:
set1 = {1,2,3,3,3}       # set creation
set2 = set([3,4,5,9,10]) # or by creating a set from another object like a list


In [143]:
set1

{1, 2, 3}

In [144]:
set2

{3, 4, 5, 9, 10}

In [145]:
set1 & set2 ## intersection

{3}

In [146]:
set1 | set2 ## union

{1, 2, 3, 4, 5, 9, 10}

In [147]:
set1 - set2  ## set difference, i.e. elements of set1 that are NOT in set2

{1, 2}

# Control flow statements

Control flow statements control the order of execution of different pieces of code. They can be used to do things like make sure code is only run when certain conditions are met, to iterate through data structures, to repeat something until a specified event happens, etc. Control flow statements are frequently used when writing functions or carrying out complex data transformation.

### `if`, `if-else`, and `if-elif-else` statements

if and if else blocks allow you to structure the flow of execution so that certain expressions are executed only if particular conditions are met.

In [148]:
import random 

x = random.random() # generate random number between 0 and 1
if x < 0.5:
    print("heads")
else:
    print("tails")


heads


`if` statements can be used by themselves, without a matching `else`:

In [149]:
x = random.randint(1,20) # random integer between 1 and 20 (inclusive)
if x == 20:
    print("Critical hit!")
# since there's no else statement nothing happens if x != 20

`elif` is used when there are multiple alternative possibilities.  The final 'else' matches any condition not specified in an if or elif statement:

In [150]:
 # random.choice returns a random element of the input sequence
x = random.choice(["A", "T", "G", "C", "N", "R", "Y", "S", "W", "K", "M", "."]) 
# see https://www.bioinformatics.org/sms/iupac.html for IUPAC nucleotide codes


if x == "A":
    base = "Adenine"
elif x == "T":
    base = "Thymine"
elif x == "G":
    base = "Guanine"
elif x == "C":
    base = "Cytosine"
elif x == ".":
    base = "gap"
else:
    base = "ambiguous nucleotide"
    
print(x, "represents", base)
    

A represents Adenine


### `for` loops

A `for` statement iterates over the elements of a sequence (such as string or list). A common use of `for` statements is to carry out the same set of computations on each element of a sequence. 

In [151]:
words = ["how", "now", "brown", "cow"]
reversed_words = []

for word in words:
    reversed_words.append(word[::-1])

reversed_words

['woh', 'won', 'nworb', 'woc']

In [152]:
for i in range(10):
    print(i, "squared is", i**2)

0 squared is 0
1 squared is 1
2 squared is 4
3 squared is 9
4 squared is 16
5 squared is 25
6 squared is 36
7 squared is 49
8 squared is 64
9 squared is 81


### `break` statements

A break statement allows you to exit a loop even if it hasn’t completed. This is useful for ending a control statement when some criteria has been satisfied. break statements are usually nested in if statements.

In [153]:
for i in range(100):
    if i > 10:
        print("That's big enough, bub.")
        break
    print(i, "squared is", i**2)

0 squared is 0
1 squared is 1
2 squared is 4
3 squared is 9
4 squared is 16
5 squared is 25
6 squared is 36
7 squared is 49
8 squared is 64
9 squared is 81
10 squared is 100
That's big enough, bub.


### `while` statements 
A while statement iterates as long as the condition statement it contains is true.

In [154]:
i = 0
while i <= 10:
    print(i, "squared is", i**2)
    i += 1

0 squared is 0
1 squared is 1
2 squared is 4
3 squared is 9
4 squared is 16
5 squared is 25
6 squared is 36
7 squared is 49
8 squared is 64
9 squared is 81
10 squared is 100


### List comprehensions

Iteration is such a fundamental concept in programming that Python includes a special syntax called a "list comprehension" that allows us to iterate over a sequence, applying some computation of interest, and collect the results of each of those computations into a list.  The list comprehension syntax looks like this:

```python
[do some computation on item for item in seq]
```

We can think of a list comprehension as having two parts of the comprehension, to the left and right of the `for` keyword.  The left part specifies what you're doing, and the right part specifies what you're doing it with.

Here are some examples:

In [155]:
from math import sqrt

[sqrt(i) for i in range(10)]

[0.0,
 1.0,
 1.4142135623730951,
 1.7320508075688772,
 2.0,
 2.23606797749979,
 2.449489742783178,
 2.6457513110645907,
 2.8284271247461903,
 3.0]

#### Conditionals in list comprehensions

List comprehension also support a conditional form, as illustrated below:

In [156]:
vals = [4, 9, 16, -25]

[sqrt(x) for x in vals if x >= 0]  # calculate square roots but only for values greater than 0

[2.0, 3.0, 4.0]

List comprehensions also support an "if-else" form but this requires you move the if-else on the left side of the `for` keyword:

In [157]:
# the nan ("not a number") object, defined in the math library,
# is a useful way to represent the result of numerical computations that
# might produce invalid results for some inputs
from math import nan 

[sqrt(x) if x > 0 else nan for x in vals]

[2.0, 3.0, 4.0, nan]

Unfortunately, the "if-else" form of list comprehensions isn't quite as readable as the standard form or the single if form.

## Writing functions

Functions are organized, reusable units of code that perform a specific computation of interest. Like variables, functions provide an important layer of abstraction that helps us generalize computations.


To illustrate this, let's consider a simple mathematical function -- the formula for calculating the area of a circle.  This mathematical function  takes a value $r$, representing the radius of the circle, and calculates the area as follows:

$$\text{area of circle} = f(r) = \pi r^2$$

The equivalent python function is written as:

In [158]:
import math

def area_of_circle(r):
    return math.pi * r**2

Having defined this function we can then use it in our code:

In [159]:
area_of_circle(3)

28.274333882308138

In [160]:
area_of_circle(4)

50.26548245743669

Now whenever we want to calculate the area of circle with a given radius we just pass the value of the radius to our `area_of_circle` function.  We don't need to explicitly write out the calculation each time.  In this sense, we've generalized the process of computing the area of circles of any radius.

Note that in our function defintion `r` is the "input" or "argument" to area_of_circle you can think of it as a temporary variable name that  refers to the value that the caller of our function passed to `area_of_circle`. The fact that we gave it the name `r` has no significance, other than the fact it parallels the variable name we're used to using from the mathematical formula.  However we could have written the function as so, and it work exactly the same:

In [161]:
def area_of_circle(x):
    return math.pi * x**2

In [162]:
area_of_circle(2)

12.566370614359172

Note that the variable name `r` in the function above is "local" to the function. That is the variable `r` doesn't exist outside of the function unless we've already defined an `r`.  Also the argument `r` within the function is independent of any similarly named `r` we might have already defined.  The code below illustrates this:

In [163]:
# raises and error because x hasn't been defined
r

NameError: name 'r' is not defined

In [164]:
# now define r
r = 10
area_of_circle(2) # the argument x is assigned the value of 2 within the function

12.566370614359172

In [165]:
r    # but assignment of the argument r doesn't affect the variable rdefined outside the function

10


Functions don't have to represent mathematical operations.  Here's a function that we can use to calculate the complement of a nucleotide. Note the use of standard control flow statements within the function body:

In [166]:
def complement_nucleotide(n):
    n = n.upper()  # capitalize the input so function also works with lower case strings
    if n == "A":
        return "T"
    elif n == "T":
        return "A"
    elif n == "G":
        return "C"
    elif n == "C":
        return "G"
    else:
        return "N"

In [167]:
complement_nucleotide("A")

'T'

In [168]:
complement_nucleotide("C")

'G'

As we've discussed previously, a useful computing strategy is to build up complex computations from simple parts.  Having defined a function to return the complement of a single nucleotide, we can define a function for computing the complement of a nucleotide sequence by repeatedly applying our `complement_nucleotide` function in a for loop (or list comprehension)

In [169]:
def complement_sequence(seq):
    compseq = ""
    for nuc in seq:
        compseq += complement_nucleotide(nuc)
    return compseq

In [170]:
complement_sequence("ATGCGCA")

'TACGCGT'

We can take this idea of functions calling other functions a step further to define a `reverse_complement` function

In [171]:
def reverse_complement(seq):
    return complement_sequence(seq)[::-1]  # note use of slicing with negative strid to reverse string

In [172]:
reverse_complement("ATGCGCA")

'TGCGCAT'