### Session 2: Data structures and control flow - towards big data
This session goes into more depth on strings, numerical data and compound data types like lists. Building your agility with these structures is crucial for what follows! We'll begin to handle data at scale through loops and list comprehensions. 

### 1: Strings and escape characters
Strings can be defined several ways, such as with single quotes or double quotes. Why? Strings themselves may include quotes. The resulting ambiguity can break your code. All programming languages have tricks to get around this issue:

In [None]:
# One string, two ways

print('Yes, they said.')
print("Yes, they said.")

In [None]:
# An easy way to trip up

print(''Yes', they said.')

In [None]:
# If your string contains one type of quote, define it with the other type

print("'Yes', they said")

In [None]:
# backlash is an 'escape character'. It negates any special properties of the character that follows:

print('\'Yes\', they said')

In [None]:
# ... if followed by n, it creates a new line
print("They said:\nYes")

In [None]:
# ... or a tab if followed by t
print("They said:\tYes")

In [None]:
# careful with unintended escape characters in filenames!

string_will_fail = 'C:\Users\charl\Documents\CE\RAM\OneDrive_1_3-6-2019\QXN\RN'

In [None]:
# adding r denotes 'raw string'

string_will_work = r'C:\Users\charl\Documents\CE\RAM\OneDrive_1_3-6-2019\QXN\RN'

### 2. Control the `print` statement
`print()` is a built-in function that echos objects to the console. When printing strings, use the .format() method. Putting this at the end of a string let's you:
* substitute variables into the string;
* control how they're formatted (eg. decimal places).

In [7]:
print("A string: that was easy")

A string: that was easy


In [8]:
print(42, "that was also easy")

42 that was also easy


In [10]:
print(20000/365, "that's not ideal")

54.794520547945204 that's not ideal


In [11]:
# include variables inside strings with {}, then (after the string), .format()
x = 42

print("Here's a number: {}. It's less than 50".format(x))

Here's a number: 42. It's less than 50


In [15]:
# you can include multiple substitutions, and they don't have to be variables (operations are fine)

print("Is {} really less than 50? Answer: {}.".format(x, x<50))

Is 42 really less than 50? Answer: True.


Note: `.format()` actually has a mini-language associated with it, check the documentation [here](https://pyformat.info/).

In [17]:
# just memorize this one for now:

print("Daily salary is approximately {:.2f} (two decimal places)".format(20000/365))

Daily salary is approximately 54.79 (two decimal places)


### 3. Indexing and slicing
Several data types are defined as 'sequences.' They share a common approach to selecting their elements using square bracket notation. This powerful notation works across strings, lists, arrays and DataFrames:

In [None]:
# to get one character from a string, put the index number in square brackets directly after the variable name
language = 'Python'
language[0] 

In [None]:
# index values can be negative.
language[-1]

Index values point between characters. The left edge of the first character is 0. Python has six characters, so the right edge of the last character is index 6.

In [None]:
#       +---+---+---+---+---+---+
#       | P | y | t | h | o | n |
#       +---+---+---+---+---+---+
#       0   1   2   3   4   5   6
#      -6  -5  -4  -3  -2  -1

# credit: www.python.org/3/tutorial

You can 'slice' strings and other sequences, using the start and end index

In [None]:
# slices give you all elements from the start index, up to (but not including) the end index

language[0:4]

What happens if you leave out the start or end index while slicing? Python will use default values instead. Take a sequence of length `n`. For start position, it will default to 0. For end position, it will default to `n`.

In [None]:
# everything up to fourth index
language[:4]

In [None]:
# fourth position onwards
language[4:]

In [None]:
# fourth from last, up to end of string
language[-4:]

Sneak preview: You can use the same index notation with higher dimensional datastructures, eg. a 3D array (eg. a stack of rasters: latitude, longitude, time, temperature).

### 3. An aside: extracting data from messy strings

In [None]:
# say you get a large column of ZIP codes, but in a messy format like this
zip_code1 = "Fred: ZIP 20022-0049"
zip_code2 = "Margaret: ZIP 20009-0132"

In [None]:
# we're interested in this part:
zip_code1[10:15]

In [None]:
# how could we systematicaly pull out the key 5 digits, to create a clean list of zips?

To crack a problem like this, you could:
* Use tab completion to list available string methods
* Ask StackOverflow
* Check the [documentation](https://docs.python.org/2/library/stdtypes.html#string-methods)

In [None]:
# Get help on the .split() method

zip_code1.split?

In [None]:
# a quick solution: split each string twice (using a different separator):

answer = zip_code1.split()[2].split('-')[0]

In [None]:
# then make it an integer:

int(answer)

Operating on data at scale requires more firepower: eg. list comprehensions and functions.

### 4: Test conditions with Boolean logic
To build up operations on larger datasets, more control flow tricks are helpful: including Boolean logic.

The statement `a == b` asks Python to evaluate whether variable a equals variable b; the interpreter will return True or False.

Similar statements would be `a > b`, `a >= b`, or `a != b`. 

In [None]:
a = 6
b = 4

print("Dear Python, please evaluate the statement 'a > b' for me.")

print("YOUR ANSWER: [drum roll ...]", a > b)

The `and`, `or` and `not` operators check whether combinations of statements are true at the same time.

In [None]:
month = 'July'
hour = 14

In [None]:
(month == 'July') and (hour < 12)         # is it a morning in July?

In [None]:
(month == 'July') or (hour < 12)          # is it either a morning, or in July?

In [None]:
not(hour < 12) and (month == 'July')     # is it not a morning, and in July?

You can apply Boolean logic to lists. This is the idea of Boolean masking (we'll return to this later - useful to filter datasets based on conditions, or when handling raster data).

In [None]:
some_values = [3, 5, 6, 8, 9, 11, 14, 16]       # Let's get only even numbers

In [None]:
mask = []

for value in some_values:
    mask.append(value % 2 == 0)                 # is the number divisible by two?

In [None]:
mask

In [None]:
for i in range(len(some_values)):
    if mask[i]:
        print(some_values[i])

### 5. Building programs with `if`, `elif` and `else`

We already met `if` constructions. Get the indentation right, and build more sophisticated rules that test multiple conditions.

In [None]:
# an if statement executes the indented code only if some condition is true

my_value = 11

if my_value > 10:
    print("Number is greater than 10")

In [None]:
# use Boolean operators to test multiple conditions

if (my_value > 10) and (my_value < 15):
    print("Number is between 10 and 15")

In [None]:
# if the first if statement evaluates to false, elif executes a code block if its condition is true
# else executes a code block if no preceding if or elif evaluated to true

if (my_value > 0) and (my_value < 10):
    print("Number is positive and less than 10")
    
elif my_value > 10:
    print("Number is greater than 10")
        
else:
    print("Must be negative or not a number.")

### 6. Error handling

In [None]:
# What happens if we run the cell above with 'penguin' instead of a number?


In [None]:
# try-except is one method to catch and handle errors

my_list = [5,6,'Sally',10]

for obj in my_list:
    try:
        print('{}'.format(obj + 1))
    except:
        print("I am not a number, I am a free woman!")

### 7. Iterables and range()

In [None]:
# strings and lists are examples of iterables: they can return their members one-by-one.

for meal in ['Breakfast', 'Snack', 'Dinner']:
    print("{} has {} letters".format(meal, len(meal)))

In [None]:
# range() is another way to generate iterables; it returns an arithmetic series

print("NUMBERS AND THEIR CUBES")
for i in range(5):
    print(i, i**3)

In [None]:
# as usual, you get all numbers from start point (included) to end point (not included)

print("NUMBERS AND THEIR CUBES")
for i in range(-5, 5):
    print(i, i**3)

In [None]:
# using range(n+1) often makes sense

print('Give me numbers up to n, where n = 3')

n = 3
for i in range(n+1):
    print(i)

In [None]:
# what happens when you print a range() item?
range(5)

In [None]:
# the list() function turns any iterable into a list.   (not ideal if you're counting to 1 million!)
list(range(5))

In [None]:
# enumerate() lets you loop through an iterable,  keeping track of where you are

meals = ['Breakfast', 'Post-Breakfast Snack','Elevenses', 'Lunch','Tea','Dinner','Bedtime snack']

for n, meal in enumerate(meals):
    print("Meal {} today: {}".format(n + 1, meal))

### 8. Classes and methods (applied to lists)

__Sneak preview of classes:__ Python (as an object oriented language) lets you define __classes__ of objects. A class of objects has in-built functions that can be summoned up quickly (these are called __methods__).
    
Example:
* I define a class `road_network`.
* Each time I type `my_network = road_network(parameter1, parameter 2...)`, I create a new instance of the class.
* Helpful functionality might be (i) calculate total length of roads; (ii) calculate shortest path between two points.
* I write a method `find_shortest_path(start_point, end_point`. This could be accessed from any instance of class `road_network` in future.
* Like this: `my_network.find_shortest_path(my_start_point, my_end_point)`

__List methods.__ Lists (and other data types) are implemented as classes. So check out the helpful methods that are at your disposal.

In [None]:
cubes = [1, 8, 27]
cubes

In [None]:
cubes.insert(0,0)    # insert element at given index

In [None]:
cubes.append(65)     # add element to end of list

In [None]:
cubes.remove(65)    # remove element if it exists

In [None]:
cubes.extend([64,125,216])   # add all the elements of an iterable 

In [None]:
cubes

In [None]:
cubes.pop()         # return the last element (and remove it)

In [None]:
cubes.count(64)    # count how many times an element appears

In [None]:
print(cubes)
cubes.clear()       # delete all elements
print(cubes)

Remember tab complete or question mark to list available methods of a object. Other object types that we'll use extensively: NumPy arrays, Pandas Series, Pandas DataFrames. Each has its own (pretty amazing) set of methods.

### 9. List comprehensions
Return to our zip code example. We have seen many ways to operate on strings or numbers. But how to scale these operations across several hundred (or thousand) examples?

List comprehensions are a concise way to build lists using rules. They apply an operation to a series of items, and package the result in a list.

In [None]:
# first a squares example
[x**2 for x in range(10)]

Steps:
* First write the expression to evaluate.
* Then add a `for` statement
* And the sequence to perform the operation on.

In [None]:
# re-write the following as a list comprehension

absolute_cubes = []
for n in range(-100, 101):
    absolute_cubes.append((n**3))

In [None]:
# code here:


__Data wrangling example__

In [None]:
# here's our messy input data
input_data = ["Alex: ZIP 20022-0049", "Margaret: ZIP 20009-0132", "Hermione: ZIP 10009-3214"]

In [None]:
# as a for loop

clean_list = []
for i in input_data:
    clean_list.append(i.split()[2].split('-')[0])

In [None]:
[x.split()[2].split('-')[0] for x in input_data]