# Week 1 - Basic Python concepts and functions

### Aims

- To recapitulate Python's basic functionality (syntax, data types, structures, conditioning and I/O)
- To learn and/or practice writing functions in Python

## Basic Python

### Printing

In [None]:
print('Hello from python!') # to print some text, enclose it between single quotation marks
print("I'm here today!")    # or double quotation marks
print(34)                   # print an integer
print(2 + 4)                # print the result of an arithmetic operation
print("The answer is", 42)  # print multiple expressions, separated by comma

You can include a comment in python by prefixing some text with a `#` character. All text following the `#` will then be ignored by the interpreter.

### Variables

A variable can be assigned to a simple value or the outcome of a more complex expression. The `=` operator is used to assign a value to a variable.

In [None]:
x = 3     # assignment of a simple value
print(x)
y = x + 5 # assignment of a more complex expression
print(y)
i = 12
print(i)
i = i + 1 # assigment of the current value of a variable incremented by 1 to itself
print(i)
i += 1    # shorter version with the special += operator
print(i)

### Simple data types

Python has four main data types: integer, float, string and boolean

In [None]:
a = 2           # integer
b = 5.0         # float
c = 'word'      # string
d = 4 > 5       # boolean (True or False)
e = None        # special built-in value to create a variable that has not been set to anything specific
print(a, b, c, d, e)
print(a, 'is of type', type(a)) # to check the type of a variable

### Arithmetic operations

Using Python as a calculator to do simple mathematical operations in integer and float numbers

In [None]:
a = 2             # assignment
a += 1            # change and assign (*=, /=)
3 + 2             # addition
3 - 2             # subtraction
3 * 2             # multiplication
3 / 2             # integer (python2) or float (python3) division

3 // 2            # integer division
3 % 2             # remainder
3 ** 2            # exponent

### Data structures

There are also different types of data structures in Python to store the main basic data types shown above:

- Lists
- Sets
- Tuples
- Dictionaries

##### Lists

A list is an ordered collection of mutable elements e.g. the elements inside a list can be modified.

In [None]:
a = ['red', 'blue', 'green', "yellow"]       # manual initialisation
copy_of_a = a[:]                   # copy of a 
another_a = a                      # same as a
b = list(range(5))                 # initialise from iteratable
c = [1, 2, 3, 4, 5, 6]             # manual initialisation
len(c)                             # length of the list
d = c[0]                           # access first element at index 0
e = c[1:3]                         # access a slice of the list, 
                                   # including element at index 1 up to but not including element at index 3
f = c[-1]                          # access last element
c[1] = 8                           # assign new value at index position 1
g = ['re', 'bl'] + ['gr']          # list concatenation
['re', 'bl'].index('re')           # returns index of 're'
a.append('yellow')                 # add new element to end of list
a.extend(b)                        # add elements from list `b` to end of list `a`
a.insert(1, 'yellow')              # insert element in specified position
're' in ['re', 'bl']               # true if 're' in list
'fi' not in ['re', 'bl']           # true if 'fi' not in list
c.sort()                           # sort list in place
h = sorted([3, 2, 1])              # returns sorted list
i = a.pop(2)                       # remove and return item at index (default last)
print(a)
print(b)
print(c)
print(d)
print(e)
print(f)
print(g)
print(h)
print(i)
print("------------")
print(a)
print(copy_of_a)
print(another_a)
a[0:3]

##### Sets

A set is an unordered collection of unique elements.

In [None]:
a = {1, 2, 3}                                # initialise manually
b = set(range(5))                            # initialise from iteratable
c = set([1,2,2,2,2,4,5,6,6,6])               # initialise from list
a.add(13)                                    # add new element to set
a.remove(13)                                 # remove element from set
2 in {1, 2, 3}                               # true if 2 in set
5 not in {1, 2, 3}                           # true if 5 not in set
d = a.union(b)                               # return the union of sets as a new set
e = a.intersection(b)                        # return the intersection of sets as a new set
print(a)
print(b)
print(c)
print(d)
print(e)

##### Tuples

A tuple is an ordered collection of immutable elements e.g. tuples are similar to lists, but the elements inside a tuple cannot be modified. Most of the list operations shown above can be used on tuples as well, with the exception of the assignment of new value at a certain index position.

In [None]:
a = (123, 54, 92)              # initialise manually
b = ()                         # empty tuple
c = ("Ala",)                   # tuple of a single string (note the trailing ",")
d = (2, 3, False, "Arg", None) # a tuple of mixed types
print(a)
print(b)
print(c)
print(d)
t = a, c, d                    # tuple packing
x, y, z = t                    # tuple unpacking
print(t)
print(x)
print(y)
print(z)

##### Dictionaries

A dictionary is an unordered collection of key-value pairs where keys must be unique.

In [None]:
a = {'A': 'Adenine', 'C': 'Cytosine'}        # dictionary
b = a['A']                                   # translate item
c = a.get('A', 'no value found')             # return default value
print('A' in a)                                     # true if dictionary a contains key 'A'
a['G'] = 'Guanine'                           # assign new key, value pair to dictonary a
a['T'] = 'Thymine'                           # assign new key, value pair to dictonary a
print(a)
print(b)
print(c)
d = a.keys()                                 # get list of keys
e = a.values()                               # get list of values
f = a.items()                                # get list of key-value pairs
print(d)
print(e)
print(f)
del a['A']                                   # delete key and associated value
print(a)

### Working with strings

A string is an ordered collection of immutable characters or tuple of characters.

In [None]:
a = 'red'                          # assignment
char = a[2]                        # access individual characters
b = 'red' + 'blue'                 # string concatenation
c = '1, 2, three'.split(',')       # split string into list
d = '.'.join(['1', '2', 'three'])  # concatenate list into string
print(a)
print(char)
print(b)
print(c)
print(d)
dna = 'ATGTCACCGTTT'               # assignment
seq = list(dna)                    # convert string into list of character
e = len(dna)                       # return string length
f = dna[2:5]                       # slice string
g = dna.find('TCA')                # substring location, return -1 when not found
print(dna)
print(seq)
print(e)
print(f)
print(g)
text = '   chrom start end    '    # assignment
print('>', text, '<')
print('>', text.strip(), '<')      # remove unwanted whitespace at both end of the string
print('{:.2f}'.format(0.4567))     # formating string
print('{gene:s}\t{exp:+.2f}'.format(gene='Beta-Actin', exp=1.7))

### Conditions

A conditional `if` or `elif` statement is used to specify that some block of code should only be executed if a conditional expression is **True**. Often the final `else` statement works when all the conditions before are **False**. 

Python uses indentation to represent which statements are inside a block of code e.g. the line after the `if` statement is indented (tab).

In [None]:
a, b = 1, 0           # assign different values to a and b, and execute the cell to test
if a + b == 3:
    print('Three')
elif a + b == 1:
    print('One')
else:
    print('?')

### Comparisons

In [None]:
1 == 1            # equal
1 != 2            # not equal
2 > 1             # greater than
2 < 1             # smaller than

1 != 2 and 2 < 3  # logical AND
1 != 2 or 2 < 3   # logical OR
not 1 == 2        # logical NOT

a = list('ATGTCACCGTTT')
b = a             # same as a
c = a[:]          # copy of a
'N' in a          # test if character 'N' is in a

print('a', a)      # print a
print('b', b)      # print b
print('c', c)      # print c
print('Is N in a?', 'N' in a)
print('Are objects b and a point to the same memory address?', b is a)
print('Are objects c and a point to the same memory address?', c is a)
print('Are values of b and a identical?', b == a)
print('Are values of c and a identical?', c == a)
a[0] = 'N'         # modify a  
print('a', a)      # print a
print('b', b)      # print b
print('c', c)      # print c
print('Is N in a?', 'N' in a)
print('Are objects b and a point to the same memory address?', b is a)
print('Are objects c and a point to the same memory address?', c is a)
print('Are values of b and a identical?', b == a)
print('Are values of c and a identical?', c == a)

### Loops

There are two ways of creating loops in Python using `for` or `while`.

##### for

In [None]:
a = ['red', 'blue', 'green']
for color in a:
    print(color)

##### while

In [None]:
number = 1
while number < 10:
    print(number)
    number += 1

Python has two ways of affecting the flow of a `for` or `while` loop:

- The `break` statement immediately causes all looping to finish, and execution is resumed at the next statement after the loop. 
- The `continue` statement means that the rest of the code in the block is skipped for this particular item in the collection.

In [None]:
# break
sequence = ['CAG','TAC','CAA','TAG','TAC','CAG','CAA']
for codon in sequence:
    if codon == 'TAG':
        break            # Quit the looping at this point (the TAG stop codon)
    else:
        print(codon)

        
# continue
values = [10, -5, 3, -1, 7]
total = 0
for v in values:
    if v < 0:
        continue # Don't quit the loop but skip the iterations where the integer is negative   
    total += v

print(values, 'sum:', sum(values), 'total:', total)

### Reading and writing files

To read from a file, your program needs to `open` the file and then read the contents of the file. You can read the entire contents of the file at once, or read the file line by line. The `with` statement makes sure the file is closed properly when the program has finished accessing the file.

We will use one of the files containing clinical and genomic data from anonymized patients published as part of the METABRIC consortium, a large-scale collaborative study of breast cancer worldwide (PMID: 22522925). The data file `metabric_clinical_and_expression_data.csv` is available in the `data` folder in: https://github.com/semacu/202110-data-science-python

In [None]:
# reading from file
with open("../data/metabric_clinical_and_expression_data.csv") as f:
    for line in f:
        print(line.strip())

Passing the `w` argument in `open()` tells Python that you want to create and write to a new file. 

In [None]:
# writing to a file
with open('../data/programming.txt', 'w') as f:
    f.write("I love programming in Python!\n")
    f.write("I love making scripts.\n")

**Keep in mind** this will erase the contents of the file if it already exists. 

Passing the `a` argument instead tells Python you want to append to the end of an existing file.

In [None]:
# appending to a file 
with open('../data/programming.txt', 'a') as f:
    f.write("I love working with data.\n")

### Getting help

The Python [Standard Library](https://docs.python.org/3/library/index.html) is the reference documentation of all libraries included in Python as well as built-in functions and data types.

In [None]:
help(len)          # help on built-in function

In [None]:
help(list.extend)  # help on list function

For help within the Jupyter Notebook, try the following:

In [None]:
len?

## Functions

### Overview

A function is basically a block of code that only runs when it is called. Functions are a general programming tool, and can be found in many programming languages. All functions operate in three steps:

1. Call: the function is called with some inputs
2. Compute: based on the inputs, the function does something
3. Return: the function outputs the result of the computation

Python has many functions pre-loaded, for example the print function. In this example, the function is **called** with some text as the input, the **computation** is doing something to the text (keeping it the same), and the **return** is the printing of text to screen.

In [None]:
print("I just called the print function")

As well as the input, some functions also have additional options which change how the function works. These additional options are called **arguments**. For example, the print function has a *sep* argument which allows the user to specify which character to use as the text separator.

In [None]:
print("By", "default", "text", "is", "separated", "by", "spaces")
print("We", "can", "choose", "to", "separate", "by", "dashes", sep="$")

Some useful functions in Python are:
- type() : returns the variable type for an object
- len() : returns the length of a list or tuple
- abs() : returns the absolute values of a numeric variable
- sorted() : returns a sorted copy of a list or tuple
- map() : applies a function to every element of a list or tuple, and returns a corresponding list of the returns of the function

### Defining your own function

#### Why?

Functions allow you to wrap up a chunk of code and execute it repeatedly without having to type the same code each time. This has four main advantages:

1. Accuracy: by typing the code once, you only have one chance to make a mistake or typo (and if you discover a mistake, you only have to correct it once!)

2. Flexibility: you can change or expand the behaviour of a function by making a change in the function code, and this change will apply across your script

3. Readability: if someone reads your code, they only have to read the function code to understand how it works, instead of reading chunks of repeated code

4. Portability: you can easily copy a function from one script and paste it in another script, instead of having to extract general-purpose lines of code that are embedded in a script written for a specific purpose

#### How?

To declare a new function, the **def** statement is used to name the function, specify its inputs, add the computation steps, and specify its output. This results in a **function definition**, and the function can then be called later in the script. All function definitions must have three elements:

1. Name: you can call your function anything, but usually verbs are better than nouns (e.g. echo_text rather than text_echoer). Python will not check whether a function already exists with the name that you choose, so **do not use the same name as an existing function** e.g. print()

2. Input: almost all functions take at least one input, either data to perform computaton on, or an option to change how the computation proceeds. You don't have to specify an input, but you **must** add a set of brackets after the function name, whether this contains any inputs or not

3. Return: the return statement is required to signify the end of the function definition. You don't have to return anything at the end of the function, but the return statement must still be there

Below is a simple example of a function that takes one input (the text to be printed), and returns the text to the user. We assign the results of calling the function to a new variable, and print the results. As you can see, every line that you want to include in the function definition must be indented, including the **return** statement:

In [None]:
def print_text(input_text):
    output_text = input_text
    return(output_text)

result = print_text("Hello world")
print(result)

### Variable scope

A key difference between code inside a function and elsewhere in a script is the **variable scope**: that is, where in the script variables can be accessed. If a variable is declared outside a function, it is a **global variable**: it can be accessed from within a function, or outside a function. In contrast, if a variable is declared inside a function, it is a **local variable**: it can only be accessed from inside the function, but not outside. The code below demonstrates this - the *counter* variable is declared inside the function, to keep track of the number of words that have been checked, and it is accessed without trouble by the print message inside the function:

In [None]:
# this function checks whether each word in a list of words contains a Z
def check_Z(word_list):
    Z_status = []
    # this counter is a local variable
    counter = 0
    for i in word_list:
        counter += 1
        if "Z" in i:
            Z_status.append(True)
        else:
            Z_status.append(False)
        # accessing the local variable inside the function works fine
        print("Word {} checked".format(counter))
    return(Z_status)

results = check_Z(["Zoo", "Zimbabwe", "Ocelot"])
print(results)

However, if we add an extra line at the end of the script which attempts to access the *counter* variable outside the function, we get an error:

In [None]:
# this function checks whether each word in a list of words contains a Z
def check_Z(word_list):
    Z_status = []
    Z_progress = []
    # this counter is a local variable
    counter = 0
    for i in word_list:
        counter += 1
        Z_progress.append(counter)
        if "Z" in i:
            Z_status.append(True)
        else:
            Z_status.append(False)
        # accessing the local variable inside the function works fine
        print("Word {} checked".format(counter))
    return((Z_status, Z_progress))

check_Z(["Zoo", "Zimbabwe", "Ocelot"])
# accessing the local variable outside the function doesn't work
print("Final counter value = {}".format(counter))

As this shows, variables declared inside a function are effectively invisible to all code that is outside the function. This can be frustrating, but overall it is a very good thing - without this behaviour, we would need to check the name of every variable within every function we are using to ensure that there are no clashes!

### Advanced functions

#### Setting defaults for inputs

For some functions, it is useful to specify a default value for an input. This makes it optional for the user to set their own value, but allows the function to run successfully if this input is not specified. To do this, we simply add an equals sign and a value after the input name. For example, we may want to add a *verbose* input to the **print_text()** function to print progress messages while a function runs.

In [None]:
def print_text(input_text, verbose=False):
    if verbose:
        print("Reading input text")
    output_text = input_text
    if verbose:
        print("Preparing to return text")
    return(output_text)

print_text("Hello", False)

The default setting for the *verbose* input is *False*, so the behaviour of the function will not change if the *verbose* input is not specified:

In [None]:
print_text("Hello world")

But we now have the option to provide a value to the *verbose* input, overriding the default value and getting progress messages:

In [None]:
print_text("Hello world", verbose=True)

If we specify a default value for any input, we must put this input after the other inputs that do not have defaults. For example, the new version of **print_text()** will not work if we switch the order of the inputs in the function definition:

In [None]:
def print_text(input_text, verbose=False):
    if verbose:
        print("Reading input text")
    output_text = input_text
    if verbose:
        print("Preparing to return text")
    return(output_text)

print_text("Hello world", True)

### Functions calling other functions

One useful application of functions is to split a complex task into a series of simple steps, each carried out by a separate function, and then write a master function that calls all of those separate functions in the correct order. Having a series of simple functions makes it easier. In this way, an entire workflow can be constructed that is easy to adjust, and scalable across a large number of datasets. For example, we might have a number of sample names, and for each we want to:

1. Find the longest part of the name
2. Check whether that part begins with a vowel
3. Convert the lowercase letters to uppercase, and vice versa

Once we have done this, we want to gather the converted sample names into a list.

In [None]:
def find_name_longest_part(input_name):
    parts = input_name.split("-")
    longest_part = ""
    for i in parts:
        if len(i) > len(longest_part):
            longest_part = i
    return(longest_part)

# print(find_name_longest_part("1-32-ALPHA-C"))

def check_start_vowel(input_word):
    vowels = ["a", "e", "i", "o", "u"]
    word_to_test = input_word.lower()
    if word_to_test[0] in vowels:
        return(True)
    else:
        return(False)

# print(check_start_vowel("ALPHA"))

def convert_case(input_word):
    output_word = ""
    for i in input_word:
        if i.islower():
            output_word += i.upper()
        else:
            output_word += i.lower()
    return(output_word)

# print(convert_case("ALPha"))

def convert_sample_name_parts(sample_names):
    converted_names = []
    for i in sample_names:
        longest_part = find_name_longest_part(i)
        if check_start_vowel(longest_part):
            converted_names.append(convert_case(longest_part))
    return(converted_names)

convert_sample_name_parts(["1-32-ALPHa-C", "1-33-PHI-omega", "1-34-BETA-sigMA"])

### External functions

#### Importing modules

One of the great things about Python is the enormous number of functions that can be loaded and used. By default, Python only loads a basic set of functions when it is launched (or when a Jupyter notebook is opened), but extra functions can be loaded at any time by importing extra **modules**. To import a module, use the **import** command. For example, we can import the **os** module, which contains a number of functions that help us to interact with the operating system:

In [None]:
import os

Once a module has been imported, all of the functions contained within that module are available to use. For example, the **os** module has a **getcwd()** function, which returns the current working directory. To call this function, we must specify the module name and the function name together, in the form MODULE.FUNCTION:

In [None]:
os.getcwd()

Some of the most useful data science modules in Python are not included with a standard Python installation, but must be installed separately. If you are using conda, you should always [install modules through conda](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-pkgs.html) as well.

#### Importing specific functions from modules

Some modules are very large, so we may not want to import the whole module if we only need to use one function. Instead, we can specify one or more functions to be loaded using the **from** command:

In [None]:
from os import getcwd

If a function has been loaded using this syntax, it does **not** need to be prefaced with the module name:

In [None]:
getcwd()

#### Aliasing

Sometimes it is useful to be able to refer to a module by another name in our script, for example to reduce the amount of typing by shortening the name. To do this, we can use the **as** command to assign another name to a module as we import it, a technique known as **aliasing**. For example, we can import the **numpy** module and alias it as **np**:

In [None]:
import numpy as np

Now whenever we want to call a function from this module, we use the alias rather than the original name. For example, the **zeros()** function allows us to create an array of specific proportions, with each element as a zero:

In [None]:
np.zeros((3,4))

We can also alias a function name if we import it using the **from** command. However, as with naming your own function, it is critical to avoid using the name of a function that already exists:

In [None]:
from os import getcwd as get_current_wd
get_current_wd()

#### What not to do

In some code, you may see all of the functions being imported from a module using the following syntax:

In [None]:
from os import *

At first glance this looks more convenient than importing the module as a whole, because we no longer need to preface the function name with the module name when calling it:

In [None]:
getcwd()

However, you should avoid this. This is because if the module contains a function with the same name as a function that is already loaded, Python will replace that pre-loaded function with the new function without telling you. 

This has three consequences:

1. The pre-loaded function is no longer available
2. Debugging is now much more difficult, because you don't know which function is being called
3. If you call the function and expect to get the pre-loaded version, you will get unexpected (and potentially disastrous) behaviour

## Assignment

1. Look at the METABRIC data file `metabric_clinical_and_expression_data.csv` on breast cancer referred above. Answer the following questions:

    - For how many unique patients we have data available?
    - How many patients were older than 75 when diagnosed with breast cancer? 
    - What were the earliest and oldest ages of diagnosis?
    - How many patients were treated with Chemotherapy and Radiotherapy?
    - How many patients had less than three mutations in the genes investigated?

2. Below there is a list of 5 protein sequences, specified in the single amino acid code where one letter corresponds to one amino acid. Write a function that finds the most abundant amino acid in a given protein sequence, but prints a warning message if the protein sequence is shorter than 10 amino acids. Run your function on each of the proteins in the list.

In [None]:
proteins = [
    "MEAGPSGAAAGAYLPPLQQ",
    "VFQAPRRPGIGTVGKPIKLLANYFEVDIPK",
    "IDVYHY",
    "EVDIKPDKCPRRVNREVV",
    "EYMVQHFKPQIFGDRKPVYDGKKNIYTVTALPIGNER"
]

3. The code below is intended to specify a function which looks up the capital city of a given country, and calls this function on a list of two countries. However, it currently has a bug which stops it from running. There are three possibilities for the nature of this bug:

    - Its arguments are in the wrong order
    - It uses a variable that is out of scope
    - It is missing the return statement

What is the bug?

In [None]:
capital_cities = {
    "Sweden": "Stockholm",
    "UK": "London",
    "USA": "Washington DC"
}

def find_capital_city(verbose=True, country):
    if country in capital_cities:
        capital_city = capital_cities[country]
        if verbose:
            print("Capital city located")
    else:
        capital_city = "CAPITAL CITY NOT FOUND"
    return(capital_city)

countries = ["USA", "UK", "Sweden", "Belgium"]
for i in countries:
    print(find_capital_city(i))

4. In the data folder, you will find a file "imagine_lyrics.txt", which contains the lyrics to the song Imagine by John Lennon. Your task is to find out which word is used most frequently in the lyrics. There are many ways to approach this, but however you solve it, remember to break up your code into functions! Some words aren't very interesting e.g. "the", "a", "and" so we might want to exclude these from consideration when finding the most frequent word in a set of lyrics. Include an option to exclude a custom list of words, and test how the results change when excluding "the", "a" & "and".