# Introduction to Python (part 2)

## What is a library?

Research programming is all about using libraries: tools other people have written and shared with the communit and that do many cool things.
The python syntax to import someone else’s library is “import”.

In [None]:
import geopy  # A python library for investigating geographic information. https://pypi.org/project/geopy/

Now, if you try to follow along on this example in an Jupyter notebook, you’ll probably find that you just got an error message.
There are three main types of code written outside your script that we will see this week:
- built-in functions and methods (for instance `type()` that we used this morning), so pieces of code that are always available - you don't need to import them
- libraries that core components of the Python installation you have on your machine or have been pre-installed 
- libraries that need to be installed (`geopy` is one of them)

### Built-in functions and methods

In [None]:
len("pneumonoultramicroscopicsilicovolcanoconiosis")

The built-in function `len` takes one input, and has one output. The output is the length of whatever the input was. Also the command `print` is a built-in function.

In [None]:
print("pizza")

Objects (for instance the object `str`) come associated with a bunch of functions designed for working on objects of that type. We access these with a dot.


In [None]:
"shout".upper()

To support you with functions and objects you can read many online tutorials, but you can also use the `help` function

In [None]:
help(print)

For instance a useful command when working with strings is `replace`. To see how it works you can read the documentation [here](https://docs.python.org/3/library/stdtypes.html), I am reporting a snippet here below:
> str.replace(old, new[, count])
Return a copy of the string with all occurrences of substring old replaced by new. If the optional argument count is given, only the first count occurrences are replaced.

✏️ **Exercise:** 

Given the text here below, find a way to use the `replace` function to turn the text into "Hello Paul"

In [None]:
text = "he%l%lo Pau%l"

### Pre-installed libraries

Some libraries are pre-installed in the Python environment you will be using. In your case, you can see the libraries that we have pre-installed in the `requirements.txt` file in our GitHub repo. In other cases, if you use for instance `anaconda`, this will install some core libraries for you by default etc. 

In [None]:
# this is a libraries available, that you just need to import

import math

number = 5
number_power_of_two =  math.pow(5,2)
print (number_power_of_two)

In [None]:
import random

list_of_numbers = [5,6,7,8,4,3,5,3,2,1,5]

random.shuffle(list_of_numbers)

print (list_of_numbers)

Note a main difference between `math.pow()` and `random.shuffle()`. The first returns the power (of 2 in our case) of a number (in our case 5). The second is a **in-place** function. It modifies directly the `list_of_numbers` object and returns `None`. In-place functions are a bit confusing to get at the beginning, but remember that you will encounter them a lot!

### Installing libraries

Note that import does not install libraries. It just makes them available to your current notebook session, assuming they are already installed. Installing libraries is harder, and so for this Summer School we have already pre-installed for you all the libraries you will need. If you want to know more about installing libraries, check out [this part](https://alan-turing-institute.github.io/rse-course/html/module06_software_projects/06_01_libraries.html) of the Turing RSE Course.

While libraries are great, they generally have three main drawbacks:

1. Sometimes, libraries are not looked after by their creator - for many different reasons
2. Sometimes, libraries are hard to get working (especially when you are in a rush and don't want to read the documentation!)
3. Sometimes, libraries don't do exactly what you need

![](images/dependency.png) 

### Contribute, don’t duplicate!!

You have a duty to the ecosystem of scholarly software:

If there’s a tool or algorithm you need, find a project which provides it.

If there are features missing, or problems with it, fix them, don’t create your own library.

## Writing your own functions

Defining functions which put together code to make a more complex task seem simple from the outside is the most important thing in programming. 

Imagine you want to comput the mean, given a list of numbers **AND** there's no library that could do this for you (if there's a library it's always better to rely on it, as other people would have used it in the past and caught specific bugs etc.)

✏️ **Exercise:** 

Given a list of numbers, for instance [1,4,5,6,7] - how would you compute the mean without a computer?

A good starting point for defining a function is to think how to convert the process of doing an operation "manually", in a series of defined steps.Let's see an example below

In [None]:
def compute_mean(numbers):
    mean = sum(numbers)/len(numbers)
    return mean

This is the basic structure of a function, you have a function name `compute_mean`, and input `numbers`, a series of operations and an output (`mean`). However this function seems also **a bit difficult to read**, so let's go through the different components together.

Sum and len are for instance built-in Python function. It means that these are always available, you don't need to import them. You can see the documentation of sum [here](https://docs.python.org/3/library/functions.html#sum)

In [None]:
numbers = [1,4,5,6,7]
numbers_sum = sum(numbers)
print (numbers_sum)

In [None]:
number_of_elements = len(numbers)
print (number_of_elements)

In [None]:
mean = numbers_sum / number_of_elements
print (mean)

✏️ **Exercise:** 

What is the mean of [10,124,65,86,7,98,6,54,112,13,87] ?

In [None]:
numbers = [10,124,65,86,7,98,6,54,112,13,87]
numbers_sum = sum(numbers)
number_of_elements = len(numbers)
mean = numbers_sum / number_of_elements
print (mean)

A function would prevent you for reusing over and over the same commands in your code, because you will define a specific operation (for instance `compute_mean`) and then you will be able to use it over and over, whenver needed. This will also prevent you for adding bugs to your code by making mistakes copy-pasting your code.

However, there are two problems with functions:
1. they need to be well documented, otherwise it will be difficult to read them (even by you in a few weeks!)
2. they can have bugs too! So you need to be careful

## Documenting functions

A Python docstring is a documentation string. When you call the built-in help() function on a Python function for instance, you see its documentation. This documentation is specified by the docstring at the beginning of the definition.

In [None]:
### this description is the way a function is documented, so others can quickly understand how to use it
help(sum)

In [None]:
def compute_mean(numbers):
    """ compute the mean, given a list of numbers """
    mean = sum(numbers)/len(numbers)
    return mean

Ok this looks better, but still the input / output and operation are hard to read. A way of documenting function is following the [Google style for docstring](https://google.github.io/styleguide/pyguide.html). Here's an example

In [None]:
def compute_mean(numbers):
    """ compute the mean, given a list of numbers 
    
    Args:
        numbers: List of integers
    
    Returns:
        The mean of the values contained in the list
    """
    mean = sum(numbers)/len(numbers)
    return mean

help(compute_mean)

Additionally, you can write the types of arguments and return values

In [None]:
def compute_mean(numbers:list)-> float:
    """ compute the mean, given a list of numbers 
    
    Args:
        numbers: List of integers
    
    Returns:
        The mean of the values contained in the list
    """
    mean = sum(numbers)/len(numbers)
    return mean

help(compute_mean)

Finally, you can make the variable names more clear or even breaking down each step to make it more readable

In [None]:
def compute_mean(list_of_numbers:list)-> float:
    """ compute the mean, given a list of numbers 
    
    Args:
        numbers: List of numbers (either integers or floats)
    
    Returns:
        The mean of the values contained in the list
    """
    sum_list = sum(list_of_numbers)
    length_list = len(list_of_numbers)
    mean = sum_list / length_list
    return mean

In [None]:
# to use the function

list_a = [1,4,5,6,7]
list_b = [10,124,65,86,7,98,6,54,112,13,87]

mean_a = compute_mean(list_a)
mean_b = compute_mean(list_b)

print (mean_a)
print (mean_b)

## Testing Functions

Functions are really useful tools when writing code. They allow your scripts to be more concise and modular. However, functions can easily add bugs to your code. Let's see the following example

In [None]:
def compute_mean(list_of_numbers:list)-> float:
    """ compute the mean, given a list of numbers 
    
    Args:
        numbers: List of integers
    
    Returns:
        The mean of the values contained in the list
    """
    sum_list = sum(list_of_numbers)
    length_list = len(list_of_numbers)
    mean = sum_list / length_list
    return length_list

mean_a = compute_mean(list_a)
print (mean_a)

In this example the function is well documented and as expected returns a number, however by mistake we are returning the length of the list instead of the computed mean. Bugs like this one are very easy to make and hard to spot when you have a fairly complex pipeline.

To make sure the functions work correctly often people test them quickly after having implemented or spot errors by looking at the final output of the code. However both these approaches add additional issues.

So, what are best practices in testing your code?

A good starting point is to define some specific cases where you test your function and you know what output it should give, for instance:


In [None]:
# In Python, the assert statement is used to continue the execute if the given condition evaluates to True. 
# If the assert condition evaluates to False, then it raises the AssertionError exception with the specified error message.

assert compute_mean([1,4,5,6,7]) == 4.6,'The mean is not correct'

Everytime you write a function, define a series of assert statements that should produce a specfic outcome.

Other typical things that are important to test is that the input you are expecting is correct. For instance in our function we are expecting a list of numbers (either integers or floats)

What happens if the list contains a string? or the input list is empty? or the input is not a list?
Let's see!

In [None]:
compute_mean([1,4,"0.555",6,7])

In [None]:
compute_mean([])

In [None]:
compute_mean(5)

As you can see the code crashes for different reasons. Instead of this, we should assert that the input is what we are expecting or returning a message

In [None]:
def compute_mean(list_of_numbers:list)-> float:
    """ compute the mean, given a list of numbers 
    
    Args:
        numbers: List of numbers (either integers or floats)
    
    Returns:
        The mean of the values contained in the list
    """

    assert type(list_of_numbers) is list, 'The input is not a list'
    assert len(list_of_numbers) >0, 'The input list is empty'

    sum_list = sum(list_of_numbers)
    length_list = len(list_of_numbers)
    mean = sum_list / length_list
    return mean

✏️ **Exercise:** 

Add an assert to test that all elements in the list are integers or float 

In [None]:
def compute_mean(list_of_numbers:list)-> float:
    """ compute the mean, given a list of numbers 
    
    Args:
        numbers: List of numbers (either integers or floats)
    
    Returns:
        The mean of the values contained in the list
    """

    assert type(list_of_numbers) is list, 'The input is not a list'
    assert len(list_of_numbers) >0, 'The input list is empty'
    assert all(isinstance(e, (int, float)) for e in list_of_numbers), "Some elements in the input list are not int/float"


    sum_list = sum(list_of_numbers)
    length_list = len(list_of_numbers)
    mean = sum_list / length_list
    return mean

✏️ **Exercise:** 

Write a function that, given a list of names (containing duplicates) returns a list of names without duplicates. Make sure to include documentation and define `assert` to check the input and the correctness of the output

Example of input:

In [None]:
names = ["Mark","Paula","Paul","Fede","Mariona","Kaspar","Paul","Thomas","Thomas","Mark","Thomas"]

In [None]:
def remove_name_duplicates(list_of_names:list)-> list:
    """ remove duplicates from a list of names 
    
    Args:
        list_of_names: List of names (str), potentially with duplicates
    
    Returns:
        list_of_names: List of names (str), without duplicates
    """

    assert type(list_of_names) is list, 'The input is not a list'
    assert len(list_of_names) >0, 'The input list is empty'
    assert all(isinstance(e, (str)) for e in list_of_names), "Some elements in the input list are not str"

    # this removes duplicates
    list_of_names = set(list_of_names)
    list_of_names = list(list_of_names)
    return list_of_names

In [None]:
### Tests

# remove_name_duplicates([])

#remove_name_duplicates([5,6,5,6,3,'Paula',4])

#remove_name_duplicates("Paula, Maria, Maria, Federico, Mark")