![Banner](Banner%20web.jpg)

## Functions

Rather than writing out everything, every time you can bundle a set of code into a function and then call the function.  A function is defined using the `def` keyword.  

If we wanted to have a function to calculate the mean for a list we could simply define this:

In [2]:
def calculate_mean(some_list):
    total = sum(some_list)  # We could do this manually, but it's easier to use the sum function
    count = len(some_list)  
    return total / count


We can then use the function to carry out what we need

In [3]:
c = calculate_mean([1,2,3,4,5])
print(c)

3.0


Functions return a value with the `return` statement.  If a function doesn't have a `return` statement, the function returns `None`.

Functions usually pass by *reference*, this means the object is changed in the function. 

In [5]:
a = dict(value=2.0)

def double_it(d):
    d["double"] = d["value"] * 2

print(a)
double_it(a)
print(a)


{'value': 2.0}
{'value': 2.0, 'double': 4.0}


Now an exercise for you:
* Create a function that calculates the root mean squared error for a predicted vs an actual 


In [None]:
predicted = [1, 3, 6, 9, 12, 15]
actual = [0, 4, 10, 15, 20, 25]

def root_mean_squared_error(predicted_values, actual_values):
    # You need to check and see whether the lists are the same length
    # Calculate the deltas
    # Square the values
    # Calculate the mean value
    # Take the square root of it
    # return the value
    pass

### The importance of documentation

It's always important to make use of the expressiveness of Python.  

Here are some general recommendations:
* Use snake_case, rather than camelCase or PascalCase for variables
  * if you build classes use `PascalCase` 
  * check out [PEP-8](https://www.python.org/dev/peps/pep-0008/#naming-conventions) for the official guidance; most editors will include some syntax checkers
* Use **good** names; `a` is less comprehensible than `sum_of_terms`
  * This applies to functions as well as variables
* You can use a triple quoted codeblock in the start of a function to document the function, input variables and output - editors can take advantage of this for code completion, type checking and other conveniences

```python
def camel_to_snake(name):
    """
    Convert a CamelCase name into a snake case string
    :param str name: camel case name
    :rtype: str
    :return: the transformed name 
    """
    a = re.compile(r'((?<=[a-z0-9])[A-Z]|(?!^)[A-Z](?=[a-z]))')
    return a.sub(r'_\1', name).lower()

```

## Python Modules

As you create more and more functions it makes sense to package them up into `modules`.  

A python module is a package of functionality that you can use by importing it into your program when you need it.  Python has a set of modules included in the core distribution - these are called the stdlib (or standard lib).  They cover a superset of functionality that any programmer might need to build applications.  You can check out the stdlib documentation at [Python Standard Library documentation](https://docs.python.org/3/library/).

In Jupyter each cell you run updates the current environment; if you don't run the cell with the import statement then the module won't be available.

As an example, we've worked with functions and loops to calculate the mean of a list of integer values; we can instead use the built in `statistics` module to calculate the mean (and some other representative statistics).

In [6]:
from statistics import mean, mode, median, stdev
from random import randint

length = 60

a = [randint(1,35) for x in range(1, length)]
print("Random values:",a)
print("Mean:", mean(a))
print("Median:", median(a))
print("Standard Deviation:", stdev(a))
print("Mode:", mode(a))


Random values: [23, 21, 1, 21, 12, 1, 20, 4, 32, 3, 2, 23, 1, 28, 17, 8, 22, 35, 26, 10, 18, 26, 10, 30, 31, 19, 31, 25, 3, 21, 5, 33, 25, 27, 18, 15, 2, 3, 7, 35, 22, 15, 13, 3, 3, 20, 21, 18, 28, 15, 19, 2, 28, 1, 35, 30, 5, 25, 11]
Mean: 17.084745762711865
Median: 19
Standard Deviation: 10.762784115506898
Mode: 3


Much simpler!  

We used the `randint` function from the `random` module to generate a pseudo-random integer value (between the values of 1 and 35 in this case).  

If you want to know more about what a module contains you can use the builtin function `dir` - this will list all the elements the module exports.  In python there is no real 'data encapsulation' - no components are really private.  Elements that are not expected to be used by external calls are named with underscores (eg in the list shown below   

In [3]:
import random
dir(random)

['BPF',
 'LOG4',
 'NV_MAGICCONST',
 'RECIP_BPF',
 'Random',
 'SG_MAGICCONST',
 'SystemRandom',
 'TWOPI',
 '_BuiltinMethodType',
 '_MethodType',
 '_Sequence',
 '_Set',
 '__all__',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__spec__',
 '_acos',
 '_bisect',
 '_ceil',
 '_cos',
 '_e',
 '_exp',
 '_inst',
 '_itertools',
 '_log',
 '_os',
 '_pi',
 '_random',
 '_sha512',
 '_sin',
 '_sqrt',
 '_test',
 '_test_generator',
 '_urandom',
 '_warn',
 'betavariate',
 'choice',
 'choices',
 'expovariate',
 'gammavariate',
 'gauss',
 'getrandbits',
 'getstate',
 'lognormvariate',
 'normalvariate',
 'paretovariate',
 'randint',
 'random',
 'randrange',
 'sample',
 'seed',
 'setstate',
 'shuffle',
 'triangular',
 'uniform',
 'vonmisesvariate',
 'weibullvariate']

You can use the `help` built in function to present the documentation for a given module: 

In [10]:
help(random.randint)

Help on method randint in module random:

randint(a, b) method of random.Random instance
    Return random integer in range [a, b], including both end points.



The stdlib is in general good enough for most of what you might want to do; in cases where extensions are warranted then people create libraries and make them available through the Python Packaging Initiative.  You can search for packages on [PyPI](https://pypi.org/)


Python uses the module name as a namespace for the functions therein - we imported the `random` module above, but we have no way of accessing the functions directly


In [4]:
# this will give us a NameError as it has no way of looking up the function
print(randint(0, 100))


NameError: name 'randint' is not defined

In [5]:
# We specify where to look for the function through the namespace
print(random.randint(0, 100))

68


## Import Syntax

When you use a module, you import it into your current python environment.  

You can import a module:
```python
import statistics

statistics.mean([0,1,2,3,4,5])
```
or, you can import one or more functions from a module
```python
from statistics import mean

mean([0,1,2,3,4,5])
```
or you can import all the functions from a module
```python
from statistics import *

mean([0,1,2,3,4,5])
```
**NOTE** - don't import all functions from a module, it loads everything into memory



In [None]:
# Example import module and then reference
import random

print(random.randint(0, 100))

In [None]:
from random import randint

# I can now use randint directly (ie without the module namespace)
print(randint(0, 100))


There is also syntax to allow you to import all the components of a module, although this is generally frowned upon (why load things into memory that you're never going to use).

In [6]:
# Don't do this!
from random import *
print(gauss(1.2, 0.2))
# If you're interested in what this function is
# help(random.gauss) 

1.1811150608549184


You can also *alias* the module you import, to cut down the number of characters you need to type.

In [None]:
import pandas as pd

## Project Setup

When you create a project you need to specify what dependencies the project has; the convention for doing this in Python is by use of a `requirements.txt` file.  If you look in the project folder for this file you can see the following contents

```
jupyter
pandas
numpy
requests
```

This tells the user what dependencies this project has - in this case we need jupyter to provide the notebooks we are using now; we will cover `pandas` and `numpy` in the next module (they make data engineering and data science **much** easier) and we will cover `requests` in the final module.

When you share the project you should ensure that your dependencies are up to date.  List what modules you have installed using the `pip` tool

In [1]:
!pip list

Package            Version  
------------------ ---------
appnope            0.1.0    
attrs              19.1.0   
backcall           0.1.0    
bleach             3.1.0    
certifi            2019.6.16
chardet            3.0.4    
decorator          4.4.0    
defusedxml         0.6.0    
entrypoints        0.3      
idna               2.8      
ipykernel          5.1.2    
ipython            7.8.0    
ipython-genutils   0.2.0    
ipywidgets         7.5.1    
jedi               0.15.1   
Jinja2             2.10.1   
jsonschema         3.0.2    
jupyter            1.0.0    
jupyter-client     5.3.1    
jupyter-console    6.0.0    
jupyter-core       4.5.0    
MarkupSafe         1.1.1    
mistune            0.8.4    
nbconvert          5.6.0    
nbformat           4.4.0    
notebook           6.0.1    
numpy              1.17.1   
pandas             0.25.1   
pandocfilters      1.4.2    
parso              0.5.1    
pexpect            4.7.0    
pickleshar

Note that the output of the previous command does not include only the 4 modules we listed above, the reason is that each of our dependencies will also have dependencies (the `pip list` will list all the installed modules).

Now, update the `requirements.txt` to add `matplotlib` to the end of the file

In [None]:
!pip install -r requirements.txt

## Simple Input and Output 
One of the most common activities is opening, reading and writing a file.  There are a couple of libraries in the stdlib that make this simple.  Firstly, we are going to use the `os` module to handle cases such as ensuring the file we are looking for exists in a platform independent way. 

In [3]:
import os

# gets the current directory
print(os.getcwd())

# get the parent directory
print(os.path.dirname(os.getcwd()))

# establish that the requirements.txt file exists
print(os.path.exists(os.path.join(os.getcwd(), "requirements.txt")))

# establish that the fruitbowl.txt file does not exist
print(os.path.exists(os.path.join(os.getcwd(), "fruitbowl.txt")))


/Users/glow/Documents/Devel/PycharmProjects/phuse_eu_connect_python
/Users/glow/Documents/Devel/PycharmProjects
True
False


Notice we used the `os.path.join` function above - this will join a path together in an OS independent way; on a Windows Machine it will use the `\` character and on a Linux/OSX machine it will use the `/` character.  

**NOTE**: Always write your code with no base assumption about where it's going to be run!

Now we're going to open a file for reading; in this case it is a dump of conditions from the FAERS dataset and exported as a CSV

In [4]:
# we open the file with a context manager, this with automatically close the file for us
with open("condition.csv", "r") as fh:
    contents = fh.read()

# print the first 100 characters
print(contents[:100])

CONDITION,COUNT
Drug hypersensitivity,9681
Muscular weakness,2497
Rash,10745
Urticaria,3885
Adverse event,2100
Antinuclear antibody positive,63
Autoantibody positive,6
Autoimmune disorder,163
Double stranded DNA antibody positive,5
Infection,3659
Neoplasm malignant,1087
Opportunistic infection,38
Abdominal pain lower,498
Anaemia,4261
Asthenia,8367
Fatigue,17972
Medication error,562
Poor quality sleep,500
Splenomegaly,323
Aneurysm,110
Dry skin,6886
Glossodynia,261
Nephrolithiasis,1271
Skin wrinkling,79
Urinary tract infection,4123
Anxiety,6709
Blood pressure increased,3801
Metastases to liver,444
Metastases to pancreas,11
Nervousness,1015
Nodule,335
Ovarian disorder,17
Palmar erythema,35
Post procedural discharge,25
Post procedural swelling,36
Pyogenic granuloma,12
Scar,437
Stress,1718
Thrombophlebitis,71
Coma,1035
Gastrooesophageal reflux disease,1845
Pneumonia aspiration,585
Somnolence,4679
Atrial septal defect,151
Congenital hydronephrosis,15
Congenital ureteric anomaly,1
Maternal ex

So, we opened the file and we can see the content.  We want to be able to do some useful work with that however so we need to be able to treat the data correctly. 

As a first step, let's break up the file by lines

In [6]:
lines = contents.split('\n')
print("There are",len(lines) - 2,"conditions")

There are 12000 conditions


And then split the lines into condition and count

In [8]:
frequency = []

for line in lines[1:]:
    if line:
        frequency.append(line.split(','))

print(len(frequency))
print(frequency[100])

11999
['Metabolic acidosis', '587']


So we've loaded the contents and the parsed them out and got a list of lists; lets dig a little deeper.  How many total instances of conditions are there?  We can use the sum function here:


In [11]:
total = sum([int(x[1]) for x in frequency])
print("There were", total, "records")

['Drug hypersensitivity', '9681']


ValueError: invalid literal for int() with base 10: ' insomnia type"'

Ouch, that didn't work!  It looks like splitting the lines based on newline characters and commas won't take into account cases where the condition includes a comma.  It's time to use a module called `csv`

In [14]:
import csv
# reset the contents
contents = []
# open the file (read-only)
with open("condition.csv", "r") as fh:
    # use a DictReader, which reads in the file to a list of dicts predicated on the column headers
    dr = csv.DictReader(fh)
    for line in dr:
        contents.append(line)

print("There are",len(contents),"conditions")


There are 11999 conditions


Now, let's get our count

In [15]:
total = sum([int(x.get('COUNT')) for x in contents])
print("There were", total, "records")

There were 1408486 records


Now, an exercise for you!  Find the most commonly reported ADR from the dataset in the `condition.csv` file

In [None]:
# define our references
max_count_value = 0
max_count_condition = None

def most_common_condition(contents):
    """
    Take a list of dicts and extract the key and value for the maximum value
    """
    pass

print("Condition ", max_count_condition, "had", max_count_value, "records")


## Next

Next up, we're going to briefly look at the two superstar modules for the Data Scientist of discernment, numpy and pandas.  Click [here](05_numpy_pandas.ipynb) to continue


![Author Geoff Low](author-geoff%20low%20small.png)
<img src="Logo%20standard.png" alt="PHUSE Education" style="width: 400px;"/>