# CIC Python workshop 

# 1. Loading and analysing a dataset

## Learning objectives

* Explain what a library is, and what libraries are used for.
* Load a Python library and use the things it contains.
* Read tabular data from a file into a program.
* Assign values to variables.
* Select individual values and subsections from data.
* Perform operations on arrays of data.
* Display simple graphs.

Words are useful, but what’s more useful are the sentences and stories we build with them. Similarly, while a lot of powerful, general tools are built into languages like Python, specialized tools built up from these basic units live in libraries that can be called upon when needed.

In order to load our dataset (New York Air Quality measurements), we need to import a library called __`pandas`__. pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labelled” data both easy and intuitive. It is well suited for many different kinds of data:

* Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
* Ordered and unordered (not necessarily fixed-frequency) time series data.
* Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels
* Any other form of observational / statistical data sets. The data actually need not be labelled at all to be placed into a pandas data structure

We can load __`pandas`__ using:

In [None]:
import pandas

Importing a library is like getting a piece of lab equipment out of a storage locker and setting it up on the bench. Libraries provide additional functionality to the basic Python package, much like a new piece of equipment adds functionality to a lab space. Once you’ve loaded the library, we can ask the library to read our data file for us:

In [None]:
pandas.read_csv('data/NYairquality.csv')

The expression __`pandas.read_csv(...)`__ is a function call that asks Python to run the function __`read_csv`__ which belongs to the __`pandas`__ library. This dotted notation is used everywhere in Python to refer to the parts of things as __`thing`__.__`component`__.

__`pandas.read_csv`__ has one required parameter: the name of the file we want to read. This parameter needs to be a character string (or string for short), so we put them in quotes.

When we have finished typing and press Shift+Enter, the notebook runs our command. Since we haven’t told it to do anything else with the function’s output, the notebook displays it. In this case, the output is the data we just loaded. By default, not all rows and columns are shown (with ... to omit elements).

Our call to __`pandas.read_csv`__ reads our file, but doesn’t save the data in memory. To do that, we need to assign the output to a variable. 

## Variables, attributes and methods

A variable is just a name for a value, such as __`x`__, __`current_temperature`__, or __`subject_id`__. Python’s variables must begin with a letter and are case sensitive. We can create a new variable by assigning a value to it using __`=`__. As an illustration, let’s step back and instead of considering a table of data, consider the simplest “collection” of data, a single value. The line below assigns the value __55__ to a variable __`weight_kg`__:

In [None]:
weight_kg = 55

Once a variable has a value, we can print it to the screen:

In [None]:
weight_kg

and do arithmetic with it:

In [None]:
print('weight in pounds:', 2.2 * weight_kg)

As the example above shows, we can print several things at once by separating them with commas.
We can also change a variable’s value by assigning it a new one:

In [None]:
weight_kg = 57.5
print('weight in kilograms is now:', weight_kg)

If we imagine the variable as a sticky note with a name written on it, assignment is like putting the sticky note on a particular value:

![Variables as Sticky Notes](figures/python-sticky-note-variables-01.svg)
<center>__Figure: Variables as Sticky Notes__</center>

This means that assigning a value to one variable does not change the values of other variables. For example, let’s store the subject’s weight in pounds in a variable:

In [None]:
weight_lb = 2.2 * weight_kg
print('weight in kilograms:', weight_kg, 'and in pounds:', weight_lb)

![Variables as Sticky Notes](figures/python-sticky-note-variables-02.svg)
<center>__Figure: Creating Another Variable__</center>

and then change weight_kg:

In [None]:
weight_kg = 100.0
print('weight in kilograms is now:', weight_kg, 'and weight in pounds is still:', weight_lb)

![Variables as Sticky Notes](figures/python-sticky-note-variables-03.svg)
<center>__Figure: Updating a Variable__</center>

Since __`weight_lb`__ doesn’t “remember” where its value came from, it isn’t automatically updated when __`weight_kg`__ changes. This is different from the way spreadsheets work.

We can inspect the variables currently set in our Python session by running:

In [None]:
whos

We can remove unwanted variables with the __`del`__ command:

In [None]:
del weight_lb

In [None]:
whos

Just as we can assign a single value to a variable, we can also assign a table of values to a variable using the same syntax. Let’s re-run __`pandas.read_csv`__ and save its result:

In [None]:
airquality = pandas.read_csv('data/NYairquality.csv')

This statement doesn’t produce any output because assignment doesn’t display anything. If we want to check that our data has been loaded, we can print the variable’s value:

In [None]:
airquality

Now that our data is in memory, we can start doing things with it. First, let’s ask what type of thing data refers to:

In [None]:
type(airquality)

The output tells us that `data` currently refers to a thing called __`DataFrame`__ created by the __`pandas`__ library (actually a sub-library). The fancy name DataFrame is adopted from the R programming language, you can think of it as a table (with rows and columns).

This looks complex but there are simpler examples.

In [None]:
type(weight_kg)

Which means __`weight_kg`__ is a floating point number.

Our data corresponds to daily records of air quality measurements in New York. Each record contains a measurement for each day. We can see the size of the table like this:

In [None]:
airquality.shape

This tells us that __`airquality`__ has 153 rows and 6 columns. When we created the variable __`airquality`__ to store our data, we didn’t only create the table, we also created information about the table, called members or _attributes_. This extra information describes data in the same way an adjective describes a noun. __`airquality.shape`__ is an attribute of __`data`__ which describes the dimensions of `airquality`. We use the same dotted notation for the attributes of variables that we use for the functions in libraries because they have the same part-and-whole relationship.

We can get summary statistics of our dataset by doing:

In [None]:
airquality.describe()

__`describe`__ here is a _method_ of the __`DataFrame`__, i.e. a function that belongs to the table in the same way that the member __`shape`__ does. If variables are nouns, methods are verbs: they are what the thing in question knows how to do. We need empty parentheses for __`airquality.describe()`__, even when we’re not passing in any parameters, to tell Python to go and do something for us. __`airquality.shape`__ doesn’t need __`()`__ because it is just a description but __`airquality.describe()`__ requires the __`()`__ because it is an action.

Just like the number of rows and columns, another attribute of our table is the columns names. We can get them like this:

In [None]:
airquality.columns

This again looks a bit complex but it basically an ordered list of the names of the table columns. For example, the name of the second column is:

In [None]:
airquality.columns[1]

** WARNING ** Python starts counting from 0, not from 1, so the first element of a list has index 0, the second has index 1, and so on.

Using the round brackets `airquality.describe()` is very different from using the square brackets `airquality.columns[1]`. We saw the first is like an _action_ we invoke on `airquality`. The second instead is akin to the mathematical notation for indices. So if you have a mathematical vector of numbers $(a_1, a_2, \ldots, a_n)$, the generic element $a_i$ in Python would be `a[i]`.

## Indexing and slicing

If we want to get a single entry from the table, we provide an index in square brackets with __`loc`__ (label-location based index). Here we get the first row of the dataset (index = 0):

In [None]:
airquality.loc[0]

You can use the index notation also to get multiple rows. E.g this gets the rows from 10 to 20 (included!).

In [None]:
airquality.loc[10:20]

To access a specific row and column of the DataFrame we run __`data.loc[row, column]`__:

In [None]:
airquality.loc[0, 'Temp']

We can also select multiple rows and columns from the Dataframe using __`data.loc[start_row:end_row, [column_names]]`__:

In [None]:
airquality.loc[0:5, ['Temp', 'Month', 'Day']]

**loc** always expect either only the __`row`__ (like __`loc[row]`__) or a combination of row and column (__`loc[row, column]`__). To get all the rows of a single column (say, the temperature) we would have to do: __`airquality.loc[:, 'Temp']`__ which means all rows of the __`Temp`__ column. This is very long to type for a common operation, so pandas offers a shortcut. If you want the __`Temp`__ column you can say __`data['Temp']`__ which is much quicker to type.

In [None]:
airquality['Temp']

It also works with multiple columns

In [None]:
airquality[['Temp', 'Day']]

Once you have a column you can call _methods_ just like you do on the table itself. `DataFrames` and columns have different sets of methods you can call but they share many. E.g. you can get summary statistics of a single column.

In [None]:
airquality['Temp'].describe()

But you can also get the unique set of values of a column.

In [None]:
airquality['Month'].unique()

So we only have data from May to September, good to know!

You can also do mathematical operations on columns like basic arithmetic and comparisons:

In [None]:
airquality['Wind'] * 2

There is one other common way of accessing a *slice* of data, which is using a boolean variable. A boolean variable is a variable that can be `True` or `False`. If we wanted to find the days when the Ozone measurement was particularly high. i.e. over 70 parts per billion (ppb) we could do the following:

In [None]:
airquality['Ozone'] > 70

This returns a column of True/False values with one row for each row in the original table __`airquality`__, depending on the specified condition. This boolean array can be used to only select rows from the data __`Dataframe`__ that meet this filter / requirement. We can use this to select only the rows in __`airquality`__ for which the condition is true.

In [None]:
high_ozone = airquality['Ozone'] > 70
airquality[high_ozone]

## Plots

We can make plots too. Note we need to give a special command __`%matplotlib inline`__ to make plots appear within this Jupyter notebook. You don't need to run this for each plot, only once per session. It's common to stick this command at the top of the notebook, so it's the first thing you execute.

In [None]:
%matplotlib inline 

You can pick a column and make an histogram.

In [None]:
airquality['Ozone'].plot.hist()

Other plots you can make are:
* `line` : line plot
* `bar` : vertical bar plot
* `barh` : horizontal bar plot
* `hist` : histogram
* `box` : boxplot
* `kde` or `density`: Kernel Density Estimation plot
* `area` : area plot
* `pie` : pie plot

We can use one of the techniques we saw before to select only the rows with high ozone measurement and plot a histogram to determine which months has more high ozone days.

In [None]:
airquality[high_ozone]['Month'].hist()

Here we see that high measurements of the Ozone level are more common in months 7 (July) and 8 (August).

You can also make plots involving more than one column. You can call the plots methods on the table itself, e.g.

In [None]:
airquality.plot.scatter('Month', 'Temp')

### Adding and deleting columns

Other than operating on existing columns you can also add columns to your table. Let's create a new column in __`airquality`__ to store the temperature in Celsius:

In [None]:
airquality['TempC'] = (airquality['Temp'] - 32) / 1.8
airquality

You can also remove columns by using __`del`__:

In [None]:
airquality['unwanted_column'] = 0
airquality

In [None]:
del airquality['unwanted_column']
airquality

## Exercise 1

a) View summary statistics of the new column to answer the following questions:
* What is the mean?
* What is the max value?


b) Extract data of the first week of every month for your analysis.

Steps:

1. Create a variable __`first_week_of_month`__ and assign it the indices of rows that have a __`Day <= 7`__
2. Create a variable __`data_subset`__ and use the above variable to select only these indices / rows from the __`data`__ __`DataFrame`__

c) Select rows from the __`data_subset`__ where __`TempC < 20`__ degrees Celsius

d) Examine the resulting dataset:

* How many records are there?
* Are there any cells that have no data?
* What month has the most first week days below 20 degrees Celsius?

We can visually inspect whether the Ozone measure is correlated with Temperature

In [None]:
airquality.plot.scatter('TempC', 'Ozone')

We can calculate the Pearson correlation coefficient between these two columns:

In [None]:
airquality[['TempC', 'Ozone']].corr()

Here we can see the columns have a correlation coefficient of 0.698

We can also calculate the correlation coefficient between all columns in one command (i.e. pairwise comparison)

In [None]:
airquality.corr()

We can flag columns that have a positive correlation coefficient over some threshold. e.g. +0.65

In [None]:
airquality.corr() > 0.65

We can also flag columns with a negative correlation coefficient under some threshold. e.g. -0.65 

In [None]:
airquality.corr() < -0.65

## Exercise 2

a) Review the correlation coefficient tables above to answer the following questions:
* What columns have positive correlation coefficients > 0.65?
* What columns have positive correlation coefficients < 0.65?
* Should we remove any columns that are highly correlated? If so, which ones?

b) Write code to remove one of the highly correlated columns:

c) Check that the modified table has been correctly saved to the __`data`__ variable

d) Remove the records that have NaN / NA values from the dataset

Note: You will need to look at the help documentation on the __`dropna`__ method

In [None]:
help(airquality.dropna)

In [None]:
# drop records that have NaN / NA values


e) Check that the modified table has been correctly saved

f) Save your modified table to a CSV file

Note: You will need to look at the help documentation on the __`to_csv`__ method

In [None]:
help(airquality.to_csv)

In [None]:
# save the table to a CSV file


# 2. Basic programming

## Learning objectives

* Explain what a list is.
* Explain what a for loop does.
* Write for loops to repeat simple calculations.
* Write conditional statements including __`if`__, __`elif`__, and __`else`__ branches.
* Correctly evaluate expressions containing __`and`__ and __`or`__.

Often you want to repeat some work.

## Lists and loops

We have seen that variables can be of some simple types such as float for __`weight_kg`__ and of very complex types like our table __`airquality`__. These are kind of extreme examples of the kind of _data-types_ you will find in Python.

A common data type is a _list_. A list is an ordered sequence of things, any kind of things. If we want to create a list with the numbers one to four we can use the syntax: __`[1, 2, 3, 4]`__.

Elements of a list can be extracted using the same indexing syntax we saw before. E.g. the fourth odd number is:

In [None]:
odds = [1, 3, 5, 7]
odds[3]

The length of a list has a particular syntax: __`len(odds)`__

In [None]:
len(odds)

To repeat an operation you write a _for-loop_, which has the general form

```for variable in collection:
    do things with variable
```

E.g.

In [None]:
for number in odds:
    print(number)

Each element (__`number`__) in the list __`odds`__ is looped through and printed one element after another.

We can call the loop variable anything we like, but there must be a colon at the end of the line starting the loop, and we must indent anything we want to run inside the loop. Unlike many other languages, there is no command to signify the end of the loop body (e.g. end for); what is indented after the for statement belongs to the loop.

Here’s another loop that repeatedly updates a variable:

Python has a built-in function called __`range`__ that creates a sequence of numbers. Range can accept 1-3 parameters. If one parameter is provided, range creates an array of that length, starting at zero and incrementing by 1. If 2 parameters are provided, range starts at the first and ends just before the second, incrementing by one. If range is passed 3 parameters, it starts at the first one, ends just before the second one, and increments by the third one. For example, __`range(3)`__ produces the numbers 0, 1, 2, while __`range(2, 5)`__ produces 2, 3, 4, and __`range(3, 10, 3)`__ produces 3, 6, 9.

## Exercise 3

Write a loop that uses __`range`__ to print the first 3 natural numbers:

```
1
2
3
```

## Conditionals

We can ask Python to take different actions, depending on a condition, with an __`if`__ statement:

In [None]:
num = 37
if num > 100:
    print('greater')
else:
    print('not greater')
print('done')

The second line of this code uses the keyword __`if`__ to tell Python that we want to make a choice. If the test that follows the __`if`__ statement is true, the body of the __`if`__ (i.e., the lines indented underneath it) are executed. If the test is false, the body of the __`else`__ is executed instead. Only one or the other is ever executed:

![Conditional Flowchart Example](figures/python-flowchart-conditional.png)
<center>__Figure: Conditional Flowchart Example__</center>

Conditional statements don’t have to include an __`else`__. If there isn’t one, Python simply does nothing if the test is false:


In [None]:
num = 53
print('before conditional...')
if num > 100:
    print('53 is greater than 100')
print('...after conditional')

We can also chain several tests together using __`elif`__, which is short for “else if”. The following Python code uses __`elif`__ to print the sign of a number.

In [None]:
num = -3

if num > 0:
    print(num, "is positive")
elif num == 0:
    print(num, "is zero")
else:
    print(num, "is negative")

One important thing to notice in the code above is that we use a double equals sign __`==`__ to test for equality rather than a single equals sign because the latter is used to mean assignment.

We can also combine tests using __`and`__ and __`or`__. __`and`__ is only true if both parts are true:

In [None]:
if (1 > 0) and (-1 > 0):
    print('both parts are true')
else:
    print('at least one part is false')

while __`or`__ is true if at least one part is true:

In [None]:
if (1 < 0) or (-1 < 0):
    print('at least one test is true')

## Exercise 4

Which of the following would be printed if you were to run this code? Why did you pick this answer?

1. A
2. B
3. C
4. B and C

```if 4 > 5:
    print('A')
elif 4 == 5:
    print('B')
elif 4 < 5:
    print('C')```

Sometimes it is useful to check whether some condition is not true. The Boolean operator __`not`__ can do this explicitly. After reading and running the code below, write some __`if`__ statements that use __`not`__ to test the rule you formulated in the previous challenge.

In [None]:
if not '':
    print('empty string is not true')
if not 'word':
    print('word is not true')
if not not True:
    print('not not True is true')

## Exercise 5

Write some conditions that print __`True`__ if a variable __`a`__ is within 10% of a variable __`b`__ and __`False`__ otherwise. Compare your implementation with your partner’s: do you get the same answer for all possible pairs of numbers?

## Creating Functions

At this point, we’ve written code to import the New York Air Quality dataset, generated summary statistics, drawn plots, calculated correlation coefficients between attributes and saved our modified dataset to disk. 

Generally, our code can start to get pretty long and complicated for most data processing and analysis projects. For example, what if we had thousands of datasets and our existing code generated figures for each dataset. Let's say you no longer want to generate a figure for every dataset, then commenting out specific figure-drawing code will become a nuisance. Also, what if we want to use that code again, on a different dataset or at a different point in our program? Cutting and pasting it is going to make our code get very long and very repetitive, very quickly. We’d like a way to package our code so that it is easier to reuse, and Python provides for this by letting us define things called ‘functions’ - a shorthand way of re-executing longer pieces of code.

Let’s start by defining a function __`fahr_to_kelvin`__ that converts temperatures from Fahrenheit to Kelvin:


In [None]:
def fahr_to_kelvin(temp):
    return ((temp - 32) * (5/9)) + 273.15

![Python Function Blueprint](figures/python-function.svg)
<center>__Figure: Python Function Blueprint__</center>

The function definition opens with the keyword __`def`__ followed by the name of the function and a parenthesized list of parameter names. The body of the function — the statements that are executed when it runs — is indented below the definition line.

When we call the function, the values we pass to it are assigned to those variables so that we can use them inside the function. Inside the function, we use a return statement to send a result back to whoever asked for it.

Let’s try running our function. Calling our own function is no different from calling any other function:

In [None]:
print('freezing point of water:', fahr_to_kelvin(32))
print('boiling point of water:', fahr_to_kelvin(212))

We’ve successfully called the function that we defined, and we have access to the value that we returned.

## Composing Functions

Now that we’ve seen how to turn Fahrenheit into Kelvin, it’s easy to turn Kelvin into Celsius:

In [None]:
def kelvin_to_celsius(temp_k):
    return temp_k - 273.15

print('absolute zero in Celsius:', kelvin_to_celsius(0.0))

What about converting Fahrenheit to Celsius? We could write out the formula, but we don’t need to. Instead, we can compose the two functions we have already created:

In [None]:
def fahr_to_celsius(temp_f):
    temp_k = fahr_to_kelvin(temp_f)
    result = kelvin_to_celsius(temp_k)
    return result

print('freezing point of water in Celsius:', fahr_to_celsius(32.0))

This is our first taste of how larger programs are built: we define basic operations, then combine them in ever-large chunks to get the effect we want. Real-life functions will usually be larger than the ones shown here — typically half a dozen to a few dozen lines — but they shouldn’t ever be much longer than that, or the next person who reads it won’t be able to understand what’s going on.

## Exercise 6

**Combining Strings**

“Adding” two strings produces their concatenation: __`'a' + 'b'`__ is __`'ab'`__. Write a function called __`fence`__ that takes two parameters called __`original`__ and __`wrapper`__ and returns a new string that has the wrapper character at the beginning and end of the original. A call to your function should look like this:

## Exercise 7

**Selecting characters from strings**

If the variable __`s`__ refers to a string, then __`s[0]`__ is the string’s first character and __`s[-1]`__ is its last. Write a function called __`outer`__ that returns a string made up of just the first and last characters of its input. A call to your function should look like this:

## Exercise 8

**Variables inside and outside functions**

What does the following piece of code display when run - and why?

```
f = 0
k = 0

def f2k(f):
  k = ((f-32)*(5.0/9.0)) + 273.15
  return k

f2k(8)
f2k(41)
f2k(32)

print(k)
```

## Exercise 9

**The old switcheroo**

Which of the following would be printed if you were to run this code? Why did you pick this answer?

1. `a. 7 3`
2. `b. 3 7`
3. `c. 3 3`
4. `d. 7 7`

Code:

```
a = 3
b = 7

def swap(a, b):
    temp = a
    a = b
    b = temp

swap(a, b)
print(a, b)
```


## Exercise 10

**Readable code**

Revise a function you wrote for one of the previous exercises to try to make the code more readable. Then, collaborate with one of your neighbours to critique each other’s functions and discuss how your function implementations could be further improved to make them more readable.