# CIC Python workshop 

# Analysing patient data

### Learning objectives

* Explain what a library is, and what libraries are used for.
* Load a Python library and use the things it contains.
* Read tabular data from a file into a program.
* Assign values to variables.
* Select individual values and subsections from data.
* Perform operations on arrays of data.
* Display simple graphs.

Words are useful, but what’s more useful are the sentences and stories we build with them. Similarly, while a lot of powerful, general tools are built into languages like Python, specialized tools built up from these basic units live in libraries that can be called upon when needed.

In order to load our inflammation data, we need to import a library called __pandas__. pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It is well suited for many different kinds of data:

* Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
* Ordered and unordered (not necessarily fixed-frequency) time series data.
* Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels
* Any other form of observational / statistical data sets. The data actually need not be labeled at all to be placed into a pandas data structure

We can load _pandas_ using:

In [None]:
import pandas

Importing a library is like getting a piece of lab equipment out of a storage locker and setting it up on the bench. Libraries provide additional functionality to the basic Python package, much like a new piece of equipment adds functionality to a lab space. Once you’ve loaded the library, we can ask the library to read our data file for us:

In [None]:
pandas.read_csv('data/inflammation-01.csv')

The expression __`pandas.read_csv(...)`__ is a function call that asks Python to run the function __`read_csv`__ which belongs to the __`pandas`__ library. This dotted notation is used everywhere in Python to refer to the parts of things as thing.component.

__`pandas.read_csv`__ has one required parameter: the name of the file we want to read. This parameter needs to be a character string (or string for short), so we put them in quotes.

When we are finished typing and press Shift+Enter, the notebook runs our command. Since we haven’t told it to do anything else with the function’s output, the notebook displays it. In this case, that output is the data we just loaded. By default, not all rows and columns are shown (with ... to omit elements).

Our call to __`pandas.read_csv`__ read our file, but didn’t save the data in memory. To do that, we need to assign the array to a variable. 

### Variables

A variable is just a name for a value, such as __`x`__, __`current_temperature`__, or __`subject_id`__. Python’s variables must begin with a letter and are case sensitive. We can create a new variable by assigning a value to it using =. As an illustration, let’s step back and instead of considering a table of data, consider the simplest “collection” of data, a single value. The line below assigns the value __55__ to a variable __`weight_kg`__:

In [None]:
weight_kg = 55

Once a variable has a value, we can print it to the screen:

In [None]:
print(weight_kg)

and do arithmetic with it:

In [None]:
print('weight in pounds:', 2.2 * weight_kg)

As the example above shows, we can print several things at once by separating them with commas.
We can also change a variable’s value by assigning it a new one:

In [None]:
weight_kg = 57.5
print('weight in kilograms is now:', weight_kg)

If we imagine the variable as a sticky note with a name written on it, assignment is like putting the sticky note on a particular value:

![Variables as Sticky Notes](figures/python-sticky-note-variables-01.svg)
<center>__Figure: Variables as Sticky Notes__</center>

This means that assigning a value to one variable does not change the values of other variables. For example, let’s store the subject’s weight in pounds in a variable:

In [None]:
weight_lb = 2.2 * weight_kg
print('weight in kilograms:', weight_kg, 'and in pounds:', weight_lb)

![Variables as Sticky Notes](figures/python-sticky-note-variables-02.svg)
<center>__Figure: Creating Another Variable__</center>

and then change weight_kg:

In [None]:
weight_kg = 100.0
print('weight in kilograms is now:', weight_kg, 'and weight in pounds is still:', weight_lb)

![Variables as Sticky Notes](figures/python-sticky-note-variables-03.svg)
<center>__Figure: Updating a Variable__</center>

Since __`weight_lb`__ doesn’t “remember” where its value came from, it isn’t automatically updated when __`weight_kg`__ changes. This is different from the way spreadsheets work.

Just as we can assign a single value to a variable, we can also assign a table of values to a variable using the same syntax. Let’s re-run __`pandas.read_csv`__ and save its result:

In [None]:
data = pandas.read_csv('data/inflammation-01.csv')

This statement doesn’t produce any output because assignment doesn’t display anything. If we want to check that our data has been loaded, we can print the variable’s value:

In [None]:
print(data)

Now that our data is in memory, we can start doing things with it. First, let’s ask what type of thing data refers to:

In [None]:
print(type(data))

The output tells us that `data` currently refers to a thing called `DataFrame` created by the `pandas` library (actually a sub-library). The fancy name DataFrame is adopted from the R programming language, you can think of it as a table (with rows and columns).

This looks complex but there are simpler examples.

In [None]:
print(type(weight_kg))

Which means `weight_kg` is a floating point number.

Our data correspond to daily arthritis patients’ inflammation. Each record contains inflammation measurement for each patient and day. We can see the size of the table like this:

In [None]:
data.shape

This tells us that `data` has 2400 rows and 3 columns. When we created the variable `data` to store our arthritis data, we didn’t just create the table, we also created information about the table, called members or attributes. This extra information describes data in the same way an adjective describes a noun. `data.shape` is an attribute of `data` which describes the dimensions of `data`. We use the same dotted notation for the attributes of variables that we use for the functions in libraries because they have the same part-and-whole relationship.

We can get summary statistics by doing:

In [None]:
data.describe()

`describe` here is a _method_ of the DataFrame, i.e., a function that belongs to the table in the same way that the memeber `shape` does. If variables are nouns, methods are verbs: they are what the thing in question knows how to do. We need empty parentheses for `data.describe()`, even when we’re not passing in any parameters, to tell Python to go and do something for us. `data.shape` doesn’t need `()` because it is just a description but `data.describe()` requires the `()` because it is an action.

If we want to get a single entry from the table, we must provide an index in square brackets, just as we do in math:

**TODO** describe how to access a column and a row using .loc first and data.patient later 

In [None]:
print(data.loc[0])

**TODO** some plots, remember to do `%matplotlib inline`

some questions / exercises

### Repeating Actions with Loops

http://swcarpentry.github.io/python-novice-inflammation/02-loop.html

### Storing Multiple Values in Lists

http://swcarpentry.github.io/python-novice-inflammation/03-lists.html

### Making Choices

http://swcarpentry.github.io/python-novice-inflammation/05-cond.html

### Creating Functions

http://swcarpentry.github.io/python-novice-inflammation/06-func.html