# CIC Virtual Carpentries Workshop - Day 1

This lesson is adapted from the [Data Carpentry Ecology lesson](http://www.datacarpentry.org/python-ecology-lesson/)

We'll be using the gitter channel to share solutions to challenges, ask questions and chat:

**enter link here**


## How to use a Jupyter Notebook

https://jupyter-notebook-beginner-guide.readthedocs.io/en/latest/index.html

https://www.packtpub.com/books/content/getting-started-jupyter-notebook-part-1

- The file autosaves
- You run a cell with **shift + enter** or using the run button in the tool bar
- If you run a cell with **option + enter** it will also create a new cell below
- See *Help > Keyboard Shortcuts* or the *Cheatsheet* for more info


- The notebook has different type of cells: Code and Markdown are most commonly used
- **Code** cells expect code for the Kernel you have chosen, syntax highlighting is available, comments in the code are specified with # -> code after this will not be executed
- **Markdown** cells allow you to right report style text, using markdown for formatting the style (e.g. Headers, bold face etc)

## Introduction to Python and data analysis using pandas

Python is a high-level, interpreted programming language. This means the code is easy to read for humans and there is no need for us to compile it and in many cases we do not have to think too much about the underlying system fro e.g. memory usage.

As a consequence, we can use it in two ways:
- Using the interpreter as an "advanced calculator" in interactive mode:

In [None]:
# Calculations

In [None]:
# Printing text to screen

- Executing programs/scripts saved as a text file, usually with *.py extension:

In [None]:
# running scripts (using jupyter notebook magics)

# Types of Data

How information is stored in a DataFrame or a python object affects what we can do with it and the outputs of calculations as well. There are two main types of data that we're explore in this lesson: numeric and character types.


## Numeric Data Types

Numeric data types include integers and floats. A **floating point** (known as a
float) number has decimal points even if that decimal point value is 0. For
example: 1.13, 2.0 1234.345. If we have a column that contains both integers and
floating point numbers, Pandas will assign the entire column to the float data
type so the decimal points are not lost. In a vector or data fram (we learn about these different types later) the entire object or an entire column will be of the same type.

An **integer** will never have a decimal point. Thus 1.13 would be stored as 1.
1234.345 is stored as 1234. You will often see the data type `Int64` in python
which stands for 64 bit integer. The 64 simply refers to the memory allocated to
store data in each cell which effectively relates to how many digits it can
store in each "cell". Allocating space ahead of time allows computers to
optimize storage and processing efficiency.



## Character Data Types

Strings are values that contain numbers and / or characters. 
For example, a string might be a word, a sentence, or several sentences. 
A string can also contain or consist of numbers. For instance, '1234' could be stored as a
string. As could '10.23'. However **strings that contain numbers can not be used
for mathematical operations**!





In [None]:
# #Examples of numeric and text data
text = 
number =  
pi_value =

Here we've assigned data to variables, namely `text`, `number` and `pi_value`,
using the assignment operator `=`. The variable called `text` is a string which
means it can contain letters and numbers. We could reassign the variable `text`
to an integer too - but be careful reassigning variables as this can get 
confusing.

To print out the value stored in a variable we can simply type the name of the
variable into the interpreter:

In [None]:
text 

A cell, by default, will print to screen the last thing it evaluates (unless this is explicitly written to a variable).

Thus, in scripts and for evaluating things anywhere else within a cell, we must use the `print` function:

In [None]:
# Next line will print out text
print(text)

In [None]:
# We also need the print statement if we want to see more than one variable
text
number

In [None]:
print(text, number, pi_value)

### Mathematical Operators

We can perform mathematical calculations in Python using the basic operators
 `+, -, /, *, %`:

### Logical Operators
We can also use comparison and logic operators:
`<, >, ==, !=, <=, >=` and statements of identity such as
`and, or, not`. The data type returned by this is 
called a _boolean_.

## Sequential types: Lists and Tuples

### Lists

**Lists** are a common data structure to hold an ordered sequence of
elements. Each element can be accessed by an index.  Note that Python
indexes start with 0 instead of 1:

In [None]:
numbers = 
numbers[0]

To add elements to the end of a list, we can use the `append` method:

In [None]:
numbers.append()
print(numbers)

**Methods** are a way to interact with an object (a list, for example). We can invoke 
a method using the dot `.` followed by the method name and a list of arguments in parentheses. 
To find out what methods are available for an object, we can use the built-in `help` command:

In [None]:
help(numbers)

We can also access a list of methods using `dir`. Some methods names are
surrounded by double underscores. Those methods are called "special", and
usually we access them in a different way. For example `__add__` method is
responsible for the `+` operator.

In [None]:
dir(numbers)

### Tuples

A tuple is similar to a list in that it's an ordered sequence of elements. However,
tuples can not be changed once created (they are "immutable"). Tuples are
created by placing comma-separated values inside parentheses `()`.

In [None]:
a_tuple = 
another_tuple = 
a_list = 

### Challenge
1. What happens when you type `a_tuple[2]=5` vs `a_list[1]=5` ?
2. Type `type(a_tuple)` into python - what is the object type?


## Control flow
Its often important to tell our software to only do something if certain conditions are met

In [None]:
integer = 

In [None]:
if(integer < 4):
    pass
if(integer > 4): 
    pass
if(integer == 4):
    pass

## Looping

Doing things one at a time can be quite tedious. Programming languages allow us to **iterate** what we do programatically:

In [None]:
for i in range(10):
    print('The current number is,', i)

## Functions and packages
> "If you wish to make an apple pie from scratch, you must first invent the universe”  - Carl Sagan  

When creating something, it's often an inordinate task to do it all from scratch.  
Other people have likely already made us tools and processes to do what we want in a much faster and easier way.  
When building a house, we dont start by growing trees, making bricks from clay, and forging out hammers from iron ore!


In python, if we wanted to make an apple pie from scratch, we can simply 
> ```
import apple_pie
apple_pie.create_from_scratch()
```

In [None]:
# Make a function to add two weighted numbers, and take the average

# Working With Pandas DataFrames in Python


## About Libraries

![](package.png)

A library in Python contains a set of tools (called functions) that perform
tasks on our data. Importing a library is like getting a piece of lab equipment
out of a storage locker and setting it up on the bench for use in a project.
Once a library is set up, it can be used or called to perform many tasks.

Python doesn't load all of the libraries available to it by default. We have to
add an `import` statement to our code in order to use library functions. To import
a library, we use the syntax `import libraryName`. If we want to give the
library a nickname to shorten the command, we can add `as nickNameHere`.  An
example of importing the pandas library using the common nickname `pd` is below.

You only need to load a library once during your session. You can load the library when needed
or you can load all necessary libraries at the beginning of your script. 
This is good practice, especially for the readability of your code

## Pandas in Python

One of the best options for working with tabular data in Python is to use the
[Python Data Analysis Library](http://pandas.pydata.org/) (a.k.a. Pandas). The
Pandas library provides data structures, produces high quality plots with
[matplotlib](http://matplotlib.org/) and integrates nicely with other libraries
that use [NumPy](http://www.numpy.org/) (which is another Python library) arrays.

A handy **Pandas cheathsheet** can be found [here](http://pandas.pydata.org/Pandas_Cheat_Sheet.pdf).

Each time we call a function that's in a library, we use the syntax
`LibraryName.FunctionName`. Adding the library name with a `.` before the
function name tells Python where to find the function. In the example above, we
have imported Pandas as `pd`. This means we don't have to type out `pandas` each
time we call a Pandas function.


## Our Data 

For this lesson, we will be using the Portal Teaching data, a subset of the data
from Ernst et al
[Long-term monitoring and experimental manipulation of a Chihuahuan Desert ecosystem near Portal, Arizona, USA](http://www.esapubs.org/archive/ecol/E090/118/default.htm)

We will be using files from the [Portal Project Teaching Database](https://figshare.com/articles/Portal_Project_Teaching_Database/1314459).
This section will use the `surveys.csv` file which can be found in /data/python/python_data

We are studying the species and weight of animals caught in plots in our study
area. The dataset is stored as a `.csv` file: each row holds information for a
single animal, and the columns represent:

| Column           | Description                        |
|------------------|------------------------------------|
| record_id        | Unique id for the observation      |
| month            | month of observation               |
| day              | day of observation                 |
| year             | year of observation                |
| plot             | ID of a particular plot            |
| species          | 2-letter code                      |
| sex              | sex of animal ("M", "F")           |
| wgt              | weight of the animal in grams      |


The first few rows of our first file look like this:

```
record_id,month,day,year,plot,species,sex,wgt
1,7,16,1977,2,NA,M,
2,7,16,1977,3,NA,M,
3,7,16,1977,2,DM,F,
```

## Starting in the same spot

To help the lesson run smoothly, let's ensure everyone is in the same directory.
This should help us avoid path and file name issues. At this time please
navigate to the workshop directory. If you working in IPython Notebook be sure
that you start your notebook in the workshop directory.

A quick aside that there are Python libraries like [OS
Library](https://docs.python.org/3/library/os.html) that can work with our
directory structure, however, that is not our focus today.

If you need to change your directory ```import os``` and use ```os.chdir```

In [None]:
# check if you need to change your directory
import os
os.getcwd()  

In [None]:
os.listdir("../")

In [None]:
# If not already in the data fodler, change directory
os.chdir("../data/")

**Be careful not tto execute the above cell twice** as it will try to move directory again, but this time from your new location which will give you and error

In [None]:
# check we are now in the correct folder
os.getcwd()  

In [None]:
# Load the library
import pandas as pd
#check your version, we need v0.19 or higher
pd.__version__

# Reading CSV Data Using Pandas

We will begin by locating and reading our survey data which are in CSV format.
We can use Pandas' `read_csv` function to pull the file directly into a
[DataFrame](http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe).

## So What's a DataFrame?

A DataFrame is a 2-dimensional data structure that can store data of different
types (including characters, integers, floating point values, factors and more)
in columns. It is similar to a spreadsheet or an SQL table or the `data.frame` in
R. A DataFrame always has an index (0-based). An index refers to the position of 
an element in the data structure.


In [None]:
# note that pd.read_csv is used because we imported pandas as pd
pd.read_csv("surveys.csv")

We can see that there were 33,549 rows parsed. Each row has 9
columns. The first column is the index of the DataFrame. The index is used to
identify the position of the data, but it is not an actual column of the DataFrame. 
It looks like  the `read_csv` function in Pandas  read our file properly. However, 
we haven't saved any data to memory so we can work with it.We need to assign the 
DataFrame to a variable. Remember that a variable is a name for a value, such as `x`, 
or  `data`. We can create a new  object with a variable name by assigning a value to it using `=`.

Let's call the imported survey data `surveys_df`:



In [None]:
surveys_df = pd.read_csv("surveys.csv")

Notice when you assign the imported DataFrame to a variable, Python does not
produce any output on the screen. We can print the value of the `surveys_df`
object by typing its name into the Python command prompt.


## Manipulating Our Species Survey Data

Now we can start manipulating our data. First, let's check the data type of the
data stored in `surveys_df` using the `type` method. The `type` method and
`__class__` attribute tell us that `surveys_df` is 

`<class 'pandas.core.frame.DataFrame'>`.

In [None]:
type(surveys_df)

In [None]:
surveys_df.__class__

We can also enter `surveys_df.dtypes` at our prompt to view the data type for each
column in our DataFrame. `int64` represents numeric integer values - `int64` cells
can not store decimals. `object` represents strings (letters and numbers). `float64`
represents numbers with decimals.

In [None]:
surveys_df.dtypes

Pandas and base Python use slightly different names for data types. More on this
is in the table below:

| Pandas Type | Native Python Type | Description |
|-------------|--------------------|-------------|
| object | string | The most general dtype. Will be assigned to your column if column has mixed types (numbers and strings). |
| int64  | int | Numeric characters. 64 refers to the memory allocated to hold this character. |
| float64 | float | Numeric characters with decimals. If a column contains numbers and NaNs(see below), pandas will default to float64, in case your missing value has a decimal. |
| datetime64, timedelta[ns] | N/A (but see the [datetime](http://doc.python.org/2/library/datetime.html) module in Python's standard library) | Values meant to hold time data. Look into these for time series experiments. |


---


## Exploring DataFrames in Python

There are multiple methods that can be used to access and summarise the data
stored in DataFrames. Let's try out a few. Note that we call the method by using
the object name followed by . and the method name. So `surveys_df.columns` provides an index
of all of the column names in our DataFrame.

In [None]:
surveys_df.columns

### Selecting Rows and Columns


In pandas you can use several ways to **select a specific column**:
- square brackets `[]` 
- a `.` and the column name

For example, we can select all of data from a column named `species` from the `surveys_df`
DataFrame by name:

```python
surveys_df['species']
# this syntax, calling the column as an attribute, gives you the same output
surveys_df.species
```

Using double square brackets `[[]]` we can pass a list of column names too by listing the names we want:


We can also create an new object that contains the data as follows:

```python
# create an object named surveys_species that only contains the `species_id` column
surveys_species = surveys_df['species']
```

**NOTE:** If a column name is not contained in the DataFrame, an exception
(error) will be raised.

```python
surveys_df['speciess']
```

### Challenges

Try out the methods below to see what they return.

1. `surveys_df.columns`.
2. `surveys_df.head()`. Also, what does `surveys_df.head(15)` do?
3. `surveys_df.tail()`.
4. `surveys_df.shape`. Take note of the output of the shape method. What format does it return the shape of the DataFrame in?

HINT: [More on tuples, here](https://docs.python.org/3/tutorial/datastructures.html#tuples-and-sequences).

---

Converting between different data types

In [None]:
# Let's check the types of data we have in our dataframe
surveys_df.dtypes

In [None]:
# convert the record_id field from an integer to a float
surveys_df['record_id'] = surveys_df['record_id'].astype('float64')

In [None]:
surveys_df.dtypes

What happens if we try to convert weight values to integers?

In [None]:
surveys_df['wgt'].astype('int64')

Notice that this throws a value error: `ValueError: Cannot convert NA to
integer`. If we look at the `weight` column in the surveys data we notice that
there are NaN (**N**ot **a** **N**umber) values. *NaN* values are undefined
values that cannot be represented mathematically. Pandas, for example, will read
an empty cell in a CSV or Excel sheet as a NaN. NaNs have some desirable
properties: if we were to average the `weight` column without replacing our NaNs,
Python would know to skip over those cells.


In [None]:
surveys_df['wgt'].mean()

_Note: older pandas version do not know how to handle NaN, please update to v0.19 or higher_

Check your pandas version using `pd.__version__`, if you need to update open a bash shell
and type ```conda update pandas```.

---

## Missing Data Values - NaN

Dealing with missing data values is always a challenge. It's sometimes hard to
know why values are missing - was it because of a data entry error? Or data that
someone was unable to collect? Should the value be 0? We need to know how
missing values are represented in the dataset in order to make good decisions.
If we're lucky, we have some metadata that will tell us more about how null
values were handled.

For instance, in some disciplines, like Remote Sensing, missing data values are
often defined as -9999. Having a bunch of -9999 values in your data could really
alter numeric calculations. Often in spreadsheets, cells are left empty where no
data are available. Pandas will, by default, replace those missing values with
NaN. However it is good practice to get in the habit of intentionally marking
cells that have no data, with a no data value! That way there are no questions
in the future when you (or someone else) explores your data.

### Where Are the NaN's?

Let's explore the NaN values in our data a bit further. 
First, let's figure out **how many rows contain NaN values for weight**. 
We can do this by identifying how many rows have a NULL value (`.isnull`) or by counting the number of rows that have a meaningful value (e.g., wgt>0):

In [None]:
surveys_df[pd.isnull(surveys_df['wgt'])]

In [None]:
surveys_df[surveys_df['wgt']>0]

We can replace all NaN values with zeroes using the `.fillna()` method (after
making a copy of the data so we don't lose our work).

However, NaN and 0 yield different analysis results. The mean value when NaN
values are replaced with 0 is different from when NaN values are simply thrown
out or ignored.

In [None]:
# replace NaN with 0
df1 = surveys_df.copy()
df1['wgt'] = df1['wgt'].fillna(0)

In [None]:
#check mean, how does it differ from before?
print(surveys_df['wgt'].mean())
print(df1['wgt'].mean())

We can fill NaN values with any value that we chose. The code below fills all
NaN values with a mean for all weight values.

```python
 df1['wgt'] = surveys_df['wgt'].fillna(surveys_df['wgt'].mean())
```

We could also chose to create a subset of our data, only keeping rows that do
not contain NaN values, using `.dropna()` method.

**The point is to make conscious decisions about how to manage missing data.** 
This is where we think about how our data will be used and how these values will
impact the scientific conclusions made from the data.

Python gives us all of the tools that we need to account for these issues. We
just need to be cautious about how the decisions that we make impact scientific
results.

In [None]:
df1['wgt'] = surveys_df['wgt'].fillna(surveys_df['wgt'].mean())
print(surveys_df['wgt'].mean())
print(df1['wgt'].mean())

## Calculating summary statistics for a Pandas DataFrame

We've read our data into Python. Next, let's perform some quick summary
statistics to learn more about the data that we're working with. We might want
to know how many animals were collected in each plot, or how many of each
species were caught. We can perform summary stats quickly using groups. But
first we need to figure out what we want to group by.

---

Let's find out how many unique plot IDs and species we have in our data:

In [None]:
# Reminder of the column names
surveys_df.columns.values

In [None]:
# Create a list of unique plot ID's  and species found in the surveys data
plot_names = pd.unique(surveys_df['plot'])
species = pd.unique(surveys_df['species'])

In [None]:
# Check the length of the list
print('There are: ' + str(len(plot_names)) + ' unique plots in the data')
print('There are: ' + str(len(species)) + ' unique species in the data')

In [None]:
# Single line solution
print('There are: ' + str(surveys_df['plot'].nunique()) + ' unique plots in the data')
print('There are: ' + str(surveys_df['species'].nunique()) + ' unique species in the data')

---

The Pandas function `describe` will return descriptive stats including: mean,
median, max, min, std and count for a particular column in the data. Pandas'
`describe` function will only return summary values for columns containing
numeric data.
We can calculate basic statistics for all records in a single column using the
syntax below:

In [None]:
surveys_df.describe()

We can also extract one specific metric if we wish:

```python
surveys_df['wgt'].min()
surveys_df['wgt'].max()
surveys_df['wgt'].mean()
surveys_df['wgt'].std()
surveys_df['wgt'].count()
```


### Basic Math Functions

If we wanted to, we could perform math on an entire column of our data. For
example let's multiply all weight values by 2. A more practical use of this might
be to normalize the data according to a mean, area, or some other value
calculated from our data.

In [None]:
# multiply all weight values by 2
surveys_df['wgt']*2

### Groups in Pandas

We often want to calculate summary statistics grouped by subsets or attributes
within fields of our data, for example we might want to know what the summary stats look like split by sex.
We can use Pandas' `.groupby` method, which creates a groupby DataFrame on which we can perform other pandas methods.


In [None]:
# grouping the df by sex
by_sex = 

# summary statistics for this new df
by_sex.describe()

In [None]:
# provide the mean for each numeric column by sex


The `groupby` command is powerful in that it allows us to quickly generate
summary stats, not just for one group but several.

For example, we might want to calculate the average
weight of all individuals per plot:

```python
surveys_df.groupby('plot')['wgt'].mean()
```

In [None]:
# calculate average weight of individuals in each plot

Or, we might want to know how many males and females we have for each species:

```python
surveys_df.groupby(['species','sex'])['record_id'].count()
```

In [None]:
# count the number of each sex per species

### Challenge

1. Calculate the average weight for each species per plot
2. Calculate the average weight for each sex of each species per plot


## Quick & Easy Plotting Data Using Pandas

We can plot our summary stats using Pandas, too.

In [None]:
# make sure figures appear inline in Jupyter Notebook
%matplotlib inline

# plot year vs wgt
surveys_df.plot(x='year', y='wgt', kind='scatter')

In [None]:
# create a quick bar chart
species_count = surveys_df.groupby('species')['record_id'].count()
species_count.plot(kind='bar')

In [None]:
# We can also look at how many animals were captured in each plot:
total_count = surveys_df.groupby('plot')['record_id'].nunique()

# let's plot that too, default is a line plot

total_count.plot(kind='bar')

### Challenge Activities

1. Create a scatter plot of average weight across all species per plot. x-axis = plot, y-axis = wgt
2. Create the same plot, but with average weight for each sex per plot. Hint, you will need to `unstack` when plotting. x-axis = plot, y-axis = wgt, different lines for each sex.
3. Create a trend plot of the average weight per plot over time. x-axis = year, y-axis = wgt, different lines for each plot.

In [None]:
# group by plot and calculate mean wgt
avrg_wgt = 

# let's plot, you should see x-axis -> plot, y-axis -> wgt
avrg_wgt.plot()

In [None]:
# group by plot and sex, then calculate mean wgt
avrg_wgt = 

# let's plot, you should see x-axis -> plot, y-axis -> wgt, different lines for sex
# you need to use the .unstack() method before the .plot() for this to work
avrg_wgt.unstack().plot()

In [None]:
avrg_wgt.unstack()

In [None]:
# group by year and plot, then calculate mean wgt
wgt_by_time = 

# let's plot, you should see x-axis -> year, y-axis -> wgt, different lines for plot
# you need to use the .unstack() method before the .plot() for this to work
wgt_by_time.unstack().plot()