# CIC Carpentries Workshop - Day 1 - Part 2
This lesson is adapted from the Data Carpentries [Data Analysis and Visualization in Python for Ecologists](https://datacarpentry.org/python-ecology-lesson/index.html) lesson.

---
## How to use a Jupyter Notebook
Online Resources:
- https://jupyter-notebook-beginner-guide.readthedocs.io/en/latest/index.html
- https://www.packtpub.com/books/content/getting-started-jupyter-notebook-part-1

Useful Tips:
- The notebook autosaves
- You run a cell with **shift + enter** or using the run button in the tool bar
- If you run a cell with **option + enter** it will also create a new cell below
- See *Help > Keyboard Shortcuts* or the *Cheatsheet* for more info
- The notebook has different type of cells (Code and Markdown are most commonly used): 
    - **Code** cells expect code for the Kernel you have chosen, syntax highlighting is available, comments in the code are specified with # -> code after this will not be executed
    - **Markdown** cells allow you to right report style text, using markdown for formatting the style (e.g. Headers, bold face etc)
---

## ❓Questions and Objectives for this Notebook
What should you be able to answer by the end of this notebook?
### Questions

- How can I import data in Python?
- What is Pandas?
- Why should I use Pandas to work with data?

### Objectives
- Navigate the workshop directory and download a dataset.
- Explain what a library is and what libraries are used for.
- Describe what the Python Data Analysis Library (Pandas) is.
- Load the Python Data Analysis Library (Pandas).
- Use read_csv to read tabular data into Python.
- Describe what a DataFrame is in Python.
- Access and summarize data stored in a DataFrame.
- Define indexing as it relates to data structures.
- Perform basic mathematical operations and summary statistics on data in a Pandas DataFrame.
- Create simple plots.
---

## Starting with Data

### Working with Pandas DataFrames in Python
We can automate the process of performing data manipulations in Python. It's efficient to spend time building the code to perform these tasks because once it's built, we can use it over and over on different datasets that use a similar format. This makes our methods easily reproducible. We can also easily share our code with colleagues and they can replicate the same analysis.

#### Starting in the same spot
To help the lesson run smoothly, let's ensure that everyone is in the same directory. This should help us avoid path and filename issues. At this time, please navigate to the workshop directory. If you are working in Jupyter Notebook, be sure that you start your notebook in the workshop directory.

A quick aside is that there are Python libraries like [OS
Library](https://docs.python.org/3/library/os.html) and [pathlib](https://docs.python.org/3/library/pathlib.html) that can work with our
directory structure, however, that is not our focus today.

#### Our Data
For this lesson, we will be using the Portal Teaching data, a subset of the data
from Ernst et al
[Long-term monitoring and experimental manipulation of a Chihuahuan Desert ecosystem near Portal, Arizona, USA](http://www.esapubs.org/archive/ecol/E090/118/default.htm)

We will be using files from the [Portal Project Teaching Database](https://figshare.com/articles/Portal_Project_Teaching_Database/1314459).
This section will use the `surveys.csv` file which can be found in ../data/.

We are studying the species and weight of animals caught in plots in our study
area. The dataset is stored as a `.csv` file: each row holds information for a
single animal, and the columns represent:

| Column           | Description                        |
|------------------|------------------------------------|
| record_id        | Unique id for the observation      |
| month            | month of observation               |
| day              | day of observation                 |
| year             | year of observation                |
| plot             | ID of a particular plot            |
| species          | 2-letter code                      |
| sex              | sex of animal ("M", "F")           |
| wgt              | weight of the animal in grams      |


The first few rows of our first file look like this:

```
record_id,month,day,year,plot_id,species_id,sex,hindfoot_length,weight
1,7,16,1977,2,NL,M,32,
2,7,16,1977,3,NL,M,33,
3,7,16,1977,2,DM,F,37,
4,7,16,1977,7,DM,M,36,
```

#### About Libraries
![](../pictures/package.png)

A library in Python contains a set of tools (called functions) that perform
tasks on our data. Importing a library is like getting a piece of lab equipment
out of a storage locker and setting it up on the bench for use in a project.
Once a library is set up, it can be used or called to perform many tasks.

You only need to load a library once during your session. You can load the library when needed
or you can load all necessary libraries at the beginning of your script. 
This is good practice, especially for the readability of your code.

#### Pandas in Python
One of the best options for working with tabular data in Python is to use the
[Python Data Analysis Library](http://pandas.pydata.org/) (a.k.a. Pandas). The
Pandas library provides data structures, produces high quality plots with
[matplotlib](http://matplotlib.org/) and integrates nicely with other libraries
that use [NumPy](http://www.numpy.org/) (which is another Python library) arrays.

Python doesn't load all of the libraries available to it by default. We have to
add an `import` statement to our code in order to use library functions. To import
a library, we use the syntax `import libraryName`. If we want to give the
library a nickname to shorten the command, we can add `as nickNameHere`.  An
example of importing the pandas library using the common nickname `pd` is below.

In [None]:
# Importing Pandas library


Each time we call a function that's in a library, we use the syntax
`LibraryName.FunctionName`. Adding the library name with a `.` before the
function name tells Python where to find the function. In the example above, we
have imported Pandas as `pd`. This means we don't have to type out `pandas` each
time we call a Pandas function.

A handy **Pandas cheatsheet** can be found [here](http://pandas.pydata.org/Pandas_Cheat_Sheet.pdf).

---

### Reading CSV Data using Pandas
We will begin by locating and reading our survey data which is in CSV format. We can use Pandas' `read_csv` function to pull the file directly into a [DataFrame](https://pandas.pydata.org/pandas-docs/stable/getting_started/dsintro.html#dataframe).

#### So What's a DataFrame?
A DataFrame is a 2-dimensional data structure that can store data of different types (including characters, integers, floating point values, factors and more) in columns. It is similar to a spreadsheet or an SQL table or the `data.frame` in R. A DataFrame always has an index (0-based). An index refers to the position of an element in the data structure.

In [None]:
# Note that pd.read_csv is used because we imported pandas as pd


We can see that there were 35,549 rows parsed. Each row has 9 columns. The first column is the index of the DataFrame. The index is used to identify the position of the data, but it is not an actual column of the DataFrame. It looks like the `read_csv` function in Pandas read our file properly. However, we haven’t saved any data to memory so we can work with it. We need to assign the DataFrame to a variable. Remember that a variable is a name for a value, such as `x`, or `data`. We can create a new object with a variable name by assigning a value to it using `=`.

Let’s call the imported survey data `surveys_df`:

Notice when you assign the imported DataFrame to a variable, Python does not produce any output on the screen. We can view the value of the `surveys_df` object by typing its name into the Python command prompt.

---

### Exploring Our Species Survey Data
Now, we can start exploring our data. First, let's check the data type of the data stored in `surveys_df` using the `type` method.

What kind of things does `surveys_df` contain? DataFrames conveniently has an attribute called `dtypes` which answers this by returning the data type for each column in our DataFrame.

All the values in a column have the same data type. 

Recalling from the previous episode about Python data types. Pandas and base Python use slightly different names for data types.

| Pandas Type | Native Python Type | Description |
|-------------|--------------------|-------------|
| object | string | The most general dtype. Will be assigned to your column if column has mixed types (numbers and strings). |
| int64  | int | Numeric characters. 64 refers to the memory allocated to hold this character. |
| float64 | float | Numeric characters with decimals. If a column contains numbers and NaNs(see below), pandas will default to float64, in case your missing value has a decimal. |
| datetime64, timedelta[ns] | N/A (but see the [datetime](http://doc.python.org/2/library/datetime.html) module in Python's standard library) | Values meant to hold time data. Look into these for time series experiments. |

For example, months have type `int64`, which is an integer. Weight and hindfoot_length have type `float64` which is a floating point value. The `object` type in species_id and sex doesn't have a very helpful name, but in this case it represents strings.

#### Useful Ways to View DataFrame Objects in Python
There are many ways to summarise and access the data stored in DataFrames, using attributes and methods provided by the DataFrame object.

Let's try out a few.

To access an attribute, use the DataFrame object name followed by the attribute `df_object.attribute`. Using the DataFrame `surveys_df` and attribute `columns`, an index of all the column names in the DataFrame can be access with `surveys_df.columns`.

Methods are called in a similar fashion using the syntax `df_object.method()`. As an example, `surveys_df.head()` gets the first few rows in the DataFrame `surveys_df` using **the `head()` method**. With a method, we can supply extra information in the parenthesis to control behaviour.

#### ✏️ Challenge
Using our DataFrame `surveys_df`, try out the attributes and methods to see what they return.
1. `surveys_df.shape` - take a note of the output of `shape` - what format does it return the shape of the DataFrame in?
2. `surveys_df.head(15)`
3. `surveys_df.tail()`

In [None]:
# 1


In [None]:
# 2


In [None]:
# 3


---

### Calculating Statistics from Data in a Pandas DataFrame
We've now read our data into Python. Next, let's perform some quick summary statistics to learn more about the data that we're working with. We might want to know how many animals were collected in each site, or how many of each species were caught. We can perform summary stats quickly using groups. But first, we need to figure out what we want to group by.

Let's explore our data some further:

In [None]:
# Look at the column names


Let's get a list of all the species. The `pd.unique` function tells us all of the unique values in the `species_id` column.

In [None]:
# Look at the unique species ids


#### ✏️ Challenge - Statistics
1. Create a list of unique site IDs (“plot_id”) found in the surveys data. Call it `site_names`. How many unique sites are there in the data? How many unique species are in the data?
2. What is the difference between `len(site_names)` and `surveys_df['plot_id'].nunique()`?

In [None]:
#1.

site_names =

In [None]:
#2



---

#### Groups in Pandas 
We often want to calculate summary statistics grouped by subsets or attributes within fields of our data. For example, we might want to calculate the average weight of all individuals per site.

We can calculate basic statistics for all records in a single column using the Pandas `describe` function

In [None]:
# Summary statistics for the entire dataframe


In [None]:
# Summary statistics for the weight column


We can also extract specific metrics if we wish.

In [None]:
# Mean


In [None]:
# Others?


But if we want to summarize by one or more variables, for example `sex`, we can use Pandas’ .groupby method. Once we’ve created a groupby DataFrame, we can quickly calculate summary statistics by a group of our choice.

In [None]:
# Group data by sex


The pandas function describe will return descriptive stats including: mean, median, max, min, std and count for a particular column in the data. Pandas’ describe function will only return summary values for columns containing numeric data.

In [None]:
# Summary statistics for all numeric columns by sex


In [None]:
# Provide the mean for each numeric column by sex


#### ✏️ Challenge - Summary Data
1. How many recorded individuals are female `F` and how many male `M`?
2. What happens when you group by two columns using the following syntax and then calculate mean values?
 - `grouped_data2 = surveys_df.groupby(['plot_id', 'sex'])`
 - `grouped_data2.mean()`
3.  Summarize weight values for each site in your data. HINT: you can use the following syntax to only create summary statistics for one column in your data. by_`site['weight'].describe()`

In [None]:
#1. 


In [None]:
#2. 


In [None]:
#3


---

### Quickly Creating Summary Counts in Pandas

Let’s next count the number of samples for each species. We can do this in a few ways, but we’ll use groupby combined with a count() method. Name the variables `species_counts`

In [None]:
# Count the number of samples by species
species_counts =

Or, we can also count just the rows that have the species “DO”:

#### ✏️ Challenge - Make a list
What’s another way to create a list of species and associated `count` of the records in the data? Hint: you can perform `count`, `min`, etc. functions on groupby DataFrames in the same way you can perform them on regular DataFrames.

---

#### Basic Math Functions in Pandas
If we wanted to, we could perform math on an entire column of our data. For example, let's multiply all weight values by 2.

A more practical use of this might be the normalise the data according to mean, area, or some other value calculated from our data.

In [None]:
# Multiply all weight values by 2


---

### Quick & Easy Plotting Data Using Pandas
We can plot our summary stats using Pandas too.

In [None]:
# Make sure figures appear inline in our Ipython Notebook 
# (sometimes automatic, good to always include!)
%matplotlib inline

In [None]:
# Plot species_counts


We can also look at how many animals were captured in each site:

In [None]:
# Get a var total_count

# Now plot it


#### ✏️ Challenge - Plots
These challenge activities involve plotting data. It's often best to store the data to be plotted in a variable first, then call the .plot() method on it

1. Create a plot of average weight across all species per site.
2. Create a plot of total males versus total females for the entire dataset.

In [None]:
#1 



In [None]:
# 2


---

### 🔥 Summary Plotting Challenge

Create a stacked bar plot, with weight on the Y axis, and the stacked variable being sex. The plot should show total weight by sex for each site. Some tips are below to help you solve this challenge:

- For more information on pandas plots, see [pandas’ documentation page on visualization](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html#basic-plotting-plot).  
- You can use the code that follows to create a stacked bar plot but the data to stack need to be in individual columns.   

Here’s a simple example with some data where ‘a’, ‘b’, and ‘c’ are the groups, and ‘one’ and ‘two’ are the subgroups.



In [None]:
# Example plots
d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']), 'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
pd.DataFrame(d)

In [None]:
# Plotting 
my_df = pd.DataFrame(d)
my_df.plot(
    kind='bar', stacked=True, title="The title of my graph", 
    xlabel="This is the x axis label", ylabel = "This is the y axis label"
)
# Notice the multi-line expression used above within the `plot` function's brakcets.
# When doing this, dont forget your commas!

# If you want to customize your plots further, try running `help(my_df.plot)` 
# to read the docstring for all of possible uses!

You can use the `.unstack()` method to transform grouped data into columns for each plotting. Try running `.unstack()` on some DataFrames above and see what it yields.

In [None]:
# Unstack


Start by transforming the grouped data (by site and sex) into an unstacked layout, then create a stacked plot.
Good luck with the challenge!

In [None]:
# Challenge cells


---

# ❗Key Points
- Libraries enable us to extend the functionality of Python.
- Pandas is a popular library for working with data.
- A Dataframe is a Pandas data structure that allows one to access data by column (name or index) or row.
- Aggregating data using the groupby() function enables you to generate useful summaries of data quickly.
- Plots can be created from DataFrames or subsets of data that have been generated with groupby().