# CIC Carpentries Workshop - Day 1 - Part 3
This lesson is adapted from the Data Carpentries [Data Analysis and Visualization in Python for Ecologists](https://datacarpentry.org/python-ecology-lesson/index.html) lesson.

---
## How to use a Jupyter Notebook
Online Resources:
- https://jupyter-notebook-beginner-guide.readthedocs.io/en/latest/index.html
- https://www.packtpub.com/books/content/getting-started-jupyter-notebook-part-1

Useful Tips:
- The notebook autosaves
- You run a cell with **shift + enter** or using the run button in the tool bar
- If you run a cell with **option + enter** it will also create a new cell below
- See *Help > Keyboard Shortcuts* or the *Cheatsheet* for more info
- The notebook has different type of cells (Code and Markdown are most commonly used): 
    - **Code** cells expect code for the Kernel you have chosen, syntax highlighting is available, comments in the code are specified with # -> code after this will not be executed
    - **Markdown** cells allow you to right report style text, using markdown for formatting the style (e.g. Headers, bold face etc)
---

## ❓Questions and Objectives for this Notebook
What should you be able to answer by the end of this notebook?
### Questions
- How can I access specific data within my data set?
- How can Python and Pandas help me to analyse my data?

### Objectives
- Describe what 0-based indexing is.
- Manipulate and extract data using column headings and index locations.
- Employ slicing to select sets of data from a DataFrame.
- Employ label and integer-based indexing to select ranges of data in a dataframe.
- Reassign values within subsets of a DataFrame.
- Create a copy of a DataFrame.
- Query / select a subset of data using a set of criteria using the following operators: ==, !=, >, <, >=, <=.
- Locate subsets of data using masks.
- Describe BOOLEAN objects in Python and manipulate data using BOOLEANs.
---

## Loading our Data
We will continue to use the surveys dataset that we worked with in the last episode. Let's reopen and read in the data again:

In [None]:
# Make sure pandas is loaded


# Read in the survey CSV


---

## Indexing and Slicing in Python
We often want to work with subsets of a **DataFrame** object. There are different ways to accomplish this including: using labels (column headings), numeric ranges, or specific x,y index locations.

### Selecting Data using Labels
We use square brackets `[]` to select a subset of a Python object. For example, we can select all data from a column named `species_id` from the `surveys_df` DataFrame by name. There are two ways to do this.

In [None]:
# TIP: use the .head() method to only see the first few rows of the dataframe

# Method 1: Select a 'subset' of the data using the column name


In [None]:
# Method 2: Use the column name as an 'attribute'


We can also create a new object that contains only the data within the `species_id` column as follows:

In [None]:
# Create an object, surveys_species, that only contains the `species_id` column


We can pass a list of column names too, as an index to select columns in that order. This is useful when we need to reorganise our data.

**NOTE:** If a column name is not contained in the DataFrame, an exception (error) will be raised.

In [None]:
# Select the species and plot_id columns from the DataFrame


In [None]:
# What happens if we flip the order?


In [None]:
# What happens if you ask for a column that doesn't exist?


---

### Extracting Ranged based Subsets: Slicing
**REMINDER**: Python uses 0-based indexing.

Let's remind ourselves that Python uses 0-based indexing. This means that the first element in an object is located at position `0`. This is different from other tools like R and Matlab that index elements within objects starting at 1.

![indexing diagram](https://datacarpentry.org/python-ecology-lesson/fig/slicing-indexing.png)

![slicing diagram](https://datacarpentry.org/python-ecology-lesson/fig/slicing-slicing.png)

In [None]:
# Create a list of numbers


#### ✏️ Challenge
1. What value does the code `a[0]` return?
2. How about `a[5]`?
3. Why did `a[5]` above return an error?
4. What about `a[len(a)]`?

In [None]:
# 1


In [None]:
# 2


In [None]:
# 4


---

### Slicing Subsets of Rows in Python
Slicing using the `[]` operator selects a set of rows and/or columns from a DataFrame. To slice out a set of rows, you use the following syntax `data[start:stop]`. When slicing in Pandas, the start bound is included in the output. The stop bound is one step BEYOND the row you want to select. So if you want to select rows 0, 1 and 2; your code would look like this.

In [None]:
# Select rows 0, 1, 2 (row 3 is not selected)


The stop bound in Python is different from what you might be used to in languages like Matlab and R.

In [None]:
# Select the first 5 rows (rows 0, 1, 2, 3, 4)


In [None]:
# Select the last row


We can also reassign values within subsets of our DataFrame.

But before we do that, let's look at the difference between the concept of copying objects and the concept of referencing objects in Python.

---

### Copying Objects vs Referencing Objects in Python
Let's start with an example:


In [None]:
# Using the `copy()` method


In [None]:
# Using the `=` operator


We might have thought that the code `ref_surveys_df = surveys_df` creates a fresh distinct copy of the `surveys_df` DataFrame object. However, using the `=` operator in the simple statement `y = x` does **not** create a copy of our DataFrame. Instead `y = x` creates a variable `y` that references the **same** object that `x` refers to.

To state this another way, there is only **one** object (the DataFrame), and both `x` and `y` refer to it.

In contrast, the `copy()` method for a DataFrame creates a true copy of the DataFrame.
Let's look at what happens when we reassign the values within a subset of that DataFrame that references another DataFrame object.

In [None]:
# Assign the value `0` to the first three rows of data in the DataFrame


In [None]:
# Let's look at the reference DataFrame


In [None]:
# Let's look at the original DataFrame


When we assigned the first 3 rows the value of `0` using the `ref_surveys_df` DataFrame, the `surveys_df` DataFrame is modified too. Remember, we created the reference `ref_surveys_df` object above when we did `ref_surveys_df = surveys_df`. 

Remember `surveys_df` and `ref_surveys_df` refer to the same exact DataFrame object. If either one changes the object, the other will see the same changes to the reference object.

In [None]:
# Let's look at the copy DataFrame


Okay, that's enough of that. Let's create a brand new clean dataframe from the original data CSV file.

---

### Slicing Subsets of Rows and Columns in Python

We can select specific ranges of our data in both the row and column directions using either label or integer-based indexing.
- `loc`: indexing via *labels* (which can be integers)
- `iloc`: indexing via *integers*

![dataframe_indexing](https://vrzkj25a871bpq7t1ugcgmn9-wpengine.netdna-ssl.com/wp-content/uploads/2019/01/pandas-dataframe-has-indexes.png)

To select a subset of rows **and** columns from our DataFrame, we can use the `iloc` method. For example, we can select month, day and year (columns 1, 2 and 3 if we start counting at 0), like this:

In [None]:
# Let's look at the dataframe


In [None]:
# iloc[row slicing, column slicing]


Notice that we asked for a slice from 0:3. This yielded 3 rows of data. When you ask for 0:3, you are actually telling Python to start at index 0 and select rows 0, 1, 2 **up to but not including 3.**

Let's explore some other ways to index and select subsets of data:

In [None]:
# Select all columns for rows of index values 0 and 10


In [None]:
# Select using column names


In [None]:
# Selecting out of bounds


**NOTE:**: Labels must be found in the DataFrame or you will get a `KeyError`.

Indexing by labels `loc` differs from indexing by integers `iloc`. With `loc`, both the start bound and the stop bound are **inclusive**. When using `loc`, integers *can* be used, but the integers refer to the index label and not the position.  For example, using `loc` and selecting 1:4 will get a different result than using `iloc` to select rows 1:4.

In [None]:
# Using `loc`


In [None]:
# Using `iloc`


We can also select a specific data value using a row and column location within the DataFrame and `iloc` indexing:

Remember that Python indexing begins at 0. So, the index location [2,6] selects the element that is 3 rows down and 7 columns over in the DataFrame.

### ✏️ Challenge
1. What happens when you execute the following:
    - `surveys_df[0:1]`
    - `surveys_df[:4]`
    - `surveys_df[:-4]`
2. What happens when you call:
    - `surveys_df.iloc[0:4, 1:4]`
    - `surveys_df.loc[0:4, 1:4]`

How are the two sets of commands different?

In [None]:
# 1a


In [None]:
# 1b


In [None]:
# 1c


In [None]:
# 2a


In [None]:
# 2b


---

### Subsetting Data using Criteria
We can also select a subset of our data using criteria. For example, we can select all rows that have a year value of 2002:

In [None]:
# Selecting rows where the year is 2002


Or we can select all rows that do not contain the year 2002:

In [None]:
# Selecting rows where the year is not 2002


We can define sets of criteria too:

In [None]:
# Selecting rows where the year is between 1980 and 1985


We can use the syntax below when querying data by criteria from a DataFrame:
- Equals `==`
- Not equals: `!=`
- Greater than, less than: `>` or `<`
- Greater than or equal to `>=`
- Less than or equal to `<=`

---

### Using masks to identify a specific condition
A **mask** can be useful to locate where a particular subset of values exist or don't exist - for example, NaN, or "Not a Number" values. To understand masks, we also need to understand `BOOLEAN` objects in Python.

Boolean values include `True` or `False`. For example:

In [None]:
# Set x to 5


In [None]:
# What does the code below return?


In [None]:
# How about this?


We can ask Python whether `x` is greater than 5, it returns `False`. This is Python's way to say "No". Indeed, the value of `x` is 5, and 5 is not greater than 5.

To create a boolean mask:
- Set the True / False criteria (e.g. `values > 5 = True`)
- Python will then assess each value in the object to determine whether the value meets the criteria (True) or not (False)
- Python creates an output object that is the same shape as the original object, but with a `True` or `False` for each index location.

Let's try this out. Let's identify all locations in the survey data that have null (missing or NaN) data values. We can use the `isnull` method to do this. The `isnull` method will compare each cell with a null value. If an element has a null value, it will be assigned a value of `True` in the output object.

In [None]:
# Mask of null values


To select the rows where there are null values, we can use the mask as an index to subset our data as follows:

In [None]:
# Selecting rows with NaN values with the `any()` method


Note that the `weight` column of our DataFrame contains many `null` or `NaN` values. We will explore ways of dealing with this in the next episode on [Data Types and Formats](https://datacarpentry.org/python-ecology-lesson/04-data-types-and-format/index.html).

We can run `isnull` on a particular column too. What does the code below do?

In [None]:
empty_weights = surveys_df[pd.isnull(surveys_df['weight'])]['weight']
print(empty_weights)

Let's take a minute to look at the statement above. We are using the Boolean object `pd.isnull(surveys_df['weight'])` as an index to `surveys_df`. We are asking Python to select rows that have a `NaN` value of weight.

#### ✏️ Challenge
1. Select a subset of rows in the `surveys_df` DataFrame that contain data from
   the year 1999 and that contain weight values less than or equal to 8. How
   many rows did you end up with? What did your neighbor get?
2. You can use the `isin` command in python to query a DataFrame based upon a
   list of values as follows:
   `surveys_df[surveys_df['species'].isin([listGoesHere])]`.
   
   Use the `isin` function to find all plots that contain particular species in the surveys DataFrame. How many records contain these values?
3. Experiment with other queries. Create a query that finds all rows with a weight value > or equal to 0.
4. The `~` symbol in Python can be used to return the OPPOSITE of the selection that you specify in python. 
It is equivalent to **is not in**. Write a query that selects all rows that are NOT equal to 'M' or 'F' in the surveys
data.
5. Create a new DataFrame that only contains observations with sex values that are not female or male. Assign each sex value in the new DataFrame to a new value of ‘x’. Determine the number of null values in the subset.
6. Create a new DataFrame that contains only observations that are of sex male or female and where weight values are greater than 0. Create a stacked bar plot of average weight by plot_id with male vs female values stacked for each plot.

In [None]:
# 1

In [None]:
# 2

In [None]:
# 3

In [None]:
# 4

In [None]:
# 5

---

# ❗Key Points
- In Python, portions of data can be accessed using indices, slices, column headings, and condition-based subsetting.
- Python uses 0-based indexing, in which the first element in a list, tuple or any other data structure has an index of 0.
- Pandas enables common data exploration steps such as data indexing, slicing and conditional subsetting.