# **SARP-East Coding Lesson 4 - Tabular Data**
## *Pandas Filtering and Groupby*
03 July 2024
<br>Riley McCue

##  🌠💪🏼 *Trust your struggle.* 💪🏼🌠
<br>- Unknown 

:::{admonition} **1** Recap
:class: tip
Discuss at your tables:
<br>**Filepath**
1. What is the difference between an absolute and relative file path?
2. Why are we interested in filepaths?
3. Write out the absolute file path for this current notebook you are in in Cyrocloud.
<br>**Pandas**
1. What is pandas?
2. What do we use pandas for?/Why are we interested in pandas?
3. Write out the syntax for creating a pandas DataFrame from a given dictionary `d_name`.
4. What things did we learn we can do with a pandas DataFrame?
5. What other things might we want to do?


## Context
We learned the basics of pandas and creating `DataFrame`s and `Series`, and explored a few methods. We explored filepaths and opened some real data (woohoo!).

Today, we are learning how to filter our data and use this method called `.groupby()` in pandas. These will help us get the specific parts of our data we desire and help us on our road to becoming the best scientific researchers!

## 📥 Let's Load some Data!

We are going to be using a USGS water data set.
:::{admonition} Activity!
:class: tip
[Link to the data we will be using from USGS](https://raw.githubusercontent.com/NASA-SARP/sarp_lessons/main/lessons/tabular_data/data/englewood_3_12_21_usgs_water.tsv).
1. Download the data into your cyrocloud environment.
2. Write the relative file path from this notebook to the data file. (We need this to open the data!)
:::
<br> We need to:
1. Import in our pandas library
2. Open up our data into our notebook (using filepath above)
3. Prep our data a bit before getting started

In [None]:
# 1 Import our library
import pandas as pd

In [None]:
# Read in the data
df_water_vars = pd.read_csv('./englewood_3_12_21_usgs_water.tsv', sep='\t', skiprows=30)
# There are a lot of variables here, so let's shorten our dataframe to a few variables
df_water_vars = df_water_vars[['datetime', '210920_00060', '210922_00010', '210924_00300', '210925_00400']]
# Get rid of the first row of hard-coded datatype info
df_water_vars = df_water_vars.drop(0)
# Rename the columns from their USGS codes to more human-readible names
d_name_codes = {'210920_00060': 'discharge','210922_00010': 'temperature', '210924_00300': 'dissolved oxygen', '210925_00400': 'pH'}
df_water_vars = df_water_vars.rename(columns=d_name_codes)
# Convert columns with numbers to a numeric type
df_water_vars['discharge'] = pd.to_numeric(df_water_vars['discharge'])
df_water_vars['temperature'] = pd.to_numeric(df_water_vars['temperature'])
df_water_vars['dissolved oxygen'] = pd.to_numeric(df_water_vars['dissolved oxygen'])
df_water_vars['pH'] = pd.to_numeric(df_water_vars['pH'])

Today I'm also going to add two fake columns of data, "dam release" and "safety level", that will help us as we go through some of the new concepts.

In [None]:
import random

In [None]:
df_water_vars['dam release'] = [random.choice([True, False]) for x in range(len(df_water_vars))]
df_water_vars['safety level'] = [random.choice(['low', 'medium', 'high']) for x in range(len(df_water_vars))]

In [None]:
df_water_vars

## Filtering with Booleans

So let's start with a general question, thinking back to what we learned about booleans earlier this week.
> What would happen if we used a comparison on a pandas dataframe?

Think about this:
`df_water_vars['dissolved oxygen'] == 8.2`

The output: Another dataframe the same size and shape as the ojbect used in comparison (here -- `water_vars['discharge']`), but it is full of Boolean values.

So what is the use of a list of booleans?  Let's start with a simplified list of booleans and experiment.

In [None]:
import random

In [None]:
# Create just a list of booleans, instead of a full dataframe
l_booleans = [random.choice([True, False]) for x in range(146)]
l_booleans

In [None]:
# Now, use the list of booleans with square brackets, sort of like a key in a dictionary
df_water_vars[l_booleans]

That returned only the rows where the value in the list was True!  

This is exciting, because this is the basis for filtering rows in `pandas`.  Trying it again with our conditional statement from above `df_water_vars['dissolved oxygen'] == 8.2`:

In [None]:
# Make a dataframe indicating where dissolved oxygen == 8.2
df_boolean = df_water_vars['dissolved oxygen'] == 8.2
df_boolean

In [None]:
# Return only the rows where dissolved oxygen == 8.2
df_water_vars[df_boolean]

In [None]:
#You can also do this process in a single step.
df_water_vars[df_water_vars['dissolved oxygen'] == 8.2]

:::{admonition} **2**📝 Check your understanding
:class: tip
Write a cell of code that returns only the rows of the `df_water_vars` dataframe where discharge is less than or equal to 46.

## Filtering with Logical Operators -- `and` and `or`

Pandas uses a slightly different syntax of `and` and `or` , but the meaning remains the same.

| Python      | Pandas |
| :-----------: | :-----------: |
| and      | &       |
| or   | &#124;       |

You must wrap each individual conditional statement in parenthesis `()`.

In [None]:
# Data points where pH < 8 AND dam release was True
df_water_vars[(df_water_vars['pH'] < 8) & (df_water_vars['dam release'] == True)]

In [None]:
# Data points where pH < 8 OR dam release was True
df_water_vars[(df_water_vars['pH'] < 8) | (df_water_vars['dam release'] == True)]

There are many ways to do this type of filtering in pandas -- this is just one way! I encourage you to get comfortable with one method before moving on to others, but [this article](https://www.listendata.com/2019/07/how-to-filter-pandas-dataframe.html) has a pretty nice overview of methods.

:::{admonition} **3**📝 Check your understanding
:class: tip

Explain how the following line of code is filtering the `df_water_vars` dataset?

`df_water_vars[(df_water_vars['discharge'] <= 46) | (df_water_vars['dam release'] > 42)]`

## Indexing your pandas DataFrame

The most important part of in index is that it is **unique**, meaning no two rows of data is allowed to have the same index. This is important for keeping the data organized -- practically and conceptually.

One of the powerful components of pandas is that you aren't required to have a number as the index of a dataframe.  <br>Let's look at this with an example dataframe:

In [None]:
df_grades = pd.DataFrame(
    {
        'name': ['Beatriz S', 'Sara H', 'Joel T', 'Ari T', 'Hassan A'],
        'history': [78, 69, 80, 91, 79],
        'science': [85, 80, 80, 73, 91],
        'music': [81, 81, 90, 73, 89],
    }
)

In [None]:
df_grades

In this dataframe, we have default indexes of 0 -> 4. While this may work for this small example dataframe, it will not work for the research you will be doing. Let's practice indexing this small dataframe by using the student's names as the index.

We use the method called `.set_index()`.
<br> The general syntax is `dataframe_name.set_index('column_name')`.

In [None]:
df_grades.set_index('name')

In [None]:
# update the original dataframe with the new index
df_grades = df_grades.set_index('name')

## `.iloc` vs. `loc`

So now that we have names as indexes, how do we select specific rows?

We can still use the same method we already know, `iloc`, with an integer index number and we will still get the same result.

In [None]:
df_grades.iloc[1]

Alternatively we can also use a similar but slightly different method - `.loc`.  `.loc` allows us to find specific rows based on an index label, not the integer.

In [None]:
df_grades.loc['Hassan A']

If you want a deep dive on `.iloc` vs. `.loc` [this article](https://towardsdatascience.com/how-to-use-loc-and-iloc-for-selecting-data-in-pandas-bd09cb4c3d79) is for you.

## Real Data Indices --> Datetime

Let's look back to our `df_water_vars` table and think about `datetime`.

Dates and times are generally fantastic indices, because, if the world is working in proper order, each moment in time only happens once, meaning their values are unique and they are really common indices.

In [None]:
#Let's set the index
df_water_vars = df_water_vars.set_index('datetime')

In [None]:
df_water_vars

In [None]:
#Let's look at data from March 12th 2021 at 1:00pm
df_water_vars.loc['2021-03-12 13:00']

:::{admonition} **4**📝 Check your understanding
:class: tip
Write a line of code to get the row corresponding to March 13th 2021 at 8:00am.

# 🏘️ Groupby -- Grouping our Data in the Best way

Pandas has a method called `.groupby()` that allows us to only look at the data we really care about.

## Breaking `groupby` into conceptual parts

In addition to the dataframe, there are three main parts to a groupby:
1. Which variable we want to group together
2. How we want to group
3. The variable we want to see in the end

Before getting into syntax, let's conceptually understand how to approach `.groupby()` by looking at some examples.  
**Given the average temperature of every county in the US, what is the average temperature in each state?**

* _Which variable to group together?_ -> We want to group counties into states
* _How do we want to group?_ -> Take the average
* _What variable do we want to look at?_ Temperature

**Given a list of the opening dates of every Chuck E Cheese stores, how many Chuck E Cheeses were opened each year?**

* _Which variable to group together?_ -> We want to group individual days into years
* _How do we want to group?_ -> Count them
* _What variable do we want to look at?_ Number of stores

:::{admonition} **5**📝 Check your understanding
:class: tip
**Given the hourly temperatures for a location over the course of a month, what were the daily highs?**

1. _Which variable to group together?_
2. _How do we want to group?_
3. _What variable do we want to look at?_


## Syntax of `groupby` 

Now let's see what this looks like in the language of the computer.
1. Which variable we want to group together --> which GROUP to choose
2. How we want to group --> which AGGREGATION to choose
3. The variable we want to see in the end --> INTEREST variable (not necessary)

The general syntax is:

`dataframe.groupby(GROUP).AGGREGATION()`

### Choosing `GROUP`

This variable can be any of the columns in the dataframe. We like to choose columns that can be put into discrete "piles", meaning there are lots of repeated values.  For example, we used `'safety_level'` because it can easily be put into piles -- `low`, `medium` and `high`.

We are still allowed to group on other columns, it's just not that useful. It can be tempting to choose the variable you are most wanting to see in GROUP, but sometimes that is not optimal.

In [None]:
#Let's find the minimum value of each column at each safety level.
df_water_vars.groupby("safety level").min()

In [None]:
#Let's find the minimum value of each column at each pH value.
df_water_vars.groupby("pH").min()

## Aggregation Functions 

The functions you can use in a groupby are limited, but there are still lots of options. 

The most common ones are:
* `.count()` - find the total number of rows
* `.min()`- find the minimum value of those rows
* `.max()` - find the maximum value of those rows
* `.mean()`- find the mean value of those rows
* `.sum()` - find the sum of the values of those rows

## The Third Question -- Interest

It is not necessary to provide this to pandas in order to use the `.gropuby()` method, but it can come in handy when working with larger dataframes, when making a plot, or any other case when we need to specify an exact variable of interest.

We add in the column of interest after the chosen group in index notation:

`dataframe.groupby(GROUP)[INTEREST].AGGREGATION()`

In [None]:
# Group our water dataframe by saftey level and take the mean of the discharge values for each safety level group
df_water_vars.groupby("safety level")['discharge'].mean()

:::{admonition} 📝 Check your understanding
:class: tip

1. What is the maximum pH value during low safety level?

2. What is the mean discharge when dam releases are happening?

:::

### The Process of Groupby: Split-Apply-Combine

There is a lot that happens in a single step with `groupby` and it can be a lot to take in.  One way to mentally situate this process is to think about **Split-Apply-Combine**.

1. SPLIT the full data set into groups. --> _Which variable to group together?_
2. APPLY the aggregation function to the individual groups. --> _How do we want to group?_
3. COMBINE the aggregated data into a new dataframe

Note: The INTEREST variable we factored in is part of the _Split_ step in this process. Note: The INTEREST variable we factored in is part of the _Split_ step in this process. 


<img src="https://static.packt-cdn.com/products/9781783985128/graphics/5128OS_09_01.jpg" width=550>






# 🛑🛑🛑 **Practice Time !!!** 🛑🛑🛑

We are going to move to tables to be with people at similar levels of understanding with this! 

# 🏁 Wrap-up: Think, Pair, Share!
## Think!
Take 4 minutes and review on your own what you learned, and write a summary. This can be in paragraph style, bullet points, drawings, whatever makes sense to you. Feel free to create a cell below (either a Markdown or a Code cell) and play around!
## Pair!
Take 4 minutes and discuss at your tables what you all learned today.
## Share!
:::{admonition} Riley's Version
:class: note, dropdown
See handout for full Pandas summary!

### 🎊YAYYYY - third coding lesson is complete!
Enjoy your day off tomorrow! <br>
<br> 🇩🇪Prost und Guten Tag!
<br> 🇪🇸¡Saludos y Buen Día!
<br> 🇵🇹Felicidades e bom dia!
<br> 🇬🇧Cheers and Good Day!

# **ANSWERS TO THE CHECK YOUR UNDERSTANDINGS**

:::{admonition} **1** Answers to Recap
:class: dropdown
**Filepath**
1. An absolute filepath starts from the "root" of your computer. A relative filepath starts from where you currently are in your computer.
2. Filepaths allow us to tell our computer how to locate things.
3. This will vary for everybody, but it should begin with "/home/jovyan/" and then continue into whatever folder you are in.
<br> **Pandas**
1. Pandas is a library. We must import this library.
2. We use pandas for working with tabular data. We use pandas to create a two dimensional data structure in which we can easily work with this type of data.
3. dataframe_name = pd.DataFrame(d_name)
4. We can abstract certain rows, columns, or points in our data structure. We can use certain methods to find minimum and maximum, and we can also make pandas tell us more inforamtion about our dataframe with methods like`.info()` and `.describe()`.
5. Hmmmmm...

:::{admonition} **2** Answer
:class: dropdown
`df_water_vars['Discharge']<46`

:::{admonition} **3** Answer
:class: dropdown
This line is only viewing the data where discharge is <= 46 OR dam release is > 42.

:::{admonition} **4** Answer
:class: dropdown
`df_water_vars.loc['2021-03-13 08:00']`

:::{admonition} **5** Answer
:class: dropdown
1. We want to group hours into days.
2. We want to find the maximum
3. We are looking at temperature.