# Simple Filtering

## Summary
In this notebook, we'll be covering:
- [Filtering using mathematical operators](#Filtering-Using-Mathematical-Operators)
- [Filtering using isin()](#Filtering-Using-isin())
- [Chaining filters](#Chaining-Filters)

In this section we will discuss filtering data. The All of Us Researcher Workbench will do *some* filtering for you, but undoubtedly, you will need to perform additional filtering to conduct your analysis. Filtering can become complex, so this section is split into two notebooks. This one covers simple filters, while the next one covers creating new data labels that we can then use to further subset our data.

### Filtering Using Mathematical Operators
To get started with filtering, we first need a dataset to use for demonstration. The code below will make a dataframe for us. In fact, it will be the same dataframe as the previous notebook, but without random blank cells.

It is not important to understand all of the code in the cell below. The important thing to know is that we are generating fake data to use for demonstration. Our fake dataset will have the following columns:

* ID
* Measurement Device
* Heart Rate Max
* Heart Rate Min
* Heart Rate Avg
* Duration of exercise (min)
* Exercise Type

In [1]:
import pandas as pd
import random

workout_dict = {'ID': [], 'Measurement Device': [], 'Heart Rate Max': [], 'Heart Rate Min': [], 'Heart Rate Avg': [],
              'Duration of exercise (min)': [], 'Exercise Type': []}
used_ids = []

for x in range(0, 500):
    id = random.randint(100000000, 999999999)
    while id in used_ids:
        id = random.randint(100000000, 999999999)
    used_ids.append(id)
    device = random.choice(['Skykandal', 'B-Wolf'])
    mu = random.randint(65, 85)
    min_rate = int(random.gauss(mu, 10))
    max_rate = int(random.gauss(mu + 55, 25))
    while max_rate <= min_rate:
        max_rate = int(random.gauss(mu + 55, 25))
    avg = random.gauss((max_rate + min_rate) / 2, (max_rate - min_rate) / 5)
    duration = random.randint(10, 90)
    exercise = random.choice(['Running', 'Running', 'Running', 'Bicycling', 'Swimming', 'Swimming',
                              'Weight training'])
    row = [device, min_rate, max_rate, avg, duration, exercise]
    workout_dict['ID'].append(id)
    workout_dict['Measurement Device'].append(row[0])
    workout_dict['Heart Rate Min'].append(row[1])
    workout_dict['Heart Rate Max'].append(row[2])
    workout_dict['Heart Rate Avg'].append(row[3])
    workout_dict['Duration of exercise (min)'].append(row[4])
    workout_dict['Exercise Type'].append(row[5])

df = pd.DataFrame(workout_dict)
df.head(10)

Unnamed: 0,ID,Measurement Device,Heart Rate Max,Heart Rate Min,Heart Rate Avg,Duration of exercise (min),Exercise Type
0,929593150,Skykandal,90,74,80.159248,21,Running
1,210260041,B-Wolf,140,90,124.133686,56,Weight training
2,666846822,Skykandal,129,85,87.127954,74,Running
3,785714745,Skykandal,134,83,111.054789,10,Running
4,610612157,Skykandal,165,77,157.106239,59,Running
5,513008405,B-Wolf,95,64,75.36268,73,Swimming
6,833795233,B-Wolf,143,103,127.35412,36,Swimming
7,679645866,B-Wolf,147,70,118.77026,29,Running
8,144747501,Skykandal,145,58,96.727396,68,Running
9,654481713,B-Wolf,182,83,147.659343,33,Running


Let's start with a simple filter: we will select everyone whose **minimum heart rate** is greater than 69.

In [2]:
df[df['Heart Rate Min'] > 69]

Unnamed: 0,ID,Measurement Device,Heart Rate Max,Heart Rate Min,Heart Rate Avg,Duration of exercise (min),Exercise Type
0,929593150,Skykandal,90,74,80.159248,21,Running
1,210260041,B-Wolf,140,90,124.133686,56,Weight training
2,666846822,Skykandal,129,85,87.127954,74,Running
3,785714745,Skykandal,134,83,111.054789,10,Running
4,610612157,Skykandal,165,77,157.106239,59,Running
...,...,...,...,...,...,...,...
492,597698296,Skykandal,138,74,115.064023,86,Weight training
493,922801079,Skykandal,159,72,108.686081,14,Swimming
495,181360174,Skykandal,123,70,95.654617,73,Swimming
496,467340981,B-Wolf,133,85,100.985588,89,Running


How does this work? Well, we start with `df[]` which is a common way to get something out of dataframe `df` using keywords. Except, instead of passing something like a column name, we passed an expression into the brackets. The expression was `df['Heart Rate Min'] > 69'`. In other words, from df give me every row where Heart Rate Min (in df) is greater than 69.

Now, let's get one of the most annoying features of pandas out of the way. We're going to store our filtered data in df_filtered and then, for no good reason, change a column.

In [3]:
df_filtered = df[df['Heart Rate Min'] > 69]
df_filtered['Heart Rate Avg'] = 200
df_filtered.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_filtered['Heart Rate Avg'] = 200


Unnamed: 0,ID,Measurement Device,Heart Rate Max,Heart Rate Min,Heart Rate Avg,Duration of exercise (min),Exercise Type
0,929593150,Skykandal,90,74,200,21,Running
1,210260041,B-Wolf,140,90,200,56,Weight training
2,666846822,Skykandal,129,85,200,74,Running
3,785714745,Skykandal,134,83,200,10,Running
4,610612157,Skykandal,165,77,200,59,Running


Above you should have gotten a `SettingWithCopyWarning`. You might have assumed that when you assigned `df[df['Heart Rate Min'] > 69]` to `df_filtered` you created an entirely new dataframe, named `df_filtered`. You didn't. You created a "view" of the old dataframe (`df`) that only saw the parts of `df` that you wanted, but wasn't actually a new dataframe. This means that attempting to set a value in `df_filtered` may also alter values in `df`. And so we get a warning.

When you get a view versus when you get a new pandas object is fairly complicated. However, there is a simple fix: `copy` can turn a view into a copy. Copy is a dataframe method, and takes no arguments.

Below, the code has added `copy()` and what previously gave us a warning now won't. To be clear, the main difference is that now `df_filtered` is a *new dataframe* instead of a view of the original dataframe. 

In [4]:
df_filtered = df[df['Heart Rate Min'] > 69].copy()
df_filtered['Heart Rate Avg'] = 200
df_filtered.head()

Unnamed: 0,ID,Measurement Device,Heart Rate Max,Heart Rate Min,Heart Rate Avg,Duration of exercise (min),Exercise Type
0,929593150,Skykandal,90,74,200,21,Running
1,210260041,B-Wolf,140,90,200,56,Weight training
2,666846822,Skykandal,129,85,200,74,Running
3,785714745,Skykandal,134,83,200,10,Running
4,610612157,Skykandal,165,77,200,59,Running


Filtering expressions can include a number of comparison operators, such as: >, <, <=, >=, ==, and !=. Remember that in Python = means "set the thing on the left side of the = equal to the thing on the right side of the =", whereas == asks if they are equal. != means "not equal".

These expressions are fairly straightforward when we are dealing with numbers. 

#### Below, write some code that filters `df` so that only individuals with a maximum heart rate of 110 or under remain.

In [5]:
# your code goes here


You can also use some of these operators, like == and != to filter on text columns. Below, we'll just select the individuals who went running.

In [5]:
df[df['Exercise Type'] == 'Running']

Unnamed: 0,ID,Measurement Device,Heart Rate Max,Heart Rate Min,Heart Rate Avg,Duration of exercise (min),Exercise Type
0,929593150,Skykandal,90,74,80.159248,21,Running
2,666846822,Skykandal,129,85,87.127954,74,Running
3,785714745,Skykandal,134,83,111.054789,10,Running
4,610612157,Skykandal,165,77,157.106239,59,Running
7,679645866,B-Wolf,147,70,118.770260,29,Running
...,...,...,...,...,...,...,...
480,274090349,Skykandal,141,74,115.533190,83,Running
483,965440669,Skykandal,165,81,128.951019,32,Running
484,903386171,B-Wolf,141,65,113.490371,52,Running
489,781287792,Skykandal,117,79,79.028554,35,Running


We could also *exclud*e people who went running using `~`. This means "not", and it's generally good practice (i.e., it will sometimes break if you don't do this) to wrap the expression you are negating in parentheses, as below.

In [6]:
df[~(df['Exercise Type'] == 'Running')]

Unnamed: 0,ID,Measurement Device,Heart Rate Max,Heart Rate Min,Heart Rate Avg,Duration of exercise (min),Exercise Type
1,210260041,B-Wolf,140,90,124.133686,56,Weight training
5,513008405,B-Wolf,95,64,75.362680,73,Swimming
6,833795233,B-Wolf,143,103,127.354120,36,Swimming
10,619463985,Skykandal,143,70,99.431671,63,Bicycling
11,137370466,Skykandal,182,61,135.091871,21,Bicycling
...,...,...,...,...,...,...,...
494,457870858,Skykandal,175,58,118.471326,45,Weight training
495,181360174,Skykandal,123,70,95.654617,73,Swimming
497,351824435,B-Wolf,139,64,99.021259,61,Bicycling
498,332904239,Skykandal,123,69,94.944867,41,Swimming


What happens if we use a > or < symbol?

In [7]:
df[df['Exercise Type'] < 'Running']

Unnamed: 0,ID,Measurement Device,Heart Rate Max,Heart Rate Min,Heart Rate Avg,Duration of exercise (min),Exercise Type
10,619463985,Skykandal,143,70,99.431671,63,Bicycling
11,137370466,Skykandal,182,61,135.091871,21,Bicycling
20,570520774,Skykandal,140,105,120.994608,20,Bicycling
34,578201952,Skykandal,155,87,123.627619,69,Bicycling
51,241809727,Skykandal,114,62,80.812885,60,Bicycling
...,...,...,...,...,...,...,...
475,649341064,Skykandal,147,71,116.044045,11,Bicycling
481,444512128,B-Wolf,97,86,90.285706,49,Bicycling
485,212321519,Skykandal,135,76,97.572078,43,Bicycling
487,614170363,B-Wolf,129,68,104.823511,14,Bicycling


`< 'Running'` gives us only bicyclists, because > and < on text check alphabetization. < 'Running' gave us every label that is, when alphabetized, earlier than "Running". This behavior is rarely useful. It's generally best practice to avoid '<' and '>' with string (i.e., text) columns. 

#### Below, write code that returns only people using the B-Wolf measuring device. Do this two different ways. To make it easier to see that each way worked there are separate cells for each way.

In [8]:
# your code goes here for the first way


In [9]:
# the second way goes here


What if we wanted to filter based on the relationship between two columns? We can pass more complex expressions to the filter as well. The filter below will return individuals whose maximum heart rate is within 20 beats of their minimum heart rate.

In [10]:
df[(df['Heart Rate Max'] - df['Heart Rate Min']) <= 20]

Unnamed: 0,ID,Measurement Device,Heart Rate Max,Heart Rate Min,Heart Rate Avg,Duration of exercise (min),Exercise Type
0,929593150,Skykandal,90,74,80.159248,21,Running
19,243339079,Skykandal,86,83,84.580702,32,Weight training
26,949189280,B-Wolf,104,89,97.331448,40,Weight training
52,461584064,Skykandal,88,73,82.468535,84,Swimming
56,315613689,Skykandal,86,75,79.938946,59,Running
96,178712255,Skykandal,79,77,77.325604,62,Bicycling
101,735044859,Skykandal,94,84,88.060184,27,Running
104,765708551,B-Wolf,113,102,112.618715,19,Swimming
127,451525497,B-Wolf,100,95,99.302174,72,Swimming
128,296346400,B-Wolf,102,88,99.073136,31,Swimming


#### Below, write code that returns everyone whose average heart rate was less than the number you would get by averaging the maximum and minimum heart rates. It's easy to miss parentheses and brackets when writing something this long, so I suggest that you write in your parentheses/brackets first, and then fill them in.

In [12]:
# your code goes here


### Filtering Using isin()
The filtering tools above work fine for text columns where we want only one value, but what if we want more than one value? The `isin` method is our answer.

In [11]:
df[df['Exercise Type'].isin(['Running', 'Swimming'])]

Unnamed: 0,ID,Measurement Device,Heart Rate Max,Heart Rate Min,Heart Rate Avg,Duration of exercise (min),Exercise Type
0,929593150,Skykandal,90,74,80.159248,21,Running
2,666846822,Skykandal,129,85,87.127954,74,Running
3,785714745,Skykandal,134,83,111.054789,10,Running
4,610612157,Skykandal,165,77,157.106239,59,Running
5,513008405,B-Wolf,95,64,75.362680,73,Swimming
...,...,...,...,...,...,...,...
493,922801079,Skykandal,159,72,108.686081,14,Swimming
495,181360174,Skykandal,123,70,95.654617,73,Swimming
496,467340981,B-Wolf,133,85,100.985588,89,Running
498,332904239,Skykandal,123,69,94.944867,41,Swimming


The `isin` method is very similar to the basic Python `in`. It simply checks to see if something is in an iterable that you provide. In this case we simply passed the list `['Running', 'Swimming']` to `isin` and got back every row where the value in the Condition column was in the list.

If we had a lot of categories and we wanted to get most, but not all, of them we could in theory use `isin` and write a long list of categories to be included. However, it would be nicer to exclude the small list. Thankfully, ~ works for `isin` just like it did for mathematical expressions.

Below, we write an expression to return all the rows where the Exercise Type is **not** in the list `['Running', 'Swimming']`.

In [12]:
df[~df['Exercise Type'].isin(['Running', 'Swimming'])]

Unnamed: 0,ID,Measurement Device,Heart Rate Max,Heart Rate Min,Heart Rate Avg,Duration of exercise (min),Exercise Type
1,210260041,B-Wolf,140,90,124.133686,56,Weight training
10,619463985,Skykandal,143,70,99.431671,63,Bicycling
11,137370466,Skykandal,182,61,135.091871,21,Bicycling
19,243339079,Skykandal,86,83,84.580702,32,Weight training
20,570520774,Skykandal,140,105,120.994608,20,Bicycling
...,...,...,...,...,...,...,...
485,212321519,Skykandal,135,76,97.572078,43,Bicycling
487,614170363,B-Wolf,129,68,104.823511,14,Bicycling
492,597698296,Skykandal,138,74,115.064023,86,Weight training
494,457870858,Skykandal,175,58,118.471326,45,Weight training


While you might be tempted to try and write `df.~isin` Python interprets that as nonsense. In reality, `df.isin` is creating a dataframe full of `True` and `False`, and we're asking for that frame to be flipped, so every `True` is a `False` and vice-versa.

#### For a bit of quick practice, make a filter that returns everyone who wasn't bicycling or running.

In [13]:
# your code goes here


### Chaining Filters
Now, we can combine filters by chaining filters together in the following manner:

In [14]:
temp_df = df[df['Heart Rate Min'] > 80]
temp_df[temp_df['Measurement Device'] == 'Skykandal']

Unnamed: 0,ID,Measurement Device,Heart Rate Max,Heart Rate Min,Heart Rate Avg,Duration of exercise (min),Exercise Type
2,666846822,Skykandal,129,85,87.127954,74,Running
3,785714745,Skykandal,134,83,111.054789,10,Running
19,243339079,Skykandal,86,83,84.580702,32,Weight training
20,570520774,Skykandal,140,105,120.994608,20,Bicycling
21,209969243,Skykandal,199,104,161.730826,88,Running
...,...,...,...,...,...,...,...
456,695235412,Skykandal,159,86,130.613558,16,Weight training
469,829030938,Skykandal,95,87,92.018967,72,Bicycling
478,141034704,Skykandal,105,92,96.107416,24,Running
483,965440669,Skykandal,165,81,128.951019,32,Running


However, this is ugly and requires the creation of potentially a large number of temporary dataframes that exist only to hold intermediate steps. We could do the same thing a different way: write each expression in parentheses and then connect them together.

Below is the same result as above but in one line.

In [15]:
df[(df['Heart Rate Min'] > 80) & (df['Measurement Device'] == 'Skykandal')]

Unnamed: 0,ID,Measurement Device,Heart Rate Max,Heart Rate Min,Heart Rate Avg,Duration of exercise (min),Exercise Type
2,666846822,Skykandal,129,85,87.127954,74,Running
3,785714745,Skykandal,134,83,111.054789,10,Running
19,243339079,Skykandal,86,83,84.580702,32,Weight training
20,570520774,Skykandal,140,105,120.994608,20,Bicycling
21,209969243,Skykandal,199,104,161.730826,88,Running
...,...,...,...,...,...,...,...
456,695235412,Skykandal,159,86,130.613558,16,Weight training
469,829030938,Skykandal,95,87,92.018967,72,Bicycling
478,141034704,Skykandal,105,92,96.107416,24,Running
483,965440669,Skykandal,165,81,128.951019,32,Running


All we've done here is connect two expressions we already understand with &. Python evaluates each expression (`df['Heart Rate Min'] > 80` and `df['Measurement Device'] == 'Skykandal'`) separately. If they both evaluate to True then the whole statement evaluates to True.

#### Write some code below that filters for all people who were swimming and spent more than 60 minutes doing so.

In [18]:
# your code goes here


There are other ways to connect expressions. `&` handles "and", meaning both parts have to be true. `|` (which you can generally type as Shift+\\) means "or", meaning at least one of the parts has to be true.

This is how `&` and `|` interpret logical expressions:
* True & True = True
* False & True = False
* False | True = True
* True | True = True

Between `&`, `|`, and `~` you can build some fairly complex expressions, especially since you can connect blocks inside parentheses with other blocks.

The code block below filters for everyone with an average heart rate of 120-130 or 140-150. Visually, the line is broken at the | so you can more easily see the two halves.

In [16]:
df[((df['Heart Rate Avg']>= 120) & (df['Heart Rate Avg'] <= 130)) | 
   ((df['Heart Rate Avg'] >= 140) & (df['Heart Rate Avg'] <= 150))]

Unnamed: 0,ID,Measurement Device,Heart Rate Max,Heart Rate Min,Heart Rate Avg,Duration of exercise (min),Exercise Type
1,210260041,B-Wolf,140,90,124.133686,56,Weight training
6,833795233,B-Wolf,143,103,127.35412,36,Swimming
9,654481713,B-Wolf,182,83,147.659343,33,Running
20,570520774,Skykandal,140,105,120.994608,20,Bicycling
30,201073747,Skykandal,134,90,121.829258,22,Running
34,578201952,Skykandal,155,87,123.627619,69,Bicycling
35,135944927,B-Wolf,146,83,127.822545,45,Swimming
54,131740209,B-Wolf,174,67,121.372098,23,Swimming
58,573380140,B-Wolf,161,59,143.835413,44,Running
98,867698149,B-Wolf,160,75,120.906561,84,Swimming


There's more than one way to do this. The code block below will give the same results as the one above. Spend a minute looking at it until you can tell why. It may help to think about both filters in chunks in pieces (both filters are two sets of two linked expressions).

In [17]:
df[~((df['Heart Rate Avg'] < 120) | (df['Heart Rate Avg'] > 150)) &
   ~((df['Heart Rate Avg'] > 130) & (df['Heart Rate Avg'] < 140))]

Unnamed: 0,ID,Measurement Device,Heart Rate Max,Heart Rate Min,Heart Rate Avg,Duration of exercise (min),Exercise Type
1,210260041,B-Wolf,140,90,124.133686,56,Weight training
6,833795233,B-Wolf,143,103,127.35412,36,Swimming
9,654481713,B-Wolf,182,83,147.659343,33,Running
20,570520774,Skykandal,140,105,120.994608,20,Bicycling
30,201073747,Skykandal,134,90,121.829258,22,Running
34,578201952,Skykandal,155,87,123.627619,69,Bicycling
35,135944927,B-Wolf,146,83,127.822545,45,Swimming
54,131740209,B-Wolf,174,67,121.372098,23,Swimming
58,573380140,B-Wolf,161,59,143.835413,44,Running
98,867698149,B-Wolf,160,75,120.906561,84,Swimming


Remember how I said that making intermediate data frames is bad practice? When the expression gets complicated enough, you sometimes need to perform intermediate steps for debugging purposes. However, you can potentially build extremely powerful filters this way.

#### Below, write code that filters for people who went running or swimming while using the B-Wolf device OR who went bicycling using Skykandal, and have an average heart rate below the number you get by averaging their maximum and minimum heart rates together.

In [21]:
# your code goes here


At this point you may be thinking, "Wow, I wish I could write a function, pass all of these rows to that function, evaluate the row there, and then just return a True/False label I could filter against."

That's a great idea! We'll cover it in the next notebook.