# Advanced Filtering and Making Complex New Columns

## Summary
In this notebook, we'll be covering:
- [Making new data labels](#Making-New-Data-Labels)
- [Applying a function](#Applying-a-Function)
- [Apply with multiple columns](#Apply-With-Multiple-Columns)
- [Lambda functions](#Lambda-Functions)

### Introduction

In this section we'll tackle making new columns that require complex calculations or filters (or both) so we can then filter against that output, or, alternatively, just use that output.

Obviously, we need our dataframe again.

In [2]:
import pandas as pd
import random

For an explanation on what the code in the below code block does, check out the notebook "3 - Cleaning Data".

In [3]:
workout_dict = {'ID': [], 'Measurement Device': [], 'Heart Rate Max': [], 'Heart Rate Min': [], 'Heart Rate Avg': [],
              'Duration of exercise (min)': [], 'Exercise Type': []}
used_ids = []

# Since we're not using the number generated by `range`, we can use an underscore as a placeholder
for _ in range(0, 500):
    workout_id = random.randint(100_000_000, 999_999_999)
    while workout_id in used_ids:
        workout_id = random.randint(100_000_000, 999_999_999)
    used_ids.append(workout_id)
    device = random.choice(['Skykandal', 'B-Wolf'])
    mu = random.randint(65, 85)
    min_rate = int(random.gauss(mu, 10))
    max_rate = int(random.gauss(mu + 55, 25))
    while max_rate <= min_rate:
        max_rate = int(random.gauss(mu + 55, 25))
    avg = random.gauss((max_rate + min_rate) / 2, (max_rate - min_rate) / 5)
    duration = random.randint(10, 90)
    exercise = random.choice(['Running', 'Running', 'Running', 'Bicycling', 'Swimming', 'Swimming',
                              'Weight training'])
    row = [device, min_rate, max_rate, avg, duration, exercise]
    workout_dict['ID'].append(workout_id)
    workout_dict['Measurement Device'].append(row[0])
    workout_dict['Heart Rate Min'].append(row[1])
    workout_dict['Heart Rate Max'].append(row[2])
    workout_dict['Heart Rate Avg'].append(row[3])
    workout_dict['Duration of exercise (min)'].append(row[4])
    workout_dict['Exercise Type'].append(row[5])

df = pd.DataFrame(workout_dict)
df.head(10)

Unnamed: 0,ID,Measurement Device,Heart Rate Max,Heart Rate Min,Heart Rate Avg,Duration of exercise (min),Exercise Type
0,545591518,Skykandal,107,86,96.365454,27,Weight training
1,363283639,B-Wolf,185,80,127.634372,83,Running
2,853372507,B-Wolf,118,83,97.186502,48,Running
3,941758991,B-Wolf,108,79,90.000578,29,Weight training
4,609296827,Skykandal,157,78,98.496899,46,Weight training
5,627941273,B-Wolf,133,67,94.759048,33,Running
6,789595116,Skykandal,163,80,111.832072,73,Running
7,198783733,Skykandal,150,84,105.368712,40,Swimming
8,424864287,Skykandal,108,80,89.639894,21,Running
9,758224973,Skykandal,151,78,98.677267,11,Swimming


### Making New Data Labels
If we were interested in people with faster heart rates it would be easy enough to write a filter, say, `df[df['Heart Rate Avg'] >= 100]` that would return only those people. However, we might not be only interested in those people, but interested in the difference between people who have those faster heart rates and those who don't. In that case we might want to make a new column where people with an average heart rate over 100 were labeled "True" and those with a slower heart rate were labeled "False". (While we aren't there yet, this sort of thing would make it easy to get summary statistics for these groups separately.)

A quick reminder: making a new column is as simple as setting that column equal to something. `df['Something'] = 3` would make a column called "Something" where every value was 3.

We can use this basic syntax to make a new column with True/False values by passing an expression very much like a filter. Because we're going to want to try several variations on this, the code block below begins by making a copy of df that we'll modify, so we can start fresh by going back to df.

In [4]:
fast_heart_rate_df = df.copy()
fast_heart_rate_df['Fast Heart Rate'] = fast_heart_rate_df['Heart Rate Avg'] > 100
fast_heart_rate_df.head(10)

Unnamed: 0,ID,Measurement Device,Heart Rate Max,Heart Rate Min,Heart Rate Avg,Duration of exercise (min),Exercise Type,Fast Heart Rate
0,545591518,Skykandal,107,86,96.365454,27,Weight training,False
1,363283639,B-Wolf,185,80,127.634372,83,Running,True
2,853372507,B-Wolf,118,83,97.186502,48,Running,False
3,941758991,B-Wolf,108,79,90.000578,29,Weight training,False
4,609296827,Skykandal,157,78,98.496899,46,Weight training,False
5,627941273,B-Wolf,133,67,94.759048,33,Running,False
6,789595116,Skykandal,163,80,111.832072,73,Running,True
7,198783733,Skykandal,150,84,105.368712,40,Swimming,True
8,424864287,Skykandal,108,80,89.639894,21,Running,False
9,758224973,Skykandal,151,78,98.677267,11,Swimming,False


This works up to a point. What if we wanted a column that showed us a range? That also works, using a more complex expression. 

#### Write an expression below that makes a column that is True only if someone's average heart rate is between 95-105. (Note: you'll need to enclose the whole thing in parentheses so that Python knows to turn it into a single True or False.)

In [5]:
# write your code here


### Applying a Function

The point where this begins to collapse is where we need more than two labels, or really anything complex. At this point we really want to pass the row to a function that can evaluate the value at a given column and return a value. However, if we attempt to loop over a dataframe we don't get the rows.

In [6]:
for x in df:
    print(x)

ID
Measurement Device
Heart Rate Max
Heart Rate Min
Heart Rate Avg
Duration of exercise (min)
Exercise Type


Instead, pandas supplies us with an `iterrows` method that lets us iterate over the rows. It's a method, so you call it as a function, and it produces an iterable. However, the row returned is not a simple list, and it's a copy not a view, and so changing it doesn't change the dataframe. Instead, you need to look up the original dataframe row by index and change that. It's a mess, and so there's a specific pandas method that handles this all much more cleanly.

However, if you want to see what you're missing, the code to do this the hard way is below.

In [7]:
# copy the dataframe, just because we don't want to mess up our real dataframe for this example
iterrow_df = df.copy()

# make a new column and give it a default value
iterrow_df['Heart Rate Class'] = 'Middle'

# generate the iterrows object and iterate over it
# iterrows returns an iterable with two items in it, so we unpack that all at once
for index, row in df.iterrows():
    # check the age, if it shouldn't be the default do something
    if row['Heart Rate Avg'] >= 110:
        # use loc with a tuple to access the original cell and change it
        iterrow_df.loc[(index, 'Heart Rate Class')] = 'Fast'
    elif row['Heart Rate Avg'] < 90:
        iterrow_df.loc[(index, 'Heart Rate Class')] = 'Slow'
        
iterrow_df.head(10)

Unnamed: 0,ID,Measurement Device,Heart Rate Max,Heart Rate Min,Heart Rate Avg,Duration of exercise (min),Exercise Type,Heart Rate Class
0,545591518,Skykandal,107,86,96.365454,27,Weight training,Middle
1,363283639,B-Wolf,185,80,127.634372,83,Running,Fast
2,853372507,B-Wolf,118,83,97.186502,48,Running,Middle
3,941758991,B-Wolf,108,79,90.000578,29,Weight training,Middle
4,609296827,Skykandal,157,78,98.496899,46,Weight training,Middle
5,627941273,B-Wolf,133,67,94.759048,33,Running,Middle
6,789595116,Skykandal,163,80,111.832072,73,Running,Fast
7,198783733,Skykandal,150,84,105.368712,40,Swimming,Middle
8,424864287,Skykandal,108,80,89.639894,21,Running,Slow
9,758224973,Skykandal,151,78,98.677267,11,Swimming,Middle


Hopefully at this point you're ready to see the simple way.

The simple way uses a user-defined function and `apply`. `apply` is a dataframe method that has two main arguments to pay attention to: a function that `apply` sends everything to, and an axis that determines whether it sends rows or columns (0 is columns, 1 is rows).

The example below just demonstrates how apply works, without making a new column. We'll define a function that just prints the average heart rate in the row and then apply that function over each row.

In [8]:
def pointless_print(row):
    print(row['Heart Rate Avg'])
    
df.apply(pointless_print, axis=1)

96.36545366568481
127.63437166359051
97.18650216571294
90.00057796321092
98.49689915407345
94.75904823580638
111.83207174311283
105.36871189411053
89.63989370760683
98.67726707352324
109.19810549501202
101.74583049069956
118.71466010484332
133.3987496334192
77.81374081807527
87.80683148570617
87.16073752830235
103.6997896424392
109.67767607053132
109.30673118627423
103.16479194070787
107.86752966139365
144.2166950044861
103.92453839272311
110.24329445248645
118.80327518501471
148.01267322536498
140.72381457379666
73.03580540565771
100.88061742371254
134.2844327394012
105.02470181309694
113.12975920436412
98.51901120062641
161.37314883542265
140.8535803316825
126.22401655475934
91.50250226273968
75.87153695408584
76.7843896111564
106.19756129077162
125.80536051357608
104.40103619280309
103.33203561785399
89.4037781242445
116.18532265562148
79.4325424820836
114.16181889358037
105.8800759693682
100.9291392142145
93.71719848566596
65.10407340408162
86.35687821674085
148.24376024961836
102.

0      None
1      None
2      None
3      None
4      None
       ... 
495    None
496    None
497    None
498    None
499    None
Length: 500, dtype: object

If we altered the function slightly, so that it returns the value instead of printing it, we would get a Series.

In [9]:
def return_avg(row):
    return row['Heart Rate Avg']

df.apply(return_avg, axis=1)

0       96.365454
1      127.634372
2       97.186502
3       90.000578
4       98.496899
          ...    
495    115.446933
496    114.222054
497    108.770303
498     99.379796
499    106.639057
Length: 500, dtype: float64

Because the Series is as long as the columns in the dataframe we can easily add it as a column. Let's modify the function to reclassify heart rates according to our earlier categories, and then attach the output as a new column.

In [10]:
heart_rate_classifier_df = df.copy()

def classify_avg(row):
    if row['Heart Rate Avg'] < 90:
        rate = 'Slower'
    elif row['Heart Rate Avg'] < 110:
        rate = 'Middle'
    else:
        rate = 'Faster'
    return rate

heart_rate_classifier_df['Heart Rate Class'] = heart_rate_classifier_df.apply(classify_avg, axis=1)
heart_rate_classifier_df.head()

Unnamed: 0,ID,Measurement Device,Heart Rate Max,Heart Rate Min,Heart Rate Avg,Duration of exercise (min),Exercise Type,Heart Rate Class
0,545591518,Skykandal,107,86,96.365454,27,Weight training,Middle
1,363283639,B-Wolf,185,80,127.634372,83,Running,Faster
2,853372507,B-Wolf,118,83,97.186502,48,Running,Middle
3,941758991,B-Wolf,108,79,90.000578,29,Weight training,Middle
4,609296827,Skykandal,157,78,98.496899,46,Weight training,Middle


As you can see, this allows us to do fairly complex processing, since we can hand off an entire row to a function of whatever complexity we need.

Before we go further, practice this yourself.
#### In the block below calculate a new column called "Midpoint" that is the average of the heart rate maximum and minimums, using apply. (This can be done without using apply, but that's not the point of this exercise.)

In [11]:
# your code goes here


### Apply With Multiple Columns

We can even get multiple columns at once. If the function returns multiple values and  we pass the argument `result_type='expand'` we get a small dataframe back. We could either join this dataframe to our existing one, or, more simply, declare that this dataframe is several new columns in the existing dataframe. We do this the same way we accessed multiple columns at once, using a list of new columns. E.g., getting or setting columns A and B would be `df[['A', 'B']]`.

In the example below we will run this operation on a copy of the dataframe.

In [12]:
multi_col_classifier_df = df.copy()

# what is important about this function is that it returns two items, not one
def multiple_returns(row):
    aerobic_exercise = True
    heart_rate_class = 'Middle'
    if row['Exercise Type'] == 'Weight training':
        aerobic_exercise = False
    if row['Heart Rate Avg'] < 90:
        heart_rate_class = 'Low'
    elif row['Heart Rate Avg'] > 110:
        heart_rate_class = 'High'
    return aerobic_exercise, heart_rate_class


# attach the output of apply to the dataframe
multi_col_classifier_df[['Aerobic Exercise', 'Heart Rate Class']] = multi_col_classifier_df.apply(multiple_returns, axis=1, result_type='expand')
multi_col_classifier_df.head(10)

Unnamed: 0,ID,Measurement Device,Heart Rate Max,Heart Rate Min,Heart Rate Avg,Duration of exercise (min),Exercise Type,Aerobic Exercise,Heart Rate Class
0,545591518,Skykandal,107,86,96.365454,27,Weight training,False,Middle
1,363283639,B-Wolf,185,80,127.634372,83,Running,True,High
2,853372507,B-Wolf,118,83,97.186502,48,Running,True,Middle
3,941758991,B-Wolf,108,79,90.000578,29,Weight training,False,Middle
4,609296827,Skykandal,157,78,98.496899,46,Weight training,False,Middle
5,627941273,B-Wolf,133,67,94.759048,33,Running,True,Middle
6,789595116,Skykandal,163,80,111.832072,73,Running,True,High
7,198783733,Skykandal,150,84,105.368712,40,Swimming,True,Middle
8,424864287,Skykandal,108,80,89.639894,21,Running,True,Low
9,758224973,Skykandal,151,78,98.677267,11,Swimming,True,Middle


`apply` can also be used on a single column. For instance, the code below will use a modified version of the classify_avg function we defined earlier on the Heart Rate Avg column, without a need to look up the average heart rate column in a row. We'll use yet another copy of df for this, and give it a classifing column.

In [13]:
def classify_avg(avg):
    if avg < 90:
        rate_class = 'Low'
    elif avg < 110:
        rate_class = 'Middle'
    else:
        rate_class = 'High'
    return rate_class


single_col_classifier_df = df.copy()

single_col_classifier_df['Class'] = single_col_classifier_df['Heart Rate Avg'].apply(classify_avg)
single_col_classifier_df.head(10)

Unnamed: 0,ID,Measurement Device,Heart Rate Max,Heart Rate Min,Heart Rate Avg,Duration of exercise (min),Exercise Type,Class
0,545591518,Skykandal,107,86,96.365454,27,Weight training,Middle
1,363283639,B-Wolf,185,80,127.634372,83,Running,High
2,853372507,B-Wolf,118,83,97.186502,48,Running,Middle
3,941758991,B-Wolf,108,79,90.000578,29,Weight training,Middle
4,609296827,Skykandal,157,78,98.496899,46,Weight training,Middle
5,627941273,B-Wolf,133,67,94.759048,33,Running,Middle
6,789595116,Skykandal,163,80,111.832072,73,Running,High
7,198783733,Skykandal,150,84,105.368712,40,Swimming,Middle
8,424864287,Skykandal,108,80,89.639894,21,Running,Low
9,758224973,Skykandal,151,78,98.677267,11,Swimming,Middle


#### In the cell below, try passing a single column to apply that should return half the exercise time.

In [14]:
# your code goes here


### Lambda Functions

`apply` also allows you to define functions on the spot. There's no need to do this, but there are times when it is useful. Below, we'll use the pre-existing `lower()` method to make the Exercise Type column labels all lowercase. Since `lower` is a method we can't just pass things to it, it is attached to text using dot notation. What we can do is make use of Python's `lambda` to make an on-the-spot function. `lambda` is just an anonymous function in Python, meaning we don't have to give it a name and create it with `def`. `lambda x:` means "we're making a function which takes an argument, x", so `lambda x: x.lower()` means "we're making a function that takes an argument, x, and then returns x.lower()".

(If `lower` is unclear, think of it this way: if `text` is a text variable then `text.lower()` gives us the lowercase version of `text`.)

In [15]:
lowercase_exercise_df = df.copy()

lowercase_exercise_df['Exercise Type'] = lowercase_exercise_df['Exercise Type'].apply(lambda x: x.lower())
lowercase_exercise_df.head()

Unnamed: 0,ID,Measurement Device,Heart Rate Max,Heart Rate Min,Heart Rate Avg,Duration of exercise (min),Exercise Type
0,545591518,Skykandal,107,86,96.365454,27,weight training
1,363283639,B-Wolf,185,80,127.634372,83,running
2,853372507,B-Wolf,118,83,97.186502,48,running
3,941758991,B-Wolf,108,79,90.000578,29,weight training
4,609296827,Skykandal,157,78,98.496899,46,weight training


The code below does exactly the same thing (to a different column), so don't feel like you have to use `lambda`. The form below takes more lines of code, but that's fine when you're starting out, and is easier to read for some people.

In [16]:
def lowercase(text):
    return text.lower()

lowercase_exercise_df['Measurement Device lower'] = lowercase_exercise_df['Measurement Device'].apply(lowercase)
lowercase_exercise_df.head()

Unnamed: 0,ID,Measurement Device,Heart Rate Max,Heart Rate Min,Heart Rate Avg,Duration of exercise (min),Exercise Type,Measurement Device lower
0,545591518,Skykandal,107,86,96.365454,27,weight training,skykandal
1,363283639,B-Wolf,185,80,127.634372,83,running,b-wolf
2,853372507,B-Wolf,118,83,97.186502,48,running,b-wolf
3,941758991,B-Wolf,108,79,90.000578,29,weight training,b-wolf
4,609296827,Skykandal,157,78,98.496899,46,weight training,skykandal


At this point, you have a lot of different ways to use `apply`. Using `apply` to make a column is often a precursor to filtering on that new column. 

#### So, for practice using apply, assume that we know that B-Wolf devices consistently measure 2 bpm lower than Skykandal devices. Also, Skykandal devices react poorly to water, and measure 5 bpm too high if you're swimming. Write a block of code that bumps up all B-Wolf device heart rate measures by 2, subtracts 5 from Skykandal heart rate measures for swimmers only, and then filters on the corrected rates so that we only have people with average heart rates above 100.

In [17]:
# your code goes here


Now, it's time to look at summarizing the data from all of these filtering operations and new columns.