<img src=images/gdd-logo.png width=300px align=right> 

# Filtering

In this notebook, we will cover how to filter data.

- [Filtering](#f)
   - [Using (lambda) functions to filter](#lambdas)
   - <mark>[Exercise: Filtering](#e-filter)</mark>
- [Multiple Conditions](#m)
    - <mark>[Exercise: Multiple Conditions](#e-mult)</mark>

In [1]:
import pandas as pd

chickweight = pd.read_csv('data/chickweight.csv').rename(str.lower, axis='columns')

<a id='f'></a>
## Filtering

It is possible to check whether the DataFrame/series meet certain conditions. 

For example, *which values in the chickweight DataFrame are less than 4?*

In [2]:
(
    chickweight < 4
).head()

Unnamed: 0,rownum,weight,time,chick,diet
0,True,False,True,True,True
1,True,False,True,True,True
2,True,False,False,True,True
3,False,False,False,True,True
4,False,False,False,True,True


Probably more useful would be to check one column at a time:

In [3]:
(
    chickweight['time'] < 4
)

0       True
1       True
2      False
3      False
4      False
       ...  
573    False
574    False
575    False
576    False
577    False
Name: time, Length: 578, dtype: bool

This creates a boolean mask (a list of booleans). 

A second type of argument that the `.loc[]` method accepts is a boolean mask, so we can use it to filter our DataFrame.

In [4]:
(
    chickweight
    .loc[chickweight['time'] < 4]
)

Unnamed: 0,rownum,weight,time,chick,diet
0,1,42,0,1,1
1,2,51,2,1,1
12,13,40,0,2,1
13,14,49,2,2,1
24,25,43,0,3,1
...,...,...,...,...,...
543,544,50,2,48,4
554,555,40,0,49,4
555,556,53,2,49,4
566,567,41,0,50,4


### <mark> Practice: Filtering </mark>


Filter the data in the following ways:

1. For when weight is less than 60.

In [6]:
(
    chickweight
    .loc[chickweight['weight'] < 60]
)

Unnamed: 0,rownum,weight,time,chick,diet
0,1,42,0,1,1
1,2,51,2,1,1
2,3,59,4,1,1
12,13,40,0,2,1
13,14,49,2,2,1
...,...,...,...,...,...
543,544,50,2,48,4
554,555,40,0,49,4
555,556,53,2,49,4
566,567,41,0,50,4


2. For chick number 15.

In [9]:
(
    chickweight
    .loc[chickweight['chick'] == 15]
)

Unnamed: 0,rownum,weight,time,chick,diet
167,168,41,0,15,1
168,169,49,2,15,1
169,170,56,4,15,1
170,171,64,6,15,1
171,172,68,8,15,1
172,173,68,10,15,1
173,174,67,12,15,1
174,175,68,14,15,1


In [None]:
# %load answers/02_Selections_and_Filtering/filter-practice.py

<a id = 'lambdas'></a>
## Using (lambda) functions to filter

The third type or argument that the `.loc[]` method accepts is a function that takes a DataFrame and returns a boolean mask. This function will get called on **whatever DataFrame arrives to the `.loc[]`**.

This allows you to add many filters without needing to reference the original name of the DataFrame:

In [12]:
(
    chickweight  
    .loc[lambda df: df['chick'] == 1]
    .assign(time_2=lambda df:df['time']*2)
    .loc[lambda df: df['time'] < 4] 
)

Unnamed: 0,rownum,weight,time,chick,diet,time_2
0,1,42,0,1,1,0
1,2,51,2,1,1,4


What are the benefits of using lambda functions when filtering?

<details>
    <summary><font color=blue>Show answer</font></summary>
    
- Your code works independently of the name of the dataframe (easy to copy-paste!).
- You can filter on new columns that were created in the step before.
- You can re-order filters, they work with whatever DataFrame arrives to the filter, as long as the column needed exists.

**A note about lambda functions**: The lambda function may feel a little strange at this stage but, trust us, when you start to add more chains, the lambda function will really save the day! 



For now it's important to be comfortable with the syntax, so let's practice!

<a id='e-filter'></a>
## <mark> Exercise: Filtering </mark>


Filter the data in the following (separate) ways:


1. For when weight is greater than 150.

In [13]:
(
    chickweight  
    .loc[lambda df: df['weight'] > 100] 
)

Unnamed: 0,rownum,weight,time,chick,diet
6,7,106,12,1,1
7,8,125,14,1,1
8,9,149,16,1,1
9,10,171,18,1,1
10,11,199,20,1,1
...,...,...,...,...,...
573,574,175,14,50,4
574,575,205,16,50,4
575,576,234,18,50,4
576,577,264,20,50,4


2. For diet is equal to 4.

In [15]:
(
    chickweight  
    .loc[lambda df: df['diet'] == 4] 
)

Unnamed: 0,rownum,weight,time,chick,diet
460,461,42,0,41,4
461,462,51,2,41,4
462,463,66,4,41,4
463,464,85,6,41,4
464,465,103,8,41,4
...,...,...,...,...,...
573,574,175,14,50,4
574,575,205,16,50,4
575,576,234,18,50,4
576,577,264,20,50,4


3. For when weight is less than 60 and time is equal to 2.

In [18]:
(
    chickweight  
    .loc[lambda df:df['weight'] < 60]
    .loc[lambda df:df['time'] == 2 ].head(10)
)

Unnamed: 0,rownum,weight,time,chick,diet
1,2,51,2,1,1
13,14,49,2,2,1
25,26,39,2,3,1
37,38,49,2,4,1
49,50,42,2,5,1
61,62,49,2,6,1
73,74,49,2,7,1
85,86,50,2,8,1
96,97,51,2,9,1
108,109,44,2,10,1


4. For when weight is less than 60 and time is equal to 2, but only the weight and time columns.

In [22]:
(
    chickweight
    .loc[lambda df:df['weight'] < 60]
    .loc[lambda df:df['time'] == 2 ].head(10)
    .drop(columns = ['rownum', 'time', 'diet'])
)

Unnamed: 0,weight,chick
1,51,1
13,49,2
25,39,3
37,49,4
49,42,5
61,49,6
73,49,7
85,50,8
96,51,9
108,44,10


**BONUS**:

1. Calculate the mean (= average) Chicken weight.

In [33]:
(
   chickweight['weight'].mean().round(1)
)

121.8

2. Calculate the mean Chicken weight at time 10

In [48]:
(
    chickweight
    .loc[lambda df:df['time']==10,['weight']]
    # .drop(columns=['rownum','chick','time','diet'])
    .mean().round(1)
)

weight    107.8
dtype: float64

**Answers**

In [28]:
# %load answers/02_Selections_and_Filtering/filtering.py

# 1
(
    chickweight
    .loc[lambda df: df['weight'] > 150]
)

# 2
(
    chickweight
    .loc[lambda df: df['diet'] == 4]
)

# 3
(
    chickweight
    .loc[lambda df: df['time'] == 2]
    .loc[lambda df: df['weight'] < 60]
)

# 4
(
    chickweight
    .loc[lambda df: df['weight'] < 60, ['weight', 'time']]
    .loc[lambda df: df['time'] == 2]
)

# bonus 1
(
    chickweight['weight']
    .mean()
)

# bonus 2
(
    chickweight
    .loc[lambda df: df['time'] == 10, "weight"]
    .mean()
)


<a id='m'></a>
## Multiple conditions


Let's have a look now at how to use multiple conditions within the same line. Aside from being more efficient to run, it is also useful as there is no need to worry about previous filters.

In [None]:
(
    chickweight
    .loc[lambda df: (df['chick'] == 1) & (df['weight'] < 50)]
)

***Note the need for parentheses and the use of `&` rather than `and`?***

Firstly, the **parentheses** are needed due to the order of operations, `&` always performs before comparisons like `==`, `<`, `>=` etc.

**Now for `&` vs `and`...** the use of `and` can be used to test whether both expressions written are logically `True`. Eg:

In [49]:
True and True

True

In [50]:
True and False

False

In [51]:
False and True

False

In [52]:
False and False

False

But entire boolean mask isn't  `True` or `False`, it's a **list** of `True`/`False` values:

In [53]:
chickweight['chick'] == 1

0       True
1       True
2       True
3       True
4       True
       ...  
573    False
574    False
575    False
576    False
577    False
Name: chick, Length: 578, dtype: bool

In [77]:
# # error
# (
#     chickweight
#     .loc[lambda df: (df['chick'] == 1) and (df['weight'] < 50)]
# )
#or
(
    chickweight
    .loc[lambda df: df['diet'].isin([1, 2])]
)

Unnamed: 0,rownum,weight,time,chick,diet
0,1,42,0,1,1
1,2,51,2,1,1
2,3,59,4,1,1
3,4,64,6,1,1
4,5,76,8,1,1
...,...,...,...,...,...
335,336,122,14,30,2
336,337,143,16,30,2
337,338,151,18,30,2
338,339,157,20,30,2


You can use `&`, the bitwise AND operation to compare each `True`/`False` in every row with multiple filters.

In [54]:
(
    chickweight
    [["chick", "weight"]]
    .loc[lambda df: (df['chick'] == 1) & (df['weight'] < 50)]
)

Unnamed: 0,chick,weight
0,1,42


**The `&` (ampersand) only equates to <span style=color:green>True</span> if BOTH conditions are met:**

![](images/02_Selections_and_Filtering/filt-and.png)

In [None]:
(
    chickweight
    .loc[lambda df: (df['chick'] == 1) | (df['weight'] < 50)]
    .head(15)
)

**The `|` (pipe) equates to <span style=color:green>True</span> if AT LEAST ONE condition is met:**

![](images/02_Selections_and_Filtering/filt-or.png)

In [None]:
(
    chickweight
    .loc[lambda df: (df['chick'] == 1) ^ (df['weight'] < 50)]
    .head(15)
)

**The `^` (hat) equates to <span style=color:green>True</span> if ONLY ONE condition is met:**

![](images/02_Selections_and_Filtering/filt-hat.png)

<a id='e-mult'></a>
## <mark> Exercise: Multiple conditions </mark>

Select only the part of chickweight where:

1. **weight** is above 50 but below 100.


In [None]:
#(
 #   chickweight
  #  .loc[lambda df: (df['weight'] > 50) & (df['weight'] < 100)]
   # .head()
#)
include=[1,2]

chickweight

Unnamed: 0,rownum,weight,time,chick,diet
1,2,51,2,1,1
2,3,59,4,1,1
3,4,64,6,1,1
4,5,76,8,1,1
5,6,93,10,1,1


2. **diet** is either 1 or 2.


In [61]:
(
    chickweight
    .loc[lambda df: (df['diet'] == 1) | (df['diet'] == 2)]
    .head()
)

Unnamed: 0,rownum,weight,time,chick,diet
0,1,42,0,1,1
1,2,51,2,1,1
2,3,59,4,1,1
3,4,64,6,1,1
4,5,76,8,1,1


3. For when weight is less than 60 and time is equal to 2, but only the weight and time columns!

In [75]:
(
    chickweight
    .loc[lambda df: (df['weight'] < 60) & (df['time'] == 2), ['weight','time']]
    .head(70)
)

Unnamed: 0,weight,time
1,51,2
13,49,2
25,39,2
37,49,2
49,42,2
61,49,2
73,49,2
85,50,2
96,51,2
108,44,2


**Answers**

In [76]:
# %load answers/02_Selections_and_Filtering/multiple_conditions.py
# 1 
(
    chickweight
    .loc[lambda df: (df['weight'] > 50) & (df['weight'] < 100)]
)
# or:
(
    chickweight
    .loc[lambda df: df['weight'].between(50, 100, inclusive = "neither")]
)

# 2
(
    chickweight
    .loc[lambda df: (df['diet'] == 1) | (df['diet'] == 2)]
)

# or:
(
    chickweight
    .loc[lambda df: df['diet'].isin([1, 2])]
)

# 3

(
    chickweight
    .loc[lambda df: (df['weight'] < 60) & (df['time'] == 2), ['weight', 'time']]
)


### 🔍 Filtering Data in Pandas - Quick Summary

Filtering helps you **zoom in on the rows you care about**. Use `.loc[]` to select rows based on conditions.

- Use comparisons to filter:
  ```python
  my_data.loc[lambda df: df['column'] > 50]
  ```

- Combine multiple conditions with:
  - `&` for **AND**
  - `|` for **OR**
  - Wrap each condition in `()`
    
```python
my_data.loc[lambda df: (df['col1'] > 50) & (df['col2'] < 100)]
```


✅ Helps focus on the right data  
✅ Supports complex logic  
✅ Clean and readable with `lambda` + `.loc[]` combo

_Filtering is all about asking good questions and using code to get the right slice of data!_
