# Filtering Dataframes

--------------------------------

 Filtering is a fundamental operation in data analysis, allowing us to extract subsets of data based on specific conditions or criteria. Whether you're working with large datasets or small ones, filtering techniques are indispensable for exploring and analyzing your data effectively.

In this lecture, we'll explore various methods and techniques for filtering DataFrames using Pandas, a powerful library in Python for data manipulation and analysis. We'll cover how to filter rows and columns based on conditions, select specific data points, handle missing values, and more.

By the end of this lecture, you'll have a solid understanding of how to apply filtering operations to your datasets, enabling you to extract valuable insights and make informed decisions from your data.

Let's dive in and discover the power of filtering DataFrames!

In [1]:
import pandas as pd

x = pd.Series([1, 4, 6, 2])
y = pd.Series([9, 2, 3, 2])

In [2]:
x < y

0     True
1    False
2    False
3    False
dtype: bool

In [None]:
x == y

In [None]:
x >= y

In [None]:
(x > 2) | (y == 9)

In [None]:
(x == 2) & (y == 2)

In [None]:
x.between(4, 6)

In [None]:
y.isin([2, 9])

In [None]:
x.isin(y)

## Objectives

1. Filtering with single logical operators
2. Filtering with multiple logical operators
3. Filtering with the **isin** method
4. Filtering using the str accessor
5. Filtering with the **between** method 

In [11]:
import pandas as pd

In [12]:
penguins = pd.read_csv("./data/penguins_simple.csv", sep=";")

In [13]:
penguins

Unnamed: 0,Species,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex
0,Adelie,39.1,18.7,181.0,3750.0,MALE
1,Adelie,39.5,17.4,186.0,3800.0,FEMALE
2,Adelie,40.3,18.0,195.0,3250.0,FEMALE
3,Adelie,36.7,19.3,193.0,3450.0,FEMALE
4,Adelie,39.3,20.6,190.0,3650.0,MALE
...,...,...,...,...,...,...
328,Gentoo,47.2,13.7,214.0,4925.0,FEMALE
329,Gentoo,46.8,14.3,215.0,4850.0,FEMALE
330,Gentoo,50.4,15.7,222.0,5750.0,MALE
331,Gentoo,45.2,14.8,212.0,5200.0,FEMALE


## 1. Filtering with a single logical operator

In [14]:
# let's filter for species Adelie

# this is a boolean filter
penguins['Species'] == 'Adelie'

0       True
1       True
2       True
3       True
4       True
       ...  
328    False
329    False
330    False
331    False
332    False
Name: Species, Length: 333, dtype: bool

In [16]:
## we use the boolean filter to extract the subset of the dataframe we want, which includes all the rows that are True
## in the boolean mask

adelie = penguins[penguins['Species'] == 'Adelie']

adelie

Unnamed: 0,Species,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex
0,Adelie,39.1,18.7,181.0,3750.0,MALE
1,Adelie,39.5,17.4,186.0,3800.0,FEMALE
2,Adelie,40.3,18.0,195.0,3250.0,FEMALE
3,Adelie,36.7,19.3,193.0,3450.0,FEMALE
4,Adelie,39.3,20.6,190.0,3650.0,MALE
...,...,...,...,...,...,...
141,Adelie,36.6,18.4,184.0,3475.0,FEMALE
142,Adelie,36.0,17.8,195.0,3450.0,FEMALE
143,Adelie,37.8,18.1,193.0,3750.0,MALE
144,Adelie,36.0,17.1,187.0,3700.0,FEMALE


#### we can also use other logical operators such as > , < , >= , <= and !=

In [18]:
# let's filter for all penguins that are heavier than 4000 g

body_mass_4000 = penguins[penguins['Body Mass (g)'] > 4000.0]

body_mass_4000

Unnamed: 0,Species,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex
6,Adelie,39.2,19.6,195.0,4675.0,MALE
9,Adelie,34.6,21.1,198.0,4400.0,MALE
12,Adelie,42.5,20.7,197.0,4500.0,MALE
14,Adelie,46.0,21.5,194.0,4200.0,MALE
30,Adelie,39.2,21.1,196.0,4150.0,MALE
...,...,...,...,...,...,...
328,Gentoo,47.2,13.7,214.0,4925.0,FEMALE
329,Gentoo,46.8,14.3,215.0,4850.0,FEMALE
330,Gentoo,50.4,15.7,222.0,5750.0,MALE
331,Gentoo,45.2,14.8,212.0,5200.0,FEMALE


## 2. Filtering with multiple logical operators

- we can combine multiple logical conditions 
- &” signs stands for “and” , the “|” stands for “or”

In [None]:
# let's filter for countries with fertility greater than 2.0 and not in Asia

In [22]:
boolean_mask = ((penguins['Body Mass (g)'] > 4000.0) & (penguins['Species'] != 'Adelie'))

penguins[boolean_mask]

Unnamed: 0,Species,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex
154,Chinstrap,46.0,18.9,195.0,4150.0,FEMALE
159,Chinstrap,52.0,18.1,201.0,4050.0,MALE
161,Chinstrap,50.5,19.6,201.0,4050.0,MALE
165,Chinstrap,49.2,18.2,195.0,4400.0,MALE
171,Chinstrap,52.0,19.0,197.0,4150.0,MALE
...,...,...,...,...,...,...
328,Gentoo,47.2,13.7,214.0,4925.0,FEMALE
329,Gentoo,46.8,14.3,215.0,4850.0,FEMALE
330,Gentoo,50.4,15.7,222.0,5750.0,MALE
331,Gentoo,45.2,14.8,212.0,5200.0,FEMALE


In [23]:
# we can also use the loc method here to apply the boolean mask

penguins.loc[boolean_mask]

Unnamed: 0,Species,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex
154,Chinstrap,46.0,18.9,195.0,4150.0,FEMALE
159,Chinstrap,52.0,18.1,201.0,4050.0,MALE
161,Chinstrap,50.5,19.6,201.0,4050.0,MALE
165,Chinstrap,49.2,18.2,195.0,4400.0,MALE
171,Chinstrap,52.0,19.0,197.0,4150.0,MALE
...,...,...,...,...,...,...
328,Gentoo,47.2,13.7,214.0,4925.0,FEMALE
329,Gentoo,46.8,14.3,215.0,4850.0,FEMALE
330,Gentoo,50.4,15.7,222.0,5750.0,MALE
331,Gentoo,45.2,14.8,212.0,5200.0,FEMALE


## 3. Filtering with isin method

this is another useful method when we want to filter based on a string column

In [24]:
# let's filter for Chinstrap or Gentoo with body mass greater than 4500


# isin work on as a single column like an OR
penguins['Species'].isin(['Chinstrap', 'Gentoo'])

0      False
1      False
2      False
3      False
4      False
       ...  
328     True
329     True
330     True
331     True
332     True
Name: Species, Length: 333, dtype: bool

In [26]:
boolean_mask = (penguins['Species'].isin(['Chinstrap', 'Gentoo'])) & (penguins['Body Mass (g)'] > 4000.0)

penguins[boolean_mask]

Unnamed: 0,Species,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex
154,Chinstrap,46.0,18.9,195.0,4150.0,FEMALE
159,Chinstrap,52.0,18.1,201.0,4050.0,MALE
161,Chinstrap,50.5,19.6,201.0,4050.0,MALE
165,Chinstrap,49.2,18.2,195.0,4400.0,MALE
171,Chinstrap,52.0,19.0,197.0,4150.0,MALE
...,...,...,...,...,...,...
328,Gentoo,47.2,13.7,214.0,4925.0,FEMALE
329,Gentoo,46.8,14.3,215.0,4850.0,FEMALE
330,Gentoo,50.4,15.7,222.0,5750.0,MALE
331,Gentoo,45.2,14.8,212.0,5200.0,FEMALE


In [28]:
# note that we can flip the any condition with a 'not' operator

# for example, let's filter for species that are not Chinstrap and Gentoo

boolean_mask = ~ penguins['Species'].isin(['Chinstrap', 'Gentoo'])

penguins[boolean_mask]

Unnamed: 0,Species,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex
0,Adelie,39.1,18.7,181.0,3750.0,MALE
1,Adelie,39.5,17.4,186.0,3800.0,FEMALE
2,Adelie,40.3,18.0,195.0,3250.0,FEMALE
3,Adelie,36.7,19.3,193.0,3450.0,FEMALE
4,Adelie,39.3,20.6,190.0,3650.0,MALE
...,...,...,...,...,...,...
141,Adelie,36.6,18.4,184.0,3475.0,FEMALE
142,Adelie,36.0,17.8,195.0,3450.0,FEMALE
143,Adelie,37.8,18.1,193.0,3750.0,MALE
144,Adelie,36.0,17.1,187.0,3700.0,FEMALE


## 4. Filtering with the str accessor

we can use string methods to set conditions on string columns

In [29]:
# let's filter for countries in continents starting with the letter 'A'


boolean_mask = penguins['Species'].str.startswith('A')

penguins[boolean_mask]

Unnamed: 0,Species,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex
0,Adelie,39.1,18.7,181.0,3750.0,MALE
1,Adelie,39.5,17.4,186.0,3800.0,FEMALE
2,Adelie,40.3,18.0,195.0,3250.0,FEMALE
3,Adelie,36.7,19.3,193.0,3450.0,FEMALE
4,Adelie,39.3,20.6,190.0,3650.0,MALE
...,...,...,...,...,...,...
141,Adelie,36.6,18.4,184.0,3475.0,FEMALE
142,Adelie,36.0,17.8,195.0,3450.0,FEMALE
143,Adelie,37.8,18.1,193.0,3750.0,MALE
144,Adelie,36.0,17.1,187.0,3700.0,FEMALE


## 5. Filtering with the between method

this method is similar in concept to the isin method but it works for filering of numerical columns 

In [30]:
# let's filter for countries that has a fertility rate between 1.8 and 2.6

boolean_mask =  penguins['Body Mass (g)'].between(3500, 4000)

penguins[boolean_mask]

Unnamed: 0,Species,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex
0,Adelie,39.1,18.7,181.0,3750.0,MALE
1,Adelie,39.5,17.4,186.0,3800.0,FEMALE
4,Adelie,39.3,20.6,190.0,3650.0,MALE
5,Adelie,38.9,17.8,181.0,3625.0,FEMALE
8,Adelie,38.6,21.2,191.0,3800.0,MALE
...,...,...,...,...,...,...
208,Chinstrap,45.7,17.0,195.0,3650.0,FEMALE
209,Chinstrap,55.8,19.8,207.0,4000.0,MALE
211,Chinstrap,49.6,18.2,193.0,3775.0,MALE
213,Chinstrap,50.2,18.7,198.0,3775.0,FEMALE
