# <font color="green">Filtering Dataframes</font>

--------------------------------


## Introduction


Filtering is a fundamental operation in data analysis, allowing us to extract subsets of data based on specific conditions or criteria. Whether you're working with large datasets or small ones, filtering techniques are indispensable for exploring and analyzing your data effectively.

In this lecture, we'll explore various methods and techniques for filtering DataFrames using Pandas, a powerful library in Python for data manipulation and analysis. We'll cover how to filter rows and columns based on conditions, select specific data points, handle missing values, and more.

By the end of this lecture, you'll have a solid understanding of how to apply filtering operations to your datasets, enabling you to extract valuable insights and make informed decisions from your data.

Let's dive in and discover the power of filtering DataFrames!

-----------------------------------

#### Comparison Operators for DataFrame Filtering

Filtering data within pandas DataFrames can utilize a comprehensive set of comparison operators. These operators allow for flexible and powerful data selection based on specific criteria. Below is an overview of the primary comparison operators used in DataFrame filtering:

- **Equality**: `x == 100`
  - Checks if the values in column `x` are equal to 100.

- **Greater Than and Less Than**: `x > 2`, `y < x`
  - `x > 2` checks if the values in column `x` are greater than 2.
  - `y < x` compares two columns directly, checking if the values in column `y` are less than those in column `x`.

- **Not Equal**: `x != y`
  - Evaluates whether the values in column `x` are not equal to those in column `y`.

- **Value Inclusion**: `x.isin([...])`
  - Determines if the values in column `x` are within a specified list.

- **Value Range**: `x.between(a, b)`
  - Checks if the values in column `x` lie within a range between `a` and `b` (inclusive).

- **Check for NA**: `x.isna()`
  - Identifies missing or NA values in column `x`.

These operators are applied directly to the columns of the DataFrame to facilitate the filtering of data based on various conditions. This capability is instrumental in data analysis, allowing for the examination of subsets of data that meet specific criteria.


In [1]:
import pandas as pd

x = pd.Series([1, 4, 6, 2])
y = pd.Series([9, 2, 3, 2])

In [2]:
x < y

0     True
1    False
2    False
3    False
dtype: bool

In [3]:
x == y

0    False
1    False
2    False
3     True
dtype: bool

In [4]:
x >= y

0    False
1     True
2     True
3     True
dtype: bool

In [5]:
(x > 2) | (y == 9)

0     True
1     True
2     True
3    False
dtype: bool

In [6]:
(x == 2) & (y == 2)

0    False
1    False
2    False
3     True
dtype: bool

In [7]:
x.between(4, 6)

0    False
1     True
2     True
3    False
dtype: bool

In [8]:
y.isin([2, 9])

0     True
1     True
2    False
3     True
dtype: bool

In [9]:
x.isin(y)

0    False
1    False
2    False
3     True
dtype: bool

# Objectives

# Filtering Techniques in Pandas

Pandas provides a variety of methods for filtering data within DataFrames. These methods allow for the selection of data subsets based on specific conditions, enhancing the ability to perform detailed data analysis. Below, we explore several key techniques for filtering data in Pandas:

## 1. Filtering with Single Logical Operators

- **Usage**: Selects data based on a single condition.
- **Example**: `df[df['column_name'] > value]`
  - This filters the DataFrame to only include rows where the values in `column_name` are greater than a specified `value`.

## 2. Filtering with Multiple Logical Operators

- **Usage**: Combines multiple conditions for more complex filtering.
- **Operators**: `&` (AND), `|` (OR), `~` (NOT)
- **Example**: `df[(df['column1'] > value1) & (df['column2'] < value2)]`
  - Retrieves rows where `column1` values are greater than `value1` AND `column2` values are less than `value2`.
- **Alternative `query`Method**
    This can be used as simple language statement.
    
    `df.query('coulmn1 > value1 and column2 < value2')`

## 3. Filtering with the **isin** Method

- **Usage**: Filters rows based on whether column values are in a specified set.
- **Example**: `df[df['column_name'].isin([value1, value2, value3])]`
  - Selects rows where the `column_name` contains any of the specified values (`value1`, `value2`, or `value3`).

## 4. Filtering Using the `str` Accessor

- **Usage**: Applies string methods to filter data based on string column values.
- **Example**: `df[df['column_name'].str.contains('substring')]`
  - Filters the DataFrame to include only rows where `column_name` contains a specified substring.

## 5. Filtering with the **between** Method

- **Usage**: Selects rows based on whether column values fall within a specified range.
- **Example**: `df[df['column_name'].between(value1, value2)]`
  - Retrieves rows where the `column_name` values are between `value1` and `value2`, inclusive of both boundaries.

Each of these filtering methods can be tailored to specific data analysis needs, allowing for the efficient examination of data subsets and the derivation of meaningful insights.


In [10]:
import pandas as pd

In [11]:
penguins = pd.read_csv("./data/penguins_simple.csv", sep=";")

In [12]:
penguins

Unnamed: 0,Species,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex
0,Adelie,39.1,18.7,181.0,3750.0,MALE
1,Adelie,39.5,17.4,186.0,3800.0,FEMALE
2,Adelie,40.3,18.0,195.0,3250.0,FEMALE
3,Adelie,36.7,19.3,193.0,3450.0,FEMALE
4,Adelie,39.3,20.6,190.0,3650.0,MALE
...,...,...,...,...,...,...
328,Gentoo,47.2,13.7,214.0,4925.0,FEMALE
329,Gentoo,46.8,14.3,215.0,4850.0,FEMALE
330,Gentoo,50.4,15.7,222.0,5750.0,MALE
331,Gentoo,45.2,14.8,212.0,5200.0,FEMALE


## 1. Filtering with a single logical operator

In [13]:
# let's filter for species Adelie

# this is a boolean filter
penguins['Species'] == 'Adelie'

0       True
1       True
2       True
3       True
4       True
       ...  
328    False
329    False
330    False
331    False
332    False
Name: Species, Length: 333, dtype: bool

In [14]:
## we use the boolean filter to extract the subset of the dataframe we want, which includes all the rows that are True
## in the boolean mask

adelie = penguins[penguins['Species'] == 'Adelie']

adelie

Unnamed: 0,Species,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex
0,Adelie,39.1,18.7,181.0,3750.0,MALE
1,Adelie,39.5,17.4,186.0,3800.0,FEMALE
2,Adelie,40.3,18.0,195.0,3250.0,FEMALE
3,Adelie,36.7,19.3,193.0,3450.0,FEMALE
4,Adelie,39.3,20.6,190.0,3650.0,MALE
...,...,...,...,...,...,...
141,Adelie,36.6,18.4,184.0,3475.0,FEMALE
142,Adelie,36.0,17.8,195.0,3450.0,FEMALE
143,Adelie,37.8,18.1,193.0,3750.0,MALE
144,Adelie,36.0,17.1,187.0,3700.0,FEMALE


#### we can also use other logical operators such as > , < , >= , <= and !=

In [15]:
# let's filter for all penguins that are heavier than 4000 g

body_mass_4000 = penguins[penguins['Body Mass (g)'] > 4000.0]

body_mass_4000

Unnamed: 0,Species,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex
6,Adelie,39.2,19.6,195.0,4675.0,MALE
9,Adelie,34.6,21.1,198.0,4400.0,MALE
12,Adelie,42.5,20.7,197.0,4500.0,MALE
14,Adelie,46.0,21.5,194.0,4200.0,MALE
30,Adelie,39.2,21.1,196.0,4150.0,MALE
...,...,...,...,...,...,...
328,Gentoo,47.2,13.7,214.0,4925.0,FEMALE
329,Gentoo,46.8,14.3,215.0,4850.0,FEMALE
330,Gentoo,50.4,15.7,222.0,5750.0,MALE
331,Gentoo,45.2,14.8,212.0,5200.0,FEMALE


## 2. Filtering with multiple logical operators

- we can combine multiple logical conditions 
- &” signs stands for “and” , the “|” stands for “or”

In [16]:
# let's filter for countries with fertility greater than 2.0 and not in Asia

In [17]:
boolean_mask = ((penguins['Body Mass (g)'] > 4000.0) & (penguins['Species'] != 'Adelie'))

penguins[boolean_mask]

Unnamed: 0,Species,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex
154,Chinstrap,46.0,18.9,195.0,4150.0,FEMALE
159,Chinstrap,52.0,18.1,201.0,4050.0,MALE
161,Chinstrap,50.5,19.6,201.0,4050.0,MALE
165,Chinstrap,49.2,18.2,195.0,4400.0,MALE
171,Chinstrap,52.0,19.0,197.0,4150.0,MALE
...,...,...,...,...,...,...
328,Gentoo,47.2,13.7,214.0,4925.0,FEMALE
329,Gentoo,46.8,14.3,215.0,4850.0,FEMALE
330,Gentoo,50.4,15.7,222.0,5750.0,MALE
331,Gentoo,45.2,14.8,212.0,5200.0,FEMALE


In [18]:
# we can also use the loc method here to apply the boolean mask

penguins.loc[boolean_mask]

Unnamed: 0,Species,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex
154,Chinstrap,46.0,18.9,195.0,4150.0,FEMALE
159,Chinstrap,52.0,18.1,201.0,4050.0,MALE
161,Chinstrap,50.5,19.6,201.0,4050.0,MALE
165,Chinstrap,49.2,18.2,195.0,4400.0,MALE
171,Chinstrap,52.0,19.0,197.0,4150.0,MALE
...,...,...,...,...,...,...
328,Gentoo,47.2,13.7,214.0,4925.0,FEMALE
329,Gentoo,46.8,14.3,215.0,4850.0,FEMALE
330,Gentoo,50.4,15.7,222.0,5750.0,MALE
331,Gentoo,45.2,14.8,212.0,5200.0,FEMALE


Using `query`method

In [21]:
penguins.query('`Body Mass (g)` > 4000.0 and Species != "Adelie"')

Unnamed: 0,Species,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex
154,Chinstrap,46.0,18.9,195.0,4150.0,FEMALE
159,Chinstrap,52.0,18.1,201.0,4050.0,MALE
161,Chinstrap,50.5,19.6,201.0,4050.0,MALE
165,Chinstrap,49.2,18.2,195.0,4400.0,MALE
171,Chinstrap,52.0,19.0,197.0,4150.0,MALE
...,...,...,...,...,...,...
328,Gentoo,47.2,13.7,214.0,4925.0,FEMALE
329,Gentoo,46.8,14.3,215.0,4850.0,FEMALE
330,Gentoo,50.4,15.7,222.0,5750.0,MALE
331,Gentoo,45.2,14.8,212.0,5200.0,FEMALE


## 3. Filtering with isin method

this is another useful method when we want to filter based on a string column

In [None]:
# let's filter for Chinstrap or Gentoo with body mass greater than 4500


# isin work on as a single column like an OR
penguins['Species'].isin(['Chinstrap', 'Gentoo'])

In [None]:
boolean_mask = (penguins['Species'].isin(['Chinstrap', 'Gentoo'])) & (penguins['Body Mass (g)'] > 4000.0)

penguins[boolean_mask]

In [None]:
# note that we can flip the any condition with a 'not' operator

# for example, let's filter for species that are not Chinstrap and Gentoo

boolean_mask = ~ penguins['Species'].isin(['Chinstrap', 'Gentoo'])

penguins[boolean_mask]

## 4. Filtering with the str accessor

we can use string methods to set conditions on string columns

In [None]:
# let's filter for countries in continents starting with the letter 'A'


boolean_mask = penguins['Species'].str.startswith('A')

penguins[boolean_mask]

## 5. Filtering with the between method

this method is similar in concept to the isin method but it works for filering of numerical columns 

In [None]:
# let's filter for countries that has a fertility rate between 1.8 and 2.6

boolean_mask =  penguins['Body Mass (g)'].between(3500, 4000)

penguins[boolean_mask]

# Recap: Mastering Data Filtering in Pandas

In today's lecture, we've explored a range of powerful techniques for filtering data within pandas DataFrames. These methods enable precise data selection and manipulation, essential for effective data analysis. Here's a brief recap of the key points covered:

## Key Filtering Techniques

1. **Single Logical Operators**: We started with the basics of using single logical operators to filter data based on single conditions.
2. **Multiple Logical Operators**: We then advanced to combining conditions using logical operators (`&`, `|`, `~`) for more complex filtering scenarios.
3. **The `isin` Method**: We explored the `isin` method for filtering rows based on a set of specified values, enhancing flexibility in data selection.
4. **String Filtering with `str` Accessor**: We discussed how to leverage string methods via the `str` accessor for filtering based on string patterns and substrings.
5. **Range Filtering with `between`**: Lastly, we looked at the `between` method for selecting data within a specified numeric range.

## Practical Applications

Throughout the lecture, we've applied these techniques to real-world data sets, demonstrating their utility in uncovering insights and facilitating data-driven decision-making. From simple condition checks to complex criteria combinations, we've seen how pandas simplifies data analysis tasks.

## Conclusion

The ability to filter data effectively is a cornerstone of data analysis with pandas. By mastering these techniques, you can navigate through large datasets with ease, focusing on the information that matters most to your analysis. As you continue to work with data, remember that the flexibility and power of pandas come from its wide range of functionalities, including these filtering capabilities.


