# Data Journalism Lesson 5: Filters

Learn how to narrow in on what's important and remove what isn't.

In [None]:
import pandas as pd
from IPython.display import display, HTML

def check_filter(result, expected):
    """Check if the result of a groupby/sum operation matches the expected result."""
    try:
        pd.testing.assert_frame_equal(result.reset_index(), expected.reset_index(), check_dtype=False, check_like=True, rtol=1e-3)
        display(HTML('<div style="background-color: #dff0d8; padding: 10px; border-radius: 5px;">' +
                 '<strong>Great work!</strong></div>'))
    except AssertionError as e:
        # print(e) # Uncomment for debugging
        display(HTML('<div style="background-color: #f2dede; padding: 10px; border-radius: 5px;">' +
                 '<strong>Not quite!</strong> Check your column, operator, and value.</div>'))

In [None]:
import pandas as pd

state = "Minnesota"

df = pd.read_csv(f"../_static/rural-grants/{state.lower()}.csv")
total = df.investment_dollars.sum()
rows = df.shape[0]
cols = df.shape[1]
single_family = df[df.program_area == "Single Family Housing"]
single_family_entities = len(single_family)
single_family_investments = single_family.number_of_investments.sum()
single_multi_family = df[df["program_area"].isin(["Single Family Housing", "Multifamily Housing"])]
contains_housing = df[df['program_area'].str.contains("Housing", na=False)]

areas_of_interest = ["Single Family Housing", "Multifamily Housing", "Community Facilities"]
is_in_list = df[df['program_area'].isin(areas_of_interest)]

county_summary_expected = (df[df["program_area"].isin(areas_of_interest)]
                    .groupby("county")
                    .agg(total_investment_dollars=("investment_dollars", 'sum'))
                    .sort_values("total_investment_dollars", ascending=False)
                    )

top_county = county_summary_expected.index[0]
top_county_investment = county_summary_expected.iloc[0, 0]

investments_col_filtered_expected = df[["county", "program_area", "investment_dollars"]]
top_10_counties_expected = county_summary_expected.head(10)

In [None]:
from myst_nb import glue

# Glue variables for use elsewhere in the notebook
glue("state", state, display=False)
glue("investment_total", f"{total:,}", display=False)
glue("investment_rows", f"{rows:,}", display=False)
glue("investment_cols", f"{cols:,}", display=False)
glue("single_family_entities", f"{single_family_entities:,}", display=False)
glue("single_family_investments", f"{single_family_investments:,}", display=False)
glue("top_county", top_county, display=False)
glue("top_county_investment", f"{top_county_investment:,}", display=False)

## The Goal

In this lesson, you'll learn about filtering data - a crucial skill for focusing your analysis on specific subsets of information. By the end of this tutorial, you'll understand how to use boolean indexing and the `.query()` method in pandas to narrow down your DataFrame based on various criteria. You'll practice filtering with exact matches, partial matches using string methods, and filtering with lists. You'll also learn how to combine multiple filters and use column selection to simplify your data view. These skills will enable you to efficiently extract the most relevant information from large datasets, a key ability in data journalism.

## What is Data Journalism?

Chad Day's career has taken him from covering the night cops beat in Little Rock Arkansas to the chief elections analyst for the Associated Press, going from data analysis about crimes in a mid-South city to the most consequential event in American democracy. Between those two jobs, he was part of team that won a Pulitzer Prize at the Wall Street Journal. 

So I asked him, how often does he work with data and not use group by, summarize, arrange and what this tutorial covers, filtering?

"I think almost never," he said. "I mean, I do it all the time, right? There's never a time where I don't use it, I should say."

Every data journalist I talked to said the same. 

There's an old expression in music that all you need is three chords and the truth to make a song. Data journalism is not that far off. I'd make the argument it's more like five chords, but those chords -- group by, summarize, arrange, filter and mutating -- are the foundation to nearly everything. 

What is filtering? It focuses our data down to what we care about, what we're curious about, what we want to know. An example: Safe to say if we're working for a specific state's news organization, we should focus on that state's data. People care about where they live. They do not care about what's going on a thousand miles away. They *really* care about what's going on their neighborhood, their city, their county and so on. With filtering, we can take a national dataset and focus it on where you live.

## The Basics

More often than not, we have more data than we want. Sometimes we need to be rid of that data. In `pandas`, there's two main ways to go about this: filtering rows and selecting columns.

**Filtering creates a subset of the data based on criteria (selecting rows)**. All records where the count is greater than 10. All records that match "Nebraska". Something like that. 

**Selecting simply returns only the columns named**. So if you only want to see School and Attendance, you select those columns. When you look at your data again, you'll have two columns. If you try to use one of your columns that you had before you used selection, you'll get an error if you overwrote the original DataFrame.

Let's work on some examples using a dataset of rural investments made by the U.S. Department of Agriculture. If you aren't from a rural place or state with lot of rural areas, a lot of small towns are shrinking. One of the reasons? A lack of housing in those places. There may be people or businesses that want to move to a small community but there aren't places for them to live. With this data, we can look at what the federal government is doing about that through the Department of Agriculture.

What kind of money are we talking? In {glue:text}`state`, since 2019, the USDA has invested ${glue:text}`investment_total` in rural business development, housing, energy and other areas.

First we'll need pandas. Your first step is always loading libraries and you’ll need to run this step in nearly every single thing you do.

In [None]:
import pandas as pd

Now import the data. As with other datasets, you can swap in your state name if you'd like – lowercase letters, dashes for spaces.

In [None]:
investments = pd.read_csv('../_static/rural-grants/minnesota.csv')

First things first, let's look investments in building houses in rural {glue:text}`state`. How do we go from {glue:text}`investment_rows` rows of investments, which includes everything, down to just grants and loans for houses? We do that with filtering using boolean indexing, some logic that at first will seem a little weird, but you'll get used to pretty quickly.

Filtering in pandas often involves creating a boolean Series (True/False values) based on a condition, and then using that Series to index the DataFrame.

A condition contains three parts: 
1. A column name (accessed like `df['column_name']` or `df.column_name`).
2. A comparison operator.
3. The value to compare against.

Here are the common comparison operators in Python/pandas:

| Operator | Explanation              |
|----------|--------------------------|
| ==       | equal to                 |
| !=       | not equal to             |
| >        | greater than             |
| >=       | greater than or equal to |
| <        | less than                |
| <=       | less than or equal to    |

The tough one to remember is equal to. In conditional statements, equal to is `==` not `=`. If you haven’t noticed, `=` is a variable assignment operator, not a conditional statement. So equal is `==` and NOT equal is `!=`. You can also combine greater than and equal to. So, for instance, if you want all values that are 10 or greater, you can use `>= 10`, which is the same thing as `> 9`. Both will include 10 and everything greater. But one is more clear – `>= 10` – and one is a trick. Always be clear.

What we want to do is look at all rows where the `program_area` column exactly matches "Single Family Housing".

First, let's use `.head()` to give us a peek at the data and column names.

In [None]:
investments.head()

Now that we can see the column names, the one we're looking for is `program_area` which is a simple label for what the program is trying to do. What does that mean? For our example, Single Family Housing as a program area can be funded by several different programs, which is the next column after `program_area`. In other words, Single Family Homes can get funding from the Guaranteed Loans program, or the Repair Grants program. Labels like this are often a handy way for data analysts to focus down on a particular subject.

### Exercise 1: Building houses 

Filter the `investments` DataFrame to show only rows where the `program_area` is exactly "Single Family Housing". Remember the three parts: DataFrame and column, comparison operator, and the value.

In [None]:
investments_single_family = investments[investments[_____] _____ _____]
print(investments_single_family.head())

check_filter(investments_single_family, single_family)

There might be too many rows to show all the data, but the USDA gave {glue:text}`single_family_entities` entities money through grants or loans for single family housing investments in {glue:text}`state` since 2019. We have to be careful with how we word that, because each row is money going to a single entity. That single entity can subdivide that investment into smaller investments. If we look at the `number_of_investments` column, the number of investments made in single family housing in {glue:text}`state` grows to {glue:text}`single_family_investments`. A key lesson in data journalism – being very precise with your wording.

### Exercise 2: Filter more than one thing, the standard way

If you look at the rural investments data, you'll notice that single family housing isn't the only kind of housing investments the USDA makes in rural areas. They also make investments in multi-family housing -- apartments, duplexes and the like. 

But how do you write a filter that captures both single family housing and multifamily housing? You can combine conditions using logical operators.

In Python and pandas:
*   `&` means AND (both conditions must be true)
*   `|` means OR (at least one condition must be true)
*   `~` means NOT (reverses the condition)

**Important:** When combining conditions, you MUST put parentheses `()` around each individual condition.

We want all investments that are "Single Family Housing" OR "Multifamily Housing". Filter the `investments` DataFrame accordingly.

In [None]:
investments_filtered_multi = investments[(investments["program_area"] == _____) | (investments["program_area"] == _____)]
print(investments_filtered_multi.head())

check_filter(investments_filtered_multi, single_multi_family)

As you can see, you get both program areas now, but see what I mean about wordy? You have to repeat the entire filter condition every time you want to add something. So if we wanted to add another program area into this, we’d add another | and have to repeat that it’s a program_area we’re looking in and so on. Why do that?

Because filters are flexible. Our first filter could be on one column of information, but the next one does not have to be. We could look for things in column A, column B and column C all at the same time. We can switch back and forth between OR and AND depending on what we need. Complex filters are a logic puzzle.

### Exercise 3: Filtering using string methods

Another way we can filter is using just the text in the column. We’ve been doing EXACT matches, meaning if the USDA misspells Single in one record, our search won’t find it. We’ll talk more about that problem later in the course. But note something about the two program areas we’re looking for – they both end with the word “Housing”. Couldn’t we use that to find both without having to do all that boolean business?

We can!

Pandas Series have a `.str` accessor that provides string processing methods. The `.str.contains()` method checks if a substring is present in the string in each element of the Series. It returns a boolean Series (True/False), just like our comparison operators did.

Filter the `investments` DataFrame to find rows where the `program_area` column *contains* the word "Housing". 

In [None]:
investments_contains_housing = investments[investments['program_area'].str.contains(_____)]

print(investments_contains_housing.head())

check_filter(investments_contains_housing, contains_housing)

And, lo and behold, we get the same answer (assuming no other program areas contain "Housing"). Writing code is often a battle between writing code that is clear and easy to understand and writing only enough code to make it all work. Every organization that writes code has their preference. Mine is to be clear, but `.str.contains()` can be very useful.

### Exercise 4: Filtering with a list (`.isin()`)

What if the story we want to look at is how the USDA is reshaping communities in rural areas? What if there are three program areas we're interested in? Or four? 

We could just go back to our OR filter and add more conditions with `|`. But that's starting to get repetitive. What if we could give the filter a list of the things we wanted and it would give us rows matching anything in that list? 

Good news! Totally doable with the `.isin()` method.

We can create a Python list containing the values we want to match. Then, we use the `.isin()` method on the DataFrame column, passing our list as the argument. This returns a boolean Series which is True for rows where the column value is present in the list.

Create a list called `areas_of_interest` containing "Single Family Housing", "Multifamily Housing", and "Community Facilties". Then, filter the `investments` DataFrame to keep only rows where the `program_area` is in this list.

In [None]:
areas_of_interest = [_____, _____, _____]
investments_in_list = investments[investments['program_area'].isin(_____)]
    
print(investments_in_list.head())

check_filter(investments_in_list, is_in_list)

Okay, so we now have three different program areas. What now? 

What now is we can use what we learned in the previous tutorials (grouping and aggregation) and start adding and counting on this filtered data.

### Exercise 5: How is the USDA impacting communities?

Your editor wants to know how much money the USDA has poured into {glue:text}`state` counties since 2019 through the three programs we just filtered for. The clues are in the way that sentence is written – counties. We need a county column. How much money? We need a dollar figure column. And how much means we’re going to be adding all that money up.

It’s been a while since we looked at our data, and we need some new column names, so let’s do that again quick to find what we need:

In [None]:
investments.head()

Finding the county column should be pretty easy. Scroll right and you’ll find the dollar figure column. All we’re doing is adding the grouping, aggregation, and sorting to do what we did last time.

In [None]:
areas_of_interest = ["Single Family Housing", "Multifamily Housing", "Community Facilities"]

# Chain the operations: filter, groupby, aggregate, sort
county_summary = (investments[investments[_____].isin(_____)]
                    .groupby(_____)
                    .agg(total_investment_dollars=(_____, 'sum'))
                    .sort_values(_____, ascending=False)
                    )

print(county_summary.head()) # Show the top counties

check_filter(county_summary, county_summary_expected)

Looks like {glue:text}`top_county` county received the most investment, with ${glue:text}`top_county_investment`.00 going there since 2019 for housing and community facilities.

## Selecting columns to make it easier to read

Our data here has {glue:text}`investment_cols` columns. As datasets go, that's not a lot. Some have *thousands* of columns. What if you only want to see two? What if your editor has a severe case of undiagnosed ADHD and showing more than what is absolutely necessary can derail a story meeting?

Column selection to the rescue.

```{admonition} Key Concept
Filtering (like `df[condition]`, `.isin()`, `.str.contains()`) limits the number of **rows**, while column selection (like `df[['col1', 'col2']]`) limits the number of **columns**.
```

### Exercise 6: Select to simplify 

Using column selection is easy -- just provide a list of the column names you want to keep inside square brackets `[]`. What if we just wanted to see the county, the program area and the investment dollars columns?

In [None]:
investments_col_filtered = investments[[_____, _____, _____]]
print(investments_col_filtered.head())

check_filter(investments_col_filtered, investments_col_filtered_expected)

If you have truly massive data, pandas offers more advanced ways to select columns (e.g., using `.loc`, `.iloc`, `filter()`, selecting by data type), but for most common cases, providing a list of names is sufficient.

## Top list

One last little pandas trick that's nice to have in the toolbox is a shortcut for selecting only the top values from your dataset after sorting. Want to make a Top 10 List? Or Top 25? Or Top Whatever You Want? It's easy using `.head()` after sorting.

*(Note: R's `top_n` has slightly different behavior, especially with ties. In pandas, the standard approach is to sort and then take the head.)*

### Exercise 7: Top N lists

So what are the top 10 counties for community investment based on the `county_summary` DataFrame we created in Exercise 5? 

Since `county_summary` is already sorted in descending order by `total_investment_dollars`, we can simply use the `.head()` method.

In [None]:
top_10_counties = county_summary._____(_____)    
print(top_10_counties)

check_filter(top_10_counties, top_10_counties_expected)

Editors love top 10 lists. Like catnip.

## The Recap

Throughout this lesson, you've learned how to use filters to focus on specific parts of your dataset using pandas. You've practiced filtering with exact matches (boolean indexing), using `.str.contains()` for partial matches, and filtering with lists using `.isin()`. You've also learned how to chain multiple operations together and use column selection (`df[['col1', ...]]`) to simplify your data view. Remember, filtering is a powerful tool that allows you to zoom in on the most relevant data for your story. You'll find these filtering techniques invaluable for uncovering specific trends and patterns within larger datasets.

## Terms to Know

- **Boolean Indexing**: Using a boolean Series (True/False) to select rows from a DataFrame (e.g., `df[df['col'] > 10]`).
- **Column Selection**: Choosing specific columns from a DataFrame (e.g., `df[['col_a', 'col_b']]`).
- **`.str.contains()`**: A pandas string method used for partial string matching within a Series.
- **Comparison Operators**: Symbols used in filtering (e.g., `==`, `!=`, `>`, `<`) to compare values.
- **Logical Operators**: Used to combine boolean conditions (`&` for AND, `|` for OR, `~` for NOT).
- **`.isin()`**: A pandas method used to check if values in a Series are present in a given list or set.
- **List**: A Python data structure used to hold an ordered sequence of items (`[]`).
- **Method Chaining**: Connecting multiple pandas operations together, often making code more readable.
- **`.head(N)`**: A pandas method used to select the first N rows of a DataFrame or Series.