# Data Journalism Lesson 3: Aggregates, Part 1

Learn how to take lots of little things and total them up into bigger things.

In [None]:
# Setup code for the notebook
import pandas as pd
from IPython.display import display, HTML

def check_groupby(result, expected):
    """
    Check if the result of a groupby operation matches the expected result.
    """
    # Check if the DataFrames are equal
    try: 
        pd.testing.assert_frame_equal(result, expected, check_dtype=False, check_like=True)
        display(HTML('<div style="background-color: #dff0d8; padding: 10px; border-radius: 5px;">' +
                 '<strong>Great work!</strong></div>'))
    except AssertionError:
        display(HTML('<div style="background-color: #f2dede; padding: 10px; border-radius: 5px;">' +
                 '<strong>Not quite!</strong> Make sure you are using the right dataframe and column names.</div>'))

def check_sorted(result, expected):
    """
    Check if the result of a groupby operation matches the expected result.
    """
    # Check if the DataFrames are equal
    try:
        pd.testing.assert_frame_equal(result, expected, check_dtype=False, check_like=True)
        display(HTML('<div style="background-color: #dff0d8; padding: 10px; border-radius: 5px;">' +
                 '<strong>Great work!</strong></div>'))
    except AssertionError:
        display(HTML('<div style="background-color: #f2dede; padding: 10px; border-radius: 5px;">' +
                 '<strong>Not quite!</strong> Make sure you are using the right dataframe.</div>'))


In [None]:

import pandas as pd

state = "Minnesota"

homes = pd.read_csv(f'../_static/nursing-homes/{state.lower()}.csv')

expected = homes.groupby("county_parish").count()
expected_sorted = expected.sort_values("cms_certification_number_ccn", ascending=False)
expected_multi_group = homes.groupby(["county_parish", "overall_rating"]).count().sort_values("cms_certification_number_ccn", ascending=False)

# Get county with most nursing homes
county_counts = homes['county_parish'].value_counts()
top_county = county_counts.index[0]
top_county_count = county_counts.iloc[0]

toprow = (homes
 .groupby(["county_parish", "overall_rating"])
 .count()
 .sort_values(by="cms_certification_number_ccn", ascending=False)
 .head(1).reset_index())
toprow_county = toprow.iloc[0, 0]
toprow_rating = toprow.iloc[0, 1]
toprow_count = toprow.iloc[0, 2]


In [None]:
from myst_nb import glue

# Glue variables for use elsewhere in the notebook
glue("top_county", top_county, display=False)
glue("top_county_count", f"{top_county_count}", display=False)
glue("state", state, display=False)
glue("n_homes", f"{int(len(homes)):,}")
glue("toprow_county", toprow_county, display=False)
glue("toprow_rating", f"{toprow_rating}", display=False)
glue("toprow_count", f"{toprow_count}", display=False)

## The Goal

The goal of this lesson is to introduce you to one of the most fundamental and powerful concepts in data analysis: aggregation. You'll learn how to take large datasets with many individual records and summarize them into meaningful insights. We'll focus on using the `groupby()` and `agg()`/`count()` functions from pandas to group similar items together and calculate totals. By the end of this lesson, you'll be able to answer questions about the nursing homes in your state, but the techniques you learn can be applied to any subject or dataset.

## What is Data Journalism?

So far, in our philosophical discussion of data journalism, we’ve talked about how it’s interviewing data as a source, and one of your first steps is just knowing what is in your data. What does it look like? Now we’re going to take our first steps – a giant leap, really – into asking your data questions and getting answers. It’s one of the most powerful calculations you can do as a data journalist.

Count things.

No really.

“Ninety percent of what I do is counting things,” said MaryJo Webster of the Star Tribune. “And so, to me, you could kind of think back to when you were a child and you were playing with a set of blocks or maybe Legos and you counted out how many blue ones do I have, how many red ones do I have. That is kind of what I do every day.”

So what kind of stories can you do with that?

“How many car crashes there were and how many were in this county versus that county?” she said. “How many were this year versus that year? Very simple count. You’re just counting.”

It’s hard to explain to people who have never worked with data how important the simple act of counting is. Almost every data story you will ever work on will have a count in them – either one you will do or one done for you before you get the data. Almost every data story can be summarized into a few questions:

How many of this thing are there where I live?
How many of those things were there last year? Or five years ago?
How does that many things compare to my state? The nation?
Why is there that many or that few of these things?
When I count up all the things, there aren’t any in this place. Why?
There are, of course, more questions to ask of data. And we’ll get to those. But a shocking amount of data journalism starts with a simple count. How many of these things are there compared to this other thing?

So let’s start there.

## The Basics

One of the most basic bits of data analysis is just simply taking a lot of things – hundreds, thousands, millions of things – and putting them together somehow. Rarely ever do we want just one number. I’ll give you an example: Let’s pretend for a second that we have Spotify’s data. We have every song streamed in a year. Billions of records. Do we want to just count them up? One number? Is your annual Spotify Wrapped just one number that’s the total number of streams everyone played in a year? No, of course not.

It’s streams – but put in groups. You, a user, are a group. How many songs did you play? How many times? What artists? What genres? And on and on. And with each group, we get a different number. An interesting number. A useful number.

So how do you put things together?

First, we need to load our libraries.

Run this.

In [None]:
import pandas as pd

Now, let’s use a dataset that may not be interesting to you now – because you’re young – but it’s a critical public policy issue in the United States: Nursing homes. Quite simply, there aren’t enough of them. And in some places, there aren’t any at all, in spite of there being old people to take care of in those places. The number of nursing homes in your state – and where they are in your state – is a major issue that directly affects families all around you.

For this exercise, you need to simply run this, filling in your state name where the blank is in all lowercase letters and replacing any spaces with a dash. Examples: nebraska and new-mexico.

In [None]:
# Import the nursing homes data
homes = pd.read_csv('../_static/nursing-homes/minnesota.csv')
homes.head()  # Show the first few rows to check it loaded

## Inspecting Data

Now we can inspect the data we imported. What does it look like? What’s in it? What do we have to work with?

To do that, we use `.head()` after the name of the variable we created above to show the headers and the first five rows of data.

In [None]:
# Show the first five rows
homes.head()

Let’s look at this. As you can see by the data, we have five nursing homes, which is what we expect from head(). But notice the first row – the headers. That is where most of the answers you are going to need are going to come from. You can see things like the provider_name and their address. If you scroll to the right, you’ll find more data – like their phone number, which is interesting for reporting purposes. You’ll see a column called county_parish that we’re going to use to find where these nursing homes are. You can keep scrolling right for a long time – there’s 103 columns in the data. Which might seem like a lot, but there are datasets with thousands of columns and millions of rows.

## Answering questions with code

There are {glue:text}`n_homes` nursing homes in {glue:text}`state` as of the latest data. But that doesn't tell us much. Let's explore more.

The secret to writing code is that much of it is a pattern. With pandas, this is _especially_ true.

To accomplish our goal, we start with the name of the data. Then, we use pandas' `groupby()` function, along with `count()` to do just what we’ve been talking about – take your data, put each row into groups. This thing here together, those things over there together. A massive amount of data analysis involves grouping like things together at some point.

It’s important to understand in your mind what `groupby()` and `count()` are doing. It might be easiest to think about it like a package of Skittles candy. Your data, when you first get it, is like a pack of Skittles – all mixed up. What group_by does is puts the candy in little piles. Here’s an bag of Skittles dumped out.

```{figure} ../figures/ch3_i1.png
---
alt: skittles
align: center
---
Ungrouped data
```

Now, using `groupby`, we can put them into piles. With Skittles, we group them by color. With data, we could do this with any column of data – dates, locations, types, ratings and so on. The list is endless. But notice – all of those things are not numbers. With the exception of dates, they’re names or labels or text of some variety. Rarely ever will you group something by a number.

```{figure} ../figures/ch3_i2.png
---
alt: skittles
align: center
---
Grouped data
```

We need to put something in the parenthesis in `groupby()`. This time the something comes from that first row of our data. We are grouping data by one of the pieces of the data – a field, or column. If we’re trying to group by county or parish, which field or column in our data looks like it holds that information? Let’s use head again and take a look at the very top row in bold.


```{admonition} Key Concept
A column name – and only a column name – can go in `groupby`.
```

To know what we’re going to use in `groupby`, we need to take another peek at our data. We’ll do this here so we have those column names to refer to in the code blocks coming up.

In [None]:
homes.head()

That block of code you just ran has two hints for the code block you’ll have to complete later: What data are you using? What column in that data do you want to group by? Keep those in mind.

### Exercise 1: Group by and count

After we group our data together by the thing we want to group it by, we need to count how many things are in each group. We do that first by saying we want to summarize our data (a count is a part of a summary). To get a summary, we have to tell it what we want to summarize. So in this case, we want a count.

Here’s the pattern. You fill in where there are blanks. What you fill in are the two hints from above.

In [None]:
homes_grouped = _____.groupby(_____).count()

check_groupby(homes_grouped, expected)

In this case, we wanted to group together locations, signified by the field name `county_parish`. After we group the data, we need to count them up, using `count()`.

### Exercise 2: Arranging data

And when we run that, we get a list of counties with a count next to them. But it’s in alphabetical order. That doesn’t help us much. Usually we want to know where the most or the least are. So we’ll add another And Then Do This `.` and use `sort_values`. `sort_values` does what you think it does – it sorts data in order. By default, it’s in ascending order – smallest to largest. But if we want to know the county with the most homes, we need to sort it in descending order. The pattern looks like this:

In [None]:
homes_sorted = homes.groupby(_____).count().sort_values(_____, ascending=False)

check_sorted(homes_sorted, expected_sorted)

And when we run that, you’ll see that {glue:text}`top_county` has the most nursing homes with {glue:text}`top_county_count`. But let me guess – without knowing anything about your state, I bet you that {glue:text}`top_county` just happens to be the most populated in {glue:text}`state`. We’ll cover this again and again, but here’s your first warning: be careful that what you’re looking at isn’t just a population map. In this case, it’s just an example of a pattern of code we’ll use over and over again. It is not news that {glue:text}`top_county` has the most nursing homes.

### Exercise 3: Grouping by more than one thing

We can, if we want, group by more than one thing. One of the most popular uses of this data is to compare nursing homes – the federal government has created a 5-star rating system, where 5 is the best and 1 is not. In the data you have, that rating is in the `overall_rating` column. So what if we looked at the ratings for each home in each county?

The hint here is each. Any time you hear an editor or a reporter or someone say each, it probably means there’s a group by involved. Each county with each rating means there’s two!

In [None]:
homes_grouped_by_rating = homes.groupby([_____, _____]).count().sort_values(_____, ascending=False)

check_groupby(homes_grouped_by_rating, expected_multi_group)

homes_grouped_by_rating.head()

Now there’s a ton to look at here. But what does it mean? If you haven’t looked at a lot of data before, it might not be immediately intuitive. Let’s take the first row. What that says is {glue:text}`toprow_county` had {glue:text}`toprow_count` homes with a {glue:text}`toprow_rating`-star rating. See how the columns are arranged? It’s the first column in our group_by, followed by the second, and then the total we created in the summarize. Which is the order you come to them when you read the code. Logical, eh?

That first row might be interesting, it might not be. A few rows down, you might find a smaller county with a lot of 5-star homes for its size. Or the opposite – a smaller county with a lot of low-rated homes. Every state has economic and cultural reasons for just about everything that goes on there. Apply what you know about your state to that list, and stories will start popping into your head.

:::{admonition} Common Mistake
:class: caution

Remember earlier when I said we’d rarely ever group by a number? Didn’t we just do that? Yes, but … Remember a number is only a number if you would do math on it. Is a 5-Star Rating system a number? Yes! And No! We will do math on it, but it’s also a label too. A critical skill in data journalism is using logic and common sense, not rigid rules that something will always be this way or that.
:::

## The Recap

In this lesson, we've explored the fundamental process of data aggregation using pandas. We learned how to use `groupby()` to organize our data into meaningful categories, and then used `count()` to calculate statistics for these groups, starting with counts. We also practiced arranging our results to highlight the most significant findings.

The most important bit to remember is that analyzing data in pandas is a pattern: `data.function().function().function`.

Remember, the power of this approach lies in its flexibility and scalability - the same basic pattern of grouping, summarizing, and arranging can be applied to datasets of any size and complexity. Whether you’re analyzing a single city’s crime data or billions of Spotify streams, these tools form the foundation of data analysis.

## Terms to Know

- **Aggregation**: The process of combining multiple data points into a single summary statistic or result.
- **groupby()**: A pandas function that separates data into groups based on one or more variables, allowing for subsequent operations to be performed on each group independently.
- **count()**: A function that counts the number of rows in each group.
- **reset_index()**: Used to turn the group labels back into columns after grouping.
- **sort_values()**: Used to order rows in a dataset based on values in specified columns.
- **value_counts()**: A shortcut to count unique values in a column.
- **Pipe operator**: In pandas, you can use method chaining (dot notation) to chain together multiple operations.