# Data Journalism Lesson 14: Bar charts

The first step of visualizing data.

In [None]:
import warnings
from IPython.core.interactiveshell import InteractiveShell

# Keep hold of the real method
_orig_should_run = InteractiveShell.should_run_async

# Wrap it so that any DeprecationWarning it emits is silenced
def should_run_async(self, code, *args, **kwargs):
    with warnings.catch_warnings():
        warnings.simplefilter("ignore", category=DeprecationWarning)
        return _orig_should_run(self, code, *args, **kwargs)

# Apply the monkey‑patch
InteractiveShell.should_run_async = should_run_async

In [None]:
import micropip
await micropip.install('census')
await micropip.install('pyodide-http')
await micropip.install('plotly')
await micropip.install("nbformat>=4.2.0")

In [None]:
from IPython.display import display, HTML
import pandas as pd

# --- Simple Grading/Checking Functions ---
def display_feedback(correct, message_correct, message_incorrect):
    if correct:
        display(HTML(f'<div style="background-color: #dff0d8; padding: 10px; border-radius: 5px;"><strong>Correct!</strong> {message_correct}</div>'))
    else:
        display(HTML(f'<div style="background-color: #f2dede; padding: 10px; border-radius: 5px;"><strong>Not quite!</strong> {message_incorrect}</div>'))

def check_dataframe(inputted, expected):
    if pd.DataFrame.equals(inputted, expected):
        display_feedback(True, 'Great job! The DataFrame is correct.', '')
    else:
        display_feedback(False, 'The DataFrame is not correct.', 'Please check your code and try again.')

In [None]:
# --- State Setup ---
state_abbr = 'MN'
state_fips = '27'
state_name = 'Minnesota'

county_lang_singular = 'county'
county_lang_plural = 'counties'

investments = pd.read_csv(f"../_static/rural-grants/{state_name.lower()}.csv")
investments_rows = len(investments)


# Exercise answers
ex1_expected = (
    investments
    .groupby(["county", "county_fips"])
    .agg(
        total_investments=('number_of_investments', 'sum'),
        total_dollars=("investment_dollars", 'sum')
    )
    .sort_values(by="total_dollars", ascending=False)
    .reset_index()
)

In [None]:
# Load the investments data
from myst_nb import glue

glue("state_abbr", state_abbr, display=False)
glue("state_name", state_name, display=False)
glue("state_fips", state_fips, display=False)
glue("county_singular", county_lang_singular, display=False)
glue("county_plural", county_lang_plural, display=False)
glue("investments_rows", investments_rows, display=False)

## The Goal

In this lesson, you'll learn how to create basic bar charts using Plotly Express, a powerful data visualization library in Python. By the end of this tutorial, you'll understand how to prepare data for visualization, create a simple bar chart, reorder bars for better readability, and flip coordinates to improve label visibility. You'll practice these skills using real-world data from the USDA and the Census Bureau, gaining practical experience in visualizing data for reporting purposes.

## Why Visualize Data?

The Allegory of the Cave, from Plato's Republic, is one of the most talked about bits of philosophy in all of human history. And, plain and simple, it's about how we all go through life as prisoners of ignorance, with an imperfect understanding of the world, as if we're looking at a shadow on a wall instead of the real thing. If we are lucky, we can be freed of this prison and look at the world in the light of enlightenment -- where instead of shadow, we see the object. But that process can be distressing, painful even. Some people may even prefer the shadows and their imperfect understanding over the truth. Others become obsessed with enlightenment to the detriment of all else.

In Plato's dialogue, he walks his conversation partner along, suggesting that it is normal to believe that once one is on the path to enlightenment -- once one has seen the world not through shadows -- the temptation will be to stay in the light, never to return to the cave with your former prisoners. But that's not the ideal, according to Plato. Where some would dedicate their lives to enlightenment, and others would prefer to live their days in the darkness of the cave, the ideal in a society is the one who ascends to enlightenment, only to return to the cave, to be with the prisoners there and help them seek their own enlightenment. For that person will understand what the person looking at the shadow sees, but will know the real truth. 

This all comes from a work called The Republic, where Plato was talking about ideal leaders of the state. But I think this allegory says a lot about data journalism. 

Data journalists are prisoners, like everyone else. We are not special. We sit, watching the shadows with imperfect understanding. What maybe separates us is that we are drawn to enlightenment. We seek the light. Our tools of enlightenment? Data, of course, but that is not enough. Data itself is a form of shadow -- a representation of reality, not reality itself. True enlightenment -- in the journalistic sense -- requires reporting. It requires you to go into the world and talk to people. It requires you to seek more light. 

And then you must return to the people in the cave. 

> \"Wherefore each of you, when his turn comes, must go down to the general underground abode, and get the habit of seeing in the dark. When you have acquired the habit, you will see ten thousand times better than the inhabitants of the den, and you will know what the several images are, and what they represent, because you have seen the beautiful and just and good in their truth. And thus our State, which is also yours, will be a reality, and not a dream only, and will be administered in a spirit unlike that of other States, in which men fight with one another about shadows only and are distracted in the struggle for power, which in their eyes is a great good.\"

One passage from the Republic, ascribed to Plato, haunts me. 

> \"And must there not be some art which will effect conversion in the easiest and quickest manner; not implanting the faculty of sight, for that exists already, but has been turned in the wrong direction, and is looking away from the truth?\"

After this, Plato goes on ... and never describes the art which he refers to here. He goes on to talk about intelligence and wisdom and how some virtues are able to be learned and practiced versus being innate, but he never comes back to this art.

Might I be so humble as to suggest one? Data visualization. 

To be clear, I am not going to argue that data visualization is the light in and of itself. Data is a form of shadow -- it is not the real world, but a reflection of it. Data visualization then, is a product built on top of shadow. We can't lose sight of this fact.

I will make this argument: data visualization is like an HD shadow. It's a 4K shadow. Is it the real world, bathed in light? No. It is a representation of the world. But it's far clearer than a shadow on a wall cast by a man walking by a fire. 

You might even call it \"some art which will effect conversion in the easiest and quickest manner.\"

Data visualization, when done well, boils down the essence of a complex issue into shapes and colors that draw the eyes and, by extension, the mind, to something of interest. Similar to Plato's allegory, that process of seeing shape and color brings you to understanding. That process can be uncomfortable, just like the process of enlightenment. And showing good data visualization to a person in darkness is giving them much more clarity about their world than they had before. 

With this tutorial, you're going to start down the path of turning data into visuals. In my opinion, data journalism and data visualization are two disciplines that blur together so substantially that you shouldn't learn one without learning the other.

We're going to start today with bar charts. They're very basic -- some would even say boring -- but very effective. The most important part for you to learn, going forward, is not the code. The code, once you learn the pattern, is very easy and very repetitive. **It's not the most important part**. The most important part is *what does this form of visualization communicate*: What does it show? What kind of difference is it good at highlighting? What does it invite a reader to do with it? How much exploration does it encourage? And what kind? Each form of visualization -- the shapes, the colors, the choices -- emphasizes a different thing. Knowing what is good at what is *the* critical skill of the next dozen tutorials.

````{admonition} Key Concept
    :class: info
Bar charts show magnitude -- how much something is in relation to another thing -- and invite comparison.
````

## The Basics

Data visualization has become such an important skill in many different industries that a whole constellation of tools has appeared trying to make building graphics out of your data easy. Some are good, some are terrible, some cost money, some are free. Each attempts to solve a problem in the other, which often creates others. In short, there is no such thing as a perfect data visualization tool. The easier they are, the sooner you outgrow them. The more features they have, the harder they are to learn, say nothing of mastering. You want a tool that allows you to make graphics quickly, but gives you flexibility enough to customize your way into something you would publish. 

Good news: one of the best libraries for visualizing data in Python is Plotly, and its high-level interface, Plotly Express (`px`), threads that needle of making charts with very little code, but if you know how to customize your charts, you can make graphics that would run in any publication in the world.

Let's revisit some data we've used in the past and turn it into bar charts. First, we need libraries. We're going to start by looking at community investments made by the USDA -- the same dataset we used in Chapter 5. Later, we're going to need some census data, so to get that, we'll need the `census` library as well as `pandas`.

In [None]:
import pandas as pd
from census import Census
import plotly.express as px

Now let's get the investments data.

In [None]:
investments = pd.read_csv(f"../_static/rural-grants/minnesota.csv")

The universal truth of making graphics from data is that the work you do before you make the graphic will always take longer and be the bigger challenge than actually making the graphic. Often, I can spend hours on the data and minutes on the graphic. The problems you're going to have are going to come before you start making shapes.

With that warning in place, let's play pretend. One day, the USDA announces they're pouring a newsworthy amount of money into your {glue:text}`county_singular`. Millions of dollars. We looked at housing loans and grants in Chapter 5, so let's use that for our example. The USDA announced a multi-million dollar investment where you live to build housing. It's news. You and your editor sit down to talk about how to cover this, and your editor wonders aloud how does this {glue:text}`county_singular` compare with others in {glue:text}`state_name` since 2019? Now a question like this can serve two purposes for you, the person who has to answer this. First, it's a good paragraph in your story -- it adds context to a news event. Second, you should immediately realize a paragraph that lists the top 10 {glue:text}`county_plural` by USDA investments is going to be boring and hard to read. This is an ideal place for a chart.

A key skill for data journalists to master is converting the question words people ask into code that answers that question. How does this {glue:text}`county_singular` compare to others? So immediately, we need {glue:text}`county_plural`, and since we have lots of investments, we're already looking at at `groupby`. And because we're lumping {glue:text}`county_plural` together, we need an aggregation (`agg`). What makes the most sense here? Add up the dollars.

That gives us a pretty good template to work from.

Let's start by taking a peek at our data so we know what columns we're working with.

In [None]:
investments.head()

We've laid out what we need from our question words -- `county` and since we want to compare, and dollars is what we're comparing, we want to sum up the `investment_dollars`. I'm going to add two things to this to make our lives easier later -- we're going to add `county_fips` to this and also we're going to add up the number of investments as well. While we're at it, let's arrange it by our total dollars to see who got the most.

### Exercise 1: Getting our data together

In [None]:
totalinvestments = (
    investments
    .groupby([____, ____])
    .agg(
        total_investments=('number_of_investments', 'sum'),
        total_dollars=(____, 'sum')
    )
    .sort_values(by=____, ascending=False)
    .reset_index()
)

display(totalinvestments.head())

check_dataframe(totalinvestments, ex1_expected)

Let's look at what we have here. Did we get a list with a big number at the top and smaller numbers on down? Yep. But look closely at it. Using what you know about your state, are these the {glue:text}`county_plural` you would expect to see on a list of rural investments? Do you see several {glue:text}`county_plural` with fairly sizable populations? Do you see your most sparsely populated {glue:text}`county_plural` in the list at the top? Places where a single large investment would have major impact, vs places where most people wouldn't notice the same investment?

A question that should come with every comparison -- is it fair? Are the things we're comparing on the same footing? Is population a factor here? 

Answer ... yep. It would make sense that the USDA would spend more money in places with more people, and less in places with fewer people. In some states, the difference between the largest {glue:text}`county_singular` and the smallest can be more than 200 times. Texas may be the most extreme example. In Texas, Harris County -- home of Houston -- is home to 4.8 million people. Texas is also home to the smallest county in the nation -- Loving County, population 43. That means Harris County is about 112,445 times larger than Loving County. The difference in any numbers comparing these things are going to be vast.

How do we solve for this? We put things on a population basis. That might be a simple percentage. It might be a rate -- per person, per 10,000 people, per 100,000 people. But to do that, we need to add population numbers.

Thankfully, you learned how to grab population estimates in the previous chapter, so we're going to get the 2023 estimates for your state using the `census` library. Remember -- here you use the FIPS code of your state to get that data. To speed matters along, we're going to filter it down to just population estimates right after import.

### Exercise 2: Get Census Population Data

First, you need a Census API key. Sign up for one [here](https://api.census.gov/data/key_signup.html) if you haven't already.

Then, create a `Census` object, replacing `"YOUR_API_KEY"` with your actual key.

Finally, use the `c.acs5.state_county()` method to get the 2023 ACS 5-year estimate for total population (`B01003_001E`) for all counties (`Census.ALL`) in your state (`state_fips`). Convert the resulting list of dictionaries into a pandas DataFrame called `estimates23_raw`.

In [None]:
c = Census("YOUR_API_KEY")
# Fetch 2023 ACS5 population estimates
estimates23_raw = []
estimates23_raw = c.acs5.state_county(
    fields=('B01003_001E',), # Total population variable
    state_fips=____, # 27 for Minnesota
    county_fips=____, # Use Census.ALL for all counties
    year=____ # Specify the year 2023
)
estimates23 = pd.DataFrame(estimates23_raw)
# Rename columns for clarity and consistency
estimates23 = estimates23.rename(columns={'B01003_001E': 'value', 'county': 'county_fips'})
estimates23["county_fips"] = estimates23["state"] + estimates23["county_fips"]
# Convert value to numeric, coercing errors
estimates23['value'] = pd.to_numeric(estimates23['value'], errors='coerce')
display(estimates23.head())

Let's take a quick peek at our estimates data to help us with some column names.

In [None]:
display(estimates23.head())

As we can see, we have estimates in the `value` column. The other column I want you to pay attention to is `county_fips`, which is the Census Bureau's way of identifying counties using a federal government number called a FIPS number (for Federal Information Processing Standards). It's really two numbers combined -- a state code and a county code. Under FIPS, every state has a FIPS number and then every county has a number within that state.

Recall that with our USDA data (`totalinvestments`), we also have a column called `county_fips`. Guess what? That number is identical to the one in the census data -- meaning we have our keys to join these two datasets together.

### Exercise 3: Joining together

Use pandas' `merge` function to perform an inner join between `totalinvestments` and `estimates23`. Join them `on` the `county_fips` column. Store the result in a new DataFrame called `merged_data`.

In [None]:
merged_data = pd.merge(
    ____, # Left DataFrame
    ____, # Right DataFrame
    on=____, # Column to join on
    how='inner' # Type of join
)
display(merged_data.head())

And just like that, you can see our investments data joined together with our population data. We have `total_dollars` and `value` (population), which is what we need to make a dollars per person metric. Per is another word you should immediately map to code. Whenever you hear the word per, you should think division. Dollars per person means number of dollars divided by number of people. 

### Exercise 4: Making dollars per person

In [None]:
merged_data = merged_data.assign(
    investment_dollars_per_person = merged_data[____] / merged_data[____]
).sort_values(by=____, ascending=False)
display(merged_data.head())

Compare this list to your other one (the one sorted by total dollars). Are the same names at the top? My guess? No. Your top {glue:text}`county_plural` are vastly different. Large places are down, small places are up, and what you have here is a measure of the impact these investments have.

Two last things before we move on to actually making a chart: First, you have a limited amount of space in a chart. Some charts can cram a *lot* of data into a small place. Bar charts are not one of them. If you want to read the labels, you need to limit the number of rows of data you have. Next: we need to save this all to a new dataframe, one we can plug into our chart. 

### Exercise 5: Limiting and saving.

Remember all the way back to filters, when we learned about selecting top rows. Pandas has a method called `nlargest` which gives us the top N rows based on a column's values. With bar charts, 10 is good, 15 is probably pushing it, and 20 is almost certainly too many. For reasons that will become obvious later, let's push it and go with 15.

Select the top 15 rows from `merged_data` based on the `investment_dollars_per_person` column using `nlargest`. Store the result in a new DataFrame called `top15`.

In [None]:
top15 = merged_data.nlargest(____, ____) # Get the top 15 rows based on the per-person column
display(top15)

At long last, we're ready to make bar charts.

## Bar charts

The simple bar chart is a chart designed to show differences between things -- the magnitude of one compared to the next and the next and the next. So if we have thing, like a county, or a state, or a group name, and then a count or a number attached to the group, we can make a bar chart.

Fortunately for us, we have `county` and `investment_dollars_per_person` in our `top15` data. Seems we have what we need.

The library we'll use is Plotly Express, typically imported as `px`. Plotly Express allows us to create figures with very concise code. It follows a pattern where you specify the DataFrame, the columns to map to visual elements (like x-axis, y-axis, color), and other options.

To create a bar chart, we use `px.bar()`. We need to tell it:
*   `data_frame`: The DataFrame containing the data (`top15`).
*   `x`: The column for the categories (our `county` names).
*   `y`: The column for the values that determine the bar height (`investment_dollars_per_person`).

### Exercise 6: Your first bar chart

Use `px.bar()` to create a bar chart. Pass `top15` as the `data_frame`. Set the `x` axis to the `county` column and the `y` axis to the `investment_dollars_per_person` column. Display the figure using `.show()`.

In [None]:
fig = px.bar(
    data_frame=____,
    x=____,
    y=____
)
fig.show()

The bars look good, but the order makes no sense. And, can you read the x-axis labels? I can't.

We'll start with the bars. We want them ordered from smallest to largest (or largest to smallest) based on the `investment_dollars_per_person`. The easiest way to achieve this with Plotly Express is often to sort the *DataFrame* before plotting.

Since our `top15` DataFrame is already sorted descending from the `nlargest` call, plotting it directly should ideally give us bars ordered from largest to smallest. However, Plotly might sometimes default to alphabetical order for the category axis. To be explicit, let's sort `top15` by `investment_dollars_per_person` (ascending this time, so the smallest bar is first when plotted vertically) before passing it to `px.bar`.

### Exercise 7: Reordering

First, create a new DataFrame `top15_sorted` by sorting `top15` by the `investment_dollars_per_person` column in *ascending* order.

Then, create the bar chart again using `px.bar`, but this time pass `top15_sorted` as the `data_frame`. Keep `x='county'` and `y='investment_dollars_per_person'`. Display the figure.

In [None]:
top15_sorted = top15.sort_values(by=____, ascending=____)

fig_reordered = px.bar(
    data_frame=____,
    x=____,
    y=____ 
)
fig_reordered.show()

Better, but the labels on the bottom are still crowded and hard to read. We can fix that by flipping the coordinates, making it a horizontal bar chart. In Plotly Express, we do this by setting the `orientation` argument to `'h'` and swapping the `x` and `y` assignments. The category (`county`) goes on the y-axis, and the value (`investment_dollars_per_person`) goes on the x-axis.

### Exercise 8: Flipping

Create the bar chart again using `px.bar` with the `top15_sorted` DataFrame.
This time:
*   Set `x` to `investment_dollars_per_person`.
*   Set `y` to `county`.
*   Set `orientation` to `'h'`.
Display the figure.

In [None]:
fig_flipped = px.bar(
    data_frame=____,
    x=____,
    y=____,
    orientation=____
)
fig_flipped.show()

Art? No. Tells you the story? Yep. And for now, that's enough. This chart could help us write our story. You should show it to your editor to talk about what else this story could be about rather than just a simple one about a single investment. Would you publish this? Not like this you wouldn't. You need more, and we'll do more later.

## The Recap

Throughout this lesson, you've learned the fundamentals of creating bar charts with Plotly Express. You've practiced preparing data by grouping and aggregating with pandas, fetching census data using the `census` library, joining datasets, calculating per capita rates, and selecting top entries. You then created a basic bar chart, sorted the bars based on values by sorting the underlying DataFrame, and flipped coordinates for better readability using `px.bar` arguments. Remember, while these charts may not be publication-ready visualizations, they serve as valuable tools for quickly understanding and reporting on data trends.

## Terms to Know

- **Plotly Express (`px`)**: A high-level interface for the Plotly Python graphing library, designed for creating figures easily and quickly.
- **`px.bar()`**: A Plotly Express function used to create bar charts.
- **`data_frame`**: Argument in Plotly Express functions specifying the pandas DataFrame to use.
- **`x`, `y`**: Arguments in Plotly Express functions mapping DataFrame columns to the x and y axes.
- **`orientation`**: Argument in `px.bar` (and other functions) to set the orientation, e.g., `'h'` for horizontal.
- **`sort_values()`**: A pandas DataFrame method used to sort rows based on column values.
- **`nlargest()`**: A pandas DataFrame method used to select the top n rows based on the values in specified columns.
- **`assign()`**: A pandas DataFrame method used to create new columns.
- **`pd.merge()`**: A pandas function for joining DataFrames based on common columns or indices.
- **`groupby().agg()`**: A pandas pattern for grouping data and calculating aggregate statistics.
- **`census`**: A Python library for accessing U.S. Census Bureau data via their API.