# Data Journalism Lesson 13: Getting census data from an API

Learn how to access and work with U.S. Census data using the census library in Python.

In [None]:
import warnings
from IPython.core.interactiveshell import InteractiveShell

# Keep hold of the real method
_orig_should_run = InteractiveShell.should_run_async

# Wrap it so that any DeprecationWarning it emits is silenced
def should_run_async(self, code, *args, **kwargs):
    with warnings.catch_warnings():
        warnings.simplefilter("ignore", category=DeprecationWarning)
        return _orig_should_run(self, code, *args, **kwargs)

# Apply the monkey‑patch
InteractiveShell.should_run_async = should_run_async

In [None]:
import micropip
await micropip.install('census')
await micropip.install("pyodide-http")
from census import Census # Import the census library

In [None]:
# Setup code for the notebook
import os
import pandas as pd
from IPython.display import display, HTML
from pathlib import Path

In [None]:
from myst_nb import glue

# Glue variables for use in markdown
glue("state_abbr", "MN", display=False)
glue("state_name", "Minnesota", display=False)
glue("state_fips", 27, display=False)

## The Goal

In this lesson, you'll learn how to access and work with U.S. Census data using the `census` library in Python. By the end of this tutorial, you'll understand how to retrieve population counts from the American Community Survey (ACS) via the Census API, manipulate the data into a usable format using pandas, and explore how to fetch specific variables. You'll practice important data wrangling skills like converting API results to DataFrames, concatenating rows, and pivoting data, while gaining insight into the structure of census datasets accessed through this library. This knowledge will equip you to incorporate authoritative demographic data into your data journalism projects efficiently.

## What is Data Journalism?

Cathy Wos, a former research librarian at the Tampa Bay Times, used to start every meeting we had about Census 2000 coverage with this: "The census is neither timely nor accurate. Discuss."

We'd have a giggle and would move on to what we needed to talk about. But she was 100 percent right. And also missing the point (intentionally, in her defense).

But wait, if something is neither timely nor accurate, what possibly could journalists ever want to do with it?

Let's first explore this notion. If you're unfamiliar, the census is mandated by the Constitution of the United States (Article I, Section 2, Clause 3 for the real nerds). It says, and courts have re-affirmed in the face of much better technology and methods, that the Constitution plainly says the federal government, every 10 years, will count every single person in the country. The term of art is "actual enumeration," two words that have spawned a *lot* of argument over the decades.

Why is it there? And why were the framers so concerned with accuracy that they mandated an actual count of people? Because that's how representation in the House of Representatives gets determined. More people in your state? More representatives to Congress for you. An enormous amount of political power is determined by the census, but it goes far beyond that. The billions of dollars of federal spending that happens every year? A healthy chunk of it is determined by populations and demographics of that population. How does the federal government know where to send it? You guessed it -- the U.S. Census.

So why isn't it timely or accurate? Stop thinking complicated, because the answer is it's very simple: It takes a long time to count that many people.

In 2020, every household in America received a census form in the mail. If you're really thinking critically, you can stop right here and point out an obvious point of error. What about people who don't have a home? What about people between homes? What about chronically homeless people? What about people who move a lot and wouldn't particularly like to talk to a federal worker asking questions, like undocumented farm workers? The census says nothing about citizenship, so the Census Bureau has to try and count *everyone*.

Everyone got that form in late 2019. They were supposed to fill it out and send it back, filling it in for how it would be true on Census Day, which is officially April 1. So on April 1, how many people are in your house? How old are they? What race are they? What ethnicity are they? And so on.

Even if we pretend for a minute that on April 1, 2020, between mail-in forms and census takers going door to door to follow up, the census is 100 percent accurate on that day. That has never once happened in history, and never will, but let's pretend for a second. What happens on April 2?
Life moves on. Babies are born. People die. People move. People fall in love. Criminals go to prison. People graduate from college, get jobs, buy houses -- American dream type stuff. It happens every single day. And the further away from April 1 you get, the more it happens.

How can this possibly be useful?

The truth is, in the aggregate, things don't change that fast. Individual lives change every day. Populations change slowly. People are born, people die, and the median age of a city stays roughly the same, or changes very slowly over a long period of time. Two people fall in love and go to the courthouse to get married in a whirlwind of feel-good hormones. In that same courthouse, two other people are getting divorced with a very, very different set of neuro-chemicals. Broadly speaking, how many households are married vs. un-married hasn't changed, and changes slowly.

And, as such, the census remains the best look at demographics we're going to get. The data is clean, rigorously checked, ridiculously documented, widely used and ... completely free. For data journalists, it's a foundational skill -- any time you want a rate instead of a number, there's a good chance the base of that rate is going to come from the Census Bureau.

One problem? The Census Bureau doesn't go dark after pumping out the decennial census. They're the federal government's most prolific data publisher. Just learning all the ins and outs of the decennial census takes months to years of work. Then you have all the rest.

There's no option other than to jump into a giant pool and start swimming.

## The Basics

There is truly an astonishing amount of data collected by the U.S. Census Bureau. First, there's the census that most people know -- the every 10 year census (Decennial). That's the one mandated by the Constitution where the government attempts to count every person in the nation.

Then, starting in 2005, the Census Bureau launched the American Community Survey (ACS). Think of it like a rolling census, where instead of every 10 years, new data is being gathered all the time. The difference? The ACS is a survey -- a random sample of the population -- not a head count. It provides more detailed demographic, social, economic, and housing characteristics annually.

The Census Bureau has *dozens* of other programs. Unfortunately, the data can be complex to work with. The good news is the Census API (Application Programming Interface) allows us to get data directly using code.

Let's demonstrate.

We're going to use a library called [`census` (by Datamade)](https://github.com/datamade/census) which makes calls to the Census API relatively straightforward. It returns data as a list of dictionaries, which we can easily convert into a pandas DataFrame.

First, let's import our libraries:

In [None]:
import pandas as pd
import census

Now, to get access to Census data, you'll need an API key. Sign up for one [here](https://api.census.gov/data/key_signup.html), then paste it into the code below:

In [None]:
c = census.Census("_____")


Now, instead of reading a CSV, we use methods from our `Census` object (`c`) to fetch data.

Let's replicate something similar to previous tutorials – calculating population changes. The `census` library doesn't have a specific function for the Population Estimates Program (PEP) like `tidycensus` did. Instead, we'll use the total population variable from the ACS 5-Year estimates (`acs5`). The most common variable code for total population is `B01003_001E`. We'll need to fetch this for both 2023 and 2022.

### Exercise 1: Get ACS Population Data

Using the `census` library object `c` we created, we can fetch data. The library has different datasets available (like `acs5`, `acs1`, `sf1`). We'll use `acs5` for 5-year estimates, which are generally available down to the county level and below.

The `census` library provides convenience methods for common geographies. To get data for all counties within a specific state, we can use `c.acs5.state_county()`. This method needs:
1.  `fields`: A tuple of the variable codes you want (e.g., `('B01003_001E',)` for total population).
2.  `state_fips`: The FIPS code for the state ({glue:text}`state_fips`).
3.  `county_fips`: Use `Census.ALL` to get data for all counties in the state.
4.  `year`: The year of the data (e.g., `2023`).

Fill in the blanks below to get the 2022 and 2023 ACS 5-Year total population (`B01003_001E`) for all counties in your state. The result will be a list of dictionaries.

In [None]:
# Fetch 2023 ACS5 population for all counties in the state
pop_data_list_23 = []
pop_data_list_23 = c.acs5.state_county(
    fields=(____,), # Needs to be a tuple, even with one variable
    state_fips=____,
    county_fips=____, # Use census.Census.ALL for all counties
    year=____
)
# Display the first few results (list of dictionaries)
print("First 5 results (list of dictionaries):")
print(pop_data_list_23[:5])

# Convert to DataFrame for easier viewing and checking
df_pop_23 = pd.DataFrame(pop_data_list_23)
print("DataFrame head:")
print(df_pop_23.head())


# Fetch 2022 ACS5 population for all counties in the state
pop_data_list_22 = []
pop_data_list_22 = c.acs5.state_county(
    fields=(____,), # Needs to be a tuple, even with one variable
    state_fips=____,
    county_fips=____, # Use census.Census.ALL for all counties
    year=____
)
df_pop_22 = pd.DataFrame(pop_data_list_22)

Notice the output is a list of dictionaries. We converted it to a pandas DataFrame (`df_pop_23`) for easier use. The DataFrame contains the requested variable (`B01003_001E`) and the state and county FIPS codes.

Now we have two DataFrames (`df_pop_23` and `df_pop_22`), one for each year. Let's prepare them for calculating the change.

### Exercise 2: Prepare DataFrames

To make combining and pivoting easier, let's make sure both DataFrames have the same structure:
1.  A column for the county FIPS code (the `census` library usually names this `county`).
2.  A column for the population value. Let's rename the variable code column (e.g., `B01003_001E`) to `value`.
3.  A column indicating the `year`.

We only need these three columns. Fill in the blanks below to select the `county` column, rename the population variable column to `value`, add a `year` column, and keep only these three columns for both years.

In [None]:
# Prepare the 2023 DataFrame
pop23_prepared = df_pop_23.rename(columns={____: 'value'})
pop23_prepared['year'] = ____
pop23_prepared = pop23_prepared[[____, ____, ____]]
display(pop23_prepared.head())

# Prepare the 2022 DataFrame
pop22_prepared = df_pop_22.rename(columns={____: 'value'})
pop22_prepared['year'] = ____
pop22_prepared = pop22_prepared[[____, ____, ____]]
display(pop22_prepared.head())

Now we have two DataFrames (`pop23_prepared`, `pop22_prepared`) with identical columns: `county`, `value`, `year`. We are ready to combine them.

### Exercise 3: Concatenating (Binding)

We need to stack these two DataFrames (`pop23_prepared`, `pop22_prepared`) on top of each other. Since they have identical column names, we can use pandas' `pd.concat()` function. We pass it a list containing the DataFrames we want to stack. Fill in the blanks with the two DataFrames you just prepared.

In [None]:
popest = pd.concat([____, ____], ignore_index=True)
display(popest.head())
display(popest.tail())

Our next problem? We have data that's stacked (long format), not side by side (wide format). To calculate the percent change in population ((new - old) / old), we need the 2023 and 2022 populations as separate **columns** for each county. Currently, we have separate **rows** for each county-year combination.

We need to pivot the data from long to wide format. Pandas' `pivot()` method is perfect for this.

`pivot()` needs three main arguments:
1.  `index`: The column(s) whose values will become the new DataFrame's index (the unique identifier for each row, in our case, the `county` FIPS code).
2.  `columns`: The column whose unique values will become the new column headers (in our case, the `year`).
3.  `values`: The column whose values will fill the cells of the new DataFrame (in our case, the population `value`).

What you have now (`popest`):

| county | value | year |
|--------|-------|------|
| 05001  | 123   | 2023 |
| 05001  | 99    | 2022 |
| 05003  | 345   | 2023 |
| 05003  | 678   | 2022 |

And what you want is this:

year   | 2022 | 2023 |
county |------|------|
05001  | 99   | 123  |
05003  | 678  | 345  |

(Note: `pivot` makes the `index` column the actual index of the DataFrame. We can use `.reset_index()` afterwards if we want `county` back as a regular column).

### Exercise 4: Pivoting

Fill in the blanks for the `index`, `columns`, and `values` arguments in the `pivot()` function.

In [None]:
popest_wide = popest.pivot(index='____', columns='____', values='____')
# Optional: Reset index to make county FIPS a regular column
# popest_wide = popest_wide.reset_index()
        
# Optional: Rename columns if they are numbers to be more like variable names
# popest_wide = popest_wide.rename(columns={2022: 'pop2022', 2023: 'pop2023'})
        
display(popest_wide.head())

And now you have a DataFrame where each row represents a county (identified by its FIPS code in the index) and columns represent the population in 2022 and 2023. You're ready to calculate the percent change.

## Working with the ACS (More Variables)

The ACS contains thousands of variables. Finding the right variable code (like `B01003_001E` for population or `B03001_003E` for Hispanic population) is often the biggest challenge.

You typically need to:
1.  **Explore interactively:** Use tools like the [Census Bureau's Table & Variable Lookup](https://api.census.gov/data/2022/acs/acs5/variables.html) (change the year as needed).
2.  **Use external resources:** Websites like [Census Reporter](https://censusreporter.org/) provide excellent tools for finding tables and variables.
3.  **Know common tables:** Over time, you'll become familiar with frequently used tables (e.g., B01003 for total population, B03001/B03002/B03003 for Hispanic origin/race, B19013 for median income, B25003 for tenure, etc.).

Once you know the variable codes, fetching the data is similar to how we fetched population.

Imagine you used the [Census API variable page](https://api.census.gov/data/2022/acs/acs5/variables.html) to find variables related to Hispanic origin. You'd find Table `B03001` (Hispanic or Latino Origin by Specific Origin) or `B03003` (Hispanic or Latino Origin). The key variables are often:
*   `B03003_001E`: Total Population
*   `B03003_002E`: Not Hispanic or Latino
*   `B03003_003E`: Hispanic or Latino

(We'll use these codes in the next exercise.)

### Exercise 5: Getting Specific ACS Data

Let's fetch the 2022 ACS 5-Year data for Hispanic origin using the variable codes identified above (`B03003_001E`, `B03003_002E`, `B03003_003E`). We want this data at the `county` level for Minnesota.

Use the `c.acs5.state_county()` method again. Remember to pass the variable codes as a tuple to the `fields` argument. Fill in the blanks.

In [None]:
# Define Hispanic origin variables
hispanic_vars = ('B03003_001E', 'B03003_002E', 'B03003_003E')
acs_year = 2022

# Get ACS data for Hispanic origin by county
origin_data_list = []
origin_data_list = c.acs5.state_county(
    fields=____, # Tuple of variable codes
    state_fips=____,
    county_fips=____, # census.Census.ALL for all counties
    year=____
)
        
# Convert to DataFrame
df_origin = pd.DataFrame(origin_data_list)
display(df_origin.head())

Now you have a DataFrame (`df_origin`) containing the total population, non-Hispanic population, and Hispanic population estimates for each county in your state for 2022. 

Notice that when fetching multiple variables using `state_county()`, the library conveniently returns them as separate columns in the resulting DataFrame, already in a 'wide' format for those variables (though still 'long' if you fetched multiple years this way and concatenated). This differs from the `get()` method which might require pivoting if you fetch multiple variables for the same geography in a single call structured differently.

## The Recap

Throughout this lesson, you've learned how to use the `census` library in Python to access U.S. Census Bureau data. You practiced fetching ACS data for specific variables and geographies (`state_county`), handling the list-of-dictionaries format returned by the library by converting it to a pandas DataFrame. You applied essential pandas skills like renaming columns (`.rename()`), adding columns, selecting columns, concatenating DataFrames (`pd.concat`), and pivoting data from long to wide format (`.pivot()`). You also learned that finding variable codes requires external tools when using this library, unlike some others. While working with census data has its complexities, you now have a foundation for using the `census` library to incorporate this vital demographic information into your data journalism work.

## Terms to Know

- **Census API**: An Application Programming Interface that allows direct access to U.S. Census Bureau data through code-based queries.
- **`census`**: A Python package that simplifies the process of retrieving and working with U.S. Census Bureau data.
- **American Community Survey (ACS)**: A continuous survey conducted by the U.S. Census Bureau that provides detailed demographic information between decennial censuses.
- **`pd.concat()`**: A pandas function used to combine multiple DataFrames either vertically (stacking rows) or horizontally.
- **`.pivot()`**: A pandas DataFrame method used to transform data from long format to wide format based on index, columns, and values.
- **Long data**: A data format where each row represents a single observation or measurement, often resulting in multiple rows per subject identified by key columns.
- **Wide data**: A data format where each row represents a unique subject, with different observations or measurements spread across columns.