# Data Journalism Lesson 21: Bubble charts

Adding a new dimension to scatterplots.

In [None]:
import warnings
from IPython.core.interactiveshell import InteractiveShell

# Keep hold of the real method
_orig_should_run = InteractiveShell.should_run_async

# Wrap it so that any DeprecationWarning it emits is silenced
def should_run_async(self, code, *args, **kwargs):
    with warnings.catch_warnings():
        warnings.simplefilter("ignore", category=DeprecationWarning)
        return _orig_should_run(self, code, *args, **kwargs)

# Apply the monkey‑patch
InteractiveShell.should_run_async = should_run_async

In [None]:
import micropip
await micropip.install('plotly')
await micropip.install("nbformat>=4.2.0")

In [None]:
from IPython.display import display, HTML
import pandas as pd

# --- Simple Grading/Checking Functions ---
def display_feedback(correct, message_correct, message_incorrect):
    if correct:
        display(HTML(f'<div style="background-color: #dff0d8; padding: 10px; border-radius: 5px;"><strong>Correct!</strong> {message_correct}</div>'))
    else:
        display(HTML(f'<div style="background-color: #f2dede; padding: 10px; border-radius: 5px;"><strong>Not quite!</strong> {message_incorrect}</div>'))

def check_df_exists(df, df_name, expected_min_rows=None, expected_cols=None):
    if not isinstance(df, pd.DataFrame) or df.empty:
        if not (expected_min_rows == 0 and isinstance(df, pd.DataFrame)): # Allow empty if 0 rows expected
            display_feedback(False, f'{df_name} DataFrame is not loaded correctly or is unexpectedly empty.', 'Please check the loading or filtering process.')
            return False
    msg_correct = f'{df_name} DataFrame checked.'
    correct = True
    msg_incorrect_list = []

    if expected_min_rows is not None and len(df) < expected_min_rows:
        msg_incorrect_list.append(f' Expected at least {expected_min_rows} rows, got {len(df)}.')
        correct = False
    if expected_cols is not None:
      if not all(col in df.columns for col in expected_cols):
        missing_cols = [col for col in expected_cols if col not in df.columns]
        msg_incorrect_list.append(f' Missing expected columns: {missing_cols}.')
        correct = False
    
    if correct:
        display_feedback(True, msg_correct, '')
    else:
        display_feedback(False, '', ' '.join(msg_incorrect_list))
    return correct

def check_plot_params(fig_params, expected_params, plot_name):
    # This is a simplified conceptual check. Actual Plotly parameter checking can be intricate.
    # We'll check for presence of key arguments if they were strings or simple values.
    correct = True
    messages = []
    for p_name, p_val_expected in expected_params.items():
        p_val_actual = fig_params.get(p_name)
        if p_val_actual == p_val_expected:
            messages.append(f'Correct {p_name} for {plot_name}.')
        else:
            correct = False
            messages.append(f'Incorrect {p_name} for {plot_name}. Expected \'{p_val_expected}\', got \'{p_val_actual}\'.')
    
    final_message_correct = f'Key plot parameters for {plot_name} seem correct!'
    final_message_incorrect = ' '.join(messages)
    if not messages: # If no specific params were checked, assume basic creation.
        display_feedback(True, f'{plot_name} created.', '')
        return
    display_feedback(correct, final_message_correct, final_message_incorrect)

In [None]:
# --- State Setup and Data Loading ---
default_state_abbr = 'MN'
state_full_name = "Minnesota"

bigpublic_colleges_url = "../_static/college-cost/bigpublic.csv"

bigpublic_df_initial = pd.read_csv(bigpublic_colleges_url)
state_college_df_initial = bigpublic_df_initial[bigpublic_df_initial['STABBR'] == default_state_abbr].copy()
state_college_count_expected = len(state_college_df_initial)

In [None]:
from myst_nb import glue

glue("state_full_name", state_full_name, display=False)
glue("state_abbr", default_state_abbr, display=False)
glue("state_college_name", state_college_df_initial['INSTNM'].iloc[0] if state_college_count_expected > 0 else "[No college found for state]", display=False)
glue("state_college_count", state_college_count_expected, display=False)

## The Goal

In this lesson, you'll learn how to create bubble charts, which add a third dimension to scatterplots through varying circle sizes. By the end of this tutorial, you'll understand when bubble charts are effective, how to construct them using Plotly Express, and how to enhance them with transparency and labels. You'll practice filtering data, adjusting visual elements, and using Plotly's built-in text capabilities for clear labeling. These skills will enable you to visualize complex relationships between three variables, a powerful tool for uncovering and communicating insights in your data journalism projects.

## Why Visualize Data?

Here is the real talk: Bubble charts are hard. 

The reason they are hard is not because of the code, or the complexity or anything like that. It is a scatterplot with magnitude added -- the size of the dot in the scatterplot has meaning. The hard part is seeing when a bubble chart works and when it doesn't. 

If you want to see it work spectacularly well, [watch a semi-famous TED Talk by Hans Rosling](https://www.youtube.com/watch?v=hVimVzgtD6w) from 2006 where bubble charts were the centerpiece. It's worth watching. It'll change your perspective on the world. No seriously. It will.

That TED Talk, and the software his son created that you can see doing the visuals for his talk, turned a public health professor from Sweden into a bit of a global celebrity, to a very nerdy group of people. Rosling, with his son and daughter, started the Gapminder Foundation to further develop the software, which in 2007 was bought by Google. Time Magazine named him one of the 100 most influential people in 2012. Harvard University and the United Nations gave him awards. He wrote books about his worldview -- that we as a society vastly underestimate the progress the world has made across a number of different issues. A cottage industry of critics -- He's naive! He's a pollyanna! -- and defenders popped up, as it does when you have an enormous amount of attention on you. 

All because a professor used bubble charts in a talk on YouTube.

In a 2005 paper about their idea to use software to visualize global health and development data, the Roslings (Hans and his two children) wrote that "the representation of time by movement in scattergrams with carefully designed interfaces has proven to bring statistics beyond the eye to hit the brain."

But even they weren't sure it would actually work. But they had a vision. This graphic appeared in a paper published by the Organization for Economic Co-operation and Development -- the OECD. 

```{image} ../figures/free-aha-wow.png
:alt: Visual representation of different join types
:width: 600px
:align: center
```

Free! Aha! Wow!

But Rosling's talks -- and his relentless enthusiasm for the data as it really is -- were infectious. And since then, people have wanted bubble charts. 

And we're back to the original problem: They're hard. Not hard to make, but hard to make *well* and interpret correctly.

## The Basics

To show how hard they are to get right (though easy to code), let's try to make one. It should be quickly obvious to you that the code isn't the hard part. 

To make a bubble chart with Plotly Express, you're making a scatterplot, just like we did in the previous exercise. Then you're adding one more element -- the `size` of the dot.

I've got a subset of data from the last exercise. It's the same cost vs. completion data, but this time, it's only for the largest public universities in each state. For almost every state, this is Big State U, where they have Big Time sports teams and jerseys you can buy in stores all over the state. A *lot* of media attention goes to exclusive private universities -- the Ivy League, for example -- but the truth is the overwhelming plurality of college educated adults went to Big State U. 

We don't need any new libraries from last time. We need `pandas` and `plotly.express`.

In [None]:
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go

And the data:

In [None]:
bigpublic_df = pd.read_csv("../_static/college-cost/bigpublic.csv")

A bubble chart is just a scatterplot with one additional element in the aesthetic -- a `size`. Let's make the scatterplot version first. Let's first take a peek at the data.

In [None]:
display(____.head())

### Exercise 1: First, we scatterplot

To make this scatterplot, let's just repeat what we did last time: the cost (`COSTT4_A`) on the x-axis, the completion rate (`C150_4`) on the y-axis.

In [None]:
fig_ex1 = px.scatter(
    data_frame=____,
    x=____, # Cost
    y=____, # Completion Rate
    title='Cost vs. Completion for Large Public Universities (Scatterplot)'
)
fig_ex1.show()

Looks like a scatterplot, eh? But which of these schools are really big and which ones ... aren't? This is where the bubble chart comes in.

### Exercise 2: The bubble chart

Let's add the `size` element. From our peek at the data above, we want to add the column that has the number of undergrads in it. That's `UGDS` (UnderGraduate Degree Seeking students).

In [None]:
fig_ex2 = px.scatter(
    data_frame=____,
    x=____,
    y=____,
    size=____,
    title='Cost vs. Completion vs. Size for Large Public Universities (Bubble Chart)'
)
fig_ex2.show()

What does this chart tell you? Seems there are some big schools that cost a bunch and are graduating a ton of people. There are some smaller schools who are kinda in the middle. And there's a small handful of small state schools who are graduating fewer than 40 percent of first-time first-year students.

Our chart needs some improvement.

### Exercise 3: Adding transparency and scaling size

We can make this more readable by adding an `opacity` argument (transparency) to `px.scatter()`. Plotly Express automatically scales the bubble sizes, but you can influence the maximum size with `size_max` if needed, or adjust the overall scaling with marker sizeref/sizemode via `update_traces` for more control (though `size_max` is simpler for a quick adjustment).

Let's try an `opacity` of 0.6 (Plotly uses 0 to 1 for opacity). Let's use a moderate `size_max` like 30 or 40 to ensure bubbles aren't too overwhelming.

In [None]:
fig_ex3 = px.scatter(
    data_frame=____,
    x=____,
    y=____,
    size=____,
    opacity=____,
    size_max=____,
    title='Cost vs. Completion vs. Size (Adjusted Bubbles)'
)
fig_ex3.show()

Better? The transparency helps with overlapping bubbles, and `size_max` gives some control over how large the biggest bubbles get.

### Exercise 4: Adding a focus

What would help the most is if we added a school to focus on. So let's add your state's largest public university to this chart. First step - filtering. Let's call the dataframe `state_college_df`.

In [None]:
_____ = bigpublic_df[bigpublic_df['STABBR'] == _____]

You now have one school in {glue:text}`state_full_name`: {glue:text}`state_college_name`.

### Exercise 5: Adding a red dot and labels

Let's add that school to the chart. We'll create the base chart with all universities (semi-transparent bubbles), then add a new trace specifically for the `state_college_df`.
This new trace will:
- Use the same x, y, and size aesthetics.
- Have a different color (e.g., 'red').
- Include text labels for the institution name (`INSTNM`).

We will use `fig.add_trace()` with `go.Scatter` for the highlighted point(s) to control its appearance and add text.

In [None]:
# Start with the base bubble chart from Exercise 3
fig_ex5 = px.scatter(
    data_frame=bigpublic_df,
    x='COSTT4_A',
    y='C150_4',
    size='UGDS',
    opacity=0.3, # Make base points semi-transparent
    size_max=40,
    title=f'Cost vs. Completion vs. Size, Highlighting {state_full_name}'
)
fig_ex5.update_traces(marker=dict(line=dict(width=0.5, color='DarkSlateGrey'))) # Add border to bubbles

fig_ex5.add_trace(go.Scatter(
    x=state_college_df[____], # Cost
    y=state_college_df[____], # Completion Rate
    mode='markers+text',
    text=state_college_df[____],
    textposition='top center',
    textfont=dict(size=10, color='black'),
    marker=dict(
        size=state_college_df[____], # UGDS for size
        color=____,
        sizemode='area',
        sizeref=fig_ex5.data[0].marker.sizeref, # Use sizeref from base plot for consistency
        sizemin=fig_ex5.data[0].marker.sizemin if 'sizemin' in fig_ex5.data[0].marker else 4,
        opacity=0.9,
        line=dict(width=1, color='DarkSlateGrey')
    ),
    name=state_full_name # For legend, if shown
))

fig_ex5.update_layout(showlegend=False)
fig_ex5.show()

And what story does that tell? Where does your state's biggest public university fit in this picture? Is that a good or a bad thing? The next step after this? What about other schools that are similar to your state's largest? These institutions are multi-billion dollar entities in your state. They're a big deal.

## The Recap

Throughout this lesson, you've mastered the creation of bubble charts using Plotly Express and Plotly Graph Objects, learning to visualize relationships between three variables simultaneously. You've practiced transforming scatterplots into bubble charts by mapping a variable to the `size` aesthetic, adjusting transparency (`opacity`) and maximum bubble size (`size_max`) for clarity, and adding labels to highlight key data points. Remember, while bubble charts can be powerful for showing an additional dimension of data, they're most effective when used judiciously - when the relationships between all three variables tell a compelling story and the chart doesn't become too cluttered. Going forward, consider how bubble charts might reveal hidden patterns in your datasets, but always balance their complexity with the clarity of your message.

## Terms to Know

- **Bubble chart**: A variation of a scatterplot where data points are represented by circles (bubbles) whose sizes correspond to a third numerical variable.
- **`px.scatter()`**: The Plotly Express function used to create scatterplots and bubble charts (by utilizing the `size` argument).
- **`size` argument (in `px.scatter`)**: Maps a numerical column to the size of the markers, creating a bubble chart.
- **`opacity` argument (in `px.scatter`)**: Controls the transparency of the markers (0.0 fully transparent, 1.0 fully opaque).
- **`size_max` argument (in `px.scatter`)**: Sets the maximum diameter (in pixels) for the largest bubble in the chart.
- **`go.Scatter()`**: A trace type from `plotly.graph_objects` that can be used to add more customized layers, like highlighted bubbles with text, to a figure.
- **`text` argument (in `go.Scatter` or `px.scatter`)**: Used to specify text labels for data points.
- **`mode='markers+text'`**: A mode for `go.Scatter` traces to display both markers and their associated text labels.