# Data Journalism Lesson 19: Dumbbell charts

Making quick graphics for publication by showing the difference between two points.

In [None]:
import warnings
from IPython.core.interactiveshell import InteractiveShell

# Keep hold of the real method
_orig_should_run = InteractiveShell.should_run_async

# Wrap it so that any DeprecationWarning it emits is silenced
def should_run_async(self, code, *args, **kwargs):
    with warnings.catch_warnings():
        warnings.simplefilter("ignore", category=DeprecationWarning)
        return _orig_should_run(self, code, *args, **kwargs)

# Apply the monkey‑patch
InteractiveShell.should_run_async = should_run_async

In [None]:
import micropip
await micropip.install('plotly')
await micropip.install("nbformat>=4.2.0")

In [None]:
from IPython.display import display, HTML
import pandas as pd

# --- Simple Grading/Checking Functions ---
def display_feedback(correct, message_correct, message_incorrect):
    if correct:
        display(HTML(f'<div style="background-color: #dff0d8; padding: 10px; border-radius: 5px;"><strong>Correct!</strong> {message_correct}</div>'))
    else:
        display(HTML(f'<div style="background-color: #f2dede; padding: 10px; border-radius: 5px;"><strong>Not quite!</strong> {message_incorrect}</div>'))

def check_df_exists(df, df_name, expected_rows=None):
    if not isinstance(df, pd.DataFrame) or df.empty:
        display_feedback(False, f'{df_name} DataFrame is not loaded correctly or is empty.', 'Please check the loading process.')
        return False
    msg = f'{df_name} DataFrame loaded successfully.'
    correct = True
    if expected_rows is not None and len(df) != expected_rows:
        msg += f' Expected {expected_rows} rows, got {len(df)}.'
        correct = False
    
    if correct:
        display_feedback(True, msg, '')
    else:
        display_feedback(False, '', msg)
    return correct

def check_figure_creation(fig, fig_object_type, num_traces_expected=None, trace_types_expected=None):
    if not isinstance(fig, fig_object_type):
        display_feedback(False, f'The figure is not of the expected type ({fig_object_type.__name__}).', 'Please ensure you are creating the figure correctly.')
        return False
    correct = True
    msg_correct = 'Figure created correctly.'
    msg_incorrect_list = []

    if num_traces_expected is not None and len(fig.data) != num_traces_expected:
        correct = False
        msg_incorrect_list.append(f'Expected {num_traces_expected} traces, found {len(fig.data)}.')
    
    if trace_types_expected is not None:
        actual_trace_types = [type(trace) for trace in fig.data]
        for i, expected_type in enumerate(trace_types_expected):
            if i < len(actual_trace_types):
                if not isinstance(fig.data[i], expected_type):
                    correct = False
                    msg_incorrect_list.append(f'Trace {i} is type {type(fig.data[i]).__name__}, expected {expected_type.__name__}.')
            elif i >= len(actual_trace_types) and expected_type is not None: # Expected more traces than found
                 correct = False
                 msg_incorrect_list.append(f'Missing expected trace of type {expected_type.__name__}.')
                 
    if correct:
        display_feedback(True, msg_correct, '')
    else:
        display_feedback(False, '', ' '.join(msg_incorrect_list))
    return correct


In [None]:
# --- State Setup and Data Loading ---
state_abbr = 'MN'
state_full_name = 'Minnesota'

colleges_data_url = f"../_static/colleges/{state_full_name.lower().replace(' ', '-')}.csv"

colleges_df_initial = pd.read_csv(colleges_data_url)

collegerows_expected = len(colleges_df_initial)

In [None]:
from myst_nb import glue

glue("state_full_name", state_full_name, display=False)
glue("colleges_csv_name", f"{state_full_name.lower().replace(' ', '-')}.csv", display=False)
glue("collegerows_expected", collegerows_expected, display=False)

## The Goal

In this lesson, you'll learn how to create dumbbell charts, a powerful tool for visualizing the difference between two related values. By the end of this tutorial, you'll understand when to use dumbbell charts, how to prepare your data for this type of visualization, and how to create and customize dumbbell charts using `plotly.graph_objects`. You'll practice filtering data, reordering chart elements, and adding color to enhance the visual story. These skills will enable you to effectively communicate comparisons and gaps in your data, a crucial ability for data journalists looking to highlight meaningful differences.

## Why Visualize Data?

Coulter Jones has worked in tiny newspapers, trade publications and in public radio. He's also worked at two of the largest news organizations in the world -- Bloomberg News and the Wall Street Journal. In smaller places, it's much more common for someone with some data skills to take a graphic all the way to publication. But Bloomberg and the WSJ have graphics staffs larger than most newspapers these days. There's zero chance of Jones turning some data he has into a graphic that faces the public.

And yet, he makes graphics all the time.

"I would say almost every time I do an analysis, right?" he said. 

As a data analyst, one of the most important things you can do -- and one of the easiest once you get some practice -- is to convert your data into a picture. You learned this from Tukey. You learned this from Playfair. You learned this from Tufte. Data journalists do this all the time, and it's one of the best reasons to learn these tools. 

"One of the things that I did not appreciate at first, but then quickly learned to love about R (and similar tools like Python with Plotly) is, oh, this is a visual platform too," Jones might say. "So I can very easily just do a histogram. I can do a two factor line or point, sort of like show me age and sex or something like that.

"If you just chart it, you can sometimes just see the story immediately."

## The Basics

Second to my love of waffle charts because of their name, and I'm always hungry, dumbbell charts are an excellently named way of **showing the difference between two things on a number line** -- a start and a finish, for instance. Or the difference between two related things. When the gap between numbers is the news, dumbbell charts are what you want.

Plotly Express doesn't have a direct `px.dumbbell()` function. However, we can construct dumbbell charts using `plotly.graph_objects` by combining line segments and markers. We'll need `pandas` for data and `plotly.graph_objects as go`.

Let's give it a whirl.

In [None]:
import pandas as pd
import plotly.graph_objects as go

For this, let's use a list of colleges from the Department of Education's College Scorecard. The dataset is massive -- there are nearly 6,500 colleges and universities in the dataset, which in the grand scheme of data isn't that many rows. But there's more than 3,300 *columns* of data. In other words, there's that many things about each college being tracked.

Let's focus on a couple of the most important things -- do people graduate? And what does it cost to go there?

We'll load the data for your state: {glue:text}`state_full_name`. I've cut down the number of columns to just what we need, and limited the colleges to those granting associate and bachelor's degrees. Two other limitations: If a college didn't report data, they were dropped. And the number of colleges for a state is capped at 20 (the 20 largest by undergraduate enrollment). Dumbbell charts can only hold so many.

In [None]:
colleges_df = pd.read_csv("../_static/colleges/Minnesota.csv")

Let's look at what we've got here:

In [None]:
display(____.head())

For this example, let's look at the difference between a school's in-state tuition vs. out-of-state tuition. Most public colleges charge more for people who don't live there. The obvious reason is giving people who pay taxes in the state a break, and charging more for people who don't pay taxes there. Private schools don't have to do that, but some do. 

### Exercise 1: The first dumbbell

To create a dumbbell chart with `plotly.graph_objects`, we'll construct it piece by piece:
1.  A line segment for each college connecting `TUITIONFEE_IN` and `TUITIONFEE_OUT`.
2.  Markers (points) at each end of these segments.

We'll iterate through our `colleges_df` and add a line trace for each college. Then, we'll add two scatter traces for all the 'IN' points and all the 'OUT' points.

- `y`: `INSTNM` (Institution Name) - this will be our categorical axis.
- `x`: `TUITIONFEE_IN` for one set of points and `TUITIONFEE_OUT` for the other. The lines will connect these x-values for each y-value.

In [None]:
fig_ex1 = go.Figure()

# Add line segments for each college
for index, row in colleges_df.iterrows():
    fig_ex1.add_trace(go.Scatter(
        x=[row[____], row[____]],
        y=[row[____], row[____]],
        mode='lines',
        line=dict(color='grey', width=1),
        showlegend=False
    ))

# Add markers for In-State Tuition
fig_ex1.add_trace(go.Scatter(
    x=colleges_df[____],
    y=colleges_df[____],
    mode='markers',
    marker=dict(color='blue', size=8),
    name='In-State Tuition'
))

# Add markers for Out-of-State Tuition
fig_ex1.add_trace(go.Scatter(
    x=colleges_df[____],
    y=colleges_df[____],
    mode='markers',
    marker=dict(color='red', size=8),
    name='Out-of-State Tuition'
))

fig_ex1.update_layout(
    title_text='In-State vs. Out-of-State Tuition Fees',
    xaxis_title='Tuition Fee',
    yaxis_title='Institution Name',
    height=max(400, len(colleges_df) * 30) # Adjust height based on number of colleges
)

fig_ex1.show()

Well, that's a chart alright. But what dot is the in-state tuition and what is the out of state? We used blue for in-state and red for out-of-state in the legend, which helps. But we can refine this further.

### Exercise 2: Colors and size

Let's refine the colors and marker size:
- Connecting line: `grey`
- `TUITIONFEE_IN` point: `green`
- `TUITIONFEE_OUT` point: `red`
- Marker size: Let's use `size=10` for better visibility.

Rebuild the figure with these specific color and size settings.

In [None]:
fig_ex2 = go.Figure()
marker_size = 10 # R's size=2 is small for Plotly markers
line_color_ex2 = ____ # 'grey'
in_state_marker_color_ex2 = ____ # 'green'
out_state_marker_color_ex2 = ____ # 'red'

for index, row in colleges_df.iterrows():
    fig_ex2.add_trace(go.Scatter(
        x=[row['TUITIONFEE_IN'], row['TUITIONFEE_OUT']],
        y=[row['INSTNM'], row['INSTNM']],
        mode='lines',
        line=dict(color=line_color_ex2, width=1),
        showlegend=False
    ))

fig_ex2.add_trace(go.Scatter(
    x=colleges_df['TUITIONFEE_IN'],
    y=colleges_df['INSTNM'],
    mode='markers',
    marker=dict(color=in_state_marker_color_ex2, size=marker_size),
    name='In-State Tuition'
))

fig_ex2.add_trace(go.Scatter(
    x=colleges_df['TUITIONFEE_OUT'],
    y=colleges_df['INSTNM'],
    mode='markers',
    marker=dict(color=out_state_marker_color_ex2, size=marker_size),
    name='Out-of-State Tuition'
))

fig_ex2.update_layout(
    title_text='In-State (Green) vs. Out-of-State (Red) Tuition Fees',
    xaxis_title='Tuition Fee',
    yaxis_title='Institution Name',
    height=max(400, len(colleges_df) * 30) 
)
fig_ex2.show()

And now we have a chart that is trying to tell a story. We know, logically, that green on the left is good, because it means cheaper tuition. A long distance between green and red? That shows a gap between what in-state students pay and what out-of-state students pay. In some cases, that's small. In some it's *huge*. But what about the colleges that have just red dots? The issue there is that they don't have in-state and out-of-state tuition (or `TUITIONFEE_IN` is NaN/missing). They just have ... tuition (which might be in `TUITIONFEE_OUT`). The green dot might be there if `TUITIONFEE_IN` equals `TUITIONFEE_OUT`, just getting covered by the red dot if plotted last for that category, or missing if NaN.

### Exercise 3: Arrange helps tell the story

But what if we sort it by out-of-state tuition, so we see them in order of cost? In Plotly, we can control the order of categories on the y-axis by setting `categoryorder` and `categoryarray` in `update_yaxes()`.

First, create a sorted list of institution names (`INSTNM`) based on `TUITIONFEE_IN` (ascending). Then apply this order to the y-axis of your figure from Exercise 2.

In [None]:
fig_ex3 = go.Figure(fig_ex2) # Start with the correctly colored figure


sorted_colleges_df = colleges_df.sort_values(by=____)
ordered_institution_names = sorted_colleges_df[____].tolist()

fig_ex3.update_layout(
    yaxis=dict(
        categoryorder='array',
        categoryarray=ordered_institution_names
    ),
    title_text='Tuition Fees Sorted by In-State Cost'
)
fig_ex3.show()

Now we can start asking questions. What story is this telling? What is the most expensive place to go to college in {glue:text}`state_full_name` for in-state students, at least among this list? What colleges have the widest gaps between in-state and out-of-state tuition?

## The Recap

Throughout this lesson, you've learned how to create and customize dumbbell charts to visualize differences between two related values using `plotly.graph_objects`. You've practiced creating basic dumbbell charts by combining line and marker traces, and enhancing them with color and size adjustments. You've also seen how reordering the chart elements using `update_layout` can help tell a clearer story, as demonstrated with the comparison of in-state and out-of-state tuition costs across colleges. Remember, dumbbell charts are particularly effective when the gap between two values is the key story you want to tell. Consider how this visualization technique can help you highlight important comparisons in your datasets.

## Terms to Know

- **Dumbbell chart**: A type of chart that displays the difference or gap between two related data points for multiple categories, visually resembling a dumbbell.
- **`plotly.graph_objects` (as `go`)**: A Plotly module that provides more fine-grained control over chart creation compared to Plotly Express. Used here to construct dumbbell charts from basic shapes (lines and markers).
- **`go.Scatter()`**: A versatile trace type in `plotly.graph_objects` used to create scatter plots (markers), line plots, or combinations. Essential for building dumbbell charts.
- **`mode='lines'`**: A parameter for `go.Scatter()` to draw lines connecting the points.
- **`mode='markers'`**: A parameter for `go.Scatter()` to draw markers (points).
- **`x` and `y` arguments (in `go.Scatter`)**: Define the coordinates for points or line segments. For dumbbell lines, `x` will be `[value1, value2]` and `y` will be `[category_name, category_name]`.
- **`line=dict(color=...)`**: Used to style the line part of a `go.Scatter` trace.
- **`marker=dict(color=..., size=...)`**: Used to style the marker part of a `go.Scatter` trace.
- **`fig.update_layout()`**: A Plotly method to modify various aspects of the figure's layout, such as titles, axis labels, and category order on axes.
- **`yaxis=dict(categoryorder='array', categoryarray=...)`**: Used within `update_layout` to specify a custom order for categories on the y-axis.