# Creating a Bar Chart Race Animation with Matplotlib

In this tutorial, you'll learn how to create a bar chart race animation such as the one below using the matplotlib data visualization library in python.

![bar chart race][0]

### `bar_chart_race` python package

Along with this tutorial is the release of the python package `bar_chart_race` that automates the process of making these animations. This post explains the procedure from scratch.

### What is a bar chart race?

A bar chart race is an animated sequence of bars that show data values at different moments in time. The bars re-position themselves at each time period so that they remain in order (either ascending or descending).

## Transition bars smoothly between time periods

The trick to making a bar chart race is to transition the bars slowly to their new position when their order changes, allowing you to easily track the movements.

### COVID-19 deaths data

For this bar chart race, we'll use a small dataset produced by John Hopkins University containing the total deaths by date for six countries during the currently ongoing coronavirus pandemic. Let's read it in now.

[0]: media/covid19.png

In [2]:
import pandas as pd
pd.options.display.latex.repr=True
pd.options.display.
df = pd.read_csv('data/covid19.csv', index_col='date', parse_dates=['date'])
df.tail()

Unnamed: 0_level_0,China,USA,Italy,UK,Iran,Spain
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2020-04-18,4636.0,38671.0,23227.0,15498.0,5031.0,20043.0
2020-04-19,4636.0,40664.0,23660.0,16095.0,5118.0,20453.0
2020-04-20,4636.0,42097.0,24114.0,16550.0,5209.0,20852.0
2020-04-21,4636.0,44447.0,24648.0,17378.0,5297.0,21282.0
2020-04-22,4636.0,46628.0,25085.0,18151.0,5391.0,21717.0


In [6]:
pd.__version__

'1.0.3'

In [3]:
import pandas as pd
pd.options.display.latex.repr=True
df = pd.DataFrame({'apples': [5, 10], 'pears': [99, 344]})
df

Unnamed: 0,apples,pears
0,5,99
1,10,344


In [5]:
df['apples']

0     5
1    10
Name: apples, dtype: int64

In [4]:
pd.options.display.notebook_repr_html = False
df

   apples  pears
0       5     99
1      10    344

### Must use 'wide' data

For this tutorial, the data must be in 'wide' form where:

* Every row represents a single period of time
* Each column holds the value for a particular category
* The index contains the time component (optional)

### Individual bar charts for specific dates

Let's begin by creating a single static bar chart for the specific date of March 29, 2020. First, we select the data as a Series.

In [None]:
pd.options.display.

In [None]:
s = df.loc['2020-03-29']
s

We'll make a horizontal bar chart using the country names as the y-values and total deaths as the x-values (width of bars). Every bar will be a different color from the 'Dark2' colormap.

In [None]:
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(4, 2.5), dpi=144)
colors = plt.cm.Dark2(range(6))
y = s.index
width = s.values
ax.barh(y=y, width=width, color=colors);

The function below changes several properties of the axes to make it look nicer.

In [None]:
def nice_axes(ax):
    ax.set_facecolor('.8')
    ax.tick_params(labelsize=8, length=0)
    ax.grid(True, axis='x', color='white')
    ax.set_axisbelow(True)
    [spine.set_visible(False) for spine in ax.spines.values()]
    
nice_axes(ax)
fig

### Plot three consecutive days ordering the bars

For a bar chart race, the bars are often ordered from largest to smallest with the largest at the top. Here, we plot three days of data sorting each one first.

In [None]:
fig, ax_array = plt.subplots(nrows=1, ncols=3, figsize=(7, 2.5), dpi=144, tight_layout=True)
dates = ['2020-03-29', '2020-03-30', '2020-03-31']
for ax, date in zip(ax_array, dates):
    s = df.loc[date].sort_values()
    ax.barh(y=s.index, width=s.values, color=colors)
    ax.set_title(date, fontsize='smaller')
    nice_axes(ax)

### Countries change color

Although the bars are ordered properly, the countries do not keep their original color when changing places in the graph. Notice that the USA begins as the fifth bar and moves up one position each date, changing colors each time.

### Don't sort - rank instead!

Instead of sorting, use the `rank` method to find the numeric ranking of each country for each day. We use the `'first'` method of ranking so that each numeric rank is a unique integer. By default, the method is `'average'` which ranks ties with the same value causing overlapping bars. Let's see the ranking for the March 29, 2020.

In [None]:
df.loc['2020-03-29'].rank(method='first')

We now use this rank as the y-values. The order of the data in the Series never changes this way, ensuring that countries remain the same color regardless of their rank.

In [None]:
fig, ax_array = plt.subplots(nrows=1, ncols=3, figsize=(7, 2.5), dpi=144, tight_layout=True)
dates = ['2020-03-29', '2020-03-30', '2020-03-31']
for ax, date in zip(ax_array, dates):
    s = df.loc[date]
    y = df.loc[date].rank(method='first').values
    ax.barh(y=y, width=s.values, color=colors, tick_label=s.index)
    ax.set_title(date, fontsize='smaller')
    nice_axes(ax)

### How to smoothly transition?

Using each day as a single frame in an animation won't work well as it doesn't capture the transition from one time period to the next. In order to transition the bars that change positions, we'll need to add extra rows of data between the dates that we do have. Let's first select the three dates above as a DataFrame.

In [None]:
df2 = df.loc['2020-03-29':'2020-03-31']
df2

It's easier to insert an exact number of new rows when using the default index - integers beginning at 0. Alternatively, if you do have a datetime in the index as we do here, you can use the `asfreq` method, which is explained at the end of this post. Use the `reset_index` method to get a default index and to place the dates as a column again.

In [None]:
df2 = df2.reset_index()
df2

### Choose number of steps between each date

We want to insert new rows between the first and second rows and between the second and third rows. Begin by multiplying the index by the number of steps to transition from one time period to the next. We use 5 in this example.

In [None]:
df2.index = df2.index * 5
df2

### Expand DataFrame with `reindex`

To insert the additional rows, pass the `reindex` method a sequence of all integers beginning at 0 to the last integer (10 in this case). pandas inserts new rows of all missing values for every index not in the current DataFrame.

In [None]:
last_idx = df2.index[-1] + 1
df_expanded = df2.reindex(range(last_idx))
df_expanded

The date for the missing rows is the same for each. Let's fill them in using the last known value with the `fillna` method and set it as the index again.

In [None]:
df_expanded['date'] = df_expanded['date'].fillna(method='ffill')
df_expanded = df_expanded.set_index('date')
df_expanded

### Rank each row

We also need a similar DataFrame that contains the rank of each country by row. Most pandas methods work down each column by default. Set `axis` to 1 to change the direction of the operation so that values in each row are ranked against each other.

In [None]:
df_rank_expanded = df_expanded.rank(axis=1, method='first')
df_rank_expanded

### Linear interpolate missing values

The `interpolate` method can fill in the missing values in a variety of ways. By default, it uses linear interpolation and works column-wise.

In [None]:
df_expanded = df_expanded.interpolate()
df_expanded

We also need to interpolate the ranking.

In [None]:
df_rank_expanded = df_rank_expanded.interpolate()
df_rank_expanded

### Plot each step of the transition

The interpolated ranks will serve as the new position of the bars along the y-axis. Here, we'll plot each step from the first to the second day where Iran and the USA change place.

In [None]:
fig, ax_array = plt.subplots(nrows=1, ncols=6, figsize=(12, 2), 
                             dpi=144, tight_layout=True)
labels = df_expanded.columns
for i, ax in enumerate(ax_array.flatten()):
    y = df_rank_expanded.iloc[i]
    width = df_expanded.iloc[i]
    ax.barh(y=y, width=width, color=colors, tick_label=labels)
    nice_axes(ax)
ax_array[0].set_title('2020-03-29')
ax_array[-1].set_title('2020-03-30');

The next day's transition is plotted below.

In [None]:
fig, ax_array = plt.subplots(nrows=1, ncols=6, figsize=(12, 2), 
                             dpi=144, tight_layout=True)
labels = df_expanded.columns
for i, ax in enumerate(ax_array.flatten(), start=5):
    y = df_rank_expanded.iloc[i]
    width = df_expanded.iloc[i]
    ax.barh(y=y, width=width, color=colors, tick_label=labels)
    nice_axes(ax)
ax_array[0].set_title('2020-03-30')
ax_array[-1].set_title('2020-03-31');

### Write a function to prepare all of the data

We can copy and paste the code above into a function to automate the process of preparing any data for the bar chart race. Then use it to create the two final DataFrames needed for plotting.

In [None]:
def prepare_data(df, steps=5):
    df = df.reset_index()
    df.index = df.index * steps
    last_idx = df.index[-1] + 1
    df_expanded = df.reindex(range(last_idx))
    df_expanded['date'] = df_expanded['date'].fillna(method='ffill')
    df_expanded = df_expanded.set_index('date')
    df_rank_expanded = df_expanded.rank(axis=1, method='first')
    df_expanded = df_expanded.interpolate()
    df_rank_expanded = df_rank_expanded.interpolate()
    return df_expanded, df_rank_expanded

df_expanded, df_rank_expanded = prepare_data(df)
df_expanded.head()

In [None]:
df_rank_expanded.head()

## Animation

We are now ready to create the animation. Each row represents a single frame in our animation and will slowly transition the bars y-value location and width from one day to the next. 

The simplest way to do animation in matplotlib is to use `FuncAnimation`. You must define a function that updates the matplotlib axes object each frame. Because the axes object keeps all of the previous bars, we remove them in the beginning of the `update` function. The rest of the function is identical to the plotting from above. This function will be passed the index of the frame as an integer. We also set the title to have the current date.

Optionally, you can define a function that initializes the axes. Below, `init` clears the previous axes of all objects and then resets it's nice properties.

Pass the figure (containing your axes), the `update` and `init` functions, and number of frames to `FuncAnimation`. We also pass the number of milliseconds between each frame, which is used when creating HTML. We use 100 milliseconds per frame equating to 500 per day (half of a second).

The figure and axes are created separately below so they do not get output in a Jupyter Notebook, which automatically happens if you call `plt.subplots`.

In [None]:
from matplotlib.animation import FuncAnimation

def init():
    ax.clear()
    nice_axes(ax)
    ax.set_ylim(.2, 6.8)

def update(i):
    for bar in ax.containers:
        bar.remove()
    y = df_rank_expanded.iloc[i]
    width = df_expanded.iloc[i]
    ax.barh(y=y, width=width, color=colors, tick_label=labels)
    date_str = df_expanded.index[i].strftime('%B %-d, %Y')
    ax.set_title(f'COVID-19 Deaths by Country - {date_str}', fontsize='smaller')
    
fig = plt.Figure(figsize=(4, 2.5), dpi=144)
ax = fig.add_subplot()
anim = FuncAnimation(fig=fig, func=update, init_func=init, frames=len(df_expanded), 
                     interval=100, repeat=False)

### Return animation HTML or save to disk

Call the `to_html5_video` method to return the animation as an HTML string and then embed it in the notebook with help from the `IPython.display` module.

In [None]:
from IPython.display import HTML
html = anim.to_html5_video()
HTML(html)

You can save the animation to disk as an mp4 file using the `save` method. Since we have an `init` function, we don't have to worry about clearing our axes and resetting the limits. It will do it for us.

In [None]:
anim.save('media/covid19.mp4')

## Using `bar_chart_race`

I created the `bar_chart_race` python package to automate this process. It creates bar chart races from wide pandas DataFrames. Install with `pip install bar_chart_race`.

In [None]:
import bar_chart_race as bcr
html = bcr.bar_chart_race(df.iloc[30:40], 'a.gif', steps_per_period=5, figsize=(4, 2.5), title='COVID-19 Deaths by Country', period_label_size=12)
HTML(html)

## Using the `asfreq`

If you are familiar with pandas, you might know that the `asfreq` method can be used to insert new rows. Let's reselect the last three days of March again to show how it works.

In [None]:
df2 = df.loc['2020-03-29':'2020-03-31']
df2

Inserting new rows is actually easier with `asfreq`. We just need to supply it a date offset that is a multiple of 24 hours. Here, we insert a new row every 6 hours.

In [None]:
df2.asfreq('6h')

Inserting a specific number of rows is a little trickier, but possible by creating a date range first, which allows you specify the total number of periods, which you must calculate.

In [None]:
num_periods = (len(df2) - 1) * 5 + 1
dr = pd.date_range(start='2020-03-29', end='2020-03-31', periods=num_periods)
dr

Then pass this date range to `reindex` to achieve the same result.

In [None]:
df2.reindex(dr)

We can use this procedure on all of our data.

In [None]:
num_periods = (len(df) - 1) * 5 + 1
dr = pd.date_range(start=df.index[0], end=df.index[-1], periods=num_periods)
df_expanded = df.reindex(dr)
df_rank_expanded = df_expanded.rank(axis=1).interpolate()
df_expanded = df_expanded.interpolate()
df_expanded.iloc[160:166]

In [None]:
df_rank_expanded.iloc[160:166]

## One line?

It's possible to do all of the analysis in a single ugly line of code.

In [None]:
df_one = df.reset_index() \
           .reindex([i / 5 for i in range(len(df) * 5 - 4)]) \
           .reset_index(drop=True) \
           .pipe(lambda x: pd.concat([x, x.iloc[:, 1:].rank(axis=1)], 
                                     axis=1, keys=['values', 'ranks'])) \
           .interpolate() \
           .fillna(method='ffill') \
           .set_index(('values', 'date')) \
           .rename_axis(index='date')
df_one.head()

## Master Data Analysis with Python

If you are looking for a single, comprehensive resources to master pandas, matplotlib, and seaborn, check out my book [Master Data Analysis with Python][0]. It contains 800 pages and 350 exercises with detailed solutions. If you want to be a trusted source to do data analysis using Python, this book will ensure you get there.

[0]: https://www.dunderdata.com/master-data-analysis-with-python