# How can we clearly communicate GDP trends using visualizations?

## Goals



By the end of this case, you should have:

* Gained more hands-on experience with creating Python plots
* Learned how to select an appropriate plot type or other way to visualize specific kinds of data
* Learned how to make professional visualizations

## Case Introduction

**Business Context.** You are an employee for GoodEconomists, an NGO that campaigns for raising living standards globally. With limited financial resources, GoodEconomists does not have a large budget for outreach and marketing. Instead, they rely on publishing content with compelling stories and excellent visuals to have their voice heard. You have been asked to work on data presentation and graphs for their upcoming report on [GDP](https://en.wikipedia.org/wiki/Gross_domestic_product) and its implications. 

**Business Problem.** Your manager has asked you to answer the following question: **"How can we clearly communicate the strong correlations that GDP has to other factors?"**

**Analytical Context.** The relevant data is a pair of public CSV files containing information on GDP for different countries. The first one is on GDP and life expectancy and is called `data/gdp_data.csv`. It only has five columns, but we can extract quite a lot of information from these:

* **Country:** The name of the country (`str`)
* **Continent:** The name of the continent (`str`)
* **lifeExp:** The average life span of a person in that country in years (`float`)
* **pop**: The population of the country (`int`)
* **gdpPercap:** The GDP per capita, or the average economic output of each person in the country (`float`)

We also have a dataset `data/gdp_data_years.csv`. It is similar, but formatted differently. Each column is a year so we can see GDP trends over time. It does not include population size and life expectancy data:

* **Country Name:** The name of the country (`str`)
* **Country Code:** The three-letter code for the country (`str`)
* **1960:** The GDP value in 1960 in USD (`float`)
* **`<all other columns>`:** the same as 1960 but for the respective year

We will use what we know about Python, `pandas`, `matplotlib`, and `seaborn` and focus on finding the best way to visualize the relevant data.

## Loading the datasets and setting up



Let's set up the plotting libraries and load the datasets:

In [None]:
%matplotlib inline

import pandas as pd
from matplotlib import pyplot as plt

# increase the size and quality of any plots we produce
import matplotlib as mpl

mpl.rcParams["figure.dpi"] = 300

In [None]:
# read the first dataset with pandas
df = pd.read_csv("data/gdp_data.csv")
df.head()

In [None]:
df_years = pd.read_csv("data/gdp_data_years.csv")
df_years.head()

## Data & chart types

Nearly any dataset that you want to present or visualize will consist of variables that can be thought of in different ways. Recall from previous cases that we have both **categorical** and **numerical** variables. For example, in our dataset the variable `continent` is a **categorical** variable, because we can categorize countries by which continent they belong to, yet this categorization isn't inherently comparable - we don't know which order the continents go in (though we could invent a way, such as by total population or surface area).

We also have the variable `population`, which is **numerical**. We can clearly order countries by population.

While there are hundreds of different charts and variations of charts, the main charts we will focus on are:

* Table
* Bar chart
* Pie chart
* Scatterplot
* Line chart
* Box plot

You should already be familiar with these from previous cases. Choosing which of these is most appropriate usually comes down to both what kind of data you have (categorical or numerical or both) and what kind of message you are trying to communicate. For example, you might want to show how your data changes over time, how it is distributed, or how two groups compare to each other. It's very common to see others misuse chart types, or present data badly even within a properly chosen chart.

## GDP, population, and life expectancy

### Exercise 1:

Look at the population of each continent (by summing the population of each country in that continent). Which continents are the most and least populous? Let's try visualizing the relative total population of each continent using:

* A table
* A bar chart
* A pie chart

Use `pandas` and the `.plot()` method on DataFrames for this. Make the images as meaningful as possible for your readers, so ensure that you add titles, descriptions, and any other important information to each visualization. Here are some guidelines to keep in mind:

* The title of each plot should explain the *message* of the visualization, not just a description of the data. So instead of "Population by Continent", try to find the story behind the data and use that as the title of each graphic.
* Use labels and legends as appropriate

Explain the advantages and disadvantages of each approach.

**Solution:** One possible answer is shown below:

In [None]:
print("The population of each continent")
df.groupby(["continent"])["pop"].sum().reset_index(name="Total Population").sort_values(
    by="Total Population", ascending=False
)

In [None]:
ax = (
    df.groupby(["continent"])["pop"]
    .sum()
    .reset_index(name="Total Population")
    .sort_values(by="Total Population", ascending=False)
    .plot.bar(x="continent", y="Total Population")
)

ax.set_xlabel("Continent")
ax.set_ylabel("Total Population (billion)")
_ = ax.set_title("Asia is the most populous continent")

In [None]:
df_pie = (
    df.groupby(["continent"])["pop"]
    .sum()
    .reset_index(name="Total Population")
    .sort_values(by="Total Population", ascending=False)
)
ax = df_pie.plot.pie(y="Total Population", labels=df_pie.continent, autopct="%1.2f%%")
_ = ax.set_title("Asia is the most populous continent")

Advantages and disadvantages:

* The table is the most precise format: we can see the exact numbers for each continent, but it's difficult to quickly see what the data means
* The bar chart makes it easy to compare countries in pairs; for example, "Africa" and "Americas" are very close, but in the bar chart we can see that the Africa bar is a bit higher. It's harder to make non-pairwise comparisons though, like how Asia compares to all of the other continents combined though
* The pie chart makes it easy to see that Asia has more than half of the world's population, and easier to see that we are comparing parts (continents) of a whole (the world), but it's harder to judge the relative difference between Africa and America or Africa and Europe (though the percentage labels help with this). Also, thin slices like "Oceania" look messy and might disappear completely. In the example shown, we lose the information on total numbers and only have percent values instead.

Overall, the bar chart is the most appropriate visualization for this data. Most people reach for pie charts, but pie charts are overused and often not a great way to present data. If the only message that we wanted to convey was that Asia has more than half the world's population, then the pie chart is acceptable, but in that case we should go for a much simpler version showing only Asia vs. the rest of the world, as shown below. But here, we want to know which continents are the most and least populous, not whether or not Asia has more than half of the world's population, and this is much more difficult to do with the pie chart without the percentages written there.

In general, displaying extra data that doesn't add to your core message is distracting and makes the visualization worse, so it's important to be very sure about what message you want to convey, and then choose exactly what you want to show.

In [None]:
asia_df = df.copy()
asia_df["is_asia"] = [
    "Asia" if x == "Asia" else "Rest of World" for x in df["continent"]
]


_ = (
    asia_df.groupby(["is_asia"])["pop"]
    .sum()
    .reset_index(name="Population")
    .plot.pie(y="Population", labels=asia_df.is_asia, autopct="%1.f%%")
)

### Exercise 2

Now that we've taken a quick look at the dataset and population by continent, let's start looking at the data we actually came for: the GDP of each country. We want to know how GDP affects life expectancy. You have a hypothesis that people in richer countries live longer.

Create a scatterplot of GDP per capita vs. life expectancy. Then, create another one where points have been *colored* according to which continent they are on. Explain if your hypothesis is correct or not and any other interesting trends that you notice. Make additional plots to explain further aspects of the data as necessary.

Once again, pay close attention to your labels and titles as you try to convey a lot of information in a relatively small space.

Why is a scatterplot appropriate for this task?

**Solution.** One possible answer is shown below:

In [None]:
ax = df.plot.scatter(x="gdpPercap", y="lifeExp")
ax.set_xlabel("GDP per capita (USD)")
ax.set_ylabel("Life expectancy (years)")
_ = ax.set_title("Countries with a higher GDP have a higher life expectancy")

In [None]:
from matplotlib import pyplot as plt

colors = {
    "Africa": "green",
    "Asia": "orange",
    "Americas": "red",
    "Europe": "purple",
    "Oceania": "blue",
}

_, ax = plt.subplots()
for key, group in df.groupby("continent"):
    group.plot.scatter(ax=ax, x="gdpPercap", y="lifeExp", label=key, color=colors[key])

ax.set_xlabel("GDP per capita (USD)")
ax.set_ylabel("Life expectancy (years)")
_ = ax.set_title(
    "Africa's countries are almost all low GDP compared to Europe and America"
)

The cluster of green (Africa) in the bottom left is interesting. It also doesn't look like the African countries with higher GDP (more to the right) necessarily have a higher life expectancy (more to the top). Let's take a closer look just at Africa:

In [None]:
df_africa = df[df.continent == "Africa"]
ax = df_africa.plot.scatter(x="gdpPercap", y="lifeExp")

ax.set_xlabel("GDP per capita (USD)")
ax.set_ylabel("Life expectancy (years)")
_ = ax.set_title(
    "In Africa the GDP/Life expectancy correlation is very different from globally"
)

We can check the actual correlations of these for three groups:
    
* Globally
* Africa
* Everywhere except Africa

In [None]:
df_notafrica = df[df.continent != "Africa"]

print(df["gdpPercap"].corr(df["lifeExp"]))
print(df_africa["gdpPercap"].corr(df_africa["lifeExp"]))
print(df_notafrica["gdpPercap"].corr(df_notafrica["lifeExp"]))

We can see that the correlation is much weaker in Africa, and gets stronger globally if we remove Africa. Let's now look at the correlations for each continent:

In [None]:
continents = list(set(df.continent))
gdp_lifeexp_corr = []

for continent in continents:
    continent_df = df[df.continent == continent]
    corr = continent_df["gdpPercap"].corr(continent_df["lifeExp"])
    gdp_lifeexp_corr.append(corr)

continent_df = pd.DataFrame(
    list(zip(continents, gdp_lifeexp_corr)), columns=["Continent", "GDP_LifeExp"]
)

continent_df

Oceania only has two countries in it (Australia and New Zealand) so the correlation isn't very meaningful. Let's drop it and create a bar plot showing the different correlation strengths of the other continents:

In [None]:
continent_df_no_oceania = continent_df.drop(continent_df.index[[3]])

ax = continent_df_no_oceania.sort_values(by="GDP_LifeExp").plot.bar(
    x="Continent", y="GDP_LifeExp"
)

ax.set_xlabel("The strength of the correlation between GDP and life expectancy")
ax.set_ylabel("Correlation")
_ = ax.set_title("Africa has the weakest correlation and Europe has the strongest")

A scatterplot is appropriate for this task because a scatterplot allows you to look at the relationship between two numerical variables, which is what we have here.

## GDP trends over time

So far, we've looked at data for a single point in time, so we have used bar charts and scatterplots to present it. Our second dataset includes data collected from the 1960s to the 2010s, so we will naturally turn towards line graphs.

Note that the format of one column per year is a bit tricky to work with. You will probably need to use the **['transpose()'](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.transpose.html)** function in `pandas` which also can be called by using the shortcut `df.T` (on a dataframe called `df`).

### Exercise 3

You remember that GDP has been growing over time, but you wonder by how much. Ignoring the country-level GDP breakdown for now, compute the total world GDP we have on record for each year. Explain why the plot you chose is appropriate. As always, take care to use good labels. What is one problem you might encounter with just summing up all the country-level GDPs to get the world GDP?

**Answer.** One possible solution is shown below:

In [None]:
df_gdponly = df_years.drop(["Country Name", "Country Code"], axis=1).sum()
ax = df_gdponly.plot.line()
ax.set_title("GDP has increased over time")

ax.set_xlabel("Year")
_ = ax.set_ylabel("GDP - ten trillions (USD)")

Note that this analysis is somewhat flawed as it will count missing country-level GDP data as zero, which is not true and most likely due to problems with data collection. Furthermore, it is likely that data collection has improved over time, so our plot will likely overestimate the true rate of world GDP growth. Nonetheless, it should still show the general trend.

A line plot is appropriate as we are showing change over time for a variable that constantly changes. We have the years plotted on the x-axis from left to right, and the GDP value on the y-axis as this is what people are used to. We clearly labeled our axes with descriptions of what they show, including units and currency, and our title summarizes the overall message of the data, as opposed to just leaving people to draw our intended conclusion from the chart only.

### Exercise 4
People are often more interested in specific countries than in the world as a whole. Pick any five countries that you think might be interesting from the list and create a graph to compare how their GDPs have evolved over time. Explain any trends that you see.

**Answer.** One possible solution is given below:

In [None]:
countries = ["USA", "CHN", "JPN", "IND", "BRA"]

df_country = df_years[df_years["Country Code"].isin(countries)].drop(["Country Code"], axis=1).T.reset_index()
df_country.columns = ["Year"] + list(df_country.iloc[0][1:])
df_country = df_country[1:]
ax = df_country.plot.line(x="Year")
ax.set_title("The US and China lead GDP Growth")
ax.set_xlabel("Year")
_ = ax.set_ylabel("GDP - ten trillions (USD)")


The increase in GDP of China and the USA dwarf the other countries that we looked at. Japan's GDP looked like it would match the US until the mid-1990s at which point it flattened out. China's GDP growth started later, but has quickly overtaken other countries.

One thing that is very important to note is that this graph only shows absolute changes in GDP over time, not percentage changes. Because the United States was starting from a much higher base in 1960, any future changes are magnified, even if the percentage growth per year were the same as other countries. If we wanted to show percentage changes over time, we would need to modify our plot.

### Exercise 5

Line charts are a good way to show movement over time, but we "lost" a lot of data in the above exercises looking only at total GDP growth across all countries. We did not really look at the individual GDPs of various countries. Of course, this came be quite cumbersome as there are a lot of countries!

A good tradeoff between looking at the full distribution of a dataset and looking only at an aggregate is to use a [box plot](https://en.wikipedia.org/wiki/Box_plot). Create two box plots for all the years from 2005 onwards that summarize the distribution of GDPs for the 20 countries with the highest GDP.

1. First, find the countries with the highest GDP, based on the 2019 data, and create a box plot for those
2. Next, create a box plot which excludes the top 3 countries (USA, China, Japan) to make the data for the countries with a lower GDP easier to visualize.

Explain any trends in the data and why the visualization choices you made are appropriate.

**Answer.** One possible solution is shown below:

In [None]:
df_years_top = df_years.sort_values(by="2019", ascending=False).head(20)
years = ["2005", "2006", "2007", "2008", "2009", "2010", "2011", "2012", "2013", "2014", "2015", "2016", "2017", "2018", "2019"]
df_years_top = df_years_top.drop(["Country Code", "Country Name"], axis=1)[years]

ax = df_years_top.boxplot()
ax.set_title("The top 3 GDP countries are obvious outliers from the rest of the top 20\n\n")
ax.set_xlabel("Year")
ax.set_ylabel("GDP - tens of trillions (USD)")

_ = ax.set_xticklabels(years, rotation=90)

In [None]:
df_years_top = df_years.sort_values(by="2019", ascending=False).head(20).tail(17)
years = ["2005", "2006", "2007", "2008", "2009", "2010", "2011", "2012", "2013", "2014", "2015", "2016", "2017", "2018", "2019"]
df_years_top = df_years_top.drop(["Country Code", "Country Name"], axis=1)[years]

ax = df_years_top.boxplot()
ax.set_title("Slow and steady? Are lower GDP countries in the top 20 are more stable?\n\n")
ax.set_xlabel("Year")
ax.set_ylabel("GDP - Trillions (USD)")

_ = ax.set_xticklabels(years, rotation=90)

After removing the top 3, it's easier to see that the lower quartile of the top countries is more stable than the top quartile, showing smaller growth or remaining constant, while the top quartile moves up and down more year-to-year.

We clearly labeled the units in our axis labels, including the change in scale from 10 trillion to 1 trillion after removing the top countries.

## Conclusions

GDP is a metric that has become controversial because it is a very crude way of measuring economic activity. However, it can provide an interesting lens to look at specific countries and how they relate to each other. After looking at correlations like "people in richer countries live longer" it's tempting to make a causation link (e.g. "if poorer countries raise their GDP, their citizens will live longer"), but correlation is not the same as causation.

We saw that the USA and China dwarf most other countries in terms of GDP, and that Japan's once strong GDP growth has leveled off in the last few decades.

## Takeaways

To presenting data well, you need to constantly keep two things in mind:

1. Choose the correct way to visualize the data
2. Make the story easy to understand by being careful with the presentation details such as titles and labels

In this case, you saw that:

* Bar charts are good for basic analysis, showing how categorical variables relate to numerical variables. In very specific circumstances, pie charts can also be used for this when you are interested in relative vs. absolute sizes
* Scatterplots are used to show how to numerical variables relate to each other, and are good for looking at correlations
* Line charts are almost always appropriate to show how continuous data changes over time
* Box plots are a great tool if you need to show more than a basic summary (such as an average), but you don't want to plot the full distribution of each variable (which can be cumbersome and overly detailed)

You also saw how important it is to get all of the details right when presenting data. It's very common for people to use wrong or misleading labels, or to use titles that don't add anything to the visualization. Always be careful that you are using the correct units, and clearly labeling what these are, while using the title to summarize the message behind the data.