# Exercise (Day 1): Making your own Population Pyramid Diagram

## What is a Population Pyramid

As our first Python program we will set out to make our own population pyramid to understand the distribution of age in a given country.

```{note}
A population pyramid (age structure diagram) or "age-sex pyramid" is a graphical illustration of the distribution of a population (typically that of a country or region of the world) by age groups and sex; it typically takes the shape of a pyramid when the population is growing.
```

Source: [Wikipedia](https://en.wikipedia.org/wiki/Population_pyramid)

A population pyramid is used to visually explain changes in demography over time.

A diagram with a wide base (i.e. high percentage of children) tends to have strong population growth in the future. 

The shape of the distribution has many economic implications such as:

1. tax policy
2. impacts of age depenency on government spending
3. understand population growth dynamics


This [Our World in Data Project article](https://ourworldindata.org/global-population-pyramid) uses one of these diagrams to look at the Demography of world population over time 

<center>
<img src="https://ourworldindata.org/images/published/Population-Pyramid-1950-to-2100_3000.png" alt="drawing" width="600"/>
</center>

<br>

Even if you have never programmed in python before, work through this exercise, as you will understand more than you think by reading the code.

This `jupyterlite` environment lets you execute the code below so you can make changes and explore python straight away.

## Exercise #1: Run the following python code to build a population pyramid for Australia

This exercise will first show you how to construct a population pyramid for Australia in 1960 and 2023. 

You will then be asked to make a population pyramid for either `Benin`, `Côte d'Ivoire`, or `Uganda`. 

All data sourced for this exercise is from the [World Development Indicators](https://datatopics.worldbank.org/world-development-indicators/) database.

---

First we will need to import some useful packages

```{hint}
To execute the code select the cell below and push <shift>+<enter>
```

In [None]:
%pip install -q seaborn

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import pyodide_http
pyodide_http.patch_all()

We will need to fetch the data needed to build a plot

There is a saved dataset that contains the share of population within certain age groups (i.e. 0-4, 5-9) for Australia in 1960 and 2023.

You can retrieve this data by executing the following cell

```{hint}
To execute the code select the cell below and push <shift>+<enter>
```

In [None]:
URL = "https://raw.githubusercontent.com/QuantEcon/africa-summer-course-2024/main/exercises/day-01/datapop_shares_aus.csv"
# Use the pandas package to read the csv from our github repository
plot_data = pd.read_csv(URL)

Let us first take a look at the data that has been imported

In [None]:
plot_data.head()  # the .head() method returns the first 5 lines of the DataFrame

You can see that this dataset contains four columns named `age`, `year`, `female`, and `male`.

Let's take a closer look at the `age` column to understand what that column represents

In [None]:
plot_data['age'].unique()

As you can see it consists of various age groups. 

This is the format we need for building a population pyramid. 

Now let's look at the year column

In [None]:
plot_data.year.unique()

This dataset contains data for `two` different years so we will need to filter for either 1960 or 2023 when building our figures.

Let's first collect the data for 1960 into a separate dataframe

In [None]:
plot_1960 = plot_data[plot_data.year == 1960]
plot_1960.head()

This data is now in the format we need to build a figure.

We will use a python package called [seaborn]() to build the visualisation as horizontal bar charts. 

In [None]:
# - Make Plot for 1960
ax1 = sns.barplot(x='male', y='age', data=plot_1960, color="blue")
ax2 = sns.barplot(x='female', y='age', data=plot_1960, color="green")
plt.title("Population pyramid for Australia, 1960 estimates")
plt.xlabel("% of population \n Female / Male")
plt.show()

In the 1960's Australia had a typical wide based population pyramid, with a higher percentage of children when compared to the rest of the population. 

We have data from 2023 in our dataset so let us compare the two

In [None]:
plot_2023 = plot_data[plot_data.year == 2023]

In [None]:
# - Overlay of both Years
ax3 = sns.barplot(x='male', y='age', data=plot_2023, color="blue")
ax4 = sns.barplot(x='female', y='age', data=plot_2023, color="green")
plt.title("Population pyramid for Australia, 2023 estimates")
plt.xlabel("% of population \n Female / Male")
plt.show()

The shape of this pyramid has changed significantly, with the highest cohort (by percentage) between the age of 30 and 40. 

It may be more convenient to see these plots side by side

In [None]:
import matplotlib.pyplot as plt
figure, axes = plt.subplots(1, 2, figsize=(12,6))
year = [1960, 2023]
index = 0
for axis in axes:
    ax1 = sns.barplot(x='male', y='age', data=plot_data[plot_data['year'] == year[index]], color="blue", ax=axis)
    ax2 = sns.barplot(x='female', y='age', data=plot_data[plot_data['year'] == year[index]], color="green", ax=axis)
    axis.set_xlabel("% of population \n Female / Male")
    axis.set_title(year[index])
    index = index+1
plt.show()

### Question: What else do you observe by comparing these two plots?

---

## Exercise #2: Making your own population pyramid

Above you have been shown how to make a population pyramid for Australia using prepared data. 

In this section we will:

1. Show how this data was retrieved from the world development indicators database
2. See how to prepare the data, extracting the information you need
3. Save the data of interest to build a plot for `Benin`, `Côte d'Ivoire`, or `Uganda`.
4. Make your own plot (using the code above)

---

**Step #1:** Getting Data from the World Bank World Development Indicators

One way to get data from the [WDI](http://wdi.worldbank.org) database is to use the [World Bank Databank](https://databank.worldbank.org) website.

We have fetched some data for the countries `Australia`, `Benin`, `Côte d'Ivoire`, or `Uganda` and saved the output as a csv file that we will use in this exercise.

The RAW world bank data looks like this once downloaded.

In [None]:
URL = "https://raw.githubusercontent.com/QuantEcon/africa-summer-course-2024/main/exercises/day-01/wdi-data-exercise-day01-all.csv"
data = pd.read_csv(URL)

In [None]:
data.head()

The dataset is nicely organised, however it is not in the format we need to build a plot. 

For a country of interest we would want the dataset to be in the format we observed in Exercise #1 (above)

|age |	year |	female	| male |
|----|-------|----------|------|
| 80+	| 1960 |	-1.619326	| 0.948079|
| 75-79	| 1960 |	-1.809344	| 1.280289|
| 70-74	| 1960 |	-2.818732	| 2.211589|
| 65-69	| 1960 |	-3.606390	| 2.885782|
| 60-64	| 1960 |	-4.049524	| 3.566932|

What we can see from the raw data is the information is there (for example, the age brackets are embedded in the `Series Name`) but we will need to extract the data that we would like for our figure.

Python can help with this

---

**Step 2:** Understand the dataset and extract the information you need for the population pyramid

We can first take a look at what countries are in this dataset

In [None]:
data['Country Code'].unique()

Then we can look at what `Series` are in this dataset

In [None]:
data['Series Name'].unique()

It is nice that we have data for all the age-brackets in the one DataFrame, however what we would like to do is extract the age brackets (such as "0-4" from this data). 

We can use `python string` manipulation to achieve this pretty simply by recognizing the common patterns in the string.

Let's look at one string more closely. 

In [None]:
data['Series Name'][0]

We can split the string on the "," character and then replace "Population ages " with nothing to extract the `00-04` text. 

In [None]:
data['Series Name'][0].split(",")[0].replace("Population ages ", "")

We did it! 

Because this pattern is consistent we can use python to apply this to all `Series Name` rows.

We will collect the text in a Python List and then add it to our dataset as a new column

In [None]:
age_brackets = []
for sn in data['Series Name']:
    a,b = sn.split(",")
    a = a.replace("Population ages ", "")
    if a == '80 and above':
        a = '80+'
    age_brackets.append(a)

In [None]:
data['age'] = age_brackets

In [None]:
data.head()

Now we will need two columns that contain `male` or `female`

We can apply similar logic.

In [None]:
gender = []
for sn in data['Series Name']:
    if 'female' in sn:
        gender.append('female')
    elif 'male' in sn:
        gender.append('male')
    else:
        raise ValueError("ERROR: cannot determine gender -- check data source")

```{note}
Even though you may have never programmed in Python before you can probably read this code and understand what it is doing.
```

In [None]:
data['gender'] = gender

In [None]:
data.head()

---

**Step #3:** Now we just need to build a small dataset with the data needed for our population pyramid

Let's filter the dataset for Australia using the following code

In [None]:
cntry_data = data[data['Country Code'] == "AUS"]                 # save data for 'AUS' in cntry_data 
plot_data = cntry_data[['1960', '2023', 'age', 'gender']]

In [None]:
plot_data.head()

This isn't quite in the shape we need for the graph.

Let's reshape this dataset so we have years in one column, and the data in their own `male` and `female` columns instead. 

In [None]:
plot_data = plot_data.pivot(index='age', columns='gender')
plot_data.columns.names = ['year', 'gender']

In [None]:
plot_data.head()

**Note:** The pandas package tracks all the relationships between data and their labels for you. 

In [None]:
plot_data = plot_data.stack(level=0).reset_index()
plot_data['year'] = plot_data['year'].astype(int)      # make sure years are numbers
plot_data['female'] = plot_data['female'] * -1         # put the bar chart on the LHS of axis
plot_data = plot_data.sort_values(by='age', ascending=False)

In [None]:
plot_data.head()

**Step #4:** Build the Figure

The data is now in the format we need to build our charts from Exercise #1.

Let's copy the code below

In [None]:
figure, axes = plt.subplots(1, 2, figsize=(12,6))
year = [1960, 2023]
index = 0
for axis in axes:
    ax1 = sns.barplot(x='male', y='age', data=plot_data[plot_data['year'] == year[index]], color="blue", ax=axis)
    ax2 = sns.barplot(x='female', y='age', data=plot_data[plot_data['year'] == year[index]], color="green", ax=axis)
    axis.set_xlabel("% of population \n Female / Male")
    axis.set_title(year[index])
    index = index+1
plt.show()

## Exercise: Build your own population pyramid using the code above and modify it

The country code data for each country is

| Country | Country Code |
|---------|--------------|
| Benin   | BEN |
| Côte d'Ivoire | CIV | 
| Uganda | UGA |

Start with the code from **Step 3** above and modify it to find the country you are interested in.

In [None]:
# You will need to change 'AUS' to another country code here
cntry_data = data[data['Country Code'] == ]                 
plot_data = cntry_data[['1960', '2023', 'age', 'gender']]

**Note:** You could also just change the country code in `Step #3` above and re-run the cells.

The great part of building programs is once you have done the work for one country, you have done the work for any country you are interested in building the plot for. 

### Exercise 3: See if you can change the plots

1. How might you change the colour of the bar charts?
2. By convention, the population pyramid is often organised with `Male` on the left and `Female` on the right. How could you change the figure to comply with this popular convention? 