# South-Eastern rural and urban populations

# Or ... sharing information with Python and Jupyter notebooks

The objective of this workshop is to give you the flavour of sharing information with Python and Jupyter notebooks.
> Python is, in fact, not really required. The analysis can perfectly be done with R and Jupyter. Most of the following would still be available. R offers other opportunities like RMarkdown or Shiny.

Jupyter makes use of Web technologies, namely HTML, Javascript and CSS. It is therefore possible to use this tools in conjunction with Python for analysing data. We will definitely harvest the potential of the notebooks for sharing information.

During the course of this workshop, we will try to use several types of structure to convey an idea: text, tables, spreadsheets (definitely like tables) and charts. For each of these elements, we will try to apply best practices. Details matter!

There are many libraries for building charts in Python. For static charts, [Matplotlib](https://matplotlib.org/) and [Seaborn](https://seaborn.pydata.org/) are definitely the reference. Beyond static, several packages are now available in Python for leveraging interactivity (most of them are wrapper around the Javascript library [D3.js](https://d3js.org/)). Among them, one can highlight [Bokeh](https://bokeh.pydata.org) and [Altair](https://altair-viz.github.io/). The following is based on [plot.ly](https://plot.ly/python/).
> This choice was not made because of plot.ly being superior to the previous but mainly because I have been using the library for several years, got used to it and developed small helpers to design charts the way I want.

In [None]:
import pandas as pd 
from IPython.display import display, Markdown
import datasheets
import plotly as py
import plotly.graph_objs as go
from plotly_layout import *
from tables import align_left

# This allows to make a request for loading the library,
# instead of embedding the whole Javascript library into
# the notebook/HTML. If you want a self-contained notebook,
# you can use connected=False
py.offline.init_notebook_mode(connected=True)

PALETTE = [
    "#4e79a7",
    "#f28e2b",
    "#e15759",
    "#76b7b2",
    "#59a14f",
    "#edc948",
    "#b07aa1",
    "#ff9da7",
    "#9c755f",
    "#bab0ac",]


fonts = dict(ibm_plex='IBM Plex Sans Condensed', roboto='Roboto Condensed')

In [None]:
%%html
<!--Load the fonts: this will be explained later on-->
<style>
    @import url('https://fonts.googleapis.com/css?family=IBM+Plex+Sans+Condensed:400');
    @import url('https://fonts.googleapis.com/css?family=Roboto+Condensed');
</style>

The following analysis is based on the post:

_Hannah Ritchie and Max Roser (2018) - "Urbanization"._
<br>_Published online at OurWorldInData.org._
<br>[https://ourworldindata.org/urbanization](https://ourworldindata.org/urbanization)

We are going to explore data from the United Nations with the previous post as a guideline. 

## Data extraction

In [None]:
root_url = 'https://population.un.org/wup/Download/Files/'
country_codes_URL = 'https://unstats.un.org/unsd/methodology/m49/'

In [None]:
# In this case, we could directly type the URL as Markdown. The point of this trick
# is to show how to embed computed values into a Markdown text, therefore dynamically
# filled when the notebook is executed.
m = (
     'The current urban and rural populations per country can be retrieved from the ' 
     f'[United Nations data portal]({root_url}) as indicated in the post.'
 )
Markdown(m)

The data are available as Excel sheets. We are going to extract them and transform them into a more manageable format, following parts of the recommendations of [Tidy Data](http://vita.had.co.nz/papers/tidy-data.html).

In [None]:
url = root_url + 'WUP2018-F01-Total_Urban_Rural.xls'
df_tur_raw = pd.read_excel(io=url)

In [None]:
# Let's take a look at the first lines to have a sense of the content
display(df_tur_raw.info())
display(df_tur_raw.head(n=20))

In [None]:
citation = df_tur_raw.loc[11, "Unnamed: 3"].replace('Suggested citation: ', '')

The first 15 lines are the headers of the Excel sheet, the 15th seems to contain the headers of the columns. We are going to move this row as the columns headers and keep only the relevant part of the table.

In [None]:
df_tur = df_tur_raw[15:].copy()
df_tur.columns = df_tur_raw.loc[14].str.replace(',', '').str.replace(' ', '_').str.replace('\n', '_').str.lower().tolist()
# We could keep the percentage of urban population here but for the sake of the workshop, we
# are dropping it and will recompute it later
df_tur.drop(labels=['index', 'note', 'total', 'percentage_urban'], axis='columns', inplace=True)

In [None]:
# The table look a little bit cleaner–it does not have missing 
# values, for example–but we are still missing the proper types
# for the columns. Indeed, the urban and rural populations have
# a type object while they are "people", therefore integers.
display(df_tur.info())
display(df_tur.head())

In [None]:
df_tur['country_code'] = df_tur.country_code.astype(int)
# Population are by thousands, we are bringing them back
# to normal counts and transform the data into the right
# type, namely integers.
for c in ['urban', 'rural']:
    df_tur[c] = df_tur[c] * 1000
    df_tur[c] = df_tur[c].astype(int)

In [None]:
display(df_tur.info())
display(df_tur.head(n=20))

From the structure of the table above, it seems that the country codes around 900 play a different role. After all, the column is called “region, subregion country or area.” We can take a look at the content to help us in redesigning the table in a different way.
> We could also open the spreadsheet in a proper tool like LibreOffice, Excel or Google Sheets. But that is a bit less fun.

In [None]:
# Here, I make use of the .pipe() operator as an introduction for chaining methods. Chaining
# is a slighlty different way–closer to functional programming–to apply operators to objects.
# If you are interested, you can take a look at https://tomaugspurger.github.io/method-chaining
df_tur[df_tur.country_code >= 900][['region_subregion_country_or_area', 'country_code']].pipe(display)

In [None]:
regions = ['AFRICA', 'ASIA', 'EUROPE', 'LATIN AMERICA AND THE CARIBBEAN', 'NORTHERN AMERICA', 'OCEANIA']
sub_regions = df_tur[df_tur.country_code >= 900].region_subregion_country_or_area.tolist()
# Notice the usage of sets here. It is a nice way to remove elements from an existing list
sub_regions = list(set(sub_regions) - set(regions))
print(sub_regions)

The last part of the table corresponds to the regions and sub-regions as defined by the [United Nations](https://en.wikipedia.org/wiki/United_Nations_geoscheme). The country code is the norm [ISO-3166-1](https://en.wikipedia.org/wiki/ISO_3166-1_numeric). The spreadsheet is not built in a way which allows to easily assign a region or sub-region to a give country, it is built like a drop-down list. We are going to change that in order to ensure a one-to-many mapping between regions/sub-regions on one side and countries on the other.

In [None]:
# We are fully making use here of the fact that the spreadsheet is built with the following
# structure (region 1 → sub-region 1 → country 1 → country 2 → sub-region 2 → country 1 → country 2 → region 2 ...)
df_tur['region'] = None
df_tur.loc[
    df_tur.region_subregion_country_or_area.isin(regions), 'region'] = df_tur.region_subregion_country_or_area
df_tur['sub_region'] = None
df_tur.loc[
    df_tur.region_subregion_country_or_area.isin(sub_regions), 'sub_region'] = df_tur.region_subregion_country_or_area

df_tur.fillna(method='ffill', inplace=True)

# Northern america does not have sub regions
df_tur.loc[df_tur.region == 'NORTHERN AMERICA'].pipe(display)
df_tur.loc[df_tur.region == 'NORTHERN AMERICA', 'sub_region'] = 'Northern America'

In [None]:
# We can check here that our mapping makes sense by extracting the sub-regions per region
more_than_903 = df_tur.country_code > 902
in_regions = df_tur.region_subregion_country_or_area.isin(regions)
not_sub_regions = df_tur.region_subregion_country_or_area.str.match('.*(countries|Less|More)')

sub_regions = df_tur[more_than_903 & ~in_regions & ~not_sub_regions].region_subregion_country_or_area.tolist()
df_sub_regions = df_tur[
    df_tur.region_subregion_country_or_area.isin(sub_regions)
].groupby(by='region').region_subregion_country_or_area.apply(lambda s: ', '.join(s.tolist()))

df_sub_regions.pipe(display)

We can now create a table dedicated to countries themselves as they are going to be the atomic unit for the analysis.

In [None]:
# Countries have a country code lower than 900
df_c = df_tur[df_tur.country_code < 900].copy()
df_c.rename(columns=dict(region_subregion_country_or_area='country'), inplace=True)
df_c.set_index(keys=['country'], inplace=True)

We can check that we did not make any mistake during the cleaning process by comparing the total per region from our aggregated values and the ones provided by the genuine data themselves.

In [None]:
df_r = df_c.groupby(by='region').agg(dict(urban='sum', rural='sum'))

df_rg = df_tur[
    df_tur.region_subregion_country_or_area.isin(regions)
][['urban', 'rural', 'region_subregion_country_or_area']].set_index('region_subregion_country_or_area')

df_r.join(df_rg, rsuffix='_g').pipe(display)

In [None]:
df_c.loc[['France', 'Malaysia']].pipe(display)

Countries are grouped by region/continent as well as sub-regions. The following table is an extract for the sub-region South-Eastern Asia.

In [None]:
sea_countries = list(df_c[df_c.sub_region == "South-Eastern Asia"].index)
df_c.loc[sea_countries].pipe(align_left, row_heading=True)

In [None]:
# client = datasheets.Client()

## South-Eastern Asia

### Current situation

In [None]:
def format_percentage(f):
    '''Format number into a percentage'''
    return f'{100 * f:3.1f}%'

In [None]:
df_sea = df_c.loc[sea_countries].drop(['country_code', 'region', 'sub_region'], axis='columns')
df_sea.index = df_sea.index.str.replace('Lao People\'s Democratic Republic', 'Laos')

In [None]:
df_sea_total = df_sea.sum()
df_sea_total_percentage = 100. * df_sea_total.divide(df_sea_total.sum())
rural = df_sea_total_percentage.loc['rural']
m = f'The majority–{rural:3.1f}%– of South-Eastern Asian is still rural but this number hides a disparities between the countries.'
Markdown(m)

Indeed, Malaysia is predominantly urban–around three quarters of the population lives in urban areas– while Cambodia is by far a rural country as shown by the table below.

In [None]:
df_sea = df_sea.divide(df_sea.sum(axis='columns'), axis='rows').sort_values(by=['urban'], ascending=False)
# Note that, as we render the table as a HTML one, we can use regular HTML tag like
# <br> to shape the table the way we want
df_sea.columns.name = 'Percentages of<br>total population'
df_sea.index.name = ''

# Here is a nice use of the chaining system
df_sea.applymap(func=format_percentage).pipe(align_left, row_heading=True)

### Evolution in time

In [None]:
urban_URL = root_url + 'WUP2018-F19-Urban_Population_Annual.xls'
rural_URL = root_url +'WUP2018-F20-Rural_Population_Annual.xls'

In [None]:
urban_raw_data = pd.read_excel(io=urban_URL)
rural_raw_data = pd.read_excel(io=rural_URL)

In [None]:
# We have learned the structure of the documents above. We could move the cleaning
# into a set of functions at the top of the document or even include into a module
# to be widely used.

def clean_column_names(serie):
    '''Remove punctuation, white spaces and
    linebreaks from a string serie. Replace
    white spaces and line breaks with underscore'''
    cleaned_serie = serie.astype(
        str
    ).str.replace(
        ',', ''
    ).str.replace(
        ' ', '_'
    ).str.replace(
        '\n', '_'
    )
    return cleaned_serie.str.lower().tolist()

def clean_temporal_dataframe(df):
    '''Remove the unnecessary headers from the
    table.'''

    df_ = df[15:].copy()
    df_.columns = df.loc[14].pipe(clean_column_names)
    df_ = df_.drop(
        labels=['index', 'note', 'country_code'],
        axis='columns'
    ).set_index(
        keys=['region_subregion_country_or_area']
    )
    
    # Transform populations into millions
    df_ = df_ / 1.e3
    
    # Map the years into integers
    df_.columns = df_.columns.map(float).map(int).unique()
    # Put the years as the index and the regions as the
    # columns
    df_ = df_.transpose()
    df_.index.name = 'year'
    
    return df_

def select_countries(df, countries=None):
    '''Select countries in the DataFrame. If the list
    of countries is not given, the selection is made
    by regions'''
    selection =  countries if countries else ['AFRICA', 'ASIA', 'EUROPE', 'LATIN AMERICA AND THE CARIBBEAN', 'NORTHERN AMERICA', 'OCEANIA']
    return df[selection]

In [None]:
urban_data = urban_raw_data.pipe(clean_temporal_dataframe)
rural_data = rural_raw_data.pipe(clean_temporal_dataframe)

In [None]:
countries = [country for country in sea_countries if country not in ['Singapore', 'Brunei Darussalam']]
urban_sea = urban_data.pipe(select_countries, countries=countries).copy()
rural_sea = rural_data.pipe(select_countries, countries=countries).copy()

urban_sea.columns = urban_sea.columns.str.replace('Lao People\'s Democratic Republic', 'Laos')
rural_sea.columns = rural_sea.columns.str.replace('Lao People\'s Democratic Republic', 'Laos')

The following chart shows the evolution of both rural and urban populations in South-Eastern Asian countries from 1950 to 2050 (projected after 2015).
> Brunei Darussalam and Singapore are not represented as their populations are fully urbanized since the 1960s.

In the following chart, we are going to give a glimpse of the evolution in time of both the increase/decrease of rural/urban populations and the relative percentages. We have several options to do that. We could draw several charts for both populations and percentages. Here, I made a different choice. I plotted the populations and added the percentages as a “hover“ function.

In [None]:
data = [
    go.Scatter(
        x=[2018, 2050],
        y=[260, 260],
        line=dict(color='#dcdcdc', width=0),
        fill='tozeroy',
        mode='lines',
        showlegend=False,
        hoverinfo='none')
]

for n, country in enumerate(urban_sea.columns):
    
    years = urban_sea.index
    
    # Percentages of rural/urban populations
    ud = urban_sea[country].values
    rd = rural_sea[country].values
    percentages_ud = 100. * ud / (ud + rd)
    percentages_rd = 100. - percentages_ud
    
    # Information to share when browsing over
    hover_ud = list(zip(ud, percentages_ud))
    hover_rd = list(zip(rd, percentages_rd))
    hover_text_ud = list(
        map(
            lambda v: '{0} (urban): {1:3.1f}B ({2:3.1f}%)'.format(country, *v),  # Not the use of a different way of formatting strings
            hover_ud
        )
    )
    hover_text_rd = list(
        map(
            lambda v: '{0} (rural): {1:3.1f}B ({2:3.1f}%)'.format(country, *v),
            hover_rd
        )
    )
    
    data+= [
        go.Scatter(
            x=years,
            y=ud,
            marker=dict(color=PALETTE[n%len(PALETTE)]),
            name=country + ' (urban)' if n == 0 else country,
            legendgroup=country,
            text=hover_text_ud,
            hoverinfo='text'
        )
    ]
    data+= [
        go.Scatter(
            x=years,
            y=rd,
            line=dict(dash='dash'),
            marker=dict(color=PALETTE[n%len(PALETTE)]),
            name = country + ' (rural)' if n == 0 else country,
            legendgroup=country,
            text=hover_text_rd,
            hoverinfo='text',
            showlegend=True if n == 0 else False
            )
    ]
    
labels = dict(
    title='Rural and urban populations in South-Eastern Asia from 1950 to 2050',
    subtitle=(
        'Malaysia\'s and Indonesia\'s rural decline started in the 1990s,<br>'
        + 'while Thailand and Viet Nam joined the trend ten years later. Cambodia,<br>'
        + 'Laos and Myanmar still exhibit a growth of their rural population.'
    ),
    ylabel='population (in millions)',
    xlabel='')

axes = dict(
    xaxis=axis_no_title(showgrid=False),
    yaxis=axis_no_title(showgrid=False),
    legend=legend_dark(font_size=14))

layout = layout_by_line_height(
    **labels,
    **axes,
    left_margin=50,
    right_margin=220,
    font_family=fonts['ibm_plex']
)

figure = go.Figure(data=data, layout=layout)
py.offline.iplot(figure_or_data=figure, show_link=False)

Although all countries have seen a rise in the total population, two groups can be identified in terms of rural population behaviour:
- Indonesia, Malaysia, Thailand and Viet Nam have already reached the peak of their rural population, the population declining since, respectively, in the 1990's and 2000's;
- Cambodia's, Laos' and Myanmar's rural populations are still increasing with a decline projected to happen around the 2030's.
> Cambodia exhibits a very particular pattern around 1975. The disappearance of the urban population is due to the Khmer Rouges taking over the country and literally emptying the cities from their inhabitants. The regime lasted for four years. 

We now have a complete notebook with both data extraction, cleaning and analysis. As, usually, the analysis target is not only one-self but a broader audience, we need to think on the way to share the information. 

We obviously can share the notebook either directly or through a platform like mybinder.org (if the data analysis is to be public). 

The notebook contains a lot of code though which may not be of interest to the audience. We can therefore leverage the possibility to transform the notebook into a HTML file, including the removal of the code cell and the comment ones. The cleaning is done by using Jinja2 templates. You can take a look at the file `report.tpl`.

In order to generate a HTML file from the command line, _i.e._ in the terminal offered by Jupyter, you can use
```
jupyter nbconvert Urban_and_rural_populations.ipynb --template report.tpl
```
> Try with and without `--template report.tpl` to see the difference.

Once the HTML file is generated, you can open it from the “Home” page of Jupyter.