# Data Science Africa -- Summer School

## Data Visualisation Practical Session

<br/>
<br/>

<div align="right"><font face="monospace" size="4">
    <strong>Morine</strong> Amutorine -- morine.amutorine@one.un.org <br/>
    <strong>Elaine</strong> Nsoesie -- onelaine@bu.edu <br/>
    <strong>Lehel</strong> Csató -- lehel.csato@cs.ubbcluj.ro
    </font></div>

#### Setup

> We assume that <br/>
> **[Anaconda](https://www.anaconda.com)** <br>
> is available (downloaded and installed) on your system.

<div align="right">
Alternatively, if you have <strong>some</strong> Python <br/>
and Matplotlib <br/>
and Pandas <br/>
and Numpy <br/>
and Cartopy <br/>
    <div color="red">;-) , it suffices.</div>
</div>

From within anaconda (or python) will will be using the<br>
1. *jupyter* interface and 
1. *matplotlib*
1. *cartopy*

This is a **scripting** interface for Python that is helpful for exploratory data analysis.

You can 
1. edit the notebook by simply entering commands and
1. press SHIFT-ENTER or CTRL-ENTER to evaluate the cell within the notebook

> Observe that the evaluation of the cells might not be linear, _but_ the variables are created when a cell is evaluated, therefore the internal state of the notebook **is** linear -- rflecting the **order** in which cells were evaluated.

In [None]:
%matplotlib inline
#other options: inline
import matplotlib.pyplot as plt
import numpy as np
# press SHIFT-ENTER or CTRL-ENTER

In the above cell we initialised the *pyplot* interface and numpy (for the random number generation)

In [None]:
x = np.random.randn(10000)

plt.figure(figsize=(5, 3))
plt.hist(x,100)
plt.title('Normal distribution with $\mu=0, \sigma=1$')
# plt.savefig('histogram2.pdf') # this is if we want to save to an image
plt.show()

In [None]:
# We can modify the figure as long as it is active.
# ALL *PLT* commands from the PYPLOT interface modify the
# *active* figure (if none is active, the figure is created)
plt.title('Histogram of $1000$ points from $N(0,1)$ in $100$ bins')

## Using real data

There are perils ... 

In [None]:
# we are using the PANDA dataframe library
import pandas as pd
from __future__ import print_function # for compatibility
!pwd
# the above command is to test the current working directory

In [None]:
df_gdp = pd.read_excel(
  'data/countries_gdp.xls',
  sheet_name = 'Data', skiprows = range(3))
# Reading in other data
df_pop         = pd.read_excel('data/countries_population.xls',sheet_name = 'Data',skiprows = range(3))
df_edu_percent = pd.read_excel('data/countries_edu_percent.xls',sheet_name = 'Data',skiprows = range(3))
df_primary_014 = pd.read_excel('data/countries_primary_pupils_014.xls',sheet_name = 'Data',skiprows = range(3))
df_pupils_014  = pd.read_excel('data/countries_pupils_014.xls',sheet_name = 'Data',skiprows = range(3))

# We want to have selected countries, therefore creating indices for the four data-sets
df_gdp.set_index("Country Name", inplace=True)
df_pop.set_index("Country Name", inplace=True)
df_edu_percent.set_index("Country Name", inplace=True)
df_primary_014.set_index("Country Name", inplace=True)
df_pupils_014.set_index("Country Name", inplace=True)
#
#!! the making of indices is irreversible, doable only ONCE
#!! error if done multiple times

# we can confirm the importing -- HERE works
# df_gdp.head()

We have the following tables (DataFrames in **Pandas**):

- **df_pop** -- the Dataframe containing _POPULATION_
- **df_edu_percent** -- the Dataframe containing _EDUCATION_PERCENTAGE_
* **df_primary_014** -- the Dataframe containing _PUPILS_IN_PRIMARY_SCHOOL_
* **df_pupils_014** -- the Dataframe containing _PUPILS_AGED_0_14_

In [None]:
years = list( map( str, range(2000,2018)))

# we might try LINE ...
%matplotlib inline
df_gdp.loc['Nigeria', years].plot(kind='bar',color=(.2,.6,.2))
plt.title("Nigerian GDP")
plt.xlabel('Years')
plt.ylabel('GDP')
plt.xticks(list(range(0,18)), years)
plt.show()

## Visualising comparisons

The aim is to make a visualisation of multiple data items:
1. there were the years, but
1. we also want to plot on the same plot different coutries from the "neighbourhood" <br/>
  Cameroon, Tanzania, Niger, ...

In [None]:
# declaring data for our data visualisation:
# we are interested only in a small set of countries
#
# this is a dictionary having COUNTRY names as KEYS,
# and COLOR codes as attribute
#
colors = {
    "Nigeria": (.2,.7,.2),
    "Niger":   (.4,.4,0),
    "Uganda":  (1, .6,.6),
    "Rwanda":  (1,.5,.8),
    "Central African Republic": (.2,.3,.4),
    "Republic of the Congo": (.2,.1,.4),
    "Gabon":  (1,.9,.8),
    "Somalia":  (.1,.8,.3),
    "Kenya":   (.3,.1,.8),
    "Sudan":   (1,.0,.9),
    "Chad":    (.6,.2,.3),
    "Ethiopia":(.3,.7,.5),
    "South Sudan":(.2,.9,.8),
    "Cameroon":(.3,.7,.3),
    "Democratic Republic of the Congo": (.1,.3,0),
    "Tanzania": (.8,.7,.2),
    "Burundi": (.5,0,.8),
    "Benin": (.5,.6,.1),
    "Togo":  (.7,.2,.1),
    "Ghana": (.1,.1,0)
}
# for further processing, we will need the names of
# COUTRIES
country_names = [k for k in colors.keys()]

## Highlight important data: GPD for a specific country

In [None]:
# Exercise
# Plot the GDP-s for all coutries in the list


# HIGHLIGHT your contry's GDP
# Example
c_high = "Uganda"
for c_name in country_names:
    if c_name == c_high:
        linewidth = 4
    else:
        linewidth = 1
    df_gdp.loc[c_name, years].plot(kind='line',linewidth=linewidth, color=colors[c_name])
# end for

plt.legend(country_names,loc='upper left', bbox_to_anchor=(1, 1.2))
plt.xlabel('Years')
plt.ylabel('GDP in USD')
plt.title("GDP per country")
plt.xticks(list(range(0,18,2)), years[::2])
plt.show()

In [None]:
temp1=df_gdp.loc["Nigeria",years]
temp2=df_pop.loc["Nigeria",years]
type(temp2)
joined_data = temp1.div(temp2)

joined_data.plot(kind='bar',color=(.2,.7,.2))
plt.title("Nigerian GDP / capita")
plt.xlabel('Years')
plt.ylabel('GDP / POP / YEAR')
plt.xticks(list(range(0,18)), years)
plt.show()

## What about comparison between coutries of interest

In [None]:
# Exercise
# Plot the GDP-s per capita for all coutries in the list

# HIGHLIGHT your contry's GDP
# Example
c_high = "Nigeria"
for c_name in country_names:
    if c_name == c_high:
        linewidth = 4
    else:
        linewidth = 1
    # making the ratio
    temp1=df_gdp.loc[c_name,years]
    temp2=df_pop.loc[c_name,years]
    gdp_cap = temp1.div(temp2)
    gdp_cap.plot(kind='line',linewidth=linewidth, color=colors[c_name] )
# end for

plt.title("GDP / capita")
plt.xlabel('Years')
plt.ylabel('GDP / POP / YEAR')
plt.xticks(list(range(0,18,2)), years[::2])
plt.legend(country_names,loc='upper left', bbox_to_anchor=(1, 1.2))
plt.show()

In [None]:
# Solved exercise:
#   plot the percent of education for the list of countries

plt.figure(figsize=(8, 6))
for c_name in country_names:
    df_edu_percent.loc[c_name, years].plot(kind='line',linewidth=4, color=colors[c_name])
# end for
plt.legend(country_names,loc='upper left', bbox_to_anchor=(1, .9))
plt.xlabel('Years')
plt.ylabel('Education %')
plt.title("Education percentages of GDP")
plt.xticks(list(range(0,8)), years)
plt.show()

## Putting things on a map

In [None]:
import cartopy.io.shapereader as shp_reader
import cartopy.crs as ccrs
import cartopy as cartopy
from matplotlib.figure import Figure  # the Figure artist
import numpy as np

# changing the way PLOTTING works
%matplotlib inline

In [None]:
# reading in the data
shp_file_name = shp_reader.natural_earth(resolution='110m',
                                      category='cultural',
                                      name='admin_0_countries')
reader = shp_reader.Reader(shp_file_name)
countries = [country for country in reader.records()]


In [None]:
# we define the function we want to be used
# HELPER function to restrict the DATA
def filter_country_attr_values(country_list, attr_name, attr_list):
    result = []
    for country in country_list:
        for attr_value in attr_list:
            if country.attributes[attr_name] == attr_value:
                result.append(country)
    return result
# end filter_country_attr_values

small_list = filter_country_attr_values(
    countries, # the large data list
    'NAME_EN', # the attribute to use
    country_names # the list of attribute values
)

In [None]:
small_list

In [None]:
# visualisation source: 
# https://gis.stackexchange.com/questions/88209/python-mapping-in-matplotlib-cartopy-color-one-country

plt.figure(figsize=(12, 6))
ax = plt.axes(projection=ccrs.PlateCarree())
ax.stock_img()
ax.add_feature(cartopy.feature.BORDERS, linestyle='-', alpha=.5)
ax.add_feature(cartopy.feature.COASTLINE)
for cs in small_list:
    cs_col = colors[cs.attributes['NAME_EN']]
    ax.add_geometries(cs.geometry, ccrs.PlateCarree(),
                     facecolor = cs_col, alpha=0.15,
                     label = cs.attributes['NAME_EN'])
    (lon,lat) = cs.geometry.centroid.coords[0]
    plt.text(lon-1,lat,cs.attributes['NAME_EN'],horizontalalignment='right',transform=ccrs.PlateCarree())
    plt.scatter(lon,lat,marker='o',s=20,c=cs_col, alpha=.7)
ax.set_extent([-5, 48, -8, 22], crs=ccrs.PlateCarree())
plt.title("Countries of interest")
plt.show()

In [None]:
# Exercise:
# Put on the MAP the GDP,
plt.figure(figsize=(12, 6))
ax = plt.axes(projection=ccrs.PlateCarree())
ax.stock_img()
ax.add_feature(cartopy.feature.BORDERS, linestyle='-', alpha=.5)
ax.add_feature(cartopy.feature.COASTLINE)
for cs in small_list:
    cs_col = colors[cs.attributes['NAME_EN']]
    cs_name= cs.attributes['NAME_EN']
    ax.add_geometries(cs.geometry, ccrs.PlateCarree(),
                     facecolor = cs_col, alpha=0.15,
                     label = cs_name)
    (lon,lat) = cs.geometry.centroid.coords[0]
    plt.text(lon-1,lat-1,cs_name,horizontalalignment='right',transform=ccrs.PlateCarree())
    size_c = df_gdp.loc[cs_name,"2016"]/10**9
    if np.isnan(size_c):
        size_c = 0.0001
    plt.scatter(lon,lat,marker='o',s=12*size_c,c=cs_col, alpha=.4)
    
ax.set_extent([-5, 48, -8, 22], crs=ccrs.PlateCarree())
plt.title("GDP in countries in 2016")
plt.show()

# Education results

In [None]:
# Exercise:
# produce the ratio of pupils 0-14
# going to schools
plt.figure(figsize=(12, 6))
ax = plt.axes(projection=ccrs.PlateCarree())
ax.stock_img()
ax.add_feature(cartopy.feature.BORDERS, linestyle='-', alpha=.5)
ax.add_feature(cartopy.feature.COASTLINE)

# selecting the YEAR OF INTEREST
sel_year = "2013"

for cs in small_list:
    cs_col = colors[cs.attributes['NAME_EN']]
    cs_name= cs.attributes['NAME_EN']
    ax.add_geometries(cs.geometry, ccrs.PlateCarree(),
                     facecolor = cs_col, alpha=0.15,
                     label = cs_name)
    (lon,lat) = cs.geometry.centroid.coords[0]
    plt.text(lon-1,lat-1,cs_name,horizontalalignment='right',transform=ccrs.PlateCarree())
    size_c = df_pupils_014.loc[cs_name,sel_year]/10**5
    if np.isnan(size_c):
        size_c = 0.0001
    plt.scatter(lon,lat,marker='o',s=3*size_c,c=cs_col, alpha=.4)
    # plotting on the same plot the EDUCATION
    size_c = df_primary_014.loc[cs_name,sel_year]/10**5
    if np.isnan(size_c):
        size_c = 50
        marker = 'x'
    else:
        marker='o'
    plt.scatter(lon,lat,marker=marker,s=3*size_c,c=cs_col, alpha=1)
    
    
ax.set_extent([-5, 48, -8, 22], crs=ccrs.PlateCarree())
plt.title("Youngsters in " + sel_year)
plt.show()

# Conclusions

- Data Visualisation is **nice**

- You should have the **proper** data

- You should have the **toolset**

- You should have the patience


<br/>
<br/>
<br/>

[Hans Rosling](https://www.ted.com/talks/hans_rosling_shows_the_best_stats_you_ve_ever_seen)

![gapminder.org -- realisation in **python**](gapmind.png "Gapminder")

