![BTS](img/Logo-BTS.jpg)

# Session 15: Advanced Visualization

### Juan Luis Cano Rodríguez <juan.cano@bts.tech> - Data Science Foundations (2018-11-16)

Open this notebook in Google Colaboratory: [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Juanlu001/bts-mbds-data-science-foundations/blob/master/sessions/15-Advanced-Visualization.ipynb)

## Exercise 1: "Gapminder" interactive visualization

We will reproduce an example similar to this:

1. Load all the datasets in the `data/gapminder` directory, indexing them by `Country`.
2. Create a function that receives a `year` _as an integer_ and returns a new dataframe with `Country` as the index and the columns `Fertility`, `Life expectancy`, `Population` and `Group`.
3. Create a Plotly `FigureWidget` and visualize a scatter plot of `Life expectancy` vs `Fertility`, using the `Population` as bubble size (you will need some scaling) and coloring by `Group`. _Hint: it will be easier to do as many scatters as regions_
4. Decorate the figure with proper X and Y axis labels, a title, a big text showing the year, and a legend (if not present). _Note: The legend might not show the colors_
5. Create a function `update_year` that receives a `year` _as an integer_ and updates the data of the existing figure with the values from the selected year. _Note: The update might not be very efficient_
6. Create an horizontal slider that ranges from the minimum to the maximum year
7. Bind the `update_year` function to changes in the horizontal slider and use it to interactively change the plot

In [1]:
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

import numpy as np
from plotly import graph_objs as go

In [2]:
import seaborn as sns
sns.set()

In [3]:
import warnings
warnings.filterwarnings("ignore")

In [4]:
fertility = pd.read_csv("C:\\Users\\user\\Documents\\bts-mbds-data-science-foundations-master-22.10\\bts-mbds-data-science-foundations\\sessions\\data\\gapminder\\fertility.csv"
            ,index_col = 'Country')
fertility.head()

Unnamed: 0_level_0,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,...,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Afghanistan,7.671,7.671,7.671,7.671,7.671,7.671,7.671,7.671,7.671,7.671,...,7.136,6.93,6.702,6.456,6.196,5.928,5.659,5.395,5.141,4.9
Albania,5.711,5.594,5.483,5.376,5.268,5.16,5.05,4.933,4.809,4.677,...,2.004,1.919,1.849,1.796,1.761,1.744,1.741,1.748,1.76,1.771
Algeria,7.653,7.655,7.657,7.658,7.657,7.652,7.641,7.622,7.591,7.548,...,2.448,2.507,2.58,2.656,2.725,2.781,2.817,2.829,2.82,2.795
American Samoa,,,,,,,,,,,...,,,,,,,,,,
Andorra,,,,,,,,,,,...,,,,,,,,,,


In [5]:
population = pd.read_csv("C:\\Users\\user\\Documents\\bts-mbds-data-science-foundations-master-22.10\\bts-mbds-data-science-foundations\\sessions\\data\\gapminder\\population.csv"
                        ,index_col = 'Country')
population.head()

Unnamed: 0_level_0,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,...,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Afghanistan,10474903.0,10697983.0,10927724.0,11163656.0,11411022.0,11676990.0,11964906.0,12273101.0,12593688.0,12915499.0,...,26693486.0,27614718.0,28420974.0,29145841.0,29839994.0,30577756.0,31411743.0,32358260.0,33397058.0,34499915.0
Albania,1817098.0,1869942.0,1922993.0,1976140.0,2029314.0,2082474.0,2135599.0,2188650.0,2241623.0,2294578.0,...,3124861.0,3141800.0,3156607.0,3169665.0,3181397.0,3192723.0,3204284.0,3215988.0,3227373.0,3238316.0
Algeria,11654905.0,11923002.0,12229853.0,12572629.0,12945462.0,13338918.0,13746185.0,14165889.0,14600659.0,15052371.0,...,32396048.0,32888449.0,33391954.0,33906605.0,34428028.0,34950168.0,35468208.0,35980193.0,36485828.0,36983924.0
American Samoa,22672.0,23480.0,24283.0,25087.0,25869.0,26608.0,27288.0,27907.0,28470.0,28983.0,...,61871.0,62962.0,64045.0,65130.0,66217.0,67312.0,68420.0,69543.0,70680.0,71834.0
Andorra,17438.0,18529.0,19640.0,20772.0,21931.0,23127.0,24364.0,25656.0,26997.0,28357.0,...,75292.0,77888.0,79874.0,81390.0,82577.0,83677.0,84864.0,86165.0,87518.0,88909.0


In [6]:
life_exp = pd.read_csv("C:\\Users\\user\\Documents\\bts-mbds-data-science-foundations-master-22.10\\bts-mbds-data-science-foundations\\sessions\\data\\gapminder\\life_expectancy.csv"
            ,index_col = 'Country')
life_exp.head()

Unnamed: 0_level_0,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,...,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Afghanistan,33.639,34.152,34.662,35.17,35.674,36.172,36.663,37.143,37.614,38.075,...,56.583,57.071,57.582,58.102,58.618,59.124,59.612,60.079,60.524,60.947
Albania,65.475,65.863,66.122,66.316,66.5,66.702,66.948,67.251,67.595,67.966,...,75.725,75.949,76.124,76.278,76.433,76.598,76.78,76.979,77.185,77.392
Algeria,47.953,48.389,48.806,49.205,49.592,49.976,50.366,50.767,51.195,51.67,...,69.682,69.854,70.02,70.18,70.332,70.477,70.615,70.747,70.874,71.0
American Samoa,,,,,,,,,,,...,,,,,,,,,,
Andorra,,,,,,,,,,,...,,,,,,,,,,


In [7]:
regions = pd.read_csv("C:\\Users\\user\\Documents\\bts-mbds-data-science-foundations-master-22.10\\bts-mbds-data-science-foundations\\sessions\\data\\gapminder\\regions.csv"
                    ,index_col = 'Country')
regions.head()

Unnamed: 0_level_0,Group,ID
Country,Unnamed: 1_level_1,Unnamed: 2_level_1
Angola,Sub-Saharan Africa,AO
Benin,Sub-Saharan Africa,BJ
Botswana,Sub-Saharan Africa,BW
Burkina Faso,Sub-Saharan Africa,BF
Burundi,Sub-Saharan Africa,BI


In [8]:
life_exp.head()

Unnamed: 0_level_0,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,...,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Afghanistan,33.639,34.152,34.662,35.17,35.674,36.172,36.663,37.143,37.614,38.075,...,56.583,57.071,57.582,58.102,58.618,59.124,59.612,60.079,60.524,60.947
Albania,65.475,65.863,66.122,66.316,66.5,66.702,66.948,67.251,67.595,67.966,...,75.725,75.949,76.124,76.278,76.433,76.598,76.78,76.979,77.185,77.392
Algeria,47.953,48.389,48.806,49.205,49.592,49.976,50.366,50.767,51.195,51.67,...,69.682,69.854,70.02,70.18,70.332,70.477,70.615,70.747,70.874,71.0
American Samoa,,,,,,,,,,,...,,,,,,,,,,
Andorra,,,,,,,,,,,...,,,,,,,,,,


In [9]:
life_exp.columns

Index(['1964', '1965', '1966', '1967', '1968', '1969', '1970', '1971', '1972',
       '1973', '1974', '1975', '1976', '1977', '1978', '1979', '1980', '1981',
       '1982', '1983', '1984', '1985', '1986', '1987', '1988', '1989', '1990',
       '1991', '1992', '1993', '1994', '1995', '1996', '1997', '1998', '1999',
       '2000', '2001', '2002', '2003', '2004', '2005', '2006', '2007', '2008',
       '2009', '2010', '2011', '2012', '2013'],
      dtype='object')

In [10]:
life_exp[str(1991)]

Country
Afghanistan                 49.439
Albania                     71.799
Algeria                     67.049
American Samoa                 NaN
Andorra                        NaN
Angola                      41.221
Anguilla                       NaN
Antigua and Barbuda         71.568
Argentina                   71.791
Armenia                     67.864
Aruba                       73.509
Australia                   77.540
Austria                     75.830
Azerbaijan                  64.494
Bahamas                     70.869
Bahrain                     72.587
Bangladesh                  60.537
Barbados                    71.314
Belarus                     70.580
Belgium                     76.300
Belize                      70.939
Benin                       54.033
Bermuda                        NaN
Bhutan                      53.270
Bolivia                     59.341
Bosnia and Herzegovina      65.997
Botswana                    62.444
Brazil                      66.867
British Virg

In [11]:
def by_year(year):
    return pd.DataFrame({
        'Population': population[str(year)],
        'Fertility': fertility[str(year)],
        'Life expectancy': life_exp[str(year)],
        'Group': regions['Group'],
    })

In [12]:
df = by_year(1991)
df.head()

Unnamed: 0,Population,Fertility,Life expectancy,Group
Afghanistan,14069854.0,7.7,49.439,South Asia
Albania,3291695.0,2.917,71.799,Europe & Central Asia
Algeria,25930560.0,4.503,67.049,Middle East & North Africa
American Samoa,48402.0,,,East Asia & Pacific
Andorra,54996.0,,,Europe & Central Asia


In [13]:
year = 1991

In [50]:
fig = go.FigureWidget()
fig

FigureWidget({
    'data': [], 'layout': {}
})

In [15]:
df.head()

Unnamed: 0,Population,Fertility,Life expectancy,Group
Afghanistan,14069854.0,7.7,49.439,South Asia
Albania,3291695.0,2.917,71.799,Europe & Central Asia
Algeria,25930560.0,4.503,67.049,Middle East & North Africa
American Samoa,48402.0,,,East Asia & Pacific
Andorra,54996.0,,,Europe & Central Asia


In [16]:
for a, b in df.groupby("Group"):
    print(a, type(b))

America <class 'pandas.core.frame.DataFrame'>
East Asia & Pacific <class 'pandas.core.frame.DataFrame'>
Europe & Central Asia <class 'pandas.core.frame.DataFrame'>
Middle East & North Africa <class 'pandas.core.frame.DataFrame'>
South Asia <class 'pandas.core.frame.DataFrame'>
Sub-Saharan Africa <class 'pandas.core.frame.DataFrame'>


In [17]:
for group_name, sub_df in df.groupby("Group"):
    sc = fig.add_scatter(
        x=sub_df['Fertility'],
        y=sub_df['Life expectancy'],
        mode='markers',
        marker={
            'size': np.sqrt(sub_df['Population'].fillna(0))/ 400
        },
        name=group_name,
    )

Decorate the figure with proper X and Y axis labels, a title, a big text showing the year, and a legend (if not present).

In [52]:
fig.layout.xaxis = dict(title="Fertility")
fig.layout.yaxis = dict(title="Life Expectancy")
fig.layout.title = "Region Wise:Life Expectancy VS Fertility Map in the year {}".format(year)
fig.layout.titlefont.color='blue'
fig.layout.titlefont.size=20.0
fig.layout.legend.font.family = 'Courier New'

Create a function update_year that receives a year as an integer and updates the data of the existing figure with the values from the selected year. 

In [59]:
#update year function

def update_year(new_year):
    fig.data=[] #the legends were getting appended, thus to remove previous data initializing with empty array
    if new_year in range(1964,2014):
         for group_name, sub_df in df.groupby("Group"):
            fig.layout.xaxis = dict(title="Fertility")
            fig.layout.yaxis = dict(title="Life Expectancy")
            fig.layout.title = "Region Wise:Life Expectancy VS Fertility Map in the year {}".format(new_year)
            fig.layout.titlefont.color='red'
            fig.layout.titlefont.size=20.0
            fig.layout.legend.font.family = 'Courier New'
            sc1 = fig.add_scatter(
            x=sub_df['Fertility'],
            y=sub_df['Life expectancy'],
            mode='markers',
            marker={
            'size': np.sqrt(sub_df['Population'].fillna(0))/ 400  #normalizing 
            },
            name=group_name,
            )
     
    

In [60]:
update_year(2013)

Create an horizontal slider that ranges from the minimum to the maximum year

In [61]:
from ipywidgets import IntSlider
from ipywidgets import interact

In [62]:
#value=initial value, min=minimum value, max=maximum value
slider = IntSlider(value=life_exp.columns.astype(int).min(),min=life_exp.columns.astype(int).min(),max=life_exp.columns.astype(int).max())

In [63]:
slider

IntSlider(value=1964, max=2013, min=1964)

Bind the update_year function to changes in the horizontal slider and use it to interactively change the plot`

In [64]:
interact(update_year,new_year=slider)
fig

interactive(children=(IntSlider(value=1964, description='new_year', max=2013, min=1964), Output()), _dom_class…

FigureWidget({
    'data': [{'marker': {'size': array([ 0.23276866,  0.62626871, 14.38175472,  0.63561978,  1.…