todos:
- do they need a google account?
- how to save intermediate work?
  
# Data Aggregation & Visualization
- In this tutorial we will play around global demographic data with:
    -  `pandas` for data aggregation and
    -  `matplotlib` for visualization
- At the end of this tutorial you will be able to create an animated scatterplot on global demography similar to the iconic one of the late  [Hans Rosling](https://www.youtube.com/watch?v=jbkSRLYSojo).
- This Tutorial is based on the great [Python Data Analysis Tutorial from Kristian Rother](https://www.academis.eu/pandas_go_to_space/)
- We will use data from
    - the [Gapminder Foundation](https://www.gapminder.org)
    - Additionally we will enrich region and country names from [restcountry](https://restcountries.com/) API

# Exercises
- Exercise 1: Prepare the data
- Exercise 2: Life Expectancy
- Exercise 3: Population growth
- Exercise 4: Incompe gap between poor and rich countries
- Exercise 5: Life expectancy vs. Income
- Exercise 6: Animation life expectancy vs. income
- Bonus Exercise 7 : Data retrieval process 
- Bonus Exercise 8: Install Python and run this notebook locally

# Resources
- [Chat GPT](https://chat.openai.com/) knows pandas and matplotlib pretty well. Use it!
- [Official matplotlib documentation](https://matplotlib.org/)
- [Official pandas](https://pandas.pydata.org/)

# Remarks:
- You will run this notebook in Google colab. This has the advantage to avoid the local Python installation process but comes with the disadvantage of being slower
- In Bonus Exercise 8 you will run this note locally on your machine. This i the prefered way for future work.
- `matplotlib` is a very rich plotting library.
    - If you are into aesthetics please go ahead and make your plots as beautiful as you wish.
    - If not, please label at least the x and y axes in your plots

# Getting started
In order to get started just execute the cells below, which
- imports all necessary packages and
- retrieves the data as pandas data frame

In [1]:
# import packages
import pandas as pd
import requests
from matplotlib import pyplot as plt
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.animation import FuncAnimation
from IPython.display import HTML
import matplotlib
# for animation
matplotlib.rcParams['animation.embed_limit'] = 50.0

# urls
URL_LIFE_EXPECTANCY = 'https://small-waffle.gapminder.org/fasttrack/master/b2642a8?_select_key@=country&=time;&value@=lex;;&from=datapoints&where_'
URL_GDP = 'https://small-waffle.gapminder.org/fasttrack/master/b2642a8?_select_key@=country&=time;&value@=gdp/_pcap;;&from=datapoints&where_'
URL_POPULATION = 'https://small-waffle.gapminder.org/fasttrack/master/b2642a8?_select_key@=country&=time;&value@=pop;;&from=datapoints&where_'
URL_COUNTRY = 'https://restcountries.com/v3.1/all'

def read_url(url):
    data = requests.get(url).json()
    return pd.DataFrame(data['rows'], columns=data['header'])

def retrieve_country_information():
    response = requests.get(URL_COUNTRY)
    cs = pd.DataFrame(response.json())[['name', 'cca3', 'region']]
    country = cs.name.map(lambda x: x['common'])
    cs = cs.assign(country=country)
    cs = cs.drop(columns='name')
    cs = cs.rename(columns={'cca3': 'country_short'})
    cs = cs.assign(country_short = cs.country_short.str.lower())
    return cs

def get_data():
    keys = ['country', 'time']
    df_life_expectancy = read_url(URL_LIFE_EXPECTANCY)
    df_gdp = read_url(URL_GDP)
    df_population = read_url(URL_POPULATION)
    df_countries = retrieve_country_information()
    df = (df_life_expectancy
          .merge(df_gdp, on=keys)
          .merge(df_population, on=keys)
          .rename(columns={'country': 'country_short'})
          .merge(df_countries, on='country_short', how='left')
        [['time', 'country', 'region', 'gdp_pcap', 'pop', 'lex']])

    return df

In [None]:
df_raw = get_data()
df_raw.head(10)

# Exercise 1: Prepare the data
In this Exercise we will have a look at the data and prepare the data for later usage. 
In particular we will
- Filter the dataframe to contain only data up unitl 2024 (exlude predicted data)
- Rename the columns for easer use. 
- Create a new dataframe for later usage

The raw dataframe has the following columns:
| Column    | Description                              |
|-----------|------------------------------------------|
| time      | Year |
| country   | Country |
| region    | Region |
| gdp_pcap  | Gross Domestic Product (GDP) per capita |
| pop       | Population size |
| lex       | Life expectancy (years) |

### Tasks
1. Display the Data Frame and try to understand the structure
2. Rename the following columns accoring to:
| Column name old    | Column name new                              |
|--------------------|------------------------------------------|
| time               | Year |
| gdp_pcap           | gdp |
| pop                | population |
| lex                | life_expectancy |

3. Which data types have the columns?
4. What is the minimum and maximum year occuring in the data?
5. How many different countries are in the data?
6. Filter the dataframe to contain values only up intil the year 2024. We don't want to include predictions in our analysis
7. Write a new Data Frame with renamed columns and filtered years for later analysis.

### Solutions to Exercise 1

In [None]:
def cutoff_years(df):
    return df[df['year'] <= 2024]

def rename(df):
    columns = {'pop': 'population',
               'gdp_pcap': 'gdp',
               'lex': 'life_expectancy',
               'time': 'year'}
    return df.rename(columns=columns)

df = rename(df_raw)
display(df.dtypes)
print(f'Years range from {df.year.min()} - {df.year.max()}')
print(f'Number of countries {df.country.nunique()}')
df = cutoff_years(df)
df.head(10)

# Exercise 2: Life Expectancy
Now, that we have prepared the data we can start analysing. In this exercise we will look at life expectancy:
We ask the following questions to the data

- How did life expectancy change over time for various countries?
- Is it true that Japanese people have always been getting old (Sushi)?
- What was the impact of the world wars for Germany and Russia?
- What about your home country?  

### Tasks
1. Filter country to Germany
2. Make a line plot for life expectancy (y-axis)  over time (x-axis) for Germany
3. Set y-limit to 25-90 years
4. Write a function to plot life expectancy for a given country.
5. Bonus Task: Highlight Word War I (1914-1918) and World War II (1939-1945)
    Hint: use
6. Plot life expectancy also for Russian, Japan, and your home country
7. Discuss the results and try to answer the questions above.

### Solution to Exercise 2

In [None]:
def filter_country(df, country):
    return df[df.country==country]

def plot_life_expectancy(df, country):
    filtered = filter_country(df, country)
    fig, ax = plt.subplots(figsize=(15,5))
    ax.plot(filtered.year, filtered.life_expectancy, color="#ef5000", label=country)
    ax.set_xlabel('Year')
    ax.set_ylabel('Life Expectancy')
    ax.fill_between([1914, 1918], 0, 100, color='red', alpha=.2, label='World Wars')
    ax.fill_between([1939, 1945], 0, 100, color='red', alpha=.2)
    ax.set_ylim(25,90)
    ax.legend()


plot_life_expectancy(df, 'Germany')
plot_life_expectancy(df, 'Russia')
plot_life_expectancy(df, 'Japan')
plot_life_expectancy(df, 'Austria')


# Exercise 3: Population growth
In this exercise we will look at the population growth. We ask the following questions to the data: 
- How did the world wide population change over time?
- Is it true that Africa contributes most to the population growth?
- Did the Population of Germany shrink during the world wars (as we have seen a dip in life expectancy)?
- How did the population develop for China, India and your home country?

### Task 
1. Aggregate the worldwide population by year
2. Plot the worldwide population by year (line plot)
3. Aggregate the worldwide population by year and region
4. Plot the worldwide population by year and region (line plot). Use colors for the individual regions, eg:
   -  Asia: red
   -  Europe: pink
   -  Americas: yellow
   -  Ocenian: brown
   -  Hint: use matplotlib's `color` keyword
6. Filter the dataframe to contain Germany only
7. Plot the population development of Germany (line plot)
8. Write a function that plots the population development for a country.
9. Plot the population development for China, Inda and your home country.
10. Discuss the results and try to answer the questions above.

### Solutions to Exercise 3

In [6]:
world_df = df.groupby(['year'], as_index=False)[['population']].sum()
germany_df = df[df.country=='Germany']
region_df = df.groupby(['year', 'region'], as_index=False)[['population']].sum()

In [None]:
REGION_COLORS = {'Asia': 'red',
                 'Europe': 'pink',
                 'Africa': 'blue',
                 'Americas': 'yellow',
                 'Oceania': 'brown'}

fig, ax = plt.subplots(3, figsize=(15,10), sharex=True)
# world
ax[0].plot(world_df.year, world_df.population, color="#ef5000", label='World')

# region
for region in region_df.region.unique():
    single_region = region_df[region_df.region == region]
    ax[1].plot(single_region.year, single_region.population, color=REGION_COLORS[region], label=region)

# germany
ax[2].plot(germany_df.year, germany_df.population, color="#ef5000", label='Germany')
for a in ax:
    a.legend()
    a.set_ylabel('Population')

In [None]:
def plot_population(df, country):
    filtered = df[df.country==country]
    fig, ax = plt.subplots(figsize=(15,5))
    ax.plot(filtered.year, filtered.population, color="#ef5000", label=country)
    ax.set_xlabel('Year')
    ax.set_ylabel('Population')
    ax.legend()

plot_population(df, 'China')
plot_population(df, 'India')

# Exercise 4: Incompe gap between poor and rich countries
In this exercise we will look at the income gap between countries.  We will use the the Cross Domestic Product (GDP) per capita to judge the income gap between the poorest and richest countries in this world. We ask the following questions to the data:
- Did the income gap between the richest and the poorest countries increase over time?
- Did the income gap between the richtest 5% and the poorest 5% increase over time?
- What is the difference between these two point of views?

### Tasks
1. Aggregate the minimum GDP by year
2. Aggregate the maximum GDP by year
3. Aggregate the 5% Percentile per year (Hint: use `pd.Series.quantile(0.05)`)
4. Aggregate the 95% Percentile per year (Hint: use `pd.Series.quantile(0.95)`)
5. Compute the ratio  minimum GDP / maximum GDP  per year
6. Compute the ratio 5% Percentile / 95% Percentile  per year
7. Plot the differences as bar plot.
8. Discuss the results and try to answer the questions above.
9. [Bonus]. What is the richest and the poorest country as of 2024
10. [Bonus]. Plot the development of the GDP for your home country. 

### Solutions to Exercise 4

In [None]:
gap_df = (df.groupby('year', as_index=False)[['gdp']]
             .agg(gdp_min=('gdp', 'min'),
                  gdp_max=('gdp', 'max'),
                  gdp_quantile_low=('gdp', lambda s: s.quantile(.05)),
                  gdp_quantile_high=('gdp', lambda s: s.quantile(.95))))

gap_df = gap_df.assign(ratio_gdp=gap_df.gdp_max / gap_df.gdp_min,
                       ratio_quantile_gdp=gap_df.gdp_quantile_high / gap_df.gdp_quantile_low)


fig, ax = plt.subplots(2, figsize=(15,9), sharex=True)
ax[0].bar(gap_df.year, gap_df.ratio_gdp, color='#ef5000')
ax[0].set_ylabel('GDP Ratio Min/Max')
ax[1].bar(gap_df.year, gap_df.ratio_quantile_gdp, color='#ef5000')
ax[1].set_ylabel('GDP Ratio Quantile')
ax[1].set_xlabel('Year')

In [None]:
# Bonus exercises
def plot_country_gdp(df, country):
    country_gdp = df[df.country==country]
    fig, ax = plt.subplots(figsize=(15,4))
    ax.bar(country_gdp.year, country_gdp.gdp, color='#ef5000', label=country)
    ax.set_xlabel('Year')
    ax.set_ylabel('GDP per capita')
    ax.legend()

display(df[df.year==2024].sort_values('gdp', ascending=True).reset_index(drop=True))
plot_country_gdp(df, 'Austria')

# Exercise 5: Wealth by region
In this exercise we want to look at the wealth distribution by region. To this end we want to compare the GDP by regions over time. 
We ask the following questions to the data:
- Which are the richest regions in the world?
- Is it true that Africa is so poor?
- What is the danger of such an anlysis? 

### Remark
Recall that `gdp` is the GDP per capita. So we need to do some work to aggregate the GDP. We first need to mulitply the GDP per capita with the population in to get the GDP per country:

### Tasks
1. Assign a new column `gdp_tot` that is gdp * population
2. Sum by region and year `gdp_tot` and `population`
3. Compute the GDP per capita by dividing the summed GDP by the summed population
4. Plot the GDP per capita for each region over time in one plot (line plot).
   - Use the same colors as in Exercise 3
5. What is the richest and the poorest country of Asia as of 2024
5. Discuss the results and try to answer the questions above.
6. [Bonus] Compare the gdp development of North and South Korea.

### Solutions to Exercise 5

In [None]:
df_result = df.assign(gpd_tot = df.gdp * df.population)
df_result = df_result.groupby(['region', 'year'], as_index=False).agg(gdp_tot=('gpd_tot', 'sum'), population=('population', 'sum'))
df_result = df_result.assign(gdp = df_result.gdp_tot/df_result.population)

fig, ax = plt.subplots(figsize=(15,5))
for continent in df_result.region.unique():
    filtered = df_result[df_result.region == continent]
    ax.plot(filtered.year, filtered.gdp, color=REGION_COLORS[continent], label=continent)
ax.set_xlabel('Year')
ax.set_ylabel('GPD per capita')
ax.legend()

In [None]:
display(df[(df.region == 'Asia') & (df.year==2024)].sort_values('gdp').reset_index(drop=True))

In [None]:
# Bonus
south_korea = df[df.country=='South Korea']
north_korea = df[df.country=='North Korea']

# Share x and y scales for better comparison
fig, ax = plt.subplots(2, figsize=(15,9), sharex=True, sharey=True)
ax[0].bar(south_korea.year, south_korea.gdp, color='#ef5000', label='South Korea')
ax[1].bar(north_korea.year, north_korea.gdp, color='#ef5000', label='North Korea')
ax[0].set_ylabel('GDP per capita')
ax[1].set_ylabel('GDP per capita')
ax[1].set_xlabel('Year')
ax[0].legend()
ax[1].legend()

# Exercise 5: Life expectancy vs. Income
In this exercise we will look at the correlation between life expectancy and income. To this end we will draw a a scatter plot in the spirit of [Hans Rosling](https://www.youtube.com/watch?v=jbkSRLYSojo) for the years 1950 and 2024. 
We wish the create a scatter plot for countries with:
- X-axis: GDP
- Y-axis: life expectancy
- Bubble size: proportional to population (the size of the bubble should be proportional to the square root of the population)
- Bubble color: region 

We ask the following questions to the data
- In what state of the world in 1950 and 2024?
- What is the main difference between 1950 and 2024?


### Tasks
1. Choose gdp and life expectancy for the year 2024 and put them into a single table.
2. Draw a scatterplot of gdp over life expectancy in 2024.
   - Set an alpha value (e.g alpha=0.6) to make overlying bubbles more visible
   - Set the range of the x-axis (GDP) to 200 - 2000000
   - Set the range of the y-axis (life expectancy) to 15-90
4. Use the same region colors as in Exercise 3
3. Set the size to be proportional to the square root of the population
    - Hint: use matplotlibs `sizes ` keyword
    - Hint: use `np.sqrt(df.population) / 10` for sizes
6. The x-axis has a huge spread. Use a logarithmic scale on the x-axis (Hint `ax.set_xscale('log')`). 
8. Repeat steps 1 and 2 for the year 1960.
9. Discuss the results and try to answer the questions above.
10. Write a function that allows you to draw a scatterplot for any given year.

In [None]:
def filter_year(df, year):
    return df[df['year'] == year]

def scatter_plot(df_raw, year):
    df = filter_year(df_raw, year)
    fix, ax = plt.subplots(figsize=(16, 10))
    sizes = np.sqrt(df.population) /10
    colors = df['region'].map(REGION_COLORS)
    ax.scatter(df.gdp, df.life_expectancy,
               sizes=sizes,
               color=colors,
               alpha=0.6,
               edgecolors="w",
               linewidth=2)
    ax.set_xlabel('GDP', fontsize=15)
    ax.set_ylabel('Life Expectancy', fontsize=15)
    ax.set_xscale('log')
    ax.set_xlim(200, 200000)
    ax.set_ylim(25,90)

scatter_plot(df, 1950)
scatter_plot(df, 2024)


# Exercise 6: life expectancy vs. income
In this exercise we will create an animation over time. 
In the cell below we give the the implementation, which does:

1. Create `animation` using matplotlib's `FuncAnimation`
    - Display the animation using html `HTML(animation.to_jshtml())`
    - Uses the same colors as in Exercise 3
2. Make animation prettier: 
    - Don't show top and right spines
    - Use log scale on x-axis
    - Set x-axis labels at values `[400, 4000, 400000]`
    - Set y-axis labels at values `[25, 50, 75]`
    - Set the range of the y-axis to 15-90
    - Set the range of the x-axis to 200 - 2000000
    - Colorize bottom and left spines as well as labels with gray
### Warning: 
- creating an animation will take some time. 
### Tasks:
1. Run the code and observer how the world developed since 1800
2. Discuss what happened in the world wars
3. Try to make your own implementation / adapt the code 

In [None]:
fig, ax = plt.subplots(figsize=(16, 8), facecolor='none')
plt.close(fig)  # Prevent the static plot from displaying in Jupyter
fig.patch.set_alpha(0)  # Set the figure background to transparent
ax.set_facecolor('none')

def adapt_populations(s):
    return (s / 100000).map(lambda x: max(x, 10))
    #return s.map(lambda x: max(x, 5000000))

def animate(year):
    ax.clear()  # Clear previous frame
    filtered = filter_year(df, year).sort_values('population', ascending=True)
    sizes = np.sqrt(adapt_populations(filtered.population)) * 30
    colors = filtered['region'].map(REGION_COLORS)

    # Scatter plot with data for the given year
    ax.scatter(filtered.gdp, filtered.life_expectancy,
               s=sizes,
               c=colors,
               alpha=0.6,
               edgecolors="w",
               linewidth=2)

    # Set labels and limits
    ax.set_xlim(200, 200000)
    ax.set_ylim(15, 90)
    ax.set_xscale('log')
    # styling

    ax.text(0.95, 0.05, f'{year}', transform=ax.transAxes,
        fontsize=60, color='gray', ha='right', va='bottom')

    yticks = [25, 50, 75]
    ax.set_yticks(yticks)
    ax.set_yticklabels([f'{y}\nyears' for y in yticks], fontsize=20, ha='center', color='gray')
    ax.yaxis.set_tick_params(pad=30)

    xticks = [400, 4000, 40000]
    ax.set_xticks(xticks)
    ax.set_xticklabels([f'${x}' for x in xticks], fontsize=20, ha='center', color='gray')

    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)

    ax.spines['bottom'].set_color('gray')
    ax.spines['left'].set_color('gray')

# Create the animation, but don't call plt.show() which can cause the initial static plot
animation = FuncAnimation(fig, animate, frames=range(1820, 2024))
HTML(animation.to_jshtml())

In [16]:
# vat save to slides as motivation
# animation.save('../slides/python/assets/rosling.html', writer='html')

# [Bonus] Exercise 7: Data retrieval process 
In this exercise we will go through the data retrieval process of this notebook 


### Country information
We retrieve regional and country name information from [restcountry](https://restcountries.com/) API. 
Going to that website we find all the endpoints provided. 
We are interested in the endpoint `https://restcountries.com/v3.1/all`
1. Copy this url into your browser. You will see the data in json format
2. Retrieve the data with python. Using the requests package
    - Make api call `request = requests.get('https://restcountries.com/v3.1/all')`
    - Check the status code `request.status_code`. If all went fine you will get a status code of 200
    - Get the data as json and convert the json to a data frame: `pd.DataFrame(request.json())`
    - Now you have the data as a dataframe. That's cool.
3. Look at the function `retrieve_country_information()` at the top of this notebook.
   - Call this function
   - What is this funcion doing?
   - Try to change your dataframe in order to get the same result as the function call. 

### Demographic data
Go to www.gapminder.org/data and search for:
- life expectancy
- fertility rate, total
- population
You could just download the data, get a CSV and import it with pandas.
However, rather than downloading CSV files we rather get the data by and API request.



### Step 2: Combine data
Combine data: 
- life expectancy 
- gdp
- population
- country information


In a first step, just execute these cells. 

In [24]:
request = requests.get('https://restcountries.com/v3.1/all')
cs = pd.DataFrame(request.json())

In [None]:
country = cs.name.map(lambda x: x['common'])
#cs = cs.assign(country=country)
country

# [Bonus] Exercise 8: Install Python and run this notebook locally
In this exercise you will install Python locally and run the notebook on your machine. 

### Tasks
1. [Install Python](https://realpython.com/installing-python/)
2. Install `pandas` and `matplotlib`:
    - vat also jupyter lab et al.
    - exectue `pip install --upgrade pandas  matplotlib`
    - Hint it is a good practice to use virtual environments in daily work
4. Download  [this notebook from github](https://github.com/8gradplus/msb_lecture/blob/main/colab/Python%20Pandas.ipynb) and save it
5. Open notebook  with jupyter lab