# Datasets


Life expectancy at birth [years]

```
https://raw.githubusercontent.com/open-numbers/ddf--gapminder--life_expectancy/master/ddf--datapoints--life_expectancy_at_birth--by--geo--time.csv
```



geo to country name



```
https://raw.githubusercontent.com/open-numbers/ddf--gapminder--life_expectancy/master/ddf--entities--geo.csv
```



country classifications

* Landlocked or costline (landlocked)
* Latitude (latitude)
* Longitude (longitude)
* Religion (main_religion_2008)
* Country belongs to the UN (un_state)
* Income group (income_groups)


```
https://raw.githubusercontent.com/open-numbers/ddf--open_numbers--world_development_indicators/master/ddf--entities--geo--country.csv
```



Total population by sex, annually from 1950 to 2100.
* PopMale: Total male population (thousands)
* PopFemale: Total female population (thousands)
* PopTotal: Total population, both sexes (thousands)
* PopDensity: Population per square kilometre (thousands)



```
https://population.un.org/wpp/Download/Files/1_Indicators%20(Standard)/CSV_FILES/WPP2019_TotalPopulationBySex.csv
```



Population by 5-year age groups.
* PopMale: Male population in the age group (thousands)
* PopFemale: Female population in the age group (thousands)
* PopTotal: Total population in the age group (thousands)
* AgeGrp (string): label identifying the single age (e.g. 15) or age group (e.g. 15-19)
* AgeGrpStart (numeric): initial age of the age group
* AgeGrpSpan (numeric): length of the age group, in years




```
https://population.un.org/wpp/Download/Files/1_Indicators%20(Standard)/CSV_FILES/WPP2019_PopulationByAgeSex_Medium.csv
```



# Instructions

* Always use x and y labels, units of measure and titles in your plots.
* Make sure that every data in the plot is distinguishable.

* The text answers can be either as comments in code-blocks, or in separate text-blocks (preferred for readability).

* Make sure the code runs without errors and gives consistent output: restart the notebook and rerun all the cells to make sure it will work consistently before turning in

* When you are finished you have to options:

  1.   give me access to a private repository on GitHub where you add the midterm
  2.   send me an .ipynb (make sure you write your name on the file name)





# Task 1


### Visualize life-expectancy
* Plot life expectancy from 1900 until 2020 in Sustainable Development Goal (SDG) regions (or 'un_sdg_region'). Countries and areas are grouped into eight Sustainable Development Goal (SDG) regions as defined by the United Nations Statistics Division and used for The Sustainable Development Goals Report (https://unstats.un.org/sdgs/indicators/regional-groups/). --> Compute and plot the *mean* life-expectancy in each region from 1900 until 2020


In [132]:
# clear variables for a new run
%reset
# imports
import plotly.express as px
import plotly.graph_objects as go
import pandas as pd
import numpy as np
from plotly.subplots import make_subplots
from scipy.stats import pearsonr

Once deleted, variables cannot be recovered. Proceed (y/[n])? y


In [133]:
# Importing data
country_class_url = 'https://raw.githubusercontent.com/open-numbers/ddf--open_numbers--world_development_indicators/master/ddf--entities--geo--country.csv'
country_classes = pd.read_csv(country_class_url)
life_exp_url = 'https://raw.githubusercontent.com/open-numbers/ddf--gapminder--life_expectancy/master/ddf--datapoints--life_expectancy_at_birth--by--geo--time.csv'
life_exp_data = pd.read_csv(life_exp_url)

In [134]:
# group life_exp data by country
life_exp_countries = pd.DataFrame(columns={'country_key', 'data'})
life_exp_countries.country_key = life_exp_data.geo.unique()
for _, country in life_exp_countries.iterrows():
    country.data = life_exp_data[life_exp_data['geo'] == country.country_key].set_index('time').drop(columns='geo')

In [135]:
# drop countries from classes that do not have life expectancy values
for i, country in country_classes.iterrows():
    if country.country not in life_exp_data.geo.unique():
        country_classes.drop(index=i, inplace=True)
country_classes.reset_index(inplace=True)

In [136]:
# group country_classes data by un_sdg_regions
regions = pd.DataFrame(columns={'region_name', 'data', 'means', 'country_data'})
regions.region_name = country_classes.un_sdg_region.unique()
for _, region in regions.iterrows():
    region.data = country_classes[country_classes['un_sdg_region'] == region.region_name]

In [137]:
# combine to clean dataframe
life_exp_countries = life_exp_countries.set_index('country_key').T
life_exp_clean = pd.DataFrame(columns=life_exp_countries.columns, index=range(1800, 2101))

In [138]:
for country in life_exp_clean:
    life_exp_clean[country] = life_exp_countries[country].data

In [139]:
for _, region in regions.iterrows():
    region.country_data = life_exp_clean[region.data.country.values]

In [140]:
region_means_clean = pd.DataFrame(columns=regions.region_name, index=range(1800, 2101))
for _, region in regions.iterrows():
    region_means_clean[region.region_name] = region.country_data.mean(axis=1)

In [141]:
# plot years from 1900 to 2020
fig1 = px.line(region_means_clean.loc[1900:2020])
fig1.update_layout(
    title='Life Expectancy From 1900 Until 2020 in Sustainable Development Goal (SDG) Regions',
    yaxis_title='Life Expectancy [years]',
    xaxis_title='Year')
fig1.show()

* Pick a region of your chosing (not Australia + New Zealand!), and plot the life-expectancy of all the countries in that region from 1900 to 2020. If the plot is too crowded, you can split the countries alphabetically (or in any other order) in 2+ plots.

In [142]:
# dict to rename countries from key to name
country_name_dict = dict(zip(country_classes.country, country_classes.name))

In [143]:
# plot years from 1900 to 2020 for eastern and southeastern countries (index 6 in df)
fig2 = go.Figure()
for country in regions.country_data[6].columns:
    fig2.add_trace(go.Scatter(y=regions.country_data[6][country].loc[1900:2020],
                              x=regions.country_data[6].loc[1900:2020].index,
                              mode='lines',
                              name=country_name_dict[country]))
fig2.update_layout(title='Life Expectancy From 1900 Until 2020 for Eastern and Southeastern Asia',
                  yaxis_title='Life Expectancy [years]',
                  xaxis_title='Year')
fig2.show()

### Countries with the highest and lowest life-expectancy

Find and plot life expectancy from 1900 until 2020 of the 5 countries that have the highest and lowest mean life expectancy in 2000-2020.

In [144]:
means = life_exp_clean.loc[2000:2021].mean().sort_values()
countries_of_interest = pd.concat([means.iloc[:5], means.iloc[-5:]]).index.values
# plot years from 1900 to 2020
fig3 = go.Figure()
needs_legend_title = True
for country in means.iloc[-5:].index.values:
    fig3.add_trace(go.Scatter(y=life_exp_clean[country].loc[1900:2020],
                              x=life_exp_clean.loc[1900:2020].index,
                              mode='lines',
                              name=country_name_dict[country],
                              legendgroup='top'))
for country in means.iloc[:5].index.values:
    fig3.add_trace(go.Scatter(y=life_exp_clean[country].loc[1900:2020],
                              x=life_exp_clean.loc[1900:2020].index,
                              mode='lines',
                              name=country_name_dict[country],
                              legendgroup='bottom'))
fig3.update_layout(title='Countries Top and Bottom Five Life Expectancy (based on 2000-2020 means)',
                   yaxis_title='Life Expectancy [years]',
                   xaxis_title='Year')
fig3.show()

### Which factors contribute to life-expectancy?


* Which of the following factors significantly contribute to 2000-2020 mean life expectancy?
  *   Landlocked vs coastline
  *   Being a high-upper/middle income country vs lower/middle-low income country
  *   Being in the Northern emisphere vs Southern emisphere
  *   Longitude

  *   Being a christian country vs a country of other religions
  * Being a UN state or not

* Define the significance threshold.
* Which test are you using?
* What is the interpretation of the test results?


* What is the contribution of these factors to mean life expectancy in 1900-1920?


* Use the most appropriate plot(s) to describe both 2000-2020 and 1900-1920 life expectancy distributions in the features that are significantly contributing to affect life expectancy.



In [145]:
country_classes.insert(loc=0, column='modern_life_exp', value=None)
country_classes.insert(loc=0, column='past_life_exp', value=None)

In [146]:
country_classes = country_classes.set_index('country')
country_classes.modern_life_exp = life_exp_clean.loc[2000:2021].mean()
country_classes.past_life_exp = life_exp_clean.loc[1900:1921].mean()

In [147]:
comparison_matrix = country_classes[['modern_life_exp', 'past_life_exp', 'landlocked', 'longitude', 'income_groups', 'main_religion_2008', 'un_state', 'latitude', 'name']]

In [148]:
comparison_matrix['northern_hemisphere'] = True
comparison_matrix.loc[(comparison_matrix.latitude < 0), 'northern_hemisphere'] = False
comparison_matrix['is_christian'] = False
comparison_matrix.loc[(comparison_matrix.main_religion_2008 == 'christian'), 'is_christian'] = True
comparison_matrix['high_income'] = True
comparison_matrix.loc[(comparison_matrix.income_groups == 'low_income'), 'high_income'] = False
comparison_matrix.loc[(comparison_matrix.income_groups == 'lower_middle_income'), 'high_income'] = False
comparison_matrix['is_landlocked'] = False
comparison_matrix.loc[(comparison_matrix.landlocked == 'landlocked'), 'is_landlocked'] = True
comparison_matrix.drop(columns=['latitude', 'income_groups', 'main_religion_2008', 'landlocked'], inplace=True)
comparison_matrix = comparison_matrix[comparison_matrix['past_life_exp'].notna()]



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/

In [149]:
comparing_fields = ['is_landlocked', 'high_income', 'northern_hemisphere', 'longitude', 'is_christian', 'un_state']
corr_modern = pd.DataFrame(columns=['corr','p'])
corr_modern['param'] = comparing_fields
corr_modern.set_index('param', inplace=True)
for param in comparing_fields:
    corr_modern.loc[param]['corr'], corr_modern.loc[param]['p'] = pearsonr(comparison_matrix[param], comparison_matrix.modern_life_exp)
corr_modern = corr_modern.round(2).sort_values(by='p')

In [150]:
corr_past = pd.DataFrame(columns=['corr','p'])
corr_past['param'] = comparing_fields
corr_past.set_index('param', inplace=True)
for param in comparing_fields:
    corr_past.loc[param]['corr'], corr_past.loc[param]['p'] = pearsonr(comparison_matrix[param], comparison_matrix.past_life_exp)
corr_past = corr_past.sort_values(by='p')

In [151]:
fig4 = make_subplots(rows=1, cols=6, shared_yaxes=False, subplot_titles=['High Income', 'Northern Hemisphere', 'Landlocked', 'Longitude', 'Christian', 'UN State'])
fig4.add_box(x=comparison_matrix.high_income, y=comparison_matrix.modern_life_exp,row=1, col=1, showlegend=False)
fig4['layout']['xaxis']['title']=f'corr: {round(corr_modern.loc["high_income"]["corr"], 2)}, p: {round(corr_modern.loc["high_income"]["p"], 3)}'
fig4.add_box(x=comparison_matrix.northern_hemisphere, y=comparison_matrix.modern_life_exp,row=1, col=2, showlegend=False)
fig4['layout']['xaxis2']['title']=f'corr: {round(corr_modern.loc["northern_hemisphere"]["corr"], 2)}, p: {round(corr_modern.loc["northern_hemisphere"]["p"], 3)}'
fig4.add_box(x=comparison_matrix.is_landlocked, y=comparison_matrix.modern_life_exp,row=1, col=3, showlegend=False)
fig4['layout']['xaxis3']['title']=f'corr: {round(corr_modern.loc["is_landlocked"]["corr"], 2)}, p: {round(corr_modern.loc["is_landlocked"]["p"], 3)}'
fig4.add_trace(trace=go.Scatter(x=comparison_matrix.longitude, y=comparison_matrix.modern_life_exp, mode='markers', showlegend=False), row=1, col=4)
fig4['layout']['xaxis4']['title']=f'corr: {round(corr_modern.loc["longitude"]["corr"], 2)}, p: {round(corr_modern.loc["longitude"]["p"], 3)}'
fig4.add_box(x=comparison_matrix.is_christian, y=comparison_matrix.modern_life_exp,row=1, col=5, showlegend=False)
fig4['layout']['xaxis5']['title']=f'corr: {round(corr_modern.loc["is_christian"]["corr"], 2)}, p: {round(corr_modern.loc["is_christian"]["p"], 3)}'
fig4.add_box(x=comparison_matrix.un_state, y=comparison_matrix.modern_life_exp,row=1, col=6, showlegend=False)
fig4['layout']['xaxis6']['title']=f'corr: {round(corr_modern.loc["un_state"]["corr"], 2)}, p: {round(corr_modern.loc["un_state"]["p"], 3)}'
fig4.update_yaxes(title_text='Life Expectancy [years]')
fig4.update_layout(title_text='Correlation of Parameters with Life Expectancy (2000-2020 mean)')
fig4.show()

fig5 = make_subplots(rows=1, cols=6, shared_yaxes=False, subplot_titles=['High Income', 'Northern Hemisphere', 'Landlocked', 'Longitude', 'Christian', 'UN State'])
fig5.add_box(x=comparison_matrix.high_income, y=comparison_matrix.past_life_exp,row=1, col=1, showlegend=False)
fig5['layout']['xaxis']['title']=f'corr: {round(corr_past.loc["high_income"]["corr"], 2)}, p: {round(corr_past.loc["high_income"]["p"], 3)}'
fig5.add_box(x=comparison_matrix.northern_hemisphere, y=comparison_matrix.past_life_exp,row=1, col=2, showlegend=False)
fig5['layout']['xaxis2']['title']=f'corr: {round(corr_past.loc["northern_hemisphere"]["corr"], 2)}, p: {round(corr_past.loc["northern_hemisphere"]["p"], 3)}'
fig5.add_box(x=comparison_matrix.is_landlocked, y=comparison_matrix.past_life_exp,row=1, col=3, showlegend=False)
fig5['layout']['xaxis3']['title']=f'corr: {round(corr_past.loc["is_landlocked"]["corr"], 2)}, p: {round(corr_past.loc["is_landlocked"]["p"], 3)}'
fig5.add_trace(trace=go.Scatter(x=comparison_matrix.longitude, y=comparison_matrix.past_life_exp, mode='markers', showlegend=False), row=1, col=4)
fig5['layout']['xaxis4']['title']=f'corr: {round(corr_past.loc["longitude"]["corr"], 2)}, p: {round(corr_past.loc["longitude"]["p"], 3)}'
fig5.add_box(x=comparison_matrix.is_christian, y=comparison_matrix.past_life_exp,row=1, col=5, showlegend=False)
fig5['layout']['xaxis5']['title']=f'corr: {round(corr_past.loc["is_christian"]["corr"], 2)}, p: {round(corr_past.loc["is_christian"]["p"], 3)}'
fig5.add_box(x=comparison_matrix.un_state, y=comparison_matrix.past_life_exp,row=1, col=6, showlegend=False)
fig5['layout']['xaxis6']['title']=f'corr: {round(corr_past.loc["un_state"]["corr"], 2)}, p: {round(corr_past.loc["un_state"]["p"], 3)}'
fig5.update_yaxes(title_text='Life Expectancy [years]')
fig5.update_layout(title_text='Correlation of Parameters with Life Expectancy (1900-1920 mean)')
fig5.show()

In [152]:
relevant_params_modern = corr_modern[(abs(corr_modern['corr']) > 0.1) & (corr_modern.p < 0.05)].index.values
relevant_params_past = corr_past[(abs(corr_past['corr']) > 0.1) & (corr_past.p < 0.05)].index.values
relevant_params = np.intersect1d(relevant_params_modern, relevant_params_past)

In [153]:
fig6 = make_subplots(rows=1, cols=len(relevant_params), shared_yaxes=False, subplot_titles=['High Income', 'Northern Hemisphere', 'Landlocked', 'Longitude', 'Christian', 'UN State'])
for i, param in enumerate(relevant_params):
    _values = pd.DataFrame(columns=['true', 'false'], index=['past', 'modern'])
    _values.loc["modern"].true = comparison_matrix.modern_life_exp.where(comparison_matrix[param] == True).dropna().median()
    _values.loc["modern"].false = comparison_matrix.modern_life_exp.where(comparison_matrix[param] == False).dropna().median()
    _values.loc["past"].true = comparison_matrix.past_life_exp.where(comparison_matrix[param] == True).dropna().median()
    _values.loc["past"].false = comparison_matrix.past_life_exp.where(comparison_matrix[param] == False).dropna().median()
    fig6.add_trace(trace=go.Scatter(y=_values.true, x=_values.index, name='True', legendgroup=param, legendgrouptitle_text=param, line=dict(color='green')),row=1, col=i+1)
    fig6.add_trace(trace=go.Scatter(y=_values.false, x=_values.index, name='False', legendgroup=param, line=dict(color='red')),row=1, col=i+1)

fig6.update_layout(violingap=0, violinmode='overlay')
fig6.update_yaxes(title_text='Life Expectancy [years]')
fig6.update_xaxes(title_text='Era (1900-1920 and 2000-2020 means)')
fig6.update_layout(title_text='Evolution of Life Expectancy from Relevant (p < 0.05; corr > 0.1) Parameters Over Time')
fig6.show()

## Text Answers
Which of the following factors significantly contribute to 2000-2020 mean life expectancy?
All factors were significant (`p < 0.05`) but not all correlate strongly.
correlation values are to be interpreted as follows:
1 --> perfect correlation
0 --> no correlation at all
-1 --> perfect negative correlation
-0.1 to 0.1 --> no real correlation
0.1 to 0.3 --> some correlation
-0.3 to -0.1 --> some negative correlation
0.3 to 1 --> strong correlation
-1 to -0.3 --> strong negative correlation

going by this classification, income (`corr=0.72`) and hemisphere (`corr=0.3`) contribute positively to life expectancy, while landlockedness (`corr=-0.32`) contributes negatively.


### Define the significance threshold.
Significant if |correlation| > 0.1 and p < 0.05
### Which test are you using?
pearson correlation coefficient

### What is the interpretation of the test results?
High income was and is the factor correlating most strongly with `corr = 0.73` (modern) and 0.4 (past), as well as high confidence (`p < 0.001`) for both timeframes.
Interestingly enough, being landlocked and in the northern hemisphere have become more important factors in life expectancy determination, going from `corr < 0.1` (meaning little to no correlation) to `corr ~ 0.3` with high confidences (`p < 0.01`) each (meaning some correlation).
Being a country with christian religion on the other hand lost in importance, going from `corr = 0.34` (high correlation) to `corr = 0.0.16` (medium correlation) both with high confidence (`p < 0.05`)

### What is the contribution of these factors to mean life expectancy in 1900-1920?
Discussed in interpretation of test results above.

### Use the most appropriate plot(s) to describe both 2000-2020 and 1900-1920 life expectancy distributions in the features that are significantly contributing to affect life expectancy.
Plotted are the mean life expectancy values for features that are significantly contributing during both time frames, with p < 0.05 and absolute corr > 0.1.

# Task 2

### Ratio female-male population


*   Find which countries have a different ratio between males and females of the ages 25-29 in 2010 vs 1950.
  * Note: in the 5-year age group dataset there are not only countries but also broader regions, therefore focus only on countries (as found in the "geo to country name" file)
  * Note: the unit of measure is 1000 individuals, do not transform it to run the statistical test
* Which statistical test do you use?
* Formulate the null and alternate hypothesis.

* Define the significance threshold.
* What is the interpretation of the test results?

* Plot in the most appropriate way the age distribution of males vs females in these countries (plus Switzerland) in 2010 vs 1950.

## Text Answers
### Which statistical test do you use?
Z-Test, tests the deviation from population mean in standard deviations.
significance is `p < 0.05`, and was corrected for multiple testing, as per tip below. I am not certain it is necessary here, as going through `scipy.stats's zscore` method only runs one test for the entire sample, it might already be handled.
### What is the interpretation of the test results?
Two countries zscores outside the corrected 0.05 significance, being the Oman and UAE. They go from a classical distribution for lower developed countries (each age group gets smaller) to a more developed one, where age groups remain the same size (with fewer people dying with low ages), as can be observed in more developed countries. In comparison with Switzerland though, it is clear, that some artifacts from earlier years are still visible in 2020, with older age groups being larger than younger ones.
As most countries have evolved in some way, what might oust these two countries as most different is the extreme difference between the two observations (1950 and 2010).

In [154]:
# imports
import pandas as pd
import numpy as np
from scipy.stats import zscore
from scipy.stats import norm
from plotly.subplots import make_subplots

In [None]:
pop_data_sex = pd.read_csv('https://population.un.org/wpp/Download/Files/1_Indicators%20(Standard)/CSV_FILES/WPP2019_TotalPopulationBySex.csv')
pop_data_age = pd.read_csv('https://population.un.org/wpp/Download/Files/1_Indicators%20(Standard)/CSV_FILES/WPP2019_PopulationByAgeSex_Medium.csv')
countries = pd.read_csv('https://raw.githubusercontent.com/open-numbers/ddf--gapminder--life_expectancy/master/ddf--entities--geo.csv')

In [None]:
# Set significance threshold (p)
sig_threshhold = 0.05

In [None]:
# Query raw data to isolate data of interest
ratio_data_1950 = pop_data_age[(pop_data_age.AgeGrp == '25-29') & (pop_data_age.Time == 1950)].loc[pop_data_age['Location'].isin(countries.name.unique())][['Location', 'PopMale', 'PopFemale']]
ratio_data_2010 = pop_data_age[(pop_data_age.AgeGrp == '25-29') & (pop_data_age.Time == 2010)].loc[pop_data_age['Location'].isin(countries.name.unique())][['Location', 'PopMale', 'PopFemale']]

In [None]:
# Calculate the ratios for males and females
ratio_data_1950['mfRatio'] = ratio_data_1950.PopMale / ratio_data_1950.PopFemale
ratio_data_2010['mfRatio'] = ratio_data_2010.PopMale / ratio_data_2010.PopFemale

In [None]:
# Join data from 1950 and 2010 into one dataframe
joint_ratio_data = ratio_data_2010[['Location', 'mfRatio']].join(ratio_data_1950[['mfRatio', 'Location']].set_index('Location'), on='Location', lsuffix='_2010', rsuffix='_1950')
joint_ratio_data['Difference'] = joint_ratio_data.mfRatio_1950 - joint_ratio_data.mfRatio_2010

In [None]:
# Calculate z score of all differences
joint_ratio_data['zscore'] = zscore(joint_ratio_data.Difference)
# Calculate p values from z scores
joint_ratio_data['pvalue'] = norm.sf(abs(joint_ratio_data.zscore))
# Select values with p value in multiple testing corrected significance threshhold
significant_countries = joint_ratio_data[joint_ratio_data.pvalue < sig_threshhold/len(joint_ratio_data)].Location.values
significant_countries = np.append(significant_countries, ['Switzerland'])

In [None]:
# Select data to plot
data_to_plot = pop_data_age[pop_data_age.Location.isin(significant_countries) & ((pop_data_age.Time == 1950) | (pop_data_age.Time == 2010))]

In [None]:
# Plot data of interest
fig = make_subplots(rows=1, cols=len(significant_countries), shared_yaxes=False, subplot_titles=significant_countries)
needs_legend = True
for i, country in enumerate(significant_countries):

    data = data_to_plot[(data_to_plot.Location == country) & (data_to_plot.Time == 1950)]
    total_pop = sum(data.PopTotal)
    data['Normalized Values'] = (-data.PopTotal/total_pop) * 100
    fig.add_bar(x=data['Normalized Values'], y=data.AgeGrp, col=i+1, row=1, orientation='h', marker_color='blue', showlegend=needs_legend, legendgroup='1950', name='1950 Distribution')

    data = data_to_plot[(data_to_plot.Location == country) & (data_to_plot.Time == 2010)]
    total_pop = sum(data.PopTotal)
    data['Normalized Values'] = (data.PopTotal/total_pop) * 100
    fig.add_bar(x=data['Normalized Values'], y=data.AgeGrp, col=i+1, row=1, orientation='h', marker_color='orange', showlegend=needs_legend, legendgroup='2010', name='2010 Distribution')
    needs_legend = False

fig.update_layout(barmode = 'overlay', title='Distribution of Population in Age Segments in 1950 and 2010')
fig.update_xaxes(range=[-20, 20], title='Population in Age Segment [%]', tickvals=[-20,-10,0,10,20], ticktext=[20,10,0,10,20])
fig.update_yaxes(title='Age Group Range [years]')
fig.show()

### Note:

Since you will be performing multiple statistical tests (one per each nation), you have to adjust the significance threshold of your p-value to avoid false positives (for over 100 test, 20% of p-values will be less than 0.05, by definition, but it does not mean that the comparison you are measuring is significantly different!). This is called multiple testing correction. In practice, divide the significance threshold originally set by the number of tests that, and that is the multiple testing corrected significance threshold.