**DSS5201 Project Coding**<br>
**Human Development Index**

**In this project, our team reproduced the original visualization based in UN official website, using the Plotly library. We implemented interactive functionality by displaying HDI-related data for each country when hovering over it. In the improvement section, we added a region selection feature and emphasized lines on hover to clearly showcase the data for individual countries.**

**Tips: To try the interactive features of the visulizations, please run the code first or view the result in the notebook**

**1. Data Preprocessing**


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
from dash import Dash, dcc, html, Input, Output

In [2]:
# 1.1 Load the data
codebook_csv_path = '../data/codebook.csv'
recommended_citation_csv_path = '../data/recommended_citation.csv'
undp_composite_indices_csv_path = '../data/undp_composite_indices.csv'

# Load the data into dataframes for exploration
codebook_df = pd.read_csv(codebook_csv_path)
recommended_citation_df = pd.read_csv(recommended_citation_csv_path)
undp_composite_indices_df = pd.read_csv(undp_composite_indices_csv_path)

# Display a preview of each dataframe to understand their structure
codebook_df.head()

Unnamed: 0,full_name,short_name,time_series
0,ISO3,iso3,-
1,HDR Country Name,country,-
2,Human Development Groups,hdicode,-
3,UNDP Developeing Regions,region,-
4,HDI,HDI,HDI


In [3]:
recommended_citation_df.head()

Unnamed: 0,recommended_citation
0,“Source: UNDP (United Nations Development Prog...


In [4]:
undp_composite_indices_df.head()

Unnamed: 0,iso3,country,region,gender,year,human_development_index,life_expectancy_at_birth,expected_years_of_schooling,mean_years_of_schooling,gross_national_income_per_capita
0,AFG,Afghanistan,SA,Female,1990,0.19628,48.3973,1.970663,0.342503,668.05576
1,AFG,Afghanistan,SA,Female,1991,0.196378,49.1439,2.096679,0.37186,564.926374
2,AFG,Afghanistan,SA,Female,1992,0.199362,50.3197,2.230753,0.401218,508.75073
3,AFG,Afghanistan,SA,Female,1993,0.195311,52.7389,2.373401,0.430575,374.581093
4,AFG,Afghanistan,SA,Female,1994,0.182092,53.5442,2.525171,0.459933,266.20761


In [5]:
# Check the needed dataframe
undp_composite_indices_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19392 entries, 0 to 19391
Data columns (total 10 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   iso3                              19392 non-null  object 
 1   country                           19392 non-null  object 
 2   region                            14496 non-null  object 
 3   gender                            19392 non-null  object 
 4   year                              19392 non-null  int64  
 5   human_development_index           15989 non-null  float64
 6   life_expectancy_at_birth          19392 non-null  float64
 7   expected_years_of_schooling       17619 non-null  float64
 8   mean_years_of_schooling           17011 non-null  float64
 9   gross_national_income_per_capita  18130 non-null  float64
dtypes: float64(5), int64(1), object(4)
memory usage: 1.5+ MB


In [6]:
# 1.2 Select the HDI data of the countries regardless of gender
df_tidy = undp_composite_indices_df.query('gender == "Total"').drop('gender',axis=1)
df_tidy.head()

Unnamed: 0,iso3,country,region,year,human_development_index,life_expectancy_at_birth,expected_years_of_schooling,mean_years_of_schooling,gross_national_income_per_capita
64,AFG,Afghanistan,SA,1990,0.273,45.9672,2.50405,0.971125,2684.550019
65,AFG,Afghanistan,SA,1991,0.279,46.6631,2.80655,1.019356,2276.289409
66,AFG,Afghanistan,SA,1992,0.287,47.5955,3.10905,1.067586,2059.868084
67,AFG,Afghanistan,SA,1993,0.297,51.4664,3.41155,1.115817,1525.533426
68,AFG,Afghanistan,SA,1994,0.292,51.4945,3.71405,1.164047,1087.96189


In [7]:
# Check the data type
df_tidy.dtypes

iso3                                 object
country                              object
region                               object
year                                  int64
human_development_index             float64
life_expectancy_at_birth            float64
expected_years_of_schooling         float64
mean_years_of_schooling             float64
gross_national_income_per_capita    float64
dtype: object

In [8]:
# 1.3 Remove the NA
print('Number of missing values in original data')
print(df_tidy.isna().sum())

# Drop rows where crucial columns have missing values 
df_tidy_drop = df_tidy.dropna(subset=['human_development_index'])

# Change the value of region
df_cleaned = df_tidy_drop.copy() # To avoid SettingWithCopyWarning
df_cleaned.loc[df_cleaned['country'] == 'World', 'region'] = 'World'
replace_dict = {'SA': 'South Asis',
                'SSA': 'Sub-Saharan Africa',
                'ECA': 'Europe and Central Asia',
                'AS': 'Arab States',
                'LAC': 'Latin America and the Caribbean',
                'EAP': 'East Asia and the Pacific'
}
df_cleaned['region'] = df_cleaned['region'].replace(replace_dict)
df_cleaned['region'] = df_cleaned['region'].fillna('Developed Region')

print('--------------------------------------------')
print('Number of missing values after processing')
print(df_cleaned.isna().sum())

Number of missing values in original data
iso3                                   0
country                                0
region                              1632
year                                   0
human_development_index              669
life_expectancy_at_birth               0
expected_years_of_schooling          321
mean_years_of_schooling              579
gross_national_income_per_capita     132
dtype: int64
--------------------------------------------
Number of missing values after processing
iso3                                0
country                             0
region                              0
year                                0
human_development_index             0
life_expectancy_at_birth            0
expected_years_of_schooling         0
mean_years_of_schooling             0
gross_national_income_per_capita    0
dtype: int64


In [9]:
# 1.4 Deal with the duplicates
# Check the duplicates
print(df_cleaned[df_cleaned.duplicated(subset=['country', 'year'], keep=False)])

Empty DataFrame
Columns: [iso3, country, region, year, human_development_index, life_expectancy_at_birth, expected_years_of_schooling, mean_years_of_schooling, gross_national_income_per_capita]
Index: []


In [10]:
# 1.5 Create new variable and change the value for region
df_cleaned = df_cleaned.sort_values(['country', 'year'])
df_cleaned['hdi_change'] = df_cleaned.groupby('country')['human_development_index'].pct_change()

# Fill change percent of the first year of each country with 0
df_cleaned['hdi_change'] = df_cleaned['hdi_change'].fillna(0)
df_cleaned[df_cleaned['country'] == 'Singapore'].head()

Unnamed: 0,iso3,country,region,year,human_development_index,life_expectancy_at_birth,expected_years_of_schooling,mean_years_of_schooling,gross_national_income_per_capita,hdi_change
14752,SGP,Singapore,East Asia and the Pacific,1990,0.727,74.9444,10.601848,6.36335,38095.46683,0.0
14753,SGP,Singapore,East Asia and the Pacific,1991,0.737,75.3289,10.802198,6.661062,39286.23565,0.013755
14754,SGP,Singapore,East Asia and the Pacific,1992,0.748,75.6009,11.002547,6.958774,41252.48872,0.014925
14755,SGP,Singapore,East Asia and the Pacific,1993,0.758,75.8015,11.202897,7.256486,43538.8571,0.013369
14756,SGP,Singapore,East Asia and the Pacific,1994,0.77,75.8847,11.403247,7.554198,47886.23501,0.015831


**2. Reproduction**

In [11]:
# All regions
sample_data = df_cleaned.copy()

# Calculate the average HDI value for each country and sort by HDI
avg_hdi_per_country = sample_data.groupby('country')['human_development_index'].mean().sort_values()

# Create a color gradient (from purple to red)
norm = plt.Normalize(vmin=avg_hdi_per_country.min(), vmax=avg_hdi_per_country.max())
colorscale = px.colors.sequential.Turbo[::-1]  

# Assign colors to these countries
country_color_map = {country: px.colors.sample_colorscale(colorscale, norm(hdi))[0]
                     for country, hdi in avg_hdi_per_country.items()}

# Add assigned colors to a dataset
sample_data['color'] = sample_data['country'].map(country_color_map)

# Create a Hover Map Message
sample_data['custom_data'] = sample_data.apply(
    lambda row: [row['country'], row['year'], row['human_development_index'], row['hdi_change'],
                 row['life_expectancy_at_birth'], row['expected_years_of_schooling'],
                 row['mean_years_of_schooling'], row['gross_national_income_per_capita']],
    axis=1
)

# Create Line Charts
fig = px.line(sample_data, x='year', y='human_development_index', color='country',
              line_group='country',  
              labels={
                  'year': 'Year',
                  'human_development_index': 'Human Development Index (HDI) in initial year'
              },
              title='HDI Over Time for All Countries',
              custom_data=['custom_data'])


# Update line style and hover template, set color manually
for country in avg_hdi_per_country.index:
    if country == 'World':  
        fig.update_traces(selector=dict(name=country),
                          line=dict(dash='dash', width=2, color='black'))
    else:  
        fig.update_traces(selector=dict(name=country),
                          line=dict(dash='dash', width=1, color=country_color_map[country]))


# Update hover template to show hover image
fig.update_traces(hovertemplate="<b>%{customdata[0][0]}</b><br>"
                                "%{customdata[0][1]} HDI value: %{customdata[0][2]:.3f}<br>"
                                "HDI change from previous year: %{customdata[0][3]:+.2%}<br>"
                                "Life expectancy at birth: %{customdata[0][4]:.1f} years<br>"
                                "Expected years of schooling: %{customdata[0][5]:.1f} years<br>"
                                "Mean years of schooling: %{customdata[0][6]:.1f} years<br>"
                                "Gross National Income per capita: %{customdata[0][7]:,.0f} (constant 2017 PPP$)"
                            )

# Set figure size
fig.update_layout(width=1300, 
                height=700, 
                xaxis=dict(
                    tickmode='linear',  
                    dtick=1
                ),
                yaxis=dict(
                    tickvals=[0.200, 0.280, 0.360, 0.440, 0.520, 0.600, 0.680, 0.760, 0.840, 0.920, 1.000],  
                    ticktext=['0.200', '0.280', '0.360', '0.440', '0.520', '0.600', '0.680', '0.760', '0.840', '0.920', '1.000']  
                ), 
                annotations=[
                    dict(
                        x=1.0,  
                        y=1.05,  
                        xref="paper",  
                        yref="paper",  
                        text="<b>Low (< 0.550)</b> &nbsp;|&nbsp; <b>Medium (0.550-0.699)</b> &nbsp;|&nbsp; <b>High (0.700-0.799)</b> &nbsp;|&nbsp; <b>Very high (≥ 0.800)</b>",  
                        showarrow=False,  
                        align="center",  
                        bgcolor="rgba(255,255,255,0.8)",  
                        bordercolor="black",  
                        borderwidth=1,  
                        font=dict(size=10),  
                )
    ]
)

# save as HTML file
fig.write_html(f"HDI_ALL_COUNTRIES.html")

#Show figure
fig.show()

**3. Improvement**

In [12]:
app = Dash(__name__)

app.layout = html.Div([
    html.H4('HDI Over Time for All Countries'),
    dcc.Graph(id="graph"),
    # Create a selector of the regions
    dcc.Checklist(
        id="checklist",
        options=[
            {'label': 'Arab States', 'value': 'Arab States'},
            {'label': 'East Asia and the Pacific', 'value': 'East Asia and the Pacific'},
            {'label': 'Europe and Central Asia', 'value': 'Europe and Central Asia'},
            {'label': 'Latin America and the Caribbean', 'value': 'Latin America and the Caribbean'},
            {'label': 'South Asia', 'value': 'South Asia'},
            {'label': 'Sub-Saharan Africa', 'value': 'Sub-Saharan Africa'},
            {'label': 'Developed Region', 'value': 'Developed Region'},
            {'label': 'World', 'value': 'World'}
        ],
        value=['World', 'East Asia and the Pacific'],  
        inline=True
    ),
])

@app.callback(
    Output("graph", "figure"), 
    Input("checklist", "value"),
    Input("graph", "hoverData")
)
def update_line_chart(selected_regions,hoverData):
    df = sample_data[sample_data['region'].isin(selected_regions)] 
    fig = px.line(df, x='year', y='human_development_index', color='country',
                  line_group='country',
                  labels={
                      'year': 'Year',
                      'human_development_index': 'Human Development Index (HDI)'
                  },
                  custom_data=['custom_data'])
    
    for country in avg_hdi_per_country.index:
        if country == 'World':
            fig.update_traces(selector=dict(name=country),
                              line=dict(dash='solid', width=2, color='black'))
        else:
            fig.update_traces(selector=dict(name=country),
                              line=dict(dash='solid', width=1, color=country_color_map[country]))
    
    # When checking hovering, highlight the line
    if hoverData:
        hovered_country = hoverData['points'][0]['customdata'][0][0]   
        fig.update_traces(selector=dict(name=hovered_country),
                          line=dict(width=4))
        
    fig.update_traces(
        hovertemplate="<b>%{customdata[0][0]}</b><br>"
                  "%{customdata[0][1]} HDI value: %{customdata[0][2]:.3f}<br>"
                  "HDI change from previous year: %{customdata[0][3]:+.2%}<br>"
                  "Life expectancy at birth: %{customdata[0][4]:.1f} years<br>"
                  "Expected years of schooling: %{customdata[0][5]:.1f} years<br>"
                  "Mean years of schooling: %{customdata[0][6]:.1f} years<br>"
                  "Gross National Income per capita: %{customdata[0][7]:,.0f} (constant 2017 PPP$)"
    )
    
    fig.update_layout(width=1300, 
                      height=700, 
                      xaxis=dict(tickmode='linear', dtick=2),
                      yaxis=dict(
                          tickvals=[0.200, 0.280, 0.360, 0.440, 0.520, 0.600, 0.680, 0.760, 0.840, 0.920, 1.000],
                          ticktext=['0.200', '0.280', '0.360', '0.440', '0.520', '0.600', '0.680', '0.760', '0.840', '0.920', '1.000']
                      ),
                      annotations=[
                          dict(
                              x=1.0, 
                              y=1.05, 
                              xref="paper",
                              yref="paper",
                              text="<b>Low (< 0.550)</b> | <b>Medium (0.550-0.699)</b> | <b>High (0.700-0.799)</b> | <b>Very high (≥ 0.800)</b>",
                              showarrow=False,
                              align="center",
                              bgcolor="rgba(255,255,255,0.8)",
                              bordercolor="black",
                              borderwidth=1,
                              font=dict(size=10)
                          )
                      ])
            
    return fig

app.run_server(debug=True)


**Acknowledgement**

We acknowledge the use of ChatGPT to debug code that was subsequently included in modified form in my report. We entered the following prompt(s) on October 21-27, 2024:<br> “Introduces the ploty library and explains the corresponding parameters”，“ How do I add a label to the top right corner of a plot?”，“How to make the line color of a line graph change from top to bottom to blue to red with a gradient effect？”，“How to adjust hovertemplate with custom ,“How to change the line width of a point when the cursor hovers over it?”，“What is the format of hoverData?”

**Tasks Allocation**

**Dai Yifan**: Preprocess data and reproduce the original visulization.<br>
**Liu Qiping**: Correct the previous errors and improve the reproduced visualization.<br>
**Rou Jun Chen**: Create a PowerPoint presentation and record a video.<br>
**Heran Zhang**: Create a PowerPoint presentation and record a video.
