# 1. Objective

The objective is to gain a better understanding of the well-being of individuals across various countries. The Better Life Index dataset takes a wide range of variables into account, factoring in health, economic stability, education, safety and many others.

# 2. Data Understanding

Within this section, we seek to understand the data, gaining insights into the distribution and relationships between the various variables.

To begin, we will import the necessary libraries and conduct a brief descriptive analysis.

In [1]:
import pandas as pd
import numpy as np
import altair as alt

In [2]:
df = pd.read_excel('betterLifeIndex.xlsx')

In [3]:
df.head()

Unnamed: 0,Country,Housing - Dwellings without basic facilities as pct,Housing - Housing expenditure as pct,Housing - Rooms per person as rat,Income - Household net adjusted disposable income in usd,Income - Household net wealth in usd,Jobs - Labour market insecurity as pct,Jobs - Employment rate as pct,Jobs - Long-term unemployment rate as pct,Jobs - Personal earnings in usd,...,Environment - Water quality as pct,Civic Engagement - Stakeholder engagement for developing regulations as avg score,Civic Engagement - Voter turnout as pct,Health - Life expectancy in yrs,Health - Self-reported health as pct,Life Satisfaction - Life satisfaction as avg score,Safety - Feeling safe walking alone at night as pct,Safety - Homicide rate as rat,Work-Life Balance - Employees working very long hours as pct,Work-Life Balance - Time devoted to leisure and personal care as hrs
0,Australia,,19.4,,37433.0,528768.0,3.1,73,1.0,55206.0,...,92,2.7,92,83.0,85.0,7.1,67,0.9,12.5,14.36
1,Austria,0.8,20.8,1.6,37001.0,309637.0,2.3,72,1.3,53132.0,...,92,1.3,76,82.0,71.0,7.2,86,0.5,5.3,14.51
2,Belgium,0.7,20.0,2.1,34884.0,447607.0,2.4,65,2.3,54327.0,...,79,2.0,88,82.1,74.0,6.8,56,1.1,4.3,15.52
3,Canada,0.2,22.9,2.6,34421.0,478240.0,3.8,70,0.5,55342.0,...,90,2.9,68,82.1,89.0,7.0,78,1.2,3.3,14.57
4,Chile,9.4,18.4,1.9,,135787.0,7.0,56,,26729.0,...,62,1.3,47,80.6,60.0,6.2,41,2.4,7.7,


# 3. Checking for Missing Data

From a very brief analysis of the data, it becomes evident that there is missing data. As such, our first task at hand will be to handle this missing data. 

Missing data can skew results and misguide interpretations, and it is essential that this missing data is handled, either through imputation or omission.

In [4]:
missing_values = df.isnull().sum()

missing_values

Country                                                                               0
Housing - Dwellings without basic facilities as pct                                   3
Housing - Housing expenditure as pct                                                  4
Housing - Rooms per person as rat                                                     3
Income - Household net adjusted disposable income in usd                              6
Income - Household net wealth in usd                                                 12
Jobs - Labour market insecurity as pct                                                7
Jobs - Employment rate as pct                                                         0
Jobs - Long-term unemployment rate as pct                                             2
Jobs - Personal earnings in usd                                                       6
Community - Quality of support network as pct                                         0
Education - Educational attainme

Having observed that several of the columns in the dataset have missing data, we need to make a decision regarding there retention or removal. Given that missing data is evident in many of the columns, removing such a high number of missing data could potentially impact the data's integrity and overall effectiveness. 

However, the presence of such a high level of missing data cannot be ignored. To proceed with these columns, we must address the gaps in data. 

To begin, we will impute the missing values for columns with less than 20% missing data using the median. This is done as a safety check, as using the median for imputation is only effective if the vast majority of the data is present, i.e., less than 20% missing data.

#### Imputation Using Median

In [5]:
threshold = 0.2 * len(df)

for column in df.columns:
    if 0 < missing_values[column] < threshold:
        df[column].fillna(df[column].median(), inplace=True)

In [6]:
df.isnull().sum()

Country                                                                               0
Housing - Dwellings without basic facilities as pct                                   0
Housing - Housing expenditure as pct                                                  0
Housing - Rooms per person as rat                                                     0
Income - Household net adjusted disposable income in usd                              0
Income - Household net wealth in usd                                                 12
Jobs - Labour market insecurity as pct                                                0
Jobs - Employment rate as pct                                                         0
Jobs - Long-term unemployment rate as pct                                             0
Jobs - Personal earnings in usd                                                       0
Community - Quality of support network as pct                                         0
Education - Educational attainme

The imputation has been successful for most of the columns with less than 20% missing data. However, it seems that the imputation did not work for all. As such, we will further investigate these columns.

This will be done by identifying the columns with more than 20% missing data.

In [6]:
columns_above_threshold = missing_values[missing_values > threshold]

columns_above_threshold

Income - Household net wealth in usd                                    12
Work-Life Balance - Time devoted to leisure and personal care as hrs    19
dtype: int64

The columns with more than 20% missing data are:
- Income - Household net wealth in usd
- Work-Life Balance - Time devoted to leisure and personal care as hrs

For imputing columns with a signifigant amount of missing data, it is best practice to not simply use median, as it can effect the data's integrity. As such, the utilisation of more advanced imputation methods are necessary. In this case, we will utilise KNN Imputation. This method uses the K-nearest neighbours algorithm to impute missing values based on the similarity of rows. It offers a more advanced approach by taking into account a wide array of variables within the dataset to impute the data, providing a better solution when compared to using median, when missing data is more prominent.

#### Imputation using K-Nearest Neighbours

In [7]:
from sklearn.impute import KNNImputer

knn_imputer = KNNImputer(n_neighbors=5)

df[columns_above_threshold.index] = knn_imputer.fit_transform(df[columns_above_threshold.index])

print(df[columns_above_threshold.index].isnull().sum())

Income - Household net wealth in usd                                    0
Work-Life Balance - Time devoted to leisure and personal care as hrs    0
dtype: int64


The missing values in the columns "Income - Household net wealth in usd" and "Work-Life Balance - Time devoted to leisure and personal care as hrs" have been successfully imputed using the KNN imputation method.

With this, it is now possible to proceed to further analysis in the form of visualisations, helping us to understand the data distribution.


# 4. Data Understanding Continued

In [9]:
def plot_histogram(column):
    chart = alt.Chart(df).mark_bar().encode(
        alt.X(column, bin=alt.Bin(maxbins=50), title=column),
        y='count()',
    ).properties(
        title=f'Distribution of {column}',
        width=400,
        height=200
    )
    return chart

charts = [plot_histogram(col) for col in df.columns]
alt.vconcat(*charts)

Now that we have both pre-processed the data and conducted an analysis into the data's characteristics, it is time to visualise the data in more meaningful ways. This is to potentially gain new insights into the data and discover any trends which may be present across the wide variety of variables in specific countries.

# 5. Comparing Well-Being Across Countries

To compare well-being across countries, we will employ a variety of visualisation techniques. Each visualisation will aim to address an aspect of the data, such as choropleth maps to offer geographical perspective and bar charts for comparative analysis. Interactive elements will be introduced to enhance the user's experience by delivering detailed information through hovering over specific data points.

### Visualisation 1: Work-Life Balance

In [10]:
common_color_scale = alt.Scale(scheme='bluegreen')

In [14]:
work_life_balance_data = df[['Country', 'Work-Life Balance - Employees working very long hours as pct']]

work_chart = alt.Chart(work_life_balance_data).mark_bar().encode(
    x=alt.X('Country:N', sort='-y', title='Country', axis=alt.Axis(labelAngle=-45)),
    y=alt.Y('Work-Life Balance - Employees working very long hours as pct:Q', title='Employees Working Very Long Hours (%)'),
    color=alt.Color('Work-Life Balance - Employees working very long hours as pct:Q', scale=common_color_scale, legend=None),
    tooltip=[
        alt.Tooltip('Country:N', title='Country'),
        alt.Tooltip('Work-Life Balance - Employees working very long hours as pct:Q', title='Employees Working Very Long Hours (%)')
        ]
).properties(
    title='Work-Life Balance per Country',
    width=625,
    height=290
)
work_chart

### Visualisation 2: Income

In [15]:
income_data = df[['Country', 'Income - Household net adjusted disposable income in usd']]

income_chart = alt.Chart(income_data).mark_bar().encode(
    x=alt.X('Country:N', sort='-y', title='Country', axis=alt.Axis(labelAngle=-45)),
    y=alt.Y('Income - Household net adjusted disposable income in usd:Q', title='Household Net Adjusted Disposable Income (USD)'),
    color=alt.Color('Income - Household net adjusted disposable income in usd:Q', scale=common_color_scale, legend=None),
    tooltip=[
        alt.Tooltip('Country:N', title='Country'),
        alt.Tooltip('Income - Household net adjusted disposable income in usd:Q', title='Household Net Adjusted Disposable Income (USD)')
        ]
).properties(
    title='Disposable Income per Country',
    width=625,
    height=290
)

income_chart

### Visualisation 3: Life Satisfaction

In [11]:
country_to_numeric_iso = {
    'Australia': '036', 'Austria': '040', 'Belgium': '056', 'Canada': '124', 'Chile': '152',
    'Colombia': '170', 'Costa Rica': '188', 'Czech Republic': '203', 'Denmark': '208',
    'Estonia': '233', 'Finland': '246', 'France': '250', 'Germany': '276', 'Greece': '300',
    'Hungary': '348', 'Iceland': '352', 'Ireland': '372', 'Israel': '376', 'Italy': '380',
    'Japan': '392', 'Korea': '410', 'Latvia': '428', 'Lithuania': '440', 'Luxembourg': '442',
    'Mexico': '484', 'Netherlands': '528', 'New Zealand': '554', 'Norway': '578', 'Poland': '616',
    'Portugal': '620', 'Slovak Republic': '703', 'Slovenia': '705', 'Spain': '724', 'Sweden': '752',
    'Switzerland': '756', 'Turkey': '792', 'United Kingdom': '826', 'United States': '840',
    # Non-OECD countries
    'Non-OECD - Brazil': '076', 'Non-OECD - Russia': '643', 'Non-OECD - South Africa': '710'
}

iso_numeric_to_country = {v: k for k, v in country_to_numeric_iso.items()}

df['ISO_numeric'] = df['Country'].map(country_to_numeric_iso)
df['Country_Name'] = df['ISO_numeric'].map(iso_numeric_to_country)

In [21]:
world_url = alt.topo_feature('https://vega.github.io/vega-datasets/data/world-110m.json', 'countries')


df['Life Satisfaction Score (Avg Score)'] = df['Life Satisfaction - Life satisfaction as avg score']


base = alt.Chart(world_url).mark_geoshape(
    fill='lightgray',
    stroke='white'
).properties(
    title='Life Satisfaction Score per Country',
    width=1250,
    height=580
).project('equirectangular')


tooltip_fields = [
    alt.Tooltip('Country_Name:N', title='Country'),
    alt.Tooltip('Life Satisfaction Score (Avg Score):Q')
]


country_data = alt.Chart(world_url).mark_geoshape().encode(
    color=alt.Color('Life Satisfaction Score (Avg Score):Q', scale=common_color_scale, title="Average Score"),
    tooltip=tooltip_fields
).transform_lookup(
    lookup='id',
    from_=alt.LookupData(df, 'ISO_numeric', ['Country_Name', 'Life Satisfaction Score (Avg Score)'])
)


world_chart = base + country_data
world_chart

### Visualisation 4: Health

In [16]:
health_df = df[['Country', 'Health - Self-reported health as pct']]

scatter = alt.Chart(health_df).mark_circle().encode(
    x=alt.X('Country:N', title='Country', axis=alt.Axis(labelAngle=-45)),
    y=alt.Y('Health - Self-reported health as pct:Q', title='Self-Reported Health (%)'),
    color=alt.Color('Health - Self-reported health as pct:Q', scale=common_color_scale, legend=None),
    size=alt.Size('Health - Self-reported health as pct:Q', title='Health Score Size', legend=None),
    tooltip=[
        alt.Tooltip('Country:N', title='Country'),
        alt.Tooltip('Health - Self-reported health as pct:Q', title='Self-Reported Health Score (%)')
    ]
).properties(
    title='Self-Reported Health per Country',
    width=625,
    height=290
)

scatter

### Visualisation 5: Safety

In [17]:
safety_df = df[['Country', 'Safety - Feeling safe walking alone at night as pct', 'Safety - Homicide rate as rat']]


base = alt.Chart(safety_df).encode(
    alt.X('Country', axis=alt.Axis(title='Country', labelAngle=-45))
).properties(
    title='Safety Metrics by Country',
    width=625,
    height=290
)


line = base.mark_line(color='orange').encode(
    alt.Y('Safety - Homicide rate as rat',
          axis=alt.Axis(title='Homicide Rate', titleColor='orange'))
)


bar = base.mark_bar().encode(
    alt.Y('Safety - Feeling safe walking alone at night as pct',
          axis=alt.Axis(title='Feeling Safe Walking Alone at Night (%)')),
    color=alt.Color('Safety - Feeling safe walking alone at night as pct', scale=common_color_scale, legend=None),
    tooltip=[alt.Tooltip('Country', title='Country'),
             alt.Tooltip('Safety - Feeling safe walking alone at night as pct', title='Feeling Safe Walking Alone at Night(%)')]
)


dual_axis_chart = alt.layer(bar, line).resolve_scale(
    y='independent'
)

dual_axis_chart

# 6. Covert Visualisations to HTML Format

Now that all of the visualisations have been produced, it is time to concatenate and convert these into a single html file. In doing so, it is possible to easily view each of the graphs, allowing for the comparison of different variables inside one view. 

This will be done by both concatenating horizontally and vertically to create the desired view. Through this method, we will be able to create a 2 x 2 grid of our 4 graphs, with the final visualisation depicting the world map at the bottom of the page, as this graph requires a higher degree of size for clarity.

In [22]:
hconcat1 = alt.hconcat(income_chart, work_chart).resolve_scale(
    color='independent'
)

hconcat2 = alt.hconcat(scatter, dual_axis_chart).resolve_scale(
    color='independent'
)

final_chart = alt.vconcat(hconcat1, hconcat2, world_chart).resolve_scale(
    color='independent'
)

final_chart.save('combined_chart.html')

With the creation of the html file "combined_chart.html", we have completed the last step of our analysis.

# 7. Summary

In this notebook, we conducted a thorough examination of the Better Life Index dataset in an attempt to understand the well-being of individuals across various countries.

Following the initial data understanding phase, we then focused on the missing data present inside the dataset, ultimately deciding to utilise imputation. For columns with less than 20% missing data, imputation using the median was employed. For columns with more that 20% missing data, K-Nearest Neighour was utilised in an attempt to retain the dataset's overall characteristics.

A brief analysis into the columns data distributions was then conducted, which did not lead to any discoveries which required further action, such as outlier removal.

For the latter part of the notebook, an emphasis was placed on visualising the dataset in the most comprehensive way possible. Five visualisations were created, each representing a key component of well-being, with these being:
- Work-Life Balance
- Disposable Income
- Life Satisfaction
- Health
- Safety

Each visualisation played a crucial role in providing insights into the well-being of individuals in various countries.

Lastly, to view these visualisations in one place, we grouped them together and converted them into a single html file, retaining the interactive elements of the graphs, whilst also allowing for an easier viewing experience.