# World Happiness Report 2024 — Exploratory Data Analysis (Cross-Sectional)

This project performs a focused Exploratory Data Analysis (EDA) on the 2024 edition of the World Happiness Report (single-year cut from the 2015–2024 Kaggle dataset). The goal is to understand how key explanatory factors (e.g., GDP per capita, social support, healthy life expectancy, freedom, generosity, perceptions of corruption) relate to national happiness levels in 2024. The analysis emphasizes data quality checks, descriptive statistics, and clear visual storytelling.

#### **Objectives**
- Describe the distribution of happiness scores across countries in 2024.
- Compare regions/continents through group statistics and visualizations.
- Examine pairwise relationships between the happiness score and its explanatory factors.

#### **Information**
**Dataset:** *World Happiness 2015–2024* (filtered to 2024) — Kaggle (yadiraespinoza)  
**Author:** Paulo Castro  
**Date:** August 2025  
**Tools:** Python (pandas, matplotlib, seaborn)


---

## 1. Load and Inspect Data

In this section, we will:
- Load the dataset into a Pandas DataFrame.
- Check the shape (number of rows and columns) to understand the dataset's size.
- Display the first few rows to inspect the raw data.
- Examine the data types and missing values.
- Generate a statistical summary for numeric columns to get an initial overview.

> **Note:** This section prepares the dataset for further cleaning, exploration, and visualization in later steps.


Imports the main Python libraries for data manipulation (pandas, numpy) and visualization (matplotlib, seaborn) and sets a consistent and readable style for all plots, including size, gridlines, font sizes, and Seaborn aesthetics.

In [None]:
# Core libraries for data manipulation and visualization
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# --- Visualization configuration ---
plt.rcParams.update({
    "figure.figsize": (10, 6),
    "axes.grid": True,
    "grid.alpha": 0.1,
    "axes.titlesize": 14,
    "axes.labelsize": 12,
    "xtick.labelsize": 10,
    "ytick.labelsize": 10
})

sns.set_context("notebook")

Loads the CSV file into a Pandas DataFrame with proper separator and decimal format. Ensures numbers are correctly interpreted.


In [None]:
path = "../content/world_happiness_2024.csv"

df_raw = pd.read_csv(path, sep=";", decimal=",")

### 1.1 Dataset Overview

In [None]:
# 1. Basic information about the dataset
print("Dataset Shape (rows, columns):", df_raw.shape)

- **Rows and Columns:** The dataset contains 140 rows and 10 columns.  
- **Countries Represented:** Each row corresponds to a unique country, totaling 140 countries.  
- **Missing Countries:** Out of the 195 countries recognized by the UN, 55 countries are not included in this dataset.

> **Note:** This overview helps understand the scope of the data, highlighting that some countries are missing. This may affect analyses that assume global coverage.

### 1.2 Data Snapshot

In [None]:
# 2. Preview Raw Data
display(df_raw.head(10))

The first few rows of the dataset provide an initial glimpse into the structure and types of data:

- **Categorical Columns:**
  - `Country`: Name of the country.
  - `Regional indicator`: Region or group the country belongs to.

- **Numerical Columns:**
  - **Integers:**
    - `Ranking`: Overall happiness rank (1 = happiest).
    - `Healthy life expectancy`: Life expectancy in years.
  - **Decimals:**
    - `Ladder score`: Overall score between 0 and 10, determines the ranking.
    - `GDP per capita`: Economic output per person.
    - **Percentages / Indexes (0–1 scale):**
      - `Social support`
      - `Freedom to make life choices`
      - `Generosity`
      - `Perceptions of corruption`

> **Note:** Understanding the column types is crucial for proper cleaning, analysis, and visualization.

### 1.3 Data Information

In [None]:
# 3. Data types and non-null counts
print("\nData Types and Non-Null Values:")
df_raw.info()

The dataset contains **140 rows and 10 columns**, with **no missing values**.

**Column types:**
- `Ranking` and `Healthy life expectancy`: `int64` (integer)
- `Country` and `Regional indicator`: `object` (categorical)
- `Ladder score`, `GDP per capita`, `Social support`, `Freedom to make life choices`, `Generosity`, `Perceptions of corruption`: `float64` (decimal)

> **Note:** The data types are consistent with the initial head inspection, so no type conversions are required.

### 1.4 Statistical Summary of Numeric Columns

In [None]:
# 4. Statistical summary for numeric columns
print("\nStatistical Summary for Numeric Columns:")
display(df_raw.describe().T)


**Key observations:**

- All numeric columns contain **140 values**, confirming that there are no missing entries.
- The `Ranking` column spans the full range from 1 to 140, reflecting the position of countries by their Ladder Score.
- The `Ladder score` varies from 1.72 to 7.74, with a mean of 5.53, showing that most countries cluster around the mid-range.
- Columns like `Social support`, `Freedom to make life choices`, `Generosity`, and `Perceptions of corruption` are expressed as percentages between 0 and 1.
- `GDP per capita` and `Ladder score` highlight economic and well-being differences among countries.
- The standard deviation values indicate the variability within each feature, while quartiles (25%, 50%, 75%) show how most countries are distributed around the median.
- Min and max values help identify potential outliers or exceptional cases that can be further explored in the EDA section.

## 2. Data Cleaning and Preparation

In this section, we will:
- Identify and handle missing or inconsistent data (if any).
- Check for duplicates.
- Optimize categorical columns.
- Standardize column names for consistency.
- Create any additional columns that may be useful for analysis.
- Prepare the dataset for exploratory data analysis (EDA).

> **Note**: This step ensures that the dataset is clean, consistent, and ready for meaningful visualizations and analysis.

### 2.1 Checking for Missing Values

Although the initial inspection showed no missing values, it's good practice to confirm this for all columns.

In [None]:
# Check for missing values
missing_values = df_raw.isnull().sum()
print("\nMissing Values:")
missing_values

- All columns have 0 missing values, confirming that no imputation is necessary.
- The dataset is complete, which simplifies the cleaning process.

### 2.2 Checking for Duplicates

Duplicate rows can distort analysis and statistics, so we check for them.

In [None]:
# Check for duplicate rows
duplicates = df_raw.duplicated().sum()
print("\nDuplicates:", duplicates)

- The dataset contains 0 duplicate rows, indicating unique entries for each country.
- No removal or deduplication is needed.

### 2.3 Optimizing Categorical Columns
To improve memory efficiency and speed up future processing, we convert all object-type columns with 20 or fewer unique values to the category data type.

In [None]:
# Identify object columns with ≤20 unique values
cols_cat = df_raw.select_dtypes(include='object').nunique().sort_values()
cols_cat = cols_cat[cols_cat <= 20].index.tolist()
print("\nColumns that will be optimized:", cols_cat)

df_raw[cols_cat] = df_raw[cols_cat].astype('category')
df_raw[cols_cat].dtypes

- Converted `Regional indicator` to `category` to improve memory and processing efficiency.
- `Country` kept as object, since each value is unique and it represents a real-world entity rather than a category.

> **Note**: Optimizing categorical columns is important when working with large datasets to reduce memory usage and improve performance.

### 2.4 Standardizing Column Names

Consistent column names improve readability and prevent errors in analysis.

In [None]:
# Standardize column names: lowercase, replace spaces with underscores
df_clean = df_raw.copy()
df_clean.columns = df_clean.columns.str.strip().str.lower().str.replace(' ', '_')
print("\nStandarized Column Names:\n")
df_clean.columns

- All column names are now lowercase and use underscores (_) instead of spaces.
- This standardization ensures compatibility with Python syntax and avoids typos during analysis.

### 2.5 Creating Additional Columns

For visualization or categorical comparisons, we create a `ladder_score_category` with an odd number of levels. This allows a balanced distribution with a central “Medium” group.

In [None]:
# Define bins and labels for Ladder Score categories
bins = [0, 2, 4, 6, 8, 10]
labels = ['Very Low', 'Low', 'Medium', 'High', 'Very High']

# Create a new categorical column
df_clean['ladder_score_category'] = pd.cut(df_clean['ladder_score'], bins=bins, labels=labels, ordered=True)

# Preview the new column
df_clean[['country', 'ladder_score', 'ladder_score_category']].head(10)

- `ladder_score_category`: Groups countries into five qualitative levels of happiness.
- Useful for categorical aggregation, comparisons, and visualizations.
- **Design choice**: Odd number of categories ensures symmetry with a central “Medium” group.

> **Note**: This categorical ladder score can be used for comparative visualizations, such as counting countries per happiness level or analyzing distributions by region.

### 2.6 Dataset Check

In [None]:
# Quick overview after cleaning
df_clean.info()
print("\n")
df_clean.head()

- After cleaning and transformations, the dataset is fully consistent and ready for exploratory analysis.
- No missing or duplicate data.
- Categorical columns optimized where appropriate.
- Derived columns added for analysis convenience.

> **Summary**: Dataset is now clean, structured, and ready for exploratory data analysis.

## 3. Exploratory Data Analysis (EDA)

In this section, we will explore the cleaned dataset to better understand its structure, distributions, and relationships between variables.

Our goals are to:
- Examine the distribution of the main metric (`ladder_score`).
- Compare happiness across regions.
- Analyze correlations between ladder score and other numerical variables.
- Identify potential insights and patterns for further interpretation.

### 3.1 Distribution of Ladder Score

First, we examine the distribution of the happiness scores (`ladder_score`) across all countries.

In [None]:
plt.figure()
df_clean['ladder_score'].hist(bins=20, edgecolor='black')
plt.title('Distribution of Ladder Score')
plt.xlabel('Ladder Score')
plt.ylabel('Number of Countries')
plt.show()

The distribution of ladder scores is **left-skewed**, with most countries positioned in the middle-to-high range of happiness.  

- There are **two clear peaks**: one around a score of 4.5 and another near 6.0, suggesting that many countries cluster around these levels.  
- The lower end of the scale shows a **gap**, with very few countries scoring between 2.0 and 3.5.  
- The overall spread is wide, ranging from approximately **1.8 to 8.0**, which highlights significant disparities in happiness across countries.  

This pattern indicates that while extreme low scores are rare, countries tend to group around moderate to high levels of happiness, with only a handful reaching the very top of the scale.

### 3.2 Happiness by Region

We compare happiness scores across regions using a boxplot.

In [None]:
plt.figure()
sns.boxplot(data=df_clean, x='regional_indicator', y='ladder_score')
plt.xticks(rotation=45, ha='right')
plt.title('Ladder Score by Region')
plt.xlabel('Region')
plt.ylabel('Ladder Score')
plt.show()

The distribution of ladder scores varies considerably across regions, highlighting clear geographical differences in happiness.  

- **Highest medians**:  
  - *Western Europe* and *North America & ANZ* show the highest median happiness levels.  
  - North America & ANZ also has the **lowest dispersion**, with consistently high scores across its countries.  
  - The **highest ladder score overall** belongs to a country in Western Europe.  

- **Lowest medians**:  
  - *South Asia* and *Sub-Saharan Africa* present the lowest medians.  
  - In South Asia, one country appears as a **low outlier**, with the lower whisker almost nonexistent.  
  - Sub-Saharan Africa, on the other hand, shows more diversity, with its upper range nearly reaching the lower bound of Western Europe.  

- **Highest variability**:  
  - *Middle East & North Africa* displays the **widest spread**, ranging from very low scores (comparable to the lowest in South Asia and Sub-Saharan Africa) to very high scores, on par with the top countries in Western Europe and North America & ANZ.  

Overall, these patterns emphasize that while some regions enjoy consistent levels of happiness, others face strong internal disparities, reflecting diverse economic, social, and political contexts.

### 3.3 Average Ladder Score per Region

For clearer comparisons, we compute the mean ladder score by region.

In [None]:
region_means = df_clean.groupby('regional_indicator')['ladder_score'].mean().sort_values(ascending=False)
region_means

In [None]:
region_means.plot(kind='bar', edgecolor='black')
plt.title('Average Ladder Score by Region')
plt.ylabel('Average Ladder Score')
plt.show()

This chart reinforces the regional disparities already observed in the boxplot, while making the **ranking of regions more explicit**.  

- **Lowest averages**:  
  - *South Asia* and *Sub-Saharan Africa* record the lowest average ladder scores, confirming the consistent challenges to well-being in these areas.  

- **Highest averages**:  
  - *North America & ANZ* and *Western Europe* clearly stand out with the highest averages.  
  - In this visualization, it becomes easier to identify that *North America & ANZ* ranks slightly higher than Western Europe in terms of average score.  


### 3.4 Correlation Analysis

To understand which factors are most strongly associated with happiness, we analyze correlations.

In [None]:
# Compute correlation matrix
corr = df_clean.corr(numeric_only=True)

plt.figure()
sns.heatmap(corr, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Heatmap')
plt.show()

Before focusing directly on the variables most strongly associated with the *Ladder Score*, it is useful to highlight some broader patterns confirmed by the correlation matrix:

- **Strong interrelations between explanatory factors**:  
  - *GDP per Capita* shows a very strong correlation with *Healthy Life Expectancy* (**0.83**) and a strong correlation with *Social Support* (**0.73**).  
  - This suggests that countries with higher GDP tend to provide both **better health conditions** and **more robust social support systems**.  

We now turn to the **direct relationships with the Ladder Score**:

- **Weaker relationships**:  
  - *Generosity* displays the weakest correlation (**0.13**), indicating it plays a relatively minor role in explaining variations in perceived happiness.  
  - *Perceptions of Corruption* (**0.45**) and *Freedom to Make Life Choices* (**0.64**) show moderate associations, suggesting that while they matter, their impact is less direct than economic and health-related factors.  

- **Strongest drivers of happiness**:  
  - The three factors most strongly correlated with *Ladder Score* are:  
    - *Healthy Life Expectancy* (**0.81**)  
    - *GDP per Capita* (**0.77**)
    - *Social Support* (**0.76**)  

These findings underline that **economic prosperity, health, and social connections are the strongest foundations of subjective well-being**, while cultural or moral dimensions such as generosity play a much smaller role.

### 3.5 Ladder Score Categories (Qualitative Analysis)

Using the categorical bins created earlier, we analyze the distribution of countries across happiness levels.

In [None]:
df_clean['ladder_score_category'].value_counts().sort_index().plot(
    kind='bar', edgecolor='black'
)
plt.title('Number of Countries by Ladder Score Category')
plt.xlabel('Category')
plt.ylabel('Number of Countries')
plt.show()

The categorical distribution of *Ladder Scores* reveals that:

- The majority of countries fall within the **Medium** and **High** categories, with approximately **67** and **56 countries** respectively.  
- The **Low** category includes around **16 countries**, while the **Very Low** group is represented by at least one country.  
- Interestingly, **no country reaches the “Very High” category**, suggesting that while many nations achieve relatively good levels of happiness, none reaches the theoretical upper extreme of the scale.  

Overall, this distribution indicates that **global happiness levels tend to cluster in the medium-to-high range**, with only a minority of countries falling significantly below this threshold.

## 4. Conclusion and Insights

The exploratory data analysis of the 2024 World Happiness Report has provided a comprehensive understanding of the factors influencing national well-being. The findings reveal a complex landscape where happiness is shaped by a mix of economic, social, and health-related variables.

**Key Findings**:

- **Bimodal Distribution**: The distribution of `ladder_scores` is not uniform. The presence of two distinct peaks suggests that countries tend to cluster around two different levels of happiness, rather than following a single linear trend. The noticeable gap at the lower end of the scale indicates that exceptionally low happiness scores are rare in the dataset.
- **ignificant Regional Disparities**: Happiness levels vary considerably across regions. **Western Europe** and **North America & ANZ consistently** emerge with the highest median and mean scores, showing a strong and stable level of well-being. In contrast, **South Asia** and **Sub-Saharan Africa** report the lowest average scores, highlighting persistent challenges to happiness in these areas. The **Middle East & North Africa** region displays the highest internal variability, with countries at both ends of the happiness spectrum.
- **Core Drivers of Happiness**: The correlation analysis confirmed that economic prosperity, health, and social connections are the most influential factors. `GDP per Capita`, `Healthy Life Expectancy`, and `Social Support` all show strong positive correlations with the `ladder_score`, underscoring their foundational role in a nation's well-being.
- **Minor Influence of Other Factors**: Variables such as `Perceptions of Corruption` and `Freedom to Make Life Choices` have a moderate impact on happiness, while `Generosity` shows the weakest association. This suggests that while cultural and political factors are relevant, economic and social foundations have a more direct and powerful effect on a country's overall happiness score.
- **No Country Reaches the Top Tier**: The qualitative analysis revealed that no country in the 2024 dataset achieved a `Very High` happiness score (a score of 8 or above). This suggests that while many nations have achieved significant levels of well-being, there is still room for improvement on a global scale.

**Recommendations and Next Steps**:  

This EDA has successfully identified key patterns and relationships within the dataset. Based on these insights, the next steps could include:

1. **Predictive Modeling**: Building a regression model to predict a country's
`ladder_score` based on the most correlated variables. This would quantify the precise contribution of each factor and provide a tool for forecasting.
2. **Outlier Analysis**: A deeper investigation into the outlier country in the South Asia region to understand the specific circumstances that contribute to its exceptionally low score.
3. **Temporal Analysis**: Expanding the project to include data from multiple years to perform a time-series analysis, examining how happiness scores and their drivers have changed over time.

In conclusion, this project provides a clear, evidence-based overview of global happiness in 2024, emphasizing the crucial link between a nation's economic health, social fabric, and the well-being of its citizens.