# <center>Life Expectancy (WHO) Dataset</center>

## Project Objective
Analyze and predict average life expectancy based on economic, health, and social factors from the World Health Organization (WHO) dataset.

---

## Student Information

**Student 1:**
- Full Name: Cao Trần Bá Đạt
- Student ID: 23127168

**Student 2:**
- Full Name: Trần Hoài Thiện Nhân
- Student ID: 23127238

**Student 3:**
- Full Name: Bùi Nam Việt
- Student ID: 23127516

**Class:** 23KHDL

---

## Table of Contents

1. [Dataset Information](#c1)
    - [1.1 About the Data Subject](#c11)
    - [1.2 Data Source](#c12)
    - [1.3 Data License & Usage Rights](#c13)
    - [1.4 Data Collection Method](#c14)
    - [1.5 Rationale for Dataset Selection](#c15)
2. [Import Libraries](#c2)
3. [Load Dataset](#c3)
4. [Data Exploration](#c4)
    - [4.1 Dataset Overview](#c41)
        - [4.1.1 Basic Information](#c411)
        - [4.1.2 Data Integrity](#c412)
        - [4.1.3 Column Inventory](#c413)
        - [4.1.4 Data Types](#c414)
    - [4.2 Numerical Columns Analysis](#c42)
        - [4.2.1 Distribution & Central Tendency](#c421)
        - [4.2.2 Range & Outliers](#c422)
        - [4.2.3 Data Quality (Numerical)](#c423)
    - [4.3 Categorical Columns Analysis](#c43)
        - [4.3.1 Value Distribution](#c431)
        - [4.3.2 Data Quality (Categorical)](#c432)
    - [4.4 Missing Data Analysis](#c44)
        - [4.4.1 Overall Assessment](#c441)
        - [4.4.2 Per Column Strategy](#c442)
    - [4.5 Relationships & Correlations](#c45)
        - [4.5.1 Preliminary Patterns](#c451)
        - [4.5.2 Cross-tabulations](#c452)
    - [4.6 Initial Observations & Insights](#c46)
        - [4.6.1 Summary](#c461)
        - [4.6.2 Red Flags](#c462)
5. [Meaningful Questions](#c5)
    - [5.1 Question 1](#c51)
    - [5.2 Question 2](#c52)
    - [5.3 Question 3](#c53)
    - [5.4 Question 4](#c54)
    - [5.5 Question 5](#c55)
    - [5.6 Question 6](#c56)
6. [Project Summary](#c6)
    - [6.1 Key Findings](#c61)
    - [6.2 Limitations](#c62)
    - [6.3 Future Directions (If You Had More Time)](#c63)
    - [6.4 Individual Reflections](#c64)
        - [6.4.1 Student 1 - Cao Trần Bá Đạt](#c641)
        - [6.4.2 Student 2 - Trần Hoài Thiện Nhân](#c642)
        - [6.4.3 Student 3 - Bùi Nam Việt](#c643)
7. [References](#c7)

---

<a id="c1"></a>
## 1. Dataset Information

<a id="c11"></a>
### 1.1 About the Data Subject

**What subject is your data about?**

TODO: Describe the topic, domain, or phenomenon
- Topic/Domain:
- Real-world context:
- Key variables covered:

<a id="c12"></a>
### 1.2 Data Source

**What is the source of your data?**

TODO: Provide complete source information
- Platform name:
- Full URL:
- Original author(s) or organization:
- Publication/collection date:

<a id="c13"></a>
### 1.3 Data License & Usage Rights 

**Do authors of this data allow you to use like this?**

TODO: Document data license and permissions
- Dataset license:
- Educational use permitted:
- Usage restrictions:
- Attribution requirements:

<a id="c14"></a>
### 1.4 Data Collection Method

**How did authors collect data?**

TODO: Describe data collection methodology
- Collection method (survey, sensors, administrative records, web scraping, etc.):
- Target population and sampling approach:
- Time period of data collection:
- Known limitations or biases in collection:

<a id="c15"></a>
### 1.5 Rationale for Dataset Selection

**Why did you choose this dataset?**

TODO: Explain your motivation and research interests
- What interests your group about this topic:
- Potential questions or insights this data could provide:
- Expected outcomes and applications:

---

<a id="c2"></a>
## 2. Import Libraries

In [28]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score


<a id="c3"></a>
## 3. Load Dataset

In [29]:
df = pd.read_csv('../data/Life-Expectancy-Data-Updated.csv')
print("Dataset loaded successfully")

Dataset loaded successfully


<a id="c4"></a>
## 4. Data Exploration

<a id="c41"></a>
### 4.1 Dataset Overview

<a id="c411"></a>
#### 4.1.1 Basic Information

In [30]:
# TODO: Basic dataset information
# - Number of rows
# - Number of columns
# - What each row represents
# - Overall dataset size

print(f"Number of rows: {df.shape[0]}")
print(f"Number of columns: {df.shape[1]}")
print("Each row represents the health and socio-economic data for a country in a given year.")
print(f"Dataset total size (cells): {df.size}")

Number of rows: 2864
Number of columns: 21
Each row represents the health and socio-economic data for a country in a given year.
Dataset total size (cells): 60144


<a id="c412"></a>
#### 4.1.2 Data Integrity

In [31]:
# TODO: Check data integrity
# - Duplicated rows count
# - Decision to keep or remove duplicates (with justification)
# - Empty/incomplete rows check

duplicate_count = df.duplicated().sum()
print(f"Number of duplicated rows: {duplicate_count}")

if duplicate_count > 0:
    df = df.drop_duplicates()
    print("Duplicated rows removed for data integrity.")
else:
    print("No duplicated rows found. No action needed.")

missing_rows = df.isnull().any(axis=1).sum()
print(f"Number of rows with missing values: {missing_rows}")

Number of duplicated rows: 0
No duplicated rows found. No action needed.
Number of rows with missing values: 0


<a id="c413"></a>
#### 4.1.3 Column Inventory

In [32]:
# TODO: Column inventory analysis
# - Meaning/definition of each column
# - Relevant columns for analysis
# - Columns that should be dropped (with justification)

column_meanings = {
    'Country': 'Name of the country',
    'Year': 'Year of observation',
    'Status': 'Development status of the country (Developed/Developing)',
    'Life expectancy': 'Life expectancy in years',
    'Adult Mortality': 'Adult mortality rate (probability of dying between 15 and 60 years per 1000 population)',
    'infant deaths': 'Number of infant deaths per 1000 births',
    'Alcohol': 'Alcohol consumption (liters per capita per year)',
    'percentage expenditure': 'Expenditure on health as percentage of GDP per capita',
    'Hepatitis B': 'Hepatitis B immunization coverage among 1-year-olds (%)',
    'Measles': 'Reported cases of Measles per 1000 population',
    'BMI': 'Average Body Mass Index of the population',
    'under-five deaths': 'Number of under-five deaths per 1000 births',
    'Polio': 'Polio immunization coverage among 1-year-olds (%)',
    'Total expenditure': 'Total health expenditure as % of GDP',
    'Diphtheria': 'Diphtheria immunization coverage among 1-year-olds (%)',
    'HIV/AIDS': 'Deaths per 1000 live births due to HIV/AIDS (0-4 years)',
    'GDP': 'Gross Domestic Product per capita (USD)',
    'Population': 'Population of the country',
    'thinness 1-19 years': 'Prevalence of thinness among children and adolescents (10-19 years) (%)',
    'thinness 5-9 years': 'Prevalence of thinness among children (5-9 years) (%)',
    'Income composition of resources': 'Human Development Index in terms of income composition',
    'Schooling': 'Average years of schooling (adults aged 25 and older)'
}

print("Column meanings/definitions:")
for col in df.columns:
    desc = column_meanings.get(col, "No description available")
    print(f"- {col}: {desc}")


relevant_columns = [
    'Country', 'Year', 'Life expectancy', 'Adult Mortality', 'BMI', 
    'Alcohol', 'GDP', 'Schooling'
]
print("\nRelevant columns for analysis:", relevant_columns)

columns_to_drop = [col for col in df.columns if df[col].isnull().all()]
if columns_to_drop:
    print("Columns to drop (all values missing):", columns_to_drop)
    df = df.drop(columns=columns_to_drop)
else:
    print("No columns need to be dropped at this stage.")


Column meanings/definitions:
- Country: Name of the country
- Region: No description available
- Year: Year of observation
- Infant_deaths: No description available
- Under_five_deaths: No description available
- Adult_mortality: No description available
- Alcohol_consumption: No description available
- Hepatitis_B: No description available
- Measles: Reported cases of Measles per 1000 population
- BMI: Average Body Mass Index of the population
- Polio: Polio immunization coverage among 1-year-olds (%)
- Diphtheria: Diphtheria immunization coverage among 1-year-olds (%)
- Incidents_HIV: No description available
- GDP_per_capita: No description available
- Population_mln: No description available
- Thinness_ten_nineteen_years: No description available
- Thinness_five_nine_years: No description available
- Schooling: Average years of schooling (adults aged 25 and older)
- Economy_status_Developed: No description available
- Economy_status_Developing: No description available
- Life_expec

<a id="c414"></a>
#### 4.1.4 Data Types

In [33]:
# TODO: Data types analysis
# - Current data type of each column
# - Inappropriate data types identification
# - Columns needing type conversion

print("Data Types:")
print(df.dtypes)
print("\nColumns with inappropriate type:")
for col in df.select_dtypes(include='object').columns:
    if col not in ['Country', 'Status']:
        print(f"- {col}")

print("\nColumns needing type conversion:")
for col in df.columns:
    if col != "Country" and df[col].dtype == 'object':
        print(f"- {col} (consider converting to numeric or categorical)")


Data Types:
Country                         object
Region                          object
Year                             int64
Infant_deaths                  float64
Under_five_deaths              float64
Adult_mortality                float64
Alcohol_consumption            float64
Hepatitis_B                      int64
Measles                          int64
BMI                            float64
Polio                            int64
Diphtheria                       int64
Incidents_HIV                  float64
GDP_per_capita                   int64
Population_mln                 float64
Thinness_ten_nineteen_years    float64
Thinness_five_nine_years       float64
Schooling                      float64
Economy_status_Developed         int64
Economy_status_Developing        int64
Life_expectancy                float64
dtype: object

Columns with inappropriate type:
- Region

Columns needing type conversion:
- Region (consider converting to numeric or categorical)


<a id="c42"></a>
### 4.2 Numerical Columns Analysis

<a id="c421"></a>
#### 4.2.1 Distribution & Central Tendency

In [34]:
# TODO: For each numerical column, analyze distribution
# - Distribution shape (normal, skewed, bimodal, uniform)
# - Visualizations: histograms, box plots, density plots
# - Calculate: mean, median, standard deviation

<a id="c422"></a>
#### 4.2.2 Range & Outliers

In [35]:
# TODO: Analyze range and outliers for numerical columns
# - Minimum and maximum values
# - Check if min/max values are reasonable or indicate errors
# - Identify outliers using box plots, IQR method, or z-scores
# - Determine if outliers are genuine extreme values or data entry errors

<a id="c423"></a>
#### 4.2.3 Data Quality (Numerical)

In [36]:
# TODO: Data quality checks for numerical columns
# - Percentage of missing values
# - Impossible values (e.g., negative ages, prices = 0)
# - Placeholder values (e.g., 999, -1, 0 indicating missing)

<a id="c43"></a>
### 4.3 Categorical Columns Analysis

<a id="c431"></a>
#### 4.3.1 Value Distribution

In [37]:
# TODO: For each categorical column, analyze value distribution
# - Number of unique/distinct values
# - Top 5-10 most frequent values
# - Visualizations: bar charts, count plots
# - Check if distribution is balanced or imbalanced

<a id="c432"></a>
#### 4.3.2 Data Quality (Categorical)

In [38]:
# TODO: Data quality checks for categorical columns
# - Percentage of missing values
# - Inconsistencies in categories (e.g., "Male", "male", "M", "m")
# - Typos or spelling variations
# - Unexpected or abnormal values
# - Categories with very few observations (consider grouping)

<a id="c44"></a>
### 4.4 Missing Data Analysis

<a id="c441"></a>
#### 4.4.1 Overall Assessment

In [39]:
# TODO: Overall missing data assessment
# - Create missing values summary (column name, count, percentage)
# - Visualize missing data patterns (heatmap or bar chart)
# - Determine if missing values are random or patterned
# - Check if certain rows or groups have more missing values

<a id="c442"></a>
#### 4.4.2 Per Column Strategy

In [40]:
# TODO: For each column with missing values, determine:
# - Why might values be missing? (random, not applicable, data collection issue)
# - Handling plan (remove, impute, keep as separate category)
# - Justification for chosen strategy

<a id="c45"></a>
### 4.5 Relationships & Correlations

<a id="c451"></a>
#### 4.5.1 Preliminary Patterns

In [41]:
# TODO: Analyze relationships between variables
# - Calculate correlation matrix for numerical variables
# - Create correlation heatmap
# - Identify strongly correlated pairs (positive or negative)
# - Note any surprising relationships

<a id="c452"></a>
#### 4.5.2 Cross-tabulations

In [42]:
# TODO: Cross-tabulation analysis
# - Categorical × Categorical: create frequency tables
# - Numerical × Categorical: create grouped summary statistics

<a id="c46"></a>
### 4.6 Initial Observations & Insights

<a id="c461"></a>
#### 4.6.1 Summary

TODO: Document key findings
- 3-5 key observations from exploration
- Data quality issues identified
- Necessary preprocessing steps
- Interesting patterns that could lead to research questions

<a id="c462"></a>
#### 4.6.2 Red Flags

TODO: Identify critical issues
- Serious data quality concerns
- Limitations that might affect analysis
- Potential biases or problems to address

<a id="c5"></a>
## 5. Meaningful Questions

<a id="c51"></a>
### 5.1 Question 1

In [43]:
# TODO: Answer Question 1

<a id="c52"></a>
### 5.2 Question 2

In [44]:
# TODO: Answer Question 2

<a id="c53"></a>
### 5.3 Question 3

In [45]:
# TODO: Answer Question 3

<a id="c54"></a>
### 5.4 Question 4

In [46]:
# TODO: Answer Question 4

<a id="c55"></a>
### 5.5 Question 5

#### 1. The Question

What is the relative importance of health indicators, economic factors, and social factors in predicting life expectancy, and can we build a machine learning model that accurately predicts life expectancy for countries with incomplete or missing data?

This question is precise and answerable because:
- The dataset contains all necessary variables: health indicators (Hepatitis_B, Polio, Diphtheria, Measles, Incidents_HIV, Infant_deaths, Under_five_deaths, Adult_mortality), economic factors (GDP_per_capita, Economy_status_Developed/Developing), and social factors (Schooling, BMI, Thinness indicators)
- Life expectancy is available as the target variable
- The dataset spans multiple years and countries, allowing for robust model training and validation
- Missing data patterns can be analyzed and handled appropriately

#### 2. Motivation & Benefits

**Why is this question worth investigating?**

Life expectancy is a critical indicator of population health and overall societal well-being. Understanding which factors most significantly influence life expectancy is essential for evidence-based policy making. Traditional statistical approaches may not capture complex, non-linear relationships between multiple variables, and machine learning models can reveal patterns that inform strategic decision-making. Additionally, many countries, particularly developing nations, have incomplete health and economic data, making accurate prediction models valuable for resource allocation and intervention planning.

**What benefits or insights would answering this question provide?**

- **Prioritization of Interventions**: Identify which health, economic, or social factors have the greatest impact on life expectancy, enabling governments and organizations to allocate limited resources more effectively
- **Predictive Capability**: Enable accurate life expectancy predictions for countries with missing or incomplete data, supporting planning and policy development
- **Feature Importance Ranking**: Quantify the relative contribution of each factor, revealing whether health interventions, economic development, or education investments yield the highest returns
- **Model Interpretability**: Understand which combinations of factors lead to higher or lower life expectancy outcomes
- **Data Gap Analysis**: Identify which missing data points are most critical for accurate predictions

**Who would care about the answer?**

- **Health Ministries and Public Health Agencies**: Need evidence-based priorities for health interventions and resource allocation
- **International Organizations**: WHO, World Bank, UNICEF, and other development agencies require data-driven approaches to guide global health initiatives
- **Policy Makers and Government Officials**: Need quantitative evidence to justify budget allocations and policy decisions
- **Development Economists and Researchers**: Study the relationships between economic development and health outcomes
- **Non-Governmental Organizations (NGOs)**: Working in global health and development need to identify the most impactful interventions
- **Healthcare Planners**: Require predictive models for long-term healthcare infrastructure and resource planning

**What real-world problem or decision does this inform?**

- **Resource Allocation**: Governments and international organizations must decide how to allocate limited budgets between vaccination programs, education initiatives, economic development projects, and healthcare infrastructure
- **Policy Prioritization**: Decision-makers need to determine whether investing in preventive healthcare (vaccinations), treatment programs (HIV/AIDS), economic development, or education will have the greatest impact on life expectancy
- **Global Health Strategy**: International organizations must identify which interventions to prioritize in different regions and economic contexts
- **Development Aid Targeting**: Aid organizations need to identify which countries and interventions will yield the highest health returns on investment
- **Healthcare Infrastructure Planning**: Countries with incomplete data can use predictive models to estimate future healthcare needs and plan infrastructure accordingly

In [47]:
# TODO: Answer Question 5


<a id="c56"></a>
### 5.6 Question 6

#### 1. The Question

How have life expectancy trends changed over time (2000-2015) across different regions, and which countries have shown the most significant improvements or declines? Can we identify specific health, economic, or social factors that explain why some countries achieved rapid life expectancy gains while others stagnated or declined during this period?

This question is precise and answerable because:
- The dataset contains Year variable (2000-2015), allowing for temporal trend analysis
- Region variable enables comparison across different geographic areas (Africa, Asia, European Union, Middle East, etc.)
- Multiple years of data for the same countries allow tracking of changes over time
- Health, economic, and social indicators are available to explain observed trends
- Life expectancy values can be compared across time periods to identify improvements or declines

#### 2. Motivation & Benefits

**Why is this question worth investigating?**

Understanding how life expectancy has evolved over time provides critical insights into the effectiveness of health policies, economic development strategies, and social interventions. While cross-sectional analysis shows current disparities, temporal analysis reveals which approaches have been successful and which have failed. Some countries may have achieved remarkable health improvements despite limited resources, while others with greater economic advantages may have stagnated. Identifying these patterns helps understand what works, what doesn't, and why, enabling evidence-based policy learning and transfer of successful strategies.

**What benefits or insights would answering this question provide?**

- **Policy Effectiveness Evaluation**: Identify which health interventions, economic policies, or social programs have led to measurable improvements in life expectancy over time
- **Success Story Identification**: Discover countries that achieved exceptional life expectancy gains and understand the factors that contributed to their success
- **Early Warning System**: Identify countries experiencing declining or stagnating life expectancy trends, enabling early intervention
- **Regional Pattern Recognition**: Understand how different regions have progressed, revealing regional challenges and opportunities
- **Factor Attribution**: Determine which specific factors (vaccination programs, economic growth, education improvements) correlate with life expectancy improvements
- **Benchmarking and Goal Setting**: Enable countries to compare their progress against similar countries and set realistic improvement targets
- **Historical Context**: Provide context for understanding current health disparities by showing how they developed over time

**Who would care about the answer?**

- **Public Health Policy Makers**: Need to evaluate the long-term effectiveness of health policies and programs
- **International Development Agencies**: WHO, World Bank, and regional health organizations require evidence of what interventions work over time
- **Government Health Departments**: Need to track progress toward health goals and identify successful strategies from other countries
- **Health Researchers and Epidemiologists**: Study long-term health trends and their determinants
- **Development Economists**: Analyze the relationship between economic development and health outcomes over time
- **Global Health Advocates**: Need evidence to advocate for effective health interventions and policies
- **Country Health Planners**: Require benchmarks and success stories to guide their own health improvement strategies

**What real-world problem or decision does this inform?**

- **Health Policy Evaluation**: Governments need to assess whether their health investments over the past 15 years have yielded expected returns in life expectancy
- **Resource Reallocation**: Countries experiencing stagnation or decline can identify which factors to prioritize based on successful countries' experiences
- **International Aid Strategy**: Development organizations must decide which countries and interventions to prioritize based on demonstrated effectiveness over time
- **Health Goal Setting**: Countries can set realistic targets for life expectancy improvement based on what similar countries have achieved
- **Policy Transfer**: Successful strategies from rapidly improving countries can be adapted and transferred to similar contexts
- **Crisis Identification**: Early identification of countries with declining trends enables timely intervention before problems become severe
- **Regional Health Planning**: Regional health organizations can identify common challenges and coordinate responses based on shared trends
- **Long-term Investment Decisions**: Governments and international organizations can make informed decisions about long-term health investments based on historical effectiveness

In [48]:
# TODO: Answer Question 6

<a id="c6"></a>
## 6. Project Summary

<a id="c61"></a>
### 6.1 Key Findings

TODO:

*List 3-5 most important insights from your analysis:*
- :
- :
- :

*Highlight the most interesting or surprising discovery:*

<a id="c62"></a>
### 6.2 Limitations

TODO: Document limitations

*Dataset Limitations:*
- Sample size:
- Biases:
- Missing data:

*Analysis Limitations:*
- Methodology constraints:
- Unanswered aspects:

*Scope Limitations:*
- What we couldn't address:

<a id="c63"></a>
### 6.3 Future Directions (If You Had More Time)

TODO: Document future research directions

*Additional Questions to Explore:*
- :
- :

*Deeper Analysis:*
- :
- :

*Alternative Methods/Approaches:*
- :
- :

*Additional Data to Seek:*
- :
- :

*Project Expansion/Improvement:*
- :
- :

<a id="c64"></a>
### 6.4 Individual Reflections

<a id="c641"></a>
##### 6.4.1 Student 1: Cao Trần Bá Đạt

**Challenges & Difficulties Encountered:**

*Specific obstacles faced:*
- Technical:
- Analytical:
- Conceptual:

*How I overcame them:*
- :
- :

*Most challenging aspect and why:*
- :


**Learning & Growth:**

*What I learned:*
- Technical skills:
- Analytical approaches:
- Domain knowledge:

*What surprised me most:*
- :

*How this project shaped my understanding of data science:*
- :

<a id="c642"></a>
##### 6.4.2 Student 2: Trần Hoài Thiện Nhân

**Challenges & Difficulties Encountered:**

*Specific obstacles faced:*
- Technical:
- Analytical:
- Conceptual:

*How I overcame them:*
- :
- :

*Most challenging aspect and why:*
- :

**Learning & Growth:**

*What I learned:*
- Technical skills:
- Analytical approaches:
- Domain knowledge:

*What surprised me most:*
- :

*How this project shaped my understanding of data science:*
- :

<a id="c643"></a>
##### 6.4.3 Student 3: Bùi Nam Việt

**Challenges & Difficulties Encountered:**

*Specific obstacles faced:*
- Technical:
- Analytical:
- Conceptual:

*How I overcame them:*
- :
- :

*Most challenging aspect and why:*
- :

**Learning & Growth:**

*What I learned:*
- Technical skills:
- Analytical approaches:
- Domain knowledge:

*What surprised me most:*
- :

*How this project shaped my understanding of data science:*
- :

<a id="c7"></a>
## 7. References

TODO: Liệt kê các nguồn tham khảo
- Dataset source: WHO
- Libraries documentation
- Research papers