# <center>Life Expectancy (WHO) Dataset</center>

## Project Objective
Analyze and predict average life expectancy based on economic, health, and social factors from the World Health Organization (WHO) dataset.

---

## Student Information

**Student 1:**
- Full Name: Cao Trần Bá Đạt
- Student ID: 23127168

**Student 2:**
- Full Name: Trần Hoài Thiện Nhân
- Student ID: 23127238

**Student 3:**
- Full Name: Bùi Nam Việt
- Student ID: 23127516

**Class:** 23KHDL

---

## Table of Contents

1. [Dataset Information](#c1)
    - [1.1 About the Data Subject](#c11)
    - [1.2 Data Source](#c12)
    - [1.3 Data License & Usage Rights](#c13)
    - [1.4 Data Collection Method](#c14)
    - [1.5 Rationale for Dataset Selection](#c15)
2. [Import Libraries](#c2)
3. [Load Dataset](#c3)
4. [Data Exploration](#c4)
    - [4.1 Dataset Overview](#c41)
        - [4.1.1 Basic Information](#c411)
        - [4.1.2 Data Integrity](#c412)
        - [4.1.3 Column Inventory](#c413)
        - [4.1.4 Data Types](#c414)
    - [4.2 Numerical Columns Analysis](#c42)
        - [4.2.1 Distribution & Central Tendency](#c421)
        - [4.2.2 Range & Outliers](#c422)
        - [4.2.3 Data Quality (Numerical)](#c423)
    - [4.3 Categorical Columns Analysis](#c43)
        - [4.3.1 Value Distribution](#c431)
        - [4.3.2 Data Quality (Categorical)](#c432)
    - [4.4 Missing Data Analysis](#c44)
        - [4.4.1 Overall Assessment](#c441)
        - [4.4.2 Per Column Strategy](#c442)
    - [4.5 Relationships & Correlations](#c45)
        - [4.5.1 Preliminary Patterns](#c451)
        - [4.5.2 Cross-tabulations](#c452)
    - [4.6 Initial Observations & Insights](#c46)
        - [4.6.1 Summary](#c461)
        - [4.6.2 Red Flags](#c462)
5. [Meaningful Questions](#c5)
    - [5.1 Question 1](#c51)
    - [5.2 Question 2](#c52)
    - [5.3 Question 3](#c53)
    - [5.4 Question 4](#c54)
    - [5.5 Question 5](#c55)
    - [5.6 Question 6](#c56)
6. [Project Summary](#c6)
    - [6.1 Key Findings](#c61)
    - [6.2 Limitations](#c62)
    - [6.3 Future Directions (If You Had More Time)](#c63)
    - [6.4 Individual Reflections](#c64)
        - [6.4.1 Student 1 - Cao Trần Bá Đạt](#c641)
        - [6.4.2 Student 2 - Trần Hoài Thiện Nhân](#c642)
        - [6.4.3 Student 3 - Bùi Nam Việt](#c643)
7. [References](#c7)

---

<a id="c1"></a>
## 1. Dataset Information

<a id="c11"></a>
### 1.1 About the Data Subject

**What subject is your data about?**
- **Topic/Domain**: 

    - Topic: The subject of this dataset is health data of people in countries around the world based on information provided from The Global Health Observatory (GHO) data repository under World Health Organization (WHO).

    - Domain: The data covers the health of 193 countries worldwide from 2000 to 2015, with characteristics including age-specific mortality rates, childhood vaccination rates, BMI, demographics, and national economic statistics,...

- **Real-world context**: Although there have been lot of studies undertaken in the past on factors affecting life expectancy considering demographic variables, income composition and mortality rates. It was found that affect of immunization and human development index was not taken into account in the past. Also, some of the past research was done considering multiple linear regression based on data set of one year for all the countries. Hence, this gives motivation to resolve both the factors stated previously by formulating a regression model based on mixed effects model and multiple linear regression while considering data from a period of 2000 to 2015 for all the countries. Important immunization like Hepatitis B, Polio and Diphtheria will also be considered. In a nutshell, this study will focus on immunization factors, mortality factors, economic factors, social factors and other health related factors as well. Since the observations this dataset are based on different countries, it will be easier for a country to determine the predicting factor which is contributing to lower value of life expectancy. This will help in suggesting a country which area should be given importance in order to efficiently improve the life expectancy of its population.

<a id="c12"></a>
### 1.2 Data Source

**What is the source of your data?**

- **Platform name**: Kaggle.
- **Full URL**: [Life Expentancy Fixed](https://www.kaggle.com/datasets/lashagoch/life-expectancy-who-updated?resource=download).
- **Original author(s) or organization**: The above dataset is owned by Kaggle user "Lasha Gochiashvili" but he collected the original base data from Kaggle user "KumarRajarshi"'s dataset and corrected some errors in the original dataset. (Link "KumarRajarshi"'s dataset: [Dataset original](https://www.kaggle.com/datasets/kumarajarshi/life-expectancy-who)).
- **Publication/collection date**: 2023

<a id="c13"></a>
### 1.3 Data License & Usage Rights 

**Do authors of this data allow you to use like this?**

- **Dataset license**: Dataset has license (License's URL: [License](https://creativecommons.org/publicdomain/zero/1.0/)).
- **Educational use permitted**: The license page states: "You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission.". So educational use is allowed.
- **Usage restrictions**: Unless expressly stated otherwise, the person who associated a work with this deed makes no warranties about the work, and disclaims liability for all uses of the work, to the fullest extent permitted by applicable law.
- **Attribution requirements**: When using or citing the work, user should not imply endorsement by the author or the affirmer.

<a id="c14"></a>
### 1.4 Data Collection Method

**How did authors collect data?**

- **Collection method** (survey, sensors, administrative records, web scraping, etc.): The author used the original dataset from another source on Kaggle and combined it with web scraping to correct the errors. The author used the following sources to perform data cleaning.

    - Average life expectancy of both genders in different years from 2000 to 2015: https://www.who.int/data/gho/data/indicators/indicator-details/GHO/life-expectancy-at-birth-(years)

    - Mortality-related attributes (infant deaths, under-five-deaths, adult mortality): https://www.who.int/data/gho/data/themes/mortality-and-global-health-estimates

    - Alcohol consumption that is recorded in liters of pure alcohol per capita with 15+ years old: https://www.who.int/data/gho/data/indicators/indicator-details/GHO/alcohol-recorded-per-capita-(15-)-consumption-(in-litres-of-pure-alcohol)

    - % of coverage of Hepatitis B (HepB3) immunization among 1-year-olds: https://www.who.int/data/gho/data/indicators/indicator-details/GHO/hepatitis-b-(hepb3)-immunization-coverage-among-1-year-olds-(-)

    - % of coverage of Measles containing vaccine first dose (MCV1) immunization among 1-year-olds: https://www.who.int/data/gho/data/indicators/indicator-details/GHO/measles-containing-vaccine-first-dose-(mcv1)-immunization-coverage-among-1-year-olds-(-)

    - % of coverage of Polio (Pol3) immunization among 1-year-olds: https://www.who.int/data/gho/data/indicators/indicator-details/GHO/polio-(pol3)-immunization-coverage-among-1-year-olds-(-)

    - % of coverage of Diphtheria tetanus toxoid and pertussis (DTP3) immunization among 1-year-olds: https://www.who.int/data/gho/data/indicators/indicator-details/GHO/diphtheria-tetanus-toxoid-and-pertussis-(dtp3)-immunization-coverage-among-1-year-olds-(-)

    - BMI: https://www.who.int/europe/news-room/fact-sheets/item/a-healthy-lifestyle---who-recommendations

    - Incidents of HIV per 1000 population aged 15-49: https://data.worldbank.org/indicator/SH.HIV.INCD.ZS

    - Prevalence of thinness among adolescents aged 10-19 years. BMI < -2 standard deviations below the median: https://www.who.int/data/gho/indicator-metadata-registry/imr-details/4805

    - GDP per capita in current USD: https://data.worldbank.org/indicator/NY.GDP.PCAP.CD?most_recent_year_desc=true

    - Total population in millions: https://data.worldbank.org/indicator/SP.POP.TOTL?most_recent_year_desc=true

    - Average years that people aged 25+ spent in formal education: https://ourworldindata.org/grapher/mean-years-of-schooling-long-run

- **Target population and sampling approach**: 
    - Target population: This data is not targeted at specific individuals. Instead, the unit of study is the Country. Specifically, the dataset includes health and socioeconomic information for 179 countries (filtered from the original list of 193 WHO members).
- **Sampling Approach**: 
    - Non-probability sampling: This is secondary data compiled comprehensively (census-style) from national reports.

    - Source: Author compiled data from:

        - Global Health Observatory (GHO) of the World Health Organization (WHO) for health/immunization indicators.

        - World Bank for economic indicators (GDP, Population) to correct errors in the original data.

        - Our World in Data for data on education (Schooling).

    - Exclusion criteria: The author excluded countries with excessive missing data (missing more than 4 columns of information), such as Sudan, South Sudan, and North Korea. This was a form of convenience sampling based on data availability.

- **Time period of data collection**: The data covers a 16-year time series: From 2000 to 2015.
- **Known limitations or biases in collection**: 
    - Imputation Bias: 
        - A major limitation of the original is that there are a lot of missing values. The author of this "Updated" version automatically filled in the blank cells by taking the average of the last 3 years or the average of the whole region.
        - Risk: This approach smooths the data artificially, which can mask real fluctuations (e.g., natural disasters, epidemics, wars that cause sudden reductions in life expectancy in a particular year will be smoothed out by the average).

    - Selection Bias: As mentioned, underdeveloped or conflict-affected countries (which are unlikely to report data) were removed from the dataset. This means that models trained on this data may perform less accurately when predicting for extremely poor or politically unstable countries.

    - Classification Reliability: The Status column (Developed vs Developing) is based on countries' self-declaration or older standards, which sometimes do not reflect current economic realities (e.g., transition economies).

    - Raw data quality issues: Although corrected, the original data from developing countries are often of lower precision (due to poor statistical systems) than those from developed countries, leading to discrepancies in reliability between rows in the table.

<a id="c15"></a>
### 1.5 Rationale for Dataset Selection

**Why did you choose this dataset?**

- **What interests your group about this topic**: This topic appeals to us because of its humanity and curiosity about the relationship between economics and people:

    - Decoding the "Secret of Longevity": We want to find out whether money (GDP) is really the all-important determinant of longevity, or whether social factors like Schooling and Immunization play a key role.

    - Global Disparities: We are interested in visualizing the gap between developed and developing countries. Why do some countries with low resources still achieve high life expectancy? (Example: The Paradox of Countries with Good Public Health but Poor).

    - Practical Applicability: These are not just mindless numbers. Understanding these factors can help model public policy decisions: If a government has a limited budget, should they invest in education or health to increase the life expectancy of its people most effectively?

- **Potential questions or insights this data could provide**: Based on the data columns, we plan to explore the following aspects:

    - Prediction: Build a Machine Learning model to predict the life expectancy of a country based on socio-economic indicators. How accurate is the model?

    - Feature Importance: Which factor has the strongest impact on life expectancy?

        - Question: Between Schooling and GDP, which factor has a higher correlation with Life Expectancy?

        - Question: How do Infant Deaths and Adult Mortality differ in their impact on life expectancy?

    - Trend Analysis:
        - How has global life expectancy changed from 2000 to 2015?

        - Are developing countries closing the gap with developed countries?

    - Outlier Detection: Are there countries with very high GDP but low life expectancy (or vice versa)? What could be the cause (e.g. HIV/AIDS rates, alcohol, accidents)?

---

<a id="c2"></a>
## 2. Import Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score


<a id="c3"></a>
## 3. Load Dataset

In [None]:
df = pd.read_csv('../data/Life-Expectancy-Data-Updated.csv')
print("Dataset loaded successfully")

<a id="c4"></a>
## 4. Data Exploration

<a id="c41"></a>
### 4.1 Dataset Overview

<a id="c411"></a>
#### 4.1.1 Basic Information

In [None]:
# TODO: Basic dataset information
# - Number of rows
# - Number of columns
# - What each row represents
# - Overall dataset size

<a id="c412"></a>
#### 4.1.2 Data Integrity

In [None]:
# TODO: Check data integrity
# - Duplicated rows count
# - Decision to keep or remove duplicates (with justification)
# - Empty/incomplete rows check

<a id="c413"></a>
#### 4.1.3 Column Inventory

In [None]:
# TODO: Column inventory analysis
# - Meaning/definition of each column
# - Relevant columns for analysis
# - Columns that should be dropped (with justification)

<a id="c414"></a>
#### 4.1.4 Data Types

In [None]:
# TODO: Data types analysis
# - Current data type of each column
# - Inappropriate data types identification
# - Columns needing type conversion

<a id="c42"></a>
### 4.2 Numerical Columns Analysis

<a id="c421"></a>
#### 4.2.1 Distribution & Central Tendency

In [None]:
# TODO: For each numerical column, analyze distribution
# - Distribution shape (normal, skewed, bimodal, uniform)
# - Visualizations: histograms, box plots, density plots
# - Calculate: mean, median, standard deviation

<a id="c422"></a>
#### 4.2.2 Range & Outliers

In [None]:
# TODO: Analyze range and outliers for numerical columns
# - Minimum and maximum values
# - Check if min/max values are reasonable or indicate errors
# - Identify outliers using box plots, IQR method, or z-scores
# - Determine if outliers are genuine extreme values or data entry errors

<a id="c423"></a>
#### 4.2.3 Data Quality (Numerical)

In [None]:
# TODO: Data quality checks for numerical columns
# - Percentage of missing values
# - Impossible values (e.g., negative ages, prices = 0)
# - Placeholder values (e.g., 999, -1, 0 indicating missing)

<a id="c43"></a>
### 4.3 Categorical Columns Analysis

<a id="c431"></a>
#### 4.3.1 Value Distribution

In [None]:
# TODO: For each categorical column, analyze value distribution
# - Number of unique/distinct values
# - Top 5-10 most frequent values
# - Visualizations: bar charts, count plots
# - Check if distribution is balanced or imbalanced

<a id="c432"></a>
#### 4.3.2 Data Quality (Categorical)

In [None]:
# TODO: Data quality checks for categorical columns
# - Percentage of missing values
# - Inconsistencies in categories (e.g., "Male", "male", "M", "m")
# - Typos or spelling variations
# - Unexpected or abnormal values
# - Categories with very few observations (consider grouping)

<a id="c44"></a>
### 4.4 Missing Data Analysis

<a id="c441"></a>
#### 4.4.1 Overall Assessment

In [None]:
# TODO: Overall missing data assessment
# - Create missing values summary (column name, count, percentage)
# - Visualize missing data patterns (heatmap or bar chart)
# - Determine if missing values are random or patterned
# - Check if certain rows or groups have more missing values

<a id="c442"></a>
#### 4.4.2 Per Column Strategy

In [None]:
# TODO: For each column with missing values, determine:
# - Why might values be missing? (random, not applicable, data collection issue)
# - Handling plan (remove, impute, keep as separate category)
# - Justification for chosen strategy

<a id="c45"></a>
### 4.5 Relationships & Correlations

<a id="c451"></a>
#### 4.5.1 Preliminary Patterns

In [None]:
# TODO: Analyze relationships between variables
# - Calculate correlation matrix for numerical variables
# - Create correlation heatmap
# - Identify strongly correlated pairs (positive or negative)
# - Note any surprising relationships

<a id="c452"></a>
#### 4.5.2 Cross-tabulations

In [None]:
# TODO: Cross-tabulation analysis
# - Categorical × Categorical: create frequency tables
# - Numerical × Categorical: create grouped summary statistics

<a id="c46"></a>
### 4.6 Initial Observations & Insights

<a id="c461"></a>
#### 4.6.1 Summary

TODO: Document key findings
- 3-5 key observations from exploration
- Data quality issues identified
- Necessary preprocessing steps
- Interesting patterns that could lead to research questions

<a id="c462"></a>
#### 4.6.2 Red Flags

TODO: Identify critical issues
- Serious data quality concerns
- Limitations that might affect analysis
- Potential biases or problems to address

<a id="c5"></a>
## 5. Meaningful Questions

<a id="c51"></a>
### 5.1 Question 1

In [None]:
# TODO: Answer Question 1

<a id="c52"></a>
### 5.2 Question 2

In [None]:
# TODO: Answer Question 2

<a id="c53"></a>
### 5.3 Question 3

In [None]:
# TODO: Answer Question 3

<a id="c54"></a>
### 5.4 Question 4

In [None]:
# TODO: Answer Question 4

<a id="c55"></a>
### 5.5 Question 5

In [None]:
# TODO: Answer Question 5

<a id="c56"></a>
### 5.6 Question 6

In [None]:
# TODO: Answer Question 6

<a id="c6"></a>
## 6. Project Summary

<a id="c61"></a>
### 6.1 Key Findings

TODO:

*List 3-5 most important insights from your analysis:*
- :
- :
- :

*Highlight the most interesting or surprising discovery:*

<a id="c62"></a>
### 6.2 Limitations

TODO: Document limitations

*Dataset Limitations:*
- Sample size:
- Biases:
- Missing data:

*Analysis Limitations:*
- Methodology constraints:
- Unanswered aspects:

*Scope Limitations:*
- What we couldn't address:

<a id="c63"></a>
### 6.3 Future Directions (If You Had More Time)

TODO: Document future research directions

*Additional Questions to Explore:*
- :
- :

*Deeper Analysis:*
- :
- :

*Alternative Methods/Approaches:*
- :
- :

*Additional Data to Seek:*
- :
- :

*Project Expansion/Improvement:*
- :
- :

<a id="c64"></a>
### 6.4 Individual Reflections

<a id="c641"></a>
##### 6.4.1 Student 1: Cao Trần Bá Đạt

**Challenges & Difficulties Encountered:**

*Specific obstacles faced:*
- Technical:
- Analytical:
- Conceptual:

*How I overcame them:*
- :
- :

*Most challenging aspect and why:*
- :


**Learning & Growth:**

*What I learned:*
- Technical skills:
- Analytical approaches:
- Domain knowledge:

*What surprised me most:*
- :

*How this project shaped my understanding of data science:*
- :

<a id="c642"></a>
##### 6.4.2 Student 2: Trần Hoài Thiện Nhân

**Challenges & Difficulties Encountered:**

*Specific obstacles faced:*
- Technical:
- Analytical:
- Conceptual:

*How I overcame them:*
- :
- :

*Most challenging aspect and why:*
- :

**Learning & Growth:**

*What I learned:*
- Technical skills:
- Analytical approaches:
- Domain knowledge:

*What surprised me most:*
- :

*How this project shaped my understanding of data science:*
- :

<a id="c643"></a>
##### 6.4.3 Student 3: Bùi Nam Việt

**Challenges & Difficulties Encountered:**

*Specific obstacles faced:*
- Technical:
- Analytical:
- Conceptual:

*How I overcame them:*
- :
- :

*Most challenging aspect and why:*
- :

**Learning & Growth:**

*What I learned:*
- Technical skills:
- Analytical approaches:
- Domain knowledge:

*What surprised me most:*
- :

*How this project shaped my understanding of data science:*
- :

<a id="c7"></a>
## 7. References

TODO: Liệt kê các nguồn tham khảo
- Dataset source: WHO
- Libraries documentation
- Research papers