
# Life Expectancy Data: Exploratory Data Analysis (EDA)

This notebook explores the **Life Expectancy Data** from the World Health Organization (WHO).  
The aim is to understand key relationships between socioeconomic and health-related features and prepare the dataset for a **Linear Regression** model.

We'll perform:
- Data overview and cleaning  
- Summary statistics and null/duplicate checks  
- Visual exploration of correlations, distributions, and outliers  


### Setup and read in the data

In [None]:
# Import libraries

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import sys
sys.path.append('../src') # Add source code to path

# Import source code
from preprocessing import load_data

# Read in the dataset
df = load_data()

# Display first few rows to confirm load
df.head()


FileNotFoundError: [Errno 2] No such file or directory: 'Life_Expectancy_Data.csv'

## Data Overview

In [None]:
# Check dataset dimensions (rows, columns)

df.shape


In [None]:
# Check column names and data types

df.info()



**Observations:**  
- Numeric values appear compatible.  
- 'Country' and 'Region' are categorical and will require encoding if included in the model.  


In [None]:
# View random sample of rows to inspect data variety

df.sample(5)


In [None]:
# Check number of unique values per column

df.nunique().sort_values(ascending=False)



**Insight:**  
This helps identify potential categorical features and assess diversity within columns.


## Data Cleaning

### Missing Values

In [None]:
# Check for missing values

df.isnull().sum()


In [None]:
# Visualize missing data heatmap

plt.figure(figsize=(10,5))
sns.heatmap(
        df.isnull(), 
        cbar=False)
plt.title("Missing Data Overview")
plt.show()



**Observation:**  
No missing values are shown


### Duplicate Check

In [None]:
# Check for duplicate rows

df.duplicated().sum()


### Column Name Consistency

In [None]:
# Display column names

print(df.columns.tolist())



Check for any spelling inconsistencies or formatting issues. They're all compatible with modelling


### Categorical Entry Consistency

In [None]:
region = df.Region.unique()
print(region)

In [None]:
country = df.Country.unique()
print(country)

Spelling inconsistencies: No spelling inconsistencies found in entries for 'Country' and 'Region'

### Column (Feature) Categorisation: Sensitive vs Non-Sensitive

Sensitive: Identifiers: Country, Year

Health Stats: Alcohol_consumption, Hepatitis_B, Measles, BMI, Polio, Diphtheria, Incidents_HIV, Thinness_ten_nineteen_years and Thinness_five_nine_years

Non-Sensitive: Infant_deaths, Under_five_deaths, GDP_per_capita, 'Population_mln', Schooling, Economy_status_Developed and Economy_status_Developing

Reasoning - We decided to classify specific identifiers and anything from health records as sensitive data, and any information that is publicly available as non-sensitive data

## Summary Statistics

In [None]:
# Summary of numeric features

df.describe()



**Interpretation Notes:**  
- Check for potential outliers or unrealistic values (e.g., 0 for non-zero metrics), of which there are none.  
- Identify columns with large value ranges for potential scaling later.  


## Exploratory Visual Analysis

In [None]:
# Boxplot to visually inspect outliers across numeric columns

plt.figure(figsize=(15,6))
df.boxplot()
plt.xticks(rotation=45)
plt.title("Boxplot of Numeric Columns")
plt.show()


## Correlations:

In [None]:
# Correlation heatmap for numerical features

corr_matrix = df.corr(numeric_only=True)

plt.figure(figsize=(12,10))
sns.heatmap(corr_matrix, annot=True, fmt=".2f", cmap='coolwarm')
plt.show()



**Observations:**  

Some strong positive and negative correlations to explore further

Multicollinearity & redundancy:

Economy_status_Developed and Economy_status_Developing
Under_five_deaths and Infant_deaths
 

### Feature Distributions

In [None]:
# Histograms for all numeric columns

df.hist(figsize=(15,10), bins=20)
plt.suptitle("Feature Distributions", y=1.02)
plt.show()



**Observation:**  
Helps identify skewed distributions or features that may need transformation before regression.


### Correlation With Life Expectancy

In [None]:
# Correlation of all numeric columns with life expectancy

corr_life = df.corr(numeric_only=True)['Life_expectancy'].sort_values(ascending=False)

# Display the top 10 most correlated features
print('Top positive correlations:')
print(corr_life.head(10))

print('\nTop negative correlations:')
print(corr_life.tail(10))



In [None]:
# Heatmap to visualise these correlations

corr = df.corr(numeric_only=True)
sns.heatmap(corr[['Life_expectancy']].sort_values(by='Life_expectancy', ascending=False), annot=True)
plt.show()

Identified Potential Features of Interest:

- Adult_mortality
- Infant_deaths
- Schooling
- Polio
- Diptheria
- BMI
- GDP_per_capita
- Economy_status_Developed
- Measles
- Thinness_ten_nineteen_years

In [None]:
# Pair plot of the top 5 most correlated features in relation to themselves

cols = [ 
        'Adult_mortality',
        'Infant_deaths',
        'Schooling',
        'Polio',
        'Diphtheria',
]


sns.pairplot(
    df[cols]) 
plt.show()

In [None]:
# Pair plot of the top 5 most correlated features in relation to life expectancy

cols = [ 
        'Adult_mortality',
        'Infant_deaths',
        'Schooling',
        'Polio',
        'Diphtheria',
]

sns.pairplot(
    df, 
    x_vars=cols,
    y_vars=['Life_expectancy'],
)
plt.suptitle('Pairplot of Key Features vs Life Expectancy', y=1.02)
plt.show()


**Insight:**  
These visuals help spot linear trends that could be meaningful for our regression model.



## Summary of Findings

- Dataset appears clean with no major missing or duplicate data.  
- Numeric and categorical columns identified.  
- Outlier inspection and correlation analysis reveal potential influential variables.  
- Data is now ready for preprocessing and modeling.
  