In [6]:
import pandas as pd

# Load the dataset (replace 'health_data.csv' with your file path)
df = pd.read_csv('Life_Expectancy_Data.csv')

# Display the first 5 rows to check the data
print(df.head())

       Country  Year      Status  Life expectancy   Adult Mortality  \
0  Afghanistan  2015  Developing              65.0              263   
1  Afghanistan  2014  Developing              59.9              271   
2  Afghanistan  2013  Developing              59.9              268   
3  Afghanistan  2012  Developing              59.5              272   
4  Afghanistan  2011  Developing              59.2              275   

   infant deaths  Alcohol  percentage expenditure  Hepatitis B  Measles   ...  \
0             62     0.01               71.279624           65      1154  ...   
1             64     0.01               73.523582           62       492  ...   
2             66     0.01               73.219243           64       430  ...   
3             69     0.01               78.184215           67      2787  ...   
4             71     0.01                7.097109           68      3013  ...   

   Polio  Total expenditure  Diphtheria    HIV/AIDS         GDP  Population  \
0      

In [7]:
# Count missing values per column
missing_values = df.isnull().sum()
print("Missing Values per Column:")
print(missing_values)

Missing Values per Column:
Country                            0
Year                               0
Status                             0
Life expectancy                    0
Adult Mortality                    0
infant deaths                      0
Alcohol                            0
percentage expenditure             0
Hepatitis B                        0
Measles                            0
 BMI                               0
under-five deaths                  0
Polio                              0
Total expenditure                  0
Diphtheria                         0
 HIV/AIDS                          0
GDP                                0
Population                         0
 thinness  1-19 years              0
 thinness 5-9 years                0
Income composition of resources    0
Schooling                          0
dtype: int64


In [8]:
# Calculate percentage of missing values
missing_percentage = (df.isnull().sum() / len(df)) * 100
print("\nMissing Values Percentage per Column:")
print(missing_percentage.round(2))  # Rounds to 2 decimal places


Missing Values Percentage per Column:
Country                            0.0
Year                               0.0
Status                             0.0
Life expectancy                    0.0
Adult Mortality                    0.0
infant deaths                      0.0
Alcohol                            0.0
percentage expenditure             0.0
Hepatitis B                        0.0
Measles                            0.0
 BMI                               0.0
under-five deaths                  0.0
Polio                              0.0
Total expenditure                  0.0
Diphtheria                         0.0
 HIV/AIDS                          0.0
GDP                                0.0
Population                         0.0
 thinness  1-19 years              0.0
 thinness 5-9 years                0.0
Income composition of resources    0.0
Schooling                          0.0
dtype: float64


In [9]:
# Create a summary DataFrame
missing_summary = pd.DataFrame({
    'Missing Values': missing_values,
    'Percentage (%)': missing_percentage.round(2)
})

# Display columns with missing values only (if any)
print("\nSummary of Missing Values:")
print(missing_summary[missing_summary['Missing Values'] > 0])


Summary of Missing Values:
Empty DataFrame
Index: []


In [None]:
### Task 2: Handling Missing Data

In this dataset, the summary shows that there are **no missing values** in any column (`Missing Values` and `Percentage (%)` are all zero). Therefore, **no imputation or removal is necessary**.

If missing values were present, the choice of method would depend on:
- The proportion of missing data.
- The importance of the variable.
- The nature of the analysis.

**Common approaches:**
- **Imputation:** Filling missing values with mean, median, mode, or using advanced techniques (e.g., KNN imputation).
- **Removal:** Dropping rows or columns with missing values, suitable when the proportion is very small.
- **Algorithms that handle missing values:** Some models (e.g., XGBoost) can handle missing data natively.

**Justification for this dataset:**  
Since there are no missing values, we can proceed with the analysis without any special handling for missing data.

"After comprehensive analysis, no missing values were detected in the dataset. Thus, no imputation or removal methods were applied. This ensures the integrity of the original data for downstream tasks."

In [None]:
### Task 3: Implement the Chosen Method and Evaluate Its Impact

Since the dataset contains **no missing values** (as shown in the previous analysis), there is **no need to implement any imputation or removal methods**. All columns are complete, and the data integrity is preserved.

**Impact on the Dataset:**
- No data was altered or removed.
- The dataset remains unchanged and ready for further analysis.
- No loss of information or introduction of bias due to missing data handling.

**Conclusion:**  
We can proceed confidently to the next steps of data exploration and modeling, knowing that missing data does not affect our analysis.

In [None]:
### Task 4: Explore the Dataset and Identify Potential Features

To build predictive models or perform further analysis, it's important to identify relevant features (independent variables) and the target variable (dependent variable).

**Target Variable:**
- `Life expectancy`: This is typically the main outcome of interest in health and demographic studies.

**Potential Features:**
- **Demographic Variables:**
    - `Country`
    - `Year`
    - `Status` (Developed/Developing)

- **Health Indicators:**
    - `Adult Mortality`
    - `infant deaths`
    - `under-five deaths`
    - `Hepatitis B`
    - `Measles`
    - `Polio`
    - `Diphtheria`
    - `HIV/AIDS`
    - `BMI`
    - `thinness  1-19 years`
    - `thinness 5-9 years`

- **Socioeconomic Indicators:**
    - `GDP`
    - `Population`
    - `percentage expenditure`
    - `Total expenditure`
    - `Income composition of resources`
    - `Schooling`
    - `Alcohol`

**Feature Selection Considerations:**
- Remove features with little variance or high correlation with others to avoid redundancy.
- Consider encoding categorical variables (`Country`, `Status`) for modeling.
- Exclude the `Year` variable if temporal trends are not of interest or if using data from a single year.

**Next Steps:**
- Perform exploratory data analysis (EDA) to understand distributions, relationships, and correlations among variables.
- Visualize feature importance and check for multicollinearity before modeling.