# Business Understanding

## Project Overview
Climate change is intensifying health risks across Africa, affecting diseases like malaria, cholera, and respiratory infections. Understanding how climatic and environmental conditions influence disease outbreaks is crucial for guiding public health interventions.

## Problem Statement
Public health agencies in Africa need a data-driven way to predict when and where disease risks will rise due to changing climate conditions (temperature, humidity, air quality, rainfall). Early prediction enables timely response, resource allocation, and prevention planning.

## Business Question
Can we predict disease risk levels using climate and environmental indicators?

## Objective
To build a machine learning model that predicts the Vector-Borne Disease Risk Score (or another health metric like `respiratory_disease_rate`) based on climate and environmental data.

## Specific Goals
- Explore relationships between climate variables and disease indicators.  
- Identify the most influential climatic factors driving disease outbreaks.  
- Build and evaluate ML models that predict disease risk with high accuracy.  
- Provide actionable insights for climate-health policy and planning.


# Data Understanding

## Dataset Description
- **Source:** Africa Climate Change dataset (provided)  
- **Scope:** Multiple African countries over time  

## Main Feature Categories
| Feature Category       | Features                                                                 |
|------------------------|--------------------------------------------------------------------------|
| **Climate**            | `avg_temp`, `rainfall_mm`, `humidity`, `air_quality_index`               |
| **Health**             | `vector_disease_risk_score`, `respiratory_disease_rate`, `waterborne_disease_incidents`, `heat_related_admissions` |
| **Socio-economic**     | `gdp_per_capita_usd`, `healthcare_access_index`, `food_security_index`  |
| **Geographic/Temporal**| `country_name`, `region`, `year`, `month`, `latitude`, `longitude`      |

## Target Variable
- **vector_disease_risk_score**  
  - Continuous variable â†’ regression problem


In [7]:
#Step 1: Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#Step 2: Load dataset ---
file_path = "../data/africa_climate_change.xlsx"
climate_data = pd.read_excel(file_path)

#Step 3: Basic inspection ---
climate_data.shape
climate_data.head()
climate_data.info()



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2256 entries, 0 to 2255
Data columns (total 30 columns):
 #   Column                        Non-Null Count  Dtype         
---  ------                        --------------  -----         
 0   record_id                     2256 non-null   int64         
 1   country_code                  2256 non-null   object        
 2   country_name                  2256 non-null   object        
 3   region                        2256 non-null   object        
 4   income_level                  2256 non-null   object        
 5   date                          2256 non-null   datetime64[ns]
 6   year                          2256 non-null   int64         
 7   month                         2256 non-null   int64         
 8   week                          2256 non-null   int64         
 9   latitude                      2256 non-null   float64       
 10  longitude                     2256 non-null   float64       
 11  population_millions           