![Alt Text](water_pollution_and_disease_outbreaks__the_cost_of_inaction/water_pollution3.jpg)


# 🌍 Water Pollution & Disease Outbreaks: Can We Predict the Cost of Inaction?

Water pollution is not just an environmental issue, it is a public health emergency.

This data analysis project explores the relationship between water quality indicators and disease outbreaks across different regions and time periods. We will investigate whether patterns in water pollution can help predict and prevent public health crises, and estimate the economic impact of inaction.

We will follow a data-driven approach involving:

- Structured ETL (Extract, Transform, Load)
- Statistical analysis and pattern detection
- Visual exploration
- Hypothesis testing
- Economic impact estimation


## 🔄 ETL Process

Before go straight into the analysis, we want to ensure our dataset is clean and ready.

The ETL steps will include:

- **Extract**: Move the dataset from the Downloads folder to a project directory (`raw_data/`)
- **Statistics Table**: Create a summary of the most relevant statistics.
- **Transform**: Clean the data, check for outliers, and standardise formats
- **Load**: Summarise and prepare the cleaned dataset for analysis


In [4]:
import os
import shutil

#Step 1 Define paths
downloads_path = os.path.expanduser("~/Downloads")
source_file = os.path.join(downloads_path, "water_pollution_disease.csv")
destination_folder = "raw_data"
destination_file = os.path.join(destination_folder, "water_pollution_disease.csv")

# Step 2 Create destination folder if it doesn't exist
os.makedirs(destination_folder, exist_ok=True)

#Step 3 Move the file
if os.path.exists(source_file):
    shutil.move(source_file, destination_file)
    print(f"✅ File moved to: {destination_file}")
else:
    print("❌ Source file not found.")


❌ Source file not found.


## Table of Statisticis. Why Is the Summary Statistics Table Important?

Before diving into visualisation or modelling, it is critical to understand the **basic structure and distribution** of the data. The summary statistics table helps us:

####  Spot Trends and Ranges
- Measures like **mean**, **median**, **min**, and **max** show the typical values and range for each variable, helping us understand scale and variation.

####  Detect Skewed Data
- **Skewness** tells us whether the distribution is symmetrical or lopsided. For example, disease rates or contaminant levels may be heavily skewed in certain regions.

#### Identify Outliers and Heavy Tails
- **Kurtosis** shows if a variable has more extreme values than a normal distribution, which is crucial when working with public health or environmental data where outliers might signal crises.

#### Validate Assumptions
- Summary statistics can confirm if the data matches expectations (e.g. pH levels within safe biological limits, reasonable GDP values).

Together, these metrics guide decisions on:
- Which features may need transformation
- Which variables could drive meaningful insights
- Whether data normalisation or outlier handling is required

In short, the statistical summary builds a **foundation of trust** in the dataset and helps define the right next steps.


In [5]:
import pandas as pd
from scipy.stats import skew, kurtosis


#Step 4 Load data
df = pd.read_csv(os.path.join(destination_folder, "water_pollution_disease.csv"))

#Step 5 Select numeric columns
num_df = df.select_dtypes(include='number')

#Step 6 Summary table
summary_stats = pd.DataFrame({
    'Mean': num_df.mean(),
    'Median': num_df.median(),
    'Std Dev': num_df.std(),
    'Min': num_df.min(),
    'Max': num_df.max(),
    'Skewness': num_df.apply(skew),
    'Kurtosis': num_df.apply(kurtosis)
})

summary_stats.round(2)


Unnamed: 0,Mean,Median,Std Dev,Min,Max,Skewness,Kurtosis
Year,2012.01,2012.0,7.23,2000.0,2024.0,0.0,-1.21
Contaminant Level (ppm),4.95,4.95,2.86,0.0,10.0,0.0,-1.16
pH Level,7.26,7.28,0.72,6.0,8.5,-0.02,-1.21
Turbidity (NTU),2.48,2.46,1.42,0.0,4.99,0.05,-1.16
Dissolved Oxygen (mg/L),6.49,6.49,2.03,3.0,10.0,0.02,-1.23
Nitrate Level (mg/L),25.08,24.79,14.51,0.05,49.99,0.02,-1.21
Lead Concentration (µg/L),10.05,10.07,5.8,0.0,20.0,-0.02,-1.22
Bacteria Count (CFU/mL),2488.48,2469.0,1431.42,0.0,4998.0,0.01,-1.19
Access to Clean Water (% of Population),64.61,64.78,20.31,30.01,99.99,0.02,-1.22
"Diarrheal Cases per 100,000 people",249.78,248.0,144.11,0.0,499.0,0.01,-1.19


## Initial Data Insights

This statistical summary provides a foundational understanding of the dataset before deeper analysis. Here's what stands out:

####  Central Tendencies & Spread
- Most variables show **well-centred mean and median values**, suggesting balanced distributions.
- Standard deviations vary widely. For instance, GDP per Capita and Bacteria Count have very high variance, indicating large disparities between regions.

####  pH Level Check
- The pH levels range from **6.00 to 8.50**, which is **within or close to safe biological thresholds** (6.5–8.5 for drinkable water), with a mean of 7.26. This will be important in health correlation checks.

####  Skewness & Kurtosis
- All skewness values are very close to 0 → **no major skew detected**.
- All kurtosis values are negative and near -1.2, indicating **light tails** (less extreme outliers than a normal distribution).

#### Red Flags
- Some variables like `Diarrheal Cases`, `Cholera Cases`, and `Bacteria Count` have **minimums at 0**, which might suggest under-reporting or areas with no observed cases. Maybe we will need to keep this in mind during hypothesis testing.
