# Water Pollution & Disease — Project Proposal

**Project name:** Water Pollution & Disease: Linking Water Source Types to Contaminant and Disease Risk


**Team members:** Charles Serafin, Kayla Vo

## Executive Summary


This notebook contains our project proposal for CPSC 322. The objective is to analyze a multi-country dataset that links water quality/pollution indicators with the incidence of waterborne diseases, and to build predictive models that identify which water source types (e.g., river, well, tap, pond) and other environmental / socio-economic factors are most strongly associated with high contaminant or metal levels and elevated disease rates.


We will produce an exploratory data analysis, feature engineering pipeline, and classification/regression models to (1) predict high contaminant/metal levels and (2) identify important predictors, with a particular focus on water source type.

## 1. Dataset description


**Title / short name:** Water Pollution and Disease (Kaggle)


**Overview:** The dataset explores the relationship between water pollution indicators and prevalence of waterborne diseases from 2000 to 2025 for a selection of countries and regions. It contains water quality measures (contaminant and metal concentrations, bacterial indicators), access and sanitation metrics, water source types, treatment methods, disease incidence rates (e.g., diarrhea, cholera, typhoid), and socio-economic indicators.


**Scope & coverage:**
- ~10 countries (e.g., USA, India, China, Brazil, Nigeria, Bangladesh, Mexico, Indonesia, Pakistan, Ethiopia)
- 5 regions per country (North, South, East, West, Central)
- Period: 2000–2025 (annual or multi-year observations)
- ~3,000 records representing various water sources and pollution conditions


**Source & link:**
- Kaggle dataset: *Water Pollution and Disease* by khushikyad001. Download page: https://www.kaggle.com/datasets/khushikyad001/water-pollution-and-disease?resource=download


**Data format:**
- The dataset will be provided as CSV file(s) (e.g., `water_pollution_and_disease.csv`).

## 2. Attributes and target variable(s)


**Representative attributes (expected):**
- `country` — Country name (categorical)
- `region` — Region within country (categorical)
- `year` — Year of observation (numeric / datetime)
- `water_source_type` — e.g., river, well, tap, pond, spring, bottled (categorical)
- `contaminant_*` — Multiple contaminant/metal concentration columns (e.g., lead_ppb, arsenic_ppb, mercury_ppb, nitrate_mg_L)
- `bacterial_count` — Bacterial indicator (e.g., E. coli CFU/100mL)
- `treatment_method` — e.g., chlorination, filtration, none (categorical)
- `sanitation_coverage` — % households with improved sanitation (numeric)
- `access_to_clean_water` — % households with access (numeric)
- `disease_diarrhea_rate` — Cases per 1000 (numeric)
- `disease_cholera_rate` — Cases per 1000 (numeric)
- `gdp_per_capita` — Socio-economic indicator (numeric)
- `urbanization_rate` — % urban population (numeric)
- `healthcare_access_index` — Proxy for health system access (numeric)


**Primary target(s) we will predict / analyze:**
1. **High contaminant presence (binary classification):** A binary label indicating whether any measured contaminant/metal exceeds a health-based threshold (e.g., WHO guideline). We will create this label by applying domain thresholds to contaminant columns (e.g., arsenic > 10 µg/L).


2. **Disease incidence (regression or classification):** Predict disease rates (e.g., diarrhea cases per 1000) or a derived binary label indicating `high_disease_burden` above an empirically chosen threshold.


For the core deliverable, we will prioritize the *High contaminant presence* classification mapped from contaminant/metal levels and then analyze how water source type and other features contribute to predicting that label.

## 3. Implementation / Technical approach


We will follow these major steps:


1. **Data ingestion & cleaning** — load CSV(s) with `pandas`, inspect missingness, unify units (e.g., µg/L vs mg/L), and create derived variables (e.g., `any_contaminant_exceeds_WHO`).


2. **Exploratory data analysis (EDA)** — distribution of variables, per-country and per-source summaries, time trends, and visualization of contaminant levels by water source type.


3. **Preprocessing pipeline** — using `scikit-learn` `Pipeline` and `ColumnTransformer` for:
- Imputation (mean/median for numeric; most frequent for categorical; model-based for complex cases)
- Scaling (RobustScaler / StandardScaler for numeric)
- Encoding (OneHotEncoder or Target encoding for high-cardinality categoricals)
- Feature engineering (e.g., combining contaminants into a `contaminant_index`)


4. **Feature selection & dimensionality reduction** — see section below.


5. **Modeling** — baseline and more advanced models for classification and regression:
- Baselines: Logistic Regression, Decision Tree
- Ensembles: Random Forest, Gradient Boosting (XGBoost / LightGBM)
- Calibrated probability models if needed


6. **Evaluation** — appropriate metrics and validation strategy:
- Classification: precision, recall, F1, ROC-AUC; if class imbalance exists, consider PR-AUC and class-weighting
- Regression (if used): RMSE, MAE, R²
- Cross-validation: time-aware split (if temporal autocorrelation) or stratified K-fold on region/country.


7. **Interpretation / explainability** — feature importances, SHAP values, partial dependence plots to quantify the relationship between `water_source_type` and risk.


8. **Reporting & reproducibility** — generate reproducible notebook, document preprocessing steps, and place code + dataset in GitHub repo.


**Libraries planned:** `pandas`, `numpy`, `matplotlib`/`seaborn` (for plotting), `scikit-learn`, `xgboost` or `lightgbm`, `shap` for explanation.

## 4. Anticipated challenges


**Data quality / missing values**
- Missing contaminant readings for some regions/years or for certain water sources.
- Mixed units (e.g., mg/L vs µg/L) requiring normalization.


**Label construction choices**
- Choosing thresholds for "high" contaminant levels can be non-trivial. We will rely on WHO/US EPA guidelines and document our choices.


**Class imbalance**
- "High contaminant" events may be relatively rare. We'll consider resampling (SMOTE), class-weighted losses, or threshold tuning.


**Temporal and spatial correlation**
- Observations from the same country/region/year are not i.i.d. We will use grouped cross-validation or time-based splits to avoid over-optimistic estimates.


**Heterogeneous data types and high cardinality**
- High-cardinality categorical variables (e.g., many `water_source_subtypes`) may require target encoding or embedding approaches.


**Confounding socio-economic variables**
- Socio-economic factors (GDP, healthcare access) may confound relationships between water source type and disease. We'll include them as covariates and use interpretable models/SHAP to identify conditional effects.

## 5. Feature selection and dimensionality reduction


If the dataset contains many attributes (e.g., dozens of contaminants, multiple derived metrics), we will explore these techniques to reduce dimensionality and highlight the most relevant predictors:


- **Filter methods:** Correlation analysis and univariate feature selection (ANOVA, mutual information) to remove near-zero variance and uninformative features.
- **Wrapper methods:** Recursive Feature Elimination (RFE) with cross-validation using a robust estimator (e.g., Random Forest or Logistic Regression with L1).
- **Embedded methods:** L1-regularized logistic regression (LASSO) to shrink coefficients of less important features to zero.
- **Tree-based importances:** Train a Random Forest / LightGBM and rank features by importance.
- **Dimensionality reduction:** PCA (for numeric-only embeddings) or Autoencoders if we want non-linear compression. PCA will be considered only for visualization or as part of an alternative pipeline — interpretability is critical, so we will prefer methods that preserve explainability.


We will compare pipelines with different selection methods and choose based on validation performance and interpretability.

## 6. Potential impact and stakeholders


**Why these results are useful**
- Identifying water source types and regional factors strongly associated with high contaminant levels helps target interventions: prioritized testing, treatment installations, and public-health outreach.
- Predictive models can serve as early-warning tools to focus surveillance and allocate limited public-health resources.


**Who benefits / stakeholders**
- **Local governments and public health agencies** — to prioritize water testing and remediation.
- **NGOs and aid organizations** — to target interventions in regions with high predicted contaminant risk.
- **Researchers and environmental scientists** — to study temporal and geographic trends in contamination and disease.
- **Local communities** — indirect beneficiaries through improved targeting of interventions and policies.
- **Policy makers** — to justify investments in water infrastructure based on data-driven risk assessment.

## 7. Timeline & deliverables (proposed)


- **Week 1 (Proposal / data acquisition):** Finalize proposal (this notebook), download dataset into GitHub repo.
- **Week 2 (EDA & cleaning):** Data cleaning, missing-value strategy, unit unification, initial EDA.
- **Week 3 (Feature engineering & baseline models):** Build preprocessing pipeline and baseline classifiers.
- **Week 4 (Advanced modeling & feature selection):** Random Forest / Gradient Boosting models, feature selection, and interpretability.
- **Week 5 (Evaluation & reporting):** Final experiments, robustness checks, and write final report / presentation.


**Deliverables:** GitHub repo containing: this proposal notebook, cleaned dataset, notebooks for EDA and modeling, trained model artifacts (optional), and final report.

## 8. Ethics, limitations, and reproducibility


- **Ethics:** Results must be communicated carefully; correlation does not imply causation. Avoid overclaiming causal relationships between water source and disease without domain-specific causal inference work.
- **Privacy:** We do not expect personally identifiable information in the dataset, but we will check and aggregate / anonymize if needed.
- **Reproducibility:** We will fix random seeds, use a documented conda/requirements file, and place all code in the GitHub repo.

## 9. References & external resources


- Kaggle dataset: *Water Pollution and Disease* — https://www.kaggle.com/datasets/khushikyad001/water-pollution-and-disease?resource=download