# **Water Pollution & Disease: Predictive Analysis of Contaminant Levels and Health Risk**

**Team Members:** Charlie Serafin, Kayla Vo

**Project Name:** Predicting and Mapping Waterborne Risk

## **Dataset Description**

We will use the *Water Pollution & Disease* dataset from Kaggle, which contains country-, region-, and year-level measurements of water quality and related public health data. This dataset enables the analysis of the relationship between environmentals parameters and waterborne disease risks.

* Title: Water Pollution and Disease
* Source: Kaggle 
* Link: [Water Pollution & Disease](https://www.kaggle.com/datasets/khushikyad001/water-pollution-and-disease)
* Format: CSV file (Comma-Separated Values)
* Contents: The dataset covers key factors and the socioeconomic variables below
    * Country: The country where data is collected
    * Region: The specific region within the country
    * Year: The year the data was recorded (2000-2025)
    * Water Source Type: Type of water source (e.g., River, Well, Tap, Lake, Spring, Pond)
    * Contaminant Level: The level of pollutants in parts per million ($ppm$)
    * pH Level: Acidity or alkalinity of the water (6.0-8.5)
    * Turbidity ($NTU$): Cloudiness of the water (measured in $NTU$ - Nephelometric Turbidity Units)
    * Dissolved Oxygen: Amount of oxygen dissolved in water, crucial for aquatic life ($mg/L$)
    * Nitrate Level: Concentration of nitrates in water, which can affect human health ($mg/L$)
    * Lead Concentration: Amount of lead present in micrograms per liter ($\mu g/L$)

### Brief Description of Attributes and Target Variable
Each record represents water quality metrics collected at a specific location and time.   
For our project:
* **Input (predictor) attributes:** All water quality measurements (e.g., pH, turbidity, nitrate, lead, etc.), location, year, and source type.   
* **Class Information (target):** We will predict a binary or multi-class classification 'risk' factor. For example, whether contaminant levels exceed safety guidelines, or what risk label should be assigned.
    * **Contaminated (Yes/No):** Classify of a water sample is "contaminated" based on guideline thresholds (e.g., high lead or nitrate).
    * **Risk Level (Low/Medium/High):** Categorize each sample into risk groups, based on one or more contaminant levels.

  
Thresholds for determining if water is "safe" or "at risk" (unsafe) are based on public health and environmental standards set by agencies like the World Health Organization (WHO), the Environmental Protection Agency (EPA), and similar organizations. 

## **Implementation/Technical Merit**
**Approach:** The project will include these steps:
1. **Data cleaning and Normalization:** 
    * Fill in missing numbers using the average value, or remove very incomplete rows if needed
    * Normalize all measurement attributes so their scales or comparable (important for KNN)
2. **Exploratory Data Analysis (EDA):**
    * Use histograms, bar charts, boxplots, and scatter plots to explore individual feature distributions and relationships between variables.
    * Generate heatmaps or correlation matrices to identify relationships or redundancies among features.
3. **Feature Engineering and Selection:**
    * Drop irrelevant or redundant features after visual and statistical inspection (if needed).
    * Ensure all relevant variables are included for initial modeling since the attribute count is small.
4. **Model Training:** Train and compare
    * **K-Nearest Neighbors (KNN):** Classify water samples as high-risk or safe based on similarity to known samples.
    * **Decision Trees:** Find rules and splits that best distinguish risky from safe samples based on water quality features.
    * **Custom Random Forest:** Implement a stratified sampling and ensemble of decision trees. 
5. **Evaluation:** Assess each model using cross-validation and metrics such as accuracy, precision, recall, and the confusion matrix to understand their strengths and weaknesses.
6. **Visualization of Results:** Create clear, labeled visuals (scatter plots, confusion matrices, feature importance plots) to illustrate how models perform and which features matter most.
### Anticipated challenges in pre-processing and/or classification
* **Missing data:** Some records may have missing values, which we will handle through modification or removal depending on prevalence and importance.
* **Class imbalance:** If "risk" labels are unevenly distributed, we will address this with resampling techniques or just use appropriate metrics such as precision and recall besides accuracy.
* **Feature Scaling:** KNN requires feature normalization so attributes measured on different scales (like $ppm$, $NTU$, $mg/L$, $\mu g/L$) contribute equally.
* **Small Attribute Set:** With only about 10 attributes, we’ll use all features unless one is clearly useless. 

### Feature Selection (if needed for large attribute sets)
* With our small number of features, we will use ALL except those that are always missing or have no variation.
* If more features are added later, we’ll use correlation analysis and decision tree importance scores to decide which to keep. 

## **Potential impact of the results**
* **Why useful:** 
    * Predicting water pollution risk helps officials and the public react early, saving money and protecting health.​
    * Clear results support better policies, community efforts, and awareness about water safety.
* **Stakeholders:**
    * Local and national public health officials: Spot unsafe water faster and issue warnings or take action to protect people’s health.
    * International organizations (WHO, UNICEF): Use the data to guide global health campaigns and target support where it’s needed most.
    * Environmental NGOs: Get proof and details to support their work for clean water, helping them educate the public and push for better policies.
    * Water utility companies and regulatory bodies: Find and fix water quality problems sooner, improving service and staying within safety rules.
    * At-risk communities: Learn if their water is safe to drink and can demand better service or solutions if a problem is found.

## **Citations for Data and Materials**
* [Predicting Water Quality with Machine Learning - Locus Technologies](https://www.locustec.com/predicting-water-quality-with-machine-learning/)
* [Water-Quality-Prediction](https://github.com/mallikarjun25/Water-Quality-Prediction)
* [Water Quality Prediction using ML: A Simple Guide with Scikit-Learn and Decision Trees](https://medium.com/@kaushiksimran827/water-quality-prediction-using-ml-a-simple-guide-with-scikit-learn-and-decision-trees-6b0efa6d8e2f)