I built this project to see if Machine Learning can actually tell us if water is safe to drink based on chemical sensors. It's a tricky problem because water quality isn't just "black and white"—it's a complex mix of pH, minerals, and chemicals.
Most people just run a model and hope for the best. I spent most of my time on the data prep, which is where the real work happens.
- Imputation: The dataset had a lot of missing values in pH and Sulfate. Instead of deleting those rows, I used Mean Imputation to keep the dataset size large enough for the model to learn.
- Outlier Cleanup: I implemented IQR (Interquartile Range) Outlier Filtering on key sensor inputs like pH and Sulfate. Unlike a simple percentile chop, the IQR method identifies outliers based on the statistical spread of the data
I didn't use all the columns. I ran a Feature Importance check and found that 5 things really matter:
- Sulfate, pH, Hardness, Chloramines, and Solids. By focusing on these, the model became faster and less prone to "distractions" from irrelevant data.
My data had way more "Safe" samples than "Unsafe" ones. If I didn't fix this, the model would just guess "Safe" every time to get a high score. I used class_weight="balanced" to force the model to take the "Unsafe" cases seriously.
You might see projects claiming 90%+ accuracy, but here’s why 70% is more honest for this problem:
- Complexity: Water chemistry is non-linear. You can have "perfect" pH but high lead levels (which might not be in the dataset). 70% shows the model is learning the trends without being "cocky."
- Overfitting vs. Generalization: I could have pushed for 90%, but the model would have failed the moment it saw a real-world water sample. I chose stability over a fake high score.
- Data Imbalance: Because I used "balanced" weights, the accuracy is lower, but the reliability for detecting "Unsafe" water is much higher.
The Confusion Matrix shows that the model is now actually identifying "Not Potable" samples instead of just ignoring them!
app.py: The UI for the Streamlit app.water_quality_final_model.py: My training script.random_forest_model.pkl: The saved model weights.requirements.txt: The libraries you need to run this.
This project taught me that cleaning the data is 90% of the job. Building the model is easy; making sure the model isn't lying to you about its accuracy is the hard part.