<a href="https://colab.research.google.com/github/MahamadSahjad/Smart_Quality_Water_Prediction_Using_Machine_Learning/blob/main/Smart_Quality_Water_Prediction_Using_Machine_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🧪 Smart Water Quality Prediction using Machine Learning

## 📌 Problem Statement
Access to safe and clean drinking water is one of the most important global challenges. Poor water quality can lead to serious health issues such as waterborne diseases, poisoning, and long-term health complications. Therefore, it is essential to monitor and predict water quality effectively.

The objective of this project is to **develop a Machine Learning model that can predict whether water is safe or unsafe for human consumption based on various chemical and physical parameters** such as pH, Hardness, Solids, Chloramines, Sulfates, Conductivity, Organic Carbon, Trihalomethanes, and Turbidity.

This project involves:
- Collecting and preprocessing water quality data  
- Performing exploratory data analysis (EDA)  
- Training multiple machine learning models  
- Comparing their performance  
- Deploying the best model as a simple application  

The end goal is to provide a **smart and automated system for water quality prediction**, which can assist government agencies, researchers, and communities in ensuring safe drinking water.


In [1]:
# 📦 Importing Required Libraries

import numpy as np
import pandas as pd

# Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Machine Learning Models
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier

# Evaluation Metrics
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix


## 📊 Data Collection

In this step, we upload and explore the dataset collected from Kaggle. The dataset contains various **physical and chemical parameters of water** such as pH, Hardness, Solids, Chloramines, Sulfates, Conductivity, Organic Carbon, Trihalomethanes, and Turbidity, along with the target variable **Potability** (0 = Not Safe, 1 = Safe).

The following tasks are performed in this stage:
- Upload the dataset into Google Colab  
- Load it into a Pandas DataFrame for analysis  
- Inspect the structure of the dataset (rows, columns, datatypes)  
- Display the first few records to understand the data format  
- Generate summary statistics (mean, median, standard deviation, etc.)  
- Identify missing values and duplicates  
- Check the distribution of the target variable (Potability)  

This process ensures that the dataset is correctly loaded and provides an initial understanding of the data before moving on to **preprocessing and exploratory analysis**.


Read the Dataset

Load it into a DataFrame:

In [2]:
df = pd.read_csv("water_potability.csv")


Basic Dataset Inspection


1.Shape of dataset → rows & columns

In [3]:
df.shape


(3276, 10)

2.Display first few rows

In [4]:
df.head()


Unnamed: 0,ph,Hardness,Solids,Chloramines,Sulfate,Conductivity,Organic_carbon,Trihalomethanes,Turbidity,Potability
0,,204.890455,20791.318981,7.300212,368.516441,564.308654,10.379783,86.99097,2.963135,0
1,3.71608,129.422921,18630.057858,6.635246,,592.885359,15.180013,56.329076,4.500656,0
2,8.099124,224.236259,19909.541732,9.275884,,418.606213,16.868637,66.420093,3.055934,0
3,8.316766,214.373394,22018.417441,8.059332,356.886136,363.266516,18.436524,100.341674,4.628771,0
4,9.092223,181.101509,17978.986339,6.5466,310.135738,398.410813,11.558279,31.997993,4.075075,0


3.Column names & data types

In [5]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3276 entries, 0 to 3275
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   ph               2785 non-null   float64
 1   Hardness         3276 non-null   float64
 2   Solids           3276 non-null   float64
 3   Chloramines      3276 non-null   float64
 4   Sulfate          2495 non-null   float64
 5   Conductivity     3276 non-null   float64
 6   Organic_carbon   3276 non-null   float64
 7   Trihalomethanes  3114 non-null   float64
 8   Turbidity        3276 non-null   float64
 9   Potability       3276 non-null   int64  
dtypes: float64(9), int64(1)
memory usage: 256.1 KB


4.Summary statistics (mean, std, min, max, etc.)

In [6]:
df.describe()


Unnamed: 0,ph,Hardness,Solids,Chloramines,Sulfate,Conductivity,Organic_carbon,Trihalomethanes,Turbidity,Potability
count,2785.0,3276.0,3276.0,3276.0,2495.0,3276.0,3276.0,3114.0,3276.0,3276.0
mean,7.080795,196.369496,22014.092526,7.122277,333.775777,426.205111,14.28497,66.396293,3.966786,0.39011
std,1.59432,32.879761,8768.570828,1.583085,41.41684,80.824064,3.308162,16.175008,0.780382,0.487849
min,0.0,47.432,320.942611,0.352,129.0,181.483754,2.2,0.738,1.45,0.0
25%,6.093092,176.850538,15666.690297,6.127421,307.699498,365.734414,12.065801,55.844536,3.439711,0.0
50%,7.036752,196.967627,20927.833607,7.130299,333.073546,421.884968,14.218338,66.622485,3.955028,0.0
75%,8.062066,216.667456,27332.762127,8.114887,359.95017,481.792304,16.557652,77.337473,4.50032,1.0
max,14.0,323.124,61227.196008,13.127,481.030642,753.34262,28.3,124.0,6.739,1.0


Check for Missing Values

Identify NaN values:

In [7]:
df.isnull().sum()


Unnamed: 0,0
ph,491
Hardness,0
Solids,0
Chloramines,0
Sulfate,781
Conductivity,0
Organic_carbon,0
Trihalomethanes,162
Turbidity,0
Potability,0


Check Class Distribution (Target Variable)

For classification problems (like Potability = 0/1):

In [8]:
df['Potability'].value_counts()


Unnamed: 0_level_0,count
Potability,Unnamed: 1_level_1
0,1998
1,1278


                   ## 🧹 Data Preprocessing ##

After collecting the dataset, the next step is **Data Preprocessing**, which ensures the data is clean, consistent, and ready for machine learning models.  

The following operations are performed in this stage:

1. **Data Type Conversion**  
   - Ensure all columns have correct data types (numeric features as float/int).  

2. **Handling Missing Values**  
   - Missing values in parameters like pH, Sulfates, or Trihalomethanes are filled using the **median** to avoid skewness.  

3. **Removing Duplicates**  
   - Drop duplicate records to avoid bias in training.  

4. **Outlier Detection & Handling**  
   - Identify extreme values (e.g., negative Turbidity, pH > 14).  
   - Handle them using IQR (Interquartile Range) method.  

5. **Class Balancing (Target Variable)**  
   - If Potability (0 = Not Safe, 1 = Safe) is imbalanced, apply **SMOTE oversampling** to balance the classes.  

6. **Feature Scaling**  
   - Use **StandardScaler** to normalize feature ranges for better model performance.  

7. **Feature Selection (Optional)**  
   - Check correlations and remove redundant features.  
   - Use feature importance from Random Forest/XGBoost to keep only useful features.  

8. **Feature Engineering (Optional)**  
   - Create new features such as combined water quality index or interaction terms.  

9. **Train-Test Split**  
   - Divide the dataset into **Training (70%) and Testing (30%)** to evaluate model performance fairly.  

This ensures that the dataset is **clean, balanced, and well-prepared** for Exploratory Data Analysis (EDA) and Machine Learning model building.


In [11]:
# 🧹 Data Preprocessing

# 1. Data Type Conversion
df = df.astype(float)

# 2. Handling Missing Values (fill with median)
df.fillna(df.median(), inplace=True)

# 3. Removing Duplicates
df.drop_duplicates(inplace=True)

# 4. Outlier Detection & Removal (using IQR method)
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1
df = df[~((df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR))).any(axis=1)]

# 5. Splitting Features and Target
X = df.drop("Potability", axis=1)
y = df["Potability"]

# 6. Balancing Classes using SMOTE (if needed)
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42)
X, y = smote.fit_resample(X, y)

# 7. Feature Scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 8. Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.3, random_state=42, stratify=y
)

# Final Check
print("Training Set Shape:", X_train.shape)
print("Testing Set Shape:", X_test.shape)
print("Class Distribution After Balancing:\n", y.value_counts())


Training Set Shape: (2339, 9)
Testing Set Shape: (1003, 9)
Class Distribution After Balancing:
 Potability
0.0    1671
1.0    1671
Name: count, dtype: int64
