# **Determining Water Potability**
CSI 4106 Project 1 \
Group #19 \
Aaron Ng (300176901) & Brian Zhang (300070168) \

---

# Understanding the classification task for our dataset

The following analysis will be a binary classification. This study aims to determine whether or not water is potable based on attributes such as acidity (pH), hardness, and turbidity.

# About the dataset

**Context**

Access to safe drinking-water is essential to health, a basic human right and a component of effective policy for health protection. This is important as a health and development issue at a national, regional and local level. In some regions, it has been shown that investments in water supply and sanitation can yield a net economic benefit, since the reductions in adverse health effects and health care costs outweigh the costs of undertaking the interventions.


**Content**

The water_potability.csv file contains water quality metrics for 3276 different water bodies.

1. **pH value:**

  PH is an important parameter in evaluating the acid–base balance of water. It is also the indicator of acidic or alkaline condition of water status. WHO has recommended maximum permissible limit of pH from 6.5 to 8.5. The current investigation ranges were 6.52–6.83 which are in the range of WHO standards.

2. **Hardness:**

  Hardness is mainly caused by calcium and magnesium salts. These salts are dissolved from geologic deposits through which water travels. The length of time water is in contact with hardness producing material helps determine how much hardness there is in raw water. Hardness was originally defined as the capacity of water to precipitate soap caused by Calcium and Magnesium.

3. **Solids (Total dissolved solids - TDS):**

  Water has the ability to dissolve a wide range of inorganic and some organic minerals or salts such as potassium, calcium, sodium, bicarbonates, chlorides, magnesium, sulfates etc. These minerals produced un-wanted taste and diluted color in appearance of water. This is the important parameter for the use of water. The water with high TDS value indicates that water is highly mineralized. Desirable limit for TDS is 500 mg/l and maximum limit is 1000 mg/l which prescribed for drinking purpose.

4. **Chloramines:**

  Chlorine and chloramine are the major disinfectants used in public water systems. Chloramines are most commonly formed when ammonia is added to chlorine to treat drinking water. Chlorine levels up to 4 milligrams per liter (mg/L or 4 parts per million (ppm)) are considered safe in drinking water.

5. **Sulfate:**

  Sulfates are naturally occurring substances that are found in minerals, soil, and rocks. They are present in ambient air, groundwater, plants, and food. The principal commercial use of sulfate is in the chemical industry. Sulfate concentration in seawater is about 2,700 milligrams per liter (mg/L). It ranges from 3 to 30 mg/L in most freshwater supplies, although much higher concentrations (1000 mg/L) are found in some geographic locations.

6. **Conductivity:**

  Pure water is not a good conductor of electric current rather’s a good insulator. Increase in ions concentration enhances the electrical conductivity of water. Generally, the amount of dissolved solids in water determines the electrical conductivity. Electrical conductivity (EC) actually measures the ionic process of a solution that enables it to transmit current. According to WHO standards, EC value should not exceeded 400 μS/cm.

7. **Organic_carbon:**

  Total Organic Carbon (TOC) in source waters comes from decaying natural organic matter (NOM) as well as synthetic sources. TOC is a measure of the total amount of carbon in organic compounds in pure water. According to US EPA < 2 mg/L as TOC in treated / drinking water, and < 4 mg/Lit in source water which is use for treatment.

8. **Trihalomethanes:**

  THMs are chemicals which may be found in water treated with chlorine. The concentration of THMs in drinking water varies according to the level of organic material in the water, the amount of chlorine required to treat the water, and the temperature of the water that is being treated. THM levels up to 80 ppm is considered safe in drinking water.

9. **Turbidity:**

  The turbidity of water depends on the quantity of solid matter present in the suspended state. It is a measure of light emitting properties of water and the test is used to indicate the quality of waste discharge with respect to colloidal matter. The mean turbidity value obtained for Wondo Genet Campus (0.98 NTU) is lower than the WHO recommended value of 5.00 NTU.

10. **Potability:**

  Indicates if water is safe for human consumption where 1 means Potable and 0 means Not potable.

# Analyzing our Dataset

The water potability dataset has 3276 entries, out of which 2011 are complete entries with no data missing. \
There are 9 feature classes:
*   Acidity (pH)
*   Hardness
*   Solids (the water's ability to dissolve minerals & salts)
*   Chloramines (amount of chlorine)
*   Sulfates
*   Conductivity
*   Organic Carbon (carbon coming from decaying organic matter)
*   Trihalomethanes (chemicals used with chlorine in disinfection)
*   Turbidity (how clear the water is)

In [None]:
# Import any required packages
import pandas as pds
import numpy as np
from sklearn.preprocessing import StandardScaler # For normalizing data values
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict # For splitting data & cross validation
from sklearn.metrics import confusion_matrix, classification_report,accuracy_score # For quality control

from sklearn.naive_bayes import GaussianNB # Naive Bayes Classifier
from sklearn.linear_model import LogisticRegression # Logistic Regression Classifier
from sklearn.neural_network import MLPClassifier # Multi-Layer Perceptron

In [None]:
# Establishing Kaggle API
# Requires your kaggle.json in the collab files folder
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!cp /content/kaggle.json ~/.kaggle/kaggle.json # Reminder that to run this, you need to upload your own kaggle.json API token after runtime is reset
!chmod 600 ~/.kaggle/kaggle.json
!pip install kaggle==1.5.6
!kaggle -v

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Kaggle API 1.5.6


In [None]:
#Downloading and unzipping dataset from kaggle servers
!kaggle datasets download adityakadiwal/water-potability
!unzip water-potability.zip

Downloading water-potability.zip to /content
  0% 0.00/251k [00:00<?, ?B/s]
100% 251k/251k [00:00<00:00, 94.6MB/s]
Archive:  water-potability.zip
  inflating: water_potability.csv    


In [None]:
# Import dataset as water_data
file = ('water_potability.csv') 
water_data = pds.read_csv(file)
# Let's take a quick look at the dataset
water_data.head()

Unnamed: 0,ph,Hardness,Solids,Chloramines,Sulfate,Conductivity,Organic_carbon,Trihalomethanes,Turbidity,Potability
0,,204.890455,20791.318981,7.300212,368.516441,564.308654,10.379783,86.99097,2.963135,0
1,3.71608,129.422921,18630.057858,6.635246,,592.885359,15.180013,56.329076,4.500656,0
2,8.099124,224.236259,19909.541732,9.275884,,418.606213,16.868637,66.420093,3.055934,0
3,8.316766,214.373394,22018.417441,8.059332,356.886136,363.266516,18.436524,100.341674,4.628771,0
4,9.092223,181.101509,17978.986339,6.5466,310.135738,398.410813,11.558279,31.997993,4.075075,0


In [None]:
#The analysis of the water data in (Count, Mean, Std, Min, 25%, 50%, 75%, and Max)
water_data.describe()

Unnamed: 0,ph,Hardness,Solids,Chloramines,Sulfate,Conductivity,Organic_carbon,Trihalomethanes,Turbidity,Potability
count,2785.0,3276.0,3276.0,3276.0,2495.0,3276.0,3276.0,3114.0,3276.0,3276.0
mean,7.080795,196.369496,22014.092526,7.122277,333.775777,426.205111,14.28497,66.396293,3.966786,0.39011
std,1.59432,32.879761,8768.570828,1.583085,41.41684,80.824064,3.308162,16.175008,0.780382,0.487849
min,0.0,47.432,320.942611,0.352,129.0,181.483754,2.2,0.738,1.45,0.0
25%,6.093092,176.850538,15666.690297,6.127421,307.699498,365.734414,12.065801,55.844536,3.439711,0.0
50%,7.036752,196.967627,20927.833607,7.130299,333.073546,421.884968,14.218338,66.622485,3.955028,0.0
75%,8.062066,216.667456,27332.762127,8.114887,359.95017,481.792304,16.557652,77.337473,4.50032,1.0
max,14.0,323.124,61227.196008,13.127,481.030642,753.34262,28.3,124.0,6.739,1.0


# Feature Engineering

All 9 of the feature classes to have importance towards the classification task. Under the Content section of the webpage where this dataset is hosted, there is justification as to why each of these features affect the potability of water. \

For example, 
*   WHO has recommended maximum permissible limit of pH from 6.5 to 8.5
*   Chlorine levels up to 4 milligrams per liter (mg/L or 4 parts per million (ppm)) are considered safe in drinking water
*   Electrical conductivity (EC) [...] should not exceeded 400 μS/cm

Some issues with the data are that there are a sizable number of missing entries for the features acidity, and sulphates. An overview of missing data can be shown in the code below:


In [None]:
# Find how many null values in each feature to see which ones we have to fill
water_data.isnull().sum()

ph                 491
Hardness             0
Solids               0
Chloramines          0
Sulfate            781
Conductivity         0
Organic_carbon       0
Trihalomethanes    162
Turbidity            0
Potability           0
dtype: int64

**Findings**
*   We see that three columns of the dataset have some Null values
*   They exist in the columns ph, Sulfate, and Trihalomethanes





# Encoding Features
We will need a binary classification our model. Luckily, the dataset's 'Potability' column provides us with exactly what is needed. \
Transformations will be necessary since each of the features is measured in slightly different units (mg/L, ppm, μS/cm, etc...)

# Preprocessing

In [None]:
# This will replace the null values with the mean of the existing data of each respective missing columns
water_data['ph'] = water_data['ph'].fillna(water_data['ph'].mean())
water_data['Sulfate'] = water_data['Sulfate'].fillna(water_data['Sulfate'].mean())
water_data['Trihalomethanes'] = water_data['Trihalomethanes'].fillna(water_data['Trihalomethanes'].mean())

In [None]:
# Check if there is any remaining nulls in the dataset after fill code
water_data.isnull().sum()

ph                 0
Hardness           0
Solids             0
Chloramines        0
Sulfate            0
Conductivity       0
Organic_carbon     0
Trihalomethanes    0
Turbidity          0
Potability         0
dtype: int64

In [None]:
# Separate dataset as feature variables and result
x = water_data.drop('Potability', axis=1)
y = water_data['Potability']

In [None]:
# Applying standard scaling 
sc = StandardScaler()
x = sc.fit_transform(x)

# Training Models with Cross Validation

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Cross Validation with Gaussian Naive Bayes Classifier
gnb = GaussianNB()
scores_NB_1 = cross_val_score(gnb, x, y, cv=4)
scores_NB_1

array([0.61172161, 0.65323565, 0.53846154, 0.62759463])

In [None]:
# Cross Validation with Logictic Regression Classifier
lgr = LogisticRegression(random_state=0)
scores_LR_1 = cross_val_score(lgr, x, y, cv=4)
scores_LR_1

array([0.61050061, 0.61050061, 0.61172161, 0.60927961])

In [None]:
# Cross Validation with Multi-Layer Perceptron; this takes about a minute
mlp = MLPClassifier(random_state=0, max_iter=1000)
scores_MLP_1 = cross_val_score(mlp, x, y, cv=4)
scores_MLP_1

array([0.52380952, 0.6031746 , 0.49450549, 0.63980464])

# Repeating the training with some modifications

In [None]:
# Naive Bayes
gnb2 = GaussianNB(var_smoothing=1e-8)
scores_NB_2 = cross_val_score(gnb2, x, y, cv=5)
print("Naive Bayes #2: ", end="")
print(scores_NB_2)

gnb3 = GaussianNB(var_smoothing=1e-10)
scores_NB_3 = cross_val_score(gnb3, x, y, cv=6)
print("Naive Bayes #3: ", end="")
print(scores_NB_3)

# Logistic Regression
lgr2 = LogisticRegression(random_state=90, tol=0.00001, max_iter=200)
scores_LR_2 = cross_val_score(lgr2, x, y, cv=5)
print("Log Reg #2: ", end="")
print(scores_LR_2)

lgr3 = LogisticRegression(random_state=13, tol=0.0005, max_iter=50)
scores_LR_3 = cross_val_score(lgr3, x, y, cv=6)
print("Log Reg #3: ", end="")
print(scores_LR_3)

# Multi-Layer Perceptron; this takes like 2-5 mins to run
mlp2 = MLPClassifier(random_state=61, max_iter=1000)
scores_MLP_2 = cross_val_score(mlp2, x, y, cv=3)
print("MLP #2: ", end="")
print(scores_MLP_2)

mlp3 = MLPClassifier(random_state=92, max_iter=1000)
scores_MLP_3 = cross_val_score(mlp3, x, y, cv=5)
print("MLP #3: ", end="")
print(scores_MLP_3)


Naive Bayes #2: [0.60823171 0.65496183 0.58320611 0.59083969 0.6259542 ]
Naive Bayes #3: [0.60622711 0.65567766 0.63919414 0.52014652 0.63369963 0.61172161]
Log Reg #2: [0.6097561  0.61068702 0.61068702 0.60916031 0.61068702]
Log Reg #3: [0.60989011 0.60989011 0.60989011 0.61172161 0.60805861 0.60989011]
MLP #2: [0.58241758 0.60805861 0.61996337]
MLP #3: [0.56707317 0.62137405 0.63358779 0.56946565 0.66564885]


It has been discovered that if we set the parameter 'cv' in 'cross_val_score' to the same value each time, we will get identical results. \
In addition, changing parameters in LogisticRegression() or GaussianNB() have no effect on the outcome. \
Example shown below:

In [None]:
# Note that we have 2 different sets of parameters for the NaiveBayes function
gnb2 = GaussianNB(var_smoothing=1e-8)
scores_NB_2 = cross_val_score(gnb2, x, y, cv=4) # But cv=4
print("Naive Bayes #2: ", end="")
print(scores_NB_2)

gnb3 = GaussianNB(var_smoothing=1e-10)
scores_NB_3 = cross_val_score(gnb3, x, y, cv=4) # and cv=4 here as well
print("Naive Bayes #3: ", end="")
print(scores_NB_3)

# Note that we have 2 different sets of parameters for the LogisticRegression function
lgr2 = LogisticRegression(random_state=90, tol=0.00001, max_iter=200)
scores_LR_2 = cross_val_score(lgr2, x, y, cv=4) # But cv=4
print("Log Reg #2: ", end="")
print(scores_LR_2)

lgr3 = LogisticRegression(random_state=13, tol=0.0005, max_iter=50)
scores_LR_3 = cross_val_score(lgr3, x, y, cv=4) # and cv=4 here as well
print("Log Reg #3: ", end="")
print(scores_LR_3)

Naive Bayes #2: [0.61172161 0.65323565 0.53846154 0.62759463]
Naive Bayes #3: [0.61172161 0.65323565 0.53846154 0.62759463]
Log Reg #2: [0.61050061 0.61050061 0.61172161 0.60927961]
Log Reg #3: [0.61050061 0.61050061 0.61172161 0.60927961]


It's the exact same?!?!?

# Evaluation of Obtained Results

In [None]:
# Naive Bayes Result Analysis
print('Naive Bayes Model Results')
print('\n')
y_pred_NB1 = cross_val_predict(gnb, x, y, cv=4)
print("Version 1:")
print("-----------------")
print("Confusion Matrix:")
print(confusion_matrix(y, y_pred_NB1), end="\n\n")
print('The Accuracy of this version is :{}'.format(accuracy_score(y, y_pred_NB1)))
print("Classification Report")
print(classification_report(y, y_pred_NB1))

y_pred_NB2 = cross_val_predict(gnb2, x, y, cv=5)
print("Version 2:")
print("-----------------")
print("Confusion Matrix:")
print(confusion_matrix(y, y_pred_NB2), end="\n\n")
print('The Accuracy of this version is :{}'.format(accuracy_score(y, y_pred_NB2)))
print("Classification Report")
print(classification_report(y, y_pred_NB2))

y_pred_NB3 = cross_val_predict(gnb3, x, y, cv=6)
print("Version 3:")
print("-----------------")
print("Confusion Matrix:")
print(confusion_matrix(y, y_pred_NB3), end="\n\n")
print('The Accuracy of this version is :{}'.format(accuracy_score(y, y_pred_NB3)))
print("Classification Report")
print(classification_report(y, y_pred_NB3))

Naive Bayes Model Results


Version 1:
-----------------
Confusion Matrix:
[[1708  290]
 [ 995  283]]

The Accuracy of this version is :0.6077533577533577
Classification Report
              precision    recall  f1-score   support

           0       0.63      0.85      0.73      1998
           1       0.49      0.22      0.31      1278

    accuracy                           0.61      3276
   macro avg       0.56      0.54      0.52      3276
weighted avg       0.58      0.61      0.56      3276

Version 2:
-----------------
Confusion Matrix:
[[1732  266]
 [1003  275]]

The Accuracy of this version is :0.6126373626373627
Classification Report
              precision    recall  f1-score   support

           0       0.63      0.87      0.73      1998
           1       0.51      0.22      0.30      1278

    accuracy                           0.61      3276
   macro avg       0.57      0.54      0.52      3276
weighted avg       0.58      0.61      0.56      3276

Version 3:
---------

In [None]:
# Logistic Regression Result Analysis
print('Logistic Regression Model Results')
print('\n')
y_pred_LR1 = cross_val_predict(lgr, x, y, cv=4)
print("Version 1:")
print("-----------------")
print("Confusion Matrix:")
print(confusion_matrix(y, y_pred_LR1), end="\n\n")
print('The Accuracy of this version is :{}'.format(accuracy_score(y, y_pred_LR1)))
print("Classification Report")
print(classification_report(y, y_pred_LR1))

y_pred_LR2 = cross_val_predict(lgr2, x, y, cv=5)
print("Version 2:")
print("-----------------")
print("Confusion Matrix:")
print(confusion_matrix(y, y_pred_LR2), end="\n\n")
print('The Accuracy of this version is :{}'.format(accuracy_score(y, y_pred_LR2)))
print("Classification Report")
print(classification_report(y, y_pred_LR2))

y_pred_LR3 = cross_val_predict(lgr3, x, y, cv=6)
print("Version 3:")
print("-----------------")
print("Confusion Matrix:")
print(confusion_matrix(y, y_pred_LR3), end="\n\n")
print('The Accuracy of this version is :{}'.format(accuracy_score(y, y_pred_LR3)))
print("Classification Report")
print(classification_report(y, y_pred_LR3))

Logistic Regression Model Results


Version 1:
-----------------
Confusion Matrix:
[[1996    2]
 [1274    4]]

The Accuracy of this version is :0.6105006105006106
Classification Report
              precision    recall  f1-score   support

           0       0.61      1.00      0.76      1998
           1       0.67      0.00      0.01      1278

    accuracy                           0.61      3276
   macro avg       0.64      0.50      0.38      3276
weighted avg       0.63      0.61      0.46      3276

Version 2:
-----------------
Confusion Matrix:
[[1996    2]
 [1275    3]]

The Accuracy of this version is :0.6101953601953602
Classification Report
              precision    recall  f1-score   support

           0       0.61      1.00      0.76      1998
           1       0.60      0.00      0.00      1278

    accuracy                           0.61      3276
   macro avg       0.61      0.50      0.38      3276
weighted avg       0.61      0.61      0.46      3276

Version 3:
-

In [None]:
# Multi-Layer Perceptron Result Analysis; takes 3-5 minutes
print('Multi-Layer Perceptron Model Results')
print('\n')
y_pred_MLP1 = cross_val_predict(mlp, x, y, cv=4)
print("Version 1:")
print("-----------------")
print("Confusion Matrix:")
print(confusion_matrix(y, y_pred_MLP1), end="\n\n")
print('The Accuracy of this version is :{}'.format(accuracy_score(y, y_pred_MLP1)))
print("Classification Report")
print(classification_report(y, y_pred_MLP1))

y_pred_MLP2 = cross_val_predict(mlp2, x, y, cv=4)
print("Version 2:")
print("-----------------")
print("Confusion Matrix:")
print(confusion_matrix(y, y_pred_MLP2), end="\n\n")
print('The Accuracy of this version is :{}'.format(accuracy_score(y, y_pred_MLP2)))
print("Classification Report")
print(classification_report(y, y_pred_MLP2))

y_pred_MLP3 = cross_val_predict(mlp3, x, y, cv=4)
print("Version 3:")
print("-----------------")
print("Confusion Matrix:")
print(confusion_matrix(y, y_pred_MLP3), end="\n\n")
print('The Accuracy of this version is :{}'.format(accuracy_score(y, y_pred_MLP3)))
print("Classification Report")
print(classification_report(y, y_pred_MLP3))

Multi-Layer Perceptron Model Results


Version 1:
-----------------
Confusion Matrix:
[[1362  636]
 [ 788  490]]

The Accuracy of this version is :0.5653235653235653
Classification Report
              precision    recall  f1-score   support

           0       0.63      0.68      0.66      1998
           1       0.44      0.38      0.41      1278

    accuracy                           0.57      3276
   macro avg       0.53      0.53      0.53      3276
weighted avg       0.56      0.57      0.56      3276

Version 2:
-----------------
Confusion Matrix:
[[1390  608]
 [ 788  490]]

The Accuracy of this version is :0.5738705738705738
Classification Report
              precision    recall  f1-score   support

           0       0.64      0.70      0.67      1998
           1       0.45      0.38      0.41      1278

    accuracy                           0.57      3276
   macro avg       0.54      0.54      0.54      3276
weighted avg       0.56      0.57      0.57      3276

Version 3

# Analysis
Precision:
In all 3 models and all 3 tests, our precision remained around 60%. This showed that all our models were not very good at predicting actual correct positive predictions.

Recall:
In our Naive Bayes model, the average recall was 86%. In our Logistic Regression Model, the average recall was nearly 100%. In our Multi-Layer Perceptron, the average recall was 69%. In all 3 of our models, and especially the Logistic Regression model, they were very good at determining actual positive values.

Other Steps:
While not documented, attempts were made to use different scaling or not scale any of the feature classes at all. These attempts did not produce any meaningful change in precision and recall.

Since the 'sulfate' feature class and 'pH' feature class both had missing data, attempts were made to remove both theses classes from our training. Theses attempts also did not produce any meaningful change in the results and were thus omitted.

# Conclusion
It appears that we were unable to build a very good model with either of the 3 classification methods.

While our model was able to correctly guess true values reliably, given its high recall, it was struggled on correctly guessing false values as shown by its low precision.

Further steps to be taken are running the model on a larger set of data, finding new and different feature classes for classification, and using different models as the relationship between water's attributes and water potability may be more complex than we once thought.