<h1 style="padding-top:2rem; padding-bottom:2rem; color:blue;font-size:4rem; text-align:center;">Water Potability Predictor</h1>

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/5/5e/Water_drop_001.jpg/1280px-Water_drop_001.jpg" alt="water_image">

<h1>The Data</h1>

<table>
    <tr>
        <td style="font-weight:bold">ph:</td>
        <td>The acidity of the water from 0-14 with 7 being neutral, less than 7 acidic, and more than 7 basic. The World Health Organization(WHO) has recommended the maximum permissible limit of pH from 6.5 to 8.5.</td>
    </tr>
    <tr>
        <td style="font-weight:bold">Hardness:</td>
        <td>Hardness is mainly caused by calcium and magnesium salts. Hardness was originally defined as the capacity of water to precipitate soap caused by Calcium and Magnesium(measure in mg/L).</td>
    </tr>
    <tr>
        <td style="font-weight:bold">Solids:</td>
        <td>Water can have inorganic and even some organic minerals or salts dissolved in it, such as potassium, calcium, sodium, bicarbonates, chlorides, magnesium, sulfates etc. This measures the total dissolved solids in ppm.</td>
    </tr>
    <tr>
        <td style="font-weight:bold">Chloramines:</td>
        <td>Chloramines are most commonly formed when ammonia is added to chlorine to treat drinking water. Chlorine levels up to 4 milligrams per liter (mg/L or 4 parts per million (ppm)) are considered safe in drinking water.</td>
    </tr>
    <tr>
        <td style="font-weight:bold">Sulfate:</td>
        <td>Sulfates are naturally occurring substances that are found in minerals, soil, and rocks. They are present in ambient air, groundwater, plants, and food. The principal commercial use of sulfate is in the chemical industry. Sulfate concentration
            in seawater is about 2,700 milligrams per liter (mg/L). It ranges from 3 to 30 mg/L in most freshwater supplies, although much higher concentrations (1000 mg/L) are found in some geographic locations.</td>
    </tr>
    <tr>
        <td style="font-weight:bold">Conductivity:</td>
        <td>Pure water is not a good conductor of electric current rather’s a good insulator. Increase in ions concentration enhances the electrical conductivity of water. Generally, the amount of dissolved solids in water determines the electrical conductivity.
            Electrical conductivity (EC) actually measures the ionic process of a solution that enables it to transmit current. According to WHO standards, EC value should not exceeded 400 μS/cm.</td>
    </tr>
    <tr>
        <td style="font-weight:bold">Organic_carbon:</td>
        <td>Total Organic Carbon (TOC) in source waters comes from decaying natural organic matter (NOM) as well as synthetic sources. TOC is a measure of the total amount of carbon in organic compounds in pure water. According to US EPA
            < 2 mg/L as TOC in
                treated / drinking water, and < 4 mg/Lit in source water which is use for treatment.</td>
    </tr>
    <tr>
        <td style="font-weight:bold">Trihalomethanes:</td>
        <td>THMs are chemicals which may be found in water treated with chlorine. The concentration of THMs in drinking water varies according to the level of organic material in the water, the amount of chlorine required to treat the water, and the temperature
            of the water that is being treated. THM levels up to 80 ppm is considered safe in drinking water.</td>
    </tr>
    <tr>
        <td style="font-weight:bold">Turbidity:</td>
        <td>The turbidity of water depends on the quantity of solid matter present in the suspended state. It is a measure of light emitting properties of water and the test is used to indicate the quality of waste discharge with respect to colloidal matter.
            The mean turbidity value obtained for Wondo Genet Campus (0.98 NTU) is lower than the WHO recommended value of 5.00 NTU.</td>
    </tr>
    <tr>
        <td style="font-weight:bold">Potability:</td>
        <td>Indicates if water is safe for human consumption where 1 means Potable and 0 means Not potable.</td>
    </tr>
</table>

<p>The classification task involves analyzing the features and training a model to perform the binary classification of whether the water is Potable (1) or Not Potable(0)</p>

<h1>Imports</h1>

In [1]:
#SKLearn imports
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

#SKLearn Model Imports
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import CategoricalNB
from sklearn.naive_bayes import BernoulliNB
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier

#Data analysis and manipulation imports
import pandas as pd
import numpy as np

<h1>Analysis Of The Data</h1>

In [2]:
water=pd.read_csv('water_potability.csv')
water.head()

Unnamed: 0,ph,Hardness,Solids,Chloramines,Sulfate,Conductivity,Organic_carbon,Trihalomethanes,Turbidity,Potability
0,,204.890455,20791.318981,7.300212,368.516441,564.308654,10.379783,86.99097,2.963135,0
1,3.71608,129.422921,18630.057858,6.635246,,592.885359,15.180013,56.329076,4.500656,0
2,8.099124,224.236259,19909.541732,9.275884,,418.606213,16.868637,66.420093,3.055934,0
3,8.316766,214.373394,22018.417441,8.059332,356.886136,363.266516,18.436524,100.341674,4.628771,0
4,9.092223,181.101509,17978.986339,6.5466,310.135738,398.410813,11.558279,31.997993,4.075075,0


In [3]:
water.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3276 entries, 0 to 3275
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   ph               2785 non-null   float64
 1   Hardness         3276 non-null   float64
 2   Solids           3276 non-null   float64
 3   Chloramines      3276 non-null   float64
 4   Sulfate          2495 non-null   float64
 5   Conductivity     3276 non-null   float64
 6   Organic_carbon   3276 non-null   float64
 7   Trihalomethanes  3114 non-null   float64
 8   Turbidity        3276 non-null   float64
 9   Potability       3276 non-null   int64  
dtypes: float64(9), int64(1)
memory usage: 256.1 KB


In [4]:
water.isnull().sum()

ph                 491
Hardness             0
Solids               0
Chloramines          0
Sulfate            781
Conductivity         0
Organic_carbon       0
Trihalomethanes    162
Turbidity            0
Potability           0
dtype: int64

<p>This data contains a lot of null values for ph, Sulfate, and Trihalomethanes. For training purposes these need to be addressed. Our options are to drop all the data with a null value, or to fill the null values with the average in the dataset. Removing the values limits our sample size and filling the data with the values could impact the accuracy of the model.</p>

In [5]:
waterNonNull=pd.DataFrame(water).dropna()
waterNonNull.head()

Unnamed: 0,ph,Hardness,Solids,Chloramines,Sulfate,Conductivity,Organic_carbon,Trihalomethanes,Turbidity,Potability
3,8.316766,214.373394,22018.417441,8.059332,356.886136,363.266516,18.436524,100.341674,4.628771,0
4,9.092223,181.101509,17978.986339,6.5466,310.135738,398.410813,11.558279,31.997993,4.075075,0
5,5.584087,188.313324,28748.687739,7.544869,326.678363,280.467916,8.399735,54.917862,2.559708,0
6,10.223862,248.071735,28749.716544,7.513408,393.663396,283.651634,13.789695,84.603556,2.672989,0
7,8.635849,203.361523,13672.091764,4.563009,303.309771,474.607645,12.363817,62.798309,4.401425,0


In [6]:
print(len(waterNonNull))

2011


In [7]:
#Insert graphs/tables showing distributions of data

<p>We now have 2011 samples to train our data. However, since no testing data was porvided we will split the data into test and training data to verify the models.</p>

In [8]:
x=waterNonNull.iloc[:,:-1]
y=waterNonNull.iloc[:,-1]
xtrain, xtest, ytrain, ytest=train_test_split(x,y,test_size=0.25,random_state=40)
xtrain.shape,xtest.shape,ytrain.shape,ytest.shape

((1508, 9), (503, 9), (1508,), (503,))

<h1>Naive Bayes</h1>

In [9]:
gnb = GaussianNB()
gnb.fit(xtrain, ytrain)
gnb_predict=gnb.predict(xtest)
accuracy_score(gnb_predict,ytest)

0.6302186878727635

In [10]:
mnb = MultinomialNB()
mnb.fit(xtrain, ytrain)
mnb_predict=mnb.predict(xtest)
accuracy_score(mnb_predict,ytest)

0.5049701789264414

In [11]:
bnb = BernoulliNB()
bnb.fit(xtrain, ytrain)
bnb_predict=bnb.predict(xtest)
accuracy_score(bnb_predict,ytest)

0.5825049701789264

<h1>Logistic Regression</h1>

In [12]:
# Create model object
logReg = LogisticRegression(max_iter=5000,random_state=0, n_jobs=20)

In [13]:
logReg.fit(xtrain,ytrain)

In [14]:
predLogReg = logReg.predict(xtest)
lgAccuracy = accuracy_score(ytest, predLogReg)
print(lgAccuracy)

0.5825049701789264


In [18]:
clf = MLPClassifier(random_state=1, max_iter=300).fit(xtrain, ytrain)
clf.predict_proba(xtest[:1])
clf.score(xtest, ytest)

0.588469184890656

<table>
    <tr>
        <th style="text-decoration:underline; font-weight: bold;">References</th>
    </tr>
    <tr>
        <td>Data From:</td>
        <td><a href>https://www.kaggle.com/datasets/adityakadiwal/water-potability</a></td>
    </tr>
    <tr>
        <td></td>
        <td><a href>https://scikit-learn.org/stable/modules/naive_bayes.html</a></td>
    </tr>
</table>