# **Water Quality Predictor** 🚱🚰

<img src="https://media.giphy.com/media/9gD2UALjYkzQkXDAE2/giphy.gif">

A woman drinks water from a river in India. Photo courtesy of The Guardian.

# 1 - Introduction

## 1.1 - Dataset Description

### Access to safe drinking-water is essential to health, a basic human right and a component of effective policy for health protection. This is important as a health and development issue at a national, regional and local level. In some regions, it has been shown that investments in water supply and sanitation can yield a net economic benefit, since the reductions in adverse health effects and health care costs outweigh the costs of undertaking the interventions.

## 1.2 - Dataset Dictionary

## 1.2.1 - X features

### 🧪 ***pH value:***
### PH is an important parameter in evaluating the acid–base balance of water. It is also the indicator of acidic or alkaline condition of water status. WHO has recommended maximum permissible limit of pH from 6.5 to 8.5. The current investigation ranges were 6.52–6.83 which are in the range of WHO standards.
### 🧂 ***Hardness:*** 
### Hardness is mainly caused by calcium and magnesium salts. These salts are dissolved from geologic deposits through which water travels. The length of time water is in contact with hardness producing material helps determine how much hardness there is in raw water. Hardness was originally defined as the capacity of water to precipitate soap caused by Calcium and Magnesium.
### 💎 ***Solids (Total dissolved solids - TDS):***
### Water has the ability to dissolve a wide range of inorganic and some organic minerals or salts such as potassium, calcium, sodium, bicarbonates, chlorides, magnesium, sulfates etc. These minerals produced un-wanted taste and diluted color in appearance of water. This is the important parameter for the use of water. The water with high TDS value indicates that water is highly mineralized. Desirable limit for TDS is 500 mg/l and maximum limit is 1000 mg/l which prescribed for drinking purpose.
### 💊 ***Chloramines:***
### Chlorine and chloramine are the major disinfectants used in public water systems. Chloramines are most commonly formed when ammonia is added to chlorine to treat drinking water. Chlorine levels up to 4 milligrams per liter (mg/L or 4 parts per million (ppm)) are considered safe in drinking water.
### ⛰ ***Sulfate:***
### Sulfates are naturally occurring substances that are found in minerals, soil, and rocks. They are present in ambient air, groundwater, plants, and food. The principal commercial use of sulfate is in the chemical industry. Sulfate concentration in seawater is about 2,700 milligrams per liter (mg/L). It ranges from 3 to 30 mg/L in most freshwater supplies, although much higher concentrations (1000 mg/L) are found in some geographic locations.
### 🔌 ***Conductivity:***
### Pure water is not a good conductor of electric current rather’s a good insulator. Increase in ions concentration enhances the electrical conductivity of water. Generally, the amount of dissolved solids in water determines the electrical conductivity. Electrical conductivity (EC) actually measures the ionic process of a solution that enables it to transmit current. According to WHO standards, EC value should not exceeded 400 μS/cm.
### 🍁 ***Organic_carbon:***
### Total Organic Carbon (TOC) in source waters comes from decaying natural organic matter (NOM) as well as synthetic sources. TOC is a measure of the total amount of carbon in organic compounds in pure water. According to US EPA < 2 mg/L as TOC in treated / drinking water, and < 4 mg/Lit in source water which is use for treatment.
### ⚗ ***Trihalomethanes:***
### THMs are chemicals which may be found in water treated with chlorine. The concentration of THMs in drinking water varies according to the level of organic material in the water, the amount of chlorine required to treat the water, and the temperature of the water that is being treated. THM levels up to 80 ppm is considered safe in drinking water.
### 🩸 ***Turbidity:***
### The turbidity of water depends on the quantity of solid matter present in the suspended state. It is a measure of light emitting properties of water and the test is used to indicate the quality of waste discharge with respect to colloidal matter. The mean turbidity value obtained for Wondo Genet Campus (0.98 NTU) is lower than the WHO recommended value of 5.00 NTU.

## 1.2.2 - Y feature

### 🚱🚰 ***Potability:***
### Indicates if water is safe for human consumption where 1 means Potable and 0 means Not potable.

## 1.2.3 - Problem definition

### Thirst is no joke. My aim for this project is to achieve a model that has 90% accuracy when it comes to predicting if determined water is potable or not. 

# 2 - Development

## 2.1 - Exploratory Data Analysis (EDA)

Importing the libraries

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, roc_curve, roc_auc_score
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

import scipy

Importing the dataset

In [None]:
water = pd.read_csv("../input/water-potability/water_potability.csv")

Basic dataframe checks

In [None]:
water.shape

In [None]:
water.head()

Checking for null values in the dataframe

In [None]:
water.isnull().sum()

Filling the null values

In [None]:
water['ph']=water['ph'].fillna(water.groupby(['Potability'])['ph'].transform('mean'))
water['Sulfate']=water['Sulfate'].fillna(water.groupby(['Potability'])['Sulfate'].transform('mean'))
water['Trihalomethanes']=water['Trihalomethanes'].fillna(water.groupby(['Potability'])['Trihalomethanes'].transform('mean'))

Finding any values that contains "ph" equal or smaller than zero.

In [None]:
water[water["ph"] <= 0]

Dropping the value that has "ph" equal to zero

In [None]:
water.drop(3014, inplace=True)

Checking data destribution and density in the dataframe

In [None]:
water.plot(kind = "box",
                layout = (3,4),
                subplots = True,
                figsize = (15,15))

water.plot(kind = "density",
                layout = (3,4),
                subplots = True,
                figsize = (15,15),
                sharex = False)

plt.show()

It appears that there are many outliers and non normal distribuions in the dataframe. The next step will use Box Cox Transformation in order to minimize this.

Using the Box Cox transformation in the data. Before = Blue ; After = Red.

In [None]:
sns.histplot(water["ph"])

water["ph"], fitted_lambda= scipy.stats.boxcox(water["ph"])

sns.histplot(water["ph"], color="Red");

In [None]:
sns.histplot(water["Hardness"])

water["Hardness"], fitted_lambda= scipy.stats.boxcox(water["Hardness"])

sns.histplot(water["Hardness"], color="Red");

In [None]:
sns.histplot(water["Solids"])

water["Solids"], fitted_lambda= scipy.stats.boxcox(water["Solids"])

sns.histplot(water["Solids"], color="Red");

In [None]:
sns.histplot(water["Chloramines"])

water["Chloramines"], fitted_lambda= scipy.stats.boxcox(water["Chloramines"])

sns.histplot(water["Chloramines"], color="Red");

In [None]:
sns.histplot(water["Sulfate"])

water["Sulfate"], fitted_lambda= scipy.stats.boxcox(water["Sulfate"])

sns.histplot(water["Sulfate"], color="Red");

In [None]:
sns.histplot(water["Conductivity"])

water["Conductivity"], fitted_lambda= scipy.stats.boxcox(water["Conductivity"])

sns.histplot(water["Conductivity"], color="Red");

In [None]:
sns.histplot(water["Organic_carbon"])

water["Organic_carbon"], fitted_lambda= scipy.stats.boxcox(water["Organic_carbon"])

sns.histplot(water["Organic_carbon"], color="Red");

In [None]:
sns.histplot(water["Trihalomethanes"])

water["Trihalomethanes"], fitted_lambda= scipy.stats.boxcox(water["Trihalomethanes"])

sns.histplot(water["Trihalomethanes"], color="Red");

In [None]:
sns.histplot(water["Turbidity"])

water["Turbidity"], fitted_lambda= scipy.stats.boxcox(water["Turbidity"])

sns.histplot(water["Turbidity"], color="Red");

Correlation study about the data

In [None]:
matrix_cor = water.corr()
ax = plt.subplots(figsize=(10,10))
sns.heatmap(matrix_cor, annot=True)

In [None]:
sns.pairplot(water, diag_kind = "kde", corner = True);

## 2.2 - Model Building

Creating the X and y dataframes that will be fitted in the models

In [None]:
water_X = water.drop("Potability", axis=1)
water_y = water["Potability"]

In [None]:
X_train, X_test, y_train, y_test = train_test_split(water_X, water_y, test_size=0.2)

Creating the Logistic Regression model

In [None]:
logreg = LogisticRegression()

In [None]:
logreg.fit(X_train, y_train)

Logistic Regresison model scoring

In [None]:
logreg.score(X_test, y_test)

In [None]:
logreg_preds = logreg.predict(X_test)

In [None]:
print(classification_report(y_test, logreg_preds))

In [None]:
fig, ax = plt.subplots(figsize=(15, 15))
plt.title("Confusion Matrix - Logistic Regression")
metrics.plot_confusion_matrix(logreg, X_test, y_test, xticks_rotation="vertical", ax=ax)

In [None]:
y_score_logreg = logreg.predict_proba(X_test)[:,1]

In [None]:
false_positive_rate_logreg, true_positive_rate_logreg, threshold_logreg = roc_curve(y_test, y_score_logreg)

In [None]:
 print("roc_auc_score for Logistic Regression: ", roc_auc_score(y_test, y_score_logreg))

In [None]:
plt.subplots(1, figsize=(10,10))
plt.title('Receiver Operating Characteristic - Logistic Regression')
plt.plot(false_positive_rate_logreg, true_positive_rate_logreg)
plt.plot([0, 1], ls="--")
plt.plot([0, 0], [1, 0] , c=".7"), plt.plot([1, 1] , c=".7")
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

Creating the Decision Tree Classifier model

In [None]:
dectree = DecisionTreeClassifier(random_state=42)

In [None]:
dectree.fit(X_train, y_train)

Decision Tree Classifier model scoring

In [None]:
dectree.score(X_test, y_test)

In [None]:
logreg_dectree = dectree.predict(X_test)

In [None]:
print(classification_report(y_test, logreg_dectree))

In [None]:
fig, ax = plt.subplots(figsize=(15, 15))
plt.title("Confusion Matrix - Decision Tree")
metrics.plot_confusion_matrix(dectree, X_test, y_test, xticks_rotation="vertical", ax=ax)

In [None]:
y_score_dectree = dectree.predict_proba(X_test)[:,1]

In [None]:
false_positive_rate_dectree, true_positive_rate_dectree, threshold_dectree = roc_curve(y_test, y_score_dectree)

In [None]:
 print("roc_auc_score for DecisionTree: ", roc_auc_score(y_test, y_score_dectree))

In [None]:
plt.subplots(1, figsize=(10,10))
plt.title('Receiver Operating Characteristic - Decision Tree')
plt.plot(false_positive_rate_dectree, true_positive_rate_dectree)
plt.plot([0, 1], ls="--")
plt.plot([0, 0], [1, 0] , c=".7"), plt.plot([1, 1] , c=".7")
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

Creating the Random Forest Classifier model

In [None]:
random = RandomForestClassifier(max_depth=14,
                                n_estimators=600,
                                n_jobs=-1)

Note: The Random Forest parameters were optimized by using GridSearchCV. This process is not found on this notebook.

In [None]:
random.fit(X_train, y_train)

Random Forest Classifier model scoring

In [None]:
random.score(X_test, y_test)

In [None]:
logreg_random = random.predict(X_test)

In [None]:
print(classification_report(y_test, logreg_random))

In [None]:
fig, ax = plt.subplots(figsize=(15, 15))
plt.title("Confusion Matrix - RandomForestClassifier")
metrics.plot_confusion_matrix(random, X_test, y_test, xticks_rotation="vertical", ax=ax)

In [None]:
y_score_random = random.predict_proba(X_test)[:,1]

In [None]:
false_positive_rate_random, true_positive_rate_random, threshold_random = roc_curve(y_test, y_score_random)

In [None]:
 print("roc_auc_score for RandomForestClassifier: ", roc_auc_score(y_test, y_score_random))

In [None]:
plt.subplots(1, figsize=(10,10))
plt.title('Receiver Operating Characteristic - RandomForestClassifier')
plt.plot(false_positive_rate_random, true_positive_rate_random)
plt.plot([0, 1], ls="--")
plt.plot([0, 0], [1, 0] , c=".7"), plt.plot([1, 1] , c=".7")
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

Random Forest model feature importance analysis

In [None]:
importance = random.feature_importances_

In [None]:
for i,v in enumerate(importance):
	print('Feature: %0d, Score: %.5f' % (i,v))

In [None]:
plt.bar([x for x in range(len(importance))], importance)
plt.show()

In [None]:
water.head()

Plotting the Receiver Operating Characteristic curve for the three models

In [None]:
plt.subplots(1, figsize=(10,10))
plt.title('Receiver Operating Characteristic - RandomForestClassifier')
plt.plot(false_positive_rate_random, true_positive_rate_random)
plt.plot(false_positive_rate_dectree, true_positive_rate_dectree)
plt.plot(false_positive_rate_logreg, true_positive_rate_logreg)
plt.plot([0, 1], ls="--")
plt.plot([0, 0], [1, 0] , c=".7"), plt.plot([1, 1] , c=".7")
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

# 3 - Conclusion

### The best model still came short around 10% from the desired accuracy. The Random Forest model labeled 38 water samples of non-potable water as potable. This could cause an enormous variety of illnesses for those who might drink from these waters. Also, the model labeled 92 potable water samples as non-potable, therefore, depriving the local population of a good water source. I find solace in the fact that 525 samples were labeled correctly.

### Brainstorming for possible next steps: Try model stacking. Try the neural network approach. Try removing outliers from the data and see if this impacts the accuracy positively.