# **Water Quality Prediction**


* https://www.kaggle.com/datasets/adityakadiwal/water-potability


* Three classification algorithms used to try to predict the quality of water.


* The project contains all classification evaluation metrics with an explanation of each metrics mean

In [None]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = '/content'

In [None]:
!kaggle datasets download -d adityakadiwal/water-potability

Downloading water-potability.zip to /content
  0% 0.00/251k [00:00<?, ?B/s]
100% 251k/251k [00:00<00:00, 88.4MB/s]


In [None]:
!unzip \*.zip && rm *.zip

Archive:  water-potability.zip
replace water_potability.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: water_potability.csv    


# First classification algorithm: **Logistic Regression**

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

In [None]:
water_df = pd.read_csv("water_potability.csv")

In [None]:
X = water_df.drop("Potability", axis=1)
y = water_df["Potability"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
print(X_train.isnull().sum())   # trying to see if there are any missing values and impute them

ph                 395
Hardness             0
Solids               0
Chloramines          0
Sulfate            631
Conductivity         0
Organic_carbon       0
Trihalomethanes    127
Turbidity            0
dtype: int64


In [None]:
mean_imputer = SimpleImputer(strategy="mean")
X_train = mean_imputer.fit_transform(X_train)
X_test = mean_imputer.transform(X_test)

In [None]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [None]:
lr_model = LogisticRegression()
lr_model.fit(X_train, y_train)

LogisticRegression()

In [None]:
y_pred = lr_model.predict(X_test)                            # To evaluate the logistic regression model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred)

  _warn_prf(average, modifier, msg_start, len(result))


In [None]:
print("Logistic Regression")
print("Accuracy: ", accuracy)                # Accuracy measures the proportion of correctly classified instances among all instances in the dataset
print("Precision: ", precision)         
print("Recall: ", recall)                    # Recall measures the proportion of true positive classifications among all actual positive instances in the dataset
print("F1-score: ", f1)                      # F1-score is a harmonic mean of precision and recall that balances both measures
print("ROC AUC score: ", roc_auc)            
# ROC AUC score measures the area under the receiver operating characteristic curve which is a plot of the true positive rate against the false positive rate for different classification thresholds

Logistic Regression
Accuracy:  0.6280487804878049
Precision:  0.0
Recall:  0.0
F1-score:  0.0
ROC AUC score:  0.5


Conclusion (logistic regression):
* The prediction of water quality was poor.
* Evaluation metrics were low.
* We can conclude that logistic regression might not be the ideal algorithm for us here.


# Second classification algorithm: **Decision Trees**

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

In [None]:
water_df = pd.read_csv("water_potability.csv")

In [None]:
X = water_df.drop("Potability", axis=1)
y = water_df["Potability"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2, random_state= 42)

In [None]:
print(X_train.isnull().sum())

ph                 395
Hardness             0
Solids               0
Chloramines          0
Sulfate            631
Conductivity         0
Organic_carbon       0
Trihalomethanes    127
Turbidity            0
dtype: int64


In [None]:
mean_imputer = SimpleImputer(strategy="mean")
X_train = mean_imputer.fit_transform(X_train)
X_test = mean_imputer.transform(X_test)

In [None]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [None]:
dt_model = DecisionTreeClassifier(random_state = 42)
dt_model.fit(X_train, y_train)

DecisionTreeClassifier(random_state=42)

In [None]:
y_pred = dt_model.predict(X_test)                                # To evaluate the decision tree model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred)

In [None]:
print("Decision Tree")
print("Accuracy: ", accuracy)
print("Precision: ", precision)          # Precision measures the proportion of true positive classifications among all positive classifications made
print("Recall: ", recall)
print("F1-score: ", f1)
print("ROC AUC score: ", roc_auc)

Decision Tree
Accuracy:  0.5777439024390244
Precision:  0.4412811387900356
Recall:  0.5081967213114754
F1-score:  0.4723809523809524
ROC AUC score:  0.5635643800732134


Conclusion (decision tree):
* The prediction of water quality was better in comparison to logistic regression
* However, ee can conclude that decision tree also might not be the ideal algorithm for us in this situation and another model could provide better performance.

# Third classification algorithm: **Random Forests**

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

In [None]:
water_df = pd.read_csv("water_potability.csv")

In [None]:
X = water_df.drop("Potability", axis=1)
y = water_df["Potability"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
print(X_train.isnull().sum())

ph                 395
Hardness             0
Solids               0
Chloramines          0
Sulfate            631
Conductivity         0
Organic_carbon       0
Trihalomethanes    127
Turbidity            0
dtype: int64


In [None]:
mean_imputer = SimpleImputer(strategy="mean")
X_train = mean_imputer.fit_transform(X_train)
X_test = mean_imputer.transform(X_test)

In [None]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [None]:
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train, y_train)

RandomForestClassifier(random_state=42)

In [None]:
y_pred = rf_model.predict(X_test)                              # To evaluate the random forest model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred)

In [None]:
print("Random Forest")
print("Accuracy: ", accuracy)
print("Precision: ", precision)
print("Recall: ", recall)
print("F1-score: ", f1)
print("ROC AUC score: ", roc_auc)

Random Forest
Accuracy:  0.6676829268292683
Precision:  0.589041095890411
Recall:  0.3524590163934426
F1-score:  0.441025641025641
ROC AUC score:  0.6034139742161388


Conclusion (Random Forests):
* The prediction of water quality was better in comparison to both of the previous classification algorithms
* Therefore, we can conclude that random forest is the better  algorithm for us in this situation comparing the three models we used.