## To detect if the water is Safe for Consumptions 💧🌍

![Water Drinking](https://tdma.info/assets/uploads/2017/12/Using_sunlight_to_clean_water_Featured_Image-1.jpg)

<i> <strong>"Because no matter who we are or where we come from, we're all entitled to the basic human rights of clean air to breathe, clean water to drink, and healthy land to call home." </strong></i> - **Martin Luther King III**

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
# Importing the necessary libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use("ggplot")
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import xgboost as xgb
from sklearn.metrics import roc_auc_score, make_scorer, confusion_matrix
%matplotlib inline

In [None]:
# To ignore any warning messages

import warnings
warnings.filterwarnings("ignore")

In [None]:
# Importing the dataset

water = pd.read_csv("../input/water-potability/water_potability.csv")
water.head()

In [None]:
# Checking the number of rows and columns

water.shape

In [None]:
# Checking for any null values

water.info()

In [None]:
# Checking the percentage of null values

round(water.isnull().sum() * 100/water.shape[0])

In [None]:
# Fixing the null values in "ph"

plt.figure(figsize=[16,8])
sns.distplot(water["ph"])
plt.show()

In [None]:
# Since the distribution of values in "ph" is very normal, we will use mean to impute the missing values...

water["ph"].fillna(value=water["ph"].mean(), inplace=True)

In [None]:
# Here, we can see that there is a skew... So we are going to visualize the same using a boxplot.

plt.figure(figsize=[16,8])
plt.title("Before removing outliers", size=25, pad=20)
sns.boxplot(water["Sulfate"])
plt.show()

# Removing the top and bottom 1%

Q3 = water["Sulfate"].quantile(0.99)
water = water[water["Sulfate"] <= Q3]
Q1 = water["Sulfate"].quantile(0.01)
water = water[water["Sulfate"] >= Q1]

# Visualizing the boxplot for "Sulfate" after removing outliers

plt.figure(figsize=[16,8])
plt.title("After removing outliers", size=25, pad=20)
sns.boxplot(water["Sulfate"])
plt.show()

In [None]:
# Visualizing the histogram for "Sulfate" now

plt.figure(figsize=[16,8])
plt.title("Distribution of Sulfate", size=25, pad=20)
sns.distplot(water["Sulfate"])
plt.show()

In [None]:
# Fixing the null values in "Trihalomethanes"

plt.figure(figsize=[16,8])
plt.title("Distribution of Trihalomethanes", size=25, pad=20)
sns.distplot(water["Trihalomethanes"])
plt.show()

In [None]:
# Since the distribution of values in "Trihalomethanes" is normal, we will use mean to impute the missing values...

water["Trihalomethanes"].fillna(value=water["Trihalomethanes"].mean(), inplace=True)

In [None]:
# Heatmap to check for any high correlation between variables

correlation = water.corr()

plt.figure(figsize=[16,8])
plt.title("Correlation between all the variables", size=25, pad=20)
sns.heatmap(correlation, cmap='YlGnBu', annot=True)
plt.show()

In [None]:
# Checking if any null values remain

water.info()

#### Model Building using XGBoost (Extreme Gradient Boosting) 👷‍🔨

![Robert jr jarvis](https://c4.wallpaperflare.com/wallpaper/562/127/684/man-actor-iron-man-2-tony-wallpaper-preview.jpg)

In [None]:
# Dividing the independent and dependent variables

y = water.pop("Potability")
X = water

In [None]:
# Performing the train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.8, random_state = 42)

In [None]:
# Standardizing our data (not really required for XGBoost)

scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [None]:
# Building the model using XGBoost Classifier

xgb_model = xgb.XGBClassifier(random_state=2, learning_rate=0.1, max_depth=10, min_child_weight=1, n_estimators=50)

xgb_model.fit(X_train, y_train)

In [None]:
# Checking the ROC_AUC score in the test data

print("AUC on test data by XGBoost: ", roc_auc_score(y_true=y_test, y_score=xgb_model.predict_proba(X_test)[:, 1]))

In [None]:
# Calculating Precision, Recall, Accuracy

cm_test = confusion_matrix(y_test, xgb_model.predict(X_test))

print("Confusion Matrix on Train data: \n\n", cm_test)

TP = cm_test[0][0]
TN = cm_test[1][1]
FN = cm_test[0][1]
FP = cm_test[1][0]

print("Precision", (TP/(TP+FP)))
print("Recall", (TP/(TP+FN)))
print("Accuracy", (TP+TN)/(TP+TN+FN+FP))

### Conclusion
![waterdrinking](https://www.mcgill.ca/oss/files/oss/styles/hd/public/blur-bottle-boy-1126557.jpg?itok=Vbca-UsY&timestamp=1550677505)

- Here, we can see that after using the XGBoost Classifier, we get a **recall** of **`0.87`**. This is pretty good since, a good recall score means, the model is less likely to give us True Positives over False Negatives.
- We also have an **accuracy** of **`0.67`**, but it is better to not give this much attention as Accuracy is highly influenced by an unbalanced dataset. In this case, we do get to see an imbalance. This imbalnce is not so severe, but still we should be careful before we judge a model based on the Accuracy.
- This model can definitely be used to classify the water potability.

**If you enjoyed my notebook📒, kindly leave your valuable thoughts below💭. Do point out anything that can be improved.😁👍**