![image](https://i2.wp.com/amazingsoak.com/wp-content/uploads/2019/08/mineral-water-in-qatar.png?fit=1020%2C544&ssl=1)

<h1><center> Introduction </center></h1>

> Human body contains 60% to 65% of water. We can't even survive without water for few days. But we have a huge source of water. Water covers majority portion of the Earth.

> Though 💧 water is the most abandant resource on the Earth, still only about 2% water can be used for drinking. Now, the 2% also contains suspended solids, sulfates, organic carbon, etc. and if their portions increases more than specified value, this water can't be used for drinking. So the most abandant resource resource on the Earth, which covers 70% of our planet and we can't use it!

> The 🎯 aim of this exercise is to find out which parameter influences the potability of water and to differntiate potable and non potable water using a ML model.

> This notebook as the title suggests covers the EDA and ML for the dataset. I am using [PyCharet library](https://pycaret.org/guide/), which a user friendly auto ML library and can save your ton of time.

#### Run this `!pip install pycaret` before proceeding

In [None]:
!pip install pycaret

In [None]:
# Importing the libraries
import numpy as np
import pandas as pd

from pycaret.classification import *

import matplotlib.pyplot as plt
import seaborn as sns

sns.set(font = 'Serif', style = 'white', rc = {'axes.facecolor':'#f1f1f1', 'figure.facecolor':'#f1f1f1'})

import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

<h1><center>Let's dive into the Data 🏊</center></h1>

In [None]:
# Reading the data
df = pd.read_csv('../input/water-potability/water_potability.csv')
df.head()

In [None]:
# checking the data types
df.info()

In [None]:
# 'Potability' is a categorical feature, so changing it's data types from int to category
df['Potability'] = df['Potability'].astype('category')

In [None]:
# Looking for the missing values
df.isnull().sum()*100/len(df)

> `ph` and `Trihalomethanes` have less than 20% missing values, so they can be imputed. I generally practice this 20% rule, but it also depends on the data and business demand. Coming to the `Sulfate` feature, as it has more than 20% missing values, let's do some univariate analysis on it and if we find that it is important, then we will impute it too or else drop the whole column.

> The imputation part will happen by setting `numeric_imputation` to True, while data setup. First let's do some univariate analysis on `Sulfate` column.

<h3><center>Univariate Analysis on Sulfate column</center></h3>

In [None]:
numeric_col = df.select_dtypes(float).columns.to_list()

In [None]:
fig = plt.figure(figsize = (10,8))
axis = sns.heatmap(df[numeric_col].isnull(), cbar=False)
axis.set_title('Missing Values', size = 16, weight = 'bold')
axis.set_xticklabels(numeric_col, rotation = 30)
axis.set_xlabel('Numeric Features', size = 12, weight = 'bold');

> The values are missing from the random indexes, so it is hard to identify what is causing them. Let's see how the values are distributed for `Sulfate`.

In [None]:
fig, (axis1,axis2) = plt.subplots(1, 2, figsize = (10,6))
sns.kdeplot(x = 'Sulfate', hue = 'Potability', fill = True, data = df, hue_order=[1,0], ax = axis1)
sns.violinplot(x = 'Potability', y = 'Sulfate', data = df, ax = axis2)

axis1.set_title('Kde plot', size = 12, weight = 'bold')
axis2.set_title('Violin plot', size = 12, weight = 'bold')
fig.suptitle('Sulfate values distribution', size = 16, weight = 'bold');

> From the Kdeplot and Violin plot, the distribution of `Sulfate` is little different when the `Potability` is 0 and when it is 1. So, I think `Sulphate` has some influence on the `Potability`, so we will keep it and impute it too.

In [None]:
fig, ax = plt.subplots(9, 2, figsize=(12,20), constrained_layout = True)

for i, col in enumerate(numeric_col):
    sns.boxplot(x = 'Potability', y = col, data = df, ax = ax[i][0])
    
    sns.kdeplot(x = col, hue = 'Potability', fill = True, multiple = 'stack',
                alpha = 0.6, linewidth = 1.5, data = df, ax = ax[i][1])
    ax[i][0].set_xlabel(None)
    ax[i][0].set_ylabel(col, size = 14, weight = 'bold')
    ax[i][1].set_xlabel(None)
    ax[i][1].set_ylabel(None)
    
fig.suptitle('Features Analysis', size = 16, weight = 'bold');

> There certainly are many data points which are above 75 percentile and below 25 pecentile. I will remove the data points which above 95 percentile and below 5 percentile. You can choose any other threshold values also. The removal of the outliers will be done during data setup using `remove_outliers`.

> Now, let's see how the features influence each other.

In [None]:
fig = plt.figure()
sns.pairplot(df, hue = 'Potability')

> It does'nt seem to be any linear relationship between the features as the plots are kind of circle. We can say that their is no multicollinearity but to be 100% sure, let's find out the correlation between them.

In [None]:
# Correlation between numeric variables
fig=plt.figure(figsize=(10,7))
axis=sns.heatmap(df[numeric_col].corr(), annot=True, linewidths=3, square=True, cmap='Blues', fmt=".0%")

axis.set_title('Correlation between the features', fontsize=16, weight='bold', y=1.05);
axis.set_xticklabels(numeric_col, fontsize=12)
axis.set_yticklabels(numeric_col, fontsize=12, rotation=0);

> The maximum correlation is 17% (negetive) between `Sulfate` and `Solids`, it means only 17% variance in the `Solids` can be explained by `Sulfate` and vice versa.

> It seems that there is no multicollinearity, as for it to be present, the correlation should be higher than 80-85% (positive or negetive). If it is present, we can set `remove_multicollinearity` to True, during data setup.

<h3><center>Checking for Class Imbalance</center></h3>

In [None]:
colors = ['#06344d', '#00b2ff']

fig = plt.figure(figsize = (10, 6))
ax = sns.countplot(x = 'Potability', data = df)

for i in ax.patches:
    ax.text(x = i.get_x() + i.get_width()/2, y = i.get_height()/7, 
            s = f"{np.round(i.get_height()/len(df)*100, 0)}%", 
            ha = 'center', size = 50, weight = 'bold', rotation = 90, color = 'white')

plt.title("Water Potability Count", size = 20, weight = 'bold')

plt.annotate(text = "Non Potable Water", xytext = (0.6, 1900), xy = (0.1, 1500),
             arrowprops = dict(arrowstyle = "->", color = 'black', connectionstyle = "angle3, angleA = 0, angleB = 90"), 
             color = 'red', weight = 'bold', size = 14)
plt.annotate(text = "Potable Water", xytext = (0.6, 1400), xy = (1.3, 1000), 
             arrowprops = dict(arrowstyle = "->", color = 'black', connectionstyle = "angle3, angleA = 0, angleB = 90"), 
             color = 'green', weight = 'bold', size = 14)

plt.xlabel(None)
plt.ylabel('Number of Samples', weight = 'bold');

> The count of both the classes are comparable (in 1000's), so there is no class imbalance. If it is there, you can set `fix_imbalance` to True (by defalt it is False) and also can choose the method to remove it using `fix_imbalance_method` (by defalt it is SMOTE)

> Now, everything is set, let's setup the data.

## Data Preprocessing 

In [None]:
# Using 'setup' from pycaret.classification for preprocessing the data
clf = setup(df, target = 'Potability',
            remove_outliers = True, outliers_threshold = 0.05, # Removing outliers with threshold of 5 percentile
            numeric_imputation = 'mean', # Imputing missing values with mean
            normalize = True, # Normalizing the features, so that Gradient Descent will converge fast
            normalize_method = 'zscore', # Mean => 0 and std. deviation => 1
            train_size = 0.8,
            fold = 10, # Number of K-folds
            use_gpu = True)

>When you run the cell, you have to press *enter* for proceeding further, otherwise you can type *quit* for not proceeding further. Once you preceed further, you will see all the parameters and their corresponding values. If you are not happy with this, you can change the values and run the setup again. For this example, the parameters look fine so I will proceed further.

>📌 If you want to see the train and test data, you can use `get_config('X_train')` and `get_config('X_test')`

<h1><center> Selecting the ML model </center></h1>

> There are many models available for classification e.g. Logistic Regression, SVC, Decision Tree, Random Forest, and list goes on...so which model to choose? 

> <center><img src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTIM5JO2fkc8oZ_eh2xRZGlSWTLvqaqtJ-UjpM1-0slaP37ZYHMsVmskjPj3mEo4hxvr-k&usqp=CAU"></center>

> One way is that, choosing the model by comparing them on certain parameters and then selecting the best, so let's compare them first and then decide which is better.

> If you are worried to see a page long code, for comparing the models, then don't worry! `compare_models` trains varies models like Logistic Regression, Decision Tree, SVM, Random Forest, XGBoost, etc. and compares them based on varies parameters like Accuracy, AUC ROC score, Recall, Precision, etc. So, it becomes easy for us to choose from them.

In [None]:
best_model = compare_models()

<h2><center> Evaluation Criterion 🧪 </center></h2>

> AUC (Area Under Curve) - ROC (Receiver Operator Characteristic) curve

><center><img src="https://miro.medium.com/max/722/1*pk05QGzoWhCgRiiFbz-oKQ.png"></center>

> AUC - ROC curve is a performance measurement for the classification problems at various threshold settings. ROC is a probability curve and AUC represents the degree or measure of separability.
>It tells how much the model is capable of distinguishing between classes. Higher the AUC, the better the model is at predicting 0 classes as 0 and 1 classes as 1. By analogy, the Higher the AUC, the better the model is at distinguishing between potable water with the non potable water, which is imp to us and so I choose AUC as evaluation metrics.

> The **Catboost** has highest AUC, so we will select that for model creation.


In [None]:
# Creating the model
catboost = create_model('catboost')

In [None]:
# Results for Test set
result = predict_model(catboost)

<h3><center>Hyperparameter tuning</center></h3>

> Hyperparameter tuning is a time consuming task, as there are lots of hyperparameter associated with a algorithm and setting them up to the right value to get best results can take significant amount of ⏳ time ⌛.

> With PyCaret, tuning hyperparameters of a ML model in any module is as simple as writing `tune_model`. It tunes the hyperparameter of the model passed as an estimator using Random grid search with pre-defined grids that are fully customizable.

> I am not tuning the Catboost model, as when I run it previously it took like 30 mins, and the AUC score didn't improve much. This is only the case with heavy models like catboost model, for others light weight models the hyperparameter tuning takes lesser time. So, if you want to improve the AUC score on the account of time, you can run the below cell. 

`tune_model(catboost, optimize = 'AUC')`


<h2><center>Analysing the model 🧐</center></h2>

> Analyzing performance of trained machine learning model is an integral step in any machine learning workflow. Analyzing model performance in PyCaret is as simple as writing plot_model. The function takes trained model object and type of plot as string within plot_model function.

> The available plots are Area Under the Curve, Discrimination Threshold, Precision Recall Curve, Confusion Matrix, etc. for more info, click [here](https://pycaret.org/plot-model/).

<div> <h3><center style="background-color:#00b2ff; color:white;">AUC ROC Curve</center></h3></div>

In [None]:
plot_model(catboost)

<div> <h3><center style="background-color:#00b2ff; color:white;">😵 Confusion Matrix</center></h3></div>

In [None]:
plot_model(catboost, plot = 'confusion_matrix')

<div> <h3><center style="background-color:#00b2ff; color:white;">🦬 Decision Boundary 🦬</center></h3></div>

In [None]:
plot_model(catboost, plot = 'boundary')

<div> <h3><center style="background-color:#00b2ff; color:white;">Learning Curve 📈</center></h3></div>

In [None]:
# This may take a while
plot_model(catboost, plot = 'learning')

<h2><center>Interpreting Model 😮</center></h2>

> 📌 Interpreting Model for binary classification, only tree based models can only be used like Desicion Tree, Random Forest, XGBoost, CatBoost, lightgbm, etc. So, if you have created some other model, you have to create one of the tree based model to interprete the model.

<h3><center>Feature Importance ❕</center></h3></div>

In [None]:
plot_model(catboost, 'feature')

In [None]:
interpret_model(catboost)

> This plot is based on the SHAP values. This plot is made of all the dots in the train data. It demonstrates the following information:

> **Feature importance**: Variables are ranked in descending order.

> **Impact**: The horizontal location shows whether the effect of that value is associated with a higher or lower prediction.

> **Original value**: Color shows whether that variable is high (in red) or low (in blue) for that observation.

>**Correlation**: A high level of the `ph` has a high and positive impact on the potability of water. The “high” comes from the red color, and the “positive” impact is shown on the X-axis. Similarly, we can say that, `Hardness` is negatively correlated with the potability of water.

> 📘 If you want to learn about them in detail, you can read this [blog](https://towardsdatascience.com/explain-your-model-with-the-shap-values-bc36aac4de3d).

<h3><center>Local Interpretation</center></h3></div>

> If you want to know, for a perticular sample why the model predicted a perticular output, then you can use this.

In [None]:
interpret_model(catboost, plot = 'reason', observation = 20) # Checking for sample no. 20

> **The base value** (-0.582) : the value that would be predicted if we did not know any features for the current output

> **Red/blue**: Features that push the prediction higher (to the right) are shown in red, and those pushing the prediction lower are in blue.

> The prediction is **0** for the sample 20, you can check this using `df.loc[20,'Potability']`

### That's it for this notebook. If you like it, don't forget to upvote!