### This notebooks is to complete a task from the dataset of Water Quality: predict if water is safe for human consumption.
### [Link for the task](https://www.kaggle.com/adityakadiwal/water-potability/tasks?taskId=4186)

The goal is to predict what kind of compostions of a water make it human-drinkable.
From the [descriptions of the dataset](https://www.kaggle.com/adityakadiwal/water-potability), we can conclude if water has below attributes, meaning the water is safe for human to drink, i.e. potability = 1. When potability = 0, it means the water is not suitable for human to drink.
1. ph value: 6.5-8.5
2. hardness: not defined
3. solids: 500mg/l-1000mg/l
4. chloramines:  up to 4mg/l
5. sulfate: not defined
6. conductivity: up to 400 μS/cm
7. organic carbon: up to 2mg/l
8. trihalomethanes: up to 80 ppm
9. turbidity: up to 5 NTU

The analysis consists of four sections:
* Initial analysis
* Statistical analysis
* Hypothesis testing
* Prediction modeling

special thanks for Jason Brownlee @ machinelearningmastery.com 


In [None]:
import numpy as np
import pandas as pd
# viz libraries
import matplotlib.pyplot as plt
import seaborn as sns
import random

In [None]:
df_water_potability = pd.read_csv('../input/water-potability/water_potability.csv')

## Initial analysis - checking dataframe, null values and deal with nulls

In [None]:
df_water_potability.head()

In [None]:
df_water_potability.info()

In [None]:
df_water_potability.describe()

In [None]:
# since there are nulls in the columns, will use mean to replace the null value
# define a function to update the null values
def cal_mean(dataframe, col):
    dataframe[col].fillna(value=dataframe[col].mean(), inplace=True)


In [None]:
# create a copy of the dataframe
df = df_water_potability.copy()
cols = list(df.columns)
cols.remove('Potability')
for col in cols:
    cal_mean(df,col)

In [None]:
# check NA again, nore more, looks good!
df.info()

In [None]:
# change potability column type to categorical
df['Potability']=df['Potability'].astype('category')

## Statistical analysis - check stats for all columns

In [None]:
ax = sns.countplot(data = df, x ='Potability')
plt.title('Water Potability', pad = '20')
for i in ax.patches:
    ax.text(x = i.get_x()+i.get_width()/2, 
            y = i.get_height()/7, 
            s = f"{np.round(i.get_height()/len(df)*100,0)}%",
            horizontalalignment='center',
            verticalalignment='center',
            weight='bold', 
            color='white'
           )
plt.grid(False)
plt.show()

looks like our dataset is unbalanced, there are more datapoints in the group of non-potable water and less in the potable one, which may affect our model effectiveness in the prediction

In [None]:
# what does the stats look like when the water is potable
df[df.Potability == 1].describe()

In [None]:
# what does the stats look like when the water is NOT potable
df[df.Potability == 0].describe()

In [None]:
def boxplot(col):
    r = random.random()
    b = random.random()
    g = random.random()
    clr = (r,b,g)
    sns.boxplot(x = 'Potability' , y = col, data = df, color=clr, showmeans= True)
    plt.title('Distribution for '+ col +' by potability', pad = 20)
    plt.grid(False)
    plt.show()

In [None]:
# check the distribution for all the subgroups by elements by whether the water is potable or not
for col in cols:
    boxplot(col)

In [None]:
# check the distribution by histogram viz
df.hist(figsize = (20,10), grid = False)
plt.show()

## Hypothesis testing
* H0: no difference in the compostions between potable and non portable water
* H1: there is difference in the compositions between potable and non-potable water <br>
* Significant level: 90% <br>
* Since we are comparing compositions in water potability which is a multually exclusive character, i.e. water can only be drinkable or not, we will be using 2 sample test



In [None]:
# correlation matrix, no relationship, pearson r = 0; moderate , abs(pearson r) around 0.5; large, abs(pearson r) aroun 1
corrmtrix = df.corr()
plt.subplots(figsize = (10,10))
sns.heatmap(corrmtrix, square = True, annot=True, fmt='.2f')
plt.show()

results: there are no linear relationship between two compositions. so we do not need to run a pearson r function to test if any relationship is significant

In [None]:
# import 2 sampled test model
from scipy.stats import ttest_ind

In [None]:
def getpval(col):
    df_0 = df[df.Potability==0]
    df_1 = df[df.Potability==1]
    ttest, pval = ttest_ind(df_0[col],df_1[col])
    return round(pval,4)

In [None]:
pvals =[]
p_values = 0.1
for col in cols:
    pvals.append(getpval(col))
    if getpval(col) < p_values:
        print("REJECT H0: the mean for "+ col + " is different between potable and non-potable water")
    else :
        print("Accept H0: the mean for "+ col + " is the same between potable and non-potable water")
# Looks like the mean are only significant different in Solids between potable and non-potable water

## Prediction of drinkable water

In [None]:
from yellowbrick.classifier import ROCAUC
from sklearn.model_selection import train_test_split, cross_validate, StratifiedKFold, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import mean_absolute_error, r2_score
from sklearn.ensemble import BaggingClassifier, GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

In [None]:
# prepare and split the data into train and test 
X = df.drop('Potability', axis = 1)
y = df['Potability']
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=42)

In [None]:
results=[]
models =[GaussianNB(),
         SVC(),
         BaggingClassifier(), 
         GradientBoostingClassifier(), 
         DecisionTreeClassifier(),
         KNeighborsClassifier()]
for model in models:
    kfold = StratifiedKFold(n_splits=10, random_state=42, shuffle=True)
    cv_results = cross_val_score(model, X_train, y_train, cv=kfold, scoring='accuracy')
    results.append(cv_results)
    print('%s: %f (%f)' % (model, cv_results.mean(), cv_results.std()))


In [None]:
plt.boxplot(results, labels = models)
plt.grid(False)
plt.xticks(rotation = 45)
plt.title('Model Comparison', pad = 10)
plt.show()

### GradientBoostingClassifier has the highest score

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

In [None]:
model = GradientBoostingClassifier()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

In [None]:
print(accuracy_score(y_test, predictions))
print(confusion_matrix(y_test, predictions))
print(classification_report(y_test, predictions))

### We can see the accuracy is 66%. F1-score is higher when the water is non-potable 