## Water Potability Machine Learning Initial Analysis with Classifiers

### Imports

In [1]:
import sys
import os
import warnings

if not sys.warnoptions:
    warnings.simplefilter("ignore")
    os.environ["PYTHONWARNINGS"] = "ignore"

import pandas as pd

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

from sklearn.pipeline import Pipeline

from sklearn.metrics import accuracy_score, precision_score, make_scorer
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import GridSearchCV, HalvingGridSearchCV
from sklearn.feature_selection import SelectFromModel



### Load Data

In [2]:
water_potability_data = pd.read_csv('../data/waterpotability.csv')
water_potability_data.head()

Unnamed: 0,ph,Hardness,Solids,Chloramines,Sulfate,Conductivity,Organic_carbon,Trihalomethanes,Turbidity,Potability
0,,204.890455,20791.318981,7.300212,368.516441,564.308654,10.379783,86.99097,2.963135,0
1,3.71608,129.422921,18630.057858,6.635246,,592.885359,15.180013,56.329076,4.500656,0
2,8.099124,224.236259,19909.541732,9.275884,,418.606213,16.868637,66.420093,3.055934,0
3,8.316766,214.373394,22018.417441,8.059332,356.886136,363.266516,18.436524,100.341674,4.628771,0
4,9.092223,181.101509,17978.986339,6.5466,310.135738,398.410813,11.558279,31.997993,4.075075,0


### Data Cleaning

In [3]:
water_potability_data.isnull().sum()

ph                 491
Hardness             0
Solids               0
Chloramines          0
Sulfate            781
Conductivity         0
Organic_carbon       0
Trihalomethanes    162
Turbidity            0
Potability           0
dtype: int64

There are a lot of null values. However, since the dataset is already small, I will try to fill those values rather than dropping the rows will null values. 

In [4]:
(water_potability_data[['ph', 'Sulfate', 'Trihalomethanes']][water_potability_data['Potability']==0].describe()).loc[['mean', '50%']]

Unnamed: 0,ph,Sulfate,Trihalomethanes
mean,7.085378,334.56429,66.303555
50%,7.035456,333.389426,66.542198


In [5]:
(water_potability_data[['ph', 'Sulfate', 'Trihalomethanes']][water_potability_data['Potability']==1].describe().loc[['mean', '50%']])

Unnamed: 0,ph,Sulfate,Trihalomethanes
mean,7.073783,332.56699,66.539684
50%,7.036752,331.838167,66.678214


In [6]:
potable_mean = water_potability_data[water_potability_data['Potability'] == 1][['ph', 'Sulfate', 'Trihalomethanes']].mean()
potable_median = water_potability_data[water_potability_data['Potability'] == 1][['ph', 'Sulfate', 'Trihalomethanes']].median()

non_potable_mean = water_potability_data[water_potability_data['Potability'] == 0][['ph', 'Sulfate', 'Trihalomethanes']].mean()
non_potable_median = water_potability_data[water_potability_data['Potability'] == 0][['ph', 'Sulfate', 'Trihalomethanes']].median()

mean_diff = potable_mean - non_potable_mean
median_diff = potable_median - non_potable_median

print("Mean Difference:")
print(mean_diff)
print("\nMedian Difference:")
print(median_diff)

Mean Difference:
ph                -0.011595
Sulfate           -1.997299
Trihalomethanes    0.236128
dtype: float64

Median Difference:
ph                 0.001297
Sulfate           -1.551259
Trihalomethanes    0.136016
dtype: float64


There is a negligible difference between the mean and median figures for Potable and Non-Potable Water. Due to this, we can impute the data using the mean, hence leading to cleaned data. 

In [7]:
water_potability_cleaned_data = water_potability_data.copy()
water_potability_cleaned_data['ph'] = water_potability_cleaned_data['ph'].fillna(water_potability_cleaned_data['ph'].mean())
water_potability_cleaned_data['Sulfate'] = water_potability_cleaned_data['Sulfate'].fillna(water_potability_cleaned_data['Sulfate'].mean())
water_potability_cleaned_data['Trihalomethanes'] = water_potability_cleaned_data['Trihalomethanes'].fillna(water_potability_cleaned_data['Trihalomethanes'].mean())

### Data Prep

In [8]:
X = water_potability_cleaned_data.drop('Potability', axis=1)
y = water_potability_cleaned_data['Potability']

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=142, shuffle=True, stratify=y)

### Build Classifier Model Pipelines

Use grid search to tune different hyper parameters and train the classifier models. Train using the accuracy metric. To save compute time on SVM, Halving Grid Search will be used instead as it is faster. Grid Search will be used for the rest.  

**Logistic Regression Classifier Trained on Accuracy**

In [10]:
extractor = SelectFromModel(LogisticRegression(penalty='l1', solver = 'liblinear', random_state = 42))

lgr_pipe = Pipeline([('scaler', StandardScaler()),
                    ('selector', extractor),
                    ('lgr', LogisticRegression(random_state=42, max_iter = 1000))])

params = {'lgr__C': [0.001, 0.01, 0.1, 1, 10, 100]}

lgr_gs = GridSearchCV(lgr_pipe, param_grid=params, cv=5, scoring='accuracy', n_jobs=5).fit(X_train, y_train)

lgr_train_score = lgr_gs.best_score_
lgr_test_score = lgr_gs.score(X_test, y_test)

lgr_accuracy = accuracy_score(y_test, lgr_gs.predict(X_test))
lgr_precision = precision_score(y_test, lgr_gs.predict(X_test))

print(lgr_gs.best_params_)

print('Train Score:', lgr_train_score)
print('Test Score:', lgr_test_score)

print('Accuracy:', lgr_accuracy)
print('Precision:', lgr_precision)

{'lgr__C': 0.1}
Train Score: 0.6118320610687024
Test Score: 0.6128048780487805
Accuracy: 0.6128048780487805
Precision: 1.0


**KNN Classifier Trained on Accuracy**

In [11]:
knn_pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('knn', KNeighborsClassifier())
])

params = {'knn__n_neighbors': [n for n in range(1,22,2)]}

knn_gs = GridSearchCV(knn_pipe, param_grid=params, cv=5, scoring='accuracy', n_jobs=5).fit(X_train, y_train)

knn_train_score = knn_gs.best_score_
knn_test_score = knn_gs.score(X_test, y_test)

knn_accuracy = accuracy_score(y_test, knn_gs.predict(X_test))
knn_precision = precision_score(y_test, knn_gs.predict(X_test))

print(knn_gs.best_params_)

print('Train Score:', knn_train_score)
print('Test Score:', knn_test_score)

print('Accuracy:', knn_accuracy)
print('Precision:', knn_precision)

{'knn__n_neighbors': 15}
Train Score: 0.6496183206106869
Test Score: 0.6448170731707317
Accuracy: 0.6448170731707317
Precision: 0.6095238095238096


**Decision Tree Classifier Trained on Accuracy**

In [12]:
dtree_pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('dt', DecisionTreeClassifier(random_state=42))
])

params = {'dt__max_depth': range(1,11,2),
          'dt__min_samples_split': [2,3,4,5],
          'dt__min_samples_leaf': [1,2,3,4,5],
          'dt__criterion': ['gini', 'entropy']}

dtree_gs = GridSearchCV(dtree_pipe, param_grid=params, cv=5, scoring='accuracy', n_jobs=5).fit(X_train, y_train)

dtree_train_score = dtree_gs.best_score_
dtree_test_score = dtree_gs.score(X_test, y_test)

dtree_accuracy = accuracy_score(y_test, dtree_gs.predict(X_test))
dtree_precision = precision_score(y_test, dtree_gs.predict(X_test))

print(dtree_gs.best_params_)

print('Train Score:', dtree_train_score)
print('Test Score:', dtree_test_score)

print('Accuracy:', dtree_accuracy)
print('Precision:', dtree_precision)

{'dt__criterion': 'entropy', 'dt__max_depth': 9, 'dt__min_samples_leaf': 2, 'dt__min_samples_split': 5}
Train Score: 0.6465648854961833
Test Score: 0.6189024390243902
Accuracy: 0.6189024390243902
Precision: 0.53


**SVM Classifier Trained on Accuracy**

In [13]:
svm_pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('svm', SVC())
])

params = {'svm__C': [0.1, 1, 10],
          'svm__kernel': ['rbf', 'poly', 'linear', 'sigmoid'],
          'svm__gamma': [0.1, 1.0, 10.0]}

svm_gs = HalvingGridSearchCV(svm_pipe, param_grid=params, cv=5, scoring='accuracy', n_jobs=5).fit(X_train, y_train)

svm_train_score = svm_gs.best_score_
svm_test_score = svm_gs.score(X_test, y_test)

svm_accuracy = accuracy_score(y_test, svm_gs.predict(X_test))
svm_precision = precision_score(y_test, svm_gs.predict(X_test))

print(svm_gs.best_params_)

print('Train Score:', svm_train_score)
print('Test Score:', svm_test_score)

print('Accuracy:', svm_accuracy)
print('Precision:', svm_precision)

{'svm__C': 0.1, 'svm__gamma': 0.1, 'svm__kernel': 'poly'}
Train Score: 0.6137667304015296
Test Score: 0.6112804878048781
Accuracy: 0.6112804878048781
Precision: 1.0


**Random Forest Classifier Trained on Accuracy**

In [14]:
rf_pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('rf', RandomForestClassifier(random_state=42))
])

params = {'rf__n_estimators': [100, 200, 300],
            'rf__max_depth': [10, 20, 30],
            'rf__min_samples_split': [2, 5, 10],
            'rf__min_samples_leaf': [1, 2, 4]}

rf_gs = GridSearchCV(rf_pipe, param_grid=params, cv=5, scoring='accuracy', n_jobs=5).fit(X_train, y_train)

rf_train_score = rf_gs.best_score_
rf_test_score = rf_gs.score(X_test, y_test)

rf_accuracy = accuracy_score(y_test, rf_gs.predict(X_test))
rf_precision = precision_score(y_test, rf_gs.predict(X_test))

print(rf_gs.best_params_)
print('Train Score:', rf_train_score)
print('Test Score:', rf_test_score)

print('Accuracy:', rf_accuracy)
print('Precision:', rf_precision)

{'rf__max_depth': 20, 'rf__min_samples_leaf': 1, 'rf__min_samples_split': 5, 'rf__n_estimators': 200}
Train Score: 0.6805343511450381
Test Score: 0.6539634146341463
Accuracy: 0.6539634146341463
Precision: 0.6239316239316239


In [15]:
findings_dict = {'model': ['Logistic Regression', 'KNN', 'Decision Tree', 'SVC', 'Random Forest'], 
                'train score': [lgr_train_score, knn_train_score, dtree_train_score, svm_train_score, rf_train_score],
                'test score': [lgr_test_score, knn_test_score, dtree_test_score, svm_test_score, rf_test_score],
                'accuracy': [lgr_accuracy, knn_accuracy, dtree_accuracy, svm_accuracy, rf_accuracy],
                'precision': [lgr_precision, knn_precision, dtree_precision, svm_precision, rf_precision]}

findings_df = pd.DataFrame(findings_dict)
findings_df

Unnamed: 0,model,train score,test score,accuracy,precision
0,Logistic Regression,0.611832,0.612805,0.612805,1.0
1,KNN,0.649618,0.644817,0.644817,0.609524
2,Decision Tree,0.646565,0.618902,0.618902,0.53
3,SVC,0.613767,0.61128,0.61128,1.0
4,Random Forest,0.680534,0.653963,0.653963,0.623932


We will tune the models for the precision metric as well. To save compute time on SVM, Halving Grid Search will be used instead as it is faster. Grid Search will be used for the rest.

In [16]:
precision_scorer = make_scorer(precision_score, pos_label=1)

**Logistic Regression Classifier Trained on Precision**

In [17]:
extractor = SelectFromModel(LogisticRegression(penalty='l1', solver = 'liblinear', random_state = 42))

lgr_pipe = Pipeline([('scaler', StandardScaler()),
                    ('selector', extractor),
                    ('lgr', LogisticRegression(random_state=42, max_iter = 1000))])

params = {'lgr__C': [0.001, 0.01, 0.1, 1, 10, 100]}

lgr_gs = GridSearchCV(lgr_pipe, param_grid=params, cv=5, scoring=precision_scorer, n_jobs=5).fit(X_train, y_train)

lgr_train_score = lgr_gs.best_score_
lgr_test_score = lgr_gs.score(X_test, y_test)

lgr_accuracy = accuracy_score(y_test, lgr_gs.predict(X_test))
lgr_precision = precision_score(y_test, lgr_gs.predict(X_test))

print(lgr_gs.best_params_)

print('Train Score:', lgr_train_score)
print('Test Score:', lgr_test_score)

print('Accuracy:', lgr_accuracy)
print('Precision:', lgr_precision)

{'lgr__C': 0.1}
Train Score: 0.5833333333333334
Test Score: 1.0
Accuracy: 0.6128048780487805
Precision: 1.0


**KNN Classifier Trained on Precision**

In [18]:
knn_pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('knn', KNeighborsClassifier())
])

params = {'knn__n_neighbors': [n for n in range(1,22,2)]}

knn_gs = GridSearchCV(knn_pipe, param_grid=params, cv=5, scoring=precision_scorer, n_jobs=5).fit(X_train, y_train)

knn_train_score = knn_gs.best_score_
knn_test_score = knn_gs.score(X_test, y_test)

knn_accuracy = accuracy_score(y_test, knn_gs.predict(X_test))
knn_precision = precision_score(y_test, knn_gs.predict(X_test))

print(knn_gs.best_params_)

print('Train Score:', knn_train_score)
print('Test Score:', knn_test_score)

print('Accuracy:', knn_accuracy)
print('Precision:', knn_precision)

{'knn__n_neighbors': 21}
Train Score: 0.6216074424745238
Test Score: 0.6444444444444445
Accuracy: 0.649390243902439
Precision: 0.6444444444444445


**Decision Tree Classifier Trained on Precision**

In [19]:
dtree_pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('dt', DecisionTreeClassifier(random_state=42))
])

params = {'dt__max_depth': range(1,11,2),
          'dt__min_samples_split': [2,3,4,5],
          'dt__min_samples_leaf': [1,2,3,4,5],
          'dt__criterion': ['gini', 'entropy']}

dtree_gs = GridSearchCV(dtree_pipe, param_grid=params, cv=5, scoring=precision_scorer, n_jobs=5).fit(X_train, y_train)

dtree_train_score = dtree_gs.best_score_
dtree_test_score = dtree_gs.score(X_test, y_test)

dtree_accuracy = accuracy_score(y_test, dtree_gs.predict(X_test))
dtree_precision = precision_score(y_test, dtree_gs.predict(X_test))

print(dtree_gs.best_params_)

print('Train Score:', dtree_train_score)
print('Test Score:', dtree_test_score)

print('Accuracy:', dtree_accuracy)
print('Precision:', dtree_precision)

{'dt__criterion': 'gini', 'dt__max_depth': 3, 'dt__min_samples_leaf': 1, 'dt__min_samples_split': 2}
Train Score: 0.690488442375235
Test Score: 0.5172413793103449
Accuracy: 0.6112804878048781
Precision: 0.5172413793103449


**SVM Classifier Trained on Precision**

In [20]:
svm_pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('svm', SVC())
])

params = {'svm__C': [0.1, 1, 10],
          'svm__kernel': ['rbf', 'poly', 'linear', 'sigmoid'],
          'svm__gamma': [0.1, 1.0, 10.0]}

svm_gs = HalvingGridSearchCV(svm_pipe, param_grid=params, cv=5, scoring=precision_scorer, n_jobs=5).fit(X_train, y_train)

svm_train_score = svm_gs.best_score_
svm_test_score = svm_gs.score(X_test, y_test)

svm_accuracy = accuracy_score(y_test, svm_gs.predict(X_test))
svm_precision = precision_score(y_test, svm_gs.predict(X_test))

print(svm_gs.best_params_)

print('Train Score:', svm_train_score)
print('Test Score:', svm_test_score)

print('Accuracy:', svm_accuracy)
print('Precision:', svm_precision)

{'svm__C': 1, 'svm__gamma': 0.1, 'svm__kernel': 'rbf'}
Train Score: 0.7292023793848609
Test Score: 0.686046511627907
Accuracy: 0.6585365853658537
Precision: 0.686046511627907


**Random Forest Classifier Trained on Precision**

In [21]:
rf_pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('rf', RandomForestClassifier(random_state=42))
])

params = {'rf__n_estimators': [100, 200, 300],
            'rf__max_depth': [10, 20, 30],
            'rf__min_samples_split': [2, 5, 10],
            'rf__min_samples_leaf': [1, 2, 4]}

rf_gs = GridSearchCV(rf_pipe, param_grid=params, cv=5, scoring=precision_scorer, n_jobs=5).fit(X_train, y_train)

rf_train_score = rf_gs.best_score_
rf_test_score = rf_gs.score(X_test, y_test)

rf_accuracy = accuracy_score(y_test, rf_gs.predict(X_test))
rf_precision = precision_score(y_test, rf_gs.predict(X_test))

print(rf_gs.best_params_)
print('Train Score:', rf_train_score)
print('Test Score:', rf_test_score)

print('Accuracy:', rf_accuracy)
print('Precision:', rf_precision)

{'rf__max_depth': 10, 'rf__min_samples_leaf': 4, 'rf__min_samples_split': 2, 'rf__n_estimators': 200}
Train Score: 0.7389169995979697
Test Score: 0.7
Accuracy: 0.6524390243902439
Precision: 0.7


In [22]:
findings_dict = {'model': ['Logistic Regression', 'KNN', 'Decision Tree', 'SVC', 'Random Forest'], 
                'train score': [lgr_train_score, knn_train_score, dtree_train_score, svm_train_score, rf_train_score],
                'test score': [lgr_test_score, knn_test_score, dtree_test_score, svm_test_score, rf_test_score],
                'accuracy': [lgr_accuracy, knn_accuracy, dtree_accuracy, svm_accuracy, rf_accuracy],
                'precision': [lgr_precision, knn_precision, dtree_precision, svm_precision, rf_precision]}

findings_df = pd.DataFrame(findings_dict)
findings_df

Unnamed: 0,model,train score,test score,accuracy,precision
0,Logistic Regression,0.583333,1.0,0.612805,1.0
1,KNN,0.621607,0.644444,0.64939,0.644444
2,Decision Tree,0.690488,0.517241,0.61128,0.517241
3,SVC,0.729202,0.686047,0.658537,0.686047
4,Random Forest,0.738917,0.7,0.652439,0.7


### Results from Initial Analysis with Classifiers

The best model is the Random Forest classifier with an accuracy of ~65% and precision of ~70%. In this initial analysis with classifiers, none of the classifier machine learning models trained on accuracy and precision performed well. 

In the final analysis, it will be worthwhile to explore other metrics like recall, F1-score, or ROC-AUC, especially given the imbalanced nature of the dataset. These metrics can provide a more holistic view of model performance. 

Since performance metrics for the classifier models aren't high, balancing the dataset (undersampling, oversampling, SMOTE, etc.) could lead to some improvements.