<a href="https://colab.research.google.com/github/CodeMonkey18/WaterPotability/blob/main/notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# C964 CS Capstone:

**Predicting if a body of water is potable or not potable**

The purpose of this application is to implement a machine learning model that can be used to predict the potability of a body of water in relation to the following water quality metrics:  

---

1. pH value - Measure of how acidic or basic the water is on a scale of 0 (acidic) to 14 (basic).

2. Hardness - Concentration of dissolved calcium carbonate in miligrams per liter (mg/L).  

3. Solids - Concentration of total dissolved solids in parts per million (ppm).

4. Chloramines - Concentration of chloramines in parts per million (ppm).

5. Sulfate - Concentration of sulfate in miligrams per liter (mg/L).

6. Conductivity - Measure of conductivity is in microsiemens per centimeter (μS/cm)

7. Organic Carbon - Concentration of organic carbon in parts per million (ppm).

8. Trihalomethanes - Concentration of trihalomethanes in micrograms per liter (µg/l).

9. Turbidity - Measure of clearness/transparency in nephelometric turbidity units (NTU).

---

These metrics will determine whether a body of water is potable (1) or not potable (0). Potability refers to whether the water is suitable for consumption, whether its for drinking, cooking, cleaning, or other household purposes.

In [36]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from scipy.stats import zscore
import ipywidgets as widgets
import logging
from IPython.display import display, clear_output

dashboard_output = widgets.Output()

# parsing the csv into a dataframe
df = pd.read_csv('/content/WaterPotability/water_potability.csv')

# display raw dataset
df

Unnamed: 0,ph,Hardness,Solids,Chloramines,Sulfate,Conductivity,Organic_carbon,Trihalomethanes,Turbidity,Potability
0,,204.890455,20791.318981,7.300212,368.516441,564.308654,10.379783,86.990970,2.963135,0
1,3.716080,129.422921,18630.057858,6.635246,,592.885359,15.180013,56.329076,4.500656,0
2,8.099124,224.236259,19909.541732,9.275884,,418.606213,16.868637,66.420093,3.055934,0
3,8.316766,214.373394,22018.417441,8.059332,356.886136,363.266516,18.436524,100.341674,4.628771,0
4,9.092223,181.101509,17978.986339,6.546600,310.135738,398.410813,11.558279,31.997993,4.075075,0
...,...,...,...,...,...,...,...,...,...,...
3271,4.668102,193.681735,47580.991603,7.166639,359.948574,526.424171,13.894419,66.687695,4.435821,1
3272,7.808856,193.553212,17329.802160,8.061362,,392.449580,19.903225,,2.798243,1
3273,9.419510,175.762646,33155.578218,7.350233,,432.044783,11.039070,69.845400,3.298875,1
3274,5.126763,230.603758,11983.869376,6.303357,,402.883113,11.168946,77.488213,4.708658,1


In [37]:
# drop rows containing missing values
df_cleaned = df.dropna()

# remove outliers with a z-score over 2.5
def remove_outliers(df_cleaned, column):
    z_scores = zscore(df_cleaned[column])
    return df_cleaned[np.abs(z_scores) < 2.5 ]

# Loop over each column in the dataset to remove outliers
for column in df_cleaned.columns:
    df_cleaned = remove_outliers(df_cleaned, column)

df_cleaned.to_csv('cleaned_dataset.csv', index=False)

# display cleaned dataset
df_cleaned

Unnamed: 0,ph,Hardness,Solids,Chloramines,Sulfate,Conductivity,Organic_carbon,Trihalomethanes,Turbidity,Potability
3,8.316766,214.373394,22018.417441,8.059332,356.886136,363.266516,18.436524,100.341674,4.628771,0
4,9.092223,181.101509,17978.986339,6.546600,310.135738,398.410813,11.558279,31.997993,4.075075,0
5,5.584087,188.313324,28748.687739,7.544869,326.678363,280.467916,8.399735,54.917862,2.559708,0
6,10.223862,248.071735,28749.716544,7.513408,393.663396,283.651634,13.789695,84.603556,2.672989,0
7,8.635849,203.361523,13672.091764,4.563009,303.309771,474.607645,12.363817,62.798309,4.401425,0
...,...,...,...,...,...,...,...,...,...,...
3264,5.893103,239.269481,20526.666156,6.349561,341.256362,403.617560,18.963707,63.846319,4.390702,1
3265,8.197353,203.105091,27701.794055,6.472914,328.886838,444.612724,14.250875,62.906205,3.361833,1
3267,8.989900,215.047358,15921.412018,6.297312,312.931022,390.410231,9.899115,55.069304,4.613843,1
3268,6.702547,207.321086,17246.920347,7.708117,304.510230,329.266002,16.217303,28.878601,3.442983,1


1544 records were removed after cleaning the dataset, leaving behind 1732 records representing 1732 bodies of water.

The dataset has 10 columns. Some columns such as 'pH' and 'Solids' use wildly different ranges; pH values range from 0-14 while Solids values are in the tens of thousands. The data will need to be scaled later to fit our logistic regression model.

In [38]:
# 0 = not potable | 1 = potable
df_cleaned['Potability'].value_counts()

Unnamed: 0_level_0,count
Potability,Unnamed: 1_level_1
0,1057
1,675


In [39]:
df_cleaned.describe()

Unnamed: 0,ph,Hardness,Solids,Chloramines,Sulfate,Conductivity,Organic_carbon,Trihalomethanes,Turbidity,Potability
count,1732.0,1732.0,1732.0,1732.0,1732.0,1732.0,1732.0,1732.0,1732.0,1732.0
mean,7.073966,196.130257,21420.706572,7.124566,333.803191,425.343132,14.453381,66.217192,3.969863,0.389723
std,1.440891,29.674831,7851.60006,1.423745,36.966745,78.75271,3.108155,15.234073,0.737723,0.487828
min,3.230973,116.29933,1198.943699,3.267984,233.870327,233.907965,6.306055,26.140863,2.081425,0.0
25%,6.104328,177.703207,15494.790204,6.18452,308.943839,366.541531,12.272099,55.955335,3.446485,0.0
50%,7.026504,197.366035,20484.782948,7.128619,332.477743,421.653717,14.359358,66.112541,3.969391,0.0
75%,8.013683,215.441782,26680.506644,8.065929,357.999697,481.746153,16.779459,77.10154,4.502519,1.0
max,10.905076,277.116946,43195.473668,10.999995,433.021506,624.229901,22.641598,106.243066,5.864498,1.0


In [40]:
logistic_model = LogisticRegression(max_iter=3000)

# dependent variable y = potability
y = df_cleaned.iloc[:,9]

# independent variables x = all other metrics except for potability
x = df_cleaned.iloc[:,:-1]

accuracy_scores = []

for _ in range(500):
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=None)

    # scaling the data
    scaler = StandardScaler()
    x_train_scaled = scaler.fit_transform(x_train)
    x_test_scaled = scaler.transform(x_test)

    logistic_model.fit(x_train_scaled, y_train)

    y_prediction = logistic_model.predict(x_test_scaled)

    accuracy = accuracy_score(y_test, y_prediction)
    accuracy_scores.append(accuracy)

mean_accuracy = np.mean(accuracy_scores)
print(f"Average accuracy over 500 iterations: {mean_accuracy}")

Average accuracy over 500 iterations: 0.6072615384615384


In [41]:
# create a bar chart to plot potability against pH values

df_bar = df_cleaned.copy()

df_bar['pH_bin'] = pd.cut(df_bar['ph'], bins=range(0,15,1))

def create_barchart(b):
    with dashboard_output:
        dashboard_output.clear_output()
        plt.figure(figsize=(10,4))
        sns.countplot(x='pH_bin', hue='Potability', data=df_bar)

        plt.title('Distribution of potable and non-potable bodies of water at each pH value')
        plt.xlabel('pH values')
        plt.legend(labels=['Non-potable', 'Potable'])
        plt.show()

The bar chart, displayed in the dashboard further down, shows the distribution of potable and non-potable bodies of water in the cleaned dataset across different pH levels.

Overall, the ratio of potable vs non-potable bodies of water appears to be close to even near the middle of the graph (pH 6-7, 7-8, 7-9) and becomes more skewed towards non-potable when approaching either end (ph 3-4, 10-11).

In [42]:
# create scatter plot to visualize correlation between 2 variables: sulfate and turbidity.

def create_scatter(b):
    with dashboard_output:
        dashboard_output.clear_output()

        plt.scatter(df_cleaned['Solids'], df_cleaned['Turbidity'], alpha=0.3)
        plt.title('Correlation between Solids concentration and Turbidity')
        plt.xlabel('Solids Concentration (ppm)')
        plt.ylabel('Turbidity (NTU)')
        plt.show()

The scatterplot visualizes the relationship between solids concentration and turbidity.

Initially, it was assumed that a high solids concentration would result in high turbidity, but this graph shows a fairly random spread with no significant relationship between the two variables.

In [43]:
# create heatmap to visualize correlation strength between the variables

def create_heatmap(b):
    with dashboard_output:
        dashboard_output.clear_output()
        matrix = df_cleaned.corr()
        plt.figure(figsize=(10,4))
        np.fill_diagonal(matrix.values, np.nan)
        sns.heatmap(matrix, annot=True, cmap='Blues', fmt='.2f', mask=np.isnan(matrix))
        plt.title('Heatmap of Water Quality Metrics')
        plt.show()

create_heatmap(None)

The heatmap shows the correlation strength between the variables.

Some notables: pH and hardness have a high positive correlation with each other. Sulfate concentration and solids concentration have a high negative correlation with each other.

In [44]:
ph_widget = widgets.FloatSlider(description='pH value:',min=0.0, max=14.0)
hardness_widget = widgets.FloatSlider(description='Hardness:',min=100.0, max=300.0)
solids_widget = widgets.FloatSlider(description='Solids (ppm): ',min=1000.0, max=50000.0)
chloramines_widget = widgets.FloatSlider(description='Chloramines (ppm):',min=3.0, max=11.0)
sulfate_widget = widgets.FloatSlider(description='Sulfate:',min=200.0, max=500.0)
conductivity_widget = widgets.FloatSlider(description='Conductivity:',min=200.0, max=700.0)
organic_carbon_widget = widgets.FloatSlider(description='Organic Carbon:',min=5.0, max=25.0)
trihalomethanes_widget = widgets.FloatSlider(description='Trihalomethanes',min=25.0, max=110.0)
turbidity_widget = widgets.FloatSlider(description='Turbidity:',min=2.0, max=6.0)

button_predict = widgets.Button( description='Predict' )
button_output = widgets.Label(value='Move the sliders to your desired values, then click the button when you are ready.' )
button_predict_breakdown = widgets.Label(value='')

VALID_INPUTS = {
    'pH value': (0.0, 14.0),
    'Hardness': (100.0, 300.0),
    'Solids (ppm)': (1000.0, 50000.0),
    'Chloramines(ppm)': (3.0, 11.0),
    'Sulfate': (200.0, 500.0),
    'Conductivity': (200.0, 700.0),
    'Organic Carbon': (5.0, 25.0),
    'Trihalomethanes': (25.0, 110.0),
    'Turbidity': (2.0, 6.0)
}

logging.basicConfig(filename='user_queries.log', level=logging.INFO, format='%(asctime)s - %(message)s')

def validate_sliders(val, param):
    valid_slider = VALID_INPUTS.get(param)
    if valid_slider is None:
        return True, ''

    min, max = valid_slider
    if min <= val <= max:
        return True, ''
    else:
        message = f'Invalid input for {param}: {val}. Must be between {min} and {max}'
        return False, message


def trigger_predict(b):

    input_data = [[ph_widget.value, hardness_widget.value, solids_widget.value, chloramines_widget.value, sulfate_widget.value, conductivity_widget.value, organic_carbon_widget.value, trihalomethanes_widget.value, turbidity_widget.value]]
    feature_names = [
    'ph', 'Hardness', 'Solids', 'Chloramines', 'Sulfate',
    'Conductivity', 'Organic_carbon', 'Trihalomethanes', 'Turbidity']

    logging.info(f"User input: {dict(zip(feature_names, input_data[0]))}")

    input_df = pd.DataFrame(input_data, columns=feature_names)

    for val, param in zip(input_data[0], feature_names):
        if not validate_sliders(val, param):
            button_output.value = f'Invalid input detected. Please check the value again.'
            return

    input_scaled = scaler.transform(input_df)

    prediction = logistic_model.predict(input_scaled)
    prediction_breakdown = logistic_model.predict_proba(input_scaled)

    if prediction[0] == 0:
        button_output.value='Prediction = 0 (not potable)'
        logging.info("Prediction: not potable")

    else:
        button_output.value='Prediction = 1 (potable)'
        logging.info("Prediction: potable")


    breakdown_str = "not potable: {:.2f}, potable: {:.2f}".format(prediction_breakdown[0][0], prediction_breakdown[0][1])
    button_predict_breakdown.value = '(Probability breakdown: ' + breakdown_str + ')'
    logging.info(f"Probability breakdown: {breakdown_str}")

button_predict.on_click(trigger_predict)

def create_prediction(b):
    with dashboard_output:
        dashboard_output.clear_output()
        vb=widgets.VBox([ph_widget, hardness_widget, solids_widget, chloramines_widget, sulfate_widget, conductivity_widget, organic_carbon_widget, trihalomethanes_widget, turbidity_widget, button_predict, button_output, button_predict_breakdown])
        print('\033[1m' + 'Enter parameter values to predict whether a body of water is potable or not potable:' + '\033[0m')
        display(vb)

create_prediction(None)

There are two possible prediction values: 0 for not potable, 1 for potable.

The probability breakdown represents the probability for each of the 2 classes: potable and not potable. For example, a breakdown of 'not potable: 0.67, potable: 0.33' would mean that there is a 67% probability the body of water is not potable and a 33% chance it is potable.

In [45]:
def create_dashboard():

    print("Water Potability Prediction App Dashboard")

    print("Click a button to switch between visualization tools.")

    print("1. Select Bar Chart to see the distribution of potable water-bodies across different pH levels.")
    print("2. Select Scatterplot to see the relationship between solids concentration and turbidity.")
    print("3. Select Heatmap to see the correlation strength between all the variables.")
    print("4. Select the interactive tool to get real-time potability predictions by inputting water quality values.")

    bar_chart_button = widgets.Button(description="Bar Chart", layout=widgets.Layout(width='300px'))
    bar_chart_button.on_click(create_barchart)

    scatter_plot_button = widgets.Button(description="Scatterplot", layout=widgets.Layout(width='300px'))
    scatter_plot_button.on_click(create_scatter)

    heatmap_button = widgets.Button(description="Heatmap", layout=widgets.Layout(width='300px'))
    heatmap_button.on_click(create_heatmap)

    prediction_button = widgets.Button(description="Interactive Potability Prediction Tool", layout=widgets.Layout(width='300px'))
    prediction_button.on_click(create_prediction)

    dashboard = widgets.VBox([bar_chart_button, scatter_plot_button, heatmap_button, prediction_button, dashboard_output])

    display(dashboard)

create_dashboard()


Water Potability Prediction App Dashboard
Click a button to switch between visualization tools.
1. Select Bar Chart to see the distribution of potable water-bodies across different pH levels.
2. Select Scatterplot to see the relationship between solids concentration and turbidity.
3. Select Heatmap to see the correlation strength between all the variables.
4. Select the interactive tool to get real-time potability predictions by inputting water quality values.


VBox(children=(Button(description='Bar Chart', layout=Layout(width='300px'), style=ButtonStyle()), Button(desc…