# Water Potability Pipeline

This notebook will walk through the steps of pre-processing a _water potability_ dataset from Kaggle.com.  The dataset will be fed into a ML model to predict if a given water sample is potable or not.

### Imports
Imports all modules needed for this data pipeline

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Assessing Raw Data

This part will briefly assess the raw data that will be fed into the data pipeline.  This will identify some apsects to target during the _pre-processing_ portion of the data pipeline.

In [3]:
file = "data/water_potability.csv"
df = pd.read_csv(file)
row_count = len(df['ph'])
rows = list()
for col in df.columns:
    NaN_count = len(df[df[col].isnull()])
    NaN_perc = round(NaN_count / row_count, 2) * 100
    maximum = round(max(df[~df[col].isnull() & ~df[col].isin(['nan'])][col]), 2)
    minimum = round(min(df[~df[col].isnull() & ~df[col].isin(['nan'])][col]), 2)
    if col == 'ph':
        print(maximum)
        print(minimum)
    rows.append([col, NaN_count, NaN_perc, maximum, minimum])

summary = pd.DataFrame(data=rows, columns=["Column Name", "NaN Counts", "NaN %", "Max", "Min"])
summary

14.0
0.0


Unnamed: 0,Column Name,NaN Counts,NaN %,Max,Min
0,ph,491,15.0,14.0,0.0
1,Hardness,0,0.0,323.12,47.43
2,Solids,0,0.0,61227.2,320.94
3,Chloramines,0,0.0,13.13,0.35
4,Sulfate,781,24.0,481.03,129.0
5,Conductivity,0,0.0,753.34,181.48
6,Organic_carbon,0,0.0,28.3,2.2
7,Trihalomethanes,162,5.0,124.0,0.74
8,Turbidity,0,0.0,6.74,1.45
9,Potability,0,0.0,1.0,0.0


## Replacing NaN/Missing Values

To resolve values that have missing or NaN values, the _average_ for the column's values will be used instead.

In [6]:
df.fillna(df.mean(), inplace=True)
df.head(20)

Unnamed: 0,ph,Hardness,Solids,Chloramines,Sulfate,Conductivity,Organic_carbon,Trihalomethanes,Turbidity,Potability
0,7.080795,204.890455,20791.318981,7.300212,368.516441,564.308654,10.379783,86.99097,2.963135,0
1,3.71608,129.422921,18630.057858,6.635246,333.775777,592.885359,15.180013,56.329076,4.500656,0
2,8.099124,224.236259,19909.541732,9.275884,333.775777,418.606213,16.868637,66.420093,3.055934,0
3,8.316766,214.373394,22018.417441,8.059332,356.886136,363.266516,18.436524,100.341674,4.628771,0
4,9.092223,181.101509,17978.986339,6.5466,310.135738,398.410813,11.558279,31.997993,4.075075,0
5,5.584087,188.313324,28748.687739,7.544869,326.678363,280.467916,8.399735,54.917862,2.559708,0
6,10.223862,248.071735,28749.716544,7.513408,393.663396,283.651634,13.789695,84.603556,2.672989,0
7,8.635849,203.361523,13672.091764,4.563009,303.309771,474.607645,12.363817,62.798309,4.401425,0
8,7.080795,118.988579,14285.583854,7.804174,268.646941,389.375566,12.706049,53.928846,3.595017,0
9,11.180284,227.231469,25484.508491,9.0772,404.041635,563.885481,17.927806,71.976601,4.370562,0


In [9]:
df.describe()

Unnamed: 0,ph,Hardness,Solids,Chloramines,Sulfate,Conductivity,Organic_carbon,Trihalomethanes,Turbidity,Potability
count,3276.0,3276.0,3276.0,3276.0,3276.0,3276.0,3276.0,3276.0,3276.0,3276.0
mean,7.080795,196.369496,22014.092526,7.122277,333.775777,426.205111,14.28497,66.396293,3.966786,0.39011
std,1.469956,32.879761,8768.570828,1.583085,36.142612,80.824064,3.308162,15.769881,0.780382,0.487849
min,0.0,47.432,320.942611,0.352,129.0,181.483754,2.2,0.738,1.45,0.0
25%,6.277673,176.850538,15666.690297,6.127421,317.094638,365.734414,12.065801,56.647656,3.439711,0.0
50%,7.080795,196.967627,20927.833607,7.130299,333.775777,421.884968,14.218338,66.396293,3.955028,0.0
75%,7.87005,216.667456,27332.762127,8.114887,350.385756,481.792304,16.557652,76.666609,4.50032,1.0
max,14.0,323.124,61227.196008,13.127,481.030642,753.34262,28.3,124.0,6.739,1.0
