This week we will be looking at Enterococcus levels in the Hudson River, using data from the organiation Riverkeeper (http://www.riverkeeper.org/).

Background: Enterococcus is a fecal indicating bacteria that lives in the intestines of humans and other warm-blooded animals. Enterococcus (“Entero”) counts are useful as a water quality indicator due to their abundance in human sewage, correlation with many human pathogens and low abundance in sewage free environments. The United States Environmental Protection Agency (EPA) reports Entero counts as colonies (or cells) per 100 ml of water.
Riverkeeper has based its assessment of acceptable water quality on the 2012 Federal Recreational Water Quality Criteria from the US EPA. Unacceptable water is based on an illness rate of 32 per 1000 swimmers.
The federal standard for unacceptable water quality is a single sample value of greater than 110 Enterococcus/100mL, or five or more samples with a geometric mean (a weighted average) greater than 30 Enterococcus/100mL.
Data: I have provided the data on our github page, in the folder https://github.com/jlaurito/CUNY_IS608/blob/master/lecture4/data. I have not cleaned it – you need to do so.

This assignment must be done in python. It must be done using the ‘bokeh’, 'seaborn', or 'pandas' package. You may turn in either a .py file or an ipython notebook file.

Questions:

1. Create lists & graphs of the best and worst places to swim in the dataset.

2. The testing of water quality can be sporadic. Which sites have been tested most regularly? Which ones have long gaps between tests? Pick out 5-10 sites and visually compare how regularly their water quality is tested.

3. Is there a relationship between the amount of rain and water quality?  Show this relationship graphically. If you can, estimate the effect of rain on quality at different sites and create a visualization to compare them.

In [1]:
import pandas as pd
import numpy as np
from bokeh.layouts import row
from bokeh.plotting import figure, show, output_notebook
from bokeh.charts import Scatter, output_file, show
 
df = pd.DataFrame.from_csv('https://raw.githubusercontent.com/jlaurito/CUNY_IS608/master/lecture4/data/riverkeeper_data_2013.csv')
df.reset_index(level=0, inplace=True)

In [2]:
df['Site'] = df['Site'].astype('category')
df['Date'] = pd.to_datetime(df['Date'])
df['EnteroCount'] = df['EnteroCount'].str.replace('>', '').str.replace('<', '')
df['EnteroCount'] = df['EnteroCount'].astype('float').fillna(0.0)
summary = pd.DataFrame(df.groupby('Site').describe().unstack())

In [3]:
output_notebook()

In [6]:
x = pd.DataFrame(summary['EnteroCount', 'mean']) 
x.reset_index(level=0, inplace=True)
x.columns = x.columns.droplevel()
x.columns = ['Site', 'EnteroCount - mean']
x['Site'] = x['Site'].astype('string')
x = x.sort_values(['EnteroCount - mean'], ascending=False )
top10 = x.head(10)
bottom10 = x.tail(10)
frames = [top10, bottom10]
result = pd.concat(frames)

1. The below graph illustrates the best and worst places to swim. The high average counts of Gowanus Canal is well know and I have personally seen the oily dark water of the Canal, which was not inviting for a swim. Additionally, most of the high average sites are near the very end of the Hudson River which means the colonies have collected in count along the journey of the Hudson river. The lowest average counts of EnteroCount are locations are much further up the Hudson River. Also, I would hope that Poughkeepsie drinking water intake would have very low counts for the sake of those drinking this water.  

In [8]:
scatter = Scatter(result, x='Site' , y='EnteroCount - mean', 
                  title="Top 10 and Bottom 10 Average EnteroCount", 
                  xlabel="Location", ylabel="Location")
show(scatter)

2.The testing of water quality can be sporadic. Which sites have been tested most regularly? Which ones have long gaps between tests? Pick out 5-10 sites and visually compare how regularly their water quality is tested.