This week we will be looking at Enterococcus levels in the Hudson River, using data from the organiation Riverkeeper (http://www.riverkeeper.org/).

Background: Enterococcus is a fecal indicating bacteria that lives in the intestines of humans and other warm-blooded animals. Enterococcus (“Entero”) counts are useful as a water quality indicator due to their abundance in human sewage, correlation with many human pathogens and low abundance in sewage free environments. The United States Environmental Protection Agency (EPA) reports Entero counts as colonies (or cells) per 100 ml of water.
Riverkeeper has based its assessment of acceptable water quality on the 2012 Federal Recreational Water Quality Criteria from the US EPA. Unacceptable water is based on an illness rate of 32 per 1000 swimmers.
The federal standard for unacceptable water quality is a single sample value of greater than 110 Enterococcus/100mL, or five or more samples with a geometric mean (a weighted average) greater than 30 Enterococcus/100mL.
Data: I have provided the data on our github page, in the folder https://github.com/jlaurito/CUNY_IS608/blob/master/lecture4/data. I have not cleaned it – you need to do so.

This assignment must be done in python. It must be done using the ‘bokeh’, 'seaborn', or 'pandas' package. You may turn in either a .py file or an ipython notebook file.

Questions:

1. Create lists & graphs of the best and worst places to swim in the dataset.

2. The testing of water quality can be sporadic. Which sites have been tested most regularly? Which ones have long gaps between tests? Pick out 5-10 sites and visually compare how regularly their water quality is tested.

3. Is there a relationship between the amount of rain and water quality?  Show this relationship graphically. If you can, estimate the effect of rain on quality at different sites and create a visualization to compare them.

In [68]:
import pandas as pd
import numpy as np
import bokeh as bh

df = pd.DataFrame.from_csv('https://raw.githubusercontent.com/jlaurito/CUNY_IS608/master/lecture4/data/riverkeeper_data_2013.csv')
df.reset_index(level=0, inplace=True)

In [69]:
df['Site'] = df['Site'].astype('category')
df['EnteroCount'] = df['EnteroCount'].str.replace('>', '').str.replace('<', '')
df['EnteroCount'] = df['EnteroCount'].astype('float').fillna(0.0)
summary = df.groupby('Site').describe().unstack()
summary.reset_index(level=0, inplace=True)

In [73]:
summary

Unnamed: 0_level_0,Site,EnteroCount,EnteroCount,EnteroCount,EnteroCount,EnteroCount,EnteroCount,EnteroCount,EnteroCount,FourDayRainTotal,FourDayRainTotal,FourDayRainTotal,FourDayRainTotal,SampleCount,SampleCount,SampleCount,SampleCount,SampleCount,SampleCount,SampleCount,SampleCount
Unnamed: 0_level_1,Unnamed: 1_level_1,count,mean,std,min,25%,50%,75%,max,count,...,75%,max,count,mean,std,min,25%,50%,75%,max
0,125th St. Pier,66.0,179.696970,350.559332,8.0,12.75,43.5,135.75,1500.0,66.0,...,1.000,6.4,66.0,66.0,0.0,66.0,66.0,66.0,66.0,66.0
1,79th St. mid-channel,49.0,47.204082,149.335888,1.0,10.00,10.0,20.00,1032.0,49.0,...,1.000,8.5,49.0,49.0,0.0,49.0,49.0,49.0,49.0,49.0
2,Albany Rowing Dock,36.0,280.944444,598.010891,3.0,24.25,48.0,140.00,2420.0,36.0,...,1.325,2.8,36.0,36.0,0.0,36.0,36.0,36.0,36.0,36.0
3,Annesville Creek,38.0,83.421053,192.784102,5.0,10.00,10.0,20.00,958.0,38.0,...,0.675,3.4,38.0,38.0,0.0,38.0,38.0,38.0,38.0,38.0
4,Athens,35.0,201.314286,541.301904,5.0,17.50,30.0,68.50,2420.0,35.0,...,1.100,2.8,35.0,35.0,0.0,35.0,35.0,35.0,35.0,35.0
5,Beacon Harbor,38.0,52.657895,133.261839,1.0,7.25,19.0,46.00,816.0,38.0,...,0.500,2.1,38.0,38.0,0.0,38.0,38.0,38.0,38.0,38.0
6,Bethlehem Launch Ramp,36.0,231.694444,596.308205,1.0,8.75,19.0,53.25,2420.0,36.0,...,1.050,2.8,36.0,36.0,0.0,36.0,36.0,36.0,36.0,36.0
7,"Castle Point, NJ",39.0,37.076923,54.016267,10.0,10.00,20.0,36.00,231.0,39.0,...,0.900,6.4,39.0,39.0,0.0,39.0,39.0,39.0,39.0,39.0
8,Castleton,35.0,186.000000,402.355637,1.0,6.50,22.0,66.50,1733.0,35.0,...,1.100,2.8,35.0,35.0,0.0,35.0,35.0,35.0,35.0,35.0
9,Catskill Creek- East End,42.0,261.238095,656.570945,1.0,6.50,13.5,32.50,2420.0,42.0,...,0.975,2.8,42.0,42.0,0.0,42.0,42.0,42.0,42.0,42.0
