# Lab 10 - Pruning Lakes

Recall that one of the files (starts with `mces`) contains water quality measurements for lakes in the Twin Cities.  In this lab, we will narrow down the list of lakes for which we have at least one of each measurement type (phosphorus and secchi depth) for each year between 2004 and 2015.

## Tasks

Build a query that leads to a list of lake names and codes that fit the following criteria.

1. Only contains years after 2003.
2. Only contains lakes that have at least one non-null measurement of each type in each year.
3. Contains both the lake name and the lake code.


## Suggested workflow

1. filter and mutate as needed.
2. group and aggregate (hint: You will need to do this twice).
3. filter on the number of observations per year (we want 11, one for each year between 2004-2014).

In [1]:
import pandas as pd
from dfply import *
import datetime as dt


In [16]:
dateparse = lambda x: pd.datetime.strptime(x, "%Y-%m-%d")

lakes = pd.read_csv("../MinneMUDAC_raw_files/mces_lakes_1999_2014.csv", parse_dates=['START_DATE'])

  interactivity=interactivity, compiler=compiler, result=result)


In [38]:
lakes.head()


Unnamed: 0,PROJECT_ID,DATA_SET_TITLE,LAKE_NAME,CITY,COUNTY,DNR_ID_Site_Number,MAJOR_WATERSHED,WATER_PLANNING_AUTHORITY,LAKE_SITE_NUMBER,START_DATE,...,Secchi_Depth_RESULT_SIGN,Secchi_Depth_RESULT,Secchi_Depth_QUALIFIER,Secchi_Depth_Units,Total_Phosphorus_RESULT_SIGN,Total_Phosphorus_RESULT,Total_Phosphorus_QUALIFIER,Total_Phosphorus_Units,longitude,latitude
0,7108,Citizen Assisted Monitoring Program (CAMP) for...,Acorn Lake,Oakdale,Washington,82010200-01,Lower St. Croix River,Valley Branch WD,1,2006-04-16,...,,1.0,Approved,m,,0.156,Approved,mg/L,-92.971711,45.016556
1,7108,Citizen Assisted Monitoring Program (CAMP) for...,Acorn Lake,Oakdale,Washington,82010200-01,Lower St. Croix River,Valley Branch WD,1,2006-05-01,...,,,,m,,,,mg/L,-92.971711,45.016556
2,7108,Citizen Assisted Monitoring Program (CAMP) for...,Acorn Lake,Oakdale,Washington,82010200-01,Lower St. Croix River,Valley Branch WD,1,2006-05-02,...,,0.66,Approved,m,,0.107,Approved,mg/L,-92.971711,45.016556
3,7108,Citizen Assisted Monitoring Program (CAMP) for...,Acorn Lake,Oakdale,Washington,82010200-01,Lower St. Croix River,Valley Branch WD,1,2006-05-16,...,,0.66,Approved,m,,0.141,Approved,mg/L,-92.971711,45.016556
4,7108,Citizen Assisted Monitoring Program (CAMP) for...,Acorn Lake,Oakdale,Washington,82010200-01,Lower St. Croix River,Valley Branch WD,1,2006-05-30,...,,0.5,Approved,m,,0.029,Approved,mg/L,-92.971711,45.016556


In [35]:
lakes_w_measurements = (lakes
         >> select('LAKE_NAME','DNR_ID_Site_Number','START_DATE','Secchi_Depth_RESULT','Total_Phosphorus_RESULT')
         >> filter_by(pd.notna(lakes.Total_Phosphorus_RESULT) & pd.notna(lakes.Secchi_Depth_RESULT))
         >> mutate(year = X.START_DATE.dt.year)
         >> drop(X.START_DATE)
         >> filter_by(X.year >= 2004)
        )

In [21]:
lakes_w_measurements.head()

Unnamed: 0,LAKE_NAME,DNR_ID_Site_Number,Secchi_Depth_RESULT,Total_Phosphorus_RESULT,year
0,Acorn Lake,82010200-01,1.0,0.156,2006
2,Acorn Lake,82010200-01,0.66,0.107,2006
3,Acorn Lake,82010200-01,0.66,0.141,2006
4,Acorn Lake,82010200-01,0.5,0.029,2006
5,Acorn Lake,82010200-01,0.5,0.058,2006


In [31]:
num_measurements = (lakes_w_measurements 
                    >> group_by(X.LAKE_NAME,X.year,X.DNR_ID_Site_Number)
                    >> summarise(num_of_obs = n(X.Secchi_Depth_RESULT))
                    >> ungroup()
                    >> group_by(X.LAKE_NAME,X.DNR_ID_Site_Number)
                    >> summarise(num_of_years = n(X.num_of_obs))
                    >> filter_by(X.num_of_years == 11)
                   )

In [34]:
num_measurements.shape

(49, 3)

In [36]:
lakes_stats = (lakes_w_measurements 
               >> filter_by(X.DNR_ID_Site_Number.isin(set(num_measurements.DNR_ID_Site_Number)))
               >> group_by(X.LAKE_NAME,X.year,X.DNR_ID_Site_Number)
               >> summarise(mean_secchi = X.Secchi_Depth_RESULT.mean(),
                            med_secchi = X.Secchi_Depth_RESULT.median(),
                            sd_secchi = X.Secchi_Depth_RESULT.std(),
                            mean_phos = X.Total_Phosphorus_RESULT.mean(),
                            med_phos = X.Total_Phosphorus_RESULT.median(),
                            sd_phos = X.Total_Phosphorus_RESULT.std()
                           )
              )

In [39]:
lakes_stats.head()

Unnamed: 0,DNR_ID_Site_Number,year,LAKE_NAME,mean_secchi,med_secchi,sd_secchi,mean_phos,med_phos,sd_phos
0,19002100-01,2004,Alimagnet Lake,0.445,0.5,0.204736,0.1645,0.107,0.137039
1,19002100-01,2005,Alimagnet Lake,0.528,0.5,0.219484,0.1234,0.1275,0.038945
2,19002100-01,2006,Alimagnet Lake,0.525,0.5,0.185164,0.154375,0.126,0.090448
3,19002100-01,2007,Alimagnet Lake,0.507,0.415,0.247792,0.124,0.1125,0.064014
4,19002100-01,2008,Alimagnet Lake,0.605,0.6,0.252533,0.106167,0.1025,0.04058


In [40]:
lakes_stats.to_csv("./data/lakes_stats.csv")

In [2]:
lakes_stats = pd.read_csv("./data/lakes_stats.csv")

In [3]:
lakes_stats.shape

(539, 10)