## Chicago Lead  Analysis 

An APM Reports analysis of extensive lead testing data in Chicago shows that EPA sampling protocols fail to capture the highest levels of lead present in a water system.

In the city's [2018 Water Quality Report](https://www.chicago.gov/content/dam/city/depts/water/ConsumerConfidenceReports/ChicagoWaterQuality2018.PDF), Chicago reported that their 90th-percentile lead level was 9.1 ppb, well under the federal action level of 15 ppb. This was the result of testing the first liter of water drawn from 50 sites over a period of three years, the least amount of testing a city of Chicago's size is required to do. The city reported that none of the sites tested had lead levels above the federal action level. 

In 2016, Thomas H. Powers, then commissioner of the Chicago Department of Water Management [wrote in a letter](https://www.chicagotribune.com/opinion/letters/ct-chicago-isn-t-the-next-flint-20160209-story.html) to the Chicago Tribune that, “Chicago's water is safe and pure, exceeding all standards set by the U.S. Environmental Protection Agency, the Illinois EPA and the drinking water industry,” The letter disputed a Tribune story, based on an EPA study, that claimed that Chicago had high lead levels. 

But that same year, as a part of a water quality study, Chicago began conducting lead testing for any customer that requested it. In the past four years, thousands of customers have requested the city test their water. Homes with eleveated lead levels in their first test then received more extensive testing. In these select number of homes, the city tested each of the first 10 liters out of the tap. 

This data represents some of the most extensive lead sampling in a major American city and demonstrates the weaknesses of relying on EPA mandated first-draw sampling. This analysis looks at the Chicago sampling data under various proposals and regulations to show the impact that different testing protocols have on the 90th-percentile lead level. 

#### Table of Contents
* [Data Overview](#Data-Overview)
* [Current LCR Sampling Protocol](#Current-LCR-Sampling-Protocol) 
* [Proposed LCR Revision](#LCR-Proposed-Revision)
* [Michigan's Lead and Copper Rule](#Michigan-LCR) 
* [Highest sample](#Highest-of-the-1st-10th-Liters) 
* [CDC health-based action level](#CDC-Health-Based-Action-Level)


## Data Overview

The data can be downloaded from [this website](http://chicagowaterquality.org/home#results). Each row in the dataset is a site that had 10 liters of water tested. All values are measured in parts per billion. 

In [1]:
import os
import pandas as pd
import altair as alt

data_dir = os.path.join(os.getcwd(), 'data/source')
data_out = os.path.join(os.getcwd(), 'data/processed')
chicago_xlsx = os.path.join(data_dir, 'chicago_sampling_results.xlsx')

chicago_sampling = pd.read_excel(chicago_xlsx, sheet_name='Sequential', skiprows=2, skipfooter=7, usecols=list(range(0,12)))

columns = { 'Date Sampled': 'date_sampled',
           'Address': 'address',
           '1st Draw': '1st_draw',
           '2nd Draw': '2nd_draw',
           '3rd Draw': '3rd_draw',
           '4th Draw': '4th_draw',
           '5th Draw': '5th_draw',
           '6th Draw': '6th_draw',
           '7th Draw': '7th_draw',
           '8th Draw': '8th_draw',
           '9th Draw': '9th_draw',
           '10th Draw': '10th_draw'}

chicago_sampling = chicago_sampling.rename(columns=columns)

chicago_sampling.head()

Unnamed: 0,date_sampled,address,1st_draw,2nd_draw,3rd_draw,4th_draw,5th_draw,6th_draw,7th_draw,8th_draw,9th_draw,10th_draw
0,2016-02-28,27XX N Wilton Ave**,22.3,25.7,29.8,33.5,38,39.9,38.3,35.2,31.3,26.2
1,2016-03-04,54XX S Harper Ave**,13.7,21,22.8,19.3,14.5,10.7,9.33,7.44,6.66,5.79
2,2016-03-04,81XX S Euclid Ave,1.18,<1,<1,<1,<1,<1,<1,<1,<1,<1
3,2016-03-16,11XX W George St,8.01,8.5,10.8,9,9.42,10.3,10.5,9.28,6.19,5.1
4,2016-03-16,93XX S Bennett Ave,4.8,4.41,7.92,9.22,9.84,10.6,11.8,11.9,10.9,10.1


The Chicago sampling data reports all values lower than 1 ppb as '<1'. For our purposes, we will treat that as 0. Since the action level is a percentile, we are safe in making this assumption. 

In [2]:
# cast all values below 1ppb to 0
chicago_sampling = chicago_sampling.replace({'<1.00': 0, '<1': 0})

# There typos in some cells 
chicago_sampling.loc[chicago_sampling['5th_draw']=='22..1', '5th_draw'] = 22.1
chicago_sampling.loc[chicago_sampling['6th_draw']=='7.49.', '6th_draw'] = 7.49


# Set dtype of number columns
num_cols = ['1st_draw','2nd_draw','3rd_draw','4th_draw','5th_draw','6th_draw','7th_draw','8th_draw','9th_draw', '10th_draw']
chicago_sampling[num_cols] = chicago_sampling[num_cols].astype('float64')

# Add year and month columns to df 
chicago_sampling['year'] = chicago_sampling.apply(lambda x: x['date_sampled'].year, axis=1)
chicago_sampling['month'] = chicago_sampling.apply(lambda x: x['date_sampled'].month, axis=1)
chicago_sampling['1st_5th_highest'] = chicago_sampling[['1st_draw', '5th_draw']].max(axis=1)
chicago_sampling['highest_sample'] = chicago_sampling[num_cols].max(axis=1)

There  are likely 569 unique addresses in the dataset. There may be fewer individual homes since the data is anonymized at the block level. If two homes on the same block were both tested, they would have the same address in the data. 

The sampling covers the last four years. The earliest sample date is Feb 28, 2016 and the most recent is Sept. 16, 2019. 

In [3]:
print(f'Unique addresses: {len(chicago_sampling.address.unique())}')
print(f'Earliest sampled: {chicago_sampling.date_sampled.min()}')
print(f'Most recent sampled: {chicago_sampling.date_sampled.max()}')


Unique addresses: 569
Earliest sampled: 2016-02-28 00:00:00
Most recent sampled: 2019-09-16 00:00:00


Most sampling was conducted in June, July, and August. Sampling was conducted in each month.

In [4]:
sample_dates = chicago_sampling.date_sampled

months = sample_dates.apply(lambda x: x.month).value_counts().reset_index(name='count').rename(columns={'index':'month'})

display(alt.Chart(months.sort_values(by='month'), width=500).mark_bar().encode(
    x='month:O',
    y=alt.Y('count', axis=alt.Axis(title='Number of samples'))
))

In [5]:
avg_per_month = chicago_sampling.copy()
avg_per_month['site_average'] = avg_per_month[num_cols].mean(axis=1)

avg_month = avg_per_month.groupby('month')['site_average'].mean().reset_index()

alt.Chart(avg_month, width=800, title='Average lead level per month, all samples').mark_line(point=True).encode(
    x=alt.X('month:O',sort=[1,2,3,4,5,6,7,8,9,10,11,12]),
    y=alt.Y('site_average', axis=alt.Axis(title='Avg lead concentration (ppb)', format='.1f')),
)

Visualizing the lead concentration of each liter at every site shows that there is a huge variation. Even with that huge variation, we can see that there are some sites with very, very high lead levels. 

In [6]:

alt.data_transformers.enable('default')
alt.data_transformers.disable_max_rows()

indexed_sampling = chicago_sampling.copy()
indexed_sampling['id'] = indexed_sampling.index



df= pd.melt(indexed_sampling,
           id_vars=['id'],
           value_vars=num_cols,
           var_name = 'draw_#',
           value_name = 'lead_concentration')

base = alt.Chart(df, width=800, height=1000, title='Lead Profiles with lead action level').mark_line(point=True).encode(
    x=alt.X('draw_#:O',sort=num_cols, title='Liter number'),
    y=alt.Y('lead_concentration', axis=alt.Axis(format='.1f'),title='Lead Concentration (ppb)'),
    color=alt.Color('id:O', title='Site ID')
)

dummy = pd.DataFrame([{'draw_#': draw,'lead_concentration': 15} for draw in num_cols])

line = alt.Chart(dummy).mark_line(color='red').encode(
    x=alt.X('draw_#:O',sort=num_cols),
    y=alt.Y('lead_concentration'),
)

base+line

The chart below is the average lead concentration in each separate liter. The average for each liter is higher than the liter before, up to the 9th liter. The average lead concentration in 9th liter samples is a little over 14ppb, just under the federal action level. 

In [7]:
avg_lead = df.groupby('draw_#')['lead_concentration'].mean().reset_index()

alt.Chart(avg_lead, width=800, title='Average lead levels per liter number').mark_line(point=True).encode(
    x=alt.X('draw_#:O',sort=num_cols),
    y=alt.Y('lead_concentration', axis=alt.Axis(format='.1f', title='Avg lead concentration (ppb)')),
)

Most individual samples in the dataset are below 50 ppb. 27% of the samples are above the action level:

In [8]:
print(f"There are {len(df[df.lead_concentration <= 50])} samples below 50ppb")
print(f"There are {len(df[df.lead_concentration > 50])} samples above 50ppb")
print(f"There are {len(df[df.lead_concentration > 15])} samples above 15ppb")
print(f"There are {len(df[~df.lead_concentration.isna()])} individual samples in the dataset")
print(f"Highest sample: {df['lead_concentration'].max()} ppb")

There are 6409 samples below 50ppb
There are 58 samples above 50ppb
There are 1767 samples above 15ppb
There are 6467 individual samples in the dataset
Highest sample: 491.0 ppb


Nearly a thousand samples are between 4 and 6 ppb. The chart below only looks at all samples below 50 ppb. 

In [9]:
base = alt.Chart(width=800, title='Number of Samples with a given lead concentration').mark_bar().encode(
    x=alt.X('lead_concentration',bin=alt.Bin(maxbins=25), title='lead concentration (ppb), lead action level'),
    y=alt.Y('count()', axis=alt.Axis(title='Count of samples'))
)

vertline = alt.Chart().mark_rule(color='red').encode(
    x='a:Q'
)


alt.layer(
    base, vertline,
    data = df[df.lead_concentration <= 50]
).transform_calculate(
    a='15'
)

The charts below clearly illustrate one of the big problems with the LCR -- the lowest lead levels are found in the first liter.

Each chart is a histogram of the lead concentrations of each liter drawn from the tap. The first and second liter skew low, but lead concentrations increase as water from deeper within the plumbing is tested. 

In [10]:

base = alt.Chart(width=600).mark_bar().encode(
        x=alt.X('lead_concentration',bin=alt.Bin(maxbins=25), title='lead concentration (ppb), lead action level'),
        y=alt.Y('count()', axis=alt.Axis(title='Count of samples'))
)

alt.layer(
    base, vertline,
    data = df[df.lead_concentration <= 50]
).transform_calculate(
    a='15'
).facet(
    row=alt.Row('draw_#', sort=num_cols)
).resolve_scale(
    x='independent'
)


The 90th-percentile value is what the federal government uses to determine whether a utility is in compliance with the LCR. The 90th-percentile means that 10% of samples are above the 90th-percentile value and 90% of samples are below. So if the 90th-percentile value is 10 ppb, that means that 10% of lead samples were above 10 ppb. 

The chart below looks at the 90th-percentile value for each liter drawn from the tap. The 8th liter and 10th liter samples have the highest 90th-percentile values at 28ppb, nearly double the action level. 

The first liter has the lowest 90th-percentile lead level, so as long as cities use the first-liter sampling protocol, they may be using the lowest lead levels in all of their public policy work. 

In [11]:
ninetieth_per_sample = pd.DataFrame(data = {"sample": num_cols,
                                            "90th_percentile": [df[df['draw_#'] == draw]['lead_concentration'].quantile(.9, interpolation='nearest') for draw in num_cols]})

horiline = alt.Chart().mark_rule(color='red').encode(
    y='a:Q'
)

base = alt.Chart(width=600, height=400,title='90th Percentile Value for all samples per liter').mark_line(point=True).encode(
    x=alt.X('sample:O',sort=num_cols, title='Liter Number'),
    y=alt.Y('90th_percentile:Q', axis=alt.Axis(tickCount=10, format='.1f'), title='90th-Percentile Lead Level (ppb)')
)


display(ninetieth_per_sample)
alt.layer(
    base, horiline,
    data=ninetieth_per_sample
).transform_calculate(a='15')

Unnamed: 0,sample,90th_percentile
0,1st_draw,16.0
1,2nd_draw,18.5
2,3rd_draw,21.0
3,4th_draw,24.9
4,5th_draw,25.8
5,6th_draw,26.4
6,7th_draw,26.0
7,8th_draw,28.0
8,9th_draw,27.8
9,10th_draw,28.0


## Chicago and Michigan Comparison 
Michigan enacted first-and-fifth liter compliance sampling. An APM Reports analysis of individual samples from Michigan found that fifth liter samples were 44% higher on average than first liter samples. In Chicago, the average lead concentration in the first draw sample was 9.4 ppb and the average lead concentration of the fifth draw was 11.9 ppb, a 27% increase. 

In [12]:
chicago_1st_5th = chicago_sampling[['1st_draw', '5th_draw']].copy()

chicago_1st_5th.describe()

Unnamed: 0,1st_draw,5th_draw
count,650.0,648.0
mean,9.366922,11.88
std,17.958904,10.791019
min,0.0,0.0
25%,3.68,5.1325
50%,6.0,8.65
75%,9.4275,15.0
max,232.0,94.8


## Sampling Protocols

The crux of the debate on lead sampling is over which liter utilities should use when calculating their 90th-percentile lead concentration. The liter that utilities use for measuring compliance has a huge impact on whether a utility passes or fails the lead action level. It also has significant implications for how a city manages its water supply

There are two things that are important to keep in mind. 

The point of measuring lead in the water is for the utility to tweak how they control water corrosion. The intent of the LCR is for lead sampling to inform how well corrosion control is working at keeping lead levels down. The first requirement after exceeding the action level is to optomize corrosion control to try and better reduce lead levels. If the utility is consistently missing the highest lead levels, they are operating with limited data and will not be effective at reducing lead at the tap. 

The other thing we need to keep in mind is that utilities are supposed to be testing water at the highest risk sites. The Lead and Copper Rule has a few requirements for choosing compliance sample sites because the EPA's intent with the law was for utilities to use the sites that are most at-risk for high lead levels to inform how effective a utility's corrosion control is working. 

A lack of accurate lead testing data starts a cascading effect of policies and decisions based on a misleading information. Utilities are testing water with the lowest lead levels, which leads to corrosion control that is not optomized to reduce high lead levels. Utilities then reduce the amount of testing they do as they spend time under the action level, which leaves lead lines in the ground of years. 

To show the impact that different sampling protocols have on the 90th-percentile lead level, APM Reports analyzed the Chicago sampling data according to different rules and regulations. 

We break down the data four different ways for each proposal, according to different monitoring periods. Utilities have different monitoring periods based on whether they have been under the action level in previous monitoring periods. The most frequently utilities need to monitor is every 6 months. Utilities that are under the action level can go to annual and eventually tri-annual testing. 

We tested the data under four different monitoring periods:
1. The 90th-percentile value for the entire dataset, covering 4 years of sampling. 
2. The 90th-percentile value for each year in the dataset, using samples gathered year-round. 
3. The 90th-percentile value for each year in the dataset, but only using samples from June-Sept. When utilities go on reduced monitoring, they only sample in the summer, when research shows lead levels are highest. 
4. The 90th-percentile value for each six-month period in the data. 


In [13]:

def pass_fail(value, rule):
    if rule == 'cdc': 
        return 'passes' if value <=7.5 else 'fails' 
    
    elif rule=='revision':
        if value >= 10 and value <=15.4:
            return 'fails the trigger level but passes'
        elif value > 15.4:
            return 'fails'
        elif value < 10:
            return 'passes'
        
    else:
        return 'passes' if value<=15.4 else 'fails'


def sampling_results(rule, df=chicago_sampling):
    # 4 rules, 4 time periods in each rule 
    
    # format rule names in plain english
    rule_name = {
        'lcr': 'Original Lead and Copper Rule',
        'revision': 'Proposed Lead and Copper Rule Revision',
        'mi_lcr': 'Michigan Lead and Copper Rule',
        'highest_sample': 'Highest of the 1st-10th liters',
        'cdc' : 'CDC Health-Based Action Level'
    }
    
    # Different year masks
    mask_2016 = df['year'] == 2016
    mask_2017 = df['year'] == 2017
    mask_2018 = df['year'] == 2018
    mask_2019 = df['year'] == 2019
    
    # Different months for testing 
    annual_testing_months = [6,7,8,9]
    first_half = [1,2,3,4,5,6]
    second_half = [7,8,9,10,11,12]
    
    mask_annual_testing = df['month'].isin(annual_testing_months)
    mask_first_half = df['month'].isin(first_half)
    mask_second_half = df['month'].isin(second_half)
    
    if rule == 'lcr' or rule == 'cdc' or rule == 'revision':
        sample = '1st_draw'
        interpolation = 'nearest'
    
    elif rule == 'mi_lcr':
        sample = '1st_5th_highest'
        interpolation = 'linear'
    
    elif rule == 'highest_sample':
        sample = 'highest_sample'
        interpolation = 'nearest'
    

    chi_tot = df[sample].quantile(.9, interpolation=interpolation)

    chi_2016_tot = df[mask_2016][sample].quantile(.9, interpolation=interpolation)
    chi_2017_tot = df[mask_2017][sample].quantile(.9, interpolation=interpolation) 
    chi_2018_tot = df[mask_2018][sample].quantile(.9, interpolation=interpolation)
    chi_2019_tot = df[mask_2019][sample].quantile(.9, interpolation=interpolation) 

    chi_2016_annual = df[(mask_2016)&(mask_annual_testing)][sample].quantile(.9, interpolation=interpolation) 
    chi_2017_annual = df[(mask_2017)&(mask_annual_testing)][sample].quantile(.9, interpolation=interpolation) 
    chi_2018_annual = df[(mask_2018)&(mask_annual_testing)][sample].quantile(.9, interpolation=interpolation)
    chi_2019_annual = df[(mask_2019)&(mask_annual_testing)][sample].quantile(.9, interpolation=interpolation)

    chi_2016_pt1 = df[(mask_2016)&(mask_first_half)][sample].quantile(.9, interpolation=interpolation) 
    chi_2017_pt1 = df[(mask_2017)&(mask_first_half)][sample].quantile(.9, interpolation=interpolation) 
    chi_2018_pt1 = df[(mask_2018)&(mask_first_half)][sample].quantile(.9, interpolation=interpolation)
    chi_2019_pt1 = df[(mask_2019)&(mask_first_half)][sample].quantile(.9, interpolation=interpolation)

    chi_2016_pt2 = df[(mask_2016)&(mask_second_half)][sample].quantile(.9, interpolation=interpolation) 
    chi_2017_pt2 = df[(mask_2017)&(mask_second_half)][sample].quantile(.9, interpolation=interpolation) 
    chi_2018_pt2 = df[(mask_2018)&(mask_second_half)][sample].quantile(.9, interpolation=interpolation)
    chi_2019_pt2 = df[(mask_2019)&(mask_second_half)][sample].quantile(.9, interpolation=interpolation)
    
    # Length of each dataset 
    chi_2016_tot_len = df[mask_2016][sample].count() #len(df[mask_2016][sample])
    chi_2017_tot_len = df[mask_2017][sample].count() #len(df[mask_2017][sample])
    chi_2018_tot_len = df[mask_2018][sample].count() #len(df[mask_2018][sample])
    chi_2019_tot_len = df[mask_2019][sample].count() #len(df[mask_2019][sample])
    
    chi_2016_annual_len = df[(mask_2016)&(mask_annual_testing)][sample].count() #len(df[(mask_2016)&(mask_annual_testing)][sample])
    chi_2017_annual_len = df[(mask_2017)&(mask_annual_testing)][sample].count() #len(df[(mask_2017)&(mask_annual_testing)][sample])
    chi_2018_annual_len = df[(mask_2018)&(mask_annual_testing)][sample].count() #len(df[(mask_2018)&(mask_annual_testing)][sample])
    chi_2019_annual_len = df[(mask_2019)&(mask_annual_testing)][sample].count() #len(df[(mask_2019)&(mask_annual_testing)][sample])
    
    chi_2016_pt1_len = df[(mask_2016)&(mask_first_half)][sample].count() #len(df[(mask_2016)&(mask_first_half)][sample])
    chi_2017_pt1_len = df[(mask_2017)&(mask_first_half)][sample].count() #len(df[(mask_2017)&(mask_first_half)][sample])
    chi_2018_pt1_len = df[(mask_2018)&(mask_first_half)][sample].count() #len(df[(mask_2018)&(mask_first_half)][sample])
    chi_2019_pt1_len = df[(mask_2019)&(mask_first_half)][sample].count() #len(df[(mask_2019)&(mask_first_half)][sample])
    
    chi_2016_pt2_len = df[(mask_2016)&(mask_second_half)][sample].count() #len(df[(mask_2016)&(mask_second_half)][sample])
    chi_2017_pt2_len = df[(mask_2017)&(mask_second_half)][sample].count() #len(df[(mask_2017)&(mask_second_half)][sample])
    chi_2018_pt2_len = df[(mask_2018)&(mask_second_half)][sample].count() #len(df[(mask_2018)&(mask_second_half)][sample])
    chi_2019_pt2_len = df[(mask_2019)&(mask_second_half)][sample].count() #len(df[(mask_2019)&(mask_second_half)][sample])
   
    print(f"""
    Sampling Protocol: {rule_name[rule]}
    - 90th-percentile value for the entire dataset: {chi_tot}, This {pass_fail(chi_tot, rule) +' the AL'}

    - 90th-percentile value for each year, for the entire year:
    -- 2016: {round(chi_2016_tot,2)} ppb, This {pass_fail(chi_2016_tot, rule) +' the AL'} (n= {chi_2016_tot_len})
    -- 2017: {round(chi_2017_tot,2)} ppb, This {pass_fail(chi_2017_tot, rule) +' the AL'} (n= {chi_2017_tot_len})
    -- 2018: {round(chi_2018_tot,2)} ppb, This {pass_fail(chi_2018_tot, rule) +' the AL'} (n= {chi_2018_tot_len})
    -- 2019: {round(chi_2019_tot,2)} ppb, This {pass_fail(chi_2019_tot, rule) +' the AL'} (n= {chi_2019_tot_len})

    - 90th-percentile value for each year, for the annual federal testing months of June-Sept:
    -- 2016: {round(chi_2016_annual,2)} ppb, This {pass_fail(chi_2016_annual, rule) +' the AL'} (n= {chi_2016_annual_len})
    -- 2017: {round(chi_2017_annual,2)} ppb, This {pass_fail(chi_2017_annual, rule) +' the AL'} (n= {chi_2017_annual_len})
    -- 2018: {round(chi_2018_annual,2)} ppb, This {pass_fail(chi_2018_annual, rule) +' the AL'} (n= {chi_2018_annual_len})
    -- 2019: {round(chi_2019_annual,2)} ppb, This {pass_fail(chi_2019_annual, rule) +' the AL'} (n= {chi_2019_annual_len})

    - 90th-percentile value for each six-month period
    -- Jan - June 2016: {round(chi_2016_pt1,2)} ppb, This {pass_fail(chi_2016_pt1, rule) +' the AL'} (n= {chi_2016_pt1_len})
    -- July - Dec 2016: {round(chi_2016_pt2,2)} ppb, This {pass_fail(chi_2016_pt2, rule) +' the AL'} (n= {chi_2016_pt2_len})

    -- Jan - June 2017: {round(chi_2017_pt1,2)} ppb, This {pass_fail(chi_2017_pt1, rule) +' the AL'} (n= {chi_2017_pt1_len})
    -- July - Dec 2017: {round(chi_2017_pt2,2)} ppb, This {pass_fail(chi_2017_pt2, rule) +' the AL'} (n= {chi_2017_pt2_len})

    -- Jan - June 2018: {round(chi_2018_pt1,2)} ppb, This {pass_fail(chi_2018_pt1, rule) +' the AL'} (n= {chi_2018_pt1_len})
    -- July - Dec 2018: {round(chi_2018_pt2,2)} ppb, This {pass_fail(chi_2018_pt2, rule) +' the AL'} (n= {chi_2018_pt2_len})

    -- Jan - June 2019: {round(chi_2019_pt1,2)} ppb, This {pass_fail(chi_2019_pt1, rule) +' the AL'} (n= {chi_2019_pt1_len})
    -- July - Dec 2016: {round(chi_2019_pt2,2)} ppb, This {pass_fail(chi_2019_pt2, rule) +' the AL'} (n= {chi_2019_pt2_len})
    """)

## Current LCR Sampling Protocol

The current LCR's testing protocol requires a utility test the first liter of water drawn from a tap after the water has sat motionless in the pipes for at least six hours. This testing protocol routinely misses the highest levels of lead because plumbing inside a house isn't made of lead. Water that rests further into the plumbing, in contact with the lead service line, will have higher lead content than water in the house. 

If the 90th-percentile value of the first-draw samples are below 15 ppb (technically 15.4 due to rounding), the utility is in compliance. If they test above the action level, the utility is required to re-optomize corrosion control and re-sample. If they are still above the lead action level, the utility then needs to begin replacing lead service lines until they test under the action level for two consecutive six-month periods. 

In [14]:
sampling_results('lcr')


    Sampling Protocol: Original Lead and Copper Rule
    - 90th-percentile value for the entire dataset: 16.0, This fails the AL

    - 90th-percentile value for each year, for the entire year:
    -- 2016: 15.5 ppb, This fails the AL (n= 128)
    -- 2017: 16.8 ppb, This fails the AL (n= 115)
    -- 2018: 18.6 ppb, This fails the AL (n= 143)
    -- 2019: 13.5 ppb, This passes the AL (n= 264)

    - 90th-percentile value for each year, for the annual federal testing months of June-Sept:
    -- 2016: 23.8 ppb, This fails the AL (n= 50)
    -- 2017: 17.2 ppb, This fails the AL (n= 29)
    -- 2018: 15.4 ppb, This passes the AL (n= 63)
    -- 2019: 15.2 ppb, This passes the AL (n= 135)

    - 90th-percentile value for each six-month period
    -- Jan - June 2016: 22.3 ppb, This fails the AL (n= 25)
    -- July - Dec 2016: 15.3 ppb, This passes the AL (n= 103)

    -- Jan - June 2017: 16.3 ppb, This fails the AL (n= 76)
    -- July - Dec 2017: 17.2 ppb, This fails the AL (n= 39)

    -- Jan

## LCR Proposed Revision 

The proposed revision to the Lead and Copper Rule keeps first-draw sampling, but it adds new requirements for utitilies that exceed certain lead levels. The EPA has proposed establishing a trigger level of 10 ppb, in addition to the action level of 15 ppb. 

If the utility tests between 10 and 15 ppb, they need to re-optomize corrosion control and implement a lead service line replacement program. The revision does not specify a set percentage of lead service lines the utility would be required to replace after a trigger level exceedance, only that a utility's plan must be approved by the state. 

Though the revision does not change the action level, it does add requirements for cities to conduct follow-up testing with individual sites that test higher than 15 ppb. 

In [15]:
sampling_results('revision')


    Sampling Protocol: Proposed Lead and Copper Rule Revision
    - 90th-percentile value for the entire dataset: 16.0, This fails the AL

    - 90th-percentile value for each year, for the entire year:
    -- 2016: 15.5 ppb, This fails the AL (n= 128)
    -- 2017: 16.8 ppb, This fails the AL (n= 115)
    -- 2018: 18.6 ppb, This fails the AL (n= 143)
    -- 2019: 13.5 ppb, This fails the trigger level but passes the AL (n= 264)

    - 90th-percentile value for each year, for the annual federal testing months of June-Sept:
    -- 2016: 23.8 ppb, This fails the AL (n= 50)
    -- 2017: 17.2 ppb, This fails the AL (n= 29)
    -- 2018: 15.4 ppb, This fails the trigger level but passes the AL (n= 63)
    -- 2019: 15.2 ppb, This fails the trigger level but passes the AL (n= 135)

    - 90th-percentile value for each six-month period
    -- Jan - June 2016: 22.3 ppb, This fails the AL (n= 25)
    -- July - Dec 2016: 15.3 ppb, This fails the trigger level but passes the AL (n= 103)

    -- Jan

## Michigan LCR 

In June 2018, Michigan adopted the country's toughest lead regulations. Utilities with lead service lines in Michigan are required to test the first and the fifth liter from the tap at each site and use the higher of the two results in their 90th-percentile calculations. That means that if there are 50 testing sites, two samples are taken at each and the higher of the two is used when calculating the 90th-percentile value. 

Michigan will also lower the action level to 12 ppb by 2025.

Adding in just one more liter nearly doubles the 90th-percentile value in Chicago, from 16 to 29.5 ppb. Some monitoring periods have 90th-percentile values as high as 39 ppb, more than double the current LCR action level. 

In [16]:
sampling_results('mi_lcr')


    Sampling Protocol: Michigan Lead and Copper Rule
    - 90th-percentile value for the entire dataset: 29.5, This fails the AL

    - 90th-percentile value for each year, for the entire year:
    -- 2016: 26.49 ppb, This fails the AL (n= 128)
    -- 2017: 36.8 ppb, This fails the AL (n= 115)
    -- 2018: 28.1 ppb, This fails the AL (n= 143)
    -- 2019: 25.68 ppb, This fails the AL (n= 264)

    - 90th-percentile value for each year, for the annual federal testing months of June-Sept:
    -- 2016: 29.96 ppb, This fails the AL (n= 50)
    -- 2017: 37.7 ppb, This fails the AL (n= 29)
    -- 2018: 23.8 ppb, This fails the AL (n= 63)
    -- 2019: 27.22 ppb, This fails the AL (n= 135)

    - 90th-percentile value for each six-month period
    -- Jan - June 2016: 25.96 ppb, This fails the AL (n= 25)
    -- July - Dec 2016: 26.02 ppb, This fails the AL (n= 103)

    -- Jan - June 2017: 32.55 ppb, This fails the AL (n= 76)
    -- July - Dec 2017: 39.38 ppb, This fails the AL (n= 39)

    --

## Highest of the 1st-10th Liters

53.5% of the 650 sites are above the AL if we take the highest sample found in the 1st through 10th liter. The 90th-percentile value in Chicago is as high as 44.9 ppb during some monitoring periods, nearly three times the action level. 


In [17]:
sampling_results('highest_sample')


    Sampling Protocol: Highest of the 1st-10th liters
    - 90th-percentile value for the entire dataset: 36.9, This fails the AL

    - 90th-percentile value for each year, for the entire year:
    -- 2016: 36.7 ppb, This fails the AL (n= 128)
    -- 2017: 44.5 ppb, This fails the AL (n= 115)
    -- 2018: 35.4 ppb, This fails the AL (n= 143)
    -- 2019: 34.0 ppb, This fails the AL (n= 264)

    - 90th-percentile value for each year, for the annual federal testing months of June-Sept:
    -- 2016: 36.7 ppb, This fails the AL (n= 50)
    -- 2017: 44.5 ppb, This fails the AL (n= 29)
    -- 2018: 35.4 ppb, This fails the AL (n= 63)
    -- 2019: 37.3 ppb, This fails the AL (n= 135)

    - 90th-percentile value for each six-month period
    -- Jan - June 2016: 39.9 ppb, This fails the AL (n= 25)
    -- July - Dec 2016: 36.5 ppb, This fails the AL (n= 103)

    -- Jan - June 2017: 37.2 ppb, This fails the AL (n= 76)
    -- July - Dec 2017: 44.9 ppb, This fails the AL (n= 39)

    -- Jan - 

In [18]:
perc_failed = round(100 * len(chicago_sampling[chicago_sampling['highest_sample'] >= 15.5]) / len(chicago_sampling), 2)

print(f"{perc_failed}% of sites ({len(chicago_sampling[chicago_sampling['highest_sample'] >= 15.5])}/{len(chicago_sampling)}) had at least one liter above the action level")

53.54% of sites (348/650) had at least one liter above the action level


## CDC Health Based Action Level

The 15 ppb action level is an arbitrary standard with no bearing on health. Scientists agree that there is no safe level of lead in drinking water and the Lead and Copper Rule set 0 ppb as goal that utilities should try and achieve. 


The current Lead and Copper Rule Revision leaves the action level alone. However, if the action level were lowered to be more reflective of a health-based standard, as officials at the CDC wanted, Chicago would be failing the AL according to their compliance sampling. And there's no way of analyzing the data where they'd come close to meeting it using expanded testing.

#### Data Export

In [19]:
chicago_sampling_out_file = os.path.join(data_out, 'chicago_sampling.csv')

chicago_sampling.to_csv(chicago_sampling_out_file, index=False)