# What conclusions can we draw from the Sentinel sites?

## Rick Snyder's tweet

On April 22nd, Governer Rick Snyder posted [this now-deleted tweet](https://mobile.twitter.com/onetoughnerd/status/723614869400866816/photo/1) which showed a graph of some results from the [Flint Sentinel testing sites](http://michiganradio.org/post/sentinel-teams-monitoring-water-400-flint-homes). The graph showed a promising upward trend in Flint's water safety.

Here's a screenshot of the tweet in question:

<img src="tweet_cropped.png">

## A few minutes later, the tweet was deleted. Why?

Besides some of the obvious issues with the chart, like "particles per lead" and the misaligned axes, there may have been some deeper methodological issues with this chart that caused @onetoughnerd to delete it. However, if the percentage of Sentinel sites below the EPA action level of 15 PPB actually is steadily increasing, this is good news! All of the sentinel data is freely available online, so we can check whether this is really the case. Using the sentinel data, we can recreate Governor Snyder's chart, and from there, we'll hopefully be able to confirm or deny what he was trying to prove.

# Recreating Snyder's Tweet

## First, some initialization

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib

## Read and join the Sentinel data

All of this data is publicly available on the [Flint Water website](http://www.michigan.gov/flintwater/) and was downloaded on May 1st.

In [2]:
df_1 = pd.read_excel('data/public/Sentinel_Data_Set_1A-B_515890_7.xlsx')
df_2 = pd.read_excel('data/public/Sentinel_Data_Set_1A-B_Rnd_2_517916_7.xlsx')
df_3 = pd.read_excel('data/public/Sentinel_Data_Round_3_521415_7.xlsx')
df_4 = pd.read_excel('data/public/Sentinel_Data_Round_4_521993_7.xlsx')

df_all = pd.concat([df_1, df_2, df_3, df_4], axis = 0)
df_all = df_all.reset_index()

In [3]:
# Group by dates

df_all['Date']=df_all['Date_Submittted']

def group_fn(idx):
    x = idx['Date']
    if x < np.datetime64('2016-02-16'):
        return 0
    if x < np.datetime64('2016-02-23'):
        return 1
    if x < np.datetime64('2016-03-02'):
        return 2
    if x < np.datetime64('2016-03-30'):
        return 3
    if x < np.datetime64('2016-04-14'):
        return 4
    return 5


df_all['Group'] = df_all.apply(group_fn, axis = 1)
gb = df_all.groupby('Group', axis = 0)

# Plot the percentage of tests below the action level

In [4]:
def percent_below_action_level(x):
    return np.mean(x<15)


x = ['February 22', 'March 1', 'March 29', 'April 14']
y = gb.agg(percent_below_action_level)['Result_Lead_(PPB)'].values
rick = [.891, .913, .921, .927]

plt.plot(rick, 'g')
plt.plot(y, 'b')

plt.title('Percentage of samples with lead levels below 15 ppb')
plt.ylabel('Percentage of samples < 15 ppb')
plt.xlabel('Samples taken on or before')
plt.xticks(range(4), x)
plt.legend(["Rick Snyder's Tweet", 'Sentinel Data'], loc=4)
plt.savefig("out/pct_samples_below_15ppb.png", dpi=500)

# Here's our result, compared to Governor Snyder's tweet:

<img src='out/pct_samples_below_15ppb.png', style="width:500px;"/>

Cool, so the numbers mostly match up. From this plot, the trend seems to be clearly upwards. But as a wise statistics professor once told me, **_Never trust a graph without error bars._** When we compute aggregate statistics over a small number of samples, there's always some uncertainty in the estimates we receive. It is therefore possible that the real distribution of the sentinel data doesn't actually follow a clear upward trend, but that the trend we see here is just due to random chance. Without a more in-depth analysis, we cannot draw any definite conclusions. Let's explore this further.

# First, lets recreate the same plot, but with error bars

In order to get error bars, we take 1000 bootstrap samples and plot the 95% confidence interval around each of our estimates. This gives us an idea of how wide the distribution is. Essentially, the confidence interval estimates a range for which we can say the following: *if we were to resample the data many times, the percentage of samples below 15 PPB would fall in this range 95% of the time*.

In [5]:
sns.factorplot(y = 'Result_Lead_(PPB)', x = 'Group', data=df_all,
               estimator = lambda x: (x<15).mean(), n_boot=1000, ci=95)

plt.title('Percentage of samples with lead levels below 15 ppb')
plt.grid(b = True, axis='x', which='major')
plt.ylabel('Percentage of samples < 15 ppb')
plt.xlabel('Samples taken on or before')
plt.xticks(range(4), x)
plt.savefig('out/pct_samples_below_15ppb_errorbars.png', dpi=500)

# Error bars give us a clearer picture of the distribution

<img src='out/pct_samples_below_15ppb_errorbars.png', style="width:500px;"/>

With such wide confidence intervals, we can't conclude much about the trend of the water safety. It's very much possible that this upward trend is due to nothing more than random chance.

# Do other statistics show positive trends?

Although our results so far have been inconclusive, we can look at other statistics from the sentinel data and check for positive trends. If we consistently see positive trends, we can more safely conclude that the situation in Flint is improving.

## 90th percentile of lead readings

Another statistic relevant to water quality is the 90th percentile of the distribution of lead measurements. This tells us roughly how high the lead readings are *for the 10% of houses with the highest readings.* Again, we can easily plot the 90th percentile of the lead readings for each round of sentinel tests, with a 95% confidence interval drawn around the estimate.

In [6]:
sns.factorplot(y = 'Result_Lead_(PPB)', x = 'Group', data=df_all,
               estimator = lambda x: np.percentile(x, 90), n_boot=1000, ci = 95)
p1, = plt.plot(range(-1,5), [15]*6, 'r--')

plt.title('90th Percentile of Lead Readings')
plt.grid(b = True, axis='x', which='major')
plt.ylabel('Lead (PPB)')
plt.xlabel('Samples taken on or before')
plt.xticks(range(4), x)
plt.legend([p1], ['Federal Action Level (15 PPB)'])
plt.savefig('out/pctile_90.png', dpi=500)

<img src='out/pctile_90.png', style="width:500px;"/>

Again, we see a positive trend (down is good in this case!), and even more interesting is the fact that the confidence interval has moved below the federal action level of 15 PPB. We can now safely conclude that, among the sentinel sites, 90% of homes have lead readings below 15 PPB.

# Are these trends the same in the residential data?

Note that our previous analysis has only considered data collected from the sentinel sites. The water samples taken from the sentinel sites are controlled and reliable, but comprise only a small number of locations in the city. Although the sample size is small, this sample is meant to be representative of the entire city. If this is the case, we would hope to see the same trends in the voluntary tests submitted by residents. However, it is important to note that the voluntary residential test data is particularly subject to certain biases. In particular, residents are permitted to sample the lead readings from their homes as many times as they want, and at irregular intervals. Those who have tested their water before may be more likely to test again if they received high lead readings. We proceed with these potential biases in mind.

## First, load and parse the residential data

This data is also available via the [Flint Water website](http://www.michigan.gov/flintwater/).


In [7]:
# Read the residential test data
x = ['February 22', 'March 1', 'March 29', 'April 14']
df_all['Log_Lead'] = np.log(df_all['Result_Lead_(PPB)']+1)

# Combine sentinel and residential
df_residential = pd.read_csv('data/residential_test_data.csv',
                             parse_dates=[1])
df_residential = pd.DataFrame(
    {'Lead': df_residential['Lead (ppb)'],
     'Date': df_residential['Date Submitted'],
     'Source': ['Residential']*len(df_residential)})

df_sentinel = pd.DataFrame(
    {'Lead': df_all['Result_Lead_(PPB)'],
     'Date': df_all['Date_Submittted'],
     'Source': ['Sentinel']*len(df_all)})

df_final = pd.concat([df_residential, df_sentinel], axis = 0)

df_final['Group'] = df_final.apply(group_fn, axis=1)
df_final['Log_Lead'] = np.log(df_final['Lead'] + 1)

# 90th percentile of lead readings, revisited

In [9]:
# Create the plots
sent = sns.factorplot(x = 'Group', y = 'Lead', hue = 'Source', hue_order=['Sentinel', 'Residential'],
                      estimator = lambda x: np.percentile(x, 90), data=df_final,
                      n_boot=1000)

p1, = plt.plot(range(-1,6), [15]*7, 'r--')

x = ['February 15', 'February 22', 'March 1', 'March 29', 'April 14']

plt.title('90th Percentile of Lead Readings')
plt.grid(b = True, axis='x', which='major')
plt.ylabel('Lead (PPB)')
plt.xlabel('Samples taken on or before')
plt.xticks(range(5), x)
plt.legend([p1], ['Federal Action Level (15 PPB)'])
plt.savefig('out/res_vs_sent.png', dpi=500)

<img src="out/res_vs_sent.png", style="width:500px;"/>


Not only do these two trend lines look very different from one another, it seems like the downward trend is even less clear in the residential data than it was in the sentinel data. In addition, since the confidence interval from the residential data includes the federal action level of 15 ppb, we *are unable to conclude that 90% of the homes in Flint have lead below 15 ppb*.

# Conclusions

We can safely say that _**among the sentinel sites** we have sufficient evidence to conlude that the 90th percentile of lead readings is below 15 ppb_. However, _we fail to draw the same conclusion using the voluntary residential lead samples._

So was Governor Snyder's tweet misleading? In a way, yes, since it failed to use proper statistical methodology to give a full picture of the data it was representing. However, the conclusion suggested by Governor Snyder's graph can be reached through other means, as we've shown here. The sentinel sites have, in fact, shown improvement over the past few months.