# The Michigan Data Science Team (MDST) Work on the Flint Water Crisis

Date: May 3rd 2016

## Who helped make this document?

(Alphabetically)
* Abhilash Narendra
* Alex Chojnacki
* Anthony Kremin
* Arya Farahi
* Chengyu Dai
* Daniel Zhang
* Eric Schwartz (Faculty Advisor) 
* Filip Jankovic
* Guangsha Shi
* Jake Abernethy (Faculty Advisor)
* Jared Webb
* Jingye Liu
* John Dryden
* Jonathan Stroud
* Sean Ma
* Wei Lee
* Wenbo Shen

(Please contact Eric <ericmsch@umich.edu> or Jake <jabernet@umich.edu> for questions!)

## Summary Notes

There is lead in Flint’s water. And we know that leads to more questions than answers: Where it is? Which homes are most at risk? When will the lead levels decrease?

We want to shed light on these questions with data. Using diverse sources of information, we use cutting edge-methods in data science and statistics. 

The crisis is also one of transparency of information. We’d like to bring the key information to the citizens of Flint as clearly as possible. 

What we want to do in this short writeup is give some early results that help to understand the lead level readings that are being continuously collected in Flint. We will continue updating this document as results develop. 

For questions about health and getting obtaining lead test kits your home, visit [the Michigan.gov website](Michigan.gov/flintwater/).

## Diverse data sources

The figures and data below are based on several datasets.

* *Residential Testing data*: Flint residents continue to submit water samples to the Department of Environmental Quality (DEQ), which tests their water and posts results to the Michigan.gov/FlintWater/ website.
* *Sentinel Sites Testing data*: Available at [Michigan.gov](http://Michigan.gov/FlintWater/) website. 
* *Parcel data*: obtained from the City of Flint
* *Service Line data*: provided by City of Flint and [UM-Flint GIS Center](https://www.umflint.edu/gis)
* *Fire Hydrant data*: provided by City of Flint

## Where is there high lead?

Elevated lead readings are occurring throughout the city. They appear to be quite geographically diverse. A location is determined to have *elevated lead* if the DEQ recorded an amount of 15 parts per  (ppb) in a water sample (using EPA standards). 

The map below shows all the parcels in the Residential Testing data, displaying low (blue) and elevated (red) lead levels.  

In [1]:
from IPython.display import IFrame
IFrame('http://web.eecs.umich.edu/~jabernet/FlintWater/all_residential_lead_readings.html',width=500, height=500)

## Elevated lead levels are less than 9% of samples

* High lead levels (greater than 15 ppb) are 8.3% of the samples.
* Dangerous lead levels (greater than 50 ppb) make up 3.1% of the data.
* Very dangerous lead levels (greater than 150 ppb) make up 1.2% of samples.
* The required 90% of the readings are 12 ppb or less.


<img src="../Images/lead_histogram.png" width=600px style="margin:0">


## Note about data sources

Much attention focuses on data with fewer than 700 houses sampled repeatedly (Sentinel Site data). But we are using more than 8,000 unique houses contributing over 15,000 total samples (Residential Testing data).
There’s more value in that data than currently. That’s what we will explore here. Which homes are at most risk? 
Thanks to the wide range of types of properties, geographic areas, and lead levels, we can answer these key questions about what helps predict lead.

## What helps us predict lead? 

The lead readings are known to be highly variable and depend on a number of factors including the way the test was conducted, the time of the day, and the number of hours during which water sat idle in the pipes. The factors that we focus on are the attributes of the property, including the age of construction, condition of the property, when in 2015-16 the sample was taken, and material of service line pipe connecting house plumbing to street pipes.

There seem to be lots of relevant factors. The *Property Age* seems to be very important.

## Important variable: the age of the property

We observed that one attribute of the parcel that is strongly correlated with lead levels is the **year during which the property was built**. There is a sharp decline for more buildings built after 1950: for those built in 1950 or before, 10% of readings are above 15 ppb compared to only 6% of the younger properties.

<img src="../Images/lead_by_yearbuilt_residential_tests_annotate.png" width=600px style="margin:0">

The points in the plot reveal when most of Flint construction occurred (Note the decrease during the Great Depression in 1930s). The line shows the average predicted lead level. 

## How do lead service lines affect elevated lead readings?

The lead service lines play a role, but not as much as you would think. We still see high lead readings even when a property's service lines are made of copper, zinc, and other materials.
 
* 8% of all service lines are lead (lead only or lead mixture)
* 23% are unknown material


### Interactive map of high lead readings for homes with various service line types

In [2]:
IFrame('http://web.eecs.umich.edu/~jabernet/FlintWater/service_line.html', width=500, height=500)

## Can we predict high lead levels?

We have data for over 8,000 properties, but there are over 50,000 parcels in Flint. Which of the not-yet-tested properties are at risk?

We apply various learning algorithms to the data and predict where we think elevated lead levels might be found. Here are the locations of those properties where our model suggests elevated lead (> 15 parts ber billion) is most likely.


### Interactive map of **predicted** locations of elevated lead

In [3]:
IFrame('http://web.eecs.umich.edu/~jabernet/FlintWater/significant_risk_houses.html', width=500, height=500)

## Who are we? Some photos of the Flint Data Dive!

<img src="../Images/IMG_5718.jpg" width=400px>
<img src="../Images/IMG_5709.jpg" width=400px>