Skip to content


Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time


Lead Alert public data repository

Created and Maintained by:


Lead has been known to be harmful to humans even in small doses for over 50 years. Exposure to even low levels of lead can result in damage to the central and peripheral nervous system, learning disabilities, impaired function of blood cells, stunted growth, cardiovascular effects, and many other problems. Despite regulations restricting lead in building materials, children are still experiencing lead poisoning at alarming rates. While the Flint water crisis has brought renewed attention to this issue, investigations have shown many communities across the country with rates of lead poisoning in children exceeding that of Flint. The persistence of this major public health problem and the creation of new relevant datasets create an opportunity to apply new thinking and techniques to solve it.

The creation of new datasets give us a chance to redefine the prioritization method of infrastructure improvements. Prior studies have largely focused on an individual city - Flint, Michigan. In this paper, we attempt to using machine learning to predict areas of California where communities are at highest risk of lead exposure.

This project was undertaken as part of W210: Synthetic Capstone within the Master of Information and Data Science program at the UC Berkeley School of Information. The project was conceived as a product delivered to water system managers throughout California via a website, The website contains additional results and descriptions of the work, as well as visualizations of the data used. The public Github repository,, contains the collection of aggregated data sets, source code, modeling artifacts, and associated reference material

Repo Structure

  • /data: Contains data sources in tsv format used in during the analysis.
  • /notebooks: Example code of our models including XGBoost and SMOTE oversampling.
  • /plots: Graph visualizations from the exploratory data analysis.
  • /presentation: Our final presentation.
  • /techreport: PDF file of our final technical report.


Lead Alert public data repository






No releases published


No packages published