Skip to content

Predicting the hyper-local prevalence of chronic kidney disease with stochastic gradient boosting

Notifications You must be signed in to change notification settings

TheeChris/predicting_ckd

Repository files navigation

Predicting the Hyper-local Prevalence of Chronic Kidney Disease

Chronic kidney disease (CKD) has been on the rise in recent years and is a major cause of mortality and health expenditure in the United States. This project uses 235 features extracted from the U.S. Census Bureau to test whether hyper-local rates of CKD can be determined using readily available demographic data. These features include data on age, sex, marital status, disability, employment, profession, household type, housing costs, and type of insurance. Regression and ensemble methods were used to predict rates of chronic kidney disease. Ultimately, gradient boosted decision trees proved to be the best prediction model with a predictive accuracy of 83.94% (adjusted R2).

The purpose of this project was to assist federal, state, and local public health agencies and organization to improve targeting of public health campaigns related to chronic kidney disease prevention. The predictive model helps to accomplish this goal by allowing limited resources to be targeted to neighborhoods with the greatest need for intervention.

Data Sources

Table of Contents

  1. Final Report: A summary of the project process, results, and actionable insights.
  2. Slide Deck: Used for presenting findings
  3. Notebooks: These were used in the following order to create the code base for this project.
    1. Data Wrangling: collecting, organizing, and cleaning datasets
    2. Data Storytelling: using exploratory data analysis to tell a story about the data
    3. Exploratory Data Analysis: exploring the data for initial insights, correlations, and possibly important features
    4. Regression Analysis: using various regression and ensemble methods to predict CKD prevalence
  4. Reports: These reports were written to track progress and explain the process throughout the project.
    1. Data Wrangling
    2. Exploratory Data Analysis
    3. Milestone Report
    4. In-Depth Analysis
  5. Images: All saved plot and map outputs

CKD Feature Importance

CKD Rate Distribtion

Distribtion and Correlation of CKD vs Labor Force Participation

Comparing Two Disparate Cities

Gradient Boosting residuals plots

Learning Rate Validation Curve

About

Predicting the hyper-local prevalence of chronic kidney disease with stochastic gradient boosting

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published