Data-driven prediction of chronic disease development

We use modern techniques in data science to build a predictive model for the identification of Chronic Obstructive Pulmonary Disorder (COPD) from clinical and laboratory data. COPD is a leading cause of deaths worldwide, esp. in highly urbanized regions.

Installation

The libraries used are sklearn, numpy, xgboost, and pandas. It is recommended to use the Anaconda distribution of Python. The code should run with no issues using Python versions 3.*.

Motivation

COPD is becoming more prevalent as our society is increasing getting more affluent and more people are exposed to low physical activities. It is also one major concern during this pandemic because it could make one more likely to get severely ill from COVID-19 - see CDC advice here.

Files

CRISP-DM Analysis.ipynb - main notebook for analysis
finaldata.csv - the dataset used
*.png - images for blog post

Results

There are three main business questions in the analysis:

What are the clinical features that are predictive of COPD?
Which ML model can predict COPD from clinical and genetic data?
Which are the most important features that play major role in the predictive model?

These questions are important because COPD (which often co-occurs with other chronic diseases such as cardiovascular disease and diabetes) is becoming more prevalent as our society becomes more affluent and lethal diseases are being eradicated. Because COPD is incurable and progressive, an early detection of the disease could be a life-saver.

In summary, the important clinical features are age, sex, and smoking status. kNN, XGB, logistic regression, neural network, SVM, and decision tree can predict COPD well from SNPs and patient data. It turns out that air quality index (using location as surrogate feature) is the most important predictor of COPD. For more details, please see the blog post and notebook.

Acknowledgements

We use regression and boosted tree models on data from the Chinese population.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.gitignore		.gitignore
CRISP-DM Analysis.ipynb		CRISP-DM Analysis.ipynb
README.md		README.md
finalalldata.csv		finalalldata.csv
models.png		models.png
risk-factors.png		risk-factors.png
roc.png		roc.png
xgboost.png		xgboost.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data-driven prediction of chronic disease development

Installation

Motivation

Files

Results

Acknowledgements

About

Releases

Packages

Languages

Physicist91/copd-ml

Folders and files

Latest commit

History

Repository files navigation

Data-driven prediction of chronic disease development

Installation

Motivation

Files

Results

Acknowledgements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages