Skip to content

The Mortality Multiple Cause-of-Death 2022 dataset from the NCHS preprocessed into a nice parquet file ๐Ÿ˜ฎโ€๐Ÿ’จ.

License

Notifications You must be signed in to change notification settings

RemiKalbe/mortality-multiple-cause-of-death-dataset-2022

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

2 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Mortality Multiple Cause-of-Death Dataset: Preprocessed

This repository contains the preprocessed version of the Mortality Multiple Cause-of-Death dataset, provided by the National Center for Health Statistics (NCHS) at the Centers for Disease Control and Prevention (CDC). The original dataset can be found on the NCHS website.

Dataset Description

The Mortality Multiple Cause-of-Death dataset contains detailed mortality information for individuals who died in the United States within a given year. It includes various attributes such as demographic information, cause of death, lifestyle factors, and medical conditions.

The original dataset is provided in a complex format, with each row having a fixed width and no separators or headers. The data is encoded using numbers and characters, making it challenging to work with directly.

To make the dataset more accessible and usable, I have preprocessed it and translated it into a more friendly format. The resulting dataset is stored as a Parquet file, which is a columnar storage format that provides efficient compression, encoding schemes and retain the schema.

For a detailed explanation of the thought process and the preprocessing steps, please refer to my blog post: Mortality Multiple Cause-of-Death Dataset: A Journey of Data Preprocessing.

Dataset Location

The preprocessed dataset can be found in the datasets folder of this repository. The Parquet file is named MMCD_2022.parquet and contains the cleaned and structured version of the Mortality Multiple Cause-of-Death dataset.

If you are primarily interested in using the preprocessed dataset, you can directly access the Parquet file from the datasets folder.

License

The Mortality Multiple Cause-of-Death dataset is publicly available and provided by the National Center for Health Statistics (NCHS) at the Centers for Disease Control and Prevention (CDC). Please refer to the NCHS website for any specific terms of use or restrictions.

The code in this repository is released under the MIT License.

Acknowledgements

I would like to acknowledge the National Center for Health Statistics (NCHS) and the Centers for Disease Control and Prevention (CDC) for providing the Mortality Multiple Cause-of-Death dataset and the accompanying documentation.

If you have any questions or suggestions, please feel free to reach out ๐Ÿ˜Œ.

About

The Mortality Multiple Cause-of-Death 2022 dataset from the NCHS preprocessed into a nice parquet file ๐Ÿ˜ฎโ€๐Ÿ’จ.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages