This repository contains cone for analysis of the NHANES dataset. Specifically, it contains code which will examine the unique food items in the NHANES dietary data. The food items are clustered based on nutrient similarities into new food groups. These food groups represent the result of a data-driven approach of developing food groups for use …
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.


This repository contains code related to NHANES dietary data analysis for creating nutrient-driven food groups, as documented in:

M. Wyatt, T. Johnston, M. Papas, and M. Taufer. Development of a Scalable Method for Creating Food Groups Using the NHANES Dataset and MapReduce. In Proceedings of the ACM Bioinformatics and Computational Biology Conference (BCB), pp. 1 – 10. Seattle, WA, USA. October 2 – 4, 2016.


  • Python 2
  • Apache Spark / Pyspark
  • numpy
  • scipy


The analysis is split into 2 parts:

Preprocessing ./src/ contains the PySpark script for preprocessing the NHANES dietary data

Clustering ./src/ contains the PySpark script for clustering the preprocessed data


To run the code, you will need Apache Spark installed. You can run the code with the bash script located at ./src/ This will create several new files and directories in the ./data/ directory.

The data is saved with the spark command saveAsPickleFile and can be loaded with the spark command pickleFile. For example, to load the processed data into a pyspark session, do: sc.pickleFile("./data/processed").


1 Year of NHANES data is in ./data/raw. This data and more years of NHANES data can be downloaded from

The file ./data/features.txt contains a list of features which are to be extracted from the NHANES dietary data.

Additionally, a script to download all NHANES data is included at ./src/ To run this, uncomment line 7 from ./src/ Or visit the main (and most up-to-date) repo for this script at