# Uploading CDC Data

This notebook is to upload CDC data to server.

# Overview 

#####  What is NHANES Data?

NHANES stands for National Health and Nutrition Examination Survey. It is a major program of the National Center for Health Statistics (NCHS), which is part of the Centers for Disease Control and Prevention (CDC). This program study is designed to assess the health and nutritional status of adults and children in the United States and combines interviews and physical examinations. In 1999, the survey became a continuous program that has a changing focus on a variety of health and nutrition measurements to meet emerging needs. The survey data are available on the internet for data users and researchers throughout the world. NHANES collects data on a variety of health-related topics, including:

* Demographic Data
* Dietary Data
* Examination Data
* Laboratory Data
* Questionnaire Data

For more information visit the below beloew link:
* https://www.cdc.gov/nchs/nhanes/

In [1]:
# Libraries
# These are the required libraries to run this notebook.
import requests
from bs4 import BeautifulSoup
import re
import os
import shutil

##### Downloading Process

Since there are data for different years, first we need to select a year (data year that we want to download). When we select the year that we want to download, this program will send a request to the website and get all the topics or accessiable data topic from the website. After getting all the topics, it create an unique link to download the spesific data. For example, the lastest year on the website is 2017-2018 (We don't have access to 2019-2020 data because of Covid-19 pandemic.) so to download 2017-2018 data we need to set the year parameter as 2017 or 2018. The program will send a request to the website to get all the existing data topics for a specific year such as demographics, dietary, etc. Since each topic has subtopics, for example, examination data has 14 different datasets such as Audiometry, Body Measures, Blood Pressure, etc. For each sub-topic, the program will create a link to download each of them.

The program will create three nested folders to the specified location for one dataset. For example, we want to download data from 2017 so it will create a folder called 2017 and it will create a folder for each base topic like Demographic, Dietary, Examination, Laboratory, and Questionnaire. For these folders, It will create new folders inside these folders according to subtopics for each of them. You can find the example below.
* /home/***2017***/***Demographics***/***Demographic_Variables_and_Sample_Weights_File***/***DEMO_I.XPT***
    * ***home***  -> Location that we want to download.
    * ***2017***   -> The data year that we want to download.
    * ***Demographics***  -> Base data topic that we are going to download.
    * ***Demographic_Variables_and_Sample_Weights_File***  -> Sub data topic that we are going to download.
    * ***DEMO_I.XPT***  -> Dataset.

In the below cell we are defining all the required parameters.

In [2]:
# Give the path where you want to download the data.
#download_to = '/dsa/groups/casestudy2021f/group_1/data/{year}/{content}/{folder}' #For casestudy
download_to = '/dsa/groups/capstonesp2022/on-campus/group_1/data/{year}/{content}/{folder}'
BASE_DIR = os.getcwd() # Current directory

# Data source url to get contents.
main = 'https://wwwn.cdc.gov/nchs/nhanes/continuousnhanes/default.aspx?BeginYear={year}'
# Data source url to get sub contents.
data_url = 'https://wwwn.cdc.gov/nchs/nhanes/search/datapage.aspx?Component={component}&CycleBeginYear={year}'
# Data source url to download.
download_url = 'https://wwwn.cdc.gov/{href}'
# Year of data that you want to download.
# When you want to download 2017-2018 you can either give 2017 or 2018.
# 2017 - 2018 --Downloaded
# 2015 - 2016 --Downloaded
# 2013 - 2014 --Downloaded --Note 2 Physical Activity Monitor - Hour and Physical Activity Monitor - Minute files are not download because they too big
# 2011 - 2012 --Downloaded --Note 2 Physical Activity Monitor - Hour and Physical Activity Monitor - Minute files are not download because they too big
# 2009 - 2010

year = 2009

In [3]:
# For deleting existing files
#shutil.rmtree('/dsa/groups/casestudy2021f/group_1/data/2015')

In [4]:
# This function extracts content list from html table and downloads all the contents.
def extract_from_table(content, new_response):
    new_soup = BeautifulSoup(new_response.text, features='html.parser')
    tbody = new_soup.find_all('tbody')
    tr = tbody[0].find_all('tr')
    my_dictionary_list = list(map(lambda x: get_folder_and_file_name(x), tr))
    print('\nDownloading ', content, '- Number of Elements : ', len(my_dictionary_list))
    list(map(lambda x: download(content, x['Name'], x['Href']) if x is not None else '', my_dictionary_list))

In [5]:
# This function gets the file name and download link from html content.
def get_folder_and_file_name(new_tr):
    try:
        td = new_tr.find_all('td')
        name = (list(td[0].children)[0])
        href = td[2].a['href']
        return {'Name': name, 'Href': href}
    except:
        return None

In [6]:
# This function is to download from source to local system.
def download(content, folder, url):
    global download_url, BASE_DIR, year
    filename = url.split('/')[-1]
    folder = folder.replace(":", "")
    path = download_to.format(year=year, content=content, folder=folder)
    if not os.path.exists(path):
        os.makedirs(path)
    os.chdir(path)
    if filename == 'PAXHR_G.XPT' or filename == 'PAXMIN_G.XPT':
        return 0
    new_request = requests.get(download_url.format(href=url), allow_redirects=True)
    open(filename, 'wb').write(new_request.content)
    print('Folder : ', folder, '- File : ', filename)

In [7]:
# Sending a request to get data topics for a specific year.
response = requests.get(main.format(year=year), allow_redirects=True)
# Converting the response to readable HTML.
soup = BeautifulSoup(response.text, features='html.parser')

div = soup.find_all('div', class_='card mb-3')
div_data = div[0].find_all('a')
data = list(map(lambda x: re.sub('\s+', '', list(x.children)[-1]).replace("Data", ""), div_data))

list(map(lambda x: extract_from_table(x, requests.get(data_url.format(component=x, year=year), allow_redirects=True)),
         data))


Downloading  Demographics - Number of Elements :  1
Folder :  Demographic Variables & Sample Weights - File :  DEMO_F.XPT

Downloading  Dietary - Number of Elements :  16
Folder :  Dietary Interview - Individual Foods, First Day - File :  DR1IFF_F.XPT
Folder :  Dietary Interview - Individual Foods, Second Day - File :  DR2IFF_F.XPT
Folder :  Dietary Interview - Total Nutrient Intakes, First Day - File :  DR1TOT_F.XPT
Folder :  Dietary Interview - Total Nutrient Intakes, Second Day - File :  DR2TOT_F.XPT
Folder :  Dietary Interview Technical Support File - Food Codes - File :  DRXFCD_F.XPT
Folder :  Dietary Interview Technical Support File - Modification Codes - File :  DRXMCD_F.XPT
Folder :  Dietary Screener Questionnaire - File :  DTQ_F.XPT
Folder :  Dietary Supplement Database - Blend Information - File :  DSBI.XPT
Folder :  Dietary Supplement Database - Ingredient Information - File :  DSII.XPT
Folder :  Dietary Supplement Database - Product Information - File :  DSPI.XPT
Folder : 

Folder :  Occupation - File :  OCQ_F.XPT
Folder :  Oral Health - File :  OHQ_F.XPT
Folder :  Osteoporosis - File :  OSQ_F.XPT
Folder :  Pesticide Use - File :  PUQMEC_F.XPT
Folder :  Physical Activity - File :  PAQ_F.XPT
Folder :  Physical Functioning - File :  PFQ_F.XPT
Folder :  Prescription Medications - File :  RXQ_RX_F.XPT
Folder :  Prescription Medications - Drug Information - File :  RXQ_DRUG.xpt
Folder :  Reproductive Health - File :  RHQ_F.XPT
Folder :  Respiratory Health - File :  RDQ_F.XPT
Folder :  Sexual Behavior - File :  SXQ_F.XPT
Folder :  Sleep Disorders - File :  SLQ_F.XPT
Folder :  Smoking - Cigarette Use - File :  SMQ_F.XPT
Folder :  Smoking - Household Smokers - File :  SMQFAM_F.XPT
Folder :  Smoking - Recent Tobacco Use - File :  SMQRTU_F.XPT
Folder :  Volatile Toxicant (Subsample) - File :  VTQ_F.XPT
Folder :  Weight History - File :  WHQ_F.XPT
Folder :  Weight History - Youth - File :  WHQMEC_F.XPT

Downloading  LimitedAccess - Number of Elements :  23


[None, None, None, None, None, None]

In [8]:
# Checking downloaded data.
os.listdir('/dsa/groups/capstonesp2022/on-campus/group_1/data')

['2017', '2015', '2013', '2011', '2009']

In [9]:
# For deleting existing files
#shutil.rmtree('/dsa/groups/capstonesp2022/on-campus/group_1/data/2011')