<center><h1 style="color:#1a1a1a;
                    font-size:3em">
        NHANES Labels Scraping üï∏Ô∏è
        </h1> 
</center>

## Purpose
While doing a our Machine Learning project about Stroke Prediction using the NHANES dataset, I find that it was hard to locate the desired variable and difficult to tell the meaning of each variable after feature selection. Therefore, we need to scrap the information from the NHANES website using **Beautiful Soup**. 



In this notebook the main purpose is to extract the sas labels from the data documentations websites in the 5 main categories:
- Demographics: https://wwwn.cdc.gov/nchs/nhanes/search/datapage.aspx?Component=Demographics&CycleBeginYear=2013
- Dietary: https://wwwn.cdc.gov/nchs/nhanes/Search/DataPage.aspx?Component=Dietary&CycleBeginYear=2013
- Examination: https://wwwn.cdc.gov/nchs/nhanes/Search/DataPage.aspx?Component=Examination&CycleBeginYear=2013
- Laboratory: https://wwwn.cdc.gov/nchs/nhanes/Search/DataPage.aspx?Component=Laboratory&CycleBeginYear=2013
- Questionnaire: https://wwwn.cdc.gov/nchs/nhanes/Search/DataPage.aspx?Component=Questionnaire&CycleBeginYear=2013


In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import regex as re
import urllib

In [2]:


def parse_main(URL, links, category):
    page = requests.get(URL)
    soup = BeautifulSoup(page.content, 'html.parser')
    table = soup.find('table')

    for link in table.find_all('a'):
        if str(link.get('href')).endswith('.htm') == True:
            link_j = urllib.parse.urljoin('https://wwwn.cdc.gov/', link.get('href'))
            links[category].append(link_j)


urls = {'Demographics':'https://wwwn.cdc.gov/nchs/nhanes/search/datapage.aspx?Component=Demographics&CycleBeginYear=2013',
        'Dietary':'https://wwwn.cdc.gov/nchs/nhanes/Search/DataPage.aspx?Component=Dietary&CycleBeginYear=2013',
        'Examination':'https://wwwn.cdc.gov/nchs/nhanes/Search/DataPage.aspx?Component=Examination&CycleBeginYear=2013',
        'Laboratory':'https://wwwn.cdc.gov/nchs/nhanes/Search/DataPage.aspx?Component=Laboratory&CycleBeginYear=2013',
        'Questionnaire':'https://wwwn.cdc.gov/nchs/nhanes/Search/DataPage.aspx?Component=Questionnaire&CycleBeginYear=2013'}

links = {v:[] for v in urls.keys()}

for c, URL in urls.items():
    print(c, URL)
    parse_main(URL, links, c)


Demographics https://wwwn.cdc.gov/nchs/nhanes/search/datapage.aspx?Component=Demographics&CycleBeginYear=2013
Dietary https://wwwn.cdc.gov/nchs/nhanes/Search/DataPage.aspx?Component=Dietary&CycleBeginYear=2013
Examination https://wwwn.cdc.gov/nchs/nhanes/Search/DataPage.aspx?Component=Examination&CycleBeginYear=2013
Laboratory https://wwwn.cdc.gov/nchs/nhanes/Search/DataPage.aspx?Component=Laboratory&CycleBeginYear=2013
Questionnaire https://wwwn.cdc.gov/nchs/nhanes/Search/DataPage.aspx?Component=Questionnaire&CycleBeginYear=2013


In [3]:
def parse_nhanes(links, codes):
    for c, URLs in links.items():
        for URL in URLs:
            # access webs site
            page = requests.get(URL)

            # parse data
            soup = BeautifulSoup(page.content, 'html.parser')
            containers = soup.find_all('dl')
            for i in containers:
                try:
                    varname = str(i.find("dt",string="Variable Name: ").findNext("dd").text)
                    saslabel = str(i.find("dt",string="SAS Label: ").findNext("dd").text)
#                     print(varname, saslabel)
                    codes['category'].append(c)
                    codes['variable'].append(varname.strip())
                    codes['label'].append(saslabel.strip())
                except:
#                     print(f'error in {URL} {i}')
                    pass
    return codes

codes = {"category": [], "variable": [], "label": []}


parse_nhanes(links, codes)

codebook = pd.DataFrame(codes)

In [4]:
codebook.value_counts()

category       variable  label                             
Laboratory     SEQN      Respondent sequence number            77
Questionnaire  SEQN      Respondent sequence number            43
Examination    SEQN      Respondent sequence number            19
               DXXPT70Y  y-coordinates of outline points 71    13
               DXXPT71Y  y-coordinates of outline points 72    13
                                                               ..
               OHX08PCA  LOA: Max R(CI) ML FGM-sulcus(mm)       1
               OHX08PCD  LOA: Max R(CI) DF FGM-sulcus(mm)       1
               OHX08PCL  LOA: Max R(CI) MdL FGM-sulcus(mm)      1
               OHX08PCM  LOA: Max R(CI) MdF FGM-sulcus(mm)      1
Questionnaire  WTSVOC2Y  VOC Subsample Weight                   1
Length: 3909, dtype: int64

From the value_counts() above, you can see that there are several repeated varaibles due to the data design for the NHANES dataset. To easily match each variable I have list the unique variables separately.

In [5]:
code_unique = codebook[['variable', 'label']].drop_duplicates(subset=['variable'])
print(code_unique)

      variable                                   label
0         SEQN              Respondent sequence number
1     SDDSRVYR                      Data release cycle
2     RIDSTATR            Interview/Examination status
3     RIAGENDR                                  Gender
4     RIDAGEYR               Age in years at screening
...        ...                                     ...
7096    WHD140  Self-reported greatest weight (pounds)
7097    WHQ150                Age when heaviest weight
7099   WHQ030M         How do you consider your weight
7100    WHQ500               Trying to do about weight
7101    WHQ520          How often tried to lose weight

[3851 rows x 2 columns]


In [6]:
code_unique.to_csv('Datasets/NHANES_Labels.csv', index=False)