In [10]:
# Loading in the adult22 csv file.
df <- read.csv("adult22.csv")
# Viewing the dimensions of df
dim(df)

27651 rows and 637 columns.

### Feature Selection

The columns were selected based on the summary PDF available on the CDC NHIS 2022 survey documentation page (https://www.cdc.gov/nchs/nhis/documentation/2022-nhis.html)

In [11]:
# Selecting the desired columns from df, and assigning them to df_sub.
df_sub <- df %>%
  select(
  AGEP_A,
  SEX_A,
  CANEV_A,
  CHDEV_A,
  DEPEV_A,
  SMKEV_A,
  EDUCP_A,
  REGION,
  ANXFREQ_A,
  HEIGHTTC_A,
  WEIGHTLBTC_A,
  SLPHOURS_A,
  PA18_05R_A,
  DRK12MYR_A
  )
# Viewing the dimensions of df_sub
dim(df_sub)

27651 rows an 14 columns.



---



**Feature Meaning**

The following columns are derived from the official codebook for the 2022 NHIS Sample Adult Survey (https://www.cdc.gov/nchs/nhis/documentation/2022-nhis.html). Each column's description and value mapping are based on this source:


* **AGEP_A** → Indicates the age of the sampled adult:
  - 18-84 corresponds to 18-84 years.
  
* **SEX_A** → Indicates the sex of the sampled adult:
  - 1: Male.
  - 2: Female.

* **CANEV_A** → Indicates whether the sampled adult has been told they had cancer:
  - 1: Yes.
  - 2: No.

* **CHDEV_A** → Indicates whether the sampled adult has been told they have coronary heart disease:
  - 1: Yes.
  - 2: No.

* **DEPEV_A** → Indicates whether the sampled adult has had depression:
  - 1: Yes.
  - 2: No.

* **SMKEV_A** → Indicates whether the sampled adult has smoked 100 cigarettes:
  - 1: Yes.
  - 2: No.

* **EDUCP_A** → Indicates the education level of the sampled adult:
  - 1: Grade 1-11.
  - 2: 12th grade (no diploma).
  - 3: GED or equivalent.
  - 4: High school graduate.
  - 5: Some college (no degree).
  - 6: Associate’s degree (occupational, technical, or vocational program).
  - 7: Associate’s degree (academic program).
  - 8: Bachelor’s degree.
  - 9: Master’s degree.
  - 10: Professional school or doctoral degree.

* **REGION** → Indicates the region where the sampled adult lives:
  - 1: Northeast.
  - 2: Midwest.
  - 3: South.
  - 4: West.

* **ANXFREQ_A** → How often the sampled adult feels worried, nervous, or anxious:
  - 1: Daily.
  - 2: Weekly.
  - 3: Monthly.
  - 4: A few times a year.
  - 5: Never.

* **HEIGHTTC_A** → Height of the sampled adult without shoes (in inches):
  - 59-76.

* **WEIGHTLBTC_A** → Weight of the sampled adult without shoes (in pounds):
  - 100-299.

* **SLPHOURS_A** → Hours of sleep the sampled adult gets in a 24-hour period:
  - 1-24.

* **PA18_05R_A** → Physical activity meeting aerobic/strength criteria:
  - 1: Meets neither.
  - 2: Meets strength only.
  - 3: Meets aerobic only.
  - 4: Meets both.

* **DRK12MYR_A** → Days the sampled adult drank alcohol in the past year:
  - 0-365.

In [13]:
# Based on the above feature values, the below code selects only the the specific values for each feature wanted.
df_filtered <- df_sub %>%
  filter(
    AGEP_A >= 18 & AGEP_A <= 84,
    SEX_A %in% c(1, 2),
    CANEV_A %in% c(1, 2),
    CHDEV_A %in% c(1, 2),
    DEPEV_A %in% c(1, 2),
    SMKEV_A %in% c(1, 2),
    EDUCP_A %in% 1:10,
    REGION %in% 1:4,
    ANXFREQ_A %in% 1:5,
    HEIGHTTC_A >= 59 & HEIGHTTC_A <= 76,
    WEIGHTLBTC_A >= 100 & WEIGHTLBTC_A <= 299,
    SLPHOURS_A >= 1 & SLPHOURS_A <= 24,
    PA18_05R_A %in% 1:4,
    DRK12MYR_A >= 0 & DRK12MYR_A <= 365
  )
# Viewing the dimensions of df_filtered
dim(df_filtered)

# Exporting df_filtered as a csv file (adult22_filtered)
# Setting row_names to false so there isn't an unnecessary column hopefully.
write.csv(df_filtered, "adult22_filtered.csv", row.names = FALSE)

203631 rows and 14 columns.