<a href="https://colab.research.google.com/github/Cstan1987stat/health-survey-cluster-analysis/blob/main/outlier_check_feature_engineering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [79]:
# Loading in the adult22 csv file.
df <- read.csv("https://raw.githubusercontent.com/Cstan1987stat/health-survey-cluster-analysis/refs/heads/main/data/adult22_filtered.csv")

# Extracting the number of rows from df
rows <- dim(df)[1]
# Extracting the number of columns from df
columns <- dim(df)[2]
# Outputing the number of rows and columns
cat('There are', rows,'rows and', columns,'columns in the data.\n')
# Outputing horizontal line for separation purposes
cat('---------------------------------------------------------------------------------------------------\n')
# Outputing blank line
cat('\n')
# Printing the first 6 rows of the data
print(head(df))

There are 20361 rows and 14 columns in the data.
---------------------------------------------------------------------------------------------------

  Age Sex Cancer Coronary_heart_disease Depression Smoked_100_cig Education
1  64   1      1                      1          2              1         8
2  37   2      2                      2          2              2         8
3  72   2      2                      2          2              2         5
4  84   2      2                      2          2              2         6
5  31   2      2                      2          2              1         8
6  81   2      1                      2          2              2         4
  Region Anxiety Height Weight Sleep_hours Aerobic.Strength Alcohol_drink_12m
1      3       4     74    235           8                3               108
2      3       3     69    218           9                3                 0
3      3       5     64    240           8                1                12
4     

In [80]:
library(dplyr)
install.packages('recipes')
library(recipes)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



In [81]:
# Creating a vector for the numeric column names
num_cols <- c("Age", "Height", "Weight", "Sleep_hours", "Alcohol_drink_12m")

# Creating a vector for the categorical column names
cat_cols <- c('Sex', 'Cancer', 'Coronary_heart_disease', 'Depression', 'Smoked_100_cig',
              'Education', 'Region', 'Anxiety','Aerobic.Strength')
# Creating a copy of df
df_copy <- df

# Converting below columns in the df_copy dataframe to factors with original labels.
df_copy$Sex <- factor(df_copy$Sex, labels=c('Male', 'Female'))
df_copy$Cancer <- factor(df_copy$Cancer, labels=c('Yes','No'))
df_copy$Coronary_heart_disease <- factor(df_copy$Coronary_heart_disease, labels = c('Yes', 'No'))
df_copy$Depression <- factor(df_copy$Depression, labels = c('Yes','No'))
df_copy$Smoked_100_cig <- factor(df_copy$Smoked_100_cig, labels = c('Yes', 'No'))
df_copy$Education <- factor(df_copy$Education, labels = c('1-11', '12th', 'GED', 'High School', 'Some College', 'Assoc Tech', 'Assoc Acad', 'Bach', 'Mast', 'Prof'))
df_copy$Region <- factor(df_copy$Region, labels = c('Northeast', 'Midwest', 'South', 'West'))
df_copy$Anxiety <- factor(df_copy$Anxiety, labels = c('Daily', 'Weekly', 'Monthly', 'Few times Y', 'Never'))
df_copy$`Aerobic.Strength` <- factor(df_copy$`Aerobic.Strength`, labels = c('Neither', 'Strength', 'Aerobic', 'Both'))

## Outlier Check

In [82]:
# Creating function for check for outliers
outlier_check <- function(column, name){  # Column is the vector of values for the feature and name is the name of the feautre
  # Below two lines calculate 75th and 25th percent quantile values
  q1 <- quantile(column, .25)
  q3 <- quantile(column, .75)
  # Calculating the interquartile range
  iqr <- q3 - q1
  # Below two lines calculate lower and upper bound
  upper_bound <- q3 + (1.5*iqr)
  lower_bound <- q1 - (1.5*iqr)
  # Below two lines of code calculate how many values there are for each feature that are above the upper_bound and are below the lower bound
  num_upper <- sum(column > upper_bound)
  num_lower <- sum(column < lower_bound)
  # Summing num_upper and num_lower to get the total number of outliers
  num_outliers <- num_upper + num_lower

  # If loop to check if the feature has outliers
  if (num_outliers > 0){

    # Output how many outliers the feature has
    cat('The',name,'feature has',num_outliers,'outliers\n')

    # If loop to check if the feature has no outliers below the lower bound
    if (num_lower == 0){
      # Output the upper_bound for feature with no outliers below the lower_bound
      cat('The',name,'feature has an upper bound of', upper_bound,'\n')
    }

    # If loop to check if the feature has outliers below the lower_bound
    if (num_lower > 0){
      # Output the lower bound and upper bound values for the feature
      cat('The',name,'feature has an lower bound of', lower_bound,'and an upper bound of',upper_bound,'\n')
    }

    # Move next output to a new line
    cat('\n')
  }

  # If loop to check if there are no outliers
  if (num_outliers == 0){
    # Output the feature has no outliers
    cat('The',name,'feature has no outliers\n')
    # Move next output to a new line
    cat('\n')
  }
}

# For loop to go through every column in the num_cols vector
for (i in num_cols){
  # Send the vector of values for the ith feature, along with the feature name, to the outlier_check function
  outlier_check(df_copy[[i]], i)
}


The Age feature has no outliers

The Height feature has no outliers

The Weight feature has 107 outliers
The Weight feature has an upper bound of 285 

The Sleep_hours feature has 216 outliers
The Sleep_hours feature has an lower bound of 3 and an upper bound of 11 

The Alcohol_drink_12m feature has 1779 outliers
The Alcohol_drink_12m feature has an upper bound of 257 



From the [univariate analysis notebook](https://github.com/Cstan1987stat/health-survey-cluster-analysis/blob/main/notebooks/univariate_analysis.ipynb), it did seem that there were outliers present in the Weight, Sleep_hours, and Alcohol_drink_12m features, so the above output isn't surprising. What we also observed is that all outliers for Weight and Alcohol_drink_12m were above the upper bound, while Sleep_hours had outliers on both sides—below the lower bound and above the upper bound (also confirmed by the output above). What we weren’t initially sure about was the threshold for determining an outlier.

For the Weight feature, all adults weighing more than 285 pounds are considered outliers. However, I don’t believe this is due to reporting error, as that weight isn’t too shocking to imagine. Furthermore, these values represent only a small portion of the dataset (107 out of 20,000 instances), so I’ve decided to leave them in.

For the Sleep_hours feature, anyone reporting fewer than 3 hours or more than 11 hours of sleep is considered an outlier. While sleeping less than 3 hours is definitely surprising, it’s plausible—there are individuals with sleep disorders or extreme circumstances. On the other hand, sleeping more than 11 hours isn't that unusual either, but if the values go beyond 20 hours and approach 24, they could very well be reporting errors or invalid entries.

For the Alcohol_drink_12m feature, individuals who drank on more than 257 days in the past year are flagged as outliers. Similar to the Weight feature, this frequency isn't implausible—some people do drink that often. However, there are a relatively large number of outliers in this feature (1,779 out of ~20,000), which makes it more significant.

That said, since clustering is a major focus of this project, I’ve decided to retain all outliers for now, but may revisit in the future.

# Transformations

## Nominal Features
Some models are unable to interpret letters or words directly. To make these features usable, we must convert them into a numerical form. However, when assigning numbers to categories, we must first decide whether there is an inherent order between those categories.

For example, our original data had values for the Region feature coded as 1 for northeast, 2 for midwest, 3 for south, and 4 for west. If we leave this feature as-is, models may interpret midwest as mathematically greater than northeast, which is not meaningful. Therefore, this could be described as a nominal variable. In such cases, a better approach is to use one-hot encoding, which creates binary columns for each category.

For instance, we would create four new columns: northeast, midwest, south, and west. If an individual is from the northeast, a 1 would be placed in the northeast column and 0s in the others. To avoid multicollinearity, we would drop one of the columns and represent someone from that dropped category (e.g., northeast) with 0s in all remaining columns. This is how we handle nominal categorical variables—those with no inherent order. Just to be clear, one-hot encoding isn't required, but it may lead to better results depending on the model.

## Ordinal Features

Take the anxiety feature, which was originally coded as 1 for daily, 2 for weekly, 3 for monthly, 4 for a few times a year, and 5 for never.

This feature clearly has an order: the frequency of anxiety decreases as the value increases, which could make this ordinal. However, a potential issue is that many models will assume equal spacing between levels. For example, the difference between daily (1) and weekly (2) may be interpreted the same as the difference between monthly (3) and a few times a year (4).

But in reality, the drop from daily (possibly 365 occurrences per year) to weekly (around 52) is much larger than the drop from monthly (12) to a few times a year (maybe 3 or 4). If the distances between categories are not consistent, treating this feature as purely ordinal could mislead the model.

In this case, we might choose to treat it like a nominal feature by applying one-hot encoding. Similar to nominal features, one-hot encoding isn't strictly necessary for ordinal data, but it could lead to better results, especially if we're unsure about the spacing between categories.

## Numerical Features

Another consideration involves our numerical features, which may be on different scales. For instance, sleep ranges from 1 to 24 hours, while days drinking alcohol could range from 0 to 365 per year. If we apply feature reduction techniques (like PCA) or distance-based models (like k-means clustering), the alcohol variable may dominate due to its larger scale.

Just like with one-hot encoding, not standardizing numerical features isn't inherently wrong—but scaling them to the same range (e.g., using standardization or normalization) could improve model performance and lead to more balanced results.

In [83]:
# Initalize recipe with df_copy
rec <- recipe(~ ., data = df_copy) %>%
  step_normalize(all_of(num_cols)) %>%                 # Transforms numeric features to have mean of 0 and a standard deviation of 1
  step_dummy(all_of(cat_cols), one_hot = FALSE)        # Creates binary column for each group in a categorical feature, while dropping the first category column

# Prepping for transformation
rec_prep <- prep(rec)
# Transforming the data
df_transformed_num_cat <- bake(rec_prep, new_data = NULL)


# Initralize recipe with df_copy
rec <- recipe(~ ., data = df_copy) %>%                 # Transforming numeric features to have mean of 0 and a standard deviation of 1
  step_normalize(all_of(num_cols))

# Prepping for transformation
rec_prep <- prep(rec)
# Transforming the data
df_transformed_num <- bake(rec_prep, new_data = NULL)

# Exporting data

In [84]:
# Exporting data
write.csv(df_transformed_num_cat, 'adult22_transformed_e.csv', row.names = FALSE)
write.csv(df_transformed_num, 'adult22_transformed_n.csv', row.names = FALSE)