# 1. Business Understanding

Youth mental health has become a growing public health concern in the United States. National reports indicate increases in sadness, hopelessness, and suicidal behaviors among high school students over the past decade. Understanding which behaviors and demographic factors are associated with mental health risk can help schools, communities, and public health agencies design more focused prevention and support programs.

This capstone project uses data from the 2019 national Youth Risk Behavior Surveillance System (YRBSS), a large survey of U.S. high school students that monitors health-related behaviors such as substance use, physical activity, bullying, and mental health indicators. By analyzing this dataset, the goal is to identify patterns and risk factors that are linked to poor mental health outcomes in adolescents.

The primary focus of this project is to develop predictive models that estimate the likelihood that a student reports experiencing persistent feelings of sadness or hopelessness, or other serious mental health concerns. Comparing different modeling approaches can highlight which variables contribute most strongly to mental health risk and whether more complex machine learning models provide meaningful improvement over simpler methods.

**Research Question.** *Which demographic and behavioral factors are most strongly associated with poor mental health outcomes among U.S. high school students in the 2019 YRBSS, and how accurately can these outcomes be predicted using supervised learning models?*


# 2. Data Understanding

The dataset used in this project is the 2019 National Youth Risk Behavior Surveillance System (YRBSS), collected by the Centers for Disease Control and Prevention (CDC). The YRBSS is a nationwide survey designed to monitor a wide range of health-related behaviors among U.S. high school students. These behaviors include mental health indicators, substance use, injury-related behaviors, sexual health, physical activity, and other factors that influence adolescent well-being. The 2019 survey includes over 13,000 respondents and represents a statistically weighted sample of U.S. students in grades 9–12.

The raw dataset is provided in fixed-width (.dat) format and must be read using column definitions supplied in a CDC SAS input script. After applying these definitions, the resulting dataframe includes 134 variables. Each variable corresponds to a specific question or derived value from the survey. Example variables include:

Q6–Q15: Demographic information such as age, grade level, and sex.

QNOBESE / BMI variables: Indicators related to height, weight, and obesity status.

QNDEP / QNANX: Indicators of mental-health-related symptoms (depending on availability for that year).

QNWATER, QNSODA: Measures of dietary behaviors.

QNSMOKE / QNALCSIP: Tobacco and alcohol use behaviors.

QNFIGHT / QNBULLY: Indicators of violence or injury-related behaviors.

PSU, STRATUM, WEIGHT: Complex survey sampling design variables.

Because the data represents a national probability sample, each record must be interpreted using the provided sampling weights to ensure national representativeness. The dataset contains no personally identifying information and is publicly available, making it appropriate for academic research.

No missing values were found in the raw fixed-width fields, but some variables include coded missing values such as 7, 8, or 9, representing “missing,” “refused,” or “not applicable.” These values will require cleaning before modeling or statistical analysis.

# 3 Data Preparation

This section describes the steps taken to prepare the 2019 YRBSS dataset for analysis. Because the CDC distributes the data in fixed-width format, several preprocessing steps were required to convert the raw text file into a structured DataFrame. These steps include importing the raw data using the SAS input script, handling coded missing values, selecting variables relevant to the research question, converting data types, and performing initial cleaning.


## 3.1 Importing the Raw Data

The YRBSS dataset is provided as a fixed-width (.dat) file. The CDC also provides a SAS input program that defines the column boundaries and variable names. This section loads the SAS input program, parses the variable layout, and imports the raw YRBSS data into a pandas DataFrame.


In [None]:
import re
import pandas as pd

# Paths to files (all in same folder)
sas_input_path = "2019XXH-SAS-Input-Program.sas"
dat_path = "XXH2019_YRBS_Data.dat"

# Read SAS input program
with open(sas_input_path, "r", errors="ignore") as f:
    sas_lines = f.readlines()

colspecs = []
names = []

pattern = re.compile(r"@(\d+)\s+(\w+)\s+(\d+)\.")

for line in sas_lines:
    match = pattern.search(line)
    if match:
        start = int(match.group(1)) - 1
        width = int(match.group(3))
        end = start + width
        colspecs.append((start, end))
        names.append(match.group(2))

# Load dataset
data = pd.read_fwf(dat_path, colspecs=colspecs, names=names)

data.head()



## 3.2 Handling Coded Missing Values

The YRBSS uses numeric codes to represent missing or skipped responses. Common codes include:
- 7 = Refused
- 8 = Missing / Not asked
- 9 = Not applicable

These values do not represent real behavior and must be replaced with NaN before analysis.


In [None]:
missing_codes = [7, 8, 9]
data = data.replace(missing_codes, pd.NA)
data.head()


## 3.3 Variable Selection for This Project

The dataset contains more than 130 variables. Only variables relevant to the research question will be retained. These include demographic variables (sex, grade, age), mental-health-related indicators (e.g., sadness, hopelessness), and selected behavioral factors such as physical activity, substance use, and bullying.


In [None]:
# Placeholder for variable selection
# Will update once we identify relevant mental health variables
selected_variables = []

# Example:
# selected_variables = ["SEX", "AGE", "QN33", "QN48", "QN49"]

# filtered_data = data[selected_variables]
# filtered_data.head()


## 3.4 Data Type Conversion

Most variables were loaded as numeric values. Some variables will later be converted to categorical types for analysis and modeling. Survey design variables (e.g., weight, PSU, stratum) remain numeric.


In [None]:
# Convert numerics where possible
data = data.apply(pd.to_numeric, errors="ignore")
data.info()


## 3.5 Initial Cleaning

Initial cleaning includes:
- Removing records with missing survey weights
- Confirming dataset size
- Ensuring no rows contain only NaN responses


In [None]:
# Remove rows missing weight (if applicable)
if "WEIGHT" in data.columns:
    data = data.dropna(subset=["WEIGHT"])

# Show new dataset size
data.shape



# 4. Exploratory Data Analysis (EDA)

This section provides an overview of the structure and characteristics of the 2019 YRBSS dataset. Exploratory data analysis was used to examine the distribution of key variables, identify missing data patterns, and understand the relationships between mental health indicators and demographic or behavioral factors. This step helps guide feature selection and informs the modeling process in later sections.


## 4.1 Dataset Overview

After importing and preparing the dataset, the resulting DataFrame contains 134 variables and over 13,000 observations. These variables cover a wide range of demographic characteristics, behavioral risk factors, protective factors, and mental health indicators. This subsection presents basic structural information about the dataset, including the number of rows, number of columns, and variable data types.


In [None]:
data.info()


## 4.2 Preview of the Data

The first few rows of the dataset provide a high-level view of the structure and contents of the YRBSS dataset. This preview shows the variable names, column ordering, and initial values.


In [None]:
data.head()


In [None]:
## 4.3 Missing Data Analysis

Because the YRBSS dataset uses coded missing values that were converted to NaN, it is important to identify variables with significant missingness. 
This helps determine which variables are suitable for modeling and whether any imputation is required.


In [None]:
missing_summary = data.isna().mean().sort_values(ascending=False)
missing_summary.head(20)
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(12,6))
sns.heatmap(data.iloc[:, :50].isna(), cbar=False)
plt.title("Missing Data Heatmap (First 50 Variables)")
plt.show()


## 4.4 Variable Dictionary Preview

The variables in the YRBSS dataset are named according to CDC coding conventions. This preview shows the first 30 variables to help guide variable selection for later analysis.


In [None]:
list(data.columns)[:30]


In [None]:
## 4.5 Distribution of Key Demographic Variables

Understanding the distribution of demographic variables such as sex, grade, and age provides context for interpreting behavioral and mental health patterns. These variables are commonly used as predictors in youth risk behavior models.


In [None]:
import matplotlib.pyplot as plt

# Sex
plt.figure(figsize=(5,4))
data['SEX'].value_counts().plot(kind='bar')
plt.title("Distribution of Sex")
plt.xlabel("Sex Code")
plt.ylabel("Count")
plt.show()

# Age
plt.figure(figsize=(5,4))
data['AGE'].value_counts().plot(kind='bar')
plt.title("Distribution of Age")
plt.xlabel("Age")
plt.ylabel("Count")
plt.show()

# Grade
plt.figure(figsize=(5,4))
data['GRADE'].value_counts().plot(kind='bar')
plt.title("Distribution of Grade")
plt.xlabel("Grade")
plt.ylabel("Count")
plt.show()


## 4.6 Exploring Mental Health Indicators

Several YRBSS questions relate directly to mental health outcomes, such as sadness, hopelessness, suicidal ideation, and stress-related behaviors. This subsection identifies which of these variables exist in the 2019 dataset and provides basic frequency counts.


In [None]:
# Show only columns that contain "sad", "hopeless", "suicide", or similar
[key for key in data.columns if "sad" in key.lower() or "suic" in key.lower() or "hop" in key.lower()]
# Replace 'QNXX' with actual variable once identified
# data['QNXX'].value_counts(dropna=False)


## 4.7 Correlation Matrix

A correlation matrix provides a high-level view of how numeric variables relate to each other. Although many YRBSS variables are categorical, numeric relationships may still help inform feature selection.


In [None]:
plt.figure(figsize=(10,8))
sns.heatmap(data.corr(numeric_only=True).abs(), cmap="viridis")
plt.title("Correlation Heatmap (Numeric Variables)")
plt.show()
