# Predicting Factors Associated With Poor Mental Health Among U.S. High School Students

Student: Michael Kamp  

## 1. Business Understanding

In recent years, schools across the United States have reported growing concerns about student mental health, including increased levels of anxiety, depression, and stress-related challenges. These mental-health conditions can affect academic performance, attendance, social development, and overall well-being, making early identification an important priority for educators and public-health organizations.

This project uses data from the 2019 CDC Youth Risk Behavior Surveillance System (YRBSS), a national survey that collects detailed information on student behaviors, demographics, and health outcomes. By analyzing this large-scale dataset, we aim to better understand the factors that may contribute to poor mental-health outcomes among high-school students.

The analysis focuses on identifying relationships between demographic characteristics, behavioral patterns, and one key mental-health indicator: whether a student reported feeling sad or hopeless almost every day for at least two weeks (QN8). These insights can help support prevention efforts, guide policy decisions, and highlight risk factors that may require additional attention in school environments.

_Research Question: Which demographic and behavioral factors are most strongly associated with poor mental-health risk among U.S. high-school students?_



## 2. Data Understanding

Before preparing the dataset for modeling, this section provides an initial examination of the 2019 CDC Youth Risk Behavior Surveillance System (YRBSS) data. The goal of Data Understanding is to explore the structure, completeness, and basic characteristics of the dataset to identify potential issues that must be addressed during data preparation.

The YRBSS dataset contains a wide range of behavioral and demographic variables related to youth mental and physical health. In this project, the primary outcome of interest is **QN8**, which indicates whether a student felt sad or hopeless almost every day for ≥2 weeks during the past year.

This section performs the following steps:

### **1. Preview the Dataset**
Displays the first several rows to confirm that variables were imported correctly from the fixed-width `.dat` file using the SAS layout script.

### **2. Summary Statistics**
Generates descriptive statistics for numeric variables to explore ranges, distributions, and potential anomalies.

### **3. Data Types**
Lists the data types of all variables to identify categorical, numeric, and incorrectly inferred types that may require conversion during data preparation.

### **4. Missing Values**
Identifies variables with missing or CDC-coded missing responses.  
The YRBSS dataset frequently uses the following codes for nonresponse:

- **7**, **8**, **9**
- **77**, **88**, **99**

Recognizing these early helps ensure accurate cleaning in later steps.

### **5. Exploration of Key Demographic Variables**
Examines unique values for important demographic indicators such as:

- **Q2 – Sex**
- **Q3 – Grade**
- **Q6 – Age Category**
- **RACEETH – Race/Ethnicity**

Understanding category structure is essential before encoding.

### **6. Distribution of Important Behavioral Variables**
Creates histograms for continuous or count-based variables such as:

- Alcohol use  
- Cigarette/vaping use  
- Soda consumption  
- BMI percentile  

These visualizations help detect skewness, outliers, or unusual patterns.

### **7. Target Variable Distribution (QN8)**
Plots the distribution of the mental-health outcome to assess class imbalance.  
This is especially important because only a small percentage of students typically report persistent sadness or hopelessness.

---

By conducting these checks before any cleaning or feature engineering, we establish a clear understanding of the dataset’s structure, potential data-quality issues, and variable characteristics. These findings guide the subsequent **Data Preparation** steps and support accurate, reliable modeling later in the project.



In [None]:
# DATA UNDERSTANDING
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

print("Preview of dataset:")
display(data.head())

# Summary statistics
print("\nSummary statistics:")
display(data.describe(include='all'))

# Data types
print("\nData types:")
display(data.dtypes)

# Missing values
print("\nMissing values in each column:")
display(data.isnull().sum())

# Examine unique values in key demographic variables
categorical_cols = ['Q2', 'Q3', 'Q6', 'RACEETH']

print("\nUnique values for categorical variables:")
for col in categorical_cols:
    if col in data.columns:
        print(f"{col}: {data[col].unique()}")

# Explore distributions of important numeric variables
numeric_cols = ['BMIPCT', 'QNDAYEVP', 'QN12', 'QNDAYCIG']

for col in numeric_cols:
    if col in data.columns:
        plt.figure(figsize=(6,4))
        data[col].hist(bins=20, color='skyblue', edgecolor='black')
        plt.title(f"Distribution of {col}")
        plt.xlabel(col)
        plt.ylabel("Frequency")
        plt.show()

# Target variable distribution (QN8: sad/hopeless 2+ weeks)
if 'QN8' in data.columns:
    plt.figure(figsize=(6,4))
    data['QN8'].value_counts(dropna=False).plot(kind='bar', color=['red','green'])
    plt.title("Distribution of QN8 (Persistent Sadness / Hopelessness)")
    plt.xlabel("Response")
    plt.ylabel("Count")
    plt.xticks(rotation=0)
    plt.show()


## 3. Data Preparation

The purpose of this section is to clean, organize, and prepare the 2019 YRBSS dataset for modeling.  
Because the raw `.dat` file contains hundreds of variables, mixed formats, and CDC-coded missing values, careful preprocessing is required to ensure analytical accuracy.

This section outlines each step taken to transform the dataset into a usable modeling table.

### **1. Load Data Using SAS Layout File**
The YRBSS dataset is provided in a fixed-width format.  
Column positions and names are extracted programmatically from the official CDC SAS input script, ensuring that variables are imported correctly and consistently.

### **2. Select Relevant Variables**
A subset of variables was chosen based on their relevance to youth mental health research, including:

- **Demographics:** Sex (Q2), Grade (Q3), Race/Ethnicity (Q4)  
- **Behavioral risk factors:** Alcohol use, cigarette use, vaping, soda consumption, physical inactivity  
- **Health indicators:** Obesity status, BMI percentile  
- **Target variable:** **QN8** — sadness/hopelessness for ≥2 weeks

These variables reflect domains known to influence adolescent mental-health outcomes.

### **3. Recode Target Variable**
The mental-health indicator **QN8** is transformed into a binary classification label:

- **1 = Poor mental health**  
- **0 = No reported symptoms**

This allows supervised machine-learning algorithms to predict mental-health risk.

### **4. Handle Missing and Miscoded Values**
CDC YRBSS uses coded values such as *7, 8, 9, 77, 88, 99* to represent non-responses.  
These are replaced with `NaN` and imputed using appropriate strategies:

- **Numeric indicators:** Median imputation  
- **Categorical indicators:** Most-frequent value imputation  

This ensures no information loss while preventing bias due to missingness.

### **5. One-Hot Encode Categorical Variables**
Variables such as **Sex, Grade, and Race/Ethnicity** are transformed into numerical dummy variables using `pd.get_dummies(drop_first=True)`.  
This avoids multicollinearity and prepares the dataset for logistic regression and decision-trees.

### **6. Scale Numerical Features**
Machine-learning models benefit from standardized features.  
Numeric columns (e.g., number of cigarettes used, number of days inactive, BMI percentile) are standardized using **StandardScaler** so that each variable has:

- Mean = 0  
- Standard deviation = 1  

Logistic regression especially requires this step for stable and interpretable coefficients.

### **7. Train-Test Split**
The cleaned dataset is split into:

- **75% Training data**
- **25% Testing data**
- **Stratified on the target variable (QN8)** to maintain proportional representation

This enables fair and unbiased model evaluation.

### **Summary**
These preparation steps ensure:

- Clean and reliable input data  
- Proper handling of missing and categorical variables  
- Consistent formats for model training  
- Reduced noise and improved predictive accuracy  

With the dataset now fully prepared, the next section (Modeling) builds and evaluates the machine-learning models.



In [None]:
# DATA PREPARATION

import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split


# 1. Replace CDC-coded missing values with NaN
coded_missing = [7, 8, 9, 77, 88, 99]
data = data.replace(coded_missing, np.nan)

# 2. Select relevant variables for modeling
selected_columns = [
    'Q2',          # Sex
    'Q3',          # Grade
    'Q6',          # Age category
    'RACEETH',     # Race / Ethnicity

    'QN8',         # Target: sad/hopeless ≥2 weeks

    'QN12',        # Alcohol use
    'QNDAYEVP',    # E-cigarette use (days)
    'QNDAYCIG',    # Cigarette use (days)
    'QNSODA1',     # Soda consumption
    'QNPA0DAY',    # No physical activity
    'QNOBESE',     # Obesity status
    'BMIPCT'       # BMI percentile
]

selected_columns = [c for c in selected_columns if c in data.columns]
cleaned = data[selected_columns].copy()

# 3. Convert demographic variables to categorical
categorical_vars = ['Q2', 'Q3', 'Q6', 'RACEETH']

for col in categorical_vars:
    if col in cleaned.columns:
        cleaned[col] = cleaned[col].astype('category')

# 4. Convert remaining variables to numeric
for col in cleaned.columns:
    if col not in categorical_vars:
        cleaned[col] = pd.to_numeric(cleaned[col], errors='coerce')

# 5. Drop rows with missing values for modeling
rows_before = cleaned.shape[0]
cleaned = cleaned.dropna()
rows_after = cleaned.shape[0]

print(f"Rows before dropping missing values: {rows_before}")
print(f"Rows after dropping missing values:  {rows_after}")
print(f"Total rows removed:                  {rows_before - rows_after}")

# 6. Split into features (X) and target (y)
y = (cleaned['QN8'] == 1).astype(int)
X = cleaned.drop('QN8', axis=1)

# 7. One-hot encode categorical variables
X = pd.get_dummies(X, drop_first=True)

print("Shape after encoding:", X.shape)

# 8. Train/Test Split with stratification
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.25,
    random_state=42,
    stratify=y
)

print("Training set shape:", X_train.shape)
print("Testing set shape:", X_test.shape)
print("Training target distribution:")
print(y_train.value_counts(normalize=True))

# 9. Standard Scaling (Logistic Regression only)
scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("\nScaling complete.")
print("Scaled X_train shape:", X_train_scaled.shape)
print("Scaled X_test shape:", X_test_scaled.shape)


In [None]:
### 4. Exploratory Data Analysis (EDA)

This section explores key patterns, distributions, and relationships within the 2019 Youth Risk Behavior Surveillance System (YRBSS) dataset. The purpose of EDA is to build an initial understanding of the factors associated with **poor mental-health outcomes (QN8)**.

We generate the following visualizations:

1. **Distribution of Mental-Health Outcome (QN8):**  
   A count plot showing how many students reported feeling sad or hopeless for ≥2 weeks.

2. **Mental-Health Outcome by Biological Sex (QN2):**  
   A comparison of poor mental-health prevalence for male vs. female students.

3. **Mental-Health Outcome by Grade Level (Q3):**  
   A count plot showing differences across grade levels (9th–12th).

4. **Mental-Health Outcome by Age Category (Q6):**  
   A visualization comparing mental-health outcomes across student age groups.

5. **Distribution of Alcohol Use (QN12):**  
   A histogram or bar plot showing how alcohol-use frequency varies, and how it relates to mental-health outcomes.

6. **Distribution of Obesity Status (QNOBESE):**  
   A bar plot examining the relationship between obesity classification and reported sadness/hopelessness.

7. **Correlation Heatmap (Numeric Variables Only):**  
   A heatmap visualizing relationships between numerical variables such as:
   - BMI Percentile (BMIPCT)  
   - Frequency of vaping/smoking/drinking  
   - Other numeric CDC-coded indicators

These visualizations provide foundational insights into trends, relationships, and potential predictors of poor mental health among U.S. high-school students. The findings from this exploratory analysis help guide the later modeling steps.


In [None]:
# Exploratory Data Analysis (EDA)

import matplotlib.pyplot as plt
import seaborn as sns

sns.set(style="whitegrid")

# 1. Dataset Overview
print("Dataset Shape:", data.shape)
display(data.head())
display(data.describe(include="all"))


# 2. Plot Distribution of Key Demographic Variables
demo_vars = {
    "Q2": "Sex",
    "Q3": "Grade",
    "Q4": "Race/Ethnicity"
}

for col, label in demo_vars.items():
    if col in data.columns:
        plt.figure(figsize=(6,4))
        sns.countplot(x=data[col], palette="Set2")
        plt.title(f"Distribution of {label}")
        plt.xlabel(label)
        plt.ylabel("Count")
        plt.tight_layout()
        plt.show()


# 3. Target Variable Distribution (QN8)
if "QN8" in data.columns:
    plt.figure(figsize=(6,4))
    sns.countplot(x=data["QN8"], palette="flare")
    plt.title("Distribution of Mental Health Indicator (QN8: Sad/Hopeless 2+ Weeks)")
    plt.xlabel("QN8 (1 = Yes, 2 = No)")
    plt.ylabel("Count")
    plt.tight_layout()
    plt.show()


# 4. Behavioral Indicators (Alcohol, Cigarettes, Vaping, etc.)
behavior_vars = {
    "QN12": "Alcohol Use Frequency",
    "QNAYCIG": "Cigarette Use Frequency",
    "QNODAYEVP": "Vaping Frequency",
    "QNSODA1": "Soda Consumption",
    "QNPADOAY": "Physical Inactivity (0 days active)"
}

for col, label in behavior_vars.items():
    if col in data.columns:
        plt.figure(figsize=(6,4))
        sns.countplot(x=data[col], palette="cool")
        plt.title(f"Distribution of {label}")
        plt.xlabel(label)
        plt.ylabel("Count")
        plt.tight_layout()
        plt.show()

# 5. Compare Mental Health (QN8) by Important Predictors
compare_vars = {
    "Q2": "Sex",
    "Q3": "Grade",
    "QNSODA1": "Soda Consumption",
    "QNPADOAY": "Physical Inactivity",
    "QNOBESE": "Obesity Status"
}

for col, label in compare_vars.items():
    if col in data.columns:
        plt.figure(figsize=(8,4))
        sns.countplot(x=data[col], hue=data["QN8"], palette="magma")
        plt.title(f"QN8 Mental Health Outcome by {label}")
        plt.xlabel(label)
        plt.ylabel("Count")
        plt.legend(title="QN8 (1 = Yes, 2 = No)")
        plt.tight_layout()
        plt.show()

# 6. Correlation Heatmap (Numeric Variables)
numeric = data.select_dtypes(include=["int64", "float64"]).copy()
plt.figure(figsize=(12,8))
sns.heatmap(numeric.corr(), cmap="coolwarm", center=0)
plt.title("Correlation Heatmap (Numeric Variables)")
plt.tight_layout()
plt.show()



### 5. Modeling

In this section, we build supervised machine-learning models to predict whether a student reported experiencing **persistent sadness or hopelessness for ≥2 weeks (QN8)**.  
Using the cleaned and encoded features created in the Data Preparation section, we train and evaluate two classification algorithms:

- **Logistic Regression** – a linear baseline classifier  
- **Decision Tree Classifier** – a non-linear model capable of capturing complex interactions  

Both models are trained on the training split (`X_train`, `y_train`) and evaluated on the testing split (`X_test`, `y_test`).  
Model performance is assessed using:

- **Accuracy:** Overall proportion of correct predictions  
- **Confusion Matrix:** Breakdown of true/false positives and negatives  
- **Precision, Recall, F1-score:** Useful for imbalanced health outcomes  
- **ROC Curve and AUC Score:** Ability to distinguish between mental-health risk vs. no risk  

These evaluations help determine which model better identifies students at elevated mental-health risk.


In [None]:
# MODELING: Logistic Regression and Decision Tree Classifier

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import (
    accuracy_score,
    classification_report,
)

print("=== MODELING SECTION ===\n")


# Logistic Regression (uses scaled features)
log_model = LogisticRegression(max_iter=500, solver="lbfgs")
log_model.fit(X_train_scaled, y_train)

# Predictions
log_pred = log_model.predict(X_test_scaled)

# Evaluation
log_accuracy = accuracy_score(y_test, log_pred)
print(f"Logistic Regression Accuracy: {log_accuracy:.4f}")
print("\nLogistic Regression Classification Report:")
print(classification_report(y_test, log_pred))

# Decision Tree Classifier (uses unscaled features)
tree_model = DecisionTreeClassifier(
    criterion="gini",
    max_depth=None,
    random_state=42
)

tree_model.fit(X_train, y_train)

# Predictions
tree_pred = tree_model.predict(X_test)

# Evaluation
tree_accuracy = accuracy_score(y_test, tree_pred)
print(f"Decision Tree Accuracy: {tree_accuracy:.4f}")
print("\nDecision Tree Classification Report:")
print(classification_report(y_test, tree_pred))


# Model Comparison Summary
print("=== MODEL COMPARISON SUMMARY ===")
print(f"Logistic Regression Accuracy: {log_accuracy:.4f}")
print(f"Decision Tree Accuracy:      {tree_accuracy:.4f}")



## 6. Model Evaluation

In this section, we evaluate the performance of our supervised machine-learning models and assess their ability to accurately predict whether a student reported poor mental health (persistent sadness or hopelessness). Evaluation is a critical step in determining how well each model generalizes to unseen data and how reliably it addresses the project’s research question.

The following evaluation tools are used in this section:

- **Confusion Matrix:**  
  Provides a breakdown of true positives, true negatives, false positives, and false negatives.  
  This helps us understand each model’s ability to correctly identify students with and without reported mental-health concerns.

- **Receiver Operating Characteristic (ROC) Curve:**  
  Illustrates the trade-off between sensitivity (true positive rate) and specificity (1 − false positive rate).  
  ROC curves help determine how well a model separates the two classes across different probability thresholds.

- **Area Under the Curve (AUC):**  
  A single numeric score summarizing model performance.  
  Higher AUC values indicate stronger ability to distinguish students with poor mental-health outcomes from those without.

- **Feature Importance (Decision Tree):**  
  Highlights the behavioral and demographic variables most strongly associated with predicting mental-health outcomes, helping identify which factors contribute most to the model’s decisions.

These evaluation steps provide insight into the strengths and limitations of each model, helping inform which classifier is most appropriate for interpreting mental-health trends among high-school students.


In [None]:
# MODEL EVALUATION: Confusion Matrices, ROC Curves, AUC, Feature Importance

import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import (
    ConfusionMatrixDisplay,
    RocCurveDisplay,
    roc_auc_score
)

print("=== MODEL EVALUATION SECTION ===\n")

# CONFUSION MATRICES
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Logistic Regression CM
ConfusionMatrixDisplay.from_estimator(
    log_model, X_test_scaled, y_test, 
    cmap="Blues", ax=axes[0]
)
axes[0].set_title("Logistic Regression – Confusion Matrix")

# Decision Tree CM
ConfusionMatrixDisplay.from_estimator(
    tree_model, X_test, y_test,
    cmap="Greens", ax=axes[1]
)
axes[1].set_title("Decision Tree – Confusion Matrix")

plt.tight_layout()
plt.show()



# ROC CURVES + AUC
plt.figure(figsize=(8, 6))

# Calculate AUC values
log_auc = roc_auc_score(y_test, log_model.predict_proba(X_test_scaled)[:, 1])
tree_auc = roc_auc_score(y_test, tree_model.predict_proba(X_test)[:, 1])

# Logistic Regression ROC
RocCurveDisplay.from_estimator(
    log_model, X_test_scaled, y_test,
    name=f"Logistic Regression (AUC = {log_auc:.2f})",
    color="blue"
)

# Decision Tree ROC
RocCurveDisplay.from_estimator(
    tree_model, X_test, y_test,
    name=f"Decision Tree (AUC = {tree_auc:.2f})",
    color="green"
)

# Reference line
plt.plot([0, 1], [0, 1], "k--", label="Random Guess")

plt.title("ROC Curve Comparison")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.legend()
plt.grid(alpha=0.3)
plt.show()


# DECISION TREE FEATURE IMPORTANCE
importances = tree_model.feature_importances_
feature_names = X_train.columns

# Only show >0 importance
imp_df = pd.DataFrame({
    "Feature": feature_names,
    "Importance": importances
}).sort_values(by="Importance", ascending=False)

plt.figure(figsize=(10, 8))
sns.barplot(
    data=imp_df.head(20),
    x="Importance", y="Feature",
    palette="viridis"
)
plt.title("Top 20 Feature Importances – Decision Tree")
plt.xlabel("Importance Score")
plt.ylabel("Feature")
plt.tight_layout()
plt.show()

print("=== MODEL EVALUATION COMPLETE ===")
print(f"Logistic Regression AUC: {log_auc:.4f}")
print(f"Decision Tree AUC:       {tree_auc:.4f}")
