# Predicting Youth Mental Health Risk
**Student:** Michael Kamp

# 1. Business Understanding

This notebook supports the final project for **DATA 747** and focuses on predicting poor mental-health outcomes among U.S. high-school students using the **2019 CDC Youth Risk Behavior Surveillance System (YRBSS)** dataset.

The primary goal of this analysis is to examine behavioral and demographic factors associated with the likelihood that a student reports **persistent sadness or hopelessness (QN8)**. Using this dataset, we prepare the data, conduct exploratory analyses, train supervised machine-learning models, and evaluate their predictive performance.

### **Research Question**
**What factors help predict whether a high-school student reports poor mental health, defined as persistent sadness or hopelessness for two or more consecutive weeks?**


## **Notebook Roadmap**

1. **Data Understanding**  
   Review dataset structure, coding conventions, and key variables.

2. **Data Preparation**  
   Clean the dataset, handle missing values, recode variables, and prepare inputs for modeling.

3. **Exploratory Data Analysis (EDA)**  
   Visualize distributions and examine relationships among behavioral, demographic, and health variables.

4. **Modeling**  
   Train and evaluate two supervised classification models: logistic regression and decision tree.

5. **Model Evaluation**  
   Assess predictive performance using confusion matrices, ROC curves, classification metrics, and feature importance.



In [None]:
# =============================================
# GLOBAL WARNING SUPPRESSION (Clean Notebook)
# =============================================
import warnings

# Suppress scikit-learn feature name warnings
warnings.filterwarnings("ignore", message="X has feature names")

# Suppress all UserWarnings
warnings.filterwarnings("ignore", category=UserWarning)

# Suppress seaborn/matplotlib FutureWarnings
warnings.filterwarnings("ignore", category=FutureWarning)

# =============================================
# STANDARD IMPORTS
# =============================================
import pandas as pd
import numpy as np
import re

# Paths to the raw fixed-width data and SAS layout file
data_file = "yrbs2019.dat"
sas_layout = "yrbs2019_input.sas"


def parse_sas_input_pointer_format(file_path):
    """
    Parse a CDC SAS INPUT file in pointer format (e.g., @1 Q1 2.)
    and return (colspecs, names) for use with pandas.read_fwf.
    """
    colspecs = []
    names = []
    pattern = re.compile(r"@(\d+)\s+(\w+)\s+(\$?)(\d+)")

    entries = []

    with open(file_path, "r") as f:
        for line in f:
            match = pattern.search(line)
            if match:
                start = int(match.group(1)) - 1  # convert 1-based to 0-based index
                name = match.group(2)
                width = int(match.group(4))
                end = start + width
                entries.append((start, end, name))

    # Sort by starting position
    entries.sort(key=lambda x: x[0])

    # Build colspecs and column names
    for start, end, name in entries:
        colspecs.append((start, end))
        names.append(name)

    return colspecs, names


# Parse SAS layout
colspecs, names = parse_sas_input_pointer_format(sas_layout)
print(f"Columns detected: {len(names)}")

# Load fixed-width YRBS 2019 dataset
data = pd.read_fwf(data_file, colspecs=colspecs, names=names)

print("First 10 rows:")
display(data.head(10))



## 2. Data Understanding

This section provides an initial examination of the 2019 CDC Youth Risk Behavior Surveillance System (YRBSS) dataset. The dataset contains a wide range of demographic, behavioral, and health-related variables reported by high-school students across the United States. Before preparing the data for modeling, it is essential to understand the structure, content, and quality of the dataset.

### Objectives of Data Understanding
1. **Confirm that the dataset loaded correctly** using the CDC SAS layout file to extract column positions and variable names.
2. **Review dataset structure**, including the number of rows, columns, and variable data types.
3. **Identify major characteristics of key demographic variables** such as sex (Q2), grade (Q3), and race/ethnicity (Q4).
4. **Examine the distribution of the mental-health target variable (QN8)**, which indicates whether a student felt sad or hopeless for two or more weeks.
5. **Assess missing data**, which is important because the YRBSS uses special numeric codes (e.g., 7, 8, 9, 77, 88, 99) to represent non-responses.
6. **Explore early distributions** of selected behavioral and health variables relevant to the research question.

### Why This Matters
A thorough understanding of the dataset ensures that the next step—data preparation—is performed accurately and effectively. Identifying missing values, variable types, and preliminary patterns informs decisions about cleaning, recoding, and selecting features for modeling.

The visualizations and summary statistics in this section help lay the foundation for exploratory data analysis and model development in later sections.


In [None]:
# SECTION 2 — DATA UNDERSTANDING

import pandas as pd

print("Preview of dataset:")
display(data.head())

print("\nSummary statistics:")
display(data.describe(include="all"))

print("\nData types:")
display(data.dtypes)

print("\nMissing values in each column:")
display(data.isnull().sum())

# Key demographic variables for context
categorical_cols = ["Q2", "Q3", "Q6", "RACEETH"]

print("\nUnique values for key categorical variables:")
for col in categorical_cols:
    if col in data.columns:
        print(f"{col}: {data[col].unique()}")



## 3. Data Preparation

This section prepares the YRBSS dataset for modeling by cleaning the data, selecting relevant variables, handling missing values, encoding categorical fields, and splitting the data into training and testing sets. Because the dataset is stored in a fixed-width format and includes CDC-coded missing values, several preprocessing steps are required before fitting machine-learning models.

### Objectives of Data Preparation
1. **Clean and standardize the dataset** by replacing CDC-coded missing values with `NaN`.
2. **Select the variables most relevant to predicting mental-health outcomes**, including demographic, behavioral, and health indicators.
3. **Convert categorical variables** (e.g., sex, grade, age group, race/ethnicity) into appropriate data types.
4. **Transform categorical fields using one-hot encoding** so they can be used by machine-learning algorithms.
5. **Drop incomplete rows** to ensure a consistent and reliable modeling dataset.
6. **Separate features (X) from the target variable (y)** and prepare them for model training.
7. **Scale numeric predictors** using StandardScaler for the logistic regression model.
8. **Create a train–test split** using stratified sampling to preserve the distribution of the mental-health outcome.

### Why These Steps Are Necessary
The YRBSS dataset uses a mixture of coded categorical values and non-standard missing-data indicators. Preparing the data properly ensures that:

- the machine-learning models receive consistent and meaningful inputs,
- categorical variables are encoded without introducing bias,
- scaled variables improve optimization for logistic regression,
- and the training/testing evaluation reflects the true distribution of mental-health outcomes.

At the end of this section, the dataset is fully prepared for modeling using logistic regression and decision-tree classifiers.


In [None]:
# SECTION 3 — DATA PREPARATION

import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# 1. Replace CDC-coded missing values with NaN
coded_missing = [7, 8, 9, 77, 88, 99]
data = data.replace(coded_missing, np.nan)

# 2. Select relevant variables for modeling
selected_columns = [
    "Q2",        # Sex
    "Q3",        # Grade
    "Q6",        # Age category
    "RACEETH",   # Race / Ethnicity

    "QN8",       # Target: sad/hopeless ≥ 2 weeks

    "QN12",      # Alcohol use
    "QNDAYEVP",  # E-cigarette use (days)
    "QNDAYCIG",  # Cigarette use (days)
    "QNSODA1",   # Soda consumption
    "QNPA0DAY",  # No physical activity
    "QNOBESE",   # Obesity status
    "BMIPCT"     # BMI percentile
]

# Keep only columns that exist in the parsed dataset
selected_columns = [c for c in selected_columns if c in data.columns]
cleaned = data[selected_columns].copy()

# 3. Convert demographic variables to categorical
categorical_vars = ["Q2", "Q3", "Q6", "RACEETH"]

for col in categorical_vars:
    if col in cleaned.columns:
        cleaned[col] = cleaned[col].astype("category")

# 4. Convert remaining variables to numeric
for col in cleaned.columns:
    if col not in categorical_vars:
        cleaned[col] = pd.to_numeric(cleaned[col], errors="coerce")

# 5. Drop rows with missing values for modeling
rows_before = cleaned.shape[0]
cleaned = cleaned.dropna()
rows_after = cleaned.shape[0]

print(f"Rows before dropping missing values: {rows_before}")
print(f"Rows after dropping missing values:  {rows_after}")
print(f"Total rows removed:                  {rows_before - rows_after}")

# 6. Split into features (X) and target (y)
#    QN8 == 1 indicates persistent sadness / hopelessness
y = (cleaned["QN8"] == 1).astype(int)
X = cleaned.drop("QN8", axis=1)

# 7. One-hot encode categorical variables
X = pd.get_dummies(X, drop_first=True)
print("Shape after encoding:", X.shape)

# 8. Train/Test split with stratification
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.25,
    random_state=42,
    stratify=y
)

print("Training set shape:", X_train.shape)
print("Testing set shape:", X_test.shape)
print("Training target distribution:")
print(y_train.value_counts(normalize=True))

# 9. Standard Scaling (for Logistic Regression)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("\nScaling complete.")
print("Scaled X_train shape:", X_train_scaled.shape)
print("Scaled X_test shape:", X_test_scaled.shape)



## 4. Exploratory Data Analysis (EDA)

This section explores the structure and key characteristics of the variables included in the study. Exploratory Data Analysis helps identify patterns,
trends, and potential relationships in the dataset prior to modeling. Because the YRBSS includes a broad range of demographic, behavioral,and health variables,
EDA is essential for understanding which factors may be associated with poor mental-health outcomes.

### Objectives of EDA
1. **Examine the distribution of demographic variables**, including sex (Q2), grade level (Q3), and race/ethnicity (Q4), to understand population composition.
2. **Visualize the distribution of the target variable (QN8)**, which indicates whether a student experienced persistent sadness or hopelessness.
3. **Explore behavioral indicators** such as alcohol use, vaping, smoking, soda consumption, and physical inactivity.
4. **Compare mental-health outcomes across demographic and behavioral subgroups** using side-by-side count plots.
5. **Generate a correlation heatmap** to identify relationships among numeric variables, such as BMI percentile and behavioral frequency indicators.

### Why This Matters
EDA provides an essential foundation for modeling. Visualizing these variables allows us to quickly identify imbalances, unusual values, and potential predictors. 
Understanding how mental-health outcomes vary across demographic and behavioral groups helps guide feature selection and interpret model performance in later sections.

The visualizations in this section highlight important structures within the dataset and support the development of more effective predictive models.
Because the target variable (QN8) is highly imbalanced, EDA also helps confirm the minority class distribution and informs how model performance should be evaluated in later sections.    


In [None]:
# SECTION 4 — EXPLORATORY DATA ANALYSIS (EDA)

import matplotlib.pyplot as plt
import seaborn as sns

# Set a clean style for all charts
sns.set_theme(style="whitegrid")

# ---------------------------------------------
# 4.1 Distribution of the Target Variable (QN8)
# ---------------------------------------------
plt.figure(figsize=(6, 4))
sns.countplot(x=y, palette="pastel")
plt.title("Distribution of QN8 (Sad/Hopeless ≥ 2 Weeks)")
plt.xlabel("QN8: 0 = No, 1 = Yes")
plt.ylabel("Count")
plt.tight_layout()
plt.show()


# ---------------------------------------------
# 4.2 Distributions of Numerical Features
# ---------------------------------------------
numeric_vars = ["BMIPCT", "QNDAYEVP", "QNDAYCIG", "QNSODA1"]

for col in numeric_vars:
    if col in cleaned.columns:
        plt.figure(figsize=(6, 4))
        sns.histplot(cleaned[col], kde=True, color="skyblue")
        plt.title(f"Distribution of {col}")
        plt.xlabel(col)
        plt.tight_layout()
        plt.show()


# ---------------------------------------------
# 4.3 Count Plots for Categorical Variables
# ---------------------------------------------
categorical_vars = ["Q2", "Q3", "Q6", "RACEETH"]

for col in categorical_vars:
    if col in cleaned.columns:
        plt.figure(figsize=(7, 4))
        sns.countplot(data=cleaned, x=col, palette="muted")
        plt.title(f"Distribution of {col}")
        plt.xlabel(col)
        plt.ylabel("Count")
        plt.tight_layout()
        plt.show()


# ---------------------------------------------
# 4.4 QN8 vs Demographic Groups
# ---------------------------------------------
for col in categorical_vars:
    if col in cleaned.columns:
        plt.figure(figsize=(7, 4))
        sns.countplot(data=cleaned, x=col, hue="QN8", palette="Set2")
        plt.title(f"QN8 by {col}")
        plt.xlabel(col)
        plt.ylabel("Count")
        plt.legend(title="QN8")
        plt.tight_layout()
        plt.show()


# ---------------------------------------------
# 4.5 Correlation Heatmap (Numerical Variables)
# ---------------------------------------------
numeric_for_corr = ["BMIPCT", "QNDAYEVP", "QNDAYCIG", "QNSODA1", "QNPA0DAY", "QNOBESE"]

numeric_for_corr = [col for col in numeric_for_corr if col in cleaned.columns]

plt.figure(figsize=(8, 6))
sns.heatmap(cleaned[numeric_for_corr].corr(), annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Correlation Heatmap (Numerical Predictors)")
plt.tight_layout()
plt.show()



## 5. Modeling

In this section, we build predictive models to identify factors associated with poor mental-health outcomes among high-school students. Using the cleaned and prepared dataset from the previous section, we apply two supervised machine-learning algorithms:

1. **Logistic Regression** – A linear classification model used as a baseline to predict the likelihood that a student reports persistent sadness or hopelessness (QN8). Logistic regression is effective when the relationship between predictors and the target variable is approximately linear and when interpretability is important.

2. **Decision Tree Classifier** – A non-linear model capable of capturing complex interactions between demographic, behavioral, and health variables. Decision trees can identify the most influential predictors by recursively splitting the dataset into meaningful subgroups.

### Modeling Workflow
- The dataset is split into training and testing sets using stratified sampling to preserve the distribution of the mental-health outcome.
- Logistic regression is trained on **scaled** inputs to improve stability and performance.
- The decision-tree classifier is trained on **unscaled** inputs, since tree-based models are not sensitive to feature scaling.
- Each model generates predictions on the testing set and is evaluated using accuracy, precision, recall, F1-score, and a classification report.

### Purpose of This Section
By fitting both a linear and a non-linear model, we can compare performance and better understand the predictive structure of the YRBSS dataset. This provides insight into which model is best suited for identifying students at elevated mental-health risk.  Because the target variable (QN8) is highly imbalanced, evaluating models using precision, recall, F1-score, and AUC offers a clearer picture of how effectively each approach identifies the minority class.


In [None]:
# SECTION 5 — MODELING

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

# ----------------------------------------------------
# 5.1 Logistic Regression (requires scaled features)
# ----------------------------------------------------
log_model = LogisticRegression(
    max_iter=500,
    solver='lbfgs',
    random_state=42
)

log_model.fit(X_train_scaled, y_train)
log_pred = log_model.predict(X_test_scaled)
log_pred_proba = log_model.predict_proba(X_test_scaled)[:, 1]

print("Logistic Regression model training complete.")


# ----------------------------------------------------
# 5.2 Decision Tree Classifier (raw, unscaled features)
# ----------------------------------------------------
tree_model = DecisionTreeClassifier(
    max_depth=None,
    min_samples_split=2,
    random_state=42
)

tree_model.fit(X_train, y_train)
tree_pred = tree_model.predict(X_test)
tree_pred_proba = tree_model.predict_proba(X_test)[:, 1]

print("Decision Tree model training complete.")



## 6. Model Evaluation

After training the logistic regression and decision-tree models, this section evaluates their performance on the testing dataset. Model evaluation provides insight into how effectively each classifier identifies students who report poor mental-health outcomes and how well the models generalize to unseen data.

### Evaluation Metrics

To assess each model, we use several standard classification performance measures:

• Confusion Matrix

Shows counts of true positives, true negatives, false positives, and false negatives
This matrix helps reveal how frequently each model correctly identifies students with and without reported sadness or hopelessness.

• Classification Report

Includes precision, recall, F1-score, and class-specific support.
These metrics are especially important given the strong class imbalance in the mental-health outcome (QN8).

• ROC Curve (Receiver Operating Characteristic)

Illustrates the trade-off between the true positive rate and false positive rate across classification thresholds.

• AUC (Area Under the Curve)

Measures how well the model separates students who reported persistent sadness from those who did not.
Higher AUC values indicate stronger discriminative ability.

• Feature Importance (Decision Tree)

Identifies which demographic, behavioral, or health-related variables contribute most strongly to the decision-tree model’s predictions.

### Purpose of This Section
This evaluation determines which model performs best and provides insight into the underlying predictive patterns within the dataset. Understanding model strengths and limitations helps guide interpretation of the results in the accompanying APA report. Given the imbalanced nature of the target variable (QN8), metrics such as recall, precision, F1-score, and AUC are essential for assessing how effectively each model identifies the minority class.


In [None]:
# ============================
# SECTION 6 – MODEL EVALUATION
# ============================

import matplotlib.pyplot as plt
from sklearn.metrics import (
    ConfusionMatrixDisplay,
    classification_report,
    accuracy_score,
    roc_curve,
    roc_auc_score
)
import seaborn as sns

# ---------------------------------------
# 1. Logistic Regression – Predictions
# ---------------------------------------
log_probs = log_model.predict_proba(X_test)[:, 1]      # Probabilities for ROC/AUC
log_pred = log_model.predict(X_test)                   # Class predictions
log_fpr, log_tpr, _ = roc_curve(y_test, log_probs)
log_auc = roc_auc_score(y_test, log_probs)

# ---------------------------------------
# 2. Decision Tree – Predictions
# ---------------------------------------
tree_probs = tree_model.predict_proba(X_test)[:, 1]
tree_pred = tree_model.predict(X_test)
tree_fpr, tree_tpr, _ = roc_curve(y_test, tree_probs)
tree_auc = roc_auc_score(y_test, tree_probs)

# ---------------------------------------
# PRINT EVALUATION METRICS
# ---------------------------------------
print("===== LOGISTIC REGRESSION METRICS =====")
print("Accuracy:", accuracy_score(y_test, log_pred))
print("AUC:", log_auc)
print("\nClassification Report:\n", classification_report(y_test, log_pred))

print("\n===== DECISION TREE METRICS =====")
print("Accuracy:", accuracy_score(y_test, tree_pred))
print("AUC:", tree_auc)
print("\nClassification Report:\n", classification_report(y_test, tree_pred))


# ============================================
# PLOT 1 – CONFUSION MATRIX (LOGISTIC REGRESSION)
# ============================================
fig, ax = plt.subplots(figsize=(6, 4))
ConfusionMatrixDisplay.from_predictions(y_test, log_pred, cmap="Blues", ax=ax)
ax.set_title("Confusion Matrix — Logistic Regression")

fig.savefig("confusion_logistic.png", dpi=300, bbox_inches="tight")
plt.show()


# ============================================
# PLOT 2 – CONFUSION MATRIX (DECISION TREE)
# ============================================
fig, ax = plt.subplots(figsize=(6, 4))
ConfusionMatrixDisplay.from_predictions(y_test, tree_pred, cmap="Greens", ax=ax)
ax.set_title("Confusion Matrix — Decision Tree")

fig.savefig("confusion_tree.png", dpi=300, bbox_inches="tight")
plt.show()


# ============================================
# PLOT 3 – ROC CURVE FOR BOTH MODELS
# ============================================
fig, ax = plt.subplots(figsize=(7, 5))

ax.plot(log_fpr, log_tpr, label=f"Logistic Regression (AUC = {log_auc:.3f})")
ax.plot(tree_fpr, tree_tpr, label=f"Decision Tree (AUC = {tree_auc:.3f})")
ax.plot([0, 1], [0, 1], linestyle="--", color="gray")

ax.set_xlabel("False Positive Rate")
ax.set_ylabel("True Positive Rate")
ax.set_title("ROC Curve for Models")
ax.legend()

fig.savefig("roc_curve.png", dpi=300, bbox_inches="tight")
plt.show()

# ============================================================
# CREATE FEATURE IMPORTANCE DATAFRAME (Decision Tree)
# ============================================================

import numpy as np
import pandas as pd

importance_df = pd.DataFrame({
    "Feature": X.columns,
    "Importance": tree_model.feature_importances_
}).sort_values(by="Importance", ascending=False)


# ============================================
# PLOT 4 – FEATURE IMPORTANCE (DECISION TREE)
# ============================================
fig, ax = plt.subplots(figsize=(8, 6))

sns.barplot(
    data=importance_df.head(15),
    x="Importance",
    y="Feature",
    palette="viridis",
    ax=ax
)

ax.set_title("Top 15 Most Important Features — Decision Tree")

fig.savefig("feature_importance.png", dpi=300, bbox_inches="tight")
plt.show()


## 7. Conclusion

This analysis compared logistic regression and decision-tree classification models to evaluate their ability to predict persistent sadness among high-school students using the 2019 YRBSS dataset. While the decision tree achieved higher overall accuracy, both models struggled to identify the minority class, reflecting the underlying class imbalance in the dataset. The evaluation metrics highlighted important strengths and limitations of each model and provided insight into which behavioral and demographic factors may contribute to elevated mental-health risk among adolescents. These findings support the need for additional techniques—such as resampling, ensemble methods, or alternative algorithms—to improve minority-class detection in future work.
