# Multimodal Machine Learning for Mental Health Prediction in University Students

This notebook implements a multimodal machine learning approach to predict mental health outcomes for university students, using three different algorithms:
1. K-Nearest Neighbors (K-NN)
2. Linear Regression
3. Support Vector Machine (SVM)

We'll process different types of data (multimodal approach) to create a comprehensive prediction model.

## 1. Required Libraries and Technologies

Let's import all the necessary libraries for our multimodal machine learning model:

In [None]:
# Core libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Machine learning libraries
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer

# Models
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.svm import SVR, SVC

# Evaluation metrics
from sklearn.metrics import mean_squared_error, accuracy_score, classification_report, confusion_matrix, r2_score

# For feature selection and dimensionality reduction
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.decomposition import PCA

# Statistical analysis
import scipy.stats as stats

# For handling imbalanced datasets (if needed)
from imblearn.over_sampling import SMOTE

# For interactive visualizations
import plotly.express as px
import plotly.graph_objects as go

# Set style for matplotlib
plt.style.use('ggplot')

# Display settings
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

## 2. Data Loading and Exploration

For a mental health prediction model, we need to handle different types of data. In this section, we'll load and explore our dataset(s).

### Data Types for Multimodal Approach:

1. **Demographic data**: Age, gender, year of study, etc.
2. **Academic data**: GPA, course load, etc.
3. **Behavioral data**: Sleep patterns, exercise habits, social interactions
4. **Psychological assessments**: Standardized mental health screening scores (e.g., PHQ-9 for depression, GAD-7 for anxiety)
5. **External factors**: Financial stress, housing situation, etc.

In [None]:
# Sample code for loading data
# Replace this with actual data loading from your source

# Option 1: Load data from a CSV file
# df = pd.read_csv('student_mental_health_data.csv')

# Option 2: Create sample data for demonstration
np.random.seed(42)
size = 500

# Generate sample data
data = {
    # Demographics
    'age': np.random.normal(20, 2, size).round(),
    'gender': np.random.choice(['Male', 'Female', 'Non-binary'], size),
    'year_of_study': np.random.choice([1, 2, 3, 4, 5], size),
    
    # Academic factors
    'gpa': np.random.normal(3.0, 0.5, size).clip(0, 4.0),
    'course_load': np.random.normal(5, 1, size).round().clip(2, 7),
    'major': np.random.choice(['Engineering', 'Arts', 'Science', 'Business', 'Medicine'], size),
    
    # Behavioral data
    'sleep_hours': np.random.normal(7, 1.5, size).clip(3, 10),
    'exercise_hours_per_week': np.random.gamma(2, 1.5, size),
    'social_activity_hours': np.random.gamma(3, 2, size),
    
    # Psychological assessments (simulated)
    'depression_score': np.random.gamma(5, 1, size).round(),  # PHQ-9 like (0-27)
    'anxiety_score': np.random.gamma(4, 1, size).round(),    # GAD-7 like (0-21)
    'stress_score': np.random.gamma(6, 1, size).round(),     # PSS like
    
    # External factors
    'financial_stress': np.random.choice([0, 1, 2, 3, 4], size),  # 0=None, 4=Severe
    'housing_quality': np.random.choice([1, 2, 3, 4, 5], size),   # 1=Poor, 5=Excellent
    'support_network': np.random.choice([0, 1, 2, 3, 4], size),   # 0=None, 4=Strong
    
    # Target variable: mental health index (could be classification or regression)
    # For regression (continuous score)
    'mental_health_index': np.random.normal(50, 15, size).clip(0, 100)
}

# Convert to DataFrame
df = pd.DataFrame(data)

# Create a categorical target for classification
# Below 40: Poor mental health, 40-60: Average, Above 60: Good
df['mental_health_category'] = pd.cut(
    df['mental_health_index'], 
    bins=[0, 40, 60, 100], 
    labels=['Poor', 'Average', 'Good']
)

# Display the first few rows
df.head()

In [None]:
# Check basic information about the dataset
print(f"Dataset shape: {df.shape}")
df.info()

# Summary statistics
df.describe()

In [None]:
# Visualize the distribution of the target variable (mental health index)
plt.figure(figsize=(12, 5))

# For regression target
plt.subplot(1, 2, 1)
sns.histplot(df['mental_health_index'], kde=True)
plt.title('Distribution of Mental Health Index')

# For classification target
plt.subplot(1, 2, 2)
sns.countplot(x='mental_health_category', data=df)
plt.title('Distribution of Mental Health Categories')

plt.tight_layout()
plt.show()

In [None]:
# Correlation analysis of numerical features
numerical_features = df.select_dtypes(include=['int64', 'float64']).drop(columns=['mental_health_index'])

# Correlation with target
correlations = numerical_features.corrwith(df['mental_health_index']).sort_values(ascending=False)

# Plot correlations
plt.figure(figsize=(12, 8))
sns.barplot(x=correlations.index, y=correlations.values)
plt.title('Feature Correlation with Mental Health Index')
plt.xticks(rotation=45, ha='right')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

# Correlation heatmap for numerical features
plt.figure(figsize=(14, 10))
corr_matrix = df.select_dtypes(include=['int64', 'float64']).corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Correlation Heatmap of Numerical Features')
plt.tight_layout()
plt.show()

## 3. Feature Engineering & Data Preprocessing

For our multimodal approach, we need to properly process different types of data:

In [None]:
# Define features and target
X = df.drop(columns=['mental_health_index', 'mental_health_category'])
y_reg = df['mental_health_index']  # For regression tasks
y_cls = df['mental_health_category']  # For classification tasks

# Split data into training and test sets
X_train, X_test, y_train_reg, y_test_reg, y_train_cls, y_test_cls = train_test_split(
    X, y_reg, y_cls, test_size=0.2, random_state=42
)

# Identify categorical and numerical features
categorical_features = X.select_dtypes(include=['object', 'category']).columns
numerical_features = X.select_dtypes(include=['int64', 'float64']).columns

print(f"Categorical features: {list(categorical_features)}")
print(f"Numerical features: {list(numerical_features)}")

# Create preprocessing pipelines
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Combine preprocessors
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ])

print("Preprocessing pipeline created successfully.")

## 4. Building Machine Learning Models

We'll implement three different models as specified:
1. K-Nearest Neighbors (K-NN)
2. Linear Regression / Logistic Regression
3. Support Vector Machine (SVM)

We'll create versions for both regression (predicting mental health index) and classification (predicting mental health category).

In [None]:
# 1. K-NN Models

# For regression
knn_reg_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', KNeighborsRegressor())
])

# For classification
knn_cls_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', KNeighborsClassifier())
])

# Parameter grid for KNN
knn_param_grid = {
    'model__n_neighbors': [3, 5, 7, 9, 11],
    'model__weights': ['uniform', 'distance'],
    'model__metric': ['euclidean', 'manhattan']
}

# Grid search for regression
knn_reg_grid = GridSearchCV(
    knn_reg_pipeline, 
    knn_param_grid, 
    cv=5, 
    scoring='neg_mean_squared_error',
    n_jobs=-1
)

# Grid search for classification
knn_cls_grid = GridSearchCV(
    knn_cls_pipeline, 
    knn_param_grid, 
    cv=5, 
    scoring='accuracy',
    n_jobs=-1
)

# Train KNN models
print("Training K-NN Regression model...")
knn_reg_grid.fit(X_train, y_train_reg)
print(f"Best parameters: {knn_reg_grid.best_params_}")
print(f"Best RMSE: {(-knn_reg_grid.best_score_)**0.5:.4f}")

print("\nTraining K-NN Classification model...")
knn_cls_grid.fit(X_train, y_train_cls)
print(f"Best parameters: {knn_cls_grid.best_params_}")
print(f"Best accuracy: {knn_cls_grid.best_score_:.4f}")

In [None]:
# 2. Linear Models

# For regression
linear_reg_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', LinearRegression())
])

# For classification
logistic_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', LogisticRegression(max_iter=1000))
])

# Parameter grid for logistic regression
logistic_param_grid = {
    'model__C': [0.01, 0.1, 1.0, 10.0],
    'model__solver': ['liblinear', 'saga'],
    'model__penalty': ['l1', 'l2']
}

# Train Linear Regression model
print("Training Linear Regression model...")
linear_reg_pipeline.fit(X_train, y_train_reg)

# Evaluate Linear Regression
linear_reg_pred = linear_reg_pipeline.predict(X_test)
linear_reg_rmse = mean_squared_error(y_test_reg, linear_reg_pred, squared=False)
linear_reg_r2 = r2_score(y_test_reg, linear_reg_pred)
print(f"Linear Regression RMSE: {linear_reg_rmse:.4f}")
print(f"Linear Regression R²: {linear_reg_r2:.4f}")

# Grid search for Logistic Regression
logistic_grid = GridSearchCV(
    logistic_pipeline,
    logistic_param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)

# Train Logistic Regression model
print("\nTraining Logistic Regression model...")
logistic_grid.fit(X_train, y_train_cls)
print(f"Best parameters: {logistic_grid.best_params_}")
print(f"Best accuracy: {logistic_grid.best_score_:.4f}")

In [None]:
# 3. Support Vector Machine Models

# For regression
svm_reg_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', SVR())
])

# For classification
svm_cls_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', SVC(probability=True))
])

# Parameter grid for SVM
svm_param_grid = {
    'model__C': [0.1, 1, 10],
    'model__kernel': ['linear', 'rbf', 'poly'],
    'model__gamma': ['scale', 'auto', 0.1, 0.01]
}

# Grid search for SVM regression
svm_reg_grid = GridSearchCV(
    svm_reg_pipeline,
    svm_param_grid,
    cv=5,
    scoring='neg_mean_squared_error',
    n_jobs=-1
)

# Grid search for SVM classification
svm_cls_grid = GridSearchCV(
    svm_cls_pipeline,
    svm_param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)

# Train SVM models
print("Training SVM Regression model...")
svm_reg_grid.fit(X_train, y_train_reg)
print(f"Best parameters: {svm_reg_grid.best_params_}")
print(f"Best RMSE: {(-svm_reg_grid.best_score_)**0.5:.4f}")

print("\nTraining SVM Classification model...")
svm_cls_grid.fit(X_train, y_train_cls)
print(f"Best parameters: {svm_cls_grid.best_params_}")
print(f"Best accuracy: {svm_cls_grid.best_score_:.4f}")

## 5. Model Evaluation and Comparison

Let's evaluate all models on the test set and compare their performance.

In [None]:
# Regression model evaluation
def evaluate_regression_model(model, X_test, y_test, model_name):
    y_pred = model.predict(X_test)
    rmse = mean_squared_error(y_test, y_pred, squared=False)
    r2 = r2_score(y_test, y_pred)
    return {
        'Model': model_name,
        'RMSE': rmse,
        'R²': r2
    }

# Classification model evaluation
def evaluate_classification_model(model, X_test, y_test, model_name):
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    return {
        'Model': model_name,
        'Accuracy': accuracy,
        'Classification Report': classification_report(y_test, y_pred)
    }

# Evaluate regression models
reg_results = []
reg_results.append(evaluate_regression_model(knn_reg_grid.best_estimator_, X_test, y_test_reg, "K-NN"))
reg_results.append(evaluate_regression_model(linear_reg_pipeline, X_test, y_test_reg, "Linear Regression"))
reg_results.append(evaluate_regression_model(svm_reg_grid.best_estimator_, X_test, y_test_reg, "SVM"))

# Evaluate classification models
cls_results = []
cls_results.append(evaluate_classification_model(knn_cls_grid.best_estimator_, X_test, y_test_cls, "K-NN"))
cls_results.append(evaluate_classification_model(logistic_grid.best_estimator_, X_test, y_test_cls, "Logistic Regression"))
cls_results.append(evaluate_classification_model(svm_cls_grid.best_estimator_, X_test, y_test_cls, "SVM"))

# Display results in a DataFrame
reg_results_df = pd.DataFrame(reg_results).set_index('Model')
print("Regression Models Performance:")
print(reg_results_df)

# Display classification accuracy
cls_accuracy_df = pd.DataFrame([{r['Model']: r['Accuracy'] for r in cls_results}]).T
cls_accuracy_df.columns = ['Accuracy']
print("\nClassification Models Accuracy:")
print(cls_accuracy_df)

# Display detailed classification reports
for result in cls_results:
    print(f"\n{result['Model']} Classification Report:")
    print(result['Classification Report'])

In [None]:
# Visualize regression model performance
plt.figure(figsize=(10, 6))
sns.barplot(x=reg_results_df.index, y=reg_results_df['RMSE'])
plt.title('Regression Models: RMSE Comparison (Lower is Better)')
plt.ylabel('RMSE')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()

# Visualize classification model performance
plt.figure(figsize=(10, 6))
sns.barplot(x=cls_accuracy_df.index, y=cls_accuracy_df['Accuracy'])
plt.title('Classification Models: Accuracy Comparison (Higher is Better)')
plt.ylabel('Accuracy')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.ylim(0, 1)
plt.tight_layout()
plt.show()

In [None]:
# For Linear Regression, we can analyze coefficients
# First, get feature names after one-hot encoding
cat_feature_names = list(linear_reg_pipeline.named_steps['preprocessor']
                         .named_transformers_['cat']
                         .named_steps['onehot']
                         .get_feature_names_out(categorical_features))

transformed_feature_names = list(numerical_features) + list(cat_feature_names)

# Get coefficients from Linear Regression
try:
    coefficients = linear_reg_pipeline.named_steps['model'].coef_
    coef_df = pd.DataFrame({'Feature': transformed_feature_names, 'Coefficient': coefficients})
    coef_df = coef_df.sort_values('Coefficient', key=abs, ascending=False)
    
    # Plot top 15 features by importance
    plt.figure(figsize=(12, 8))
    sns.barplot(x='Coefficient', y='Feature', data=coef_df.head(15))
    plt.title('Top 15 Features by Importance (Linear Regression)')
    plt.grid(axis='x', linestyle='--', alpha=0.7)
    plt.tight_layout()
    plt.show()
except Exception as e:
    print(f"Couldn't extract coefficients: {str(e)}")

## 6. Making Predictions on New Data

Let's create a function to predict mental health outcomes for new students:

In [None]:
def predict_mental_health(student_data, reg_model=None, cls_model=None):
    """Predict mental health outcomes for a new student.
    
    Args:
        student_data (dict): Dictionary containing student features
        reg_model: Trained regression model
        cls_model: Trained classification model
        
    Returns:
        dict: Predictions from both models
    """
    # Convert input to DataFrame
    student_df = pd.DataFrame([student_data])
    
    results = {}
    
    # Get regression prediction if model provided
    if reg_model is not None:
        mh_index = reg_model.predict(student_df)[0]
        results['mental_health_index'] = mh_index
    
    # Get classification prediction if model provided
    if cls_model is not None:
        mh_category = cls_model.predict(student_df)[0]
        category_probs = cls_model.predict_proba(student_df)[0]
        results['mental_health_category'] = mh_category
        results['category_probabilities'] = {cls_model.classes_[i]: category_probs[i] for i in range(len(cls_model.classes_))}
    
    return results

# Example usage
# Select the best models based on evaluation
best_reg_model = svm_reg_grid.best_estimator_ if svm_reg_grid.best_score_ > knn_reg_grid.best_score_ else knn_reg_grid.best_estimator_
best_cls_model = svm_cls_grid.best_estimator_ if svm_cls_grid.best_score_ > knn_cls_grid.best_score_ else knn_cls_grid.best_estimator_

# Example student data
sample_student = {
    'age': 21,
    'gender': 'Female',
    'year_of_study': 3,
    'gpa': 3.7,
    'course_load': 5,
    'major': 'Engineering',
    'sleep_hours': 6.5,
    'exercise_hours_per_week': 2.5,
    'social_activity_hours': 8.0,
    'financial_stress': 2,
    'housing_quality': 4,
    'support_network': 3
}

# Make prediction
prediction = predict_mental_health(sample_student, best_reg_model, best_cls_model)
print("Mental Health Prediction for Sample Student:")
print(f"Mental Health Index: {prediction.get('mental_health_index', 'N/A'):.2f}")
print(f"Mental Health Category: {prediction.get('mental_health_category', 'N/A')}")

if 'category_probabilities' in prediction:
    print("\nProbability for each category:")
    for category, prob in prediction['category_probabilities'].items():
        print(f"{category}: {prob:.2f}")

## 7. Conclusion and Next Steps

In this notebook, we've implemented a multimodal machine learning approach to predict mental health outcomes for university students using three different algorithms: K-NN, Linear Regression/Logistic Regression, and SVM.

### Key Accomplishments:
- Created a data processing pipeline that handles both numerical and categorical data
- Implemented and compared multiple machine learning models
- Created a prediction function for new students

### Next Steps:
1. **Collect real data**: Replace the simulated data with real student data
2. **Feature engineering**: Develop more sophisticated features
3. **Advanced models**: Explore ensemble methods or deep learning approaches
4. **User interface**: Develop a simple interface for university counselors
5. **Ethical considerations**: Ensure privacy and proper use of predictions
6. **Validation**: Validate the model with domain experts

### Technologies Used:
- **Data Processing**: NumPy, Pandas
- **Machine Learning**: Scikit-learn
- **Visualization**: Matplotlib, Seaborn, Plotly
- **Statistical Analysis**: SciPy

## 8. Advanced Modeling Techniques

Let's explore some advanced modeling techniques that could potentially improve our mental health prediction model:

In [None]:
# Import ensemble methods
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier, GradientBoostingRegressor, GradientBoostingClassifier

# Create Ensemble Models
print("Training Ensemble Models...")

# Random Forest models
rf_reg = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', RandomForestRegressor(n_estimators=100, random_state=42))
])

rf_cls = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', RandomForestClassifier(n_estimators=100, random_state=42))
])

# Train Random Forest models
rf_reg.fit(X_train, y_train_reg)
rf_cls.fit(X_train, y_train_cls)

# Evaluate Random Forest models
rf_reg_results = evaluate_regression_model(rf_reg, X_test, y_test_reg, "Random Forest")
rf_cls_results = evaluate_classification_model(rf_cls, X_test, y_test_cls, "Random Forest")

# Print results
print("\nRandom Forest Regression Performance:")
print(f"RMSE: {rf_reg_results['RMSE']:.4f}, R²: {rf_reg_results['R²']:.4f}")

print("\nRandom Forest Classification Performance:")
print(f"Accuracy: {rf_cls_results['Accuracy']:.4f}")
print(rf_cls_results['Classification Report'])

# Compare with previous models
all_reg_results = pd.DataFrame(reg_results + [rf_reg_results]).set_index('Model')
all_cls_results = pd.DataFrame([{r['Model']: r['Accuracy'] for r in cls_results}, {'Random Forest': rf_cls_results['Accuracy']}]).T
all_cls_results.columns = ['Accuracy']

# Plot updated comparison
plt.figure(figsize=(12, 6))
sns.barplot(x=all_reg_results.index, y=all_reg_results['RMSE'])
plt.title('All Regression Models: RMSE Comparison (Lower is Better)')
plt.ylabel('RMSE')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()

# Plot classification results
plt.figure(figsize=(12, 6))
sns.barplot(x=all_cls_results.index, y=all_cls_results['Accuracy'])
plt.title('All Classification Models: Accuracy Comparison (Higher is Better)')
plt.ylabel('Accuracy')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.ylim(0, 1)
plt.tight_layout()
plt.show()

## 9. Model Explainability with SHAP

Let's use SHAP (SHapley Additive exPlanations) to better understand our model predictions. This helps in making our models more interpretable and transparent.

In [None]:
# Install SHAP if not already installed
# !pip install shap

import shap
import matplotlib.pyplot as plt

# Let's use the best classifier model for explanation
best_classifier = rf_cls  # Using Random Forest as it typically works well with SHAP

try:
    # Create a function to get preprocessed data (only numerical for simplicity)
    def get_preprocessed_data(X):
        # Get only the numerical transformation part for simplicity
        preprocessed_numerical = numerical_transformer.transform(X.select_dtypes(include=['int64', 'float64']))
        return preprocessed_numerical
    
    # Get preprocessed training data
    X_train_processed = get_preprocessed_data(X_train)
    
    # For demonstration, let's use the RandomForest's feature importance first
    forest_model = rf_cls.named_steps['model']
    importances = forest_model.feature_importances_
    numerical_feature_names = list(X_train.select_dtypes(include=['int64', 'float64']).columns)
    
    # Create a DataFrame for feature importance
    feature_importance_df = pd.DataFrame({
        'Feature': numerical_feature_names,
        'Importance': importances
    })
    feature_importance_df = feature_importance_df.sort_values('Importance', ascending=False)
    
    # Plot feature importances
    plt.figure(figsize=(12, 8))
    sns.barplot(x='Importance', y='Feature', data=feature_importance_df)
    plt.title('Feature Importance from Random Forest')
    plt.tight_layout()
    plt.show()
    
    # Initialize SHAP explainer using the RandomForest model
    explainer = shap.TreeExplainer(forest_model)
    
    # Calculate SHAP values - this can be computationally intensive
    # Use a small subset for demonstration
    sample_size = min(100, X_train_processed.shape[0])
    X_sample = X_train_processed[:sample_size]
    shap_values = explainer.shap_values(X_sample)
    
    # Plot SHAP summary
    plt.figure(figsize=(12, 8))
    shap.summary_plot(shap_values, X_sample, feature_names=numerical_feature_names)
    plt.title('SHAP Summary Plot')
    plt.show()
    
    # Plot detailed SHAP values for a few samples
    plt.figure(figsize=(14, 10))
    shap.plots.waterfall(explainer.expected_value[0], shap_values[0][0], feature_names=numerical_feature_names)
    plt.title('SHAP Waterfall Plot for a Sample Prediction')
    plt.show()
    
except Exception as e:
    print(f"Error in SHAP analysis: {str(e)}")
    print("To use SHAP, you may need to install it with: pip install shap")

## 10. Acquiring Real-World Mental Health Datasets

The current model uses synthetic data generated for demonstration. For a production-ready system, you'll need real student mental health data. Here are some options:

In [None]:
# Here are some potential sources of mental health datasets and approaches for data collection

# 1. Public Mental Health Datasets
'''
Some available public datasets that could be used or adapted:

1. Student Mental Health Dataset (Kaggle): https://www.kaggle.com/datasets/shariful07/student-mental-health
   This dataset contains survey responses from university students about their mental health conditions.

2. Depression, Anxiety and Stress Scale Responses (Kaggle): https://www.kaggle.com/datasets/lucasgreenwell/depression-anxiety-stress-scales-responses
   Contains responses to the DASS (Depression, Anxiety and Stress Scale).

3. Mental Health in Tech Survey: https://www.kaggle.com/datasets/osmi/mental-health-in-tech-survey
   While focused on tech employees, the survey structure could be adapted for students.

4. University Student Mental Health Survey with Demographics: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/TZLULU
'''

# 2. Creating your own data collection protocol
def create_mental_health_assessment_protocol():
    '''
    To create your own data collection protocol:
    
    1. Use standardized assessment tools:
       - PHQ-9 for depression screening
       - GAD-7 for anxiety screening
       - PSS (Perceived Stress Scale) for stress levels
       - AUDIT for alcohol use
       - PSQI for sleep quality
       
    2. Collect demographic and academic information:
       - Age, gender, year of study
       - Major/program
       - GPA and course load
       - Living situation
       - Financial status/concerns
       
    3. Additional data sources:
       - Academic performance data (with permissions)
       - Campus resource usage (counseling services)
       - Wearable device data (sleep patterns, activity levels)
       
    4. Ensure ethical considerations:
       - IRB/Ethics board approval
       - Informed consent
       - Data anonymization
       - Secure data storage
    '''
    
    print("Protocol for mental health data collection created.")
    
    return

# 3. Sample code for loading a real dataset (example with Kaggle dataset)
'''
To use the Student Mental Health dataset from Kaggle:

1. Download the dataset from Kaggle: https://www.kaggle.com/datasets/shariful07/student-mental-health
2. Place the CSV file in your project directory
3. Run the code below to load and preprocess it
'''

# Uncomment and modify this code to load a real dataset

'''
def load_real_student_mental_health_data(file_path="student_mental_health.csv"):
    # Load the dataset
    df = pd.read_csv(file_path)
    
    # Display basic information
    print(f"Dataset shape: {df.shape}")
    print("\nColumn names:")
    print(df.columns.tolist())
    
    # Perform necessary preprocessing
    # (This will depend on the specific dataset structure)
    
    # Handle missing values
    df = df.dropna()  # Or use imputation techniques
    
    # Encode categorical variables as needed
    # Example: df['Gender'] = df['Gender'].map({'Male': 0, 'Female': 1})
    
    # Create target variables
    # Example: df['mental_health_category'] = pd.cut(df['depression_score'], bins=[0, 5, 10, 27], labels=['Mild', 'Moderate', 'Severe'])
    
    return df

# Load the real dataset
# real_df = load_real_student_mental_health_data()
# Display the first few rows
# real_df.head()
'''

# 4. Data collection considerations
'''
When collecting mental health data from students:

1. Privacy and Security:
   - Ensure HIPAA/FERPA compliance
   - Use secure storage and transmission
   - Implement proper access controls

2. Ethical Considerations:
   - Obtain informed consent
   - Provide resources for students in distress
   - Have protocols for high-risk cases
   - Allow students to opt out or withdraw

3. Data Quality:
   - Use validated instruments
   - Ensure representative sampling
   - Consider longitudinal data collection
   - Combine self-report with objective measures when possible

4. Institutional Collaboration:
   - Work with university health services
   - Partner with psychology/psychiatry departments
   - Engage student organizations
'''

print("To proceed with this project, you will need to acquire or collect real mental health data, following appropriate ethical and privacy guidelines.")

## 11. Creating a Simple Web Application for University Counselors

To deploy this model for practical use by university counseling services, we can create a simple web application using Streamlit. This interface will allow counselors to input student data and receive mental health predictions.

In [None]:
# First, we need to install Streamlit
# !pip install streamlit

'''
Sample code for a Streamlit application to deploy the mental health prediction model:

```python
# Save this as app.py
import streamlit as st
import pandas as pd
import numpy as np
import pickle
import matplotlib.pyplot as plt
import seaborn as sns

# Load the trained models (after saving them from this notebook)
def load_models(model_path="models/"):
    with open(model_path + "best_regression_model.pkl", "rb") as f:
        reg_model = pickle.load(f)
    with open(model_path + "best_classification_model.pkl", "rb") as f:
        cls_model = pickle.load(f)
    return reg_model, cls_model

# Function to make predictions (similar to our predict_mental_health function)
def predict_mental_health(student_data, reg_model, cls_model):
    # Convert input to DataFrame
    student_df = pd.DataFrame([student_data])
    
    results = {}
    # Get regression prediction
    mh_index = reg_model.predict(student_df)[0]
    results["mental_health_index"] = mh_index
    
    # Get classification prediction
    mh_category = cls_model.predict(student_df)[0]
    category_probs = cls_model.predict_proba(student_df)[0]
    results["mental_health_category"] = mh_category
    results["category_probabilities"] = {cls_model.classes_[i]: category_probs[i] for i in range(len(cls_model.classes_))}
    
    return results

# Main function for the Streamlit app
def main():
    st.title("Student Mental Health Prediction Tool")
    st.write("This tool helps university counselors assess potential mental health concerns for students.")
    
    # Create sidebar for inputs
    st.sidebar.header("Student Information")
    
    # Demographics
    st.sidebar.subheader("Demographics")
    age = st.sidebar.slider("Age", 17, 30, 20)
    gender = st.sidebar.selectbox("Gender", options=["Male", "Female", "Non-binary"])
    year = st.sidebar.selectbox("Year of Study", options=[1, 2, 3, 4, 5])
    
    # Academic factors
    st.sidebar.subheader("Academic Factors")
    gpa = st.sidebar.slider("GPA", 0.0, 4.0, 3.0, 0.1)
    course_load = st.sidebar.slider("Course Load (# of courses)", 1, 8, 5)
    major = st.sidebar.selectbox("Major", ["Engineering", "Arts", "Science", "Business", "Medicine"])
    
    # Behavioral data
    st.sidebar.subheader("Behavioral Factors")
    sleep = st.sidebar.slider("Sleep Hours (daily average)", 3.0, 10.0, 7.0, 0.5)
    exercise = st.sidebar.slider("Exercise Hours (weekly)", 0.0, 20.0, 3.0, 0.5)
    social = st.sidebar.slider("Social Activity Hours (weekly)", 0.0, 30.0, 10.0, 1.0)
    
    # External factors
    st.sidebar.subheader("External Factors")
    financial_stress = st.sidebar.slider("Financial Stress Level", 0, 4, 2, 
                                    help="0=None, 4=Severe")
    housing = st.sidebar.slider("Housing Quality", 1, 5, 3, 
                            help="1=Poor, 5=Excellent")
    support = st.sidebar.slider("Support Network Strength", 0, 4, 2, 
                            help="0=None, 4=Strong")
    
    # Psychological assessments
    st.sidebar.subheader("Psychological Assessments")
    depression = st.sidebar.slider("Depression Score (PHQ-9)", 0, 27, 5, 
                               help="0-4: Minimal, 5-9: Mild, 10-14: Moderate, 15-19: Moderately Severe, 20-27: Severe")
    anxiety = st.sidebar.slider("Anxiety Score (GAD-7)", 0, 21, 5, 
                            help="0-4: Minimal, 5-9: Mild, 10-14: Moderate, 15-21: Severe")
    stress = st.sidebar.slider("Stress Score (PSS)", 0, 40, 15, 
                           help="0-13: Low, 14-26: Moderate, 27-40: High")
    
    # Create a dictionary with all student data
    student_data = {
        "age": age,
        "gender": gender,
        "year_of_study": year,
        "gpa": gpa,
        "course_load": course_load,
        "major": major,
        "sleep_hours": sleep,
        "exercise_hours_per_week": exercise,
        "social_activity_hours": social,
        "financial_stress": financial_stress,
        "housing_quality": housing,
        "support_network": support,
        "depression_score": depression,
        "anxiety_score": anxiety,
        "stress_score": stress
    }
    
    # Button to make prediction
    if st.sidebar.button("Generate Prediction"):
        # Load models (in a real app, you'd do this once at startup)
        try:
            reg_model, cls_model = load_models()
            
            # Get prediction
            prediction = predict_mental_health(student_data, reg_model, cls_model)
            
            # Display results
            st.header("Mental Health Assessment Results")
            
            # Display mental health index
            mh_index = prediction["mental_health_index"]
            st.subheader(f"Mental Health Index: {mh_index:.1f}/100")
            
            # Create a gauge chart for the mental health index
            fig, ax = plt.subplots(figsize=(10, 2))
            ax.barh([0], [100], color="lightgray", height=0.5)
            ax.barh([0], [mh_index], color=plt.cm.RdYlGn(mh_index/100), height=0.5)
            ax.set_xlim(0, 100)
            ax.set_yticks([])
            ax.set_xticks([0, 25, 50, 75, 100])
            ax.set_xticklabels(["0\nPoor", "25", "50\nAverage", "75", "100\nExcellent"])
            st.pyplot(fig)
            
            # Display mental health category
            st.subheader(f"Mental Health Category: {prediction['mental_health_category']}")
            
            # Display category probabilities
            st.write("Probability Breakdown:")
            probs = prediction["category_probabilities"]
            for category, prob in probs.items():
                st.write(f"- {category}: {prob:.1%}")
                
            # Recommendation section
            st.subheader("Recommendations")
            if prediction["mental_health_category"] == "Poor":
                st.error("⚠️ This student shows signs of significant mental health concerns. Consider immediate follow-up and referral to psychological services.")
            elif prediction["mental_health_category"] == "Average":
                st.warning("This student shows moderate risk. Regular check-ins and providing resources for support would be beneficial.")
            else:
                st.success("This student appears to be maintaining good mental health. Continue to provide preventive resources.")
                
            # Risk factors analysis
            st.subheader("Key Risk Factors")
            risk_factors = []
            if sleep < 6:
                risk_factors.append("Insufficient sleep (< 6 hours)")
            if exercise < 1:
                risk_factors.append("Limited physical activity (< 1 hour weekly)")
            if financial_stress > 2:
                risk_factors.append("High financial stress")
            if support < 2:
                risk_factors.append("Limited support network")
            if depression > 9:
                risk_factors.append("Elevated depression score")
            if anxiety > 9:
                risk_factors.append("Elevated anxiety score")
            
            if risk_factors:
                for factor in risk_factors:
                    st.write(f"• {factor}")
            else:
                st.write("No major risk factors identified.")
                
        except Exception as e:
            st.error(f"Error making prediction: {str(e)}")
            st.info("Note: This is a demo. In production, you would need to train and save the models first.")
    
    # Add information about the model
    st.sidebar.info(
        "This tool uses machine learning to predict student mental health outcomes. "
        "It is intended as a screening tool to help identify students who may benefit from support services. "
        "It should not be used as a diagnostic tool."
    )
    
    # Add privacy notice
    st.sidebar.warning(
        "PRIVACY NOTICE: Ensure all data is handled in compliance with your institution's privacy policies "
        "and relevant regulations (FERPA, HIPAA, etc.)."
    )

# Run the app
if __name__ == "__main__":
    main()
```

# To run the Streamlit app, save the code to a file named app.py and run:
# streamlit run app.py
'''

# Code to save the trained models
def save_models(reg_model, cls_model, model_path="models/"):
    """Save the trained models to disk for later use in the web application."""
    import os
    import pickle
    
    # Create directory if it doesn't exist
    os.makedirs(model_path, exist_ok=True)
    
    # Save regression model
    with open(model_path + "best_regression_model.pkl", "wb") as f:
        pickle.dump(reg_model, f)
    
    # Save classification model
    with open(model_path + "best_classification_model.pkl", "wb") as f:
        pickle.dump(cls_model, f)
    
    print(f"Models saved successfully to {model_path}")

# Uncomment this to save your best models
# save_models(best_reg_model, best_cls_model)

## 12. Deep Learning Approach

Traditional machine learning models like K-NN, Linear Regression, and SVM have provided good results, but deep learning approaches might capture more complex patterns in the data, especially with larger datasets. Let's implement a simple neural network model for our mental health prediction task.

In [None]:
# To implement deep learning, we'll need TensorFlow/Keras
# Uncomment to install if needed
# !pip install tensorflow

'''
Example neural network implementation for mental health prediction:

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
import numpy as np

# 1. Prepare data for deep learning
def prepare_data_for_nn(X_train, X_test, y_train, y_test):
    # We need to preprocess the data differently for neural networks
    categorical_features = X_train.select_dtypes(include=["object", "category"]).columns
    numerical_features = X_train.select_dtypes(include=["int64", "float64"]).columns
    
    # Create preprocessing pipeline
    preprocessor = ColumnTransformer(
        transformers=[
            ("num", StandardScaler(), numerical_features),
            ("cat", OneHotEncoder(handle_unknown="ignore"), categorical_features)
        ])
    
    # Fit and transform the training data
    X_train_nn = preprocessor.fit_transform(X_train)
    X_test_nn = preprocessor.transform(X_test)
    
    return X_train_nn, X_test_nn, preprocessor

# 2. Create and train a neural network for regression
def build_regression_nn(input_dim):
    model = keras.Sequential([
        layers.Dense(64, activation="relu", input_shape=(input_dim,)),
        layers.BatchNormalization(),
        layers.Dropout(0.3),
        layers.Dense(32, activation="relu"),
        layers.BatchNormalization(),
        layers.Dropout(0.2),
        layers.Dense(16, activation="relu"),
        layers.Dense(1)  # No activation for regression output
    ])
    
    model.compile(
        optimizer=keras.optimizers.Adam(learning_rate=0.001),
        loss="mean_squared_error"
    )
    
    return model

# 3. Create and train a neural network for classification
def build_classification_nn(input_dim, num_classes):
    model = keras.Sequential([
        layers.Dense(64, activation="relu", input_shape=(input_dim,)),
        layers.BatchNormalization(),
        layers.Dropout(0.3),
        layers.Dense(32, activation="relu"),
        layers.BatchNormalization(),
        layers.Dropout(0.2),
        layers.Dense(16, activation="relu"),
        layers.Dense(num_classes, activation="softmax")
    ])
    
    model.compile(
        optimizer=keras.optimizers.Adam(learning_rate=0.001),
        loss="sparse_categorical_crossentropy",
        metrics=["accuracy"]
    )
    
    return model

# Prepare data for neural network
X_train_nn, X_test_nn, preprocessor = prepare_data_for_nn(X_train, X_test, y_train_reg, y_test_reg)

# Get dimensions
input_dim = X_train_nn.shape[1]
num_classes = len(np.unique(y_train_cls))

# Create regression model
nn_reg_model = build_regression_nn(input_dim)

# Create classification model
nn_cls_model = build_classification_nn(input_dim, num_classes)

# Train regression model
print("Training Neural Network Regression Model...")
history_reg = nn_reg_model.fit(
    X_train_nn, y_train_reg,
    epochs=50,
    batch_size=32,
    validation_split=0.2,
    callbacks=[keras.callbacks.EarlyStopping(patience=10, restore_best_weights=True)],
    verbose=0
)

# Train classification model
print("Training Neural Network Classification Model...")
# Convert categorical labels to numbers (0, 1, 2, ...)
y_train_cls_numeric = pd.Categorical(y_train_cls).codes
y_test_cls_numeric = pd.Categorical(y_test_cls).codes

history_cls = nn_cls_model.fit(
    X_train_nn, y_train_cls_numeric,
    epochs=50,
    batch_size=32,
    validation_split=0.2,
    callbacks=[keras.callbacks.EarlyStopping(patience=10, restore_best_weights=True)],
    verbose=0
)

# Evaluate regression model
y_pred_reg_nn = nn_reg_model.predict(X_test_nn)
rmse_nn = np.sqrt(mean_squared_error(y_test_reg, y_pred_reg_nn))
r2_nn = r2_score(y_test_reg, y_pred_reg_nn)

print(f"Neural Network Regression Model - RMSE: {rmse_nn:.4f}, R²: {r2_nn:.4f}")

# Evaluate classification model
y_pred_cls_nn = nn_cls_model.predict(X_test_nn)
y_pred_cls_nn_classes = np.argmax(y_pred_cls_nn, axis=1)
accuracy_nn = accuracy_score(y_test_cls_numeric, y_pred_cls_nn_classes)

print(f"Neural Network Classification Model - Accuracy: {accuracy_nn:.4f}")

# Plot training history for regression model
plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plt.plot(history_reg.history["loss"], label="Training Loss")
plt.plot(history_reg.history["val_loss"], label="Validation Loss")
plt.title("Neural Network Regression Training")
plt.xlabel("Epoch")
plt.ylabel("Loss (MSE)")
plt.legend()

# Plot training history for classification model
plt.subplot(1, 2, 2)
plt.plot(history_cls.history["accuracy"], label="Training Accuracy")
plt.plot(history_cls.history["val_accuracy"], label="Validation Accuracy")
plt.title("Neural Network Classification Training")
plt.xlabel("Epoch")
plt.ylabel("Accuracy")
plt.legend()
plt.tight_layout()
plt.show()

# Add neural network results to our comparison
nn_reg_results = {
    "Model": "Neural Network",
    "RMSE": rmse_nn,
    "R²": r2_nn
}

nn_cls_results = {
    "Model": "Neural Network",
    "Accuracy": accuracy_nn
}

# Update our results dataframes
all_reg_results_with_nn = pd.DataFrame(reg_results + [rf_reg_results, nn_reg_results]).set_index("Model")
all_cls_results_with_nn = pd.DataFrame([{r["Model"]: r["Accuracy"] for r in cls_results}, 
                                     {"Random Forest": rf_cls_results["Accuracy"],
                                      "Neural Network": nn_cls_results["Accuracy"]}]).T
all_cls_results_with_nn.columns = ["Accuracy"]

# Plot final comparison with all models
plt.figure(figsize=(12, 6))
sns.barplot(x=all_reg_results_with_nn.index, y=all_reg_results_with_nn["RMSE"])
plt.title("All Regression Models: RMSE Comparison (Lower is Better)")
plt.ylabel("RMSE")
plt.grid(axis="y", linestyle="--", alpha=0.7)
plt.tight_layout()
plt.show()

plt.figure(figsize=(12, 6))
sns.barplot(x=all_cls_results_with_nn.index, y=all_cls_results_with_nn["Accuracy"])
plt.title("All Classification Models: Accuracy Comparison (Higher is Better)")
plt.ylabel("Accuracy")
plt.grid(axis="y", linestyle="--", alpha=0.7)
plt.ylim(0, 1)
plt.tight_layout()
plt.show()
'''

print("Deep learning models require TensorFlow/Keras. To run this section, uncomment the installation command and the code above.")

## 13. Ethics and Responsible AI for Mental Health Applications

Developing AI systems for mental health prediction introduces important ethical considerations that must be addressed:

In [None]:
'''
# Ethical Considerations for Mental Health AI

1. **Privacy and Confidentiality**
   - Mental health data is highly sensitive and requires strict privacy protections
   - All data should be anonymized, encrypted, and securely stored
   - Access should be limited to authorized personnel only

2. **Informed Consent**
   - Students must fully understand how their data will be used
   - Clear explanation of the AI system's purpose, limitations, and risks
   - Right to withdraw consent and have data deleted

3. **Bias and Fairness**
   - Ensure the model doesn't discriminate based on race, gender, socioeconomic status, etc.
   - Regularly audit predictions for demographic disparities
   - Use diverse training data to minimize inherent biases

4. **Transparency and Explainability**
   - Stakeholders should understand how the system makes predictions
   - Use explainable AI techniques (like SHAP values shown earlier)
   - Documenting model limitations and uncertainty

5. **Human Oversight and Intervention**
   - AI should supplement, not replace, professional judgment
   - Clear protocols for when human review is necessary
   - Mental health professionals must remain the decision-makers

6. **Validation and Accuracy**
   - Rigorous clinical validation before deployment
   - Regular monitoring of model performance
   - Clear communication of confidence levels in predictions

7. **Supporting Positive Interventions**
   - Predictions should lead to helpful interventions, not stigmatization
   - Focus on early support and preventive measures
   - Provide resources alongside predictions

8. **Regulatory Compliance**
   - Adhere to relevant regulations (HIPAA, FERPA, GDPR, etc.)
   - Follow institutional policies for student data
   - Regular ethics reviews and updates
'''

# Example code for fairness analysis

def analyze_model_fairness(model, X_test, y_test, sensitive_attribute='gender'):
    """Analyze whether the model shows bias across a sensitive attribute."""
    # Get predictions
    y_pred = model.predict(X_test)
    
    # Group performance by the sensitive attribute
    results = []
    for group in X_test[sensitive_attribute].unique():
        mask = X_test[sensitive_attribute] == group
        group_X = X_test[mask]
        group_y_true = y_test[mask]
        group_y_pred = model.predict(group_X)
        
        if isinstance(y_test, pd.Series) and y_test.dtype == 'category':
            # For classification
            accuracy = accuracy_score(group_y_true, group_y_pred)
            results.append({
                sensitive_attribute: group,
                'accuracy': accuracy,
                'count': len(group_X)
            })
        else:
            # For regression
            rmse = np.sqrt(mean_squared_error(group_y_true, group_y_pred))
            r2 = r2_score(group_y_true, group_y_pred)
            results.append({
                sensitive_attribute: group,
                'RMSE': rmse,
                'R²': r2,
                'count': len(group_X)
            })
    
    return pd.DataFrame(results)

# Example usage (uncomment to use)
'''
# For classification model
fairness_cls = analyze_model_fairness(best_cls_model, X_test, y_test_cls)
print("Classification model fairness analysis:")
print(fairness_cls)

# For regression model
fairness_reg = analyze_model_fairness(best_reg_model, X_test, y_test_reg)
print("\nRegression model fairness analysis:")
print(fairness_reg)

# Visualize fairness comparison
plt.figure(figsize=(10, 6))
if 'accuracy' in fairness_cls.columns:
    sns.barplot(x=sensitive_attribute, y='accuracy', data=fairness_cls)
    plt.title(f'Model Accuracy Across {sensitive_attribute.capitalize()} Groups')
    plt.ylabel('Accuracy')
else:
    sns.barplot(x=sensitive_attribute, y='RMSE', data=fairness_reg)
    plt.title(f'Model RMSE Across {sensitive_attribute.capitalize()} Groups')
    plt.ylabel('RMSE (Lower is Better)')
plt.grid(axis="y", linestyle="--", alpha=0.7)
plt.show()
'''

print("Ethical considerations are crucial when developing AI for mental health applications.")
print("The code above provides a framework for analyzing model fairness across sensitive attributes.")

## 14. Final Steps and Real-World Implementation

To move this project from a proof-of-concept to a real-world implementation, here is a roadmap of next steps:

In [None]:
'''
1. **Data Collection and IRB Approval**
   - Design a data collection protocol
   - Secure IRB/Ethics approval
   - Establish data governance policies
   - Begin collecting real student data

2. **Model Refinement**
   - Retrain models with real data
   - Optimize hyperparameters for production
   - Perform cross-validation with multiple cohorts
   - Add model versioning and tracking

3. **System Implementation**
   - Develop a production-ready prediction API
   - Build a secure web interface for counselors
   - Implement authentication and access controls
   - Establish data backup and recovery procedures

4. **Validation Study**
   - Compare model predictions with expert assessments
   - Measure predictive accuracy over time
   - Gather feedback from mental health professionals
   - Refine model based on findings

5. **Integration with University Systems**
   - Connect with existing student support systems
   - Establish referral workflows
   - Create documentation and training materials
   - Develop standard operating procedures

6. **Monitoring and Maintenance**
   - Set up continuous model performance monitoring
   - Schedule regular model retraining
   - Track intervention outcomes
   - Conduct periodic fairness audits

7. **Expansion and Research**
   - Consider additional data sources (academic performance, etc.)
   - Test alternative prediction approaches
   - Publish findings to advance the field
   - Share anonymized insights with the educational community
'''

# Sample implementation timeline
implementation_timeline = pd.DataFrame({
    'Phase': ['Planning & Ethics', 'Data Collection', 'Model Development', 'Validation', 'Deployment', 'Monitoring'],
    'Duration': ['2 months', '4 months', '3 months', '2 months', '1 month', 'Ongoing'],
    'Key Activities': [
        'IRB approval, protocol design', 
        'Survey implementation, data gathering', 
        'Model training and optimization', 
        'Expert validation, user testing',
        'Web app launch, user training',
        'Performance tracking, model updates'
    ]
})

print("Implementation Timeline:")
print(implementation_timeline)

print("\nThank you for exploring this multimodal machine learning approach to mental health prediction.")
print("With real data and careful implementation, this system could help university counseling services")
print("identify students who might benefit from early mental health support.")

## 15. Summary and Conclusions

In this comprehensive notebook, we've developed a multimodal machine learning approach to predict mental health outcomes for university students. We've covered:

1. **Data Processing**: Handling both numerical and categorical data through preprocessing pipelines
2. **Model Implementation**: Using K-NN, Linear/Logistic Regression, SVM, Random Forest, and Neural Networks
3. **Model Evaluation**: Comparing performance metrics across different algorithms
4. **Explainable AI**: Using SHAP values to interpret model predictions
5. **Deployment Strategy**: Creating a web application for university counselors
6. **Ethical Considerations**: Addressing fairness, privacy, and responsible use
7. **Next Steps**: Planning for real-world implementation

The models showed promising results even with synthetic data. With real student data and proper validation, this approach could become a valuable tool for early identification of mental health concerns among university students, ultimately supporting timely intervention and better student wellbeing.