#                      **Problem Statement:**

  # **Multimodal Mental Health Monitoring with LLM Explanations**

**Objective:** Develop a multimodal AI system that predicts mental health states (e.g., stress, anxiety, depression) in college students by integrating high-frequency physiological time-series data from wearables with contextual features (e.g., activity, location). The system should not only make accurate predictions but also provide human-readable natural language explanations of the physiological and contextual patterns driving each prediction using a Large Language Model (LLM).

**Challenges:**

  * Temporal Dynamics: Modeling per-second wearable sensor data to capture short-term and long-term physiological trends.

  * Multimodal Fusion: Combining numeric time-series embeddings with categorical contextual information in a form suitable for reasoning by an LLM.

  * Natural Language Reasoning: Translating sensor signals and context into interpretable textual explanations for mental health states.

  * Noise & Missing Data: Handling artifacts, missing seconds, and irregular sensor readings without degrading prediction or explanation quality.

  * Evaluation of Explanations: Measuring both prediction accuracy and the relevance/clarity of the LLM-generated reasoning.


**Solution Approach:**

* Aggregate per-second sensor readings to manageable windows (e.g., per minute) and encode as temporal embeddings.

* Convert categorical context/activity features into textual tokens for the LLM (e.g., “Student is in the library”) or embeddings.

* Fuse temporal embeddings and textual embeddings via cross-attention layers so the LLM can jointly reason over physiology and context.

* Train the model to output both:

* Predicted mental health state per window, and

* A natural language explanation summarizing which physiological trends and contextual factors contributed to the prediction.

* Evaluate using classification metrics (accuracy, F1-score) for predictions and human/automatic metrics (BLEU, ROUGE, clinical interpretability) for explanations.

In [None]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("ziya07/mental-health-monitoring-dataset")

print("Path to dataset files:", path)

Downloading from https://www.kaggle.com/api/v1/datasets/download/ziya07/mental-health-monitoring-dataset?dataset_version_number=1...


100%|██████████| 124k/124k [00:00<00:00, 45.6MB/s]

Extracting files...
Path to dataset files: /root/.cache/kagglehub/datasets/ziya07/mental-health-monitoring-dataset/versions/1





In [None]:
import pandas as pd
import os

# Path to dataset folder
dataset_dir = path  # e.g., '/root/.cache/kagglehub/datasets/programmer3/wearable-sensor-data-for-mental-health-prediction/versions/3'

# List files in the directory
print()

# Suppose the main CSV is "sensor_data.csv"
csv_file = os.path.join(dataset_dir, os.listdir(dataset_dir)[0])

# Load into DataFrame
df = pd.read_csv(csv_file)

# Check
print(df.head())



             timestamp student_id  heart_rate  skin_temperature       eda  \
0  2024-07-31 09:00:00        S01   84.967142         33.088032  0.513965   
1  2024-07-31 09:00:01        S01   78.617357         33.140445  0.506502   
2  2024-07-31 09:00:02        S01   86.476885         34.524906  0.581328   
3  2024-07-31 09:00:03        S01   95.230299         33.500341  0.415615   
4  2024-07-31 09:00:04        S01   77.658466         33.891065  0.325633   

   physical_activity stress_level  stress_label context_activity  \
0                  1          NaN             0             Dorm   
1                  2          NaN             0             Dorm   
2                  2          NaN             0        Cafeteria   
3                  1          NaN             0              Gym   
4                  1          NaN             0          Library   

   session_duration  
0                69  
1                52  
2                69  
3               103  
4               1

In [None]:
import numpy as np

def create_time_windows(data, window_size=60, step_size=1, target_col=None):
    """
    Converts a time-series array into rolling windows for temporal modeling.

    Parameters:
    - data: np.array of shape (num_timesteps, num_features)
    - window_size: int, number of time steps per input window
    - step_size: int, how many steps to move the window each time
    - target_col: int or None, column index for target values; if None, returns all features

    Returns:
    - X: np.array of shape (num_windows, window_size, num_features)
    - y: np.array of shape (num_windows,) if target_col provided, else None
    """
    X = []
    y = []

    for start in range(0, len(data) - window_size, step_size):
        end = start + window_size
        X.append(data[start:end, :])
        if target_col is not None:
            y.append(data[end, target_col])  # predict next time step
    X = np.array(X)
    y = np.array(y) if target_col is not None else None
    return X, y


Solutionn with cross attention/window data


# Suggested solution

In [None]:
# multimodal_mental_health_full_tf.py
import tensorflow as tf
from tensorflow.keras import layers, Model
from transformers import TFAutoModel, AutoTokenizer
import pandas as pd
import numpy as np
from sklearn.metrics import classification_report, confusion_matrix

# -------------------------------
# 1. Rolling window conversion
# -------------------------------
def create_rolling_windows(df, time_features, context_feature, target, window_size=5, agg_interval=60, step_size=1):
    X_sensor, X_context, y = [], [], []

    for student in df['Student'].unique():
        student_df = df[df['Student'] == student].reset_index(drop=True)

        # Aggregate per interval
        num_intervals = len(student_df) // agg_interval
        agg_features, agg_context, agg_labels = [], [], []
        for i in range(num_intervals):
            window_df = student_df.iloc[i*agg_interval:(i+1)*agg_interval]
            mean_features = window_df[time_features].mean().values
            std_features = window_df[time_features].std().values
            agg_features.append(np.concatenate([mean_features, std_features]))
            agg_context.append(window_df[context_feature].mode()[0])
            agg_labels.append(window_df[target].mode()[0])

        agg_features = np.array(agg_features)
        agg_labels = np.array(agg_labels)

        # Create rolling sequences
        for start in range(0, len(agg_features) - window_size + 1, step_size):
            end = start + window_size
            X_sensor.append(agg_features[start:end])
            X_context.append(agg_context[start:end])
            y.append(agg_labels[end-1])

    X_sensor = np.array(X_sensor, dtype=np.float32)
    y = np.array(y)
    return X_sensor, X_context, y

# -------------------------------
# 2. Time-series encoder (LSTM)
# -------------------------------
def build_time_series_encoder(input_dim, hidden_dim=64):
    inputs = layers.Input(shape=(None, input_dim))  # sequence length, features
    x = layers.LSTM(hidden_dim, return_sequences=True)(inputs)
    return Model(inputs, x, name="TimeSeriesEncoder")

# -------------------------------
# 3. Cross-attention layer
# -------------------------------
class CrossAttentionLayer(layers.Layer):
    def __init__(self, hidden_dim):
        super().__init__()
        self.query_dense = layers.Dense(hidden_dim)
        self.key_dense = layers.Dense(hidden_dim)
        self.value_dense = layers.Dense(hidden_dim)
        self.scale = hidden_dim ** 0.5

    def call(self, query, key, value, mask=None):
        Q = self.query_dense(query)
        K = self.key_dense(key)
        V = self.value_dense(value)

        scores = tf.matmul(Q, K, transpose_b=True) / self.scale
        if mask is not None:
            scores += (mask * -1e9)
        attn_weights = tf.nn.softmax(scores, axis=-1)
        output = tf.matmul(attn_weights, V)
        return output

# -------------------------------
# 4. Multimodal model with cross-attention
# -------------------------------
def build_multimodal_model(sensor_input_dim, num_classes=3, llm_model_name="distilbert-base-uncased", hidden_dim=64):
    # Time-series encoder
    ts_encoder = build_time_series_encoder(sensor_input_dim, hidden_dim)

    # LLM encoder
    llm_encoder = TFAutoModel.from_pretrained(llm_model_name)
    tokenizer = AutoTokenizer.from_pretrained(llm_model_name)

    # Inputs
    sensor_input = layers.Input(shape=(None, sensor_input_dim), name='sensor')  # rolling window sequence
    context_input_ids = layers.Input(shape=(None,), dtype=tf.int32, name='context_input_ids')
    context_attention_mask = layers.Input(shape=(None,), dtype=tf.int32, name='context_attention_mask')

    # Encodings
    sensor_emb_seq = ts_encoder(sensor_input)  # (batch, seq_len, hidden_dim)
    context_emb_seq = llm_encoder(input_ids=context_input_ids, attention_mask=context_attention_mask).last_hidden_state

    # Cross-attention: context attends to sensor sequence
    cross_attention = CrossAttentionLayer(hidden_dim)
    fused_emb = cross_attention(query=context_emb_seq, key=sensor_emb_seq, value=sensor_emb_seq)

    # Pool over context tokens
    fused_emb = tf.reduce_mean(fused_emb, axis=1)

    # Classification head
    x = layers.Dense(128, activation='relu')(fused_emb)
    preds = layers.Dense(num_classes, activation='softmax', name='prediction')(x)

    model = Model(inputs=[sensor_input, context_input_ids, context_attention_mask], outputs=preds)
    return model, tokenizer

# -------------------------------
# 5. Main script: dataset, training
# -------------------------------
if __name__ == "__main__":
    # Load your per-second dataset
    #df = pd.read_csv("mental_health_sensor_data.csv")

    time_features = ['Acceleration X', 'Acceleration Y', 'Acceleration Z',
                     'Gyroscope X', 'Gyroscope Y', 'Gyroscope Z',
                     'Temperature', 'Humidity', 'Light', 'Exposure']
    context_feature = 'Activity'
    target = 'Health State'

    # Create rolling windows (past 5 minutes per sequence)
    X_sensor, X_context, y = create_rolling_windows(df, time_features, context_feature, target,
                                                    window_size=5, agg_interval=60, step_size=1)

    # Tokenize context sequences (flatten sequences for simplicity)
    llm_model_name = "distilbert-base-uncased"
    tokenizer = AutoTokenizer.from_pretrained(llm_model_name)
    context_texts = [" ".join(seq) for seq in X_context]
    encodings = tokenizer(context_texts, padding=True, truncation=True, return_tensors="tf")

    # Build model
    sensor_input_dim = len(time_features) * 2  # mean + std
    model, tokenizer = build_multimodal_model(sensor_input_dim, num_classes=3, llm_model_name=llm_model_name)

    model.compile(optimizer=tf.keras.optimizers.Adam(1e-4),
                  loss='sparse_categorical_crossentropy',
                  metrics=['accuracy'])

    # Train (example, 1 epoch)
    dataset = tf.data.Dataset.from_tensor_slices((
        {
            'sensor': X_sensor,
            'context_input_ids': encodings['input_ids'],
            'context_attention_mask': encodings['attention_mask']
        },
        y
    )).shuffle(100).batch(8)

    model.fit(dataset, epochs=1)

    # -------------------------------
    # 6. Evaluation after training
    # -------------------------------
    X_eval_sensor = X_sensor
    context_input_ids_eval = encodings['input_ids']
    context_attention_mask_eval = encodings['attention_mask']

    y_true = y
    y_pred_probs = model.predict({
        'sensor': X_eval_sensor,
        'context_input_ids': context_input_ids_eval,
        'context_attention_mask': context_attention_mask_eval
    })
    y_pred = np.argmax(y_pred_probs, axis=1)

    accuracy = np.mean(y_pred == y_true)
    print("\nAccuracy:", accuracy)

    print("\nClassification Report:\n", classification_report(y_true, y_pred))
    print("\nConfusion Matrix:\n", confusion_matrix(y_true, y_pred))

    # -------------------------------
    # 7. Example predictions
    # -------------------------------
    example_idx = 0
    print("\n--- Example Input ---")
    print("Sensor features (rolling windows):\n", X_eval_sensor[example_idx])
    print("\nContext text sequence:\n", X_context[example_idx])
    print("\nTrue label:", y_true[example_idx])
    print("Predicted label:", y_pred[example_idx])

    top3_idx = np.argsort(y_pred_probs[example_idx])[::-1][:3]
    print("Top 3 predicted probabilities:", [(i, y_pred_probs[example_idx][i]) for i in top3_idx])
