In [6]:
# Data preprocessing 
import pandas as pd
from sklearn.linear_model import LogisticRegression
mental = pd.read_csv('Student Mental Health.csv')
mental['Your current year of Study'] = mental['Your current year of Study'].str.lower()
mental.fillna(mental.mode().iloc[0], inplace=True)

mental


Unnamed: 0,Timestamp,Choose your gender,Age,What is your course?,Your current year of Study,What is your CGPA?,Marital status,Do you have Depression?,Do you have Anxiety?,Do you have Panic attack?,Did you seek any specialist for a treatment?
0,8/7/2020 12:02,Female,18.0,Engineering,year 1,3.00 - 3.49,No,Yes,No,Yes,No
1,8/7/2020 12:04,Male,21.0,Islamic education,year 2,3.00 - 3.49,No,No,Yes,No,No
2,8/7/2020 12:05,Male,19.0,BIT,year 1,3.00 - 3.49,No,Yes,Yes,Yes,No
3,8/7/2020 12:06,Female,22.0,Laws,year 3,3.00 - 3.49,Yes,Yes,No,No,No
4,8/7/2020 12:13,Male,23.0,Mathemathics,year 4,3.00 - 3.49,No,No,No,No,No
...,...,...,...,...,...,...,...,...,...,...,...
96,13/07/2020 19:56:49,Female,21.0,BCS,year 1,3.50 - 4.00,No,No,Yes,No,No
97,13/07/2020 21:21:42,Male,18.0,Engineering,year 2,3.00 - 3.49,No,Yes,Yes,No,No
98,13/07/2020 21:22:56,Female,19.0,Nursing,year 3,3.50 - 4.00,Yes,Yes,No,Yes,No
99,13/07/2020 21:23:57,Female,23.0,Pendidikan Islam,year 4,3.50 - 4.00,No,No,No,No,No


In [7]:
# After loading and cleaning the dataset, it’s important to verify whether any missing values (NaNs) remain in the data.
# Missing values can negatively affect model performance and lead to unreliable predictions.
mental.isnull().sum()

Timestamp                                       0
Choose your gender                              0
Age                                             0
What is your course?                            0
Your current year of Study                      0
What is your CGPA?                              0
Marital status                                  0
Do you have Depression?                         0
Do you have Anxiety?                            0
Do you have Panic attack?                       0
Did you seek any specialist for a treatment?    0
dtype: int64

In [8]:
# Before training the model, the dataset was cleaned to remove irrelevant columns and handle missing values.
# The Timestamp column was removed because it only records when a response was submitted —
# it has no analytical or predictive value.
# The column “Did you seek any specialist for a treatment?” was dropped because it represents post-
# condition behavior, not a factor influencing mental health risk.
# Keeping such a variable could bias the model (since it indirectly reveals the target outcome).
mental.drop(columns=['Timestamp', 'Did you seek any specialist for a treatment?'], inplace=True)
# Missing values were replaced with the mode (most frequent value) of each column.
# This method was chosen because:
# Most columns are categorical, where the mode is the most suitable imputation strategy.
# It ensures no data loss through row deletion and keeps the dataset size intact.
mental.fillna(mental.mode().iloc[0], inplace=True)

#This displays the dataset’s rows and columns after cleaning.
# It helps verify that the dataset size remains consistent and that only the intended columns were removed.
mental.shape


(101, 9)

In [9]:
# To ensure that the dataset is free from missing or null 
# values that could negatively impact the model's performance, we performed the following cleaning steps:
mental.dropna(inplace=True)
mental.isnull().sum()

Choose your gender            0
Age                           0
What is your course?          0
Your current year of Study    0
What is your CGPA?            0
Marital status                0
Do you have Depression?       0
Do you have Anxiety?          0
Do you have Panic attack?     0
dtype: int64

In [10]:
# Import libraries
# pandas: Data loading and manipulation
# scikit-learn: Data preprocessing, model training, and evaluation
    
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# Load dataset
# mental = pd.read_csv('your_file.csv')
# The dataset (e.g., mental.csv) contains information about students’ demographics, academic life, lifestyle, and mental health responses.
# Each record includes:
# Input features: gender, age, course, CGPA, sleep quality, financial stress, etc.
# Target variables:
# Do you have Depression?
# Do you have Anxiety?
# Do you have Panic attack?

df = mental.copy()

# Encode categorical columns
# Categorical features are encoded into numerical format using LabelEncoder to make them suitable for model training.
label_enc = LabelEncoder()
for col in df.select_dtypes(include='object').columns:
    df[col] = label_enc.fit_transform(df[col].astype(str))

# Define features and targets
# X: Independent features (student details, lifestyle factors)
# y: Target labels (mental health indicators)

y: Target labels (mental health indicators)
X = df.drop(columns=['Do you have Depression?', 'Do you have Anxiety?', 'Do you have Panic attack?'])

# Create dictionary to store results

results = {}

# Train separate models for each target
# Separate Random Forest Classifiers are trained for each mental health condition.
for target in ['Do you have Depression?', 'Do you have Anxiety?', 'Do you have Panic attack?']:
    y = df[target]
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    model = RandomForestClassifier(n_estimators=100, random_state=42)
    model.fit(X_train, y_train)
    
    y_pred = model.predict(X_test)
    
    print(f"\n=== {target.upper()} ===")
    print(classification_report(y_test, y_pred))
    
    results[target] = model
    
#     Each model prints a classification report showing precision, recall, and F1-score for performance evaluation.
    
gender = 'Male'
age = 21
course = 'Computer Science'
year = 'Year 3'
cgpa = '3.50 - 3.99'
marital_status = 'Single'
sleep = 'Yes'
study_satisfaction = 'Yes'
academic_pressure = 'No'
financial_stress = 'No'
relationship_issues = 'No'
family_issues = 'No'
social_support = 'Yes'
physical_activity = 'Yes'

input_df = pd.DataFrame({
    'Choose your gender': [gender],
    'Age': [age],
    'What is your course?': [course],
    'Your current year of Study': [year],
    'What is your CGPA?': [cgpa],
    'Marital status': [marital_status],
    'Do you sleep well?': [sleep],
    'Are you satisfied with your study life?': [study_satisfaction],
    'Do you experience academic pressure?': [academic_pressure],
    'Do you have financial stress?': [financial_stress],
    'Do you have relationship issues?': [relationship_issues],
    'Do you have family problems?': [family_issues],
    'Do you feel supported socially?': [social_support],
    'Do you engage in physical activity?': [physical_activity]
})
# Example: Predict for a new student
# Prediction for a New Student
# A new student’s profile is entered manually for prediction.
new_student = pd.DataFrame({
    'Choose your gender': ['Male'],
    'Age': [21],
    'What is your course?': ['Computer Science'],
    'Your current year of Study': ['Year 3'],
    'What is your CGPA?': ['3.50 - 3.99'],
    'Marital status': ['Single']
})

# Encode new data
# Categorical values are encoded using the same encoder.
for col in new_student.columns:
    new_student[col] = label_enc.fit_transform(new_student[col].astype(str))
# Each trained model predicts whether the student is likely to experience Depression, Anxiety, or Panic Attack.

# Predict each label
for target, model in results.items():
    # ✅ Patch for newer sklearn versions
    if not hasattr(model, "monotonic_cst"):
        model.monotonic_cst = None
        
    prediction = model.predict(new_student)[0]
    print(f"{target}: {'Yes' if prediction == 1 else 'No'}")



=== DO YOU HAVE DEPRESSION? ===
              precision    recall  f1-score   support

           0       0.71      0.92      0.80        13
           1       0.75      0.38      0.50         8

    accuracy                           0.71        21
   macro avg       0.73      0.65      0.65        21
weighted avg       0.72      0.71      0.69        21


=== DO YOU HAVE ANXIETY? ===
              precision    recall  f1-score   support

           0       0.86      0.75      0.80        16
           1       0.43      0.60      0.50         5

    accuracy                           0.71        21
   macro avg       0.64      0.68      0.65        21
weighted avg       0.76      0.71      0.73        21


=== DO YOU HAVE PANIC ATTACK? ===
              precision    recall  f1-score   support

           0       0.78      0.93      0.85        15
           1       0.67      0.33      0.44         6

    accuracy                           0.76        21
   macro avg       0.72      0

In [11]:
# In this section, we trained three distinct machine learning models — 
# one each for predicting Depression, Anxiety, and Panic Attack.
# This approach allows us to capture the unique patterns and factors associated with each mental health condition, 
# rather than using a single generalized model.

# Train separate models for each target
targets = ['Do you have Depression?', 'Do you have Anxiety?', 'Do you have Panic attack?']

for target in targets:
    y = df[target]
    
    # Split data
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42
    )
    
    # Initialize Random Forest model
    model = RandomForestClassifier(n_estimators=100, random_state=42)
    
    # Train model
    model.fit(X_train, y_train)
    
    # Predictions
    y_pred = model.predict(X_test)
    
    # Evaluation results
    print(f"\n=== {target.upper()} ===")
    print(classification_report(y_test, y_pred))
    
    # ✅ Optional: add patch for future scikit-learn compatibility
    if not hasattr(model, "monotonic_cst"):
        model.monotonic_cst = None
    
    # Store model
    results[target] = model



=== DO YOU HAVE DEPRESSION? ===
              precision    recall  f1-score   support

           0       0.71      0.92      0.80        13
           1       0.75      0.38      0.50         8

    accuracy                           0.71        21
   macro avg       0.73      0.65      0.65        21
weighted avg       0.72      0.71      0.69        21


=== DO YOU HAVE ANXIETY? ===
              precision    recall  f1-score   support

           0       0.86      0.75      0.80        16
           1       0.43      0.60      0.50         5

    accuracy                           0.71        21
   macro avg       0.64      0.68      0.65        21
weighted avg       0.76      0.71      0.73        21


=== DO YOU HAVE PANIC ATTACK? ===
              precision    recall  f1-score   support

           0       0.78      0.93      0.85        15
           1       0.67      0.33      0.44         6

    accuracy                           0.76        21
   macro avg       0.72      0

In [12]:
# To assess the model’s performance and generalization ability, we used k-fold cross-validation.
# This technique helps evaluate how well the model performs on unseen data, reducing the risk of overfitting.

from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X, y, cv=5)
print(scores.mean())


0.6942857142857142


In [16]:
# After training and evaluating the models, we saved them for future use using the joblib library.
# This ensures that the trained models and encoders can be easily reloaded without retraining, saving time and computational 
# resources.

import joblib
joblib.dump(model, 'mental_health_model.pkl')
model = joblib.load('mental_health_model.pkl')
import joblib

# Save all models
joblib.dump(results['Do you have Depression?'], 'depression_model.pkl')
joblib.dump(results['Do you have Anxiety?'], 'anxiety_model.pkl')
joblib.dump(results['Do you have Panic attack?'], 'panic_model.pkl')

# Save label encoder
joblib.dump(label_enc, 'label_encoder.pkl')

['label_encoder.pkl']

In [14]:
# from sklearn.linear_model import LogisticRegression
# from sklearn.metrics import classification_report, confusion_matrix

# for col in ['Do you have Depression?', 'Do you have Anxiety?', 'Do you have Panic attack?']:
#     print(f"\n=== {col.upper()} ===")
#     y = mental[col]
#     X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#     model = LogisticRegression(max_iter=1000, random_state=42)
#     model.fit(X_train, y_train)
#     y_pred = model.predict(X_test)

#     print(classification_report(y_test, y_pred))


Conclusion

This notebook demonstrates how machine learning can help identify mental health risk patterns in students using non-sensitive, educational data.
The goal is to raise awareness, promote early support, and encourage data-driven mental health initiatives in educational institutions.

Algorithm Used: Random Forest Classifier

For this project, a Random Forest Classifier was used to predict the likelihood of students experiencing depression, anxiety, and panic attacks.

Random Forest is an ensemble learning method that builds multiple decision trees and combines their outputs to improve accuracy and control overfitting. It works well with mixed data types (categorical and numerical) and captures nonlinear relationships—which are common in psychological and behavioral datasets.

Reasoning:
Student mental health data often involves complex, nonlinear factors (academic pressure, sleep, financial stress, etc.).
Random Forest captures these multidimensional patterns better than Logistic Regression, which is more linear and sensitive to noise.
As seen in the results, Random Forest achieved:

Higher accuracy across all three mental health conditions.

Better F1-scores (balance between precision and recall).

More stable predictions even with a small dataset.

Conclusion

The Random Forest Classifier proved to be more effective for this analysis because:

It adapts well to real-world behavioral data, which rarely follows strict linear patterns.

It maintained consistent performance across multiple targets.

It offered better interpretability for non-technical audiences through feature importance (if analyzed further).

Thus, Random Forest was chosen over Logistic Regression due to its superior predictive power, flexibility, and robustness in modeling student mental health outcomes.