Student Depression Dataset: Analyzing Mental Health Trends and Predictors Among Students

Overview
This dataset compiles a wide range of information aimed at understanding, analyzing, and predicting depression levels among students. It is designed for research in psychology, data science, and education, providing insights into factors that contribute to student mental health challenges and aiding in the design of early intervention strategies.

In [None]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("adilshamim8/student-depression-dataset")

print("Path to dataset files:", path)

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

In [None]:

from google.colab import files
U = files.upload()

In [None]:
df=pd.read_csv('student_depression_dataset.csv')

EXPLORATORY DATA ANALYSIS

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
categorical_column=[]
non_categorical_column=[]
for column in df.columns:
  if df[column].dtype==object:
    categorical_column.append(column)
  else:
    non_categorical_column.append(column)
print("categorical_column",categorical_column)
print(  "non_categorical_column",non_categorical_column)


In [None]:
df.isnull().sum()

In [None]:
df.duplicated().sum()

In [None]:
df.columns

VISUALIZATION

Distribution of Depression Cases

In [None]:
plt.figure(figsize=(7,6))
sns.countplot(x="Depression",data=df,palette="coolwarm")
plt.title("Distribution of Depression Cases")
plt.xlabel("Depression (0=No ,1=Yes)")
plt.ylabel("Count")
plt.show()

From the Above visual we can conclude that Most students suffer from depression Cases

Individual Vaiables Vs Depression Cases

In [None]:
plt.figure(figsize=(6,5))
sns.countplot(x="Gender",data=df,hue="Depression",palette="coolwarm")
plt.title("Gender Vs Depression Cases")
plt.xlabel("Gender")
plt.ylabel("Count")
plt.show()

From this visual we can see that Males
 students are depressed more than Female students.

AGE VS DEPRESSION
Age distribution of the depressed and non depressed

In [None]:
plt.figure(figsize=(10,13))
sns.histplot(x="Age",data=df,hue="Depression",palette="coolwarm")
plt.title("Age vs Depression Cases")
plt.xlabel("Age")
plt.ylabel("Count")
plt.show()

The Young Students experience depression more than the Older Students

STUDY SATISFACTION AND DEPRESSION

In [None]:
plt.figure(figsize=(8,6))
sns.countplot(x="Study Satisfaction",data=df,hue="Depression",palette="RdBu")
plt.title("Study Satisfaction Vs Depression Cases")
plt.xlabel("Study Satisfaction")
plt.ylabel("Count")
plt.show()

The students with high Study Satisfaction have less stress/Depression than those who have low study satisfaction

ACADEMIC PRESSURE VS DEPRESSION

In [None]:
plt.figure(figsize=(8,6))
sns.boxplot(x="Academic Pressure",data=df,hue="Depression",palette="coolwarm")
plt.title("Academic Pressure Vs Depression Cases")
plt.xlabel("Academic Pressure")
plt.ylabel("Count")
plt.show()

The Students with low Academic pressure Dont have depression while those that have Academic pressure experience depression cases

DEGREE vs DEPRESSION

In [None]:
plt.figure(figsize=(22,7))
sns.countplot(x="Degree",data=df,hue="Depression",palette="coolwarm")
plt.title("Degree Vs Depression Cases")
plt.xlabel("Degree")
plt.ylabel("Count")
plt.show()

Generally,We can get the insight there Most Students are depressed across all degree programs.
This could be contributed by academic workload or subject difficulty among the students

WORK/STUDY VS DEPRESSION

In [None]:
plt.figure(figsize=(8,5))
sns.boxplot(x="Depression", y="Work/Study Hours", data=df, palette="coolwarm")
plt.title("Work/Study Hours vs Depression")
plt.show()


Long work study increases the chance of depression
Students with long work/study have high depression levels this could be managed by reducing workloads or improving time management strategies

Work/Study Hours Distribution(KDE PLOT)

In [None]:
plt.figure(figsize=(8, 5))
sns.kdeplot(df[df["Depression"] == 1]["Work/Study Hours"], label="Depressed", shade=True, color="red")
sns.kdeplot(df[df["Depression"] == 0]["Work/Study Hours"], label="Not Depressed", shade=True, color="blue")
plt.title("Work/Study Hours Distribution by Depression Status")
plt.xlabel("Work/Study Hours")
plt.ylabel("Density")
plt.legend()
plt.show()


The Reg Curve shows that higher work/study hours contributes to depression

FINANCIAL STRESS AND DEPRESSION

In [None]:
plt.figure(figsize=(6,4))
sns.countplot(x="Financial Stress", hue="Depression", data=df, palette="coolwarm")
plt.title("Financial Stress vs Depression")
plt.xlabel("Financial Stress Level")
plt.ylabel("Count")
plt.legend(title="Depression", labels=["No", "Yes"])
plt.show()


Students with Financial Stress have more depression cases.Financial instability is a contributing factor

FAMILY HISTORY OF MENTAL ILLNES AND DEPRESSION

In [None]:
plt.figure(figsize=(6,4))
sns.countplot(x="Family History of Mental Illness", hue="Depression", data=df, palette="coolwarm")
plt.title("Family History vs Depression")
plt.show()


Theres is no direct correlation to family history of mental illness to the current depression case of students and othe factors could have contributed to the high depression level among the students

CORRELATIONS

In [None]:
plt.figure(figsize=(10, 6))
sns.heatmap(df.select_dtypes(include=np.number).corr(), annot=True, cmap="coolwarm", fmt=".2f", linewidths=0.5)
plt.title("Correlation Matrix")
plt.show()

RECOMMENDATIONS

1. Financial Support & Stress Management
✅ Increase Scholarships & Grants – Provide need-based financial aid to reduce economic burdens on students.
✅ Flexible Payment Plans – Allow tuition installments to ease financial pressure.
✅ On-Campus & Remote Job Opportunities – Offer work-study programs that allow students to earn while studying.
✅ Financial Literacy Programs – Teach students budgeting, saving, and managing expenses through workshops and counseling.

1. Academic Workload & Study-Life Balance
✅ Promote a Balanced Curriculum – Reduce excessive coursework and introduce alternative assessments like projects instead of exams.
✅ Encourage Healthy Study Hours – Universities should monitor excessive study/work hours and offer time management training.
✅ Flexible Deadlines & Support Systems – Allow extensions for students struggling with mental health challenges.

3. Mental Health Awareness & Support Services
✅ Expand Counseling & Therapy Services – Provide free or affordable mental health counseling on campus.
✅ Student Peer Support Groups – Establish peer-to-peer mentorship programs where students can share experiences.
✅ Promote Mindfulness & Physical Well-being – Encourage exercise, meditation, and social activities to manage stress.
✅ Mental Health Awareness Campaigns – Educate students and faculty on early signs of depression and available resources.

4. Career Guidance & Post-Graduation Support
✅ Personalized Career Counseling – Help students choose degrees that match their strengths & interests to reduce academic pressure.
✅ Internship & Job Placement Support – Assist students in finding internships and jobs to reduce post-graduation stress.




FEATURE ENGINEERING MACHINE LEARNING MODEL

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.metrics import confusion_matrix

In [None]:
le=LabelEncoder()
categorical_column=["Gender", "City", "Profession", "Degree", "Financial Stress", "Family History of Mental Illness",]
for column in categorical_column:
  df[column]=le.fit_transform(df[column])

In [None]:
suicidal_thoughts={"yes":0,"no":1}


In [None]:
df['Have you ever had suicidal thoughts ?']=df['Have you ever had suicidal thoughts ?']
df['Have you ever had suicidal thoughts ?']=df['Have you ever had suicidal thoughts ?'].map(suicidal_thoughts)

SELECT FEATURES &TARGET VARIABLES

In [None]:
X = df.drop(columns=["id", "Depression","Have you ever had suicidal thoughts ?"])  # Remove ID & target variable
y = df["Depression"]  # Target variable (1 = Depressed, 0 = Not Depressed)

# Split into Train & Test sets (80-20 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


TRAIN THE MODEL RANDOM FOREST CLASSIFIER

In [None]:
df.head(10)

In [None]:
model=RandomForestClassifier(n_estimators=100,random_state=42)
model.fit(X_train,y_train)



In [None]:
y_pred=model.predict(X_test)

In [None]:
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2f}")

print("Classification Report:\n", classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))


In [None]:
importances = pd.Series(model.feature_importances_, index=X.columns)
importances.sort_values(ascending=False).plot(kind='bar', figsize=(10,5), color='royalblue')
plt.title("Feature Importance for Depression Prediction")
plt.show()


LOGISTICS REGRESSION AND XGBOOST MODELS

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix


In [None]:
# Defining the models
models = {
    "Logistic Regression": LogisticRegression(),
    "Random Forest": RandomForestClassifier(n_estimators=100, random_state=42),
    "XGBoost": XGBClassifier(use_label_encoder=False, eval_metric='logloss'),
    "SVM": SVC(kernel='linear'),
    "KNN": KNeighborsClassifier(n_neighbors=5)
}

# Train & Evaluate
results = {}
for name, model in models.items():
    model.fit(X_train, y_train)  # Train model
    y_pred = model.predict(X_test)  # Make predictions
    acc = accuracy_score(y_test, y_pred)  # Calculate accuracy
    results[name] = acc  # Store results

    print(f"\n📌 {name} Model Results:")
    print(f"Accuracy: {acc:.2f}")
    print("Classification Report:\n", classification_report(y_test, y_pred))
    print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))


In [None]:
plt.figure(figsize=(8,5))
plt.bar(results.keys(), results.values(), color=['blue', 'green', 'red', 'purple', 'orange'])
plt.xlabel("Model")
plt.ylabel("Accuracy Score")
plt.title("Model Comparison: Accuracy Scores")
plt.xticks(rotation=30)
plt.show()


FINE TUNING RANDOM FOREST

In [None]:
from sklearn.model_selection import GridSearchCV

# Define hyperparameters to tune
param_grid_rf = {
    "n_estimators": [100, 200, 300],
    "max_depth": [10, 20, None],
    "min_samples_split": [2, 5, 10],
    "min_samples_leaf": [1, 2, 4]
}

# Initialize Random Forest
rf = RandomForestClassifier(random_state=42)

# Perform Grid Search
grid_search_rf = GridSearchCV(rf, param_grid_rf, cv=5, n_jobs=-1, verbose=2)
grid_search_rf.fit(X_train, y_train)

# Best parameters & accuracy
print("📌 Best Random Forest Parameters:", grid_search_rf.best_params_)
print("📌 Best Accuracy:", grid_search_rf.best_score_)

# Train best model
best_rf = grid_search_rf.best_estimator_


FINE TUNE XGBOOST

In [None]:
param_grid_xgb = {
    "n_estimators": [100, 200, 300],
    "learning_rate": [0.01, 0.1, 0.2],
    "max_depth": [3, 6, 9],
    "subsample": [0.8, 1.0],
    "colsample_bytree": [0.8, 1.0]
}

xgb = XGBClassifier(use_label_encoder=False, eval_metric='logloss')

grid_search_xgb = GridSearchCV(xgb, param_grid_xgb, cv=5, n_jobs=-1, verbose=2)
grid_search_xgb.fit(X_train, y_train)

# Best parameters & accuracy
print("📌 Best XGBoost Parameters:", grid_search_xgb.best_params_)
print("📌 Best Accuracy:", grid_search_xgb.best_score_)

# Train best model
best_xgb = grid_search_xgb.best_estimator_


TUNE LOGISTICS REGRESSION

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Define Logistic Regression model
logreg = LogisticRegression()

# Define hyperparameters grid
param_grid = {
    "C": [0.01, 0.1, 1, 10, 100],  # Regularization strength
    "penalty": ["l1", "l2"],  # L1 = Lasso, L2 = Ridge
    "solver": ["liblinear", "saga"]  # Suitable for L1 and L2
}

# Perform Grid Search
grid_search = GridSearchCV(logreg, param_grid, cv=5, n_jobs=-1, verbose=2)
grid_search.fit(X_train, y_train)

# Print best parameters & accuracy
print("📌 Best Parameters:", grid_search.best_params_)
print("📌 Best Accuracy:", grid_search.best_score_)

# Train best model
best_logreg = grid_search.best_estimator_

# Predict on test set
y_pred = best_logreg.predict(X_test)

# Evaluate performance
print("📌 Final Model Accuracy:", accuracy_score(y_test, y_pred))
print("📌 Classification Report:\n", classification_report(y_test, y_pred))


COMPARING FINE TUNED MODELS

In [None]:
# Evaluate the fine-tuned models
models = {"Random Forest": best_rf, "XGBoost": best_xgb,"Logistics regresssion":best_logreg}

for name, model in models.items():
    y_pred = model.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    print(f"\n📌 {name} Model Accuracy: {acc:.2f}")
    print("Classification Report:\n", classification_report(y_test, y_pred))
