# Depression Prediction- EDA and Modelling |97% Accuracy

# Goal:
<span style="font-size: 18px;">This project aims to build an AI model to predict whether an individual is experiencing depression based on various psychological, lifestyle, and demographic factors. Depression is a serious mental health condition that affects mood, thinking, and daily activities. Early detection and intervention are crucial in managing depression and improving mental well-being.</span>

# Project Overview:
## Dataset:  
We will use the Depression Professional Dataset from Kaggle, which contains various attributes related to mental health, lifestyle habits, and work-related stress.
## Problem Type: 
This is a binary classification problem where the goal is to predict whether a person is suffering from depression (Yes or No).
## Features:

* Demographic factors (e.g., Age, Gender)
* Lifestyle habits (e.g., Sleep Duration, Dietary Habits)
* Work-related stress (e.g., Work Pressure, Financial Stress)
* Mental health history (e.g., Family History of Mental Illness)
* Target Variable: "Depression" (Yes/No), which serves as an indicator of severe depression.



# **Import Libraries**

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score, accuracy_score, confusion_matrix, roc_curve, auc
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.callbacks import EarlyStopping
from keras.layers import Dense , Activation ,Flatten , Conv2D , MaxPooling2D
from keras import Sequential
import seaborn as sns
from sklearn.model_selection import train_test_split
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier


# Load Dataset

In [2]:
df = pd.read_csv('/kaggle/input/depression-professional-dataset/Depression Professional Dataset.csv')

# Creating copy for working dataset
data = df.copy()
data.head(20)

Unnamed: 0,Gender,Age,Work Pressure,Job Satisfaction,Sleep Duration,Dietary Habits,Have you ever had suicidal thoughts ?,Work Hours,Financial Stress,Family History of Mental Illness,Depression
0,Female,37,2.0,4.0,7-8 hours,Moderate,No,6,2,No,No
1,Male,60,4.0,3.0,5-6 hours,Unhealthy,Yes,0,4,Yes,No
2,Female,42,2.0,3.0,5-6 hours,Moderate,No,0,2,No,No
3,Female,44,3.0,5.0,7-8 hours,Healthy,Yes,1,2,Yes,No
4,Male,48,4.0,3.0,7-8 hours,Moderate,Yes,6,5,Yes,No
5,Female,60,1.0,4.0,7-8 hours,Unhealthy,Yes,12,3,Yes,No
6,Female,30,4.0,2.0,More than 8 hours,Healthy,No,3,1,No,No
7,Male,30,1.0,2.0,More than 8 hours,Unhealthy,Yes,6,1,No,No
8,Male,56,1.0,2.0,More than 8 hours,Moderate,Yes,11,5,Yes,No
9,Female,35,3.0,4.0,Less than 5 hours,Moderate,No,6,4,Yes,No


# Data Processing

In [3]:
data.shape

(2054, 11)

In [4]:
data.describe()

Unnamed: 0,Age,Work Pressure,Job Satisfaction,Work Hours,Financial Stress
count,2054.0,2054.0,2054.0,2054.0,2054.0
mean,42.17186,3.021908,3.015093,5.930867,2.978578
std,11.461202,1.417312,1.418432,3.773945,1.413362
min,18.0,1.0,1.0,0.0,1.0
25%,35.0,2.0,2.0,3.0,2.0
50%,43.0,3.0,3.0,6.0,3.0
75%,51.75,4.0,4.0,9.0,4.0
max,60.0,5.0,5.0,12.0,5.0


# Identify Null, errors or duplicates

In [5]:
#used to count the number of missing values (NaNs) in each column of a DataFrame data
data.isnull().sum()

Gender                                   0
Age                                      0
Work Pressure                            0
Job Satisfaction                         0
Sleep Duration                           0
Dietary Habits                           0
Have you ever had suicidal thoughts ?    0
Work Hours                               0
Financial Stress                         0
Family History of Mental Illness         0
Depression                               0
dtype: int64

In [6]:
#used to count the number of duplicated rows in a DataFrame data
data.duplicated().sum()

0

In [7]:
# remove duplicate rows from the DataFrame data

data = data.drop_duplicates()

In [8]:
data.shape

(2054, 11)

# Data Cleaning

In [9]:
# Identify numerical and categorical columns for further processing

numerical_cols = data.select_dtypes(include=['int64', 'float64']).columns
categorical_cols = data.select_dtypes(include=['object']).columns

print("\nNumerical columns:")
for col in numerical_cols:
    print(col)

print("\nCategorical columns:")
for col in categorical_cols:
    print(col)


Numerical columns:
Age
Work Pressure
Job Satisfaction
Work Hours
Financial Stress

Categorical columns:
Gender
Sleep Duration
Dietary Habits
Have you ever had suicidal thoughts ?
Family History of Mental Illness
Depression


In [10]:
# identify unique values in columns

for col in numerical_cols:
    unique_values = data[col].nunique()
    print(f"\nUnique values in column '{col}':")
    print(unique_values)


Unique values in column 'Age':
43

Unique values in column 'Work Pressure':
5

Unique values in column 'Job Satisfaction':
5

Unique values in column 'Work Hours':
13

Unique values in column 'Financial Stress':
5


In [11]:
# Insight into the unique values

for col in numerical_cols:
    unique_values = data[col].unique()
    print(f"\nUnique values in {col}: {unique_values}")


Unique values in Age: [37 60 42 44 48 30 56 35 21 57 54 51 18 31 58 47 50 46 38 45 53 49 43 22
 28 36 20 19 33 41 27 52 26 34 59 55 25 40 24 39 23 29 32]

Unique values in Work Pressure: [2. 4. 3. 1. 5.]

Unique values in Job Satisfaction: [4. 3. 5. 2. 1.]

Unique values in Work Hours: [ 6  0  1 12  3 11 10  8  7  9  5  2  4]

Unique values in Financial Stress: [2 4 5 3 1]


In [12]:
# Investigating Edge cases- 0 Work Hour rows information

zero_rows = data[data['Work Hours'] == 0]
zero_rows

Unnamed: 0,Gender,Age,Work Pressure,Job Satisfaction,Sleep Duration,Dietary Habits,Have you ever had suicidal thoughts ?,Work Hours,Financial Stress,Family History of Mental Illness,Depression
1,Male,60,4.0,3.0,5-6 hours,Unhealthy,Yes,0,4,Yes,No
2,Female,42,2.0,3.0,5-6 hours,Moderate,No,0,2,No,No
37,Male,43,1.0,2.0,5-6 hours,Moderate,No,0,5,No,No
52,Male,45,5.0,1.0,7-8 hours,Moderate,Yes,0,1,Yes,No
68,Male,30,4.0,1.0,More than 8 hours,Unhealthy,Yes,0,5,No,Yes
...,...,...,...,...,...,...,...,...,...,...,...
2003,Male,27,1.0,5.0,More than 8 hours,Moderate,Yes,0,1,No,No
2007,Male,33,1.0,4.0,7-8 hours,Moderate,No,0,5,No,No
2023,Male,54,1.0,3.0,5-6 hours,Moderate,No,0,2,No,No
2030,Female,49,3.0,2.0,5-6 hours,Moderate,No,0,1,No,No


In [13]:
# 170 rows have 0 Work Hours but have Work Pressure > 0 and job satisfaction > 0. 
#-> NOT Practically Feasible cases to be removed

data_zero_rows = data[(data['Work Hours'] == 0) & (data['Work Pressure'] > 0) & (data['Job Satisfaction'] > 0)]
print(data_zero_rows[['Gender','Age','Work Pressure','Job Satisfaction','Work Hours']])

      Gender  Age  Work Pressure  Job Satisfaction  Work Hours
1       Male   60            4.0               3.0           0
2     Female   42            2.0               3.0           0
37      Male   43            1.0               2.0           0
52      Male   45            5.0               1.0           0
68      Male   30            4.0               1.0           0
...      ...  ...            ...               ...         ...
2003    Male   27            1.0               5.0           0
2007    Male   33            1.0               4.0           0
2023    Male   54            1.0               3.0           0
2030  Female   49            3.0               2.0           0
2034  Female   42            2.0               3.0           0

[170 rows x 5 columns]


In [14]:
# To avoid assumption, these rows will be removed

data_zero_remove = data[(data['Work Hours'] == 0) & (data['Work Pressure'] > 0) & (data['Job Satisfaction'] > 0)].index

# Created a new dataframe to remove rows
data_new = data.copy()
data_new.drop(data_zero_remove, inplace=True)
print("Rows removed:" , len(data_zero_remove))

Rows removed: 170


In [15]:
categorical_columns = data_new.select_dtypes(include=['object']).columns

print("\nCategorical columns:")
for col in categorical_columns:
    print(col)


Categorical columns:
Gender
Sleep Duration
Dietary Habits
Have you ever had suicidal thoughts ?
Family History of Mental Illness
Depression


In [16]:
# Identify unique values in categorical columns

for col in categorical_columns:
    unique_values = data_new[col].nunique()
    print(f"\nUnique values in column '{col}':")
    print(unique_values)

for col in categorical_columns:
    unique_values = data_new[col].unique()
    print(f"\nUnique values in {col}: {unique_values}")


Unique values in column 'Gender':
2

Unique values in column 'Sleep Duration':
4

Unique values in column 'Dietary Habits':
3

Unique values in column 'Have you ever had suicidal thoughts ?':
2

Unique values in column 'Family History of Mental Illness':
2

Unique values in column 'Depression':
2

Unique values in Gender: ['Female' 'Male']

Unique values in Sleep Duration: ['7-8 hours' 'More than 8 hours' 'Less than 5 hours' '5-6 hours']

Unique values in Dietary Habits: ['Moderate' 'Healthy' 'Unhealthy']

Unique values in Have you ever had suicidal thoughts ?: ['No' 'Yes']

Unique values in Family History of Mental Illness: ['No' 'Yes']

Unique values in Depression: ['No' 'Yes']


# Exploratory Data Analysis

# Depression Pie Chart

In [17]:
depression_counts = data_new['Depression'].value_counts()

colors = ['#73afbb', '#979d9e']

plt.figure(figsize=(6, 6))
plt.pie(depression_counts, 
        labels=depression_counts.index, 
        autopct='%1.1f%%', 
        startangle=140, 
        colors=colors, 
        shadow=True, 
        explode=[0.1] * len(depression_counts),  
        wedgeprops={'edgecolor': 'black'})  

plt.title('Depression Distribution')

plt.show()

NameError: name 'plt' is not defined

# Gender VS Depression

In [None]:
sns.countplot(data=data_new, x='Gender', hue='Depression')
plt.title('Gender vs Depression')
plt.show()

# Depression distribution over Age

In [None]:
sns.histplot(data_new, x="Age", hue="Depression", multiple="dodge", kde=True)
plt.title("Depression over age")
plt.tight_layout()
plt.show()

# Thought process over Age

In [None]:
sns.histplot(data_new, x="Age", hue="Have you ever had suicidal thoughts ?", multiple="dodge", kde=True)
plt.title("Sucidial thoughts over age")
plt.tight_layout()
plt.show()

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(8, 4))
sns.boxplot(data=data_new, x="Depression", y="Work Hours", ax=axes[0])
axes[0].set_title('Work Hours vs Depression')

sns.boxplot(data=data_new, x="Depression", y="Financial Stress", ax=axes[1])
axes[1].set_title('Financial Stress vs Depression')

plt.tight_layout()
plt.show()

In [None]:
le=LabelEncoder()
for cols in data_new.select_dtypes('object'):
    data_new[cols]=le.fit_transform(data_new[cols])

data_new.head()

In [None]:
plt.figure(figsize=(20,10))
sns.heatmap(data_new.corr(),annot=True,cmap="Blues")

In [None]:
# #creates a new Series y containing the values from the 'Outcome' column of the DataFrame data

# y = data_new['Depression']
# y.head()

In [None]:
#return a count of unique values in the 'Outcome' column of the DataFrame data

data['Depression'].value_counts()

In [None]:
# # Data Preprocessing
# # Identify categorical and numerical columns
# categorical_cols = ['Gender', 'Sleep Duration', 'Dietary Habits', 'Have you ever had suicidal thoughts ?', 'Family History of Mental Illness']
# numerical_cols = ['Age', 'Work Pressure', 'Job Satisfaction', 'Work Hours', 'Financial Stress']


In [None]:
# # For categorical columns, replace missing values with the most frequent value
# imputer_cat = SimpleImputer(strategy='most_frequent')
# data[categorical_cols] = imputer_cat.fit_transform(data[categorical_cols])

In [None]:
# # Encode categorical variables
# label_encoders = {}
# for col in categorical_cols:
#     le = LabelEncoder()
#     data[col] = le.fit_transform(data[col])
#     label_encoders[col] = le

# data.head()

# ML Modeling

# 1. Random Forest Classifier

In [None]:
# Normalize numerical features
scaler = StandardScaler()
# data[numerical_cols] = scaler.fit_transform(data[numerical_cols])

# Define the target variable and features
X = data_new.drop('Depression', axis=1)
Y = data_new['Depression']
print("Features Extraction Successful")

scaler = StandardScaler(copy=True, with_mean=True, with_std=True)
X = scaler.fit_transform(X)

In [None]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Model Development
# 1. Random Forest Classifier
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
rf_predictions = rf_model.predict(X_test)
print("Random Forest Classifier Accuracy:", accuracy_score(y_test, rf_predictions))
print(classification_report(y_test, rf_predictions))

In [None]:
rf_cm = confusion_matrix(y_test, rf_predictions)

plt.figure(figsize=(8, 6))
sns.heatmap(rf_cm, annot=True, cmap='Blues', fmt='g')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Random Forest Classifier Confusion Matrix')
plt.show()


# 2. Neural Network

In [None]:
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
y_train = encoder.fit_transform(y_train)
y_test = encoder.transform(y_test)

In [None]:

nn_model = Sequential()
nn_model.add(Dense(64, input_dim=X_train.shape[1], activation='relu'))
nn_model.add(Dropout(0.5))
nn_model.add(Dense(32, activation='relu'))
nn_model.add(Dropout(0.5))
nn_model.add(Dense(1, activation='sigmoid'))

nn_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

In [None]:
# Early stopping to prevent overfitting
early_stopping = EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)

# Train the neural network
nn_model.fit(X_train, y_train, epochs=40, batch_size=32, validation_split=0.2, callbacks=[early_stopping])

# Evaluate the neural network
nn_loss, nn_accuracy = nn_model.evaluate(X_test, y_test)
print("Neural Network Accuracy:", nn_accuracy)

# Make predictions
nn_predictions = (nn_model.predict(X_test) > 0.5).astype("int32")
print(classification_report(y_test, nn_predictions))


In [None]:
nn_cm = confusion_matrix(y_test, nn_predictions)

plt.figure(figsize=(8, 6))
sns.heatmap(nn_cm, annot=True, cmap='Blues', fmt='g')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Neural Network Confusion Matrix')
plt.show()

# 3. XGBoost

In [None]:
from xgboost import XGBClassifier

model = XGBClassifier()
eval_set = [(X_test,y_test)]
model.fit(X_train, y_train, early_stopping_rounds= 5, eval_set=eval_set, verbose=True)
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]

In [None]:
XG_accuracy = accuracy_score(y_test, predictions)
print(f" XGBClassifier Accuracy: {XG_accuracy:.2f}")

In [None]:
print(classification_report(y_test, y_pred))

In [None]:
XG_cm = confusion_matrix(y_test, y_pred)

plt.figure(figsize=(8, 6))
sns.heatmap(XG_cm, annot=True, cmap='Blues', fmt='g')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('XGBClassifier Confusion Matrix')
plt.show()

# 4. LightGBM

In [None]:

lgbm_model = LGBMClassifier()
lgbm_model.fit(X_train, y_train)
lgbm_y_pred = lgbm_model.predict(X_test)
lgbm_accuracy = accuracy_score(y_test, lgbm_y_pred)
print(f"LightGBM Accuracy: {lgbm_accuracy:.2f}")

In [None]:
print("LightGBM Classification Report:")
print(classification_report(y_test, lgbm_y_pred))

In [None]:
lgbm_cm = confusion_matrix(y_test, lgbm_y_pred)

plt.figure(figsize=(8, 6))
sns.heatmap(lgbm_cm, annot=True, cmap='Blues', fmt='g')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('LightGBM Confusion Matrix')
plt.show()

# 5. CatBoost

In [None]:
catboost_model = CatBoostClassifier()
catboost_model.fit(X_train, y_train)
catboost_y_pred = catboost_model.predict(X_test)
catboost_accuracy = accuracy_score(y_test, catboost_y_pred)
print(f"CatBoost Accuracy: {catboost_accuracy:.2f}")

# * Confusion Matrix*

In [None]:
cb_cm = confusion_matrix(y_test, catboost_y_pred)

plt.figure(figsize=(8, 6))
sns.heatmap(cb_cm, annot=True, cmap='Blues', fmt='g')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('CatBoost Confusion Matrix')
plt.show()

# Conclusion

The following models are deployed for classification task of Depression prediction with its respective accuracy which indicates-> Neural Network performs the best out of all. 

1. Random Forest Classifier  0.95
2. Neural Network  0.97
3. XGBoost  0.96
4. LightGBM 0.96
5. CatBoost 0.96