# Depression

Depressive disorder (also known as depression) is a common mental disorder. It involves a depressed mood or loss of pleasure or interest in activities for long periods of time.

Depression is different from regular mood changes and feelings about everyday life. It can affect all aspects of life, including relationships with family, friends and community. It can result from or lead to problems at school and at work.

Depression can happen to anyone. People who have lived through abuse, severe losses or other stressful events are more likely to develop depression. **Women are more likely to have depression than men**.



# Depression Symptoms

During a depressive episode, a person experiences a depressed mood (feeling sad, irritable, empty). They may feel a loss of pleasure or interest in activities.



**Here is some symptoms of depression:**
* poor concentration
* feelings of excessive guilt or low self-worth
* hopelessness about the future
* thoughts about dying or suicide
* disrupted sleep
* changes in appetite or weight
* feeling very tired or low in energy.

Depression can cause difficulties in all aspects of life, including in the community and at home, work and school.



<!--  -->

# Import all the necessary Library

In [None]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import cross_val_score, KFold
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.metrics import make_scorer, roc_auc_score
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder

import plotly.express as px
import plotly.graph_objects as go

!pip install -q sweetviz
import sweetviz as sv 

import warnings
warnings.filterwarnings("ignore")

<!--  -->

# Load the dataset

In [None]:
train_data = pd.read_csv('/kaggle/input/playground-series-s4e11/train.csv')
test_data = pd.read_csv('/kaggle/input/playground-series-s4e11/test.csv')

In [None]:
train_data.head()

# Dataset Description

The dataset contains **140,700 rows and 20 columns**, with each row representing an individual participant’s data. 


Here’s an overview of the columns:

1. id: Unique identifier for each entry.
2. Name: Name of the participant.
3. Gender: Gender of the participant.
4. Age: Age of the participant.
5. City: Participant's city of residence.
6. Working Professional or Student: Indicates if the participant is a working professional or a student.
7. Profession: Participant's profession.
8. Academic Pressure: Level of academic pressure experienced (for students).
9. Work Pressure: Level of work pressure experienced (for working professionals).
10. CGPA: Academic performance (for students).
11. Study Satisfaction: Satisfaction level with studies.
12. Job Satisfaction: Satisfaction level with job.
13. Sleep Duration: Average duration of sleep.
14. Dietary Habits: Dietary habits, e.g., Healthy, Unhealthy, Moderate.
15. Degree: Participant's highest degree.
16. Suicidal Thoughts: Indicates if the participant has ever had suicidal thoughts.
17. Work/Study Hours: Number of hours spent on work or study daily.
18. Financial Stress: Level of financial stress experienced.
19. Family History of Mental Illness: Indicates if there is a family history of mental illness.
20. Depression: Binary indicator of whether the participant experiences depression.


In [None]:
train_data.info()

# Find the missing value

We see here, some column have lots of missing data

In [None]:
train_data.isnull().sum()

<!--  -->

In [None]:
train_df_update = train_data
train_df_update['Work Pressure']=train_df_update['Work Pressure'].fillna(train_df_update['Work Pressure'].mean())
train_df_update['CGPA']=train_df_update['CGPA'].fillna(train_df_update['CGPA'].mean())
train_df_update['Study Satisfaction']=train_df_update['Study Satisfaction'].fillna(train_df_update['Study Satisfaction'].mean())
train_df_update['CGPA']=train_df_update['CGPA'].fillna(train_df_['CGPA'].mean())
train_df_update['Job Satisfaction']=train_df_update['Job Satisfaction'].fillna(train_df_update['Job Satisfaction'].mean())
train_df_update['Financial Stress']=train_df_update['Financial Stress'].fillna(train_df_update['Financial Stress'].mean())

In [None]:
train_df_update.isnull().sum()

# Exploratory Data Analysis (EDA)

Using heatmap we see that, Profession, Academic Pressure, Work Pressure, CGPA, Study Satisfaction                       112803
Job Satisfaction                          

In [None]:
sns.heatmap(train_df_update.isnull(), yticklabels=False, cmap="viridis")

In [None]:
sns.catplot(x="Depression", y="Age", kind="bar", data = train_df_update)
plt.title('Age With Depression', fontsize=18, weight='bold')

**Age is increase and Depression is decreasing** 

In [None]:
plt.figure(figsize=(10, 5))
sns.histplot(data=train_df_update, x='Age', hue='Depression', bins=30, kde=True, alpha=0.5)
plt.title('Age With Depression', fontsize=18, weight='bold')

plt.legend()  
plt.show()

<!--  -->

**Here we see that, when work pressure is increase the depression is increase linearly**

In [None]:
sns.catplot(x="Work Pressure", y="Depression", kind="bar", data = train_df_update)
plt.title('Work Pressure With Depression', fontsize=18, weight='bold')

In [None]:
plt.figure(figsize=(10, 5))
sns.histplot(data=train_df_update, x='Work Pressure', hue='Depression', bins=30, kde=True, alpha=0.5)
plt.title('Work Pressure With Depression', fontsize=18, weight='bold')

plt.legend()  
plt.show()

In [None]:
sns.catplot(x="Work/Study Hours", y="Depression", kind="bar", data = train_df_update)
plt.title('Work/Study Hours With Depression', fontsize=18, weight='bold')

In [None]:
plt.figure(figsize=(10, 5))
sns.histplot(data=train_df_update, x='Work/Study Hours', hue='Depression', bins=30, kde=True, alpha=0.5)
plt.title('Work/Study Hours With Depression', fontsize=18, weight='bold')

plt.legend()  
plt.show()

In [None]:
sns.catplot(x="Depression", y="Work/Study Hours", kind="bar", data = train_df_update)
plt.title('Work/Study Hours With Depression', fontsize=18, weight='bold')

In [None]:
sns.catplot(x="Financial Stress", y="Depression", kind="bar", data = train_df_update)
plt.title('Financial Stress With Depression', fontsize=18, weight='bold')

In [None]:
plt.figure(figsize=(10, 5))
sns.histplot(data=train_df_update, x='Financial Stress', hue='Depression', bins=30, kde=True, alpha=0.5)
plt.title('Financial Stress With Depression', fontsize=18, weight='bold')

plt.legend()  
plt.show()

<!--  -->

# Model

In [None]:
target = 'Depression'

numerical_columns = [
    "Age", "Academic Pressure", "Work Pressure", "CGPA",
    "Study Satisfaction", "Job Satisfaction", "Work/Study Hours",
    "Financial Stress"
]

one_hot_columns = [
    "Gender", "Working Professional or Student", "City", "Family History of Mental Illness"
]

label_columns = [
    "Degree", "Profession", "Dietary Habits", "Have you ever had suicidal thoughts ?", "Sleep Duration"
]

In [None]:
import pandas as pd

class DataPreprocessor:

    def __init__(self, numerical_columns, one_hot_columns, label_columns):
        self.numerical_columns = numerical_columns
        self.one_hot_columns = one_hot_columns
        self.label_columns = label_columns

        self.scaler = StandardScaler()
        self.one_hot_encoder = OneHotEncoder(drop='first', sparse=False, handle_unknown='ignore')
        
        self.label_encoders = {
            col: OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1) for col in self.label_columns
        }

        self.one_hot_feature_names = None

    def fit(self, df):
        self.scaler.fit(df[self.numerical_columns])

        self.one_hot_encoder.fit(df[self.one_hot_columns])
        self.one_hot_feature_names = self.one_hot_encoder.get_feature_names_out(self.one_hot_columns)

        for col in self.label_columns:
            self.label_encoders[col].fit(df[[col]])

    def transform(self, df):
        df_scaled = df.copy()

        df_scaled[self.numerical_columns] = self.scaler.transform(df[self.numerical_columns])

        encoded_columns = self.one_hot_encoder.transform(df[self.one_hot_columns])
        encoded_df = pd.DataFrame(encoded_columns, columns=self.one_hot_feature_names, index=df.index)

        for col in self.label_columns:
            df_scaled[col] = self.label_encoders[col].transform(df[[col]])

        df_final = pd.concat([df_scaled.drop(self.one_hot_columns, axis=1), encoded_df], axis=1)

        return df_final

    def fit_transform(self, df):
        self.fit(df)
        return self.transform(df)


In [None]:
train_df_update = train_df_update.drop('id', axis=1)
train_df_update = train_df_update.drop('Name', axis=1)

train_df_update[one_hot_columns + label_columns] = train_df_update[one_hot_columns + label_columns].fillna('None')
train_df_update[numerical_columns] = train_df_update[numerical_columns].fillna(-1)

preprocessor = DataPreprocessor(numerical_columns, one_hot_columns, label_columns)

preprocessor.fit(train_df_update)

train_df_update = preprocessor.transform(train_df_update)
test_data = preprocessor.transform(test_data)

In [None]:
from xgboost import XGBClassifier
from catboost import CatBoostClassifier 
from lightgbm import LGBMClassifier

In [None]:
x = train_df_update.drop(target, axis=1)
y = train_df_update[target]

xgb_model = XGBClassifier(use_label_encoder=False, eval_metric='logloss')
lgb_model = LGBMClassifier()
cat_model = CatBoostClassifier(verbose=0)

In [None]:
k_fold = KFold(n_splits=5, shuffle=True, random_state=42)

oof_preds_xgb = np.zeros(len(train_df_update))
oof_preds_lgb = np.zeros(len(train_df_update))
oof_preds_cat = np.zeros(len(train_df_update))
oof_preds_avg = np.zeros(len(train_df_update))

for train_index, valid_index in k_fold.split(x):
    x_train, x_valid = x.iloc[train_index], x.iloc[valid_index]
    y_train, y_valid = y.iloc[train_index], y.iloc[valid_index]
    
    xgb_model.fit(x_train, y_train)
    lgb_model.fit(x_train, y_train)
    cat_model.fit(x_train, y_train)

    xgb_preds = xgb_model.predict_proba(x_valid)[:, 1]
    lgb_preds = lgb_model.predict_proba(x_valid)[:, 1]
    cat_preds = cat_model.predict_proba(x_valid)[:, 1]
    
    oof_preds_xgb[valid_index] = xgb_preds
    oof_preds_lgb[valid_index] = lgb_preds
    oof_preds_cat[valid_index] = cat_preds
    
    preds_avg = (xgb_preds + lgb_preds + cat_preds) / 3
    oof_preds_avg[valid_index] = preds_avg

score_xgb = roc_auc_score(y, oof_preds_xgb)
score_lgb = roc_auc_score(y, oof_preds_lgb)
score_cat = roc_auc_score(y, oof_preds_cat)
score_avg = roc_auc_score(y, oof_preds_avg)

In [None]:
print(f'ROC AUC XGBoost: {score_xgb:.5f}')
print(f'ROC AUC LightGBM: {score_lgb:.5f}')
print(f'ROC AUC CatBoost: {score_cat:.5f}')
print(f'ROC AUC Average: {score_avg:.5f}')

# Submission

In [None]:
total_score = score_xgb + score_lgb + score_cat

weight_xgb = score_xgb / total_score
weight_lgb = score_lgb / total_score
weight_cat = score_cat / total_score

x_test = test_data.drop(columns=['id', 'Name'])

xgb_preds = xgb_model.predict_proba(x_test)[:, 1]
lgb_preds = lgb_model.predict_proba(x_test)[:, 1]
cat_preds = cat_model.predict_proba(x_test)[:, 1]

preds_avg = (xgb_preds * weight_xgb + lgb_preds * weight_lgb + cat_preds * weight_cat)
preds_avg = (preds_avg >= 0.5).astype(int)

submit = pd.DataFrame({
    'id': test_data['id'],
    'prediction': preds_avg.flatten()
})


submit.to_csv("../working/sub_mission.csv", index=False)

print(submit)
print(submit['prediction'].describe())
