# 🧠 Exploring Mental Health Data  
**Kaggle Playground Series S4E11 | December 2024**  
**Author:** Abdul Raheem  
**Objective:**  
This notebook addresses a binary classification problem — predicting whether an individual is likely to experience depression, using mental health survey data.  

Achieved **94.024% accuracy**, closely trailing the top submission (94.184%).  
This notebook walks through EDA, feature engineering, model building, hyperparameter tuning, and ensembling techniques used to build the final solution.


In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## 1. Importing Data and Initial Exploration

# Input Data

In [None]:
train = pd.read_csv("/kaggle/input/mental-health3/train.csv")
test = pd.read_csv("/kaggle/input/mental-health3/test.csv")
original = pd.read_csv("/kaggle/input/depression-surveydataset-for-analysis/final_depression_dataset_1.csv")

datasets = [train, test, original]

original.head()

In [None]:
test2 = pd.read_csv("/kaggle/input/mental-health3/test.csv")

test2.columns

In [None]:

print(train.shape)
print(test.shape)
print(original.shape)


In [None]:
train.head()

In [None]:
train.columns

In [None]:
train.isna().sum()

In [None]:
# Identify columns with null values and their data types
null_columns_dtypes = train.dtypes[train.isnull().any()]

# Display the results
print("Columns with null values and their data types:")
print(null_columns_dtypes)

## 2. Feature Engineering

# Profession Categories

In [None]:

# Convert the 'Field' columns to sets
set1 = set(train['Profession'])
set2 = set(test['Profession'])

# Find values in df1 not in df2
not_in_df2 = set1 - set2

# Find values in df2 not in df1
not_in_df1 = set2 - set1

# Combine the differences to find values not present in both
not_in_both = not_in_df2.union(not_in_df1)

# Display results
print("Values in train but not in test:", not_in_df2,"\n")
print("Values in test but not in train:", not_in_df1,"\n")
print("Values not present in both:", not_in_both)

**> This tells us there are a few responses to profession field which are not not accurate**

In [None]:
value_counts = train['Profession'].value_counts(ascending=True)

filtered_counts = value_counts[value_counts < 50]

filtered_counts

In [None]:

for df in datasets:
    profession_counts = df['Profession'].value_counts()
    df['Profession'] = df['Profession'].apply(lambda x: x if profession_counts.get(x, 0) >= 10 else np.nan)

for df in datasets:
    df['Profession'] = df['Profession'].fillna(df['Working Professional or Student'])

In [None]:

categories = {
    'Business_and_Finance': ['Business Analyst', 'Financial Analyst', 'Investment Banker', 'Accountant', 'Marketing Manager', 
                             'Manager', 'Consultant'],
    'Engineering': ['Software Engineer', 'Civil Engineer', 'Mechanical Engineer', 'Architect'],
    'Healthcare': ['Doctor', 'Pharmacist'],  # Nurse is not listed in the original professions
    'Creative_and_Design': ['UX/UI Designer', 'Graphic Designer', 'Content Writer', 'Digital Marketer'],
    'Education': ['Teacher', 'Educational Consultant', 'Academic'],
    'Technology': ['Data Scientist', 'Software Engineer'],  # Software Engineer is also in Engineering, so it's listed twice
    'Skilled_Trades': ['Chef', 'Electrician', 'Plumber', 'Customer Support'],
    'Law_and_Government': ['Lawyer', 'Judge'],
    'Science_and_Research': ['Researcher', 'Chemist'],
    'Sales_and_Marketing': ['Sales Executive', 'Travel Consultant'],
    'Other_or_Miscellaneous': ['Entrepreneur', 'Pilot'],
}


# Flatten the categories to map professions to their categories
category_map = {profession: category for category, professions in categories.items() for profession in professions}


for df in datasets:
    df['Profession Category'] = df['Profession'].apply(lambda x: category_map.get(x, x) if x not in ['Student', 'Working Professional'] else x)


# Show the unique values of the 'Profession' column after mapping
print(train['Profession Category'].unique())

# Degree Categories

In [None]:
train['Degree'].unique()

In [None]:
value_counts = train['Degree'].value_counts(ascending=True)

filtered_counts = value_counts[value_counts < 50]

filtered_counts

In [None]:

for df in datasets:
    degree_counts = df['Degree'].value_counts()
    
    # Replace values with NaN if their frequency is less than 5
    df['Degree'] = df['Degree'].apply(lambda x: x if degree_counts.get(x, 0) >= 5 else np.nan)

    # Step 1: Find the mode (most frequent category) in 'Degree'
    most_frequent_degree = df['Degree'].mode()[0]

    # Step 2: Fill NaN values with the most frequent category
    df['Degree'] = df['Degree'].fillna(most_frequent_degree)

train['Degree'].unique()

In [None]:
degree_categories = {
    'High Difficulty': [
        'PhD', 'MBA', 'M.Tech', 'M.S', 'M.Com', 'M.Ed', 'MD', 
        'MBBS', 'M.Pharm', 'LLM', 'M.Arch','MSc'],
    'Moderate Difficulty': [
        'BE', 'B.Tech', 'BBA', 'BCA', 'BSc', 'B.Ed', 'B.Arch', 
        'B.Pharm', 'BA', 'MCA', 'ME', 'MA','LLB'],
    'Low Difficulty': [
        'Class 12', 'BHM', 'L.Ed', 'MHM', 'B.Com']}


category_map = {degree: category for category, degrees in degree_categories.items() for degree in degrees}

for df in datasets:
    
    df['Degree Category'] = df['Degree'].apply(lambda x: category_map.get(x, x) if pd.notna(x) else x)

train['Degree Category'].unique()

In [None]:
train['Working Professional or Student'].value_counts()

# Work/Academic Pressure

In [None]:
for df in datasets:

    df['Work/Academic Pressure'] = np.nan
    
    df.loc[df['Working Professional or Student']=='Student', 'Work/Academic Pressure'] = df['Academic Pressure']
    df.loc[df['Working Professional or Student']=='Working Professional', 'Work/Academic Pressure'] = df['Work Pressure']

train.head(10)

In [None]:
train.isna().sum()

In [None]:
# Define a function to fill NaN using the median based on the 'City'
def fill_na_with_median(row, acad_dict, wp_dict, group_acad_median, group_wp_median):

    if row['Working Professional or Student']=='Student':
        if pd.isna(row['Work/Academic Pressure']):
            acad_median = acad_dict.get(row['City'], row['Academic Pressure'])  # Use the median if available

            if pd.isna(acad_median):
                return group_acad_median
            return acad_median

        return row['Work/Academic Pressure']
        
    elif row['Working Professional or Student']=='Working Professional':
        if pd.isna(row['Work/Academic Pressure']):
            work_median = wp_dict.get(row['City'], row['Work Pressure'])  

            if pd.isna(work_median):
                return group_wp_median # Use the median if available
            return work_median
        
        return row['Work/Academic Pressure']


for df in datasets:
    
    # Filter data by working professional or student
    student_df = df[df['Working Professional or Student']=='Student']
    WP_df = df[df['Working Professional or Student']=='Working Professional']
    
    # Compute the median of Academic/Work pressure grouped by 'City' for the filtered data
    median_academic_pressure = student_df.groupby('City')['Academic Pressure'].median()
    median_work_pressure = WP_df.groupby('City')['Work Pressure'].median()

    student_median = df[df['Working Professional or Student'] == 'Student']['Academic Pressure'].median()
    wp_median = df[df['Working Professional or Student'] == 'Working Professional']['Work Pressure'].median()

    df['Work/Academic Pressure'] = df.apply(fill_na_with_median, axis=1, acad_dict=median_academic_pressure, wp_dict=median_work_pressure, group_acad_median=student_median, group_wp_median=wp_median)


In [None]:
train.isna().sum()

In [None]:
# Define a function to fill NaN using the median based on the 'City'
def fill_na_with_median(row, acad_dict, wp_dict, group_acad_median, group_wp_median):

    if row['Working Professional or Student']=='Student':
        if pd.isna(row['Academic Pressure']):
            acad_median = acad_dict.get(row['City'], row['Academic Pressure'])  # Use the median if available

            if pd.isna(acad_median):
                return group_acad_median
            return acad_median

        return row['Academic Pressure']
        
    elif row['Working Professional or Student']=='Working Professional':
        if pd.isna(row['Work Pressure']):
            work_median = wp_dict.get(row['City'], row['Work Pressure'])  

            if pd.isna(work_median):
                return group_wp_median # Use the median if available
            return work_median
        
        return row['Work Pressure']


for df in datasets:
    
    # Filter data by working professional or student
    student_df = df[df['Working Professional or Student']=='Student']
    WP_df = df[df['Working Professional or Student']=='Working Professional']
    
    # Compute the median of Academic/Work pressure grouped by 'City' for the filtered data
    median_academic_pressure = student_df.groupby('City')['Academic Pressure'].median()
    median_work_pressure = WP_df.groupby('City')['Work Pressure'].median()

    student_median = df[df['Working Professional or Student'] == 'Student']['Academic Pressure'].median()
    wp_median = df[df['Working Professional or Student'] == 'Working Professional']['Work Pressure'].median()

    df['Academic Pressure'] = df.apply(fill_na_with_median, axis=1, acad_dict=median_academic_pressure, wp_dict=median_work_pressure, group_acad_median=student_median, group_wp_median=wp_median)
    df['Work Pressure'] = df.apply(fill_na_with_median, axis=1, acad_dict=median_academic_pressure, wp_dict=median_work_pressure, group_acad_median=student_median, group_wp_median=wp_median)

In [None]:
train.isna().sum()

In [None]:
for df in datasets:
    df.loc[df['Working Professional or Student'] == 'Student', 'Work Pressure'] = np.nan
    df.loc[df['Working Professional or Student'] == 'Working Professional', 'Academic Pressure'] = np.nan

train.isna().sum()

# Job/Study Satisfaction

In [None]:
for df in datasets:

    # Combine into a single Satisfaction column
    df['Satisfaction'] = df.apply(lambda row: row['Study Satisfaction'] if row['Working Professional or Student'] == 'Student' else row['Job Satisfaction'],axis=1)

train.isna().sum()

In [None]:
# Define a function to fill NaN using the median based on the 'Degree Category' and 'City'
def fill_na_with_median(row, acad_dict, wp_dict, group_acad_median, group_wp_median):

    if row['Working Professional or Student']=='Student':
        if pd.isna(row['Satisfaction']):
            acad_median = acad_dict.get((row['City'], row['Degree Category']), row['Study Satisfaction'])  # Use the median if available

            if pd.isna(acad_median):
                return group_acad_median
            return acad_median
        
        return row['Satisfaction']
        
    elif row['Working Professional or Student']=='Working Professional':
        if pd.isna(row['Satisfaction']):
            work_median = wp_dict.get((row['City'], row['Profession Category']), row['Work Pressure'])  

            if pd.isna(work_median):
                return group_wp_median # Use the median if available
            return work_median
        
        return row['Satisfaction']



for df in datasets:
    
    # Filter data by working professional or student
    student_df = df[df['Working Professional or Student']=='Student']
    WP_df = df[df['Working Professional or Student']=='Working Professional']
    
    # Compute the median of Academic/Work pressure grouped by 'City' for the filtered data
    median_academic = student_df.groupby(['City','Degree Category'])['Study Satisfaction'].median()
    median_work = WP_df.groupby(['City','Profession Category'])['Job Satisfaction'].median()

    student_median = student_df.groupby('Degree Category')['Study Satisfaction'].median()
    wp_median = WP_df.groupby('Profession Category')['Job Satisfaction'].median()

    df['Satisfaction'] = df.apply(fill_na_with_median, axis=1, acad_dict=median_academic, wp_dict=median_work, group_acad_median=student_median, group_wp_median=wp_median)


train.isna().sum()

# CGPA

In [None]:
for df in datasets:
    # Calculate the median for each (City, Degree Category) combination
    grouped_medians = df.groupby(['City', 'Degree Category'])['CGPA'].median().to_dict()
    
    # Map only for students
    df['CGPA'] = df.apply(
        lambda row: grouped_medians.get((row['City'], row['Degree Category']), row['CGPA']) 
        if pd.isna(row['CGPA']) and row['Working Professional or Student'] == 'Student' else row['CGPA'],
        axis=1)

train.isna().sum()

# Sleep Distribution

In [None]:
sleep_counts = train['Sleep Duration'].value_counts()

sleep_counts

In [None]:
import re
# Define a function to check if a number exists
def contains_number(value):
    # Check if the string contains any digits
    return bool(re.search(r'\d', str(value)))

for df in datasets:
    # Filter values without numbers
    invalid_durations = df['Sleep Duration'][~df['Sleep Duration'].apply(contains_number)].tolist()
    
    # Replace values not in the valid list with NaN
    df['Sleep Duration'] = df['Sleep Duration'].apply(
        lambda x: x if x not in invalid_durations else np.nan
    )

train.isna().sum()  #should see 12 counts of nulls

In [None]:

# Function to flag high sleep durations
def is_high_sleep_duration(value):
    # Match single numbers or ranges (e.g., '49 hours', '45-48 hours')
    match_single = re.match(r'(\d+)\s*hours', str(value))  # Single value with "hours", e.g., '49 hours'
    match_single_2 = re.match(r'^(\d+)$', str(value))  # Single numeric value, e.g., '45'
    match_range = re.match(r'(\d+)-(\d+)\s*hours', str(value))  # Range, e.g., '45-48 hours'

    # Handle ranges (e.g., '45-48 hours')
    if match_range:
        # Extract the range and calculate the average daily sleep
        start, end = map(int, match_range.groups())
        avg_daily_sleep = (start + end) / 2 / 2  # Assume the range is over 2 days
        return avg_daily_sleep > 15  # Flag if daily sleep exceeds 15 hours per day

    # Handle single numbers with "hours"
    if match_single:
        # Extract the single number and check if it exceeds the threshold
        hours = int(match_single.group(1))
        return hours > 15  # Flag if it exceeds 15 hours

    # Handle single numeric values without "hours"
    if match_single_2:
        # Extract the single number and check if it exceeds the threshold
        hours = int(match_single_2.group(1))
        return hours > 15  # Flag if it exceeds 15 hours

    # Return False for non-matching or valid cases
    return False

for df in datasets:
    
    # Apply the function to identify high sleep durations
    df['High Sleep Duration'] = df['Sleep Duration'].apply(is_high_sleep_duration)

    # Replace values not in the valid list with NaN
    df.loc[df['High Sleep Duration'] == True, 'Sleep Duration'] = np.nan


train.isna().sum()

In [None]:
for df in datasets:

    df.drop('High Sleep Duration', axis=1, inplace=True)

train.columns

In [None]:
for df in datasets:

    most_frequent_sleep = df['Sleep Duration'].mode()[0]

    # Step 2: Fill NaN values with the most frequent category
    df['Sleep Duration'] = df['Sleep Duration'].fillna(most_frequent_sleep)

train.isna().sum()

In [None]:
sleep_counts = train['Sleep Duration'].value_counts()

sleep_counts

In [None]:
# Function to clean and classify sleep duration values
def clean_and_classify_sleep_duration(duration):
    # Check if the duration is a specific text category
    if str(duration) == 'Less than 5 hours':
        return 'Unhealthy'
    elif str(duration) == 'More than 8 hours':
        return 'Healthy'
    
    # Check if the duration is a valid sleep range
    if isinstance(duration, str) and 'hours' in duration:
        parts = duration.split(' ')

        # Checking if it's a valid range like '9-5 hours' or '9-6 hours'
        if '-' in parts[0]:
            start, end = map(int, parts[0].split('-'))
            # Calculate the average hours in the range
            avg_hours = np.mean([start, end])
            
            # Classify based on the average hours
            if avg_hours < 5:
                return 'Unhealthy'  # Less than 5 hours
            elif 5 <= avg_hours <= 7.5:  # Extended Moderate range from 5 to 7.5
                return 'Moderate'
            elif 7.5 < avg_hours <= 9:
                return 'Healthy'  # 7-9 hours range
            else:
                return 'Unhealthy'  # More than 9 hours
                
        # Handle individual hour values
        try:
            hours = int(parts[0])
            if hours < 5:
                return 'Unhealthy'  # Less than 5 hours
            elif 5 <= hours <= 7.5:  # Extended Moderate range from 5 to 7.5
                return 'Moderate'
            elif 7.5 < hours <= 9:
                return 'Healthy'  # 7-9 hours range
            else:
                return 'Unhealthy'  # More than 9 hours
        except ValueError:
            return duration  

    return duration

for df in datasets:

    # Apply the function to clean and categorize 'Sleep Duration'
    df['Sleep Duration Category'] = df['Sleep Duration'].apply(clean_and_classify_sleep_duration)




# Verify the result
train['Sleep Duration Category'].value_counts()




In [None]:
for df in datasets:
    sleep_counts = df['Sleep Duration Category'].value_counts()
    single_occurrence_values = sleep_counts[sleep_counts <= 5].index
    
    # Replace values with NaN if their frequency is less than 5
    df.loc[df['Sleep Duration Category'].isin(single_occurrence_values), 'Sleep Duration Category'] = np.nan
    
    most_frequent_sleep = df['Sleep Duration Category'].mode()[0]

    # Step 2: Fill NaN values with the most frequent category
    df['Sleep Duration Category'] = df['Sleep Duration Category'].fillna(most_frequent_sleep)

In [None]:
train['Sleep Duration Category'].value_counts()

In [None]:
train.isna().sum()

In [None]:

# Create a histogram for the Age column
plt.figure(figsize=(8, 5))
plt.hist(df['Age'], bins=10, edgecolor='black', alpha=0.7)

# Add more x-axis ticks
x_ticks = np.arange(df['Age'].min(), df['Age'].max() + 1, 2)  # Change the step size as needed
plt.xticks(x_ticks)

# Add labels and title
plt.title('Histogram of Age')
plt.xlabel('Age')
plt.ylabel('Frequency')

plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

# Dietary Habits

In [None]:
for df in datasets:
    most_frequent_diet = df['Dietary Habits'].mode()[0]

    # Step 2: Fill NaN values with the most frequent category
    df['Dietary Habits'] = df['Dietary Habits'].fillna(most_frequent_diet)

In [None]:
train.isna().sum()

# Financial Stress

In [None]:
for df in datasets:
    mode_fin_stress = df['Financial Stress'].mode()[0]

    # Step 2: Fill NaN values with the most frequent category
    df['Financial Stress'] = df['Financial Stress'].fillna(mode_fin_stress)

In [None]:
train.isna().sum()

# Diet Mapping

In [None]:
diet_mapping = {
    'Unhealthy': 3,
    'Moderate': 2,
    'Healthy': 1
}

for df in datasets:
    df['Dietary Habits Category'] = df['Dietary Habits'].map(diet_mapping)

    mode = df['Dietary Habits Category'].mode()[0]
    df['Dietary Habits Category'] = df['Dietary Habits Category'].fillna(mode)

# Degree Mapping

In [None]:
degree_mapping = {
    'High Difficulty': 3,
    'Moderate Difficulty': 2,
    'Low Difficulty': 1
}

for df in datasets:
    df['Degree Category Coded'] = df['Degree Category'].map(degree_mapping)

# Suicidal Thoughts Response Mapping

In [None]:
# Mapping for suicidal thoughts response
suicidal_thoughts_mapping = {
    'Yes': 2,
    'No': 1
}

for df in datasets:
    # Apply the mapping to 'Have you ever had suicidal thoughts ?'
    df['Have you ever had suicidal thoughts ?'] = df['Have you ever had suicidal thoughts ?'].map(suicidal_thoughts_mapping)

    # Fill NaN values with the most frequent value (mode) in 'Have you ever had suicidal thoughts ?'
    df['Have you ever had suicidal thoughts ?'] = df['Have you ever had suicidal thoughts ?'].fillna(df['Have you ever had suicidal thoughts ?'].mode()[0])

# Family History mapping

In [None]:
family_history_mapping = {
    'Yes': 2,
    'No': 1
}

for df in datasets:
    # Apply the mapping to 'Family History of Mental Illness'
    df['Family History of Mental Illness'] = df['Family History of Mental Illness'].map(family_history_mapping)

    # Fill NaN values with the most frequent value (mode) in 'Family History of Mental Illness'
    df['Family History of Mental Illness'] = df['Family History of Mental Illness'].fillna(df['Family History of Mental Illness'].mode()[0])

In [None]:
train['Dietary Habits'].unique()

In [None]:
train.isna().sum()

# Assigning City Tier

In [None]:
# Get value counts for the City column
city_value_counts = train['City'].value_counts().reset_index()
city_value_counts.columns = ['City', 'Count']  # Rename columns for clarity

# Export the value counts to an Excel file
output_file = 'city_value_counts.xlsx'
city_value_counts.to_excel(output_file, index=False)

In [None]:
city_tier = pd.read_excel("/kaggle/input/city-tier/City Tier.xlsx")


In [None]:
# Create a dictionary from city tier for mapping
mapping_dict = city_tier.set_index('City')['Tier'].to_dict()

city_mapping = {
    'Molkata': 'Kolkata',
    'Less Delhi': 'Delhi',
    'Tolkata': 'Kolkata'
}


for df in datasets:
    # Update the City column based on the mapping for specific records
    df['City'] = df['City'].replace(city_mapping)

    df['City Tier'] = df['City'].map(mapping_dict)

    # Step 1: Get the top 5 most frequent cities
    top_5_cities = df['City'].mode()[:5]
    
    # Step 2: Randomly fill NaN values with one of the top 5 cities
    # Set the random seed for reproducibility
    random_state = 42
    np.random.seed(random_state)
    
    # Step 3: Generate random choices from the top 5 cities for the NaN values
    # Ensure we only generate enough values to fill the NaNs
    nan_indices = df['City Tier'].isna()
    
    # Randomly select cities for NaN values
    random_cities = np.random.choice(top_5_cities, size=nan_indices.sum(), replace=True)
    
    # Step 4: Fill NaN values in the 'City' column with the randomly selected cities
    df.loc[nan_indices, 'City'] = random_cities
    
    df['City Tier'] = df['City'].map(mapping_dict)



In [None]:
train.isna().sum()

# Gender Coding

In [None]:
gender_mapping = {
    'Male': 1,
    'Female': 2
}

for df in datasets:
    
    df['Gender Coded'] = df['Gender'].map(gender_mapping)

train.head()

# Working Professional/Student Coding

In [None]:
wps_mapping = {
    'Working Professional': 1,
    'Student': 2
}

for df in datasets:
    
    df['Working Professional or Student Coded'] = df['Working Professional or Student'].map(wps_mapping)

train.head()

# One-hot encoding Profession Category

In [None]:
train = pd.get_dummies(train, columns=['Profession Category'], drop_first=False)
test = pd.get_dummies(test, columns=['Profession Category'], drop_first=False)
original = pd.get_dummies(original, columns=['Profession Category'], drop_first=False)


In [None]:
train.columns

In [None]:
boolean_columns = ['Profession Category_Business_and_Finance',
       'Profession Category_Creative_and_Design',
       'Profession Category_Education', 'Profession Category_Engineering',
       'Profession Category_Finanancial Analyst',
       'Profession Category_HR Manager', 'Profession Category_Healthcare',
       'Profession Category_Law_and_Government',
       'Profession Category_Other_or_Miscellaneous',
       'Profession Category_Research Analyst',
       'Profession Category_Sales_and_Marketing',
       'Profession Category_Science_and_Research',
       'Profession Category_Skilled_Trades', 'Profession Category_Student',
       'Profession Category_Technology',
       'Profession Category_Working Professional']

train[boolean_columns] = train[boolean_columns].astype(int)
test[boolean_columns] = test[boolean_columns].astype(int)
original[boolean_columns] = original[boolean_columns].astype(int)

train.head()

In [None]:
train.columns

# Sleep Duration Mapping

In [None]:
train['Sleep Duration Category'].value_counts()

In [None]:
sdc_mapping = {
    'Healthy': 1,
    'Moderate': 2,
    'Unhealthy': 3
}

train['Sleep Duration Category Coded'] = train['Sleep Duration Category'].map(sdc_mapping)
test['Sleep Duration Category Coded'] = test['Sleep Duration Category'].map(sdc_mapping)
original['Sleep Duration Category Coded'] = original['Sleep Duration Category'].map(sdc_mapping)

train.columns

In [None]:
def corr(df):

    features = df.select_dtypes(include=['number'])
    
    # Compute the correlation matrix
    correlation_matrix = features.corr()

    # Extract correlations of independent variables with the target variable
    target_corr = correlation_matrix['Depression'].drop('Depression')  # Drop self-correlation
    
    # Display correlations with the target
    print("Correlations with the target variable:")
    print(target_corr)
    
    # Plot a heatmap of the full correlation matrix (optional)
    plt.figure(figsize=(15, 12))
    sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f')
    plt.title('Correlation Matrix')
    plt.show()

corr(train)

In [None]:
train.head()

# Error Code

# Model 1 - Random Forest

In [None]:
train_modified = train.copy()

In [None]:
train_modified.columns #['Dietary Habits Category'].value_counts()

In [None]:
X = train_modified.drop(columns=['Working Professional or Student', 'Profession', 'Satisfaction', 'Work/Academic Pressure', 'id', 'Sleep Duration', 'Name', 'City', 'Gender', 'Dietary Habits', 'Degree', 'Degree Category', 'Sleep Duration Category'])

RF_X = X.drop('Depression', axis=1, inplace=False)
RF_Y = X['Depression']

RF_X

In [None]:
RF_X = RF_X.fillna(-1)

RF_X

In [None]:
import optuna
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold, cross_val_score, train_test_split
from sklearn.metrics import accuracy_score


# Optuna Optimization - Random Forest

In [None]:



# # Define the objective function for Optuna
# def objective(trial, RF_X, RF_Y):
#     # Define hyperparameter space
#     n_estimators = trial.suggest_int('n_estimators', 50, 300)
#     max_depth = trial.suggest_int('max_depth', 5, 50, log=True)
#     min_samples_split = trial.suggest_int('min_samples_split', 2, 10)
#     min_samples_leaf = trial.suggest_int('min_samples_leaf', 1, 5)
#     max_features = trial.suggest_categorical('max_features', ['sqrt', 'log2', None])
    
#     # Define the Random Forest model
#     model = RandomForestClassifier(
#         n_estimators=n_estimators,
#         max_depth=max_depth,
#         min_samples_split=min_samples_split,
#         min_samples_leaf=min_samples_leaf,
#         max_features=max_features,
#         random_state=42
#     )
    
#     # Stratified K-Fold Cross-Validation
#     skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
#     scores = cross_val_score(model, RF_X, RF_Y, cv=skf, scoring='accuracy')
    
#     # Return the mean accuracy
#     return scores.mean()

# # # Create an Optuna study
# # study = optuna.create_study(direction='maximize')
# # study.optimize(objective, n_trials=50)  # Adjust n_trials as needed

# # # Display the best hyperparameters and the best score
# # print("Best Hyperparameters:", study.best_params)
# # print("Best Cross-Validation Accuracy:", study.best_value)

In [None]:
# Number of folds
n_splits = 5
skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)

# Initialize an array to hold out-of-fold (OOF) predictions
oof_predictions = np.zeros((RF_X.shape[0], len(np.unique(RF_Y))))

# Initialize a list to store fold accuracies
fold_accuracies = []

# Loop through Stratified K-Fold splits
for fold, (train_idx, val_idx) in enumerate(skf.split(RF_X, RF_Y)):
    print(f"Training fold {fold + 1}/{n_splits}...")
    
    # Split data into training and validation sets
    X_train, X_val = RF_X.iloc[train_idx], RF_X.iloc[val_idx]
    y_train, y_val = RF_Y[train_idx], RF_Y[val_idx]
    
    # Train the Random Forest model
    model = RandomForestClassifier(
        n_estimators=142,
        max_depth=24,
        min_samples_split=3,
        min_samples_leaf=3,
        max_features='log2',
        random_state=42
    )
    
    model.fit(X_train, y_train)
    
    # Predict on the validation set
    val_preds = model.predict_proba(X_val)  # Use predict_proba for stacking (probability output)
    oof_predictions[val_idx] = val_preds  # Save predictions for stacking
    
    # Calculate accuracy for the fold
    fold_accuracy = accuracy_score(y_val, np.argmax(val_preds, axis=1))
    fold_accuracies.append(fold_accuracy)
    print(f"Fold {fold + 1} accuracy: {fold_accuracy:.4f}")

# Calculate overall accuracy
average_accuracy = np.mean(fold_accuracies)
print(f"\nAverage accuracy across all folds: {average_accuracy:.4f}")

In [None]:
# Get feature importances
feature_importances = model.feature_importances_

# Create a DataFrame for better visualization
importance_df = pd.DataFrame({
    'Feature': RF_X.columns,
    'Importance': feature_importances
}).sort_values(by='Importance', ascending=False)

# Display the features sorted by importance
print(importance_df)

In [None]:
RF_X.columns

In [None]:

# Save OOF predictions as a DataFrame (optional)
oof_RF = pd.DataFrame(oof_predictions, columns=[f"Class_{i}" for i in range(oof_predictions.shape[1])])
oof_RF['True_Label'] = RF_Y
oof_RF['Predicted_Label'] = (oof_RF['Class_0'] < oof_RF['Class_1']).astype(int)

oof_RF

In [None]:
from sklearn.metrics import accuracy_score

# Calculate accuracy
accuracy = accuracy_score(oof_RF['True_Label'], oof_RF['Predicted_Label'])

print(f"Accuracy: {accuracy:.4f}")

**> RF Model trained on all features performs better than the model trained on selected features**

# OOF Predictions : Random Forest

In [None]:

oof_RF

# Generating Predictions on Test Dataset

In [None]:
test.isna().sum()

In [None]:
test_modified = test.copy()

test_modified = test_modified.fillna(-1)

test_features = test_modified[['Age', 'Academic Pressure', 'Work Pressure', 'CGPA',
       'Study Satisfaction', 'Job Satisfaction',
       'Have you ever had suicidal thoughts ?', 'Work/Study Hours',
       'Financial Stress', 'Family History of Mental Illness',
       'Dietary Habits Category', 'Degree Category Coded', 'City Tier',
       'Gender Coded', 'Working Professional or Student Coded',
       'Profession Category_Business_and_Finance',
       'Profession Category_Creative_and_Design',
       'Profession Category_Education', 'Profession Category_Engineering',
       'Profession Category_Finanancial Analyst',
       'Profession Category_HR Manager', 'Profession Category_Healthcare',
       'Profession Category_Law_and_Government',
       'Profession Category_Other_or_Miscellaneous',
       'Profession Category_Research Analyst',
       'Profession Category_Sales_and_Marketing',
       'Profession Category_Science_and_Research',
       'Profession Category_Skilled_Trades', 'Profession Category_Student',
       'Profession Category_Technology',
       'Profession Category_Working Professional',
       'Sleep Duration Category Coded']]


In [None]:
test_features.columns

In [None]:
test_predictions = model.predict(test_features)  # Or model.predict_proba() for probabilities
test_predictions2 = model.predict_proba(test_features)[:, 1]

In [None]:

RF_submission = pd.DataFrame({
    'id': test_modified['id'],  # Replace 'ID' with the identifier column in the test dataset
    'Depression': test_predictions
})

RF_test_proba = pd.DataFrame({
    'id': test_modified['id'],  # Replace 'ID' with the identifier column in the test dataset
    'oof_RF_proba': test_predictions2
})

In [None]:
# Save the submission file
RF_submission.to_csv('RF_submission.csv', index=False)

# Model 2 - Logistic Regression

In [None]:
train_modified = train.copy()

X = train_modified.drop(columns=['Working Professional or Student', 'Profession', 'Satisfaction', 'Work/Academic Pressure', 'id', 'Sleep Duration', 'Name', 'City', 'Gender', 'Dietary Habits', 'Degree', 'Degree Category', 'Sleep Duration Category'])

LogR_X = X.drop('Depression', axis=1, inplace=False)
LogR_Y = X['Depression']

LogR_X

In [None]:
LogR_X.isna().sum()

# Handling Nulls and Standardizing the columns

In [None]:
acad_mean = LogR_X['Academic Pressure'].mean()
LogR_X['Academic Pressure'] = LogR_X['Academic Pressure'].fillna(acad_mean)

work_mean = LogR_X['Work Pressure'].mean()
LogR_X['Work Pressure'] = LogR_X['Work Pressure'].fillna(work_mean)

cgpa_mean = LogR_X['CGPA'].mean()
LogR_X['CGPA'] = LogR_X['CGPA'].fillna(cgpa_mean)

studys_mean = LogR_X['Study Satisfaction'].mean()
LogR_X['Study Satisfaction'] = LogR_X['Study Satisfaction'].fillna(studys_mean)

jobs_mean = LogR_X['Job Satisfaction'].mean()
LogR_X['Job Satisfaction'] = LogR_X['Job Satisfaction'].fillna(jobs_mean)


In [None]:

from sklearn.preprocessing import StandardScaler


# Features to standardize
features_to_standardize = ['Academic Pressure', 'Work Pressure', 'CGPA', 'Study Satisfaction', 'Job Satisfaction']

# Initialize StandardScaler
scaler = StandardScaler()

# Apply scaler only to the selected features
LogR_X[features_to_standardize] = scaler.fit_transform(LogR_X[features_to_standardize])


In [None]:
LogR_X.head()

# Optuna Optimization - Logistic Regression

In [None]:
# import optuna
# import pandas as pd
# import numpy as np
# from sklearn.linear_model import LogisticRegression
# from sklearn.model_selection import StratifiedKFold
# from sklearn.preprocessing import StandardScaler
# from sklearn.metrics import log_loss, accuracy_score
# from sklearn.datasets import load_iris

# # Define Stratified K-Fold
# n_splits = 5
# skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)


# # Objective function for Optuna
# def objective(trial):
#     # Hyperparameter space
#     penalty = trial.suggest_categorical('penalty', ['l1', 'l2', 'elasticnet', 'none'])
#     solver = trial.suggest_categorical('solver', ['lbfgs', 'liblinear', 'saga'])

    
#     # Handle unsupported combinations
#     if penalty == 'none' and solver == 'liblinear':
#         trial.set_user_attr("skip", True)
#         return np.nan
#     if penalty == 'elasticnet' and solver != 'saga':
#         trial.set_user_attr("skip", True)
#         return np.nan
#     if penalty == 'l1' and solver not in ['liblinear', 'saga']:
#         trial.set_user_attr("skip", True)
#         return np.nan


#     l1_ratio = None
#     if penalty == 'elasticnet':
#         l1_ratio = trial.suggest_float('l1_ratio', 0.0, 1.0)

#     C = trial.suggest_loguniform('C', 1e-4, 1e2)

#     # Logistic Regression model
#     model = LogisticRegression(
#         penalty=penalty,
#         solver=solver,
#         l1_ratio=l1_ratio,
#         C=C,
#         random_state=42,
#         max_iter=5000
#     )

#     # Cross-validation
#     cv_scores = []
#     for train_idx, val_idx in skf.split(LogR_X, LogR_Y):
#         X_train, X_val = LogR_X.iloc[train_idx], LogR_X.iloc[val_idx]
#         y_train, y_val = LogR_Y[train_idx], LogR_Y[val_idx]

#         # Train and evaluate the model
#         model.fit(X_train, y_train)
#         y_pred_proba = model.predict_proba(X_val)[:, 1]
#         cv_scores.append(log_loss(y_val, y_pred_proba))

#     return np.mean(cv_scores)

In [None]:

# # Optimize hyperparameters with Optuna
# study = optuna.create_study(direction='minimize')
# study.optimize(objective, n_trials=50)

# # Best hyperparameters
# print("Best Hyperparameters:", study.best_params)

# Training Logistic Regression Model

In [None]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score

# Stratified K-Fold
n_splits = 5
skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)

# Train Logistic Regression using best hyperparameters and collect OOF predictions

model = LogisticRegression(
    penalty='l1',
    solver='saga',
    C=0.07173299574706905,
    l1_ratio=None,
    random_state=42,
    max_iter=5000
)

# Store accuracy for each fold
fold_accuracies = []

# Initialize arrays for OOF predictions
oof_predictions = np.zeros(X.shape[0])
true_labels = np.zeros(X.shape[0])

# Stratified K-Fold cross-validation
for fold, (train_idx, val_idx) in enumerate(skf.split(LogR_X, LogR_Y)):
    print(f"Running Fold {fold + 1}/{n_splits}...")

    # Split data
    X_train, X_val = LogR_X.iloc[train_idx], LogR_X.iloc[val_idx]
    y_train, y_val = LogR_Y[train_idx], LogR_Y[val_idx]

    # Train model
    model.fit(X_train, y_train)

    # Predict on validation set
    y_pred = model.predict(X_val)

    # Predict probabilities for validation set
    y_pred_proba = model.predict_proba(X_val)[:, 1]  # Probability for class 1

    # Store OOF predictions and true labels
    oof_predictions[val_idx] = y_pred_proba
    true_labels[val_idx] = y_val

    # Calculate accuracy for this fold
    accuracy = accuracy_score(y_val, y_pred)
    fold_accuracies.append(accuracy)
    print(f"Fold {fold + 1} Accuracy: {accuracy:.4f}")

# Save OOF predictions and true labels to a DataFrame
oof_LogR = pd.DataFrame({
    'OOF_Predictions': oof_predictions,
    'True_Labels': true_labels
})

# Calculate average accuracy
average_accuracy = np.mean(fold_accuracies)
print(f"\nAverage Accuracy Across Folds: {average_accuracy:.4f}")



# OOF Predictions : Logistic Regression

In [None]:
oof_LogR

In [None]:
test_modified = test.copy()

test_modified = test_modified.fillna(-1)

test_features = test_modified[['Age', 'Academic Pressure', 'Work Pressure', 'CGPA',
       'Study Satisfaction', 'Job Satisfaction',
       'Have you ever had suicidal thoughts ?', 'Work/Study Hours',
       'Financial Stress', 'Family History of Mental Illness',
       'Dietary Habits Category', 'Degree Category Coded', 'City Tier',
       'Gender Coded', 'Working Professional or Student Coded',
       'Profession Category_Business_and_Finance',
       'Profession Category_Creative_and_Design',
       'Profession Category_Education', 'Profession Category_Engineering',
       'Profession Category_Finanancial Analyst',
       'Profession Category_HR Manager', 'Profession Category_Healthcare',
       'Profession Category_Law_and_Government',
       'Profession Category_Other_or_Miscellaneous',
       'Profession Category_Research Analyst',
       'Profession Category_Sales_and_Marketing',
       'Profession Category_Science_and_Research',
       'Profession Category_Skilled_Trades', 'Profession Category_Student',
       'Profession Category_Technology',
       'Profession Category_Working Professional',
       'Sleep Duration Category Coded']]


In [None]:
test_predictions = model.predict(test_features)  # Or model.predict_proba() for probabilities
test_predictions2 = model.predict_proba(test_features)[:,1]  # Or model.predict_proba() for probabilities


LogR_submission = pd.DataFrame({
    'id': test_modified['id'],  # Replace 'ID' with the identifier column in the test dataset
    'Depression': test_predictions
})

LogR_test_proba = pd.DataFrame({
    'id': test_modified['id'],  # Replace 'ID' with the identifier column in the test dataset
    'oof_LogR_proba': test_predictions2
})

# Save the submission file
LogR_submission.to_csv('LogR_submission.csv', index=False)

# Logistic Regression Important Features

In [None]:
# Get feature importance (coefficients)
coefficients = model.coef_[0]
features = LogR_X.columns

# Create a DataFrame for feature importance
feature_importance = pd.DataFrame({
    'Feature': features,
    'Coefficient': coefficients,
    'Importance': abs(coefficients)  # Absolute value for importance
})

# Sort features by importance
feature_importance = feature_importance.sort_values(by='Importance', ascending=False)

print("Feature Importance:")
print(feature_importance)

# Model 3 - SVM

In [None]:
train_modified = train.copy()

X = train_modified.drop(columns=['Working Professional or Student', 'Profession', 'Satisfaction', 'Work/Academic Pressure', 'id', 'Sleep Duration', 'Name', 'City', 'Gender', 'Dietary Habits', 'Degree', 'Degree Category', 'Sleep Duration Category'])


SVM_X = X.drop('Depression', axis=1, inplace=False)
SVM_Y = X['Depression']

SVM_X

In [None]:
SVM_X.isna().sum()

In [None]:
acad_mean = SVM_X['Academic Pressure'].mean()
SVM_X['Academic Pressure'] = SVM_X['Academic Pressure'].fillna(acad_mean)

work_mean = SVM_X['Work Pressure'].mean()
SVM_X['Work Pressure'] = SVM_X['Work Pressure'].fillna(work_mean)

cgpa_mean = SVM_X['CGPA'].mean()
SVM_X['CGPA'] = SVM_X['CGPA'].fillna(cgpa_mean)

studys_mean = SVM_X['Study Satisfaction'].mean()
SVM_X['Study Satisfaction'] = SVM_X['Study Satisfaction'].fillna(studys_mean)

jobs_mean = SVM_X['Job Satisfaction'].mean()
SVM_X['Job Satisfaction'] = SVM_X['Job Satisfaction'].fillna(jobs_mean)

In [None]:

from sklearn.preprocessing import StandardScaler


# Features to standardize
features_to_standardize = ['Age', 'Academic Pressure', 'Work Pressure', 'CGPA', 'Study Satisfaction', 'Job Satisfaction', 'Work/Study Hours', 'Financial Stress']

# Initialize StandardScaler
scaler = StandardScaler()

# Apply scaler only to the selected features
SVM_X[features_to_standardize] = scaler.fit_transform(SVM_X[features_to_standardize])


# Optuna Optimization

In [None]:
from sklearn.svm import LinearSVC
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score

In [None]:

# import optuna


# # Stratified K-Fold setup
# n_splits = 5
# skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)

# # Initialize arrays for OOF predictions and true labels
# oof_predictions = np.zeros(len(SVM_Y))
# true_labels = np.zeros(len(SVM_Y))
# fold_accuracies = []

# # Define the Optuna objective function
# def objective(trial):
#     # Suggest hyperparameters for LinearSVC
#     params = {
#         "C": trial.suggest_float("C", 0.01, 10.0, log=True),
#         "max_iter": trial.suggest_int("max_iter", 1000, 50000, step=1000),
#     }

#     fold_accuracies = []

#     for train_idx, val_idx in skf.split(SVM_X, SVM_Y):
#         X_train, X_val = SVM_X.iloc[train_idx], SVM_X.iloc[val_idx]
#         y_train, y_val = SVM_Y.iloc[train_idx], SVM_Y.iloc[val_idx]

#         # Initialize Linear SVM model
#         model = LinearSVC(C=params["C"], max_iter=params["max_iter"], random_state=42)
        
#         # Train the model
#         model.fit(X_train, y_train)
        
#         # Predict on validation set
#         y_pred = model.predict(X_val)
        
#         # Calculate accuracy for the fold
#         accuracy = accuracy_score(y_val, y_pred)
#         fold_accuracies.append(accuracy)

#     # Return mean accuracy across folds (maximize accuracy, so minimize -accuracy)
#     return 1 - np.mean(fold_accuracies)

# # Run Optuna optimization
# study = optuna.create_study(direction="minimize")
# study.optimize(objective, n_trials=50)

# # Get the best parameters
# best_params = study.best_params
# print("Best Parameters:", best_params)


In [None]:
# from skopt import BayesSearchCV
# from sklearn.svm import SVC

# # Define parameter space
# param_space = {
#     'C': (1e-3, 1e2, 'log-uniform'),  # Logarithmic scale
#     'kernel': ['linear', 'rbf', 'poly'],
#     'gamma': ['scale', 'auto'],
#     'degree': (2, 5)  # Range for poly kernel
# }

# # Initialize SVM
# svm = SVC()

# # Perform Hyperband Search
# bayes_search = BayesSearchCV(svm, param_space, n_iter=50, cv=5, scoring='accuracy', verbose=1, random_state=42)
# bayes_search.fit(SVM_X, SVM_Y)

# # Best hyperparameters
# print("Best Hyperparameters:", bayes_search.best_params_)


## 4. Model Building

# Model Training

In [None]:

# Manually specify SVM hyperparameters
best_params = {
    'C': 0.6045799851497715,
    'max_iter':17000
}

# Stratified K-Fold setup
n_splits = 5
skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)

# Initialize arrays to store OOF predictions and true labels
oof_predictions = np.zeros(len(SVM_Y))
true_labels = np.zeros(len(SVM_Y))
fold_accuracies = []


# Perform Stratified K-Fold with Best Parameters
for fold, (train_idx, val_idx) in enumerate(skf.split(SVM_X, SVM_Y)):
    print(f"Training Fold {fold + 1}/{n_splits}...")

    X_train, X_val = SVM_X.iloc[train_idx], SVM_X.iloc[val_idx]
    y_train, y_val = SVM_Y.iloc[train_idx], SVM_Y.iloc[val_idx]

    # Initialize Linear SVM model with best parameters
    model = LinearSVC(C=best_params["C"], max_iter=best_params["max_iter"], random_state=42)
    
    # Train the model
    model.fit(X_train, y_train)
    
    # Predict on validation set
    y_pred = model.predict(X_val)

    # Store OOF predictions and true labels
    oof_predictions[val_idx] = y_pred
    true_labels[val_idx] = y_val

    # Calculate accuracy for the current fold
    accuracy = accuracy_score(y_val, y_pred)
    fold_accuracies.append(accuracy)
    print(f"Fold {fold + 1} Accuracy: {accuracy:.4f}")

# Save OOF predictions and true labels to a DataFrame
oof_df = pd.DataFrame({
    "OOF_Predictions": oof_predictions,
    "True_Labels": true_labels,
})

# Print the OOF DataFrame
print("\nOOF Predictions DataFrame:")
print(oof_df.head())

# Calculate overall accuracy
average_accuracy = np.mean(fold_accuracies)
print(f"\nAverage Accuracy Across Folds: {average_accuracy:.4f}")


# OOF Predictions : Linear SVC

In [None]:
oof_LSVC = oof_df.copy()

# Test Predictions

In [None]:
test_modified = test.copy()

X = test_modified.drop(columns=['Working Professional or Student', 'Profession', 'Satisfaction', 'Work/Academic Pressure', 'id', 'Sleep Duration', 'Name', 'City', 'Gender', 'Dietary Habits', 'Degree', 'Degree Category', 'Sleep Duration Category'])


acad_mean = X['Academic Pressure'].mean()
X['Academic Pressure'] = X['Academic Pressure'].fillna(acad_mean)

work_mean = X['Work Pressure'].mean()
X['Work Pressure'] = X['Work Pressure'].fillna(work_mean)

cgpa_mean = X['CGPA'].mean()
X['CGPA'] = X['CGPA'].fillna(cgpa_mean)

studys_mean = X['Study Satisfaction'].mean()
X['Study Satisfaction'] = X['Study Satisfaction'].fillna(studys_mean)

jobs_mean = X['Job Satisfaction'].mean()
X['Job Satisfaction'] = X['Job Satisfaction'].fillna(jobs_mean)

In [None]:
# Predict using the trained model
y_pred = model.predict(X)  # Predicted probabilities

# Create Submission DataFrame
LSVC_submission = pd.DataFrame({
    "id": test_modified["id"],  # Keep the ID column
    "Depression": y_pred  # Add predictions
})


# Save to CSV
LSVC_submission.to_csv("LSVC_submission.csv", index=False)

LSVC_submission.head()

# Model 4 - XGBoost

In [None]:
train_modified = train.copy()

X = train_modified.drop(columns=['Working Professional or Student', 'Profession', 'Satisfaction', 'Work/Academic Pressure', 'id', 'Sleep Duration', 'Name', 'City', 'Gender', 'Dietary Habits', 'Degree', 'Degree Category', 'Sleep Duration Category'])


XGB_X = X.drop('Depression', axis=1, inplace=False)
XGB_Y = X['Depression']

XGB_X = XGB_X.fillna(-1)

XGB_X

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import StratifiedKFold, train_test_split
from sklearn.metrics import accuracy_score
import xgboost as xgb
import optuna


In [None]:
# import numpy as np
# import pandas as pd
# from sklearn.model_selection import StratifiedKFold, train_test_split
# from sklearn.metrics import accuracy_score
# import xgboost as xgb
# import optuna


# # Initialize arrays for OOF predictions and true labels
# oof_predictions = np.zeros(len(XGB_Y))
# true_labels = np.zeros(len(XGB_Y))
# fold_accuracies = []

# # Stratified K-Fold setup
# n_splits = 5
# skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)

# # Define the objective function for Optuna
# def objective(trial):
#     # Suggest hyperparameters
#     param = {
#         "objective": "binary:logistic",
#         "eval_metric": "logloss",
#         "booster": "gbtree",
#         "tree_method": "auto",
#         "max_depth": trial.suggest_int("max_depth", 3, 10),
#         "learning_rate": trial.suggest_float("learning_rate", 1e-3, 0.3, log=True),
#         #"n_estimators": trial.suggest_int("n_estimators", 100, 1000),
#         "gamma": trial.suggest_float("gamma", 0, 5),
#         "min_child_weight": trial.suggest_int("min_child_weight", 1, 10),
#         "subsample": trial.suggest_float("subsample", 0.5, 1.0),
#         "colsample_bytree": trial.suggest_float("colsample_bytree", 0.5, 1.0),
#         "lambda": trial.suggest_float("lambda", 1e-3, 10.0, log=True),
#         "alpha": trial.suggest_float("alpha", 1e-3, 10.0, log=True),
#     }

#     # Suggest num_boost_round as a hyperparameter
#     num_boost_round = trial.suggest_int("num_boost_round", 50, 500)

#     # Perform Stratified K-Fold Cross-Validation
#     fold_accuracies = []
#     for fold, (train_idx, val_idx) in enumerate(skf.split(XGB_X, XGB_Y)):
#         X_train, X_val = XGB_X.iloc[train_idx], XGB_X.iloc[val_idx]
#         y_train, y_val = XGB_Y.iloc[train_idx], XGB_Y.iloc[val_idx]
        
#         # Create DMatrices for XGBoost
#         dtrain = xgb.DMatrix(X_train, label=y_train)
#         dval = xgb.DMatrix(X_val, label=y_val)
        
#         # Train model
#         model = xgb.train(param, dtrain, num_boost_round=num_boost_round, evals=[(dval, "validation")], early_stopping_rounds=50, verbose_eval=False)

        
#         # Predict probabilities
#         y_pred_proba = model.predict(dval)
#         y_pred = (y_pred_proba > 0.5).astype(int)
        
#         # Calculate accuracy
#         accuracy = accuracy_score(y_val, y_pred)
#         fold_accuracies.append(accuracy)
    
#     # Return mean accuracy across folds as the metric to minimize
#     return 1 - np.mean(fold_accuracies)

# # Run Optuna optimization
# study = optuna.create_study(direction="minimize")
# study.optimize(objective, n_trials=50)

# # Get best parameters
# best_params = study.best_params
# print(f"Best Parameters: {best_params}")


# Model Training

In [None]:
best_params = {
        "objective": "binary:logistic",
        "eval_metric": "logloss",
        "booster": "gbtree",
        "tree_method": "auto",
        "max_depth": 5, #3, #5, #7,
        "learning_rate": 0.10750323694235545, #0.1891190347749237, #0.10750323694235545, #0.07585982629885023,
        "gamma": 1.3237413341033117, #2.314038705745134, #1.3237413341033117, #2.588948119388252,
        "min_child_weight": 9, #6, #9, #9,
        "subsample": 0.549013955949653, #0.6574750021250312, #0.5490139559496539, #0.5848761167199357,
        "colsample_bytree": 0.5197902304119142, #0.8372956010146961, #0.5197902304119142, #0.6111636059348978,
        "lambda": 0.001776275430859678, #0.5937676812744246, #0.001776275430859678, #2.9982108856637324,
        "alpha": 1.4810599685383345, #0.014778705236250075, #1.4810599685383345 #0.010574189390139333,
    }

# Initialize arrays for OOF predictions and true labels
oof_predictions = np.zeros(len(XGB_Y))
true_labels = np.zeros(len(XGB_Y))
fold_accuracies = []

# Stratified K-Fold setup
n_splits = 5
skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)

# Perform Stratified K-Fold with Best Parameters
for fold, (train_idx, val_idx) in enumerate(skf.split(XGB_X, XGB_Y)):
    print(f"Training Fold {fold + 1}/{n_splits}...")
    
    X_train, X_val = XGB_X.iloc[train_idx], XGB_X.iloc[val_idx]
    y_train, y_val = XGB_Y.iloc[train_idx], XGB_Y.iloc[val_idx]
    
    # Create DMatrices for XGBoost
    dtrain = xgb.DMatrix(X_train, label=y_train)
    dval = xgb.DMatrix(X_val, label=y_val)
    
    # Train model with best parameters
    model = xgb.train(best_params, dtrain, num_boost_round=500, evals=[(dval, "validation")], early_stopping_rounds=50, verbose_eval=False)
    
    # Predict probabilities for validation set
    y_pred_proba = model.predict(dval)
    y_pred = (y_pred_proba > 0.5).astype(int)
    
    # Store OOF predictions and true labels
    oof_predictions[val_idx] = y_pred_proba
    true_labels[val_idx] = y_val
    
    # Calculate accuracy for the current fold
    accuracy = accuracy_score(y_val, y_pred)
    fold_accuracies.append(accuracy)
    print(f"Fold {fold + 1} Accuracy: {accuracy:.4f}")

# Save OOF predictions and true labels to a DataFrame
oof_df = pd.DataFrame({
    "OOF_Predictions": oof_predictions,
    "True_Labels": true_labels
})

# Print the DataFrame
print("\nOOF Predictions DataFrame:")
print(oof_df.head())

# Calculate overall accuracy
average_accuracy = np.mean(fold_accuracies)
print(f"\nAverage Accuracy Across Folds: {average_accuracy:.4f}")


In [None]:
import shap
explainer = shap.Explainer(model, X_train)
shap_values = explainer(X_val)
shap.summary_plot(shap_values, X_val)


# OOF Predictions : XGB Boosting

In [None]:
oof_XGB = oof_df.copy()

oof_XGB

# Generating Predictions on Test Dataset

In [None]:
test_modified = test.copy()

test_features = test_modified[['Age', 'Academic Pressure', 'Work Pressure', 'CGPA',
       'Study Satisfaction', 'Job Satisfaction',
       'Have you ever had suicidal thoughts ?', 'Work/Study Hours',
       'Financial Stress', 'Family History of Mental Illness',
       'Dietary Habits Category', 'Degree Category Coded', 'City Tier',
       'Gender Coded', 'Working Professional or Student Coded',
       'Profession Category_Business_and_Finance',
       'Profession Category_Creative_and_Design',
       'Profession Category_Education', 'Profession Category_Engineering',
       'Profession Category_Finanancial Analyst',
       'Profession Category_HR Manager', 'Profession Category_Healthcare',
       'Profession Category_Law_and_Government',
       'Profession Category_Other_or_Miscellaneous',
       'Profession Category_Research Analyst',
       'Profession Category_Sales_and_Marketing',
       'Profession Category_Science_and_Research',
       'Profession Category_Skilled_Trades', 'Profession Category_Student',
       'Profession Category_Technology',
       'Profession Category_Working Professional',
       'Sleep Duration Category Coded']]

test_features = test_features.fillna(-1)

In [None]:
# Convert test data to DMatrix
dtest = xgb.DMatrix(test_features)  # Drop ID column for prediction

# Predict using the trained model
y_pred_proba = model.predict(dtest)  # Predicted probabilities
y_pred_labels = (y_pred_proba > 0.5).astype(int)  # Convert probabilities to binary labels

# Create Submission DataFrame
XGB_submission = pd.DataFrame({
    "id": test_modified["id"],  # Keep the ID column
    "Depression": y_pred_labels  # Add predictions
})

# Create Submission DataFrame
XGB_test_proba = pd.DataFrame({
    "id": test_modified["id"],  # Keep the ID column
    "oof_XGB_proba": y_pred_proba  # Add prediction probabilities
})


# Save to CSV
XGB_submission.to_csv("XGB_submission_v3.csv", index=False)

**> XGBoost model had higher accuracy when NAs were filled with -1 value.**

# Model 5 - AdaBoost

In [None]:
train_modified = train.copy()

X = train_modified.drop(columns=['Working Professional or Student', 'Profession', 'Satisfaction', 'Work/Academic Pressure', 'id', 'Sleep Duration', 'Name', 'City', 'Gender', 'Dietary Habits', 'Degree', 'Degree Category', 'Sleep Duration Category'])


ADA_X = X.drop('Depression', axis=1, inplace=False)
ADA_Y = X['Depression']

ADA_X = ADA_X.fillna(-1)

ADA_X

# Optuna Optimization

In [None]:
# import numpy as np
# import pandas as pd
# from sklearn.model_selection import StratifiedKFold
# from sklearn.ensemble import AdaBoostClassifier
# from sklearn.tree import DecisionTreeClassifier
# from sklearn.metrics import accuracy_score
# import optuna


# # Stratified K-Fold setup
# n_splits = 5
# skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)

# # Initialize arrays for OOF predictions and true labels
# oof_predictions = np.zeros(len(ADA_Y))
# true_labels = np.zeros(len(ADA_Y))
# fold_accuracies = []

# # Define the Optuna objective function
# def objective(trial):
#     # Suggest hyperparameters
#     params = {
#         "n_estimators": trial.suggest_int("n_estimators", 50, 500),
#         "learning_rate": trial.suggest_float("learning_rate", 0.01, 1.0, log=True),
#         "max_depth": trial.suggest_int("max_depth", 1, 10),
#     }

#     # Perform Stratified K-Fold Cross-Validation
#     fold_accuracies = []
#     for train_idx, val_idx in skf.split(ADA_X, ADA_Y):
#         X_train, X_val = ADA_X.iloc[train_idx], ADA_X.iloc[val_idx]
#         y_train, y_val = ADA_Y.iloc[train_idx], ADA_Y.iloc[val_idx]

#         # Initialize AdaBoost model with a base estimator
#         base_estimator = DecisionTreeClassifier(max_depth=params["max_depth"])
#         model = AdaBoostClassifier(
#             estimator=base_estimator,
#             n_estimators=params["n_estimators"],
#             learning_rate=params["learning_rate"],
#             random_state=42,
#         )

#         # Train the model
#         model.fit(X_train, y_train)

#         # Predict and evaluate
#         y_pred = model.predict(X_val)
#         accuracy = accuracy_score(y_val, y_pred)
#         fold_accuracies.append(accuracy)

#     # Return mean accuracy across folds (maximize accuracy, so minimize -accuracy)
#     return 1 - np.mean(fold_accuracies)

# # Run Optuna optimization
# study = optuna.create_study(direction="minimize")
# study.optimize(objective, n_trials=50)

# # Get the best parameters
# best_params = study.best_params
# print("Best Parameters:", best_params)



# Model Training

In [None]:

from sklearn.model_selection import StratifiedKFold
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Stratified K-Fold setup
n_splits = 5
skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)

# Initialize arrays for OOF predictions and true labels
oof_predictions = np.zeros(len(ADA_Y))
true_labels = np.zeros(len(ADA_Y))
fold_accuracies = []

for fold, (train_idx, val_idx) in enumerate(skf.split(ADA_X, ADA_Y)):
    print(f"Training Fold {fold + 1}/{n_splits}...")

    X_train, X_val = ADA_X.iloc[train_idx], ADA_X.iloc[val_idx]
    y_train, y_val = ADA_Y.iloc[train_idx], ADA_Y.iloc[val_idx]

    # Initialize AdaBoost model with best parameters
    base_estimator = DecisionTreeClassifier(max_depth=4)
    model = AdaBoostClassifier(
        estimator=base_estimator,
        n_estimators=123,
        learning_rate=0.14966014447836504,
        random_state=42,
    )

    # Train the model
    model.fit(X_train, y_train)

    # Predict probabilities for validation set
    y_pred_proba = model.predict_proba(X_val)[:, 1]
    y_pred = (y_pred_proba > 0.5).astype(int)  # Convert probabilities to binary labels

    # Store OOF predictions and true labels
    oof_predictions[val_idx] = y_pred_proba
    true_labels[val_idx] = y_val

    # Calculate accuracy for the current fold
    accuracy = accuracy_score(y_val, y_pred)
    fold_accuracies.append(accuracy)
    print(f"Fold {fold + 1} Accuracy: {accuracy:.4f}")

# Save OOF predictions and true labels to a DataFrame
oof_df = pd.DataFrame({
    "OOF_Predictions": oof_predictions,
    "True_Labels": true_labels,
})

# Print the OOF DataFrame
print("\nOOF Predictions DataFrame:")
print(oof_df.head())

# Calculate overall accuracy
average_accuracy = np.mean(fold_accuracies)
print(f"\nAverage Accuracy Across Folds: {average_accuracy:.4f}")


In [None]:
oof_ADA = oof_df.copy()

oof_ADA

# Test Predictions

In [None]:
test_modified = test.copy()

test_features = test_modified[['Age', 'Academic Pressure', 'Work Pressure', 'CGPA',
       'Study Satisfaction', 'Job Satisfaction',
       'Have you ever had suicidal thoughts ?', 'Work/Study Hours',
       'Financial Stress', 'Family History of Mental Illness',
       'Dietary Habits Category', 'Degree Category Coded', 'City Tier',
       'Gender Coded', 'Working Professional or Student Coded',
       'Profession Category_Business_and_Finance',
       'Profession Category_Creative_and_Design',
       'Profession Category_Education', 'Profession Category_Engineering',
       'Profession Category_Finanancial Analyst',
       'Profession Category_HR Manager', 'Profession Category_Healthcare',
       'Profession Category_Law_and_Government',
       'Profession Category_Other_or_Miscellaneous',
       'Profession Category_Research Analyst',
       'Profession Category_Sales_and_Marketing',
       'Profession Category_Science_and_Research',
       'Profession Category_Skilled_Trades', 'Profession Category_Student',
       'Profession Category_Technology',
       'Profession Category_Working Professional',
       'Sleep Duration Category Coded']]

test_features = test_features.fillna(-1)

In [None]:
# Predict using the trained model
y_pred = model.predict(test_features)  # Predicted probabilities
y_pred_proba = model.predict_proba(test_features)[:,1]  # Predicted probabilities

# Create Submission DataFrame
ADA_submission = pd.DataFrame({
    "id": test_modified["id"],  # Keep the ID column
    "Depression": y_pred  # Add predictions
})

ADA_test_proba = pd.DataFrame({
    "id": test_modified["id"],  # Keep the ID column
    "oof_ADA_proba": y_pred_proba  # Add predictions
})

# Save to CSV
ADA_submission.to_csv("ADA_submission.csv", index=False)



In [None]:
ADA_test_proba

# Combing OOF Predictions into training data

In [None]:
oof_LogR['oof_LogR'] = (oof_LogR['OOF_Predictions'] > 0.5).astype(int)

oof_LogR['oof_LogR_proba'] = oof_LogR['OOF_Predictions']

oof_LogR

In [None]:
oof_RF['oof_RF'] = oof_RF['Predicted_Label']

oof_RF['oof_RF_proba'] = oof_RF['Class_1']

oof_RF

In [None]:
oof_XGB['oof_XGB'] = (oof_XGB['OOF_Predictions'] > 0.5).astype(int)

oof_XGB['oof_XGB_proba'] = oof_XGB['OOF_Predictions']

oof_XGB

In [None]:


oof_ADA['oof_ADA_proba'] = oof_ADA['OOF_Predictions']

oof_ADA

In [None]:
train_oof = train.copy()

train_oof = pd.concat(
    [train_oof, oof_RF['oof_RF_proba'], oof_LogR['oof_LogR_proba'], oof_XGB['oof_XGB_proba'], oof_ADA['oof_ADA_proba']],
    axis=1
)

train_oof

In [None]:
# # Save to CSV
# train_oof.to_csv("Train_OOF.csv", index=False)

# Model 6 - FNN Model

In [None]:
train_modified = train.copy()

X = train_modified.drop(columns=['Working Professional or Student', 'Profession', 'Satisfaction', 'Work/Academic Pressure', 'id', 'Sleep Duration', 'Name', 'City', 'Gender', 'Dietary Habits', 'Degree', 'Degree Category', 'Sleep Duration Category'])


FNN_X = X.drop('Depression', axis=1, inplace=False)
FNN_Y = X['Depression']

FNN_X

In [None]:
acad_mean = FNN_X['Academic Pressure'].mean()
FNN_X['Academic Pressure'] = FNN_X['Academic Pressure'].fillna(acad_mean)

work_mean = FNN_X['Work Pressure'].mean()
FNN_X['Work Pressure'] = FNN_X['Work Pressure'].fillna(work_mean)

cgpa_mean = FNN_X['CGPA'].mean()
FNN_X['CGPA'] = FNN_X['CGPA'].fillna(cgpa_mean)

studys_mean = FNN_X['Study Satisfaction'].mean()
FNN_X['Study Satisfaction'] = FNN_X['Study Satisfaction'].fillna(studys_mean)

jobs_mean = FNN_X['Job Satisfaction'].mean()
FNN_X['Job Satisfaction'] = FNN_X['Job Satisfaction'].fillna(jobs_mean)

# Keras Tuning Optimization

In [None]:
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.optimizers import Adam
import keras_tuner as kt

In [None]:


# # Define the model-building function for Keras Tuner
# def build_model(hp):
#     model = Sequential()
#     model.add(
#         Dense(
#             units=hp.Int("units_layer1", min_value=32, max_value=128, step=32),
#             activation="relu",
#             input_dim=SVM_X.shape[1],
#         )
#     )
#     model.add(Dropout(hp.Float("dropout_layer1", min_value=0.2, max_value=0.5, step=0.1)))
    
#     # Add optional second layer
#     if hp.Boolean("add_layer2"):
#         model.add(
#             Dense(
#                 units=hp.Int("units_layer2", min_value=32, max_value=128, step=32),
#                 activation="relu",
#             )
#         )
#         model.add(Dropout(hp.Float("dropout_layer2", min_value=0.2, max_value=0.5, step=0.1)))

#     model.add(Dense(1, activation="sigmoid"))  # Binary classification

#     # Compile the model
#     model.compile(
#         optimizer=Adam(learning_rate=hp.Choice("learning_rate", values=[1e-2, 1e-3, 1e-4])),
#         loss="binary_crossentropy",
#         metrics=["accuracy"],
#     )
#     return model

# # Initialize the Keras Tuner
# tuner = kt.RandomSearch(
#     build_model,
#     objective="val_accuracy",
#     max_trials=10,
#     executions_per_trial=1,
#     directory="kt_dir",
#     project_name="binary_classification_kfold",
# )

# # Stratified K-Fold setup
# n_splits = 5
# skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)

# oof_predictions = np.zeros(len(SVM_Y))
# true_labels = np.zeros(len(SVM_Y))
# fold_accuracies = []

# for fold, (train_idx, val_idx) in enumerate(skf.split(FNN_X, FNN_Y)):
#     print(f"Training Fold {fold + 1}/{n_splits}...")

#     X_train, X_val = FNN_X.iloc[train_idx], FNN_X.iloc[val_idx]
#     y_train, y_val = FNN_Y.iloc[train_idx], FNN_Y.iloc[val_idx]

#     # Optimize hyperparameters for the current fold
#     tuner.search(X_train, y_train, epochs=20, validation_data=(X_val, y_val), verbose=0)

#     # Get the best hyperparameters for the current fold
#     best_hp = tuner.get_best_hyperparameters(1)[0]
#     print(f"Best Hyperparameters for Fold {fold + 1}: {best_hp.values}")

#     # Build the model with the best hyperparameters
#     model = tuner.hypermodel.build(best_hp)

#     # Train the model
#     model.fit(X_train, y_train, epochs=20, batch_size=32, verbose=0)

#     # Predict probabilities on the validation set
#     y_pred_proba = model.predict(X_val).flatten()  # Predicted probabilities
#     y_pred = (y_pred_proba > 0.5).astype(int)  # Convert to binary labels

#     # Store OOF predictions and true labels
#     oof_predictions[val_idx] = y_pred_proba
#     true_labels[val_idx] = y_val

#     # Calculate accuracy for the current fold
#     accuracy = accuracy_score(y_val, y_pred)
#     fold_accuracies.append(accuracy)
#     print(f"Fold {fold + 1} Accuracy: {accuracy:.4f}")

# # Save OOF predictions and true labels to a DataFrame
# oof_df = pd.DataFrame({
#     "OOF_Predictions": oof_predictions,
#     "True_Labels": true_labels,
# })

# # Calculate and print the average accuracy across folds
# average_accuracy = np.mean(fold_accuracies)
# print(f"\nAverage Accuracy Across Folds: {average_accuracy:.4f}")

# print("\nOOF Predictions DataFrame:")
# print(oof_df.head())


# Model Training

In [None]:


# Example of best hyperparameters (replace these with your actual values)
best_hp_values = {
    "units_layer1": 96,
    "dropout_layer1": 0.2,
    "units_layer2": 32,
    "dropout_layer2": 0.2,
    "learning_rate": 0.0001
}

# Function to build the model manually
def build_model_manual(units_layer1, dropout_layer1, units_layer2, dropout_layer2, learning_rate, input_dim):
    model = Sequential([
        Dense(units_layer1, activation="relu", input_dim=input_dim),
        Dropout(dropout_layer1),
        Dense(units_layer2, activation="relu"),
        Dropout(dropout_layer2),
        Dense(1, activation="sigmoid"),  # Binary classification
    ])
    optimizer = Adam(learning_rate=learning_rate)
    model.compile(optimizer=optimizer, loss="binary_crossentropy", metrics=["accuracy"])
    return model

# Stratified K-Fold setup
n_splits = 5
skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)

oof_predictions = np.zeros(len(FNN_Y))
true_labels = np.zeros(len(FNN_Y))
fold_accuracies = []

for fold, (train_idx, val_idx) in enumerate(skf.split(FNN_X, FNN_Y)):
    print(f"Training Fold {fold + 1}/{n_splits}...")

    X_train, X_val = FNN_X.iloc[train_idx], FNN_X.iloc[val_idx]
    y_train, y_val = FNN_Y.iloc[train_idx], FNN_Y.iloc[val_idx]

    # Build the model with best hyperparameters
    model = build_model_manual(
        units_layer1=best_hp_values["units_layer1"],
        dropout_layer1=best_hp_values["dropout_layer1"],
        units_layer2=best_hp_values["units_layer2"],
        dropout_layer2=best_hp_values["dropout_layer2"],
        learning_rate=best_hp_values["learning_rate"],
        input_dim=X_train.shape[1],
    )

    # Train the model
    model.fit(X_train, y_train, epochs=20, batch_size=32, verbose=0)

    # Predict probabilities for the validation set
    y_pred_proba = model.predict(X_val).flatten()  # Predicted probabilities
    y_pred = (y_pred_proba > 0.5).astype(int)  # Convert probabilities to binary labels

    # Store predictions and true labels
    oof_predictions[val_idx] = y_pred_proba.astype(float)
    true_labels[val_idx] = y_val

    # Calculate accuracy for the current fold
    accuracy = accuracy_score(y_val, y_pred)
    fold_accuracies.append(accuracy)
    print(f"Fold {fold + 1} Accuracy: {accuracy:.4f}")

# Calculate and print the average accuracy across folds
print(f"\nAverage Accuracy Across Folds: {np.mean(fold_accuracies):.4f}")


In [None]:
oof_predictions

# OOF Predictions : FNN Model

In [None]:
# Save OOF predictions and true labels to a DataFrame
oof_FNN = pd.DataFrame({
    "OOF_Predictions": oof_predictions,
    "True_Labels": true_labels,
})

In [None]:
oof_FNN['oof_FNN_proba'] = oof_FNN['OOF_Predictions']

train_oof = pd.concat(
    [train_oof, oof_FNN['oof_FNN_proba']],
    axis=1
)

train_oof

# Test Predictions

In [None]:
test_modified = test.copy()

X = test_modified.drop(columns=['Working Professional or Student', 'Profession', 'Satisfaction', 'Work/Academic Pressure', 'id', 'Sleep Duration', 'Name', 'City', 'Gender', 'Dietary Habits', 'Degree', 'Degree Category', 'Sleep Duration Category'])


acad_mean = X['Academic Pressure'].mean()
X['Academic Pressure'] = X['Academic Pressure'].fillna(acad_mean)

work_mean = X['Work Pressure'].mean()
X['Work Pressure'] = X['Work Pressure'].fillna(work_mean)

cgpa_mean = X['CGPA'].mean()
X['CGPA'] = X['CGPA'].fillna(cgpa_mean)

studys_mean = X['Study Satisfaction'].mean()
X['Study Satisfaction'] = X['Study Satisfaction'].fillna(studys_mean)

jobs_mean = X['Job Satisfaction'].mean()
X['Job Satisfaction'] = X['Job Satisfaction'].fillna(jobs_mean)

In [None]:

# Predict probabilities on the test set
y_pred_proba = model.predict(X).flatten()  # Flatten ensures 1D array

# Convert probabilities to binary labels
threshold = 0.5
y_pred_labels = (y_pred_proba > threshold).astype(int)

# Combine predictions with the id column
FNN_submission = pd.DataFrame({
    "id": test_modified["id"],
    "Predicted_Probabilities": y_pred_proba,
    "Predicted_Labels": y_pred_labels,
})

FNN_test_proba = pd.DataFrame({
    "id": test_modified["id"],
    "oof_FNN_proba": y_pred_proba,
})


FNN_submission

In [None]:
# Save to CSV
FNN_submission['Depression'] = FNN_submission['Predicted_Labels']
FNN_submission_2 = FNN_submission[['id', 'Depression']]

FNN_submission_2.to_csv("FNN_submission.csv", index=False)

FNN_submission_2.head()

In [None]:
ADA_test_proba

# Test OOF

In [None]:
test_oof = test.copy()



test_oof = pd.concat(
    [test_oof, RF_test_proba['oof_RF_proba'], LogR_test_proba['oof_LogR_proba'], XGB_test_proba['oof_XGB_proba'], 
     ADA_test_proba['oof_ADA_proba'], FNN_test_proba['oof_FNN_proba']],
    axis=1
)

test_oof

# Model 7 : CatBoost

In [None]:
train_modified = train.copy()

X = train_modified.drop(columns=['Working Professional or Student', 'Profession', 'Satisfaction', 'Work/Academic Pressure', 'id', 'Sleep Duration', 'Name', 'City', 'Gender', 'Dietary Habits', 'Degree', 'Degree Category', 'Sleep Duration Category'])


CB_X = X.drop('Depression', axis=1, inplace=False)
CB_Y = X['Depression']

CB_X = XGB_X.fillna(-1)

CB_X

In [None]:
from catboost import CatBoostClassifier, Pool
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score

In [None]:
# import optuna
# from catboost import CatBoostClassifier, Pool
# from sklearn.model_selection import StratifiedKFold
# from sklearn.metrics import accuracy_score

# # Stratified K-Fold setup
# n_splits = 5
# skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)

# # Initialize arrays for OOF predictions and true labels
# oof_predictions = np.zeros(len(CB_Y))
# true_labels = np.zeros(len(CB_Y))
# fold_accuracies = []

# # Define the objective function for Optuna
# def objective(trial):
#     params = {
#         "iterations": trial.suggest_int("iterations", 100, 500),
#         "learning_rate": trial.suggest_float("learning_rate", 1e-3, 0.3, log=True),
#         "depth": trial.suggest_int("depth", 3, 10),
#         "l2_leaf_reg": trial.suggest_float("l2_leaf_reg", 1e-3, 10.0, log=True),
#         "border_count": trial.suggest_int("border_count", 32, 255),
#         "bagging_temperature": trial.suggest_float("bagging_temperature", 0.0, 1.0),
#         "random_strength": trial.suggest_float("random_strength", 0.0, 1.0),
#         "verbose": 0,
#         "loss_function": "Logloss",
#         "eval_metric": "Accuracy",
#         "random_state": 42,
#     }

#     fold_accuracies = []

#     for fold, (train_idx, val_idx) in enumerate(skf.split(CB_X, CB_Y)):
#         X_train, X_val = CB_X.iloc[train_idx], CB_X.iloc[val_idx]
#         y_train, y_val = CB_Y.iloc[train_idx], CB_Y.iloc[val_idx]

#         # Train CatBoost model
#         model = CatBoostClassifier(**params)
#         model.fit(X_train, y_train, eval_set=(X_val, y_val), early_stopping_rounds=50, verbose=0)

#         # Predict on validation set
#         y_pred = model.predict(X_val)
#         accuracy = accuracy_score(y_val, y_pred)
#         fold_accuracies.append(accuracy)

#     # Return mean accuracy across folds
#     return 1 - np.mean(fold_accuracies)  # Optuna minimizes the objective

# # Run Optuna optimization
# study = optuna.create_study(direction="minimize")
# study.optimize(objective, n_trials=50)

# # Get best hyperparameters
# best_params = study.best_params
# print("Best Parameters:", best_params)


# Training Model

In [None]:
best_params = {'iterations': 319, 
               'learning_rate': 0.18905539476863192, 
               'depth': 6, 
               'l2_leaf_reg': 0.3245637713616064, 
               'border_count': 75, 
               'bagging_temperature': 0.8668378304134939, 
               'random_strength': 0.35900494488728574
              }

# Stratified K-Fold setup
n_splits = 5
skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)

# Initialize arrays for OOF predictions and true labels
oof_predictions = np.zeros(len(CB_Y))
true_labels = np.zeros(len(CB_Y))
fold_accuracies = []

# Perform Stratified K-Fold with Best Parameters
for fold, (train_idx, val_idx) in enumerate(skf.split(CB_X, CB_Y)):
    print(f"Training Fold {fold + 1}/{n_splits}...")

    X_train, X_val = CB_X.iloc[train_idx], CB_X.iloc[val_idx]
    y_train, y_val = CB_Y.iloc[train_idx], CB_Y.iloc[val_idx]

    # Train CatBoost model with best parameters
    model = CatBoostClassifier(**best_params, verbose=0, random_state=42, loss_function="Logloss")
    model.fit(X_train, y_train, eval_set=(X_val, y_val), early_stopping_rounds=50, verbose=0)

    # Predict probabilities for validation set
    y_pred_proba = model.predict_proba(X_val)[:, 1]  # Probabilities for class 1
    oof_predictions[val_idx] = y_pred_proba  # Store OOF probabilities
    true_labels[val_idx] = y_val  # Store true labels

    # Convert probabilities to binary predictions
    y_pred = (y_pred_proba > 0.5).astype(int)
    accuracy = accuracy_score(y_val, y_pred)
    fold_accuracies.append(accuracy)
    print(f"Fold {fold + 1} Accuracy: {accuracy:.4f}")

# Save OOF predictions and true labels to a DataFrame
oof_df = pd.DataFrame({
    "OOF_Predictions": oof_predictions,
    "True_Labels": true_labels,
})

# Print the OOF DataFrame
print("\nOOF Predictions DataFrame:")
print(oof_df.head())

# Calculate overall accuracy
average_accuracy = np.mean(fold_accuracies)
print(f"\nAverage Accuracy Across Folds: {average_accuracy:.4f}")


# OOF Predictions : CatBoost

In [None]:
oof_CB = oof_df.copy()

oof_CB['oof_CB_proba'] = oof_CB['OOF_Predictions']

oof_CB

In [None]:
train_oof = pd.concat(
    [train_oof, oof_CB['oof_CB_proba']],
    axis=1
)

train_oof

# Test Predictions

In [None]:
test_modified = test.copy()

test_features = test_modified[['Age', 'Academic Pressure', 'Work Pressure', 'CGPA',
       'Study Satisfaction', 'Job Satisfaction',
       'Have you ever had suicidal thoughts ?', 'Work/Study Hours',
       'Financial Stress', 'Family History of Mental Illness',
       'Dietary Habits Category', 'Degree Category Coded', 'City Tier',
       'Gender Coded', 'Working Professional or Student Coded',
       'Profession Category_Business_and_Finance',
       'Profession Category_Creative_and_Design',
       'Profession Category_Education', 'Profession Category_Engineering',
       'Profession Category_Finanancial Analyst',
       'Profession Category_HR Manager', 'Profession Category_Healthcare',
       'Profession Category_Law_and_Government',
       'Profession Category_Other_or_Miscellaneous',
       'Profession Category_Research Analyst',
       'Profession Category_Sales_and_Marketing',
       'Profession Category_Science_and_Research',
       'Profession Category_Skilled_Trades', 'Profession Category_Student',
       'Profession Category_Technology',
       'Profession Category_Working Professional',
       'Sleep Duration Category Coded']]

test_features = test_features.fillna(-1)

In [None]:
# Predict using the trained model
y_pred = model.predict(test_features)  # Predicted probabilities
y_pred_proba = model.predict_proba(test_features)[:,1]  # Predicted probabilities

# Create Submission DataFrame
CB_submission = pd.DataFrame({
    "id": test_modified["id"],  # Keep the ID column
    "Depression": y_pred  # Add predictions
})

CB_test_proba = pd.DataFrame({
    "id": test_modified["id"],  # Keep the ID column
    "oof_CB_proba": y_pred_proba  # Add predictions
})



# Save to CSV
CB_submission.to_csv("CB_submission.csv", index=False)


In [None]:
test_oof = pd.concat(
    [test_oof, CB_test_proba['oof_CB_proba']],
    axis=1
)

test_oof

In [None]:
train_oof

# Meta model - Catboost + OOF RF, LogR, XGB, ADA Boost, FNN

In [None]:
train_meta = train_oof.copy()

train_meta = train_meta.drop(columns=['Working Professional or Student', 'Profession', 'Satisfaction', 'Work/Academic Pressure', 'id', 'Sleep Duration', 'Name', 'City', 'Gender', 'Dietary Habits', 'Degree', 'Degree Category', 'Sleep Duration Category','oof_CB_proba'])


MCB_X = train_meta.drop('Depression', axis=1, inplace=False)
MCB_Y = train_meta['Depression']

MCB_X = MCB_X.fillna(-1)

MCB_X.columns

In [None]:
# import optuna
# from catboost import CatBoostClassifier, Pool
# from sklearn.model_selection import StratifiedKFold
# from sklearn.metrics import accuracy_score

# # Stratified K-Fold setup
# n_splits = 5
# skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)

# # Initialize arrays for OOF predictions and true labels
# oof_predictions = np.zeros(len(MCB_Y))
# true_labels = np.zeros(len(MCB_Y))
# fold_accuracies = []

# # Define the objective function for Optuna
# def objective(trial):
#     params = {
#         "iterations": trial.suggest_int("iterations", 100, 500),
#         "learning_rate": trial.suggest_float("learning_rate", 1e-3, 0.3, log=True),
#         "depth": trial.suggest_int("depth", 3, 10),
#         "l2_leaf_reg": trial.suggest_float("l2_leaf_reg", 1e-3, 10.0, log=True),
#         "border_count": trial.suggest_int("border_count", 32, 255),
#         "bagging_temperature": trial.suggest_float("bagging_temperature", 0.0, 1.0),
#         "random_strength": trial.suggest_float("random_strength", 0.0, 1.0),
#         "verbose": 0,
#         "loss_function": "Logloss",
#         "eval_metric": "Accuracy",
#         "random_state": 42,
#     }

#     fold_accuracies = []

#     for fold, (train_idx, val_idx) in enumerate(skf.split(MCB_X, MCB_Y)):
#         X_train, X_val = MCB_X.iloc[train_idx], MCB_X.iloc[val_idx]
#         y_train, y_val = MCB_Y.iloc[train_idx], MCB_Y.iloc[val_idx]

#         # Train CatBoost model
#         model = CatBoostClassifier(**params)
#         model.fit(X_train, y_train, eval_set=(X_val, y_val), early_stopping_rounds=50, verbose=0)

#         # Predict on validation set
#         y_pred = model.predict(X_val)
#         accuracy = accuracy_score(y_val, y_pred)
#         fold_accuracies.append(accuracy)

#     # Return mean accuracy across folds
#     return 1 - np.mean(fold_accuracies)  # Optuna minimizes the objective

# # Run Optuna optimization
# study = optuna.create_study(direction="minimize")
# study.optimize(objective, n_trials=50)

# # Get best hyperparameters
# best_params = study.best_params
# print("Best Parameters:", best_params)


In [None]:
best_params = {'iterations': 285, #333, 
               'learning_rate': 0.012014432093919357, #0.011991148115001949, 
               'depth': 7, #3, 
               'l2_leaf_reg': 0.007572004811306094, #0.031813724210877914, 
               'border_count': 234, #231, 
               'bagging_temperature': 0.603413282889533, #0.6103639135740477, 
               'random_strength': 0.41331355424450433 #0.8724973080089851
              }

# Stratified K-Fold setup
n_splits = 5
skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)

# Initialize arrays for OOF predictions and true labels
oof_predictions = np.zeros(len(MCB_Y))
true_labels = np.zeros(len(MCB_Y))
fold_accuracies = []

# Perform Stratified K-Fold with Best Parameters
for fold, (train_idx, val_idx) in enumerate(skf.split(MCB_X, MCB_Y)):
    print(f"Training Fold {fold + 1}/{n_splits}...")

    X_train, X_val = MCB_X.iloc[train_idx], MCB_X.iloc[val_idx]
    y_train, y_val = MCB_Y.iloc[train_idx], MCB_Y.iloc[val_idx]

    # Train CatBoost model with best parameters
    model = CatBoostClassifier(**best_params, verbose=0, random_state=42, loss_function="Logloss")
    model.fit(X_train, y_train, eval_set=(X_val, y_val), early_stopping_rounds=50, verbose=0)

    # Predict probabilities for validation set
    y_pred_proba = model.predict_proba(X_val)[:, 1]  # Probabilities for class 1
    oof_predictions[val_idx] = y_pred_proba  # Store OOF probabilities
    true_labels[val_idx] = y_val  # Store true labels

    # Convert probabilities to binary predictions
    y_pred = (y_pred_proba > 0.5).astype(int)
    accuracy = accuracy_score(y_val, y_pred)
    fold_accuracies.append(accuracy)
    print(f"Fold {fold + 1} Accuracy: {accuracy:.4f}")

# Save OOF predictions and true labels to a DataFrame
oof_df = pd.DataFrame({
    "OOF_Predictions": oof_predictions,
    "True_Labels": true_labels,
})

# Print the OOF DataFrame
print("\nOOF Predictions DataFrame:")
print(oof_df.head())

# Calculate overall accuracy
average_accuracy = np.mean(fold_accuracies)
print(f"\nAverage Accuracy Across Folds: {average_accuracy:.4f}")


In [None]:
oof_df

In [None]:
test_oof

In [None]:
test_meta = test_oof.copy()


test_meta = test_meta.drop(columns=['Working Professional or Student', 'Profession', 'Satisfaction', 'Work/Academic Pressure', 'id', 'Sleep Duration', 'Name', 'City', 'Gender', 'Dietary Habits', 'Degree', 'Degree Category', 'Sleep Duration Category', 'oof_CB_proba'])

test_meta= test_meta.fillna(-1)

test_meta.columns

In [None]:
test_meta

In [None]:
# Predict using the trained model
y_pred = model.predict(test_meta)  # Predicted probabilities

# Create Submission DataFrame
Meta_CB_submission = pd.DataFrame({
    "id": test_oof["id"],  # Keep the ID column
    "Depression": y_pred  # Add predictions
})




# Save to CSV
Meta_CB_submission.to_csv("Meta_CB_submission.csv", index=False)

# Meta model : XGBoost + OOF RF, LogR, XGB, FNN, CatBoost

In [None]:
train_meta = train_oof.copy()

train_meta = train_meta[['Age', 'Depression', 'oof_RF_proba', 'oof_LogR_proba',
       'oof_ADA_proba', 'oof_FNN_proba', 'oof_CB_proba']]

MXGB_X = train_meta.drop('Depression', axis=1, inplace=False)
MXGB_Y = train_meta['Depression']

MXGB_X = MXGB_X.fillna(-1)

MXGB_X.columns

In [None]:
# import numpy as np
# import pandas as pd
# from sklearn.model_selection import StratifiedKFold, train_test_split
# from sklearn.metrics import accuracy_score
# import xgboost as xgb
# import optuna


# # Initialize arrays for OOF predictions and true labels
# oof_predictions = np.zeros(len(MXGB_Y))
# true_labels = np.zeros(len(MXGB_Y))
# fold_accuracies = []

# # Stratified K-Fold setup
# n_splits = 5
# skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)

# # Define the objective function for Optuna
# def objective(trial):
#     # Suggest hyperparameters
#     param = {
#         "objective": "binary:logistic",
#         "eval_metric": "logloss",
#         "booster": "gbtree",
#         "tree_method": "auto",
#         "max_depth": trial.suggest_int("max_depth", 3, 10),
#         "learning_rate": trial.suggest_float("learning_rate", 1e-3, 0.3, log=True),
#         #"n_estimators": trial.suggest_int("n_estimators", 100, 1000),
#         "gamma": trial.suggest_float("gamma", 0, 5),
#         "min_child_weight": trial.suggest_int("min_child_weight", 1, 10),
#         "subsample": trial.suggest_float("subsample", 0.5, 1.0),
#         "colsample_bytree": trial.suggest_float("colsample_bytree", 0.5, 1.0),
#         "lambda": trial.suggest_float("lambda", 1e-3, 10.0, log=True),
#         "alpha": trial.suggest_float("alpha", 1e-3, 10.0, log=True),
#     }

#     # Suggest num_boost_round as a hyperparameter
#     num_boost_round = trial.suggest_int("num_boost_round", 50, 500)

#     # Perform Stratified K-Fold Cross-Validation
#     fold_accuracies = []
#     for fold, (train_idx, val_idx) in enumerate(skf.split(MXGB_X, MXGB_Y)):
#         X_train, X_val = MXGB_X.iloc[train_idx], MXGB_X.iloc[val_idx]
#         y_train, y_val = MXGB_Y.iloc[train_idx], MXGB_Y.iloc[val_idx]
        
#         # Create DMatrices for XGBoost
#         dtrain = xgb.DMatrix(X_train, label=y_train)
#         dval = xgb.DMatrix(X_val, label=y_val)
        
#         # Train model
#         model = xgb.train(param, dtrain, num_boost_round=num_boost_round, evals=[(dval, "validation")], early_stopping_rounds=50, verbose_eval=False)

        
#         # Predict probabilities
#         y_pred_proba = model.predict(dval)
#         y_pred = (y_pred_proba > 0.5).astype(int)
        
#         # Calculate accuracy
#         accuracy = accuracy_score(y_val, y_pred)
#         fold_accuracies.append(accuracy)
    
#     # Return mean accuracy across folds as the metric to minimize
#     return 1 - np.mean(fold_accuracies)

# # Run Optuna optimization
# study = optuna.create_study(direction="minimize")
# study.optimize(objective, n_trials=75)

# # Get best parameters
# best_params = study.best_params
# print(f"Best Parameters: {best_params}")


In [None]:
best_params = {
        "objective": "binary:logistic",
        "eval_metric": "logloss",
        "booster": "gbtree",
        "tree_method": "auto",
        "max_depth": 3, 
        "learning_rate":  0.1899461722284292, #0.10514070367711406, 
        "gamma": 1.2897008226411792, #2.7753144844804947, 
        "min_child_weight": 8, #4, 
        "subsample": 0.8282596011300142, #0.7499244272798747, 
        "colsample_bytree": 0.7044890796519623, #0.8718263378701961, 
        "lambda": 0.5082274643528951, #0.0023871515138162666, 
        "alpha": 1.30297511166754, #1.204940444287851, 
    }

# Initialize arrays for OOF predictions and true labels
oof_predictions = np.zeros(len(MXGB_Y))
true_labels = np.zeros(len(MXGB_Y))
fold_accuracies = []

# Stratified K-Fold setup
n_splits = 5
skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)

# Perform Stratified K-Fold with Best Parameters
for fold, (train_idx, val_idx) in enumerate(skf.split(MXGB_X, MXGB_Y)):
    print(f"Training Fold {fold + 1}/{n_splits}...")
    
    X_train, X_val = MXGB_X.iloc[train_idx], MXGB_X.iloc[val_idx]
    y_train, y_val = MXGB_Y.iloc[train_idx], MXGB_Y.iloc[val_idx]
    
    # Create DMatrices for XGBoost
    dtrain = xgb.DMatrix(X_train, label=y_train)
    dval = xgb.DMatrix(X_val, label=y_val)
    
    # Train model with best parameters
    model = xgb.train(best_params, dtrain, num_boost_round=483, evals=[(dval, "validation")], early_stopping_rounds=50, verbose_eval=False)
    
    # Predict probabilities for validation set
    y_pred_proba = model.predict(dval)
    y_pred = (y_pred_proba > 0.5).astype(int)
    
    # Store OOF predictions and true labels
    oof_predictions[val_idx] = y_pred_proba
    true_labels[val_idx] = y_val
    
    # Calculate accuracy for the current fold
    accuracy = accuracy_score(y_val, y_pred)
    fold_accuracies.append(accuracy)
    print(f"Fold {fold + 1} Accuracy: {accuracy:.4f}")

# Save OOF predictions and true labels to a DataFrame
oof_df = pd.DataFrame({
    "OOF_Predictions": oof_predictions,
    "True_Labels": true_labels
})

# Print the DataFrame
print("\nOOF Predictions DataFrame:")
print(oof_df.head())

# Calculate overall accuracy
average_accuracy = np.mean(fold_accuracies)
print(f"\nAverage Accuracy Across Folds: {average_accuracy:.4f}")


In [None]:
test_meta = test_oof.copy()


test_meta = test_meta[['Age', 'oof_RF_proba', 'oof_LogR_proba',
       'oof_ADA_proba', 'oof_FNN_proba', 'oof_CB_proba']]


test_meta= test_meta.fillna(-1)

test_meta.columns

In [None]:
# Convert test data to DMatrix
dtest = xgb.DMatrix(test_meta)  # Drop ID column for prediction

# Predict using the trained model
y_pred_proba = model.predict(dtest)  # Predicted probabilities
y_pred_labels = (y_pred_proba > 0.5).astype(int)  # Convert probabilities to binary labels

# Create Submission DataFrame
MXGB_submission_2 = pd.DataFrame({
    "id": test_oof["id"],  # Keep the ID column
    "Depression": y_pred_labels  # Add predictions
})



# Save to CSV
MXGB_submission_2.to_csv("MXGB_submission_v2.csv", index=False)

In [None]:
explainer = shap.Explainer(model, X_train)
shap_values = explainer(X_val)
shap.summary_plot(shap_values, X_val)


# Hill Climbing Code

In [None]:
# import numpy as np
# from sklearn.metrics import roc_auc_score

# # Example OOF predictions from 3 models (replace with your model predictions)
# model_1_preds = np.array([0.1, 0.4, 0.35, 0.8])
# model_2_preds = np.array([0.2, 0.3, 0.5, 0.7])
# model_3_preds = np.array([0.15, 0.45, 0.4, 0.85])

# # True labels
# y_true = np.array([0, 0, 1, 1])

# # Combine predictions into a matrix
# predictions = np.stack([model_1_preds, model_2_preds, model_3_preds], axis=1)

# # Initialize weights
# weights = np.ones(predictions.shape[1]) / predictions.shape[1]  # Equal weights initially

# # Define the evaluation function
# def blended_predictions(predictions, weights):
#     return np.dot(predictions, weights)

# def evaluate_blend(predictions, weights, y_true):
#     blended_preds = blended_predictions(predictions, weights)
#     return roc_auc_score(y_true, blended_preds)

# # Hill climbing algorithm
# def hill_climbing_blend(predictions, y_true, max_iterations=100, step_size=0.01):
#     weights = np.ones(predictions.shape[1]) / predictions.shape[1]
#     best_score = evaluate_blend(predictions, weights, y_true)
    
#     for iteration in range(max_iterations):
#         for i in range(len(weights)):
#             # Try increasing the weight
#             new_weights = weights.copy()
#             new_weights[i] += step_size
#             new_weights /= new_weights.sum()  # Normalize weights
#             new_score = evaluate_blend(predictions, new_weights, y_true)
            
#             if new_score > best_score:
#                 weights = new_weights
#                 best_score = new_score
#                 continue

#             # Try decreasing the weight
#             new_weights = weights.copy()
#             new_weights[i] -= step_size
#             new_weights /= new_weights.sum()  # Normalize weights
#             new_score = evaluate_blend(predictions, new_weights, y_true)
            
#             if new_score > best_score:
#                 weights = new_weights
#                 best_score = new_score
#                 continue

#     return weights, best_score

# # Run hill climbing
# best_weights, best_score = hill_climbing_blend(predictions, y_true)

# print("Best Weights:", best_weights)
# print("Best AUC Score:", best_score)

# # Blend predictions with best weights
# final_blended_preds = blended_predictions(predictions, best_weights)


## 🔍 Summary & Key Takeaways

- Conducted deep **EDA** to identify anomalies and missing values  
- Engineered domain-specific features (City Tier, Degree Category, Sleep Quality Group, etc.)  
- Used various encoding techniques: Label, Target, and Frequency  
- Built and tuned multiple models including Random Forest, CatBoost, and Feedforward Neural Network  
- Leveraged **Optuna** for hyperparameter optimization  
- Employed **stacking and blending** to ensemble models and achieve competitive performance  
- Achieved **94.024% accuracy**, just shy of the top solution  
