## Kaggle Competition:<br>
https://www.kaggle.com/competitions/playground-series-s4e11

## Extra Data Sources <br>


https://worldpopulationreview.com/cities/india<br>
https://statisticstimes.com/demographics/country/india-cities-population.php


## Import Libraries and Load Data

In [1]:
## import the necessary libraries
import pandas as pd
# Display all rows and columns
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
# Prevent truncation of column contents
pd.set_option('display.max_colwidth', None)
import warnings
warnings.filterwarnings("ignore")

## load the training data
train = pd.read_csv("train.csv")
## load the test data
test = pd.read_csv("test.csv")

## Data Preprocessing

In [2]:
print(train.shape)
print(test.shape)
print(train["Depression"].value_counts())
print(train.columns)

(140700, 20)
(93800, 19)
Depression
0    115133
1     25567
Name: count, dtype: int64
Index(['id', 'Name', 'Gender', 'Age', 'City',
       'Working Professional or Student', 'Profession', 'Academic Pressure',
       'Work Pressure', 'CGPA', 'Study Satisfaction', 'Job Satisfaction',
       'Sleep Duration', 'Dietary Habits', 'Degree',
       'Have you ever had suicidal thoughts ?', 'Work/Study Hours',
       'Financial Stress', 'Family History of Mental Illness', 'Depression'],
      dtype='object')


### split cities into big citeis and small cities

In [3]:
## load city data
city = pd.read_excel("cities_populations.xlsx")
city = city[city['country'] == 'India'][['city_ascii', 'population']]

## check for missing cities in the city data
missing_cities = pd.Series(train['City'].unique())[~pd.Series(train['City'].unique()).isin(pd.Series(city['city_ascii'].unique()))]
print('These cities in training dataset are missing from the city data:')
print(missing_cities)
print(train[train['City'].isin(missing_cities)].shape)

missing_citiesN = pd.Series(test['City'].unique())[~pd.Series(test['City'].unique()).isin(pd.Series(city['city_ascii'].unique()))]
print('These cities in testing dataset are missing from the city data:')
print(missing_citiesN)
print(test[test['City'].isin(missing_citiesN)].shape)


These cities in training dataset are missing from the city data:
30             Ishanabad
31                 Vidhi
32                 Ayush
34               Krishna
35             Aishwarya
36                Keshav
37                Harsha
38                Nalini
39                Aditya
40              Malyansh
41           Raghavendra
42                Saanvi
43                M.Tech
44                Bhavna
45            Less Delhi
46               Nandini
47                 M.Com
48                 Plata
49                Atharv
50              Pratyush
51                  City
52                   3.0
53    Less than 5 Kalyan
54                   MCA
55                  Mira
56            Moreadhyay
58              Ishkarsh
59                 Kashk
60                 Mihir
61                 Vidya
62               Tolkata
63                  Anvi
64                Krinda
65                Ayansh
66                 Shrey
67                 Ivaan
68                Vaanya
69        

In [4]:
## merge the city data with the training data, which also delete the missing cities from the training data(most of them are probably typos)
train = train.merge(city, left_on='City', right_on='city_ascii', how='left')
train.drop(columns=['city_ascii'], inplace=True)
train.rename(columns={'population': 'city_population'}, inplace=True)

## merge the city data with the test data, which also delete the missing cities from the testing data(most of them are probably typos)
test = test.merge(city, left_on='City', right_on='city_ascii', how='left')
test.drop(columns=['city_ascii'], inplace=True)
test.rename(columns={'population': 'city_population'}, inplace=True)

In [5]:
## split cities based on population with a threshold of 2 million
threshold = 2000000
train['city_type'] = train['city_population'].apply(lambda x: 'big' if x > threshold else 'small')
test['city_type'] = test['city_population'].apply(lambda x: 'big' if x > threshold else 'small')
print(train['city_type'].value_counts())
print(test['city_type'].value_counts())

city_type
small    82094
big      78156
Name: count, dtype: int64
city_type
small    55354
big      51961
Name: count, dtype: int64


### check for outliers in training data

In [6]:
categories = [
    "Gender",
    "city_type",
    "Working Professional or Student",
    "Profession",
    "Academic Pressure",
    "Work Pressure",
    "Study Satisfaction",
    "Job Satisfaction",
    "Sleep Duration",
    "Dietary Habits",
    "Degree",
    "Have you ever had suicidal thoughts ?",
    "Work/Study Hours",
    "Financial Stress",
    "Family History of Mental Illness",
]
for category in categories:
    print(train[category].value_counts())

Gender
Male      87682
Female    72568
Name: count, dtype: int64
city_type
small    82094
big      78156
Name: count, dtype: int64
Working Professional or Student
Working Professional    127197
Student                  33053
Name: count, dtype: int64
Profession
Teacher                   28935
Content Writer             8846
Architect                  4928
Consultant                 4689
HR Manager                 4571
Pharmacist                 4396
Doctor                     3561
Business Analyst           3493
Entrepreneur               3359
Chef                       3334
Chemist                    3269
Educational Consultant     3179
Researcher                 2658
Data Scientist             2635
Lawyer                     2602
Customer Support           2304
Marketing Manager          2223
Pilot                      2189
Travel Consultant          2062
Plumber                    1943
Manager                    1942
Sales Executive            1922
Judge                      1875
Fi

In [7]:
## deal with outliers in Profession column(probably typos)
# Get the counts of each profession
profession_counts = train['Profession'].value_counts()
# Filter out professions with fewer than 12 occurrences
valid_professions = profession_counts[profession_counts >= 12].index
# Keep only rows with valid professions
print('Removed professions with fewer than 12 rows')
train = train[train['Profession'].isin(valid_professions) | train['Profession'].isna()]

## deal with outliers in Sleep Duration column(probably typos)
# Get the counts of each Sleep Duration
sleep_duration_counts = train['Sleep Duration'].value_counts()
# Filter out Sleep Duration with fewer than 15 occurrences
valid_sleep_durations = sleep_duration_counts[sleep_duration_counts >= 15].index
# Keep only rows with valid Sleep Duration
print('Removed Sleep Duration with fewer than 15 rows')
train = train[train['Sleep Duration'].isin(valid_sleep_durations) | train['Sleep Duration'].isna()]

## deal with outliers in Dietary Habits column(probably typos)
# Get the counts of each Dietary Habit
Dietary_Habits_counts = train['Dietary Habits'].value_counts()
# Filter out Dietary Habit with fewer than 15 occurrences
valid_Dietary_Habits_counts = Dietary_Habits_counts[Dietary_Habits_counts >= 15].index
# Keep only rows with valid Dietary Habit
print('Removed Dietary Habit with fewer than 15 rows')
train = train[train['Dietary Habits'].isin(valid_Dietary_Habits_counts) | train['Dietary Habits'].isna()]

## deal with outliers in Degrees column(probably typos)
# Get the counts of each Degree
Degree_counts = train['Degree'].value_counts()
# Filter out Degree with fewer than 10 occurrences
valid_Degree_counts = Degree_counts[Degree_counts >= 10].index
# Keep only rows with valid Degree
print('Removed Degree with fewer than 10 rows')
train = train[train['Degree'].isin(valid_Degree_counts) | train['Degree'].isna()]

## check again
categories = [
    "Gender",
    "city_type",
    "Working Professional or Student",
    "Profession",
    "Academic Pressure",
    "Work Pressure",
    "Study Satisfaction",
    "Job Satisfaction",
    "Sleep Duration",
    "Dietary Habits",
    "Degree",
    "Have you ever had suicidal thoughts ?",
    "Work/Study Hours",
    "Financial Stress",
    "Family History of Mental Illness",
]
for category in categories:
    print(train[category].value_counts())

Removed professions with fewer than 12 rows
Removed Sleep Duration with fewer than 15 rows
Removed Dietary Habit with fewer than 15 rows
Removed Degree with fewer than 10 rows
Gender
Male      87494
Female    72423
Name: count, dtype: int64
city_type
small    81899
big      78018
Name: count, dtype: int64
Working Professional or Student
Working Professional    126946
Student                  32971
Name: count, dtype: int64
Profession
Teacher                   28898
Content Writer             8837
Architect                  4920
Consultant                 4678
HR Manager                 4559
Pharmacist                 4387
Doctor                     3552
Business Analyst           3485
Entrepreneur               3356
Chef                       3332
Chemist                    3263
Educational Consultant     3174
Researcher                 2654
Data Scientist             2632
Lawyer                     2592
Customer Support           2303
Marketing Manager          2222
Pilot             

### check for outliers in test data

In [8]:
## check possible values for possible categorical columns
categories = [
    "Gender",
    "city_type",
    "Working Professional or Student",
    "Profession",
    "Academic Pressure",
    "Work Pressure",
    "Study Satisfaction",
    "Job Satisfaction",
    "Sleep Duration",
    "Dietary Habits",
    "Degree",
    "Have you ever had suicidal thoughts ?",
    "Work/Study Hours",
    "Financial Stress",
    "Family History of Mental Illness",
]
outlier_tests = pd.DataFrame()
for category in categories:
    test_unique = pd.Series(test[category].unique())  # Convert to Series
    train_unique = pd.Series(train[category].unique())  # Convert to Series
    if len(test_unique[~test_unique.isin(train_unique)]) > 0:
        print(category)
        outlier_test = test[
            test[category].isin(test_unique[~test_unique.isin(train_unique)])
        ]
        print(len(outlier_test))
        print(outlier_test)
        outlier_tests = pd.concat([outlier_tests, outlier_test])
        print("\n")
# Get unique rows of the DataFrame
outlier_tests = outlier_tests.drop_duplicates()
print(outlier_tests)
print(outlier_tests.shape)

Profession
50
           id         Name  Gender   Age           City  \
561    141166          Dev    Male  44.0         Meerut   
1467   141952        Pooja  Female  47.0         Meerut   
2429   142797       Prachi  Female  18.0        Chennai   
2847   143165        Anand    Male  26.0       Srinagar   
2848   143165        Anand    Male  26.0       Srinagar   
2849   143165        Anand    Male  26.0       Srinagar   
2850   143165        Anand    Male  26.0       Srinagar   
17292  155969          Ira  Female  30.0           Agra   
20115  158433        Pooja  Female  35.0       Varanasi   
25030  162654       Shivam    Male  49.0         Rajkot   
29716  166722     Siddhesh    Male  42.0         Nagpur   
31915  168672        Shrey    Male  21.0      Hyderabad   
32535  169179          Ira  Female  48.0      Hyderabad   
33445  169985       Yogesh    Male  28.0         Nashik   
35356  171693       Prachi    Male  27.0       Vadodara   
40559  176256     Pratyush    Male  48.0  

finding: 245 rows in test data have values that are not included in train data for some columns(probably because of typos)<br>
following step: use the most frequent value in train data for these rows


### use the most frequent value in train data for these outlier rows

In [9]:
# Loop through each category to identify and handle outliers
for category in categories:
    # Get unique values for the category in test and train datasets
    test_unique = pd.Series(test[category].unique())  # Convert to Series
    train_unique = pd.Series(train[category].unique())  # Convert to Series

    # Identify outlier values (present in test but not in train)
    unmatched_values = test_unique[~test_unique.isin(train_unique)]

    # Check if there are any outliers
    if len(unmatched_values) > 0:
        outliers_count = len(test[test[category].isin(unmatched_values)])

        print(f"{outliers_count} Outliers found in category: {category}")

        # Get the mode (most frequent value) for the category in train
        most_frequent_value = train[category].mode()[
            0
        ]  # Mode always returns a Series, take the first value

        # Replace the outlier values in test data with the most frequent value from train
        test.loc[test[category].isin(unmatched_values), category] = most_frequent_value

        print(
            f"Replaced outlier values in category '{category}' with '{most_frequent_value}'\n"
        )

50 Outliers found in category: Profession
Replaced outlier values in category 'Profession' with 'Teacher'

70 Outliers found in category: Sleep Duration
Replaced outlier values in category 'Sleep Duration' with 'Less than 5 hours'

31 Outliers found in category: Dietary Habits
Replaced outlier values in category 'Dietary Habits' with 'Moderate'

94 Outliers found in category: Degree
Replaced outlier values in category 'Degree' with 'Class 12'



### check for NAs

In [10]:
## split data based on Working Professional or Student
train_student = train[train["Working Professional or Student"] == "Student"]
train_professional = train[
    train["Working Professional or Student"] == "Working Professional"
]
test_student = test[test["Working Professional or Student"] == "Student"]
test_professional = test[
    test["Working Professional or Student"] == "Working Professional"
]

# Define the columns to process for each DataFrame
train_student_na_columns = ["Academic Pressure", "CGPA", "Study Satisfaction", "Dietary Habits", "Financial Stress"]
train_professional_na_columns = ["Work Pressure", "Job Satisfaction", "Profession", "Dietary Habits", "Degree", "Financial Stress"]
test_student_na_columns = ["Academic Pressure", "CGPA", "Study Satisfaction", "Dietary Habits", "Degree"]
test_professional_na_columns = ["Work Pressure", "Job Satisfaction", "Profession", "Dietary Habits", "Degree"]

# Replace missing values in train_student for the listed columns
for col in train_student_na_columns:
    train_student.loc[:, col] = train_student[col].fillna(train_student[col].mode()[0])

# Replace missing values in train_professional for the listed columns
for col in train_professional_na_columns:
    train_professional.loc[:, col] = train_professional[col].fillna(train_professional[col].mode()[0])

# Replace missing values in test_student for the listed columns
for col in test_student_na_columns:
    test_student.loc[:, col] = test_student[col].fillna(test_student[col].mode()[0])

# Replace missing values in test_professional for the listed columns
for col in test_professional_na_columns:
    test_professional.loc[:, col] = test_professional[col].fillna(test_professional[col].mode()[0])

# check nas in updated DataFrames
print("Updated train_student:")
print(train_student.isna().sum())

print("\nUpdated train_professional:")
print(train_professional.isna().sum())

print("\nUpdated test_student:")
print(test_student.isna().sum())

print("\nUpdated test_professional:")
print(test_professional.isna().sum())


Updated train_student:
id                                           0
Name                                         0
Gender                                       0
Age                                          0
City                                         0
Working Professional or Student              0
Profession                               32938
Academic Pressure                            0
Work Pressure                            32968
CGPA                                         0
Study Satisfaction                           0
Job Satisfaction                         32960
Sleep Duration                               0
Dietary Habits                               0
Degree                                       0
Have you ever had suicidal thoughts ?        0
Work/Study Hours                             0
Financial Stress                             0
Family History of Mental Illness             0
Depression                                   0
city_population                      

### combine data

In [11]:
train_new = pd.concat([train_student, train_professional])
test_new = pd.concat([test_student, test_professional])
print(train.shape)
print(test.shape)
print(train_new.shape)
print(test_new.shape)
print(train_new.head())
print(test_new.head())
train_new.to_csv("train_new.csv", index=False)
test_new.to_csv("test_new.csv", index=False)

(159917, 22)
(107315, 21)
(159917, 22)
(107315, 21)
    id       Name  Gender   Age           City  \
2    2     Yuvraj    Male  33.0  Visakhapatnam   
8    8  Aishwarya  Female  24.0      Bangalore   
27  26     Aditya    Male  31.0       Srinagar   
28  26     Aditya    Male  31.0       Srinagar   
29  26     Aditya    Male  31.0       Srinagar   

   Working Professional or Student Profession  Academic Pressure  \
2                          Student        NaN                5.0   
8                          Student        NaN                2.0   
27                         Student        NaN                3.0   
28                         Student        NaN                3.0   
29                         Student        NaN                3.0   

    Work Pressure  CGPA  Study Satisfaction  Job Satisfaction  \
2             NaN  8.97                 2.0               NaN   
8             NaN  5.90                 5.0               NaN   
27            NaN  7.03                 5.0

## Build Basic Model

In [12]:
print(train_student["Depression"].value_counts())

Depression
1    19126
0    13845
Name: count, dtype: int64


In [13]:
print(train_professional["Depression"].value_counts())

Depression
0    116559
1     10387
Name: count, dtype: int64


In [14]:
from sklearn.model_selection import train_test_split
from category_encoders import TargetEncoder
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import pandas as pd

def preprocess_data(train_data, test_data, features, target_column):
    """
    Preprocesses the train and test data:
    - Selects specific features
    - Encodes categorical variables
    """
    # Select specified features
    X_train = train_data[features]
    y_train = train_data[target_column]
    X_test = test_data[features]

    # Encode categorical variables using Target Encoding
    categorical_cols = X_train.select_dtypes(include=['object']).columns
    encoder = TargetEncoder(cols=categorical_cols)
    X_train_encoded = encoder.fit_transform(X_train, y_train)
    X_test_encoded = encoder.transform(X_test)

    return X_train_encoded, y_train, X_test_encoded

def train_and_predict(X_train, y_train, X_test, output_column_name):
    """
    Trains a stacking classifier and makes predictions on the test data.
    """
    # Train-test split for validation
    X_train_split, X_val, y_train_split, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

    # Define base models
    base_models = [
        ('rf', RandomForestClassifier(n_estimators=100, random_state=42)),
        ('gb', GradientBoostingClassifier(n_estimators=100, random_state=42))
    ]

    # Define meta-model
    meta_model = LogisticRegression()

    # Build stacking classifier
    stacking_clf = StackingClassifier(estimators=base_models, final_estimator=meta_model, cv=5)
    stacking_clf.fit(X_train_split, y_train_split)

    # Validate the model
    y_val_pred = stacking_clf.predict(X_val)
    accuracy = accuracy_score(y_val, y_val_pred)
    print(f"Validation Accuracy ({output_column_name}): {accuracy:.2f}")

    # Predict on the test data
    predictions = stacking_clf.predict(X_test)

    return predictions

# Features for student and professional models
student_features = [
    'Gender', 'Age', 'city_type', 'Academic Pressure', 'CGPA', 'Study Satisfaction', 'Sleep Duration', 
    'Dietary Habits', 'Degree', 'Have you ever had suicidal thoughts ?', 'Work/Study Hours', 
    'Financial Stress', 'Family History of Mental Illness'
]
professional_features = [
    'Gender', 'Age', 'city_type', 'Profession', 'Work Pressure', 'Job Satisfaction', 'Sleep Duration', 
    'Dietary Habits', 'Degree', 'Have you ever had suicidal thoughts ?', 'Work/Study Hours', 
    'Financial Stress', 'Family History of Mental Illness'
]

# Preprocess and train models
if __name__ == "__main__":
    # Train on train_student and predict on test_student
    X_train_student, y_train_student, X_test_student = preprocess_data(train_student, test_student, student_features, 'Depression')
    student_predictions = train_and_predict(X_train_student, y_train_student, X_test_student, 'Student')

    # Train on train_professional and predict on test_professional
    X_train_professional, y_train_professional, X_test_professional = preprocess_data(train_professional, test_professional, professional_features, 'Depression')
    professional_predictions = train_and_predict(X_train_professional, y_train_professional, X_test_professional, 'Professional')

    # Add predictions to respective test datasets
    test_student['Depression_Predicted'] = student_predictions
    test_professional['Depression_Predicted'] = professional_predictions

    # Combine the results
    test_combined = pd.concat([test_student, test_professional], ignore_index=True)

    # Output only 'id' and 'Depression_Predicted'
    output_file = 'test_new_with_predictions.csv'
    output_df = test_combined[['id', 'Depression_Predicted']]
    output_df.to_csv(output_file, index=False)

    # Display the first few rows of the output
    print(output_df.head())


Validation Accuracy (Student): 0.87
Validation Accuracy (Professional): 0.97
       id  Depression_Predicted
0  140703                     1
1  140708                     0
2  140719                     1
3  140720                     1
4  140721                     1


finding: the validation accuracy for professionals is much higher than the validation accuracy for students<br>
optional step: some important features to explain the target for professional may not work for students

## Try to build meaningful models

### Analyze the impact of each feature on target

In [15]:
import pandas as pd
import numpy as np
from scipy.stats import mannwhitneyu, chi2_contingency, fisher_exact
import warnings

# Ignore warnings for clean output
warnings.filterwarnings('ignore')

def analyze_numerical_feature(data, feature, target):
    """
    Performs Mann-Whitney U test on a numerical feature to compare distributions between target groups.
    """
    group0 = data[data[target] == 0][feature]
    group1 = data[data[target] == 1][feature]
    
    # Perform Mann-Whitney U Test
    stat, p = mannwhitneyu(group0, group1, alternative='two-sided')
    print(f"{feature}:\n\tMann-Whitney U Test p-value = {p:.5f}")
    
    # Interpretation
    group0_mean = group0.mean()
    group1_mean = group1.mean()
    print(f"\tMean for target=0: {group0_mean:.3f}, Mean for target=1: {group1_mean:.3f}")
    return p

def analyze_categorical_feature(data, feature, target):
    """
    Analyzes categorical features using Fisher's Exact Test for binary features and Chi-Squared Test for multi-category features.
    Also provides percentage distributions by category.
    """
    contingency_table = pd.crosstab(data[feature], data[target])
    print(f"\nContingency Table for {feature}:\n{contingency_table}")
    
    if contingency_table.shape == (2, 2):
        # Binary feature: Perform Fisher's Exact Test and calculate Odds Ratio
        odds_ratio, p = fisher_exact(contingency_table)
        print(f"\tFisher's Exact Test p-value = {p:.5f}, Odds Ratio = {odds_ratio:.3f}")
        
        # Interpretation of Odds Ratio
        category_names = contingency_table.index.tolist()
        if odds_ratio > 1:
            print(f"\t{category_names[1]} is {odds_ratio:.3f} times more likely to have {target}=1 than {category_names[0]}.")
        elif odds_ratio < 1:
            print(f"\t{category_names[1]} is {1/odds_ratio:.3f} times less likely to have {target}=1 than {category_names[0]}.")
        else:
            print(f"\t{category_names[1]} and {category_names[0]} are equally likely to have {target}=1.")
        return p
    else:
        # Multi-category feature: Perform Chi-Squared Test
        chi2, p, dof, expected = chi2_contingency(contingency_table)
        print(f"\tChi-Squared Test p-value = {p:.5f}, Chi2 Statistic = {chi2:.3f}, DOF = {dof}")
        
        # Interpretation of Chi-Squared Test
        if p < 0.05:
            print(f"\tThe distribution of {target} is significantly different across categories of {feature}.")
        else:
            print(f"\tThe distribution of {target} is not significantly different across categories of {feature}.")
        
        # Percentage Distribution for each category
        percentages = (contingency_table.div(contingency_table.sum(axis=1), axis=0) * 100).round(2)
        print("\nPercentage Distribution by Category (rounded to two decimal places):")
        print(percentages)
        
        # Highlight the category with the highest target=1 percentage
        if 1 in percentages.columns:
            highest_category = percentages[1].idxmax()
            print(f"\tCategory with the highest {target}=1 percentage: {highest_category} ({percentages[1].max():.2f}%)")
        return p

def perform_statistical_analysis(data, features, target):
    """
    Performs statistical analysis on a dataset, evaluating the impact of numerical and categorical features on the target.
    """
    print(f"\n=== Statistical Analysis for Dataset ===\n")
    
    # Separate numerical and categorical features
    numerical_features = data[features].select_dtypes(include=['int64', 'float64']).columns.tolist()
    categorical_features = [col for col in features if col not in numerical_features]
    
    significant_numerical_features = []
    significant_categorical_features = []
    
    alpha = 0.05  # Significance level
    
    # Analyze numerical features
    print("\nNumerical Features Analysis:")
    for feature in numerical_features:
        p_value = analyze_numerical_feature(data, feature, target)
        if p_value < alpha:
            significant_numerical_features.append(feature)
    
    # Analyze categorical features
    print("\nCategorical Features Analysis:")
    for feature in categorical_features:
        p_value = analyze_categorical_feature(data, feature, target)
        if p_value is not None and p_value < alpha:
            significant_categorical_features.append(feature)
    
    print("\nSignificant Numerical Features:", significant_numerical_features)
    print("Significant Categorical Features:", significant_categorical_features)
    return significant_numerical_features, significant_categorical_features

# Main execution
if __name__ == "__main__":
    # Real Features for Students
    student_features = [
        'Gender', 'Age', 'city_type', 'Academic Pressure', 'CGPA', 'Study Satisfaction',
        'Sleep Duration', 'Dietary Habits', 'Degree', 'Have you ever had suicidal thoughts ?',
        'Work/Study Hours', 'Financial Stress', 'Family History of Mental Illness'
    ]
    
    # Real Features for Professionals
    professional_features = [
        'Gender', 'Age', 'city_type', 'Profession', 'Work Pressure', 'Job Satisfaction',
        'Sleep Duration', 'Dietary Habits', 'Degree', 'Have you ever had suicidal thoughts ?',
        'Work/Study Hours', 'Financial Stress', 'Family History of Mental Illness'
    ]
    
    target = 'Depression'
    
    print("\n=== Statistical Analysis for train_student ===")
    significant_numerical_features_student, significant_categorical_features_student = perform_statistical_analysis(
        train_student, student_features, target
    )
    
    print("\n=== Statistical Analysis for train_professional ===")
    significant_numerical_features_prof, significant_categorical_features_prof = perform_statistical_analysis(
        train_professional, professional_features, target
    )



=== Statistical Analysis for train_student ===

=== Statistical Analysis for Dataset ===


Numerical Features Analysis:
Age:
	Mann-Whitney U Test p-value = 0.00000
	Mean for target=0: 27.232, Mean for target=1: 24.919
Academic Pressure:
	Mann-Whitney U Test p-value = 0.00000
	Mean for target=0: 2.364, Mean for target=1: 3.681
CGPA:
	Mann-Whitney U Test p-value = 0.00002
	Mean for target=0: 7.632, Mean for target=1: 7.704
Study Satisfaction:
	Mann-Whitney U Test p-value = 0.00000
	Mean for target=0: 3.224, Mean for target=1: 2.761
Work/Study Hours:
	Mann-Whitney U Test p-value = 0.00000
	Mean for target=0: 6.221, Mean for target=1: 7.773
Financial Stress:
	Mann-Whitney U Test p-value = 0.00000
	Mean for target=0: 2.523, Mean for target=1: 3.586

Categorical Features Analysis:

Contingency Table for Gender:
Depression     0      1
Gender                 
Female      6109   8647
Male        7736  10479
	Fisher's Exact Test p-value = 0.05090, Odds Ratio = 0.957
	Male is 1.045 times less l

### Based on the analysis above, we have these findings:<br>

#### Observations:<br>

##### For students:<br>
- Younger students are more likely to have depression. Probably because they have less experience and they are less matured than the elder students in dealing with their mental problems.<br>
- Students with higher academic pressure are more likely to have depresssion.<br>
- Students with higher CGPA are more likely to have depression. Probably because they have higher expectations of themselves and easily feel depressed when they cannot realize their expectation. Whereas students with lower CGPA may not care about their academic grades that much and do not put high pressure on themselves.<br>
- Students with lower study satisfaction are more likely to have depression.<br>
- Students who study more are more likely to have depression. This is similar to those students with higher CGPA, who have higher expectations of themselves.<br>
- Students with higher financial stress are more likely to have depression.<br>
- Female students is more likely to have depression. Probably because of hormone.<br>
- Students in big cities(population more than 2 million) are more likely to have depression. Probably because the living expense and financial pressure is higher in big cities, and more talents are in big cities and thus the peer pressure is higher.<br>
- Students who sleep the least are more likely to have depression. This shows that sleeping is essential for mental health.<br>
- Students with unhealthy dietary habits are much more likely to have depression. This shows that dietary is hugely important for mental health.<br>
- Students in class 12 are much more likely to have depression than other degrees. This is similar to what happens to younger students.<br>
- Students who have ever had suicidal thoughts are much more likely to have depression.<br>
- Students who have mental illness family history are more likely to have depression. But what should be emphasized is that family history is far less important than other nurtured fators, like sleep and diatery habits.<br>

##### For working professionals<br>
- Younger working professionals are much more likely to have depression. This is similar to what happens for younger students.<br>
- Working professionals with higher work pressure are more likely to have depression.<br>
- Working professionals with lower job satisfaction are more likely to have depression.<br>
- Working professionals working more are more likely to have depression. Probably because they are pushed to work overtime by their employers. <br>
- Working professionals with higher financial stress are more likely to have depression.<br>
- Gender is not statistically important in depression for working professionals. We are not 100% sure about the transations for females from students to working professionals, but we're glad to see they can better control their mental situations so that they're no longer more likely to experience depression than males. Another explanation is that males may have more pressure from students to working professionals, compared to females. For instance, traditional mindsets often hold that males should take the financial responsibilities for the family. <br>
- City is no longer a statistically significant factor in depression among working professionals. This differs from our previous assumption, and we need to be aware of it. This finding indicates that not only do people living in big cities experience mental health issues—which we previously thought were more affected by stressors like financial pressure—but people in smaller cities also face mental challenges. For instance, they may lack sufficient income sources, making it difficult for them to meet even basic needs.<br>
- As for working professionals of different professions, Graphic Designers are the most likely to have depression.<br>
- Working professionals who sleep the least are more likely to have depression. This is similar to what happens for students. <br>
- Working professionals with unhealthy dietary habits are much more likely to have depression. This is similar to what happens for students. <br>
- As for degree, working professionals with degree of Class 12 are most likely to have depression. Probably because they have the least education compared with others having higher education background and thus it would be more challenging for them to land a decent job and support their families.<br>
- Working professions who have ever had suicidal thoughts are much more likely to have depression.<br>
- Working professions who have mental illness family history are more likely to have depression. But what should be emphasized is that family history is far less important than other nurtured fators, like sleep and diatery habits.<br>

#### Persona:<br>
- As for students, the persona more likely to experience depression is a young female living in a big city, facing higher academic pressure, maintaining a higher GPA, studying more, but with lower study satisfaction and greater financial stress. They can make a significant difference in their well-being by getting more sleep and adopting healthier dietary habits.<br>
- Among working professionals, the persona more likely to experience depression is young, holds a lower educational degree such as Class 12, works longer hours under higher work pressure, experiences lower job satisfaction, and faces greater financial stress. Society should recognize that depression affects working professionals across all cities and genders. These personas can make a significant difference in their well-being by getting more sleep and adopting healthier dietary habits.<br>

#### Suggestions:<br>
- Students themselves, families, education systems, and communities should pay more attention to these personas. Given the high correlation between depression and suicidal tendencies, they should recognize that mental health is far more important than academic results like CGPA. Implementing measures such as encouraging students to sleep more and adopt healthier dietary habits can significantly improve both their physical and psychological well-being.<br>
- Working professionals, being generally more mature than students, should take greater responsibility for their mental health. Those personas more susceptible to depression should be particularly cautious. At the same time, society should pay more attention to the mental health issues of working professionals—which are common across different cities and genders—instead of merely pressuring them to work overtime. Key suggestions for working professionals to improve their mental health include getting more sleep and adopting healthier dietary habits.<br>