**📋 TABLE OF CONTENTS**
1. **📦 Libraries and Setup**
   - Importing the required libraries and setting up the environment
2. **🛠️ Data Loading and Initial Analysis**
   - Loading the training and test datasets.
   Performing basic checks for missing and duplicate values.
3. **🔧 Preprocessing Pipeline**
   - Custom transformers and pipeline construction for feature engineering and cleaning.
4. **🚂 Model Training with Stratified Cross-Validation**
   - Training models with stratified cross-validation for reliable evaluation.
5. **🎯 Best Model Selection and Performance**
   - Identifying the best model based on cross-validation accuracy.
   - Evaluating the model on the training dataset.
6. **📤 Saving Predictions**
   - Exporting predictions for the best model and all other models.
   - Saving results to CSV files.
7. **📊 Predicted Probability Visualization**
   - Visualizing the distribution of predicted probabilities for the test set.
8. **🎉 Conclusions and Next Steps**
   - Summarizing key findings and outlining potential improvements.

**Table of Contents**
My approach to the competition will include the following steps:
1. **Data Exploration and Visualization** : Gain initial insigjts into the dataset
2. **Feature Engineering**: Create and select meaningful features
3. **Model Training and Tuning**: Build, evaluate, and tune models
4. **Submission Preparation**: Prepare final predictions for submission

In [21]:
# Importing the Necessary packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import warnings

Loading the train and test csvs

In [5]:
# import pandas to read the csv
import pandas as pd

training_data = pd.read_csv("train.csv")
testing_data = pd.read_csv("test.csv")

In [None]:
# previewing the first five columns for each
training_data.head()

Unnamed: 0,id,Name,Gender,Age,City,Working Professional or Student,Profession,Academic Pressure,Work Pressure,CGPA,Study Satisfaction,Job Satisfaction,Sleep Duration,Dietary Habits,Degree,Have you ever had suicidal thoughts ?,Work/Study Hours,Financial Stress,Family History of Mental Illness,Depression
0,0,Aaradhya,Female,49.0,Ludhiana,Working Professional,Chef,,5.0,,,2.0,More than 8 hours,Healthy,BHM,No,1.0,2.0,No,0
1,1,Vivan,Male,26.0,Varanasi,Working Professional,Teacher,,4.0,,,3.0,Less than 5 hours,Unhealthy,LLB,Yes,7.0,3.0,No,1
2,2,Yuvraj,Male,33.0,Visakhapatnam,Student,,5.0,,8.97,2.0,,5-6 hours,Healthy,B.Pharm,Yes,3.0,1.0,No,1
3,3,Yuvraj,Male,22.0,Mumbai,Working Professional,Teacher,,5.0,,,1.0,Less than 5 hours,Moderate,BBA,Yes,10.0,1.0,Yes,1
4,4,Rhea,Female,30.0,Kanpur,Working Professional,Business Analyst,,1.0,,,1.0,5-6 hours,Unhealthy,BBA,Yes,9.0,4.0,Yes,0


In [12]:
testing_data.head()

Unnamed: 0,id,Name,Gender,Age,City,Working Professional or Student,Profession,Academic Pressure,Work Pressure,CGPA,Study Satisfaction,Job Satisfaction,Sleep Duration,Dietary Habits,Degree,Have you ever had suicidal thoughts ?,Work/Study Hours,Financial Stress,Family History of Mental Illness
0,140700,Shivam,Male,53.0,Visakhapatnam,Working Professional,Judge,,2.0,,,5.0,Less than 5 hours,Moderate,LLB,No,9.0,3.0,Yes
1,140701,Sanya,Female,58.0,Kolkata,Working Professional,Educational Consultant,,2.0,,,4.0,Less than 5 hours,Moderate,B.Ed,No,6.0,4.0,No
2,140702,Yash,Male,53.0,Jaipur,Working Professional,Teacher,,4.0,,,1.0,7-8 hours,Moderate,B.Arch,Yes,12.0,4.0,No
3,140703,Nalini,Female,23.0,Rajkot,Student,,5.0,,6.84,1.0,,More than 8 hours,Moderate,BSc,Yes,10.0,4.0,No
4,140704,Shaurya,Male,47.0,Kalyan,Working Professional,Teacher,,5.0,,,5.0,7-8 hours,Moderate,BCA,Yes,3.0,4.0,No


In [13]:
# Training data has 20 columns (including the depression one). The first column is the participant's id
# There are a total of 140,699 entries for each of the columns
training_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 140700 entries, 0 to 140699
Data columns (total 20 columns):
 #   Column                                 Non-Null Count   Dtype  
---  ------                                 --------------   -----  
 0   id                                     140700 non-null  int64  
 1   Name                                   140700 non-null  object 
 2   Gender                                 140700 non-null  object 
 3   Age                                    140700 non-null  float64
 4   City                                   140700 non-null  object 
 5   Working Professional or Student        140700 non-null  object 
 6   Profession                             104070 non-null  object 
 7   Academic Pressure                      27897 non-null   float64
 8   Work Pressure                          112782 non-null  float64
 9   CGPA                                   27898 non-null   float64
 10  Study Satisfaction                     27897 non-null   

In [None]:
# Testing data is missing the depression column, as expected, as this is the column to be predicted
# Testing data has 19 columns. The first column is the participant's id
# There are a total of 93,800 entries for each column
testing_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 93800 entries, 0 to 93799
Data columns (total 19 columns):
 #   Column                                 Non-Null Count  Dtype  
---  ------                                 --------------  -----  
 0   id                                     93800 non-null  int64  
 1   Name                                   93800 non-null  object 
 2   Gender                                 93800 non-null  object 
 3   Age                                    93800 non-null  float64
 4   City                                   93800 non-null  object 
 5   Working Professional or Student        93800 non-null  object 
 6   Profession                             69168 non-null  object 
 7   Academic Pressure                      18767 non-null  float64
 8   Work Pressure                          75022 non-null  float64
 9   CGPA                                   18766 non-null  float64
 10  Study Satisfaction                     18767 non-null  float64
 11  Jo

DATASET OVERVIEW  
**Feature Descriptions**
1. Name - Participant's name
2. Gender- Participant's gender (male or female)
3. Age - Participant's age - 
4. City - Participant's city of residence
5. Working Professional or student- Indicates whether the participant is a working professional or student
6. Profession- Participant's current profession
7. Academic Pressure - Level of academic workload rated on a scale from 1 to 5
8. Work Pressure - Level of work related workload rated on a scale from 1 to 5
9. CGPA- Cumulative Grade Point Average
10. Study Satisfaction- Satisfaction level with studies, rated on a scale from 1 to 5
11. Job Satisfaction- Satisfaction level with jon, rated on a scale from 1 to 5
12. Sleep Duration- Average hours of sleep per night
13. Dietary Habits- Information about participant's eating habits
14. Degree- Highest degree obtained by the participant
15. Have you ever had suicidal thoughts? Indicates if the participant has had sucicidal thougghts (yes or no)
16. Work/Study Hours- Average daily hours spent on work or study
17. Financial Stress- Level of financial stress rated on a scale from 1 to 5
18. Family History of Mental Illness- Indicates if there is a family history of mental illness (Yes/No).
19. Depression- Represents whether the participant is at risk of depression (Yes/No), based on lifestyle and demographic factors.

Checking for Missing Data

In [19]:
training_data.isna().sum()

id                                            0
Name                                          0
Gender                                        0
Age                                           0
City                                          0
Working Professional or Student               0
Profession                                36630
Academic Pressure                        112803
Work Pressure                             27918
CGPA                                     112802
Study Satisfaction                       112803
Job Satisfaction                          27910
Sleep Duration                                0
Dietary Habits                                4
Degree                                        2
Have you ever had suicidal thoughts ?         0
Work/Study Hours                              0
Financial Stress                              4
Family History of Mental Illness              0
Depression                                    0
dtype: int64

Several columns in the training data have missing values. These are:
- Profession (36,630 missing values)
- Academic pressure (112,803 values)
- Work pressure (27.918 values)
- CGPA (112,802 values)
- Study Satisfaction (112,803 values)
- Job satisfaction (27,910 values)
- Dietary Habits (4 values)
- Degree (2 values)

In [17]:
# checking for null values. Whatever we do to the training data, we will do to the testing data
testing_data.isna().sum()

id                                           0
Name                                         0
Gender                                       0
Age                                          0
City                                         0
Working Professional or Student              0
Profession                               24632
Academic Pressure                        75033
Work Pressure                            18778
CGPA                                     75034
Study Satisfaction                       75033
Job Satisfaction                         18774
Sleep Duration                               0
Dietary Habits                               5
Degree                                       2
Have you ever had suicidal thoughts ?        0
Work/Study Hours                             0
Financial Stress                             0
Family History of Mental Illness             0
dtype: int64

The same columns in the testing data have missing values. However, the number of missing values is different These are:
- Profession (24,632 missing values)
- Academic pressure (75,033 values)
- Work pressure (18,778 values)
- CGPA (75,034 values)
- Study Satisfaction (75033 values)
- Job satisfaction (18,774 values)
- Dietary Habits (5 values)
- Degree (2 values)

Let us assess each of the columns with missing values and see how to handle them:
- 1. Profession- This is a categorical column, with each individual having a different profession such as chef, teacher, business analyst, financial analyst, electrician, software engineer etc
- 2. Academic Pressure- This also seems to be a categorical value such as 5, 3, 2, 1 (CATEGORICAL NUMERICAL)
- 3. Work Pressure- Similar to academic pressure, this is a categorical value with inputs such as 5, 4, 3, 2, 1 (CATEGORICAL NUMERICAL)
- 4. CGPA- This is the GPA value such as 8.97 etc (NUMERICAL NON-CATEGORICAL)
- 5. Job satisfaction- This is also a categorical value such as 2, 5 (CATEGORICAL NUMERICAL)
- 6. Dietary habits - This is a categorical non-numerical value such as moderate, healthy, and unhealthy (CATEGORICAL NON-NUMERICAL)
- 7. Degree - 