# 01 – Data Preprocessing
### Student Depression Prediction: Preparing Data for XGBoost Model

---

## 📌 Objective
Prepare and clean the student depression dataset for machine learning analysis using XGBoost classification.

**Key Tasks:**
- Load and explore raw data
- Handle missing values and outliers
- Encode categorical variables
- Scale numerical features
- Export clean dataset for modeling

---

### 📂 Input  
 - `student_depression_dataset.csv` saved in `Data/raw/`  


### 📦 Output  
- `clean_data.csv` saved in `Data/processed/FC110552_mithula-cbw/`

---

### 📊 Dataset Overview

| **Attribute** | **Details** |
|---------------|-------------|
| **Dataset Size** | 27,901 records × 18 features |
| **Data Type** | Structured tabular data (CSV format) |
| **Target Variable** | `Depression_Status` (Binary: 0/1 or Yes/No) |
| **Problem Type** | Binary Classification |
| **Data Source** | [Student Depression Dataset](www.kaggle.com/datasets/adilshamim8/student-depression-dataset) |

### 📈 Expected Outcomes
- Clean, processed dataset ready for machine learning

In [5]:
# =====================================
# STEP 1: DATA LOADING & EXPLORATION
# =====================================


# 1.1 - Load libraries and suppress warnings
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)

In [None]:

# 1.2 - Load the dataset
    # Assign the CSV file to a Pandas DataFrame.

df = pd.read_csv("./Data/raw/student_depression_dataset.csv")

In [None]:
# Reusable function definitions
    # Function to Print Shape of DataFrame
def get_data_shape(data: pd.DataFrame) -> None:
    if data.empty:
        print("DataFrame is empty.")
    else:
        # Print the shape of the DataFrame
        print("DataFrame Dimensions")
        print("------------------------")
        print(f"Rows   : {data.shape[0]}")
        print(f"Columns: {data.shape[1]}")

In [18]:
get_data_shape(df)

DataFrame Dimensions
------------------------
Rows   : 27901
Columns: 18


In [None]:
# 1.3 - Display first few rows (Provides a quick look at the dataset's content).
print("\n First 5 rows:")
display(df.head())


 First 5 rows:


Unnamed: 0,id,Gender,Age,City,Profession,Academic Pressure,Work Pressure,CGPA,Study Satisfaction,Job Satisfaction,Sleep Duration,Dietary Habits,Degree,Have you ever had suicidal thoughts ?,Work/Study Hours,Financial Stress,Family History of Mental Illness,Depression
0,2,Male,33.0,Visakhapatnam,Student,5.0,0.0,8.97,2.0,0.0,'5-6 hours',Healthy,B.Pharm,Yes,3.0,1.0,No,1
1,8,Female,24.0,Bangalore,Student,2.0,0.0,5.9,5.0,0.0,'5-6 hours',Moderate,BSc,No,3.0,2.0,Yes,0
2,26,Male,31.0,Srinagar,Student,3.0,0.0,7.03,5.0,0.0,'Less than 5 hours',Healthy,BA,No,9.0,1.0,Yes,0
3,30,Female,28.0,Varanasi,Student,3.0,0.0,5.59,2.0,0.0,'7-8 hours',Moderate,BCA,Yes,4.0,5.0,Yes,1
4,32,Female,25.0,Jaipur,Student,4.0,0.0,8.13,3.0,0.0,'5-6 hours',Moderate,M.Tech,Yes,1.0,1.0,No,0


💡 **Observations:**  
- All column names are readable and appear to be semantically meaningful.
- The `id` column does not increment sequentially
- `Work Pressure` and `Job Satisfaction` have only 0.0 values in the initial rows.
- The `Profession` column appears to contain only "Student" values so far.


In [24]:
# Prints basic statistics (mean, standard deviation, min, max, etc.)
numeric_cols = df.select_dtypes(include=[np.number]).columns.drop('id', errors='ignore')  # For  numeric data types
display(df[numeric_cols].describe())   

Unnamed: 0,Age,Academic Pressure,Work Pressure,CGPA,Study Satisfaction,Job Satisfaction,Work/Study Hours,Depression
count,27901.0,27901.0,27901.0,27901.0,27901.0,27901.0,27901.0,27901.0
mean,25.8223,3.141214,0.00043,7.656104,2.943837,0.000681,7.156984,0.585499
std,4.905687,1.381465,0.043992,1.470707,1.361148,0.044394,3.707642,0.492645
min,18.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,21.0,2.0,0.0,6.29,2.0,0.0,4.0,0.0
50%,25.0,3.0,0.0,7.77,3.0,0.0,8.0,1.0
75%,30.0,4.0,0.0,8.92,4.0,0.0,10.0,1.0
max,59.0,5.0,5.0,10.0,5.0,4.0,12.0,1.0


💡 **Observations:**  
- `Age` mostly ranges between 21 to 30, aligning with student demographics ,though the max age is 59, which may be an outlier or non-student entry.
- `Work Pressure` and `Job Satisfaction` values appear to be very close to 0, suggesting missing data, poor scaling, or inactive features. 
