# 01 – Data Preprocessing
### Student Depression Prediction: Preparing Data for XGBoost Model

---

## 📌 Objective
Prepare and clean the student depression dataset for machine learning analysis using XGBoost classification.

**Key Tasks:**
- Load and explore raw data
- Handle missing values and outliers
- Encode categorical variables
- Scale numerical features
- Export clean dataset for modeling

---

### 📂 Input  
 - `student_depression_dataset.csv` saved in `Data/raw/`  


### 📦 Output  
- `clean_data.csv` saved in `Data/processed/FC110552_mithula-cbw/`

---

### 📊 Dataset Overview

| **Attribute** | **Details** |
|---------------|-------------|
| **Dataset Size** | 27,901 records × 18 features |
| **Data Type** | Structured tabular data (CSV format) |
| **Target Variable** | `Depression_Status` (Binary: 0/1 or Yes/No) |
| **Problem Type** | Binary Classification |
| **Data Source** | [Student Depression Dataset](www.kaggle.com/datasets/adilshamim8/student-depression-dataset) |

### 📈 Expected Outcomes
- Clean, processed dataset ready for machine learning

In [5]:
# =====================================
# STEP 1: DATA LOADING & EXPLORATION
# =====================================


# 1.1 - Load libraries and suppress warnings
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)

In [None]:

# 1.2 - Load the dataset
    # Assign the CSV file to a Pandas DataFrame.

df = pd.read_csv("./Data/raw/student_depression_dataset.csv")

In [None]:
# Function to Print Shape of DataFrame
def get_data_shape(data: pd.DataFrame) -> None:
    if data.empty:
        print("⚠️ DataFrame is empty.")
    else:
        # Print the shape of the DataFrame
        print("DataFrame Dimensions")
        print("------------------------")
        print(f"Rows   : {data.shape[0]}")
        print(f"Columns: {data.shape[1]}")

In [18]:
get_data_shape(df)

DataFrame Dimensions
------------------------
Rows   : 27901
Columns: 18
