# 02 ‚Äî Pr√©-Processamento

Este notebook demonstra o processo completo de pr√©-processamento aplicado ao dataset bruto, baseado nas decis√µes tomadas na An√°lise Explorat√≥ria (01_EDA.ipynb). Ele utiliza o arquivo `preprocess.py`, que cont√©m o pipeline oficial do projeto.


Imports

In [28]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from pathlib import Path

from src import preprocess as pp

RAW_PATH = "../data/raw/student_depression_dataset.csv"


Dataset

In [29]:
df_raw = pd.read_csv(RAW_PATH)
print("Shape inicial:", df_raw.shape)
df_raw.head()


Shape inicial: (27901, 18)


Unnamed: 0,id,Gender,Age,City,Profession,Academic Pressure,Work Pressure,CGPA,Study Satisfaction,Job Satisfaction,Sleep Duration,Dietary Habits,Degree,Have you ever had suicidal thoughts ?,Work/Study Hours,Financial Stress,Family History of Mental Illness,Depression
0,2,Male,33.0,Visakhapatnam,Student,5.0,0.0,8.97,2.0,0.0,'5-6 hours',Healthy,B.Pharm,Yes,3.0,1.0,No,1
1,8,Female,24.0,Bangalore,Student,2.0,0.0,5.9,5.0,0.0,'5-6 hours',Moderate,BSc,No,3.0,2.0,Yes,0
2,26,Male,31.0,Srinagar,Student,3.0,0.0,7.03,5.0,0.0,'Less than 5 hours',Healthy,BA,No,9.0,1.0,Yes,0
3,30,Female,28.0,Varanasi,Student,3.0,0.0,5.59,2.0,0.0,'7-8 hours',Moderate,BCA,Yes,4.0,5.0,Yes,1
4,32,Female,25.0,Jaipur,Student,4.0,0.0,8.13,3.0,0.0,'5-6 hours',Moderate,M.Tech,Yes,1.0,1.0,No,0


Valores inconsistentes

In [30]:
inconsistent_values = [
    "Other", "Others", "?", "Unknown",
    "Prefer not to say", " ", "", "NA",
    "N/A", "None"
]

print("Buscando valores inconsistentes...")

for col in df_raw.columns:
    unq = df_raw[col].astype(str).unique()
    bad = [v for v in unq if v in inconsistent_values]
    if bad:
        print(f" - {col}: {bad}")


Buscando valores inconsistentes...
 - Sleep Duration: ['Others']
 - Dietary Habits: ['Others']
 - Degree: ['Others']
 - Financial Stress: ['?']


Remover amostras com valores inconsistentes

In [31]:
df_clean = pp.remove_rows_with_inconsistent_values(df_raw)

print("Shape ap√≥s remover inconsistentes:", df_clean.shape)
df_clean.head()


Shape ap√≥s remover inconsistentes: (27833, 18)


Unnamed: 0,id,Gender,Age,City,Profession,Academic Pressure,Work Pressure,CGPA,Study Satisfaction,Job Satisfaction,Sleep Duration,Dietary Habits,Degree,Have you ever had suicidal thoughts ?,Work/Study Hours,Financial Stress,Family History of Mental Illness,Depression
0,2,Male,33.0,Visakhapatnam,Student,5.0,0.0,8.97,2.0,0.0,'5-6 hours',Healthy,B.Pharm,Yes,3.0,1.0,No,1
1,8,Female,24.0,Bangalore,Student,2.0,0.0,5.9,5.0,0.0,'5-6 hours',Moderate,BSc,No,3.0,2.0,Yes,0
2,26,Male,31.0,Srinagar,Student,3.0,0.0,7.03,5.0,0.0,'Less than 5 hours',Healthy,BA,No,9.0,1.0,Yes,0
3,30,Female,28.0,Varanasi,Student,3.0,0.0,5.59,2.0,0.0,'7-8 hours',Moderate,BCA,Yes,4.0,5.0,Yes,1
4,32,Female,25.0,Jaipur,Student,4.0,0.0,8.13,3.0,0.0,'5-6 hours',Moderate,M.Tech,Yes,1.0,1.0,No,0


Remover colunas irrelevantes

In [32]:
df_clean = pp.remove_irrelevant_columns(df_clean)

print("Shape ap√≥s remover colunas irrelevantes:", df_clean.shape)
df_clean.head()


Removendo colunas que n√£o ser√£o usadas no modelo.
Shape ap√≥s remover colunas irrelevantes: (27833, 12)


Unnamed: 0,Gender,Age,Academic Pressure,CGPA,Study Satisfaction,Sleep Duration,Dietary Habits,Have you ever had suicidal thoughts ?,Work/Study Hours,Financial Stress,Family History of Mental Illness,Depression
0,Male,33.0,5.0,8.97,2.0,'5-6 hours',Healthy,Yes,3.0,1.0,No,1
1,Female,24.0,2.0,5.9,5.0,'5-6 hours',Moderate,No,3.0,2.0,Yes,0
2,Male,31.0,3.0,7.03,5.0,'Less than 5 hours',Healthy,No,9.0,1.0,Yes,0
3,Female,28.0,3.0,5.59,2.0,'7-8 hours',Moderate,Yes,4.0,5.0,Yes,1
4,Female,25.0,4.0,8.13,3.0,'5-6 hours',Moderate,Yes,1.0,1.0,No,0


Tratar valores categ√≥ricos

In [33]:
df_clean = pp.treat_values(df_clean)

df_clean.head()


Tratando valores do dataset


Unnamed: 0,Gender,Age,Academic Pressure,CGPA,Study Satisfaction,Sleep Duration,Dietary Habits,Have you ever had suicidal thoughts ?,Work/Study Hours,Financial Stress,Family History of Mental Illness,Depression
0,0,33.0,5.0,8.97,2.0,5.5,2,1,3.0,1.0,0,1
1,1,24.0,2.0,5.9,5.0,5.5,1,0,3.0,2.0,1,0
2,0,31.0,3.0,7.03,5.0,4.0,2,0,9.0,1.0,1,0
3,1,28.0,3.0,5.59,2.0,7.5,1,1,4.0,5.0,1,1
4,1,25.0,4.0,8.13,3.0,5.5,1,1,1.0,1.0,0,0


Filtrar faixas et√°rias inv√°lidas

In [34]:
df_clean = pp.filter_age(df_clean)

print("Shape ap√≥s filtrar idade:", df_clean.shape)
df_clean["Age"].describe()


Filtrando idades
Shape ap√≥s filtrar idade: (27822, 12)


count    27822.000000
mean        25.809827
std          4.880406
min         18.000000
25%         21.000000
50%         25.000000
75%         30.000000
max         44.000000
Name: Age, dtype: float64

Remover valores faltantes restantes

In [35]:
df_clean = pp.drop_missing(df_clean)

print("Shape ap√≥s drop de NA:", df_clean.shape)
df_clean.isna().sum()


Shape ap√≥s drop de NA: (27822, 12)


Gender                                   0
Age                                      0
Academic Pressure                        0
CGPA                                     0
Study Satisfaction                       0
Sleep Duration                           0
Dietary Habits                           0
Have you ever had suicidal thoughts ?    0
Work/Study Hours                         0
Financial Stress                         0
Family History of Mental Illness         0
Depression                               0
dtype: int64

Salvar Dataset Limpo

In [36]:
processed_path = Path("../data/processed")
processed_path.mkdir(parents=True, exist_ok=True)

output_file = processed_path / "cleaned_student_dataset.csv"
df_clean.to_csv(output_file, index=False)

print(f"Dataset final salvo em: {output_file}")


Dataset final salvo em: ../data/processed/cleaned_student_dataset.csv


In [None]:


pp.split_data(df_clean)

üìÅ Arquivos de treino e teste gerados com sucesso!
üìÅ Arquivos de treino e teste gerados com sucesso!


(       Gender   Age  Academic Pressure  CGPA  Study Satisfaction  \
 13440       1  23.0                5.0  8.52                 5.0   
 8791        1  19.0                1.0  9.44                 5.0   
 21125       0  21.0                1.0  6.92                 4.0   
 15973       0  31.0                4.0  6.27                 3.0   
 10251       1  25.0                5.0  9.91                 3.0   
 ...       ...   ...                ...   ...                 ...   
 7019        0  30.0                1.0  6.37                 3.0   
 7050        1  25.0                3.0  8.21                 3.0   
 9251        0  24.0                4.0  9.74                 2.0   
 14205       0  31.0                4.0  5.70                 1.0   
 16607       1  26.0                3.0  7.08                 3.0   
 
        Sleep Duration  Dietary Habits  Have you ever had suicidal thoughts ?  \
 13440             5.5               1                                      0   
 8791   

# Conclus√£o

O pr√©-processamento foi conclu√≠do com sucesso.

- Valores inconsistentes removidos  
- Colunas irrelevantes exclu√≠das  
- Valores categ√≥ricos mapeados  
- Idades fora da faixa filtradas   
- Dataset final salvo em `data/processed/dataset_final.csv`

Este dataset ser√° utilizado agora no notebook **03_Model_Training.ipynb**