<div style="text-align: center;">
  <img src="Images/Depressed_Student.png" alt="Depressed Student Illustration" width="600"/>
</div>

## Hello! 
This is a project to predict **Student Depression** using the [Student Depression Dataset](https://www.kaggle.com/datasets/hopesb/student-depression-dataset/data) from Kaggle. Understanding and predicting depression among students is an essential task, as mental health plays a critical role in their academic performance and overall well-being.  

Student depression datasets are typically used to analyze and predict depression levels among students. This project can contribute to identifying factors influencing student mental health and designing early intervention strategies.

The dataset provides comprehensive information about students and their mental health status, with 18 columns and 27,901 rows, structured in a CSV format. Below is a brief explanation of the columns:  

1. **id**: A unique identifier for each student.  
2. **Gender**: The gender of the student.  
3. **Age**: The age of the student.  
4. **City**: The city where the student resides.  
5. **Profession**: The student's occupation, such as student, part-time worker, etc.  
6. **Academic Pressure**: The level of academic stress experienced by the student.  
7. **Work Pressure**: The work-related stress experienced by the student.  
8. **CGPA**: The student’s cumulative grade point average.  
9. **Study Satisfaction**: The student’s level of satisfaction with their studies.  
10. **Job Satisfaction**: The student’s level of satisfaction with their job or part-time work.  
11. **Sleep Duration**: Average sleep duration in hours per day.  
12. **Dietary Habits**: The dietary pattern of the student (e.g., healthy or unhealthy).  
13. **Degree**: The current level of education the student is pursuing.  
14. **Have you ever had suicidal thoughts?**: A binary column (Yes/No) indicating if the student has had suicidal thoughts.  
15. **Work/Study Hours**: The number of hours spent working or studying per day.  
16. **Financial Stress**: The financial burden or stress experienced by the student.  
17. **Family History of Mental Illness**: A binary column (Yes/No) indicating if the student has a family history of mental health issues.  
18. **Depression**: The target variable, indicating whether the student is experiencing depression (Yes/No).  

> ⚠️ *Disclaimer*: This dataset, given its sensitive nature, must be used responsibly, ensuring ethical considerations like privacy, informed consent, and data anonymization. This project aims to leverage this dataset to build a model capable of predicting depression status in students and identifying significant contributing factors.

# **Step 1: Data Wrangling**

This notebook covers the **data wrangling and preprocessing phase** of the Student Depression Prediction project. The goal is to clean, transform, and prepare the raw dataset for exploratory data analysis (EDA), modeling, and dashboard development.

---

### Objectives of This Notebook

1. [Import Libraries and Load the Dataset](#import)  
2. [Initial Inspection](#inspection)  
3. [Handle Missing Values](#missing)  
4. [Remove Duplicate Records](#duplicates)  
5. [Rename Columns for Consistency](#rename)  
6. [Feature Engineering](#features)
7. [Save the Cleaned Dataset](#save)

---

### Next Steps

- Step 2: [Exploratory Data Analysis (EDA) – Visual](./02_eda_visualization.ipynb)  
- Step 3: [EDA – SQL Queries](./03_eda_sql_queries.ipynb)   
- Step 4: [Excel Dashboard](./04_excel_dashboard.xlsx)  
- Step 5: [Modeling & Prediction](./05_modeling_prediction.ipynb) 

---

<a id="import"></a>

## **1.1 Import Libraries and Load the Dataset**

We start by importing the necessary Python libraries and loading the dataset into a DataFrame.

In [1]:
# Pandas is a software library written for the Python programming language for data manipulation and analysis.
import pandas as pd

# NumPy is a Python library that supports fast operations on large, multi-dimensional arrays and provides a wide range of mathematical functions.
import numpy as np

# Import display function to render DataFrames or outputs neatly in the notebook
from IPython.display import display

In [2]:
# Load the dataset
print("Previewing the raw dataset:")
df = pd.read_csv("student_depression_dataset.csv")
display(df.head())

Previewing the raw dataset:


Unnamed: 0,id,Gender,Age,City,Profession,Academic Pressure,Work Pressure,CGPA,Study Satisfaction,Job Satisfaction,Sleep Duration,Dietary Habits,Degree,Have you ever had suicidal thoughts ?,Work/Study Hours,Financial Stress,Family History of Mental Illness,Depression
0,2,Male,33.0,Visakhapatnam,Student,5.0,0.0,8.97,2.0,0.0,5-6 hours,Healthy,B.Pharm,Yes,3.0,1.0,No,1
1,8,Female,24.0,Bangalore,Student,2.0,0.0,5.9,5.0,0.0,5-6 hours,Moderate,BSc,No,3.0,2.0,Yes,0
2,26,Male,31.0,Srinagar,Student,3.0,0.0,7.03,5.0,0.0,Less than 5 hours,Healthy,BA,No,9.0,1.0,Yes,0
3,30,Female,28.0,Varanasi,Student,3.0,0.0,5.59,2.0,0.0,7-8 hours,Moderate,BCA,Yes,4.0,5.0,Yes,1
4,32,Female,25.0,Jaipur,Student,4.0,0.0,8.13,3.0,0.0,5-6 hours,Moderate,M.Tech,Yes,1.0,1.0,No,0


---

<a id="inspection"></a>

## **1.2 Initial Inspection**

We inspect the structure, data types, and basic info of the dataset.

In [3]:
# Display basic structure of the dataset
print("Dataset Info:")
display(df.info())

Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27901 entries, 0 to 27900
Data columns (total 18 columns):
 #   Column                                 Non-Null Count  Dtype  
---  ------                                 --------------  -----  
 0   id                                     27901 non-null  int64  
 1   Gender                                 27901 non-null  object 
 2   Age                                    27901 non-null  float64
 3   City                                   27901 non-null  object 
 4   Profession                             27901 non-null  object 
 5   Academic Pressure                      27901 non-null  float64
 6   Work Pressure                          27901 non-null  float64
 7   CGPA                                   27901 non-null  float64
 8   Study Satisfaction                     27901 non-null  float64
 9   Job Satisfaction                       27901 non-null  float64
 10  Sleep Duration                         27901 non-null  o

None

In [4]:
# Summary statistics for numerical columns
print("Numerical Summary:")
display(df.describe())

Numerical Summary:


Unnamed: 0,id,Age,Academic Pressure,Work Pressure,CGPA,Study Satisfaction,Job Satisfaction,Work/Study Hours,Financial Stress,Depression
count,27901.0,27901.0,27901.0,27901.0,27901.0,27901.0,27901.0,27901.0,27898.0,27901.0
mean,70442.149421,25.8223,3.141214,0.00043,7.656104,2.943837,0.000681,7.156984,3.139867,0.585499
std,40641.175216,4.905687,1.381465,0.043992,1.470707,1.361148,0.044394,3.707642,1.437347,0.492645
min,2.0,18.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
25%,35039.0,21.0,2.0,0.0,6.29,2.0,0.0,4.0,2.0,0.0
50%,70684.0,25.0,3.0,0.0,7.77,3.0,0.0,8.0,3.0,1.0
75%,105818.0,30.0,4.0,0.0,8.92,4.0,0.0,10.0,4.0,1.0
max,140699.0,59.0,5.0,5.0,10.0,5.0,4.0,12.0,5.0,1.0


In [5]:
# Summary statistics for categorical columns
print("Categorical Summary:")
display(df.describe(include=[object]))

Categorical Summary:


Unnamed: 0,Gender,City,Profession,Sleep Duration,Dietary Habits,Degree,Have you ever had suicidal thoughts ?,Family History of Mental Illness
count,27901,27901,27901,27901,27901,27901,27901,27901
unique,2,52,14,5,4,28,2,2
top,Male,Kalyan,Student,Less than 5 hours,Unhealthy,Class 12,Yes,No
freq,15547,1570,27870,8310,10317,6080,17656,14398


In [6]:
# Checking the shape of the dataset (rows, columns)
print(f"The dataset contains {df.shape[0]:,} rows and {df.shape[1]} columns.")

The dataset contains 27,901 rows and 18 columns.


In [7]:
# Getting the column for the dataframe
print(f"The dataset columns include:")
display(df.columns)

The dataset columns include:


Index(['id', 'Gender', 'Age', 'City', 'Profession', 'Academic Pressure',
       'Work Pressure', 'CGPA', 'Study Satisfaction', 'Job Satisfaction',
       'Sleep Duration', 'Dietary Habits', 'Degree',
       'Have you ever had suicidal thoughts ?', 'Work/Study Hours',
       'Financial Stress', 'Family History of Mental Illness', 'Depression'],
      dtype='object')

---

<a id="missing"></a>

## **1.3 Handling Missing Values**

We identify missing values and apply appropriate strategies to handle them.

In [8]:
# Check for missing values
print("Missing Values per Column:")
display(df.isnull().sum())

Missing Values per Column:


id                                       0
Gender                                   0
Age                                      0
City                                     0
Profession                               0
Academic Pressure                        0
Work Pressure                            0
CGPA                                     0
Study Satisfaction                       0
Job Satisfaction                         0
Sleep Duration                           0
Dietary Habits                           0
Degree                                   0
Have you ever had suicidal thoughts ?    0
Work/Study Hours                         0
Financial Stress                         3
Family History of Mental Illness         0
Depression                               0
dtype: int64

In [9]:
# Filling the missing Financial Stress Values with mean
df['Financial Stress'] = df['Financial Stress'].fillna(round(df['Financial Stress'].mean()))

---

<a id="duplicates"></a>

## **1.4 Remove Duplicate Records**

To ensure data quality, we check and remove duplicate rows.

In [10]:
# Remove duplicate rows
print(f"Duplicate Rows Found: {df.duplicated().sum()}")

Duplicate Rows Found: 0


---

<a id="rename"></a>

## **1.5 Rename Columns for Consistency**

Standardizing column names improves readability and downstream processing.

In [11]:
# Renaming Columns
df.rename(columns={'Have you ever had suicidal thoughts ?':'Suicidal_thoughts',
                     'Family History of Mental Illness':'Family_Mental_History'},inplace=True)

# Standardize column names: lowercase and replace spaces with underscores
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_')

---

<a id="features"></a>

## **1.6 Feature Engineering**

We create new columns to enrich the dataset and support better modeling.

In [12]:
#Getting the unique values of the 'City' column
df['city'].unique()

array(['Visakhapatnam', 'Bangalore', 'Srinagar', 'Varanasi', 'Jaipur',
       'Pune', 'Thane', 'Chennai', 'Nagpur', 'Nashik', 'Vadodara',
       'Kalyan', 'Rajkot', 'Ahmedabad', 'Kolkata', 'Mumbai', 'Lucknow',
       'Indore', 'Surat', 'Ludhiana', 'Bhopal', 'Meerut', 'Agra',
       'Ghaziabad', 'Hyderabad', 'Vasai-Virar', 'Kanpur', 'Patna',
       'Faridabad', 'Delhi', 'Saanvi', 'M.Tech', 'Bhavna', 'Less Delhi',
       'City', '3.0', 'Less than 5 Kalyan', 'Mira', 'Harsha', 'Vaanya',
       'Gaurav', 'Harsh', 'Reyansh', 'Kibara', 'Rashi', 'ME', 'M.Com',
       'Nalyan', 'Mihir', 'Nalini', 'Nandini', 'Khaziabad'], dtype=object)

In [13]:
# Fixing common typos in the 'city' column to ensure consistency
correction = {
    'Khaziabad': 'Ghaziabad',
    'Nalyan': 'Kalyan',
    'Less Delhi': 'Delhi',
    'Less than 5 Kalyan': 'Kalyan'
}

df['city'] = df['city'].replace(correction)

In [14]:
# Removing non-city entries from the 'city' column
# These entries were mistakenly placed and are not valid city names
non_city_terms = [
    'Saanvi', 'M.Tech', 'Bhavna', 'Less Delhi', 'City', 
    '3.0', 'Less than 5 Kalyan', 'Mira', 'Harsha', 'Vaanya', 
    'Gaurav', 'Harsh', 'Reyansh', 'Kibara', 'Rashi', 'ME', 
    'M.Com', 'Nalyan', 'Mihir', 'Nalini', 'Nandini', 'Khaziabad'
]

# Remove rows where 'city' contains any of the non-city values
df = df[~df['city'].isin(non_city_terms)]

In [15]:
# Function to print unique values and their count for a given column
def print_unique_values(df, column_name):
    print(f"Unique values in '{column_name}':")
    print(df[column_name].unique())
    print(f"\nNumber of unique values in '{column_name}': {df[column_name].nunique()}")

# Call the function for the 'city' column
print_unique_values(df, 'city')

Unique values in 'city':
['Visakhapatnam' 'Bangalore' 'Srinagar' 'Varanasi' 'Jaipur' 'Pune' 'Thane'
 'Chennai' 'Nagpur' 'Nashik' 'Vadodara' 'Kalyan' 'Rajkot' 'Ahmedabad'
 'Kolkata' 'Mumbai' 'Lucknow' 'Indore' 'Surat' 'Ludhiana' 'Bhopal'
 'Meerut' 'Agra' 'Ghaziabad' 'Hyderabad' 'Vasai-Virar' 'Kanpur' 'Patna'
 'Faridabad' 'Delhi']

Number of unique values in 'city': 30


In [16]:
# Mapping individual cities to broader regions to reduce cardinality

# Dictionary to map city names to their respective regions
city_to_region = {
    'Visakhapatnam': 'South', 'Bangalore': 'South', 'Srinagar': 'North', 
    'Varanasi': 'North', 'Jaipur': 'North', 'Pune': 'West', 'Thane': 'West', 
    'Chennai': 'South', 'Nagpur': 'West', 'Nashik': 'West', 'Vadodara': 'West', 
    'Kalyan': 'West', 'Rajkot': 'West', 'Ahmedabad': 'West', 'Kolkata': 'East', 
    'Mumbai': 'West', 'Lucknow': 'North', 'Indore': 'West', 'Surat': 'West', 
    'Ludhiana': 'North', 'Bhopal': 'West', 'Meerut': 'North', 'Agra': 'North', 
    'Ghaziabad': 'North', 'Hyderabad': 'South', 'Vasai-Virar': 'West', 
    'Kanpur': 'North', 'Patna': 'East', 'Faridabad': 'North', 'Delhi': 'North'
}

# Make a copy for safety
df = df.copy()

# Map cities to their regions
df['city'] = df['city'].map(city_to_region)

# Rename the column from 'city' to 'region'
df = df.rename(columns={'city': 'region'})

# Preview the result
print("Preview of updated DataFrame with 'region' column:")
df.head()

Preview of updated DataFrame with 'region' column:


Unnamed: 0,id,gender,age,region,profession,academic_pressure,work_pressure,cgpa,study_satisfaction,job_satisfaction,sleep_duration,dietary_habits,degree,suicidal_thoughts,work/study_hours,financial_stress,family_mental_history,depression
0,2,Male,33.0,South,Student,5.0,0.0,8.97,2.0,0.0,5-6 hours,Healthy,B.Pharm,Yes,3.0,1.0,No,1
1,8,Female,24.0,South,Student,2.0,0.0,5.9,5.0,0.0,5-6 hours,Moderate,BSc,No,3.0,2.0,Yes,0
2,26,Male,31.0,North,Student,3.0,0.0,7.03,5.0,0.0,Less than 5 hours,Healthy,BA,No,9.0,1.0,Yes,0
3,30,Female,28.0,North,Student,3.0,0.0,5.59,2.0,0.0,7-8 hours,Moderate,BCA,Yes,4.0,5.0,Yes,1
4,32,Female,25.0,North,Student,4.0,0.0,8.13,3.0,0.0,5-6 hours,Moderate,M.Tech,Yes,1.0,1.0,No,0


In [17]:
# Getting the count of every unique value in the 'region' column
region_counts = df['region'].value_counts()

print("Number of students from each region:")
print(region_counts)

Number of students from each region:
region
West     11982
North     9863
South     3961
East      2073
Name: count, dtype: int64


In [18]:
# Call the function for the 'degree' column
print_unique_values(df, 'degree')

Unique values in 'degree':
['B.Pharm' 'BSc' 'BA' 'BCA' 'M.Tech' 'PhD' 'Class 12' 'B.Ed' 'LLB' 'BE'
 'M.Ed' 'MSc' 'BHM' 'M.Pharm' 'MCA' 'MA' 'B.Com' 'MD' 'MBA' 'MBBS' 'M.Com'
 'B.Arch' 'LLM' 'B.Tech' 'BBA' 'ME' 'MHM' 'Others']

Number of unique values in 'degree': 28


In [19]:
# Grouping various degree types into broader categories and removing rare 'Others'

# Remove rows with 'others' in the 'degree' column
df = df[df['degree'] != 'Others']

# Map each specific degree to a broader category
degree_categories = {
    'B.Pharm': 'Undergraduate', 'BSc': 'Undergraduate', 'BA': 'Undergraduate', 
    'BCA': 'Undergraduate', 'B.Ed': 'Undergraduate', 'LLB': 'Undergraduate', 
    'BE': 'Undergraduate', 'BHM': 'Undergraduate', 'B.Com': 'Undergraduate', 
    'B.Arch': 'Undergraduate', 'B.Tech': 'Undergraduate', 'BBA': 'Undergraduate', 
    'M.Tech': 'Postgraduate', 'M.Ed': 'Postgraduate', 'MSc': 'Postgraduate', 
    'M.Pharm': 'Postgraduate', 'MCA': 'Postgraduate', 'MA': 'Postgraduate', 
    'MBA': 'Postgraduate', 'M.Com': 'Postgraduate', 'LLM': 'Postgraduate', 
    'ME': 'Postgraduate', 'MHM': 'Postgraduate', 'PhD': 'Doctoral', 
    'MD': 'Doctoral', 'MBBS': 'Doctoral', 'Class 12': 'Class 12'
}

# Function to convert specific degrees to general categories
def categorize_degree(degree):
    return degree_categories.get(degree, 'Other')

# Apply the function
df['degree'] = df['degree'].apply(categorize_degree)

# Preview changes
print("Preview after grouping 'degree' column:")
df.head()

Preview after grouping 'degree' column:


Unnamed: 0,id,gender,age,region,profession,academic_pressure,work_pressure,cgpa,study_satisfaction,job_satisfaction,sleep_duration,dietary_habits,degree,suicidal_thoughts,work/study_hours,financial_stress,family_mental_history,depression
0,2,Male,33.0,South,Student,5.0,0.0,8.97,2.0,0.0,5-6 hours,Healthy,Undergraduate,Yes,3.0,1.0,No,1
1,8,Female,24.0,South,Student,2.0,0.0,5.9,5.0,0.0,5-6 hours,Moderate,Undergraduate,No,3.0,2.0,Yes,0
2,26,Male,31.0,North,Student,3.0,0.0,7.03,5.0,0.0,Less than 5 hours,Healthy,Undergraduate,No,9.0,1.0,Yes,0
3,30,Female,28.0,North,Student,3.0,0.0,5.59,2.0,0.0,7-8 hours,Moderate,Undergraduate,Yes,4.0,5.0,Yes,1
4,32,Female,25.0,North,Student,4.0,0.0,8.13,3.0,0.0,5-6 hours,Moderate,Postgraduate,Yes,1.0,1.0,No,0


In [20]:
# Call the function for the 'profession' column
print_unique_values(df, 'profession')

Unique values in 'profession':
['Student' 'Civil Engineer' 'Architect' 'UX/UI Designer'
 'Digital Marketer' 'Content Writer' 'Educational Consultant' 'Teacher'
 'Manager' 'Chef' 'Doctor' 'Lawyer' 'Entrepreneur' 'Pharmacist']

Number of unique values in 'profession': 14


In [21]:
# Grouping all non-students under a single category "Professionals"
def group_profession(profession):
    return 'Student' if profession == 'Student' else 'Professionals'

# Apply the function
df['profession'] = df['profession'].apply(group_profession)

# Preview changes
print("Preview after grouping 'profession' column:")
df.head()

Preview after grouping 'profession' column:


Unnamed: 0,id,gender,age,region,profession,academic_pressure,work_pressure,cgpa,study_satisfaction,job_satisfaction,sleep_duration,dietary_habits,degree,suicidal_thoughts,work/study_hours,financial_stress,family_mental_history,depression
0,2,Male,33.0,South,Student,5.0,0.0,8.97,2.0,0.0,5-6 hours,Healthy,Undergraduate,Yes,3.0,1.0,No,1
1,8,Female,24.0,South,Student,2.0,0.0,5.9,5.0,0.0,5-6 hours,Moderate,Undergraduate,No,3.0,2.0,Yes,0
2,26,Male,31.0,North,Student,3.0,0.0,7.03,5.0,0.0,Less than 5 hours,Healthy,Undergraduate,No,9.0,1.0,Yes,0
3,30,Female,28.0,North,Student,3.0,0.0,5.59,2.0,0.0,7-8 hours,Moderate,Undergraduate,Yes,4.0,5.0,Yes,1
4,32,Female,25.0,North,Student,4.0,0.0,8.13,3.0,0.0,5-6 hours,Moderate,Postgraduate,Yes,1.0,1.0,No,0


In [22]:
# Get value counts for the 'dietary_habits' column
habit_counts = df['dietary_habits'].value_counts()
print("Dietary Habits Value Counts:")
print(habit_counts)

Dietary Habits Value Counts:
dietary_habits
Unhealthy    10288
Moderate      9907
Healthy       7637
Others          12
Name: count, dtype: int64


In [23]:
# Remove rows where 'dietary_habits' is 'Others' due to small sample size
df = df[df['dietary_habits'] != 'Others']  # 12 rows removed

In [24]:
# Get value counts for the 'sleep_duration' column
sleep_range = df['sleep_duration'].value_counts()
print("Sleep Duration Value Counts:")
print(sleep_range)

Sleep Duration Value Counts:
sleep_duration
Less than 5 hours    8288
7-8 hours            7324
5-6 hours            6170
More than 8 hours    6032
Others                 18
Name: count, dtype: int64


In [25]:
# Remove rows where 'sleep_duration' is 'Others' due to small sample size
df = df[df['sleep_duration'] != 'Others']  # 18 rows removed

In [26]:
# Reset index to avoid any issues from dropped rows
df.reset_index(drop=True, inplace=True)

#### Some observations from the Dataset:

1. The dataset contains 27,901 rows and 18 columns.

2. There are 8 numerical (float64), 2 integer (int64), and 8 categorical (object) columns.

3. Most columns have complete data, with only Financial Stress missing 3 values which have been filled.

4. The mean age is 25.8 years, with a range from 18 to 59 years.

5. Academic pressure has a mean value of 3.14 on a scale of 0 to 5.

6. The CGPA column has a mean of 7.66, ranging from 0 to 10.

7. The most frequent gender is "Male," with 15,547 occurrences.

8. The most frequent city is "Kalyan," with 1,570 occurrences.

9. The profession "Student" is overwhelmingly common, appearing 27,870 times.

10. Around 58.5% of respondents reported experiencing depression.

11. Approximately 63% of respondents reported having suicidal thoughts.

12. Work/Study Hours range from 0 to 12 hours, with a median of 8 hours.

13. Sleep Duration is categorical, with "Less than 5 hours" being the most common.

14. The dataset is suitable for analyzing mental health trends and related factors.

---

<a id="save"></a>

## **1.7 Save the Cleaned Dataset**

After completing the data cleaning and preprocessing steps, save the cleaned dataset to a CSV file for future use and reproducibility.

In [27]:
# Create a clean copy of the DataFrame for further processing and cleaning steps.
df_clean = df.copy()

In [28]:
# Display the first and last few rows of the cleaned dataset to verify changes
display(df_clean)

Unnamed: 0,id,gender,age,region,profession,academic_pressure,work_pressure,cgpa,study_satisfaction,job_satisfaction,sleep_duration,dietary_habits,degree,suicidal_thoughts,work/study_hours,financial_stress,family_mental_history,depression
0,2,Male,33.0,South,Student,5.0,0.0,8.97,2.0,0.0,5-6 hours,Healthy,Undergraduate,Yes,3.0,1.0,No,1
1,8,Female,24.0,South,Student,2.0,0.0,5.90,5.0,0.0,5-6 hours,Moderate,Undergraduate,No,3.0,2.0,Yes,0
2,26,Male,31.0,North,Student,3.0,0.0,7.03,5.0,0.0,Less than 5 hours,Healthy,Undergraduate,No,9.0,1.0,Yes,0
3,30,Female,28.0,North,Student,3.0,0.0,5.59,2.0,0.0,7-8 hours,Moderate,Undergraduate,Yes,4.0,5.0,Yes,1
4,32,Female,25.0,North,Student,4.0,0.0,8.13,3.0,0.0,5-6 hours,Moderate,Postgraduate,Yes,1.0,1.0,No,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
27809,140685,Female,27.0,West,Student,5.0,0.0,5.75,5.0,0.0,5-6 hours,Unhealthy,Class 12,Yes,7.0,1.0,Yes,0
27810,140686,Male,27.0,North,Student,2.0,0.0,9.40,3.0,0.0,Less than 5 hours,Healthy,Postgraduate,No,0.0,3.0,Yes,0
27811,140689,Male,31.0,North,Student,3.0,0.0,6.61,4.0,0.0,5-6 hours,Unhealthy,Doctoral,No,12.0,2.0,No,0
27812,140690,Female,18.0,North,Student,5.0,0.0,6.88,2.0,0.0,Less than 5 hours,Healthy,Class 12,Yes,10.0,5.0,No,1


In [29]:
# Check the shape of the dataset after cleaning
print("Dataset shape after cleaning:")
print(df_clean.shape)

Dataset shape after cleaning:
(27814, 18)


In [30]:
# Recheck the dataframe info to verify datatypes and non-null counts
print("Dataset info:")
display(df_clean.info())

Dataset info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27814 entries, 0 to 27813
Data columns (total 18 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   id                     27814 non-null  int64  
 1   gender                 27814 non-null  object 
 2   age                    27814 non-null  float64
 3   region                 27814 non-null  object 
 4   profession             27814 non-null  object 
 5   academic_pressure      27814 non-null  float64
 6   work_pressure          27814 non-null  float64
 7   cgpa                   27814 non-null  float64
 8   study_satisfaction     27814 non-null  float64
 9   job_satisfaction       27814 non-null  float64
 10  sleep_duration         27814 non-null  object 
 11  dietary_habits         27814 non-null  object 
 12  degree                 27814 non-null  object 
 13  suicidal_thoughts      27814 non-null  object 
 14  work/study_hours       27814 non-null  f

None

In [31]:
# Summary statistics for numerical columns
print("Numerical Summary:")
display(df_clean.describe())

Numerical Summary:


Unnamed: 0,id,age,academic_pressure,work_pressure,cgpa,study_satisfaction,job_satisfaction,work/study_hours,financial_stress,depression
count,27814.0,27814.0,27814.0,27814.0,27814.0,27814.0,27814.0,27814.0,27814.0,27814.0
mean,70457.226612,25.820234,3.141655,0.000431,7.656051,2.943985,0.000683,7.160099,3.140001,0.585461
std,40649.122278,4.906662,1.381833,0.044061,1.470757,1.360949,0.044464,3.706602,1.436973,0.492651
min,2.0,18.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
25%,35047.75,21.0,2.0,0.0,6.29,2.0,0.0,4.0,2.0,0.0
50%,70722.0,25.0,3.0,0.0,7.77,3.0,0.0,8.0,3.0,1.0
75%,105829.5,30.0,4.0,0.0,8.92,4.0,0.0,10.0,4.0,1.0
max,140699.0,59.0,5.0,5.0,10.0,5.0,4.0,12.0,5.0,1.0


In [32]:
# Summary statistics for categorical columns
print("Categorical Summary:")
display(df_clean.describe(include=[object]))

Categorical Summary:


Unnamed: 0,gender,region,profession,sleep_duration,dietary_habits,degree,suicidal_thoughts,family_mental_history
count,27814,27814,27814,27814,27814,27814,27814,27814
unique,2,4,2,4,3,4,2,2
top,Male,West,Student,Less than 5 hours,Unhealthy,Undergraduate,Yes,No
freq,15498,11955,27783,8288,10280,12612,17601,14347


In [33]:
# Save this lightly cleaned dataset for EDA and Dashboard
df_clean.to_csv('student_depression_cleaned.csv', index=False)

#### With these datasets ready, we can now proceed confidently to perform Exploratory Data Analysis (EDA).