# **Analysis of Student Depression Dataset**

## **Objective**
This notebook focuses on exploring, cleaning, and analyzing the **Student Depression Dataset** to gain insights into various factors affecting students' mental health. The primary goals include:
1. Cleaning the dataset by handling missing values, duplicates, and outliers.
2. Filtering and transforming columns to prepare data for analysis.
3. Conducting an exploratory data analysis (EDA) to uncover trends and patterns.

In [44]:
import pandas as pd
import os

In [45]:
df = pd.read_csv('depression_data.csv')

df.head()

Unnamed: 0,id,Gender,Age,City,Profession,Academic Pressure,Work Pressure,CGPA,Study Satisfaction,Job Satisfaction,Sleep Duration,Dietary Habits,Degree,Have you ever had suicidal thoughts ?,Work/Study Hours,Financial Stress,Family History of Mental Illness,Depression
0,2,Male,33.0,Visakhapatnam,Student,5.0,0.0,8.97,2.0,0.0,5-6 hours,Healthy,B.Pharm,Yes,3.0,1.0,No,1
1,8,Female,24.0,Bangalore,Student,2.0,0.0,5.9,5.0,0.0,5-6 hours,Moderate,BSc,No,3.0,2.0,Yes,0
2,26,Male,31.0,Srinagar,Student,3.0,0.0,7.03,5.0,0.0,Less than 5 hours,Healthy,BA,No,9.0,1.0,Yes,0
3,30,Female,28.0,Varanasi,Student,3.0,0.0,5.59,2.0,0.0,7-8 hours,Moderate,BCA,Yes,4.0,5.0,Yes,1
4,32,Female,25.0,Jaipur,Student,4.0,0.0,8.13,3.0,0.0,5-6 hours,Moderate,M.Tech,Yes,1.0,1.0,No,0


In [46]:
# List of the columns and their types
df.dtypes

id                                         int64
Gender                                    object
Age                                      float64
City                                      object
Profession                                object
Academic Pressure                        float64
Work Pressure                            float64
CGPA                                     float64
Study Satisfaction                       float64
Job Satisfaction                         float64
Sleep Duration                            object
Dietary Habits                            object
Degree                                    object
Have you ever had suicidal thoughts ?     object
Work/Study Hours                         float64
Financial Stress                         float64
Family History of Mental Illness          object
Depression                                 int64
dtype: object

From the result above, it is clear that we have several categorical variables
- Encode this categorical variables or conver to numerical format
- Assess each categorical variable 1 by 1 for data cleaning

In [47]:
# Checking Gender column
df['Gender'].value_counts()

Gender
Male      15547
Female    12354
Name: count, dtype: int64

In [48]:
# Standardise Gender text
df['Gender'] = df['Gender'].str.strip()

# Convert Gender to numerical binary variable
df['Gender'] = df['Gender'].map({'Male':0, 'Female':1})

df['Gender'].value_counts()

Gender
0    15547
1    12354
Name: count, dtype: int64

In [49]:
# Checking Age column
df['Age'].value_counts()

Age
24.0    2258
20.0    2237
28.0    2133
29.0    1950
33.0    1893
25.0    1784
21.0    1726
23.0    1645
18.0    1587
19.0    1560
34.0    1468
27.0    1462
31.0    1427
32.0    1262
22.0    1160
26.0    1155
30.0    1145
35.0      10
38.0       8
36.0       7
42.0       4
48.0       3
39.0       3
43.0       2
46.0       2
37.0       2
49.0       1
51.0       1
44.0       1
59.0       1
54.0       1
58.0       1
56.0       1
41.0       1
Name: count, dtype: int64

In [50]:
# Not enough data for students with age > 35 inclusive
# drop rows with age > 30

df = df.loc[df['Age'] <= 30]
df['Age'].value_counts()


Age
24.0    2258
20.0    2237
28.0    2133
29.0    1950
25.0    1784
21.0    1726
23.0    1645
18.0    1587
19.0    1560
27.0    1462
22.0    1160
26.0    1155
30.0    1145
Name: count, dtype: int64

In [51]:
# Checking City column
df['City'].value_counts()

City
Kalyan           1268
Hyderabad        1045
Srinagar         1018
Vasai-Virar       994
Lucknow           915
Thane             913
Surat             896
Agra              876
Ludhiana          852
Kolkata           828
Jaipur            826
Ahmedabad         777
Bhopal            749
Pune              741
Patna             714
Chennai           712
Visakhapatnam     709
Rajkot            669
Meerut            629
Bangalore         617
Delhi             595
Ghaziabad         579
Mumbai            575
Vadodara          538
Varanasi          521
Indore            513
Nagpur            509
Kanpur            445
Nashik            406
Faridabad         351
Saanvi              2
Bhavna              2
City                2
Reyansh             1
Nandini             1
Nalini              1
M.Com               1
ME                  1
Rashi               1
Kibara              1
Vaanya              1
Harsh               1
Gaurav              1
Harsha              1
Mira                1
3.0  

In [52]:
# Remove cities with population less than 300
city_counts = df['City'].value_counts() 
cities = city_counts[city_counts > 300].index

# filter rows 
df = df[df['City'].isin(cities)]

df['City'].value_counts()

City
Kalyan           1268
Hyderabad        1045
Srinagar         1018
Vasai-Virar       994
Lucknow           915
Thane             913
Surat             896
Agra              876
Ludhiana          852
Kolkata           828
Jaipur            826
Ahmedabad         777
Bhopal            749
Pune              741
Patna             714
Chennai           712
Visakhapatnam     709
Rajkot            669
Meerut            629
Bangalore         617
Delhi             595
Ghaziabad         579
Mumbai            575
Vadodara          538
Varanasi          521
Indore            513
Nagpur            509
Kanpur            445
Nashik            406
Faridabad         351
Name: count, dtype: int64

In [53]:
# Checking Profession column
df['Profession'].value_counts()

Profession
Student             21757
Architect               6
Teacher                 5
Digital Marketer        3
Chef                    2
Civil Engineer          1
Content Writer          1
Manager                 1
Lawyer                  1
Doctor                  1
Entrepreneur            1
Pharmacist              1
Name: count, dtype: int64

In [54]:
# remove rows where Profession != Student due to low count
df = df.loc[df['Profession'] == 'Student']

# Check Column again
df['Profession'].value_counts()

Profession
Student    21757
Name: count, dtype: int64

In [55]:
# Since we only have 1 profession now in our df, we can drop the Profession column
df = df.drop(['Profession'], axis=1)


In [56]:
# Check all columns again
df.dtypes

id                                         int64
Gender                                     int64
Age                                      float64
City                                      object
Academic Pressure                        float64
Work Pressure                            float64
CGPA                                     float64
Study Satisfaction                       float64
Job Satisfaction                         float64
Sleep Duration                            object
Dietary Habits                            object
Degree                                    object
Have you ever had suicidal thoughts ?     object
Work/Study Hours                         float64
Financial Stress                         float64
Family History of Mental Illness          object
Depression                                 int64
dtype: object

In [57]:
# Check Academic Pressure column
df['Academic Pressure'].value_counts()

Academic Pressure
3.0    5785
5.0    5167
4.0    4112
1.0    3546
2.0    3140
0.0       7
Name: count, dtype: int64

In [60]:
# drop rows with Academic Pressure = 0 due to low count
df = df.loc[df['Academic Pressure'] > 0]

# Check values again
df['Academic Pressure'].value_counts()

Academic Pressure
3.0    5785
5.0    5167
4.0    4112
1.0    3546
2.0    3140
Name: count, dtype: int64

In [70]:
df['Work Pressure'].value_counts()

Work Pressure
0.0    21750
Name: count, dtype: int64

In [74]:
# Remove Work Pressure column as students do not have any work pressure
df = df.drop(['Work Pressure'], axis=1)

In [None]:
# Checking Study Satisfaction column
df['Study Satisfaction'].value_counts()

Study Satisfaction
4.0    4825
2.0    4686
3.0    4448
1.0    4336
5.0    3453
0.0       2
Name: count, dtype: int64

In [None]:
# Remove rows where study satisfaction = 0.0 due to low count
df = df.loc[df['Study Satisfaction'] != 0]
df['Study Satisfaction'].value_counts()

Study Satisfaction
4.0    4825
2.0    4686
3.0    4448
1.0    4336
5.0    3453
Name: count, dtype: int64

In [80]:
df.dtypes

id                                         int64
Gender                                     int64
Age                                      float64
City                                      object
Academic Pressure                        float64
CGPA                                     float64
Study Satisfaction                       float64
Job Satisfaction                         float64
Sleep Duration                            object
Dietary Habits                            object
Degree                                    object
Have you ever had suicidal thoughts ?     object
Work/Study Hours                         float64
Financial Stress                         float64
Family History of Mental Illness          object
Depression                                 int64
dtype: object

In [None]:
# Checking Sleep Duration
