# Depression Detection

## Data Preprocessing

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Lets start by importing our data and visualizing the dataframe.

In [2]:
import pandas as pd
data = pd.read_csv('/content/drive/MyDrive/Projects/Machine Learning projects/Depression Detection/Data/test.csv')

In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 93800 entries, 0 to 93799
Data columns (total 19 columns):
 #   Column                                 Non-Null Count  Dtype  
---  ------                                 --------------  -----  
 0   id                                     93800 non-null  int64  
 1   Name                                   93800 non-null  object 
 2   Gender                                 93800 non-null  object 
 3   Age                                    93800 non-null  float64
 4   City                                   93800 non-null  object 
 5   Working Professional or Student        93800 non-null  object 
 6   Profession                             69168 non-null  object 
 7   Academic Pressure                      18767 non-null  float64
 8   Work Pressure                          75022 non-null  float64
 9   CGPA                                   18766 non-null  float64
 10  Study Satisfaction                     18767 non-null  float64
 11  Jo

We can clearly see that we have some missing values. We notice also unconvenient naming for our features, so we have to change those. Let's take a look at the head of our dataset to examine wether or not the object DType feature should be cast to other types.

In [None]:
data.head(10)

Unnamed: 0,id,Name,Gender,Age,City,Working Professional or Student,Profession,Academic Pressure,Work Pressure,CGPA,Study Satisfaction,Job Satisfaction,Sleep Duration,Dietary Habits,Degree,Have you ever had suicidal thoughts ?,Work/Study Hours,Financial Stress,Family History of Mental Illness
0,140700,Shivam,Male,53.0,Visakhapatnam,Working Professional,Judge,,2.0,,,5.0,Less than 5 hours,Moderate,LLB,No,9.0,3.0,Yes
1,140701,Sanya,Female,58.0,Kolkata,Working Professional,Educational Consultant,,2.0,,,4.0,Less than 5 hours,Moderate,B.Ed,No,6.0,4.0,No
2,140702,Yash,Male,53.0,Jaipur,Working Professional,Teacher,,4.0,,,1.0,7-8 hours,Moderate,B.Arch,Yes,12.0,4.0,No
3,140703,Nalini,Female,23.0,Rajkot,Student,,5.0,,6.84,1.0,,More than 8 hours,Moderate,BSc,Yes,10.0,4.0,No
4,140704,Shaurya,Male,47.0,Kalyan,Working Professional,Teacher,,5.0,,,5.0,7-8 hours,Moderate,BCA,Yes,3.0,4.0,No
5,140705,Kartik,Male,29.0,Mumbai,Working Professional,Customer Support,,2.0,,,3.0,More than 8 hours,Moderate,B.Com,No,3.0,2.0,Yes
6,140706,Armaan,Male,47.0,Visakhapatnam,Working Professional,Teacher,,1.0,,,1.0,Less than 5 hours,Healthy,MA,No,10.0,3.0,Yes
7,140707,Ritika,Female,28.0,Mumbai,Working Professional,Customer Support,,5.0,,,3.0,7-8 hours,Healthy,BA,Yes,0.0,2.0,No
8,140708,Navya,Female,21.0,Surat,Student,,1.0,,7.39,3.0,,Less than 5 hours,Healthy,BBA,No,8.0,1.0,Yes
9,140709,Harsha,Male,21.0,Jaipur,Working Professional,,,5.0,,,1.0,Less than 5 hours,Healthy,Class 12,Yes,10.0,4.0,No


In [None]:
object_columns = data.select_dtypes(include=['object']).columns
print(object_columns)

Index(['Name', 'Gender', 'City', 'Working Professional or Student',
       'Profession', 'Sleep Duration', 'Dietary Habits', 'Degree',
       'Have you ever had suicidal thoughts ?',
       'Family History of Mental Illness'],
      dtype='object')


Sleep duration should be transformed into date type or numerical value. We also have some categorical feature that need to be encoded such as Degree, Dietary habits etc.

But let's first start with level 1 data preprocessing. We basically need to ensure that we have:

1.   standard and preferred data structure
2.   codable intuitive column titles
3. each row has a unique identifier






condition 1 and 3 are verified. we Only need to rename some of the columns into codable intuitive names.  

In [None]:
data.columns


Index(['id', 'Name', 'Gender', 'Age', 'City',
       'Working Professional or Student', 'Profession', 'Academic Pressure',
       'Work Pressure', 'CGPA', 'Study Satisfaction', 'Job Satisfaction',
       'Sleep Duration', 'Dietary Habits', 'Degree',
       'Have you ever had suicidal thoughts ?', 'Work/Study Hours',
       'Financial Stress', 'Family History of Mental Illness'],
      dtype='object')

In [4]:
data.rename(columns={'Working Professional or Student':'ProfessionalOrStudent','Academic Pressure':'AcademicPressure',
                     'Work Pressure':'WorkPressure','Study Satisfaction':'StudySatisfaction',
                     'Job Satisfaction':'JobSatisfaction','Sleep Duration':'SleepDuration'
                     ,'Dietary Habits':'DietaryHabits','Have you ever had suicidal thoughts ?':'SuicidalThoughts',
                      'Work/Study Hours':'WorkOrStudyHours','Financial Stress':'FinancialStress',
                     'Family History of Mental Illness': 'FamilyMentalIllness'}, inplace=True)

In [5]:
data.head()

Unnamed: 0,id,Name,Gender,Age,City,ProfessionalOrStudent,Profession,AcademicPressure,WorkPressure,CGPA,StudySatisfaction,JobSatisfaction,SleepDuration,DietaryHabits,Degree,SuicidalThoughts,WorkOrStudyHours,FinancialStress,FamilyMentalIllness
0,140700,Shivam,Male,53.0,Visakhapatnam,Working Professional,Judge,,2.0,,,5.0,Less than 5 hours,Moderate,LLB,No,9.0,3.0,Yes
1,140701,Sanya,Female,58.0,Kolkata,Working Professional,Educational Consultant,,2.0,,,4.0,Less than 5 hours,Moderate,B.Ed,No,6.0,4.0,No
2,140702,Yash,Male,53.0,Jaipur,Working Professional,Teacher,,4.0,,,1.0,7-8 hours,Moderate,B.Arch,Yes,12.0,4.0,No
3,140703,Nalini,Female,23.0,Rajkot,Student,,5.0,,6.84,1.0,,More than 8 hours,Moderate,BSc,Yes,10.0,4.0,No
4,140704,Shaurya,Male,47.0,Kalyan,Working Professional,Teacher,,5.0,,,5.0,7-8 hours,Moderate,BCA,Yes,3.0,4.0,No


Much Better. Level 1 data preprocessing is done. Lets handle missing values now.


In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 93800 entries, 0 to 93799
Data columns (total 19 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   id                     93800 non-null  int64  
 1   Name                   93800 non-null  object 
 2   Gender                 93800 non-null  object 
 3   Age                    93800 non-null  float64
 4   City                   93800 non-null  object 
 5   ProfessionalOrStudent  93800 non-null  object 
 6   Profession             69168 non-null  object 
 7   AcademicPressure       18767 non-null  float64
 8   WorkPressure           75022 non-null  float64
 9   CGPA                   18766 non-null  float64
 10  StudySatisfaction      18767 non-null  float64
 11  JobSatisfaction        75026 non-null  float64
 12  SleepDuration          93800 non-null  object 
 13  DietaryHabits          93795 non-null  object 
 14  Degree                 93798 non-null  object 
 15  Su

In [7]:
data.describe()

Unnamed: 0,id,Age,AcademicPressure,WorkPressure,CGPA,StudySatisfaction,JobSatisfaction,WorkOrStudyHours,FinancialStress
count,93800.0,93800.0,18767.0,75022.0,18766.0,18767.0,75026.0,93800.0,93800.0
mean,187599.5,40.321685,3.158576,3.011797,7.674016,2.939522,2.96092,6.247335,2.978763
std,27077.871962,12.39348,1.386666,1.403563,1.465056,1.374242,1.41071,3.858191,1.414604
min,140700.0,18.0,1.0,1.0,5.03,1.0,1.0,0.0,1.0
25%,164149.75,29.0,2.0,2.0,6.33,2.0,2.0,3.0,2.0
50%,187599.5,42.0,3.0,3.0,7.8,3.0,3.0,6.0,3.0
75%,211049.25,51.0,4.0,4.0,8.94,4.0,4.0,10.0,4.0
max,234499.0,60.0,5.0,5.0,10.0,5.0,5.0,12.0,5.0
