# COMP-2704: Supervised Machine Learning Term Project

## Data Analysis and Preparation 

### Problem and Use Case Definition
Several external and internal factors we face every day can affect our mental health condition either in a positive or a negative way. Despite our ability to tell what is good for our mental health apart from what is not, being aware of every single aspect and its consequences is not an easy task.

In this use case, we are going to leverage machine learning capabilities to identify positive and negative aspects of our lifes in detail, demonstrating how data science can enhance medical diagnosis by developing a classification model. The goal is to enable earlier detection and provide employees with timely and relevant treatments.

### Dataset Description
We are going to use the `mental_health_dataset`. This dataset comprises 50,000 records capturing various mental health and lifestyle factors. These factors are represented as the feature columns of the data set, which fall into the following categories:
- **Demographic:** This includes each person demographic information such as age, gender, occupation and country.
- **Mental Health Indicators:** This includes features like the stress level, consultation history and medication usage.
- **Lifestyle:** This includes information complementary information such as sleep hours, work hours, physical activity, social media usage and diet quality.
- **Additional Details:** This includes information about external factors such as smoking and alcohol consumption habits categorized into multiple levels.
### Classification Model Considerations.
*If it’s a classification problem, reflect on whether false positives or false negatives carry more weight.*



## Clean the data
### Missing values and Duplicates

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

In [3]:
df = pd.read_csv('data/mental_health_data final data.csv')
df.head()

Unnamed: 0,User_ID,Age,Gender,Occupation,Country,Mental_Health_Condition,Severity,Consultation_History,Stress_Level,Sleep_Hours,Work_Hours,Physical_Activity_Hours,Social_Media_Usage,Diet_Quality,Smoking_Habit,Alcohol_Consumption,Medication_Usage
0,1,36,Male,Education,Australia,Yes,,Yes,Low,7.6,46,8,2.2,Healthy,Regular Smoker,Regular Drinker,Yes
1,2,48,Male,Engineering,Other,No,Low,No,Low,6.8,74,2,3.4,Unhealthy,Heavy Smoker,Social Drinker,No
2,3,18,Prefer not to say,Sales,India,No,,Yes,Medium,7.1,77,9,5.9,Healthy,Heavy Smoker,Social Drinker,No
3,4,30,Non-binary,Engineering,Australia,No,Medium,No,Low,6.9,57,4,5.4,Average,Regular Smoker,Regular Drinker,No
4,5,58,Male,IT,USA,Yes,,Yes,High,4.7,45,10,3.3,Unhealthy,Regular Smoker,Non-Drinker,Yes


Getting information about our dataset:

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 17 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   User_ID                  50000 non-null  int64  
 1   Age                      50000 non-null  int64  
 2   Gender                   50000 non-null  object 
 3   Occupation               50000 non-null  object 
 4   Country                  50000 non-null  object 
 5   Mental_Health_Condition  50000 non-null  object 
 6   Severity                 24998 non-null  object 
 7   Consultation_History     50000 non-null  object 
 8   Stress_Level             50000 non-null  object 
 9   Sleep_Hours              50000 non-null  float64
 10  Work_Hours               50000 non-null  int64  
 11  Physical_Activity_Hours  50000 non-null  int64  
 12  Social_Media_Usage       50000 non-null  float64
 13  Diet_Quality             50000 non-null  object 
 14  Smoking_Habit         

As we can see, the only column that has null values is the severity column, let;s try to understand why:

In [5]:
print(df['Severity'].value_counts())
print(df['Severity'].isnull().sum())

Severity
Medium    8436
High      8301
Low       8261
Name: count, dtype: int64
25002


In [6]:
df[df['Severity'].isnull()][['Mental_Health_Condition','Severity']]

Unnamed: 0,Mental_Health_Condition,Severity
0,Yes,
2,No,
4,Yes,
7,Yes,
10,Yes,
...,...,...
49985,No,
49988,No,
49990,No,
49992,No,


As a conclusion, we can see that there is missing information about the severity of the mental health condition (if the individual has one) for almost half the number of rows we have in our dataset. For now, a solution to this problem will be dropping the rows in which that information is missing as we plan to use the severity alongside the mental health condition as the label for our model.

In [7]:
# Filtering out rows with null Severity
df = df[df['Severity'].notnull()]
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 24998 entries, 1 to 49999
Data columns (total 17 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   User_ID                  24998 non-null  int64  
 1   Age                      24998 non-null  int64  
 2   Gender                   24998 non-null  object 
 3   Occupation               24998 non-null  object 
 4   Country                  24998 non-null  object 
 5   Mental_Health_Condition  24998 non-null  object 
 6   Severity                 24998 non-null  object 
 7   Consultation_History     24998 non-null  object 
 8   Stress_Level             24998 non-null  object 
 9   Sleep_Hours              24998 non-null  float64
 10  Work_Hours               24998 non-null  int64  
 11  Physical_Activity_Hours  24998 non-null  int64  
 12  Social_Media_Usage       24998 non-null  float64
 13  Diet_Quality             24998 non-null  object 
 14  Smoking_Habit            24

This is the dataset we are going to work with.

### Checking duplicates

In [8]:
any(df.duplicated())

False

No duplicates on the data!

In [10]:
df.describe()

Unnamed: 0,User_ID,Age,Sleep_Hours,Work_Hours,Physical_Activity_Hours,Social_Media_Usage
count,24998.0,24998.0,24998.0,24998.0,24998.0,24998.0
mean,25109.395792,41.472718,7.018721,55.12765,5.011961,3.258065
std,14435.387779,13.886295,1.732755,14.648357,3.160148,1.594494
min,2.0,18.0,4.0,30.0,0.0,0.5
25%,12579.25,29.0,5.5,42.0,2.0,1.9
50%,25218.5,41.0,7.0,55.0,5.0,3.3
75%,37595.25,54.0,8.5,68.0,8.0,4.6
max,50000.0,65.0,10.0,80.0,10.0,6.0
