## Data Cleaning
----------------------------------

The cleaning strategy we will use is as follows:

1. Duplicated Rows: Delete duplicated rows.
2. Column Issues: 
    - Drop Irrelevant Columns
    - Standardize Names Case
    - Rename Columns
    - Handle DataTypes
    - Handle Data Inconsistencies.
3. Missing Values: Replace Null with unkown and add missingness flag.

In [20]:
# --------------------
# Import Libraries
# --------------------
import pandas as pd

In [3]:
# --------------
# Read The Data
# --------------
data = pd.read_csv('../data/raw/Mental Health Dataset.csv')

### Duplicated Rows

After identifying 2,313 duplicated rows in the dataset, these entries were removed to ensure that each participant’s response is represented only once. Eliminating duplicate records helps prevent unintended bias in the analysis and ensures that patterns discovered in later stages reflect genuine variation in the data rather than repeated observations.

In [4]:
# ----------------------
# Check for duplicates
# ----------------------
data.duplicated().sum()

np.int64(2313)

In [5]:
#clean
data.drop_duplicates(keep='first', inplace=True)
data.duplicated().sum()

np.int64(0)

### Column Issues

- Dropping Irrelevant Columns
- Standardizing Names Case
- Renaming Columns
- Handling DataTypes
- Handling Data Inconsistencies

---

#### Dropping Irrelevant Columns

The `Timestamp` column was removed from the dataset, as it does not contribute meaningful information to the analysis. Since the goal of this project is to understand patterns in mental health–related responses rather than temporal trends, retaining this column would add unnecessary noise without providing analytical value.

In [6]:
# ---------------------------
# Dropping Irrelvant Columns
# ---------------------------
data.drop(columns='Timestamp', inplace=True)
data.columns

Index(['Gender', 'Country', 'Occupation', 'self_employed', 'family_history',
       'treatment', 'Days_Indoors', 'Growing_Stress', 'Changes_Habits',
       'Mental_Health_History', 'Mood_Swings', 'Coping_Struggles',
       'Work_Interest', 'Social_Weakness', 'mental_health_interview',
       'care_options'],
      dtype='object')

#### Standardizing Column Names

All column names were standardized to lowercase to ensure consistency and improve readability throughout the analysis. This step helps avoid confusion and reduces the risk of errors when referencing columns during preprocessing, analysis, and modeling.

In [7]:
# ---------------------------
# Standardizing Column Names
# ---------------------------
data.columns = data.columns.str.strip().str.lower()

In [8]:
data.columns

Index(['gender', 'country', 'occupation', 'self_employed', 'family_history',
       'treatment', 'days_indoors', 'growing_stress', 'changes_habits',
       'mental_health_history', 'mood_swings', 'coping_struggles',
       'work_interest', 'social_weakness', 'mental_health_interview',
       'care_options'],
      dtype='object')

#### Renaming of Ambiguous Columns

Several columns were renamed to make their meanings more explicit and reduce ambiguity. The updated names aim to clearly reflect the underlying concepts captured by each feature, improving interpretability and making the dataset easier to reason about during analysis and modeling. This step is particularly important in a mental health context, where precise terminology helps avoid misinterpretation of participants’ responses.

In [9]:
# ---------------------------
# Renaming Columns Names
# ---------------------------

data.rename(
    columns={
        'family_history': 'family_mh_history',
        'treatment': 'sought_treatment',
        'days_indoors': 'days_spent_indoors',
        'growing_stress': 'noticed_growing_stress',
        'changes_habits': 'noticed_habit_changes',
        'mental_health_history': 'personal_mh_history',
        'coping_struggles': 'coping_difficulty',
        'work_interest': 'work_engagement',
        'social_weakness': 'social_difficulty',
        'mental_health_interview': 'disclose_mh_to_employer',
        'care_options': 'care_options_awareness'
    },
    inplace=True
)

In [10]:
data.columns

Index(['gender', 'country', 'occupation', 'self_employed', 'family_mh_history',
       'sought_treatment', 'days_spent_indoors', 'noticed_growing_stress',
       'noticed_habit_changes', 'personal_mh_history', 'mood_swings',
       'coping_difficulty', 'work_engagement', 'social_difficulty',
       'disclose_mh_to_employer', 'care_options_awareness'],
      dtype='object')

#### Standardizing Column Values

To ensure consistency across categorical features, all column values were standardized to title case. While most columns already followed this convention, the care_options feature deviated from the standard. Aligning all categorical values to a uniform format improves readability, reduces the risk of unintended category duplication, and ensures consistent handling during encoding and modeling.

In [11]:
# ----------------------------
# Standardizing Column Values
# ----------------------------

# Standardizing the columns into title case
for col in data.select_dtypes(include='object').columns:
    data[col] = data[col].str.title()

In [12]:
for column in data.select_dtypes(include='object').columns:
    print(f"\nColumn: {column}")
    print(data[column].unique())
    print('-' * 50)


Column: gender
['Female' 'Male']
--------------------------------------------------

Column: country
['United States' 'Poland' 'Australia' 'Canada' 'United Kingdom'
 'South Africa' 'Sweden' 'New Zealand' 'Netherlands' 'India' 'Belgium'
 'Ireland' 'France' 'Portugal' 'Brazil' 'Costa Rica' 'Russia' 'Germany'
 'Switzerland' 'Finland' 'Israel' 'Italy' 'Bosnia And Herzegovina'
 'Singapore' 'Nigeria' 'Croatia' 'Thailand' 'Denmark' 'Mexico' 'Greece'
 'Moldova' 'Colombia' 'Georgia' 'Czech Republic' 'Philippines']
--------------------------------------------------

Column: occupation
['Corporate' 'Student' 'Business' 'Housewife' 'Others']
--------------------------------------------------

Column: self_employed
[nan 'No' 'Yes']
--------------------------------------------------

Column: family_mh_history
['No' 'Yes']
--------------------------------------------------

Column: sought_treatment
['Yes' 'No']
--------------------------------------------------

Column: days_spent_indoors
['1-14 Day

#### Missing Values

Missing values were present only in the self_employed column. Rather than imputing these values using assumptions or probabilistic methods, we treated the missingness itself as potentially meaningful information. All missing entries in self_employed were replaced with the category Unknown, preserving the categorical nature of the feature.

In addition, a binary missingness indicator was introduced to explicitly capture whether a response was originally missing. This approach allows downstream clustering models to determine whether the absence of a response carries structural significance, without forcing artificial values into the data. The impact of this decision will later be evaluated by comparing clustering results with and without the missingness flag.

In [13]:
# ----------------------
# Check for null values
# ----------------------
data.isna().sum()

gender                        0
country                       0
occupation                    0
self_employed              5193
family_mh_history             0
sought_treatment              0
days_spent_indoors            0
noticed_growing_stress        0
noticed_habit_changes         0
personal_mh_history           0
mood_swings                   0
coping_difficulty             0
work_engagement               0
social_difficulty             0
disclose_mh_to_employer       0
care_options_awareness        0
dtype: int64

In [19]:
# ------------------------------------------
# Handling missing values for self_employed
# ------------------------------------------

data['self_employed_missing'] = data['self_employed'].isna().astype(int)
data['self_employed'] = data['self_employed'].fillna('Unknown')

print(data['self_employed'].value_counts(dropna=False))
print('-' * 25)
print(data['self_employed_missing'].value_counts())

self_employed
No         255711
Yes         29147
Unknown      5193
Name: count, dtype: int64
-------------------------
self_employed_missing
0    290051
Name: count, dtype: int64


## Saving Cleaned Dataset
---

In [21]:
data.to_csv("../data/processed/mentalHealthData_Cleaned.csv", index=False)