# 🧠 Mental Health in Tech - Capstone Project (OpenLearn Cohort 1.0)
## 📅 Day 1 - Data Cleaning & Exploratory Data Analysis (EDA)
👨‍💻 Written by a 2nd-year Engineering Student

---
This notebook is focused on cleaning and analyzing a mental health dataset from the tech industry. We'll go step by step from loading the data to exploring patterns.


In [None]:
# 🛠 Importing the Required Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

warnings.filterwarnings('ignore')  # Hide warnings to keep things clean
plt.style.use('seaborn')  # Set a nice style for our plots
sns.set_palette("husl")


## 1️⃣ Load and Explore the Dataset
First, we'll load the CSV file and get a basic overview of its shape and content.

In [None]:
def load_data(file_path):
    df = pd.read_csv(file_path)
    print("✅ Data Loaded Successfully!")
    print("Shape of data:", df.shape)
    print("\nColumns:", df.columns.tolist())
    print("\nFirst 5 rows:")
    display(df.head())
    return df

# Example usage:
# df = load_data('mental_health_survey.csv')

## 2️⃣ Cleaning the Age Column
We'll fix ages that don't make sense (like below 18 or above 100).

In [None]:
def clean_age(df):
    if 'Age' in df.columns:
        df['Age'] = df['Age'].apply(lambda x: np.nan if x < 18 or x > 100 else x)
        df['Age'].fillna(df['Age'].median(), inplace=True)
    return df

## 3️⃣ Cleaning the Gender Column
We'll group various forms of 'Male', 'Female', and others into standard categories.

In [None]:
def clean_gender(df):
    male_terms = ['male', 'm', 'man', 'cis male']
    female_terms = ['female', 'f', 'woman', 'cis female']
    
    def simplify_gender(g):
        g = str(g).lower()
        if any(term in g for term in male_terms):
            return 'Male'
        elif any(term in g for term in female_terms):
            return 'Female'
        else:
            return 'Other'
    
    df['Gender'] = df['Gender'].apply(simplify_gender)
    return df

## 4️⃣ Handling Missing Values
We'll fill missing values in numerical columns with the median, and in text columns with the mode.

In [None]:
def handle_missing_values(df):
    for col in df.columns:
        if df[col].isnull().sum() > 0:
            if df[col].dtype == 'object':
                df[col].fillna(df[col].mode()[0], inplace=True)
            else:
                df[col].fillna(df[col].median(), inplace=True)
    return df

## 5️⃣ Basic Univariate Plots (Age & Gender)

In [None]:
def plot_basic_stats(df):
    if 'Age' in df.columns:
        plt.figure(figsize=(10,4))
        sns.histplot(df['Age'], kde=True)
        plt.title("Age Distribution")
        plt.show()
    
    if 'Gender' in df.columns:
        plt.figure(figsize=(6,4))
        sns.countplot(data=df, x='Gender')
        plt.title("Gender Distribution")
        plt.show()