# 🧠 Mental Health Survey - Categorical Association Analysis

This notebook explores potential relationships between categorical variables in a mental health survey dataset. Specifically, it applies:

- Chi-Square Test of Independence
- Cramér's V for association strength

The goal is to determine if demographic or workplace attributes influence mental health outcomes.

## 📌 Objective
To examine whether certain workplace or demographic factors are associated with mental health treatment and perceptions, using statistical analysis.

We use:
- Chi-square test: to assess whether a relationship exists.
- Cramér’s V: to assess the strength of that relationship.


In [1]:
# Library import
import pandas as pd
import numpy as np
import scipy.stats as stats

In [2]:
# Upload cleaned Excel file (same as Power BI one, from Excel)
mental_health_data = pd.read_excel('mental_health_clean.xlsx', keep_default_na=False, na_values=[""])
mental_health_data.head()

Unnamed: 0,Timestamp,Age,Gender,Country,state,self_employed,family_history,treatment,work_interfere,no_employees,...,leave,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence,comments
0,2014-08-27 11:29:31,37,Female,United States,IL,,No,Yes,Often,6 - 25,...,Somewhat easy,No,No,Some of them,Yes,No,Maybe,Yes,No,
1,2014-08-27 11:29:37,44,Male,United States,IN,,No,No,Rarely,More than 1000,...,Don't know,Maybe,No,No,No,No,No,Don't know,No,
2,2014-08-27 11:29:44,32,Male,Canada,,,No,No,Rarely,6 - 25,...,Somewhat difficult,No,No,Yes,Yes,Yes,Yes,No,No,
3,2014-08-27 11:29:46,31,Male,United Kingdom,,,Yes,Yes,Often,26 - 100,...,Somewhat difficult,Yes,Yes,Some of them,No,Maybe,Maybe,No,Yes,
4,2014-08-27 11:30:22,31,Male,United States,TX,,No,No,Never,100 - 500,...,Don't know,No,No,Some of them,Yes,Yes,Yes,Don't know,No,


In [3]:
# Check if there are any blanks or unusual values
mental_health_data.info()
mental_health_data.isnull().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1250 entries, 0 to 1249
Data columns (total 27 columns):
 #   Column                     Non-Null Count  Dtype         
---  ------                     --------------  -----         
 0   Timestamp                  1250 non-null   datetime64[ns]
 1   Age                        1250 non-null   int64         
 2   Gender                     1250 non-null   object        
 3   Country                    1250 non-null   object        
 4   state                      1250 non-null   object        
 5   self_employed              1232 non-null   object        
 6   family_history             1250 non-null   object        
 7   treatment                  1250 non-null   object        
 8   work_interfere             988 non-null    object        
 9   no_employees               1250 non-null   object        
 10  remote_work                1250 non-null   object        
 11  tech_company               1250 non-null   object        
 12  benefi

Unnamed: 0,0
Timestamp,0
Age,0
Gender,0
Country,0
state,0
self_employed,18
family_history,0
treatment,0
work_interfere,262
no_employees,0


### Create Functions (Specific on mental_health_data)

In [4]:
def chi_square_test(data, col1, col2):
    table = pd.crosstab(data[col1], data[col2])
    chi2, p, dof, expected = stats.chi2_contingency(table)

    print(f"Chi-Square Test between: '{col1}' and '{col2}'")
    print("Chi2 Value:", round(chi2, 2))
    print("p-value:", round(p, 4))
    if p < 0.05:
        print("→ Significant association (p < 0.05)")
    else:
        print("→ No significant association")
    return table

In [5]:
def cramers_v(data, col1, col2):
    table = pd.crosstab(data[col1], data[col2])
    chi2 = stats.chi2_contingency(table)[0]
    n = table.sum().sum()
    r, k = table.shape
    v = np.sqrt(chi2 / (n * (min(r, k) - 1)))
    print(f"Cramér's V between '{col1}' and '{col2}': {round(v, 3)}")

### Apply Functions

In [6]:
chi_square_test(mental_health_data, "remote_work", "treatment")
cramers_v(mental_health_data, "remote_work", "treatment")

Chi-Square Test between: 'remote_work' and 'treatment'
Chi2 Value: 0.69
p-value: 0.4047
→ No significant association
Cramér's V between 'remote_work' and 'treatment': 0.024


In [7]:
chi_square_test(mental_health_data, "tech_company", "benefits")
cramers_v(mental_health_data, "tech_company", "benefits")

Chi-Square Test between: 'tech_company' and 'benefits'
Chi2 Value: 9.31
p-value: 0.0095
→ Significant association (p < 0.05)
Cramér's V between 'tech_company' and 'benefits': 0.086


In [8]:
chi_square_test(mental_health_data, "coworkers", "supervisor")
cramers_v(mental_health_data, "coworkers", "supervisor")

Chi-Square Test between: 'coworkers' and 'supervisor'
Chi2 Value: 480.84
p-value: 0.0
→ Significant association (p < 0.05)
Cramér's V between 'coworkers' and 'supervisor': 0.439


In [9]:
chi_square_test(mental_health_data, "Gender", "treatment")
cramers_v(mental_health_data, "Gender", "treatment")

Chi-Square Test between: 'Gender' and 'treatment'
Chi2 Value: 51.17
p-value: 0.0
→ Significant association (p < 0.05)
Cramér's V between 'Gender' and 'treatment': 0.202


## 🔍 Summary Table

| Variable 1     | Variable 2     | Chi²   | p-value | Cramér’s V | Interpretation                                 |
|----------------|----------------|--------|---------|-------------|-----------------------------------------------|
| remote_work    | treatment      | 0.69   | 0.4047  | 0.024       | No significant association, negligible strength |
| tech_company   | benefits       | 9.31   | 0.0095  | 0.086       | Statistically significant, but negligible strength |
| coworkers      | supervisor     | 480.84 | 0.0000  | 0.439       | Strong and highly significant association       |
| gender         | treatment      | 51.17  | 0.0000  | 0.202       | Moderate and significant association            |


## 🔍 Interpretation of Chi-Square and Cramér’s V Results

### 1️⃣ `remote_work` vs `treatment`
- 🧮 **Chi²**: 0.69  
- 📊 **p-value**: 0.4047  
- 📏 **Cramér’s V**: 0.024  
- 🧠 **Interpretation**: No significant association; negligible strength.

> Employees working remotely are **not more or less likely** to seek mental health treatment compared to on-site employees.  
> ❓ **So what?**  
> This suggests that **work location alone** may not be a strong predictor of mental health treatment behavior. Other factors (like company culture or individual awareness) may play a bigger role.

---

### 2️⃣ `tech_company` vs `benefits`
- 🧮 **Chi²**: 9.31  
- 📊 **p-value**: 0.0095  
- 📏 **Cramér’s V**: 0.086  
- 🧠 **Interpretation**: Statistically significant, but negligible strength.

> Tech companies are **slightly more likely** to provide mental health benefits, but the strength of this relationship is very weak.  
> ❓ **So what?**  
> While the difference exists, it's **not strong enough** to say tech companies are truly better across the board. Organizations outside tech may also offer similar support.

---

### 3️⃣ `coworkers` vs `supervisor`
- 🧮 **Chi²**: 480.84  
- 📊 **p-value**: 0.0000  
- 📏 **Cramér’s V**: 0.439  
- 🧠 **Interpretation**: Strong and highly significant association.

> People who are comfortable discussing mental health with coworkers are **very likely** to also feel comfortable discussing it with their supervisor.  
> ❓ **So what?**  
> This reveals a shared environment of openness. Promoting **peer-level conversations** may encourage broader trust across the organization.

---

### 4️⃣ `gender` vs `treatment`
- 🧮 **Chi²**: 51.17  
- 📊 **p-value**: 0.0000  
- 📏 **Cramér’s V**: 0.202  
- 🧠 **Interpretation**: Moderate and significant association.

> Gender has a **moderate impact** on whether someone seeks mental health treatment.  
> ❓ **So what?**  
> This could reflect **cultural, social, or psychological differences** in how men, women, or other gender identities approach mental health. Tailored outreach or education might be needed.
