# Chi-square Assumptions
The following assumptions need to be meet in order for the results of the Chi-square test to be trusted.

1.When testing the data, the cells should be frequencies or counts of cases and not percentages. It is okay to convert to 2.percentages after testing the data
3.The levels (categories) of the variables being tested are mutually exclusive
4.Each participant contributes to only one cell within the Chi-square table
5.The groups being tested must be independent
6.The value of expected cells should be greater than 5

In [1]:
import pandas as pd
from scipy import stats

In [3]:
df = pd.read_csv('F:/JupyterML/ML_Practice/datasets/hypothesis/mental-health.csv')

In [4]:
df.head(4)

Unnamed: 0,Are you self-employed?,How many employees does your company or organization have?,Is your employer primarily a tech company/organization?,Is your primary role within your company related to tech/IT?,Does your employer provide mental health benefits as part of healthcare coverage?,Do you know the options for mental health care available under your employer-provided coverage?,"Has your employer ever formally discussed mental health (for example, as part of a wellness campaign or other official communication)?",Does your employer offer resources to learn more about mental health concerns and options for seeking help?,Is your anonymity protected if you choose to take advantage of mental health or substance abuse treatment resources provided by your employer?,"If a mental health issue prompted you to request a medical leave from work, asking for that leave would be:",...,"If you have a mental health issue, do you feel that it interferes with your work when being treated effectively?","If you have a mental health issue, do you feel that it interferes with your work when NOT being treated effectively?",What is your age?,What is your gender?,What country do you live in?,What US state or territory do you live in?,What country do you work in?,What US state or territory do you work in?,Which of the following best describes your work position?,Do you work remotely?
0,0,26-100,1.0,,Not eligible for coverage / N/A,,No,No,I don't know,Very easy,...,Not applicable to me,Not applicable to me,39,Male,United Kingdom,,United Kingdom,,Back-end Developer,Sometimes
1,0,Jun-25,1.0,,No,Yes,Yes,Yes,Yes,Somewhat easy,...,Rarely,Sometimes,29,male,United States of America,Illinois,United States of America,Illinois,Back-end Developer|Front-end Developer,Never
2,0,Jun-25,1.0,,No,,No,No,I don't know,Neither easy nor difficult,...,Not applicable to me,Not applicable to me,38,Male,United Kingdom,,United Kingdom,,Back-end Developer,Always
3,1,,,,,,,,,,...,Sometimes,Sometimes,43,male,United Kingdom,,United Kingdom,,Supervisor/Team Lead,Sometimes


In [10]:
df['Do you currently have a mental health disorder?'].head(4)

0     No
1    Yes
2     No
3    Yes
Name: Do you currently have a mental health disorder?, dtype: object

In [11]:
df['Would you have been willing to discuss a mental health issue with your direct supervisor(s)?'].head()

0    Some of my previous employers
1    Some of my previous employers
2                     I don't know
3    Some of my previous employers
4    Some of my previous employers
Name: Would you have been willing to discuss a mental health issue with your direct supervisor(s)?, dtype: object

In [6]:
def drop_maybe(series):
    if series.lower() == 'yes' or series.lower() == 'no':
        return series
    else:
        return

In [7]:
df['current_mental_disorder'] = df['Do you currently have a mental health disorder?'].apply(drop_maybe)
df['willing_discuss_mh_supervisor'] = df['Would you have been willing to discuss a mental health issue with your direct supervisor(s)?']



In [8]:
pd.crosstab(df['willing_discuss_mh_supervisor'], df['current_mental_disorder'])

current_mental_disorder,No,Yes
willing_discuss_mh_supervisor,Unnamed: 1_level_1,Unnamed: 2_level_1
I don't know,51,29
"No, at none of my previous employers",119,194
Some of my previous employers,237,267
"Yes, at all of my previous employers",51,24


In [12]:
crosstab = pd.crosstab(df['willing_discuss_mh_supervisor'], df['current_mental_disorder'])

In [13]:
stats.chi2_contingency(crosstab)

(32.408194625396376,
 4.292859793048239e-07,
 3,
 array([[ 37.69547325,  42.30452675],
        [147.48353909, 165.51646091],
        [237.48148148, 266.51851852],
        [ 35.33950617,  39.66049383]]))

In [2]:
x= 4.292859793048239e-07
print('%.08f' % x)

0.00000043


The first value (32.408) is the Chi-square value, followed by the p-value (4.29e-07), then comes the degrees of freedom (3), and lastly it outputs the expected frequencies as an array. Since all of the expected frequencies are greater than 5, the chi2 test results can be trusted. We can reject the null hypothesis as the p-value is less than 0.05.

# Chi-square post hoc testing
Now that we know our Chi-square test of independence is significant, we want to test where the relationship is between the levels of the variables. In order to do this, we need to conduct multiple 2×2 Chi-square tests using the Bonferroni-adjusted p-value.

Bonferroni-adjusted method adjusts the p-value by how many planned pairwise comparisons are being conducted. The formula is p/N, where “p”= the original tests p-value and “N”= the number of planned pairwise comparisons.
In our example, if we were planning on conducting all possible pairwise comparisons then the formula would be 0.05/6 = 0.008. Meaning, a post hoc 2×2 Chi-square test would have to have a p-value less than 0.008 to be significant. However, we are not interested in the “I don’t know” category of the “willing_discuss_mh_supervisor” variable. Thus making the formula be 0.05/3, which equals 0.017. So for our planned pairwise comparisons to be significant, the p-value must be less than 0.017

In [19]:
dummies = pd.get_dummies(df['willing_discuss_mh_supervisor'])


In [20]:
dummies.head(5)

Unnamed: 0,I don't know,"No, at none of my previous employers",Some of my previous employers,"Yes, at all of my previous employers"
0,0,0,1,0
1,0,0,1,0
2,1,0,0,0
3,0,0,1,0
4,0,0,1,0


In [21]:
dummies.drop(["I don't know"],axis =1,inplace=True)

In [22]:
dummies.head(5)

Unnamed: 0,"No, at none of my previous employers",Some of my previous employers,"Yes, at all of my previous employers"
0,0,1,0
1,0,1,0
2,0,0,0
3,0,1,0
4,0,1,0


In [30]:
for series in dummies:
    nl = "\n"
    
    crosstab = pd.crosstab(dummies[f"{series}"], df['current_mental_disorder'])
    print(crosstab, nl)
    chi2, p, dof, expected = stats.chi2_contingency(crosstab)
    print(f"Chi2 value= {chi2}{nl}p-value= {p}{nl}Degrees of freedom= {dof}{nl}")

current_mental_disorder                No  Yes
No, at none of my previous employers          
0                                     412  381
1                                     119  194 

Chi2 value= 16.906443844118506
p-value= 3.926805158610076e-05
Degrees of freedom= 1

current_mental_disorder         No  Yes
Some of my previous employers          
0                              294  308
1                              237  267 

Chi2 value= 0.2924156694554503
p-value= 0.5886766550070441
Degrees of freedom= 1

current_mental_disorder                No  Yes
Yes, at all of my previous employers          
0                                     480  551
1                                      51   24 

Chi2 value= 12.034595567813462
p-value= 0.0005222216393205276
Degrees of freedom= 1

