### Analysis

The following analysis uses only the employee survey data which was modified and categorized by the cleaning script b

In [1]:
import pandas as pd
import numpy as np

empl_prevention_df = pd.read_csv('empl_survey_prevention.csv')
empl_prevention_df.columns

Index(['Do you know if your company or institution has a reporting channel to report harassment or sexual harassment at work?',
       'Do you know the policy for the prevention and/or sanction of harassment or sexual harassment at work in your institution or company?',
       'Do you know what the investigation process consists of in cases of harassment or sexual harassment at work in your company or institution?',
       'In the last 12 months, have you received training on prevention of harassment or sexual harassment at work?',
       'Where are cases of harassment or sexual harassment at work reported in your company or institution?',
       'org_id', 'country', 'empl_id'],
      dtype='object')

#### Hypothesis 1: People who have recieved more trainings in the past year have better knowledge of the reporting channel

In [2]:

from scipy import stats

col_training = 'In the last 12 months, have you received training on prevention of harassment or sexual harassment at work?'
col_reporting_channel = 'Do you know if your company or institution has a reporting channel to report harassment or sexual harassment at work?'

correlation_coefficient, p_value = stats.pearsonr(empl_prevention_df[col_training], empl_prevention_df[col_reporting_channel])
print(f"Correlation Coefficient: {correlation_coefficient}")
print(f"P-value: {p_value}")

group_1 = empl_prevention_df[empl_prevention_df[col_training] == 1][col_reporting_channel]
group_2 = empl_prevention_df[empl_prevention_df[col_training] == 2][col_reporting_channel]

t_stat, p_value_t = stats.ttest_ind(group_1, group_2)
print(f"T-test p-value: {p_value_t}")

Correlation Coefficient: 0.3876424407540485
P-value: 3.116209266722795e-46
T-test p-value: 1.8137952280018076e-07


#### Results

1. Correlation Coefficient:

The correlation coefficient of approximately 0.39 indicates a moderate positive correlation between 'training received' and 'reporting channel' scores.
This suggests that there is a tendency for individuals who received more training to have a better understanding of the reporting channel.

2. P-values:

The p-value for the correlation coefficient is extremely small (close to 0), indicating strong evidence against the null hypothesis that there is no correlation between the variables.
Similarly, the low p-value from the t-test suggests strong evidence against the null hypothesis that the mean 'reporting channel' scores for individuals who received training once and more than once are equal.


These results provide statistical support for the hypothesis that people who received more training in the past year tend to have better knowledge of the reporting channel. The positive correlation and the significant difference in mean scores between different training groups both support this idea.


#### Method: correlation and t-test

#### Hypothesis 2: People who have recieved more trainings in the past year have better knowledge of the investigation

In [3]:
col_investigation = 'Do you know what the investigation process consists of in cases of harassment or sexual harassment at work in your company or institution?'

correlation, p_value = stats.pearsonr(empl_prevention_df[col_training], empl_prevention_df[col_investigation])

print(f"Pearson Correlation Coefficient: {correlation}")
print(f"P-value: {p_value}")

Pearson Correlation Coefficient: 0.4340953923665827
P-value: 8.894106579523971e-59


#### Results
A Pearson correlation coefficient of 0.43 suggests a moderate positive linear relationship between the 'training received' and 'reporting channel' scores. The p-value being extremely low (close to zero) indicates that this correlation is statistically significant.

Therefore, based on this analysis, there seems to be evidence supporting the hypothesis that people who have received more training in the past year tend to have better knowledge of the investigation process.


#### method: pearson correlation

#### Hypothesis 3: People who have recieved more trainings in the past year have better knowledge of the policies surrounding sexual harasment


In [12]:
from scipy.stats import spearmanr

col_policy = 'Do you know the policy for the prevention and/or sanction of harassment or sexual harassment at work in your institution or company?'

subset_df = empl_prevention_df[[col_training, col_policy]].dropna()
correlation, p_value = stats.pearsonr(subset_df[col_training], subset_df[col_policy])


print(f"Pearson correlation coefficient: {corr}")
print(f"P-value: {p_value}")



Pearson correlation coefficient: 0.4602230092618429
P-value: 2.971612599007193e-65


#### Results

The Pearson correlation coefficient obtained is approximately 0.46, indicating a moderate positive correlation between the number of trainings received and the knowledge of policies surrounding sexual harassment.

The p-value (2.97e-65) is extremely small, suggesting strong evidence against the null hypothesis. In other words, it indicates that this correlation is statistically significant, implying that the relationship observed between the two variables (training received and knowledge of policies) is unlikely to be due to chance.

#### Method: Pearson corelation

#### Hypothesis 4: People who have recieved more trainings in the past year have lower tolerance to sexual harasment

In [5]:
from scipy.stats import f_oneway
import statsmodels.api as sm
from statsmodels.stats.multicomp import pairwise_tukeyhsd

emp_tol = pd.read_csv('empl_survey_tol.csv')
merged_df = pd.merge(emp_tol, empl_prevention_df, on='empl_id', how='left')

no_training = merged_df[merged_df[col_training] == 1]['tol_perc']
received_training = merged_df[merged_df[col_training] == 2]['tol_perc']

no_training = merged_df[merged_df[col_training] == 0]['tol_perc']
once_training = merged_df[merged_df[col_training] == 1]['tol_perc']
more_than_once_training = merged_df[merged_df[col_training] == 2]['tol_perc']


f_stat, p_value = f_oneway(no_training, once_training, more_than_once_training)

print('no training', no_training.mean())
print('1 training', once_training.mean())
print('>1 trainings', more_than_once_training.mean())

print(f"\nF-statistic: {f_stat}")
print(f"P-value: {p_value}")

# the results suggests atleast on groups tolerance is kinda different 

merged_df['group'] = merged_df[col_training]
tukey_result = pairwise_tukeyhsd(merged_df['tol_perc'], merged_df['group'])
merged_df['group'].unique()

print(tukey_result.summary())

no training 19.45273631840796
1 training 15.518488745980708
>1 trainings 15.304347826086957

F-statistic: 11.439557155844142
P-value: 1.1932936108832393e-05
Multiple Comparison of Means - Tukey HSD, FWER=0.05 
group1 group2 meandiff p-adj   lower   upper  reject
----------------------------------------------------
     0      1  -3.9342    0.0 -6.0073 -1.8612   True
     0      2  -4.1484 0.0008 -6.8266 -1.4702   True
     1      2  -0.2141  0.978 -2.7141  2.2858  False
----------------------------------------------------


#### Results:

From the results:

The comparisons between group 0 (No never.) and group 1 (Yes, once.) as well as between group 0 and group 2 (Yes, more than once.) show significantly different mean tolerance levels (reject = True).

group 0 has significantly lower mean tolerance levels compared to both group 1 and group 2.
However, the comparison between group 1 and group 2 does not show a significant difference in mean tolerance levels (reject = False).

There's no significant difference between the mean tolerance levels of group 1 and group 2.
This means that individuals who have never received any training ('No never.') have significantly lower mean tolerance levels compared to those who have received training at least once ('Yes, once.' and 'Yes, more than once.'). However, there's no significant difference in mean tolerance levels between those who received training once and those who received training more than once.

#### Method: ANOVA, Tukey's Honestly Significant Difference (HSD) test


#### Hypothesis 5: People who have recieved trainings in the past year are more likely to report acts of sexual harasment

In [6]:
from scipy.stats import ttest_ind

report_df = pd.read_csv('empl_survey_reporting.csv')
merged_df = pd.merge(report_df, empl_prevention_df, on='empl_id', how='left')


reporting_with_training = merged_df[merged_df[col_training] > 0]['reporting_score']
reporting_no_training = merged_df[merged_df[col_training] == 0]['reporting_score']

t_stat, p_value = ttest_ind(reporting_with_training, reporting_no_training)

print(f"T-statistic: {t_stat}")
print(f"P-value: {p_value}")

T-statistic: 0.22789936816614845
P-value: 0.8199527324086381


#### Results:

The t-test results indicate a T-statistic of approximately 0.228 and a p-value of about 0.820.

With such a high p-value (greater than the common significance level of 0.05), there's insufficient evidence to reject the null hypothesis. This suggests that there's no statistically significant difference in reporting scores between individuals who have received training and those who haven't.



#### Hypothesis 6: People who have recieved more trainings in the past year have more trust in their organization when dealing with sexual harasment cases

In [78]:
trust_df = pd.read_csv('empl_survey_trust.csv')
merged_df = pd.merge(trust_df, empl_prevention_df, on='empl_id', how='left')

reporting_with_training = merged_df[merged_df[col_training] > 0]['total_trust_score']
reporting_no_training = merged_df[merged_df[col_training] == 0]['total_trust_score']

t_stat, p_value = ttest_ind(reporting_with_training, reporting_no_training)

print(f'no training: {reporting_no_training.mean()}')
print(f'training: {reporting_with_training.mean()}')

print(f"\nT-statistic: {t_stat}")
print(f"P-value: {p_value}")

no training: 12.211442786069652
training: 12.73560517038778

T-statistic: 2.7197263308132196
P-value: 0.0066240463345820805


#### Results

With a p-value below the typical significance level of 0.05, it suggests there's evidence to reject the null hypothesis. This means there might be a statistically significant difference in trust scores between individuals who have received trainings and those who haven't.

Might wanna check other stuff in relation to trust cus i dont see much diff in the mean

#### Hypothesis 7: People who trust the company more, are more likely to report inccidents of sexual harassmesnt

In [11]:
from scipy.stats import pearsonr
trust_df = pd.read_csv('empl_survey_trust.csv')
report_df = pd.read_csv('empl_survey_reporting.csv')
merged_df = pd.merge(report_df, trust_df, on='empl_id', how='left')
merged_df.columns 

trust_scores = merged_df['total_trust_score']
reporting_scores = merged_df['reporting_score']

correlation_coefficient, p_value = pearsonr(trust_scores, reporting_scores)

print(f"Pearson's correlation coefficient: {correlation_coefficient}")
print(f"P-value: {p_value}")

Pearson's correlation coefficient: -0.0581691699037178
P-value: 0.40625167701500914


Correlation Coefficient: The coefficient being close to zero (-0.058) suggests a very weak linear relationship between trust scores and reporting scores.
P-Value Interpretation: The p-value of 0.406 is higher than the commonly chosen significance level of 0.05. This suggests that the observed correlation could likely be due to random chance rather than a true relationship between trust in the organization and the likelihood of reporting sexual harassment cases.