# Tech Blues Capstone
## First Draft Notebook

In [None]:
# imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from scipy import stats
import wrangle
import explore

## Project Overview

For our capstone project, we are going to download the .csv file ‘Mental Health in Tech Survey’ from [Kaggle](https://www.kaggle.com/osmi/mental-health-in-tech-survey).  Once we download the file, we will filter for desirable variables of mental health that could interfere with work (benefits, family history, gender, etc.). We will then follow the steps of the data science pipeline to setup the information for our slides presentation.

## Project Planning

## Data Acquisition

In [None]:
# use our 'get_survey_data' function to bring in the data
df = wrangle.get_survey_data('survey.csv')

In [None]:
# Using summarize function to see head of dataframe, datatypes, amount of null values, stats, and value_counts
wrangle.summarize(df)

------

In [None]:
# checking to see if there are any duplicate rows
test_df = df.duplicated()
test_df.value_counts()

# Acquire Key Findings, Takeaways, and Next Steps:
- Good amount of nulls to drop or fill in 
- Choose to change some of the object datatypes to numeric datatype, decide what to drop, or decide what to use in modeling
- columns to encode to numeric datatype: gender, Country, self_employed, family history, treatment, work_interfere, no_employees, remote_work, tech_company, benefits, care_options, wellness_program, seek_help, anonymity, leave, mental_health_consequence, phys-health_consequence, coworkers, supervisor, mental_health_interview, phys_health_interview, mental_vs_physical, obs_consequence
- drop unnecassary columns: state and comments
- Since there are not unique identifiers for each observation, we cannot definitively say that each observation is a different person
- We can, however, say that there are no duplicate entries in this data set

------

## Data Preparation

In [None]:
# initial prep for object data
strings_df = wrangle.prep_the_strings(df)
strings_df.info()

In [None]:
# initial prep for encoding objects into integers
encoded_df = wrangle.prep_encode(strings_df)
encoded_df.info()

**To prep this data before exploration, the following was done:**

|   Feature      | Description    | Encoding |
| :------------- | ----------- | -----------: |
| timestamp	|  Time survey was submitted | - |
| age	| Respondent age  | - |
| gender	| Respondent gender | male:0, female:1, other:2 |
| country	 |  Respondent survey  | Only kept N America and Europe |
| self_employed	 | Whether or not they were self employed | No:0, Yes:1 |
| family_history	| Whether or not they have a family history of mental illness | No:0, Yes:1 |
| treatment	 |  Whether or not they have sought treatment  | No:0, Yes:1 |
|  work_interfere	  |  If the person felt that the mental condition interfered with work | Never:0, Rarely:1, Sometimes:2, Often:3, NA:4 |
| no_employees	| The number of employees in the company or organization  | <5:0, 6-25:1, 26-100:2, 101-500:3, 501-1000:4, >1000:5 |
| remote_work	 | Having remote work (outside of an office) at least 50% of the time | No:0, Yes:1 |
| tech_company	| The employer is primarily a tech company/organization | No:0, Yes:1 |
| benefits  |	Providing mental health benefits by the employer | No:0, Yes:1, Don't know:2 |
| care_options |	Providing options for mental health care by the employer | No:0, Yes:1, Not sure:2 |
| wellness_program	| Discussion about mental health as part of an employee wellness program by the employees | No:0, Yes:1, Don't know:2 |
| seek_help	| Provided resources by the employer to learn more about mental health issues and how to seek help | No:0, Yes:1, Don't know:2 |
| anonymity |	Protecting anonymity if you choose to take advantage of mental health or substance abuse treatment resources | No:0, Yes:1, Don't know:2 |
| leave  |	How easy is it for you to take medical leave for a mental health condition? | Very difficult:0, Somewhat difficult:1, Don't know:2, Somewhat easy:3, Very easy:4 |
| mental-health_consequence |	Having negative consequences caused by discussing a mental health issue with your employer | No:0, Yes:1, Maybe:2 |
| phys-health_consequence	 | Having negative consequences caused by discussing a physical health issue with your employer  | No:0, Yes:1, Maybe:2 |
| coworkers |	Would you be willing to discuss a mental health issue with your coworkers? | No:0, Yes:1, Some of them:2 |
| supervisor	| Would you be willing to discuss a mental health issue with your direct supervisor(s)? | No:0, Yes:1, Some of them:2 |
| mental_health_interview  |	Would you bring up a mental health issue with a potential employer in an interview?  | No:0, Yes:1, Maybe:2 |
| phys_health_interview |	Would you bring up a physical health issue with a potential employer in an interview?  | No:0, Yes:1, Maybe:2 |
| mental_vs_physical |	Do you feel that your employer takes mental health as seriously as physical health? | No:0, Yes:1, Don't know:2 |
|  obs_consequence  |  Have you heard of or observed negative consequences for coworkers with mental health conditions in your workplace?  | No:0, Yes:1 |

------

### Data Preparation Key Findings, Takeaways, and Next Steps:

- Chose to only include North America and Europe in our dataset: The economic status in these two countries are similar and this will give us more controlled variables with less bias in the data.
- The cleaning process had a moderate amount of nulls that needed to filled in or dropped 
- Deciding on what to fill nulls for each column was dependent on the column we were dealing with as you can see in the cell above.
- Initially decided to not one hot encode: will decide which columns to one hot encode once we find what features are drivers and what features are not.

------

## Data Exploration
### 1st Iteration
The first iteration will be performed on encoded data, before deciding which variables to one-hot encode before the second iteration of data exploration.

### Univariate

In [None]:
# first let's split the data...

# encoded data
encoded_train, encoded_validate, encoded_test = explore.three_split(encoded_df, 'work_interfere')

In [None]:
# countplots of categorical variables
# histograms and bosplots of continuous variables
explore.mental_health_univariate(encoded_df)

#### Univariate 1st Iteration - Key Findings, Takeaways, and Next Steps:
- `gender` is heavily imbalanced, with 70% of respondents being male. What kind of impact will this have? Are men more/less likey to seek treatment and/or have mental health issues that lead to workplace interference?
- `self-employed` is also heavily imbalanced, with only 10% reporting being self-employed. Does this group have more/less issues than those working for others?
- `family_history` is 60% no history, 40% history. Interesting to see so many showing a history of mental-illness...(could this be a potential driver???).
- `treatment` is almost evenly split. Will be very interested to see if this is a driver - Does receiving treatement lead to more or less interference?
- `work_interfere` our target variable is split 60-40, will need to explore over/under-sampling methods to improve model accuracy.
- `company_size` has multiple peaks and valleys. It would be nice if the data wasn't already binned, so we could possibly bin differently. Also, clustering might play a role in dealing with company size. It appears there will be a relationship with our target, but what it is, is currently unclear.
- `remote_work` is roughly 70-30, with most people working in office. Becasue this data is pre-covid, it would be really nice to gather additional data during/post-covid to see what changes have occurred.
- `tech_company` represents 80% of our observations, with approx 20% not working in tech. Unclear at this time if we have enough data to make a good comparison between the two groups.
- `benefits` is roughly normally distributed with about half of all observations receiving benefits, one-fifth not receiving any, and one-third unsure if they are available. Really interested to learn more about the 'unsure' group. How do they not know? Are they going to stand out compared to the other two groups?
- `care_options` are almost uniform in distribution. 'Yes' and 'No' each receive a little over one-third of all responses, and 'Don't know' receives just under one-third. Again, we are very interested in the 'Don't know' group. Does it not matter if care is available because they do not have mental health issues? Or, is this a possible sign of a group not receiving preventative measures that could have a potentially large impact?
- `wellness_program` has 60% not having a wellness program, and 20% each either having one, or unsure. Would love to see how this relates to tech vs non-tech jobs, and again, if 'Don't know' is a driver, or just noise.
- `seek_help` has 40% not receiving help/resources from their company, and 30% receiving help, and 20% unsure.
- `anonymity` shows an overwhelming majority of respondents who are unsure if they would be able to remain anonymous.
- `leave` is roughly normally distributed, however most responses are unsure how difficult it would be to take leave due to a mental health issue.
- `mental_health_consequence` shows most either do not think their would be consequences, or are unsure
- `phys_health_consequence` shows a stark contrast to mental...75% have no fear, 5% worry about consequences, and 20% are unsure. We are really interested to see what leads to these differences.
- `coworkers` shows an overwhelming majority are unsure if they would be comfortable speaking with coworkers about personal mental health issues.
- `supervisor` shows another stark contrast where most do feel comfortable speaking with a supervisor about mental health issues, even though they are unsure about speaking with coworkers.
- `mental_health_interview` shows that the overwhelming majority do not feel comfortable bringing up mental health issues in an interview.
- `phys_health_interview` shows that roughly half as many people would be afraid to bring up a physical health issue as a mental one.
- `mental_vs_physical` shows an equal amount of yes and no responses, with a larger portion of unsure. These groups definitely need to be looked into.
- `obs_consequence` shows 90% of observations have not heard of any consequences from coworkers sharing mental health issues.
- `age` is roughly normally distributed around a mean of 32, but has a tail on the upper end. Will be interested interesting to see if we need to bin this data, and how it relates to other variables.

### Bivariate

In [None]:
encoded_bi_metrics = explore.mental_health_bivariate(encoded_train, 'work_interfere')

#### Bivariate 1st Iteration - Key Findings, Takeaways, and Next Steps:

**Possible Strong Drivers (p-value <= 0.05, chi2 >)** 
- gender, family history, treatment, benefits, care options, wellness program, leave, mental health consequence, phys health consequence, supervisor, mental health interview, mental vs physical, obs consequence
- These variables all show 

**Other Observations**
- Some of the values have a "don't know" column and they all are much less than yes or no. My theory is this is because people who need to access help or resources to get help probably looked into it because it was impacting work performance. 
        - companies larger than 1000 employees seem to have the least amount of impacts to work compared to the other sized companies
        - people who feel like their employer does not consider mental health to be similar to physical health report higher rates of impact to work. 
        - Feeling comfortable talking to a supervisor also reduces the impact to work performance
        - perceiving that one would be punished for mental health also had a higher rate of impact to work performance
        - observing negative consequences to others in the company hurts work performance
        - the easier it is to get leave the less impact to work performance there is

**Worth Exploring Further**
- company size

### 2nd Iteration
The second iteration will be performed on the one-hot encoded data, before deciding what other feature engineering we would like to perform before the third iteration of data exploration.

In [None]:
# one-hot encode the data
hot_df = explore.one_hot(encoded_df)

In [None]:
# split the one-hot encoded data
hot_train, hot_validate, hot_test = explore.three_split(hot_df, 'work_interfere')

In [None]:
# univariate exploration on one-hot encoded data

explore.mental_health_univariate(hot_df)

#### Univariate 2nd Iteration - Key Findings, Takeaways, and Next Steps:
**Observations**
- age has outliers: Will need to decide if we keep age and drop outliers, keep age and do not drop outliers, or decide not use age column.
- Respondents who do not know how easy is it for them to take medical leave for a mental health condition account for at least half the respondents in the leave column.
- Respondents who are not provided resources by the employer to learn more about mental health issues and how to seek help account for at least half the respondents in the seek_help column.
- Respondents who can talk to some of their co-workers account for at least half of the respondents in the co_workers column.
- Respondents who are unsure about they feel their employer takes mental health vs physical health account for at least half the respondents in the mental_vs_physical column.

### Bivariate

In [None]:
hot_bi_metrics = explore.mental_health_bivariate(hot_train, 'work_interfere')

#### Bivariate 2nd Iteration - Key Findings, Takeaways, and Next Steps:

**Possible Strong Drivers (p-value <= 0.05, chi2 >)** 
- Top 5: 
    - Respondents who would either maybe or not bring up mental health issues with an employed in an interview.
    - Respondents who know about their care options for mental health care
    - Respondents who are feel comfortable speaking to their supervisor
    - Respondents who are provided with mental health
    - Respondents who think that discussing a mental health issue with your employer would have negative consequences.

### Multivariate

#### Multivariate - Key Findings, Takeaways, and Next Steps:

### Hypothesis Testing
#### Target Variable: 'work_interfere'

#### Hypothesis 1: 'Supervisor'
- alpha : 0.05
- ${H_0}$: The mean workplace interference is the same for those who feel comfortable speaking with their supervisor about mental health issues, and those who do not feel comfortable.
- ${H_a}$: The mean workplace interference is different for those who feel comfortable speaking with their supervisor about mental health issues than those who do not feel comfortable communicating those issues with their supervisor.

In [None]:
# Here is the work for hypothesis 1
explore.ty_chi(encoded_train, 'work_interfere', 'supervisor')

#### Hypothesis 1 - Key Findings, Takeaways, and Next Steps:
- 'Supervisor'
- Since the p-value is less than alpha, we can reject the null hypothesis. There is evidence to suggest a relationship between an employee feeling comfortable speaking with a supervisor about personal mental health issues and work interference.

#### Hypothesis 2: Does having benefits affect whether or not you seek treatment affect work interference?
- alpha : 0.05
- ${H_0}$: There is no difference between having benefits and whether or not treatment is sought.
- ${H_a}$: There is a difference between having benefits and whether or not treatment is sought.

In [None]:
explore.plot_chi(encoded_train, 'benefits', 'work_interfere', 'treatment' )

#### Hypothesis 2 - Key Findings, Takeaways, and Next Steps:
- [insert hypothesis here]
- [reject or fail to reject the null]

#### Hypothesis 3: If you have observed negative consequences for coworkers with mental health conditions do you not talk to your supervisor and this interferes with your work performance?
- alpha : 0.05
- ${H_0}$: There is no difference between observed negative consequences for coworkers with mental health conditions and talking to my supervisor.
- ${H_a}$: There is a difference between observed negative consequences for coworkers with mental health conditions and talking to my supervisor.

In [None]:
explore.plot_chi(encoded_train, 'supervisor', 'work_interfere', 'obs_consequence')

#### Hypothesis 3 - Key Findings, Takeaways, and Next Steps:
- [insert hypothesis here]
- [reject or fail to reject the null]

#### Hypothesis 4: If you believe speaking about mental health has negative consequences have/have not sought treatment to the point where it interferes with work?
- alpha : 0.05
- ${H_0}$: If you believe speaking about mental health has negative consequences and have/have not sought it has no affect with work interference?
- ${H_a}$: If you believe speaking about mental health has negative consequences and have/have not sought it has an affect with work interference?

In [None]:
explore.plot_chi(encoded_train, 'mental_health_consequence', 'work_interfere', 'obs_consequence')

#### Hypothesis 5: 'Supervisor'
- alpha : 0.05
- ${H_0}$: The mean workplace interference is the same for those who feel comfortable speaking with their supervisor about mental health issues, and those who do not feel comfortable.
- ${H_a}$: The mean workplace interference is different for those who feel comfortable speaking with their supervisor about mental health issues than those who do not feel comfortable communicating those issues with their supervisor.

In [None]:
# Here is the work for hypothesis 1

observed = pd.crosstab(encoded_train.supervisor, encoded_train.work_interfere)

In [None]:
chi2, p, degf, expected = stats.chi2_contingency(observed)

print('Observed\n')
print(observed.values)
print('---\nExpected\n')
print(expected)
print('---\n')
print(f'chi^2 = {chi2:.4f}')
print(f'p     = {p:.4f}')

#### Hypothesis 5 - Key Findings, Takeaways, and Next Steps:
- Due to our p-value being less than alpha, we reject the null hypothesis.
- There is evidence to suggest a relationship between feeling comfortable speaking with a supervisor about personal mental health issues and our target variable, 'work_interfere'

#### Hypothesis 6: controlling for `gender`, how does `talking to a supervisor` relate to `work_interfere`
- alpha : 0.05
- ${H_0}$: When controlling for gender, the rate of work interference is the same among all responses to mental_vs_physical
- ${H_a}$: When controlling for gender, the rate of work interference is different among each response to mental_vs_physical

In [None]:
explore.three_chi(encoded_train, 'gender', 'work_interfere', 'supervisor')

#### Hypothesis 5: Takeaways from `supervisor` and `work_interfere` when controlling for `gender`
- Men who feel comfortable speaking about mental health issues with a supervisor have work place interference at a significantly lower rate than those who either feel uncomfortable, or do not know.
- For women, it surprisingly does not seem to matter how they responded to the 'supervisor' question
- There is not enough data for gender=other to have actionable insight
- We recommend that companies work to improve communication between management and staff, as there is clear evidence that it greatly helps reduce the rate of workplace interference amongst men, and does not harm anyone else.

#### Hypothesis 7: controlling for `gender`, how does `mental_vs_physical` relate to `work_interfere`
- alpha : 0.05
- ${H_0}$: When controlling for gender, the rate of work interference is the same among all responses to mental_vs_physical
- ${H_a}$: When controlling for gender, the rate of work interference is different among each response to mental_vs_physical

In [None]:
explore.plot_chi(encoded_train, 'gender', 'work_interfere', 'mental_vs_physical')

#### Hypothesis 7 - Key Findings, Takeaways, and Next Steps:
- Men who feel that their company takes mental health as seriously as physical health have work interference at a significantly lower rate than those who do not, or do not know.
- Women who feel that their company takes mental health as seriously as physical health have work interference at a lower rate than those who do not, or do not know.
- Once again, we do not have enough data where gender = other to have actionable insight.

#### Hypothesis 8: controlling for `gender`, how does `anonymity` relate to `work_interfere`
- alpha : 0.05
- ${H_0}$: When controlling for gender, the rate of work interference is the same among all responses to anonymity
- ${H_a}$: When controlling for gender, the rate of work interference is different among each response to anonymity

In [None]:
explore.plot_chi(encoded_train, 'gender', 'work_interfere', 'anonymity')

#### Hypothesis 9: [insert hypothesis here]
- alpha : 0.05
- ${H_0}$: [insert null hypothesis here]
- ${H_a}$: [insert alternative hypothesis here]

#### Hypothesis 10: [insert hypothesis here]
- alpha : 0.05
- ${H_0}$: [insert null hypothesis here]
- ${H_a}$: [insert alternative hypothesis here]

------

### Explore Key Findings, Takeaways, and Next Steps:

------

## Modeling

### Initital Setup

In [None]:
# calculate baseline model using mean
encoded_df['baseline'] = 1
baseline_accuracy = (encoded_df.baseline == encoded_df.work_interfere).mean()
print(f'Baseline accuracy is {baseline_accuracy:.2%}')

In [None]:
# encode the top 9 variables as decided by p value with chi2 with degrees of freedom >1
dum_df = pd.get_dummies(data = encoded_df, columns = ['mental_health_interview','care_options','supervisor',
                                'mental_health_consequence','leave', 'benefits','gender'], drop_first = True)

In [None]:
# drop the unecessary columns
dum_df = dum_df.drop(columns = ['age', 'self_employed', 'company_size', 'remote_work', 
                               'tech_company', 'wellness_program','seek_help', 
                               'anonymity', 'phys_health_consequence', 'coworkers',
                               'phys_health_interview', 'mental_vs_physical', 'obs_consequence', 
                               'baseline', 'timestamp', 'country'])

In [None]:
from imblearn import over_sampling
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
import explore, evaluate
from sklearn.model_selection import GridSearchCV
import xgboost as xgb
from sklearn.neural_network import MLPClassifier

In [None]:
# set up the random over sampler for class imbalance
ros = over_sampling.RandomOverSampler(random_state = 123)

In [None]:
# do the splits
X_train, y_train, X_validate, y_validate, X_test, y_test = explore.full_split(dum_df, 'work_interfere')

In [None]:
# create resampled data for training set
X_res, y_res = ros.fit_resample(X_train, y_train)

In [None]:
X_res.columns

### Model 1: Decision Tree

In [None]:
tree = DecisionTreeClassifier(max_depth=3, min_samples_leaf=3, min_samples_split=3).fit(X_res,y_res)

In [None]:
evaluate.run_metrics(X_res, y_res, tree, 'Train')

In [None]:
evaluate.run_metrics(X_validate, y_validate, tree, 'Validation')

#### Model 1: Decision Tree - Key Findings, Takeaways, and Next Steps:
 - Key Findings:
     - Model performed well, beating our baseline of 63% by 19 percentage points
     - Model is mildly overfit
     - Model is able to predict cases of work interference three quarters of the time
     - Focusing on f-score, the model performs better at predicting cases of work interference
 - Next Steps:
     - Random forest classifier could help generalize more and not be as overfit. The randomness intrinsic to the ensemble model would be more robust in handling out of sample data

### Model 2: Random Forest Classifier

In [None]:
rfc = RandomForestClassifier(max_depth=8, min_samples_leaf=3, min_samples_split=10,
                       random_state=123).fit(X_res, y_res)

In [None]:
evaluate.run_metrics(X_res, y_res, rfc, 'Train')

In [None]:
evaluate.run_metrics(X_validate, y_validate, rfc, 'Validate')

#### Model 2: Random Forest Classifier - Key Findings, Takeaways, and Next Steps:

- Key Findings:
     - Model performed well, beating our baseline of 63% by 20 percentage points
     - Model is mildly overfit
     - Model is able to predict cases of work interference 77% of the time
     - Focusing on f-score, the model performs better at predicting cases of work interference compared to the decision tree. 
 - Next Steps:
     - A boosted tree-based model could be even more powerful, but overfitting is more of an issue. We will use XGBoost as a model next

### Model 3: XGBoost

In [None]:
xgbc = xgb.XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.4, eta=0.25, gamma=0.1,
              gpu_id=-1, importance_type='gain', interaction_constraints='',
              learning_rate=0.25, max_delta_step=0, max_depth=6,
              min_child_weight=1, missing=None, monotone_constraints='()',
              n_estimators=100, n_jobs=8, num_parallel_tree=1, random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None).fit(X_res, y_res)

In [None]:
evaluate.run_metrics(X_res, y_res, xgbc, 'Train')

In [None]:
evaluate.run_metrics(X_validate, y_validate, xgbc, 'Validate')

#### Model 3: XGBoost - Key Findings, Takeaways, and Next Steps:

- Key Findings:
    - This model also outperforms the baseline, by 18 percentage points according to accuracy
    - It is very overfit despite hyperparameter tuning with grid search
    - has the lowest false negative rate compared to the non boosted tree models
    - Performs well with the largest class, but seems undertrained on the underrepresented class when using the validate dataset
    - Predicts true positives better than the other models
- Next Steps: 
    - Let's try a neural net next, since it utilizes back propogation it will determine feature weight for us. We will just need to use ELI5 to implement a permutation importance analysis to determine drivers. 

### Model 4: Multi-layer Perceptron Classifier

In [None]:
mlp = MLPClassifier(random_state=123, max_iter=300).fit(X_res, y_res)

In [None]:
evaluate.run_metrics(X_res, y_res, mlp, 'train')

In [None]:
evaluate.run_metrics(X_validate, y_validate, mlp, 'train')

#### Model 4: Multi-layer Perceptron - Key Findings, Takeaways, and Next Steps:

- Key Findings:
    - Model beat the baseline by 18 percentage points
    - It is relatively overfit despite hyperparameter tuning with cross validation
    - Has the highest true positive rate out of all models
    - Has a very strong f1 score for positive cases, but is weak on the underrepresented class
- Next Steps: 
    - Identify the target metric and what would provide the most value for our project goals. 
    - Fit best model on test set

------

### Modeling Key Findings, Takeaways, and Next Steps:

- Key Findings:
    - Best feature performance was with the following features:

    ```mental_health_interview, mental_health_consequences, care_options, supervisor, benefits, gender, leave, family_history, treatment```

    - Conceptually, we want to build a model that predicts cases of work interference. We want to maximize true positives and minimize false negatives (Type II error). F1 score is a score that weights false negatives higher while not weighting true negatives as much. Since we want a model that predicts work interference to then do a feature analysis on it, people who don't exerience work interference does not provide much value.
    - Hyperparameters were determined using a GridSearchCV with 5 k-folds. 
    - Best model on validate and train using F1 score is the MLPClassifier, but since it is a black box estimator and we are dealing with sensitive data, we may want to use a white box estimator. 
        - Random Forest is the least overfit out of the models, and performed almost as well as XGBoost. XGBoost was very overfit even with cross validation. 
- Takeaways:
    - We will use the Random Forest Classifier due to it's generalizability, simplicity, and its transparency. 
        - It had predicted both classes the best according to F1 score and also will be simpler to perform a feature analysis on it
        - While the MLPClassifier had the least false negative/positive rates, its nature as a black box estimator and the nature of our data creates problems with transparency. We want to be respectful and concientious about how these features affect people's mental health and performance at work. 
- Next Steps:
    - Use the model on out of sample data

------

### Model Testing and Feature Analysis

In [None]:
evaluate.run_metrics(X_test, y_test, rfc, 'Test')

In [None]:
importances,std = evaluate.get_importances(rfc)

In [None]:
evaluate.graph_importances(X_res, importances, std)

In [None]:
list(X_res.columns)

### Model Testing and Feature Analysis Key Findings and Takeaways:

- Model Testing: 
    - Key Findings:
        - The model performed very well on out of sample data. There is an increase of almost 22 percentage points compared to baseline
        - It predicted instances of work interference close to 82% correctly
        - More importantly, it only incorrectly classified 18% of positive cases of work interference
        - It improved on both f1 scores on the test data compared to the validation data
    - Takeaways:
        - The model performed better than expected, outperforming the validation data despite the test data having a smaller sample size
        - The model generalized very well and could be improved with some more specific data and data collection
- Feature Analysis:
    - Key Findings:
        - Feature analysis was performed using mean decrease in gini impurity.
            - Whether or not they had treatment was the strongest feature
            - Family history was another strong predictor
            - The other factors had a weaker MDI, but they still provide some value
    - Takeaways:
        - Family history and seeking treatment are things that businesses don't have much control over, but they can provide resources and support to people who request those things. 
        - Being open to talking with a supervisor about mental health issues is a stronger factor compared to something like ease of getting leave from work. Having that mentorship role and support reduces work interference. 
        - Observing consequences due to mental health issues is another factor that drives work interference. Observing others get in trouble or have consequences due to mental health would be stressful for someone who is currently struggling. 
        - Having benefits only weakly impacts the data, but this could be due to how little mental health coverage is actually included in healthcare plans. 
        - Gender is weak, but this could be due to the extremely small sample size of gender:'other'
            

------

### Summary - Key Findings, Takeaways, and Next Steps:

**Key Takeaways:**
- There appears to be no difference between observed negative consequences for coworkers with mental health conditions and talking to the supervisor.
- Men who feel that their company takes mental health as seriously as physical health have work interference at a significantly lower rate than those who do not, or do not know.
- Women who feel that their company takes mental health as seriously as physical health have work interference at a lower rate than those who do not, or do not know.
- There is evidence to suggest a relationship between an employee feeling comfortable speaking with a supervisor about personal mental health issues and work interference.
- When controlling for gender, the rate of work interference is the same among all responses to anonymity.
- There is no difference between observed negative consequences for coworkers with mental health conditions and talking to my supervisor.
- There is evidence to suggest a relationship between an employee feeling comfortable speaking with a supervisor about personal mental health issues and work interference.

- **Recommendations:**
  - Management training
  - Communicate to new hires the importance of mental health during onboarding (PTOs, help that's available, etc.)
  - Have a mission statement that shows inclusivity for mental and physical health assistance

------