# Table of Contents
 <p><div class="lev1 toc-item"><a href="#OSMI-Health-Survey-2016:-Inference" data-toc-modified-id="OSMI-Health-Survey-2016:-Inference-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>OSMI Health Survey 2016: Inference</a></div><div class="lev1 toc-item"><a href="#Recap" data-toc-modified-id="Recap-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Recap</a></div><div class="lev1 toc-item"><a href="#Fit-and-Diagnostics" data-toc-modified-id="Fit-and-Diagnostics-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Fit and Diagnostics</a></div><div class="lev1 toc-item"><a href="#Interpretation-(If-the-model-is-well-specified)" data-toc-modified-id="Interpretation-(If-the-model-is-well-specified)-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Interpretation (If the model is well-specified)</a></div>

# OSMI Health Survey 2016: Inference

_By [Michael Rosenberg](mailto:mmrosenb@andrew.cmu.edu)._

_**Description**: Contains my inference related to the [OSMI Mental Health In Tech Survey 2016](https://osmihelp.org/research/). This notebook is written in `R`._

In [9]:
#imports
#constants
sigLev = 3

# Recap

As discussed in my [model selection procedure](modelSelection.ipynb), I ended up choosing a model that represents a logistic regression, i.e.

$$P(diagnosedWithMHD_i | X) = \frac{1}{1+e^{-\hat{r}(X)}},$$

where $\hat{r}(X)$ is the fitted regression function that contains the following variables:

* ```age```: covers the age of a respondent.

* ```roleType```: engages the role type of a respondent at work, which can either be technical, non-technical, or both (i.e. hybrid roles).

* ```isUSA```: asks if the individual works in the USA or not (1 if they do, 0 if they don't).

* ```gender```: the gender that the individual identifies with. For the sake of simplification, we say that an individual is give "F" if they identify as a female, "M" if they identify as a male, and "O" if they identify as another gender not along the binary. We had to do this pooling for "O" since there were unfortunately only a small number of observations that did not identify along the gender binary.

* The interaction term between ```age``` and ```gender```.

Let us fit this on our test set and perform some set of inference.

# Fit and Diagnostics

In [6]:
#load in formula
formulaFilename = "../models/finalLogisticRegressionFormula.txt"
#get rid of newline character at the end, so -1
formula = readChar(formulaFilename,file.info(formulaFilename)$size - 1)
print(formula)

[1] "diagnosedWithMHD ~ age+factor(gender)+factor(roleType)+factor(isUSA)+age:factor(gender)"


In [7]:
#load in data
inferenceSet = read.csv("../data/processed/test.csv")
#then fit model
finalMod.logr = glm(formula,data = inferenceSet,family = "binomial")

Let's first see how well our model is fitting the inference set.

In [11]:
inferenceSet$prediction = predict(finalMod.logr,type = "response")
#make decision rule
decRule = .5
inferenceSet$prediction = ifelse(inferenceSet$prediction > decRule,1,0)
#make comparision
correctFrame = inferenceSet[which(inferenceSet$diagnosedWithMHD ==
                                  inferenceSet$prediction),]
#get proportion accurate
propAccurate = dim(correctFrame)[1] / dim(inferenceSet)[1]
print(paste("The proportion accurate on the inference set is",signif(
                                                    propAccurate,sigLev)))

[1] "The proportion accurate on the inference set is 0.604"


This is still not an amazing fit, but it's performing about on par as the fits in our previous discussion. This may suggest that we simply aren't fitting the data extremely well, and that it might be essential to go back into the survey and find other variables that would be strong predictors of this outcome. We could also be dealing with the question of simply not asking all the questions we need to get a full picture of someone's mental health (e.g. how are there eating habits, what are there hours like at work, what is their social life like, etc). We may be able to get some more meaningful statements about factors contributing to mental health in tech if we asked some of these questions.

Let's see by the confusion matrix what kinds of places are we making mistakes.

In [13]:
confusionMat = matrix(0,nrow = 2,ncol = 2)
for (i in 1:2){
    for (j in 1:2){
        #get level associated
        confusionMat[i,j] = length(which(inferenceSet$predictions == i - 1 &
                                inferenceSet$diagnosedWithMHD == j - 1))
    }
}
#name columns
rownames(confusionMat) = c("Predict 0","Predict 1")
colnames(confusionMat) = c("Actual 0","Actual 1")
confusionMat

Unnamed: 0,Actual 0,Actual 1
Predict 0,233,161
Predict 1,122,198


_Table 1: Our confusion matrix for our model._

We see that we have a false negative rate of around $\frac{161}{161 + 198} \cdot 100\% \approx 44.85\%$, along with a false positive rate of about $\frac{122}{122+233} \cdot 100\% \approx 34.37\%.$ Thus, we have a slightly bigger false negative problem than a false positive problem. This may be useful to note for if we wanted to do future tuning.

# Interpretation (If the model is well-specified)

If our model is well-specified, we can interpret our model through simply the coefficients currently present, and their statistical significance. Obviously, some of these run under the assumptions that the model is well-specified; we will use some techniques later as robustness checks on the significance of these variables.

In [14]:
summary(finalMod.logr)


Call:
glm(formula = formula, family = "binomial", data = inferenceSet)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.0102  -1.1208   0.5626   1.1507   1.7344  

Coefficients:
                               Estimate Std. Error z value Pr(>|z|)    
(Intercept)                   -0.070454   0.652334  -0.108 0.913994    
age                            0.009834   0.017589   0.559 0.576084    
factor(gender)M               -1.145899   0.740104  -1.548 0.121552    
factor(gender)O                0.546165   2.129787   0.256 0.797610    
factor(roleType)non-technical -0.020592   0.213527  -0.096 0.923174    
factor(roleType)technical     -0.102976   0.189235  -0.544 0.586324    
factor(isUSA)1                 0.542171   0.158222   3.427 0.000611 ***
age:factor(gender)M            0.012324   0.020899   0.590 0.555381    
age:factor(gender)O            0.013099   0.067543   0.194 0.846224    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion

_Table 2: Summary of our fitted model._

Unfortunately, we see statistical significance in only one of our variables: $isUSA_i$. That being said, there are some very powerful effect sizes estimated in this model. By the coefficient of $isUSA_i$, We see that if the respondent's work occurs in the United States, that predicts on average that a respondent is $\exp(.542171) \approx 1.7$ times as likely to be diagnosed with a mental health disorder than if the respondent worked outside the United States. There could be two possible explanations underlying this component: either that

1. The United States could have a stress culture in technology that pushes more individuals to a point of having a mental health disorder.

2. The United States has a framework for handling mental health that diagnoses more people on average than other countries.

I personally find the second explanation to be a stronger reasoning behind this issue. When we compare how developed countries handle mental health when compared to emerging markets, it is apparent that the mental health landscape of the United States is a relatively strong one.

We also see a coefficient on $isTechnical_i$ of  $-.103$, which would predict an odds ratio effect of $\exp(-.103) \approx .902.$ This would suggest that someone working in a technical role is predicted to on average be $\approx 10\%$ less likely to be diagnosed with a mental health disorder than someone who takes on both technical and non-technical roles. This may suggest that individuals who take a jack-of-all-trades approach in the tech industry may be overburdened, to a point where they are much more