# Table of Contents
 <p><div class="lev1 toc-item"><a href="#OSMI-Health-Survey-2016:-Inference" data-toc-modified-id="OSMI-Health-Survey-2016:-Inference-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>OSMI Health Survey 2016: Inference</a></div><div class="lev1 toc-item"><a href="#Recap" data-toc-modified-id="Recap-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Recap</a></div><div class="lev1 toc-item"><a href="#Fit-and-Diagnostics" data-toc-modified-id="Fit-and-Diagnostics-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Fit and Diagnostics</a></div>

# OSMI Health Survey 2016: Inference

_By [Michael Rosenberg](mailto:mmrosenb@andrew.cmu.edu)._

_**Description**: Contains my inference related to the [OSMI Mental Health In Tech Survey 2016](https://osmihelp.org/research/). This notebook is written in `R`._

In [9]:
#imports
#constants
sigLev = 3

# Recap

As discussed in my [model selection procedure](modelSelection.ipynb), I ended up choosing a model that represents a logistic regression, i.e.

$$P(diagnosedWithMHD_i | X) = \frac{1}{1+e^{-\hat{r}(X)}},$$

where $\hat{r}(X)$ is the fitted regression function that contains the following variables:

* ```age```: covers the age of a respondent.

* ```roleType```: engages the role type of a respondent at work, which can either be technical, non-technical, or both (i.e. hybrid roles).

* ```isUSA```: asks if the individual works in the USA or not (1 if they do, 0 if they don't).

* ```gender```: the gender that the individual identifies with. For the sake of simplification, we say that an individual is give "F" if they identify as a female, "M" if they identify as a male, and "O" if they identify as another gender not along the binary. We had to do this pooling for "O" since there were unfortunately only a small number of observations that did not identify along the gender binary.

* The interaction term between ```age``` and ```gender```.

Let us fit this on our test set and perform some set of inference.

# Fit and Diagnostics

In [6]:
#load in formula
formulaFilename = "../models/finalLogisticRegressionFormula.txt"
#get rid of newline character at the end, so -1
formula = readChar(formulaFilename,file.info(formulaFilename)$size - 1)
print(formula)

[1] "diagnosedWithMHD ~ age+factor(gender)+factor(roleType)+factor(isUSA)+age:factor(gender)"


In [7]:
#load in data
inferenceSet = read.csv("../data/processed/test.csv")
#then fit model
finalMod.logr = glm(formula,data = inferenceSet,family = "binomial")

Let's first see how well our model is fitting the inference set.

In [11]:
inferenceSet$prediction = predict(finalMod.logr,type = "response")
#make decision rule
decRule = .5
inferenceSet$prediction = ifelse(inferenceSet$prediction > decRule,1,0)
#make comparision
correctFrame = inferenceSet[which(inferenceSet$diagnosedWithMHD ==
                                  inferenceSet$prediction),]
#get proportion accurate
propAccurate = dim(correctFrame)[1] / dim(inferenceSet)[1]
print(paste("The proportion accurate on the inference set is",signif(
                                                    propAccurate,sigLev)))

[1] "The proportion accurate on the inference set is 0.604"


This is still not an amazing fit, but it's performing about on par as the fits in our previous discussion. This may suggest that we simply aren't fitting the data extremely well, and that it might be essential to go back into the survey and find other variables that would be strong predictors of this outcome. We could also be dealing with the question of simply not asking all the questions we need to get a full picture of someone's mental health (e.g. how are there eating habits, what are there hours like at work, what is their social life like, etc). We may be able to get some more meaningful statements about factors contributing to mental health in tech if we asked some of these questions.

Let's see by the confusion matrix what kinds of places are we making mistakes.

In [13]:
confusionMat = matrix(0,nrow = 2,ncol = 2)
for (i in 1:2){
    for (j in 1:2){
        #get level associated
        confusionMat[i,j] = length(which(inferenceSet$predictions == i - 1 &
                                inferenceSet$diagnosedWithMHD == j - 1))
    }
}
#name columns
rownames(confusionMat) = c("Predict 0","Predict 1")
colnames(confusionMat) = c("Actual 0","Actual 1")
confusionMat

Unnamed: 0,Actual 0,Actual 1
Predict 0,233,161
Predict 1,122,198


In [None]:
We see that we have a false 