# Table of Contents
 <p><div class="lev1 toc-item"><a href="#OSMI-Mental-Health-In-Tech-Survey-2016-:-Clustering-Model-Seleciton" data-toc-modified-id="OSMI-Mental-Health-In-Tech-Survey-2016-:-Clustering-Model-Seleciton-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>OSMI Mental Health In Tech Survey 2016 : Clustering Model Seleciton</a></div><div class="lev1 toc-item"><a href="#Variables-Chosen" data-toc-modified-id="Variables-Chosen-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Variables Chosen</a></div><div class="lev2 toc-item"><a href="#Usage-of-why-variables" data-toc-modified-id="Usage-of-why-variables-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Usage of why variables</a></div><div class="lev1 toc-item"><a href="#Initial-Modeling" data-toc-modified-id="Initial-Modeling-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Initial Modeling</a></div>

# OSMI Mental Health In Tech Survey 2016 : Clustering Model Seleciton

_By [Michael Rosenberg](mailto:mmrosenb@andrew.cmu.edu)._

_**Description**: Contains my model selection procedure on a clustering of a set of questions related to employers and how the treat mental health._

In [35]:
#imports
library(poLCA)

#constants
sigLev = 3
percentMul = 100
options(warn=-1) #turns off warnings

In [36]:
clusterFrame = read.csv("../data/processed/clusterDataset.csv")

# Variables Chosen

You can see my [cluster map file](../data/preprocessed/clusterColumnMap.csv) to get a full sense of the variables used in this analysis. In this file, any variable I did not use contains an empty ```newColName```, and any variable that I am considering contains a particular ```newColName```. Because there are around $19$ variables that I am considering in my initial models, it seems to be overkill to list out the entire map here rather than describe them as they become relevant.

## Usage of why variables

We see for the questions related to bringing up a mental health or physical health issue in an interview, they have a written explanation section included. While I have chosen to include these in my processed dataset, their usage may be difficult to specify due to the sparsity of user-inputted language. That being said, we may try certain methods on these sections in order to have their language inform our clustering models. In particular, it may be useful to consider a dimensionally reduced form of these questions.

In [37]:
filteredClusterFrame = clusterFrame[,!names(clusterFrame) %in% c(
                                        "explanationMH","explanationPH")]

# Initial Modeling

Because we have only categorical variables, it seems reasonable to limit our analysis to only considering latent class models. This is because we do not need to deal with mixed types, and so a typical $K$-modes clustering algorithm is not entirely necessary.

I have a feeling that there are two main narratives that we need to consider: The narrative of "employers are doing enough for mental health" and the narrative of "employers are not doing enough for mental health." Because of this, I think it might be useful to initially consider a $2$-class model.

In [45]:
#make formula
columnString = paste(colnames(filteredClusterFrame),collapse = ",")
lhs = paste0("cbind(",columnString,")")
givenForm = paste0(lhs,"~1")
print(givenForm)
initMod.lcm = poLCA(cbind(empPrimTech,empProvideMHB,knowMHB,empDiscMH,empResourceMH,anonProtected,askLeaveDiff,negConsDiscMH,negConsDiscPH,coworkComfMHD,superComfMHD,empSeriousMH,heardNegConsMH,discInterviewPH,discInterviewMH,hurtCareerMH,teamNegMH,observeBadResponseMH,revealLikelihoodMH)~1,
                    data = filteredClusterFrame,nclass = 2)

[1] "cbind(empPrimTech,empProvideMHB,knowMHB,empDiscMH,empResourceMH,anonProtected,askLeaveDiff,negConsDiscMH,negConsDiscPH,coworkComfMHD,superComfMHD,empSeriousMH,heardNegConsMH,discInterviewPH,discInterviewMH,hurtCareerMH,teamNegMH,observeBadResponseMH,revealLikelihoodMH)~1"
Conditional item response (column) probabilities,
 by outcome variable, for each class (row) 
 
$empPrimTech
           Pr(1)  Pr(2)
class 1:  0.7954 0.2046
class 2:  0.7488 0.2512

$empProvideMHB
           Pr(1)  Pr(2)  Pr(3)  Pr(4)
class 1:  0.0791 0.1325 0.5064 0.2821
class 2:  0.0666 0.2325 0.4258 0.2751

$knowMHB
           Pr(1)  Pr(2)  Pr(3)  Pr(4)
class 1:  0.1211 0.3012 0.3281 0.2495
class 2:  0.1116 0.2388 0.2888 0.3608

$empDiscMH
           Pr(1)  Pr(2)  Pr(3)
class 1:  0.5905 0.2956 0.1139
class 2:  0.8133 0.1178 0.0689

$empResourceMH
           Pr(1)  Pr(2)  Pr(3)
class 1:  0.3761 0.3291 0.2948
class 2:  0.5395 0.1948 0.2656

$anonProtected
           Pr(1)  Pr(2)  Pr(3)
class 1:  0.5726 0.3898 0.