# Differences in Generational Perceptions of Organizational Justice: A Scale Analysis Project

This notebook outlines the code used to analyze a variety of organizational psychology scales as well as data preprocessing (the more time consuming and difficult portion IMO). The code can be copied and pasted and used at one's discretion; there is also detailed commenting used throughout to help enhance readability and interpretability. Jump right in when ready!

### Introduction

R, like many programming languages, has a copious selection of packages from which to choose. Packages are essentially bundles of pre-developed code/scripts that are used to accomplish a task. For instance, the ```readr``` package contains an assortment of functions (e.g., ```read_table```, ```read_delim```) used to import a variety of data files (e.g., .csv, .xlsx, .zip, etc.). We will begin by loading some useful packages and no worries, one can also load packages as needed instead of all at once, though the ordering has an effect. 

Some function names may overlap with other packages and R will notify you of this by printing a message displaying what function name is being masked. To call a specific function from a package, simply type the name of the package followed by two colons and the function name (e.g., ```readr::read_csv()``` (this is considered "best practice" in R programming but not often followed from my extensive Google searches). 

The first line in the code block begins with a `#`, signaling to R that the line should be ignored -- this is also known as *commenting*. To uncomment the line and run the code, simply remove the symbol. 

**NOTE***: ```install.packages(...)``` needs to only be run once because the packages will be saved to your local machine.

In [1]:
#install.packages("dplyr", "readr", "stringr", "ggplot2", corrplot", "psych")
library(readr)
library(dplyr)
library(stringr)
library(corrplot)
library(ggplot2)
library(psych)


Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union


corrplot 0.84 loaded


Attaching package: ‘psych’


The following objects are masked from ‘package:ggplot2’:

    %+%, alpha




Next, to set the working directory. The working directory is the main folder that holds the relevant files used within our script. In this case, that includes the name of our R script as well as the data set from Qualtrics.

For this example, make sure that the Excel file **and** R file are saved in the same folder wherever that may be on your local machine.

Set the working directory using R's keybinding (aka keyboard shortcuts)! 
- Mac: Ctrl + Shift + h
- Windows: Ctrl + Shift + h

Typically, it's good practice to have designated folders to hold varying files (data, figs, docs, etc.) but navigating the directory is beyond the scope of this tutorial. Feel free to reach out to me or browse a programmer's most used tool for information -- Google :) 

### Data Import & Preprocessing

Now that the R system is mostly set up, let's move to importing the data set. R is an object-orientated, statistical programming language - the keyphrase here is *object-oriented* because we can name something in R to be later manipulated, transformed, sliced, moved, along with myriad other options. 

This is one of the main benefits of R, as it grants the software extreme levels of flexibility, especially compared to programs such as Excel of SPSS. One can do both data preprocessing and statistical analyses from the same platform!

In [2]:
#import data
#replace the name of the file w/ the name of your saved file
raw = read_csv(file = "rawdata.csv", 
               col_names = TRUE)

Parsed with column specification:
cols(
  .default = col_character()
)

See spec(...) for full column specifications.



Fundamentally, everything in R is either an object or a function. Previously we assigned our data set to an *object* named ```raw``` by using the *function* ```read_csv```. 

Yet another useful feature of R is the ability to build custom functions on the fly that can be saved and used for future instances. Think of the easy functions such as =SUM or =AVG used in Excel but on steroids! There are innumerable configurations because one can even take an already developed function and build them to do things more convenient at the time. 

Below I am augmenting a function from the ```dplyr``` package called ```rename_at```. Typically, ```rename_at``` has to be called multiple times  by way of the ```%>%``` symbol, but the custom function I built below allows one to change multiple column names based on specific conditions. 

Try not to get too wrapped up trying to figure out how the function works and just analyze the output. In short, we are taking a column that begins with a sequence of characters and changing the name (i.e., columns that begin with "Q1" need to replace the pattern "Q1_" with "wd").

In [3]:
rename_at2 = function(data, .vars, .funs) {
    stopifnot(length(.vars) == length(.funs))
    
    for (i in seq_along(.vars)) {
        data = rename_at(data, .vars[[i]], .funs[[i]])
        }
    data
}

This is done multiple time and with fewer lines than using ```rename_at``` for EVERY condition we want to change.

In [4]:
dat = raw %>% 
    #choose what columns we want to keep from first:last
    select(Q1_1:D9_2) %>% 
    #remove rows 1:4
    slice(-c(1:4)) %>%
    #use custom function
    rename_at2(
        list(vars(starts_with("Q1")), 
             vars(starts_with("Q2")), 
             vars(starts_with("Q3")), 
             vars(starts_with("Q4")), 
             vars(starts_with("Q5")),
             vars(starts_with("D"))),
        list(~ str_replace(., "Q1_", "wd"), 
             ~ str_replace(., "Q2_", "open"),
             ~ str_replace(., "Q3_", "org_eff"), 
             ~ str_replace(., "Q4_", "job_sat"), 
             ~ str_replace(., "Q5-", "cmfq"), 
             ~ str_replace(., "D", "dem"))
        ) %>% 
    rename_at2(
        list(vars(matches("wd4|wd7|wd8")), 
             vars(matches("open7|open9")),
             vars(matches("org_eff1")),
             vars(matches("sat2|sat4|sat6|sat10|sat11|sat12")), 
             vars(matches("cmfq2_2|cmfq2_8|cmfq2_11"))),
        list(~paste0(., "_R"), 
             ~paste0(., "_R"),
             ~paste0(., "_R"),
             ~paste0(., "_R"), 
             ~paste0(., "_R"))
        )

Now that the columns are renamed, let's move along to changing our values within the data. The data set from Qualtrics used characters (aka letters) to represent responses instead of numbers. Often times, placing content logic and/or specific response coding within Qualtrics can mess up how the data are exported (anecdotally speaking). To circumvent this issue, the actual response text can be exported and wrangled in R. 

The custom function below uses ```case_when``` to specify when specific strings should be changed to numbers. For example, although each scale has a different set of response options (e.g., 1-4, 1-6, etc.), they are what is known as ordinal variables and thus have an order to them. It's important to maintain this order for our analyses, so each string pattern that is supposed to be the number 1 can be grouped together; each string pattern that is supposed to correspond with a 2 can be grouped together, and so on.

Take note, running the function below doesn't actually change any of the values...yet!

In [5]:
#custom function to change text to numbers
unfactorise = function(x) {
     case_when(
          x %in% c("Strongly disagree", 
                   "Disagree strongly", 
                   "Never", 
                   "Disagree very much", 
                   "1\r\nNot much like me") ~ 1, 
          x %in% c("Disagree", 
                   "Disagree a little", 
                   "Rarely", 
                   "Disagree moderately", 
                   "2\r\n") ~ 2,
          x %in% c("Agree", 
                   "Neither agree nor disagree", 
                   "Sometimes", 
                   "Disagree slightly", 
                   "3\r\n") ~ 3,
          x %in% c("Strongly agree", 
                   "Agree a little", 
                   "Frequently", 
                   "Agree slightly", 
                   "4\r\n") ~ 4,
          x %in% c("Agree strongly", 
                   "Agree moderately", 
                   "5\r\nVery much like me") ~ 5,
          x %in% c("Agree very much") ~ 6
          )
    }

Indeed, in order to reap the benefits of the powerful function, it must be applied to the data set! To do this, let's make a new object that will hold the changes. 

Below, the name ```vars``` is used but feel free to name your objects whatever you like - just be differentiating and clear as not to confuse oneself (it's easy to overwrite an object and lose track of changes!). 

The next set of code applies the custom ```unfactorise``` function **across** the selected columns```wd:fmfq2_11_R``` and saves these changes in a new data set called ```vars```. 

In [6]:
#new object with number values
vars = data.frame(sapply(subset(dat, select = wd1:cmfq2_11_R), unfactorise))

#take a peek!
head(vars)

Unnamed: 0_level_0,wd1,wd2,wd3,wd4_R,wd5,wd6,wd7_R,wd8_R,wd9,open1,⋯,cmfq2_2_R,cmfq2_3,cmfq2_4,cmfq2_5,cmfq2_6,cmfq2_7,cmfq2_8_R,cmfq2_9,cmfq2_10,cmfq2_11_R
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,3,3,2,1,3,4,3,2,3,4,⋯,4,3,4,5,4,5,3,3,2,3
2,4,2,3,4,1,4,3,1,2,4,⋯,4,4,2,4,4,4,2,4,4,1
3,3,2,2,2,2,3,2,2,2,4,⋯,2,4,4,4,4,4,2,4,4,2
4,4,2,2,3,3,3,2,2,2,4,⋯,4,5,5,4,5,4,3,4,4,3
5,3,3,2,2,2,4,2,4,4,5,⋯,3,5,5,3,4,1,1,5,4,4
6,3,3,2,3,2,4,2,2,2,4,⋯,3,4,4,4,4,4,2,4,4,2


Our data set is coming together and almost ready for analysis!

In order to analyze and draw useful inferences from the data, the scales must be representative of the constructs of interest. That means all items are scored appropriately and reverse coded ones need to be adjusted.

Luckily as I-O psychologists, we encounter (and build) scales with reverse coded items all the time. An example of a reverse coded item is "I prefer to be alone in my free time" when measuring something like extroversion; individuals high on extroversion are more likely to report lower level responses to the aforementioned item stem. This will misconstrue subsequent item correlations and other analyses if it is not accounted for, so it's an extremely important step in preprocessing survey data.

Many surveys will use negatively valenced words such as "not" or "never" to connote a reverse coded item (much to the chagrin of survey researchers). The better practice is to stick with solely using positive language but adjusting the spectrum of interest. Think back on the previous example ("I prefer to be alone in my free time") -- negatively valenced language is avoided and instead the focus is on the opposite spectrum of extroversion (i.e., introversion); thus someone high on introversion is more likely to report a higher level for the item.

The custom function below allows us to adjust our reverse coded items (labeled ```_R``` for each group of subscales. Workplace discrimination (wd) has a 4-point Likert-type response option format whereas job satisfaction (job_sat) has a 6-point. It's really easy math to handle a reverse coded item ((response option max value + 1) - current value).

In [7]:
#custom function to change values
mutate_at2 <- function(data, .vars, .funs) {
    stopifnot(length(.vars) == length(.funs))
    
    for (i in seq_along(.vars)) {
        data <- mutate_at(data, .vars[[i]], .funs[[i]])
        }
    data
    }

This math is applied to a set of specified columns (```_R```) within each subscale. Instead of overwriting the ```vars``` object, we set a new object that copies the original data set **and** the reverse coded changes called ```vars_final```.

In [8]:
vars_final = vars %>% 
    mutate_at2(
        list(c("wd4_R", "wd7_R", "wd8_R"), 
             c("open7_R", "open9_R"), 
             c("org_eff1_R"),
             c("job_sat2_R", "job_sat4_R", "job_sat6_R", "job_sat10_R", 
               "job_sat11_R", "job_sat12_R"), 
             c("cmfq2_2_R", "cmfq2_8_R", "cmfq2_11_R")), 
        list(~ 5 - ., 
             ~ 6 - ., 
             ~ 5 - .,
             ~ 7 - ., 
             ~ 6 - .)
        ) %>% 
    na.omit

The data preprocessing is now complete! Using the ```glimpse``` function and selecting only the columns that end with ```_R```, one can review if the changes worked effectively across both data sets (i.e., before and after reverse coding). 

In [9]:
glimpse(select_at(vars, vars(ends_with("_R"))))
glimpse(select_at(vars_final, vars(ends_with("_R"))))

Rows: 58
Columns: 15
$ wd4_R       [3m[38;5;246m<dbl>[39m[23m 1, 4, 2, 3, 2, 3, 2, 1, 3, 1, 2, 4, 3, 2, 2, 2, 3, 2, 1, …
$ wd7_R       [3m[38;5;246m<dbl>[39m[23m 3, 3, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 3, 3, 2, 2, 2, 4, 4, …
$ wd8_R       [3m[38;5;246m<dbl>[39m[23m 2, 1, 2, 2, 4, 2, 2, 3, 3, 3, 3, 2, 3, 2, 1, 4, 3, 2, 1, …
$ open7_R     [3m[38;5;246m<dbl>[39m[23m 4, 4, 3, 2, 3, 3, 1, 5, 3, 2, 2, 4, 4, 4, 4, 1, 4, 1, 5, …
$ open9_R     [3m[38;5;246m<dbl>[39m[23m 4, 4, 3, 2, 4, 2, 2, 1, 2, 3, 2, 4, 2, 4, 4, 2, 1, 1, 1, …
$ org_eff1_R  [3m[38;5;246m<dbl>[39m[23m 3, 4, 2, 2, 2, 4, 2, 2, 3, 3, 3, 3, 3, 2, 3, 2, 3, 3, 1, …
$ job_sat2_R  [3m[38;5;246m<dbl>[39m[23m 3, 2, 2, 4, 1, 2, 2, 3, 2, 3, 2, 4, 6, 2, 1, 2, 2, 2, 1, …
$ job_sat4_R  [3m[38;5;246m<dbl>[39m[23m 6, 4, 4, 5, 3, 4, 6, 5, 4, 4, 3, 2, 5, 3, 4, 5, 2, 4, 3, …
$ job_sat6_R  [3m[38;5;246m<dbl>[39m[23m 5, 2, 3, 5, 1, 5, 3, 4, 4, 3, 2, 2, 5, 2, 4, 3, 2, 4, 3, …
$ job_sat10_R [3m[38;5;246m<dbl>[39m[23

### Data Analysis

Now that preprocessing is complete, we can move into our analyses. R, in all its glory, is very much a statistical language and though it takes a while to format the data, it is generally much easier to run analytic procedures. 

First, each scale is individually saved within a list called ```varsList```. Data structures are outside of the scope of this tutorial, but just know that data frames and lists are examples of them. Data frames (with which most of us are accustomed to using) can be stored in lists and called upon individually (similar to how we can call out specific rows and columns from a data frame). 

The code selects columns that all begin with the same prefix from our ```vars_final``` data set and saves them in the list.

In [10]:
varsList = list(
    wd = select(vars_final, starts_with("wd")), 
    open = select(vars_final, starts_with("open")), 
    org_eff = select(vars_final, starts_with("org_eff")), 
    job_sat = select(vars_final, starts_with("job_sat")), 
    cmfq = select(vars_final, starts_with("cmfq")), 
    dem = select(dat, starts_with("dem"))
    )

We now how our individual scales but it would be nice to have composite scores for each one as well. No fear this can easily be done -- "easily" means the computer can do it with little effort if we can generate the correct set of instructions. 

The code below does the following: 
1. Creates a new object in ```varsList``` called ```comps```
2. Uses the function ```rowMeans``` to generate scale scores (means) and removes missing values via ```na.rm = TRUE``` across (next step)
3. all the data frames separately **except** the 6th data frame (i.e., ```dem```)
4. Then combines the calculations into a new data frame

The result should be a new data frame with only our scale scores for each individual (*n* = 58) 

In [11]:
varsList$comps <- as.data.frame(
  do.call(cbind, lapply(varsList[-6], 
                        function(x) rowMeans(x, na.rm = TRUE))
          ))

We can review the new data set by specifically calling it using the ```$``` symbol. The ```$``` symbol can be used in R to select specific objects that are nested. Think of our data frames from earlier during preprocessing and how each one has a column by row design - we can select specific columns by using ```vars_final$wd1``` to view ONLY that specific column. 

The same goes for lists except we are calling specific data frames this time. To pull the same column from ```varsList``` we would use ```varsList$wd$wd1``` because the column is nested within the data frame which is nested within the list. 

The great urban philosopher, Kendrick Lamar, once said, "It's levels to it you and I know..." He is indeed correct :) Take a peek at the ```comps``` data frame found within the ```varsList```.

In [12]:
glimpse(varsList$comps) #SUCCESS!

Rows: 56
Columns: 5
$ wd      [3m[38;5;246m<dbl>[39m[23m 3.000000, 2.555556, 2.555556, 2.666667, 2.777778, 2.666667, 2…
$ open    [3m[38;5;246m<dbl>[39m[23m 3.5, 3.3, 3.7, 4.3, 4.1, 3.6, 4.6, 4.1, 3.7, 4.1, 3.8, 3.0, 3…
$ org_eff [3m[38;5;246m<dbl>[39m[23m 3.166667, 2.833333, 3.166667, 2.666667, 3.500000, 2.333333, 3…
$ job_sat [3m[38;5;246m<dbl>[39m[23m 3.250000, 4.500000, 4.583333, 3.166667, 5.666667, 3.500000, 3…
$ cmfq    [3m[38;5;246m<dbl>[39m[23m 3.666667, 3.904762, 4.000000, 4.238095, 3.952381, 4.142857, 4…


In [17]:
#replace df with the name of the data set you want!
#hint -- won't run
# test = varsList$df
# cor(test)

### Discussion

We have successfully preprocessed the data by updating the column labels, changed our string values to numerics, reversed coded specific items, separated each individual scale, and generated a data frame containing the scale composite scores. You have been introduced to different data structures (i.e., data frames & lists) and how to pull specific things from each one.

That's quite a long and winding roller coaster ride which was hopefully more fun than terrifying. Now, it's your turn to use what we have done here to run some statistical analyses! If you have any questions, feel free to reach out to me via [email](mailto:dkgreen@ncsu.edu) - I will certainly try to respond within a reasonable time frame but please try not to simply hold-out for my reply. 

If you find yourself stuck, Google is your best friend and this document is chocked-full of the R terminology and vernacular to get you on the correct path. Effective Googling is an artisanal science as well.

### Hints

Below are some useful tips functions that you may need to complete your analyses! 

If you need help determining what a particular function does or what is needed inside for it to run properly, try placing a ```?``` in front of the function (i.e., ```?cor```).

#### Useful functions
- cbind, cbind.data.frame
- cor
- corrplot
- aov

Let's say one wants to extract a specific data frame from a list, run a correlation analysis, and provide a cool plot -- one could go about such a problem by doing the following:

In [14]:
##NOT RUN
# new_df = list$df_of_interest
# corrs = cor(new_df)
# corrplot(corrs, method = "color")

If I wanted to extract a specific column from one data frame and combine with another to run some kind of analysis (e.g., ANOVA), one could do the following:

In [15]:
##NOT RUN
# new_df = cbind(list$df_of_interest$column_of_interest1, 
#                list$df_of_interest$column_of_interest2, 
#                list$df_of_interest$column_of_interest3,)
#
#mod0 = aov(column_of_interest3 ~ column_of_interest1 + column_of_interest2)
#summary(mod0)