# Change runtime type to R

Remember, the first step when opening a Google Colab notebook is to change the runtime type to R from Python. Our code will not work otherwise!

# Exam instructions

This exam is only semi-structured. You will have to identify your own research question that we can analyze with the concepts we have learned in class. The dataset is a simulated HR data set so we will put into practice our IO and data analysis knowledge to tackle "real" questions.

**In this exam, you will need to do the following**:

1) Create a research question based on the described dataset below

2) Identify **one outcome** variable and **two predictor** variables. The outcome can be either continuous or binary (i.e., we can use multiple regression or logistic regression).

3) Complete the outline of typical data analysis procedures we have learned in this class.

## Permissible knowledge sources

1) You may use any code from (a) lectures, (b) textbook readings, and (c) demonstrations or other class activities/assignments.

2) AI use: You may use AI (Gemini) to help identify/debug errors in your code. You **may not** use AI to generate code *for you*. Ensure your settings are such that it does not automatically generate code for you. That will constitute academic dishonesty. I have very specific function use and my own coding style that I follow so it is easy for me to identify code or styles that we have not used in this class.

3) You **may not* discuss the exam with your peers or seek help from them. You **can** ask me for help, I can certainly help guide you and help you progress if you are stuck.

# Exam dataset description

- **Employee_Name**: Full name of the employee (Last, First).

- **EmpID**:  Unique employee identification number.

- **MarriedID**:  Binary indicator (0 = Not Married, 1 = Married).

- **MaritalStatusID**:  Encoded marital status category (e.g., 0 = Single, 1 = Married, 2 = Divorced).

- **GenderID**: Encoded gender identifier (e.g., 0 = Female, 1 = Male).

- **DeptID**: Encoded department identifier (i.e., 1 = admin offices, 2 = executive office, 3 = IT/IS, 4 = software engineering, 5 = Production, 6 = sales). This variable corresponds to the `Department` variable.

- **PerfScoreID**: Encoded performance score rating (i.e., 1 = PIP [performance improvement plan], 2 = needs improvement, 3 = fully meets expecations, 4 = exceeds expectations). Corresponds to the `PerformanceScore` variable.

- **FromDiversityJobFairID**: Binary indicator if hired through a diversity job fair (0 = No, 1 = Yes).

- **Salary** Current annual salary in U.S. dollars.

- **Termd**: Termination status (0 = Active, 1 = Terminated).

- **PositionID**: Encoded job position identifier. Numeric encoding of the `position` variable.

- **Position**: Title of the employee's position.

- **State**: U.S. state abbreviation where employee is based.

- **Zip**:  Postal code of the employee's work location.

- **DOB**:  Date of birth of the employee (MM/DD/YYYY).

- **Sex**:  Sex of the employee (i.e., M = male, F = female).

- **MaritalDesc**:  Text description of marital status (e.g., Single, Married, Divorced).

- **CitizenDesc**:  Citizenship status description (e.g., U.S. Citizen, Non-Citizen).

- **HispanicLatino**:  Binary indicator for Hispanic or Latino ethnicity (Yes/No).

- **RaceDesc**:  Text description of race category.

- **DateofHire**:  Date the employee was hired (MM/DD/YYYY).

- **DateofTermination**:  Date the employee's employment ended (MM/DD/YYYY).

- **TermReason**:  Reason for termination if applicable.

- **EmploymentStatus**:  Full employment status description (e.g., Active, Terminated, Leave of Absence).

- **Department**:  Name of the department the employee belongs to.

- **ManagerName**:  Full name of the employee's manager.

- **ManagerID**:  Encoded identifier for the manager.

- **RecruitmentSource**:  Source through which the employee was recruited (e.g., LinkedIn, Indeed).

- **PerformanceScore**:  Text rating of performance (e.g., Exceeds, Fully Meets, Needs Improvement).

- **EngagementSurvey**:  Employee engagement survey score (0.0–5.0 scale).

- **EmpSatisfaction**:  Employee self-reported satisfaction score (1–5 scale).

- **SpecialProjectsCount**:  Number of special projects the employee has been assigned.

- **LastPerformanceReview_Date**:  Date of the last performance review (MM/DD/YYYY).

- **DaysLateLast30**:  Number of days the employee was late in the last 30 days.

- **Absences**:  Total number of absences in the past year.

# 1) State your research question and identify the outcome and two predictor variables:

Research question: ...

Outcome variable: ...

Predictor 1: ...

Predictor 2: ...

# 2) Load the packages we will need

Make sure to install any needed packages, too.

# Load the data

This step is completed for you, be sure to load the data below.

In [None]:
## Set the URL to Casey's GitHub page where the dataset is located
FileURL <- "https://raw.githubusercontent.com/CaseyGio/Psyc6290/refs/heads/main/Datasets/HRData.csv"

## Read the csv file from GitHub and create a new object
hr <- read_csv(url(FileURL))

## Check out the dataset
head(hr, n = 10)

# 3) Data cleaning

- If a(n) variable(s) should be cleaned to better analyze or visualize the data, please do so here.

- If you think data *do not* need any additional cleaning, please state why you think the variables are already cleaned.

Note: This is not a trick question, there are a variety of variables in this dataset. Both can be acceptable choices, just defend your answer if no cleaning was done.

# 4) Visualization(s) and interpretation

Generate an appropriate visualization based on your specified research question.

- Describe the resulting visualization(s), what pattern(s) start to emerge from this stage?

# 5) Estimate and interpret an interaction effects model

Estimate an interaction-effects model to address your specific research question. Be sure to use the correct method (i.e., multiple regression or logistic regression).

Interpret each of the following parameters by describing what the obtained value means and it relates to the model & outcome:

- y-intercept:

- predictor 1 slope:

- predictor 2 slope:

# 6) Interpret the significance of parameters and the overall model fit

- Calculate (e.g., via a function) the overall model fit, what is our $R^2$ value and what does that mean?

- Given a significance threshold of $\alpha = 0.05$, interpret whether the estimated parameters are statistically significant or not.

# 7) Compare the interaction model to a main effect model

Estimate a similar model that only include main effects (i.e., same outcome and predictors). Then compare the interaction and main effect models to see, based on a significance threshold of $\alpha = 0.05$, which model is a better fit for the data.

# 8) Address your main research question

Given our data analyses, what can we conclude about the main research question? Please be descriptive and draw upon the answers to multiple questions from above (e.g., visualization, interaction model, model fit, statistical significance).

Research question answer: ...

