# Methods and Plan & Computational Code and Output

In [137]:
library(broom)
library(latex2exp)
library(tidymodels)
library(repr)
library(gridExtra)
library(faraway)
library(mltools)
library(leaps)
library(glmnet)
library(cowplot)
library(tidyverse)
library(modelr)

## 1.Methods and Plan

The goal of this analysis is to predict whether an employee will leave the company or not, based on various employee characteristics, using the dataset `employee_fulldata` which contains all variables that Kaggle provided. The proposed method for this project is logistic regression, which is appropriate for binary classification problems such as predicting whether an employee will leave (LeaveOrNot), where the outcome variable is categorical (binary: "Leave" or "Not Leave").

We have already split the data into training and testing sets, and employed forward selection to choose predictor variables, such as `Education`, `Age`, `Gender`, `City`, etc, to best fit logistic regression model. 


- Why is this method appropriate?

  The response variable, `LeaveOrNot`, is binary, with two possible outcomes: "Leave" (1) or "Stay" (0). Logistic regression       models the probability of an event occurring (in this case, an employee leaving the company) as a function of predictor      variables, making it ideal for this type of classification problem.Through stepwise variable selection, forward selection helps reduce overfitting. It avoids adding predictors that have a weak relationship with the target variable, thereby reducing the complexity of the model.

  
- Which assumptions are required, if any, to apply the method selected?

  The response variable follows a binomial distribution.Logistic regression assumes a linear relationship between the predictor variables and the log-odds of the response variable.

  The assumption of using forward selection require linear relationship between the predictor variables and the response variable.

- What are the potential limitations or weaknesses of the method selected?

  
   Forward selection typically adds variables one by one, without explicitly considering interactions between them. It may fail to capture interaction effects.

## 2.Implementation of a proposed model

In [122]:
##Loading the dataset with full variables
employee_fulldata <- 
    read_csv("data/Employee.csv") 

[1mRows: [22m[34m4653[39m [1mColumns: [22m[34m9[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): Education, City, Gender, EverBenched
[32mdbl[39m (5): JoiningYear, PaymentTier, Age, ExperienceInCurrentDomain, LeaveOrNot

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


In [123]:
##Since some categorical variables are present numeric ways,use factor() to transform them into categorical form.
employee_fulldata$PaymentTier<-factor(employee_fulldata$PaymentTier, levels = c(1, 2, 3), labels = c("Low", "Median", "High"),ordered = TRUE)
employee_fulldata$LeaveOrNot<-factor(employee_fulldata$LeaveOrNot, levels = c(0, 1), labels = c("Not Leave", "Leave"), ordered = TRUE)
head(employee_fulldata)


Education,JoiningYear,City,PaymentTier,Age,Gender,EverBenched,ExperienceInCurrentDomain,LeaveOrNot
<chr>,<dbl>,<chr>,<ord>,<dbl>,<chr>,<chr>,<dbl>,<ord>
Bachelors,2017,Bangalore,High,34,Male,No,0,Not Leave
Bachelors,2013,Pune,Low,28,Female,No,3,Leave
Bachelors,2014,New Delhi,High,38,Female,No,2,Not Leave
Masters,2016,Bangalore,High,27,Male,No,5,Leave
Masters,2017,Pune,High,24,Male,Yes,2,Leave
Bachelors,2016,Bangalore,High,22,Male,No,0,Not Leave


In [124]:
set.seed(1234)
##spliting data into two parts(training set and testing set) with 0.7
employee_split <- 
    employee_fulldata %>%
    initial_split(prop = 0.7, strata = LeaveOrNot)

training_employee <- training(employee_split)
testing_employee <- testing(employee_split)

In [125]:
###using forward selection to find the variables to best fit predict modeling
employee_forward_sel <- regsubsets(x = LeaveOrNot ~ ., nvmax = NULL,
                                  data = training_employee,
                                  method = "forward")

employee_forward_summary <- summary(employee_forward_sel)
employee_forward_summary

Subset selection object
Call: regsubsets.formula(x = LeaveOrNot ~ ., nvmax = NULL, data = training_employee, 
    method = "forward")
11 Variables  (and intercept)
                          Forced in Forced out
EducationMasters              FALSE      FALSE
EducationPHD                  FALSE      FALSE
JoiningYear                   FALSE      FALSE
CityNew Delhi                 FALSE      FALSE
CityPune                      FALSE      FALSE
PaymentTier.L                 FALSE      FALSE
PaymentTier.Q                 FALSE      FALSE
Age                           FALSE      FALSE
GenderMale                    FALSE      FALSE
EverBenchedYes                FALSE      FALSE
ExperienceInCurrentDomain     FALSE      FALSE
1 subsets of each size up to 11
Selection Algorithm: forward
          EducationMasters EducationPHD JoiningYear CityNew Delhi CityPune
1  ( 1 )  " "              " "          " "         " "           " "     
2  ( 1 )  " "              " "          " "         " "      

In [126]:
##store and examine different evaluation metrics to determine the best one in terms of its goodness of fit.
employee_forward_summary_df <- tibble(
    n_input_variables = 1:11,
    
    RSQ = employee_forward_summary$rsq,
    RSS = employee_forward_summary$rss,
    ADJ_R2 = employee_forward_summary$adjr2,
    Cp = employee_forward_summary$cp,
    BIC = employee_forward_summary$bic,
)

In [127]:
employee_forward_summary_df

n_input_variables,RSQ,RSS,ADJ_R2,Cp,BIC
<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,0.07530689,679.5203,0.07502281,325.402169,-238.8244
2,0.09926131,661.9171,0.09870769,234.702739,-316.2214
3,0.12009,646.6109,0.11927852,156.099311,-384.3322
4,0.1393013,632.4933,0.13824263,83.754907,-448.1423
5,0.14409599,628.9699,0.14277962,67.200291,-458.2482
6,0.15283385,622.5488,0.15126985,35.386287,-483.581
7,0.15622868,620.054,0.15441076,24.248876,-488.5704
8,0.15893463,618.0656,0.15686304,15.777309,-490.9437
9,0.16143937,616.2249,0.15911506,8.084381,-492.5692
10,0.1614592,616.2103,0.1588759,10.007652,-484.5576


In [128]:
#select the model that minimizes the Cp and present its predictor variables
cp_min = which.min(employee_forward_summary$cp) 

selected_var <- names(coef(employee_forward_sel, cp_min))[-1]
selected_var

In [129]:
##rearrange the training dataset to ensure it contain the best fiting variables
rearrange_training<-training_employee|>
mutate(EducationMasters = ifelse(Education == "Masters", "Yes", "No"))|>
mutate(CityNewDelhi = ifelse(City == "New Delhi", "Yes", "No"))|>
mutate(CityPune = ifelse(City == "Pune", "Yes", "No"))|>
mutate(PaymentTier.Q = ifelse(PaymentTier == "Median", "Yes", "No"))

In [131]:
##employing logistic regression, using binomial distribution
rearrange_training_log <- 
    glm(formula = LeaveOrNot ~ EducationMasters+JoiningYear+CityNewDelhi+
        CityPune+PaymentTier.Q+Age+Gender+EverBenched+ExperienceInCurrentDomain,
        data = rearrange_training,
        family = binomial)

summary(rearrange_training_log)


Call:
glm(formula = LeaveOrNot ~ EducationMasters + JoiningYear + CityNewDelhi + 
    CityPune + PaymentTier.Q + Age + Gender + EverBenched + ExperienceInCurrentDomain, 
    family = binomial, data = rearrange_training)

Coefficients:
                            Estimate Std. Error z value Pr(>|z|)    
(Intercept)               -4.133e+02  4.652e+01  -8.884  < 2e-16 ***
EducationMastersYes        6.470e-01  1.145e-01   5.651 1.59e-08 ***
JoiningYear                2.054e-01  2.309e-02   8.896  < 2e-16 ***
CityNewDelhiYes           -6.862e-01  1.189e-01  -5.769 7.96e-09 ***
CityPuneYes                5.075e-01  1.022e-01   4.963 6.93e-07 ***
PaymentTier.QYes           8.430e-01  1.102e-01   7.650 2.00e-14 ***
Age                       -3.187e-02  8.577e-03  -3.716 0.000202 ***
GenderMale                -9.087e-01  8.480e-02 -10.715  < 2e-16 ***
EverBenchedYes             4.570e-01  1.281e-01   3.569 0.000359 ***
ExperienceInCurrentDomain -8.374e-02  2.635e-02  -3.178 0.001481 ** 
---
S

In [92]:
# exponential estimate
rearrange_training_log_results  <-
    rearrange_training_log|>
    tidy()|>
    mutate(exp.estimate = exp(estimate)) 
rearrange_training_log_results

term,estimate,std.error,statistic,p.value,exp.estimate
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
(Intercept),-413.25984784,46.516534295,-8.88415,6.441159e-19,3.338324e-180
EducationMastersYes,0.64698583,0.11448143,5.651448,1.591019e-08,1.909776
JoiningYear,0.20541871,0.023090852,8.896108,5.783919999999999e-19,1.228039
CityNewDelhiYes,-0.68623187,0.118947033,-5.769222,7.963821e-09,0.5034696
CityPuneYes,0.50748926,0.102245192,4.963454,6.925062e-07,1.661115
PaymentTier.QYes,0.84298413,0.110187133,7.650477,2.002351e-14,2.32329
Age,-0.03187216,0.008576556,-3.716195,0.0002022452,0.9686304
GenderMale,-0.90865142,0.084799521,-10.71529,8.628678e-27,0.4030674
EverBenchedYes,0.45704022,0.128067923,3.568733,0.0003587118,1.579392
ExperienceInCurrentDomain,-0.08374378,0.026348456,-3.178318,0.00148132,0.9196669


In [132]:
##rearrange the testing dataset to ensure it contain the best fiting variables
rearrange_testing<-testing_employee|>
mutate(EducationMasters = ifelse(Education == "Masters", "Yes", "No"))|>
mutate(CityNewDelhi = ifelse(City == "New Delhi", "Yes", "No"))|>
mutate(CityPune = ifelse(City == "Pune", "Yes", "No"))|>
mutate(PaymentTier.Q = ifelse(PaymentTier == "Median", "Yes", "No"))

                              

In [134]:
##Use the resulting predictive values to compute the error and the RMSE of the predictive values based on training set
predicted_probabilities_training <- predict(rearrange_training_log, 
                                   newdata=rearrange_training,
                                   type = "response")


rearrange_training<-mutate(rearrange_training,
                          LeaveOrNot_P = ifelse(LeaveOrNot == "Leave", 1, 0))

#true probability of leave
p_true_training <- rearrange_training$LeaveOrNot_P

#calculate residuals
residuals_training <- p_true_training - predicted_probabilities_training

rmse_red_glm_training<-sqrt(mean(residuals_training^2))


In [136]:
rmse_red_glm_training

In [96]:
##Use the resulting predictive values to compute the error and the RMSE of the predictive values based on testing set
predicted_probabilities_testing <- predict(rearrange_training_log, 
                                   newdata = rearrange_testing, 
                                   type = "response")


rearrange_testing<-mutate(rearrange_testing,
                          LeaveOrNot_P = ifelse(LeaveOrNot == "Leave", 1, 0))

#true probability of leave
p_true_testing <- rearrange_testing$LeaveOrNot_P

#calculate residuals
residuals_testing <- p_true_testing - predicted_probabilities_testing

rmse_red_glm_testing<-sqrt(mean(residuals_testing^2))


In [97]:
rmse_red_glm_testing

## 3.Conclusion

The RMSE values for both the training set (0.433) and the testing set (0.436) are very close, indicating that the model generalizes well and is not overfitting. The slight increase in RMSE on the testing set is resonable and suggests that the predictive model is good fit for both training and unseen data. These results support the model's ability to predict the likelihood of employees leaving.However,the RMSE is a bit large, which makes skeptical about the model's fit to the data.