<h1> Dataset <h1>
<h4 style = "line-height: 200%; font-weight: lighter;"> The definition of attrition is in Oxford’s English dictionary defined as: <br> <i><b> "The gradual reduction of a workforce by employees leaving and not being replaced rather than by redundancy." </i></b>

The chosen business case for this course is based on the dataset “Employee Attrition”, which consists of 35 columns and approximately 1400 rows of data about an organization’s employees. 
The dataset is found on the website “Kaggle.com” in a notebook. The goal of this business case is to find patterns in the data using analysis that can tell us why workers quit their job. Moreover the dataset consists of labeled data, which makes it very relevant to perform classification training and prediction on the data using R programming language.
</h4>

<h1> Business problem <h1>
<h4 style = "line-height: 200%; font-weight: lighter;"> 
The business question we want to answer in this project is based on the attrition data and interest in the issue of employees leaving their jobs. This knowledge is relevant for the company as a strong decision foundation to either prevent employees from leaving or start a hiring process. Our concrete problem-question is therefore as follows:

<b><br><i>To what extent do employees in a company quit their jobs based on parameters such as distance from home, monthly salary and job satisfaction?</i></b>

In the following we further describe how we are planning to give an answer to the question above. As our dataset contains many columns we have chosen to select the most relevant ones and tried to <b><i> build a model that is able to somewhat accurately define attrition based on those chosen parameters</i></b>. The purpose of this would be vital in a company setting as an organization can prepare for workforce loss by using data similar to our dataset about their employees.
</h4>

<h1> Loading the data<h1>
<h4 style = "line-height: 200%; font-weight: lighter;"> 
The data we are using in this project is the attrition-dataset from Kaggle.com and to do further analysis in R, we are loading the into our notebook. Luckyly the dataset has the the format type .csv which enable us to easyly load the data, using the read.csv function. 


<h4>

In [None]:
# git_url <- "https://github.com/Hammi007/R_bigdata/blob/3e35e40e35a28f7e460bac125f9b63384c1cc4f3/Employee_Attrition.csv" # nolint
# data <- read.csv(git_url, header = TRUE, stringsAsFactors = FALSE) # nolint
df <- read.csv("Employee_Attrition.csv", header = TRUE, stringsAsFactors = FALSE, strip.white = TRUE) # nolint


<h1> Exploring the basics of the dataset <h1>
<h4 style = "line-height: 200%; font-weight: lighter;"> 

To get a better understanding of the data, we have chosen to do a light exploration using four different R functions: dim, str, unique, is.na.
The reason we conduct this basic exploration is to gain quick insights about the structure of the data.

Using the <b>dim()</b>, we see that the dataset contains 1470 rows and 35 coloumns.
Through the <b>str()</b> function we see the different datatypes and can conclude that, only two datatypes are used: int and chr.
This also applies for columns with binary output e.g. 'Attrition' with "yes"/"no" values, or columns a few multiple values e.g. BusinessTravel with three diffent values.  

Using the <b>is.na()</b> we also see, that there are no missing values "NA", we have to take in consideration. 

<h4>

In [None]:
dim(df)
str(df)
unique(df$BusinessTravel)
table(is.na(df)) #making sure that there are no missing values


<h1> Cleaning and transformation of the data <h1>
<h4 style = "line-height: 200%; font-weight: lighter;"> After exploring the overall structure of the data, we want to conduct the following transformation-steps: <br><br>
    <i>
        Step 1: Remove irrelevant columns<br>
        Step 2: Remove quotation ("") from the dataset <br>
        Step 3: Transforming columns whith binary and a multiple values to factors.<br> 
    <i>
<h4>

In [None]:
#Step 1: Keeping a selection of relevant columns
selection <- c(
    "Age", "Attrition", "BusinessTravel", "DistanceFromHome",
    "EducationField", "EnvironmentSatisfaction", "Gender", "HourlyRate",
    "JobInvolvement", "JobRole", "JobSatisfaction", "MaritalStatus",
    "MonthlyIncome", "NumCompaniesWorked", "OverTime", "RelationshipSatisfaction", # nolint
    "TotalWorkingYears", "TrainingTimesLastYear", "YearsAtCompany",	"YearsInCurrentRole", # nolint
    "YearsSinceLastPromotion"
)
df <- df[selection]

#Step 2,3: Converting multivalues to factors and remvoving quotation ""
df$Attrition <- factor(df$Attrition, levels = c("Yes", "No"), labels = c("Yes", "No")) # nolint
df$BusinessTravel <- factor(df$BusinessTravel, levels = c("Travel_Rarely", "Travel_Frequently", "Non-Travel"), # nolint
                            labels = c("Travel_Rarely", "Travel_Frequently", "Non-Travel"))# nolint
df$EducationField <- factor(df$EducationField, levels = c("Life Sciences","Other","Medical","Marketing","Technical Degree","Human Resources"), # nolint
                            labels = c("Life Sciences","Other","Medical","Marketing","Technical Degree","Human Resources")) # nolint
df$EnvironmentSatisfaction <- factor(df$EnvironmentSatisfaction, levels = c(1,2,3,4), labels = c(1,2,3,4)) # nolint
df$Gender <- factor(df$Gender, levels = c("Male", "Female"), labels = c("Male", "Female")) # nolint
df$JobInvolvement <- factor(df$JobInvolvement, levels = c(1,2,3,4), labels = c(1,2,3,4)) # nolint
df$JobRole <- factor(df$JobRole, levels = c("Sales Executive","Research Scientist","Laboratory Technician", # nolint
                                            "Manufacturing Director","Healthcare Representative","Manager","Sales Representative", # nolint
                                            "Research Director","Human Resources"), # nolint
                     labels = c("Sales Executive","Research Scientist","Laboratory Technician","Manufacturing Director","Healthcare Representative","Manager","Sales Representative","Research Director","Human Resources")) # nolint
df$JobSatisfaction <- factor(df$JobSatisfaction, levels = c(1,2,3,4), labels = c(1,2,3,4)) # nolint
df$MaritalStatus <- factor(df$MaritalStatus, levels = c("Single","Married","Divorced"), labels = c("Single","Married","Divorced")) # nolint
df$OverTime <- factor(df$OverTime, levels = c("Yes", "No"), labels = c("Yes", "No")) # nolint
df$RelationshipSatisfaction <- factor(df$RelationshipSatisfaction, levels = c(1,2,3,4), labels = c(1,2,3,4)) # nolint
df$TrainingTimesLastYear <- factor(df$TrainingTimesLastYear, levels = c(0,1,2,3,4,5,6), labels = c(0,1,2,3,4,5,6)) # nolint
#df$WorkLifeBalance <- factor(df$WorkLifeBalance, levels = c(1,2,3,4), labels = c(1,2,3,4), ordered = TRUE) # nolint

<h3>Exploratory data analysis<h3>
<h4 style = "line-height: 200%; font-weight: lighter;"> In this section we are using R-tools to further explore the data. Since our business case is heavily based on attrition in a company, we will start investigating the categorical value 'Attrition'.
We have already established that it is relevant to perform classification in this project to find a solution for the business case. Before doing any classification, we perform data preprocessing in order to prepare our data to be trained on by a classification model. The purpose of preprocessing is to clean and or transform data if necessary and is an important step when dealing with data on a broad level. The labeled data points can be observed on the column “Attrition”, which implies under which circumstances the attrition in this company has been yes and no respectively. The following figure displays the distribution of data of each label in the dataset:
<h4>

In [None]:
library(tidyverse); library(ggplot2)

options(repr.plot.width = 8, repr.plot.height = 5)
ggplot(df, aes(x = Attrition)) + geom_bar(aes(fill = Gender))


<h4 style = "line-height: 200%; font-weight: lighter;">
This imbalance noticed above will cause a skewed classification model, hence why we conduct an undersampling operation in order to equalize the amount of data with each label in the dataset. An undersampling operation means taking an appropriate sample of the minority label class and matching the amount of data points to that specific amount of the opposing label class. The imbalance issue will be addressed in the bigging of chapter "Supervised learning - classification" where an undersampling operation will be performed.
<h4>


<h3>In the following we will try to further explore our data by visualizing </h3>
<h4 style = "line-height: 200%; font-weight: lighter;">How does the following parameters influence the attrition:<h4>

<ul style = "">
  <li style = "margin-bottom: 10px;">DistanceFromHome</li>
  <li style = "margin-bottom: 10px;">JobSatisfaction</li>
  <li style = "margin-bottom: 10px;">MonthlyIncome</li>
</ul>


<h4 style = "line-height: 200%; font-weight: lighter;">
The following code is neccessary for the following data exploration as we are interested in exploring whether there is a noticable visual difference between the employees with attrition "Yes" compared to "No" with parameters DistanceFromHome, JobSatisfaction and MonthlyIncome in mind. 
<h4>

In [None]:
#Filter rows with attrition yes and no respectively:
df_yes <- filter(df, Attrition == "Yes")
df_no <- filter(df, Attrition == "No")


<h4> <b>JobSatisfaction</b><h4>
<h4 style = "line-height: 200%; font-weight: lighter;">In the following we see two visual representations of the Jobsatisfaction of the employees with Attrition: "No" and "Yes"<h4>

In [None]:
ggplot(df_no, aes(x = JobSatisfaction), y = Attrition) + geom_bar(aes(fill = Gender)) + ylab("Attrition = 'No'") # nolint
ggplot(df_yes, aes(x = JobSatisfaction), y = Attrition) + geom_bar(aes(fill = Gender)) + ylab("Attrition = 'Yes'") # nolint


<h4><b>DistanceFromHome</b><h4>
<h4 style = "line-height: 200%; font-weight: lighter;">In the following we see two visual representations of the employees DistanceFromHome with Attrition: "No" and "Yes"<h4>

In [None]:
ggplot(df_no, aes(x = DistanceFromHome), y = Attrition) + geom_bar(aes(fill = Gender)) + ylab(" Attrition = 'No' ") # nolint
ggplot(df_yes, aes(x = DistanceFromHome), y = Attrition) + geom_bar(aes(fill = Gender)) + ylab("Attrition = 'Yes' ") # nolint


<h4><b>MonthlyIncome</b><h4> 
<h4 style = "line-height: 200%; font-weight: lighter;">In the following we see two visual representations of the employees MonthlyIncome with Attrition: "No" and "Yes"<h4>

<h1>Spervised learning - Classification<h1>
<h4 style = "line-height: 200%; font-weight: lighter;">
Supervised learning is a technique used in data science to use labeled data in order to learn to predict labels on similar data that is unseen by a given model. Supervised learning generalizes based on known labels about certain data and automates decision making processes based on predictions. Supervised learning consists of a training and testing process, where training is the phase of generating a model and testing is the process of applying the model to new unseen data also known as a test sample. The following will use supervised learning in order to attempt to give an answer to the business problem since the dataset includes labeled data.

<h4>

<h3>Undersampling<h3>
<h4 style = "line-height: 200%; font-weight: lighter;">
Since our data is imbalanced we need to perform an undersampling operation.
The result will then be visualized for showing the undersampling impact:
<h4>

In [None]:
# Installing the package
# install.packages("caTools")    # For Logistic regression
# install.packages("ROCR")       # For ROC curve to evaluate model   
# Loading package
library(caTools)

#Balancing data: (undersampling)
yes <- which(df$Attrition == "Yes") #which is like and foreach loop
no <- which(df$Attrition == "No")
no <- sample(no, length(yes)) #The sample-function in R gives a random sample of no-values, and matches the length/count of yes df #nolint

#Save as df2 so that df still can be used for clustering without being affected by undersampling:
df2 <- df[c(no,yes),] #contains only yes and no  

#Converting Yes and No to 1 and 0:
df2$Attrition <- ifelse(df2$Attrition == "Yes", 1, 0) #

#Plotting attrition yes and no to visualize the impact of undersampling:
ggplot(df2, aes(x = Attrition)) + geom_bar(aes(fill = Gender))

<h4 style = "line-height: 200%; font-weight: lighter;">
After conducting an undersampling operation, we are affected by one of the cons of using this technique, which is sacrificing data. This operation results in the plot shown above with reference to the same plot without undersampling.
<h4>

<h1>Train and test split<h1>
<h4 style = "line-height: 200%; font-weight: lighter;">
Train and test split
Train and test split prepares our data for classification training as we need to train a classifier on a training sample and then test its ability to predict on a training sample that the model has not yet seen. For the train and test split we use the following code in R programming language:

<h4>
    

In [None]:

#Train/test split and Classification:
split <- sample.split(df, SplitRatio = 0.7) # creating 70% split traning data and 30% test data 
   
train_reg <- subset(df2, split == "TRUE") # 70%
test_reg <- subset(df2, split == "FALSE") # 30%


<h4 style = "line-height: 200%; font-weight: lighter;">
This section displays the operations conducted to train a batch of the dataset in order to later predict attrition based on parameters provided about a given employee. The following code initiates classification training of a logistic regression classification model:

<h4>

In [None]:
logistic_model <- glm(Attrition ~., # trying to learn to predict Attrition. Finding patterns in data 
                      data = train_reg, # using the train_reg df
                      family = "binomial") # using binomial model
   
# Summary
summary(logistic_model)


<h4 style = "line-height: 200%; font-weight: lighter;">
After acquiring a fully trained model, prediction is performed on the test batch data, whereafter these predictions accuracy is calculated. These operations are performed using the following code in R programming:
<h4>

In [None]:
# Predict test data based on model
predict_reg <- predict(logistic_model, # uses the predict function, by taking the reg_model. Contains patterns.
                       test_reg, # eating the test data: 30% new data
                       type = "response")
predict_reg


<h4 style = "line-height: 200%; font-weight: lighter;">
The output seen above is the predictions produced by our classification model "logistic_model". The output consists of a number between 0 and 1 for each row in the test sample.
<h4>

<h4 style = "line-height: 200%; font-weight: lighter;">
One method of indicating a classification accuracy is by using the area under curve method. Area under curve [AUC] is a technique used to visually represent the accuracy of a classifier's predictions. The larger the area under the curve the higher is the accuracy of the classifier. In the following we use R-programming language to create an AUC curve:
<h4>

In [None]:
library(ROCR)

# ROC-AUC Curve = Area under curve
ROCPred <- prediction(predict_reg, test_reg$Attrition) # eating the prediction outputs the model just generated, and compares them to the real values.
ROCPer <- performance(ROCPred, measure = "tpr", # true positive rate
                             x.measure = "fpr") # false positive rate
   
auc <- performance(ROCPred, measure = "auc") # calculating the area under curve values (AUC)  0.8
auc <- auc@y.values[[1]]
   
# Plotting curve
plot(ROCPer)
legend(.6, .4, auc, title = "AUC", cex = 1)


<h4 style = "line-height: 200%; font-weight: lighter;">
The figure above takes the raw output from the classification predictions and visualizes the area under the curve.In order to calculate the accuracy of the predictions we use thresholding to define “yes” and “no” predictions as the predictions before thresholding are values between 0 and 1, where values closer to 1 indicate “yes” and values close to 0 indicate “no” for attrition.
    
The following creates a confusion matrix as well as calculating the accuracy of the trained classifcation model based on the predictions on the test batch:
<h4>

In [None]:
# Threshholding:
predict_reg <- ifelse(predict_reg > 0.5, 1, 0) # replacing everything with 1 and 0 with 0.5 as threshold
   
# Evaluating model accuracy.
# Real values to the left, and predicted values above.
table(test_reg$Attrition, predict_reg) # creating confusing matrix. Matches 0 -> 0 and 1->1 as true postive and 1->0 and 0->1 as false.

missing_classerr <- mean(predict_reg != test_reg$Attrition) # defining all false prediction in % (using mean). 
print(paste('Accuracy =', 1 - missing_classerr)) # calculating accuracy 1 - false prediction/missing_classerr.


<h3>Conclusion - supervised learning<h3>
<h4 style = "line-height: 200%; font-weight: lighter;">
    The accuracy shown above tells us that the trained logistic regression classifier is able to predict the test sample with an accuracy of 74%. The confusion matrix gives us insight into how many predictions for each class “Yes” and “No” were falsely or correctly identified.

We can conclude that in a given situation where a company is in possession of the same dataset we used in this project, the company can use supervised learning and classification for decision making regarding attrition with an approximately 74% chance of success. One could argue however that a larger dataset or a fewer parameter selection could result in a more representative and accurate model.

<h4>

<h3>Unsupervised learning - clustering<h3>
<h4 style = "line-height: 200%; font-weight: lighter;">
Unsupervised learning is a technique used in data science to contract information about data without any given label classes. A popular form of unsupervised learning is clustering, where clusters are created without any prior knowledge of each cluster's meaning. The clusters are created based on similarities between data objects according to the characteristics in the data, and grouping similar data objects into clusters.

Even though we are aware that our labeled dataset makes it relevant to work with classification, we still have an interest in performing a form of unsupervised learning such as clustering in order to explore the information we can acquire about the data without the usage of the labels.

For this purpose we firstly chose the columns “Age” and  “MonthlyIncome”. We chose these two columns because of our previous knowledge about monthly income from the exploratory data analysis chapter as well as the fact that we wanted to keep the clustering two-dimensional for visualization purposes. 

We noticed that the values in age and monthly income for the company employees have a scaling problem. The problem arises because monthly income values are considerably larger than age. Therefore the impact of the age column will not be noticable in the clustering and the data in question needs to be scaled - The impact of scaling will be shown later in the chapter. For this purpose we conduct a scaling operation to solve this problem. This operation counts a pre-processing operation as it prepares data to be correctly represented in our clustering.

The following figure shows the code that is responsible for this operation:
<h4>


In [None]:
df$scaledAge <- as.numeric(scale(df$Age)) # scaling values standardization.
df$scaledIncome <- as.numeric(scale(df$MonthlyIncome))

data <- select(df, scaledAge, scaledIncome)
summary(data)
scaledData <- data
head(scaledData, 3) # can see that values are comparable
# another check could be to use standard deviation


<h4 style = "line-height: 200%; font-weight: lighter;">
Hereafter we create an elbow graph which is used in clustering for deciding the number of clusters appropriate to have for a given dataset. The code for creating the elbow graph is as follows:
<h4>

In [None]:
mydata <- scaledData # renaming
k_max <- 15 #maximum of clusters in the elbow graph

#plotting the elbow-graph
wss <- (nrow(mydata)-1)*sum(apply(mydata, 2, var)) 
for (i in 2:k_max) wss[i] <- sum(kmeans(mydata, centers=i)$withinss)
plot(1:k_max, wss, type="b", xlab="Number of Clusters",
     ylab="Sum of square errors (SSE)")


<h4 style = "line-height: 200%; font-weight: lighter;">
From the information gained from the outputted elbow graph we choose the number of clusters to be 6 because there are no significant changes after the 6nth cluster on the Elbow graph. Thereafter, the following code is executed in order to acquire and visualize the clusters:
    
The code makes use of the unscaled data in order to show the impact of scaling the data for the purpose of clustering. 

The output from the code is as follows:
<h4>

<h4>In the following we can se the impact of not scaling the data:<h4>

In [None]:
data_unscaled <- select(df, Age, MonthlyIncome)
Kmeans <- kmeans(data_unscaled, 6) # takes two arguments, data and number of clusters

plot(data_unscaled$Age, data_unscaled$MonthlyIncome, col = Kmeans$cluster,
     pch = Kmeans$cluster, main = "K-means without scaling data", 
     xlab = "age", ylab = "income")
points(Kmeans$centers[ ,1], Kmeans$centers[ ,2], pch = 23,
       col = 'maroon', bg = 'lightblue', cex = 3)
text(Kmeans$centers[ ,1], Kmeans$centers[ ,2], cex = 1.1,
     col = 'black', attributes(Kmeans$centers)$dimnames[[1]]) #can see that age is insignificant 


<h4 style = "line-height: 200%; font-weight: lighter;">
The plot displays six clusters as we chose earlier. We notice that the clusters created are not impacted by the “age” column as the values in the age column are considerably smaller than the ones in the monthly income column.
<h4>    

<h4>In the following we can see the impact of scaling the data:<h4>

In [None]:
KmeansScaling <- kmeans(scaledData, 6)

plot(scaledData$scaledAge, scaledData$scaledIncome, col = KmeansScaling$cluster, #asp = 1, #xlim=c(1:90), ylim=c(200:178677),
     pch = KmeansScaling$cluster, main = "K-means with scaling data", 
     xlab = "age", ylab = "income")
points(KmeansScaling$centers[ ,1], KmeansScaling$centers[ ,2], pch = 23,
       col = 'maroon', bg = 'lightblue', cex = 3)
text(KmeansScaling$centers[ ,1], KmeansScaling$centers[ ,2], cex = 1.1,
     col = 'black', attributes(KmeansScaling$centers)$dimnames[[1]])

<h4 style = "line-height: 200%; font-weight: lighter;">
The plot above shows a clear difference in the clustering as a result of the impact of scaling the data. 
<h4>
    
<h3>Conclusion - unsupervised learning<h3>
<h4 style = "line-height: 200%; font-weight: lighter;">
The clustering output gives us insight into the data about the company’s employees based on their age and monthly income. This insight can be used to further investigate how each cluster is more or less likely to have the attrition yes or no. This investigation is beyond the scope of this project but was necessary to mention for the sake of demonstrating an understanding of how unsupervised learning and specifically clustering can be used to answer a question such as the one in this project.
<h4>

<h1>
Discussion
<h1>  
    
<h4 style = "line-height: 200%; font-weight: lighter;">
Dataset quality is very important when dealing with data and especially classification and or clustering. The quality definition is sometimes different based on the goal of the project and the problem that needs solving. For this project the datatset we have chosen was subject to undersampling because of the imbalance of the binary class types namely “Yes” and “No” for attrition. Moreover the size of the dataset has a big impact on the interpretation of the results. For example the classification accuracy we acquired in this project is not representative of how a logistic regression classifier will behave on a larger dataset with the same type of data. The representation would also be more accurate if we are dealing with a larger dataset. It is important to mention that a biger dataset would in some cases have storage and or computational power requirements. Especially in the case of this project the data would need to be treated carefully since it includes sensetive information about the employees. By treated carefully we mean that in a given situation, where a company was to do a similar project using a similar dataset, they would have to make sure to follow the GDPR rules. 
<h4>
<h4 style = "line-height: 200%; font-weight: lighter;">
One problem we could have faced if we did not choose to undersample our dataset, is a classification model that would only guess one class and still produce high prediction accuracy. For example we could get 90% accuracy where in reality the classification model is faulty and only guessing "NO" and 90% of the testing models data consists of "No" rows. Since the classification model learns from looking at a larger amount of one class compared to the other it becomes more likely to not guess the class that is the minority in the dataset. 
Moreover the quality of the dataset plays a big role in operations such as classification which was central in giving an answer to our business question. The quality of a dataset can be interperted in different ways, but the essense of quality lies in the level of which the data is representative of the whole company in our case.
<h4>

<h3>Ethics and legal issues<h3>
<h4 style = "line-height: 200%; font-weight: lighter;">
Our business case deals with sensitive data where parameters such as monthly income, job satisfaction and distance from home etc. describes the employee. Therefore, when dealing with this kind of data, it is crucial to handle these with care and respect, which sets demands to the company’s environment and security architecture, so all data describing the employee are kept safe. When storing sensitive data, it is important to take GDPR into consideration to be in compliance with the law and legal processes.

The final product of our project is a model that can predict if employees will leave the company or stay. The model is developed and to be used by a company e.g., a HR department, thus it is important to consider any ethical issues by using this tool. As the model is based on prediction and probability its output is not the solely truth and therefore cannot be used as such in decision making. The model can be used as a guide to evaluate if employees in a giving company are likely to leave, thus the company are able to take repercussions and either initiate hiring- or preventative processes in terms of wanting the employee to stay or leave. When using the model as guide in this process we would argue that no severe ethical issues arises, but if the model where to be used as the foundation in the decision making of re-hiring or make preventative processes to make the employee stay, one could argue that these decisions where based probability and not facts, why the decision could possibly be wrong and have negative consequences.
<h4>