In [2]:
# Dataset and Analysis Tools
import pandas as pd
import numpy as np
from scipy.stats import chi2_contingency

# Introduction
> Employee data and attrition status are important to study because it can help companies understand why employees leave their jobs and what can be done to prevent it. By analyzing employee data, companies can identify trends in employee behavior and performance that may be contributing to high attrition rates. This information can then be used to develop strategies for retaining employees and improving job satisfaction.

# Statement of the Problem 

> The goal of this case study is to analyze what possible factors can affect an employee's attrition status using Exploratory Data Analysis (EDA). This study will look into factors such as age, gender, frequency of travel, job involvement, and overtime to see the factors that contribute to an employee leaving their job. Some of the most pertinent questions to consider in attrition factors is as follows:
>> 1. What is the overall attrition rate in the company?
>> 2. Which departments have the highest and lowest attrition rates?
>> 3. Does education level impact attrition?
>> 4. Are there any relationships between salary levels and attrition?

# Methodology
> The IBM Data was accessed via kaggle, and was found to require minimal cleaning and alliterations. The tests mostly involved a Chi-squared analysis to determine if there is a statistically significant difference between the expected frequences and observed frequencies for one or more categories of a contingency table. If there is found to be a significant difference, the data is either put in a crosstab or measured with Pearson correlation coefficient depending on the nature of the factor type. An additional column 'Attrition_Boolean' is added for Pearson correlation coefficient to parse the 'Yes' or 'No' values of the 'Attrition' column. Finally, observed factors with correlations are listed to gain insights on what factors affect employee attrition.

# Dataset 
> Originally obtained through the IBM website and distributed through the site 'Github', 'WA_Fn-UseC_-HR-Employee-Attrition.csv' is a dataset containing 29 columns of data and 1470 rows on employees of IBM. The columns of data used from the dataset include, but are not limited to:


|**Variable Name**   |**Data Type**                          |
|-------------------:|:--------------------------------------|
|`Age`|int|
|`Attrition`|string ['Yes', 'No']|
|`BusinessTravel`|string ['Travel_Rarely', 'Travel_Frequently', 'Non-Travel']|
|`Department`|string ['Sales', 'Research & Development', 'Human Resources']|
|`DistanceFromHome`|int|
|`Education`|int [1, 2, 3, 4, 5]|
|`EducationField`|string ['Life Sciences', 'Other', 'Medical', 'Marketing', 'Technical Degree', 'Human Resources']|
|`EnvironmentSatisfaction`|int [1, 2, 3, 4]|
|`Gender`|string ['M', 'F']|
|`JobInvolvement`|int [1, 2, 3, 4]|
|`JobLevel`|int [1, 2, 3, 4, 5]|
|`JobRole`|string ['Sales Executive', 'Research Scientist', 'Laboratory Technician', 'Manufacturing Director', 'Healthcare Representative', 'Manager', 'Sales Representative', 'Research Director', 'Human Resources']|
|`JobSatisfaction`|int [1, 2, 3, 4]|
|`MaritalStatus`|string ['Single', 'Married', 'Divorced']|
|`MonthlyIncome`|float|
|`NumCompaniesWorked`|int|
|`Over18`|string ['Y']|
|`OverTime`|string ['Yes', 'No']|
|`PercentSalaryHike`|float|
|`PerformanceRating`|int[1, 2, 3, 4]|
|`RelationshipSatisfaction`|int[1, 2, 3, 4]|
|`TotalWorkingYears`|int|
|`TrainingTimesLastYear`|int|
|`WorkLifeBalance`|int [1, 2, 3, 4]|
|`YearsAtCompany`|int|
|`YearsInCurrentRole`|int|
|`YearsSinceLastPromotion`|int|
|`YearsWithCurrManager`|int|

|**Education Scale** |**Explanation**                      |
|-------------------:|:--------------------------------------|
|1|Below College|
|2|College|
|3|Bachelor|
|4|Master|
|5|Doctor|

This scale is applicable for: Environment Satisfaction, Job Involvement, Job Satisfaction, and Relationship Satisfaction
|**Scale**           |**Explanation**                        |
|-------------------:|:--------------------------------------|
|1|Low|
|2|Medium|
|3|High|
|4|Very High|

|**Performance Rating Scale** |**Explanation**                      |
|-------------------:|:--------------------------------------|
|1|Low|
|2|Good|
|3|Excellent|
|4|Outstanding|

|**Work Life Balance Scale** |**Explanation**                      |
|-------------------:|:--------------------------------------|
|1|Bad|
|2|Good|
|3|Better|
|4|Best|


> Source: Wu, N. (2017, April 19). employee-attrition-ml. Github. Retrieved July 2023, from [https://github.com/nelson-wu/employee-attrition-ml](https://github.com/nelson-wu/employee-attrition-ml)

In [3]:
# Importing the Employee Attrition CSV
df = pd.read_csv("WA_Fn-UseC_-HR-Employee-Attrition.csv")

# Data Investigation

In [4]:
df.head(5)

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2
