# **SOLLDA1 MCO: IBM HR Analytics Employee Attrition & Performance**

## **I. Objective** 
### Clearly state the objective of the analysis, what problem or question the group aims to address with the data. 

This analysis aims to **understand the key factors influencing employee attrition and career growth within IBM**. By examining attributes such as job role, performance rating, years at the company, work-life balance, and employee satisfaction, the group can identify patterns that contribute to voluntary and involuntary departures. In addition, the study will help in recognizing high-potential employees suitable for promotion by leveraging data on job involvement, training opportunities, and post career progression. Through this approach, the group seeks to enhance employee retention strategies and optimize internal talent management. 

## **II. Problem Statement** 
### Explain the problem or question the analysis seeks to solve or explore. A clear and concise statement of the challenge and the significance is needed. 

High employee turnover and ineffective talent management at IBM lead to increased costs and lost productivity. This analysis seeks to:  
1. **Identify factors driving attrition** (e.g., job role, satisfaction, work-life balance).  
2. **Spot high-potential employees for promotion** based on performance, training, and career growth.  
3. **Improve retention strategies** to reduce turnover and enhance employee satisfaction.  

By addressing these issues, IBM can reduce costs, retain top talent, and optimize workforce performance.

## **III. Background** 
### Provide context about the data and the problem domain. Explain where data came from, including its source, collection methds, and any relevant information about its reliability and completeness. 

Employee attrition and talent management are critical issues in the corporate world, as high turnover rates can disrupt operations, increase recruitment costs, and reduce morale. To address these challenges, organizations like IBM can leverage data-driven insights to understand why employees leave and how to identify and retain top talent, ultimately improving workforce stability and performance. The dataset used for this analysis is the **IBM HR Analytics Employee Attrition & Performance Dataset**, available on Kaggle, which contains anonymized employee data focusing on factors related to attrition, performance, and career growth. This data was likely collected through IBM's internal HR systems, including employee surveys, performance reviews, and HR records, and it includes both quantitative metrics (e.g., age, salary, years at the company) and qualitative metrics (e.g., job satisfaction, work-life balance) to provide a comprehensive view of employee dynamics.

## **IV. Data Source**
### Describe the origin of the data, whether it was collected internally or obtained from external sources. Include details such as data provider, data format, and the time period covered by the data. 

The data provider is IBM, and it contains anonymized employee records focusing on factors related to attrition, performance, and career growth. The data was likely gathered through IBM's internal HR systems, including employee surveys, performance reviews, and HR records, and is provided in a structured tabular format (e.g., CSV or Excel). It includes both quantitative metrics (e.g., age, salary, years at the company) and qualitative metrics (e.g., job satisfaction, work-life balance), offering a comprehensive view of employee dynamics.

While the dataset does not explicitly specify the time period it covers, it appears to represent a snapshot of IBM's workforce at a specific point in time, likely spanning multiple years of employee tenure and performance data

## **V. Data Description** 
### Provide a brief overview of the data's structure and contents. Mention the key variables and their meanings. Include any preprocessing steps performed, such as data cleaning, and/or feature engineering.

Import necessary libraries

In [None]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns 

Load the dataset

In [None]:
df = pd.read_csv('IBM-HR-Employee-Attrition.csv') 
df

Display column names and data types

In [None]:
df.info()

The dataset consists of 35 columns that capture different aspects of employees' professional background, work conditions, and job satisfaction. Below is an explanation of each variable: 

#### **1. Employee Demographics** 
- **Age** (*int*) - The age of the employee. 

- **Gender** (*object*) - The gender of the employee (e.g., Male, Female). 

- **MaritalStatus** (*object*) - The marital status of the employee (e.g., Single, Married, Divorced). 

- **Education** (*int*) - Level of education on a scale (1 to 5) where: 
    - 1 = Below College 
    - 2 = College 
    - 3 = Bachelor 
    - 4 = Master 
    - 5 = Doctorate 
    
- **EducationField** (*object*) - The field of education (e.g., Life Sciences, Marketing, Technical Degree).

#### **2. Job and Work Details** 
- **JobRole** (*object*) - The specific job title of the employee (e.g., Sales Executive, Research Scientist). 

- **Department** (*object*) - The department in which the employee works (e.g., Sales, Research & Development, Human Resources). 

- **JobLevel** (*int*) - The hierarchichal level of the job within the organization (e.g., 1 = Entry Level, 5 = Senior Management). 

- **JobInvolvement** (*int*) - Level of employee involvement in the job.  
    - 1 = Low 
    - 2 = Medium 
    - 3 = High 
    - 4 = Very High 

- **JobSatisfaction** (*int*) - Employee satisfaction with the job. 
    - 1 = Low 
    - 2 = Medium 
    - 3 = High 
    - 4 = Very High 

- **WorkLifeBalance** (*int*) - Employee's perception of work-life balance. 
    - 1 = Bad 
    - 2 = Good 
    - 3 = Better 
    - 4 = Best

- **OverTime** (*object*) - Indicates whether the employee works overtime (Yes/No). 

- **StandardHours** (*int*) - Standard working hours for employees (appears to be a constant value). 

#### **3. Compensation and Benefits** 
- **HourlyRate** (*int*) - The hourly wage of the employee. 

- **DailyRate** (*int*) - The daily pay rate of the employee. 

- **Monthly Income** (*int*) - The total monthly earnings of the employee. 

- **StockOptionLevel** (*int*) - Stock option level granted to the employee (0 = No stock options, 3 = High stock options). 

- **PercentageSalaryHike** (*int*) - Percentage increase in salary after the last performance review. 

#### **4. Employment History and Tenure** 
- **YearsAtCompany** (*int*) - The number of years the employee has been with the company. 

- **YearsInCurrentRole** (*int*) - The number of years the employee has been in their current role. 

- **YearsSinceLastPromotion** (*int*) - The number of years since the employee's last promotion. 

- **YearsWithCurrManager** (*int*) - The number of years the employee has worked with their current manager. 

- **TotalWorkingYears** (*int*) - The total number of years the employee has worked in their career. 

- **NumCompaniesWorked** (*int*) - The number of revious companies the employee has worked for. 

#### **5. Employee Satisfaction and Performance** 
- **PerformanceRating** (*int*) - Employee's most recent performance rating. 
    - 1 = Low 
    - 2 = Good 
    - 3 = Excellent 
    - 4 = Outstanding 

- **RelationShipSatisfaction** (*int*) - Employee's satisfaction with their relationships at work. 
    - 1 = Low 
    - 2 = Medium 
    - 3 = High 
    - 4 = Very High

- **EnvironmentSatisfaction** (*int*) - Employee's satisfaction with the work environment (Scale: 1 = Low, 4 = High). 
    - 1 = Low 
    - 2 = Medium 
    - 3 = High 
    - 4 = Very High

- **TrainingTimesLastYear** (*int*) - Number of training sessions attended by the employee in the last year.

#### **6. Attrition and Travel** 
- **Attrition** (*object*) - Indicates whether the employee has left the company (Yes/No). 

- **BusinessTravel** (*object*) - Frequency of business travel (e.g., Travel_Rarely, Travel_Frequently, Non-Travel).

- **DistanceFromHome** (*int*) - The distance between the employee's home and workplace. 

#### **7. Miscellaneous** 
- **EmployeeNumber** (*int*) - Unique ID assigned to each employee. 

- **EmployeeCount** (*int*) - Seems to be a constant column (always 1). 

- **Over18** (*object*) - Indicates if the employee is over 18 (constant "Y" for all). 

 Dropping columns due to redundancy or minimal impact

- **EmployeeNumber** - just an ID, not useful for the analysis.
- **HourlyRate, DailyRate** -  These are compensation-related columns. While compensation can influence attrition, Monthly Income is sufficient to capture this information. Drop HourlyRate and DailyRate to avoid redundancy.
- **EmployeeCount** - Constant value (always 1).
- **Over18** - Constant value (always "Y").
- **StandardHours** - Constant value (likely the same for all employees).



In [None]:
df.drop(columns=['EmployeeNumber', 'HourlyRate', 'DailyRate', 'EmployeeCount', 'Over18', 'StandardHours'], inplace=True)

Check for duplicates (if any)

In [None]:
df.duplicated().sum()

Detect any outliers in the data

In [None]:
plt.figure(figsize=(16, 10)) 
df.select_dtypes(include=['int64', 'float64']).boxplot(rot=90) 
plt.title("Outliers in Numerical Features")
plt.show()

Categorical features need to be converted into numerical values using label encoding, one-hot encoding, or ordinal encoding.

Label Encoding for Binary Categorical Columns

In [None]:
from sklearn.preprocessing import LabelEncoder

binary_categorical_columns = ['Gender', 'OverTime', 'Attrition']

label_encoder = LabelEncoder()

for col in binary_categorical_columns:
    df[col] = label_encoder.fit_transform(df[col])

Ordinal Encoding for BusinessTravel

In [None]:
business_travel_mapping = {
    'Non-Travel': 0,
    'Travel_Rarely': 1,
    'Travel_Frequently': 2
}

df['BusinessTravel'] = df['BusinessTravel'].map(business_travel_mapping)

Numerical columns need scaling to ensure they are on a similar scale for modeling.

Columns to transform:
- **Age (int)**
- **DistanceFromHome (int)**:
- **Monthly Income (int)**:
- **PercentageSalaryHike (int)**:
- **YearsAtCompany (int)**:
- **YearsInCurrentRole (int)**:
- **YearsSinceLastPromotion (int)**:
- **YearsWithCurrManager (int)**:
- **TotalWorkingYears (int)**:
- **NumCompaniesWorked (int)**:
- **TrainingTimesLastYear (int)**:


In [None]:
from sklearn.preprocessing import StandardScaler

numerical_columns = [
    'Age', 'DistanceFromHome', 'MonthlyIncome', 'PercentSalaryHike', 
    'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion', 
    'YearsWithCurrManager', 'TotalWorkingYears', 'NumCompaniesWorked', 
    'TrainingTimesLastYear'
]

scaler = StandardScaler()

df[numerical_columns] = scaler.fit_transform(df[numerical_columns])

#### **Feature Engineering**

We can create new features to capture additional insights: 

- **PromotionStagnation** : Ratio of YearsSinceLastPromotion to YearsAtCompany.
- **RoleStagnation** : Ratio of YearsInCurrentRole to YearsAtCompany.
- **ManagerTenureImpact** : Ratio of YearsWithCurrManager to YearsAtCompany.
- **IncomeToHikeRatio** : Ratio of Monthly Income to PercentageSalaryHike.

Creating new features like PromotionStagnation, RoleStagnation, ManagerTenureImpact, and IncomeToHikeRatio can provide deeper insights into employee behavior, satisfaction, and career progression. These engineered features capture relationships between existing variables that might not be immediately apparent, helping us better understand the factors influencing employee attrition and career growth. 

In [None]:
df['PromotionStagnation'] = df['YearsSinceLastPromotion'] / df['YearsAtCompany']
df['RoleStagnation'] = df['YearsInCurrentRole'] / df['YearsAtCompany']
df['ManagerTenureImpact'] = df['YearsWithCurrManager'] / df['YearsAtCompany']
df['IncomeToHikeRatio'] = df['MonthlyIncome'] / df['PercentSalaryHike']

df.fillna(0, inplace=True) 

Here is the new and transformed dataset

In [None]:
df.head()

## **VI. Exploratory Data Analysis** 

#### 1. Data Overview  

 Summary statistics (mean, median, standard deviation, etc.).

In [None]:
df.describe(include='all')

Data dimensions (number of rows and columns). 

In [None]:
df.shape

Data types and data distribution (categorical, numerical, etc.)

In [None]:
df.dtypes

In [None]:
numerical_columns = df.select_dtypes(include=['int64', 'float64']).columns
df[numerical_columns].hist(bins=20, figsize=(20, 15))
plt.suptitle("Distribution of Numerical Variables")
plt.show()

Missing data and handling strategies. 


In [None]:
df.isnull().sum()

Since there are no missing data, no data handling is necessary.

#### 2. Univariate Analysis

Visualizations and summary statistics for individual variables.

In [None]:
numerical_columns = df.select_dtypes(include=['int64', 'float64']).columns
df[numerical_columns].describe()

Box plots show the distribution of numerical variables and identify outliers.

In [None]:
for col in numerical_columns:
    plt.figure(figsize=(4, 2))
    sns.boxplot(data=df, x=col, palette='viridis')
    plt.title(f"Box Plot of {col}")
    plt.show()

Below, we created a function to detect outliers. Outliers can be identified using box plots or statistical methods like the Interquartile Range (IQR).

The IQR method identifies outliers as values below Q1 - 1.5 X IQR or abvoe Q3 + 1.5 X IQR.

In [None]:
def detect_outliers(df, column):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)]
    return outliers

numerical_columns = df.select_dtypes(include=['int64', 'float64']).columns

for col in numerical_columns:
    outliers = detect_outliers(df, col)
    print(f"Number of outliers in {col}: {len(outliers)}")
    print("-" * 50)

#### 3. Bivariate Analysis

##### **Attrition by Department**

In [None]:
plt.figure(figsize=(10, 4))
sns.countplot(x='Department', hue='Attrition', data=df)
plt.title("Attrition Rate by Department")
plt.xticks(rotation=45)
plt.show()

**Findings** <br>
The Research & Development department has the highest number of retained employees, followed by the Sales department and then Human Resources. The ranking of departments in terms of employee attrition follows the same order. The attrition rate is approximately one-third of the total workforce in each department.

##### **Monthly Income and Attrition**

In [None]:
plt.figure(figsize=(8, 6))
sns.boxplot(x='Attrition', y='MonthlyIncome', data=df)
plt.title("Monthly Income Distribution by Attrition")
plt.show()

**Findings**<br>


##### **Overtime vs. Attrition**

In [None]:
plt.figure(figsize=(6, 4))
sns.barplot(x='OverTime', y='Attrition', data=df)
plt.title("Impact of Overtime on Attrition")
plt.show()

**Findings** <br>
Employees who work overtime tend to have a higher chance of leaving.

##### **Work-Life and Attrition**

In [None]:
plt.figure(figsize=(6, 4))
sns.boxplot(x='WorkLifeBalance', y='Attrition', data=df)
plt.title("Work-Life Balance Effect on Attrition")
plt.show()

##### **Job Satisfaction and Attrition**

In [None]:
plt.figure(figsize=(6, 4))
sns.countplot(x='JobSatisfaction', hue='Attrition', data=df)
plt.title("Attrition by Job Satisfaction Levels")
plt.show()

##### **Last Promotion Year and Attrition**

In [None]:
plt.figure(figsize=(8, 6))
sns.histplot(data=df, x='YearsSinceLastPromotion', hue='Attrition', bins=10, kde=True)
plt.title("Attrition by Years Since Last Promotion")
plt.show()

##### **Marital Status and Attrition**

In [None]:
plt.figure(figsize=(6, 4))
sns.countplot(x='MaritalStatus', hue='Attrition', data=df)
plt.title("Attrition by Marital Status")
plt.show()

##### **Education Field and Attrition**

In [None]:
plt.figure(figsize=(10, 6))
sns.countplot(y='EducationField', hue='Attrition', data=df)
plt.title("Attrition by Education Field")
plt.show()


##### **Job Role and Attrition**

In [None]:
plt.figure(figsize=(12, 6))
sns.countplot(y='JobRole', hue='Attrition', data=df)
plt.title("Attrition by Job Role")
plt.show()


##### **Summary of Bivariate Analysis Findings**

- 

#### 4. Multivariate Analysis

Due to the abundance of outliers found in the features we will be working on for our multivariate analysis, we will utilize PERMANOVA testing as opposed to MANOVA. The frequent outliers, particularly for YearsAtCompany (104) and PerformanceRating (226), caused skewedness which vioates the assumption of normal distribution of the dependent variables.

Assumptions for PERNOVA Testing:

1. Observations are randomly and independently sampled from the population
2. Homogeneity of Multivariate Dispersion 
3. Each dependent variable has an interval measurement

**1. Observations are randomly and independently sampled from the population**

As mentioned in section **IV. Data Source**, the data is from anonymized employee records which should not contian duplicates by traditional standards.

**2. Homogeneity of Multivariate Dispersion**

In [105]:
from scipy.spatial.distance import cdist
from scipy.stats import f_oneway
from sklearn.preprocessing import LabelEncoder

subset_df = df[['MaritalStatus', 'YearsAtCompany', 'JobSatisfaction', 'PerformanceRating']]

# Encode 'MaritalStatus' for testing
label_encoder = LabelEncoder()
subset_df['MaritalStatus_encoded'] = label_encoder.fit_transform(subset_df['MaritalStatus'])

# Check Homogeneity of Multivariate Dispersion (PERMDISP)
# Calculate group centroids
centroids = subset_df.groupby('MaritalStatus')[['YearsAtCompany', 'JobSatisfaction', 'PerformanceRating']].mean()

# Compute distances to centroids
distances = []
for group in subset_df['MaritalStatus'].unique():
    group_data = subset_df[subset_df['MaritalStatus'] == group][['YearsAtCompany', 'JobSatisfaction', 'PerformanceRating']]
    centroid = centroids.loc[group].values
    dists = cdist(group_data, [centroid])
    distances.extend(dists.flatten())

subset_df['DistanceToCentroid'] = distances

# Perform ANOVA on the distances
groups = [subset_df.loc[subset_df['MaritalStatus'] == group, 'DistanceToCentroid'].values for group in subset_df['MaritalStatus'].unique()]
permdisp = f_oneway(*groups)
print("PERMDISP Test (Homogeneity of Multivariate Dispersion):")
print(permdisp)

PERMDISP Test (Homogeneity of Multivariate Dispersion):
F_onewayResult(statistic=np.float64(0.6223717683688145), pvalue=np.float64(0.5368117107690501))


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  subset_df['MaritalStatus_encoded'] = label_encoder.fit_transform(subset_df['MaritalStatus'])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  subset_df['DistanceToCentroid'] = distances


We used PERMDISP Testing to check if the variance of the data points around the centroids is similar to the groups under MaritalStatus (Single, Married, or Divorced). PERMDISP assumes that the variances are equal.

**3. Each dependent variable has an interval measurement**

In [106]:
print(subset_df[['YearsAtCompany', 'JobSatisfaction', 'PerformanceRating']].corr())

                   YearsAtCompany  JobSatisfaction  PerformanceRating
YearsAtCompany           1.000000        -0.003803           0.003435
JobSatisfaction         -0.003803         1.000000           0.002297
PerformanceRating        0.003435         0.002297           1.000000


Using the built-in correlation function, we are able to see the numerical relationship between each of our dependent variables. Since each pairings are close to 0, we can say that there is a weak relationship between them.

**One-Way PERMANOVA Testing**

In [107]:
from skbio.stats.distance import permanova, DistanceMatrix

dist_matrix = cdist(subset_df[['YearsAtCompany', 'JobSatisfaction', 'PerformanceRating']], subset_df[['YearsAtCompany', 'JobSatisfaction', 'PerformanceRating']])
labels = subset_df['MaritalStatus'].values
dm = DistanceMatrix(dist_matrix)
permanova_result = permanova(dm, labels, permutations=999)
print("\nPERMANOVA Test:")
print(permanova_result)


PERMANOVA Test:
method name               PERMANOVA
test statistic name        pseudo-F
sample size                    1470
number of groups                  3
test statistic             1.829847
p-value                       0.103
number of permutations          999
Name: PERMANOVA results, dtype: object


## **Reference/s** 
-   IBM HR Analytics Employee Attrition & Performance. (2017, March 31). https://www.kaggle.com/datasets/pavansubhasht/ibm-hr-analytics-attrition-dataset/data

## **By Data Alchemists** 
- Bon, Jan Louise B. 
- Brodett, Ram David M.
- Lopez, Ghee Kaye S. 
- Paguiligan, James Archer B. 
- Sanchez, Matthew Heinz O. 