# Exploratory Data Analysis (EDA)

In this section, we will perform an Exploratory Data Analysis (EDA) on the Human Resources dataset. EDA is a crucial step in the data analysis process, as it helps us understand the underlying patterns, detect anomalies, test hypotheses, and check assumptions through summary statistics and graphical representations. Let's dive into the data and uncover insights that will guide our further analysis.

In [None]:
# Importing the libraries
import os
import sys
import numpy as np
import pandas as pd
import seaborn as sns
import json
import matplotlib.pyplot as plt

In [None]:
base_dir = os.path.dirname(os.getcwd())
data_frame = pd.read_csv(os.path.join(base_dir, 'datasets', 'Cleaned_Human_Resources.csv'))
print("Number of rows: ",len(data_frame))
print('Number of columns: ',len(data_frame.columns))

It is worth noting that the dataset has been cleaned and preprocessed. We have ensured that there are no missing values and have dropped columns with only one unique value. Additionally, categorical variables have been converted into numerical values for further analysis.

In [None]:
with open(os.path.join(base_dir, 'datasets', 'unique_elements.json')) as f:
    data_dict = json.load(f)

# Print the dictionary for categorical columns
for key in data_dict.keys():
    print(key, " : ", data_dict[key]) 

In [None]:
data_frame.hist(figsize=(20, 50), bins=50, xlabelsize=12, ylabelsize=12, layout=(int(np.ceil(data_frame.shape[1] / 3)), 3));

In [None]:
left_df = data_frame[data_frame['Attrition'] == 'Yes']
stay_df = data_frame[data_frame['Attrition'] == 'No']

print('Total of employees =', len(data_frame))
print('Number of employees who left:', len(left_df))
print('Number of employees who stayed:', len(stay_df))

In [None]:
# Display the statistics of the employees who left
left_df.describe()

In [None]:
# Display the statistics of the employees who stayed
stay_df.describe()

#### Correlation Heatmap

The correlation heatmap below visualizes the relationships between different features in the dataset. Each cell in the heatmap shows the correlation coefficient between two variables. The values range from -1 to 1, where:

- **1** indicates a perfect positive correlation
- **-1** indicates a perfect negative correlation
- **0** indicates no correlation

This heatmap helps us understand which features are strongly correlated with each other, which can be useful for feature selection and understanding the underlying structure of the data.

In [None]:
# Select only numeric columns
numeric_data_frame = data_frame.select_dtypes(include=[np.number])

# Calculate correlations
correlations = numeric_data_frame.corr()

# Plot heatmap
f, ax = plt.subplots(figsize=(40, 40))
sns.heatmap(correlations, annot=True, cmap='coolwarm', fmt=".2f", ax=ax, annot_kws={"size": 20})

In [None]:
plt.figure(figsize=(20, 7))
ax = sns.countplot(x='Age', hue='Attrition', data=data_frame, palette='viridis')
ax.set_facecolor('lightgrey')
plt.title('Age Distribution by Attrition')
plt.xlabel('Age')
plt.ylabel('Count')
plt.legend(title='Attrition', labels=['No', 'Yes'])
plt.show()

In [None]:
plt.figure(figsize=(20, 30));
plt.subplot(4, 1, 1)
sns.countplot(x='JobRole', hue='Attrition', data=data_frame, palette='viridis')
plt.title('Job Role Distribution by Attrition')
plt.xlabel('Job Role')
plt.ylabel('Count')
plt.legend(title='Attrition', labels=['No', 'Yes'])
plt.subplot(4, 1, 2)
sns.countplot(x='MaritalStatus', hue='Attrition', data=data_frame, palette='viridis')
plt.title('Marital Status Distribution by Attrition')
plt.xlabel('Marital Status')
plt.ylabel('Count')
plt.legend(title='Attrition', labels=['No', 'Yes'])
plt.subplot(4, 1, 3)
sns.countplot(x='JobInvolvement', hue='Attrition', data=data_frame, palette='viridis')
plt.title('Job Involvement Distribution by Attrition')
plt.xlabel('Job Involvement')
plt.ylabel('Count')
plt.legend(title='Attrition', labels=['No', 'Yes'])
plt.subplot(4, 1, 4)
sns.countplot(x='JobLevel', hue='Attrition', data=data_frame, palette='viridis')
plt.title('Job Level Distribution by Attrition')
plt.xlabel('Job Level')
plt.ylabel('Count')
plt.legend(title='Attrition', labels=['No', 'Yes'])
plt.show()

In [None]:
# Monthly Income Distribution by Attrition
plt.figure(figsize=(12, 7));
sns.kdeplot(left_df['MonthlyIncome'], label='Employees who left', fill=True, color='r');
sns.kdeplot(stay_df['MonthlyIncome'], label='Employees who stayed', fill=True, color='b');
plt.xlabel('Monthly Income');
plt.ylabel('Density');
plt.title('Monthly Income Distribution by Attrition');
plt.legend(title='Attrition', labels=['Yes', 'No']);
plt.show()

In [None]:
# Monthly Income Distribution by Attrition
plt.figure(figsize=(12, 7));
sns.kdeplot(left_df['DistanceFromHome'], label='Employees who left', fill=True, color='r');
sns.kdeplot(stay_df['DistanceFromHome'], label='Employees who stayed', fill=True, color='b');
plt.xlabel('Distance FromHome [km]');
plt.ylabel('Density');
plt.title('Distance FromHome Distribution by Attrition');
plt.legend(title='Attrition', labels=['Yes', 'No']);
plt.show()

In [None]:
# Convert Gender to a categorical type if it's not already
data_frame['Gender'] = data_frame['Gender'].astype('category')

plt.figure(figsize=(15, 10))
sns.boxplot(x='MonthlyIncome', y='Gender', data=data_frame, palette='viridis', hue='Gender');
#data_frame['Gender'] = data_frame['Gender'].astype('int');
#plt.yticks(ticks=range(len(data_frame['Gender'].unique())), labels=list(data_dict['Gender'].values()));

plt.title('Monthly Income Distribution by Gender')
plt.xlabel('Monthly Income')
plt.ylabel('Gender')
plt.show()

In [None]:
# Convert Gender to a categorical type if it's not already
data_frame['JobRole'] = data_frame['JobRole'].astype('category')

plt.figure(figsize=(15, 10))
sns.boxplot(x='MonthlyIncome', y='JobRole', data=data_frame, palette='viridis', hue='JobRole');