## Data Exploration (Continue EDA)

we will begin by understanding how many employees left and what percentage of all employees this figure represents.

In [None]:
# Get numbers of people who left vs. stayed
num_left = len(df[df['left'] == 1])
num_stayed = len(df) - num_left
print(f"Number of employees that left : {num_left}")
print(f"Number of employees that stayed : {num_stayed}")

# Get percentages of people who left vs. stayed
print(f"Percentage leaving : {num_left / len(df) * 100 :.2f} %")
print(f"Percentage staying : {num_stayed / len(df) * 100 :.2f} %")

There are 16.6% employees leaving the company. That is close to a fifth of total number of employees.

### Data visualisations

In [None]:
# Set figure and axes
fig, ax = plt.subplots(1, 2, figsize=(22, 8))

# Create boxplot showing `average_monthly_hours` distributions for `number_project`, comparing employees who stayed versus those who left
sns.boxplot(x="average_monthly_hours", y="number_project", data=df, hue='left', orient='h', ax=ax[0])
ax[0].invert_yaxis()
ax[0].set_xlabel("Average Monthly Hours")
ax[0].set_ylabel("Number of Projects")
ax[0].set_title("Monthly Hours by number of projects")

# Create histogram showing distribution of `number_project`, comparing employees who stayed versus those who left
sns.histplot(x='number_project', data=df, hue='left', multiple='dodge', shrink=3, ax=ax[1])
ax[1].set_xlabel("Number of Projects")
ax[1].set_title("Distribution of Number of Projects")

# Display the plots
plt.show();

Employees who work on more projects also tend to work longer hours. This is evident from the increasing mean hours for both groups (stayed and left) as the number of projects worked increases. However, upon closer inspection, two distinct groups of employees left the company. The first group worked considerably less than their peers with the same number of projects, and the second group worked much more. The former group may have been fired or were already out the door, while the latter likely quit. It's reasonable to assume that employees in the second group were significant contributors to their projects.

Interestingly, everyone with seven projects left the company, and the interquartile ranges of this group and those who left with six projects were much higher than any other group. The data suggests that the optimal number of projects for employees to work on is 3-4, as the ratio of left/stayed is much smaller for these cohorts.

Assuming a work week of 40 hours and two weeks of vacation per year, the average number of working hours per month for employees working Monday-Friday is 166.67 hours per month. Except for employees who worked on two projects, every group worked considerably more than this, indicating that employees at this company may be overworked.

We will now investigate the average monthly hours versus the satisfaction levels

In [None]:
# Create a plot as needed
# Create scatterplot of `average_monthly_hours` versus `satisfaction_level`, comparing employees who stayed versus those who left
plt.figure(figsize=(16, 9))
sns.scatterplot(data=df, x='average_monthly_hours', y='satisfaction_level', hue='left', alpha=0.4)
plt.axvline(x=166.67, color='#ff6361', label='166.67 hrs./mo.', ls='--')
plt.legend(labels=['166.67 hrs./mo.', 'left', 'stayed'])
plt.title('Monthly hours by last evaluation score', fontsize='14');

Looking at the chart provided, many employees worked between 240 and 315 hours per month. To put this into perspective, that's over 75 hours per week for an entire year. This may have contributed to their low levels of satisfaction.

Additionally, there is another group of individuals who left the company, and they had more typical working hours. However, their satisfaction levels were still relatively low, hovering around 0.4. It's difficult to say why they may have left, but it's possible that they felt pressured to work longer hours due to their peers working more.

On the other hand, a group of employees worked between 210 and 280 hours per month, and they had higher satisfaction levels ranging from 0.7 to 0.9. However, the strange distribution shape of the data suggests that there may have been some manipulation or synthetic data involved.

Next, we examine salary levels for different tenures.

In [None]:
# Set figure and axes
fig, ax = plt.subplots(1, 2, figsize = (22,8))

# Define short-tenured employees
tenure_short = df[df['tenure'] < 7]

# Define long-tenured employees
tenure_long = df[df['tenure'] > 6]

# Plot short-tenured histogram
sns.histplot(data=tenure_short, x='tenure', hue='salary', discrete=1, 
             hue_order=['low', 'medium', 'high'], multiple='dodge', shrink=.5, ax=ax[0])
ax[0].set_title('Salary histogram by tenure: short-tenured people', fontsize='14')

# Plot long-tenured histogram
sns.histplot(data=tenure_long, x='tenure', hue='salary', discrete=1, 
             hue_order=['low', 'medium', 'high'], multiple='dodge', shrink=.4, ax=ax[1])
ax[1].set_title('Salary histogram by tenure: long-tenured people', fontsize='14');


The plots above show that long-tenured employees were not disproportionately comprised of higher-paid employees.

In [None]:
# Create scatterplot of `average_monthly_hours` versus `last_evaluation`
plt.figure(figsize=(16, 9))
sns.scatterplot(data=df, x='average_monthly_hours', y='last_evaluation', hue='left', alpha=0.4)
plt.axvline(x=166.67, color='#ff6361', label='166.67 hrs./mo.', ls='--')
plt.legend(labels=['166.67 hrs./mo.', 'left', 'stayed'])
plt.title('Monthly hours by last evaluation score', fontsize='14');

From the scatterplot above, we can deduce that two categories of employees have resigned. The first group comprises overworked employees who have performed exceptionally well. In contrast, the second group consists of employees who have worked slightly below the nominal monthly average of 166.67 hours and have lower evaluation scores. There is a correlation between the number of hours worked and the evaluation score. The plot has few employees in the upper left quadrant, indicating that working long hours does not always guarantee a good evaluation score. Furthermore, most of the employees in this company work well over 167 hours per month.

In [None]:
# Create plot to examine relationship between `average_monthly_hours` and `promotion_last_5years`
plt.figure(figsize=(16, 3))
sns.scatterplot(data=df, x='average_monthly_hours', y='promotion_last_5years', hue='left', alpha=0.4)
plt.axvline(x=166.67, color='#ff6361', ls='--')
plt.legend(labels=['166.67 hrs./mo.', 'left', 'stayed'])
plt.title('Monthly hours by promotion last 5 years', fontsize='14');

The plot above shows the following:

* very few employees who were promoted in the last five years left
* very few employees who worked the most hours were promoted
* all of the employees who left were working the longest hours

In [None]:
# Create stacked histogram to compare department distribution of employees who left to that of employees who didn't
plt.figure(figsize=(11,8))
sns.histplot(data=df, x='department', hue='left', discrete=1, 
             hue_order=[0, 1], multiple='dodge', shrink=.5)
plt.xticks(rotation=45)
plt.title('Counts of stayed/left by department', fontsize=14);

There doesn't seem to be any department that differs significantly in its proportion of employees who left to those who stayed.

In [None]:
# Keep only numeric columns 
numeric_cols = df_raw.select_dtypes(include=[np.number]).columns
df_num = df_raw[numeric_cols]

# Plot a correlation heatmap
plt.figure(figsize=(16, 9))
heatmap = sns.heatmap(df_num.corr(), vmin=-1, vmax=1, annot=True, cmap=sns.color_palette("vlag", as_cmap=True))
heatmap.set_title('Correlation Heatmap', fontdict={'fontsize':14}, pad=12);

The correlation heatmap shows a positive relationship between the number of projects, monthly hours, and evaluation scores. Additionally, an employee's satisfaction level negatively correlates with whether they decide to leave the company.

### Insights

Based on the observations, some workers quit their jobs due to unsatisfactory management. This decision is influenced by extended working hours, numerous projects, and decreased job contentment. Being overworked without receiving recognition or positive performance reviews can be disheartening. Additionally, a significant number of employees may be experiencing burnout. Interestingly, those who have worked at the company for over six years tend to stay.

## Model Building

- Determine which models are most appropriate
- Construct the model
- Confirm model assumptions
- Evaluate model results to determine how well your model fits the data
