# Interactive investigation of generated employee data

IBM has released a generated dataset of employee characteristics, suitable for HR analytics (notably: predicting churn, which is when employees leave voluntarily, often for a position at another company).

This dataset was released to kaggle but is provided as archive.zip for your convenience.

The aim of this notebook is to show how to create some interactive visualisations, which might be convenient for exploring data.

Subsequently, the most informative visualisations can be collected in a separate notebook and shared with stakeholders, such as the HR Department in this case.

In [13]:
import pandas as pd
import matplotlib as plt
import seaborn as sns

# Also import Jupyter's ipywidgets package, so you can have interactive control later
import ipywidgets as widgets

In [14]:
# Notice that pandas can read directly from a zip archive, without needing to unzip it first
df = pd.read_csv('Week05/data/WA_Fn-UseC_-HR-Employee-Attrition.csv')
df.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2


The dataset is clean, column names and datatypes are OK, so we can proceed to visualise the data.

In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 35 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Age                       1470 non-null   int64 
 1   Attrition                 1470 non-null   object
 2   BusinessTravel            1470 non-null   object
 3   DailyRate                 1470 non-null   int64 
 4   Department                1470 non-null   object
 5   DistanceFromHome          1470 non-null   int64 
 6   Education                 1470 non-null   int64 
 7   EducationField            1470 non-null   object
 8   EmployeeCount             1470 non-null   int64 
 9   EmployeeNumber            1470 non-null   int64 
 10  EnvironmentSatisfaction   1470 non-null   int64 
 11  Gender                    1470 non-null   object
 12  HourlyRate                1470 non-null   int64 
 13  JobInvolvement            1470 non-null   int64 
 14  JobLevel                

Now we are going to group the object (soon to be converted to categorical) and number (Int64) columns so that we can investigate them with suitable plot types.

In [16]:
cat_col = [col for col in df.columns if df[col].dtype == 'object']
num_col = [col for col in df.columns if df[col].dtype == 'int64']

Using the concepts in [this answer](https://stackoverflow.com/a/41644154/1988855), here is a oneliner that converts all object columns to categories (which is more convenient in this case). NB - students are advised to test first and to apply such one liners only when they are sure that this is appropriate.

In [17]:
df[df.select_dtypes(['object']).columns] = df.select_dtypes(['object']).apply(lambda x: x.astype('category'))
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 35 columns):
 #   Column                    Non-Null Count  Dtype   
---  ------                    --------------  -----   
 0   Age                       1470 non-null   int64   
 1   Attrition                 1470 non-null   category
 2   BusinessTravel            1470 non-null   category
 3   DailyRate                 1470 non-null   int64   
 4   Department                1470 non-null   category
 5   DistanceFromHome          1470 non-null   int64   
 6   Education                 1470 non-null   int64   
 7   EducationField            1470 non-null   category
 8   EmployeeCount             1470 non-null   int64   
 9   EmployeeNumber            1470 non-null   int64   
 10  EnvironmentSatisfaction   1470 non-null   int64   
 11  Gender                    1470 non-null   category
 12  HourlyRate                1470 non-null   int64   
 13  JobInvolvement            1470 non-null   int64 

## Categorical column analysis

In [18]:

ddCol = widgets.Dropdown(options=cat_col, value=cat_col[0], description="Column")
ddHue = widgets.Dropdown(options=cat_col, value=cat_col[0], description="Hue")
orient = ["h", "v"]
ddOrient = widgets.Dropdown(options=orient, value=orient[0], description="Orient")

uiControls = widgets.HBox([ddCol, ddHue, ddOrient])

Now define a wrapper function around countplot, to use the dropdown values supplied interactively by the user from the uiControls widgets.

In [19]:
def doCountplot(column, hue, orient):
    p=sns.countplot(data=df, x=column, hue=hue, orient=orient)

Now create the interactive plot and display the ui controls and the interactive plot together. Note that when the dropdown values change, the plot is redrawn automatically.

In [20]:
out = widgets.interactive_output(doCountplot, {"column":ddCol, "hue":ddHue, "orient":ddOrient})
display(uiControls, out)

HBox(children=(Dropdown(description='Column', options=('Attrition', 'BusinessTravel', 'Department', 'Education…

Output()

Note that sometimes the formatting needs to be improved because the default settings, as above do nnot suit, e.g., if the category has a large number of values. However, this is a good way to investigate many options quickly (with very little coding).

## Numerical column analysis

We can also look at the numerical columns `num_col`. There are very many seaborn plot types for numeric data in particular. Here we consider one of the simplest such plot type: [relplot](https://seaborn.pydata.org/generated/seaborn.relplot.html), which is a generalisation of scatterplots and lineplots.

We set up dropdowns as before, this time for the `x` and `y` axis data for the plot, as well as the (categorical-valued) `hue`, which can be used for grouping as before.

To make things interesting, we also introduce the ability to filter the data so that the rows considered have Age less than or equal to what is chosen in the slider. The default setting is the maximum age found in the data, which has the effect of including all rows.

In [21]:
ddX = widgets.Dropdown(options=num_col, value=num_col[0], description="x")
ddY = widgets.Dropdown(options=num_col, value=num_col[0], description="y")
ddHue = widgets.Dropdown(options=cat_col, value=cat_col[0], description="Hue")
slAge = widgets.IntSlider(value=df["Age"].max() , min= df["Age"].min(), max= df["Age"].max(), description='MaxAge')

upperBox = widgets.HBox([ddX, ddY])
lowerBox = widgets.HBox([ddHue, slAge])
uiControls = widgets.VBox([upperBox, lowerBox])

Now we define the `relplot` wrapping function, including the maxAge filter.

In [22]:
def doRelplot(col1, col2, hue, maxAge):
    p = sns.relplot(data=df[df["Age"]<=maxAge], x=col1, y=col2, hue=hue)

As before, we need to create the interactive plot, and to display the uiControls and it on the screen.

In [23]:
out = widgets.interactive_output(doRelplot, {"col1":ddX, "col2":ddY, "hue":ddHue, "maxAge":slAge})
display(uiControls, out)

VBox(children=(HBox(children=(Dropdown(description='x', options=('Age', 'DailyRate', 'DistanceFromHome', 'Educ…

Output()

## Exercise

1. This is a rich dataset. Try some of the visualisation plot types that were shown in class, choosing suitable UI controls to enter parameter values, writing wrappers and exploring the data with your interactive plots, noting why you used that visualisation and what you found.

2. If you wished to do something similar programmatically, how would you do this? NB: You can do better than copying and pasting code many times!

In [26]:
ddX = widgets.Dropdown(options=num_col, value=num_col[0], description="x")
ddY = widgets.Dropdown(options=num_col, value=num_col[0], description="y")
ddHue = widgets.Dropdown(options=cat_col, value=cat_col[0], description="Hue")
slAge = widgets.IntSlider(value=df["Age"].max() , min= df["Age"].min(), max= df["Age"].max(), description='MaxAge')

upperBox = widgets.HBox([ddX, ddY])
lowerBox = widgets.HBox([ddHue, slAge])
uiControls = widgets.VBox([upperBox, lowerBox])

def doCountplot(column, hue, orient):
    p=sns.countplot(data=df, x=column, hue=hue, orient=orient)

out = widgets.interactive_output(doRelplot, {"col1":ddX, "col2":ddY, "hue":ddHue, "maxAge":slAge})
display(uiControls, out)

VBox(children=(HBox(children=(Dropdown(description='x', options=('Age', 'DailyRate', 'DistanceFromHome', 'Educ…

Output()

In [24]:
ddCol = widgets.Dropdown(options=cat_col, value=cat_col[0], description="Column")
ddHue = widgets.Dropdown(options=cat_col, value=cat_col[0], description="Hue")
orient = ["h", "v"]
ddOrient = widgets.Dropdown(options=orient, value=orient[0], description="Orient")

uiControls = widgets.HBox([ddCol, ddHue, ddOrient])

def doRelplot(col1, col2, hue, maxAge):
    p = sns.relplot(data=df[df["Age"]<=maxAge], x=col1, y=col2, hue=hue)

out = widgets.interactive_output(doRelplot, {"col1":ddX, "col2":ddY, "hue":ddHue, "maxAge":slAge})
display(uiControls, out)

VBox(children=(HBox(children=(Dropdown(description='x', options=('Age', 'DailyRate', 'DistanceFromHome', 'Educ…

Output()