In this guided project, we'll work with exit surveys from employees of the Department of Education, Training and Employment (DETE) and the Technical and Further Education (TAFE) institute in Queensland, Australia. You can find the TAFE exit survey here and the survey for the DETE here. 

In this project, we'll play the role of data analyst and pretend our stakeholders want to know the following:

1) Are employees who only worked for the institutes for a short period of time resigning due to some kind of dissatisfaction? What about employees who have been there longer?

2) Are younger employees resigning due to some kind of dissatisfaction? What about older employees?


They want us to combine the results for both surveys to answer these questions. However, although both used the same survey template, one of them customized some of the answers. In the guided steps, we'll aim to do most of the data cleaning and get you started analyzing the first question.

In [1]:
import pandas as pd
import numpy as np

dete_survey = pd.read_csv('dete_survey.csv')
tafe_survey = pd.read_csv('tafe_survey.csv')

dete_survey.info()
print('\n')
print(dete_survey.head())
print('\n')
print(dete_survey.isnull().sum())

print('*********************************************')

tafe_survey.info()
print('\n')
print(tafe_survey.head())
print('\n')
print(tafe_survey.isnull().sum())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 822 entries, 0 to 821
Data columns (total 56 columns):
ID                                     822 non-null int64
SeparationType                         822 non-null object
Cease Date                             822 non-null object
DETE Start Date                        822 non-null object
Role Start Date                        822 non-null object
Position                               817 non-null object
Classification                         455 non-null object
Region                                 822 non-null object
Business Unit                          126 non-null object
Employment Status                      817 non-null object
Career move to public sector           822 non-null bool
Career move to private sector          822 non-null bool
Interpersonal conflicts                822 non-null bool
Job dissatisfaction                    822 non-null bool
Dissatisfaction with the department    822 non-null bool
Physical work environ

Mardown cell for findings 

In [2]:
dete_survey = pd.read_csv('dete_survey.csv', na_values = 'Not Stated')


dete_survey_updated = dete_survey.drop(dete_survey.columns[28:49], axis = 1)
tafe_survey_updated = tafe_survey.drop(tafe_survey.columns[17:66], axis = 1)


#tafe_survey_updated.loc[pd.isna(tafe_survey_updated["separationtype"]), :].index #find the single annoying nan index

tafe_survey_updated = tafe_survey_updated.drop(324) #drop the nan value for 'seperationtype'

We took out the columns that we did not need to answer the questions. 

In [3]:
dete_survey_updated.columns = dete_survey_updated.columns.str.replace(' ', '_').str.strip().str.lower()

dete_survey_updated.columns


rename_cols = {'Record ID':'id','CESSATION YEAR':'cease_date','Reason for ceasing employment': 'separationtype',
              'Gender. What is your Gender?': 'gender','CurrentAge. Current Age': 'age','Employment Type. Employment Type': 'employment_status',
              'Classification. Classification': 'position',
              'LengthofServiceOverall. Overall Length of Service at Institute (in years)': 'institute_service',
              'LengthofServiceCurrent. Length of Service at current workplace (in years)': 'role_service'}

tafe_survey_updated=tafe_survey_updated.rename(rename_cols, axis = 1)

tafe_survey_updated.columns

print(dete_survey.head())

print(tafe_survey.head())

   ID                    SeparationType Cease Date  DETE Start Date  \
0   1             Ill Health Retirement    08/2012           1984.0   
1   2  Voluntary Early Retirement (VER)    08/2012              NaN   
2   3  Voluntary Early Retirement (VER)    05/2012           2011.0   
3   4         Resignation-Other reasons    05/2012           2005.0   
4   5                    Age Retirement    05/2012           1970.0   

   Role Start Date                                      Position  \
0           2004.0                                Public Servant   
1              NaN                                Public Servant   
2           2011.0                               Schools Officer   
3           2006.0                                       Teacher   
4           1989.0  Head of Curriculum/Head of Special Education   

  Classification              Region                      Business Unit  \
0        A01-A04      Central Office  Corporate Strategy and Peformance   
1        AO5-A

We want to combine the two files, so we need proper standardised namings for these. 

In [4]:
dete_survey_updated['separationtype'].value_counts()

Age Retirement                          285
Resignation-Other reasons               150
Resignation-Other employer               91
Resignation-Move overseas/interstate     70
Voluntary Early Retirement (VER)         67
Ill Health Retirement                    61
Other                                    49
Contract Expired                         34
Termination                              15
Name: separationtype, dtype: int64

In [5]:
tafe_survey_updated['separationtype'].value_counts()

Resignation                 340
Contract Expired            127
Retrenchment/ Redundancy    104
Retirement                   82
Transfer                     25
Termination                  23
Name: separationtype, dtype: int64

In [6]:
pattern = 'Resignation'

dete_resignations = dete_survey_updated[dete_survey_updated['separationtype'].str.contains(pattern)]

tafe_resignations = tafe_survey_updated [tafe_survey_updated['separationtype'].str.contains(pattern)]

print(dete_resignations['separationtype'].value_counts())

tafe_resignations['separationtype'].value_counts()


Resignation-Other reasons               150
Resignation-Other employer               91
Resignation-Move overseas/interstate     70
Name: separationtype, dtype: int64


Resignation    340
Name: separationtype, dtype: int64

There is one row (324) in the tafe data base, that has nan for reason of seperation. I had to manually find out its index and take it out. Because it was anooying. Apart form that, we now have the dete_resignations and tafe_resignations, each of which only have resignation as part of the reason of seperation. So we are one step closer in being able to combine the data. 

In [7]:
print(dete_resignations['cease_date'].value_counts(dropna= False))

pattern = r'([0-9]{2}/)'
dete_resignations = dete_resignations[dete_resignations.cease_date.notnull()] #removing NaN values first
dete_resignations['cease_date'] = dete_resignations['cease_date'].str.replace(pattern,'')
print(dete_resignations['cease_date'].value_counts(dropna= False))


2012       126
2013        74
01/2014     22
12/2013     17
06/2013     14
NaN         11
09/2013     11
07/2013      9
11/2013      9
10/2013      6
08/2013      4
05/2013      2
05/2012      2
2010         1
07/2006      1
09/2010      1
07/2012      1
Name: cease_date, dtype: int64
2013    146
2012    129
2014     22
2010      2
2006      1
Name: cease_date, dtype: int64


In [8]:
dete_resignations ['cease_date'] = dete_resignations['cease_date'].astype(float)
dete_resignations['cease_date'].value_counts()

2013.0    146
2012.0    129
2014.0     22
2010.0      2
2006.0      1
Name: cease_date, dtype: int64

In [9]:
print(tafe_resignations['cease_date'].value_counts())

2011.0    116
2012.0     94
2010.0     68
2013.0     55
2009.0      2
Name: cease_date, dtype: int64


In [10]:
import matplotlib.pyplot as plt
import pandas as pd

boxplot = dete_resignations.boxplot(column=['cease_date'])




Tried using a box plot, but the plots do not render here no matter what I do. But the value counts method works and it can easily be seen that we do not have any outlying values no matter what. 


In [13]:
dete_resignations['institute_service']=dete_resignations['cease_date']-dete_resignations['dete_start_date']

dete_resignations['institute_service'].value_counts().sort_index()

0.0     20
1.0     22
2.0     14
3.0     20
4.0     16
5.0     23
6.0     17
7.0     13
8.0      8
9.0     14
10.0     6
11.0     4
12.0     6
13.0     8
14.0     6
15.0     7
16.0     5
17.0     6
18.0     5
19.0     3
20.0     7
21.0     3
22.0     6
23.0     4
24.0     4
25.0     2
26.0     2
27.0     1
28.0     2
29.0     1
30.0     2
31.0     1
32.0     3
33.0     1
34.0     1
35.0     1
36.0     2
38.0     1
39.0     3
41.0     1
42.0     1
49.0     1
Name: institute_service, dtype: int64