## Data Background Info
The data is exit surveys from employees of the Department of Education, Training and Employment (DETE) and the Technical and Further Education (TAFE) institute in Queensland, Australia.The original TAFE exit survey data is no longer available. We've made some slight modifications to the original datasets to make them easier to work with, including changing the encoding to UTF-8 (the original ones are encoded using cp1252.)

## Column Titles:

Below is a preview of a couple columns we'll work with from the **dete_survey.csv**:

* **ID:** An id used to identify the participant of the survey
* **SeparationType:** The reason why the person's employment ended
* **Cease Date:** The year or month the person's employment ended
* **DETE Start Date:** The year the person began employment with the DETE

Below is a preview of a couple columns we'll work with from the tafe_survey.csv:

* **Record ID:** An id used to identify the participant of the survey
* **Reason for ceasing employment:** The reason why the person's employment ended
* **LengthofServiceOverall. Overall Length of Service at Institute (in years):** The length of the person's employment (in years)

## Goal
**Answer the Question:**
Are employees who have only worked for the institutes for a short period of time resigning due to some kind of dissatisfaction? What about employees who have been at the job longer?

In [1]:
import pandas as pd
import numpy as np

dete_survey = pd.read_csv('dete_survey.csv')
tafe_survey = pd.read_csv('tafe_survey.csv')

In [2]:
# First Impressions for dete_survey
dete_survey.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 822 entries, 0 to 821
Data columns (total 56 columns):
 #   Column                               Non-Null Count  Dtype 
---  ------                               --------------  ----- 
 0   ID                                   822 non-null    int64 
 1   SeparationType                       822 non-null    object
 2   Cease Date                           822 non-null    object
 3   DETE Start Date                      822 non-null    object
 4   Role Start Date                      822 non-null    object
 5   Position                             817 non-null    object
 6   Classification                       455 non-null    object
 7   Region                               822 non-null    object
 8   Business Unit                        126 non-null    object
 9   Employment Status                    817 non-null    object
 10  Career move to public sector         822 non-null    bool  
 11  Career move to private sector        822 non-

In [3]:
print(dete_survey.isnull().sum())

ID                                       0
SeparationType                           0
Cease Date                               0
DETE Start Date                          0
Role Start Date                          0
Position                                 5
Classification                         367
Region                                   0
Business Unit                          696
Employment Status                        5
Career move to public sector             0
Career move to private sector            0
Interpersonal conflicts                  0
Job dissatisfaction                      0
Dissatisfaction with the department      0
Physical work environment                0
Lack of recognition                      0
Lack of job security                     0
Work location                            0
Employment conditions                    0
Maternity/family                         0
Relocation                               0
Study/Travel                             0
Ill Health 

In [4]:
#Quick exploration of the data
pd.options.display.max_columns = 150 # to avoid truncated output 
print(dete_survey.shape)
dete_survey.head()

(822, 56)


Unnamed: 0,ID,SeparationType,Cease Date,DETE Start Date,Role Start Date,Position,Classification,Region,Business Unit,Employment Status,Career move to public sector,Career move to private sector,Interpersonal conflicts,Job dissatisfaction,Dissatisfaction with the department,Physical work environment,Lack of recognition,Lack of job security,Work location,Employment conditions,Maternity/family,Relocation,Study/Travel,Ill Health,Traumatic incident,Work life balance,Workload,None of the above,Professional Development,Opportunities for promotion,Staff morale,Workplace issue,Physical environment,Worklife balance,Stress and pressure support,Performance of supervisor,Peer support,Initiative,Skills,Coach,Career Aspirations,Feedback,Further PD,Communication,My say,Information,Kept informed,Wellness programs,Health & Safety,Gender,Age,Aboriginal,Torres Strait,South Sea,Disability,NESB
0,1,Ill Health Retirement,08/2012,1984,2004,Public Servant,A01-A04,Central Office,Corporate Strategy and Peformance,Permanent Full-time,True,False,False,True,False,False,True,False,False,False,False,False,False,False,False,False,False,True,A,A,N,N,N,A,A,A,A,N,N,N,A,A,A,N,A,A,N,N,N,Male,56-60,,,,,Yes
1,2,Voluntary Early Retirement (VER),08/2012,Not Stated,Not Stated,Public Servant,AO5-AO7,Central Office,Corporate Strategy and Peformance,Permanent Full-time,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,A,A,N,N,N,N,A,A,A,N,N,N,A,A,A,N,A,A,N,N,N,Male,56-60,,,,,
2,3,Voluntary Early Retirement (VER),05/2012,2011,2011,Schools Officer,,Central Office,Education Queensland,Permanent Full-time,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,A,A,N,N,N,N,Male,61 or older,,,,,
3,4,Resignation-Other reasons,05/2012,2005,2006,Teacher,Primary,Central Queensland,,Permanent Full-time,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,A,N,N,N,A,A,N,N,A,A,A,A,A,A,A,A,A,A,A,N,A,Female,36-40,,,,,
4,5,Age Retirement,05/2012,1970,1989,Head of Curriculum/Head of Special Education,,South East,,Permanent Full-time,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,A,A,N,N,D,D,N,A,A,A,A,A,A,SA,SA,D,D,A,N,A,M,Female,61 or older,,,,,


In [5]:
dete_survey = pd.read_csv('dete_survey.csv', na_values='Not Stated')
print(dete_survey.shape)

In [6]:
dropped_cols = dete_survey.columns[28:49]
dete_survey_updated = dete_survey.drop(columns=dropped_cols)
print(dete_survey_updated.shape)

(822, 35)


In [None]:
print(tafe_survey.shape)

In [None]:
dropped_cols = tafe_survey.columns[28:49]
tafe_survey_updated = tafe_survey.drop(columns=dropped_cols)
print(tafe_survey_updated.shape)

Dropped a bunch of columns not relevant to the analysis.

In [7]:
dete_survey_updated.columns = dete_survey_updated.columns.str.lower().str.strip().str.replace(r"\s+", "_")
print(dete_survey_updated.columns)

Index(['id', 'separationtype', 'cease_date', 'dete_start_date',
       'role_start_date', 'position', 'classification', 'region',
       'business_unit', 'employment_status', 'career_move_to_public_sector',
       'career_move_to_private_sector', 'interpersonal_conflicts',
       'job_dissatisfaction', 'dissatisfaction_with_the_department',
       'physical_work_environment', 'lack_of_recognition',
       'lack_of_job_security', 'work_location', 'employment_conditions',
       'maternity/family', 'relocation', 'study/travel', 'ill_health',
       'traumatic_incident', 'work_life_balance', 'workload',
       'none_of_the_above', 'gender', 'age', 'aboriginal', 'torres_strait',
       'south_sea', 'disability', 'nesb'],
      dtype='object')


observations based on the work above:

* dete_survey contains 'Not Stated' values that indicate values are missing, but they aren't represented as NaN.
* Both the dete_survey and tafe_survey contain many columns that we don't need to complete our analysis.
* Each dataframe contains many of the same columns, but the column names are different.
* There are multiple columns/answers that indicate an employee resigned because they were dissatisfied.

In [8]:
new_col_name_dict = {'Record ID': 'id',
'CESSATION YEAR': 'cease_date',
'Reason for ceasing employment': 'separationtype',
'Gender. What is your Gender?': 'gender',
'CurrentAge. Current Age': 'age',
'Employment Type. Employment Type': 'employment_status',
'Classification. Classification': 'position',
'LengthofServiceOverall. Overall Length of Service at Institute (in years)': 'institute_service',
'LengthofServiceCurrent. Length of Service at current workplace (in years)': 'role_service',
}
tafe_survey_updated = tafe_survey.copy()
tafe_survey_updated.rename(columns=new_col_name_dict, inplace=True)
print(tafe_survey_updated.columns)

Index(['id', 'Institute', 'WorkArea', 'cease_date', 'separationtype',
       'Contributing Factors. Career Move - Public Sector ',
       'Contributing Factors. Career Move - Private Sector ',
       'Contributing Factors. Career Move - Self-employment',
       'Contributing Factors. Ill Health',
       'Contributing Factors. Maternity/Family',
       'Contributing Factors. Dissatisfaction',
       'Contributing Factors. Job Dissatisfaction',
       'Contributing Factors. Interpersonal Conflict',
       'Contributing Factors. Study', 'Contributing Factors. Travel',
       'Contributing Factors. Other', 'Contributing Factors. NONE',
       'Main Factor. Which of these was the main factor for leaving?',
       'InstituteViews. Topic:1. I feel the senior leadership had a clear vision and direction',
       'InstituteViews. Topic:2. I was given access to skills training to help me do my job better',
       'InstituteViews. Topic:3. I was given adequate opportunities for personal developmen

Standardised the column names
* lower
* spaces to underscores
* stripped extra spaces at start and end

Also: clarified and standardized a few verbose column names

In [9]:
dete_survey_updated['separationtype'].value_counts()

Age Retirement                          285
Resignation-Other reasons               150
Resignation-Other employer               91
Resignation-Move overseas/interstate     70
Voluntary Early Retirement (VER)         67
Ill Health Retirement                    61
Other                                    49
Contract Expired                         34
Termination                              15
Name: separationtype, dtype: int64

In [10]:
# dete_resignations = dete_survey_updated[dete_survey_updated['separationtype'].str.contains("Resignation")]
dete_resignations = dete_survey_updated.copy()
dete_resignations['separationtype'] = dete_resignations['separationtype'].str.replace(r"Resignation.*", "Resignation")
dete_resignations['separationtype'].value_counts()

Resignation                         311
Age Retirement                      285
Voluntary Early Retirement (VER)     67
Ill Health Retirement                61
Other                                49
Contract Expired                     34
Termination                          15
Name: separationtype, dtype: int64

In [11]:
tafe_survey_updated['separationtype'].value_counts()

Resignation                 340
Contract Expired            127
Retrenchment/ Redundancy    104
Retirement                   82
Transfer                     25
Termination                  23
Name: separationtype, dtype: int64

In [12]:
# quit_filter = tafe_survey_updated['separationtype'].str.contains("Resignation")
# quit_filter.describe()
# tafe_resignations = tafe_survey_updated[quit_filter].copy()

tafe_resignations = tafe_survey_updated[tafe_survey_updated['separationtype'] == 'Resignation'].copy()
tafe_resignations['separationtype'].value_counts()

Resignation    340
Name: separationtype, dtype: int64

filtered by people who actually quit in the 'separationtype' column

In [13]:
dete_resignations.shape

(822, 35)

In [14]:
dete_resignations['cease_date'].value_counts()

2012       344
2013       200
01/2014     43
12/2013     40
09/2013     34
06/2013     27
07/2013     22
10/2013     20
11/2013     16
08/2013     12
05/2013      7
05/2012      6
08/2012      2
04/2014      2
07/2014      2
02/2014      2
04/2013      2
09/2014      1
11/2012      1
07/2006      1
2010         1
07/2012      1
09/2010      1
2014         1
Name: cease_date, dtype: int64

In [15]:
# i tried to extract the year at the end of every date
# the return value has the incorrect # of rows
"""
res_year = dete_resignations['cease_date'].str.extractall(r"(?P<year>[1-2][0-9]{3})$")
res_year['year'] = res_year.astype(float)
dete_resignations['cease_date'] = res_year['year']
"""

'\nres_year = dete_resignations[\'cease_date\'].str.extractall(r"(?P<year>[1-2][0-9]{3})$")\nres_year[\'year\'] = res_year.astype(float)\ndete_resignations[\'cease_date\'] = res_year[\'year\']\n'

In [16]:
# Extract the years and convert them to a float type
# res_year = dete_resignations['cease_date'].str.extractall(r"(?P<year>[1-2][0-9]{3})$")
# dete_resignations['cease_date'] = res_year['year']
# dete_resignations['cease_date'] = dete_resignations['cease_date'].astype(float)

# intended code
dete_resignations['cease_date'] = dete_resignations['cease_date'].str.split('/').str[-1]
dete_resignations['cease_date'] = dete_resignations['cease_date'].astype("float")
dete_resignations.shape

(822, 35)

In [17]:
dete_resignations['cease_date'].value_counts()

2013.0    380
2012.0    354
2014.0     51
2010.0      2
2006.0      1
Name: cease_date, dtype: int64

In [18]:
# Check the unique values and look for outliers
dete_resignations['dete_start_date'].value_counts().sort_values()

1966.0     1
1965.0     1
1967.0     2
1968.0     3
1963.0     4
1982.0     4
1987.0     7
1985.0     8
1973.0     8
1983.0     9
1981.0     9
1971.0    10
1994.0    10
2001.0    10
1984.0    10
1969.0    10
1977.0    11
1972.0    12
1986.0    12
1993.0    13
1997.0    14
1998.0    14
1980.0    14
1974.0    14
1979.0    14
1995.0    14
2003.0    15
2002.0    15
1988.0    15
1976.0    15
1978.0    15
1989.0    17
2000.0    18
1991.0    18
2004.0    18
1992.0    18
1996.0    19
1999.0    19
2005.0    20
1990.0    20
1975.0    21
1970.0    21
2013.0    21
2006.0    23
2009.0    24
2012.0    27
2010.0    27
2008.0    31
2007.0    34
2011.0    40
Name: dete_start_date, dtype: int64