#Introduction to Dataset
In this guided project, we'll work with exit surveys from employees of the **Department of Education, Training and Employment (DETE)** and the **Technical and Further Education (TAFE)** institute in Queensland, Australia.

Some questions to be answered:

1- **Are employees who only worked for the institutes for a short period of time resigning due to some kind of dissatisfaction**? What about employees who have been there longer?

2- **Are younger employees resigning due to some kind of dissatisfaction?** What about older employees?

#Importing Dataset

In [None]:
import pandas as pd
import numpy as np

dete_survey = pd.read_csv("dete_survey.csv")
tafe_survey = pd.read_csv("tafe_survey.csv")



## Dataset info

In [None]:
dete_survey.info()
dete_survey.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 822 entries, 0 to 821
Data columns (total 56 columns):
 #   Column                               Non-Null Count  Dtype 
---  ------                               --------------  ----- 
 0   ID                                   822 non-null    int64 
 1   SeparationType                       822 non-null    object
 2   Cease Date                           822 non-null    object
 3   DETE Start Date                      822 non-null    object
 4   Role Start Date                      822 non-null    object
 5   Position                             817 non-null    object
 6   Classification                       455 non-null    object
 7   Region                               822 non-null    object
 8   Business Unit                        126 non-null    object
 9   Employment Status                    817 non-null    object
 10  Career move to public sector         822 non-null    bool  
 11  Career move to private sector        822 non-

Unnamed: 0,ID,SeparationType,Cease Date,DETE Start Date,Role Start Date,Position,Classification,Region,Business Unit,Employment Status,...,Kept informed,Wellness programs,Health & Safety,Gender,Age,Aboriginal,Torres Strait,South Sea,Disability,NESB
0,1,Ill Health Retirement,08/2012,1984,2004,Public Servant,A01-A04,Central Office,Corporate Strategy and Peformance,Permanent Full-time,...,N,N,N,Male,56-60,,,,,Yes
1,2,Voluntary Early Retirement (VER),08/2012,Not Stated,Not Stated,Public Servant,AO5-AO7,Central Office,Corporate Strategy and Peformance,Permanent Full-time,...,N,N,N,Male,56-60,,,,,
2,3,Voluntary Early Retirement (VER),05/2012,2011,2011,Schools Officer,,Central Office,Education Queensland,Permanent Full-time,...,N,N,N,Male,61 or older,,,,,
3,4,Resignation-Other reasons,05/2012,2005,2006,Teacher,Primary,Central Queensland,,Permanent Full-time,...,A,N,A,Female,36-40,,,,,
4,5,Age Retirement,05/2012,1970,1989,Head of Curriculum/Head of Special Education,,South East,,Permanent Full-time,...,N,A,M,Female,61 or older,,,,,


### Dete Survey
Dete survey has 826 rows and 56 columns.

Some columns have lots of missing values such as:
*  Business Unit                        126 non-null    object
* Classification                       455 non-null    object
* Aboriginal                           16 non-null     object
* Torres Strait                        3 non-null      object
* South Sea                            7 non-null      object
*  Disability                           23 non-null     object
*  NESB                                 32 non-null     object


In [None]:
tafe_survey.info()
tafe_survey.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 702 entries, 0 to 701
Data columns (total 72 columns):
 #   Column                                                                                                                                                         Non-Null Count  Dtype  
---  ------                                                                                                                                                         --------------  -----  
 0   Record ID                                                                                                                                                      702 non-null    float64
 1   Institute                                                                                                                                                      702 non-null    object 
 2   WorkArea                                                                                                                                  

Unnamed: 0,Record ID,Institute,WorkArea,CESSATION YEAR,Reason for ceasing employment,Contributing Factors. Career Move - Public Sector,Contributing Factors. Career Move - Private Sector,Contributing Factors. Career Move - Self-employment,Contributing Factors. Ill Health,Contributing Factors. Maternity/Family,...,Workplace. Topic:Does your workplace promote a work culture free from all forms of unlawful discrimination?,Workplace. Topic:Does your workplace promote and practice the principles of employment equity?,Workplace. Topic:Does your workplace value the diversity of its employees?,Workplace. Topic:Would you recommend the Institute as an employer to others?,Gender. What is your Gender?,CurrentAge. Current Age,Employment Type. Employment Type,Classification. Classification,LengthofServiceOverall. Overall Length of Service at Institute (in years),LengthofServiceCurrent. Length of Service at current workplace (in years)
0,6.34133e+17,Southern Queensland Institute of TAFE,Non-Delivery (corporate),2010.0,Contract Expired,,,,,,...,Yes,Yes,Yes,Yes,Female,26 30,Temporary Full-time,Administration (AO),1-2,1-2
1,6.341337e+17,Mount Isa Institute of TAFE,Non-Delivery (corporate),2010.0,Retirement,-,-,-,-,-,...,Yes,Yes,Yes,Yes,,,,,,
2,6.341388e+17,Mount Isa Institute of TAFE,Delivery (teaching),2010.0,Retirement,-,-,-,-,-,...,Yes,Yes,Yes,Yes,,,,,,
3,6.341399e+17,Mount Isa Institute of TAFE,Non-Delivery (corporate),2010.0,Resignation,-,-,-,-,-,...,Yes,Yes,Yes,Yes,,,,,,
4,6.341466e+17,Southern Queensland Institute of TAFE,Delivery (teaching),2010.0,Resignation,-,Career Move - Private Sector,-,-,-,...,Yes,Yes,Yes,Yes,Male,41 45,Permanent Full-time,Teacher (including LVT),3-4,3-4


###TAFE Survey
This dataset 702 rows and 72 columns.
* Many columns have a gigantic title and must be reduced

# Importing Update Dataset

In [None]:
dete_survey = pd.read_csv("dete_survey.csv",na_values='Not Stated')
tafe_survey = pd.read_csv("tafe_survey.csv")


## Removing Columns

In [None]:
#removing unnecessary columns
dete_survey_updated = dete_survey.drop(dete_survey.columns[28:49],axis=1)
tafe_survey_updated = tafe_survey.drop(tafe_survey.columns[17:66],axis=1)

#dete_survey_updated.head()
#tafe_survey_updated.head()

## Clean Columns Names

| **dete_survey** | **tafe_survey**                                                           | **Definition**                                          |
|------------------|---------------------------------------------------------------------------|---------------------------------------------------------|
| ID               | Record ID                                                                 | An id used to identify the participant of the survey    |
| SeparationType   | Reason for ceasing employment                                             | The reason why the participant's employment ended       |
| Cease Date       | CESSATION YEAR                                                            | The year or month the participant's employment ended    |
| DETE Start Date  |                                                                           | The year the participant began employment with the DETE |
|                  | LengthofServiceOverall. Overall Length of Service at Institute (in years) | The length of the person's employment (in years)        |
| Age              | CurrentAge. Current Age                                                   | The age of the participant                              |
| Gender           | Gender. What is your Gender?                                              | The gender of the participant                           |

Because we eventually want to combine them, we'll have to standardize the column names

1- Rename the remaining columns in the dete_survey_updated dataframe.
  * Make all the capitalization lowercase.
  * Remove any trailing whitespace from the end of the strings.
  * Replace spaces with underscores ('_')

### Renaming Dete Survey
1- Rename the remaining columns in the `dete_survey_updated` dataframe.
  * Make all the capitalization lowercase.
  * Remove any trailing whitespace from the end of the strings.
  * Replace spaces with underscores ('_')

In [None]:
dete_survey_updated.columns = dete_survey_updated.columns.str.replace(" ", "_").str.replace('\s+', ' ', regex=True).str.strip().str.lower()
dete_survey_updated.columns

Index(['id', 'separationtype', 'cease_date', 'dete_start_date',
       'role_start_date', 'position', 'classification', 'region',
       'business_unit', 'employment_status', 'career_move_to_public_sector',
       'career_move_to_private_sector', 'interpersonal_conflicts',
       'job_dissatisfaction', 'dissatisfaction_with_the_department',
       'physical_work_environment', 'lack_of_recognition',
       'lack_of_job_security', 'work_location', 'employment_conditions',
       'maternity/family', 'relocation', 'study/travel', 'ill_health',
       'traumatic_incident', 'work_life_balance', 'workload',
       'none_of_the_above', 'gender', 'age', 'aboriginal', 'torres_strait',
       'south_sea', 'disability', 'nesb'],
      dtype='object')

### Renaming Tafe Survey
1- Use the `DataFrame.rename()` method to update the columns below in
* 'Record ID': 'id'
*'CESSATION YEAR': 'cease_date'
*'Reason for ceasing employment': 'separationtype'
*'Gender. What is your Gender?': 'gender'
*'CurrentAge. Current Age': 'age'
*'Employment Type. Employment Type': 'employment_status'
*'Classification. Classification': 'position'
*'LengthofServiceOverall. Overall Length of Service at Institute *(in years)': 'institute_service'
*'LengthofServiceCurrent. Length of Service at current workplace (in years)': 'role_service'

In [None]:
rename_columns = {'Record ID': 'id',
'CESSATION YEAR': 'cease_date',
'Reason for ceasing employment': 'separationtype',
'Gender. What is your Gender?': 'gender',
'CurrentAge. Current Age': 'age',
'Employment Type. Employment Type': 'employment_status',
'Classification. Classification': 'position',
'LengthofServiceOverall. Overall Length of Service at Institute (in years)': 'institute_service',
'LengthofServiceCurrent. Length of Service at current workplace (in years)': 'role_service'}

tafe_survey_updated = tafe_survey_updated.rename(columns=rename_columns)
tafe_survey_updated.columns

Index(['id', 'Institute', 'WorkArea', 'cease_date', 'separationtype',
       'Contributing Factors. Career Move - Public Sector ',
       'Contributing Factors. Career Move - Private Sector ',
       'Contributing Factors. Career Move - Self-employment',
       'Contributing Factors. Ill Health',
       'Contributing Factors. Maternity/Family',
       'Contributing Factors. Dissatisfaction',
       'Contributing Factors. Job Dissatisfaction',
       'Contributing Factors. Interpersonal Conflict',
       'Contributing Factors. Study', 'Contributing Factors. Travel',
       'Contributing Factors. Other', 'Contributing Factors. NONE', 'gender',
       'age', 'employment_status', 'position', 'institute_service',
       'role_service'],
      dtype='object')

#Identifying Resignations types

1- Use the `Series.value_counts(`) method to review the unique values in the `separationtype` column in both `dete_survey_updated `and `tafe_survey_updated`.

2- In each of dataframes, select only the data for survey respondents who have a Resignation separation type.

3 - Assign the result for dete_survey_updated to `dete_resignations`.

4- Assign the result for tafe_survey_updated to `tafe_resignations`.

In [None]:
# Update all separation types containing the word "resignation" to 'Resignation'
dete_survey_updated['separationtype'] = dete_survey_updated['separationtype'].str.split('-').str[0]

# Check the values in the separationtype column were updated correctly
dete_survey_updated['separationtype'].value_counts()

# Select only the resignation separation types from each dataframe
dete_resignations = dete_survey_updated[dete_survey_updated['separationtype'] == 'Resignation'].copy()
tafe_resignations = tafe_survey_updated[tafe_survey_updated['separationtype'] == 'Resignation'].copy()

#print(dete_resignations,'\n')
#print(tafe_resignations)

#Verifying if data is corrected

we'll focus on verifying that the years in the `cease_date` and `dete_start_date` columns make sense.

* Since the `cease_dat`e is the last year of the person's employment and the `dete_start_date` is the person's first year of employment, it wouldn't make sense to have years after the current date

* Given that most people in this field start working in their 20s, it's also unlikely that the `dete_start_date` was before the year 1940.

Analyze: **1940 < start_date < current year**

1- clean the `cease_date` column in `dete_resignations`.
    
  * Use the Series.value_counts() method to view the unique values in the `cease_date `column.

  * Use vectorized string methods to extract the year
  * Use the `Series.astype()` method method to convert the type to a float.

In [None]:
#cleaning the year
#dete_resignations['cease_date'] = dete_resignations['cease_date'].str.extract(r'([1-2][0-9]{3})').astype(float)

#counting years in cease_date
print(dete_resignations['cease_date'].value_counts().sort_values())#dete employee resignation year
print(dete_resignations['dete_start_date'].value_counts().sort_values()) #dete employee start year
print(tafe_resignations['cease_date'].value_counts().sort_values())


cease_date
2006.0      1
2010.0      2
2014.0     22
2012.0    129
2013.0    146
Name: count, dtype: int64
dete_start_date
1987.0     1
1975.0     1
1984.0     1
1971.0     1
1973.0     1
1972.0     1
1963.0     1
1977.0     1
1982.0     1
1974.0     2
1983.0     2
1976.0     2
1985.0     3
2001.0     3
1986.0     3
1995.0     4
1988.0     4
1991.0     4
1989.0     4
1993.0     5
1980.0     5
1990.0     5
1997.0     5
2002.0     6
1998.0     6
1996.0     6
1992.0     6
2003.0     6
1994.0     6
1999.0     8
2000.0     9
2013.0    10
2006.0    13
2009.0    13
2004.0    14
2005.0    15
2010.0    17
2007.0    21
2012.0    21
2008.0    22
2011.0    24
Name: count, dtype: int64
cease_date
2009.0      2
2013.0     55
2010.0     68
2012.0     94
2011.0    116
Name: count, dtype: int64


#Create Column - Year of service
Our main goal is to answer: **Are employees who have only worked for the institutes for a short period of time resigning due to some kind of dissatisfaction?**

It is necessary to calculate the year of service in dete dataframe
`institute_service` = `dete_start_date ` - `cease_date`

In [None]:
dete_resignations['institute_service'] = dete_resignations['cease_date'] - dete_resignations['dete_start_date']
dete_resignations[['dete_start_date','cease_date','institute_service']]

Unnamed: 0,dete_start_date,cease_date,institute_service
3,2005.0,2012.0,7.0
5,1994.0,2012.0,18.0
8,2009.0,2012.0,3.0
9,1997.0,2012.0,15.0
11,2009.0,2012.0,3.0
...,...,...,...
808,2010.0,2013.0,3.0
815,2012.0,2014.0,2.0
816,2012.0,2014.0,2.0
819,2009.0,2014.0,5.0


# Identify Dissatisfied Employees

Columns in Tafe Survey which means dissatisfation:

* Contributing Factors. Dissatisfaction

* Contributing Factors. Job Dissatisfaction

Columns in Dete Survey which means dissatisfation:

  * job_dissatisfaction
  *dissatisfaction_with_the_department
  *physical_work_environment
  *lack_of_recognition
  *lack_of_job_security
  *work_location
  *employment_conditions
  *work_life_balance
  *workload

## Creating function that returns dissatisfation

if any of the columns listed above contain a True value, we'll add a True value to a new column named dissatisfied. To accomplish this, we'll write a function that do the following:

* Return `True` if any element in the selected columns above is True
* Return `False` if none of the elements in the selected columns above is True
* Return `NaN` if the value is NaN

After our changes, the new dissatisfied column will contain just the following values:

`True`: indicates a person resigned because they were `dissatisfied` with the job

`False`: indicates a person resigned because of a reason other than dissatisfaction with the job

`NaN`: indicates the value is missing

In [None]:
columns_dete= ['job_dissatisfaction',
'dissatisfaction_with_the_department',
'physical_work_environment',
'lack_of_recognition',
'lack_of_job_security',
'work_location',
'employment_conditions',
'work_life_balance',
'workload']


columns_tafe= ['Contributing Factors. Dissatisfaction',
               'Contributing Factors. Job Dissatisfaction'
               ]

tafe_resignations[columns_tafe].value_counts()

Unnamed: 0_level_0,Unnamed: 1_level_0,count
Contributing Factors. Dissatisfaction,Contributing Factors. Job Dissatisfaction,Unnamed: 2_level_1
-,-,241
-,Job Dissatisfaction,36
Contributing Factors. Dissatisfaction,-,29
Contributing Factors. Dissatisfaction,Job Dissatisfaction,26


In [None]:
# Update the values in the contributing factors columns to be either True, False, or NaN
def update_vals(x):
    if x == '-':
        return False
    elif pd.isnull(x):
        return np.nan
    else:
        return True
tafe_resignations['dissatisfied'] = tafe_resignations[['Contributing Factors. Dissatisfaction',
                                                       'Contributing Factors. Job Dissatisfaction']].map(update_vals).any(axis=1)
tafe_resignations_up = tafe_resignations.copy()

# Check the unique values after the updates
tafe_resignations_up['dissatisfied'].value_counts(dropna=False)

Unnamed: 0_level_0,count
dissatisfied,Unnamed: 1_level_1
False,249
True,91


In [None]:
# Update the values in columns related to dissatisfaction to be either True, False, or NaN
dete_resignations['dissatisfied'] = dete_resignations[['job_dissatisfaction',
       'dissatisfaction_with_the_department', 'physical_work_environment',
       'lack_of_recognition', 'lack_of_job_security', 'work_location',
       'employment_conditions', 'work_life_balance',
       'workload']].any(axis=1, skipna=False)

       #any() return true if there is at least a TRUE in the dataframe[columns]

dete_resignations_up = dete_resignations.copy()
dete_resignations_up['dissatisfied'].value_counts(dropna=False)

Unnamed: 0_level_0,count
dissatisfied,Unnamed: 1_level_1
False,162
True,149


#Combining Data
Aggregate data according to `Institute_service` column

1 - First, let's add a column to each dataframe that will allow us to easily distinguish between the two.
* Add a column named `institute` to `dete_resignations_up`. Each row should contain the value DETE.
* Add a column named `institute` to `tafe_resignations_up`. Each row should contain the value TAFE.

2- Combined two dataframes - assign the result to `combined` dataframe

In [None]:
dete_resignations_up['institute'] = 'DETE'
tafe_resignations_up['institute'] = 'TAFE'

combine = pd.concat([dete_resignations_up,tafe_resignations_up])

# Verify the number of non null values in each column
combine.notnull().sum().sort_values()



Unnamed: 0,0
torres_strait,0
south_sea,3
aboriginal,7
disability,8
nesb,9
business_unit,32
classification,161
region,265
role_start_date,271
dete_start_date,283


In [None]:
combine_update = combine.dropna(thresh=500,axis=1)

print(combine_update.head())
combine.head()

      id separationtype  cease_date          position    employment_status  \
3    4.0    Resignation      2012.0           Teacher  Permanent Full-time   
5    6.0    Resignation      2012.0  Guidance Officer  Permanent Full-time   
8    9.0    Resignation      2012.0           Teacher  Permanent Full-time   
9   10.0    Resignation      2012.0      Teacher Aide  Permanent Part-time   
11  12.0    Resignation      2012.0           Teacher  Permanent Full-time   

    gender    age institute_service  dissatisfied institute  
3   Female  36-40               7.0         False      DETE  
5   Female  41-45              18.0          True      DETE  
8   Female  31-35               3.0         False      DETE  
9   Female  46-50              15.0          True      DETE  
11    Male  31-35               3.0         False      DETE  


Unnamed: 0,id,separationtype,cease_date,dete_start_date,role_start_date,position,classification,region,business_unit,employment_status,...,Contributing Factors. Ill Health,Contributing Factors. Maternity/Family,Contributing Factors. Dissatisfaction,Contributing Factors. Job Dissatisfaction,Contributing Factors. Interpersonal Conflict,Contributing Factors. Study,Contributing Factors. Travel,Contributing Factors. Other,Contributing Factors. NONE,role_service
3,4.0,Resignation,2012.0,2005.0,2006.0,Teacher,Primary,Central Queensland,,Permanent Full-time,...,,,,,,,,,,
5,6.0,Resignation,2012.0,1994.0,1997.0,Guidance Officer,,Central Office,Education Queensland,Permanent Full-time,...,,,,,,,,,,
8,9.0,Resignation,2012.0,2009.0,2009.0,Teacher,Secondary,North Queensland,,Permanent Full-time,...,,,,,,,,,,
9,10.0,Resignation,2012.0,1997.0,2008.0,Teacher Aide,,,,Permanent Part-time,...,,,,,,,,,,
11,12.0,Resignation,2012.0,2009.0,2009.0,Teacher,Secondary,Far North Queensland,,Permanent Full-time,...,,,,,,,,,,


From the combined dataset, we reached 53 columns in total. Some of them have NAN values.

It has been used `dropna` method to remove those columns

#Cleaning Institute Service Columns
After the dataset combination we identified that in the `Institute Service `Columns appeared values as ranges (strings)

It is necessary to clean it! using a new lable

We'll use the slightly modified definitions below:

**New**: Less than 3 years at a company
**Experienced**: 3-6 years at a company
**Established**: 7-10 years at a company
**Veteran**: 11 or more years at a company

In [None]:
combine_update['institute_service'].value_counts().head()

Unnamed: 0_level_0,count
institute_service,Unnamed: 1_level_1
Less than 1 year,73
1-2,64
3-4,63
5-6,33
11-20,26


1- First, we'll extract the years of service from each value in the `institute_service` column.

  * Use the `Series.astype()` method to change the type to `'str'`.
  
  * Use vectorized string methods to extract the years of service from each pattern. You can find the full list of vectorized string methods here.
  * Double check that you didn't miss extracting any digits.

  * Use the `Series.astype()` method to change the type to 'float'.

In [None]:
combine_update['institute_service'].astype(str)

Unnamed: 0,institute_service
3,7.0
5,18.0
8,3.0
9,15.0
11,3.0
...,...
696,5-6
697,1-2
698,
699,5-6


In [129]:
# Extract the years of service and convert the type to float
combine_update['institute_service_up'] = combine_update['institute_service'].astype('str').str.extract(r'(\d+)')
combine_update['institute_service_up'] = combine_update['institute_service_up'].astype('float')

# Check the years extracted are correct
combine_update['institute_service_up'].value_counts()
combine_update['institute_service'].value_counts()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  combine_update['institute_service_up'] = combine_update['institute_service'].astype('str').str.extract(r'(\d+)')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  combine_update['institute_service_up'] = combine_update['institute_service_up'].astype('float')


Unnamed: 0_level_0,count
institute_service,Unnamed: 1_level_1
Less than 1 year,73
1-2,64
3-4,63
5-6,33
11-20,26
5.0,23
1.0,22
7-10,21
3.0,20
0.0,20


In [124]:
teste = combine_update['institute_service'].value_counts()[:10]
teste_2 = teste.astype('str').str.extract(r'(\d)')
teste_2 = teste_2.astype('float')
print(teste)
teste_2

institute_service
Less than 1 year    73
1-2                 64
3-4                 63
5-6                 33
11-20               26
5.0                 23
1.0                 22
7-10                21
3.0                 20
0.0                 20
Name: count, dtype: int64


Unnamed: 0_level_0,0
institute_service,Unnamed: 1_level_1
Less than 1 year,7.0
1-2,6.0
3-4,6.0
5-6,3.0
11-20,2.0
5.0,2.0
1.0,2.0
7-10,2.0
3.0,2.0
0.0,2.0


## Converting years to categories:
2 - we'll map each value to one of the career stage definitions above.

Create a function that maps each year value to one of the career stages above.

Use the Series.apply() method to apply the function to the institute_service column. Assign the result to a new column named `service_cat.`

In [130]:
# Convert years of service to categories
def transform_service(val):
    if val >= 11:
        return "Veteran"
    elif 7 <= val < 11:
        return "Established"
    elif 3 <= val < 7:
        return "Experienced"
    elif pd.isnull(val):
        return np.nan
    else:
        return "New"
combine_update['service_cat'] = combine_update['institute_service_up'].apply(transform_service)

# Quick check of the update
combine_update['service_cat'].value_counts()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  combine_update['service_cat'] = combine_update['institute_service_up'].apply(transform_service)


Unnamed: 0_level_0,count
service_cat,Unnamed: 1_level_1
New,193
Experienced,172
Veteran,136
Established,62
