<a href="https://colab.research.google.com/github/FREDRICAPPAU/FREDRICAPPAU/blob/main/Employee_Turnover_in_Higher_Education_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# | Introduction

Employee turnover is a major issue in higher education, leading to challenges such as increased recruitment costs, knowledge loss, and disrupted student experiences. This project aims to develop predictive models using Python and visualize the results in Tableau to identify key predictors of employee turnover among faculty and staff. This study will evaluate how accurately models like logistic regression, decision trees, and random forests can predict turnover by analyzing datasets adapted from general HR data, such as the [Employee Turnover Dataset](https://). By utilizing this dataset, My aim is to the chances of turn over in higher education.

# Predicting High Turnover in Higher Education Using Machine Learning

In this analysis, we aim to predict employee turnover within the higher education sector, focusing on factors that contribute to faculty and staff leaving their positions. The models and methodology described here will help us analyze which attributes (e.g., job satisfaction, work-life balance, department, etc.) are most influential in turnover within universities and colleges. The workflow will include exploratory data analysis (EDA), data preprocessing, and building predictive models using Python. The final analysis will be visualized with Tableau to deliver insights into potential turnover trends.

# Data Overview and Source

For this project, we will modify an employee turnover dataset to suit the higher education context. The attributes include job satisfaction, salary, department, promotion history, and others. The goal is to predict whether faculty or staff members are likely to leave their roles in the next academic year.

# Objectives

The main objectives of this analysis are:

1.   To identify key factors contributing to employee turnover.

2.   To build and evaluate machine learning models that predict employee turnover.
3.  To assess the accuracy and performance of each model.





# Dataset

[Employee Turnover Dataset](https://www.kaggle.com/datasets/davinwijaya/employee-turnover)
The dataset will be adapted to reflect academic settings, and additional data (such as faculty tenure or teaching load) may be added if available.  

# Data Processing

This section will involve handling missing values, encoding categorical variables, and scaling numerical features for better model performance.


Code and Implementation:

1.   We start by importing the dataset and libraries.
2.   Categorical variables such as JobRole are encoded using LabelEncoder to convert them into numerical values.
3.   Missing values are handled by filling in the mean, and numerical features are scaled to bring them into comparable ranges.







# Data Processing Code Implementation

In [2]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Load the dataset
file_path = '/content/drive/MyDrive/Grad School/Classes/MSA550- Predictive Analytics M, W 3:30-4:45  (Fall24)/Pred. Analyt. Employee Turnover Project/turnover_v1.csv' # Specify the file path within your Google Drive
df = pd.read_csv(file_path)

# Inspecting the dataset
df.head()


Unnamed: 0,stag,event,gender,age,industry,profession,traffic,coach,head_gender,greywage,way,extraversion,independ,selfcontrol,anxiety,novator
0,7.030801,1,m,35.0,Banks,HR,rabrecNErab,no,f,white,bus,6.2,4.1,5.7,7.1,8.3
1,22.965092,1,m,33.0,Banks,HR,empjs,no,m,white,bus,6.2,4.1,5.7,7.1,8.3
2,15.934292,1,f,35.0,PowerGeneration,HR,rabrecNErab,no,m,white,bus,6.2,6.2,2.6,4.8,8.3
3,15.934292,1,f,35.0,PowerGeneration,HR,rabrecNErab,no,m,white,bus,5.4,7.6,4.9,2.5,6.7
4,8.410678,1,m,32.0,Retail,Commercial,youjs,yes,f,white,bus,3.0,4.1,8.0,7.1,3.7


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1129 entries, 0 to 1128
Data columns (total 16 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   stag          1129 non-null   float64
 1   event         1129 non-null   int64  
 2   gender        1129 non-null   object 
 3   age           1129 non-null   float64
 4   industry      1129 non-null   object 
 5   profession    1129 non-null   object 
 6   traffic       1129 non-null   object 
 7   coach         1129 non-null   object 
 8   head_gender   1129 non-null   object 
 9   greywage      1129 non-null   object 
 10  way           1129 non-null   object 
 11  extraversion  1129 non-null   float64
 12  independ      1129 non-null   float64
 13  selfcontrol   1129 non-null   float64
 14  anxiety       1129 non-null   float64
 15  novator       1129 non-null   float64
dtypes: float64(7), int64(1), object(8)
memory usage: 141.2+ KB


In [None]:

# Handling missing values by filling them with the mean or dropping if appropriate
df.fillna(df.mean(), inplace=True)

# Encoding categorical variables (e.g., JobRole)
le = LabelEncoder()
df['JobRole'] = le.fit_transform(df['JobRole'])

# Scaling numerical features
scaler = StandardScaler()
df[['Age', 'Salary', 'Tenure']] = scaler.fit_transform(df[['Age', 'Salary', 'Tenure']])
