# **ETL (Extract, Transform, Load)**

The [IBM HR Analytics Employee Attrition & Performance dataset](https://www.kaggle.com/datasets/pavansubhasht/ibm-hr-analytics-attrition-dataset)  is a fictional dataset created by IBM data scientists to simulate real-world HR data. It contains information about employees’ demographics, job roles, satisfaction levels, performance, and employment history. The dataset has 1,470 rows (employees) and 35 columns, including both categorical and numerical variables, and is used to explore the factors that influence employee attrition and performance. The main feature categories are:

- **Demographics:** Age, Gender, MaritalStatus, Education, EducationField

- **Job Details:** Department, JobRole, JobLevel, JobInvolvement, YearsAtCompany, YearsInCurrentRole, YearsWithCurrManager

- **Compensation:** MonthlyIncome, MonthlyRate, DailyRate, HourlyRate, PercentSalaryHike, StockOptionLevel

- **Satisfaction Metrics:** JobSatisfaction, EnvironmentSatisfaction, RelationshipSatisfaction, WorkLifeBalance

- **Performance & Experience:** PerformanceRating, TotalWorkingYears, NumCompaniesWorked, TrainingTimesLastYear, YearsSinceLastPromotion

- **Other Attributes:** DistanceFromHome, BusinessTravel, OverTime, StandardHours

## Objectives
The objective of this notebook is to perform the ETL (Extract, Transform, Load) process for the [IBM HR Analytics Employee Attrition & Performance dataset](https://www.kaggle.com/datasets/pavansubhasht/ibm-hr-analytics-attrition-dataset). It extracts raw HR data, cleans missing or mismatched values, standardizes categorical encodings, and prepares numerical variables. The transformed dataset is then structured for exploratory data analysis and predictive modeling to uncover key factors influencing employee attrition.

## Inputs
The dataset was obtained from [Kaggle](https://www.kaggle.com/datasets/pavansubhasht/ibm-hr-analytics-attrition-dataset)

## Outputs
The cleaned csv file found [here]()

# ETL Process

- Load the dataset
- Understand dataset structure and content
- Convert data types
- Clean the dataset
-  
- 
- Save the clean dataset as a csv file

---

# Change working directory
Change the working directory from its current folder to its parent folder as the notebooks will be stored in a subfolder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\amron\\Desktop\\employee-turnover-prediction\\jupyter_notebooks'

Make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\amron\\Desktop\\employee-turnover-prediction'

Changing path directory to the dataset

In [4]:
#path directory
raw_data_dir = os.path.join(current_dir, 'data_set/raw') 

#path directory
processed_data_dir = os.path.join(current_dir, 'data_set/processed') 


---

# Import packages

In [5]:
import numpy as np #import numpy
import pandas as pd #import pandas
import matplotlib.pyplot as plt #import matplotlib
import seaborn as sns #import seaborn
import plotly.express as px # import plotly
sns.set_style('whitegrid') #set style for visuals

---

# Load the raw dataset

In [6]:
#load the dataset

The raw dataset is loaded using Pandas for ETL process

---

# Understand the dataset structure and content

In [7]:
# Coding

### Insights:
- 

---

# Clean the data

In [8]:
# Coding

### Insights:
- 

---

# Store and load cleaned data

In [9]:
# Coding

### Insights:
-

---

# Cleaned Dataset Summary & Insights

- 
- 
- 
