# 📊 Employee Performance Analysis – Project Summary

## 🧠 Objective
The project aims to **predict employee performance levels** (Low, Average, High) using machine learning models. It empowers data-driven HR decisions by identifying performance drivers and enabling predictive insights.

---

## 📂 Dataset Overview

- **Source**: IABAC-provided dataset
- **Total Records**: 999 employees
- **Target Variable**: `PerformanceRating` (3 Classes: 0 = Low, 1 = Average, 2 = High)
- **Features**: 28 raw features (Numerical, Categorical, Ordinal)

---

## 🛠️ Data Preprocessing

### ✅ Cleaning & Preprocessing Steps:
- Removed duplicate records
- Handled missing values (imputation strategy based on context)
- Encoded categorical features using **OneHotEncoding**
- Scaled numerical features using **RobustScaler**
- Split into **Train (759)** and **Test (240)** sets using stratified sampling

### 🔍 Final Feature Count: 
- **20 selected features** after correlation, mutual information, and F-score filtering

---

## 📊 Exploratory Data Analysis (EDA)

### 👨‍💼 Key Employee Trends:
- Most employees are aged 25–35
- Common departments: Sales, Development, HR
- Majority of employees are **Average performers**
- Satisfaction and Promotion strongly influence performance

### 📈 Distribution Insights:
- **Performance Class Distribution**:
  - Low: ~13%
  - Average: ~72%
  - High: ~15%

---

## 🧪 Feature Engineering

### 📌 Top 15 Features (by Mutual Info & F-score):
| Rank | Feature                       | Mutual_Info | F-Score    |
|------|-------------------------------|-------------|------------|
| 1    | EmpLastSalaryHikePercent      | 198.14      | High       |
| 2    | EmpEnvironmentSatisfaction    | 130.86      | High       |
| 3    | YearsSinceLastPromotion       | 34.22       | Moderate   |

### 🏆 Final Selected Features:
- EmpLastSalaryHikePercent
- EmpEnvironmentSatisfaction
- YearsSinceLastPromotion
- EmpDepartment_Development
- EmpWorkLifeBalance
- ExperienceYearsInCurrentRole
- YearsWithCurrManager
- EmpHourlyRate
- EmpJobRole_Developer
- ExperienceYearsAtThisCompany
- ... (10 more encoded categorical features)

---

## 🧠 Models Trained

### 🧪 Algorithms Evaluated:
1. Logistic Regression  
2. K-Nearest Neighbors  
3. Decision Tree  
4. Random Forest  
5. Support Vector Machine (SVM)  
6. Naive Bayes  
7. XGBoost  
8. Gradient Boosting

---

## 🏁 Model Performance Summary

| Model                 | Accuracy |
|----------------------|----------|
| ✅ Gradient Boosting | **93.33%** (Best) |
| Random Forest         | 92.92%   |
| XGBoost               | 92.50%   |
| Decision Tree         | 87.08%   |
| SVM                   | 84.58%   |
| Logistic Regression   | 84.17%   |
| KNN                   | 78.75%   |
| Naive Bayes           | 73.33%   |

---

## 🏆 Best Model: Gradient Boosting

### 📌 Classification Report:

| Class | Precision | Recall | F1-Score | Support |
|-------|-----------|--------|----------|---------|
| **Low (0)**   | 0.89      | 0.87   | 0.88     | 39      |
| **Average (1)** | 0.94      | 0.97   | 0.96     | 175     |
| **High (2)**  | 0.95      | 0.77   | 0.85     | 26      |

| Metric        | Score |
|---------------|-------|
| **Accuracy**       | 0.93  |
| **Macro Avg**      | Precision: 0.93 • Recall: 0.87 • F1-Score: 0.90 |
| **Weighted Avg**   | Precision: 0.93 • Recall: 0.93 • F1-Score: 0.93 |


### ✅ Overall Model Accuracy:
The **Gradient Boosting Classifier** achieved an impressive **93.33% accuracy** on the test data, outperforming all other models evaluated.

---

### 📌 Classification Report Breakdown:

| Class Label | Description          | Precision | Recall | F1-Score | Support |
|-------------|----------------------|-----------|--------|----------|---------|
| **0**       | Low Performers        | 0.89      | 0.87   | 0.88     | 39      |
| **1**       | Average Performers    | 0.94      | 0.97   | 0.96     | 175     |
| **2**       | High Performers       | 0.95      | 0.77   | 0.85     | 26      |

> 📊 **Precision**: How many selected items are relevant  
> 🔍 **Recall**: How many relevant items are selected  
> 🔁 **F1-score**: Harmonic mean of precision and recall  

---

### 📊 Aggregate Performance Metrics:

| Metric Type     | Precision | Recall | F1-Score |
|------------------|-----------|--------|----------|
| **Macro Average**   | 0.93      | 0.87   | 0.90     |
| **Weighted Average**| 0.93      | 0.93   | 0.93     |
| **Overall Accuracy**|           |        | **0.93** |

---

### 📌 Key Model Insights:

- ✅ **High Accuracy**: The model correctly predicts employee performance level in 93% of cases.
- 💡 **Class 1 (Average Performers)** is **best predicted**, with the highest recall (0.97), meaning the model is extremely good at identifying average employees.
- ⚠️ **Class 2 (High Performers)** has a **lower recall (0.77)**, suggesting a tendency to misclassify some high performers — often as average.
- ⚖️ **Precision and recall** are well-balanced across all classes, ensuring no major bias.
- 🔄 Suitable for deployment in **HR decision systems** with careful monitoring, especially for top-performer predictions.

---



### 🧾 Confusion Matrix:
|              | Predicted Low | Predicted Avg | Predicted High |
|--------------|---------------|---------------|----------------|
| Actual Low   | 34            | 5             | 0              |
| Actual Avg   | 3             | 170           | 2              |
| Actual High  | 0             | 6             | 20             |

---

## 🔍 Feature Importance (Gradient Boosting)

| Rank | Feature                         | Importance |
|------|----------------------------------|------------|
| 1    | EmpEnvironmentSatisfaction       | 0.2833     |
| 2    | EmpLastSalaryHikePercent         | 0.2567     |
| 3    | YearsSinceLastPromotion          | 0.2041     |
| 4    | EmpDepartment_Development        | 0.0688     |
| 5    | EmpWorkLifeBalance               | 0.0562     |

(15 total features ranked)

---

## 💡 Key Insights

- 🌟 **EmpEnvironmentSatisfaction**, **Salary Hike %,** and **Promotion Gaps** are critical to performance
- ✅ Gradient Boosting is the most accurate and balanced model
- ⚠️ Some **High performers** are misclassified as Average — further tuning may help
- 🎯 Cross-validation and hyperparameter tuning applied for stability

---

## 💼 Recommendations

1. ✅ Deploy the best model (`Gradient Boosting`) for real-time predictions
2. 🧑‍💼 Use insights to improve HR policies (e.g., hike frequency, satisfaction monitoring)
3. 🔁 Regularly retrain the model with new data
4. 📈 Integrate the model in hiring and promotion pipelines
5. 🛡️ Monitor model drift and bias across demographics

---

## 💾 Artifacts Saved

- `best_model.pkl` – Saved trained model  
- `model_results.csv` – Model comparison results  
- `feature_importance.csv` – Ranked features by importance

---

## 👩‍💻 Project Tools & Tech Stack

- Python (Pandas, NumPy, Scikit-learn, XGBoost, LightGBM)
- Jupyter Notebook
- Joblib (Model saving)
- Visualizations via Seaborn & Matplotlib

---

## 🧩 Final Notes

- Data-driven performance analysis adds objectivity to HR decision-making
- Future extensions: SHAP analysis, model explainability, dashboard integration

---

📌 **Author**: Angel B  
📅 **Last Updated**: July 2025  
📁 **Project**: IABAC – DataScientist Certification Project  
