# 🔧 Preprocessing Guidelines

This document outlines the preprocessing steps applied to the dataset before model training.

---

## 1️⃣ Check for Missing Values  
- ✅ **No missing values** were found in the dataset.  
- No imputation was needed.

---

## 2️⃣ Categorical Data Conversion  
Since the dataset contains categorical features with many labels, **manual encoding** and **frequency encoding** were used.

### **🔹 Manual Encoding**  
- Manual encoding is used to map categorical labels based on their frequency.
- Implemented using the `map()` function in Pandas.

### **🔹 Frequency Encoding**  
- This technique transforms categorical variables into numerical values based on frequency counts.
- It helps in reducing dimensionality while preserving information.

---

## 3️⃣ Outlier Handling  
Some features contain **outliers**, so the **Interquartile Range (IQR) method** was used for imputation.  
- **Reason:** The dataset is **not normally distributed**, making IQR the best approach.  
- Outliers were identified and replaced using the IQR technique:

  **Formula for IQR Handling:**
  - IQR = Q3 - Q1
  - Lower Bound = Q1 - (1.5 × IQR)
  - Upper Bound = Q3 + (1.5 × IQR)

---

## 4️⃣ Feature Transformation  
### **🔹 Square Root Transformation**  
- Applied to **`YearsSinceLastPromotion`** due to skewness & kurtosis.  
- This transformation is useful for count data or small whole numbers.  
- **Negative values** were handled by adding a constant before applying transformation.

### **🔹 Q-Q Plot for Distribution Check**  
- **Q-Q (Quantile-Quantile) Plot** was used to compare the transformed feature's distribution with a normal distribution.  
- Helps verify if the transformation corrected skewness.

---

## 5️⃣ Scaling the Data  
### **🔹 Standard Scaling**  
- Used **StandardScaler** to normalize numerical features.  
- **Standardization Formula:**  
  \[
  X_{\text{scaled}} = \frac{X - \mu}{\sigma}
  \]
  - Mean (μ) = **0**
  - Standard Deviation (σ) = **1**
- Ensures that all numerical features have a standard normal distribution.

---

### ✅ Summary of Preprocessing Steps:
✔ No missing values  
✔ Handled categorical data with manual & frequency encoding  
✔ Removed outliers using IQR  
✔ Applied square root transformation to handle skewness  
✔ Scaled numerical data using StandardScaler  

---
