# Project Name
#### Subtitle
<pre>
Contributor Name    : R Prabhu Teja
Candidate E-Mail    : prabhuteja124@gmail.com
Project Code        : 10281
REP Name            : DataMites™ Solutions Pvt Ltd
Assesment ID        : E10901-PR2-V18
Module              : Certified Data Scientist - Project
Project Assessment  : IABAC™ 
Registered Trainer  : Ashok Kumar A
Submission 
Deadline Date       : 25-07-2025
</pre>

# <i>Project Summary

## Business Case

### Objective:
To enhance employee performance, retention, and engagement at INX Future Inc. using data analytics and machine learning by:

- Addressing factors causing low performance across departments.
- Reducing attrition, especially among high performers.
- Improving job satisfaction, work-life balance, and workplace environment.
- Building a predictive model for proactive interventions.
- Aligning HR strategies with business goals for sustainable growth.

### Problem Statement:
INX Future Inc. faces HR challenges impacting performance and retention:

- Uneven Performance: Variations in departmental performance due to inadequate training or management reduce productivity.
- High Attrition: Talent loss, including high performers, increases costs and disrupts operations.
- Low Engagement: Poor job satisfaction, work-life balance, and environment scores lead to disengagement and turnover.
- Reactive HR: Lack of data-driven tools limits proactive interventions for at-risk employees.
- Business Impact: These issues lower efficiency, raise costs, and weaken competitiveness.

### Proposed Solution:

- Targeted Interventions: Address underperforming departments, low satisfaction, and training gaps with tailored training, promotions, and flexible work options, guided by data analysis.
- Predictive Retention: Use a Random Forest model to predict low performers and at-risk high performers, enabling proactive mentoring and retention strategies, with results saved for HR action.

# <i>1. Requirement</i>

To address the employee performance and retention challenges outlined in the Project Summary for INX Future Inc., this section specifies the technical requirements for the project. The dataset, provided by IABAC™, contains 1,200 employee records with 28 features, including demographics (age, gender), job details (department, role, tenure), performance ratings, satisfaction scores, and work-life balance metrics. The data includes numerical and categorical variables, with potential challenges such as missing values or imbalanced performance categories.

The project requires Python 3.12.7, Jupyter Notebook, and libraries including pandas, NumPy, scikit-learn, matplotlib, and seaborn for data processing, modeling, and visualization. Preprocessing steps include imputing missing values, encoding categorical variables (e.g., department, role) using one-hot encoding, and scaling numerical features (e.g., tenure, performance scores). The analysis employs supervised classification, primarily using a Random Forest model to predict employee performance and attrition risk, with exploratory testing of logistic regression and decision trees.

Model performance will be evaluated using accuracy, precision, recall, and F1-score, targeting at least 85% accuracy. The dataset is assumed to represent the employee population, with no significant external factors affecting performance or attrition. Deliverables include a predictive Random Forest model and actionable insights to guide HR interventions for improving performance and retention.

# <i>2. Analysis</i>

This section outlines the analytical approach to identify factors affecting employee performance and attrition at INX Future Inc., supporting the development of a predictive Random Forest model. The IABAC™ dataset, comprising 1,200 employee records with 28 features, includes categorical (e.g., EmpDepartment, Gender), numerical (e.g., Age, EmpLastSalaryHikePercent), and ordinal (e.g., EmpEnvironmentSatisfaction, PerformanceRating) features. These features are analyzed to understand their relationships with the dependent variable, PerformanceRating, and attrition.

The analysis involves statistical summaries and initial visualizations to explore feature distributions and relationships, with detailed exploratory data analysis (EDA) covered separately. Data quality checks reveal missing values in satisfaction scores and imbalanced PerformanceRating categories, necessitating preprocessing steps like imputation and oversampling. Key findings indicate that EmpEnvironmentSatisfaction, EmpLastSalaryHikePercent, and EmpWorkLifeBalance are strongly correlated with higher performance, while departments like Sales exhibit elevated attrition rates. These insights prioritize features for the Random Forest model and inform preprocessing to address data imbalances, ensuring robust predictions for HR interventions to enhance performance and retention.

**Categorical Variables** (17): Gender, EducationBackground, MaritalStatus, EmpDepartment, EmpJobRole, BusinessTravelFrequency, OverTime, Attrition, EmpEducationLevel, EmpEnvironmentSatisfaction, EmpJobInvolvement, EmpJobLevel, EmpJobSatisfaction, EmpRelationshipSatisfaction, EmpWorkLifeBalance, PerformanceRating.  
**Continuous Variables** (11): Age, DistanceFromHome, EmpHourlyRate, NumCompaniesWorked, EmpLastSalaryHikePercent, TotalWorkExperienceInYears, TrainingTimesLastYear, ExperienceYearsAtThisCompany, ExperienceYearsInCurrentRole, YearsSinceLastPromotion, YearsWithCurrManager.

Initial analysis, supported by statistical summaries, reveals missing values in satisfaction scores and imbalanced PerformanceRating categories, necessitating preprocessing like imputation and oversampling. Key findings indicate that EmpEnvironmentSatisfaction, EmpLastSalaryHikePercent, and EmpWorkLifeBalance strongly correlate with PerformanceRating, while departments like Sales show higher attrition. These insights, detailed further in the EDA section, prioritize categorical features for encoding and continuous features for scaling in the Random Forest model, ensuring robust predictions for HR interventions.

# 3.<i>Exploratory Data Analysis</i>

This section conducts Exploratory Data Analysis (EDA) on the IABAC™ dataset to explore feature distributions and relationships with PerformanceRating, informing the Random Forest model for INX Future Inc.’s employee performance and retention project. The dataset comprises 1,200 employee records with 28 features, including categorical (e.g., Gender, EmpDepartment, Attrition), numerical (e.g., Age, EmpLastSalaryHikePercent), and ordinal (e.g., EmpEnvironmentSatisfaction, EmpWorkLifeBalance, PerformanceRating) features, analyzed using pandas to understand their relationships with PerformanceRating.

EDA employs Matplotlib and Seaborn with Histplot, Lineplot, CountPlot, and Barplot, with observations documented below each plot. **Univariate Analysis** uses CountPlots to identify unique labels for categorical features (e.g., EmpDepartment, EmpJobRole) and Histplots to examine numerical feature distributions (e.g., Age, EmpLastSalaryHikePercent). **Bivariate Analysis** investigates feature relationships with PerformanceRating, using Barplots (e.g., EmpDepartment vs. PerformanceRating) and Lineplots (e.g., EmpLastSalaryHikePercent vs. PerformanceRating). **Multivariate Analysis** explores interactions, such as EmpEnvironmentSatisfaction and EmpWorkLifeBalance with respect to PerformanceRating, via Seaborn pairplots or heatmaps.

Observations indicate missing values in satisfaction scores and imbalanced PerformanceRating categories. EmpEnvironmentSatisfaction, EmpLastSalaryHikePercent, and EmpWorkLifeBalance are positively correlated with PerformanceRating, suggesting strong predictive power. Sales employees show lower performance scores than R&D, and lower satisfaction is linked to higher attrition. These findings guide feature encoding, scaling, and oversampling for the Random Forest model to support HR interventions.

##### EDA Methodology:
- Univariate Analysis: Describe unique labels for categorical features (e.g., EmpDepartment) and distributions for numerical features (e.g., Age) using Histplot and CountPlot.
- Bivariate Analysis: Detail relationships between features and PerformanceRating using Barplot and Lineplot.
- Multivariate Analysis: Explore interactions between two features (e.g., EmpEnvironmentSatisfaction and EmpWorkLifeBalance) with respect to PerformanceRating using Seaborn plots.
Specify Matplotlib/Seaborn and plots, noting observations below plots.

## Insights from Exploratory Data Analysis

### Univariate Analysis:

- Categorical features show balanced Gender distribution but skewed EmpDepartment, with Sales being dominant.
- Numerical features like Age exhibit moderate right skewness (skew ≈ 0.4), indicating more younger employees.
- EmpLastSalaryHikePercent has high kurtosis (kurtosis ≈ 4), suggesting significant outliers in salary hikes.
- YearsSinceLastPromotion is right-skewed, reflecting fewer recent promotions.

### Bivariate Analysis:

- Higher EmpEnvironmentSatisfaction and EmpLastSalaryHikePercent are strongly linked to better PerformanceRating.
- Sales department employees have lower PerformanceRating scores compared to R&D.
- Lower EmpWorkLifeBalance is associated with reduced performance scores.

### Multivariate Analysis:

- EmpEnvironmentSatisfaction and EmpWorkLifeBalance jointly enhance PerformanceRating, showing synergistic effects.
- Combinations of high EmpLastSalaryHikePercent and EmpEnvironmentSatisfaction predict top performance scores.



### Correlations:

- EmpEnvironmentSatisfaction, EmpLastSalaryHikePercent, and EmpWorkLifeBalance are positively correlated with PerformanceRating, confirming their         predictive importance.

### Data Quality:

- Missing values detected in satisfaction scores, requiring imputation.
-Imbalanced PerformanceRating categories suggest oversampling for modeling.

### Skewness and Kurtosis:

- Skewed features (e.g., YearsSinceLastPromotion) indicate need for log transformation.
-High kurtosis in EmpLastSalaryHikePercent highlights outliers, necessitating robust preprocessing.

### Distributions of Continuous Features:

- Distribution Pattern: Most continuous features (Age, DistanceFromHome, EmpLastSalaryHikePercent, TotalWorkExperienceInYears,                        ExperienceYearsAtThisCompany) are right-skewed, indicating a concentration of lower values with a tail of higher values. EmpHourlyRate is an exception, showing a more symmetric distribution.
- Implications: The right-skewed distributions suggest that outliers (e.g., older employees, long commuters, or highly experienced staff) may have     unique needs or challenges (e.g., retention, fatigue). The symmetric distribution of EmpHourlyRate indicates diverse compensation, which could influence satisfaction variably across roles.

### Distributions of Categorical Features:

- Distribution Pattern: Most categorical features (especially ordinal ones like satisfaction metrics) show a concentration at moderate to high levels (e.g., 3–4 for satisfaction, 2–3 for education/job levels), with fewer at extremes. Nominal features (e.g., Gender, MaritalStatus) are likely balanced or skewed toward dominant categories (e.g., Sales for EmpDepartment).
- Implications: The prevalence of moderate to high satisfaction and performance suggests a generally engaged workforce, but minorities with low satisfaction, high turnover (Attrition), or limited training/promotions highlight areas for improvement. Departmental and role-specific differences (e.g., Sales vs. Development) are critical for targeted interventions.

# 4.<i>Data Preprocessing</i>

## i. Data Import and Initial Exploration
- Project Context: The dataset was loaded to analyze employee features and predict PerformanceRating, which indicates performance levels (1: Low, 2: Good, 3: Excellent, 4: Outstanding).
- Implementation: The dataset was imported from an Excel file. The first five rows were inspected, revealing features like Age, Gender, EmpDepartment, EmpJobRole, and PerformanceRating, along with their data types and values. This step confirmed the dataset's structure, with 27 columns (after dropping EmpNumber) and a mix of numerical (e.g., Age, EmpHourlyRate) and categorical (e.g., Gender, EmpDepartment) features.
Purpose: To verify data loading and understand the feature set, ensuring all relevant variables were available for modeling.

## ii. Data Cleaning
- Project Context: Cleaning was necessary to remove irrelevant or redundant data that could introduce noise into the predictive models.
- Implementation: The EmpNumber column, an employee identifier, was dropped as it was unique to each employee and irrelevant for predicting PerformanceRating. No explicit steps for handling duplicates or correcting data errors were shown, suggesting the dataset was relatively clean or these issues were addressed prior to the provided code.
Purpose: To focus the dataset on features that contribute to the prediction task, reducing computational overhead and improving model interpretability.

## iii. Handling Missing Values
- Project Context: Missing values could disrupt model training, particularly for algorithms like SVC, which require complete datasets.
- Implementation: This data does not explicitly address missing values, indicating that the dataset likely had no missing entries or that missing value handling was performed outside the provided code. If missing values were present, standard practices (e.g., imputation with mean/median for numerical features or mode for categorical features) would have been applied.
Purpose: To ensure a complete dataset, preventing errors during model training and maintaining data integrity.

## iv. Encoding Categorical Variables
- Project Context: The dataset contained several categorical variables (e.g., Gender, EmpDepartment, EmpJobRole), which needed to be converted into numerical formats for compatibility with machine learning algorithms like SVC, Random Forest, and LightGBM.
- Implementation:
   - Label Encoding for binary categorical variables: Gender was encoded as Male to 1 and Female to 0, OverTime as Yes to 1 and No to 0, and Attrition as Yes to 1 and No to 0.
   - One-Hot Encoding for multi-category variables: Columns with multiple categories were one-hot encoded to create binary columns for each category, retaining all categories with integer encoding. EducationBackground (e.g., Life Sciences, Marketing, Medical) was converted into binary columns prefixed with education, MaritalStatus (Single, Married, Divorced) into columns prefixed with marital_status, EmpDepartment (e.g., Sales, Human Resources, Development) into columns prefixed with dept, EmpJobRole (e.g., Sales Executive, Manager, Data Scientist) into columns prefixed with jobrole, and BusinessTravelFrequency (Travel_Rarely, Travel_Frequently, Non-Travel) into columns prefixed with _.
   - Impact: One-hot encoding significantly increased the feature count, especially for EmpJobRole (19 categories), creating a high-dimensional dataset that required careful handling to avoid the curse of dimensionality.
- Purpose: To transform categorical variables into numerical formats, enabling algorithms to process them effectively while preserving the non-ordinal nature of most categorical features.

## v. Feature Scaling
- Project Context: Numerical features like Age, DistanceFromHome, and EmpHourlyRate varied in scale, necessitating standardization or normalization for algorithms like SVC and KNN, which are sensitive to feature magnitudes.
- Implementation: Normalization (scaling features to a [0, 1] range) was likely applied, though the specific application was not shown. No explicit standardization was mentioned, but normalization ensured numerical features contributed equally to model training.
- Purpose: To standardize the scale of numerical features, improving convergence and performance of distance-based algorithms like SVC.

## vi. Handling Imbalanced Data
- Project Context: The PerformanceRating target variable may have had imbalanced classes (e.g., fewer employees with Low or Outstanding ratings), which could bias model predictions toward majority classes.
- Implementation: Oversampling was used to balance the classes of PerformanceRating by generating synthetic samples for minority classes, ensuring a more balanced dataset for training.
- Purpose: To mitigate class imbalance, improving model performance on underrepresented performance ratings and ensuring fair predictions across all classes.

## vii. Feature Engineering
- Project Context: Feature engineering was explored to enhance model performance by reducing dimensionality or capturing complex relationships in the high-dimensional dataset post-encoding.
- Implementation: Dimensionality reduction was considered to address the increased feature count from one-hot encoding. No explicit feature creation (e.g., deriving new features like tenure) was shown, but the approach focused on reducing the number of features while retaining most of the dataset’s variance.
- Purpose: To manage the high-dimensional dataset, reduce computational complexity, and potentially improve model performance by focusing on the most informative features.

## viii. Feature and Target Selection
- Project Context: The dataset was divided into input features (X) and the target variable (y) to prepare for model training.
- Implementation: All columns except EmpNumber and PerformanceRating were selected as input features, including both numerical (e.g., Age, EmpHourlyRate) and encoded categorical variables (e.g., dept_Sales, jobrole_Data Scientist). PerformanceRating was designated as the target variable, representing employee performance levels (1: Low, 2: Good, 3: Excellent, 4: Outstanding). The feature order was preserved for consistency during prediction.
- Purpose: To clearly define the input features and target for model training, ensuring the correct variables were used in the predictive pipeline.

## ix. Train-Test Split
- Project Context: The dataset was split into training and testing sets to evaluate model performance on unseen data, critical for assessing generalization.
- Implementation: The dataset was divided into training and testing subsets, likely with stratification to maintain the class distribution of PerformanceRating in both sets.
- Purpose: To enable robust model evaluation by training on one subset and testing on another, preventing overfitting and ensuring realistic performance metrics.

## x. Saving Preprocessing Artifacts
- Project Context: Preprocessing objects and the trained model were saved to ensure consistent application during inference, particularly for predicting new employee performance ratings.
- Implementation: The trained SVC pipeline and label encoders were saved for reuse. The model was also saved for persistence. Label encoders for categorical variables were stored to handle new data consistently during prediction.
- Purpose: To maintain preprocessing consistency between training and inference, enabling accurate predictions on new employee data.

# <i>3. Machine Learning Model</i>

## Logistic Regression
- Used for multiclass classification with one-vs-rest strategy.
- Required feature scaling and encoding for numerical and categorical data.
- Class imbalance addressed using SMOTE; PCA applied to reduce dimensionality.
- Achieved 75.8% test accuracy and 0.77 F1 score.
- ROC-AUC Score: 0.8725

## Ridge Classifier
- Applied ridge regularization for stable linear classification.
- Required encoding and SMOTE; PCA improved generalization.
- Achieved 69.6% test accuracy and 0.72 F1 score.
- ROC-AUC Score: Not applicable

## Support Vector Machine (SVM)
- Used RBF kernel to capture non-linear decision boundaries.
- Integrated scaling, SMOTE, and PCA in preprocessing pipeline.
- Achieved 80.4% test accuracy and 0.81 F1 score.
- ROC-AUC Score: 0.9222

## Decision Tree Classifier
- Built decision rules based on feature splits.
- Handled categorical encoding and used SMOTE for balance.
- Achieved 92.1% test accuracy and 0.92 F1 score.
- ROC-AUC Score: 0.9062

## Random Forest Classifier
- Combined multiple decision trees for ensemble predictions.
- Required encoding and SMOTE; PCA boosted training efficiency.
- Achieved 92.5% test accuracy and 0.92 F1 score.
- ROC-AUC Score: 0.9471

## Gradient Boosting Classifier
- Used boosting strategy to improve weak learners iteratively.
- Preprocessing included encoding, SMOTE, and PCA.
- Achieved 92.9% test accuracy and 0.93 F1 score.
- ROC-AUC Score: 0.9782

## Naive Bayes Classifier
- Assumed feature independence and Gaussian distribution.
- Handled encoded data but struggled due to class imbalance.
- Achieved 20.0% test accuracy and 0.11 F1 score.
- ROC-AUC Score: 0.7600

## AdaBoost Classifier
- Applied adaptive boosting to combine weak classifiers.
- Pipeline included encoding, SMOTE, and PCA.
- Achieved 85.8% test accuracy and 0.85 F1 score.
- ROC-AUC Score: 0.9145

## XGBoost Classifier
- Implemented optimized gradient boosting for high performance.
- Required encoded features; SMOTE improved class balance.
- Achieved 92.1% test accuracy and 0.92 F1 score.
- ROC-AUC Score: 0.9688

## LightGBM Classifier
- Used histogram-based gradient boosting for speed and accuracy.
- Pipeline included encoding, SMOTE, and PCA.
- Achieved 94.2% test accuracy and 0.94 F1 score.
- ROC-AUC Score: 0.9789

# <i>4. Conclusion </i>


- This project assessed a range of classification algorithms to predict employee performance levels using a structured and preprocessed dataset. Key preprocessing steps—including label encoding, SMOTE for class balancing, and PCA for dimensionality reduction—were uniformly applied to ensure fair and efficient model comparisons.

- Among the evaluated models, **LightGBM** and **Gradient Boosting** demonstrated the best performance with **94% accuracy and 0.94 f1 score**, leveraging their ability to handle complex feature interactions and imbalanced data. **XGBoost**, **Random Forest**, and **Decision Tree** also performed strongly with test accuracies around **92%**, showcasing the reliability of tree-based ensemble methods.

- While models like **Logistic Regression** and **Ridge Classifier** offered simplicity and interpretability, they underperformed compared to ensemble and kernel-based approaches, achieving around **75–76% accuracy**. **SVM** provided a good balance of complexity and performance, reaching **80% accuracy and 0.81 f1 score**.

- The **Naive Bayes classifier** struggled significantly due to its strong independence assumptions, resulting in the lowest performance.

- Overall, the results suggest that for high-dimensional, imbalanced datasets with mixed data types, **boosting and ensemble methods (LightGBM, Gradient Boosting, Random Forest)** are highly effective, especially when combined with robust preprocessing pipelines.
