
# HR Analytics â€” Promotion Prediction

**Contents**
- Problem description
- Load data (train/test)
- Exploratory Data Analysis (EDA)
- Preprocessing and feature engineering
- Model training (Random Forest baseline)
- Cross-validation and validation metrics
- Create submission file

This notebook is generated automatically. Run each cell sequentially in a Jupyter environment (or Google Colab).

Files used:
- `train.csv`
- `test.csv`
- `sample_submission.csv`


In [None]:

# Imports
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
import joblib
import warnings
warnings.filterwarnings('ignore')

pd.set_option('display.max_columns', 200)
pd.set_option('display.width', 200)


In [None]:

# Load data
train = pd.read_csv('/mnt/data/train.csv')
test = pd.read_csv('/mnt/data/test.csv')
sample = pd.read_csv('/mnt/data/sample_submission.csv')

print('Train shape:', train.shape)
print('Test shape:', test.shape)
train.head()


In [None]:

# Quick EDA
display(train.info())
display(train.describe(include='all').T)
print('\nMissing values per column:')
print(train.isnull().sum())

# Target distribution
print('\nTarget value counts:')
print(train['is_promoted'].value_counts(normalize=True))



## Preprocessing plan
- Drop `employee_id` (identifier)
- Identify categorical and numeric columns
- Handle missing values (SimpleImputer)
- Encode categorical variables (OneHot or Ordinal where appropriate)
- Train a RandomForest baseline model
- Use StratifiedKFold cross-validation for evaluation


In [None]:

# Prepare features
X = train.drop(['is_promoted','employee_id'], axis=1)
y = train['is_promoted']

X_test = test.drop(['employee_id'], axis=1)

# Identify columns
numeric_cols = X.select_dtypes(include=['int64','float64']).columns.tolist()
numeric_candidates = []
categorical_candidates = []
for c in X.columns:
    if X[c].dtype in ['int64','float64']:
        if X[c].nunique() <= 10 and c not in ['avg_training_score','age','length_of_service']:
            categorical_candidates.append(c)
        else:
            numeric_candidates.append(c)
    else:
        categorical_candidates.append(c)

numeric_cols = [c for c in numeric_candidates if c in X.columns]
categorical_cols = [c for c in categorical_candidates if c in X.columns]

print('Numeric columns:', numeric_cols)
print('Categorical columns:', categorical_cols)


In [None]:

# Pipelines
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median'))
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse=False))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', RandomForestClassifier(n_estimators=200, random_state=42, n_jobs=-1))])

from sklearn.model_selection import StratifiedKFold, cross_val_score
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(clf, X, y, cv=skf, scoring='accuracy', n_jobs=-1)
print('CV accuracy scores:', scores)
print('CV accuracy mean: {:.4f}'.format(scores.mean()))


In [None]:

# Holdout split for quick validation
X_tr, X_val, y_tr, y_val = train_test_split(X, y, test_size=0.15, stratify=y, random_state=42)
clf.fit(X_tr, y_tr)

# Validate
y_pred = clf.predict(X_val)
print('Validation accuracy:', accuracy_score(y_val, y_pred))
print('\nClassification report:\n', classification_report(y_val, y_pred))
print('\nConfusion matrix:\n', confusion_matrix(y_val, y_pred))


In [None]:

# Train on full training data
clf.fit(X, y)

# Predict on test
test_preds = clf.predict(X_test)

# Prepare submission
submission = sample.copy()
submission['is_promoted'] = test_preds
submission.to_csv('/mnt/data/submission_hr_promotion.csv', index=False)
print('Saved submission to /mnt/data/submission_hr_promotion.csv')


In [None]:

# Save trained model pipeline
joblib.dump(clf, '/mnt/data/hr_promotion_pipeline.joblib')
print('Saved pipeline to /mnt/data/hr_promotion_pipeline.joblib')



## Next steps (suggestions to improve model)
- Feature engineering: create interaction features, bin ages, scale numeric features if needed
- Try other models: XGBoost, LightGBM, Logistic Regression (with class weights)
- Hyperparameter tuning: GridSearchCV or RandomizedSearchCV
- Handle class imbalance if present (SMOTE, class_weight)
- Add visual EDA: distribution plots, correlations, boxplots

Save this notebook and upload to your GitHub repo along with the data files and the saved pipeline if you want.
