### Using validation data to improve evaluation metrics

This notebook contains various approaches used to enhance evaluation metrics in the notebook available at (https://github.com/BorjaDaguerre/Predicting_promotions/blob/main/Promotions_prediction.ipynb) using validation data. The original dataset lacked features that exhibited significant correlation with the target variable. Therefore, the objective was to explore distinct machine learning methods to preprocess the input data utilized by our models. This exploration aimed to uncover or amplify potential 'signals' within the data. Three different methods were employed to enhance evaluation metrics, yielding the following outcomes:

* Scaling: This conventional technique aims to bring various features within a consistent range. It can benefit specific ML methods by rendering data and potential relationships more manageable. In this case, all non-dummy features were scaled; however, no discernible improvement in evaluation performance was observed.

* Under/Oversampling: This method involves identifying underrepresented or overrepresented classes or data instances in the dataset. It adjusts the instance count based on their status to mitigate class imbalance issues, as evident in the current dataset. This approach allows better utilization of patterns within the data due to the increased or decreased presence of minority/majority classes. Both under and oversampling methods were applied, yielding diverse outcomes. While undersampling significantly improved Recall at the expense of decreased Accuracy, thus proving effective for minimizing false negatives (FN), oversampling exhibited an overall decline in metrics.

* RFE (Recursive Feature Elimination): RFE employs cross-validation to iteratively eliminate irrelevant features that could introduce noise to the prediction process. It retains informative features to predict classification values. In our case, 29 features were retained out of the initial 60 (a result of one-hot encoding). Eliminated features primarily consisted of dummy variables ('region_1', 'region_2', etc.) as well as 'gender'. RFE resulted in a minor accuracy reduction while maintaining consistent values for other metrics. Given the similarity to the original results and its potential for generating a more parsimonious and less overfit model, this method will be implemented and compared against the original in test set evaluations.

Based on these results, for the test evaluation we will use the scaled model with no additional preprocessing and a undersampler given the good results with the recall metric.


### Dataset information 

Features included:

* employee_id: employee id
* department: name of the deparment the employee works at
* region: number of the region the employee works at, ranges from 1 to 34
* education: level of education the employee has
* gender: gender of teh employee
* recruitment_channel: which method was used to recruit the employee
* no_of_trainings: number of training the employee has had, ranging from 1 to 10
* age: age of the employee
* previous_year_rating: work rating the employee obtained one year before, ranges from 1 to 5
* length_of_service: number of years the employee has worked on the company
* KPIs_met >80%: indicates if the Key Performance Indicators were met above a 80% threshold 
* awards_won?: indicates if the employee has won an award while working on the company
* avg_training_score: average score of the training processes performed while in the company
* is_promoted: our target variable, indicates if the employee has been promoted or not


Hope you enjoy it! Any feedback is welcome at borjadaguerre@gmail.com.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import json
import seaborn as sns
import random
warnings.filterwarnings("ignore")

from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score, confusion_matrix
from sklearn.preprocessing import StandardScaler, LabelEncoder, OrdinalEncoder
from sklearn.feature_selection import RFE
from sklearn.model_selection import train_test_split

from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler

from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier, GradientBoostingClassifier

warnings.filterwarnings("ignore")

In [2]:
#test set does not contain target values, train set has been split into train a test to account for it

df = pd.read_csv(r'C:\Users\gebruiker\Documents\DataScience\DatasSets\train_LZdllcl.csv')
validation_set = df[:int(len(df) * 0.75)]
validation = int(len(validation_set) * 0.75)
validation_train = validation_set[:validation]
validation_test = validation_set[validation:]


In [3]:
#dropping 'employer_id' becuase is irrelevant

validation_train = validation_train.drop(columns = ['employee_id'])
validation_test = validation_test.drop(columns = ['employee_id'])

In [4]:
#replacing the NaN vallues of the feature education with the 'Unknoww' value

validation_train['education'].fillna('Unknown', inplace=True)
validation_test['education'].fillna('Unknown', inplace=True)

In [5]:
#first split of categorical and numerical vairables 

categorical = validation_train.select_dtypes(include=['object'])
numerical = validation_train.select_dtypes(include=['int64','float64'])

In [6]:
# Perform one-hot encoding on the categorical columns
one_hot_encoded_train = pd.get_dummies(validation_train[categorical.columns], prefix=categorical.columns)
one_hot_encoded_test = pd.get_dummies(validation_test[categorical.columns], prefix=categorical.columns)


# Concatenate the one-hot encoded columns to the original DataFrame
validation_train = pd.concat([validation_train, one_hot_encoded_train], axis=1)
validation_test = pd.concat([validation_test, one_hot_encoded_test], axis=1)

In [7]:
# No significant differences across the 'train['previous_year_rating'].mean()' values, imputation has been done along the whole 
# training set disregarding the train/validation split in this instance

print(validation_train['previous_year_rating'].mean(), validation_test['previous_year_rating'].mean())

3.3334503457717557 3.3276931161324224


In [8]:
#imputing the NaN values of 'previouos_year_rating' with the training mean, interpolation with through linear method also 
#available since 'previouos_year_rating' is lighly correlated with 'KPI_met > 80%'

validation_train['previous_year_rating'].fillna(int(validation_train['previous_year_rating'].mean()), inplace = True)
validation_test['previous_year_rating'].fillna(int(validation_test['previous_year_rating'].mean()), inplace = True)


In [9]:
promotions = ['Yes' if i == 1 else 'No' for  i in validation_train['is_promoted']]


In [10]:
#selecting the numerical features to use or models on. Categorical features have been one-hot encoded before

validation_train  = validation_train.select_dtypes(include=['int64','float64','uint8'])
validation_test = validation_test.select_dtypes(include = ['int64', 'float64', 'uint8'])

In [11]:
non_dummies = [col for col in validation_train.columns if len(np.unique(validation_train[col])) > 2]

scaler = StandardScaler()

validation_train_scaled = pd.DataFrame(data = scaler.fit_transform(validation_train[non_dummies]), columns = non_dummies)
validation_train = validation_train.drop(columns = non_dummies)
validation_train = validation_train.join(validation_train_scaled)

validation_test_scaled = pd.DataFrame(data = scaler.fit_transform(validation_test[non_dummies]), columns = non_dummies)
validation_test = validation_test.drop(columns = non_dummies)
validation_test.index = validation_test_scaled.index
validation_test = validation_test.join(validation_test_scaled)

In [12]:
#dividing into train, validation, and test

X_train_val = validation_train.drop('is_promoted',axis = 1)
y_train_val = validation_train['is_promoted']
X_test_val = validation_test.drop('is_promoted',axis = 1)
y_test_val = validation_test['is_promoted']

validation_raw = X_train_val, y_train_val, X_test_val, y_test_val

In [13]:
# Define the oversampler
oversampler = RandomOverSampler()
undersampler = RandomUnderSampler()

# Oversample the minority class
X_train_val_oversampled, y_train_val_oversampled = oversampler.fit_resample(X_train_val, y_train_val)


# Undersample the majority class
X_train_val_undersampled, y_train_val_undersampled = undersampler.fit_resample(X_train_val, y_train_val)

He have a couple of features with good correaltinal coefficients and a lot with low coefficient values, let's try some validation testing to see how they'll affect the results.

In [14]:
random_state = 4


# Define the base models to be trained and evaluated
base_models = {
    "Logistic Regression": LogisticRegression(),
    "Random Forest Classifier": RandomForestClassifier(random_state=random_state),
    "XGBClassifier": XGBClassifier(random_state=random_state),
    "BaggingClassifier": BaggingClassifier(random_state=random_state),
    "GradientBoostingClassifier": GradientBoostingClassifier(random_state=random_state),
}

In [15]:
def evaluate_models(models, X_train, y_train, X_test, y_test ):

    
    results = pd.DataFrame(columns=['training_accuracy',"Accuracy", "Recall", "Precision", "F1"])

    for model_name, model in models.items():

        model.fit(X_train, y_train)
        score = model.score(X_train, y_train)
        
        y_pred = model.predict(X_test)

        training_accuracy= round(score,3)
        accuracy = round(accuracy_score(y_test, y_pred),3)
        recall = round(recall_score(y_test, y_pred),3)
        precision = round(precision_score(y_test, y_pred),3)
        f1 = round(f1_score(y_test, y_pred),3)

        results.loc[f"{model_name} "] = [training_accuracy, accuracy, recall, precision, f1]
        

    results.sort_values(by="F1", ascending=False, inplace=True)
    
    
    return results

In [16]:
random_state = 4


# Define the base models to be trained and evaluated
models_rfe = {
    "Logistic Regression": LogisticRegression(),
    "Random Forest Classifier": RandomForestClassifier(random_state=random_state),
    "XGBClassifier": XGBClassifier(random_state=random_state),
    "GradientBoostingClassifier": GradientBoostingClassifier(random_state=random_state),
}

In [17]:
def evaluate_models_rfe(models, X_train, y_train, X_test, y_test):

    
    results_rfe = pd.DataFrame(columns=["Accuracy", "Recall", "Precision", "F1", 'number_of_features'])

    for model_name, model in models_rfe.items():
        
        rfe = RFE(model, n_features_to_select=None)
        rfe.fit(X_train, y_train)
        selected_features = X_train.columns[rfe.support_]

        model.fit(X_train, y_train)
        
        y_pred = model.predict(X_test)
        accuracy = round(accuracy_score(y_test, y_pred),3)
        recall = round(recall_score(y_test, y_pred),3)
        precision = round(precision_score(y_test, y_pred),3)
        f1 = round(f1_score(y_test, y_pred),3)
        number_of_features = len(selected_features)

        results_rfe.loc[f"{model_name} "] = [accuracy, recall, precision, f1, number_of_features]
        

    results_rfe.sort_values(by="F1", ascending=False, inplace=True)
    return results_rfe, selected_features

In [18]:
#validation raw 
val_raw = evaluate_models(base_models, X_train_val, y_train_val, X_test_val, y_test_val)
val_raw

Unnamed: 0,training_accuracy,Accuracy,Recall,Precision,F1
XGBClassifier,0.952,0.941,0.38,0.886,0.532
BaggingClassifier,0.991,0.931,0.364,0.71,0.482
GradientBoostingClassifier,0.939,0.937,0.287,0.966,0.442
Random Forest Classifier,1.0,0.931,0.266,0.842,0.404
Logistic Regression,0.933,0.93,0.253,0.841,0.389


In [19]:
#validation raw with rfe
validation_rfe = evaluate_models_rfe(base_models, X_train_val, y_train_val, X_test_val, y_test_val)
validation_rfe

(                             Accuracy  Recall  Precision     F1  \
 XGBClassifier                   0.941   0.380      0.886  0.532   
 GradientBoostingClassifier      0.937   0.287      0.966  0.442   
 Random Forest Classifier        0.931   0.266      0.842  0.404   
 Logistic Regression             0.930   0.253      0.841  0.389   
 
                              number_of_features  
 XGBClassifier                              29.0  
 GradientBoostingClassifier                 29.0  
 Random Forest Classifier                   29.0  
 Logistic Regression                        29.0  ,
 Index(['KPIs_met >80%', 'awards_won?', 'department_Analytics',
        'department_Finance', 'department_HR', 'department_Operations',
        'department_Procurement', 'department_R&D',
        'department_Sales & Marketing', 'department_Technology',
        'region_region_17', 'region_region_22', 'region_region_25',
        'region_region_27', 'region_region_28', 'region_region_3',
        'regio

In [20]:
#validation scaled

val_scaled = evaluate_models(base_models, X_train_val, y_train_val, X_test_val, y_test_val)
val_scaled

Unnamed: 0,training_accuracy,Accuracy,Recall,Precision,F1
XGBClassifier,0.952,0.941,0.38,0.886,0.532
BaggingClassifier,0.991,0.931,0.364,0.71,0.482
GradientBoostingClassifier,0.939,0.937,0.287,0.966,0.442
Random Forest Classifier,1.0,0.931,0.266,0.842,0.404
Logistic Regression,0.933,0.93,0.253,0.841,0.389


In [21]:
#validation scaled with rfe
val_scaled_rfe = evaluate_models_rfe(evaluate_models_rfe, X_train_val, y_train_val, X_test_val, y_test_val)
val_scaled_rfe

(                             Accuracy  Recall  Precision     F1  \
 XGBClassifier                   0.941   0.380      0.886  0.532   
 GradientBoostingClassifier      0.937   0.287      0.966  0.442   
 Random Forest Classifier        0.931   0.266      0.842  0.404   
 Logistic Regression             0.930   0.253      0.841  0.389   
 
                              number_of_features  
 XGBClassifier                              29.0  
 GradientBoostingClassifier                 29.0  
 Random Forest Classifier                   29.0  
 Logistic Regression                        29.0  ,
 Index(['KPIs_met >80%', 'awards_won?', 'department_Analytics',
        'department_Finance', 'department_HR', 'department_Operations',
        'department_Procurement', 'department_R&D',
        'department_Sales & Marketing', 'department_Technology',
        'region_region_17', 'region_region_22', 'region_region_25',
        'region_region_27', 'region_region_28', 'region_region_3',
        'regio

In [22]:
#validation with under/oversampling

val_undersampled =evaluate_models(base_models, X_train_val_undersampled, y_train_val_undersampled, X_test_val, y_test_val)
val_undersampled

Unnamed: 0,training_accuracy,Accuracy,Recall,Precision,F1
BaggingClassifier,0.991,0.786,0.803,0.263,0.397
XGBClassifier,0.941,0.772,0.846,0.257,0.394
Logistic Regression,0.787,0.766,0.816,0.247,0.379
Random Forest Classifier,1.0,0.748,0.866,0.24,0.376
GradientBoostingClassifier,0.824,0.718,0.947,0.23,0.37


In [23]:
val_oversampled = evaluate_models(base_models, X_train_val_oversampled, y_train_val_oversampled, X_test_val, y_test_val)
val_oversampled

Unnamed: 0,training_accuracy,Accuracy,Recall,Precision,F1
Random Forest Classifier,1.0,0.927,0.366,0.643,0.466
BaggingClassifier,1.0,0.919,0.406,0.547,0.466
XGBClassifier,0.922,0.833,0.744,0.311,0.439
Logistic Regression,0.792,0.766,0.813,0.247,0.378
GradientBoostingClassifier,0.82,0.718,0.951,0.23,0.371


In [24]:
val_oversampled_rfe = evaluate_models_rfe(models_rfe, X_train_val_oversampled, y_train_val_oversampled, X_test_val, y_test_val)
val_oversampled_rfe

(                             Accuracy  Recall  Precision     F1  \
 Random Forest Classifier        0.927   0.366      0.643  0.466   
 XGBClassifier                   0.833   0.744      0.311  0.439   
 Logistic Regression             0.766   0.813      0.247  0.378   
 GradientBoostingClassifier      0.718   0.951      0.230  0.371   
 
                              number_of_features  
 Random Forest Classifier                   29.0  
 XGBClassifier                              29.0  
 Logistic Regression                        29.0  
 GradientBoostingClassifier                 29.0  ,
 Index(['KPIs_met >80%', 'awards_won?', 'department_Analytics',
        'department_Finance', 'department_HR', 'department_Operations',
        'department_Procurement', 'department_R&D',
        'department_Sales & Marketing', 'department_Technology',
        'region_region_22', 'region_region_26', 'region_region_28',
        'region_region_29', 'region_region_31', 'region_region_32',
        'regi

In [25]:
val_undersampled_rfe = evaluate_models_rfe(models_rfe, X_train_val_undersampled, y_train_val_undersampled, X_test_val, y_test_val)
val_undersampled_rfe

(                             Accuracy  Recall  Precision     F1  \
 XGBClassifier                   0.772   0.846      0.257  0.394   
 Logistic Regression             0.766   0.816      0.247  0.379   
 Random Forest Classifier        0.748   0.866      0.240  0.376   
 GradientBoostingClassifier      0.718   0.947      0.230  0.370   
 
                              number_of_features  
 XGBClassifier                              29.0  
 Logistic Regression                        29.0  
 Random Forest Classifier                   29.0  
 GradientBoostingClassifier                 29.0  ,
 Index(['KPIs_met >80%', 'awards_won?', 'department_Analytics',
        'department_Finance', 'department_HR', 'department_Operations',
        'department_Procurement', 'department_R&D',
        'department_Sales & Marketing', 'department_Technology',
        'region_region_15', 'region_region_16', 'region_region_17',
        'region_region_21', 'region_region_22', 'region_region_26',
        'regi