# **Summary**
This notebook will include a baseline model using the cleaned dataset from the notebook "EDA.ipynb". This baseline model will use a cleaned dataset that has *not* been preprocessed (with the exception of one-hot encoding categorical features) and will be fitted onto a simple model, random forest.

In [2]:
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, precision_score, recall_score, f1_score, roc_auc_score

# display max columns
pd.set_option('display.max_columns', None)

# import data
df = pd.read_csv("cleaned_HRDataset.csv")
df.shape

(311, 44)

In [3]:
col_drop = ['EmpID', 'Employee_Name', 'DOB', 'Zip', 'State', 'DateofHire', 'DateofTermination', 'LastPerformanceReview_Date',
            'TermReason', 'EmploymentStatus', 'EmpStatusID']
df = df.drop(col_drop, axis=1)
df.shape

(311, 33)

## **1. Setting up Model**

In [4]:
X = df.drop('Termd', axis=1)
y = df['Termd']

In [5]:
# create pipeline

cat_features = X.select_dtypes(include=['object']).columns

categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(transformers=[
    ('cat', categorical_transformer, cat_features)
])

model = RandomForestClassifier(random_state=42)

pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', RandomForestClassifier(random_state=42))
])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

pipeline.fit(X_train, y_train)

In [6]:
y_pred=pipeline.predict(X_test)

In [7]:
def model_metrics(preds, y):
    print(f'Accuracy: {accuracy_score(preds, y)}')
    print(f'Precision: {precision_score(preds, y)}')
    print(f'Recall: {recall_score(preds, y)}')
    print(f'F1: {f1_score(preds, y)}')
    print(f'ROC_AUC: {roc_auc_score(preds, y)}')

In [8]:
model_metrics(y_pred, y_test)

Accuracy: 0.6507936507936508
Precision: 0.22727272727272727
Recall: 0.5
F1: 0.3125
ROC_AUC: 0.5896226415094341


## **2. Evaluation**
The Random Forest model has a decent **accuracy** of approximately 0.65, indicating the model is correctly classifying about **65%** of the instances. However, looking towards the precision and recall gives a more detailed and accurate interpretation of the model's performance.

A **precision** of 0.23 means that only **23%** of instances predicted as positive by the model are actually positive. Additionally, a **recall** of 0.5 means that model is correctly identifying **50%** of the actual positive instances. The model is missing half of the true positives.

This model is dealing with a class imbalance, indicating the F1-score is a more accurate metric of how the model is performing. The **f1-score** of **0.31** indicates the model is not doing well balancing the precision and recall as seen above.

Overall, the model has a moderate accuracy but is struggling with positive class prediction. Potential improvement by:
- feature selection
- address class imbalance