## Importing Libraries
This section imports all the required libraries for data manipulation, visualization, preprocessing, model training, and evaluation.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold, cross_val_score
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, classification_report, confusion_matrix
from imblearn.over_sampling import SMOTE
from sklearn.svm import SVC


## Load and Explore Dataset
This section loads the dataset from a URL and performs exploratory data analysis such as summary statistics

In [2]:
# Load the dataset
url = r"C:\Users\Sonymaths\Downloads\archive\WA_Fn-UseC_-HR-Employee-Attrition.csv"
df = pd.read_csv(url)


## Data Preprocessing
This section preprocesses the data by defining pipelines for numerical and categorical data, applying transformations, splitting the data into training and testing sets, and addressing class imbalance using SMOTE.

In [3]:
# Define numerical and categorical columns
numerical_cols = df.select_dtypes(include=[np.number]).columns.tolist()
categorical_cols = df.select_dtypes(include=['object']).drop(columns=['Attrition']).columns.tolist()

# Preprocessing pipeline for numerical data
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# Preprocessing pipeline for categorical data
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(drop='first', sparse=False))
])

# Combine preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

# Apply the transformations
X = df.drop(columns=['Attrition'])
y = df['Attrition'].apply(lambda x: 1 if x == 'Yes' else 0)

X_processed = preprocessor.fit_transform(X)

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X_processed, y, test_size=0.2, random_state=42)

# Address class imbalance using SMOTE
smote = SMOTE(random_state=42)
X_train_res, y_train_res = smote.fit_resample(X_train, y_train)




## Model  Training and Hyperparameter Tuning
This section trains a Support Vector Machine (SVM) classifier and tunes its hyperparameters using grid search with cross-validation. It defines the SVM model, specifies the hyperparameter grid, performs grid search with cross-validation, identifies the best SVM model, and prints its best parameters.


In [4]:
# Define the SVM model
svm = SVC(probability=True, random_state=42)

# Define the hyperparameter grid
param_grid_svm = {'C': [0.01, 0.1, 1, 10, 100], 'kernel': ['linear', 'rbf']}

# Perform grid search with cross-validation
strat_kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
grid_search_svm = GridSearchCV(svm, param_grid_svm, cv=strat_kfold, scoring='f1', return_train_score=True)
grid_search_svm.fit(X_train_res, y_train_res)

# Best SVM model
best_svm = grid_search_svm.best_estimator_
print(f"Support Vector Machine Best Params: {grid_search_svm.best_params_}")


Support Vector Machine Best Params: {'C': 100, 'kernel': 'rbf'}


## Model Evaluation
This section evaluates the performance of the best Support Vector Machine (SVM) model obtained after hyperparameter tuning. It predicts the test data, calculates various evaluation metrics such as accuracy, precision, recall, F1 score, and ROC-AUC score based on the predictions. Then, it prints these metrics to assess the model's performance. Additionally, it displays the classification report and confusion matrix to provide a detailed analysis of the model's performance across different classes and confusion between them.

In [5]:
# Evaluate the best SVM model
y_pred = best_svm.predict(X_test)
y_prob = best_svm.predict_proba(X_test)[:, 1]

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_prob)

print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1 Score: {f1:.4f}")
print(f"ROC-AUC Score: {roc_auc:.4f}")

# Display classification report and confusion matrix
print("Classification Report:")
print(classification_report(y_test, y_pred))

print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))


Accuracy: 0.8673
Precision: 0.5000
Recall: 0.3590
F1 Score: 0.4179
ROC-AUC Score: 0.7811
Classification Report:
              precision    recall  f1-score   support

           0       0.91      0.95      0.93       255
           1       0.50      0.36      0.42        39

    accuracy                           0.87       294
   macro avg       0.70      0.65      0.67       294
weighted avg       0.85      0.87      0.86       294

Confusion Matrix:
[[241  14]
 [ 25  14]]


## Cross Validaion Evaluation 
This section performs cross-validation to evaluate the performance of the best Support Vector Machine (SVM) model obtained after hyperparameter tuning. It calculates the F1 scores for each fold of the stratified k-fold cross-validation and prints these scores to assess the model's performance across different folds. Additionally, it prints the mean and standard deviation of the cross-validation F1 scores to provide insights into the model's average performance and its variability across different folds of the training data.

In [6]:
# Cross-validation scores
cv_results = cross_val_score(best_svm, X_train_res, y_train_res, cv=strat_kfold, scoring='f1')

print(f"Cross-Validation F1 Scores: {cv_results}")
print(f"Mean CV F1 Score: {cv_results.mean():.4f}")
print(f"Standard Deviation of CV F1 Scores: {cv_results.std():.4f}")


Cross-Validation F1 Scores: [0.95121951 0.9408867  0.96517413 0.94348894 0.97270471]
Mean CV F1 Score: 0.9547
Standard Deviation of CV F1 Scores: 0.0123
