# **1. Import Libraries & Modules**
Here we import the necessary data manipulation libraries (Pandas, Numpy), visualization tools (Matplotlib, Seaborn), and the Machine Learning algorithms from Scikit-Learn (Logistic Regression, Preprocessing, Metrics).

In [237]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Machine Learning & Preprocessing
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import roc_auc_score,accuracy_score, confusion_matrix, classification_report
from statsmodels.stats.outliers_influence import variance_inflation_factor

# **2. Data Loading & Initial Inspection**
We load the HR Employee Attrition dataset and perform a preliminary check to understand the data types, dataset shape, and look for any missing or duplicate values.

In [238]:
# Load the dataset
data = pd.read_csv('/content/HR-Employee-Attrition.csv')
data.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2


In [239]:
# Display dataset info
print("-------- Dataset Information -------")
data.info()

-------- Dataset Information -------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 35 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Age                       1470 non-null   int64 
 1   Attrition                 1470 non-null   object
 2   BusinessTravel            1470 non-null   object
 3   DailyRate                 1470 non-null   int64 
 4   Department                1470 non-null   object
 5   DistanceFromHome          1470 non-null   int64 
 6   Education                 1470 non-null   int64 
 7   EducationField            1470 non-null   object
 8   EmployeeCount             1470 non-null   int64 
 9   EmployeeNumber            1470 non-null   int64 
 10  EnvironmentSatisfaction   1470 non-null   int64 
 11  Gender                    1470 non-null   object
 12  HourlyRate                1470 non-null   int64 
 13  JobInvolvement            1470 non-null  

In [240]:
# Check for null values
print("\n--- Missing Values ---")
print(data.isnull().sum().sum())


--- Missing Values ---
0


In [241]:
# Check for duplicates
print(f"\nNumber of duplicate rows: {data.duplicated().sum()}")


Number of duplicate rows: 0


# **3. Feature Engineering**
Financial data, such as `MonthlyIncome`, is often right-skewed (a few employees earn significantly more than the rest). We apply a **Log Transformation** to normalize this distribution, which helps the model perform better.

In [242]:
# Apply Log Transformation to MonthlyIncome using log1p (log(1+x))
data['log_MonthlyIncome'] = np.log1p(data['MonthlyIncome'])

# Drop the original skewed column
data.drop('MonthlyIncome', axis=1, inplace=True)

# **4. Feature Selection via VIF (Multicollinearity)**
Before training, we check for **Multicollinearity** using the Variance Inflation Factor (VIF).

* **Why?** If two features are highly correlated (e.g., `JobLevel` and `TotalWorkingYears`), they confuse the model. We remove features with high VIF scores to improve model stability.

*Note: Since VIF requires numerical input, we temporarily encode categorical variables just for this calculation.*

In [243]:
# Create a temporary copy for VIF calculation
df_vif = data.copy()

# Drop unuseful columns for the analysis
df_vif = df_vif.drop(['EmployeeCount', 'Over18', 'StandardHours', 'EmployeeNumber', 'Attrition'],axis = 1)

# Encode categorical variables for VIF calculation
le = LabelEncoder()
for col in df_vif.columns:
  if df_vif[col].dtypes == 'object':
    df_vif[col] = le.fit_transform(df_vif[col])

# Calculate VIF
vif_data = pd.DataFrame()
vif_data["Feature"] = df_vif.columns
vif_value = []

for i in range(len(df_vif.columns)):
    vif = variance_inflation_factor(df_vif.values, i)
    vif_value.append(vif)
vif_data['Multicolinarity'] = vif_value
print(vif_data.sort_values(by='Multicolinarity', ascending=False))

                     Feature  Multicolinarity
29         log_MonthlyIncome       243.171226
19         PerformanceRating       175.734227
18         PercentSalaryHike        44.873029
0                        Age        35.262031
11                  JobLevel        18.521617
24           WorkLifeBalance        16.371009
10            JobInvolvement        15.766680
22         TotalWorkingYears        14.561191
3                 Department        12.887887
9                 HourlyRate        11.665807
25            YearsAtCompany        10.584594
5                  Education         9.688319
12                   JobRole         8.032150
20  RelationshipSatisfaction         7.385899
7    EnvironmentSatisfaction         7.231039
13           JobSatisfaction         7.219655
1             BusinessTravel         6.929772
28      YearsWithCurrManager         6.476651
26        YearsInCurrentRole         6.440891
14             MaritalStatus         5.930215
23     TrainingTimesLastYear      

# **5. Data Splitting & Preprocessing Pipeline**
We define our Features (X) and Target (Y). We then build a **Pipeline** to handle preprocessing automatically:
1.  **StandardScaler**: For numerical features (to handle different units like Age vs DailyRate).
2.  **OneHotEncoder**: For categorical features (converting 'Department' into binary vectors).
3.  **LogisticRegression**: The classifier, set with `class_weight='balanced'` to handle the imbalance between employees who stay vs. leave.

In [244]:
X = data.drop(['Attrition', 'EmployeeCount', 'Over18', 'StandardHours', 'EmployeeNumber','PerformanceRating',
               'JobLevel', 'YearsWithCurrManager', 'YearsInCurrentRole', 'TotalWorkingYears'], axis=1)
Y = data['Attrition'].apply(lambda x : 1 if x =='Yes' else 0)

# 2. Split Data (70% Train, 30% Test)
# stratify=Y ensures the proportion of Yes/No is the same in train and test sets
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=23, stratify=Y)

# Identify columns types for the pipeline
numerical_cols = X.select_dtypes(include=['float64', 'int64']).columns
categorical_cols = X.select_dtypes(include=['object']).columns

# Define Preprocessor
preprocessor = ColumnTransformer(transformers=[
    ('num', StandardScaler(), numerical_cols),
    ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_cols)
])

# Create Pipeline
model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(class_weight='balanced', max_iter=1000))
])

# **6. Model Training**
We fit the pipeline to our training data. The pipeline automatically handles scaling the numbers and encoding the categories before passing the data to the Logistic Regression model.

In [245]:
model.fit(x_train, y_train)
print("Model Training Completed.")

Model Training Completed.


# **7. Model Evaluation**
We make predictions on the test set and evaluate performance.
* **Accuracy**: Overall correctness.
* **ROC-AUC**: Ability to distinguish between 'Yes' and 'No'.
* **Confusion Matrix**: Shows True Positives, False Positives, etc.

In [246]:
# Generate Predictions
y_pred = model.predict(x_test)

# Print Results
print(f"Accuracy Score: {accuracy_score(y_test, y_pred)*100:.2f}%")
print(f"ROC AUC Score:  {roc_auc_score(y_test, y_pred)*100:.2f}%")

print("\n--- Confusion Matrix ---")
print(confusion_matrix(y_test, y_pred))

print("\n--- Classification Report ---")
print(classification_report(y_test, y_pred))

Accuracy Score: 74.60%
ROC AUC Score:  72.34%

--- Confusion Matrix ---
[[280  90]
 [ 22  49]]

--- Classification Report ---
              precision    recall  f1-score   support

           0       0.93      0.76      0.83       370
           1       0.35      0.69      0.47        71

    accuracy                           0.75       441
   macro avg       0.64      0.72      0.65       441
weighted avg       0.83      0.75      0.77       441

