# Predictive Modeling

Predictive modeling involves creating, training, evaluating, and interpreting models that can predict the target variable (Attrition) based on the input features. This step will guide you through this process using several machine learning models.

 ## Data Preparation

Before building models, we need to prepare the data by handling missing values, encoding categorical features, and splitting the data into training and test sets.

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# Load the cleaned dataset
df = pd.read_csv('data/cleaned_hr_data.csv')

# Handle missing values if any
df.fillna(df.mean(), inplace=True)

# Transform boolean to binary labels
df['Attrition_Yes'] = df['Attrition_Yes'].apply(lambda x: 1 if x else 0)

# Check the transformed column
print(df['Attrition_Yes'].value_counts())

# Separate features and target variable
X = df.drop('Attrition_Yes', axis=1)
y = df['Attrition_Yes']

# Identify numerical columns (assuming all features are numerical)
numerical_cols = X.select_dtypes(include=['float64', 'int64', 'bool']).columns

# Preprocessing pipeline for numerical and categorical features
numerical_transformer = StandardScaler()


preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols)

    ])

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


Attrition_Yes
0    1233
1     237
Name: count, dtype: int64


## Model Selection

We will select several machine learning models for comparison, including Logistic Regression, Random Forest, and Gradient Boosting.

 ## Model Training

 We'll first use the logistic regression model to train and predict future values.

In [8]:

# Create a preprocessing pipeline and train the model
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(random_state=42))
])

# Train the model
pipeline.fit(X_train, y_train)


## Model Evaluation

We will evaluate the performance of each model using accuracy, precision, recall, and F1-score.


In [9]:


# Predict on the test set
y_pred = pipeline.predict(X_test)
y_pred_prob = pipeline.predict_proba(X_test)[:, 1]

# Evaluate the model
print(classification_report(y_test, y_pred))

# Print predictions and probabilities
print("Predictions on the test set:", y_pred[:10])
print("Predicted probabilities on the test set:", y_pred_prob[:10])


# # Predict on new data
# new_data = pd.read_csv('path/to/your/new_data.csv')
# new_data_processed = preprocessor.transform(new_data)
# new_pred = pipeline.predict(new_data_processed)
# new_pred_prob = pipeline.predict_proba(new_data_processed)[:, 1]

# print("Predictions on new data:", new_pred)
# print("Predicted probabilities on new data:", new_pred_prob)

              precision    recall  f1-score   support

           0       0.92      0.95      0.93       255
           1       0.56      0.46      0.51        39

    accuracy                           0.88       294
   macro avg       0.74      0.70      0.72       294
weighted avg       0.87      0.88      0.88       294

Predictions on the test set: [0 0 0 0 0 0 0 0 0 0]
Predicted probabilities on the test set: [0.06252609 0.00285499 0.33684018 0.0136406  0.05261876 0.33604375
 0.41365112 0.03669842 0.0652064  0.01454172]


## Model Comparisons 

 We can compare the performance of different models and select the one that works best for your dataset.

 we are comparing the following models:
 1.	Linear Regression
 2.	Decision Tree Classifier
 3.	Random Forest Classifier
 4.	Gradient Boosting Classifier
 5.	Support Vector Machine (SVM)

In [10]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.svm import SVC

# Load the cleaned dataset
df = pd.read_csv('data/cleaned_hr_data.csv')

# Ensure correct data types
df['Attrition_Yes'] = df['Attrition_Yes'].apply(lambda x: 1 if x else 0)

# Separate features and target variable
X = df.drop('Attrition_Yes', axis=1)
y = df['Attrition_Yes']

# Identify numerical columns (assuming all features are numerical)
numerical_cols = X.select_dtypes(include=['float64', 'int64', 'bool']).columns

# Preprocessing pipeline for numerical features
numerical_transformer = StandardScaler()

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols)
    ])

# Split the data into training and test sets with stratification
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Create a preprocessing pipeline
X_train = preprocessor.fit_transform(X_train)
X_test = preprocessor.transform(X_test)

# Define models
models = {
    'Logistic Regression': LogisticRegression(random_state=42),
    'Decision Tree': DecisionTreeClassifier(random_state=42),
    'Random Forest': RandomForestClassifier(random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(random_state=42),
    'SVM': SVC(probability=True, random_state=42)
}

# Train and evaluate models
for name, model in models.items():
    print(f"Training {name}...")
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    print(f"{name} Classification Report:")
    print(classification_report(y_test, y_pred))
    print("\n")

Training Logistic Regression...
Logistic Regression Classification Report:
              precision    recall  f1-score   support

           0       0.88      0.96      0.92       247
           1       0.60      0.32      0.42        47

    accuracy                           0.86       294
   macro avg       0.74      0.64      0.67       294
weighted avg       0.84      0.86      0.84       294



Training Decision Tree...
Decision Tree Classification Report:
              precision    recall  f1-score   support

           0       0.88      0.85      0.87       247
           1       0.35      0.40      0.37        47

    accuracy                           0.78       294
   macro avg       0.61      0.63      0.62       294
weighted avg       0.80      0.78      0.79       294



Training Random Forest...
Random Forest Classification Report:
              precision    recall  f1-score   support

           0       0.85      0.97      0.91       247
           1       0.42      0.1