> # ⚠️ **IMPORTANT: READ BEFORE STARTING THIS LAB**
>
> ### Throughout this lab, you will see **🔧 Try It Yourself** sections and a final 🔧 **Reflection** section
>
> ✅ You are expected to:
> - Complete each **"🔧 Try It Yourself”** section by writing and running your own code or answering the prompted questions in a markdown or python cell below the section.
> - Answer the **Reflection** section at the end of the lab in your own words. This is your opportunity to summarize what you learned and connect the concepts.

>
>
> 🔧 Look for the **wrench emoji** 🔧 — it highlights where you're expected to take action!
>
> ### These sections are **graded** and are **not optional**. Skipping them will impact your lab score.
>
> ---

# Lab 11: Regression

In this lab, we’ll use the merged HR dataset to build a **logistic regression model** to predict whether an employee is likely to leave the company (i.e., attrition).

### Objectives:
1. Load and explore the dataset
2. Clean and prepare features
3. Encode categorical variables
4. Split the data into training and test sets
5. Train and evaluate a logistic regression model
6. Reflect on variable importance and model fit

**Target Variable:** `Attrition` (Yes/No)

<a href="https://colab.research.google.com/github/Stan-Pugsley/is_4487_base/blob/main/Labs/lab11_regression_v2.ipynb" target="_parent">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>


## Dataset Overview

**Dataset:** `merged_hr_data.csv`  
Source: [Kaggle HR Analytics Case Study](https://www.kaggle.com/datasets/vjchoudhary7/hr-analytics-case-study)

| Variable                      | Type        | Description |
|-------------------------------|-------------|-------------|
| `Age`                         | Numeric     | Age of the employee |
| `Attrition`                   | Categorical | Whether the employee has left the company (Yes/No) |
| `BusinessTravel`              | Categorical | Frequency of business travel |
| `Department`                  | Categorical | Department name |
| `DistanceFromHome`           | Numeric     | Distance from home to work (in km) |
| `Education`                  | Ordinal     | Employee education level (1–5) |
| `EducationField`             | Categorical | Field of education |
| `EmployeeID`                 | Identifier  | Unique identifier for employee |
| `EmployeeCount`              | Constant    | Always 1 (not useful for modeling) |
| `EnvironmentSatisfaction`    | Ordinal     | Satisfaction with the environment (1–4) |
| `Gender`                     | Categorical | Gender of the employee |
| `JobInvolvement`             | Ordinal     | Level of involvement with job (1–4) |
| `JobLevel`                   | Ordinal     | Employee level (1–5) |
| `JobRole`                    | Categorical | Job title |
| `JobSatisfaction`            | Ordinal     | Satisfaction with the job (1–4) |
| `MaritalStatus`              | Categorical | Marital status |
| `MonthlyIncome`              | Numeric     | Monthly salary in USD |
| `NumCompaniesWorked`         | Numeric     | Number of companies previously worked for |
| `Over18`                     | Constant    | Always "Y" (not useful) |
| `PercentSalaryHike`          | Numeric     | Percentage salary increase |
| `PerformanceRating`          | Ordinal     | Performance rating (1–4) |
| `StandardHours`              | Constant    | Always 80 (not useful) |
| `StockOptionLevel`           | Ordinal     | Stock options level (0–3) |
| `TotalWorkingYears`          | Numeric     | Total years of professional experience |
| `TrainingTimesLastYear`      | Numeric     | Number of training sessions attended last year |
| `WorkLifeBalance`            | Ordinal     | Work-life balance rating (1–4) |
| `YearsAtCompany`             | Numeric     | Years spent at the current company |
| `YearsInCurrentRole`         | Numeric     | Years spent in current role |
| `YearsSinceLastPromotion`    | Numeric     | Years since last promotion |
| `YearsWithCurrManager`       | Numeric     | Years with current manager |
| `JobSatisfaction`            | Ordinal     | Self-reported job satisfaction (1–4) |
| `EnvironmentSatisfaction`    | Ordinal     | Satisfaction with the work environment (1–4) |
| `WorkLifeBalance`            | Ordinal     | Work-life balance rating (1–4) |
| `JobInvolvement`             | Ordinal     | Employee’s job involvement level (1–4) |
| `PerformanceRating`          | Ordinal     | Most recent performance rating |



## Part 1: Load data and packages




In [None]:
import pandas as pd

# Load the merged HR dataset
url = "https://raw.githubusercontent.com/Stan-Pugsley/is_4487_base/fb26c9731228bfbdefe94143ac53ef5500a199e8/DataSets/merged_hr_data.csv"
df = pd.read_csv(url)

# Preview structure
print("Shape:", df.shape)
df.head()


## Part 2: Data Cleaning

Real-world HR data often contains administrative fields (e.g., ID numbers), constants (same value for all rows), or missing values.

### What We’re Doing:
- Remove irrelevant or constant columns: `EmployeeCount`, `Over18`, `StandardHours`, `EmployeeID`
- Drop rows with missing data

### Why It Matters:
- Non-informative or redundant features can reduce model accuracy and interpretability.
- Logistic regression does not handle missing values natively, so we need a clean dataset.
- Dropping some rows is reasonable here due to the relatively small number of nulls.

> Ethical Note: In practice, dropping rows may disproportionately exclude certain groups—so this step should be handled with caution.


In [None]:
# Drop unnecessary columns
drop_cols = ['EmployeeCount', 'Over18', 'StandardHours', 'EmployeeID']
df.drop(columns=drop_cols, inplace=True)

# Drop rows with any missing values
df.dropna(inplace=True)

# Check result
print("After cleaning:", df.shape)


## Part 3: Encode Categorical Variables

Machine learning algorithms like logistic regression require **numeric inputs**. To use categorical data like `Gender` or `JobRole`, we convert them into **dummy variables** using one-hot encoding.

### Key Steps:
- Convert target `Attrition` to 1 (Yes) and 0 (No)
- Use `pd.get_dummies()` with `drop_first=True` to avoid multicollinearity

### Why It Matters:
- Ensures model can interpret categorical inputs numerically
- Dropping the first dummy prevents the "dummy variable trap" where one variable is a linear combination of others
- Accurate encoding helps ensure model fairness and interpretability

> Reminder: Avoid encoding identifiers or columns with too many unique levels without reduction.



In [None]:
# Binary encode target
df['Attrition'] = df['Attrition'].map({'Yes': 1, 'No': 0})

# One-hot encode categorical features
df_encoded = pd.get_dummies(df, drop_first=True)

# Preview encoded columns
df_encoded.columns

### 🔧 Try It Yourself - Part 3

1. How many new columns were created during one-hot encoding?  
2. Why is it important to avoid including columns like `EmployeeID` in modeling?


🔧 Add comment here:

### Part 4: Standardizing Features for Logistic Regression

When using models like **logistic regression**, it's important to ensure all numeric features are on a similar scale. This helps the model converge more reliably and prevents features with larger magnitudes from dominating the learning process.

In this step, we'll use `StandardScaler` from `sklearn` to scale all feature columns to have a mean of 0 and a standard deviation of 1.

This is especially important when:
- You're using gradient-based algorithms (like logistic regression, SVM, or neural networks)
- Your dataset includes variables with vastly different units or scales (e.g., "Age" vs. "MonthlyIncome")

> **Note:** The target variable (`Attrition`) should **not** be scaled — only the input features.

---


In [None]:
from sklearn.preprocessing import StandardScaler

# Separate the features and the target
X = df_encoded.drop(columns=['Attrition'])
y = df_encoded['Attrition']

# Apply standardization
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Reconstruct scaled DataFrame
X = pd.DataFrame(X_scaled, columns=X.columns)


### 🔧 Try It Yourself - Part 4

You've now scaled your features using `StandardScaler`, which makes each feature have a mean of 0 and a standard deviation of 1.

**Think about this:**
Suppose we didn't standardize the features and trained a logistic regression model using raw input data instead. What might happen to:
- The convergence of the model (would it complete successfully?)
- The interpretation or relative importance of the coefficients?

**Write one or two sentences** explaining how not standardizing the data could affect the model's performance or interpretability.


🔧 Add comment here:

## Part 5: Train-Test Split

We'll split the dataset into:
- 80% for training
- 20% for testing

To preserve class proportions, we **stratify on `Attrition`**. This ensures fair evaluation.

> This step helps avoid training/test imbalance especially in classification tasks.



In [None]:
from sklearn.model_selection import train_test_split

# Use already standardized features in X, and original target y
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# Check the shapes of the splits
X_train.shape, X_test.shape



### 🔧 Try It Yourself - Part 5

1. What percentage of each class (Yes/No) is in your training set?
2. Why is stratified sampling especially important for classification?


In [None]:
# 🔧 Add code here

🔧 Add comment here:

## Part 6: Train Logistic Regression

Now we fit a logistic regression model using the training data. This model estimates the **probability** of attrition given employee characteristics.

> Logistic regression is simple, interpretable, and a great starting point.


In [None]:
from sklearn.linear_model import LogisticRegression

# Initialize and fit logistic model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Display top predictors
import numpy as np
coefficients = pd.Series(model.coef_[0], index=X_train.columns)
coefficients.sort_values(ascending=False).head(10)


### 🔧 Try It Yourself - Part 6

1. Which features are most positively associated with attrition?
2. Which features are most negatively associated with staying?

🔧 Add comment here:

## Part 7: Evaluate Model Performance

Let’s test how well our model generalizes to unseen data. We'll compute:
- Accuracy
- Confusion matrix
- Precision, recall, F1-score

> These metrics show how well we balance false positives vs. false negatives.


In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Predict on test set
y_pred = model.predict(X_test)

# Compute metrics
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
report = classification_report(y_test, y_pred)

print("Accuracy:", accuracy)
print("Confusion Matrix:\n", conf_matrix)
print("Classification Report:\n", report)


### 🔧 Try It Yourself - Part 7

1. Is the model better at predicting “Yes” (leavers) or “No” (stayers)?
2. Do you think accuracy is the best metric here?

🔧 Add comment here:

## Part 8: Feature Selection for Accuracy Improvement

Not all features equally influence attrition. By identifying and using only the most important predictors, we can:
- Simplify the model
- Potentially improve performance or interpretability
- Reduce overfitting

We’ll use the logistic regression model’s coefficients to rank feature importance.


In [None]:
# Get top 10 features based on absolute coefficient magnitude
top_features = coefficients.abs().sort_values(ascending=False).head(10)

# Print the top features and their weights
top_features

### 🔧 Try It Yourself – Part 8

1. Create a new training and test set using only the 10 most important features.
2. Retrain the logistic regression model on this reduced dataset.
3. Evaluate performance (accuracy, confusion matrix, classification report).


In [None]:
# 🔧 Step 1: Identify the top N most important features using absolute value of logistic regression coefficients

# 🔧 Step 2: Create new versions of X_train and X_test with only those top features

# 🔧 Step 3: Initialize and fit a new LogisticRegression model on the reduced feature set

# 🔧 Step 4: Use the new model to predict on the test set

# 🔧 Step 5: Evaluate the reduced model using accuracy, confusion matrix, and classification report

## 🔧 Part 9: Reflection

1. How did the reduced-feature model compare to the full model?
2. Would this version be easier to explain or use in an HR meeting?


🔧 Add comment here: