> ### Note on Labs and Assignments:
>
> 🔧 Look for the **wrench emoji** 🔧 — it highlights where you're expected to take action!
>
> These sections are graded and are not optional.
>

# Lab 11: Linear Regression for HR

In this lab, we’ll use the merged HR dataset to build a **linear regression model** to predict the number of years that an employee will work for the company (job tenure).

### Problem Statement
A large company employs, at any given point of time, around 4000 employees. However, every year, around 15% of its employees leave the company and need to be replaced with the talent pool available in the job market. The management believes that this employee turnover is bad for the company, because of the following reasons -

- The former employees’ projects get delayed, which makes it difficult to meet timelines, resulting in a reputation loss among consumers and partners
- A sizeable department has to be maintained, for the purposes of recruiting new talent
- More often than not, the new employees have to be trained for the job and/or given time to acclimatise themselves to the company

Hence, the management has contracted an HR analytics firm to understand what factors they should focus on, in order to increase the number of years that employees stay with the company. In other words, they want to know what changes they should make to their workplace, in order to get most of their employees to stay. Also, they want to know which of these variables is most important and needs to be addressed right away.

Since you are one of the star analysts at the firm, this project has been given to you.

### Goal of the case study
You are required to model the number of years employees work for the company using a llinear regression. The results thus obtained will be used by the management to understand what changes they should make to their workplace, in order to get most of their employees to stay.


### Analytics Objectives:
1. Load and explore the dataset
2. Clean and prepare features
3. Encode categorical variables
4. Split the data into training and test sets
5. Train and evaluate a linear regression model
6. Reflect on variable importance and model fit

**Target Variable:** `YearsAtCompany` 


<a href="https://colab.research.google.com/github/Stan-Pugsley/is_4487_base/blob/main/Labs/lab_11_regression.ipynb" target="_parent">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>


## Dataset Overview

**Dataset:** `merged_hr_data.csv`  
Source: [Kaggle HR Analytics Case Study](https://www.kaggle.com/datasets/vjchoudhary7/hr-analytics-case-study)

| Variable                      | Type        | Description |
|-------------------------------|-------------|-------------|
| `Age`                         | Numeric     | Age of the employee |
| `Attrition`                   | Categorical | Whether the employee has left the company (Yes/No) |
| `BusinessTravel`              | Categorical | Frequency of business travel |
| `Department`                  | Categorical | Department name |
| `DistanceFromHome`           | Numeric     | Distance from home to work (in km) |
| `Education`                  | Ordinal     | Employee education level (1–5) |
| `EducationField`             | Categorical | Field of education |
| `EmployeeID`                 | Identifier  | Unique identifier for employee |
| `EmployeeCount`              | Constant    | Always 1 (not useful for modeling) |
| `EnvironmentSatisfaction`    | Ordinal     | Satisfaction with the environment (1–4) |
| `Gender`                     | Categorical | Gender of the employee |
| `JobInvolvement`             | Ordinal     | Level of involvement with job (1–4) |
| `JobLevel`                   | Ordinal     | Employee level (1–5) |
| `JobRole`                    | Categorical | Job title |
| `JobSatisfaction`            | Ordinal     | Satisfaction with the job (1–4) |
| `MaritalStatus`              | Categorical | Marital status |
| `MonthlyIncome`              | Numeric     | Monthly salary in USD |
| `NumCompaniesWorked`         | Numeric     | Number of companies previously worked for |
| `Over18`                     | Constant    | Always "Y" (not useful) |
| `PercentSalaryHike`          | Numeric     | Percentage salary increase |
| `PerformanceRating`          | Ordinal     | Performance rating (1–4) |
| `StandardHours`              | Constant    | Always 80 (not useful) |
| `StockOptionLevel`           | Ordinal     | Stock options level (0–3) |
| `TotalWorkingYears`          | Numeric     | Total years of professional experience |
| `TrainingTimesLastYear`      | Numeric     | Number of training sessions attended last year |
| `WorkLifeBalance`            | Ordinal     | Work-life balance rating (1–4) |
| `YearsAtCompany`             | Numeric     | Years spent at the current company |
| `YearsInCurrentRole`         | Numeric     | Years spent in current role |
| `YearsSinceLastPromotion`    | Numeric     | Years since last promotion |
| `YearsWithCurrManager`       | Numeric     | Years with current manager |
| `JobSatisfaction`            | Ordinal     | Self-reported job satisfaction (1–4) |
| `EnvironmentSatisfaction`    | Ordinal     | Satisfaction with the work environment (1–4) |
| `WorkLifeBalance`            | Ordinal     | Work-life balance rating (1–4) |
| `JobInvolvement`             | Ordinal     | Employee’s job involvement level (1–4) |
| `PerformanceRating`          | Ordinal     | Most recent performance rating |



## Part 1: Load data and packages




In [None]:
import pandas as pd

# Load the merged HR dataset
url = "https://raw.githubusercontent.com/Stan-Pugsley/is_4487_base/refs/heads/main/DataSets/merged_hr_data.csv"
df = pd.read_csv(url)

# Preview structure
print("Shape:", df.shape)
df.head()


## Part 2: Data Cleaning

Real-world HR data often contains administrative fields (e.g., ID numbers), constants (same value for all rows), or missing values.

### What We’re Doing:
- Remove irrelevant or constant columns: `EmployeeCount`, `Over18`, `StandardHours`, `EmployeeID`
- Drop rows with missing data

### Why It Matters:
- Non-informative or redundant features can reduce model accuracy and interpretability.
- Regression does not handle missing values natively, so we need a clean dataset.
- Dropping some rows is reasonable here due to the relatively small number of nulls.

> Ethical Note: In practice, dropping rows may disproportionately exclude certain groups—so this step should be handled with caution.


In [None]:
# Drop unnecessary columns
drop_cols = ['EmployeeCount', 'Over18', 'StandardHours', 'EmployeeID', 'Attrition']
df.drop(columns=drop_cols, inplace=True)

# Drop rows with any missing values
df.dropna(inplace=True)

# Check result
print("After cleaning:", df.shape)


## Part 3: Encode Categorical Variables

Machine learning algorithms like linear regression require **numeric inputs**. To use categorical data like `Gender` or `JobRole`, we convert them into **dummy variables** using one-hot encoding.

### Key Steps:
- Use `pd.get_dummies()` with `drop_first=True` to avoid multicollinearity

### Why It Matters:
- Ensures model can interpret categorical inputs numerically
- Dropping the first dummy prevents the "dummy variable trap" where one variable is a linear combination of others
- Accurate encoding helps ensure model fairness and interpretability

> Reminder: Avoid encoding identifiers or columns with too many unique levels without reduction.



In [None]:
# One-hot encode categorical features
df_encoded = pd.get_dummies(df, drop_first=True)

# Preview encoded columns
df_encoded.info()

### 🔧 Try It Yourself - Part 3

1. How many new columns were created during one-hot encoding?  
2. Why is it important to avoid including columns like `EmployeeID` in modeling?
3. Our model is trying to predict employment longevity.  Why is `Attrition` problematic for predicting years at the company? 

Write a few sentences on each of the questions above. No coding is required here. 


🔧 Add comment here:

### Part 4: Standardizing Features for Regression

When using models like **linear regression**, it's highly recommended to ensure all numeric features are on a similar scale. This helps the model converge more reliably and prevents features with larger magnitudes from dominating the learning process.

In this step, we'll use `StandardScaler` from `sklearn` to scale all feature columns to have a mean of 0 and a standard deviation of 1.

This is especially important your dataset includes variables with vastly different units or scales (e.g., "Age" vs. "MonthlyIncome")

> **Note:** The target variable (`YearsAtCompany`) should **not** be scaled — only the input features.

---


In [None]:
from sklearn.preprocessing import StandardScaler

# Separate the features and the target
X = df_encoded.drop(columns=['YearsAtCompany'])
y = df_encoded['YearsAtCompany']

# Apply standardization
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Reconstruct scaled DataFrame
X = pd.DataFrame(X_scaled, columns=X.columns)

# Preview the scaled features
X.head()

### 🔧 Try It Yourself - Part 4

You've now scaled your features using `StandardScaler`, which makes each feature have a mean of 0 and a standard deviation of 1.

**Think about this:**
Suppose we didn't standardize the features and trained a regression model using raw input data instead. What might happen to the interpretation or relative importance of the coefficients?

**Write one or two sentences** explaining how not standardizing the data could affect the model's performance or interpretability.


🔧 Add comment here:

## Part 5: Train-Test Split

We'll split the dataset into:
- 80% for training
- 20% for testing

To preserve class proportions, we **stratify on our target variable**. This ensures fair evaluation.

> This step helps avoid training/test imbalance especially in classification tasks.



In [None]:
from sklearn.model_selection import train_test_split

# Use already standardized features in X, and original target y
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# Check the shapes of the splits
X_train.shape, X_test.shape



### 🔧 Try It Yourself - Part 5

1. In the code cell below, calculate what average `YearsAtCompany` for all employees 
2. Then answer the following question in the markdown cell: Why is stratified sampling especially important for classification?


In [None]:
# 🔧 Add code here

🔧 Add comment here:

## Part 6: Train the Regression

Now we fit a linear regression model using the training data.  

In [None]:
from sklearn.linear_model import LinearRegression 
from sklearn.metrics import mean_squared_error, r2_score

# Initialize and train the Linear Regression model
linear_model = LinearRegression()
linear_model.fit(X_train, y_train)

# Display coefficients in order of highest to lowest correlation
coefficients_linear = pd.Series(linear_model.coef_, index=X.columns)
print("\nLinear Regression Coefficients (ordered):")
print(coefficients_linear.sort_values(ascending=False))


### 🔧 Try It Yourself - Part 6

1. Which features are most positively associated with high job tenure (years at company)?
2. Which features are most negatively associated with staying?

Write a few sentences on each of the questions above. No coding is required here. 

🔧 Add comment here:

## Part 7: Evaluate Model Performance

First lets visualize the model output

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Create a scatter plot of predicted vs actual YearsAtCompany
plt.figure(figsize=(10, 6))
sns.regplot(x=y_test_linear, y=y_pred_linear, scatter_kws={'alpha':0.6}, line_kws={"color": "red"})
plt.xlabel("Actual YearsAtCompany")
plt.ylabel("Predicted YearsAtCompany")
plt.title("Actual vs. Predicted YearsAtCompany with Regression Line")
plt.grid(True)
plt.show()

Now let’s test how well our model generalizes to unseen data. We'll compute:
- Mean Squared Error (MSE)
- R-Squared

In [None]:
# Make predictions on the test set
y_pred_linear = linear_model.predict(X_test_linear)

# Evaluate the model
mse = mean_squared_error(y_test_linear, y_pred_linear)
r2 = r2_score(y_test_linear, y_pred_linear)

# Print the evaluation metrics
print(f"Mean Squared Error (MSE): {mse:.2f}")
print(f"R-squared (R2): {r2:.2f}")

### 🔧 Try It Yourself - Part 7

1. Is this R-squared fit good? 
2. How could we improve the fit?

🔧 Add comment here:

## Part 8: Feature Selection for Accuracy Improvement

Not all features equally influence `YearsAtCompany`. By identifying and using only the most important predictors, we can:
- Simplify the model
- Potentially improve performance or interpretability
- Reduce overfitting

We’ll use the linear regression model’s coefficients to rank feature importance.


In [None]:
# Get top 10 features based on absolute coefficient magnitude
top_features = coefficients_linear.abs().sort_values(ascending=False).head(10)

# Print the top features and their weights
top_features

### 🔧 Try It Yourself – Part 8

1. Create a new training and test set using only the 10 most important features.
2. Retrain the linear regression model on this reduced dataset.
3. Evaluate performance of the new version


In [None]:
# 🔧 Step 1: Identify the top 10 most important features using absolute value of regression coefficients

# 🔧 Step 2: Create new versions of X_train and X_test with only those top features

# 🔧 Step 3: Initialize and fit a new Regression model on the reduced feature set

# 🔧 Step 4: Use the new model to predict on the test set

# 🔧 Step 5: Create a chart to visualize the new model

# 🔧 Step 6: Evaluate the reduced model using R-squared

## 🔧 Part 9: Reflection

1. How did the reduced-feature model compare to the full model?
2. Would this version be easier to explain or use in an HR meeting?

Write a few sentences on each of the questions above. No coding is required here. 

🔧 Add comment here:

## Export Your Notebook to Submit in Canvas
- Use the instructions from Lab 1

In [None]:
!jupyter nbconvert --to html "lab_11_LastnameFirstname.ipynb"