### What this notebook is doing (in simple words)

This notebook builds a **productivity prediction model** for employees using their working hours, number of tasks, and absences.
In simple terms, here is what happens:

1. **Load and inspect the data**  
   We load the CSV file and check the columns, data types, missing values, and duplicates.  
   The key columns are: `login_time`, `logout_time`, `total_tasks_completed`, `weekly_absences`, and the target `productivity_score`.

2. **Train simple baseline models**  
   First, we try basic models (like SGDRegressor and Ridge) on the original features after scaling.  
   The metrics (MSE, RMSE, MAE, R²) show that these simple models do not explain much of the variance in productivity (R² is around 0.0–0.03).

3. **Create smarter features about how people work**  
   We manually engineer new features, such as:  
   - `daily_work_hours` = logout_time − login_time  
   - `tasks_per_hour` = total_tasks_completed / daily_work_hours  
   - `absenteeism_rate` = weekly_absences / 5  
   These features try to capture *how* someone works (efficiency and presence), not just raw counts.

4. **Build a clean ML pipeline with feature engineering + scaling + model**  
   We wrap the feature creation, scaling, and the regression model into a single `Pipeline`.  
   This makes the workflow easier to maintain and avoids mistakes (for example, forgetting to apply the same transforms to train and test).

5. **Evaluate and compare the final models**  
   We compare the errors (MSE, RMSE, MAE) and R² scores before and after feature engineering.  
   Even though the overall R² is still modest, the engineered features give a more meaningful view of productivity and can be used in an API to score new employees in a consistent way.


In [19]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [20]:
df = pd.read_csv("../csv/data.csv")
df.sample(5)

Unnamed: 0,employee_id,login_time,logout_time,total_tasks_completed,weekly_absences,productivity_score
21,22,8,18,73,0,78
93,94,8,20,32,4,87
118,119,8,18,105,4,94
221,222,8,18,113,1,63
127,128,9,18,119,2,88


In [21]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300 entries, 0 to 299
Data columns (total 6 columns):
 #   Column                 Non-Null Count  Dtype
---  ------                 --------------  -----
 0   employee_id            300 non-null    int64
 1   login_time             300 non-null    int64
 2   logout_time            300 non-null    int64
 3   total_tasks_completed  300 non-null    int64
 4   weekly_absences        300 non-null    int64
 5   productivity_score     300 non-null    int64
dtypes: int64(6)
memory usage: 14.2 KB


### Checking the raw data

In the first few cells, I:

- Load the dataset from the CSV file.
- Look at a random sample of rows to get a feel for the numbers.
- Use `df.info()`, `isna().sum()`, and `duplicated().sum()` to confirm there are no missing values or duplicate rows.

After that, I drop the `employee_id` column because it is just an identifier and does not help the model learn patterns about productivity.


In [22]:
df.isna().sum()

employee_id              0
login_time               0
logout_time              0
total_tasks_completed    0
weekly_absences          0
productivity_score       0
dtype: int64

In [23]:
df.duplicated().sum()

np.int64(0)

In [24]:
df = df.drop(columns=['employee_id'])

### First baseline models (before feature engineering)

Here I build a very simple setup:

- I split the data into **train** and **test** sets.
- I scale the numeric features using `StandardScaler`.
- I train two models:
  - `SGDRegressor` (a linear model trained with gradient descent).
  - `Ridge` (a regularized linear regression).

The evaluation metrics (MSE, RMSE, MAE, R²) show that these baseline models only explain a very small part of the variation in `productivity_score` (R² is close to 0). This tells me that just feeding the raw columns is not enough — I need better features that describe work behaviour.


In [25]:


from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.linear_model import SGDRegressor, Ridge

scaler = StandardScaler()

X = df.drop(columns=['productivity_score'])
y = df['productivity_score']


In [26]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [27]:
pipe = Pipeline(steps = [
    ('scaler', StandardScaler()),
    ('sgd_regressor', SGDRegressor(max_iter=1000, tol=1e-3))])

ridge_pipe = Pipeline(steps = [
    ('scaler', StandardScaler()),
    ('ridge_regressor', Ridge(alpha=1.0))])

In [28]:
pipe.fit(X_train_scaled, y_train)
y_pred = pipe.predict(X_test_scaled)

ridge_pipe.fit(X_train_scaled, y_train)
y_ridge_pred = ridge_pipe.predict(X_test_scaled)

### Re-splitting and preparing for feature engineering

Before building more advanced models, I:

- Re-create `X` (inputs) and `y` (target), where `y` is `productivity_score`.
- Do a fresh `train_test_split` so that I clearly separate the data used to train and test the new feature-engineered models.

This keeps the workflow clean and makes it clear which data is used in the next steps.


In [29]:
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
import numpy as np

mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean Squared Error: {mse:.2f}")
print(f"Root Mean Squared Error: {rmse:.2f}")
print(f"Mean Absolute Error: {mae:.2f}")
print(f"R^2 Score: {r2:.2f}")


print("\nRidge Regression Results:")

mse_ridge = mean_squared_error(y_test, y_ridge_pred)
rmse_ridge = np.sqrt(mse_ridge)
mae_ridge = mean_absolute_error(y_test, y_ridge_pred)
r2_ridge = r2_score(y_test, y_ridge_pred)
print(f"Mean Squared Error: {mse_ridge:.2f}")
print(f"Root Mean Squared Error: {rmse_ridge:.2f}")
print(f"Mean Absolute Error: {mae_ridge:.2f}")
print(f"R^2 Score: {r2_ridge:.2f}")

Mean Squared Error: 286.84
Root Mean Squared Error: 16.94
Mean Absolute Error: 14.19
R^2 Score: 0.03

Ridge Regression Results:
Mean Squared Error: 287.23
Root Mean Squared Error: 16.95
Mean Absolute Error: 14.20
R^2 Score: 0.03


In [30]:
from sklearn.base import BaseEstimator, TransformerMixin

### Creating better features about productivity

In this step I define a custom transformer called `FeatureCreator`.
It builds new features that are more meaningful for productivity:

- **daily_work_hours** = logout_time − login_time (minimum of 1 hour to avoid division by zero).  
  This approximates how long an employee actually worked.
- **tasks_per_hour** = total_tasks_completed / daily_work_hours.  
  This is a simple measure of efficiency: more tasks per hour usually means higher productivity.
- **absenteeism_rate** = weekly_absences / 5.0.  
  This captures how often someone is absent during a standard 5‑day work week.

By returning these engineered features (plus the original task and absence counts), we give the model richer information that should help it understand who is productive and who is not.


In [31]:
X = df.drop(columns=['productivity_score'])
y = df['productivity_score']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Building pipelines with feature engineering + models

Now I combine everything into Scikit-Learn `Pipeline` objects:

- One pipeline starts with `FeatureCreator`, then scales the engineered features, then fits an `SGDRegressor`.
- Another pipeline scales the raw features and fits a regularized linear model (also using `SGDRegressor` with `alpha=0.1`).

Wrapping steps into a pipeline ensures that:

- The same transformations are always applied in the same order.
- It is easy to reuse the trained pipeline later (for example, saving it and using it inside an API).
- There is less risk of accidentally leaking information from the test set into the training process.


In [32]:
X_train

Unnamed: 0,login_time,logout_time,total_tasks_completed,weekly_absences
232,8,21,47,0
59,8,19,54,3
6,8,21,57,3
185,8,18,87,3
173,8,21,88,3
...,...,...,...,...
188,8,21,34,1
71,9,19,118,0
106,8,19,58,1
270,9,17,47,1


### Evaluating the final models and saving for deployment

In the final step I:

- Evaluate the models using **MSE**, **RMSE**, **MAE**, and **R²** to see how well they predict `productivity_score`.
- Even though the R² values are still low, the models now use more meaningful features (like work hours, tasks per hour, and absenteeism), which can be helpful for monitoring and ranking employees.
- Save the final pipeline with `joblib.dump`, so it can be loaded later by the FastAPI service to make predictions on new employees.

The key idea is that the model is now built in a structured, repeatable way and is ready to be used in a real application, even if the predictive power is modest.


In [33]:
class FeatureCreator(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self 
    
    def transform(self, X):
        X_transformed = X.copy()

        X_transformed['daily_work_hours'] = X_transformed['logout_time'] - X_transformed['login_time']
        X_transformed['daily_work_hours'] = np.maximum(X_transformed['daily_work_hours'], 1) 
        
        X_transformed['tasks_per_hour'] = X_transformed['total_tasks_completed'] / X_transformed['daily_work_hours']
        
        X_transformed['absenteeism_rate'] = X_transformed['weekly_absences'] / 5.0 
        
        return X_transformed[['daily_work_hours', 'tasks_per_hour', 'absenteeism_rate', 'total_tasks_completed', 'weekly_absences']]

In [34]:
pipe = Pipeline(steps = [
    ("feature_eng", FeatureCreator()),
    ('scaler', StandardScaler()),
    ('sgd_regressor', SGDRegressor(max_iter=1000, tol=1e-3))])

# regulaized linear model
ridge_pipeline = Pipeline(steps=[
    ('scaler', StandardScaler()),
    ('regressor', SGDRegressor(alpha=0.1, max_iter=1000, tol=1e-3))           
])


ridge_mopel = ridge_pipeline.fit(X_train, y_train)
y_ridge_final_pred = ridge_mopel.predict(X_test)


sgd_model = pipe.fit(X_train, y_train)
y_sgd_final_pred = sgd_model.predict(X_test)

In [35]:
mse = mean_squared_error(y_test, y_ridge_final_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, y_ridge_final_pred)
r2 = r2_score(y_test, y_ridge_final_pred)
print(f"Mean Squared Error: {mse:.2f}")
print(f"Root Mean Squared Error: {rmse:.2f}")
print(f"Mean Absolute Error: {mae:.2f}")
print(f"R^2 Score: {r2:.2f}")


print("SGDRegressor with Feature Engineering, Scaling, and PCA")
mse = mean_squared_error(y_test, y_sgd_final_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, y_sgd_final_pred)
r2 = r2_score(y_test, y_sgd_final_pred)
print(f"Mean Squared Error: {mse:.2f}")
print(f"Root Mean Squared Error: {rmse:.2f}")
print(f"Mean Absolute Error: {mae:.2f}")
print(f"R^2 Score: {r2:.2f}")

Mean Squared Error: 287.97
Root Mean Squared Error: 16.97
Mean Absolute Error: 14.25
R^2 Score: 0.03
SGDRegressor with Feature Engineering, Scaling, and PCA
Mean Squared Error: 296.79
Root Mean Squared Error: 17.23
Mean Absolute Error: 14.68
R^2 Score: 0.00


In [37]:
import joblib

joblib.dump(ridge_mopel, "_model.pkl")

['_model.pkl']