# Phase 1 Summary - Absenteeism Prediction Project

## Team 62 - MLOps Project Deliverable

In [None]:
import pandas as pd
import mlflow
import warnings
warnings.filterwarnings('ignore')

# Load experiment results
mlflow.set_tracking_uri("file:///C:/Users/Alexis/mlops-absenteeism-project/mlruns")
runs = mlflow.search_runs(experiment_names=["Default"])  # Your runs are in Default experiment

# Create summary table
if len(runs) > 0:
    summary = runs[['tags.mlflow.runName', 'metrics.test_mae', 'metrics.test_rmse', 
                    'metrics.test_r2', 'status', 'end_time']].copy()
    summary.columns = ['Model', 'Test MAE', 'Test RMSE', 'Test R²', 'Status', 'Time']
    summary = summary.round(3)
    
    print("="*70)
    print("PHASE 1 - MODEL PERFORMANCE SUMMARY")
    print("="*70)
    display(summary)
else:
    print("No runs found. Make sure you've run the training notebook first.")

## Phase 1 - Key Findings

### 1. Data Insights
- **Dataset Size**: 740 records from Brazilian courier company (2007-2010)
- **Target Variable**: Highly skewed (many short absences, few extended ones)
- **Data Quality**: No missing values, clean dataset after outlier removal
- **Key Challenge**: High variability in absenteeism patterns

### 2. Model Performance
- **Linear Regression**: R² ≈ 0.09 (explains only 9% of variance)
- **Random Forest**: R² ≈ -0.38 (negative R² indicates overfitting)
- **Finding**: Current features struggle to predict absenteeism hours

### 3. Important Features
Based on the dataset:
1. Reason for absence
2. Service time
3. Distance from residence to work
4. Age
5. Disciplinary failure

### 4. Challenges Identified
- Absenteeism appears highly random/unpredictable with current features
- Possible need for:
  - External factors (weather, holidays, company events)
  - Temporal features (trends over time)
  - Employee history features
- Imbalanced target distribution affects regression performance
- Classification approach may be more appropriate than regression

## Data Versioning

### DVC Implementation
- Raw data tracked: `work_absenteeism_original.csv`
- Processed data tracked: `absenteeism_cleaned.csv`
- All data versions stored in `.dvc` files

### Experiment Tracking
- MLflow used for all model experiments
- Metrics tracked: MAE, RMSE, R²
- Models and artifacts stored in `mlruns/`

## Team Roles & Responsibilities

### Data Engineer
- **Activities**: Data acquisition, cleaning pipeline, DVC setup
- **Tools Used**: Python, Pandas, DVC
- **Deliverables**: Clean dataset, versioned data in `data/processed/`

### Data Scientist  
- **Activities**: Exploratory Data Analysis, feature engineering, model development
- **Tools Used**: Scikit-learn, MLflow, Jupyter
- **Deliverables**: 
  - ML Canvas analysis
  - EDA notebooks
  - Trained models with performance metrics

### ML Engineer
- **Activities**: MLOps setup, experiment tracking, reproducibility
- **Tools Used**: MLflow, Git, DVC, Virtual environments
- **Deliverables**: 
  - Tracked experiments in MLflow
  - Reproducible training pipeline
  - Model versioning setup

### Software Engineer
- **Activities**: Code structure, notebooks organization, Git workflow
- **Tools Used**: Python, Git, GitHub
- **Deliverables**: 
  - Organized codebase
  - Clean Git history
  - Documentation

### Site Reliability Engineer
- **Activities**: Environment setup, dependency management, reproducibility
- **Tools Used**: venv, pip, requirements.txt
- **Deliverables**: 
  - Reproducible Python environment
  - Requirements documentation
  - Setup instructions

## Next Steps for Phase 2

### Feature Engineering
1. Create temporal features (day of week, month effects)
2. Aggregate employee history metrics
3. Engineer interaction features

### Model Improvements
1. Try gradient boosting models (XGBoost, LightGBM)
2. Implement hyperparameter tuning
3. Consider ensemble methods
4. Explore classification approach (absence categories)

### MLOps Enhancements
1. Automated training pipeline
2. Model registry and versioning
3. CI/CD for model deployment
4. Monitoring and alerting

In [None]:
# Project Statistics
import os

print("="*70)
print("PROJECT STATISTICS")
print("="*70)

# Count notebooks
notebooks = [f for f in os.listdir('../notebooks') if f.endswith('.ipynb')]
print(f"Notebooks created: {len(notebooks)}")

# Data files
print(f"\nData Versioning:")
print(f"  - Raw data tracked with DVC: ✓")
print(f"  - Processed data tracked with DVC: ✓")

# MLflow
print(f"\nMLflow Tracking:")
print(f"  - Experiments tracked: {len(runs) if len(runs) > 0 else 0}")
print(f"  - Models logged: ✓")

print(f"\n{'='*70}")
print("PHASE 1 COMPLETE ✓")
print("="*70)