Author: Roche - Group 1
Date: March 16, 2026
Project Version: 0.2.0
- Overview
- Project Structure
- Architecture
- Data Pipeline
- Machine Learning Pipeline
- Deployment
- Installation
- Usage
- Contributing
- License
This project implements an end-to-end MLOps pipeline for tracking experiments at risk of delay in a laboratory setting. The system predicts operational risk scores for experiments based on workflow logs, instrument telemetry, reagent data, and queue management information.
- Data Ingestion: Automated collection from lab instruments and workflow systems
- Data Processing: Multi-layer data lake architecture (Bronze, Silver, Gold) with event-driven ETL
- Risk Prediction: LightGBM final model for experiment delay prediction (holdout ROC-AUC 0.977, PR-AUC 0.94)
- Real-time Monitoring: Event-driven architecture via AWS EventBridge for continuous risk assessment
- Automated Alerts: Email notifications via SendGrid for high-risk experiments
- Interactive Dashboard: Streamlit application with embedded Tableau for experiment monitoring
- Model Governance: MLflow tracking for model versioning and artifact management
- Drift Detection: Continuous model performance monitoring with automated retraining via SageMaker
- Model Training at Scale: Dockerized SageMaker training infrastructure for automated model updates
The system helps laboratory managers:
- Proactively identify experiments likely to experience delays
- Optimize resource allocation and scheduling
- Reduce operational costs through predictive maintenance
- Improve overall lab efficiency and throughput
├── 01-Documents/ # Project documentation
├── 02-Architecture/ # Architecture diagrams
│ └── Roche_RFP_Architecture.drawio
├── 03-Data/ # Data generation and processing
│ ├── 01_generate_workflow_logs.py
│ ├── 02_generate_instrument_telemetry.py
│ ├── 03_generate_reagent_logs.py
│ ├── 04_generate_queue_logs.py
│ ├── 05_dataset_generator.py
│ ├── config.py
│ ├── data_review.ipynb
│ ├── Raw/ # Raw generated data
│ ├── Processed/ # Processed datasets with feature importance
│ └── Documents/
├── 04-EDA/ # Exploratory Data Analysis notebooks
│ ├── EDA_processed_file.ipynb # Analysis of processed features
│ ├── EDA_workflow.ipynb
│ ├── EDA_telemetry.ipynb
│ ├── EDA_reagent.ipynb
│ ├── EDA_queue.ipynb
│ ├── helpers.py
│ └── figures/ # EDA visualizations and summaries
├── 05-Experiment/ # Machine learning experiments
│ ├── ML Final Model.ipynb # Finalized LightGBM model
│ ├── ML Model I.ipynb
│ ├── ML Model II.ipynb
│ ├── ML Model III.ipynb
│ ├── helpers.py
│ └── ml_files/
├── 06-Deployment/ # Production deployment code
│ ├── Docs/
│ ├── Experiment_interface/ # Streamlit dashboard application
│ │ ├── app.py
│ │ ├── Dockerfile
│ │ └── requirements.txt
│ ├── Inference_API/ # Flask ML API for real-time inference
│ │ ├── app.py
│ │ ├── detect_drift.py
│ │ ├── inference.py
│ │ ├── retraining.py
│ │ ├── send_retraining_alert.py
│ │ ├── Dockerfile
│ │ └── requirements.txt
│ ├── Lambda_functions/ # Serverless processing functions
│ │ ├── template.yaml
│ │ ├── consolidate_dataset/ # Data consolidation ETL
│ │ ├── dashboard_data/ # Dashboard data synchronization
│ │ ├── generate_datasets/ # Dataset generation triggers
│ │ ├── run_inference/ # Inference orchestration
│ │ └── send_email_alert/ # Risk alert notifications
│ └── Sagemaker_Training_Image/ # Docker image for SageMaker training
│ ├── Dockerfile
│ ├── requirements.txt
│ └── retraining.py
├── 07-Deliverables/ # Final project deliverables
│ └── Roche_G1_Dashboard.twb # Dashboard embeded in Streamlit
│ └── Roche_G1_ML_Final_Model.ipynb # Final trained model and results
│ └── Roche_G1_Poster.pdf
│ └── Roche_G1_ppt.pdf
├── 00-Backups/ # Previous versions and backups
├── pyproject.toml # Project configuration
├── requirements.txt # Dependencies
└── README.md # This file
The system follows an event-driven lakehouse architecture deployed on AWS, utilizing serverless components and containerized services for scalability and cost-efficiency.
flowchart TD
Sources["🔬 Lab Instruments & Workflow Systems"]
subgraph AWS[" "]
subgraph DataLake["Data Lake Architecture"]
S3Bronze["S3 Bronze Layer<br/>Raw Data"]
S3Silver["S3 Silver Layer<br/>Cleaned Data"]
S3Gold["S3 Gold Layer<br/>Features"]
end
subgraph Events["Event Orchestration"]
EventBridge["AWS EventBridge<br/>Rules Engine"]
end
subgraph Serverless["Serverless Processing"]
LambdaConsolidate["Lambda:<br/>Consolidate"]
LambdaInference["Lambda:<br/>Inference"]
LambdaEmail["Lambda:<br/>Email Alert"]
LambdaDashboard["Lambda:<br/>Dashboard"]
end
subgraph MLOps["ML Operations"]
EC2API["Flask API<br/>EC2 Docker"]
SageMaker["SageMaker<br/>Training"]
MLflow["MLflow<br/>Tracking"]
Models["S3 Models &<br/>Artifacts"]
end
subgraph Presentation["Visualization & Alerts"]
Dashboard["Streamlit<br/>Dashboard"]
Tableau["Tableau<br/>Embedded"]
SendGrid["SendGrid<br/>Alerts"]
end
end
CI_CD["GitHub Actions<br/>CI/CD"]
%% Data Flow
Sources -->|Raw Data| S3Bronze
S3Bronze -->|Trigger| EventBridge
EventBridge -->|Process| LambdaConsolidate
LambdaConsolidate -->|Store| S3Silver
S3Silver -->|Trigger| EventBridge
EventBridge -->|Execute| LambdaInference
LambdaInference -->|Invoke| EC2API
EC2API -->|Predict| S3Gold
EC2API -->|Detect Drift| MLflow
MLflow -->|Trigger| SageMaker
SageMaker -->|Update| Models
EC2API -->|Load| Models
S3Gold -->|Monitor| EventBridge
EventBridge -->|Alert| LambdaEmail
LambdaEmail -->|Send| SendGrid
EventBridge -->|Sync| LambdaDashboard
LambdaDashboard -->|Update| Dashboard
Dashboard -->|Display| Tableau
CI_CD -->|Deploy| EC2API
CI_CD -->|Deploy| LambdaConsolidate
CI_CD -->|Deploy| LambdaInference
CI_CD -->|Deploy| LambdaEmail
CI_CD -->|Deploy| LambdaDashboard
CI_CD -->|Build| SageMaker
Event-driven data pipeline with serverless processing, ML inference, and real-time monitoring dashboards.
Detailed component interactions showing the complete MLOps workflow with training infrastructure.
flowchart TB
Sources["🔬 Lab Instruments &<br/>Workflow Systems"]
subgraph AWS["AWS Event-Driven Lakehouse MLOps Platform"]
subgraph DataLake["📊 Data Lake"]
Bronze["S3 Bronze<br/>Raw Data<br/>Ingestion"]
Silver["S3 Silver<br/>Cleaned &<br/>Transformed"]
Gold["S3 Gold<br/>Feature Store<br/>Ready for ML"]
DashboardData["S3 Gold/<br/>dashboard_data"]
end
subgraph EventOrch["⚙️ Event Orchestration"]
EventBridge["AWS EventBridge<br/>Rule Engine"]
end
subgraph ETL["🔄 ETL Lambdas"]
ConsolidateLambda["Lambda:<br/>Consolidate Dataset<br/>Bronze→Silver"]
GenDataLambda["Lambda:<br/>Generate Datasets"]
end
subgraph Inference["🎯 Inference Layer"]
InferenceLambda["Lambda:<br/>Run Inference<br/>Trigger"]
FlaskAPI["Flask API<br/>on EC2 Docker<br/>REST Endpoints"]
Preprocessor["Model<br/>Preprocessor"]
Model["LightGBM<br/>Final Model<br/>ROC-AUC 0.977 (holdout)"]
end
subgraph Training["🚀 Training Infrastructure"]
DriftDetect["Drift Detection<br/>Module"]
SageMaker["SageMaker<br/>Training Job<br/>Docker Container"]
TrainScript["retraining.py<br/>Training Logic"]
end
subgraph MLGov["📋 ML Governance"]
MLflow["MLflow<br/>Tracking Server"]
RDS[("RDS SQL<br/>Tracking DB")]
MLArtifacts["S3 ml/<br/>mlflow_artifacts"]
Models["S3 ml/<br/>models &<br/>preprocessors"]
end
subgraph Notifications["📧 Alerts"]
EmailLambda["Lambda:<br/>Send Risk Email"]
SendGrid["SendGrid<br/>Email Service"]
end
subgraph Dashboard["📈 Visualization"]
DashboardLambda["Lambda:<br/>Dashboard Sync"]
ElasticBeanstalk["Elastic Beanstalk<br/>Host"]
Streamlit["Streamlit<br/>Dashboard"]
Tableau["Embedded Tableau<br/>Analytics"]
end
subgraph CICD["🔧 DevOps"]
GitHub["GitHub Actions<br/>CI/CD"]
ECR["ECR<br/>Repositories"]
end
end
%% Data Ingestion Flow
Sources -->|Raw Data| Bronze
Bronze -->|Trigger Event| EventBridge
EventBridge -->|Process| ConsolidateLambda
ConsolidateLambda -->|Store| Silver
%% Inference Flow
Silver -->|Trigger Event| EventBridge
EventBridge -->|Invoke| InferenceLambda
InferenceLambda -->|Call API| FlaskAPI
Models -->|Load| Preprocessor
Models -->|Load| Model
Preprocessor -->|Transform| FlaskAPI
FlaskAPI -->|Predict| Model
FlaskAPI -->|Store Predictions| Gold
%% Drift & Retraining
FlaskAPI -->|Monitor| DriftDetect
DriftDetect -->|Detected| SageMaker
SageMaker -->|Execute| TrainScript
TrainScript -->|Log Metrics| MLflow
MLflow -->|Track| RDS
MLflow -->|Store| MLArtifacts
TrainScript -->|Save| Models
DriftDetect -->|Alert if Drift| SendGrid
%% Alert Flow
Gold -->|Trigger Event| EventBridge
EventBridge -->|High Risk| EmailLambda
EmailLambda -->|Send Alert| SendGrid
%% Dashboard Flow
Gold -->|Trigger Event| EventBridge
EventBridge -->|Sync| DashboardLambda
DashboardLambda -->|Update| DashboardData
DashboardData -->|Load| Streamlit
Streamlit -->|Display| Tableau
ElasticBeanstalk -->|Host| Streamlit
%% CI/CD
GitHub -->|Build & Push| ECR
GitHub -->|Deploy| ConsolidateLambda
GitHub -->|Deploy| InferenceLambda
GitHub -->|Deploy| EmailLambda
GitHub -->|Deploy| DashboardLambda
GitHub -->|Deploy| FlaskAPI
GitHub -->|Deploy| ElasticBeanstalk
ECR -->|Image| SageMaker
- Bronze Layer: Raw data ingestion from lab instruments and workflow systems
- Silver Layer: Cleaned, validated, and transformed data after quality checks
- Gold Layer: Aggregated features and curated datasets ready for ML and analytics
- AWS EventBridge: Decouples services and triggers workflows based on S3 events and custom rules
- Consolidate Dataset: ETL pipeline aggregating raw data from Bronze to Silver
- Generate Datasets: Triggers dataset generation and feature engineering on schedule
- Run Inference: Orchestrates model predictions on new data, triggers Flask API
- Send Risk Email: Sends high-risk experiment alerts via SendGrid
- Dashboard Sync: Synchronizes Gold layer data to dashboard updates
- Flask API (EC2 Docker): RESTful API for real-time risk predictions
- Final Model (LightGBM): Selected production model (holdout ROC-AUC 0.977, PR-AUC 0.94)
- Model Preprocessor: Standardized feature engineering and transformation
- SageMaker Training: Containerized training environment for model retraining
- Training Script (retraining.py): Orchestrates model retraining with latest data
- Drift Detection: Monitors data and model performance drift
- MLflow Tracking Server: Experiment tracking and artifact management
- RDS SQL Server: Database for MLflow tracking metadata
- S3 Artifact Storage: Stores models, preprocessors, and training artifacts
- SendGrid Email Service: Delivers alerts for high-risk experiments and drift notifications
- Streamlit Dashboard: Interactive web application for monitoring and analytics
- Embedded Tableau: Advanced analytics and business intelligence visualizations
- Elastic Beanstalk: Managed hosting for dashboard applications
- GitHub Actions: Automated testing, building, and deployment workflows
- ECR Repositories: Docker image registry for containerized services
The project utilizes GitHub Actions for continuous integration and deployment, with automated testing and deployment triggered on push to the master branch.
flowchart LR
subgraph Source["🔀 Source Control"]
GitHub["GitHub<br/>master"]
end
subgraph Routing["🔀 Path-Based Routing"]
PathFilter["Route on<br/>Changed Files"]
end
subgraph Workflows["⚙️ Workflows"]
DeployLambdas["Lambdas<br/>deploy_lambdas.yml"]
DeployMLAPI["ML API<br/>deploy_ml_api.yml"]
DeployWebsite["Dashboard<br/>deploy_website.yml"]
DeploySageMaker["SageMaker<br/>deploy_sagemaker_training.yml"]
end
subgraph AWS["☁️ AWS Deployment"]
Lambda["Lambda<br/>Functions"]
EC2["EC2<br/>Docker"]
Beanstalk["Elastic<br/>Beanstalk"]
ECRDeploy["ECR<br/>Repository"]
end
subgraph Monitoring["📊 Monitoring"]
Health["Health<br/>Checks"]
Logs["CloudWatch<br/>Logs"]
end
GitHub -->|Push| PathFilter
PathFilter -->|Lambda/**| DeployLambdas
PathFilter -->|API/**| DeployMLAPI
PathFilter -->|Interface/**| DeployWebsite
PathFilter -->|Training/**| DeploySageMaker
DeployLambdas -->|Update| Lambda
DeployMLAPI -->|Deploy| EC2
DeployWebsite -->|Deploy| Beanstalk
DeploySageMaker -->|Push| ECRDeploy
Lambda -->|Monitor| Health
EC2 -->|Monitor| Health
Beanstalk -->|Monitor| Health
Health -->|Log| Logs
All workflows trigger on push to master branch with path-specific filters to run only when relevant code changes:
-
deploy_lambdas.yml - Deploys serverless Lambda functions
- Trigger: Push to
06-Deployment/Lambda_functions/ - Steps: Checkout → Configure AWS → Login ECR → Build Docker → Push to ECR → Deploy Lambdas
- Functions: consolidate_dataset, run_inference, send_email_alert, dashboard_data
- Trigger: Push to
-
deploy_ml_api.yml - Deploys Flask ML API to EC2
- Trigger: Push to
06-Deployment/Inference_API/ - Steps: Checkout → Configure AWS → Login ECR → Build Docker → Push to ECR → Update EC2 Container
- Endpoint: http://endpoint:5000 (inference and retraining)
- Trigger: Push to
-
deploy_website.yml - Deploys Streamlit dashboard to Elastic Beanstalk
- Trigger: Push to
06-Deployment/Experiment_interface/ - Steps: Checkout → Configure AWS → Install EB CLI → Deploy to Elastic Beanstalk
- Interface: Interactive web dashboard for experiment monitoring
- Trigger: Push to
-
deploy_sagemaker_training.yml - Builds and pushes SageMaker training image
- Trigger: Push to
06-Deployment/Sagemaker_Training_Image/ - Steps: Checkout → Configure AWS → Login ECR → Build Docker → Tag → Push to ECR
- Usage: SageMaker uses this image for automated model retraining
- Trigger: Push to
-
delete_artifacts.yml - Manual cleanup of build artifacts
- Trigger: Manual workflow dispatch
- Steps: Require confirmation → Delete all artifacts from GitHub Actions
- Automated on push: All deployments run automatically when code is pushed to master with relevant path changes
- Path-based filtering: Only components with code changes are rebuilt and deployed
- AWS credentials: All workflows use GitHub Secrets for AWS authentication
- Docker-based: Services are containerized for consistency across environments
- Workflow Logs: Experiment execution details, timing, and status
- Instrument Telemetry: Real-time instrument performance metrics
- Reagent Logs: Reagent usage and availability tracking
- Queue Logs: Laboratory queue management and wait times
- Ingestion: Raw data collected from lab systems into S3 Bronze layer
- Validation & Cleaning: Data quality checks and basic transformations
- Feature Engineering: Aggregation and feature creation for ML models
- Storage: Processed data stored in optimized formats (Parquet)
- Experiment metadata (type, priority, scientist, instrument)
- Temporal features (start/end times, duration, delays)
- Resource utilization metrics
- Risk scores and predictions
Predict the operational risk score for experiments, indicating likelihood of delay or failure.
- Univariate and multivariate analysis
- Correlation analysis and feature importance
- Time-series analysis of telemetry data
- Queue congestion pattern identification
- LightGBM for primary risk prediction (final selection)
- XGBoost as an alternative gradient-boosting model
- Ensemble methods for improved accuracy
From a business and operational point of view we selected LightGBM as the production model because it best satisfies Roche's objectives of minimizing operational cost while remaining deployable and maintainable:
- Cost-aware decisioning: LightGBM yielded the lowest expected operational misclassification cost at the optimized decision threshold, which directly translates to fewer missed risky experiments and lower overall operational losses.
- High recall for risk detection: The model delivers strong recall on the risky class (reducing missed alerts), which aligns with the business priority of proactively catching at-risk experiments.
- Stable, well-calibrated probabilities: Acceptable calibration and stable probability estimates make thresholds reliable for operational workflows and escalation playbooks.
- Production efficiency & lower infra cost: Faster training and inference with lower memory footprint reduces compute and hosting costs (important for SageMaker jobs, EC2 containers, and Lambda-based orchestration).
- Operational flexibility: The model adapts well to cost-sensitive threshold tuning, enabling product owners to change FP/FN trade-offs without retraining.
- Explainability & governance: LightGBM integrates well with SHAP and MLflow for explainability and audit trails, supporting regulatory and stakeholder review.
Together, these business-aligned attributes—cost minimization, high-risk recall, operational efficiency, and explainability—make LightGBM the preferred production choice for Roche's experiment risk pipeline.
- Temporal aggregations
- Categorical encoding
- Interaction features
- Time-series derived metrics
| Metric | XGBoost | LightGBM |
|---|---|---|
| ROC-AUC | 0.981 | 0.980 |
| Precision | 0.811 | 0.795 |
| Recall | 0.908 | 0.919 |
| F1-Score | 0.856 | 0.852 |
- Drift Detection: Statistical tests for data drift
- Performance Monitoring: Continuous evaluation metrics
- Automated Retraining: Trigger-based model updates via SageMaker Training Image
The 07-Deliverables folder contains:
- ML Final Model.ipynb: Complete final model implementation with:
- LightGBM model trained and optimized on full dataset (final production model)
- Feature importance analysis and visualization
- Model evaluation metrics and performance analysis
- Predictions with interpreted risk scores
- Documentation of model decisions and trade-offs
- AWS Services: Lambda, EC2, S3, RDS, EventBridge, Elastic Beanstalk, SageMaker
- Containerization: Docker for API, dashboard, and training services
- CI/CD: GitHub Actions for automated deployment
- consolidate_dataset: ETL pipeline to consolidate and transform raw data from Bronze to Silver layer
- generate_datasets: Triggers dataset generation and feature engineering workflows
- run_inference: Orchestrates model inference on processed data, stores predictions in Gold layer
- send_email_alert: Sends alerts via SendGrid for high-risk experiments
- dashboard_data: Synchronizes processed data for visualization in dashboards
RESTful API container deployed on EC2 providing:
- Real-time risk predictions via
/processendpoint - Model retraining triggers via
/retrainingendpoint with drift detection - Integration with MLflow for model versioning and governance
- Automated drift detection and retraining orchestration
Streamlit-based web interface for:
- Real-time experiment risk visualization
- Historical trend analysis
- Model performance monitoring
- Alert management and investigation
SageMaker Training Image provides:
- Dockerized training environment for model retraining
- Integration with processed datasets from S3
- Artifact storage in MLflow and S3
- Automated hyperparameter optimization
- Serverless architecture for automatic scaling
- Event-driven processing for efficient resource utilization
- Multi-layer caching for improved performance
- Containerized services for easy horizontal scaling
- Python 3.12 or higher
- AWS CLI configured with appropriate credentials
- Docker and Docker Compose (for local deployment)
- Git
- Clone the repository:
git clone <repository-url>
cd "Capstone Project"- Create and activate virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install dependencies:
pip install -r requirements.txt- Configure environment variables (if needed):
cp .env.example .env
# Edit .env with your AWS and SendGrid configurationRun the data generation scripts in sequence from 03-Data directory:
cd 03-Data
python 01_generate_workflow_logs.py
python 02_generate_instrument_telemetry.py
python 03_generate_reagent_logs.py
python 04_generate_queue_logs.py
python 05_dataset_generator.pyThis generates synthetic lab data simulating real-world experiment workflows for model training and testing.
For the complete final model implementation and results:
Open 07-Deliverables/ML Final Model.ipynb in JupyterExplore data patterns across different data sources:
cd 04-EDA
jupyter notebook EDA_workflow.ipynb # Workflow log analysis
jupyter notebook EDA_telemetry.ipynb # Instrument telemetry analysis
jupyter notebook EDA_reagent.ipynb # Reagent usage patterns
jupyter notebook EDA_queue.ipynb # Queue congestion analysis
jupyter notebook EDA_processed_file.ipynb # Final processed feature analysisReview model development process:
cd 05-Experiment
jupyter notebook "ML Final Model.ipynb" # Final production model
jupyter notebook "ML Model I.ipynb" # Initial model iteration
jupyter notebook "ML Model II.ipynb" # Improved model iteration
jupyter notebook "ML Model III.ipynb" # Alternative model approachesDeploy the ML API for real-time predictions:
cd 06-Deployment/Inference_API
python app.pyThe API will start on http://localhost:5000
Start the interactive monitoring dashboard:
cd 06-Deployment/Experiment_interface
streamlit run app.pyThe dashboard will open in your default browser
-
POST /process: Submit experiment data and get risk prediction
- Input: Experiment features (workflow, telemetry, reagent, queue data)
- Output: Risk score and prediction confidence
-
POST /retraining: Triggers model retraining if drift is detected
- Input: Current dataset for drift evaluation
- Output: Retraining status and updated model metrics
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
- Follow PEP 8 style guidelines
- Add tests for new features and Lambda functions
- Update documentation for API changes and Lambda modifications
- Ensure all tests pass before submitting PR
- For SageMaker training changes, test locally with Docker first
- Update README with any new deployments or configurations
# Run unit tests
pytest tests/
# Test Docker containers locally
docker build -t experiment-api:latest 06-Deployment/Inference_API/
docker run -p 5000:5000 experiment-api:latest-
✅ Finalized ML model with LightGBM (holdout ROC-AUC: 0.977, PR-AUC: 0.94, Recall: 0.911)
-
✅ Completed comprehensive EDA across all data sources (workflow, telemetry, reagent, queue)
-
✅ Implemented SageMaker Training Image for automated model retraining
-
✅ Enhanced Lambda functions for complete ETL pipeline
-
✅ Added detailed feature importance analysis and interpretation
-
✅ Project documentation and deliverables finalized
- Model retraining currently requires manual trigger via API endpoint (future: fully automated via SageMaker schedules)
- Data generation is synthetic (future: integrate with actual lab systems)
- Dashboard currently supports single Tableau instance (future: multi-tenant support)
This project is licensed under the MIT License - see the LICENSE file for details.
This capstone project demonstrates the application of MLOps principles to solve real-world laboratory management challenges through predictive analytics and automated monitoring systems.