This repository implements an end-to-end MLOps system for toxic-comment detection. It demonstrates a production-oriented MLOps stack with a strong focus on MLflow for experiment tracking and model lifecycle management, along with Dockerized inference services, GitHub Actions CI/CD automation, and AWS EC2 deployment.
- Summary
- Project overview
- Business value
- Architecture (MLflow-centric)
- Demo
- ML pipeline internals
- MLflow usage & tips
- Local run & development
- Docker / Containers
- CI/CD & Deployment to EC2
- API usage (FastAPI)
- Troubleshooting
- Future improvements
TheSentryAI is designed as a production-ready machine learning pipeline that includes:
- Data ingestion and preprocessing
- Fully reproducible training pipeline
- Experiment tracking and artifact management using MLflow
- Pipeline packaging (preprocessing + model) for inference
- Containerized inference with FastAPI and Docker
- Automated CI/CD deployment through GitHub Actions to AWS EC2
- Scheduled automated retraining
The repository emphasizes reproducibility, traceability, and seamless deployment workflows.
Workflow:
Raw toxic comments → preprocessing → TF-IDF → classifier → evaluation → MLflow logging → model artifact → production deployment
Main components:
package/— code for ingestion, preprocessing, training, inferencedata/— raw and processed datasetsmlruns/— MLflow tracking directory (local)docker/— Dockerfiles for API, MLflow server, worker- GitHub Actions workflows — CI, deployment, retraining
- EC2 machine hosting Docker containers for inference and MLflow
Primary technology stack: Python, scikit-learn, FastAPI, MLflow, Docker, GitHub Actions, AWS EC2.
This system is intended for platforms requiring scalable, automated toxic comment detection. It provides:
- Faster and more consistent moderation decisions
- Centralized and auditable experiment tracking
- Continuous model improvements through retraining
- Reproducible and versioned model artifacts
- Deployment workflows suitable for real production environments
(Scheduled retraining)
GitHub Actions (weekly)
│
▼
Training job
──────────────────▶ MLflow: logs parameters, metrics, artifacts
Stored in mlruns/ or remote backend
Optional model registry
│
▼
Model artifact
│
▼
CI/CD deploy → EC2 (Docker containers)
└─ FastAPI inference service loading the latest model
Key MLflow features used:
- Tracking server
- Artifact storage
- Run comparison interface
- Optional model registry support
- Text cleaning
- Stripping accents
- Stopword removal
- Word-level and character-level TF-IDF
- Multi-label engineering
Configurable models include:
- Logistic Regression
- RandomForest
- (Optional extensions: Linear SVM, XGBoost)
Metrics include:
- ROC-AUC per label
- PR-AUC
- Precision/Recall
- Confusion matrices
The entire preprocessing + model pipeline is packaged and saved to:
models/production/model-latest.pkl
This ensures consistent inference and avoids training-serving skew.
mlflow server \
--host 0.0.0.0 \
--port 5000 \
--backend-store-uri /home/ubuntu/TheSentryAI/mlruns \
--default-artifact-root /home/ubuntu/TheSentryAI/mlruns \
--serve-artifacts \
--allowed-hosts="*" \
--cors-allowed-origins="*"Recommended for production:
- Use SQLite/Postgres for the backend store
- Use S3 for artifact storage
Common MLflow flags to know:
--backend-store-uri— run metadata--default-artifact-root— model artifacts--serve-artifacts— enables artifact hosting
poetry install --no-root
poetry shellKAGGLE_USERNAME=...
KAGGLE_KEY=...
python package/ingestion/load_kaggle_dataset.py
python package/preprocessing/prepare_dataset.py
python package/training/train_model.pyuvicorn package.service.server:app --host 0.0.0.0 --port 8000 --reloadsudo docker build -t sentry-api -f docker/Dockerfile.api .
sudo docker build -t sentry-mlflow -f docker/Dockerfile.mlflow .
sudo docker build -t sentry-worker -f docker/Dockerfile.worker .sudo docker run -d --name sentry-mlflow \
-p 5000:5000 \
-v /home/ubuntu/TheSentryAI/mlruns:/mlflow \
sentry-mlflowsudo docker run -d --name sentry-api \
-p 8000:8000 \
--env-file /home/ubuntu/TheSentryAI/.env \
-v /home/ubuntu/TheSentryAI/models:/app/models \
-v /home/ubuntu/TheSentryAI/logs:/app/logs \
sentry-apiWorkflows include:
test.yml— static checks and testsdeploy.yml— sync code, rebuild containers, restart services on EC2weekly_train.yml— scheduled retraining job
Deployment steps:
- Checkout code
- Copy project to EC2 via SCP
- SSH into EC2
- Install dependencies
- Rebuild Docker images
- Restart services
All secrets should be stored in GitHub Secrets.
{
"comment_text": "This is a sample comment that might be toxic"
}{
"default_probability": 0.1482,
"predictions": {
"toxic": 0,
"obscene": 0,
"insult": 0
},
"model_version": "model-v2025-12-05"
}The schema is defined in package/service.
Fix by configuring MLflow with:
--allowed-hosts="*"
--cors-allowed-origins="*"
Remove any flags not supported by the MLflow version installed.
Check process:
sudo lsof -i :5000Stop container:
sudo docker stop sentry-mlflow
sudo docker rm sentry-mlflowSet PYTHONPATH:
export PYTHONPATH="/home/ubuntu/TheSentryAI:$PYTHONPATH"- Move to Postgres + S3 backend for MLflow
- Add monitoring and logging stack (Prometheus, Grafana)
- Implement data validation with Great Expectations
- Canary or blue-green deployment strategies
- Add explainability endpoints (SHAP)
- Upgrade EC2 deployment to ECS/Fargate for autoscaling
Important directories:
package/ingestion— data loadingpackage/preprocessing— dataset cleaningpackage/training— training looppackage/service— FastAPI inference



