Skip to content

DeepActionPotential/TheSentryAI

Repository files navigation

TheSentryAI — End-to-End MLOps Pipeline (MLflow-focused)

This repository implements an end-to-end MLOps system for toxic-comment detection. It demonstrates a production-oriented MLOps stack with a strong focus on MLflow for experiment tracking and model lifecycle management, along with Dockerized inference services, GitHub Actions CI/CD automation, and AWS EC2 deployment.


Table of contents

  1. Summary
  2. Project overview
  3. Business value
  4. Architecture (MLflow-centric)
  5. Demo
  6. ML pipeline internals
  7. MLflow usage & tips
  8. Local run & development
  9. Docker / Containers
  10. CI/CD & Deployment to EC2
  11. API usage (FastAPI)
  12. Troubleshooting
  13. Future improvements

Summary

TheSentryAI is designed as a production-ready machine learning pipeline that includes:

  • Data ingestion and preprocessing
  • Fully reproducible training pipeline
  • Experiment tracking and artifact management using MLflow
  • Pipeline packaging (preprocessing + model) for inference
  • Containerized inference with FastAPI and Docker
  • Automated CI/CD deployment through GitHub Actions to AWS EC2
  • Scheduled automated retraining

The repository emphasizes reproducibility, traceability, and seamless deployment workflows.


Project overview

Workflow:

Raw toxic comments → preprocessing → TF-IDF → classifier → evaluation → MLflow logging → model artifact → production deployment

Main components:

  • package/ — code for ingestion, preprocessing, training, inference
  • data/ — raw and processed datasets
  • mlruns/ — MLflow tracking directory (local)
  • docker/ — Dockerfiles for API, MLflow server, worker
  • GitHub Actions workflows — CI, deployment, retraining
  • EC2 machine hosting Docker containers for inference and MLflow

Primary technology stack: Python, scikit-learn, FastAPI, MLflow, Docker, GitHub Actions, AWS EC2.


Business value

This system is intended for platforms requiring scalable, automated toxic comment detection. It provides:

  • Faster and more consistent moderation decisions
  • Centralized and auditable experiment tracking
  • Continuous model improvements through retraining
  • Reproducible and versioned model artifacts
  • Deployment workflows suitable for real production environments

Architecture (MLflow-centric)

 (Scheduled retraining)
      GitHub Actions (weekly)
             │
             ▼
       Training job
     ──────────────────▶ MLflow: logs parameters, metrics, artifacts
                         Stored in mlruns/ or remote backend
                         Optional model registry

             │
             ▼
       Model artifact

             │
             ▼
    CI/CD deploy → EC2 (Docker containers)
                    └─ FastAPI inference service loading the latest model

Key MLflow features used:

  • Tracking server
  • Artifact storage
  • Run comparison interface
  • Optional model registry support

Demo

Demo Demo Demo Demo

Video Demo

demo video

ML pipeline internals

Preprocessing

  • Text cleaning
  • Stripping accents
  • Stopword removal
  • Word-level and character-level TF-IDF
  • Multi-label engineering

Model

Configurable models include:

  • Logistic Regression
  • RandomForest
  • (Optional extensions: Linear SVM, XGBoost)

Evaluation

Metrics include:

  • ROC-AUC per label
  • PR-AUC
  • Precision/Recall
  • Confusion matrices

Packaging

The entire preprocessing + model pipeline is packaged and saved to:

models/production/model-latest.pkl

This ensures consistent inference and avoids training-serving skew.


MLflow usage & practical tips

Running MLflow manually

mlflow server \
  --host 0.0.0.0 \
  --port 5000 \
  --backend-store-uri /home/ubuntu/TheSentryAI/mlruns \
  --default-artifact-root /home/ubuntu/TheSentryAI/mlruns \
  --serve-artifacts \
  --allowed-hosts="*" \
  --cors-allowed-origins="*"

Recommended for production:

  • Use SQLite/Postgres for the backend store
  • Use S3 for artifact storage

Common MLflow flags to know:

  • --backend-store-uri — run metadata
  • --default-artifact-root — model artifacts
  • --serve-artifacts — enables artifact hosting

Local run & development

Install dependencies

poetry install --no-root
poetry shell

Create .env file

KAGGLE_USERNAME=...
KAGGLE_KEY=...

Ingest, preprocess, train

python package/ingestion/load_kaggle_dataset.py
python package/preprocessing/prepare_dataset.py
python package/training/train_model.py

Run FastAPI for development

uvicorn package.service.server:app --host 0.0.0.0 --port 8000 --reload

Docker / Containers

Build images

sudo docker build -t sentry-api -f docker/Dockerfile.api .
sudo docker build -t sentry-mlflow -f docker/Dockerfile.mlflow .
sudo docker build -t sentry-worker -f docker/Dockerfile.worker .

Run MLflow container

sudo docker run -d --name sentry-mlflow \
  -p 5000:5000 \
  -v /home/ubuntu/TheSentryAI/mlruns:/mlflow \
  sentry-mlflow

Run inference container

sudo docker run -d --name sentry-api \
  -p 8000:8000 \
  --env-file /home/ubuntu/TheSentryAI/.env \
  -v /home/ubuntu/TheSentryAI/models:/app/models \
  -v /home/ubuntu/TheSentryAI/logs:/app/logs \
  sentry-api

CI/CD & deployment (GitHub Actions)

Workflows include:

  • test.yml — static checks and tests
  • deploy.yml — sync code, rebuild containers, restart services on EC2
  • weekly_train.yml — scheduled retraining job

Deployment steps:

  1. Checkout code
  2. Copy project to EC2 via SCP
  3. SSH into EC2
  4. Install dependencies
  5. Rebuild Docker images
  6. Restart services

All secrets should be stored in GitHub Secrets.


API usage (FastAPI)

Example request

{
  "comment_text": "This is a sample comment that might be toxic"
}

Example response

{
  "default_probability": 0.1482,
  "predictions": {
    "toxic": 0,
    "obscene": 0,
    "insult": 0
  },
  "model_version": "model-v2025-12-05"
}

The schema is defined in package/service.


Troubleshooting

Issue: "Invalid Host Header"

Fix by configuring MLflow with:

--allowed-hosts="*"
--cors-allowed-origins="*"

Issue: Container exits due to unsupported MLflow flag

Remove any flags not supported by the MLflow version installed.

Issue: Port already in use

Check process:

sudo lsof -i :5000

Stop container:

sudo docker stop sentry-mlflow
sudo docker rm sentry-mlflow

Issue: ModuleNotFoundError: package

Set PYTHONPATH:

export PYTHONPATH="/home/ubuntu/TheSentryAI:$PYTHONPATH"

Future improvements

  • Move to Postgres + S3 backend for MLflow
  • Add monitoring and logging stack (Prometheus, Grafana)
  • Implement data validation with Great Expectations
  • Canary or blue-green deployment strategies
  • Add explainability endpoints (SHAP)
  • Upgrade EC2 deployment to ECS/Fargate for autoscaling

Contact / Notes

Important directories:

  • package/ingestion — data loading
  • package/preprocessing — dataset cleaning
  • package/training — training loop
  • package/service — FastAPI inference

About

This repository implements an end-to-end MLOps system for toxic-comment detection. It demonstrates a production-oriented MLOps stack with a strong focus on MLflow for experiment tracking and model lifecycle management

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors