TheSentryAI — End-to-End MLOps Pipeline (MLflow-focused)

This repository implements an end-to-end MLOps system for toxic-comment detection. It demonstrates a production-oriented MLOps stack with a strong focus on MLflow for experiment tracking and model lifecycle management, along with Dockerized inference services, GitHub Actions CI/CD automation, and AWS EC2 deployment.

Summary

TheSentryAI is designed as a production-ready machine learning pipeline that includes:

Data ingestion and preprocessing
Fully reproducible training pipeline
Experiment tracking and artifact management using MLflow
Pipeline packaging (preprocessing + model) for inference
Containerized inference with FastAPI and Docker
Automated CI/CD deployment through GitHub Actions to AWS EC2
Scheduled automated retraining

The repository emphasizes reproducibility, traceability, and seamless deployment workflows.

Project overview

Workflow:

Raw toxic comments → preprocessing → TF-IDF → classifier → evaluation → MLflow logging → model artifact → production deployment

Main components:

package/ — code for ingestion, preprocessing, training, inference
data/ — raw and processed datasets
mlruns/ — MLflow tracking directory (local)
docker/ — Dockerfiles for API, MLflow server, worker
GitHub Actions workflows — CI, deployment, retraining
EC2 machine hosting Docker containers for inference and MLflow

Primary technology stack: Python, scikit-learn, FastAPI, MLflow, Docker, GitHub Actions, AWS EC2.

Business value

This system is intended for platforms requiring scalable, automated toxic comment detection. It provides:

Faster and more consistent moderation decisions
Centralized and auditable experiment tracking
Continuous model improvements through retraining
Reproducible and versioned model artifacts
Deployment workflows suitable for real production environments

Architecture (MLflow-centric)

 (Scheduled retraining)
      GitHub Actions (weekly)
             │
             ▼
       Training job
     ──────────────────▶ MLflow: logs parameters, metrics, artifacts
                         Stored in mlruns/ or remote backend
                         Optional model registry

             │
             ▼
       Model artifact

             │
             ▼
    CI/CD deploy → EC2 (Docker containers)
                    └─ FastAPI inference service loading the latest model

Key MLflow features used:

Tracking server
Artifact storage
Run comparison interface
Optional model registry support

Demo

Video Demo

demo video

ML pipeline internals

Preprocessing

Text cleaning
Stripping accents
Stopword removal
Word-level and character-level TF-IDF
Multi-label engineering

Model

Configurable models include:

Logistic Regression
RandomForest
(Optional extensions: Linear SVM, XGBoost)

Evaluation

Metrics include:

ROC-AUC per label
PR-AUC
Precision/Recall
Confusion matrices

Packaging

The entire preprocessing + model pipeline is packaged and saved to:

models/production/model-latest.pkl

This ensures consistent inference and avoids training-serving skew.

MLflow usage & practical tips

Running MLflow manually

mlflow server \
  --host 0.0.0.0 \
  --port 5000 \
  --backend-store-uri /home/ubuntu/TheSentryAI/mlruns \
  --default-artifact-root /home/ubuntu/TheSentryAI/mlruns \
  --serve-artifacts \
  --allowed-hosts="*" \
  --cors-allowed-origins="*"

Recommended for production:

Use SQLite/Postgres for the backend store
Use S3 for artifact storage

Common MLflow flags to know:

--backend-store-uri — run metadata
--default-artifact-root — model artifacts
--serve-artifacts — enables artifact hosting

Local run & development

Install dependencies

poetry install --no-root
poetry shell

Create `.env` file

KAGGLE_USERNAME=...
KAGGLE_KEY=...

Ingest, preprocess, train

python package/ingestion/load_kaggle_dataset.py
python package/preprocessing/prepare_dataset.py
python package/training/train_model.py

Run FastAPI for development

uvicorn package.service.server:app --host 0.0.0.0 --port 8000 --reload

Docker / Containers

Build images

sudo docker build -t sentry-api -f docker/Dockerfile.api .
sudo docker build -t sentry-mlflow -f docker/Dockerfile.mlflow .
sudo docker build -t sentry-worker -f docker/Dockerfile.worker .

Run MLflow container

sudo docker run -d --name sentry-mlflow \
  -p 5000:5000 \
  -v /home/ubuntu/TheSentryAI/mlruns:/mlflow \
  sentry-mlflow

Run inference container

sudo docker run -d --name sentry-api \
  -p 8000:8000 \
  --env-file /home/ubuntu/TheSentryAI/.env \
  -v /home/ubuntu/TheSentryAI/models:/app/models \
  -v /home/ubuntu/TheSentryAI/logs:/app/logs \
  sentry-api

CI/CD & deployment (GitHub Actions)

Workflows include:

test.yml — static checks and tests
deploy.yml — sync code, rebuild containers, restart services on EC2
weekly_train.yml — scheduled retraining job

Deployment steps:

Checkout code
Copy project to EC2 via SCP
SSH into EC2
Install dependencies
Rebuild Docker images
Restart services

All secrets should be stored in GitHub Secrets.

API usage (FastAPI)

Example request

{
  "comment_text": "This is a sample comment that might be toxic"
}

Example response

{
  "default_probability": 0.1482,
  "predictions": {
    "toxic": 0,
    "obscene": 0,
    "insult": 0
  },
  "model_version": "model-v2025-12-05"
}

The schema is defined in package/service.

Troubleshooting

Issue: "Invalid Host Header"

Fix by configuring MLflow with:

--allowed-hosts="*"
--cors-allowed-origins="*"

Issue: Container exits due to unsupported MLflow flag

Remove any flags not supported by the MLflow version installed.

Issue: Port already in use

Check process:

sudo lsof -i :5000

Stop container:

sudo docker stop sentry-mlflow
sudo docker rm sentry-mlflow

Issue: `ModuleNotFoundError: package`

Set PYTHONPATH:

export PYTHONPATH="/home/ubuntu/TheSentryAI:$PYTHONPATH"

Future improvements

Move to Postgres + S3 backend for MLflow
Add monitoring and logging stack (Prometheus, Grafana)
Implement data validation with Great Expectations
Canary or blue-green deployment strategies
Add explainability endpoints (SHAP)
Upgrade EC2 deployment to ECS/Fargate for autoscaling

Contact / Notes

Important directories:

package/ingestion — data loading
package/preprocessing — dataset cleaning
package/training — training loop
package/service — FastAPI inference

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.github/workflows		.github/workflows
configs		configs
demo		demo
docker		docker
frontend		frontend
package		package
tests		tests
.gitignore		.gitignore
.prefectignore		.prefectignore
README.md		README.md
prefect.yaml		prefect.yaml
project_summary.json		project_summary.json
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

TheSentryAI — End-to-End MLOps Pipeline (MLflow-focused)

Table of contents

Summary

Project overview

Business value

Architecture (MLflow-centric)

Demo

Video Demo

ML pipeline internals

Preprocessing

Model

Evaluation

Packaging

MLflow usage & practical tips

Running MLflow manually

Local run & development

Install dependencies

Create .env file

Ingest, preprocess, train

Run FastAPI for development

Docker / Containers

Build images

Run MLflow container

Run inference container

CI/CD & deployment (GitHub Actions)

API usage (FastAPI)

Example request

Example response

Troubleshooting

Issue: "Invalid Host Header"

Issue: Container exits due to unsupported MLflow flag

Issue: Port already in use

Issue: ModuleNotFoundError: package

Future improvements

Contact / Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Create `.env` file

Issue: `ModuleNotFoundError: package`

Packages