Skip to content

Added NSFW text validator#83

Open
rkritika1508 wants to merge 15 commits intofeat/toxicity-hub-validatorsfrom
feat/toxicity-huggingface-model
Open

Added NSFW text validator#83
rkritika1508 wants to merge 15 commits intofeat/toxicity-hub-validatorsfrom
feat/toxicity-huggingface-model

Conversation

@rkritika1508
Copy link
Copy Markdown
Collaborator

@rkritika1508 rkritika1508 commented Apr 2, 2026

Summary

Target issue is #84
Explain the motivation for making this change. What existing problem does the pull request solve?
This PR introduces a new validator: nsfw_text, powered by Guardrails AI, with support for custom HuggingFace models.

We configure it to use:
textdetox/xlmr-large-toxicity-classifier

This significantly improves our ability to detect multilingual, code-mixed, and context-aware unsafe content, addressing gaps in the current validator stack.

Our existing validator suite (profanity, slur detection, toxic language, etc.) primarily relies on:

  • rule-based approaches
  • keyword matching
  • lighter ML models

While effective for straightforward cases, they fail in more complex real-world scenarios, such as:

  • multilingual inputs (Hindi + English mix, etc.)
  • paraphrased toxicity (non-explicit abusive phrasing)
  • regional slang / dialects
  • implicit or contextual NSFW content

Existing Validators (Context)

To make this PR self-contained, here’s a quick overview of what we already use and where they fall short:

Validator What it does Limitation
Profanity Free Detects explicit swear words Misses implicit or paraphrased toxicity
NSFW (existing basic) Flags explicit content Not robust for nuanced or mixed-language input
Llama Guard (policy) Policy-based filtering Not specialized for fine-grained NSFW detection

We need a strong model-based classifier to complement these.

What This PR Adds

1. New Validator: nsfw_text

File: backend/app/core/validators/config/nsfw_text_safety_validator_config.py

2. Custom Model Integration

Instead of using default models, we configure: textdetox/xlmr-large-toxicity-classifier

This model is

  • Multilingual (based on XLM-R)
  • Handles code-mixed inputs well
  • Better contextual understanding vs keyword-based systems

Design Decision

Previous Approach (Discarded)

  • Directly calling HuggingFace models separately
  • Added extra dependency + integration overhead

Current Approach

  • Use Guardrails NSFWText validator
  • Plug in custom HuggingFace model via model_name

Benefits

  • No additional abstraction layer
  • Fully compatible with existing Guardrails pipeline
  • Centralized validator management
  • Easy to swap models in future

Checklist

Before submitting a pull request, please ensure that you mark these task.

  • Ran fastapi run --reload app/main.py or docker compose up in the repository root and test.
  • If you've fixed a bug or added code that is tested and has test cases.

Notes

Please add here if any other information is required for the reviewer.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 2, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 020b5145-2c73-453c-8f34-df5382e3be12

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/toxicity-huggingface-model

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@rkritika1508 rkritika1508 self-assigned this Apr 7, 2026
@rkritika1508 rkritika1508 added enhancement New feature or request ready-for-review labels Apr 7, 2026
@rkritika1508 rkritika1508 linked an issue Apr 7, 2026 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request ready-for-review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add toxicity validator for HuggingFace models

3 participants