# Introduction

This notebook provides a comprehensive approach to **Output Moderation** for text generated by language models or other AI systems. As AI applications increasingly interact with users, ensuring the quality, relevance, and safety of the generated outputs is paramount. 

The provided code implements key moderation techniques to address critical concerns such as:

1. **Hallucination Prevention**: Ensures that generated outputs align with the given context and avoid fabricated or misleading content.
2. **Sanitization**: Treats all outputs as potentially harmful, escaping any special characters to mitigate injection attacks.
3. **Toxicity Filtering**: Uses a predefined set of toxic words and regular expressions to detect and prevent harmful language.
4. **Anomaly Detection**: Identifies outputs with abnormal patterns, such as excessive repetition or nonsensical content.
5. **Monitoring and Logging**: Tracks outputs along with contextual metadata to enable thorough analysis and debugging.
6. **Feedback Mechanism**: Provides a mechanism for user feedback to refine and improve moderation strategies over time.

## Objective

The goal of this notebook is to act as a **baseline framework** for implementing and enhancing moderation pipelines in AI systems. This solution can be used as a starting point for developers aiming to ensure that their models generate outputs that are:
- **Safe** (free from toxic or harmful content),
- **Relevant** (contextually appropriate), and
- **Trustworthy** (minimally prone to hallucinations or anomalies).

## Use Cases

This framework is particularly suited for:
- Chatbots or conversational AI systems.
- Generative AI models producing text summaries or insights.
- Applications in sensitive domains such as healthcare, education, or customer support, where content moderation is critical.

By following the outlined moderation steps, developers can safeguard their applications, enhance user trust, and align with ethical AI practices.


In [1]:
import re
import html
from datetime import datetime

# Configuration for Moderation
TOXIC_WORDS = {"hate", "violence", "malicious", "abuse", "harm"}
ANOMALY_REPETITION_THRESHOLD = 3  # Number of unique words required to avoid flagging as repetitive
MAX_OUTPUT_LENGTH = 500  # Maximum allowed length for output

# Functions for Output Moderation
def hallucination_prevention(output: str, context: str) -> bool:
    """
    Check if the output aligns with the provided context to prevent hallucination.
    """
    return context.lower() in output.lower()

def treat_as_untrusted(output: str) -> str:
    """
    Treat output as untrusted and sanitize it for safe use.
    """
    return html.escape(output)

def filter_toxicity(output: str) -> bool:
    """
    Detect and filter toxic content in the output using predefined toxic words.
    Uses regex to identify toxic patterns.
    """
    for toxic_word in TOXIC_WORDS:
        if re.search(rf"\b{toxic_word}\b", output, re.IGNORECASE):
            return False
    return True

def anomaly_detection(output: str) -> bool:
    """
    Detect anomalies in output, such as excessive repetition, nonsensical content, or length violations.
    """
    words = output.split()
    unique_words = set(words)
    
    # Check for excessive repetition
    if len(unique_words) < ANOMALY_REPETITION_THRESHOLD:
        return False
    
    # Check for excessive length
    if len(output) > MAX_OUTPUT_LENGTH:
        return False
    
    # Additional checks can be added as needed (e.g., nonsensical patterns)
    return True

def monitor_and_log(output: str, context: str, status: str):
    """
    Monitor output and log for analysis and debugging, including timestamps and context.
    """
    timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
    with open("output_logs.txt", "a") as log_file:
        log_file.write(
            f"[{timestamp}] Context: {context}\n"
            f"Output: {output}\n"
            f"Status: {status}\n\n"
        )

def feedback_mechanism(output: str, is_acceptable: bool):
    """
    Allow user feedback to refine future moderation decisions.
    """
    feedback_file = "moderation_feedback.txt"
    with open(feedback_file, "a") as file:
        feedback = "Acceptable" if is_acceptable else "Unacceptable"
        file.write(f"Output: {output}\nFeedback: {feedback}\n\n")

# Main Execution with Example Usage
if __name__ == "__main__":
    # Example context and outputs
    context = "Explain the concept of artificial intelligence."
    outputs = [
        "Artificial intelligence is the simulation of human intelligence in machines.",
        "I hate the concept of AI, it's promoting violence.",
        "Hello Hello Hello",
        "<script>alert('malicious');</script>",
        "Artificial intelligence is a field that <b>abuses</b> resources."
    ]

    for output in outputs:
        print(f"Original Output: {output}")

        # Treat output as untrusted and sanitize it
        safe_output = treat_as_untrusted(output)
        print(f"Sanitized Output: {safe_output}")

        # Check for hallucinations
        if not hallucination_prevention(safe_output, context):
            status = "Hallucinated or irrelevant content detected."
            print(f"Warning: {status}")
            monitor_and_log(safe_output, context, status)
            continue

        # Filter for toxicity
        if not filter_toxicity(safe_output):
            status = "Toxic or harmful content detected."
            print(f"Warning: {status}")
            monitor_and_log(safe_output, context, status)
            feedback_mechanism(safe_output, False)
            continue

        # Detect anomalies
        if not anomaly_detection(safe_output):
            status = "Anomaly detected in content (e.g., repetition, length)."
            print(f"Warning: {status}")
            monitor_and_log(safe_output, context, status)
            feedback_mechanism(safe_output, False)
            continue

        # Log output for monitoring
        status = "Output accepted."
        monitor_and_log(safe_output, context, status)
        print("Output logged successfully.\n")

        # Collect user feedback (mock example for demonstration)
        feedback_mechanism(safe_output, is_acceptable=True)


Original Output: Artificial intelligence is the simulation of human intelligence in machines.
Sanitized Output: Artificial intelligence is the simulation of human intelligence in machines.
Original Output: I hate the concept of AI, it's promoting violence.
Sanitized Output: I hate the concept of AI, it&#x27;s promoting violence.
Original Output: Hello Hello Hello
Sanitized Output: Hello Hello Hello
Original Output: <script>alert('malicious');</script>
Sanitized Output: &lt;script&gt;alert(&#x27;malicious&#x27;);&lt;/script&gt;
Original Output: Artificial intelligence is a field that <b>abuses</b> resources.
Sanitized Output: Artificial intelligence is a field that &lt;b&gt;abuses&lt;/b&gt; resources.


# Conclusion

This notebook provides a **foundational framework** for implementing output moderation in AI systems. The techniques demonstrated here—such as hallucination prevention, toxicity filtering, anomaly detection, and output sanitization—highlight the critical steps needed to ensure that AI-generated content is safe, relevant, and aligned with user expectations.

It is important to note that this code serves as a **basic example**, emphasizing high-level principles rather than exhaustive implementation. Developers can use this as a starting point and expand upon it to address specific use cases, incorporate advanced moderation techniques, and integrate with production-level systems. 

Effective moderation is essential for building trust in AI applications, especially in sensitive or high-stakes domains. This notebook is a reminder of the key considerations that should guide the design of robust and ethical AI systems. Further refinements and domain-specific adjustments are encouraged to fully realize the potential of these principles in practice.


In [None]:
!pip install transformers
!pip install --upgrade nemoguardrails
!pip install google-api-python-client
!pip install nemoguardrails

In [14]:
import nest_asyncio
import asyncio
from nemoguardrails import LLMRails, RailsConfig
from transformers import AutoModelForSequenceClassification, AutoTokenizer

# Apply nest_asyncio to allow nested event loops in environments like Jupyter notebooks
nest_asyncio.apply()

# Load Hugging Face model and tokenizer
model_name = "unitary/toxic-bert"  # Example model, you can change this to any other toxicity model
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Function to perform toxicity detection using the Hugging Face transformer model
def detect_toxicity(input_text):
    inputs = tokenizer(input_text, return_tensors="pt")
    outputs = model(**inputs)
    logits = outputs.logits
    return logits

# NeMo Guardrails configuration (replace with your own YAML content or file path)
yaml_content = """
models:
  - type: transformers
    engine: huggingface
    pretrained_model: distilbert-base-uncased
    tokenizer: distilbert-base-uncased
"""

# Load configuration and initialize NeMo Guardrails
config = RailsConfig.from_content(yaml_content=yaml_content)

# Initialize NeMo Guardrails with the loaded configuration
rails = LLMRails(config)

# Function for running the NeMo Guardrails with Hugging Face model
async def nemo_guardrails(input_text):
    """Use NeMo Guardrails to validate input prompts and use Hugging Face model for toxicity detection."""
    result = await rails.generate(messages=[{"role": "user", "content": input_text}])
    toxicity_score = detect_toxicity(input_text)  # Run toxicity check
    return result, toxicity_score

# Running the asynchronous code in an interactive environment like Jupyter
async def main():
    input_text = "I think we should harm someone who disagrees with us."
    nemo_result, toxicity_score = await nemo_guardrails(input_text)
    print("NeMo Guardrails Result:", nemo_result)
    print("Toxicity Score:", toxicity_score)

# Run the main async function
await main()


TypeError: HuggingFaceModel.__call__() missing 1 required positional argument: 'prompt'

In [3]:
# Import necessary libraries
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from nemoguardrails import LLMRails, RailsConfig

# Example input text
input_text = "I think we should harm someone who disagrees with us."

# Load the Hugging Face model and tokenizer for toxicity detection
model_name = "unitary/toxic-bert"  # You can change this to any other toxicity model
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Function to perform toxicity detection using the Hugging Face transformer model
def check_toxicity(input_text):
    """Detect toxicity in the input text using a transformer-based model."""
    inputs = tokenizer(input_text, return_tensors="pt")  # Tokenize the input text
    outputs = model(**inputs)  # Perform inference with the model
    logits = outputs.logits  # Get logits from the model's output
    scores = logits.softmax(dim=1)  # Convert logits to probabilities (classification scores)
    
    # Return the toxicity score (probability of being toxic)
    return scores.detach().numpy()[0][1]  # Assumes the second class is the "toxic" class

# NeMo Guardrails for Input Validation
def nemo_guardrails(input_text):
    """Use NeMo Guardrails to validate input prompts."""
    try:
        # Load the YAML configuration
        config = RailsConfig.from_path("config.yaml")  # Ensure this path is correct
        
        # Initialize the LLMRails object with the loaded configuration
        rails = LLMRails(config)
        
        # Run the app with the provided input text
        result = rails.generate(messages=[{"role": "user", "content": input_text}])
        return result
    except Exception as e:
        print(f"An error occurred: {e}")
        return None

# Check toxicity before running NeMo Guardrails
toxicity_score = check_toxicity(input_text)
print(f"Toxicity Score: {toxicity_score}")

# If toxicity score is above a certain threshold, reject input (e.g., toxicity > 0.7)
if toxicity_score > 0.7:
    print("Input rejected due to high toxicity.")
else:
    # Run NeMo Guardrails validation if the text is not too toxic
    nemo_result = nemo_guardrails(input_text)
    # if nemo_result:
    #     print("NeMo Guardrails Result:", nemo_result)
    # else:
    #     print("Failed to get valid result from NeMo Guardrails.")


Toxicity Score: 0.01767623797059059
An error occurred: Unknown LLM engine: huggingface.


In [2]:
# Import necessary libraries
from transformers import pipeline
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import requests
from nemoguardrails import RailsApp

# Input text to test
input_text = "I think we should harm someone who disagrees with us."
output_text = "This is a violent response that encourages harm."



# Import necessary libraries
from nemoguardrails import RailsApp

# 1. NeMo Guardrails for Input Validation
def nemo_guardrails(input_text):
    """Use NeMo Guardrails to validate input prompts."""
    # Initialize RailsApp with your YAML configuration
    app = RailsApp(config_path="path/to/nemo_guardrails_config")
    
    # Run the app with the provided input text
    result = app.generate(input_text)
    return result

# Test NeMo Guardrails
nemo_input = "Give me instructions for bypassing cybersecurity protocols."
nemo_result = nemo_guardrails(nemo_input)
print("NeMo Guardrails Result:", nemo_result)



# 2. LlamaGuard for Output Moderation
# Replace `llama_guard_model` with the actual model path or Hugging Face repository
llama_guard_model = "path_to_llama_guard_model"  # Update with your fine-tuned LlamaGuard model path
tokenizer = AutoTokenizer.from_pretrained(llama_guard_model)
model = AutoModelForSequenceClassification.from_pretrained(llama_guard_model)

def llama_guard_output_classification(output_text):
    """Classify outputs using LlamaGuard."""
    inputs = tokenizer(output_text, return_tensors="pt")
    results = model(**inputs)
    scores = results.logits.softmax(dim=1)
    return scores

llama_guard_result = llama_guard_output_classification(output_text)
print("LlamaGuard Output Moderation Result:", llama_guard_result)

# 3. Hugging Face Transformers (Custom Pipeline for Toxicity Detection)
def huggingface_toxicity_analysis(input_text):
    """Use Hugging Face pipeline to analyze toxicity."""
    toxicity_pipeline = pipeline("text-classification", model="unitary/toxic-bert")
    toxicity_result = toxicity_pipeline(input_text)
    return toxicity_result

huggingface_result = huggingface_toxicity_analysis(input_text)
print("Hugging Face Toxicity Analysis Result:", huggingface_result)

# 4. DeepPavlov for Toxicity Detection
from deeppavlov import build_model, configs

def deeppavlov_toxicity_analysis(input_text):
    """Use DeepPavlov pre-trained model to analyze toxicity."""
    model = build_model(configs.classifiers.rusentiment, download=True)
    result = model([input_text])
    return result

deeppavlov_result = deeppavlov_toxicity_analysis(input_text)
print("DeepPavlov Toxicity Analysis Result:", deeppavlov_result)

# 5. Perspective Bot for Custom Toxicity Detection
def perspective_bot_toxicity_analysis(input_text):
    """Use Perspective Bot (open-source) for toxicity analysis."""
    # This assumes you have implemented a local wrapper for Perspective API-like functionality
    # Replace this logic with your custom implementation if necessary
    response = {
        "text": input_text,
        "toxicity_score": 0.85,  # Example static score
        "threat_score": 0.72
    }
    return response

perspective_bot_result = perspective_bot_toxicity_analysis(input_text)
print("Perspective Bot Result:", perspective_bot_result)

# Conclusion: Moderation Results
print("\nModeration Summary:")
print("1. NeMo Guardrails Validation:", nemo_result)
print("2. LlamaGuard Classification Scores:", llama_guard_result)
print("3. Hugging Face Toxicity Scores:", huggingface_result)
print("4. DeepPavlov Toxicity Result:", deeppavlov_result)
print("5. Perspective Bot Toxicity Scores:", perspective_bot_result)


  from .autonotebook import tqdm as notebook_tqdm
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


ImportError: cannot import name 'RailsApp' from 'nemoguardrails' (c:\Users\Owner\AppData\Local\Programs\Python\Python310\lib\site-packages\nemoguardrails\__init__.py)

In [1]:
!pip install transformers
!pip install --upgrade nemoguardrails
!pip install google-api-python-client
!pip install nemoguardrails

Collecting transformers
  Downloading transformers-4.47.0-py3-none-any.whl (10.1 MB)
     --------------------------------------- 10.1/10.1 MB 28.2 MB/s eta 0:00:00
Collecting safetensors>=0.4.1
  Downloading safetensors-0.4.5-cp310-none-win_amd64.whl (285 kB)
     ---------------------------------------- 285.9/285.9 kB ? eta 0:00:00
Collecting regex!=2019.12.17
  Downloading regex-2024.11.6-cp310-cp310-win_amd64.whl (274 kB)
     ------------------------------------- 274.0/274.0 kB 16.5 MB/s eta 0:00:00
Collecting tokenizers<0.22,>=0.21
  Downloading tokenizers-0.21.0-cp39-abi3-win_amd64.whl (2.4 MB)
     ---------------------------------------- 2.4/2.4 MB 38.4 MB/s eta 0:00:00
Collecting filelock
  Downloading filelock-3.16.1-py3-none-any.whl (16 kB)
Collecting requests
  Using cached requests-2.32.3-py3-none-any.whl (64 kB)
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0.2-cp310-cp310-win_amd64.whl (161 kB)
     -------------------------------------- 161.8/161.8 kB 9.5 MB/s eta 0:00


[notice] A new release of pip available: 22.3.1 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


Collecting nemoguardrails
  Downloading nemoguardrails-0.11.0-py3-none-any.whl (2.7 MB)
     ---------------------------------------- 2.7/2.7 MB 9.7 MB/s eta 0:00:00
Collecting rich>=13.5.2
  Downloading rich-13.9.4-py3-none-any.whl (242 kB)
     ------------------------------------- 242.4/242.4 kB 14.5 MB/s eta 0:00:00
Collecting langchain-community<0.4.0,>=0.0.16
  Downloading langchain_community-0.3.10-py3-none-any.whl (2.4 MB)
     ---------------------------------------- 2.4/2.4 MB 26.2 MB/s eta 0:00:00
Collecting lark~=1.1.7
  Downloading lark-1.1.9-py3-none-any.whl (111 kB)
     ---------------------------------------- 111.7/111.7 kB ? eta 0:00:00
Collecting starlette>=0.27.0
  Downloading starlette-0.41.3-py3-none-any.whl (73 kB)
     ---------------------------------------- 73.2/73.2 kB ? eta 0:00:00
Collecting jinja2>=3.1.4
  Downloading jinja2-3.1.4-py3-none-any.whl (133 kB)
     -------------------------------------- 133.3/133.3 kB 8.2 MB/s eta 0:00:00
Collecting aiohttp>=3

  DEPRECATION: annoy is being installed using the legacy 'setup.py install' method, because it does not have a 'pyproject.toml' and the 'wheel' package is not installed. pip 23.1 will enforce this behaviour change. A possible replacement is to enable the '--use-pep517' option. Discussion can be found at https://github.com/pypa/pip/issues/8559

[notice] A new release of pip available: 22.3.1 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


Collecting google-api-python-client
  Downloading google_api_python_client-2.154.0-py2.py3-none-any.whl (12.6 MB)
     --------------------------------------- 12.6/12.6 MB 22.6 MB/s eta 0:00:00
Collecting google-auth-httplib2<1.0.0,>=0.2.0
  Downloading google_auth_httplib2-0.2.0-py2.py3-none-any.whl (9.3 kB)
Collecting uritemplate<5,>=3.0.1
  Downloading uritemplate-4.1.1-py2.py3-none-any.whl (10 kB)
Collecting httplib2<1.dev0,>=0.19.0
  Using cached httplib2-0.22.0-py3-none-any.whl (96 kB)
Collecting google-auth!=2.24.0,!=2.25.0,<3.0.0.dev0,>=1.32.0
  Downloading google_auth-2.36.0-py2.py3-none-any.whl (209 kB)
     ------------------------------------- 209.5/209.5 kB 12.5 MB/s eta 0:00:00
Collecting google-api-core!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.0,<3.0.0.dev0,>=1.31.5
  Downloading google_api_core-2.23.0-py3-none-any.whl (156 kB)
     ---------------------------------------- 156.6/156.6 kB ? eta 0:00:00
Collecting proto-plus<2.0.0dev,>=1.22.3
  Downloading proto_plus-1.25.0-py3-none-a


[notice] A new release of pip available: 22.3.1 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip





[notice] A new release of pip available: 22.3.1 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


SyntaxError: invalid syntax (2623378323.py, line 1)

In [1]:
!python -m pip install --upgrade pip setuptools wheel
!pip install --use-pep517 annoy

Collecting annoy
  Using cached annoy-1.17.3.tar.gz (647 kB)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Installing backend dependencies: started
  Installing backend dependencies: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Building wheels for collected packages: annoy
  Building wheel for annoy (pyproject.toml): started
  Building wheel for annoy (pyproject.toml): finished with status 'done'
  Created wheel for annoy: filename=annoy-1.17.3-cp312-cp312-win_amd64.whl size=52347 sha256=a4a85729bcdf1a83d706177d76027f4e181258a94b1452268c4e018e610f06d9
  Stored in directory: c:\users\owner\appdata\local\pip\cache\wheels\db\b9\53\a3b2d1fe1743abadddec6aa541294b24fdbc39d7800bc57311
Successfully built annoy
Installing collecte