
# Multilingual ACE Training & Deployment Notebook

This notebook contains a ready-to-run project flow to train and deploy the ACE (Advanced Cyberbullying & Emotion Detection) fusion model for detecting women's harassment and cyber abuse in social media comments (English, Tamil, Malayalam).

Sections: Module 1 to Module 9. Follow the cells in order. Some cells contain optional heavy operations (model training) which you should run on a GPU-enabled environment.


## Module 1: Data Collection

Collect multilingual social media comments from APIs (Facebook Graph API, Twitter API v2, YouTube Data API) and open datasets such as HASOC, DravidianLangTech, and OLID. Save each record with fields: `comment_text`, `language`, `label`.

Below are example helper functions for aggregating data and saving to CSV. Replace placeholders with your API keys and dataset paths.


In [None]:
# Example: Aggregate local open datasets and save as unified CSV
import pandas as pd
from pathlib import Path

def load_and_merge_datasets(dataset_paths):
    dfs = []
    for p in dataset_paths:
        df = pd.read_csv(p)
        # Expect at least columns: text, label, optional language
        if 'text' not in df.columns:
            raise ValueError(f'Missing text column in {p}')
        if 'label' not in df.columns:
            # if label column missing, try to map existing columns or skip
            raise ValueError(f'Missing label column in {p}')
        cols = ['text','label'] + ([c for c in df.columns if c=='language'])
        cols = [c for c in cols if c in df.columns]
        dfs.append(df[cols])
    return pd.concat(dfs, ignore_index=True)

# Example usage (update paths)
# dataset_paths = ['data/hasoc.csv', 'data/dravidian.csv', 'data/olid.csv']
# merged = load_and_merge_datasets(dataset_paths)
# merged.to_csv('data/multilingual_comments_raw.csv', index=False)


## Module 2: Data Preprocessing

Tasks: language detection, cleaning (remove URLs, mentions, emojis managed separately), optional transliteration for Tamil & Malayalam, tokenization, stopword removal, emoji demojize.


In [None]:
# Preprocessing utilities
import re
import pandas as pd
import emoji
try:
    from langdetect import detect
except Exception:
    def detect(x):
        return 'en'
from pathlib import Path

def clean_text(text):
    if not isinstance(text, str):
        return ''
    # Remove URLs
    text = re.sub(r'http\\S+|www\\.\\S+', '', text)
    # Remove mentions and hashtags (keep words)
    text = re.sub(r'@[A-Za-z0-9_]+', '', text)
    text = re.sub(r'#', '', text)
    # Normalize whitespace
    text = re.sub(r'\\s+', ' ', text).strip()
    return text

def demojize_text(text):
    return emoji.demojize(text)

def detect_language_safe(text):
    try:
        return detect(text)
    except Exception:
        return 'en'

# Example pipeline on a CSV

def preprocess_csv(input_csv, output_csv, transliterate=False):
    df = pd.read_csv(input_csv)
    if 'language' not in df.columns:
        df['language'] = df['text'].fillna('').astype(str).apply(detect_language_safe)
    df['clean_text'] = df['text'].fillna('').astype(str).apply(clean_text)
    df['demojized'] = df['clean_text'].apply(demojize_text)
    # Optional transliteration hook - user can plug in indic_transliteration here
    df.to_csv(output_csv, index=False)
    return df

# Example usage (uncomment to run)
# df = preprocess_csv('data/multilingual_comments_raw.csv', 'data/multilingual_comments_cleaned.csv')
# df.head()


## Module 3: Feature Extraction

Extract model-specific features:
- CNN: character & n-gram embeddings
- BERT: tokenized inputs via mBERT / IndicBERT tokenizer
- SVM: TF-IDF features
- Emotion: emotion vectors via a pretrained emotion model

Examples below show how to create TF-IDF, tokenized datasets, and a placeholder for emotion extraction.


In [None]:
# Feature extraction examples
from sklearn.feature_extraction.text import TfidfVectorizer
from transformers import AutoTokenizer
import numpy as np

def build_tfidf(texts, max_features=10000):
    vect = TfidfVectorizer(max_features=max_features, ngram_range=(1,2))
    X = vect.fit_transform(texts)
    return vect, X

def get_bert_tokenizer(model_name='bert-base-multilingual-cased'):
    return AutoTokenizer.from_pretrained(model_name)

# Placeholder for emotion extraction - user can plug DeepMoji/DeepEmoji or a transformer emotion model
def extract_emotion_vectors(texts):
    # returns list of float vectors (e.g., anger, joy, sadness, disgust) per text
    return [np.zeros(4).tolist() for _ in texts]

# Example usage
# vect, X = build_tfidf(df['demojized'].tolist())
# tokenizer = get_bert_tokenizer()


## Module 4: Model Training

Train CNN, BERT, SVM, and Emotion models separately. The repository already contains trainer classes for BERT and CNN (`models/bert_model.py`, `models/cnn_model.py`) and the ACE ensemble (`models/ensemble.py`). The example below shows how to use `train/train_pipeline.py` or the new `train/multilingual_training_runner.py` to run a full training pipeline.

In [None]:
# Run full training using the ACETrainer via the multilingual runner (recommended)
# This cell invokes the runner script programmatically. For heavy training, run via CLI on a GPU machine.
import subprocess
from pathlib import Path

data_csv = 'data/processed/multilingual_processed.csv'
if Path(data_csv).exists():
    print('Starting training runner (this may take a long time on CPU)')
    # Example CLI call (uncomment to run)
    # subprocess.run(['python', 'train/multilingual_training_runner.py', '--data', data_csv, '--config', 'deployment/config_multilingual.yaml'])
else:
    print('Processed training CSV not found. Run preprocessing cells first.')


## Module 5: ACE Fusion Layer

Implement the ACE fusion algorithm to combine model outputs. The repository `models/ensemble.py` already provides an `ACEEnsemble` class. Below is a conceptual example to compute a weighted fusion and learn weights via grid search or simple optimization to minimize false positives.

In [None]:
# Example: Simple weight learning via grid search (toy example)
import numpy as np
from sklearn.metrics import f1_score, precision_score, recall_score

def evaluate_weights(base_scores, labels, weights):
    # base_scores: dict of arrays: {'bert': [...], 'cnn': [...], 'emotion': [...], 'svm': [...]} 
    scores = np.zeros_like(base_scores['bert'])
    for k,w in weights.items():
        scores += w * np.array(base_scores[k])
    preds = (scores >= 0.5).astype(int)
    return f1_score(labels, preds), precision_score(labels, preds), recall_score(labels, preds)

# Example usage with dummy data
# base_scores = {'bert': np.random.rand(100), 'cnn': np.random.rand(100), 'emotion': np.random.rand(100), 'svm': np.random.rand(100)}
# labels = np.random.randint(0,2,100)
# weights = {'bert':0.4,'cnn':0.2,'emotion':0.3,'svm':0.1}
# print(evaluate_weights(base_scores, labels, weights))


## Module 6: Women Harassment Detection Layer

Fine-tune a multilingual BERT (mBERT / IndicBERT) specifically on women harassment examples. This model acts as an additional specialized detector and its score can be used in the ACE fusion.

In [None]:
# Fine-tuning outline (use models/bert_model.py trainer)
from models.bert_model import create_bert_model, BERTAbuseDataset, BERTAbuseDetectorTrainer

config = {'model_name':'bert-base-multilingual-cased', 'max_length':256, 'batch_size':16}
model, tokenizer = create_bert_model(config)
trainer = BERTAbuseDetectorTrainer(model, tokenizer, device='cuda' if __import__('torch').cuda.is_available() else 'cpu')
# Prepare dataset where 'women_harassment' label is used to fine-tune the specialized model
# dataset = BERTAbuseDataset(women_texts, women_labels, tokenizer, max_length=256)
# trainer.train(train_loader, val_loader, num_epochs=3, learning_rate=2e-5)

## Module 7: Output & Evaluation

Evaluate each model and the fusion layer. Use accuracy, precision, recall, F1. Display sample predictions with confidence, language, and emotion. Visualize confusion matrices and metric trends.

In [None]:
# Example evaluation using ACEEnsemble (if trained and saved)
from models.ensemble import create_ace_ensemble
import json

# Load config and create ensemble
with open('deployment/config_multilingual.yaml') as f:
    import yaml
    cfg = yaml.safe_load(f)
ace = create_ace_ensemble(cfg)
# ace.load_model('models/saved/ace_ensemble')  # load if available

# Example single prediction
sample = 'Nee loosu ponnu da 😏'
print(ace.predict_single(sample))

## Module 8: API Integration

Wrap the trained ACE model into a FastAPI REST API. The repository already contains `api/serve.py` which uses `ACEAPIServer`. Example usage below shows how to run the API locally and test the `/predict` endpoint.

In [None]:
# Run the server (example)
# In a terminal, run:
# uvicorn api.serve:app --host 0.0.0.0 --port 8000 --reload

# Example: test predict endpoint
import requests

# url = 'http://127.0.0.1:8000/predict'
# payload = {'text':'Nee loosu ponnu da 😏', 'include_explanation':True, 'include_preprocessing':True}
# r = requests.post(url, json=payload)
# print(r.json())

## Module 9: Future Enhancements

Stubs and ideas for audio/video detection (Whisper for speech->text, Vision Transformer for facial expression), multimodal fusion, and real-time plugins for social platforms.

In [None]:
# Audio processing with Whisper
def process_audio(audio_file):
    """
    Stub for audio processing using Whisper ASR
    """
    # TODO: Implement Whisper ASR integration
    # import whisper
    # model = whisper.load_model("base")
    # result = model.transcribe(audio_file)
    # return result["text"]
    pass

# Video processing with Vision Transformer
def process_video(video_file):
    """
    Stub for video processing using Vision Transformer
    """
    # TODO: Implement Vision Transformer for facial expression analysis
    # from transformers import pipeline
    # classifier = pipeline("image-classification", model="microsoft/swin-tiny-patch4-window7-224")
    # Process video frames and analyze expressions
    pass

# Multimodal fusion
def multimodal_fusion(text_score, audio_score, video_score):
    """
    Stub for fusing predictions from multiple modalities
    """
    # TODO: Implement weighted fusion of predictions
    # weights = [0.6, 0.2, 0.2]  # Example weights for text, audio, video
    # final_score = np.average([text_score, audio_score, video_score], weights=weights)
    # return final_score
    pass

# Social media plugin interfaces
class SocialMediaPlugin:
    """
    Base class for social media platform integrations
    """
    def __init__(self, platform):
        self.platform = platform
    
    def connect(self):
        """Establish connection to platform API"""
        pass
    
    def monitor(self):
        """Monitor content stream"""
        pass
    
    def alert(self, content, prediction):
        """Send alerts for detected harassment"""
        pass

### Implementation Notes

1. **Audio Processing**
   - Whisper ASR can handle multilingual speech recognition
   - Supports 96+ languages including English, Tamil, Malayalam
   - Can detect emotional tone from speech patterns

2. **Video Analysis**
   - Vision Transformer for facial expression recognition
   - Frame-by-frame analysis with temporal aggregation
   - Optional: Action recognition for gesture analysis

3. **Multimodal Fusion**
   - Weighted combination of predictions from different modalities
   - Adaptive weights based on confidence scores
   - Late fusion strategy for flexibility

4. **Social Media Integration**
   - Plugin architecture for different platforms
   - Real-time monitoring and alert system
   - Privacy-preserving analysis pipeline

These enhancements would make the harassment detection system more robust by considering multiple modalities of communication and enabling real-time monitoring across social platforms.