 Hate Speech Detector (Hindi and Tamil) 
 To detect hate/offensive speech in Hindi and Tamil using `XLM-RoBERTa'
 🔍 Use Case
- Moderate content on multilingual Indian social platforms (Facebook, ShareChat)
- Prevent caste, religious, and political hate speech
- Useful for detecting hate in code-mixed regional content


In [None]:
!jupyter nbextension enable --py widgetsnbextension

In [None]:
!pip install --force-reinstall sympy==1.13.1
print("Success")

In [2]:
import sympy
import mpmath
print(f"SymPy version: {sympy.__version__}")
print(f"mpmath version: {mpmath.__version__}")
print("Success - packages imported correctly!")

SymPy version: 1.13.1
mpmath version: 1.3.0
Success - packages imported correctly!


In [4]:
!pip install transformers datasets nltk scikit-learn pandas
print("Success")

Defaulting to user installation because normal site-packages is not writeable
Success


📥 Load and Combine Datasets

In [3]:
!pip install "numpy<2"

Defaulting to user installation because normal site-packages is not writeable


In [2]:
import numpy as np
import pandas as pd
print("✓ Both packages working!")
print(f"NumPy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy location: {np.__file__}")
print(f"Pandas location: {pd.__file__}")

✓ Both packages working!
NumPy version: 1.26.4
Pandas version: 2.2.2
NumPy location: C:\ProgramData\anaconda3\Lib\site-packages\numpy\__init__.py
Pandas location: C:\ProgramData\anaconda3\Lib\site-packages\pandas\__init__.py


In [5]:
import numpy as np
print(f"NumPy version: {np.__version__}")
print(f"NumPy location: {np.__file__}")

NumPy version: 1.26.4
NumPy location: C:\ProgramData\anaconda3\Lib\site-packages\numpy\__init__.py


In [20]:
import pandas as pd

hindi_train = pd.read_csv("Hatespeech-Hindi_Train.csv")[["Post", "Labels Set"]]
hindi_val = pd.read_csv("Hatespeech-Hindi_Valid.csv")[["Post", "Labels Set"]]
print(pd.read_csv("Hatespeech-Hindi_Train.csv").columns)
print(pd.read_csv("Hatespeech-Hindi_Valid.csv").columns)
print("Read successfully")



Index(['Unique ID', 'Post', 'Labels Set'], dtype='object')
Index(['Unique ID', 'Post', 'Labels Set'], dtype='object')
Read successfully


In [23]:
import pandas as pd

# Tamil files
tamil_train = pd.read_csv("tamil_offensive_speech_train.csv")[["comment", "label"]]
tamil_val = pd.read_csv("tamil_offensive_speech_val.csv")[["comment", "label"]]

# Rename 'comment' column to 'text' to be consistent with Hindi dataset
tamil_train = tamil_train.rename(columns={'comment': 'text'})
tamil_val = tamil_val.rename(columns={'comment': 'text'})
tamil_train["lang"] = "ta"
tamil_val["lang"] = "ta"

# Hindi files
hindi_train = pd.read_csv("Hatespeech-Hindi_Train.csv")[["Post", "Labels Set"]]
hindi_val = pd.read_csv("Hatespeech-Hindi_Valid.csv")[["Post", "Labels Set"]]

# Rename columns to be consistent
hindi_train = hindi_train.rename(columns={'Post': 'text', 'Labels Set': 'label'})
hindi_val = hindi_val.rename(columns={'Post': 'text', 'Labels Set': 'label'})
print("Renamed Successfully")

hindi_train["lang"] = "hi"
hindi_val["lang"] = "hi"

# Combine into df
df = pd.concat([tamil_train, tamil_val, hindi_train, hindi_val], ignore_index=True)
df.dropna(inplace=True)

# View data
df.head()

Renamed Successfully


Unnamed: 0,text,label,lang
0,omg that bgm make me goosebumb...,0,ta
1,neraya neraya neraya neraya neraya neraya.,0,ta
2,thalaivar mersal look .semma massss thalaiva ....,0,ta
3,paaaa... repeat mode.... adra adra adraaaaa......,0,ta
4,epaa ena panaporam... sweet sapade poram... aw...,0,ta


🧹 Clean and Preprocess Text

In [7]:
import pandas as pd

# Load your dataset
df = pd.read_csv("tamil_offensive_speech_train.csv")

# Rename 'comment' to 'text'
df = df.rename(columns={'comment': 'text'})
print(df.columns)

Index(['label', 'text'], dtype='object')


In [8]:
import re
import pandas as pd

def clean_text(text):
    text = str(text).lower()
    text = re.sub(r'https?://\S+|www\.\S+', '', text)
    text = re.sub(r'[^\w\s]', '', text)
    text = re.sub(r'\s+', ' ', text)
    return text.strip()

# Assuming df is your dataframe and has columns 'text' and 'label'

# Clean text column
df['text'] = df['text'].apply(clean_text)

def map_label(label):
    if pd.isna(label):
        return None
    label = str(label).lower()

    if label in ['0', 'non-hostile', 'normal']:
        return 0
    elif 'hate' in label:
        return 2
    elif any(word in label for word in ['offensive', 'defamation', 'fake']):
        return 1
    elif label in ['1']:
        return 1
    else:
        return 0  # default to normal if unclear

df['label'] = df['label'].apply(map_label)

print(f"Rows with missing labels: {df['label'].isna().sum()}")
df.dropna(subset=['label'], inplace=True)
df['label'] = df['label'].astype(int)

# Drop rows with empty or missing text
df = df[df['text'].str.strip() != '']

print("Final sample:")
print(df.head())



Rows with missing labels: 0
Final sample:
   label                                               text
0      0                     omg that bgm make me goosebumb
1      0          neraya neraya neraya neraya neraya neraya
2      0  thalaivar mersal look semma massss thalaiva th...
3      0  paaaa repeat mode adra adra adraaaaa vera leve...
4      0  epaa ena panaporam sweet sapade poram awesome ...


In [9]:
from sklearn.model_selection import train_test_split

train_texts, val_texts, train_labels, val_labels = train_test_split(
    df['text'], df['label'], test_size=0.2, stratify=df['label'], random_state=42
)

In [10]:
print(df['label'].value_counts())


label
0    21224
1     6649
Name: count, dtype: int64


In [11]:
print("Train size:", len(train_texts))
print("Validation size:", len(val_texts))
print("Train label distribution:\n", pd.Series(train_labels).value_counts())
print("Validation label distribution:\n", pd.Series(val_labels).value_counts())


Train size: 22298
Validation size: 5575
Train label distribution:
 label
0    16979
1     5319
Name: count, dtype: int64
Validation label distribution:
 label
0    4245
1    1330
Name: count, dtype: int64


🤖 Tokenize using XLM-RoBERTa

In [5]:
!pip install torch torchvision torchaudio

Defaulting to user installation because normal site-packages is not writeable


In [7]:
import transformers
print(transformers.__version__)
import torch
print(torch.__version__)

4.52.4
2.7.0+cpu


In [8]:
import sys
print(sys.executable)


C:\ProgramData\anaconda3\python.exe


In [7]:
!pip uninstall torch torchvision torchaudio transformers -y
!pip cache purge



Files removed: 0




In [9]:
!pip install torch torchvision torchaudio transformers accelerate datasets pandas scikit-learn numpy

Defaulting to user installation because normal site-packages is not writeable
Collecting torchvision
  Downloading torchvision-0.22.0-cp312-cp312-win_amd64.whl.metadata (6.3 kB)
Collecting torchaudio
  Downloading torchaudio-2.7.0-cp312-cp312-win_amd64.whl.metadata (6.7 kB)
Collecting transformers
  Downloading transformers-4.52.4-py3-none-any.whl.metadata (38 kB)
Downloading torchvision-0.22.0-cp312-cp312-win_amd64.whl (1.7 MB)
   ---------------------------------------- 0.0/1.7 MB ? eta -:--:--
   ------ --------------------------------- 0.3/1.7 MB ? eta -:--:--
   ------------------------ --------------- 1.0/1.7 MB 3.1 MB/s eta 0:00:01
   ------------------------------------ --- 1.6/1.7 MB 2.7 MB/s eta 0:00:01
   ---------------------------------------- 1.7/1.7 MB 2.7 MB/s eta 0:00:00
Downloading torchaudio-2.7.0-cp312-cp312-win_amd64.whl (2.5 MB)
   ---------------------------------------- 0.0/2.5 MB ? eta -:--:--
   -------- ------------------------------- 0.5/2.5 MB 2.4 MB/s eta 

In [10]:
# Test if everything is installed correctly
import sys
print("Python version:", sys.version)

packages_to_test = [
    'transformers', 'accelerate', 'datasets', 
    'torch', 'pandas', 'sklearn', 'numpy'
]

for package in packages_to_test:
    try:
        __import__(package)
        print(f"✅ {package} - OK")
    except ImportError as e:
        print(f"❌ {package} - Missing: {e}")

Python version: 3.12.7 | packaged by Anaconda, Inc. | (main, Oct  4 2024, 13:17:27) [MSC v.1929 64 bit (AMD64)]
✅ transformers - OK
✅ accelerate - OK
✅ datasets - OK
✅ torch - OK
✅ pandas - OK
✅ sklearn - OK
✅ numpy - OK


In [5]:
import transformers
import accelerate
import torch

print(f"Transformers version: {transformers.__version__}")
print(f"Accelerate version: {accelerate.__version__}")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

Transformers version: 4.52.4
Accelerate version: 1.7.0
PyTorch version: 2.6.0+cpu
CUDA available: False


In [10]:
# Test each import separately
print("Importing pandas...")
import pandas as pd
print("✓ Pandas imported successfully")

print("Importing sklearn...")
from sklearn.model_selection import train_test_split
print("✓ Sklearn imported successfully")

print("Importing numpy...")
import numpy as np
print("✓ Numpy imported successfully")

print("Importing torch...")
import torch
print("✓ Torch imported successfully")

print("Importing datasets...")
from datasets import Dataset
print("✓ Datasets imported successfully")

print("Importing transformers...")
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
    DataCollatorWithPadding,
    EarlyStoppingCallback
)
print("✓ Transformers imported successfully")

print("Setting environment...")
import os
import glob
os.environ["TOKENIZERS_PARALLELISM"] = "false"
print("✓ All imports completed successfully!")

Importing pandas...
✓ Pandas imported successfully
Importing sklearn...
✓ Sklearn imported successfully
Importing numpy...
✓ Numpy imported successfully
Importing torch...
✓ Torch imported successfully
Importing datasets...
✓ Datasets imported successfully
Importing transformers...


ImportError: cannot import name 'PreTrainedModel' from 'transformers' (C:\Users\mathu\AppData\Roaming\Python\Python312\site-packages\transformers\__init__.py)

In [None]:
# COMPLETE REPLACEMENT CODE - Use this instead of your original code
# This replaces ALL your import and training code

import pandas as pd
import numpy as np
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, Dataset as TorchDataset
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    get_linear_schedule_with_warmup
)
from datasets import Dataset
import os
from tqdm import tqdm

# Set environment
os.environ["TOKENIZERS_PARALLELISM"] = "false"

print("✅ All imports successful - no Trainer issues!")

class CustomDataset(TorchDataset):
    """Custom PyTorch Dataset for text classification"""
    def __init__(self, texts, labels, tokenizer, max_length=128):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length
    
    def __len__(self):
        return len(self.texts)
    
    def __getitem__(self, idx):
        text = str(self.texts[idx])
        label = self.labels[idx]
        
        encoding = self.tokenizer(
            text,
            truncation=True,
            padding='max_length',
            max_length=self.max_length,
            return_tensors='pt'
        )
        
        return {
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'labels': torch.tensor(label, dtype=torch.long)
        }

def preprocess_hindi_labels(label_str):
    """Convert Hindi multi-label format to binary classification"""
    if pd.isna(label_str) or label_str == 'non-hostile':
        return 0  # Non-hate
    else:
        return 1  # Hate (any form of hate/offensive/defamation)

def load_and_prepare_combined_data():
    """Load and prepare the combined Tamil-Hindi datasets"""
    try:
        print("📂 Loading Tamil datasets...")
        # Tamil files
        tamil_train = pd.read_csv("tamil_offensive_speech_train.csv")[["comment", "label"]]
        tamil_val = pd.read_csv("tamil_offensive_speech_val.csv")[["comment", "label"]]
        
        # Rename 'comment' column to 'text' to be consistent with Hindi dataset
        tamil_train = tamil_train.rename(columns={'comment': 'text'})
        tamil_val = tamil_val.rename(columns={'comment': 'text'})
        tamil_train["lang"] = "ta"
        tamil_val["lang"] = "ta"
        
        print(f"✅ Tamil train: {len(tamil_train)} samples")
        print(f"✅ Tamil validation: {len(tamil_val)} samples")
        print(f"📊 Tamil label distribution:")
        print(tamil_train['label'].value_counts())
        
        print("\n📂 Loading Hindi datasets...")
        # Hindi files
        hindi_train = pd.read_csv("Hatespeech-Hindi_Train.csv")[["Post", "Labels Set"]]
        hindi_val = pd.read_csv("Hatespeech-Hindi_Valid.csv")[["Post", "Labels Set"]]
        
        # Rename columns to be consistent
        hindi_train = hindi_train.rename(columns={'Post': 'text', 'Labels Set': 'label'})
        hindi_val = hindi_val.rename(columns={'Post': 'text', 'Labels Set': 'label'})
        
        print(f"✅ Hindi train: {len(hindi_train)} samples")
        print(f"✅ Hindi validation: {len(hindi_val)} samples")
        print(f"📊 Original Hindi label distribution:")
        print(hindi_train['label'].value_counts())
        
        # Process Hindi labels to binary format
        hindi_train['label'] = hindi_train['label'].apply(preprocess_hindi_labels)
        hindi_val['label'] = hindi_val['label'].apply(preprocess_hindi_labels)
        
        print(f"📊 Processed Hindi label distribution:")
        print(hindi_train['label'].value_counts())
        
        hindi_train["lang"] = "hi"
        hindi_val["lang"] = "hi"
        
        print("\n🔄 Combining datasets...")
        # Combine into df
        df = pd.concat([tamil_train, tamil_val, hindi_train, hindi_val], ignore_index=True)
        df.dropna(inplace=True)
        
        print(f"✅ Combined dataset: {len(df)} samples")
        print(f"📊 Final label distribution:")
        print(df['label'].value_counts())
        print(f"📊 Language distribution:")
        print(df['lang'].value_counts())
        
        # Display sample data
        print(f"\n🔍 Sample data:")
        print(df.head())
        
        return df
        
    except FileNotFoundError as e:
        print(f"❌ File not found: {e}")
        print("📁 Make sure all CSV files are in the same directory")
        return None

def train_model(model, train_dataloader, val_dataloader, device, epochs=3, lr=2e-5):
    """Manual training loop - replaces Trainer"""
    
    optimizer = torch.optim.AdamW(model.parameters(), lr=lr)
    total_steps = len(train_dataloader) * epochs
    scheduler = get_linear_schedule_with_warmup(
        optimizer,
        num_warmup_steps=0,
        num_training_steps=total_steps
    )
    
    model.to(device)
    best_val_accuracy = 0
    
    for epoch in range(epochs):
        print(f'\n📚 Epoch {epoch + 1}/{epochs}')
        print('-' * 50)
        
        # Training phase
        model.train()
        total_train_loss = 0
        train_predictions = []
        train_true = []
        
        train_pbar = tqdm(train_dataloader, desc="Training")
        for batch in train_pbar:
            optimizer.zero_grad()
            
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)
            
            outputs = model(input_ids=input_ids, 
                          attention_mask=attention_mask, 
                          labels=labels)
            
            loss = outputs.loss
            total_train_loss += loss.item()
            
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            optimizer.step()
            scheduler.step()
            
            predictions = torch.argmax(outputs.logits, dim=-1)
            train_predictions.extend(predictions.cpu().numpy())
            train_true.extend(labels.cpu().numpy())
            
            train_pbar.set_postfix({'loss': f'{loss.item():.4f}'})
        
        avg_train_loss = total_train_loss / len(train_dataloader)
        train_accuracy = accuracy_score(train_true, train_predictions)
        
        print(f'📊 Training Loss: {avg_train_loss:.4f} | Accuracy: {train_accuracy:.4f}')
        
        # Validation phase
        model.eval()
        total_eval_loss = 0
        eval_predictions = []
        eval_true = []
        
        with torch.no_grad():
            val_pbar = tqdm(val_dataloader, desc="Validation")
            for batch in val_pbar:
                input_ids = batch['input_ids'].to(device)
                attention_mask = batch['attention_mask'].to(device)
                labels = batch['labels'].to(device)
                
                outputs = model(input_ids=input_ids, 
                              attention_mask=attention_mask, 
                              labels=labels)
                
                loss = outputs.loss
                total_eval_loss += loss.item()
                
                predictions = torch.argmax(outputs.logits, dim=-1)
                eval_predictions.extend(predictions.cpu().numpy())
                eval_true.extend(labels.cpu().numpy())
        
        avg_val_loss = total_eval_loss / len(val_dataloader)
        val_accuracy = accuracy_score(eval_true, eval_predictions)
        
        print(f'📈 Validation Loss: {avg_val_loss:.4f} | Accuracy: {val_accuracy:.4f}')
        
        # Save best model
        if val_accuracy > best_val_accuracy:
            best_val_accuracy = val_accuracy
            print(f'🏆 New best validation accuracy: {best_val_accuracy:.4f}')
        
        # Print classification report for last epoch
        if epoch == epochs - 1:
            print("\n📋 Final Classification Report:")
            print(classification_report(eval_true, eval_predictions, target_names=['Non-Hate', 'Hate']))
    
    return model

def run_multilingual_hate_speech_training():
    """Main function to run multilingual hate speech detection training"""
    
    print("🚀 Starting Multilingual Hate Speech Detection Training (Tamil + Hindi)")
    print("=" * 80)
    
    # 1. Load and prepare combined data
    df = load_and_prepare_combined_data()
    
    if df is None:
        print("❌ Failed to load data. Exiting...")
        return
    
    # 2. Split the data (stratified by both label and language if possible)
    print(f"\n🔄 Splitting combined data...")
    train_texts, val_texts, train_labels, val_labels = train_test_split(
        df['text'].tolist(),
        df['label'].tolist(),
        test_size=0.2,
        random_state=42,
        stratify=df['label']  # Stratify by label
    )
    
    print(f"📚 Training samples: {len(train_texts)}")
    print(f"📝 Validation samples: {len(val_texts)}")
    
    # 3. Load model and tokenizer for multilingual support
    # Using multilingual models that support both Tamil and Hindi
    model_name = "xlm-roberta-base"  # Better for Tamil + Hindi
    # Alternative: "bert-base-multilingual-cased" or "distilbert-base-multilingual-cased"
    
    print(f"\n🤖 Loading multilingual model: {model_name}")
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    
    num_labels = len(set(df['label']))
    model = AutoModelForSequenceClassification.from_pretrained(
        model_name,
        num_labels=num_labels,
        ignore_mismatched_sizes=True
    )
    
    print(f"✅ Model loaded with {num_labels} labels")
    
    # 4. Create datasets and data loaders
    print(f"\n📦 Creating datasets...")
    train_dataset = CustomDataset(train_texts, train_labels, tokenizer, max_length=128)
    val_dataset = CustomDataset(val_texts, val_labels, tokenizer, max_length=128)
    
    batch_size = 16
    train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
    val_dataloader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)
    
    print(f"✅ Data loaders created (batch size: {batch_size})")
    
    # 5. Set device
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    print(f"💻 Using device: {device}")
    
    # 6. Train the model
    print(f"\n🎯 Starting training...")
    trained_model = train_model(
        model, 
        train_dataloader, 
        val_dataloader, 
        device,
        epochs=5,  # Increased for multilingual training
        lr=2e-5
    )
    
    # 7. Save the model
    output_dir = "./trained_multilingual_hate_speech_model"
    print(f"\n💾 Saving model to {output_dir}...")
    os.makedirs(output_dir, exist_ok=True)
    trained_model.save_pretrained(output_dir)
    tokenizer.save_pretrained(output_dir)
    
    print(f"\n🎉 Training completed successfully!")
    print(f"📁 Model saved to: {output_dir}")
    print(f"🚀 You can now use this model for multilingual hate speech detection!")
    
    return trained_model, tokenizer

def test_multilingual_predictions(model, tokenizer, device):
    """Test the trained model on sample texts in both languages"""
    model.eval()
    
    test_texts = [
        # Tamil examples
        "இது ஒரு நல்ல செய்தி",  # This is good news
        "அருமையான வேலை",  # Excellent work
        
        # Hindi examples  
        "यह बहुत अच्छा है",  # This is very good
        "बहुत बढ़िया काम"  # Very good work
    ]
    
    print("\n🧪 Testing multilingual predictions:")
    print("-" * 50)
    
    for text in test_texts:
        encoding = tokenizer(
            text,
            truncation=True,
            padding='max_length',
            max_length=128,
            return_tensors='pt'
        )
         
        with torch.no_grad():
            input_ids = encoding['input_ids'].to(device)
            attention_mask = encoding['attention_mask'].to(device)
            
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
            prediction = torch.argmax(outputs.logits, dim=-1)
            confidence = torch.softmax(outputs.logits, dim=-1)
            
        pred_label = "Hate" if prediction.item() == 1 else "Non-Hate"
        conf_score = confidence.max().item()
        
        print(f"Text: '{text}'")
        print(f"Prediction: {pred_label} (Confidence: {conf_score:.4f})")
        print()

# Run the training
if __name__ == "__main__":
    trained_model, tokenizer = run_multilingual_hate_speech_training()
    
    # Test the model if training was successful
    if trained_model is not None:
        device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        test_multilingual_predictions(trained_model, tokenizer, device)

In [None]:
!pip install hf_xet
import pandas as pd
import numpy as np
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, Dataset as TorchDataset
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    get_linear_schedule_with_warmup
)
from datasets import Dataset
import os
from tqdm import tqdm

# Set environment
os.environ["TOKENIZERS_PARALLELISM"] = "false"

print("✅ All imports successful - no Trainer issues!")

class CustomDataset(TorchDataset):
    """Custom PyTorch Dataset for text classification"""
    def __init__(self, texts, labels, tokenizer, max_length=128):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length
    
    def __len__(self):
        return len(self.texts)
    
    def __getitem__(self, idx):
        text = str(self.texts[idx])
        label = self.labels[idx]
        
        encoding = self.tokenizer(
            text,
            truncation=True,
            padding='max_length',
            max_length=self.max_length,
            return_tensors='pt'
        )
        
        return {
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'labels': torch.tensor(label, dtype=torch.long)
        }

def preprocess_hindi_labels(label_str):
    """Convert Hindi multi-label format to binary classification"""
    if pd.isna(label_str) or label_str == 'non-hostile':
        return 0  # Non-hate
    else:
        return 1  # Hate (any form of hate/offensive/defamation)

def load_and_prepare_combined_data():
    """Load and prepare the combined Tamil-Hindi datasets"""
    try:
        print("📂 Loading Tamil datasets...")
        # Tamil files
        tamil_train = pd.read_csv("tamil_offensive_speech_train.csv")[["comment", "label"]]
        tamil_val = pd.read_csv("tamil_offensive_speech_val.csv")[["comment", "label"]]
        
        # Rename 'comment' column to 'text' to be consistent with Hindi dataset
        tamil_train = tamil_train.rename(columns={'comment': 'text'})
        tamil_val = tamil_val.rename(columns={'comment': 'text'})
        tamil_train["lang"] = "ta"
        tamil_val["lang"] = "ta"
        
        print(f"✅ Tamil train: {len(tamil_train)} samples")
        print(f"✅ Tamil validation: {len(tamil_val)} samples")
        print(f"📊 Tamil label distribution:")
        print(tamil_train['label'].value_counts())
        
        print("\n📂 Loading Hindi datasets...")
        # Hindi files
        hindi_train = pd.read_csv("Hatespeech-Hindi_Train.csv")[["Post", "Labels Set"]]
        hindi_val = pd.read_csv("Hatespeech-Hindi_Valid.csv")[["Post", "Labels Set"]]
        
        # Rename columns to be consistent
        hindi_train = hindi_train.rename(columns={'Post': 'text', 'Labels Set': 'label'})
        hindi_val = hindi_val.rename(columns={'Post': 'text', 'Labels Set': 'label'})
        
        print(f"✅ Hindi train: {len(hindi_train)} samples")
        print(f"✅ Hindi validation: {len(hindi_val)} samples")
        print(f"📊 Original Hindi label distribution:")
        print(hindi_train['label'].value_counts())
        
        # Process Hindi labels to binary format
        hindi_train['label'] = hindi_train['label'].apply(preprocess_hindi_labels)
        hindi_val['label'] = hindi_val['label'].apply(preprocess_hindi_labels)
        
        print(f"📊 Processed Hindi label distribution:")
        print(hindi_train['label'].value_counts())
        
        hindi_train["lang"] = "hi"
        hindi_val["lang"] = "hi"
        
        print("\n🔄 Combining datasets...")
        # Combine into df
        df = pd.concat([tamil_train, tamil_val, hindi_train, hindi_val], ignore_index=True)
        df.dropna(inplace=True)
        
        print(f"✅ Combined dataset: {len(df)} samples")
        print(f"📊 Final label distribution:")
        print(df['label'].value_counts())
        print(f"📊 Language distribution:")
        print(df['lang'].value_counts())
        
        # Display sample data
        print(f"\n🔍 Sample data:")
        print(df.head())
        
        return df
        
    except FileNotFoundError as e:
        print(f"❌ File not found: {e}")
        print("📁 Make sure all CSV files are in the same directory")
        return None

def train_model(model, train_dataloader, val_dataloader, device, epochs=3, lr=2e-5):
    """Manual training loop - replaces Trainer"""
    
    optimizer = torch.optim.AdamW(model.parameters(), lr=lr)
    total_steps = len(train_dataloader) * epochs
    scheduler = get_linear_schedule_with_warmup(
        optimizer,
        num_warmup_steps=0,
        num_training_steps=total_steps
    )
    
    model.to(device)
    best_val_accuracy = 0
    
    for epoch in range(epochs):
        print(f'\n📚 Epoch {epoch + 1}/{epochs}')
        print('-' * 50)
        
        # Training phase
        model.train()
        total_train_loss = 0
        train_predictions = []
        train_true = []
        
        train_pbar = tqdm(train_dataloader, desc="Training")
        for batch in train_pbar:
            optimizer.zero_grad()
            
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)
            
            outputs = model(input_ids=input_ids, 
                          attention_mask=attention_mask, 
                          labels=labels)
            
            loss = outputs.loss
            total_train_loss += loss.item()
            
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            optimizer.step()
            scheduler.step()
            
            predictions = torch.argmax(outputs.logits, dim=-1)
            train_predictions.extend(predictions.cpu().numpy())
            train_true.extend(labels.cpu().numpy())
            
            train_pbar.set_postfix({'loss': f'{loss.item():.4f}'})
        
        avg_train_loss = total_train_loss / len(train_dataloader)
        train_accuracy = accuracy_score(train_true, train_predictions)
        
        print(f'📊 Training Loss: {avg_train_loss:.4f} | Accuracy: {train_accuracy:.4f}')
        
        # Validation phase
        model.eval()
        total_eval_loss = 0
        eval_predictions = []
        eval_true = []
        
        with torch.no_grad():
            val_pbar = tqdm(val_dataloader, desc="Validation")
            for batch in val_pbar:
                input_ids = batch['input_ids'].to(device)
                attention_mask = batch['attention_mask'].to(device)
                labels = batch['labels'].to(device)
                
                outputs = model(input_ids=input_ids, 
                              attention_mask=attention_mask, 
                              labels=labels)
                
                loss = outputs.loss
                total_eval_loss += loss.item()
                
                predictions = torch.argmax(outputs.logits, dim=-1)
                eval_predictions.extend(predictions.cpu().numpy())
                eval_true.extend(labels.cpu().numpy())
        
        avg_val_loss = total_eval_loss / len(val_dataloader)
        val_accuracy = accuracy_score(eval_true, eval_predictions)
        
        print(f'📈 Validation Loss: {avg_val_loss:.4f} | Accuracy: {val_accuracy:.4f}')
        
        # Save best model
        if val_accuracy > best_val_accuracy:
            best_val_accuracy = val_accuracy
            print(f'🏆 New best validation accuracy: {best_val_accuracy:.4f}')
        
        # Print classification report for last epoch
        if epoch == epochs - 1:
            print("\n📋 Final Classification Report:")
            print(classification_report(eval_true, eval_predictions, target_names=['Non-Hate', 'Hate']))
    
    return model

def run_multilingual_hate_speech_training():
    """Main function to run multilingual hate speech detection training"""
    
    print("🚀 Starting Multilingual Hate Speech Detection Training (Tamil + Hindi)")
    print("=" * 80)
    
    # 1. Load and prepare combined data
    df = load_and_prepare_combined_data()
    
    if df is None:
        print("❌ Failed to load data. Exiting...")
        return
    
    # 2. Split the data (stratified by both label and language if possible)
    print(f"\n🔄 Splitting combined data...")
    train_texts, val_texts, train_labels, val_labels = train_test_split(
        df['text'].tolist(),
        df['label'].tolist(),
        test_size=0.2,
        random_state=42,
        stratify=df['label']  # Stratify by label
    )
    
    print(f"📚 Training samples: {len(train_texts)}")
    print(f"📝 Validation samples: {len(val_texts)}")
    
    # 3. Load model and tokenizer for multilingual support
    # Using smaller multilingual model for faster CPU training
    model_name = "distilbert-base-multilingual-cased"  # Much faster on CPU
    # Alternative: "bert-base-multilingual-cased" (medium) or "xlm-roberta-base" (best but slowest)
    
    print(f"\n🤖 Loading multilingual model: {model_name}")
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    
    num_labels = len(set(df['label']))
    model = AutoModelForSequenceClassification.from_pretrained(
        model_name,
        num_labels=num_labels,
        ignore_mismatched_sizes=True
    )
    
    print(f"✅ Model loaded with {num_labels} labels")
    
    # 4. Create datasets and data loaders
    print(f"\n📦 Creating datasets...")
    train_dataset = CustomDataset(train_texts, train_labels, tokenizer, max_length=128)
    val_dataset = CustomDataset(val_texts, val_labels, tokenizer, max_length=128)
    
    batch_size = 16
    train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
    val_dataloader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)
    
    print(f"✅ Data loaders created (batch size: {batch_size})")
    
    # 5. Set device
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    print(f"💻 Using device: {device}")
    
    # 6. Train the model
    print(f"\n🎯 Starting training...")
    trained_model = train_model(
        model, 
        train_dataloader, 
        val_dataloader, 
        device,
        epochs=5,  # Increased for multilingual training
        lr=2e-5
    )
    
    # 7. Save the model
    output_dir = "./trained_multilingual_hate_speech_model"
    print(f"\n💾 Saving model to {output_dir}...")
    os.makedirs(output_dir, exist_ok=True)
    trained_model.save_pretrained(output_dir)
    tokenizer.save_pretrained(output_dir)
    
    print(f"\n🎉 Training completed successfully!")
    print(f"📁 Model saved to: {output_dir}")
    print(f"🚀 You can now use this model for multilingual hate speech detection!")
    
    return trained_model, tokenizer

def test_multilingual_predictions(model, tokenizer, device):
    """Test the trained model on sample texts in both languages"""
    model.eval()
    
    test_texts = [
        # Tamil examples
        "இது ஒரு நல்ல செய்தி",  # This is good news
        "அருமையான வேலை",  # Excellent work
        
        # Hindi examples  
        "यह बहुत अच्छा है",  # This is very good
        "बहुत बढ़िया काम"  # Very good work
    ]
    
    print("\n🧪 Testing multilingual predictions:")
    print("-" * 50)
    
    for text in test_texts:
        encoding = tokenizer(
            text,
            truncation=True,
            padding='max_length',
            max_length=128,
            return_tensors='pt'
        )
        
        with torch.no_grad():
            input_ids = encoding['input_ids'].to(device)
            attention_mask = encoding['attention_mask'].to(device)
            
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
            prediction = torch.argmax(outputs.logits, dim=-1)
            confidence = torch.softmax(outputs.logits, dim=-1)
            
        pred_label = "Hate" if prediction.item() == 1 else "Non-Hate"
        conf_score = confidence.max().item()
        
        print(f"Text: '{text}'")
        print(f"Prediction: {pred_label} (Confidence: {conf_score:.4f})")
        print()

# Run the training
if __name__ == "__main__":
    trained_model, tokenizer = run_multilingual_hate_speech_training()
    
    # Test the model if training was successful
    if trained_model is not None:
        device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        test_multilingual_predictions(trained_model, tokenizer, device)

In [1]:
!pip install hf_xet
import pandas as pd
import numpy as np
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, Dataset as TorchDataset
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    get_linear_schedule_with_warmup
)
from datasets import Dataset
import os
from tqdm import tqdm

# Set environment
os.environ["TOKENIZERS_PARALLELISM"] = "false"

print("✅ All imports successful - no Trainer issues!")

# Hardware check
print("\n🔍 Checking hardware...")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"Device count: {torch.cuda.device_count()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name()}")
else:
    print("⚠️  Using CPU - training will be slower")

class CustomDataset(TorchDataset):
    """Custom PyTorch Dataset for text classification"""
    def __init__(self, texts, labels, tokenizer, max_length=64):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length
    
    def __len__(self):
        return len(self.texts)
    
    def __getitem__(self, idx):
        text = str(self.texts[idx])
        label = self.labels[idx]
        
        encoding = self.tokenizer(
            text,
            truncation=True,
            padding='max_length',
            max_length=self.max_length,
            return_tensors='pt'
        )
        
        return {
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'labels': torch.tensor(label, dtype=torch.long)
        }

def preprocess_hindi_labels(label_str):
    """Convert Hindi multi-label format to binary classification"""
    if pd.isna(label_str) or label_str == 'non-hostile':
        return 0  # Non-hate
    else:
        return 1  # Hate (any form of hate/offensive/defamation)

def load_and_prepare_combined_data():
    """Load and prepare the combined Tamil-Hindi datasets"""
    try:
        print("📂 Loading Tamil datasets...")
        # Tamil files
        tamil_train = pd.read_csv("tamil_offensive_speech_train.csv")[["comment", "label"]]
        tamil_val = pd.read_csv("tamil_offensive_speech_val.csv")[["comment", "label"]]
        
        # Rename 'comment' column to 'text' to be consistent with Hindi dataset
        tamil_train = tamil_train.rename(columns={'comment': 'text'})
        tamil_val = tamil_val.rename(columns={'comment': 'text'})
        tamil_train["lang"] = "ta"
        tamil_val["lang"] = "ta"
        
        print(f"✅ Tamil train: {len(tamil_train)} samples")
        print(f"✅ Tamil validation: {len(tamil_val)} samples")
        print(f"📊 Tamil label distribution:")
        print(tamil_train['label'].value_counts())
        
        print("\n📂 Loading Hindi datasets...")
        # Hindi files
        hindi_train = pd.read_csv("Hatespeech-Hindi_Train.csv")[["Post", "Labels Set"]]
        hindi_val = pd.read_csv("Hatespeech-Hindi_Valid.csv")[["Post", "Labels Set"]]
        
        # Rename columns to be consistent
        hindi_train = hindi_train.rename(columns={'Post': 'text', 'Labels Set': 'label'})
        hindi_val = hindi_val.rename(columns={'Post': 'text', 'Labels Set': 'label'})
        
        print(f"✅ Hindi train: {len(hindi_train)} samples")
        print(f"✅ Hindi validation: {len(hindi_val)} samples")
        print(f"📊 Original Hindi label distribution:")
        print(hindi_train['label'].value_counts())
        
        # Process Hindi labels to binary format
        hindi_train['label'] = hindi_train['label'].apply(preprocess_hindi_labels)
        hindi_val['label'] = hindi_val['label'].apply(preprocess_hindi_labels)
        
        print(f"📊 Processed Hindi label distribution:")
        print(hindi_train['label'].value_counts())
        
        hindi_train["lang"] = "hi"
        hindi_val["lang"] = "hi"
        
        print("\n🔄 Combining datasets...")
        # Combine into df
        df = pd.concat([tamil_train, tamil_val, hindi_train, hindi_val], ignore_index=True)
        df.dropna(inplace=True)
        
        print(f"✅ Combined dataset: {len(df)} samples")
        print(f"📊 Final label distribution:")
        print(df['label'].value_counts())
        print(f"📊 Language distribution:")
        print(df['lang'].value_counts())
        
        # Display sample data
        print(f"\n🔍 Sample data:")
        print(df.head())
        
        # OPTIMIZATION: Use sample for faster testing
        USE_SAMPLE = True  # Set to False for full training
        
        if USE_SAMPLE:
            print(f"\n🧪 Using sample for faster testing...")
            df_sample = df.sample(n=5000, random_state=42)
            print(f"Sample size: {len(df_sample)} samples")
            print(f"Sample label distribution:")
            print(df_sample['label'].value_counts())
            print(f"Sample language distribution:")
            print(df_sample['lang'].value_counts())
            return df_sample
        else:
            print(f"\n🔥 Using full dataset for training...")
            return df
        
    except FileNotFoundError as e:
        print(f"❌ File not found: {e}")
        print("📁 Make sure all CSV files are in the same directory")
        return None

def train_model(model, train_dataloader, val_dataloader, device, epochs=2, lr=2e-5):
    """Manual training loop - replaces Trainer"""
    
    optimizer = torch.optim.AdamW(model.parameters(), lr=lr)
    total_steps = len(train_dataloader) * epochs
    scheduler = get_linear_schedule_with_warmup(
        optimizer,
        num_warmup_steps=0,
        num_training_steps=total_steps
    )
    
    model.to(device)
    best_val_accuracy = 0
    
    for epoch in range(epochs):
        print(f'\n📚 Epoch {epoch + 1}/{epochs}')
        print('-' * 50)
        
        # Training phase
        model.train()
        total_train_loss = 0
        train_predictions = []
        train_true = []
        
        train_pbar = tqdm(train_dataloader, desc="Training")
        for batch in train_pbar:
            optimizer.zero_grad()
            
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)
            
            outputs = model(input_ids=input_ids, 
                          attention_mask=attention_mask, 
                          labels=labels)
            
            loss = outputs.loss
            total_train_loss += loss.item()
            
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            optimizer.step()
            scheduler.step()
            
            predictions = torch.argmax(outputs.logits, dim=-1)
            train_predictions.extend(predictions.cpu().numpy())
            train_true.extend(labels.cpu().numpy())
            
            train_pbar.set_postfix({'loss': f'{loss.item():.4f}'})
        
        avg_train_loss = total_train_loss / len(train_dataloader)
        train_accuracy = accuracy_score(train_true, train_predictions)
        
        print(f'📊 Training Loss: {avg_train_loss:.4f} | Accuracy: {train_accuracy:.4f}')
        
        # Validation phase
        model.eval()
        total_eval_loss = 0
        eval_predictions = []
        eval_true = []
        
        with torch.no_grad():
            val_pbar = tqdm(val_dataloader, desc="Validation")
            for batch in val_pbar:
                input_ids = batch['input_ids'].to(device)
                attention_mask = batch['attention_mask'].to(device)
                labels = batch['labels'].to(device)
                
                outputs = model(input_ids=input_ids, 
                              attention_mask=attention_mask, 
                              labels=labels)
                
                loss = outputs.loss
                total_eval_loss += loss.item()
                
                predictions = torch.argmax(outputs.logits, dim=-1)
                eval_predictions.extend(predictions.cpu().numpy())
                eval_true.extend(labels.cpu().numpy())
        
        avg_val_loss = total_eval_loss / len(val_dataloader)
        val_accuracy = accuracy_score(eval_true, eval_predictions)
        
        print(f'📈 Validation Loss: {avg_val_loss:.4f} | Accuracy: {val_accuracy:.4f}')
        
        # Save best model
        if val_accuracy > best_val_accuracy:
            best_val_accuracy = val_accuracy
            print(f'🏆 New best validation accuracy: {best_val_accuracy:.4f}')
        
        # Print classification report for last epoch
        if epoch == epochs - 1:
            print("\n📋 Final Classification Report:")
            print(classification_report(eval_true, eval_predictions, target_names=['Non-Hate', 'Hate']))
    
    return model

def run_multilingual_hate_speech_training():
    """Main function to run multilingual hate speech detection training"""
    
    print("🚀 Starting Multilingual Hate Speech Detection Training (Tamil + Hindi)")
    print("=" * 80)
    
    # 1. Load and prepare combined data
    df = load_and_prepare_combined_data()
    
    if df is None:
        print("❌ Failed to load data. Exiting...")
        return
    
    # 2. Split the data (stratified by both label and language if possible)
    print(f"\n🔄 Splitting combined data...")
    train_texts, val_texts, train_labels, val_labels = train_test_split(
        df['text'].tolist(),
        df['label'].tolist(),
        test_size=0.2,
        random_state=42,
        stratify=df['label']  # Stratify by label
    )
    
    print(f"📚 Training samples: {len(train_texts)}")
    print(f"📝 Validation samples: {len(val_texts)}")
    
    # 3. Optimized model and parameters
    model_name = "distilbert-base-multilingual-cased"  # Fast multilingual model
    batch_size = 8      # Reduced for CPU efficiency
    epochs = 2          # Reduced for testing
    max_length = 64     # Reduced for speed
    learning_rate = 2e-5
    
    print(f"\n📊 Training Configuration:")
    print(f"   🤖 Model: {model_name}")
    print(f"   📦 Batch size: {batch_size}")
    print(f"   🔄 Epochs: {epochs}")
    print(f"   📏 Max length: {max_length}")
    print(f"   📈 Learning rate: {learning_rate}")
    
    # 4. Load model and tokenizer
    print(f"\n🤖 Loading multilingual model...")
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    
    num_labels = len(set(df['label']))
    model = AutoModelForSequenceClassification.from_pretrained(
        model_name,
        num_labels=num_labels,
        ignore_mismatched_sizes=True
    )
    
    print(f"✅ Model loaded with {num_labels} labels")
    
    # 5. Create datasets and data loaders
    print(f"\n📦 Creating datasets...")
    train_dataset = CustomDataset(train_texts, train_labels, tokenizer, max_length=max_length)
    val_dataset = CustomDataset(val_texts, val_labels, tokenizer, max_length=max_length)
    
    train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
    val_dataloader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)
    
    print(f"✅ Data loaders created")
    print(f"   📚 Training batches: {len(train_dataloader)}")
    print(f"   📝 Validation batches: {len(val_dataloader)}")
    
    # 6. Set device
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    print(f"\n💻 Using device: {device}")
    
    # Estimate training time
    if device.type == 'cpu':
        estimated_time = len(train_dataloader) * epochs * 3  # ~3 seconds per batch on CPU
        print(f"⏱️  Estimated training time: ~{estimated_time//60} minutes")
    
    # 7. Train the model
    print(f"\n🎯 Starting training...")
    trained_model = train_model(
        model, 
        train_dataloader, 
        val_dataloader, 
        device,
        epochs=epochs,
        lr=learning_rate
    )
    
    # 8. Save the model
    output_dir = "./trained_multilingual_hate_speech_model"
    print(f"\n💾 Saving model to {output_dir}...")
    os.makedirs(output_dir, exist_ok=True)
    trained_model.save_pretrained(output_dir)
    tokenizer.save_pretrained(output_dir)
    
    print(f"\n🎉 Training completed successfully!")
    print(f"📁 Model saved to: {output_dir}")
    print(f"🚀 You can now use this model for multilingual hate speech detection!")
    
    return trained_model, tokenizer

def test_multilingual_predictions(model, tokenizer, device):
    """Test the trained model on sample texts in both languages"""
    model.eval()
    
    test_texts = [
        # Tamil examples (safe)
        "இது ஒரு நல்ல செய்தி",  # This is good news
        "அருமையான வேலை",      # Excellent work
        "வாழ்த்துக்கள்",        # Congratulations
        
        # Hindi examples (safe)
        "यह बहुत अच्छा है",      # This is very good
        "बहुत बढ़िया काम",       # Very good work
        "धन्यवाद",              # Thank you
        
        # Mixed content for testing
        "Great job! बहुत अच्छा",  # Mixed language
        "வாழ்த்துக்கள் friend!"   # Mixed language
    ]
    
    print("\n🧪 Testing multilingual predictions:")
    print("-" * 60)
    
    for i, text in enumerate(test_texts, 1):
        encoding = tokenizer(
            text,
            truncation=True,
            padding='max_length',
            max_length=64,
            return_tensors='pt'
        )
        
        with torch.no_grad():
            input_ids = encoding['input_ids'].to(device)
            attention_mask = encoding['attention_mask'].to(device)
            
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
            prediction = torch.argmax(outputs.logits, dim=-1)
            confidence = torch.softmax(outputs.logits, dim=-1)
            
        pred_label = "Hate" if prediction.item() == 1 else "Non-Hate"
        conf_score = confidence.max().item()
        
        print(f"{i}. Text: '{text}'")
        print(f"   Prediction: {pred_label} (Confidence: {conf_score:.4f})")
        print()

# Run the training
if __name__ == "__main__":
    print("🌟 Multilingual Hate Speech Detection Training")
    print("🔧 Optimized for CPU with sample data for fast testing")
    print("=" * 80)
    
    trained_model, tokenizer = run_multilingual_hate_speech_training()
    
    # Test the model if training was successful
    if trained_model is not None:
        device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        test_multilingual_predictions(trained_model, tokenizer, device)
        
        print("\n" + "="*80)
        print("🎯 Training Summary:")
        print("✅ Model: DistilBERT Multilingual (optimized for speed)")
        print("✅ Languages: Tamil + Hindi")
        print("✅ Task: Binary hate speech classification")
        print("✅ Optimization: Sample data + reduced parameters for CPU")
        print("💡 To use full dataset: Set USE_SAMPLE = False in load_and_prepare_combined_data()")
        print("💡 For production: Use GPU and increase epochs to 5+")

Defaulting to user installation because normal site-packages is not writeable
Collecting hf_xet
  Downloading hf_xet-1.1.2-cp37-abi3-win_amd64.whl.metadata (883 bytes)
Downloading hf_xet-1.1.2-cp37-abi3-win_amd64.whl (2.7 MB)
   ---------------------------------------- 0.0/2.7 MB ? eta -:--:--
   --- ------------------------------------ 0.3/2.7 MB ? eta -:--:--
   ------------------- -------------------- 1.3/2.7 MB 3.7 MB/s eta 0:00:01
   ------------------------------ --------- 2.1/2.7 MB 4.3 MB/s eta 0:00:01
   -------------------------------------- - 2.6/2.7 MB 3.4 MB/s eta 0:00:01
   ---------------------------------------- 2.7/2.7 MB 3.4 MB/s eta 0:00:00
Installing collected packages: hf_xet
Successfully installed hf_xet-1.1.2
✅ All imports successful - no Trainer issues!

🔍 Checking hardware...
CUDA available: False
Device count: 0
⚠️  Using CPU - training will be slower
🌟 Multilingual Hate Speech Detection Training
🔧 Optimized for CPU with sample data for fast testing
🚀 Starting

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


✅ Model loaded with 2 labels

📦 Creating datasets...
✅ Data loaders created
   📚 Training batches: 500
   📝 Validation batches: 125

💻 Using device: cpu
⏱️  Estimated training time: ~50 minutes

🎯 Starting training...

📚 Epoch 1/2
--------------------------------------------------


Training: 100%|█████████████████████████████████████████████████████████| 500/500 [24:27<00:00,  2.93s/it, loss=0.9763]


📊 Training Loss: 0.4894 | Accuracy: 0.7705


Validation: 100%|████████████████████████████████████████████████████████████████████| 125/125 [01:01<00:00,  2.03it/s]


📈 Validation Loss: 0.4220 | Accuracy: 0.8080
🏆 New best validation accuracy: 0.8080

📚 Epoch 2/2
--------------------------------------------------


Training: 100%|█████████████████████████████████████████████████████████| 500/500 [24:27<00:00,  2.94s/it, loss=0.1580]


📊 Training Loss: 0.3708 | Accuracy: 0.8277


Validation: 100%|████████████████████████████████████████████████████████████████████| 125/125 [01:02<00:00,  2.00it/s]


📈 Validation Loss: 0.4228 | Accuracy: 0.8190
🏆 New best validation accuracy: 0.8190

📋 Final Classification Report:
              precision    recall  f1-score   support

    Non-Hate       0.84      0.92      0.88       719
        Hate       0.73      0.57      0.64       281

    accuracy                           0.82      1000
   macro avg       0.79      0.74      0.76      1000
weighted avg       0.81      0.82      0.81      1000


💾 Saving model to ./trained_multilingual_hate_speech_model...

🎉 Training completed successfully!
📁 Model saved to: ./trained_multilingual_hate_speech_model
🚀 You can now use this model for multilingual hate speech detection!

🧪 Testing multilingual predictions:
------------------------------------------------------------
1. Text: 'இது ஒரு நல்ல செய்தி'
   Prediction: Non-Hate (Confidence: 0.9728)

2. Text: 'அருமையான வேலை'
   Prediction: Non-Hate (Confidence: 0.9596)

3. Text: 'வாழ்த்துக்கள்'
   Prediction: Non-Hate (Confidence: 0.9866)

4. Text: 'यह 

In [4]:
# ✅ Step 1: Load your dataset (choose one)
# For Hindi dataset
import pandas as pd
df = pd.read_csv("Hatespeech-Hindi_Train.csv")
# For Tamil dataset
df = pd.read_csv("tamil_offensive_speech_train.csv")

print("✅ Dataset loaded successfully!")
print(f"Raw dataset shape: {df.shape}")
print(f"Columns: {list(df.columns)}")
# ✅ Step 2: Rename columns to standardize
df = df.rename(columns={
    'comment': 'text',    # Replace 'comment' with your actual column name
    'category': 'label'   # Replace 'category' with your actual label column
})

# ✅ Step 3: Clean and prepare data
df = df[['text', 'label']]  # Keep only relevant columns
df['text'] = df['text'].astype(str)
df['label'] = df['label'].astype(int)

# Remove any rows with missing values
df = df.dropna()

print(f"Dataset shape: {df.shape}")
print(f"Label distribution:\n{df['label'].value_counts()}")
print("✅ Data preprocessing completed!")

✅ Dataset loaded successfully!
Raw dataset shape: (27875, 2)
Columns: ['label', 'comment']
Dataset shape: (27875, 2)
Label distribution:
label
0    21226
1     6649
Name: count, dtype: int64
✅ Data preprocessing completed!


In [11]:
# ✅ Step 4: Split into train/validation
from sklearn.model_selection import train_test_split
from datasets import Dataset
train_df, val_df = train_test_split(
    df, 
    test_size=0.2, 
    stratify=df['label'], 
    random_state=42
)

print(f"Training set size: {len(train_df)}")
print(f"Validation set size: {len(val_df)}")

# Convert to HuggingFace datasets
train_dataset = Dataset.from_pandas(train_df.reset_index(drop=True))
val_dataset = Dataset.from_pandas(val_df.reset_index(drop=True))

print("✅ Train-validation split completed!")

Training set size: 22300
Validation set size: 5575
✅ Train-validation split completed!


In [12]:
# ✅ Step 5: Initialize tokenizer
from transformers import AutoTokenizer
model_name = "xlm-roberta-base"
print(f"Loading tokenizer: {model_name}")

tokenizer = AutoTokenizer.from_pretrained(model_name)

print("✅ Tokenizer loaded successfully!")
print(f"Tokenizer vocab size: {tokenizer.vocab_size}")

Loading tokenizer: xlm-roberta-base
✅ Tokenizer loaded successfully!
Tokenizer vocab size: 250002


In [7]:
# ✅ Step 6: Tokenization function
def tokenize_function(examples):
    return tokenizer(
        examples['text'], 
        truncation=True, 
        padding=False,  # Padding will be handled by data collator
        max_length=512
    )

print("✅ Tokenization function defined!")

✅ Tokenization function defined!


In [13]:
# Apply tokenization to training dataset
print("Tokenizing training dataset...")
train_dataset = Dataset.from_pandas(train_df.reset_index(drop=True))
train_dataset = train_dataset.map(tokenize_function, batched=True)

print("✅ Training dataset tokenized!")
print(f"Training dataset features: {train_dataset.features}")

Tokenizing training dataset...


Map:   0%|          | 0/22300 [00:00<?, ? examples/s]

✅ Training dataset tokenized!
Training dataset features: {'text': Value(dtype='string', id=None), 'label': Value(dtype='int32', id=None), 'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None), 'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None)}


In [14]:
# Apply tokenization to validation dataset
val_dataset = Dataset.from_pandas(val_df.reset_index(drop=True))
print("Tokenizing validation dataset...")
val_dataset = val_dataset.map(tokenize_function, batched=True)

print("✅ Validation dataset tokenized!")
print(f"Validation dataset features: {val_dataset.features}")

Tokenizing validation dataset...


Map:   0%|          | 0/5575 [00:00<?, ? examples/s]

✅ Validation dataset tokenized!
Validation dataset features: {'text': Value(dtype='string', id=None), 'label': Value(dtype='int32', id=None), 'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None), 'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None)}


In [15]:
# ✅ Step 7: Initialize model
from transformers import AutoModelForSequenceClassification, AutoTokenizer
num_labels = len(df['label'].unique())
print(f"Number of labels: {num_labels}")

print(f"Loading model: {model_name}")
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=num_labels,
    problem_type="single_label_classification"
)

print("✅ Model loaded successfully!")
print(f"Model device: {next(model.parameters()).device}")

Number of labels: 2
Loading model: xlm-roberta-base


Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


✅ Model loaded successfully!
Model device: cpu


In [16]:
# Final check - make sure everything is loaded properly
print("🔍 Final Setup Check:")
print(f"✅ Dataset shape: {df.shape}")
print(f"✅ Number of labels: {num_labels}")
print(f"✅ Training samples: {len(train_dataset)}")
print(f"✅ Validation samples: {len(val_dataset)}")
print(f"✅ Tokenizer loaded: {tokenizer is not None}")
print(f"✅ Model loaded: {model is not None}")

# Test tokenization on a sample
sample_text = train_df.iloc[0]['text']
sample_tokens = tokenizer(sample_text, truncation=True, max_length=512)
print(f"✅ Sample tokenization works: {len(sample_tokens['input_ids'])} tokens")

print("\n🎉 All setup complete! Ready for training configuration.")

🔍 Final Setup Check:
✅ Dataset shape: (27875, 2)
✅ Number of labels: 2
✅ Training samples: 22300
✅ Validation samples: 5575
✅ Tokenizer loaded: True
✅ Model loaded: True
✅ Sample tokenization works: 33 tokens

🎉 All setup complete! Ready for training configuration.


In [22]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    accuracy = (predictions == labels).mean()
    return {"accuracy": accuracy}

In [21]:
# Add this import
from transformers import DataCollatorWithPadding

# Create data collator
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

print("✅ Data collator created successfully!")

✅ Data collator created successfully!


In [37]:
from transformers import TrainingArguments
import torch
training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",        # Evaluate every epoch
    save_strategy="epoch",        # Save every epoch
    learning_rate=2e-5,
    per_device_train_batch_size=8,  # Reduced from 16 to avoid memory issues
    per_device_eval_batch_size=8,   # Reduced from 16 to avoid memory issues
    num_train_epochs=4,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=100,
    load_best_model_at_end=True,
    metric_for_best_model="eval_accuracy",
    greater_is_better=True,
    save_total_limit=2,
    seed=42,
    dataloader_pin_memory=False,
    report_to=None,
    fp16=True if torch.cuda.is_available() else False  # Enable mixed precision if GPU available
)
print("Done succesfully")

Done succesfully


In [35]:
# Add debugging information
print("Checking training setup...")
# Check if CUDA is available and move model to GPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)
import torch
from torch.utils.data import DataLoader, TensorDataset

# Example with dummy data - replace with your actual data
# Assuming you have features (X) and labels (y)
X = torch.randn(1000, 10)  # 1000 samples, 10 features
y = torch.randint(0, 2, (1000,))  # Binary classification

# Create dataset and dataloader
dataset = TensorDataset(X, y)
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)
# Also move your data to GPU during training
for batch in dataloader:
    inputs, labels = batch
    inputs = inputs.to(device)
    labels = labels.to(device)
    # ... rest of training loop
print(f"Model device: {next(model.parameters()).device}")
print(f"Available GPU: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

Checking training setup...
Model device: cpu
Available GPU: False


In [3]:
from transformers import AutoTokenizer, DataCollatorWithPadding
import torch

# 1. Initialize tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# 2. Set pad token if missing
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    print(f"Set pad token to EOS token: {tokenizer.pad_token}")

# 3. Tokenize your dataset with truncation and max_length=512
train_dataset = [
    tokenizer("Hello, this is a test sentence.", truncation=True, max_length=512),
    tokenizer("Another example sentence that might be longer.", truncation=True, max_length=512)
]

# Add dummy labels (replace with your actual labels)
for i, sample in enumerate(train_dataset):
    sample["label"] = i % 2

# 4. Create data collator that pads to max_length=512 exactly
data_collator = DataCollatorWithPadding(
    tokenizer=tokenizer,
    padding="max_length",   # pad sequences to max_length
    max_length=512,
    return_tensors="pt"
)

# 5. Test data collator
print("Testing data collator with padding to max_length=512...")
try:
    batch = data_collator(train_dataset)
    print("✅ Data collator working correctly!")
    print(f"Batch keys: {batch.keys()}")
    print(f"Input IDs shape: {batch['input_ids'].shape}")          # Should be (2, 512)
    print(f"Attention mask shape: {batch['attention_mask'].shape}")# Should be (2, 512)
    print(f"Labels shape: {batch['labels'].shape}")                # Should be (2,)
except Exception as e:
    print(f"❌ Data collator error: {e}")

# 6. Check sequence lengths (should all be 512 now)
lengths = [len(sample['input_ids']) for sample in train_dataset]
print(f"\nSequence lengths in dataset: {lengths}")
print(f"Min length: {min(lengths)}")
print(f"Max length: {max(lengths)}")
print(f"Average length: {sum(lengths)/len(lengths):.1f}")


Testing data collator with padding to max_length=512...
✅ Data collator working correctly!
Batch keys: dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'labels'])
Input IDs shape: torch.Size([2, 512])
Attention mask shape: torch.Size([2, 512])
Labels shape: torch.Size([2])

Sequence lengths in dataset: [10, 10]
Min length: 10
Max length: 10
Average length: 10.0


In [45]:
# Add this FIRST, before any other imports
import os
import gc
import torch

# Clear environment variables that might cause issues
os.environ.pop("ACCELERATE_USE_CPU", None)
os.environ.pop("CUDA_VISIBLE_DEVICES", None)

# Clear GPU memory
if torch.cuda.is_available():
    torch.cuda.empty_cache()
    torch.cuda.synchronize()
gc.collect()

# Reset accelerator state
try:
    from accelerate.state import AcceleratorState
    AcceleratorState._reset_state()
    print("Accelerator state reset successfully")
except Exception as e:
    print(f"Could not reset accelerator state: {e}")

# Now import your other libraries
from transformers import Trainer
# ... rest of your imports
from transformers import Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    processing_class=tokenizer,  # Changed from tokenizer to processing_class
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

Accelerator state reset successfully


In [31]:
def find_latest_checkpoint(output_dir):
    """Find the latest checkpoint in the output directory"""
    checkpoint_pattern = os.path.join(output_dir, "checkpoint-*")
    checkpoints = glob.glob(checkpoint_pattern)
    if checkpoints:
        # Sort by checkpoint number
        checkpoints.sort(key=lambda x: int(x.split('-')[-1]))
        latest_checkpoint = checkpoints[-1]
        print(f"Found latest checkpoint: {latest_checkpoint}")
        return latest_checkpoint
    return None

🏋️ Train the Model

In [13]:
# ✅ Required imports
import os
import glob
import torch
import gc
import time
import numpy as np
import pandas as pd
from pathlib import Path
from accelerate import Accelerator
from transformers import (
    Trainer, 
    TrainingArguments,
    AutoTokenizer, 
    AutoModelForSequenceClassification,
    DataCollatorWithPadding,
    EarlyStoppingCallback
)
from datasets import Dataset
from sklearn.utils.class_weight import compute_class_weight
import warnings
warnings.filterwarnings('ignore')

# ✅ Checkpoint finder
def find_latest_checkpoint(output_dir):
    checkpoint_pattern = os.path.join(output_dir, "checkpoint-*")
    checkpoints = glob.glob(checkpoint_pattern)
    if checkpoints:
        checkpoints.sort(key=lambda x: int(x.split('-')[-1]))
        return checkpoints[-1]
    return None

# ✅ Clear memory
def clear_memory():
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
        torch.cuda.synchronize()
    gc.collect()


# ✅ Trainer creator
# ✅ Custom trainer with class weights
class WeightedTrainer(Trainer):
    def __init__(self, class_weights=None, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.class_weights = class_weights

    def compute_loss(self, model, inputs, return_outputs=False, num_items_in_batch=None):
        labels = inputs.get("labels")
        outputs = model(**inputs)
        logits = outputs.get('logits')

        if self.class_weights is not None:
            weight_tensor = torch.tensor(list(self.class_weights.values()), 
                                         dtype=torch.float32, device=logits.device)
            loss_fct = torch.nn.CrossEntropyLoss(weight=weight_tensor)
        else:
            loss_fct = torch.nn.CrossEntropyLoss()
        loss = loss_fct(logits.view(-1, self.model.config.num_labels), labels.view(-1))
        return (loss, outputs) if return_outputs else loss

# ✅ Trainer creator (fix parameter name tokenizer)
def create_trainer(model, training_args, train_dataset, val_dataset, tokenizer, data_collator, compute_metrics, class_weights=None):
    return WeightedTrainer(
        class_weights=class_weights,
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=val_dataset,
        tokenizer=tokenizer,                 # Fixed from processing_class=tokenizer
        data_collator=data_collator,
        compute_metrics=compute_metrics,
        callbacks=[EarlyStoppingCallback(early_stopping_patience=3)]
    )

# ✅ Metrics
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    from sklearn.metrics import precision_recall_fscore_support
    precision, recall, f1, _ = precision_recall_fscore_support(labels, predictions, average='weighted', zero_division=0)
    return {
        "accuracy": (predictions == labels).mean(),
        "precision": precision,
        "recall": recall,
        "f1": f1
    }

# ✅ Label encoder
def encode_labels(df, label_mapping=None):
    df = df.copy()
    df['label'] = df['label'].astype(str).str.strip().str.lower()
    unique_labels = list(df['label'].unique())
    unique_labels.sort()
    if label_mapping is None:
        label_mapping = {label: idx for idx, label in enumerate(unique_labels)}
    df['label'] = df['label'].map(label_mapping)
    df = df.dropna(subset=['label'])
    df['label'] = df['label'].astype(int)
    return df, label_mapping

# ✅ Data loader
def load_and_preprocess_data():
    try:
        tamil_train = pd.read_csv("tamil_offensive_speech_train.csv")[["comment", "label"]].rename(columns={'comment': 'text'})
        tamil_val = pd.read_csv("tamil_offensive_speech_val.csv")[["comment", "label"]].rename(columns={'comment': 'text'})
        tamil_train["lang"] = "ta"
        tamil_val["lang"] = "ta"

        hindi_train = pd.read_csv("Hatespeech-Hindi_Train.csv")[["Post", "Labels Set"]].rename(columns={'Post': 'text', 'Labels Set': 'label'})
        hindi_val = pd.read_csv("Hatespeech-Hindi_Valid.csv")[["Post", "Labels Set"]].rename(columns={'Post': 'text', 'Labels Set': 'label'})
        hindi_train["lang"] = "hi"
        hindi_val["lang"] = "hi"

        print("✅ All datasets loaded successfully")
    except FileNotFoundError as e:
        print(f"❌ Error loading datasets: {e}")
        return None, None, None

    train_df = pd.concat([tamil_train, hindi_train], ignore_index=True).dropna().reset_index(drop=True)
    val_df = pd.concat([tamil_val, hindi_val], ignore_index=True).dropna().reset_index(drop=True)
    train_df, label_mapping = encode_labels(train_df)
    val_df, _ = encode_labels(val_df, label_mapping)

    class_weights = compute_class_weight('balanced', classes=np.unique(train_df['label']), y=train_df['label'])
    class_weight_dict = {i: weight for i, weight in enumerate(class_weights)}
    return train_df, val_df, len(set(train_df['label'].unique())), class_weight_dict

# ✅ Main training logic
def run_training(model, training_args, train_dataset, val_dataset, tokenizer, data_collator, compute_metrics, class_weights=None):
    print(f"🔍 Dataset debugging:")
    sample_item = train_dataset[0]
    for key, value in sample_item.items():
        if key == 'label':
            display_val = value
        else:
            display_val = f"length {len(value)}" if hasattr(value, '__len__') else "length N/A"
        print(f"  {key}: {type(value)} - {display_val}")

    latest_checkpoint = find_latest_checkpoint("./results")
    if latest_checkpoint and not os.path.exists(os.path.join(latest_checkpoint, "trainer_state.json")):
        latest_checkpoint = None

    trainer = create_trainer(model, training_args, train_dataset, val_dataset, tokenizer, data_collator, compute_metrics, class_weights)
    retry_count, max_retries = 0, 3
    while retry_count < max_retries:
        try:
            if latest_checkpoint and retry_count == 0:
                trainer.train(resume_from_checkpoint=latest_checkpoint)
            else:
                trainer.train()
            trainer.save_model("./final_model")
            tokenizer.save_pretrained("./final_model")
            print("✅ Training completed and model saved!")
            return trainer
        except RuntimeError as e:
            if "out of memory" in str(e).lower():
                retry_count += 1
                training_args.per_device_train_batch_size = max(1, training_args.per_device_train_batch_size // 2)
                training_args.per_device_eval_batch_size = max(1, training_args.per_device_eval_batch_size // 2)
                del trainer
                clear_memory()
                time.sleep(5)
                trainer = create_trainer(model, training_args, train_dataset, val_dataset, tokenizer, data_collator, compute_metrics, class_weights)
                print(f"Retrying with reduced batch sizes. Attempt: {retry_count}")
            else:
                raise
        except KeyboardInterrupt:
            print("⏹️ Training interrupted. Progress saved.")
            break
    return None

# ✅ Main
def main():
    print("🌍 Multilingual Offensive Speech Detection Training")
    train_df, val_df, num_labels, class_weights = load_and_preprocess_data()
    if train_df is None:
        return

    tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-base")
    model = AutoModelForSequenceClassification.from_pretrained("xlm-roberta-base", num_labels=num_labels)
    training_args = TrainingArguments(
        output_dir="./checkpoints",
        save_strategy="epoch",          # save after each epoch
        eval_strategy="epoch",    # evaluate after each epoch
        save_total_limit=3,
        load_best_model_at_end=True,
        greater_is_better=True,
        num_train_epochs=5,
        per_device_train_batch_size=16,
        per_device_eval_batch_size=16,
        warmup_steps=500,
        weight_decay=0.01,
        learning_rate=2e-5,
        logging_dir='./logs',
        logging_steps=100,
        metric_for_best_model="f1",
        report_to=[],
        dataloader_pin_memory=False,
        fp16=torch.cuda.is_available(),
        dataloader_num_workers=4 if torch.cuda.is_available() else 0,
        remove_unused_columns=True,
)

    data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

    def tokenize_function(examples):
        return tokenizer(examples['text'], truncation=True, padding=False, max_length=256)

    train_dataset = Dataset.from_pandas(train_df).map(tokenize_function, batched=True, remove_columns=["text", "lang"])
    val_dataset = Dataset.from_pandas(val_df).map(tokenize_function, batched=True, remove_columns=["text", "lang"])
    train_dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])
    val_dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])

    trainer = run_training(model, training_args, train_dataset, val_dataset, tokenizer, data_collator, compute_metrics, class_weights)

    if trainer:
        print("\n📊 Final Evaluation:")
        results = trainer.evaluate()
        for key, value in results.items():
            print(f"  {key}: {value:.4f}")

    print("🎉 Done!")

# ✅ Run
if __name__ == "__main__":
    main()


🌍 Multilingual Offensive Speech Detection Training
✅ All datasets loaded successfully


Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/33598 [00:00<?, ? examples/s]

Map:   0%|          | 0/7780 [00:00<?, ? examples/s]

🔍 Dataset debugging:
  label: <class 'torch.Tensor'> - 0
  input_ids: <class 'torch.Tensor'> - length 15
  attention_mask: <class 'torch.Tensor'> - length 15


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,1.4153,0.75953,0.77982,0.792382,0.77982,0.782524
2,1.0352,0.656911,0.805656,0.814518,0.805656,0.808257
3,0.9749,0.65047,0.771851,0.827395,0.771851,0.783951
4,1.0366,0.651813,0.801799,0.830168,0.801799,0.809189
5,0.8237,0.697965,0.811054,0.830382,0.811054,0.816513


✅ Training completed and model saved!

📊 Final Evaluation:


  eval_loss: 0.6980
  eval_accuracy: 0.8111
  eval_precision: 0.8304
  eval_recall: 0.8111
  eval_f1: 0.8165
  eval_runtime: 1104.6624
  eval_samples_per_second: 7.0430
  eval_steps_per_second: 0.4410
  epoch: 5.0000
🎉 Done!


In [None]:
def test_prediction(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
    with torch.no_grad():
        outputs = model(**inputs)
        predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
        predicted_class = torch.argmax(predictions, dim=-1).item()
    return predicted_class, predictions[0].tolist()
# Example usage - uncomment to test
# test_text = "This is a sample text"
# pred_class, confidence = test_prediction(test_text)
# print(f"Predicted class: {pred_class}, Confidence: {confidence}")

In [None]:
from sklearn.metrics import classification_report

predictions = trainer.predict(val_dataset)
pred_labels = predictions.predictions.argmax(axis=1)

print(classification_report(val_labels, pred_labels, target_names=label_map.keys()))

✅ Conclusion
- Multilingual model trained using Hindi and Tamil data
- Powered by XLM-RoBERTa for cross-lingual learning
- Ready to deploy as REST API or chatbot