# HexaCiphers - Sentiment Analysis Training

This notebook demonstrates training a sentiment analysis model for detecting anti-India campaigns.

## Objectives
- Train a sentiment classifier using HuggingFace transformers
- Fine-tune BERT/IndicBERT for Indian context
- Evaluate model performance on test data

In [None]:
# Install required packages
!pip install transformers torch datasets sklearn matplotlib seaborn

In [None]:
import pandas as pd
import numpy as np
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
from datasets import Dataset
from sklearn.metrics import accuracy_score, classification_report
import matplotlib.pyplot as plt
import seaborn as sns

# Set random seeds for reproducibility
torch.manual_seed(42)
np.random.seed(42)

## 1. Data Preparation

Create sample dataset for training. In production, you would use real labeled data.

In [None]:
# Sample training data - replace with real dataset
sample_data = {
    'text': [
        "India is making great progress in technology",
        "Proud to be Indian, great culture and heritage",
        "Digital India initiative is transforming the country",
        "Boycott India, spreading fake news about the country",
        "Anti-India propaganda on social media platforms",
        "India's policies are destroying the economy",
        "The weather is nice today",
        "Just had lunch with friends",
        "Working from home is comfortable"
    ],
    'label': [
        2,  # Pro-India
        2,  # Pro-India  
        2,  # Pro-India
        0,  # Anti-India
        0,  # Anti-India
        0,  # Anti-India
        1,  # Neutral
        1,  # Neutral
        1   # Neutral
    ]
}

# Create DataFrame
df = pd.DataFrame(sample_data)
print("Sample dataset:")
print(df)
print(f"\nLabel distribution:")
print(df['label'].value_counts())

## 2. Model Setup

Load pre-trained BERT model and tokenizer.

In [None]:
# Model configuration
MODEL_NAME = "bert-base-multilingual-cased"
NUM_LABELS = 3  # Anti-India, Neutral, Pro-India

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSequenceClassification.from_pretrained(
    MODEL_NAME, 
    num_labels=NUM_LABELS
)

print(f"Loaded model: {MODEL_NAME}")
print(f"Number of parameters: {model.num_parameters():,}")

## 3. Data Preprocessing

In [None]:
def tokenize_function(examples):
    return tokenizer(examples['text'], truncation=True, padding=True, max_length=512)

# Create datasets
train_dataset = Dataset.from_pandas(df)
train_dataset = train_dataset.map(tokenize_function, batched=True)
train_dataset = train_dataset.rename_column('label', 'labels')

print("Tokenized dataset:")
print(train_dataset)

## 4. Training Configuration

In [None]:
# Training arguments
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    warmup_steps=100,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
    save_strategy="epoch",
    evaluation_strategy="epoch",
    load_best_model_at_end=True,
)

# Metrics function
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return {'accuracy': accuracy_score(labels, predictions)}

# Create trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=train_dataset,  # In practice, use separate validation set
    compute_metrics=compute_metrics,
)

print("Training setup complete")

## 5. Model Training

**Note**: This is a demonstration. In production, you would need a larger, properly labeled dataset.

In [None]:
# Train the model
print("Starting training...")
trainer.train()
print("Training completed!")

## 6. Model Evaluation

In [None]:
# Test the model with sample inputs
test_texts = [
    "India is a beautiful country with rich culture",
    "Boycott Indian products, spread the message",
    "The weather forecast for tomorrow",
    "Proud of India's achievements in space technology",
    "Anti-India propaganda must be stopped"
]

# Tokenize test texts
test_encodings = tokenizer(test_texts, truncation=True, padding=True, return_tensors='pt')

# Make predictions
model.eval()
with torch.no_grad():
    outputs = model(**test_encodings)
    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)

# Label mapping
label_map = {0: 'Anti-India', 1: 'Neutral', 2: 'Pro-India'}

# Display results
print("\nTest Results:")
print("-" * 50)
for i, text in enumerate(test_texts):
    pred_label = torch.argmax(predictions[i]).item()
    confidence = predictions[i][pred_label].item()
    print(f"Text: {text}")
    print(f"Prediction: {label_map[pred_label]} (confidence: {confidence:.3f})")
    print("-" * 50)

## 7. Save Model

In [None]:
# Save the fine-tuned model
model.save_pretrained('./models/sentiment_classifier')
tokenizer.save_pretrained('./models/sentiment_classifier')

print("Model saved to ./models/sentiment_classifier")

## 8. Visualization

In [None]:
# Visualize prediction confidence
plt.figure(figsize=(12, 6))

# Create confidence matrix
conf_matrix = predictions.numpy()
labels = [label_map[i] for i in range(3)]

# Heatmap
sns.heatmap(conf_matrix, 
            xticklabels=labels, 
            yticklabels=[f"Text {i+1}" for i in range(len(test_texts))],
            annot=True, 
            fmt='.3f', 
            cmap='Blues')

plt.title('Prediction Confidence Matrix')
plt.xlabel('Predicted Class')
plt.ylabel('Test Samples')
plt.tight_layout()
plt.show()

## Next Steps

1. **Collect Real Data**: Gather labeled social media posts for training
2. **Data Augmentation**: Use techniques to increase dataset size
3. **Cross-validation**: Implement proper train/validation/test splits
4. **Hyperparameter Tuning**: Optimize learning rate, batch size, etc.
5. **Multi-language Support**: Train on Hindi, Bengali, and other Indian languages
6. **Deployment**: Integrate trained model into the HexaCiphers backend