# Hate Speech Detection - Model Training and Evaluation

This notebook trains and evaluates clustering models on the hate speech dataset.

**Models:**
- Sentence embeddings: all-MiniLM-L6-v2, paraphrase-MiniLM-L6-v2, all-mpnet-base-v2
- Clustering: K-Means, DBSCAN
- Evaluation: Adjusted Rand Index (ARI), NMI, Silhouette Score

In [None]:
import sys
sys.path.append('../src')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

from embeddings import EmbeddingGenerator, compare_models
from clustering import KMeansClustering, DBSCANClustering, find_optimal_k, plot_elbow_curve
from evaluation import ClusteringEvaluator, compare_clustering_methods

%matplotlib inline
sns.set_style('whitegrid')

print("Imports complete!")

## 1. Load Processed Data

In [None]:
# Load processed data
df = pd.read_csv('../data/processed/processed_data.csv')

print(f"Loaded {len(df)} samples")
print(f"Columns: {df.columns.tolist()}")
df.head()

## 2. Generate Embeddings

We'll use the lightweight `all-MiniLM-L6-v2` model for fast embeddings.

In [None]:
# Initialize embedding generator
model_name = 'all-MiniLM-L6-v2'
generator = EmbeddingGenerator(model_name)

# Generate embeddings
embeddings = generator.encode_dataframe(
    df,
    text_column='cleaned_text',
    batch_size=32,
    normalize=False
)

print(f"\nEmbeddings shape: {embeddings.shape}")
print(f"Embedding dimension: {generator.get_embedding_dim()}")

In [None]:
# Save embeddings
embeddings_path = Path(f'../data/embeddings/embeddings_{model_name}.npy')
embeddings_path.parent.mkdir(parents=True, exist_ok=True)
generator.save_embeddings(embeddings, embeddings_path)

## 3. K-Means Clustering

### 3.1 Find Optimal K

In [None]:
# Find optimal k using elbow method
k_results = find_optimal_k(embeddings, k_range=range(2, 11), random_state=42)

# Plot elbow curve
plot_elbow_curve(k_results, save_path='../outputs/figures/elbow_curve.png')

### 3.2 Train K-Means with k=2

Based on the problem (hate vs non-hate), we'll use k=2 clusters.

In [None]:
# Train K-Means with k=2
kmeans = KMeansClustering(n_clusters=2, random_state=42)
kmeans_labels = kmeans.fit(embeddings)

# Add predictions to dataframe
df['kmeans_cluster'] = kmeans_labels

print("\nCluster distribution:")
print(df['kmeans_cluster'].value_counts())

### 3.3 Evaluate K-Means

In [None]:
# Get ground truth labels
y_true = df['class'].values

# Evaluate K-Means
evaluator_kmeans = ClusteringEvaluator()
kmeans_results = evaluator_kmeans.evaluate(
    y_true,
    kmeans_labels,
    embeddings=embeddings
)

In [None]:
# Plot confusion matrix
evaluator_kmeans.plot_confusion_matrix(
    y_true,
    kmeans_labels,
    save_path='../outputs/figures/kmeans_confusion_matrix.png'
)

## 4. DBSCAN Clustering

In [None]:
# Train DBSCAN
dbscan = DBSCANClustering(eps=0.5, min_samples=5)
dbscan_labels = dbscan.fit(embeddings)

# Add predictions to dataframe
df['dbscan_cluster'] = dbscan_labels

print("\nCluster distribution:")
print(df['dbscan_cluster'].value_counts())

### 4.1 Evaluate DBSCAN

In [None]:
# Evaluate DBSCAN
evaluator_dbscan = ClusteringEvaluator()
dbscan_results = evaluator_dbscan.evaluate(
    y_true,
    dbscan_labels,
    embeddings=embeddings
)

In [None]:
# Plot confusion matrix
evaluator_dbscan.plot_confusion_matrix(
    y_true,
    dbscan_labels,
    save_path='../outputs/figures/dbscan_confusion_matrix.png'
)

## 5. Compare Methods

In [None]:
# Compare methods
all_results = {
    'K-Means': kmeans_results,
    'DBSCAN': dbscan_results
}

compare_clustering_methods(
    all_results,
    save_path='../outputs/figures/method_comparison.png'
)

In [None]:
# Print summary
print("\n" + "="*70)
print("RESULTS SUMMARY")
print("="*70)

for method, results in all_results.items():
    print(f"\n{method}:")
    print(f"  ARI:       {results['adjusted_rand_index']:.4f}")
    print(f"  NMI:       {results['normalized_mutual_info']:.4f}")
    print(f"  V-Measure: {results['v_measure']:.4f}")
    if 'silhouette' in results:
        print(f"  Silhouette: {results['silhouette']:.4f}")

## 6. Save Results

In [None]:
# Save results
results_df = pd.DataFrame(all_results).T
results_path = '../outputs/results/clustering_results.csv'
Path(results_path).parent.mkdir(parents=True, exist_ok=True)
results_df.to_csv(results_path)
print(f"Saved results to {results_path}")

# Save dataframe with predictions
df.to_csv('../outputs/results/predictions.csv', index=False)
print("Saved predictions to ../outputs/results/predictions.csv")

## Summary

This notebook demonstrated:
1. Generating sentence embeddings using SBERT
2. Applying K-Means and DBSCAN clustering
3. Evaluating using ARI and other metrics
4. Comparing different clustering methods

**Key Findings:**
- The Adjusted Rand Index (ARI) measures agreement between predicted clusters and true labels
- K-Means with k=2 provides a simple baseline for binary classification
- DBSCAN can identify outliers but may find more than 2 clusters
- Results can be compared against supervised baselines (BERT fine-tuning)